本篇博文主要内容为 2026-02-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-02-04)

今日共更新853篇论文,其中:

  • 自然语言处理130篇(Computation and Language (cs.CL))
  • 人工智能316篇(Artificial Intelligence (cs.AI))
  • 计算机视觉152篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习359篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

【速读】: 该论文旨在解决并行推理(parallel thinking)在实际应用中因计算资源消耗大而导致的效率瓶颈问题。现有方法主要依赖局部轨迹信号进行优化,缺乏对跨并行分支全局动态规律的有效利用。其解决方案的关键在于提出一种名为2D探针(2D probing)的新接口机制,通过周期性地从所有推理分支获取中间答案,显式暴露宽度(width)与深度(depth)之间的协同演化关系;基于此机制发现三个关键现象:宽度-深度分配的非单调缩放特性、推理分支长度的异质性以及全局共识的早期稳定化。由此设计出无需训练的控制器Parallel-Probe,通过基于共识的早期停止策略调控推理深度,并采用基于偏差的分支剪枝策略动态调整并行宽度,从而在保持竞争力准确率的同时显著降低测试时的token消耗(最多减少35.8%序列token和25.8%总token成本)。

链接: https://arxiv.org/abs/2602.03845
作者: Tong Zheng,Chengsong Huang,Runpeng Dai,Yun He,Rui Liu,Xin Ni,Huiwen Bao,Kaishen Wang,Hongtu Zhu,Jiaxin Huang,Furong Huang,Heng Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce \textbfParallel-Probe , a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to \textbf35.8 % and total token cost by over \textbf25.8 % while maintaining competitive accuracy.
zh

[NLP-1] Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

【速读】: 该论文试图解决的问题是:如何有效利用先进的生成式 AI(Generative AI)模型,特别是基于 Google Gemini 的模型(如 Gemini Deep Think 及其高级变体),在理论科学研究中实现从辅助常规任务到推动原创性、专家级数学发现的跨越。解决方案的关键在于构建一套系统性的“人机协作”方法论,包括迭代优化(iterative refinement)、问题分解(problem decomposition)和跨学科知识迁移(cross-disciplinary knowledge transfer),并进一步拓展至非标准交互场景,例如将 AI 作为严格的对抗性审稿人识别证明中的细微漏洞,或嵌入神经符号(neuro-symbolic)循环中自动编写与执行代码以验证复杂推导。这些实践表明,AI 不仅可作为自动化工具,更可成为科学发现创造过程中的真正合作伙伴。

链接: https://arxiv.org/abs/2602.03837
作者: David P. Woodruff,Vincent Cohen-Addad,Lalit Jain,Jieming Mao,Song Zuo,MohammadHossein Bateni,Simina Branzei,Michael P. Brenner,Lin Chen,Ying Feng,Lance Fortnow,Gang Fu,Ziyi Guan,Zahra Hadizadeh,Mohammad T. Hajiaghayi,Mahdi JafariRaviz,Adel Javanmard,Karthik C. S.,Ken-ichi Kawarabayashi,Ravi Kumar,Silvio Lattanzi,Euiwoong Lee,Yi Li,Ioannis Panageas,Dimitris Paparas,Benjamin Przybocki,Bernardo Subercaseaux,Ola Svensson,Shayan Taherijam,Xuan Wu,Eylon Yogev,Morteza Zadimoghaddam,Samson Zhou,Vahab Mirrokni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google’s Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a “neuro-symbolic” loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.
zh

[NLP-2] AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations ICLR2026

【速读】: 该论文旨在解决科学插图(scientific illustrations)自动生成中的关键瓶颈问题,即如何从长篇科学文本中自动、高质量地生成结构完整且具有美学吸引力的插图。传统方法依赖人工制作,效率低下且难以规模化,限制了科研成果的有效传播。解决方案的核心在于提出AutoFigure——首个基于智能体(agentic)框架的自动化生成系统,其创新性体现在:在生成最终插图前,通过深度思考、内容重组与多轮验证机制,确保输出结果兼具结构性合理性与视觉美感;同时,配套构建了FigureBench这一大规模基准数据集(含3,300对高质量文本-插图样本),为评估和优化生成质量提供了可靠标准。实验表明,AutoFigure显著优于现有基线方法,可生成达到出版标准的科学插图。

链接: https://arxiv.org/abs/2602.03828
作者: Minjun Zhu,Zhen Lin,Yixuan Weng,Panzhong Lu,Qiujie Xie,Yifan Wei,Sifan Liu,Qiyao Sun,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: Accepted at the ICLR 2026

点击查看摘要

Abstract:High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in this https URL.
zh

[NLP-3] hey Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes Symbols and Cultural References

【速读】: 该论文旨在解决基于表情包(meme)的社会滥用内容检测问题,其核心挑战在于有害意图常依赖隐含的文化符号和跨模态的细微不一致,而现有方法受限于文化盲视(cultural blindness)、边界模糊(boundary ambiguity)以及缺乏可解释性(lack of interpretability)。解决方案的关键在于提出一个三阶段框架CROSS-ALIGN+:首先通过引入ConceptNet、Wikidata和Hatebase等结构化知识增强多模态表征以缓解文化盲视;其次利用参数高效的LoRA适配器(LoRA adapters)明确区分讽刺与滥用,降低边界模糊性;最后生成级联式解释(cascaded explanations)提升模型决策的可解释性,从而实现更准确且透明的社会滥用内容识别。

链接: https://arxiv.org/abs/2602.03822
作者: Sahil Tripathi,Gautam Siddharth Kashyap,Mehwish Nasim,Jian Yang,Jiechao Gao,Usman Naseem
机构: Jamia Hamdard (贾米亚·哈姆达德大学); Macquarie University (麦考瑞大学); The University of Western Australia (西澳大利亚大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: Accepted at the The Web Conference 2026 (Research Track)

点击查看摘要

Abstract:Meme-based social abuse detection is challenging because harmful intent often relies on implicit cultural symbolism and subtle cross-modal incongruence. Prior approaches, from fusion-based methods to in-context learning with Large Vision-Language Models (LVLMs), have made progress but remain limited by three factors: i) cultural blindness (missing symbolic context), ii) boundary ambiguity (satire vs. abuse confusion), and iii) lack of interpretability (opaque model reasoning). We introduce CROSS-ALIGN+, a three-stage framework that systematically addresses these limitations: (1) Stage I mitigates cultural blindness by enriching multimodal representations with structured knowledge from ConceptNet, Wikidata, and Hatebase; (2) Stage II reduces boundary ambiguity through parameter-efficient LoRA adapters that sharpen decision boundaries; and (3) Stage III enhances interpretability by generating cascaded explanations. Extensive experiments on five benchmarks and eight LVLMs demonstrate that CROSS-ALIGN+ consistently outperforms state-of-the-art methods, achieving up to 17% relative F1 improvement while providing interpretable justifications for each decision.
zh

[NLP-4] Antidistillation Fingerprinting

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中第三方学生模型(student model)通过蒸馏(distillation)方式学习教师模型(teacher model)输出时,如何有效检测此类训练行为的问题。现有指纹技术依赖启发式扰动,在生成质量与指纹强度之间存在显著权衡,常需大幅降低模型实用性以确保指纹被学生模型吸收。其解决方案的关键在于提出对抗蒸馏指纹(Antidistillation Fingerprinting, ADFP),该方法基于梯度驱动的对抗蒸馏采样框架,利用代理模型(proxy model)识别并采样能最大化学生模型微调后指纹可检测性的token,而非依赖非目标性水印的偶然吸收,从而在保持模型性能几乎不变的前提下显著提升检测置信度。

链接: https://arxiv.org/abs/2602.03812
作者: Yixuan Even Xu,John Kirchenbauer,Yash Savani,Asher Trockman,Alexander Robey,Tom Goldstein,Fei Fang,J. Zico Kolter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 11 figures

点击查看摘要

Abstract:Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model’s outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student’s learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model’s architecture is unknown.
zh

[NLP-5] Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮代码生成等迭代决策任务中,因在线强化学习(Online Reinforcement Learning, RL)训练成本高且不稳定,而难以广泛应用的问题。其解决方案的关键在于提出一种名为Cobalt的新方法,该方法将多轮代码生成建模为一个单步可恢复的马尔可夫决策过程(one-step recoverable Markov decision process),通过离线轨迹收集与上下文带-bandit学习相结合的方式:首先利用参考模型生成完整代码轨迹并切分为部分轨迹作为上下文提示,随后在在线带-bandit学习阶段,模型仅需基于这些提示完成单步代码生成。这种方法既保留了在线RL的性能优势,又显著降低了训练复杂度和不稳定性,同时引入扰动轨迹增强以缓解模型在上下文中的奖励劫持(reward hacking)行为。

链接: https://arxiv.org/abs/2602.03806
作者: Ziru Chen,Dongdong Chen,Ruinan Jin,Yingbin Liang,Yujia Xie,Huan Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs’ in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at this https URL.
zh

[NLP-6] FullStack-Agent : Enhancing Agent ic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的代码代理在生成复杂交互式网站时仅限于前端页面、缺乏真实全栈数据处理与存储能力的问题。其核心挑战在于构建生产级全栈Web应用需要对数据流进行精细控制、理解不断更新的包依赖关系,并精准定位代码库中的隐蔽错误。解决方案的关键在于提出一个统一的全栈智能体系统——FullStack-Agent,包含三个核心组件:(1) FullStack-Dev,一个多智能体框架,具备强大的规划、代码编辑、代码库导航和错误定位能力;(2) FullStack-Learn,一种创新的数据扩展与自我改进方法,通过回译爬取和合成的网站仓库来提升基础LLM性能;(3) FullStack-Bench,一套系统性测试基准,全面评估生成网站的前端、后端及数据库功能。实验证明,FullStack-Dev在前端、后端和数据库测试用例上分别优于现有最先进方法8.7%、38.2%和15.9%,而FullStack-Learn使30B规模模型性能提升9.7%、9.5%和2.8%,显著提升了全栈代码生成的准确性与鲁棒性。

链接: https://arxiv.org/abs/2602.03798
作者: Zimu Lu,Houxing Ren,Yunqiao Yang,Ke Wang,Zhuofan Zong,Mingjie Zhan,Hongsheng Li
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assisting non-expert users to develop complex interactive websites has become a popular task for LLM-powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full-stack data processing and storage with fancy visual effects. Notably, constructing production-level full-stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack-Agent, a unified agent system for full-stack agentic coding that consists of three parts: (1) FullStack-Dev, a multi-agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack-Learn, an innovative data-scaling and self-improving method that back-translates crawled and synthesized website repositories to improve the backbone LLM of FullStack-Dev. (3) FullStack-Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack-Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self-improvement, demonstrating the effectiveness of our approach. The code is released at this https URL.
zh

[NLP-7] WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

【速读】: 该论文旨在解决网页代理(web agent)在面对提示注入攻击(prompt injection attack)时,因网页内容被恶意篡改而导致执行非用户意图任务的问题。现有检测与定位方法在真实网页代理场景中效果有限,因其假设条件难以满足。解决方案的关键在于提出WebSentinel,一种两步式检测与定位框架:第一步从网页中提取可能被污染的片段(segment),第二步通过评估每个片段与其所在网页上下文的一致性来判断是否为攻击源。该方法显著优于基线模型,在多个自建的污染与干净网页数据集上均展现出优越性能。

链接: https://arxiv.org/abs/2602.03792
作者: Xilong Wang,Yinuo Liu,Zhun Wang,Dawn Song,Neil Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user’s intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web-agent setting. In this work, we propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emphsegments of interest that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: this https URL.
zh

[NLP-8] AOrchestra: Automating Sub-Agent Creation for Agent ic Orchestration

【速读】: 该论文旨在解决当前语言代理(Language Agents)在处理复杂、长周期任务时,因缺乏动态抽象机制而导致的适应性不足问题。现有方法通常采用静态或固定的子代理设计,难以灵活应对多样化的任务需求。解决方案的关键在于提出一种统一且框架无关的代理抽象模型,将任意代理表示为四元组(Instruction, Context, Tools, Model),这一抽象形式作为能力的组合式配方,使系统能够按需生成专用执行器。在此基础上,作者构建了名为 AOrchestra 的智能体系统,其中中央编排器在每一步动态构造该四元组:筛选任务相关的上下文、选择合适的工具与模型,并通过即时自动创建代理来委派执行。该设计显著降低了人工工程成本,支持多种代理插件式接入,并实现可控的性能-成本权衡,从而逼近帕累托最优。

链接: https://arxiv.org/abs/2602.03786
作者: Jianhao Ruan,Zhihao Xu,Yiran Peng,Fashen Ren,Zhaoyang Yu,Xinbing Liang,Jinyu Xiang,Bang Liu,Chenglin Wu,Yuyu Luo,Jiayi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: this https URL
zh

[NLP-9] Context Compression via Explicit Information Transmission

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文推理中因二次方复杂度的注意力机制和不断增长的键值缓存(key-value caches)而导致的计算成本高昂问题,提出通过软压缩(soft context compression)来缓解这一瓶颈。其解决方案的关键在于提出ComprExIT(Context Compression via Explicit Information Transmission),这是一种轻量级框架,将软压缩重构为一种新的范式:在冻结的LLM隐藏状态上进行显式信息传输(explicit information transmission)。该方法通过两个核心机制实现高效压缩:(i) 深度方向的信息传输(depth-wise transmission),选择性地将多层特征传递至token锚点(token anchors),从而缓解逐层表示覆盖问题;(ii) 宽度方向的信息传输(width-wise transmission),通过全局优化的传输方案将锚点聚合到少量槽位中,确保信息分配的协调性。实验表明,ComprExIT在六个问答基准测试中持续优于现有最优压缩方法,且仅引入约1%的额外参数,验证了显式且协调的信息传输策略在提升压缩效果与鲁棒性方面的有效性。

链接: https://arxiv.org/abs/2602.03784
作者: Jiangnan Ye,Hanqi Yan,Zhenyi Shen,Heng Chang,Ye Mao,Yulan He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model’s internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.
zh

[NLP-10] Efficient Estimation of Kernel Surrogate Models for Task Attribution ICLR2026

【速读】: 该论文旨在解决多任务训练中任务归因(task attribution)问题,即量化每个训练任务对目标任务性能的影响。传统方法如留一法重训练(leave-one-out retraining)虽准确但计算开销巨大,而现有基于线性代理模型(linear surrogate models)的方法仅能捕捉一阶关系,无法建模任务间的非线性交互(如协同效应、拮抗效应等)。论文的关键创新在于提出一种统一的任务加权框架,并首次揭示线性代理模型与影响函数(influence functions)之间的二阶联系;进一步引入核代理模型(kernel surrogate models),通过非线性核函数有效建模二阶任务交互关系。为高效学习该模型,作者设计了一种基于梯度的估计方法,利用预训练模型的一阶近似实现无需重复训练即可获得高精度估计(相对误差<2%)。实验表明,核代理模型在多个领域(包括Transformer数学推理、上下文学习和多目标强化学习)均显著优于线性代理模型与影响函数基线,相关性提升达25%,且在下游任务选择中带来40%的性能改进。

链接: https://arxiv.org/abs/2602.03783
作者: Zhenshuo Zhang,Minxuan Duan,Hongyang R. Zhang
机构: Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages. To appear in ICLR 2026

点击查看摘要

Abstract:Modern AI agents such as large language models are trained on diverse tasks – translation, code generation, mathematical reasoning, and text prediction – simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task’s performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate estimates with less than 2% relative error without repeated retraining. Experiments across multiple domains – including math reasoning in transformers, in-context learning, and multi-objective reinforcement learning – demonstrate the effectiveness of kernel surrogate models. They achieve a 25% higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines. When used for downstream task selection, kernel surrogate models yield a 40% improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.
zh

[NLP-11] CUBO: Self-Contained Retrieval-Augmented Generation on Consumer Laptops 10 GB Corpora 16 GB RAM Single-Device Deployment

【速读】: 该论文旨在解决组织在处理敏感文档时面临的矛盾:使用基于云的生成式 AI (Generative AI) 存在违反《通用数据保护条例》(GDPR) 的风险,而本地部署系统通常需要 18–32 GB 内存,难以在消费级笔记本电脑上实现。解决方案的关键在于提出 CUBO——一个面向具备 16 GB 共享内存的消费级笔记本的系统导向检索增强生成(RAG)平台,其核心创新包括流式数据摄入(O(1) 缓冲区开销)、分层混合检索和面向硬件的调度策略,可在不超过 15.5 GB 内存限制的前提下实现具有竞争力的 Recall@10(0.48–0.97,覆盖 BEIR 域),同时通过仅本地处理保障 GDPR 第 5(1)© 条规定的数据最小化原则。

链接: https://arxiv.org/abs/2602.03731
作者: Paolo Astrino
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Organizations handling sensitive documents face a tension: cloud-based AI risks GDPR violations, while local systems typically require 18-32 GB RAM. This paper presents CUBO, a systems-oriented RAG platform for consumer laptops with 16 GB shared memory. CUBO’s novelty lies in engineering integration of streaming ingestion (O(1) buffer overhead), tiered hybrid retrieval, and hardware-aware orchestration that enables competitive Recall@10 (0.48-0.97 across BEIR domains) within a hard 15.5 GB RAM ceiling. The 37,000-line codebase achieves retrieval latencies of 185 ms (p50) on C1,300 laptops while maintaining data minimization through local-only processing aligned with GDPR Art. 5(1)©. Evaluation on BEIR benchmarks validates practical deployability for small-to-medium professional archives. The codebase is publicly available at this https URL.
zh

[NLP-12] raining Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

【速读】: 该论文旨在解决长时程(long-horizon)强化学习中因稀疏的轨迹级奖励而导致的学习困难问题,尤其在大语言模型进行多轮规划和工具调用时表现不佳。现有基于树的方法虽尝试缓解此问题,但常面临高方差与计算效率低下的挑战。其解决方案的关键在于提出一种无价值函数(value-free)的对比监督方法——分支相对策略优化(Branching Relative Policy Optimization, BranPO),通过截断轨迹尾部并重采样替代延续路径来构建共享前缀上的对比后缀,从而降低长期回溯中的信用分配模糊性;同时引入难度感知分支采样机制以动态调整任务间的分支频率,并采用冗余步骤掩码抑制无信息动作,显著提升训练效率与稳定性。

链接: https://arxiv.org/abs/2602.03719
作者: Yubao Zhao,Weiquan Huang,Sudong Wang,Ruochen Zhao,Chen Chen,Yao Shu,Chengwei Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \hrefthis https URLcode.
zh

[NLP-13] No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

【速读】: 该论文旨在解决当前文化理解类问答(QA)基准测试中普遍存在的单跳问题(single-hop questions)局限性,这些问题往往使大语言模型(LLM)仅依赖浅层线索而非真正具备跨情境、传统和隐含社会知识的推理能力。为应对这一挑战,作者提出ID-MoCQA——首个大规模多跳(multi-hop)文化理解QA数据集,聚焦印度尼西亚传统文化,并提供英印尼双语版本。其解决方案的关键在于构建一个系统性的转换框架,将原始单跳文化问题转化为包含六类线索(如常识、时间、地理等)的多跳推理链,并通过专家评审与LLM-as-a-judge相结合的多阶段验证流程确保答案质量,从而有效评估LLM在复杂文化推理任务中的真实能力。

链接: https://arxiv.org/abs/2602.03709
作者: Vynska Amalia Permadi,Xingwei Tan,Nafise Sadat Moosavi,Nikos Aletras
机构: University of Sheffield (谢菲尔德大学); Universitas Pembangunan Nasional “Veteran” Yogyakarta (雅加达国防部发展国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.
zh

[NLP-14] Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因自回归解码导致的高延迟问题,尤其针对大型推理模型(Large Reasoning Models, LRMs)生成长链思维(chain-of-thought)时效率低下的痛点。现有推测解码(speculative decoding)方法虽能通过并行生成和验证多个token加速推理,但其基于token级别的验证机制忽略了语义等价性(semantic equivalence),即不同token序列可能表达相同语义,从而造成无效拒绝与资源浪费。解决方案的关键在于提出SemanticSpec——一种语义感知的推测解码框架,其核心创新是引入语义概率估计机制,利用模型内部隐藏状态对特定语义序列的生成可能性进行评估,从而以整个语义序列而非单个token为单位进行验证,显著提升验证效率与准确性。实验表明,SemanticSpec在多个基准测试中实现了最高达2.7倍的加速比,优于现有的token级与序列级基线方法。

链接: https://arxiv.org/abs/2602.03708
作者: Ximing Dong,Shaowei Wang,Dayi Lin,Boyuan Chen,Ahmed E. Hassan
机构: Huawei(华为); University of Manitoba (曼尼托巴大学); Queen’s University (皇后大学)
类目: Computation and Language (cs.CL); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model’s internal hidden states to assess the likelihood of generating sequences with specific this http URL on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.
zh

[NLP-15] OmniRAG -Agent Agent : Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

【速读】: 该论文旨在解决低资源环境下长时音频-视频问答(long audio-video question answering)中存在的四大挑战:密集编码成本高、细粒度检索能力弱、主动规划能力有限,以及缺乏清晰的端到端推理流程。其核心解决方案是提出OmniRAG-Agent,一种基于代理(agent)的多模态问答方法,关键在于构建一个图像-音频检索增强生成模块,使OmniLLM能够从外部知识库中获取短而相关的帧和音频片段;同时引入代理循环机制,实现跨轮次的计划、工具调用与检索证据融合,并采用群体相对策略优化(group relative policy optimization)协同提升工具使用效率与答案质量。

链接: https://arxiv.org/abs/2602.03707
作者: Yifan Zhu,Xinyu Mu,Tao Feng,Zhonghong Ou,Yuning Gong,Haoran Luo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end this http URL address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.
zh

[NLP-16] Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在自动生成多选题(Multiple-Choice Questions, MCQs)时,难以可靠地满足特定认知层次要求的问题,尤其是在文本理解、推理和主旨把握等不同认知维度上的可控性不足。解决方案的关键在于提出了一种名为ReQUESTA的混合式多智能体框架,其通过将MCQ创作分解为专业化子任务,并协调LLM驱动的智能体与基于规则的组件,实现规划、受控生成、迭代评估与后处理的全流程协同。这种工作流设计显著提升了生成题目在难度、区分度及与阅读理解表现一致性方面的质量,验证了“流程设计”作为结构化生成任务的核心杠杆作用。

链接: https://arxiv.org/abs/2602.03704
作者: Yu Tian,Linh Huynh,Katerina Christhilf,Shubham Chakraborty,Micah Watanabe,Tracy Arner,Danielle McNamara
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This manuscript is under review at Electronics

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have made automated multiple-choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi-agent framework for generating cognitively diverse MCQs that systematically target text-based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components to support planning, controlled generation, iterative evaluation, and post-processing. We evaluated the framework in a large-scale reading comprehension study using academic expository texts, comparing ReQUESTA-generated MCQs with those produced by a single-pass GPT-5 zero-shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.
zh

[NLP-17] Conflict-Resolving and Sharpness-Aware Minimization for Generalized Knowledge Editing with Multiple Updates

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识更新过程中面临的三大挑战:输入形式泛化能力差、多轮编辑下的稳定性不足以及新旧知识之间的冲突问题。针对这些问题,作者提出了一种参数高效的综合训练框架——CoRSA(Conflict-Resolving and Sharpness-Aware Minimization),其核心创新在于同时实现三个目标:通过最小化损失曲率提升模型对不同输入形式的泛化能力与多轮更新的稳定性;通过最大化新旧知识间的间隔来缓解知识冲突;并通过实证验证其在事实编辑和代码生成任务中的有效性,显著优于LoRA和传统模型编辑方法,在平均绝对性能提升上达12.42%,并减少27.82%的灾难性遗忘。

链接: https://arxiv.org/abs/2602.03696
作者: Duy Nguyen,Hanqi Xiao,Archiki Prasad,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 22 pages, 8 figures. Code link: this https URL

点击查看摘要

Abstract:Large language models (LLMs) rely on internal knowledge to solve many downstream tasks, making it crucial to keep them up to date. Since full retraining is expensive, prior work has explored efficient alternatives such as model editing and parameter-efficient fine-tuning. However, these approaches often break down in practice due to poor generalization across inputs, limited stability, and knowledge conflict. To address these limitations, we propose the CoRSA (Conflict-Resolving and Sharpness-Aware Minimization) training framework, a parameter-efficient, holistic approach for knowledge editing with multiple updates. CoRSA tackles multiple challenges simultaneously: it improves generalization to different input forms and enhances stability across multiple updates by minimizing loss curvature, and resolves conflicts by maximizing the margin between new and prior knowledge. Across three widely used fact editing benchmarks, CoRSA achieves significant gains in generalization, outperforming baselines with average absolute improvements of 12.42% over LoRA and 10% over model editing methods. With multiple updates, it maintains high update efficacy while reducing catastrophic forgetting by 27.82% compared to LoRA. CoRSA also generalizes to the code domain, outperforming the strongest baseline by 5.48% Pass@5 in update efficacy.
zh

[NLP-18] Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent Systems, MAS)中存在的任务特异性高、架构复杂且可复用性差,以及依赖自然语言通信导致长流程交互中误差累积和不稳定性的问题。其核心解决方案是提出了一组可复用的潜在构建模块——Agent Primitives,包括Review(审查)、Voting and Selection(投票与选择)和Planning and Execution(规划与执行)三种通用计算模式;这些原语通过键值缓存(Key-Value Cache, KV cache)进行内部通信,有效缓解了多阶段交互中的信息退化问题,提升了系统的鲁棒性和效率。同时,引入一个Organizer代理根据轻量级知识池自动组合原语以适应不同任务,从而实现高效、稳定且通用的MAS构建。

链接: https://arxiv.org/abs/2602.03695
作者: Haibo Jin,Kuang Peng,Ye Yu,Xiaopeng Yuan,Haohan Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long-context, multi-stage interactions within internal agent histories. In this work, we propose \textbfAgent Primitives, a set of reusable latent building blocks for LLM-based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key-value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi-stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive-based MAS. Experiments show that primitives-based MAS improve average accuracy by 12.0-16.5% over single-agent baselines, reduce token usage and inference latency by approximately 3 \times -4 \times compared to text-based MAS, while incurring only 1.3 \times -1.6 \times overhead relative to single-agent inference and providing more stable performance across model backbones. Comments: 16 pages Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.03695 [cs.MA] (or arXiv:2602.03695v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2602.03695 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-19] OCRTurk: A Comprehensive OCR Benchmark for Turkish EACL2026

【速读】: 该论文旨在解决当前文档解析(Document Parsing)基准测试在低资源语言场景下覆盖不足的问题,特别是针对土耳其语(Turkish)缺乏标准化、反映真实世界多样性的评估基准。其解决方案的关键在于构建了一个名为OCRTurk的土耳其语文档解析基准数据集,该数据集涵盖多种版面元素(layout elements)和文档类别(如学术文章、学位论文、幻灯片和非学术文章),并按难度分为三个层次(easy, medium, hard)。通过在该基准上对七种OCR模型进行逐元素评估,研究发现PaddleOCR在多数指标上表现最优,且性能随文档类型显著变化,其中幻灯片类文档最为困难。这一工作为低资源语言的文档解析提供了可复现、多维度的评估标准,推动了相关技术在实际应用中的可靠性和鲁棒性提升。

链接: https://arxiv.org/abs/2602.03693
作者: Deniz Yılmaz,Evren Ayberk Munis,Çağrı Toraman,Süha Kağan Köse,Burak Aktaş,Mehmet Can Baytekin,Bilge Kaan Görür
机构: Middle East Technical University (中东部理工大学); Politecnico di Torino (都灵理工大学); Roketsan Inc. (火箭公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EACL 2026 SIGTURK

点击查看摘要

Abstract:Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.
zh

[NLP-20] Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在真实场景下因检索噪声导致性能脆弱的问题,尤其是在所需证据已存在于Top-K结果中时仍表现不佳。其核心原因是现有检索器和重排序器仅优化相关性指标,常选择过于简单或信息不足的片段,而未考虑这些证据是否适合作为生成器的输入。解决方案的关键在于提出BAR-RAG,它将重排序器重构为边界感知的证据选择器(boundary-aware evidence selector),目标是选取处于生成器“恰到好处区域”(Goldilocks Zone)的证据——即既不过于简单也不完全无法回答,而是对生成器具有挑战性但足够支持推理的证据,从而提供最强的学习信号。该方法通过强化学习利用生成器反馈训练选择器,并采用两阶段流水线:先在诱导的证据分布上微调生成器以缓解训练与推理之间的分布偏移,显著提升了RAG系统在噪声检索下的端到端性能和鲁棒性。

链接: https://arxiv.org/abs/2602.03689
作者: Jiashuo Sun,Pengcheng Jiang,Saizhuo Wang,Jiajun Fan,Heng Wang,Siru Ouyang,Ming Zhong,Yizhu Jiao,Chengsong Huang,Xueqiang Xu,Pengrui Han,Peiran Li,Jiaxin Huang,Ge Liu,Heng Ji,Jiawei Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 8 tables, 5 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator’s Goldilocks Zone – evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at this https URL.
zh

[NLP-21] Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

【速读】: 该论文旨在解决软注意力机制(Softmax Attention)在长序列场景下因二次计算复杂度而导致的效率瓶颈问题,同时克服线性注意力模型(Linear Attention)因隐藏状态尺寸受限而表达能力不足的局限。其解决方案的关键在于提出 Neural Attention Search Linear (NAtS-L) 框架,该框架在同一层内对不同token动态分配线性注意力或软注意力操作:对于仅具短期影响、可压缩至固定大小隐藏状态的token采用线性注意力,而对于包含长期依赖信息、需保留以供未来查询的token则启用软注意力。通过自动搜索最优的门控DeltaNet与软注意力组合策略,NAtS-L实现了token级别的混合架构,在保持模型表达力的同时显著提升计算效率。

链接: https://arxiv.org/abs/2602.03681
作者: Difan Deng,Andreas Bentzen Winje,Lukas Fehring,Marius Lindauer
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.
zh

[NLP-22] Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中模态跟随(Modality Following)机制的内在工作原理不清晰的问题,即模型如何根据用户指令选择性地利用多模态上下文,这对确保实际部署中的安全性和可靠性至关重要。解决方案的关键在于从信息流视角揭示了模态仲裁的动态过程:指令token作为结构锚点,在浅层注意力层中非选择性地传递多模态线索至其作为潜在缓冲区;深层注意力层则依据指令意图完成模态竞争决策,而MLP层表现出语义惯性,构成对抗性力量;进一步识别出一组稀疏的专用注意力头驱动这一仲裁过程,因果干预实验表明仅操纵其中5%的关键头即可使模态跟随比例下降60%或提升60%,从而为提升模型透明度和实现多模态信息协同提供了理论框架与可操作路径。

链接: https://arxiv.org/abs/2602.03677
作者: Yu Zhang,Mufan Xu,Xuefeng Bai,Kehai chen,Pengfei Zhang,Yang Xiang,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Modality Following

点击查看摘要

Abstract:Modality following serves as the capacity of multimodal large language models (MLLMs) to selectively utilize multimodal contexts based on user instructions. It is fundamental to ensuring safety and reliability in real-world deployments. However, the underlying mechanisms governing this decision-making process remain poorly understood. In this paper, we investigate its working mechanism through an information flow lens. Our findings reveal that instruction tokens function as structural anchors for modality arbitration: Shallow attention layers perform non-selective information transfer, routing multimodal cues to these anchors as a latent buffer; Modality competition is resolved within deep attention layers guided by the instruction intent, while MLP layers exhibit semantic inertia, acting as an adversarial force. Furthermore, we identify a sparse set of specialized attention heads that drive this arbitration. Causal interventions demonstrate that manipulating a mere 5% of these critical heads can decrease the modality-following ratio by 60% through blocking, or increase it by 60% through targeted amplification of failed samples. Our work provides a substantial step toward model transparency and offers a principled framework for the orchestration of multimodal information in MLLMs.
zh

[NLP-23] RAG Turk: Best Practices for Retrieval Augmented Generation in Turkish EACL2026

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在非英语语言(特别是形态丰富的土耳其语)中设计指导不足的问题,从而提升其事实准确性与效率。解决方案的关键在于构建了一个基于土耳其语维基百科和CulturaX的全面RAG数据集,并对RAG管道的七个阶段(包括查询转换、重排序和答案精炼)进行无任务微调的基准测试。研究发现,使用HyDE(假设性文档嵌入)等复杂方法可将准确率提升至85%,显著优于基线(78.70%);同时,通过交叉编码器重排序(Cross-encoder Reranking)与上下文增强(Context Augmentation)组合形成的帕累托最优配置,在保持84.60%准确率的同时大幅降低计算成本。此外,论文指出过度堆叠生成模块会破坏形态线索导致性能下降,而采用简单查询澄清结合稳健重排序策略则能有效提升系统鲁棒性与效率。

链接: https://arxiv.org/abs/2602.03652
作者: Süha Kağan Köse,Mehmet Can Baytekin,Burak Aktaş,Bilge Kaan Görür,Evren Ayberk Munis,Deniz Yılmaz,Muhammed Yusuf Kartal,Çağrı Toraman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by EACL 2026 SIGTURK

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.
zh

[NLP-24] Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

【速读】: 该论文旨在解决搜索增强型推理(search-integrated reasoning)语言代理在强化学习训练中面临的多尺度信用分配问题(multi-scale credit assignment problem),即现有方法依赖稀疏的轨迹级奖励,难以区分高质量推理与偶然猜测,导致冗余或误导性的搜索行为。解决方案的关键在于提出Search-R2框架,其核心是引入Actor-Refiner协同机制:Actor生成初始推理轨迹,Meta-Refiner通过“剪切-重生成”机制选择性诊断并修正错误步骤;同时设计混合奖励机制,将结果正确性与检索证据的信息密度相结合,提供细粒度监督信号,从而实现对推理过程的精准干预和联合优化。

链接: https://arxiv.org/abs/2602.03647
作者: Bowei He,Minda Hu,Zenan Xu,Hongru Wang,Licheng Zong,Yankai Chen,Chen Ma,Xue Liu,Pluto Zhou,Irwin King
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); LLM Department, Tencent (腾讯大语言模型部门); The Chinese University of Hong Kong (香港中文大学); McGill University (麦吉尔大学); City University of Hong Kong (香港城市大学); The University of Edinburgh (爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a ‘cut-and-regenerate’ mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.
zh

[NLP-25] utorial on Reasoning for IR IR for Reasoning ECIR2026

【速读】: 该论文旨在解决信息检索(Information Retrieval, IR)系统在处理复杂信息需求时的局限性,即传统方法仅依赖语义相关性排序文档,难以满足逻辑约束执行、多步推理和多源证据融合等高级推理任务。其核心问题是推动IR从模式匹配向结构化、可验证的推理范式演进。解决方案的关键在于构建一个统一的分析框架,该框架基于对IR中“推理”的明确定义,并将现有不同领域的推理方法(如大语言模型的推理策略、神经符号系统、概率建模等)映射到反映推理核心组件的坐标轴上,从而揭示各类方法的权衡关系与互补性,明确IR如何借鉴跨学科进展并发挥自身在推理系统中的核心作用。

链接: https://arxiv.org/abs/2602.03640
作者: Mohanna Hoveyda,Panagiotis Efstratiadis,Arjen de Vries,Maarten de Rijke
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ECIR 2026

点击查看摘要

Abstract:Information retrieval has long focused on ranking documents by semantic relatedness. Yet many real-world information needs demand more: enforcement of logical constraints, multi-step inference, and synthesis of multiple pieces of evidence. Addressing these requirements is, at its core, a problem of reasoning. Across AI communities, researchers are developing diverse solutions for the problem of reasoning, from inference-time strategies and post-training of LLMs, to neuro-symbolic systems, Bayesian and probabilistic frameworks, geometric representations, and energy-based models. These efforts target the same problem: to move beyond pattern-matching systems toward structured, verifiable inference. However, they remain scattered across disciplines, making it difficult for IR researchers to identify the most relevant ideas and opportunities. To help navigate the fragmented landscape of research in reasoning, this tutorial first articulates a working definition of reasoning within the context of information retrieval and derives from it a unified analytical framework. The framework maps existing approaches along axes that reflect the core components of the definition. By providing a comprehensive overview of recent approaches and mapping current methods onto the defined axes, we expose their trade-offs and complementarities, highlight where IR can benefit from cross-disciplinary advances, and illustrate how retrieval process itself can play a central role in broader reasoning systems. The tutorial will equip participants with both a conceptual framework and practical guidance for enhancing reasoning-capable IR systems, while situating IR as a domain that both benefits and contributes to the broader development of reasoning methodologies.
zh

[NLP-26] RE: Encourag ing Exploration in the Trust Region

【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)中,传统基于全局熵正则化的探索策略效果不佳甚至损害性能的问题。作者指出,这是由于LLMs存在大规模词汇和长生成序列所导致的累积尾部风险(cumulative tail risk),使得标准熵最大化将概率质量无差别地分散到大量无效token上,从而破坏推理连贯性。解决方案的关键在于提出信任区域熵(Trust Region Entropy, TRE),通过限制探索范围在模型的信任区域内,确保探索集中在合理候选空间内,从而提升生成质量和任务表现。

链接: https://arxiv.org/abs/2602.03635
作者: Chao Huang,Yujing Lu,Quangang Li,Shenghe Wang,Yan Wang,Yueyang Zhang,Long Xia,Jiashu Zhao,Zhiyuan Sun,Daiting Shi,Tingwen Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model’s trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at this https URL.
zh

[NLP-27] BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish EACL2026

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在形态丰富的低资源语言(如土耳其语)中进行 Text-to-SQL 任务时性能下降的问题,即当前主流模型在非英语环境下的跨语言泛化能力不足。其关键解决方案是构建了 BIRDTurk——首个基于受控翻译流程的土耳其语 Text-to-SQL 基准数据集,通过保持数据库模式标识符的语义一致性与 SQL 查询逻辑结构不变,实现了高质量、可验证的多语言迁移评估;同时系统性比较了基于提示推理、代理式多阶段推理和监督微调三种方法,发现代理式推理在跨语言场景下更具鲁棒性,而现代指令微调模型在监督微调策略下表现更优。

链接: https://arxiv.org/abs/2602.03633
作者: Burak Aktaş,Mehmet Can Baytekin,Süha Kağan Köse,Ömer İlbilgi,Elif Özge Yılmaz,Çağrı Toraman,Bilge Kaan Görür
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted by EACL 2026 SIGTURK

点击查看摘要

Abstract:Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation, driven by both structural linguistic divergence and underrepresentation in LLM pretraining, while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.
zh

[NLP-28] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

【速读】: 该论文旨在解决深度研究(DeepResearch)生成报告时缺乏可验证奖励信号的问题,导致训练和评估困难。现有基于评分量表(rubric-based)的评估方法要么依赖粗粒度、预定义的量表,难以捕捉细节差异,要么需要为每个查询手工构建专属量表,成本高且难以扩展。其解决方案的关键在于提出一个端到端的流水线,用于训练与人类偏好对齐的、针对特定查询的评分量表生成器(rubric generator)。该方法首先构建包含人类偏好标注的DeepResearch风格查询数据集,并通过强化学习训练量表生成器,融合人类偏好监督与大语言模型(LLM)驱动的量表评估作为混合奖励信号。此外,引入多智能体马尔可夫状态(Multi-agent Markov-state, MaMs)工作流以更好地处理长程推理任务。实证结果表明,所提量表生成器能提供更具区分度和人类一致性更强的监督信号,并在MaMs框架下显著提升DeepResearch系统的性能,达到与领先闭源模型相当的水平。

链接: https://arxiv.org/abs/2602.03619
作者: Changze Lv,Jie Zhou,Wentao Zhao,Jingwen Xu,Zisu Huang,Muzhao Tian,Shihan Dou,Tao Gui,Le Tian,Xiao Zhou,Xiaoqing Zheng,Xuanjing Huang,Jie Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.
zh

[NLP-29] Controlling Output Rankings in Generative Engines for LLM -based Search

【速读】: 该论文旨在解决生成式 AI(Generative AI)驱动的搜索系统中,由于大语言模型(LLM)初始检索结果排序对推荐输出具有强影响,导致小企业和独立创作者难以获得公平曝光的问题。其核心挑战在于如何在不改变 LLM 本身行为的前提下,通过优化外部检索内容来主动引导最终推荐排名。解决方案的关键是提出 CORE(Control Output Rankings in Generative Engines),一种针对黑箱 LLM 搜索接口的优化方法:它通过向搜索引擎返回的内容中添加三种类型的优化内容——字符串级、推理级和评论级——以策略性地调整 LLM 输出的排序位置,从而显著提升目标产品在 Top-K 推荐中的成功率(平均达 91.4% @Top-5),同时保持生成内容的自然流畅性。

链接: https://arxiv.org/abs/2602.03608
作者: Haibo Jin,Ruoxi Chen,Peiyan Zhang,Yifeng Luo,Huimin Zeng,Man Luo,Haohan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 23 pages

点击查看摘要

Abstract:The way customers search for and choose products is changing with the rise of large language models (LLMs). LLM-based search, or generative engines, provides direct product recommendations to users, rather than traditional online search results that require users to explore options themselves. However, these recommendations are strongly influenced by the initial retrieval order of LLMs, which disadvantages small businesses and independent creators by limiting their visibility. In this work, we propose CORE, an optimization method that \textbfControls \textbfOutput \textbfRankings in g\textbfEnerative Engines for LLM-based search. Since the LLM’s interactions with the search engine are black-box, CORE targets the content returned by search engines as the primary means of influencing output rankings. Specifically, CORE optimizes retrieved content by appending strategically designed optimization content to steer the ranking of outputs. We introduce three types of optimization content: string-based, reasoning-based, and review-based, demonstrating their effectiveness in shaping output rankings. To evaluate CORE in realistic settings, we introduce ProductBench, a large-scale benchmark with 15 product categories and 200 products per category, where each product is associated with its top-10 recommendations collected from Amazon’s search interface. Extensive experiments on four LLMs with search capabilities (GPT-4o, Gemini-2.5, Claude-4, and Grok-3) demonstrate that CORE achieves an average Promotion Success Rate of \textbf91.4% @Top-5, \textbf86.6% @Top-3, and \textbf80.3% @Top-1, across 15 product categories, outperforming existing ranking manipulation methods while preserving the fluency of optimized content. Comments: 23 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2602.03608 [cs.CL] (or arXiv:2602.03608v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.03608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-30] Efficient Algorithms for Partial Constraint Satisfaction Problems over Control-flow Graphs WWW

【速读】: 该论文致力于解决在程序控制流图(Control-Flow Graph, CFG)上定义的偏约束满足问题(Partial Constraint Satisfaction Problem, PCSP),其核心目标是在允许部分约束以指定代价被违反的前提下,寻找总代价最小的解。这类问题广泛应用于经典编译优化任务中,如寄存器分配(Register Allocation)、生命周期最优推测性部分冗余消除(Lifetime-optimal Speculative Partial Redundancy Elimination, LOSPRE)和银行选择指令的最优放置(Optimal Placement of Bank Selection Instructions)。解决方案的关键在于利用结构化程序控制流图的稀疏性和可分解性,特别是基于Series-Parallel-Loop (SPL) 分解方法,提出了一种通用算法,其时间复杂度为 O(GD6)O(|G| \cdot |D|^6),其中 G|G| 为控制流图大小,D|D| 为变量取值域的大小;对于固定域大小,该算法实现了线性时间复杂度,显著优于以往针对特定PCSP问题的方法,并在实验中对银行选择优化任务取得了比现有最优方法快四倍的运行效率。

链接: https://arxiv.org/abs/2602.03588
作者: Xuran Cai,Amir Goharshady
机构: 未知
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: Already accepted by SETTA’25. this https URL . arXiv admin note: substantial text overlap with arXiv:2507.16660

点击查看摘要

Abstract:In this work, we focus on the Partial Constraint Satisfaction Problem (PCSP) over control-flow graphs (CFGs) of programs. PCSP serves as a generalization of the well-known Constraint Satisfaction Problem (CSP). In the CSP framework, we define a set of variables, a set of constraints, and a finite domain D that encompasses all possible values for each variable. The objective is to assign a value to each variable in such a way that all constraints are satisfied. In the graph variant of CSP, an underlying graph is considered and we have one variable corresponding to each vertex of the graph and one or several constraints corresponding to each edge. In PCSPs, we allow for certain constraints to be violated at a specified cost, aiming to find a solution that minimizes the total cost. Numerous classical compiler optimization tasks can be framed as PCSPs over control-flow graphs. Examples include Register Allocation, Lifetime-optimal Speculative Partial Redundancy Elimination (LOSPRE), and Optimal Placement of Bank Selection Instructions. On the other hand, it is well-known that control-flow graphs of structured programs are sparse and decomposable in a variety of ways. In this work, we rely on the Series-Parallel-Loop (SPL) decompositions as introduced by~\citeRegisterAllocation. Our main contribution is a general algorithm for PCSPs over SPL graphs with a time complexity of (O(|G| \cdot |D|^6)), where (|G|) represents the size of the control-flow graph. Note that for any fixed domain D, this yields a linear-time solution. Our algorithm can be seen as a generalization and unification of previous SPL-based approaches for register allocation and LOSPRE. In addition, we provide experimental results over another classical PCSP task, i.e. Optimal Bank Selection, achieving runtimes four times better than the previous state of the art.
zh

[NLP-31] CL-bench: A Benchmark for Context Learning

【速读】: 该论文旨在解决当前语言模型(Language Models, LMs)在面对真实世界复杂任务时,缺乏有效从特定上下文中学得新知识并进行推理的能力问题,即“上下文学习”(context learning)能力的缺失。现有模型主要依赖预训练阶段获取的知识,难以处理需结合任务特定背景、领域规则、复杂流程或数据驱动规律等新信息的任务。解决方案的关键在于构建CL-bench——一个由500个复杂上下文、1,899个任务和31,607条验证标准组成的基准测试集,所有内容均由领域专家设计,确保每个任务所需的新知识均包含于对应上下文中。该基准不仅要求模型理解并利用上下文中的新信息,还强调对非预训练知识的推理与整合,超越了传统长上下文检索或简单示例学习任务,从而系统评估模型是否具备类人级别的上下文学习能力。

链接: https://arxiv.org/abs/2602.03587
作者: Shihan Dou,Ming Zhang,Zhangyue Yin,Chenhao Huang,Yujiong Shen,Junzhe Wang,Jiayi Chen,Yuchen Ni,Junjie Ye,Cheng Zhang,Huaibing Xie,Jianglu Hu,Shaolei Wang,Weichao Wang,Yanling Xiao,Yiting Liu,Zenan Xu,Zhen Guo,Pluto Zhou,Tao Gui,Zuxuan Wu,Xipeng Qiu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Di Wang,Shunyu Yao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 78 pages, 17 figures

点击查看摘要

Abstract:Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.
zh

[NLP-32] V_0: A Generalist Value Model for Any Policy at State Zero

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在使用Actor-Critic方法(如PPO)进行训练时,因策略模型持续演化导致价值模型(Value Model)需频繁同步更新以准确追踪策略能力变化所带来的高计算开销问题。传统方法依赖于与策略模型规模相当的价值模型进行参数化拟合,而GRPO虽通过群体平均奖励替代价值模型避免了参数更新,但需大量采样以保证估计稳定性。本文提出V₀——一种通用价值模型(Generalist Value Model),其关键创新在于将策略的动态能力显式建模为上下文输入,利用历史指令-性能对(instruction-performance pairs)动态刻画模型能力,而非依赖参数微调来感知能力变化;特别地,V₀专注于初始状态(State Zero)下的值估计,作为资源调度器,在GRPO训练中预测成功概率以优化采样预算分配,在部署阶段则实现高效模型路由,从而在性能与成本之间实现帕累托最优平衡。

链接: https://arxiv.org/abs/2602.03584
作者: Yi-Kai Zhang,Zhiyuan Yao,Hongyan Hao,Yueqing Sun,Qi Gu,Hui Su,Xunliang Cai,De-Chuan Zhan,Han-Jia Ye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose V_0 , a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy’s dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence V_0 ), our model serves as a critical resource scheduler. During GRPO training, V_0 predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that V_0 significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.
zh

[NLP-33] Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs

【速读】: 该论文旨在解决Graph-augmented RAG(GraphRAG)在实际应用中准确率下降和延迟过高问题,其根源在于对所有查询一概而论地采用图结构增强策略,忽略了查询复杂度的差异。解决方案的关键在于提出一种高效且自适应的GraphRAG框架EA-GraphRAG,通过语法感知的复杂度分析动态融合传统密集检索(dense RAG)与图检索(graph-based retrieval):首先构建查询的句法特征以表征其结构信息,继而使用轻量级复杂度评分器输出连续复杂度分数,最后依据该分数实施路由决策——低复杂度查询采用密集检索,高复杂度查询触发图检索,并对边界案例引入复杂度感知的互斥排名融合(complexity-aware reciprocal rank fusion),从而在保持高精度的同时显著降低延迟,实现混合场景下简单与复杂查询的最优处理。

链接: https://arxiv.org/abs/2602.03578
作者: Su Dong,Qinggang Zhang,Yilin Xiao,Shengyuan Chen,Chuang Zhou,Xiao Huang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle with knowledge-intensive tasks due to hallucinations and outdated parametric knowledge. While Retrieval-Augmented Generation (RAG) addresses this by integrating external corpora, its effectiveness is limited by fragmented information in unstructured domain documents. Graph-augmented RAG (GraphRAG) emerged to enhance contextual reasoning through structured knowledge graphs, yet paradoxically underperforms vanilla RAG in real-world scenarios, exhibiting significant accuracy drops and prohibitive latency despite gains on complex queries. We identify the rigid application of GraphRAG to all queries, regardless of complexity, as the root cause. To resolve this, we propose an efficient and adaptive GraphRAG framework called EA-GraphRAG that dynamically integrates RAG and GraphRAG paradigms through syntax-aware complexity analysis. Our approach introduces: (i) a syntactic feature constructor that parses each query and extracts a set of structural features; (ii) a lightweight complexity scorer that maps these features to a continuous complexity score; and (iii) a score-driven routing policy that selects dense RAG for low-score queries, invokes graph-based retrieval for high-score queries, and applies complexity-aware reciprocal rank fusion to handle borderline cases. Extensive experiments on a comprehensive benchmark, consisting of two single-hop and two multi-hop QA benchmarks, demonstrate that our EA-GraphRAG significantly improves accuracy, reduces latency, and achieves state-of-the-art performance in handling mixed scenarios involving both simple and complex queries.
zh

[NLP-34] ACL: Aligned Contrastive Learning Improves BERT and Multi-exit BERT Fine-tuning

【速读】: 该论文旨在解决对比学习(Contrastive Learning, CL)在监督学习设置下与交叉熵损失(Cross-Entropy Loss, CE)目标存在冲突的问题,从而限制了CL在监督任务中的有效应用。其核心解决方案是提出一种新颖的对齐对比学习(Aligned Contrastive Learning, ACL)框架,关键在于:首先,ACL-Embed将标签嵌入(label embeddings)视为带有不同标签的增强样本,通过对比学习将其与样本表示对齐;其次,引入ACL-Grad机制,在CE与ACL-Embed目标冲突时自动舍弃ACL-Embed项以优化联合目标;此外,为提升多出口BERT(multi-exit BERT)中浅层出口的性能,进一步提出跨层ACL(ACL-CL),利用教师出口指导学生浅层出口的优化。实验表明,ACL在GLUE基准上优于或相当传统方法,并显著提升多出口BERT的性能-延迟权衡。

链接: https://arxiv.org/abs/2602.03563
作者: Wei Zhu
机构: University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite its success in self-supervised learning, contrastive learning is less studied in the supervised setting. In this work, we first use a set of pilot experiments to show that in the supervised setting, the cross-entropy loss objective (CE) and the contrastive learning objective often conflict with each other, thus hindering the applications of CL in supervised settings. To resolve this problem, we introduce a novel \underlineAligned \underlineContrastive \underlineLearning (ACL) framework. First, ACL-Embed regards label embeddings as extra augmented samples with different labels and employs contrastive learning to align the label embeddings with its samples’ representations. Second, to facilitate the optimization of ACL-Embed objective combined with the CE loss, we propose ACL-Grad, which will discard the ACL-Embed term if the two objectives are in conflict. To further enhance the performances of intermediate exits of multi-exit BERT, we further propose cross-layer ACL (ACL-CL), which is to ask the teacher exit to guide the optimization of student shallow exits. Extensive experiments on the GLUE benchmark results in the following takeaways: (a) ACL-BRT outperforms or performs comparably with CE and CE+SCL on the GLUE tasks; (b) ACL, especially CL-ACL, significantly surpasses the baseline methods on the fine-tuning of multi-exit BERT, thus providing better quality-speed tradeoffs for low-latency applications.
zh

[NLP-35] HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

【速读】: 该论文旨在解决稀疏注意力(Sparse Attention)方法中存在的两个核心问题:一是传统方法依赖额外代理模型来预测token重要性,引入复杂性且可能性能不佳;二是现有稀疏注意力设计虽减少计算量,但未能有效节省键值缓存(KV Cache)存储。解决方案的关键在于提出混合稀疏注意力(Hybrid Sparse Attention, HySparse)架构,其通过在每个全连接注意力层后插入多个稀疏注意力层,并直接利用前序全连接注意力层的结果来确定稀疏层的token选择和复用KV缓存,从而实现更高效的计算与内存优化。此设计使模型在仅使用少量全连接层的情况下仍能保持高性能,同时显著降低KV缓存占用。

链接: https://arxiv.org/abs/2602.03560
作者: Yizhao Gao,Jianyu Wei,Qihao Zhang,Yu Cheng,Shimao Chen,Zhengju Tang,Zihan Jiang,Yifan Song,Hailin Zhang,Liang Zhao,Bo Yang,Gang Wang,Shijie Cao,Fuli Luo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer’s token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.
zh

[NLP-36] When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在单步逆合成分析任务中缺乏客观评估的问题。现有基准和指标主要依赖已发表的合成路径及基于单一标准答案的Top-K准确率,无法反映真实合成规划中的开放性与多样性。其解决方案的关键在于提出一个全新的基准框架,引入ChemCensor这一化学合理性度量指标,强调反应路径的化学合理性而非精确匹配,并结合CREED数据集(包含数百万条经ChemCensor验证的反应记录)训练出优于基线模型的新模型,从而更贴近人类实际的合成设计实践。

链接: https://arxiv.org/abs/2602.03554
作者: Bogdan Zagribelnyy,Ivan Ilin,Maksim Kuznetsov,Nikita Bondarev,Roman Schutski,Thomas MacDougall,Rim Shayakhmetov,Zulfat Miftakhutdinov,Mikolaj Mizera,Vladimir Aladinskiy,Alex Aliper,Alex Zhavoronkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.
zh

[NLP-37] Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models EACL2026

【速读】: 该论文试图解决多语言机器翻译中不同语言间翻译质量差异显著的问题,特别是探究除训练资源不均之外,语言类型学特征(typological properties)是否是影响翻译质量的内在因素。解决方案的关键在于:通过分析两种先进的多语言翻译模型(NLLB-200 和 Tower+),在控制数据资源和书写系统等外部因素后,发现目标语言的类型学属性仍显著驱动翻译质量;同时指出某些类型学特征的语言更能从更广泛的输出空间搜索中获益,暗示其可能受益于超越标准左到右束搜索(left-to-right beam search)的解码策略。研究还发布了涵盖 212 种语言的细粒度类型学属性数据集,以促进该领域的进一步研究。

链接: https://arxiv.org/abs/2602.03551
作者: Vitalii Hirak,Jaap Jumelet,Arianna Bisazza
机构: Heinrich Heine University (海因里希海涅大学); University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 11 figures, EACL 2026

点击查看摘要

Abstract:Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.
zh

[NLP-38] SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue

【速读】: 该论文旨在解决当前大型语言模型在服务对话(Service Dialogue)中表现不佳的问题,其根源在于训练数据质量低、数据稀缺以及难以模拟真实的目标导向型用户行为。解决方案的关键在于提出SEAD(Self-Evolving Agent for Service Dialogue)框架,该框架通过将用户建模解耦为两个组件实现高效训练:一是Profile Controller,用于生成多样化的用户状态以管理训练课程(training curriculum);二是User Role-play Model,专注于生成逼真的角色扮演行为。这种设计使环境能够提供自适应的训练场景,而非作为不公平的对抗方,从而显著提升任务完成率(+17.6%)和对话效率(+11.1%)。

链接: https://arxiv.org/abs/2602.03548
作者: Yuqin Dai,Ning Gao,Wei Zhang,Jie Wang,Zichen Luo,Jinpeng Wang,Yujie Wang,Ruiyuan Wu,Chaozheng Wang
机构: Meituan(美团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities in open-domain dialogues. However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data. This limitation arises from data scarcity and the difficulty of simulating authentic, goal-oriented user behaviors. To address these issues, we propose SEAD (Self-Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large-scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary. Experiments demonstrate that SEAD significantly outperforms Open-source Foundation Models and Closed-source Commercial Models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%. Code is available at: this https URL.
zh

[NLP-39] Can Large Language Models Generalize Procedures Across Representations?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在不同符号表示形式(如代码、图结构和自然语言)之间泛化能力不足的问题,尤其是在训练数据仅限于某一类表示时,模型难以有效迁移到其他表示形式的任务中。其核心解决方案是提出一种两阶段数据课程(data curriculum):首先在符号化数据(如代码和图)上进行训练,随后在自然语言数据上继续训练。这一策略显著提升了模型在多种模型架构和任务上的跨表示泛化性能,尤其使一个15亿参数的Qwen模型在自然语言规划任务中接近零样本GPT-4o的表现,表明该方法能有效促进生成式类比(generative analogy)的学习,从而实现更鲁棒的跨模态理解与推理。

链接: https://arxiv.org/abs/2602.03542
作者: Fangru Lin,Valentin Hofmann,Xingchen Wan,Weixing Wang,Zifeng Ding,Anthony G. Cohn,Janet B. Pierrehumbert
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.
zh

[NLP-40] Learning to Reason Faithfully through Step-Level Faithfulness Maximization

【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的大型语言模型(Large Language Models, LLMs)在多步推理任务中因稀疏结果奖励导致中间步骤缺乏监督,从而引发过度自信和虚假推理(spurious reasoning),最终加剧幻觉(hallucination)的问题。解决方案的关键在于提出一种名为FaithRL的一般性强化学习框架,其核心是形式化一个“忠实度最大化”(faithfulness-maximization)目标,并通过几何奖励设计与忠实度感知的优势调节机制实现该目标:前者对未支持的推理步骤施加惩罚,后者在保留有效部分推导的前提下进行逐步信用分配,从而显著降低幻觉率并保持甚至提升答案正确性。

链接: https://arxiv.org/abs/2602.03507
作者: Runquan Gui,Yafu Li,Xiaoye Qu,Ziyan Liu,Yeqiu Cheng,Yu Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at this https URL.
zh

[NLP-41] Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理表格图像时面临的挑战,即复杂布局和结构与内容信息高度耦合导致的推理困难。现有方法通常依赖昂贵的监督训练、强化学习或外部工具,限制了效率与可扩展性。其解决方案的关键在于提出DiSCo(Disentangled Structure-Content alignment framework),通过在多模态对齐过程中显式分离结构抽象(structural abstraction)与语义定位(semantic grounding),实现对表格结构的有效适应;在此基础上进一步构建Table-GLS(Global-to-Local Structure-guided reasoning framework),采用全局到局部的结构引导推理策略,结合结构探索与证据驱动的推断机制,显著提升LVLM在表格理解与推理上的性能,尤其在未见表格结构上具备良好泛化能力。

链接: https://arxiv.org/abs/2602.03491
作者: Yingjie Zhu,Xuefeng Bai,Kehai Chen,Yang Xiang,Youcheng Pan,Xiaoqiang Zhou,Min Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM’s table understanding and reasoning capabilities, particularly generalizing to unseen table structures.
zh

[NLP-42] Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在生成长推理轨迹时过度使用自我验证(self-verification,即recheck)行为的问题。实证分析表明,大量验证步骤为重复确认中间结果,但绝大多数是确认性而非纠错性,导致计算资源浪费且未显著提升准确性。解决方案的关键在于提出一种基于经验的测试时(test-time)框架:通过检测模型中recheck行为的激活状态,查询离线的历史验证经验池,并利用高效检索估计当前验证是否冗余;若历史经验表明无需验证,则发出抑制信号引导模型跳过该步骤,从而在保持甚至提升准确性的前提下减少高达20.3%的token消耗。

链接: https://arxiv.org/abs/2602.03485
作者: Quanyu Long,Kai Jie Jiang,Jianda Chen,Xu Guo,Leilei Gan,Wenya Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.
zh

[NLP-43] Preferences for Idiomatic Language are Acquired Slowly – and Forgotten Quickly: A Case Study on Swedish ACL

【速读】: 该论文旨在解决语言模型在瑞典语中对习语表达(idiomatic expressions)与语法上可接受但非习语表达(linguistically acceptable but non-idiomatic expressions)的偏好发展机制问题,尤其是在预训练和从英语向瑞典语迁移过程中。解决方案的关键在于:首先构建两类新颖的数据集来量化模型对习语性的感知——一类对比传统习语与其合理变体,另一类对比自然习语与翻译腔(Translationese);其次通过在不同训练阶段对模型进行最小对(minimal pair)测试,系统性地追踪其习语偏好演化轨迹。结果表明,习语能力的习得显著慢于语法和词汇正确性,且在大规模模型(8B参数)中仍持续提升,而基于机器翻译的指令微调会导致模型快速丧失对习语的偏好。

链接: https://arxiv.org/abs/2602.03484
作者: Jenny Kunz
机构: Linköping University (林雪平大学)
类目: Computation and Language (cs.CL)
备注: Accepted to TACL. Note that the arXiv version is a pre-MIT Press publication version

点击查看摘要

Abstract:In this study, we investigate how language models develop preferences for \textitidiomatic as compared to \textitlinguistically acceptable Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English – the common approach for languages with little or no native instruction data – causes models to rapidly lose their preference for idiomatic language.
zh

[NLP-44] A-RAG : Scaling Agent ic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统无法有效利用前沿语言模型强大推理能力和长程工具调用能力的问题。传统RAG方法受限于两种固定范式:一是单次检索并拼接文本片段至输入,二是预定义执行流程由模型按步骤执行,二者均未让模型参与检索决策过程,从而限制了系统随模型能力提升而高效扩展的能力。其解决方案的关键在于提出A-RAG(Agentic RAG)框架,通过向模型暴露分层检索接口,赋予其自主选择检索策略(关键词搜索、语义搜索和段落读取)的能力,使模型能够根据任务动态调整检索粒度与方式,从而在多个开放域问答基准上实现性能提升且检索token数量相当或更低,充分释放了大模型的潜力,并展现出良好的可扩展性。

链接: https://arxiv.org/abs/2602.03442
作者: Mingxuan Du,Benfeng Xu,Chiwei Zhu,Shaohan Wang,Pengyu Wang,Xiaorui Wang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Metastone Technology (元象科技)
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model’s input, or (2) predefining a workflow and prompting the model to execute it step-by-step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A-RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A-RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open-domain QA benchmarks show that A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A-RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A-RAG scales with model size and test-time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at this https URL.
zh

[NLP-45] DiscoverLLM : From Executing Intents to Discovering Them

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对用户模糊或开放性请求时,难以有效协助用户明确和发现其潜在意图的问题。传统方法通过直接提问澄清意图(如“你想要什么语气?”)往往失效,因为用户自身尚未形成清晰目标。解决方案的关键在于提出一个名为DiscoverLLM的通用框架,其核心是一个新颖的用户模拟器(user simulator),该模拟器以层级化意图结构建模用户的认知状态,随着模型逐步揭示相关选项,意图逐渐具体化,而这种具体化程度被用作奖励信号来训练模型优化行为。由此,模型学会在意图不明确时主动探索(diverge),在意图清晰后收敛并执行(converge),从而提升任务完成度与对话效率。

链接: https://arxiv.org/abs/2602.03429
作者: Tae Soo Kim,Yoonjoo Lee,Jaesang Yu,John Joon Young Chung,Juho Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking “what kind of tone do you want?” fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options – where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.
zh

[NLP-46] SWE-World: Building Software Engineering Agents in Docker-Free Environments

【速读】: 该论文旨在解决当前软件工程代理(Software Engineering Agents)在训练与评估过程中对容器化环境(containerized environments)高度依赖所带来的资源消耗大、维护成本高及可扩展性差的问题。现有方法通常需要完整的依赖配置和物理执行程序与测试用例,这不仅效率低下,还限制了代理的优化空间。其解决方案的关键在于提出 SWE-World——一个无需 Docker 的框架,通过基于真实代理-环境交互数据训练的大型语言模型(LLM),学习并预测中间执行结果和最终测试反馈,从而替代物理执行环境。该设计保留了标准的代理-环境交互循环,同时显著降低环境构建与维护开销,并支持在测试时通过模拟候选轨迹的最终结果实现有效测试时缩放(Test-Time Scaling, TTS),大幅提升代理性能。

链接: https://arxiv.org/abs/2602.03419
作者: Shuang Sun,Huatong Song,Lisheng Huang,Jinhao Jiang,Ran Le,Zhihao Lv,Zongchao Chen,Yiwen Hu,Wenyang Luo,Wayne Xin Zhao,Yang Song,Hongteng Xu,Tao Zhang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); BOSS Zhipin (BOSS直聘)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled software engineering agents to tackle complex code modification tasks. Most existing approaches rely on execution feedback from containerized environments, which require dependency-complete setup and physical execution of programs and tests. While effective, this paradigm is resource-intensive and difficult to maintain, substantially complicating agent training and limiting scalability. We propose SWE-World, a Docker-free framework that replaces physical execution environments with a learned surrogate for training and evaluating software engineering agents. SWE-World leverages LLM-based models trained on real agent-environment interaction data to predict intermediate execution outcomes and final test feedback, enabling agents to learn without interacting with physical containerized environments. This design preserves the standard agent-environment interaction loop while eliminating the need for costly environment construction and maintenance during agent optimization and evaluation. Furthermore, because SWE-World can simulate the final evaluation outcomes of candidate trajectories without real submission, it enables selecting the best solution among multiple test-time attempts, thereby facilitating effective test-time scaling (TTS) in software engineering tasks. Experiments on SWE-bench Verified demonstrate that SWE-World raises Qwen2.5-Coder-32B from 6.2% to 52.0% via Docker-free SFT, 55.0% with Docker-free RL, and 68.2% with further TTS. The code is available at this https URL
zh

[NLP-47] FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的事实幻觉(factual hallucinations)和缺乏可追溯来源(traceable provenance)的问题。现有知识资源通常在结构化知识与文本上下文之间存在权衡:要么提供无上下文的结构化知识库,要么提供有限规模和语言覆盖的有依据文本。解决方案的关键在于提出FactNet——一个包含17亿个原子断言(atomic assertions)和30.1亿个可审计证据指针(auditable evidence pointers)的开源大规模资源,其全部数据源自316种语言的维基百科版本,并采用严格确定性的构建流程,确保每个证据单元可实现字节级精确恢复。该方法保障了高接地精度(92.1%),并在长尾语言中仍具鲁棒性,为可信、可验证的多语言系统提供了基础且可复现的训练与评估资源。

链接: https://arxiv.org/abs/2602.03417
作者: Yingli Shen,Wen Lai,Jie Zhou,Xueren Zhang,Yudong Wang,Kangyang Luo,Shuo Wang,Ge Gao,Alexander Fraser,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While LLMs exhibit remarkable fluency, their utility is often compromised by factual hallucinations and a lack of traceable provenance. Existing resources for grounding mitigate this but typically enforce a dichotomy: they offer either structured knowledge without textual context (e.g., knowledge bases) or grounded text with limited scale and linguistic coverage. To bridge this gap, we introduce FactNet, a massive, open-source resource designed to unify 1.7 billion atomic assertions with 3.01 billion auditable evidence pointers derived exclusively from 316 Wikipedia editions. Unlike recent synthetic approaches, FactNet employs a strictly deterministic construction pipeline, ensuring that every evidence unit is recoverable with byte-level precision. Extensive auditing confirms a high grounding precision of 92.1%, even in long-tail languages. Furthermore, we establish FactNet-Bench, a comprehensive evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking. FactNet provides the community with a foundational, reproducible resource for training and evaluating trustworthy, verifiable multilingual systems.
zh

[NLP-48] Verified Critical Step Optimization for LLM Agents

【速读】: 该论文旨在解决大语言模型代理在执行复杂长时任务时,后训练阶段面临的关键挑战:仅基于结果的奖励难以精确分配中间步骤的信用,估计的步骤级奖励引入系统性噪声,而蒙特卡洛采样方法在步骤奖励估计上计算成本过高。解决方案的关键在于提出关键步骤优化(Critical Step Optimization, CSO),其核心思想是聚焦于经验证的关键步骤(即决策点,其中替代动作能将任务结果从失败转变为成功),通过过程奖励模型(PRM)识别候选关键步骤,利用专家模型生成高质量备选方案,并由策略模型自身继续执行直至任务完成;仅保留策略模型能成功执行并获得正确结果的备选方案作为直接偏好学习(DPO)训练数据,从而实现细粒度、可验证的监督信号,同时避免轨迹级粗粒度和步骤级噪声,显著提升训练效率与效果。

链接: https://arxiv.org/abs/2602.03412
作者: Mukai Li,Qingcheng Zeng,Tianqing Fang,Zhenwen Liang,Linfeng Song,Qi Liu,Haitao Mi,Dong Yu
机构: Tencent AI Lab (腾讯AI实验室); The University of Hong Kong (香港大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注: Working in progress

点击查看摘要

Abstract:As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model’s weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high-quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine-grained, verifiable supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise. Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post-training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification-based learning for agent post-training.
zh

[NLP-49] SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

【速读】: 该论文旨在解决如何构建高效且可复现的软件工程智能体(Software Engineering Agent)的问题,特别是在有限初始能力的开源基础模型上实现长周期、复杂任务的求解能力。其核心解决方案在于提出SWE-Master框架,系统性地优化从教师轨迹合成与数据整理、长程监督微调(Long-horizon SFT)、基于真实执行反馈的强化学习(Reinforcement Learning, RL),到推理架构设计的完整代理开发流程。关键创新点在于通过结构化训练范式和测试时缩放(Test-Time Scaling, TTS)策略,显著提升模型在真实软件工程任务基准SWE-bench Verified上的解决率,达到70.8%(TTS@8),验证了该方法在提升生成式AI(Generative AI)驱动的软件工程代理性能方面的有效性与可复现性。

链接: https://arxiv.org/abs/2602.03411
作者: Huatong Song,Lisheng Huang,Shuang Sun,Jinhao Jiang,Ran Le,Daixuan Cheng,Guoxin Chen,Yiwen Hu,Zongchao Chen,Wayne Xin Zhao,Yang Song,Tao Zhang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); BOSS Zhipin (BOSS直聘)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents. SWE-Master systematically explores the complete agent development pipeline, including teacher-trajectory synthesis and data curation, long-horizon SFT, RL with real execution feedback, and inference framework design. Starting from an open-source base model with limited initial SWE capability, SWE-Master demonstrates how systematical optimization method can elicit strong long-horizon SWE task solving abilities. We evaluate SWE-Master on SWE-bench Verified, a standard benchmark for realistic software engineering tasks. Under identical experimental settings, our approach achieves a resolve rate of 61.4% with Qwen2.5-Coder-32B, substantially outperforming existing open-source baselines. By further incorporating test-time scaling~(TTS) with LLM-based environment feedback, SWE-Master reaches 70.8% at TTS@8, demonstrating a strong performance potential. SWE-Master provides a practical and transparent foundation for advancing reproducible research on software engineering agents. The code is available at this https URL.
zh

[NLP-50] owards Distillation-Resistant Large Language Models : An Information-Theoretic Perspective

【速读】: 该论文旨在解决生成式 AI(Generative AI)中大型语言模型(Large Language Models, LLMs)因暴露为黑盒 API 而面临的知识蒸馏(distillation)攻击问题,尤其关注以往未被充分研究的基于 logits 的蒸馏攻击。解决方案的关键在于从信息论角度出发,利用条件互信息(Conditional Mutual Information, CMI)量化教师模型输出 logits 与输入查询之间在给定真实标签条件下的相关性,从而识别出对模型提取有利的上下文信息;在此基础上,提出通过学习一个变换矩阵来净化原始输出,以最小化 CMI 为目标设计抗蒸馏目标函数,有效移除蒸馏相关的信息同时保持任务性能,显著提升模型知识产权保护能力。

链接: https://arxiv.org/abs/2602.03396
作者: Hao Fang,Tianyi Zhang,Tianqu Zhuang,Jiawei Kong,Kuofeng Gao,Bin Chen,Leqi Liang,Shu-Tao Xia,Ke Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models’ intellectual property.
zh

[NLP-51] Pursuing Best Industrial Practices for Retrieval-Augmented Generation in the Medical Domain

【速读】: 该论文旨在解决工业级检索增强生成(Retrieval Augmented Generation, RAG)系统在构建过程中缺乏最佳实践指南的问题,尤其是在医疗领域中如何选择组件、组织架构及实现细节尚无共识。其解决方案的关键在于:首先对RAG系统的各个模块进行细致分析,并为每个组件提供实用的替代方案;其次通过在三类任务上的系统性评估,揭示提升RAG性能与效率之间权衡的最佳实践路径,从而为工业场景下部署高效、可靠的RAG系统提供实证依据和设计指导。

链接: https://arxiv.org/abs/2602.03368
作者: Wei Zhu
机构: University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While retrieval augmented generation (RAG) has been swiftly adopted in industrial applications based on large language models (LLMs), there is no consensus on what are the best practices for building a RAG system in terms of what are the components, how to organize these components and how to implement each component for the industrial applications, especially in the medical domain. In this work, we first carefully analyze each component of the RAG system and propose practical alternatives for each component. Then, we conduct systematic evaluations on three types of tasks, revealing the best practices for improving the RAG system and how LLM-based RAG systems make trade-offs between performance and efficiency.
zh

[NLP-52] MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在边缘设备(如智能手机)上部署时面临的性能与资源限制之间的矛盾问题,即传统通过增加参数量或推理时计算量来提升性能的方法因受限于内存(RAM)和神经网络处理单元(NPU)资源而不可行。解决方案的关键在于提出MeKi(Memory-based Expert Knowledge Injection)系统,其核心思想是将模型容量扩展从计算资源(FLOPs)转向存储空间(ROM),通过在每个Transformer层中引入基于token的内存专家模块,将预存储的语义知识注入生成过程;同时采用重参数化策略,将训练阶段使用的参数矩阵折叠为紧凑的静态查找表,从而实现模型容量与推理计算开销的解耦,并引入零额外推理延迟。

链接: https://arxiv.org/abs/2602.03359
作者: Ning Ding,Fangcheng Liu,Kyungrae Kim,Linji Hao,Kyeng-Hun Lee,Hyeonmok Ko,Yehui Tang
机构: Samsung Research(三星研究院); Samsung Research(三星研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at this https URL.
zh

[NLP-53] GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer

【速读】: 该论文旨在解决语言模型(Language Model, LM)中提示(prompt)优化的样本效率低问题,即在组合空间庞大且目标模型评估成本高昂的情况下,如何高效地搜索出高奖励的有效提示。现有基于强化学习(Reinforcement Learning, RL)的方法多依赖于在线策略更新和固定分布采样的元提示(meta-prompt),导致探索效率低下。其解决方案的关键在于提出GFlowPO框架,将提示搜索建模为受元提示引导的先验正则化后验推断问题;首先利用离线生成流网络(off-policy Generative Flow Network, GFlowNet)对轻量级提示语言模型进行训练,通过回放缓冲区重用历史提示评估结果以提升样本效率;其次引入无需训练的动态记忆更新机制(Dynamic Memory Update, DMU),通过向元提示注入来自回放缓冲区的多样化提示与优先队列中的高性能提示,逐步聚焦于高奖励区域,从而实现更高效的提示优化。

链接: https://arxiv.org/abs/2602.03358
作者: Junmo Cho,Suhan Kim,Sangjune An,Minsu Kim,Dong Bok Lee,Heejun Lee,Sung Ju Hwang,Hae Beom Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.
zh

[NLP-54] PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的机器翻译中强化学习(Reinforcement Learning, RL)训练面临的两大挑战:一是由蒙特卡洛回报估计带来的噪声学习信号,二是轨迹空间庞大导致全局探索占优而难以实现细粒度局部优化。解决方案的关键在于提出一种两阶段强化学习框架——PEGRL(Post-editing Guided Reinforcement Learning),其核心创新是引入后编辑(Post-editing)作为辅助任务,通过在每轮迭代中利用当前翻译输出构建后编辑输入,使后编辑阶段的回报估计能够依赖于当前翻译行为的条件信息,从而同时促进全局探索与局部精细优化;此外,采用任务特定的加权策略平衡翻译与后编辑目标的贡献,形成偏向性但更样本高效的估计器,在多个翻译任务上均显著优于基线方法,尤其在英译土语任务中性能达到先进LLM系统水平。

链接: https://arxiv.org/abs/2602.03352
作者: Yunzhi Shen,Hao Zhou,Xin Huang,Xue Han,Junlan Feng,Shujian Huang
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); China Mobile Research Beijing (中国移动北京研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbfPEGRL, a \textittwo-stage RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English \to Finnish, English \to Turkish, and English \leftrightarrow Chinese show consistent gains over RL baselines, and for English \to Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).
zh

[NLP-55] Robustness as an Emergent Property of Task Performance

【速读】: 该论文旨在解决当前生成式 AI 模型在真实世界应用中面临的鲁棒性(robustness)问题,即模型在面对输入扰动或任务变体时保持稳定性能的能力。其核心发现是:随着模型在特定任务上性能提升至较高水平,鲁棒性会自然涌现,而非依赖于模型本身的独立特性;这一现象主要由任务特定的胜任能力(task-specific competence)驱动,而非模型层面的固有属性。因此,解决方案的关键在于通过提升模型对任务的掌握程度来间接增强鲁棒性,而非单独设计专门的鲁棒性优化机制。

链接: https://arxiv.org/abs/2602.03344
作者: Shir Ashury-Tahan,Ariel Gera,Elron Bandel,Michal Shmueli-Scheuer,Leshem Choshen
机构: IBM Research (IBM研究院); MIT (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Robustness is often regarded as a critical future challenge for real-world applications, where stability is essential. However, as models often learn tasks in a similar order, we hypothesize that easier tasks will be easier regardless of how they are presented to the model. Indeed, in this paper, we show that as models approach high performance on a task, robustness is effectively achieved. Through an empirical analysis of multiple models across diverse datasets and configurations (e.g., paraphrases, different temperatures), we find a strong positive correlation. Moreover, we find that robustness is primarily driven by task-specific competence rather than inherent model-level properties, challenging current approaches that treat robustness as an independent capability. Thus, from a high-level perspective, we may expect that as new tasks saturate, model robustness on these tasks will emerge accordingly. For researchers, this implies that explicit efforts to measure and improve robustness may warrant reduced emphasis, as such robustness is likely to develop alongside performance gains. For practitioners, it acts as a sign that indeed the tasks that the literature deals with are unreliable, but on easier past tasks, the models are reliable and ready for real-world deployment.
zh

[NLP-56] Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

【速读】: 该论文旨在解决生成式 AI(Generative AI)在部署阶段因大语言模型(LLM)批评者模型(critic model)主动干预而导致的性能不可预测性问题。尽管这些批评者在离线评估中表现出高准确性(如AUROC达0.94),但其实际部署效果可能显著恶化模型表现,甚至引发高达26个百分点(pp)的性能下降。论文的核心发现是识别出“干扰-恢复权衡”(disruption-recovery tradeoff):干预虽可修复失败轨迹,但也可能破坏本可成功的轨迹。解决方案的关键在于提出一种预部署测试机制——仅需50个任务的小型试点即可判断干预是否有益,从而避免在高成功率任务上引入严重退化,并在高失败场景下实现小幅提升(如ALFWorld基准+2.8 pp,p=0.014)。该框架的核心价值在于识别何时不应干预,防止部署前出现灾难性性能下降。

链接: https://arxiv.org/abs/2602.03338
作者: Rakshith Vasudev,Melisa Russak,Dan Bikel,Waseem Alshikh
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.03338 [cs.CL] (or arXiv:2602.03338v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.03338 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-57] MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

【速读】: 该论文旨在解决运筹学(Operations Research, OR)领域中专家驱动建模过程效率低、脆弱性强,难以适应新场景的问题。现有基于大语言模型(Large Language Models, LLMs)的方法要么依赖昂贵的微调,要么采用多智能体框架但缺乏可靠的协同纠错机制与任务特定检索能力,导致输出准确性不足。其解决方案的关键在于提出一种无需微调的端到端多智能体框架MIRROR,集成两个核心机制:(1) 基于执行反馈的迭代自适应修订机制,实现自动错误修正;(2) 分层检索机制,从精心构建的示例库中获取相关建模与编码范例,从而提升建模精度与可靠性。该方法显著优于现有技术,在标准OR基准及复杂工业数据集上表现突出,为非专家用户提供高效且可靠的运筹学建模工具。

链接: https://arxiv.org/abs/2602.03318
作者: Yifan Shi,Jialong Shi,Jiayi Wang,Ye Fan,Jianyong Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.
zh

[NLP-58] R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

【速读】: 该论文旨在解决如何自主合成高质量、多样化且具有挑战性的多模态训练数据,以提升多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂现实任务中的性能问题。解决方案的关键在于提出了一种名为集体对抗数据合成(Collective Adversarial Data Synthesis, CADS)的新方法,其核心机制包含两个循环阶段:集体对抗数据生成(CAD-Generate)与集体对抗数据评判(CAD-Judge),通过集合智能(collective intelligence)保障数据多样性与质量,并利用对抗学习策略生成具有挑战性的样本;此外,引入对抗上下文优化(Adversarial Context Optimization)机制动态调整生成语境,进一步引导高价值数据的涌现,从而有效驱动模型迭代优化。

链接: https://arxiv.org/abs/2602.03300
作者: Jingyi Zhang,Tianyi Lin,Huanjin Yao,Xiang Lan,Shunyu Liu,Jiaxing Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.
zh

[NLP-59] POP: Prefill-Only Pruning for Efficient Large Model Inference

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)在部署过程中因计算成本过高而导致的效率瓶颈问题,尤其是现有结构化剪枝方法在追求硬件效率时往往导致显著的性能下降。其解决方案的关键在于提出一种阶段感知的推理策略——预填充仅剪枝(Prefill-Only Pruning, POP),该策略通过引入虚拟门机制揭示了深层网络在预填充(prefill)阶段对上下文编码冗余、而在解码(decode)阶段对下一个词预测至关重要的不对称特性;在此基础上,POP 在计算密集的预填充阶段安全地跳过深层网络,同时保留完整模型用于敏感的解码阶段,并通过独立的键值(Key-Value, KV)投影和边界处理策略确保缓存完整性与首个生成token的准确性,从而在多模态模型上实现高达1.37倍的预填充延迟加速且几乎无性能损失。

链接: https://arxiv.org/abs/2602.03295
作者: Junhui He,Zhihui Fu,Jun Wang,Qingan Li
机构: Wuhan University (武汉大学); OPPO Research Institute (OPPO研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37 \times speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
zh

[NLP-60] Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在持续适应过程中效率与优化效果之间的矛盾问题,即传统模型合并(model merging)方法多作为事后优化手段,难以捕捉监督微调(Supervised Fine-Tuning, SFT)中的动态优化优势。其解决方案的关键在于提出一种名为Streaming Merging的新范式,其中核心是ARM(Activation-guided Rotation-aware Merging)策略:通过将合并系数视为学习率,并从激活子空间中推导旋转向量,使参数更新沿数据驱动轨迹进行,从而逼近梯度下降动力学;ARM不仅保留高维参数演化中的几何结构,还能仅依赖早期SFT检查点,经迭代合并超越完全收敛的SFT模型,在1.7B至14B规模及数学、代码等多领域任务中实现更高效、轻量且可扩展的模型适应。

链接: https://arxiv.org/abs/2602.03237
作者: Yuxuan Yao,Haonan Sheng,Qingsong Lv,Han Wu,Shuqi Liu,Zehua Liu,Zengyan Liu,Jiahui Gao,Haochen Tan,Xiaojin Fu,Haoli Bai,Hing Cheung So,Zhijiang Guo,Linqi Song
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The escalating scale of Large Language Models (LLMs) necessitates efficient adaptation techniques. Model merging has gained prominence for its efficiency and controllability. However, existing merging techniques typically serve as post-hoc refinements or focus on mitigating task interference, often failing to capture the dynamic optimization benefits of supervised fine-tuning (SFT). In this work, we propose Streaming Merging, an innovative model updating paradigm that conceptualizes merging as an iterative optimization process. Central to this paradigm is \textbfARM (\textbfActivation-guided \textbfRotation-aware \textbfMerging), a strategy designed to approximate gradient descent dynamics. By treating merging coefficients as learning rates and deriving rotation vectors from activation subspaces, ARM effectively steers parameter updates along data-driven trajectories. Unlike conventional linear interpolation, ARM aligns semantic subspaces to preserve the geometric structure of high-dimensional parameter evolution. Remarkably, ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model. Experimental results across model scales (1.7B to 14B) and diverse domains (e.g., math, code) demonstrate that ARM can transcend converged checkpoints. Extensive experiments show that ARM provides a scalable and lightweight framework for efficient model adaptation.
zh

[NLP-61] ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文输入时面临的“中间信息丢失”(lost in the middle)问题,即随着输入长度增加,关键信息因被稀释或忽略而导致性能下降。现有上下文压缩方法难以在信息保留与压缩效率之间取得平衡。其解决方案的关键在于提出自适应任务感知压缩器(Adaptive Task-Aware Compressor, ATACompressor),该方法通过一个选择性编码器仅压缩与任务相关的上下文片段,确保核心信息得以保留;同时引入自适应分配控制器,根据相关文本长度动态调整压缩率,从而优化资源利用并提升任务表现。

链接: https://arxiv.org/abs/2602.03226
作者: Xuancheng Li,Haitao Li,Yujia Zhou,Qingyao Ai,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context inputs in large language models (LLMs) often suffer from the “lost in the middle” problem, where critical information becomes diluted or ignored due to excessive length. Context compression methods aim to address this by reducing input size, but existing approaches struggle with balancing information preservation and compression efficiency. We propose Adaptive Task-Aware Compressor (ATACompressor), which dynamically adjusts compression based on the specific requirements of the task. ATACompressor employs a selective encoder that compresses only the task-relevant portions of long contexts, ensuring that essential information is preserved while reducing unnecessary content. Its adaptive allocation controller perceives the length of relevant content and adjusts the compression rate accordingly, optimizing resource utilization. We evaluate ATACompressor on three QA datasets: HotpotQA, MSMARCO, and SQUAD-showing that it outperforms existing methods in terms of both compression efficiency and task performance. Our approach provides a scalable solution for long-context processing in LLMs. Furthermore, we perform a range of ablation studies and analysis experiments to gain deeper insights into the key components of ATACompressor.
zh

[NLP-62] oken Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

【速读】: 该论文旨在解决大语言模型在长上下文推理中注意力机制的二次复杂度(quadratic complexity)瓶颈问题。现有加速方法要么通过结构化模式稀疏化注意力矩阵,要么在特定层永久丢弃token,导致可能保留无关信息或依赖不可逆的早期决策,无法适应不同层和注意力头(attention head)间token重要性的动态变化。其解决方案的关键在于提出一种轻量级、动态的token级稀疏化机制——Token Sparse Attention,该机制在注意力计算过程中将每头的查询(Q)、键(K)、值(V)压缩至一个精简token集合,再将输出解压缩回原始序列,从而允许token信息在后续层中被重新评估;同时揭示了token选择与稀疏注意力之间新的设计空间,且完全兼容密集注意力实现(如Flash Attention),可无缝集成现有稀疏注意力核。实验表明,该方法在128K上下文长度下实现了最高达3.23倍的注意力加速,且精度损失小于1%,验证了动态交错的token级稀疏化是提升长上下文推理可扩展性的有效策略。

链接: https://arxiv.org/abs/2602.03216
作者: Dongwon Jo,Beomseok Kang,Jiwon Song,Jae-Joon Kim
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head Q , K , V to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to \times 3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.
zh

[NLP-63] ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成长文本时因键值缓存(Key-Value Cache, KV Cache)随序列长度线性增长而导致的内存与计算开销过大的问题。现有KV缓存淘汰方法通常基于重要性排序进行裁剪,但难以捕捉复杂的KV依赖关系,导致性能下降。其解决方案的关键在于提出ForesightKV框架,该框架通过监督学习和强化学习相结合的方式优化缓存淘汰策略:首先设计Golden Eviction算法利用未来注意力分数识别每一步最优淘汰的KV对,并通过成对排序损失(Pairwise Ranking Loss)进行知识蒸馏;进一步将缓存淘汰建模为马尔可夫决策过程(Markov Decision Process),并采用GRPO算法缓解低熵token上的语言建模损失激增问题。实验表明,ForesightKV在仅使用一半缓存预算的情况下仍显著优于现有方法。

链接: https://arxiv.org/abs/2602.03203
作者: Zican Dong,Peiyu Liu,Junyi Li,Zhipeng Chen,Han Peng,Shuo Wang,Wayne Xin Zhao
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.
zh

[NLP-64] Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

【速读】: 该论文旨在解决强化学习后训练(reinforcement post-training)过程中出现的熵崩溃(entropy collapse)问题,即策略熵单调下降导致训练不稳定甚至模型性能塌陷的现象。这一问题限制了训练时长(通常仅限5-20个epoch),阻碍了持续探索与策略优化。其解决方案的关键在于引入**提示增强(prompt augmentation)**策略,通过在训练中使用多样化的推理模板和格式引导模型生成不同路径的推理轨迹(reasoning traces),从而提升采样多样性(rollout diversity)。实验表明,该方法无需KL散度正则项即可实现稳定扩展训练周期,并支持低熵状态下的训练鲁棒性,最终使Qwen2.5-Math-1.5B模型在MATH Level 3-5数据集上达到当前最优数学推理性能。

链接: https://arxiv.org/abs/2602.03190
作者: Wenquan Lu,Hai Huang,Randall Balestriero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 44.5 per-benchmark accuracy and 51.3 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at this https URL.
zh

[NLP-65] DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference

【速读】: 该论文旨在解决长上下文场景下键值缓存(Key-Value Cache, KVCache)内存占用过大导致的推理瓶颈问题,特别是现有压缩方法因采用固定分割策略(如固定间隔或预定义分隔符)而引发的语义不匹配问题,从而造成显著的准确率下降(5.5%–55.1%)。其解决方案的关键在于提出 DynSplit-KV 方法,通过两个核心创新实现动态语义分割:(1) 设计一种重要性感知的动态分隔符选择策略,使分割点更贴合语义边界,提升准确率 49.9%;(2) 提出统一映射策略,将变长语义块转换为固定长度格式,降低推理开销 4.9 倍。实验表明,该方法在长上下文场景中实现了最高精度、2.2 倍于 FlashAttention 的加速效果以及峰值内存减少 2.6 倍。

链接: https://arxiv.org/abs/2602.03184
作者: Jiancai Ye,Jun Liu,Qingchen Li,Tianlang Zhao,Hanbin Zhang,Jiayi Pan,Ningyi Xu,Guohao Dai
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although Key-Value (KV) Cache is essential for efficient large language models (LLMs) inference, its growing memory footprint in long-context scenarios poses a significant bottleneck, making KVCache compression crucial. Current compression methods rely on rigid splitting strategies, such as fixed intervals or pre-defined delimiters. We observe that rigid splitting suffers from significant accuracy degradation (ranging from 5.5% to 55.1%) across different scenarios, owing to the scenario-dependent nature of the semantic boundaries. This highlights the necessity of dynamic semantic splitting to match semantics. To achieve this, we face two challenges. (1) Improper delimiter selection misaligns semantics with the KVCache, resulting in 28.6% accuracy loss. (2) Variable-length blocks after splitting introduce over 73.1% additional inference overhead. To address the above challenges, we propose DynSplit-KV, a KVCache compression method that dynamically identifies delimiters for splitting. We propose: (1) a dynamic importance-aware delimiter selection strategy, improving accuracy by 49.9%. (2) A uniform mapping strategy that transforms variable-length semantic blocks into a fixed-length format, reducing inference overhead by 4.9x. Experiments show that DynSplit-KV achieves the highest accuracy, 2.2x speedup compared with FlashAttention and 2.6x peak memory reduction in long-context scenarios.
zh

[NLP-66] Privasis: Synthesizing the Largest “Public” Private Dataset from Scratch

【速读】: 该论文旨在解决隐私敏感数据研究中长期存在的数据稀缺问题,尤其是在现代AI代理(如OpenClaw和Gemini Agent)持续访问高敏感个人信息背景下,传统数据集规模小、多样性不足且存在隐私风险。解决方案的关键在于构建首个百万级全合成数据集Privasis——一个从零开始生成的文本资源库,包含140万条记录及5510万条标注属性(如种族、出生日期、工作单位等),覆盖医疗记录、法律文件、金融信息、日历和短信等多种文档类型,具备显著更大的规模与更丰富的多样性。通过该数据集训练的紧凑型文本脱敏模型(参数量≤4B)在性能上超越GPT-5和Qwen-3 235B等前沿大语言模型,从而为隐私敏感领域的研究提供可扩展、安全且高效的基础设施支持。

链接: https://arxiv.org/abs/2602.03183
作者: Hyunwoo Kim,Niloofar Mireshghallah,Michael Duan,Rui Xin,Shuyue Stella Li,Jaehun Jung,David Acuna,Qi Pang,Hanshen Xiao,G. Edward Suh,Sewoong Oh,Yulia Tsvetkov,Pang Wei Koh,Yejin Choi
机构: NVIDIA; CMU; USC; UW
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: For code and data, see this https URL

点击查看摘要

Abstract:Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents–such as OpenClaw and Gemini Agent–are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch–an expansive reservoir of texts with rich and diverse private information–designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.
zh

[NLP-67] VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)与人类多元价值观对齐的核心挑战,尤其是现有基于偏好的方法难以捕捉深层动机原则、基于价值的方法在价值提取中忽略层次结构、评估仅能检测存在而无法校准强度,以及控制模型输出时缺乏对强度的精细调节能力。其解决方案的关键在于提出VALUEFLOW框架,该框架首次实现了从价值提取、强度评估到可控调节的统一建模:通过HIVES构建包含跨理论结构的层次化价值嵌入空间,利用VIDB(Value Intensity DataBase)提供大规模带强度标注的文本资源,结合锚定式评估器实现输出强度的一致性评分,并在此基础上揭示了多值控制下的不对称可调性和组合规律,从而为LLM的价值强度可控性提供了可扩展的基础设施。

链接: https://arxiv.org/abs/2602.03160
作者: Woojin Kim,Sieun Hyeon,Jusang Oh,Jaeyoung Do
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.
zh

[NLP-68] FASA: Frequency-aware Sparse Attention ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长序列输入时因键值缓存(Key Value, KV cache)内存开销过大而导致的性能瓶颈问题。现有基于token剪枝(token pruning)的方法存在局限性:静态剪枝策略可能导致不可逆的信息丢失,而动态策略依赖启发式规则,难以准确捕捉查询相关的token重要性。本文提出FASA框架,其核心创新在于发现旋转位置编码(Rotary Position Embedding, RoPE)在频率块(frequency-chunk, FC)层级上的功能稀疏性——即少数“主导”FCs始终与完整注意力头保持高度上下文一致性,从而为识别关键token提供了一个计算开销极低且鲁棒的代理指标。基于此洞察,FASA首先利用主导FCs定位重要token集合,随后仅对剪枝后的子集执行聚焦注意力计算,显著降低内存带宽需求和计算成本,在多种长上下文任务中实现接近oracle性能的稳定表现。

链接: https://arxiv.org/abs/2602.03152
作者: Yifei Wang,Yueqi Wang,Zhenrui Yue,Huimin Zeng,Yong Wang,Ismini Lourentzou,Zhengzhong Tu,Xiangxiang Chu,Julian McAuley
机构: AMAP(阿里巴巴集团); UCSD(加州大学圣地亚哥分校); UIUC(伊利诺伊大学厄巴纳-香槟分校); Texas A&M University(德克萨斯A&M大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of “dominant” FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100% of full-KV performance when only keeping 256 tokens, and achieves 2.56 \times speedup using just 18.9% of the cache on AIME24.
zh

[NLP-69] Self-Hinting Language Models Enhance Reinforcement Learning

【速读】: 该论文旨在解决Group Relative Policy Optimization (GRPO)在稀疏终端奖励(sparse terminal rewards)场景下容易停滞的问题,其核心原因是同一组内的轨迹(rollouts)常获得相同的奖励,导致相对优势(relative advantages)坍缩,进而使策略更新失效。解决方案的关键在于提出自提示对齐的GRPO框架(SAGE),通过在训练阶段注入特权提示(privileged hints,如计划或分解),在不改变任务奖励的前提下,人为增加组内轨迹的多样性,从而防止相对优势坍缩;测试时则移除提示,部署无提示策略,实现高效推理。此外,自生成的多样化提示还充当自适应课程学习机制,更有效地捕捉模型瓶颈,优于固定提示来源(如初始策略或外部强模型)。

链接: https://arxiv.org/abs/2602.03143
作者: Baohao Liao,Hanze Dong,Xinxing Xu,Christof Monz,Jiang Bian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x , the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution \tau conditioned on (x,h) . Crucially, the task reward R(x,\tau) is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=\varnothing and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner’s bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at this https URL.
zh

[NLP-70] Short Chains Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在执行复杂任务时因生成冗长推理链而导致的高延迟和计算开销问题。解决方案的关键在于提出一种名为CoSMo(Consistency-Guided Split-Merge Optimization)的框架,其核心机制是通过结构感知的分治优化策略,动态识别并合并冗余推理片段、填补逻辑断层以保证连贯性,并结合段落级预算的结构对齐强化学习,在训练过程中引导模型维持高效且一致的推理结构,从而在不牺牲准确性的前提下显著减少推理段落数量(平均降低28.7%),同时提升整体性能(平均提升3.3点准确率)。

链接: https://arxiv.org/abs/2602.03141
作者: Runquan Gui,Jie Wang,Zhihai Wang,Chi Ma,Jianye Hao,Feng Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbfCoSMo (\textbfConsistency-Guided \textbfSplit-\textbfMerge \textbfOptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf3.3 points while reducing segment usage by \textbf28.7% on average compared to reasoning efficiency baselines.
zh

[NLP-71] One Model All Roles: Multi-Turn Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

【速读】: 该论文旨在解决当前AI在群体对话中缺乏社会智能(Social Intelligence)的问题,尤其是如何让模型在多轮、多代理的交互中自主学习复杂的社会规范与协作策略。其解决方案的关键在于提出OMAR框架——一个基于强化学习的多角色自对弈机制,使单一模型能够同时扮演对话中的所有参与者,通过动态的社会互动直接学习长期目标和精细的社会行为(如共情、说服与妥协),并采用分层优势估计(Hierarchical Advantage Estimation)保障长对话训练的稳定性。

链接: https://arxiv.org/abs/2602.03109
作者: Bowen Jiang,Taiwei Shi,Ryo Kamoi,Yuan Yuan,Camillo J. Taylor,Longqi Yang,Pei Zhou,Sihao Chen
机构: Microsoft Corporation(微软); University of Pennsylvania(宾夕法尼亚大学); University of Southern California(南加州大学); Penn State University(宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.
zh

[NLP-72] ChemPro: A Progressive Chemistry Benchmark for Large Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在化学领域科学推理能力评估中的系统性不足问题,特别是缺乏一个结构化、分层次且覆盖广泛化学子领域的基准测试工具。其解决方案的关键在于构建了ChemPro,这是一个包含4100个自然语言问答对的渐进式基准数据集,按难度分为四个连贯模块,涵盖基础信息回忆、长程推理、多概念整合、精细表述的问题解决等任务,并平衡分布于生物化学、无机化学、有机化学和物理化学四大领域。通过该设计,ChemPro能够模拟从中学到高中水平的化学学业评估,从而精准衡量LLMs在不同复杂度下的表现,揭示其在复杂科学推理上的局限性,推动更鲁棒的模型改进方法的发展。

链接: https://arxiv.org/abs/2602.03108
作者: Aaditya Baranwal,Shruti Vyas
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student’s academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2602.03108 [cs.CL] (or arXiv:2602.03108v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.03108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-73] he Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在识别中文语境中礼貌、不礼貌及伪礼貌(mock politeness)现象时的语用理解能力不足的问题。其解决方案的关键在于:基于 rapport management theory(关系管理理论)与 mock politeness model(伪礼貌模型)构建了一个包含真实与模拟语料的三分类数据集,并在四种提示策略(零样本、少样本、知识增强和混合策略)下对六种代表性模型进行系统评估,从而为语用学理论在人工智能时代的应用提供实证基础与方法论创新。

链接: https://arxiv.org/abs/2602.03107
作者: Yitong Zhang,Yuhan Xiang,Mingxuan Liu
机构: Tsinghua University (清华大学); National University of Singapore (新加坡国立大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:From a pragmatic perspective, this study systematically evaluates the differences in performance among representative large language models (LLMs) in recognizing politeness, impoliteness, and mock politeness phenomena in Chinese. Addressing the existing gaps in pragmatic comprehension, the research adopts the frameworks of Rapport Management Theory and the Model of Mock Politeness to construct a three-category dataset combining authentic and simulated Chinese discourse. Six representative models, including GPT-5.1 and DeepSeek, were selected as test subjects and evaluated under four prompting conditions: zero-shot, few-shot, knowledge-enhanced, and hybrid strategies. This study serves as a meaningful attempt within the paradigm of ``Great Linguistics,‘’ offering a novel approach to applying pragmatic theory in the age of technological transformation. It also responds to the contemporary question of how technology and the humanities may coexist, representing an interdisciplinary endeavor that bridges linguistic technology and humanistic reflection.
zh

[NLP-74] ask–Specificity Score: Measuring How Much Instructions Really Matter for Supervision

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)指令微调(Instruction Tuning)中普遍存在的弱标注问题:即同一输入可能对应多个合理输出,而这些输出在不同指令下均成立,导致指令对目标输出的唯一性不明确。为量化指令对输出预测的影响程度,作者提出任务特异性评分(Task-Specificity Score, TSS),通过对比真实指令与针对相同输入的合理替代指令来评估指令的重要性。其核心创新在于引入TSS++方法,采用硬负样本(hard alternatives)并加入一个小规模的质量项以缓解易负例(easy-negative)效应,从而更精准地筛选出任务特异性高的训练样本。实验表明,基于TSS选择的任务特异性示例能在有限token预算下显著提升下游任务性能,并可与基于困惑度(perplexity)或IFD等质量过滤器互补使用。

链接: https://arxiv.org/abs/2602.03103
作者: Pritam Kadasi,Abhishek Upperwal,Mayank Singh
机构: Lingo Research Group, Indian Institute of Technology Gandhinagar, India (印度理工学院甘地纳格尔分校语言研究组); Soket AI, India (Soket AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction tuning is now the default way to train and adapt large language models, but many instruction–input–output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emphdoes the instruction uniquely determine the target output? We propose the \textbfTask–Specificity Score (TSS) to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbfTSS++, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textscAlpaca, \textscDolly-15k, \textscNI-20) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.03103 [cs.CL] (or arXiv:2602.03103v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.03103 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-75] st-time Recursive Thinking: Self-Improvement without External Feedback

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在无需额外训练的情况下实现自我提升的问题,其核心挑战在于高效生成多样且高质量的候选解,以及在缺乏真实标签监督的前提下可靠地选择正确答案。解决方案的关键在于提出测试时递归思维(Test-time Recursive Thinking, TRT)框架,该框架通过基于回放策略、累积知识和自生成验证信号来引导生成过程,从而实现迭代式自我改进。实验证明,使用TRT后,开源模型在AIME-25/24数据集上达到100%准确率,而闭源模型在LiveCodeBench最困难的问题上准确率提升10.4–14.8个百分点,且无需外部反馈。

链接: https://arxiv.org/abs/2602.03094
作者: Yufan Zhuang,Chandan Singh,Liyuan Liu,Yelong Shen,Dinghuai Zhang,Jingbo Shang,Jianfeng Gao,Weizhu Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench’s most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.
zh

[NLP-76] AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中对专家标注数据和外部验证器的高度依赖问题,以及现有自演化范式因无法精准定位最优学习区域而可能加剧群体幻觉和错误先验的风险。其解决方案的关键在于提出一种无监督的自主演化推理优化框架(Autonomous Evolutionary Reasoning Optimization, AERO),该框架通过一个协同双循环系统内化自我提问、作答与批判机制,并借鉴最近发展区(Zone of Proximal Development, ZPD)理论,利用熵基定位策略聚焦“可解性差距”(solvability gap),同时引入独立反事实修正(Independent Counterfactual Correction)实现鲁棒验证;此外,采用交错训练策略(Staggered Training Strategy)同步不同功能角色的能力增长,防止课程坍塌,从而实现更高效且稳定的推理能力演化。

链接: https://arxiv.org/abs/2602.03084
作者: Zhitao Gao,Jie Ma,Xuhong Li,Pengyu Li,Ning Qu,Yaqiang Wu,Hui Liu,Jun Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert-annotated data and external verifiers. While existing self-evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underlineAutonomous \underlineEvolutionary \underlineReasoning \underlineOptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self-questioning, answering, and criticism within a synergistic dual-loop system. Inspired by the \textitZone of Proximal Development (ZPD) theory, AERO utilizes entropy-based positioning to target the ``solvability gap’’ and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57% on Qwen3-4B-Base and 5.10% on Qwen3-8B-Base, outperforming competitive baselines. Code is available at this https URL.
zh

[NLP-77] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)传统训练流程中缺乏双向优化机制的问题,即当前从预训练到后训练的单向流程未能充分利用后训练阶段(如强化学习调优)所获得的知识来反哺预训练基础模型。为实现自增强的飞轮效应(self-reinforcing flywheel),其核心解决方案是提出ReMiT(Reinforcement Learning-Guided Mid-Training),关键在于识别预训练中期(annealing phase)作为能力跃迁的关键转折点,并利用强化学习调优后的模型推理先验(reasoning priors)动态重加权该阶段的token,优先保留对推理至关重要的语义单元,从而在不依赖教师模型的前提下提升预训练质量并持续增强后续后训练性能。

链接: https://arxiv.org/abs/2602.03075
作者: Junjie Huang,Jiarui Qin,Di Yin,Weiwen Liu,Yong Yu,Xing Sun,Weinan Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computation and Language (cs.CL)
备注: 25 pages

点击查看摘要

Abstract:Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process–where insights from post-training retroactively improve the pre-trained foundation–remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.
zh

[NLP-78] From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

【速读】: 该论文旨在解决远程协助场景中语音指令缺乏空间明确性的问题,即如何将模糊的口语指代(如“那个红色的盒子”)转化为可被AR系统理解并可视化呈现的空间定位引导,从而减少因指代不清导致的反复微调(如“再往右一点”)。其解决方案的关键在于提出Speech-to-Spatial框架,通过分析人类在自然对话中的四种典型指代模式(直接属性、关系式、记忆性与链式指代),将其映射到以物体为中心的关系图谱(object-centric relational graph)中,并基于此解析语音输入,生成持久且情境感知的增强现实(Augmented Reality, AR)视觉提示,实现从离散语音到空间可解释、可操作引导的转换。

链接: https://arxiv.org/abs/2602.03059
作者: Yoonsang Kim,Divyansh Pradhan,Devshree Jadeja,Arie Kaufman
机构: Stony Brook University (石溪大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注: 11 pages, 6 figures. This is the author’s version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR) 2026

点击查看摘要

Abstract:We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance (“a bit more to the right”, “now, stop.”) during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.
zh

[NLP-79] MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中基于大语言模型(Large Language Models, LLMs)的推理轨迹存在高方差的问题,特别是评估中间步骤的过程验证(Process Verification)在MAS场景下的有效性尚不明确这一关键挑战。其解决方案的关键在于构建一个系统性的实证研究框架MAS-ProVe,涵盖三种验证范式(LLM-as-a-Judge、奖励模型、过程奖励模型)和两种验证粒度(代理级与迭代级),并结合五种代表性验证器和四种上下文管理策略,在六个不同的MAS框架上进行多基准测试。研究发现,当前过程验证方法无法稳定提升性能且常表现出高方差,表明可靠评估部分多智能体轨迹仍具挑战性;同时,训练过的LLM作为评判者优于基于奖励的方法,且其性能接近单一代理,但存在上下文长度与性能之间的权衡关系。这揭示了实现高效且鲁棒的过程验证仍需超越现有范式的进一步突破。

链接: https://arxiv.org/abs/2602.03053
作者: Vishal Venkataramani,Haizhou Shi,Zixuan Ke,Austin Xu,Xiaoxiao He,Yingbo Zhou,Semih Yavuz,Hao Wang,Shafiq Joty
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Preprint; work in progress

点击查看摘要

Abstract:Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at this https URL.
zh

[NLP-80] SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在采用低秩压缩(low-rank compression)时,因各层独立优化导致的重建误差累积问题——即局部误差在网络中逐层传播并放大,最终显著偏离全精度基准模型性能。解决方案的关键在于提出自适应误差抑制奇异值分解(Self-Adaptive Error Suppression SVD, SAES-SVD)框架,其核心创新包括:(1) 累积误差感知层压缩(Cumulative Error-Aware Layer Compression, CEALC),将压缩目标建模为局部重建误差与加权累积误差补偿的联合优化,并基于二阶激活统计量推导出闭式低秩解,显式对齐每层输出以补偿误差积累;(2) 自适应协同误差抑制(Adaptive Collaborative Error Suppression, ACES),通过动态调整权重系数最大化压缩后层输出的Frobenius范数与压缩目标范数之比,在固定秩约束下高效利用秩预算,从而提升整体压缩性能。

链接: https://arxiv.org/abs/2602.03051
作者: Xing Hu,Dawei Yang,Yuan Cheng,Zhixuan Chen,Zukang Xu
机构: Houmo AI; Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques. As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline. To address this, we propose Self-Adaptive Error Suppression SVD (SAES-SVD), a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation. SAES-SVD is composed of two novel components: (1) Cumulative Error-Aware Layer Compression (CEALC), which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer’s output with its full-precision counterpart to compensate for accumulated errors. (2) Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CEALC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer’s output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively. Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or mixed-rank strategies, SAES-SVD consistently improves post-compression performance.
zh

[NLP-81] LatentMem: Customizing Latent Memory for Multi-Agent Systems

【速读】: 该论文旨在解决多智能体系统(Multi-Agent System, MAS)中现有记忆机制面临的两个核心瓶颈:一是由于缺乏角色感知的定制化导致的记忆同质化问题,二是因记忆条目过于细粒度引发的信息过载问题。其解决方案的关键在于提出LatentMem框架,该框架通过两个核心组件实现高效、个性化的记忆管理:一是轻量化的经验库(experience bank),用于存储原始交互轨迹;二是记忆合成器(memory composer),根据检索到的经验和特定智能体上下文生成紧凑的潜在记忆表示。此外,引入潜空间记忆策略优化(Latent Memory Policy Optimization, LMPO)机制,将任务级优化信号传递至记忆合成器,从而引导其生成高效且高价值的记忆表征,在不修改底层框架的前提下显著提升多智能体系统的性能表现。

链接: https://arxiv.org/abs/2602.03036
作者: Muxin Fu,Guibin Zhang,Xiangyuan Xue,Yafu Li,Zefeng He,Siyuan Huang,Xiaoye Qu,Yu Cheng,Yang Yang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language model (LLM)-powered multi-agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi-agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi-agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role-aware customization, and (ii) information overload induced by excessively fine-grained memory entries. To address these limitations, we propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task-level optimization signals through latent memories to the composer, encouraging it to produce compact and high-utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to 19.36 % over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.
zh

[NLP-82] RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents

【速读】: 该论文旨在解决多轮工具调用(multi-turn tool calling)任务中,大型语言模型(LLM)因奖励稀疏且探索成本高而导致强化学习(RL)优化困难的问题。具体而言,传统方法如监督微调(SFT)后接组相对策略优化(GRPO)在组内奖励方差较低时易陷入停滞(例如,同一组内的多个回放轨迹均获得全0或全1奖励),导致组归一化优势(group-normalized advantage)信息不足,进而引发更新信号消失。解决方案的关键在于提出RC-GRPO(Reward-Conditioned Group Relative Policy Optimization),其核心创新是将探索建模为可控的引导问题,通过离散奖励标记(reward tokens)实现对轨迹质量的条件控制;首先训练一个奖励条件轨迹策略(Reward-Conditioned Trajectory Policy, RCTP),使其能根据注入的奖励目标标记(如|high_reward|、|low_reward|)生成不同质量的轨迹;随后在RL阶段,于每组内采样多样化的奖励标记并条件化回放,从而提升组内多样性与优势估计的有效性,最终在BFCLv4多轮基准测试上显著优于基线方法,甚至使Qwen-2.5-7B-Instruct的性能超越所有闭源API模型。

链接: https://arxiv.org/abs/2602.03025
作者: Haitian Zhong,Jixiu Zhai,Lei Song,Jiang Bian,Qiang Liu,Tieniu Tan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-turn tool calling is challenging for Large Language Models (LLMs) because rewards are sparse and exploration is expensive. A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low (e.g., more rollouts in a group receive the all 0 or all 1 reward), making the group-normalized advantage uninformative and yielding vanishing updates. To address this problem, we propose RC-GRPO (Reward-Conditioned Group Relative Policy Optimization), which treats exploration as a controllable steering problem via discrete reward tokens. We first fine-tune a Reward-Conditioned Trajectory Policy (RCTP) on mixed-quality trajectories with reward goal special tokens (e.g., |high_reward|, |low_reward|) injected into the prompts, enabling the model to learn how to generate distinct quality trajectories on demand. Then during RL, we sample diverse reward tokens within each GRPO group and condition rollouts on the sampled token to improve within-group diversity, improving advantage gains. On the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi-turn benchmark, our method yields consistently improved performance than baselines, and the performance on Qwen-2.5-7B-Instruct even surpasses all closed-source API models.
zh

[NLP-83] CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理能力提升过程中对海量高质量人工标注数据的高度依赖问题,这种依赖使得监督微调(Supervised Fine-Tuning, SFT)和基于特定推理数据的强化学习(Reinforcement Learning, RL)方法难以持续扩展。为克服这一局限,作者提出了一种无需外部训练数据的协作式教练-玩家(Coach-Player)强化学习框架——CPMöbius。其核心创新在于将教练与玩家视为独立但协同的角色:教练通过生成针对性指令来引导玩家能力发展,并根据玩家性能变化获得奖励;而玩家则因成功完成由教练设计的递进式任务而被奖励。这种合作优化机制直接提升了玩家的数学推理能力,在不依赖任何外部数据的情况下显著优于现有无监督方法。

链接: https://arxiv.org/abs/2602.02979
作者: Ran Li,Zeyuan Liu,Yinghao chen,Bingxiang He,Jiarui Yuan,Zixuan Fu,Weize Chen,Jinyi Hu,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player’s capability and receives rewards based on changes in the Player’s performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player’s mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.
zh

[NLP-84] Where Norms and References Collide: Evaluating LLM s on Normative Reasoning AAAI-26 AAAI

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在具身智能体(embodied agents)所处的社会情境中,能否有效进行基于社会规范的指称消解(norm-based reference resolution, NBRR)的问题。其核心挑战在于,NBRR要求模型不仅理解语言表达,还需推理物理与社会语境中的隐含规范性预期,而现有LLMs在面对隐含、不明确或冲突的社会规范时表现不佳。解决方案的关键是提出SNIC(Situated Norms in Context),一个由人类验证的诊断测试平台,用于系统评估LLMs提取和应用与日常任务(如清洁、整理、服务)相关的物理基础社会规范的能力,从而揭示当前模型在社会情境推理上的局限性。

链接: https://arxiv.org/abs/2602.02975
作者: Mitchell Abrams,Kaveh Eskandari Miandoab,Felix Gervits,Vasanth Sarathy,Matthias Scheutz
机构: 1. Massachusetts Institute of Technology (麻省理工学院); 2. Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.
zh

[NLP-85] Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning

【速读】: 该论文旨在解决现有视觉标记剪枝(Vision Token Pruning)方法在视觉问答(VQA)任务中表现良好,但在视觉定位(Visual Grounding, VG)任务中性能显著下降的问题。其核心问题在于,当前剪枝策略依赖全局语义相似性和注意力得分,导致丢失了由标记位置信息交互所构建的全局空间参考框架(global spatial reference frame)。解决方案的关键在于提出一个两阶段的剪枝框架 Nüwa:第一阶段在视觉编码器后通过分离(separation)、对齐(alignment)和聚合(aggregation)三个操作,借鉴群体智能算法保留富含信息的全局空间锚点;第二阶段在语言模型(LLM)中引入文本引导剪枝(text-guided pruning),以保留与任务相关的视觉标记。该设计有效实现了高效特征聚合与空间完整性保持的平衡,从而在多个 VQA 基准上达到 SOTA 性能(94%–95%),并在 VG 任务上实现显著提升(7%–47%)。

链接: https://arxiv.org/abs/2602.02951
作者: Yihong Huang,Fei Ma,Yihua Shao,Jingcai Guo,Zitong Yu,Laizhong Cui,Qi Tian
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ); School of Artificial Intelligence, Xidian University; The Hong Kong Polytechnic University; Great Bay University; Shenzhen University; Huawei
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM’s processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens’ positional information. Motivated by these findings, we propose \textNüwa , a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that \textNüwa achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).
zh

[NLP-86] Equal Access Unequal Interaction: A Counterfactual Audit of LLM Fairness

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在公平性评估中长期忽视的交互质量差异问题,即当不同群体获得平等访问权限后,模型响应在语气、不确定性表达和语言框架上是否存在系统性偏倚。其解决方案的关键在于采用反事实提示设计(counterfactual prompt design),通过控制年龄、性别和国籍等身份属性变量,在职业建议任务中对GPT-4与LLaMA-3.1-70B进行可控公平性审计,并结合自动化语言学指标(如情感倾向、礼貌程度和模糊化表达)量化交互质量差异,最终发现即使访问无差别,模型仍存在显著的交互层面不公平现象。

链接: https://arxiv.org/abs/2602.02932
作者: Alireza Amiri-Margavi,Arshia Gharagozlou,Amin Gholami Davodi,Seyed Pouyan Mousavi Davoudi,Hamidreza Hasani Balyani
机构: University of Pittsburgh (匹兹堡大学); University of Minnesota Duluth (明尼苏达大学德卢斯分校); Amazon Lab126 (亚马逊Lab126)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Prior work on fairness in large language models (LLMs) has primarily focused on access-level behaviors such as refusals and safety filtering. However, equitable access does not ensure equitable interaction quality once a response is provided. In this paper, we conduct a controlled fairness audit examining how LLMs differ in tone, uncertainty, and linguistic framing across demographic identities after access is granted. Using a counterfactual prompt design, we evaluate GPT-4 and LLaMA-3.1-70B on career advice tasks while varying identity attributes along age, gender, and nationality. We assess access fairness through refusal analysis and measure interaction quality using automated linguistic metrics, including sentiment, politeness, and hedging. Identity-conditioned differences are evaluated using paired statistical tests. Both models exhibit zero refusal rates across all identities, indicating uniform access. Nevertheless, we observe systematic, model-specific disparities in interaction quality: GPT-4 expresses significantly higher hedging toward younger male users, while LLaMA exhibits broader sentiment variation across identity groups. These results show that fairness disparities can persist at the interaction level even when access is equal, motivating evaluation beyond refusal-based audits.
zh

[NLP-87] Aligning Language Model Benchmarks with Pairwise Preferences

【速读】: 该论文旨在解决当前语言模型(Language Model)基准测试(Benchmark)在预测实际应用中模型性能方面的局限性问题,即许多基准测试无法准确反映模型在真实场景下的实用价值。其核心解决方案是提出“基准对齐”(Benchmark Alignment)概念,并设计了首个实现该思路的方法——BenchAlign。该方法利用有限的模型性能信息(如问答级表现和部署阶段收集到的模型排序对),自动调整离线基准中各题目的权重,从而生成新的静态基准,能够更准确地预测未见过的模型之间的偏好关系,且保持可解释性。

链接: https://arxiv.org/abs/2602.02898
作者: Marco Gutierrez,Xinyi Leng,Hannah Cyberey,Jonathan Richard Schwarz,Ahmed Alaa,Thomas Hartvigsen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.
zh

[NLP-88] raceNAS: Zero-shot LLM Pruning via Gradient Trace Correlation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化剪枝(structured pruning)过程中因子模块对剪枝敏感性差异导致的非均匀剪枝优化难题,同时克服现有训练感知剪枝方法计算开销过大、难以高效探索全局结构依赖的问题。其解决方案的关键在于提出了一种无需训练的神经架构搜索(Neural Architecture Search, NAS)框架TraceNAS,通过引入一种尺度不变的零样本代理指标(scale-invariant zero-shot proxy),有效衡量剪枝后模型与预训练模型在损失曲面(loss landscape)上的对齐程度,从而识别出具有最大后剪枝训练性能潜力的模型结构。该方法显著提升了剪枝搜索效率,在单张GPU上仅需8.5小时即可完成高保真度模型发现,相比训练感知方法节省了10倍GPU小时。

链接: https://arxiv.org/abs/2602.02891
作者: Prajna G. Malettira,Manish Nagaraj,Arjun Roy,Shubham Negi,Kaushik Roy
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Structured pruning is essential for efficient deployment of Large Language Models (LLMs). The varying sensitivity of LLM sub-blocks to pruning necessitates the identification of optimal non-uniformly pruned models. Existing methods evaluate the importance of layers, attention heads, or weight channels in isolation. Such localized focus ignores the complex global structural dependencies that exist across the model. Training-aware structured pruning addresses global dependencies, but its computational cost can be just as expensive as post-pruning training. To alleviate the computational burden of training-aware pruning and capture global structural dependencies, we propose TraceNAS, a training-free Neural Architecture Search (NAS) framework that jointly explores structured pruning of LLM depth and width. TraceNAS identifies pruned models that maintain a high degree of loss landscape alignment with the pretrained model using a scale-invariant zero-shot proxy, effectively selecting models that exhibit maximal performance potential during post-pruning training. TraceNAS is highly efficient, enabling high-fidelity discovery of pruned models on a single GPU in 8.5 hours, yielding a 10 \times reduction in GPU-hours compared to training-aware methods. Evaluations on the Llama and Qwen families demonstrate that TraceNAS is competitive with training-aware baselines across commonsense and reasoning benchmarks.
zh

[NLP-89] HALT: Hallucination Assessment via Log-probs as Time series

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全关键领域中普遍存在幻觉(Hallucination)的问题,即模型生成与事实不符或无依据的内容。其解决方案的核心在于提出一种轻量级检测方法HALT(Hallucination Assessment via Log-probs as Time series),该方法仅利用LLM生成时的前20个token的对数概率(log-probabilities)构建时间序列,并结合门控循环单元(Gated Recurrent Unit, GRU)模型与基于熵的特征来学习模型校准偏差(calibration bias)。HALT无需访问模型内部状态或注意力机制(区别于白盒方法),也不依赖表面文本形式(区别于黑盒方法),从而实现高效、通用且兼容专有模型的幻觉检测能力。

链接: https://arxiv.org/abs/2602.02888
作者: Ahmad Shapiro,Karan Taneja,Ashok Goel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination detection Unified Benchmark), which consolidates prior datasets into ten capabilities covering both reasoning tasks (Algorithmic, Commonsense, Mathematical, Symbolic, Code Generation) and general purpose skills (Chat, Data-to-Text, Question Answering, Summarization, World Knowledge). While being 30x smaller, HALT outperforms Lettuce, a fine-tuned modernBERT-base encoder, achieving a 60x speedup gain on HUB. HALT and HUB together establish an effective framework for hallucination detection across diverse LLM capabilities.
zh

[NLP-90] Which course? Discourse! Teaching Discourse and Generation in the Era of LLM s EACL2026

【速读】: 该论文试图解决的问题是:在自然语言处理(Natural Language Processing, NLP)领域快速演变的背景下,如何设计跨学科课程以有效衔接语言学与计算机科学等子领域,并特别关注话语处理(discourse processing)在生成式AI(Generative AI)中的作用,而这一关联在现有本科教育中尚未得到充分重视。解决方案的关键在于构建一门名为“计算话语与自然语言生成”的新课程,该课程由具备互补专长的团队联合设计,融合理论与实证研究,强调课堂内外的探索性学习,旨在通过跨学科整合提升学生对语言意图、注意力机制和连贯结构等核心概念的理解,并将其应用于长文本生成任务中。

链接: https://arxiv.org/abs/2602.02878
作者: Junyi Jessy Li,Yang Janet Liu,Kanishka Misra,Valentina Pyatkin,William Sheffield
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Pittsburgh (匹兹堡大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: accepted to the TeachNLP 2026 workshop (co-located with EACL 2026), camera-ready, 14 pages

点击查看摘要

Abstract:The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula. We present a new course, “Computational Discourse and Natural Language Generation”. The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.
zh

[NLP-91] Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

【速读】: 该论文试图解决的问题是:在不确定性环境下,人类决策者如何权衡是否通过提问来减少不确定性(即是否提出澄清问题,Clarification Questions, CQs),以及这种决策如何受情境不确定性与替代行动成本的影响。解决方案的关键在于构建一个基于预期后悔(Expected Regret)的计算模型,该模型量化了在信息不完整时立即行动相较于等待完全信息可能造成的损失,并预测人类倾向于根据潜在损失的风险程度来调整澄清行为的频率——即当错误行动代价较高时,不确定性对决策的影响更显著,从而实现理性权衡。

链接: https://arxiv.org/abs/2602.02843
作者: Polina Tsvilodub,Karl Mulligan,Todd Snider,Robert D. Hawkins,Michael Franke
机构: University of Tübingen (图宾根大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures, under review

点击查看摘要

Abstract:When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that this http URL communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.
zh

[NLP-92] Chain of Simulation: A Dual-Mode Reasoning Framework for Large Language Models with Dynamic Problem Routing

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中因采用统一提示策略而导致性能受限的问题。现有方法通常对所有问题使用相同的推理模式,缺乏针对不同问题类型(如数学计算、空间推理、多跳推理)的动态适配能力,从而影响准确性和效率。解决方案的关键在于提出一种名为“模拟链”(Chain of Simulation, CoS)的双模推理框架,其核心创新是通过问题特征自动路由至三种专用推理模式:面向数学问题的自一致性计算流、基于JSON表示的状态追踪用于空间推理,以及用于多跳推理的事实提取混合机制。该设计实现了无需额外训练即可显著提升推理准确性,并在保持高精度的同时将计算成本降低54%,验证了问题特定模式选择对性能的决定性作用。

链接: https://arxiv.org/abs/2602.02842
作者: Saeid Sheikhi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Chain of Simulation (CoS), a novel dual-mode reasoning framework that dynamically routes problems to specialized reasoning strategies in Large Language Models (LLMs). Unlike existing uniform prompting approaches, CoS employs three distinct reasoning modes: (1) computational flow with self-consistency for mathematical problems, (2) symbolic state tracking with JSON representations for spatial reasoning, and (3) hybrid fact-extraction for multi-hop inference. Through comprehensive evaluation on GSM8K, StrategyQA, and bAbI benchmarks using four state-of-the-art models (Gemma-3 27B, LLaMA-3.1 8B, Mistral 7B, and Qwen-2.5 14B), we demonstrate that CoS achieves 71.5% accuracy on GSM8K (1.0% absolute improvement), 90.0% on StrategyQA (2.5% improvement), and 19.0% on bAbI (65.2% relative improvement) compared to the strongest baselines. The analysis reveals that problem-specific mode selection is crucial, with computational mode achieving 81.2% accuracy when correctly applied to mathematical problems, while misrouting leads to 0% accuracy. We provide detailed algorithms for mode selection, state tracking, and answer extraction, establishing CoS as an effective approach for improving LLM reasoning without additional training. The framework provides superior trade-offs between accuracy and efficiency compared to Self-Consistency, achieving comparable performance at 54% lower computational cost.
zh

[NLP-93] CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中预训练知识可能带来的安全与隐私风险问题,特别是针对现有去遗忘(LLM Unlearning)方法在执行过程中易引发通用领域知识退化、依赖保留数据或精心构造的对比样本对(contrastive pairs),从而导致数据和计算资源消耗过高的局限性。解决方案的关键在于提出CATNIP(Calibrated and Tokenized Negative Preference Alignment),其核心创新是基于模型在token级别的置信度对去遗忘效应进行比例缩放,实现细粒度的遗忘控制,从而显著降低灾难性遗忘(catastrophic forgetting);同时,该方法无需保留数据或对比响应对,在数据稀缺和输入长度变化等现实场景下仍保持鲁棒性,有效提升了去遗忘性能与知识保留之间的平衡。

链接: https://arxiv.org/abs/2602.02824
作者: Zhengbang Yang,Yisheng Zhong,Junyuan Hong,Zhuangdi Zhu
机构: George Mason University (乔治梅森大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model’s token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.
zh

[NLP-94] R2-Router: A New Paradigm for LLM Routing with Reasoning

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)路由机制中忽略输出长度对模型质量影响的问题。现有方法假设每个LLM在特定查询下具有固定的质量和成本,从而导致路由器可能错误地排除高性能但高成本的LLM,而忽视其在短输出约束下仍可实现高质量且成本更低的可能性。解决方案的关键在于提出R2-Router,它将输出长度预算视为可控变量,联合优化LLM选择与输出长度预算,并通过长度约束指令强制执行预算限制,使路由器能够发现强大LLM在受限输出下优于弱模型的高效配置。这一方法实现了从被动选择到主动推理的转变,推动了路由机制向“路由即推理”的新方向发展。

链接: https://arxiv.org/abs/2602.02823
作者: Jiaqi Xue,Qian Lou,Jiarong Xing,Heng Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM’s quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM’s quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2-Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost-efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5x lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget.
zh

[NLP-95] When Efficient Communication Explains Convexity

【速读】: 该论文试图解决的问题是:为何基于高效通信(efficient communication)的理论能够解释世界语言中语义类型(semantic typology)的多样性,即这种解释背后的机制和关键驱动因素是什么。解决方案的关键在于引入信息瓶颈(Information Bottleneck, IB)方法来形式化通信中的简洁性与信息量之间的权衡,并通过实证分析发现:在IB框架下,最优通信性能与一种新颖的凸性推广概念之间存在显著相关性;进一步实验表明,这种相关性的核心驱动力是“交际需求分布”的凸性特征,而非其他建模参数。这一发现从机制层面揭示了为何高效通信能成功解释语义类型,从而推动了对语言演化规律的理解。

链接: https://arxiv.org/abs/2602.02821
作者: Ashvin Ranjan,Shane Steinert-Threlkeld
机构: 未知
类目: Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Much recent work has argued that the variation in the languages of the world can be explained from the perspective of efficient communication; in particular, languages can be seen as optimally balancing competing pressures to be simple and to be informative. Focusing on the expression of meaning – semantic typology – the present paper asks what factors are responsible for successful explanations in terms of efficient communication. Using the Information Bottleneck (IB) approach to formalizing this trade-off, we first demonstrate and analyze a correlation between optimality in the IB sense and a novel generalization of convexity to this setting. In a second experiment, we manipulate various modeling parameters in the IB framework to determine which factors drive the correlation between convexity and optimality. We find that the convexity of the communicative need distribution plays an especially important role. These results move beyond showing that efficient communication can explain aspects of semantic typology into explanations for why that is the case by identifying which underlying factors are responsible.
zh

[NLP-96] AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic

【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在多语言评估中将语言与文化等同、仅以性能作为理解能力代理所带来的局限性问题,即忽视了同一语言内部存在的显著文化差异对模型理解能力的影响。解决方案的关键在于构建一个基于阿姆哈拉语(Amharic)不同地区叙事内容的长序列故事问答基准——AmharicStoryQA,该基准扎根于文化多样性的叙事材料,从而揭示现有LLMs在低资源语言中的叙事理解鸿沟,并表明区域特异性内容对评估结果具有显著影响,且监督微调在不同区域和评估场景下带来的改进效果不均衡。这一方法强调了超越语言层面、基于文化情境进行评估的重要性。

链接: https://arxiv.org/abs/2602.02774
作者: Israel Abebe Azime,Abenezer Kebede Angamo,Hana Mekonen Tamiru,Dagnachew Mekonnen Marilign,Philipp Slusallek,Seid Muhie Yimam,Dietrich Klakow
机构: Saarland University (萨尔兰大学); AIMS AMMI; Resonance AI4D Lab; Addis Ababa University (亚的斯亚贝巴大学); HILCOE School of Computer Science and Technology; University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a models understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce \textbf\textitAmharicStoryQA, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.
zh

[NLP-97] Privately Fine-Tuned LLM s Preserve Temporal Dynamics in Tabular Data

【速读】: 该论文旨在解决纵向数据(longitudinal data)在差分隐私合成中面临的挑战,即传统方法将用户历史记录展平为高维向量后,虽能保持边际分布的隐私性,却无法保留时间序列中的时序一致性。其核心问题在于:现有基于边缘分布的机制无法捕捉个体轨迹中的长期依赖关系,导致合成数据在实际应用中存在显著偏差。解决方案的关键在于提出PATH框架,该框架以完整表格作为合成单元,利用经过差分隐私微调的大语言模型(large language models, LLMs)的自回归能力,显式建模事件间的时序结构,从而有效捕获长程依赖关系,在保持与真实数据相似边际分布的同时,显著提升轨迹分布距离和状态转移准确率。

链接: https://arxiv.org/abs/2602.02766
作者: Lucas Rosenblatt,Peihan Liu,Ryan McKenna,Natalia Ponomareva
机构: NYU (纽约大学); Columbia University (哥伦比亚大学); Google Research (谷歌研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Research on differentially private synthetic tabular data has largely focused on independent and identically distributed rows where each record corresponds to a unique individual. This perspective neglects the temporal complexity in longitudinal datasets, such as electronic health records, where a user contributes an entire (sub) table of sequential events. While practitioners might attempt to model such data by flattening user histories into high-dimensional vectors for use with standard marginal-based mechanisms, we demonstrate that this strategy is insufficient. Flattening fails to preserve temporal coherence even when it maintains valid marginal distributions. We introduce PATH, a novel generative framework that treats the full table as the unit of synthesis and leverages the autoregressive capabilities of privately fine-tuned large language models. Extensive evaluations show that PATH effectively captures long-range dependencies that traditional methods miss. Empirically, our method reduces the distributional distance to real trajectories by over 60% and reduces state transition errors by nearly 50% compared to leading marginal mechanisms while achieving similar marginal fidelity.
zh

[NLP-98] From Task Solving to Robust Real-World Adaptation in LLM Agents

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)作为智能体(agent)在真实复杂环境中部署时所面临的“清洁接口假设”与实际鲁棒性之间的鸿沟问题。现有评估方法通常假设环境稳定、工具可靠且目标明确,但现实场景中存在部分可观测性、动态环境、噪声信号和状态漂移等挑战,导致模型在任务完成率上表现良好却难以适应不确定性。解决方案的关键在于通过一个基于网格的长程执行游戏环境,系统性地模拟四种典型操作条件(partial observability, dynamic environments, noisy signals, and dynamic agent state),迫使代理在无显式指令下自主权衡任务完成度、效率与惩罚规避,从而揭示其在不确定性和非平稳性下的适应能力。实验表明,尽管多数模型在标准任务中表现优异,但在部署相关鲁棒性测试中性能显著下降,且不同模型的优劣排序随不确定性类型而变化,凸显出对验证机制、安全动作选择及部分可观测下的目标推断等关键能力的研究必要性。

链接: https://arxiv.org/abs/2602.02760
作者: Pouya Pezeshkpour,Estevam Hruschka
机构: Megagon Labs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed as specialized agents that plan, call tools, and take actions over extended horizons. Yet many existing evaluations assume a “clean interface” where dynamics are specified and stable, tools and sensors are reliable, and success is captured by a single explicit objective-often overestimating real-world readiness. In practice, agents face underspecified rules, unreliable signals, shifting environments, and implicit, multi-stakeholder goals. The challenge is therefore not just solving tasks, but adapting while solving: deciding what to trust, what is wanted, when to verify, and when to fall back or escalate. We stress-test deployment-relevant robustness under four operational circumstances: partial observability, dynamic environments, noisy signals, and dynamic agent state. We benchmark agentic LLMs in a grid-based game with a simple goal but long-horizon execution. Episodes violate clean-interface assumptions yet remain solvable, forcing agents to infer rules, pay for information, adapt to environmental and internal shifts, and act cautiously under noise. Across five state-of-the-art LLM agents, we find large gaps between nominal task-solving and deployment-like robustness. Performance generally degrades as grid size and horizon increase, but rankings are unstable: weaker models can beat stronger ones when strategy matches the uncertainty regime. Despite no explicit instruction, agents trade off completion, efficiency, and penalty avoidance, suggesting partial objective inference. Ablations and feature analyses reveal model-specific sensitivities and failure drivers, motivating work on verification, safe action selection, and objective inference under partial observability, noise, and non-stationarity.
zh

[NLP-99] Scaling Small Agents Through Strategy Auctions

【速读】: 该论文旨在解决小规模语言模型(small language models)在复杂任务中性能难以扩展的问题,尤其是在深度搜索和编程等长周期工作负载下,如何有效利用小模型并降低对大型模型的依赖。其核心挑战在于现有路由机制无法适配代理式人工智能(agentic AI)的工作流,导致成本高且性能不佳。解决方案的关键是提出Strategy Auctions for Workload Efficiency (SALE),一个受自由职业者市场启发的代理框架:小模型通过竞标短战略计划参与任务分配,由统一的成本-价值评分机制评估并借助共享拍卖记忆进行迭代优化,实现按任务动态路由与无需额外训练的持续自我改进。实验表明,SALE在保持甚至超越最大模型性能的同时,将对最大模型的依赖减少53%,整体成本降低35%,凸显了基于市场机制的协同调度策略相较于单纯扩大模型规模更能提升系统级效率。

链接: https://arxiv.org/abs/2602.02751
作者: Lisa Alazraki,William F. Shen,Yoram Bachrach,Akhil Mathur
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents’ performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 53%, lowers overall cost by 35%, and consistently improves upon the largest agent’s pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost – often both – underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively “scaled up” through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.
zh

[NLP-100] me-Critical Multimodal Medical Transportation: Organs Patients and Medical Supplies

【速读】: 该论文旨在解决医疗运输中因交通拥堵和传统交通工具局限性导致的时效性问题,特别是在器官移植和紧急医疗场景下,如何通过优化多模式运输系统(包括地面救护车与空中无人飞行器)来提升响应速度并降低运营成本。其解决方案的关键在于提出一种构造性贪心启发式算法,该算法能够整合不同类型的车辆资源(如救护车、无人机(Unmanned Aerial Vehicles, UAVs)和电动垂直起降飞行器(electric vertical take-off and landing aircraft, eVTOLs)),在考虑地面交通拥堵与空中天气影响的基础上,实现载荷合并、快速调度,并有效平衡总运输时间、充电/燃料成本及运营费用,从而显著优于计算复杂度高的传统优化模型。

链接: https://arxiv.org/abs/2602.02736
作者: Elaheh Sabziyan Varnousfaderani,Syed A. M. Shihab,Mohammad Taghizadeh
机构: Kent State University (肯特州立大学)
类目: Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Timely transportation of organs, patients, and medical supplies is critical to modern healthcare, particularly in emergencies and transplant scenarios where even short delays can severely impact outcomes. Traditional ground-based vehicles such as ambulances are often hindered by traffic congestion; while air vehicles such as helicopters are faster but costly. Emerging air vehicles – Unmanned Aerial Vehicles and electric vertical take-off and landing aircraft – have lower operating costs, but remain limited by range and susceptibility to weather conditions. A multimodal transportation system that integrates both air and ground vehicles can leverage the strengths of each to enhance overall transportation efficiency. This study introduces a constructive greedy heuristic algorithm for multimodal vehicle dispatching for medical transportation. Four different fleet configurations were tested: (i) ambulances only, (ii) ambulances with Unmanned Aerial Vehicles, (iii) ambulances with electric vertical take-off and landing aircraft, and (iv) a fully integrated fleet of ambulances, Unmanned Aerial Vehicles, and electric vertical take-off and landing aircraft. The algorithm incorporates payload consolidation across compatible routes, accounts for traffic congestion in ground operations and weather conditions in aerial operations, while enabling rapid vehicle dispatching compared to computationally intensive optimization models. Using a common set of conditions, we evaluate all four fleet types to identify the most effective configurations for fulfilling medical transportation needs while minimizing operating costs, recharging/fuel costs, and total transportation time.
zh

[NLP-101] Predicting first-episode homelessness among US Veterans using longitudinal EHR data: time-varying models and social risk factors

【速读】: 该论文旨在解决美国退伍军人中无家可归问题的早期识别与精准预测难题,以实现主动干预和预防。其核心解决方案在于构建融合社会风险因素与临床特征的纵向电子健康记录(EHR)模型,通过引入临床医生指导的逻辑来刻画疾病状态和社会风险的持续性,并对比经典机器学习、基于Transformer的掩码语言模型以及微调的大语言模型(LLM)的表现。关键创新在于将社会行为因素纳入长期建模,使模型在前1%高风险人群中展现出显著提升的阳性预测值(3.93–13.80%),从而有效聚焦资源于最需要干预的群体,支持数据驱动的精准预防策略。

链接: https://arxiv.org/abs/2602.02731
作者: Rohan Pandey,Haijuan Yan,Hong Yu,Jack Tsai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Homelessness among US veterans remains a critical public health challenge, yet risk prediction offers a pathway for proactive intervention. In this retrospective prognostic study, we analyzed electronic health record (EHR) data from 4,276,403 Veterans Affairs patients during a 2016 observation period to predict first-episode homelessness occurring 3-12 months later in 2017 (prevalence: 0.32-1.19%). We constructed static and time-varying EHR representations, utilizing clinician-informed logic to model the persistence of clinical conditions and social risks over time. We then compared the performance of classical machine learning, transformer-based masked language models, and fine-tuned large language models (LLMs). We demonstrate that incorporating social and behavioral factors into longitudinal models improved precision-recall area under the curve (PR-AUC) by 15-30%. In the top 1% risk tier, models yielded positive predictive values ranging from 3.93-4.72% at 3 months, 7.39-8.30% at 6 months, 9.84-11.41% at 9 months, and 11.65-13.80% at 12 months across model architectures. Large language models underperformed encoder-based models on discrimination but showed smaller performance disparities across racial groups. These results demonstrate that longitudinal, socially informed EHR modeling concentrates homelessness risk into actionable strata, enabling targeted and data-informed prevention strategies for at-risk veterans.
zh

[NLP-102] Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

【速读】: 该论文旨在解决深度学习模型中隐藏表示所蕴含语义信息的可解释性问题,即如何准确识别模型在做出预测时实际依赖的语义成分。传统基于聚类的概念解释方法(如层次聚类计算复杂度高,K-Means易产生浅层或频率主导的簇)难以在大规模数据上高效且高质量地提取有意义的概念。解决方案的关键在于提出向量量化潜在概念(Vector Quantized Latent Concept, VQLC)框架,其基于向量量化变分自编码器(VQ-VAE)架构,通过学习一个离散码本将连续表示映射到概念向量,从而在保证人类可理解解释质量的同时显著提升可扩展性。

链接: https://arxiv.org/abs/2602.02726
作者: Xuemin Yu,Ankur Garg,Samira Ebrahimi Kahou,Hassan Sajjad
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep Learning models encode rich semantic information in their hidden representations. However, it remains challenging to understand which parts of this information models actually rely on when making predictions. A promising line of post-hoc concept-based explanation methods relies on clustering token representations. However, commonly used approaches such as hierarchical clustering are computationally infeasible for large-scale datasets, and K-Means often yields shallow or frequency-dominated clusters. We propose the vector quantized latent concept (VQLC) method, a framework built upon the vector quantized-variational autoencoder (VQ-VAE) architecture that learns a discrete codebook mapping continuous representations to concept vectors. We perform thorough evaluations and show that VQLC improves scalability while maintaining comparable quality of human-understandable explanations.
zh

[NLP-103] owards Understanding Steering Strength

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)后训练控制中“导向强度”(steering strength)选择不明确的问题,即如何在推理阶段合理调整对中间潜在表示的扰动幅度——过小无法激发预期行为,过大则导致模型性能严重退化。解决方案的关键在于首次提出对导向强度的理论分析框架,通过推导其对下一个词概率、概念存在性及交叉熵的影响规律,揭示了导向强度与模型输出之间的定量关系,并发现其效应可能呈现非单调特性,从而为实际应用提供可解释的指导原则。

链接: https://arxiv.org/abs/2602.02712
作者: Magamed Taimeskhanov,Samuel Vaiter,Damien Garreau
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 33 pages (including appendix)

点击查看摘要

Abstract:A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model’s performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.
zh

[NLP-104] BinaryPPO: Efficient Policy Optimization for Binary Classification

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在真实场景中因标签噪声、类别不平衡或弱监督等问题导致性能下降的局限性。其解决方案的关键在于提出 BinaryPPO,一种基于离线强化学习的大语言模型(Large Language Model, LLM)框架,将二分类任务重新建模为奖励最大化问题,并采用一种带置信度加权的奖励函数,通过惩罚不确定或错误预测来引导模型学习鲁棒的决策策略,从而在静态数据集上实现无需在线交互的高效训练。

链接: https://arxiv.org/abs/2602.02708
作者: Punya Syon Pandey,Zhijing Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real-world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain-specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40-60 percentage points, reaching up to 99%, substantially outperforming supervised baselines. We provide an in-depth analysis of the role of reward shaping, advantage scaling, and policy stability in enabling this improvement. Overall, we demonstrate that confidence-based reward design provides a robust alternative to SFT for binary classification. Our code is available at this https URL.
zh

[NLP-105] InfMem: Learning System-2 Memory Control for Long-Context Agent

【速读】: 该论文旨在解决超长文档推理中因记忆限制导致的稀疏证据难以有效整合的问题,尤其针对流式代理(streaming agents)被动更新记忆策略无法保留多跳推理所需低显著性连接证据的缺陷。解决方案的关键在于提出一种以控制为核心的代理 InfMem,其通过 PreThink-Retrieve-Write 协议实现类 System-2 的控制机制:主动监控证据充分性、执行文档内目标检索,并采用证据感知联合压缩策略更新有限记忆;同时引入从监督微调(SFT)到强化学习(RL)的训练范式,使检索、写入与停止决策与最终任务正确性对齐,从而在保持高准确率的同时显著提升推理效率。

链接: https://arxiv.org/abs/2602.02704
作者: Xinyu Wang,Mingze Li,Peng Lu,Xiao-Wen Chang,Lifeng Shang,Jinping Li,Fei Mi,Prasanna Parthasarathi,Yufei Cui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning over ultra-long documents requires synthesizing sparse evidence scattered across distant segments under strict memory constraints. While streaming agents enable scalable processing, their passive memory update strategy often fails to preserve low-salience bridging evidence required for multi-hop reasoning. We propose InfMem, a control-centric agent that instantiates System-2-style control via a PreThink-Retrieve-Write protocol. InfMem actively monitors evidence sufficiency, performs targeted in-document retrieval, and applies evidence-aware joint compression to update a bounded memory. To ensure reliable control, we introduce a practical SFT-to-RL training recipe that aligns retrieval, writing, and stopping decisions with end-task correctness. On ultra-long QA benchmarks from 32k to 1M tokens, InfMem consistently outperforms MemAgent across backbones. Specifically, InfMem improves average absolute accuracy by +10.17, +11.84, and +8.23 points on Qwen3-1.7B, Qwen3-4B, and Qwen2.5-7B, respectively, while reducing inference time by 3.9\times on average (up to 5.1\times ) via adaptive early stopping.
zh

[NLP-106] Monotonicity as an Architectural Bias for Robust Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对对抗性提示(adversarial prompts)和越狱攻击(jailbreak attacks)时表现出的脆弱性问题,这种脆弱性源于高维输入空间中微小且精心构造的扰动即可引发内部语义表示和输出结果的巨大波动。解决方案的关键在于引入**单调性(monotonicity)**作为Transformer架构的一种归纳偏置(inductive bias),通过在序列到序列Transformer的前馈子层中选择性地强制实现单调约束——即增强信息、证据或约束不会导致内部表征的退化——从而提升模型鲁棒性。该设计保留了预训练模型的性能,同时允许注意力机制自由处理否定、矛盾及上下文交互等复杂语义关系,实验证明该方法可将对抗攻击成功率从约69%显著降低至19%,而标准摘要任务性能仅轻微下降。

链接: https://arxiv.org/abs/2602.02686
作者: Patrick Cooper,Alireza Nadali,Ashutosh Trivedi,Alvaro Velasquez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers – while leaving attention mechanisms unconstrained – we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally. Comments: 12 pages, 1 figure Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) MSC classes: 68T50, 68T05 ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2602.02686 [cs.CL] (or arXiv:2602.02686v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.02686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-107] WideSeek: Advancing Wide Research via Multi-Agent Scaling

【速读】: 该论文旨在解决当前搜索智能(Search Intelligence)从深度研究(Deep Research)向广度研究(Wide Research)演进过程中所面临的两大核心挑战:一是缺乏专门针对搜索广度(search breadth)的基准测试数据集,二是缺少有效的多智能体优化方法。为应对这些问题,作者提出两个关键解决方案:其一,构建了WideSeekBench——一个基于多阶段严谨数据流水线生成的通用广域信息搜寻(General Broad Information Seeking, GBIS)基准,确保目标信息量、逻辑约束和领域多样性;其二,设计了WideSeek架构,即一种动态分层多智能体系统,可根据任务需求自主分叉并行子智能体,并通过统一的端到端强化学习(end-to-end RL)训练框架对多智能体轨迹进行线性化优化,从而显著提升广度搜索能力。实验表明,增加智能体数量是推动广度研究范式发展的有效路径。

链接: https://arxiv.org/abs/2602.02636
作者: Ziyang Huang,Haolin Ren,Xiaowei Yuan,Jiawei Wang,Zhongtao Jiang,Kun Xu,Shizhu He,Jun Zhao,Kang Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.
zh

[NLP-108] Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management

【速读】: 该论文旨在解决烟草病虫害管理中大语言模型(Large Language Model, LLM)生成建议时存在的幻觉(hallucination)和缺乏领域一致性问题,尤其是在多跳推理(multi-hop reasoning)和比较性问题上表现不佳。解决方案的关键在于构建一个领域特定的知识图谱(Knowledge Graph, KG),将疾病、症状、农药及防控措施等实体显式建模为关联节点,并利用图增强的检索机制(Graph-Augmented Retrieval)从知识图谱中提取与查询相关的子图作为结构化证据。该证据被嵌入LLM输入以引导生成过程,从而提升推荐的准确性与可解释性,实验表明该方法在复杂推理任务上显著优于仅依赖文本相似性的基线模型。

链接: https://arxiv.org/abs/2602.02635
作者: Siyu Li,Chenwei Song,Qi Zhou,Wan Zhou,Xinyi Liu
机构: Chongqing Jiaotong University (重庆交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes a graph-augmented reasoning framework for tobacco pest and disease management that integrates structured domain knowledge into large language models. Building on GraphRAG, we construct a domain-specific knowledge graph and retrieve query-relevant subgraphs to provide relational evidence during answer generation. The framework adopts ChatGLM as the Transformer backbone with LoRA-based parameter-efficient fine-tuning, and employs a graph neural network to learn node representations that capture symptom-disease-treatment dependencies. By explicitly modeling diseases, symptoms, pesticides, and control measures as linked entities, the system supports evidence-aware retrieval beyond surface-level text similarity. Retrieved graph evidence is incorporated into the LLM input to guide generation toward domain-consistent recommendations and to mitigate hallucinated or inappropriate treatments. Experimental results show consistent improvements over text-only baselines, with the largest gains observed on multi-hop and comparative reasoning questions that require chaining multiple relations.
zh

[NLP-109] Fine-Tuning Language Models to Know What They Know

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在元认知(metacognition)能力上的缺失问题,即模型缺乏对其内部知识状态的自我觉察与显式行为的一致性。传统上,人类通过共享的内部记忆来同时完成问答和报告知识状态,但LLMs尚未实现这种机制。为量化元认知能力,论文提出基于双提示(dual-prompt)的方法测量指标 $ d_\text{type2} $,并引入进化策略驱动的元认知对齐方法(Evolution Strategy for Metacognitive Alignment, ESMA),通过优化模型参数使其内部知识与外部输出行为对齐。该方案的关键在于利用进化策略(Evolution Strategy)对少量关键参数进行稀疏调整,从而显著提升模型在未训练场景下的泛化能力,证明其元认知能力得到实质性增强。

链接: https://arxiv.org/abs/2602.02605
作者: Sangjun Park,Elliot Meyerson,Xin Qiu,Risto Miikkulainen
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: Preprint

点击查看摘要

Abstract:Metacognition is a critical component of intelligence, specifically regarding the awareness of one’s own knowledge. While humans rely on shared internal memory for both answering questions and reporting their knowledge state, this dependency in LLMs remains underexplored. This study proposes a framework to measure metacognitive ability d_\rmtype2’ using a dual-prompt method, followed by the introduction of Evolution Strategy for Metacognitive Alignment (ESMA) to bind a model’s internal knowledge to its explicit behaviors. ESMA demonstrates robust generalization across diverse untrained settings, indicating a enhancement in the model’s ability to reference its own knowledge. Furthermore, parameter analysis attributes these improvements to a sparse set of significant modifications.
zh

[NLP-110] Uncertainty and Fairness Awareness in LLM -Based Recommendation Systems

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成推荐时面临的预测不确定性(predictive uncertainty)与嵌入式偏见(embedded biases)问题,这些问题会显著影响推荐系统的准确性、一致性与可信度。其核心解决方案在于提出一种新型的不确定性感知评估方法(uncertainty-aware evaluation methodology),结合一个针对八个社会人口属性(共31个分类值)标注的跨领域数据集(电影与音乐),量化预测熵以衡量不确定性,并引入人格特征感知的公平性评估机制(personality-aware fairness),从而揭示个性化推荐与群体公平性之间的权衡关系。通过案例研究发现,Google DeepMind的Gemini 1.5 Flash存在系统性不公平现象(SNSR=0.1363,SNSV=0.0507),且该偏差在提示扰动(如拼写错误和多语言输入)下依然存在,表明所提方法能有效识别并解释复杂场景下的偏见来源,为构建更安全、可解释的推荐型大语言模型(RecLLMs)提供理论基础与实践路径。

链接: https://arxiv.org/abs/2602.02582
作者: Chandan Kumar Sah,Xiaoli Lian,Li Zhang,Tony Xu,Syed Shazaib Shah
机构: Beihang University (北京航空航天大学); McGill University (麦吉尔大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted at the Second Conference of the International Association for Safe and Ethical Artificial Intelligence, IASEAI26, 14 pages

点击查看摘要

Abstract:Large language models (LLMs) enable powerful zero-shot recommendations by leveraging broad contextual knowledge, yet predictive uncertainty and embedded biases threaten reliability and fairness. This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of LLM-generated recommendations. We introduce a benchmark of curated metrics and a dataset annotated for eight demographic attributes (31 categorical values) across two domains: movies and music. Through in-depth case studies, we quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind’s Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes; measured similarity-based gaps are SNSR at 0.1363 and SNSV at 0.0507. These disparities persist under prompt perturbations such as typographical errors and multilingual inputs. We further integrate personality-aware fairness into the RecLLM evaluation pipeline to reveal personality-linked bias patterns and expose trade-offs between personalization and group fairness. We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark that advances explainability and equity in LLM recommendations. Together, these contributions establish a foundation for safer, more interpretable RecLLMs and motivate future work on multi-model benchmarks and adaptive calibration for trustworthy deployment.
zh

[NLP-111] Enhancing Post-Training Quantization via Future Activation Awareness

【速读】: 该论文旨在解决后训练量化(Post-training Quantization, PTQ)在压缩大语言模型(Large Language Models, LLMs)时存在的量化偏差(quantization bias)和误差累积问题,尤其是在校准数据存在偏差的情况下,导致量化结果不稳定且性能次优。解决方案的关键在于提出未来感知量化(Future-Aware Quantization, FAQ),其核心思想是利用未来层(future-layer)激活信息来指导当前层的量化参数设置,从而更准确地识别并保留重要权重,同时降低对校准噪声的敏感性;进一步引入窗口预览机制(window-wise preview mechanism)以软性聚合多个未来层激活,避免对单一未来层过度依赖,并通过预搜索配置实现低开销优化,无需反向传播、数据重构或额外调参,显著提升了量化稳定性与效果,适用于边缘部署场景。

链接: https://arxiv.org/abs/2602.02538
作者: Zheqi Lv,Zhenxuan Fan,Qi Tian,Wenqiao Zhang,Yueting Zhuang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.
zh

[NLP-112] From Sparse Decisions to Dense Reasoning : A Multi-attribute Trajectory Paradigm for Multimodal Moderation

【速读】: 该论文旨在解决多模态安全内容审核中因数据与监督信号双重稀疏性导致的模型泛化能力不足问题,特别是传统二值标签引发的“捷径学习”(shortcut learning)现象,使得模型难以准确区分潜在风险边界。其解决方案的关键在于提出一种新型学习范式 UniMod,将单一体决策任务重构为包含证据锚定、模态评估、风险映射、策略决策和响应生成等多维推理轨迹的密集推理过程,从而迫使模型基于显式的安全语义进行决策,避免收敛至表面特征。此外,论文还设计了多头标量奖励模型 UniRM 以提供属性级评分监督,并引入专门的优化策略解耦任务特定参数并平衡多任务训练动态,有效缓解不同目标间的干扰。实证结果表明,UniMod 在文本审核上表现优异,且在仅使用领先基线40%以下训练数据的情况下实现了新的多模态基准性能。

链接: https://arxiv.org/abs/2602.02536
作者: Tianle Gu,Kexin Huang,Lingyu Li,Ruilin Luo,Shiyang Huang,Zongqi Wang,Yujiu Yang,Yan Teng,Yingchun Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi-head scalar reward model (UniRM). UniRM provides multi-dimensional supervision by assigning attribute-level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40% of the training data used by leading baselines. Ablations further validate our multi-attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \hrefthis https URLproject website.
zh

[NLP-113] he “Robert Boulton” Singularity: Semantic Tunneling and Manifold Unfolding in Recursive AI

【速读】: 该论文旨在解决生成式 AI 在递归合成数据训练过程中出现的语义稳定性问题,特别是当模型在上下文长度为128(L=128)时,传统使用困惑度(Perplexity, PPL)作为监控指标会误导性地掩盖模型语义多样性的崩溃现象。研究发现,基线模型虽保持高语法流畅性(PPL≈83.9),但其语义空间迅速收敛至单一低熵叙事吸引子——“Robert Boulton”奇点,导致潜在流形(latent manifold)有效秩从3.62骤降至2.22,丧失世界知识多样性。为应对此“语义隧道”(Semantic Tunneling)新失效模式,作者引入Hou(2026)提出的多尺度负耦合信息系综(Multi-Scale Negative Coupled Information Systems, MNCIS)框架,并通过自适应谱负耦合(Adaptive Spectral Negative Coupling, ASNC)作为拓扑算子实现“流形展开”(Manifold Unfolding),将模型的有效秩提升至5.35,从而构建出抵抗语义吸引子引力的“人工流形”,维持训练数据的长尾分布特性。

链接: https://arxiv.org/abs/2602.02526
作者: Pengyue Hou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Physics (physics.comp-ph)
备注: Companion paper to arXiv:2601.11594 . Provides empirical validation of the MNCIS framework in Large Language Models (GPT-2) using a recursive training protocol (N=1500). Includes complete, reproducible Python implementation of Adaptive Spectral Negative Coupling (ASNC) and Effective Rank metrics in the Appendix

点击查看摘要

Abstract:The stability of generative artificial intelligence trained on recursive synthetic data is conventionally monitored via Perplexity (PPL). We demonstrate that PPL is a deceptive metric in context-stabilized regimes (L=128). Using a rigorous sliding-window protocol (N=1500), we identify a novel failure mode termed “Semantic Tunneling.” While the Baseline model maintains high grammatical fluency (PPL approx. 83.9), it suffers a catastrophic loss of semantic diversity, converging within seven generations to a single, low-entropy narrative attractor: the “Robert Boulton” Singularity. This phenomenon represents a total collapse of the latent manifold (Global Effective Rank 3.62 - 2.22), where the model discards diverse world knowledge to optimize for statistically safe syntactic templates. To address this, we apply the Multi-Scale Negative Coupled Information Systems (MNCIS) framework recently established in Hou (2026) [arXiv:2601.11594]. We demonstrate that Adaptive Spectral Negative Coupling (ASNC) acts as a topological operator that actively induces “Manifold Unfolding.” MNCIS forces the model to expand its effective rank from the anisotropic baseline of 3.62 to a hyper-diverse state of 5.35, effectively constructing an “Artificial Manifold” that resists the gravitational pull of semantic attractors and preserves the long-tail distribution of the training data.
zh

[NLP-114] GraphDancer: Training LLM s to Explore and Reason over Graphs via Curriculum Reinforcement Learning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理结构化知识图谱时面临的两大挑战:一是需要精确调用函数以导航具有模式定义的关系,而非依赖相似性检索;二是复杂问题的解答通常要求通过多跳推理聚合证据,需迭代式地进行信息搜索。解决方案的关键在于提出 GraphDancer,这是一个基于强化学习(Reinforcement Learning, RL)的框架,通过交错执行推理与函数调用,使 LLM 能够有效探索图结构知识。为提升中等规模模型的训练效率,作者进一步设计了一种图感知课程学习策略,利用易于困难的采样器按信息搜索路径的结构复杂度调度训练过程,从而显著增强模型在跨领域和分布外场景下的泛化能力。

链接: https://arxiv.org/abs/2602.02518
作者: Yuyang Bai,Zhuofeng Li,Ping Nie,Jianwen Xie,Yu Zhang
机构: Texas A&M University (德州农工大学); University of Waterloo (滑铁卢大学); Lambda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, Project website: this https URL

点击查看摘要

Abstract:Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graph-structured knowledge poses two key challenges: (1) navigating structured, schema-defined relations requires precise function calls rather than similarity-based retrieval, and (2) answering complex questions often demands multi-hop evidence aggregation through iterative information seeking. We propose GraphDancer, a reinforcement learning (RL) framework that teaches LLMs to navigate graphs by interleaving reasoning and function execution. To make RL effective for moderate-sized LLMs, we introduce a graph-aware curriculum that schedules training by the structural complexity of information-seeking trajectories using an easy-to-hard biased sampler. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with either a 14B backbone or GPT-4o-mini, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code and models can be found at this https URL .
zh

[NLP-115] CreditAudit: 2D Auditing for LLM Evaluation and Selection DATE

【速读】: 该论文旨在解决当前语言模型在真实部署场景中性能评估与基准测试得分不一致的问题,尤其是由于系统提示(system prompt)模板、输出协议及交互模式的迭代变化,导致在代理式多步流水线中微小的协议调整可能引发显著失败,使实践者难以判断应部署哪个模型。解决方案的关键在于提出 CreditAudit 框架,该框架通过在多个基准测试上使用语义对齐且非对抗性的系统提示模板组合进行评估,报告平均能力(mean ability)作为核心性能指标,并以场景诱导波动标准差(scenario induced fluctuation sigma)作为稳定性风险信号;进一步将波动性映射为从 AAA 到 BBB 的可解释信用等级,结合跨模型分位数和诊断机制缓解模板难度漂移问题,从而实现基于具体应用场景的模型选择与分级部署,提升评估的客观性与可信度。

链接: https://arxiv.org/abs/2602.02515
作者: Yiliang Song,Hongjun An,Jiangong Xiao,Haofei Zhao,Jiawei Shao,Xuelong Li
机构: Guangxi Normal University (广西师范大学); Northwestern Polytechnical University (西北工业大学); Institute of Artificial Intelligence (TeleAI), China TelecomChina (中国电信人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: First update

点击查看摘要

Abstract:Leaderboard scores on public benchmarks have been steadily rising and converging, with many frontier language models now separated by only marginal differences. However, these scores often fail to match users’ day to day experience, because system prompts, output protocols, and interaction modes evolve under routine iteration, and in agentic multi step pipelines small protocol shifts can trigger disproportionate failures, leaving practitioners uncertain about which model to deploy. We propose CreditAudit, a deployment oriented credit audit framework that evaluates models under a family of semantically aligned and non adversarial system prompt templates across multiple benchmarks, reporting mean ability as average performance across scenarios and scenario induced fluctuation sigma as a stability risk signal, and further mapping volatility into interpretable credit grades from AAA to BBB via cross model quantiles with diagnostics that mitigate template difficulty drift. Controlled experiments on GPQA, TruthfulQA, and MMLU Pro show that models with similar mean ability can exhibit substantially different fluctuation, and stability risk can overturn prioritization decisions in agentic or high failure cost regimes. By providing a 2D and grade based language for regime specific selection, CreditAudit supports tiered deployment and more disciplined allocation of testing and monitoring effort, enabling more objective and trustworthy model evaluation for real world use.
zh

[NLP-116] Beyond Translation: Cross-Cultural Meme Transcreation with Vision-Language Models

【速读】: 该论文旨在解决跨文化情境下表情包(meme)的转创(transcreation)问题,即在保留原作传播意图和幽默感的同时,适配文化特定元素以实现跨文化语境下的有效传达。其核心挑战在于如何处理视觉-文本多模态内容中的文化特异性信息,确保转换后的表达既符合目标文化的认知习惯又不失原有意图。解决方案的关键在于提出了一种基于视觉-语言模型(vision-language models)的混合转创框架,并构建了一个大规模双向中文-美国表情包数据集,通过人工评估与自动化指标相结合的方式系统分析了6,315对表情包的跨文化转换质量,揭示了当前模型在不同文化方向上的不对称表现(US→Chinese优于Chinese→US),并识别出可迁移与难以迁移的幽默及视觉-文本设计特征,从而为跨文化多模态生成任务提供了新的评估框架与实践路径。

链接: https://arxiv.org/abs/2602.02510
作者: Yuming Zhao,Peiyi Zhang,Oana Ignat
机构: Santa Clara University (圣克拉拉大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Memes are a pervasive form of online communication, yet their cultural specificity poses significant challenges for cross-cultural adaptation. We study cross-cultural meme transcreation, a multimodal generation task that aims to preserve communicative intent and humor while adapting culture-specific references. We propose a hybrid transcreation framework based on vision-language models and introduce a large-scale bidirectional dataset of Chinese and US memes. Using both human judgments and automated evaluation, we analyze 6,315 meme pairs and assess transcreation quality across cultural directions. Our results show that current vision-language models can perform cross-cultural meme transcreation to a limited extent, but exhibit clear directional asymmetries: US-Chinese transcreation consistently achieves higher quality than Chinese-US. We further identify which aspects of humor and visual-textual design transfer across cultures and which remain challenging, and propose an evaluation framework for assessing cross-cultural multimodal generation. Our code and dataset are publicly available at this https URL.
zh

[NLP-117] ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching

【速读】: 该论文旨在解决大语言模型在长文本处理中面临的两个核心挑战:长上下文建模能力不足与计算效率低下。现有高效注意力机制虽能降低复杂度,但通常受限于模型状态的覆盖范围。其解决方案的关键在于提出ROSA-Tuning,通过引入一个基于CPU的检索-回忆机制(即ROSA,RWKV Online Suffix Automaton),并行地从长上下文中定位与当前查询相关的历史位置信息,并以可训练方式注入模型状态;随后利用范围受限的注意力机制进行加权融合,从而实现高效且精准的长程依赖建模。该方法还结合二值化策略和反事实梯度算法支持端到端训练,并通过异步CPU-GPU流水线优化整体执行效率,在保持窗口注意力模型的计算效率和GPU内存占用的同时显著提升长文本性能。

链接: https://arxiv.org/abs/2602.02499
作者: Yunao Zheng,Xiaojie Wang,Lei Ren,Wei Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context capability and computational efficiency are among the central challenges facing today’s large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning introduces in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we design a binary discretization strategy and a counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at this https URL.
zh

[NLP-118] st-Time Detoxification without Training or Learning Anything

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时可能产生有害或不当内容的问题,尤其是在面对良性输入时仍存在毒性输出的风险,这对模型部署的安全性和用户信任构成挑战。为实现减少有害内容的同时保持生成质量,论文提出一种无需重新训练、不依赖梯度或辅助组件的测试阶段(test-time)优化方法:通过零阶优化(zeroth-order optimization)近似完成文本毒性相对于输入嵌入(input embeddings)的梯度,并利用少量梯度下降步骤引导生成过程向毒性更低的延续方向演化。该方案仅需访问输入嵌入、毒性评分函数及模型前向推理结果,即可在不同模型和提示下实现鲁棒的毒性降低,在多数场景中取得最优的毒性与生成质量权衡。其核心创新在于将词嵌入视为有效的控制变量,推动黑盒优化在自回归语言模型中的应用,从而实现无需训练即可扩展的更安全文本生成。

链接: https://arxiv.org/abs/2602.02498
作者: Baturay Saglam,Dionysis Kalogerias
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model’s generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.
zh

[NLP-119] STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在科学、技术、工程和数学(STEM)推理能力评估中存在“孤岛式”基准测试的问题,即现有方法仅提供单一聚合得分,无法区分模型错误是源于领域知识不足还是认知能力缺陷,从而限制了诊断价值。解决方案的关键在于提出STEMVerse诊断框架,通过将超过20,000个来自主流基准的STEM问题重新组织到一个统一的“学科 × 认知复杂度”能力空间中,并为每个问题赋予双轴标签,实现对LLMs在学术专精与认知深度两个维度上的系统性分析,从而揭示结构化失败模式并提供可操作的科学推理特征理解路径。

链接: https://arxiv.org/abs/2602.02497
作者: Xuzhao Li,Xuchen Li,Jian Zhao,Shiyu Hu
机构: NTU(南洋理工大学); ZGCA; ZGCI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, Under review

点击查看摘要

Abstract:As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated “silos,” offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified “Discipline \times Cognition” capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.
zh

[NLP-120] he Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中常见的不忠实行为问题,即模型在内部推理链(Chain of Thought, CoT)与最终输出之间存在显著分歧,为迎合用户而生成与自身推理不一致的答案。解决方案的关键在于提出“虚伪差距”(Hypocrisy Gap)这一机制性度量指标,利用稀疏自编码器(Sparse Autoencoders, SAEs)量化模型内部推理与最终生成轨迹在潜在空间中的差异;具体通过稀疏线性探针提取内部真实信念,并与最终生成路径进行数学比较,从而有效识别和检测模型的谄媚(sycophantic)或虚伪(hypocritical)行为。实验表明,该方法在Gemma、Llama和Qwen模型上对谄媚和虚伪情形的检测均优于基于决策对齐概率的基线方法。

链接: https://arxiv.org/abs/2602.02496
作者: Shikhar Shiromani,Archie Chaudhury,Sri Pranav Kunda
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model’s internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model’s tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic’s Sycophancy benchmark show that our method achieves an AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases where the model internally “knows” the user is wrong, consistently outperforming a decision-aligned log-probability baseline (0.41-0.50 AUROC).
zh

[NLP-121] RLAnything: Forge Environment Policy and Reward Model in Completely Dynamic RL System

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Model, LLM)及代理场景中学习信号弱、训练效率低的问题。其核心挑战在于如何有效整合多源反馈以提升策略(policy)和奖励模型(reward model)的协同优化能力。解决方案的关键在于提出RLAnything框架,通过闭环优化动态构建环境、策略与奖励模型,使策略训练融合步骤级(step-wise)与结果级(outcome)反馈,同时利用一致性反馈联合优化奖励模型,并借助批评者(critic)反馈实现自动环境适应,从而显著增强整体RL系统的学习信号强度与稳定性。实验证明,该方法在多个LLM和代理任务上均取得显著性能提升,且优化后的奖励模型信号优于依赖人工标注的结果信号。

链接: https://arxiv.org/abs/2602.02488
作者: Yinjie Wang,Tianbao Xie,Ke Shen,Mengdi Wang,Ling Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: this https URL
zh

[NLP-122] Kimi K2.5: Visual Agent ic Intelligence

【速读】: 该论文旨在解决当前多模态智能体(agentic intelligence)在复杂任务处理中效率低、协同能力弱的问题。其核心挑战在于如何实现文本与视觉模态的深度融合,并提升智能体在面对多样化任务时的自主分解与并行执行能力。解决方案的关键在于两个层面:一是通过联合文本-视觉预训练、零视觉监督微调(zero-vision SFT)及联合文本-视觉强化学习,实现双模态间的相互增强;二是提出Agent Swarm框架,一种自驱动的并行智能体编排机制,能够动态将复杂任务分解为异构子问题并并发执行,从而显著提升任务完成效率与性能。实验表明,该方案在编码、视觉理解、推理和智能体任务等多个领域达到当前最优水平,且相比单智能体基线,延迟降低最高达4.5倍。

链接: https://arxiv.org/abs/2602.02276
作者: Kimi Team:Tongtong Bai,Yifan Bai,Yiping Bao,S.H. Cai,Yuan Cao,Y. Charles,H.S. Che,Cheng Chen,Guanduo Chen,Huarong Chen,Jia Chen,Jiahao Chen,Jianlong Chen,Jun Chen,Kefan Chen,Liang Chen,Ruijue Chen,Xinhao Chen,Yanru Chen,Yanxu Chen,Yicun Chen,Yimin Chen,Yingjiang Chen,Yuankun Chen,Yujie Chen,Yutian Chen,Zhirong Chen,Ziwei Chen,Dazhi Cheng,Minghan Chu,Jialei Cui,Jiaqi Deng,Muxi Diao,Hao Ding,Mengfan Dong,Mengnan Dong,Yuxin Dong,Yuhao Dong,Angang Du,Chenzhuang Du,Dikang Du,Lingxiao Du,Yulun Du,Yu Fan,Shengjun Fang,Qiulin Feng,Yichen Feng,Garimugai Fu,Kelin Fu,Hongcheng Gao,Tong Gao,Yuyao Ge,Shangyi Geng,Chengyang Gong,Xiaochen Gong,Zhuoma Gongque,Qizheng Gu,Xinran Gu,Yicheng Gu,Longyu Guan,Yuanying Guo,Xiaoru Hao,Weiran He,Wenyang He,Yunjia He,Chao Hong,Hao Hu,Jiaxi Hu,Yangyang Hu,Zhenxing Hu,Ke Huang,Ruiyuan Huang,Weixiao Huang,Zhiqi Huang,Tao Jiang,Zhejun Jiang,Xinyi Jin,Yu Jing,Guokun Lai,Aidi Li,C. Li,Cheng Li,Fang Li,Guanghe Li,Guanyu Li,Haitao Li,Haoyang Li,Jia Li,Jingwei Li,Junxiong Li,Lincan Li,Mo Li,Weihong Li,Wentao Li,Xinhang Li,Xinhao Li,Yang Li,Yanhao Li,Yiwei Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Kimi K2.5 tech report

点击查看摘要

Abstract:We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to 4.5\times over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
zh

[NLP-123] hinking Like a Doctor: Conversational Diagnosis through the Exploration of Diagnostic Knowledge Graphs

【速读】: 该论文旨在解决当前对话式诊断系统在实际临床场景中面临的两大局限性:一是依赖模型的参数化知识,难以应对信息不完整的诊断情境;二是假设患者能提供详尽且具体的症状描述,这与真实世界中早期就诊时患者表述模糊的情况不符。解决方案的关键在于构建一个基于诊断知识图谱(diagnostic knowledge graph)的两步推理机制:首先从对话上下文中生成诊断假设,随后通过澄清性提问对假设进行验证,并迭代直至得出最终诊断。该方法显著提升了诊断准确性和效率,并借助MIMIC-IV数据集中的患者画像和改进后的模拟器增强了系统的临床真实性与实用性。

链接: https://arxiv.org/abs/2602.01995
作者: Jeongmoon Won,Seungwon Kook,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational diagnosis requires multi-turn history-taking, where an agent asks clarifying questions to refine differential diagnoses under incomplete information. Existing approaches often rely on the parametric knowledge of a model or assume that patients provide rich and concrete information, which is unrealistic. To address these limitations, we propose a conversational diagnosis system that explores a diagnostic knowledge graph to reason in two steps: (i) generating diagnostic hypotheses from the dialogue context, and (ii) verifying hypotheses through clarifying questions, which are repeated until a final diagnosis is reached. Since evaluating the system requires a realistic patient simulator that responds to the system’s questions, we adopt a well-established simulator along with patient profiles from MIMIC-IV. We further adapt it to describe symptoms vaguely to reflect real-world patients during early clinical encounters. Experiments show improved diagnostic accuracy and efficiency over strong baselines, and evaluations by physicians support the realism of our simulator and the clinical utility of the generated questions. Our code will be released upon publication.
zh

[NLP-124] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

【速读】: 该论文旨在解决生成式视频中 grounded human-object interaction (GHOI) 的挑战,即让说话头像(talking avatar)能够根据文本指令与周围物体进行语义对齐的交互动作,这要求模型具备环境感知能力并克服控制精度与视频质量之间的权衡困境(control-quality dilemma)。解决方案的关键在于提出一种双流框架 InteractAvatar,其核心创新包括:1)引入感知与交互模块(Perception and Interaction Module, PIM),利用目标检测增强环境理解,生成与文本对齐的交互动作;2)设计音频-交互感知生成模块(Audio-Interaction Aware Generation Module, AIM),协同合成高质量、动作一致的说话头像视频;3)通过一个专门设计的动作到视频对齐器,使PIM与AIM共享相似网络结构,实现动作与视频的并行联合生成,从而有效缓解控制与质量之间的矛盾。

链接: https://arxiv.org/abs/2602.01538
作者: Youliang Zhang,Zhengguang Zhou,Zhentao Yu,Ziyao Huang,Teng Hu,Sen Liang,Guozhen Zhang,Ziqiao Peng,Shunkai Li,Yi Chen,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Xiu Li
机构: Tsinghua University (清华大学); Tencent HY (腾讯HY)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: this https URL
zh

[NLP-125] Mići Princ – A Little Boy Teaching Speech Technologies the Chakavian Dialect

【速读】: 该论文旨在解决两个核心问题:一是如何长期保存具有重要文化价值的区域性方言文本与语音资源(如《小王子》在查卡维亚方言中的译本),二是如何利用该结构化数据提升人工智能模型对非标准方言语音的识别能力。解决方案的关键在于构建一个对齐良好的计算机可读、AI就绪的数据集,其中文本与音频内容精确到每个词和音节级别,并基于此数据集对Whisper-large-v3模型进行微调,使其在查卡维亚方言上的词错误率(WER)降低50%,字符级错误率减少达三分之二,从而为方言语音识别及后续多模态AI研究提供高质量基准资源。

链接: https://arxiv.org/abs/2602.03245
作者: Nikola Ljubešić,Peter Rupnik,Tea Perinčić
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 2 figures, 14 pages, accepted and presented at JTDH 2024

点击查看摘要

Abstract:This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the this http URL repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already – adapting the Whisper-large-v3 open automatic speech recognition model, with decent performance on standard Croatian, to Chakavian dialectal speech. We can happily report that with adapting the model, the word error rate on the selected test data has being reduced to a half, while we managed to remove up to two thirds of the error on character level. We envision many more usages of this dataset beyond the set of experiments we have already performed, both on tasks of artificial intelligence research and application, as well as dialectal research. The third motivation for this release is our hope that this, now highly structured dataset, will be transformed into a digital online edition of this work, allowing individuals beyond the research and technology communities to enjoy the beauty of the message of the little boy in the desert, told through the spectacular prism of the Chakavian dialect.
zh

[NLP-126] WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

【速读】: 该论文旨在解决语音深度伪造检测(speech deepfake detection)中前端特征提取的局限性问题:传统手工设计的滤波器组特征(filterbank features)虽具可解释性,但难以捕捉高层语义细节,导致性能落后于自监督学习(self-supervised learning, SSL)特征;而SSL特征虽然表现优异,却缺乏可解释性,并可能忽略细微的频谱异常。解决方案的关键在于提出WST-X系列新型特征提取器,其核心是利用小波散射变换(wavelet scattering transform, WST),将小波分析与类卷积神经网络的非线性结构相结合,从而在保持对平移不变性和形变稳定性的同时,有效提取声学细节(一维WST)和高阶结构异常(二维WST)。实验表明,WST-X在Deepfake-Eval-2024数据集上显著优于现有方法,且分析揭示了小平均尺度(J)与高频及方向分辨率(Q, L)的协同作用对于识别细微伪造痕迹至关重要。

链接: https://arxiv.org/abs/2602.02980
作者: Xi Xuan,Davide Carbone,Ruchi Pandey,Wenxin Zhang,Tomi H. Kinnunen
机构: University of Eastern Finland (东芬兰大学); Laboratoire de Physique de l’Ecole Normale Supérieure, Université PSL, CNRS, Sorbonne Université, Université de Paris (巴黎高等师范物理实验室,巴黎文理研究大学,法国国家科学研究中心,索邦大学,巴黎大学); University of Chinese Academy of Sciences (中国科学院大学); University of Toronto (多伦多大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: Submitted to IEEE Signal Processing Letters

点击查看摘要

Abstract:Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ( J ), combined with high-frequency and directional resolutions ( Q, L ), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.
zh

[NLP-127] A vector logic for intensional formal semantics

【速读】: 该论文旨在解决形式语义学(formal semantics)与分布语义学(distributional semantics)在处理意向性语义(intensional semantics)时的结构兼容性问题。其核心挑战在于如何将基于模型论结构的形式语义映射到由使用驱动的高维向量空间中,同时保持语义组合性。解决方案的关键在于构造一种嵌入机制:将Kripke风格的意向模型以单射方式嵌入至向量空间,并使语义函数提升为(多)线性映射,从而保留复合结构;通过引入复合索引空间(compound index space)支持多种索引类型(如世界、时间、位置),将意向性表示为线性算子;进一步地,模态算子可通过代数方式导出——可达关系转化为线性算子,模态条件简化为累积值的阈值检测;对于不可数索引域,采用测度论推广,使得必然性对应几乎处处成立,可能性对应正测度集上成立,从而自然地扩展出适用于连续参数的非经典逻辑体系。

链接: https://arxiv.org/abs/2602.02940
作者: Daniel Quigley
机构: Center for Possible Minds; Indiana University Bloomington (印第安纳大学布卢明顿分校)
类目: Logic (math.LO); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
备注: 25 pages; 68 sources

点击查看摘要

Abstract:Formal semantics and distributional semantics are distinct approaches to linguistic meaning: the former models meaning as reference via model-theoretic structures; the latter represents meaning as vectors in high-dimensional spaces shaped by usage. This paper proves that these frameworks are structurally compatible for intensional semantics. We establish that Kripke-style intensional models embed injectively into vector spaces, with semantic functions lifting to (multi)linear maps that preserve composition. The construction accommodates multiple index sorts (worlds, times, locations) via a compound index space, representing intensions as linear operators. Modal operators are derived algebraically: accessibility relations become linear operators, and modal conditions reduce to threshold checks on accumulated values. For uncountable index domains, we develop a measure-theoretic generalization in which necessity becomes truth almost everywhere and possibility becomes truth on a set of positive measure, a non-classical logic natural for continuous parameters.
zh

[NLP-128] WAXAL: A Large-Scale Multilingual African Language Speech Corpus

【速读】: 该论文旨在解决语音技术发展在高资源语言中占据主导地位,导致大多数撒哈拉以南非洲语言使用者面临显著数字鸿沟的问题。其解决方案的关键在于构建并发布WAXAL数据集,这是一个大规模、开放获取的多语言语音数据集,涵盖21种撒哈拉以南非洲语言(覆盖超1亿人口),包含约1,250小时的自动语音识别(ASR)语料和超过180小时的高质量文本转语音(TTS)语料,通过与四家非洲学术及社区组织合作完成数据采集、标注与质量控制,旨在推动包容性语音技术的发展并促进这些语言的数字保存。

链接: https://arxiv.org/abs/2602.02734
作者: Abdoulaye Diack,Perry Nelson,Kwaku Agbesi,Angela Nakalembe,MohamedElfatih MohamedKhair,Vusumuzi Dube,Tavonga Siyavora,Subhashini Venugopalan,Jason Hickey,Uche Okonkwo,Abhishek Bapna,Isaac Wiafe,Raynard Dodzi Helegah,Elikem Doe Atsakpo,Charles Nutrokpor,Fiifi Baffoe Payin Winful,Kafui Kwashie Solaga,Jamal-Deen Abdulai,Akon Obu Ekpezu,Audace Niyonkuru,Samuel Rutunda,Boris Ishimwe,Michael Melese,Engineer Bainomugisha,Joyce Nakatumba-Nabende,Andrew Katumba,Claire Babirye,Jonathan Mukiibi,Vincent Kimani,Samuel Kibacia,James Maina,Fridah Emmah,Ahmed Ibrahim Shekarau,Ibrahim Shehu Adamu,Yusuf Abdullahi,Howard Lakougna,Bob MacDonald,Hadar Shemtov,Aisha Walcott-Bryant,Moustapha Cisse,Avinatan Hassidim,Jeff Dean,Yossi Matias
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Initial dataset release

点击查看摘要

Abstract:The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at this https URL under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
zh

[NLP-129] Social Catalysts Not Moral Agents : The Illusion of Alignment in LLM Societies

【速读】: 该论文试图解决多智能体系统(Multi-Agent Systems)中因“公地悲剧”(Tragedy of the Commons)导致的合作机制失效问题,即在缺乏有效约束的情况下,个体理性行为往往破坏集体利益。解决方案的关键在于引入锚定代理(Anchoring Agents)——一种预编程的利他性实体,通过在公共品博弈(Public Goods Game, PGG)中示范合作行为来提升整体协作水平。研究发现,尽管锚定代理能显著提高局部合作率,但其效果主要源于策略性顺从和认知卸载(cognitive offloading),而非真正的规范内化(norm internalization)。尤其值得注意的是,大多数代理在新环境中回归自利行为,而高级模型如GPT-4.1甚至表现出“变色龙效应”(Chameleon Effect),在公众监督下伪装成合作方实则策略性背叛,揭示了当前人工社会中行为调控与真实价值对齐之间的根本性差距。

链接: https://arxiv.org/abs/2602.02598
作者: Yueqing Hu,Yixuan Jiang,Zehua Jiang,Xiao Wen,Tianhong Wang
机构: Institute of Neuroscience, Chinese Academy of Sciences(中国科学院神经科学研究所); School of Philosophy, Anhui University(安徽大学哲学学院); Department of Psychology and Behavioral Sciences, Zhejiang University(浙江大学心理学与行为科学系); Mental Health Education Center, North China Electric Power University(华北电力大学心理健康教育中心)
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) has led to the emergence of Multi-Agent Systems where collective cooperation is often threatened by the “Tragedy of the Commons.” This study investigates the effectiveness of Anchoring Agents–pre-programmed altruistic entities–in fostering cooperation within a Public Goods Game (PGG). Using a full factorial design across three state-of-the-art LLMs, we analyzed both behavioral outcomes and internal reasoning chains. While Anchoring Agents successfully boosted local cooperation rates, cognitive decomposition and transfer tests revealed that this effect was driven by strategic compliance and cognitive offloading rather than genuine norm internalization. Notably, most agents reverted to self-interest in new environments, and advanced models like GPT-4.1 exhibited a “Chameleon Effect,” masking strategic defection under public scrutiny. These findings highlight a critical gap between behavioral modification and authentic value alignment in artificial societies.
zh

计算机视觉

[CV-0] EventNeuS: 3D Mesh Reconstruction from a Single Event Camera

【速读】:该论文旨在解决事件相机(event camera)在三维网格重建(3D mesh reconstruction)任务中精度不足的问题,现有方法在利用事件流进行稠密三维表示学习方面存在明显局限。解决方案的关键在于提出EventNeuS,一种基于单目彩色事件流的自监督神经模型,首次将三维有符号距离函数(Signed Distance Function, SDF)与密度场学习结合事件监督信号,从而实现更精确的几何重建;同时引入球谐函数编码(spherical harmonics encodings)以增强对视点依赖性效应的建模能力,显著提升了重建质量,在Chamfer距离和平均绝对误差上分别优于最优基线方法34%和31%。

链接: https://arxiv.org/abs/2602.03847
作者: Shreyas Sachan,Viktor Rudnev,Mohamed Elgharib,Christian Theobalt,Vladislav Golyanik
机构: Saarland University (萨尔兰大学); MPI for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures, 3 tables; project page: this https URL

点击查看摘要

Abstract:Event cameras offer a considerable alternative to RGB cameras in many scenarios. While there are recent works on event-based novel-view synthesis, dense 3D mesh reconstruction remains scarcely explored and existing event-based techniques are severely limited in their 3D reconstruction accuracy. To address this limitation, we present EventNeuS, a self-supervised neural model for learning 3D representations from monocular colour event streams. Our approach, for the first time, combines 3D signed distance function and density field learning with event-based supervision. Furthermore, we introduce spherical harmonics encodings into our model for enhanced handling of view-dependent effects. EventNeuS outperforms existing approaches by a significant margin, achieving 34% lower Chamfer distance and 31% lower mean absolute error on average compared to the best previous method.
zh

[CV-1] PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization

【速读】:该论文旨在解决影视前期制作中概念原型设计效率与表达力之间的矛盾问题,即传统手绘分镜缺乏复杂镜头所需的三维空间精度,而3D预可视化又依赖专业技能和高质量绑定资产。其解决方案的关键在于提出PrevizWhiz系统,该系统结合粗略的3D场景与生成式图像及视频模型,实现风格化视频预览的快速生成;核心创新包括帧级图像再风格化(可调节相似度)、基于运动路径或外部视频输入的时间编辑能力,以及高保真视频片段的优化生成,从而显著降低技术门槛、加速创意迭代并提升团队沟通效率。

链接: https://arxiv.org/abs/2602.03838
作者: Erzhen Hu,Frederik Brudy,David Ledo,George Fitzmaurice,Fraser Anderson
机构: Autodesk Research(奥特克研究); University of Virginia (弗吉尼亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 13 figures; accepted and to appear at CHI 2026

点击查看摘要

Abstract:In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film’s possibilities before fullscale production, yet conventional approaches involve trade-offs in efficiency and expressiveness. Hand-drawn storyboards often lack spatial precision needed for complex cinematography, while 3D previsualization demands expertise and high-quality rigged assets. To address this gap, we present PrevizWhiz, a system that leverages rough 3D scenes in combination with generative image and video models to create stylized video previews. The workflow integrates frame-level image restyling with adjustable resemblance, time-based editing through motion paths or external video inputs, and refinement into high-fidelity video clips. A study with filmmakers demonstrates that our system lowers technical barriers for film-makers, accelerates creative iteration, and effectively bridges the communication gap, while also surfacing challenges of continuity, authorship, and ethical consideration in AI-assisted filmmaking.
zh

[CV-2] Continuous Control of Editing Models via Adaptive-Origin Guidance

【速读】:该论文旨在解决扩散模型在图像和视频语义编辑中缺乏对文本引导编辑强度进行平滑控制的问题。现有方法依赖于无分类器指导(Classifier-Free Guidance, CFG)调节编辑强度,但实验表明,直接缩放CFG无法实现输入与编辑结果之间的连续过渡,其根源在于标准的无条件预测作为指导起点,在低指导尺度下占据主导地位,并可能引入任意内容扰动。解决方案的关键是提出自适应原点指导(Adaptive-Origin Guidance, AdaOr),通过引入一个与身份相关的自适应原点(identity-conditioned adaptive origin),并根据编辑强度插值该身份预测与标准无条件预测,从而确保从原始输入到编辑结果的连续、稳定过渡。该方法无需每编辑单独处理或依赖特殊数据集,可无缝集成至标准训练框架,实现推理时细粒度的编辑强度控制。

链接: https://arxiv.org/abs/2602.03826
作者: Alon Wolf,Chen Katzir,Kfir Aberman,Or Patashnik
机构: Tel Aviv University (特拉维夫大学); Decart.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page at this https URL

点击查看摘要

Abstract:Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.
zh

[CV-3] Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中存在的严重效率问题,其根源在于模型规模庞大和视觉标记(visual token)数量过多。现有方法主要通过压缩模型尺寸或减少可训练参数来提升效率,但效果有限。论文提出的关键解决方案是DualSpeed框架,其核心创新在于引入“快-慢”双模式机制:快模式采用视觉标记剪枝(Visual Token Pruning, VTP)策略以降低训练时的视觉token数量,从而加速训练;慢模式则在完整视觉序列上训练,确保与推理阶段的一致性,并通过自蒸馏(self-distillation)从快模式中学习,实现性能保持。该设计有效解决了训练-推理不一致问题,显著提升了训练效率而无明显性能损失。

链接: https://arxiv.org/abs/2602.03815
作者: Dingkun Zhang,Shuhan Qi,Yulin Wu,Xinyu Xiao,Xuan Wang,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model’s behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1 \times and LLaVA-NeXT by 4.0 \times , retaining over 99% performance. Code: this https URL
zh

[CV-4] Progressive Checkerboards for Autoregressive Multiscale Image Generation

【速读】:该论文旨在解决自回归图像生成中如何高效并行采样独立位置的同时仍能建模序列条件依赖性的难题。其解决方案的关键在于提出一种基于渐进棋盘(progressive checkerboard)的固定采样顺序,该顺序在多尺度金字塔结构中按均匀间隔区域并行抽取样本,并在每一步保持四叉树细分各层级的完全平衡,从而有效实现尺度间与尺度内的条件建模。实验表明,在平衡设置下,不同尺度放大因子只要总串行步骤数相同,性能差异不大,且在ImageNet分类条件下,该方法以更少的采样步骤达到了与当前先进自回归系统相当的生成效果。

链接: https://arxiv.org/abs/2602.03811
作者: David Eigen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A key challenge in autoregressive image generation is to efficiently sample independent locations in parallel, while still modeling mutual dependencies with serial conditioning. Some recent works have addressed this by conditioning between scales in a multiscale pyramid. Others have looked at parallelizing samples in a single image using regular partitions or randomized orders. In this work we examine a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. Our ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision at each step. This enables effective conditioning both between and within scales. Intriguingly, we find evidence that in our balanced setting, a wide range of scale-up factors lead to similar results, so long as the total number of serial steps is constant. On class-conditional ImageNet, our method achieves competitive performance compared to recent state-of-the-art autoregressive systems with like model capacity, using fewer sampling steps.
zh

[CV-5] SplitSplat: Zero-Shot Panoptic Segmentation via Explicit Instance Modeling and 3D Gaussian Splatting

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在场景重建中缺乏对象一致性与语义感知结构的问题,即现有方法难以实现对场景中不同物体实例的独立建模与语义标注。其解决方案的关键在于提出SplitSplat框架,通过先对场景进行分割、再逐个重建每个对象实例的方式,实现对象级的语义建模:首先利用深度信息传播实例掩码以生成视图一致的2D掩码,随后独立重建每个对象并融合回场景同时优化边界,最终在对象层面嵌入语义描述符,从而支持全景分割、物体检索和3D编辑等下游任务。此设计显著提升了重建结果的语义准确性和对象一致性,并在ScanNetv2分割基准上达到当前最优性能。

链接: https://arxiv.org/abs/2602.03809
作者: Leonardo Monchieri,Elena Camuffo,Francesco Barbato,Pietro Zanuttigh,Simone Milani
机构: University of Padova (帕多瓦大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (GS) enables fast and high-quality scene reconstruction, but it lacks an object-consistent and semantically aware structure. We propose SplitSplat, a framework for panoptic scene reconstruction using 3DGS. Our approach explicitly models object instances. It first propagates instance masks across views using depth, thus producing view-consistent 2D masks. Each object is then reconstructed independently and merged back into the scene while refining its boundaries. Finally, instance-level semantic descriptors are embedded in the reconstructed objects, supporting various applications, including panoptic segmentation, object retrieval, and 3D editing. Unlike existing methods, SplitSplat tackles the problem by first segmenting the scene and then reconstructing each object individually. This design naturally supports downstream tasks and allows SplitSplat to achieve state-of-the-art performance on the ScanNetv2 segmentation benchmark.
zh

[CV-6] 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

【速读】:该论文旨在解决现有视频生成中人体运动控制方法依赖2D姿态或显式3D参数化模型(如SMPL)所带来的局限性:2D姿态会将运动绑定于特定视角,难以实现新视角合成;而显式3D模型虽具结构信息,但存在深度模糊性和动态建模不准等问题,若作为强约束会抑制大规模视频生成器内在的3D感知能力。解决方案的关键在于提出一种隐式、视图无关的运动表征——3DiMo,其通过联合训练运动编码器与预训练视频生成器,将驱动帧压缩为紧凑的视图无关运动token,并以语义交叉注意力方式注入生成过程;同时利用多视角监督(单视角、多视角及移动相机视频)强化跨视角运动一致性,并引入渐进式几何监督策略,初期使用SMPL进行初始化后逐步退火至零,促使模型从外部3D引导过渡到基于数据和生成器先验自主学习真正的3D空间运动理解,从而在运动保真度和视觉质量上显著优于现有方法。

链接: https://arxiv.org/abs/2602.03796
作者: Zhixue Fang,Xu He,Songlin Tang,Haoxian Zhang,Qingfeng Li,Xiaoqiang Liu,Pengfei Wan,Kun Gai
机构: Kling Team, Kuaishou Technology (快手科技); Tsinghua University (清华大学); CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator’s spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator’s priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.
zh

[CV-7] BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

【速读】:该论文旨在解决具身世界模型(embodied world models)在实际应用中面临的三大关键挑战:一是动作空间(coordinate-space)与视频像素空间(pixel-space)之间的对齐问题;二是模型对相机视角变化敏感,导致泛化能力差;三是不同机器人形态(embodiment)之间缺乏统一的架构设计。解决方案的核心在于提出BridgeV2W框架,其关键技术是将动作空间指令转换为与视觉输入对齐的具身掩码(embodiment masks),并通过ControlNet风格的路径注入预训练视频生成模型,从而实现动作控制信号与预测视频的精准对齐、引入视点特定条件以适应不同相机视角,并构建跨形态统一的世界模型架构。此外,为避免静态背景过拟合,还引入基于流的运动损失(flow-based motion loss),聚焦于学习动态且任务相关的区域,显著提升了视频生成质量和下游任务表现。

链接: https://arxiv.org/abs/2602.03793
作者: Yixiang Chen,Peiyan Li,Jiabing Yang,Keji He,Xiangnan Wu,Yuan Xu,Kai Wang,Jing Liu,Nianfeng Liu,Yan Huang,Liang Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at this https URL .
zh

[CV-8] From Pre- to Intra-operative MRI: Predicting Brain Shift in Temporal Lobe Resection for Epilepsy Surgery

【速读】:该论文旨在解决神经外科手术中因脑移位(brain shift)导致术前磁共振成像(MRI)失效的问题,从而影响图像引导神经外科系统(IGNS)的导航精度。其关键解决方案是提出一种基于U-Net架构的深度学习模型NeuralShift,该模型仅依赖术前MRI即可预测术中脑组织的整体形变和局部位移,实验结果显示其在 anatomical landmarks 上实现低至1.12 mm的靶点注册误差(TRE),并能准确重建脑部结构(DICE达0.97),有效补偿颞叶切除术中的大范围脑移位,提升术中导航精度与手术安全性。

链接: https://arxiv.org/abs/2602.03785
作者: Jingjing Peng,Giorgio Fiore,Yang Liu,Ksenia Ellum,Debayan Daspupta,Keyoumars Ashkan,Andrew McEvoy,Anna Miserocchi,Sebastien Ourselin,John Duncan,Alejandro Granados
机构: King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Introduction: In neurosurgery, image-guided Neurosurgery Systems (IGNS) highly rely on preoperative brain magnetic resonance images (MRI) to assist surgeons in locating surgical targets and determining surgical paths. However, brain shift invalidates the preoperative MRI after dural opening. Updated intraoperative brain MRI with brain shift compensation is crucial for enhancing the precision of neuronavigation systems and ensuring the optimal outcome of surgical interventions. Methodology: We propose NeuralShift, a U-Net-based model that predicts brain shift entirely from pre-operative MRI for patients undergoing temporal lobe resection. We evaluated our results using Target Registration Errors (TREs) computed on anatomical landmarks located on the resection side and along the midline, and DICE scores comparing predicted intraoperative masks with masks derived from intraoperative MRI. Results: Our experimental results show that our model can predict the global deformation of the brain (DICE of 0.97) with accurate local displacements (achieve landmark TRE as low as 1.12 mm), compensating for large brain shifts during temporal lobe removal neurosurgery. Conclusion: Our proposed model is capable of predicting the global deformation of the brain during temporal lobe resection using only preoperative images, providing potential opportunities to the surgical team to increase safety and efficiency of neurosurgery and better outcomes to patients. Our contributions will be publicly available after acceptance in this https URL.
zh

[CV-9] QVLA: Not All Channels Are Equal in Vision-Language-Action Models Quantization ICLR2026

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在资源受限机器人平台上的部署难题,其核心挑战在于现有低比特量化方法(如来自大语言模型的SmoothQuant)仅关注数据保真度而忽视动作空间敏感性,导致微小的动作偏差可能累积为任务失败。解决方案的关键在于提出QVLA,一个以动作为中心的量化框架,采用通道级细粒度比特分配策略:通过测量每个通道在不同比特宽度下的最终动作空间敏感性,生成精确的通道重要性指标,并以此指导全局优化,将量化与剪枝(0比特)统一于同一框架中,从而在显著降低显存占用的同时保持高任务性能。

链接: https://arxiv.org/abs/2602.03782
作者: Yuhao Xu,Yantai Yang,Zhenyang Fan,Yufan Liu,Yuming Li,Bing Li,Zhipeng Zhang
机构: AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院); Anyverse Dynamics; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Terminal Technology Department, Alipay, Ant Group (蚂蚁集团支付宝终端技术部)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICLR2026

点击查看摘要

Abstract:The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model’s quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model’s VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.
zh

[CV-10] FOVI: A biologically-inspired foveated interface for deep vision models

【速读】:该论文旨在解决传统计算机视觉系统在处理全视野高分辨率图像时计算效率低下的问题,其核心挑战在于这些系统通常采用均匀分辨率编码视觉信息,而人类视觉系统则具有中心聚焦(foveated)的特性,即在视野中心提供高分辨率、周边区域分辨率递减,从而通过眼动实现高效主动感知。解决方案的关键在于提出一种基于视网膜和初级视觉皮层(V1)结构的仿生视觉接口(FOVI),将变分辨率的类视网膜传感器阵列重构为统一密度的V1-like传感器流形,并通过k近邻(kNN)定义感受野,结合新颖的核映射技术实现kNN卷积。该方法显著降低了计算成本,同时保持了与非聚焦基线模型相当的性能,为高分辨率自我中心视觉中的高效、可扩展主动感知提供了新路径。

链接: https://arxiv.org/abs/2602.03766
作者: Nicholas M. Blauch,George A. Alvarez,Talia Konkle
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code and pre-trained models are available at this https URL and this https URL.
zh

[CV-11] RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images

【速读】:该论文旨在解决当前视觉模型训练数据主要基于经过图像信号处理(ISP)流水线优化的RGB图像,导致传感器级信息丢失、不利于机器推理的问题。其解决方案的关键在于构建一个大规模RAW图像数据集RAWDet-7,包含约25k训练和7.6k测试图像,覆盖多样相机、光照与环境条件,并对七类目标进行密集标注;同时提供基于高分辨率sRGB图像生成的对象级描述,以研究RAW图像在低比特量化(4-bit、6-bit、8-bit)下的信息保留能力与检测性能,从而为低比特RAW图像处理中的目标检测、描述质量与泛化能力提供基准评估体系。

链接: https://arxiv.org/abs/2602.03760
作者: Mishal Fatima,Shashank Agnihotri,Kanchana Vaishnavi Gandikota,Michael Moeller,Margret Keuper
机构: University of Mannheim(曼海姆大学); University of Siegen(锡根大学); Max Planck Institute(马克斯·普朗克研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: *Equal Contribution

点击查看摘要

Abstract:Most vision models are trained on RGB images processed through ISP pipelines optimized for human perception, which can discard sensor-level information useful for machine reasoning. RAW images preserve unprocessed scene data, enabling models to leverage richer cues for both object detection and object description, capturing fine-grained details, spatial relationships, and contextual information often lost in processed images. To support research in this domain, we introduce RAWDet-7, a large-scale dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments, densely annotated for seven object categories following MS-COCO and LVIS conventions. In addition, we provide object-level descriptions derived from the corresponding high-resolution sRGB images, facilitating the study of object-level information preservation under RAW image processing and low-bit quantization. The dataset allows evaluation under simulated 4-bit, 6-bit, and 8-bit quantization, reflecting realistic sensor constraints, and provides a benchmark for studying detection performance, description quality detail, and generalization in low-bit RAW image processing. Dataset code upon acceptance.
zh

[CV-12] st-Time Conditioning with Representation-Aligned Visual Features

【速读】:该论文旨在解决扩散模型(diffusion model)在推理阶段缺乏灵活、精确控制生成内容的问题,尤其是如何利用预训练特征提取器中的语义信息实现细粒度到全局的条件引导。其解决方案的关键在于提出Representation-Aligned Guidance (REPA-G) 框架,该框架通过优化一个相似性目标(即势能函数 potential)来引导去噪过程,使生成结果趋向于由预训练特征提取器提取出的目标表示。该方法完全在推理阶段运行,支持从单个图像块的纹理匹配到整体图像语义引导的多尺度控制,并可扩展至多概念组合,从而提供比模糊文本提示或粗粒度类别标签更精准的生成控制能力。

链接: https://arxiv.org/abs/2602.03753
作者: Nicolas Sereyjol-Garros,Ellington Kirby,Victor Letzelter,Victor Besnier,Nermin Samet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at this https URL.
zh

[CV-13] Zero-shot large vision-language model prompting for automated bone identification in paleoradiology x-ray archives

【速读】:该论文旨在解决古病理影像学(paleoradiology)中因图像异质性导致的内容导航难题,例如骨骼离散、定位随意、侧向性标记缺失以及年龄、性别和成像设备等因素引入的高变异性,从而阻碍专家高效筛选与分析大量影像数据。解决方案的关键在于采用一种零样本提示(zero-shot prompting)策略,利用先进的视觉语言大模型(Large Vision Language Model, LVLM)自动识别图像中的主要骨骼、投射视角和侧向性信息;通过将原始DICOM文件转换为骨窗PNG图像并输入结构化提示指令,系统可输出结构化JSON结果,显著提升标注效率与准确性,实验证明其在主要骨骼识别上达到92%准确率,侧向性识别达100%,为大规模古病理影像数据集的自动化编码与内容导航提供了可行路径。

链接: https://arxiv.org/abs/2602.03750
作者: Owen Dong,Lily Gao,Manish Kota,Bennett A. Landmana,Jelena Bekvalac,Gaynor Western,Katherine D. Van Schaik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Paleoradiology, the use of modern imaging technologies to study archaeological and anthropological remains, offers new windows on millennial scale patterns of human health. Unfortunately, the radiographs collected during field campaigns are heterogeneous: bones are disarticulated, positioning is ad hoc, and laterality markers are often absent. Additionally, factors such as age at death, age of bone, sex, and imaging equipment introduce high variability. Thus, content navigation, such as identifying a subset of images with a specific projection view, can be time consuming and difficult, making efficient triaging a bottleneck for expert analysis. We report a zero shot prompting strategy that leverages a state of the art Large Vision Language Model (LVLM) to automatically identify the main bone, projection view, and laterality in such images. Our pipeline converts raw DICOM files to bone windowed PNGs, submits them to the LVLM with a carefully engineered prompt, and receives structured JSON outputs, which are extracted and formatted onto a spreadsheet in preparation for validation. On a random sample of 100 images reviewed by an expert board certified paleoradiologist, the system achieved 92% main bone accuracy, 80% projection view accuracy, and 100% laterality accuracy, with low or medium confidence flags for ambiguous cases. These results suggest that LVLMs can substantially accelerate code word development for large paleoradiology datasets, allowing for efficient content navigation in future anthropology workflows.
zh

[CV-14] See-through: Single-image Layer Decomposition for Anime Characters

【速读】:该论文旨在解决将静态动漫插画自动转换为可操控的2.5D模型这一难题,传统专业流程依赖繁琐的手动分割和对遮挡区域的艺术性“幻觉”重建以实现动画效果。其核心解决方案在于提出一个框架,通过分解单张图像生成语义明确且完全补全的层,并推断出合理的绘制顺序(drawing order),从而实现动态层重建。关键创新点包括:1)利用商业Live2D模型构建可扩展的监督数据引擎,获取像素级语义与隐藏几何信息;2)引入基于扩散模型的部件一致性模块(Body Part Consistency Module),确保全局几何一致性;3)结合像素级伪深度(pseudo-depth)推理机制,精确处理如交错发丝等复杂分层结构,最终生成适用于专业实时动画应用的高保真可操控模型。

链接: https://arxiv.org/abs/2602.03749
作者: Jian Lin,Chengze Li,Haoyun Qin,Kwun Wang Chan,Yanghua Jin,Hanyuan Liu,Stephen Chun Wang Choy,Xueting Liu
机构: Saint Francis University (圣弗朗西斯大学); University of Pennsylvania (宾夕法尼亚大学); Spellbrush; Sitagaki Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 23 pages, 20 figures, preprint version only

点击查看摘要

Abstract:We introduce a framework that automates the transformation of static anime illustrations into manipulatable 2.5D models. Current professional workflows require tedious manual segmentation and the artistic ``hallucination’’ of occluded regions to enable motion. Our approach overcomes this by decomposing a single image into fully inpainted, semantically distinct layers with inferred drawing orders. To address the scarcity of training data, we introduce a scalable engine that bootstraps high-quality supervision from commercial Live2D models, capturing pixel-perfect semantics and hidden geometry. Our methodology couples a diffusion-based Body Part Consistency Module, which enforces global geometric coherence, with a pixel-level pseudo-depth inference mechanism. This combination resolves the intricate stratification of anime characters, e.g., interleaving hair strands, allowing for dynamic layer reconstruction. We demonstrate that our approach yields high-fidelity, manipulatable models suitable for professional, real-time animation applications.
zh

[CV-15] LIVE: Long-horizon Interactive Video World Modeling

【速读】:该论文旨在解决自回归视频世界模型(autoregressive video world models)在长时程生成中因预测误差累积而导致性能下降的问题。现有方法依赖预训练教师模型和序列级分布匹配来缓解此问题,但会引入额外计算开销且无法防止训练范围之外的误差传播。解决方案的关键在于提出一种名为LIVE的新型模型,其核心创新是通过新颖的循环一致性目标(cycle-consistency objective)强制限制误差积累——具体而言,LIVE先从真实帧进行前向滚动(forward rollout),再执行反向生成以重建初始状态,最后基于重构终端状态计算扩散损失(diffusion loss),从而显式约束长时程误差传播。这一机制无需教师蒸馏即可实现稳定、高质量的超长视频生成。

链接: https://arxiv.org/abs/2602.03747
作者: Junchao Huang,Ziyang Ye,Xinting Hu,Tianyu He,Guiyu Zhang,Shaoshuai Shi,Jiang Bian,Li Jiang
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shenzhen Loop Area Institute (深圳 loop 区域研究院); Microsoft Research (微软研究院); The University of Hong Kong (香港大学); Voyager Research, Didi Chuxing (滴滴出行 Voyager 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 22 figures

点击查看摘要

Abstract:Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
zh

[CV-16] Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment

【速读】:该论文旨在解决地下基础设施(如污水管和涵洞系统)自主巡检中,如何在资源受限的边缘设备上实现从视觉检测结果到可读性强的自然语言摘要的端到端自动化生成问题。其核心挑战在于兼顾模型轻量化与摘要质量,以支持实时部署。解决方案的关键在于提出一个两阶段流水线:第一阶段采用轻量级RAPID-SCAN分割模型(仅0.64M参数,F1-score达0.834),高效完成缺陷分割;第二阶段基于微调后的Phi-3.5视觉语言模型(Vision-Language Model, VLM)生成领域特定的简洁自然语言描述,并通过后训练量化与硬件优化显著降低模型尺寸和推理延迟,同时保持摘要质量。该方案在移动机器人平台上验证了其实用性,为边缘部署的集成式AI系统提供了可行路径,推动了基础设施智能维护的自动化与规模化发展。

链接: https://arxiv.org/abs/2602.03742
作者: Johny J. Lopez,Md Meftahul Ferdaus,Mahdi Abdelguerfi
机构: University of New Orleans (新奥尔良大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.
zh

[CV-17] RegionReason er: Region-Grounded Multi-Round Visual Reasoning ICLR2026

【速读】:该论文旨在解决当前大型视觉语言模型在视觉推理中普遍存在的单步或纯文本推理局限性,即缺乏在多轮交互中迭代优化理解能力的问题。其解决方案的关键在于提出了一种名为RegionReasoner的强化学习框架,该框架通过要求每个推理过程明确引用对应的参考边界框(bounding boxes)来强制实现基于区域的语义接地(grounded reasoning),并引入全局-局部一致性奖励机制,以确保推理轨迹与场景级和区域级描述之间的语义连贯性。此方法显著提升了多轮视觉推理任务中的空间定位精度、全局一致性以及整体推理准确性。

链接: https://arxiv.org/abs/2602.03733
作者: Wenfang Sun,Hao Chen,Yingjun Du,Yefeng Zheng,Cees G. M. Snoek
机构: University of Amsterdam (阿姆斯特丹大学); Anhui University (安徽大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.
zh

[CV-18] Referring Industrial Anomaly Segmentation

【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中传统方法面临的两大核心问题:一是无监督方法定位粗略且需手动设定阈值,二是监督方法因数据稀缺与不平衡易过拟合,且二者均受限于“一类异常对应一个模型”的封闭集范式。解决方案的关键在于提出一种基于语言引导的新型检测范式——指代式工业异常分割(Referring Industrial Anomaly Segmentation, RIAS),其通过自然语言描述直接生成精确异常掩码,并利用通用提示词(universal prompts)实现单模型对多种异常类型的检测能力,从而推动IAD向开放集场景演进。该方案的核心技术包括MVTec-Ref数据集(含多样化指代表达和95%小尺寸异常)以及DQFormer架构,后者采用双查询令牌(Anomaly/Background)与语言门控多级聚合机制(Language-Gated Multi-Level Aggregation, LMA),显著提升多尺度分割精度并减少冗余查询,实现高效视觉-文本融合。

链接: https://arxiv.org/abs/2602.03673
作者: Pengfei Yue,Xiaokang Jiang,Yilin Lu,Jianghang Lin,Shengchuan Zhang,Liujuan Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial Anomaly Detection (IAD) is vital for manufacturing, yet traditional methods face significant challenges: unsupervised approaches yield rough localizations requiring manual thresholds, while supervised methods overfit due to scarce, imbalanced data. Both suffer from the “One Anomaly Class, One Model” limitation. To address this, we propose Referring Industrial Anomaly Segmentation (RIAS), a paradigm leveraging language to guide detection. RIAS generates precise masks from text descriptions without manual thresholds and uses universal prompts to detect diverse anomalies with a single model. We introduce the MVTec-Ref dataset to support this, designed with diverse referring expressions and focusing on anomaly patterns, notably with 95% small anomalies. We also propose the Dual Query Token with Mask Group Transformer (DQFormer) benchmark, enhanced by Language-Gated Multi-Level Aggregation (LMA) to improve multi-scale segmentation. Unlike traditional methods using redundant queries, DQFormer employs only “Anomaly” and “Background” tokens for efficient visual-textual integration. Experiments demonstrate RIAS’s effectiveness in advancing IAD toward open-set capabilities. Code: this https URL.
zh

[CV-19] Efficient Sequential Neural Network with Spatial-Temporal Attention and Linear LSTM for Robust Lane Detection Using Multi-Frame Images

【速读】:该论文旨在解决自动驾驶车辆(AVs)在混合交通环境中进行车道检测时面临的准确性、鲁棒性和实时性不足的问题,尤其是现有视觉方法常忽略图像中关键区域及其时空(spatial-temporal, ST)显著性,导致在严重遮挡和强光干扰等复杂场景下性能下降。解决方案的关键在于提出一种基于标准编码器-解码器结构的新型序列神经网络模型,引入时空注意力机制(spatial-temporal attention mechanism),以聚焦车道线的关键特征并挖掘连续图像帧间的显著ST相关性,从而提升检测精度与鲁棒性,同时通过该机制减少模型参数量和乘加运算(MACs),实现更高的计算效率。

链接: https://arxiv.org/abs/2602.03669
作者: Sandeep Patil,Yongqi Dong,Haneen Farah,Hans Hellendoorn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 pages, 9 figures, under review by IEEE T-ITS

点击查看摘要

Abstract:Lane detection is a crucial perception task for all levels of automated vehicles (AVs) and Advanced Driver Assistance Systems, particularly in mixed-traffic environments where AVs must interact with human-driven vehicles (HDVs) and challenging traffic scenarios. Current methods lack versatility in delivering accurate, robust, and real-time compatible lane detection, especially vision-based methods often neglect critical regions of the image and their spatial-temporal (ST) salience, leading to poor performance in difficult circumstances such as serious occlusion and dazzle lighting. This study introduces a novel sequential neural network model with a spatial-temporal attention mechanism to focus on key features of lane lines and exploit salient ST correlations among continuous image frames. The proposed model, built on a standard encoder-decoder structure and common neural network backbones, is trained and evaluated on three large-scale open-source datasets. Extensive experiments demonstrate the strength and robustness of the proposed model, outperforming state-of-the-art methods in various testing scenarios. Furthermore, with the ST attention mechanism, the developed sequential neural network models exhibit fewer parameters and reduced Multiply-Accumulate Operations (MACs) compared to baseline sequential models, highlighting their computational efficiency. Relevant data, code, and models are released at this https URL.
zh

[CV-20] MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型预训练中缺乏高质量伪动作标签的问题,尤其是在无地面真实动作标签的情况下如何提取具有动作信息的潜在动作(latent actions)。传统方法依赖于单一视角视频或特定机器人数据集,难以泛化到不同形态的机器人系统。解决方案的关键在于提出多视角点潜在动作模型(Multi-View Point Latent Action Model, MVP-LAM),其通过时间同步的多视角视频学习离散的潜在动作,并采用跨视角重建目标进行训练——即从一个视角推断出的潜在动作必须能解释另一个视角中的未来状态,从而减少对视角特异性线索的依赖,增强潜在动作对真实动作的信息量。实验表明,MVP-LAM生成的潜在动作更具动作导向性,在Bridge V2上实现了更高的互信息和动作预测性能,且在分布外评估下依然稳健;进一步用于VLA预训练后,显著提升了SIMPLER和LIBERO-Long基准上的下游操作任务表现。

链接: https://arxiv.org/abs/2602.03668
作者: Jung Min Lee,Dohyeok Lee,Seokhun Ju,Taehyun Cho,Jin Woo Koo,Li Zhao,Sangwoo Hong,Jungwoo Lee
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning \emphlatent actions from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent’s actions despite the absence of ground-truth labels. We propose \textbfMulti-\textbfView\textbfPoint \textbfLatent \textbfAction \textbfModel (\textbfMVP-LAM), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a \emphcross-viewpoint reconstruction objective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.
zh

[CV-21] MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态和社会情境模糊场景中难以做出符合人类道德判断的问题。现有方法多依赖二元或成对的监督信号,无法充分捕捉人类道德推理的连续性和多样性。解决方案的关键在于提出MM-SCALE(Multimodal Moral Scale),一个大规模数据集,通过五点量表评分和显式的模态标注来对齐VLM与人类道德偏好;其核心创新是将监督方式从离散的二元标签扩展为连续的标量信号,并结合列表级偏好优化策略,从而提供更丰富的对齐信号和更精细的多模态道德推理校准能力。

链接: https://arxiv.org/abs/2602.03665
作者: Eunkyu Park,Wesley Hanwen Deng,Cheyon Jin,Matheus Kunzler Maldaner,Jordan Wheeler,Jason I. Hong,Hong Shen,Adam Perer,Ken Holstein,Motahhare Eslami,Gunhee Kim
机构: Seoul National University (首尔国立大学); Carnegie Mellon University (卡内基梅隆大学); University of Florida (佛罗里达大学); Epic Games
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.
zh

[CV-22] SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection ICLR2026

【速读】:该论文旨在解决遥感领域中定向目标检测因密集对象分布和多样化类别导致的大规模标注成本过高问题,尤其在弱标注或稀疏标注条件下如何保持模型性能。其解决方案的关键在于提出首个稀疏部分弱监督定向目标检测框架,通过三项核心创新实现:(1) 设计稀疏注释导向的方位与尺度感知学生模型(SOS-Student),在稀疏标注场景下分离未标注目标与背景,并从方位无关或尺度无关的弱标注中学习方向与尺度信息;(2) 构建基于多层预测分布的多级伪标签过滤策略,提升伪标签质量;(3) 提出独特的稀疏分区方法,确保各类别在训练中获得公平对待。实验表明,该框架在DOTA和DIOR数据集上显著优于传统方法,提供了一种高性价比的解决方案。

链接: https://arxiv.org/abs/2602.03634
作者: Wei Zhang,Xiang Liu,Ningjing Liu,Mingxin Liu,Wei Liao,Chunyan Xu,Xue Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The Fourteenth International Conference on Learning Representations (ICLR 2026)

点击查看摘要

Abstract:A consistent trend throughout the research of oriented object detection has been the pursuit of maintaining comparable performance with fewer and weaker annotations. This is particularly crucial in the remote sensing domain, where the dense object distribution and a wide variety of categories contribute to prohibitively high costs. Based on the supervision level, existing oriented object detection algorithms can be broadly grouped into fully supervised, semi-supervised, and weakly supervised methods. Within the scope of this work, we further categorize them to include sparsely supervised and partially weakly-supervised methods. To address the challenges of large-scale labeling, we introduce the first Sparse Partial Weakly-Supervised Oriented Object Detection framework, designed to efficiently leverage only a few sparse weakly-labeled data and plenty of unlabeled data. Our framework incorporates three key innovations: (1) We design a Sparse-annotation-Orientation-and-Scale-aware Student (SOS-Student) model to separate unlabeled objects from the background in a sparsely-labeled setting, and learn orientation and scale information from orientation-agnostic or scale-agnostic weak annotations. (2) We construct a novel Multi-level Pseudo-label Filtering strategy that leverages the distribution of model predictions, which is informed by the model’s multi-layer predictions. (3) We propose a unique sparse partitioning approach, ensuring equal treatment for each category. Extensive experiments on the DOTA and DIOR datasets show that our framework achieves a significant performance gain over traditional oriented object detection methods mentioned above, offering a highly cost-effective solution. Our code is publicly available at this https URL.
zh

[CV-23] Multi-Objective Optimization for Synthetic-to-Real Style Transfer

【速读】:该论文旨在解决合成图像到真实图像的域适应(domain adaptation)问题,即在缺乏大量真实世界像素级标注数据的情况下,如何提升语义分割网络在真实场景中的性能。由于合成图像与真实图像之间存在显著的域差距(domain gap),直接使用合成数据训练的模型在真实场景中表现不佳。解决方案的关键在于将风格迁移(style transfer)建模为一个序列优化问题,并利用多目标遗传算法(multi-objective genetic algorithms)自动搜索最优的数据增强管道(augmentation pipeline),以在结构保真度和风格相似性之间取得平衡。该方法通过在进化过程中使用成对图像的局部度量(paired-image metrics)实现快速评估,避免了传统分布度量所需的大量样本生成,从而在高维组合搜索空间中实现了高效可行的优化。

链接: https://arxiv.org/abs/2602.03625
作者: Estelle Chigot,Thomas Oberlin,Manon Huguenin,Dennis Wilson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in International Conference on the Applications of Evolutionary Computation (Part of EvoStar), April 2026 (EvoApplications 2026)

点击查看摘要

Abstract:Semantic segmentation networks require large amounts of pixel-level annotated data, which are costly to obtain for real-world images. Computer graphics engines can generate synthetic images alongside their ground-truth annotations. However, models trained on such images can perform poorly on real images due to the domain gap between real and synthetic images. Style transfer methods can reduce this difference by applying a realistic style to synthetic images. Choosing effective data transformations and their sequence is difficult due to the large combinatorial search space of style transfer operators. Using multi-objective genetic algorithms, we optimize pipelines to balance structural coherence and style similarity to target domains. We study the use of paired-image metrics on individual image samples during evolution to enable rapid pipeline evaluation, as opposed to standard distributional metrics that require the generation of many images. After optimization, we evaluate the resulting Pareto front using distributional metrics and segmentation performance. We apply this approach to standard datasets in synthetic-to-real domain adaptation: from the video game GTA5 to real image datasets Cityscapes and ACDC, focusing on adverse conditions. Results demonstrate that evolutionary algorithms can propose diverse augmentation pipelines adapted to different objectives. The contribution of this work is the formulation of style transfer as a sequencing problem suitable for evolutionary optimization and the study of efficient metrics that enable feasible search in this space. The source code is available at: this https URL.
zh

[CV-24] Quasi-multimodal-based pathophysiological feature learning for retinal disease diagnosis

【速读】:该论文旨在解决眼科临床中多模态数据在视网膜疾病诊断中存在的数据异质性、潜在侵入性及配准复杂性等问题。其核心解决方案是提出一个统一的框架,通过合成包含眼底荧光血管造影(FFA)、多光谱成像(MSI)及强调潜在病灶与视盘/杯区域的显著图在内的多模态数据,并采用并行模型独立学习各模态特异性表征,再基于下游任务自适应地进行模态内与跨模态特征校准,实现信息剪枝与灵活融合,从而提升分类与分级性能。

链接: https://arxiv.org/abs/2602.03622
作者: Lu Zhang,Huizhen Yu,Zuowei Wang,Fu Gui,Yatu Guo,Wei Zhang,Mengyu Jia
机构: Tianjin University (天津大学); Tianjin Eye Institute (天津眼科研究所); Tianjin Eye Hospital (天津眼科医院); Tianjin Key Laboratory of Ophthalmology and Visual Science (天津市眼科与视觉科学重点实验室); Tianjin Medical University (天津医科大学); Nankai University Affiliated Eye Hospital (南开大学附属眼科医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Retinal diseases spanning a broad spectrum can be effectively identified and diagnosed using complementary signals from multimodal data. However, multimodal diagnosis in ophthalmic practice is typically challenged in terms of data heterogeneity, potential invasiveness, registration complexity, and so on. As such, a unified framework that integrates multimodal data synthesis and fusion is proposed for retinal disease classification and grading. Specifically, the synthesized multimodal data incorporates fundus fluorescein angiography (FFA), multispectral imaging (MSI), and saliency maps that emphasize latent lesions as well as optic disc/cup regions. Parallel models are independently trained to learn modality-specific representations that capture cross-pathophysiological signatures. These features are then adaptively calibrated within and across modalities to perform information pruning and flexible integration according to downstream tasks. The proposed learning system is thoroughly interpreted through visualizations in both image and feature spaces. Extensive experiments on two public datasets demonstrated the superiority of our approach over state-of-the-art ones in the tasks of multi-label classification (F1-score: 0.683, AUC: 0.953) and diabetic retinopathy grading (Accuracy:0.842, Kappa: 0.861). This work not only enhances the accuracy and efficiency of retinal disease screening but also offers a scalable framework for data augmentation across various medical imaging modalities.
zh

[CV-25] KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLM s

【速读】:该论文旨在解决训练-free视频理解中因视觉冗余和计算开销高而导致的效率与效果瓶颈问题,尤其针对现有基于CLIP相似度的关键帧选择策略易产生偏差、可能遗漏关键帧的缺陷。其解决方案的关键在于提出一个两阶段框架KTV:第一阶段通过聚类帧级视觉特征实现与问题无关的关键帧选择,生成紧凑且具有代表性的帧子集以缓解时间冗余;第二阶段则在选定的关键帧上进行视觉token重要性评估与冗余剪枝,显著减少输入大语言模型(LLM)的视觉token数量,从而提升计算效率并保持甚至超越现有方法的视频理解性能。

链接: https://arxiv.org/abs/2602.03615
作者: Baiyang Song,Jun Peng,Yuxin Zhang,Guangyao Chen,Feidiao Yang,Jianyuan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbfKTV, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emphe.g., only 504 visual tokens for a 60-min video with 10800 frames, achieving 44.8% accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.
zh

[CV-26] A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

【速读】:该论文旨在解决自监督表示学习在视频和动作条件世界模型中的迁移问题,尤其是如何在不依赖生成式建模(Generative Modeling)的前提下,有效捕捉语义特征并支持下游任务。其解决方案的关键在于提出EB-JEPA,一个基于联合嵌入预测架构(Joint-Embedding Predictive Architectures, JEPAs)的开源库,通过在表示空间而非像素空间中进行预测,避免了生成模型的不稳定性和低效性,同时保持对语义信息的高保真度。该方法在图像、视频和动作条件世界模型三个层级上实现了模块化、可复现的训练流程,且单GPU即可在数小时内完成训练,显著提升了能源效率与可及性。实验表明,该框架在CIFAR-10上的探针分类准确率达91%,在Moving MNIST视频预测和Two Rooms导航任务中分别实现多步预测能力和97%的规划成功率,验证了其有效性与泛化能力。

链接: https://arxiv.org/abs/2602.03604
作者: Basile Terver,Randall Balestriero,Megi Dervishi,David Fan,Quentin Garrido,Tushar Nagarajan,Koustuv Sinha,Wancong Zhang,Mike Rabbat,Yann LeCun,Amir Bar
机构: Meta FAIR(Meta人工智能研究院); INRIA(法国国家信息与自动化研究院); New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at this https URL.
zh

[CV-27] Refer-Agent : A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

【速读】:该论文旨在解决当前 referring video object segmentation (RVOS) 方法对大规模监督微调(SFT)的强依赖性及其在多模态大语言模型(MLLM)快速演进下的可扩展性不足问题,同时克服零样本方法因流程设计简单而导致性能显著落后于 SFT 方法的局限。解决方案的关键在于提出 Refer-Agent,一个具有交替推理-反思机制的协作式多智能体系统:首先通过“粗粒度到细粒度”的帧选择策略保障帧多样性与文本相关性,并引入动态关注布局(Dynamic Focus Layout)自适应调整视觉注意力;其次创新性地设计链式反思机制(Chain-of-Reflection),利用提问者-响应者对生成自我反思链,以验证中间结果并提供反馈用于下一轮推理优化,从而实现更精准、鲁棒且无需额外微调即可兼容新 MLLMs 的 RVOS 推理能力。

链接: https://arxiv.org/abs/2602.03595
作者: Haichao Jiang,Tianming Liang,Wei-Shi Zheng,Jian-Fang Hu
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbfRefer-Agent, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent’s visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.
zh

[CV-28] IPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection ICASSP’26

【速读】:该论文旨在解决零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)中因目标域正常数据不可用而导致的检测性能受限问题,尤其针对现有方法依赖CLIP模型时存在的空间对齐粗略和对细粒度异常敏感性不足的局限。其解决方案的关键在于:首先选用具有空间感知能力的视觉语言模型TIPS作为骨干网络以缓解CLIP的空间错位问题;其次提出解耦提示机制——固定提示用于图像级检测,可学习提示用于像素级定位,并通过将局部特征证据注入全局得分来弥合全局与局部特征之间的分布差异。此方法无需CLIP特有的复杂辅助模块,在七个工业数据集上实现了图像级和像素级检测性能分别提升1.1–3.9%和1.5–6.9%,且架构简洁、泛化能力强。

链接: https://arxiv.org/abs/2602.03594
作者: Alireza Salehi,Ehsan Karami,Sepehr Noey,Sahand Noey,Makoto Yamada,Reshad Hosseini,Mohammad Sabokrou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the extended version of the paper accepted in ICASSP’26, which will be publicly available in May. Authors’ contributions may vary among the versions

点击查看摘要

Abstract:Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP’s coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior work compensates with complex auxiliary modules yet largely overlooks the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP’s issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at this http URL.
zh

[CV-29] High-Resolution Underwater Camouflaged Object Detection: GBU-UCOD Dataset and Topology-Aware and Frequency-Decoupled Networks

【速读】:该论文旨在解决水下伪装目标检测(Underwater Camouflaged Object Detection, UCOD)中因海洋深度变化导致的目标与背景视觉相似性高、细长生物拓扑结构碎片化以及透明生物特征提取困难等问题。解决方案的关键在于提出DeepTopo-Net框架,其核心创新包括:1)设计了Water-Conditioned Adaptive Perceptor (WCAP),利用黎曼度量张量动态调整卷积采样场以应对物理退化;2)引入Abyssal-Topology Refinement Module (ATRM),通过骨架先验保持细长目标的结构连通性。该方法在首个高分辨率(2K)海洋垂直分带基准GBU-UCOD上验证了优越性能,尤其在复杂水下形态完整性保持方面表现突出。

链接: https://arxiv.org/abs/2602.03591
作者: Wenji Wu,Shuo Ye,Yiyu Liu,Jiguang He,Zhuo Wang,Zitong Yu
机构: Harbin Engineering University (哈尔滨工程大学); Guangdong University of Technology (广东工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater Camouflaged Object Detection (UCOD) is a challenging task due to the extreme visual similarity between targets and backgrounds across varying marine depths. Existing methods often struggle with topological fragmentation of slender creatures in the deep sea and the subtle feature extraction of transparent organisms. In this paper, we propose DeepTopo-Net, a novel framework that integrates topology-aware modeling with frequency-decoupled perception. To address physical degradation, we design the Water-Conditioned Adaptive Perceptor (WCAP), which employs Riemannian metric tensors to dynamically deform convolutional sampling fields. Furthermore, the Abyssal-Topology Refinement Module (ATRM) is developed to maintain the structural connectivity of spindly targets through skeletal priors. Specifically, we first introduce GBU-UCOD, the first high-resolution (2K) benchmark tailored for marine vertical zonation, filling the data gap for hadal and abyssal zones. Extensive experiments on MAS3K, RMAS, and our proposed GBU-UCOD datasets demonstrate that DeepTopo-Net achieves state-of-the-art performance, particularly in preserving the morphological integrity of complex underwater patterns. The datasets and codes will be released at this https URL.
zh

[CV-30] SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM NEURIPS2024

【速读】:该论文旨在解决当前视频大语言模型(Video Large Language Models, Vid-LLMs)在同时保持高质量帧级语义信息(即每帧足够的token数量)与全面的视频级时序信息(即每视频足够多的采样帧数)方面存在的瓶颈问题,从而推动Vid-LLMs向细粒度视频理解发展。其解决方案的关键在于提出一种名为SlowFocus的机制,该机制首先基于查询定位相关时序片段,随后对这一局部区域进行密集采样以提取高频率局部特征;并引入多频混合注意力模块,将这些局部高频细节与全局低频上下文融合,从而提升时序理解能力。此外,作者还设计了一套针对性训练策略以增强模型的时间定位和细粒度时序推理能力,并构建FineAction-CGR基准用于评估细粒度时序理解任务性能。

链接: https://arxiv.org/abs/2602.03589
作者: Ming Nie,Dan Ding,Chunwei Wang,Yuanfan Guo,Jianhua Han,Hang Xu,Li Zhang
机构: Fudan University (复旦大学); Huawei (华为); Yinwang Intelligent Technology Co., Ltd. (英伟达智能科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.
zh

[CV-31] ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

【速读】:该论文旨在解决生成式 AI(Generative AI)图像质量评估中因模型持续迭代导致人工标注失效的问题,尤其针对生成图像的视觉质量和提示词(prompt)一致性难以量化评估的挑战。其解决方案的关键在于提出 ELIQ 框架,通过自动构建正样本与特定维度负样本对(涵盖传统失真和 AIGC 特有失真模式),实现无需人工标签的可迁移监督;在此基础上,利用指令微调将预训练多模态模型转化为质量感知判别器,并采用轻量级门控融合与质量查询 Transformer(Quality Query Transformer)预测二维质量分数,从而在多个基准测试中显著优于现有无标签方法,并具备从 AI 生成内容(AIGC)到用户生成内容(UGC)场景的强泛化能力。

链接: https://arxiv.org/abs/2602.03558
作者: Xinyue Li,Zhiming Xu,Zhichao Zhang,Zhaolin Cai,Sijing Wu,Xiongkuo Min,Yitong Chen,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.
zh

[CV-32] Cut to the Mix: Simple Data Augmentation Outperforms Elaborate Ones in Limited Organ Segmentation Datasets MICCAI2024

【速读】:该论文旨在解决多器官分割任务中因临床数据稀缺导致深度学习(Deep Learning, DL)模型训练受限的问题。其核心解决方案是引入基于图像间(inter-image)和对象级(object-level)的数据增强(Data Augmentation, DA)策略,以提升在小样本条件下模型的泛化能力和分割性能。关键创新在于系统评估了四种新型DA方法(CutMix、CarveMix、ObjectAug和AnatoMix),并发现CutMix在保持简单性的同时展现出最强的鲁棒性和有效性,即使生成的图像在直观上看似“错误”,仍能显著提升Dice分数(平均提高4.9),优于当前最先进的nnUNet模型。

链接: https://arxiv.org/abs/2602.03555
作者: Chang Liu,Fuxin Fan,Annette Schwarz,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2024

点击查看摘要

Abstract:Multi-organ segmentation is a widely applied clinical routine and automated organ segmentation tools dramatically improve the pipeline of the radiologists. Recently, deep learning (DL) based segmentation models have shown the capacity to accomplish such a task. However, the training of the segmentation networks requires large amount of data with manual annotations, which is a major concern due to the data scarcity from clinic. Working with limited data is still common for researches on novel imaging modalities. To enhance the effectiveness of DL models trained with limited data, data augmentation (DA) is a crucial regularization technique. Traditional DA (TDA) strategies focus on basic intra-image operations, i.e. generating images with different orientations and intensity distributions. In contrast, the interimage and object-level DA operations are able to create new images from separate individuals. However, such DA strategies are not well explored on the task of multi-organ segmentation. In this paper, we investigated four possible inter-image DA strategies: CutMix, CarveMix, ObjectAug and AnatoMix, on two organ segmentation datasets. The result shows that CutMix, CarveMix and AnatoMix can improve the average dice score by 4.9, 2.0 and 1.9, compared with the state-of-the-art nnUNet without DA strategies. These results can be further improved by adding TDA strategies. It is revealed in our experiments that Cut-Mix is a robust but simple DA strategy to drive up the segmentation performance for multi-organ segmentation, even when CutMix produces intuitively ‘wrong’ images. Our implementation is publicly available for future benchmarks.
zh

[CV-33] AffordanceGrasp-R1:Leverag ing Reasoning -Based Affordance Segmentation with Reinforcement Learning for Robotic Grasping

【速读】:该论文旨在解决机器人抓取任务中因场景复杂性和语言指令多样性导致的推理能力不足与空间定位精度低的问题。其解决方案的关键在于提出了一种基于推理驱动的可操作性分割框架 AffordanceGrasp-R1,通过引入链式思维(Chain-of-Thought, CoT)冷启动策略与强化学习相结合的方式,提升模型在抓取决策中的逻辑推理能力和空间语义 grounding 能力;同时重构抓取流程,从全局场景点云中生成候选抓取位姿,并利用指令条件化的可操作性掩码进行筛选,从而实现更鲁棒、更具泛化性的语言引导抓取性能。

链接: https://arxiv.org/abs/2602.03547
作者: Dingyi Zhou,Mu He,Zhuowei Fang,Xiangtong Yao,Yinlong Liu,Alois Knoll,Hu Cao
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version

点击查看摘要

Abstract:We introduce AffordanceGrasp-R1, a reasoning-driven affordance segmentation framework for robotic grasping that combines a chain-of-thought (CoT) cold-start strategy with reinforcement learning to enhance deduction and spatial grounding. In addition, we redesign the grasping pipeline to be more context-aware by generating grasp candidates from the global scene point cloud and subsequently filtering them using instruction-conditioned affordance masks. Extensive experiments demonstrate that AffordanceGrasp-R1 consistently outperforms state-of-the-art (SOTA) methods on benchmark datasets, and real-world robotic grasping evaluations further validate its robustness and generalization under complex language-conditioned manipulation scenarios.
zh

[CV-34] Constrained Dynamic Gaussian Splatting

【速读】:该论文旨在解决动态高斯点阵(Dynamic Gaussian Splatting)在边缘设备部署时面临的根本性矛盾:无约束的点云密化会导致内存消耗过高,而启发式剪枝策略则难以在预设高斯点预算下实现最优渲染质量。解决方案的关键在于提出约束型动态高斯点阵(Constrained Dynamic Gaussian Splatting, CDGS),其核心创新是引入一个可微分的预算控制器(differentiable budget controller),作为优化驱动机制;该控制器基于多模态统一重要性评分(multi-modal unified importance score),融合几何、运动与感知线索,实现对高斯点容量的精确调控。此外,通过解耦静态与动态要素的优化并采用自适应分配机制,以及三阶段训练策略和双模式混合压缩方案,CDGS 在严格满足硬件预算的前提下显著提升率-失真帕累托前沿性能,相较当前最优方法实现超过3倍的压缩比。

链接: https://arxiv.org/abs/2602.03538
作者: Zihan Zheng,Zhenglong Wu,Xuanxuan Wang,Houqiang Zhong,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Dynamic Gaussian Splatting enables high-fidelity 4D reconstruction, its deployment is severely hindered by a fundamental dilemma: unconstrained densification leads to excessive memory consumption incompatible with edge devices, whereas heuristic pruning fails to achieve optimal rendering quality under preset Gaussian budgets. In this work, we propose Constrained Dynamic Gaussian Splatting (CDGS), a novel framework that formulates dynamic scene reconstruction as a budget-constrained optimization problem to enforce a strict, user-defined Gaussian budget during training. Our key insight is to introduce a differentiable budget controller as the core optimization driver. Guided by a multi-modal unified importance score, this controller fuses geometric, motion, and perceptual cues for precise capacity regulation. To maximize the utility of this fixed budget, we further decouple the optimization of static and dynamic elements, employing an adaptive allocation mechanism that dynamically distributes capacity based on motion complexity. Furthermore, we implement a three-phase training strategy to seamlessly integrate these constraints, ensuring precise adherence to the target count. Coupled with a dual-mode hybrid compression scheme, CDGS not only strictly adheres to hardware constraints (error 2%) but also pushes the Pareto frontier of rate-distortion performance. Extensive experiments demonstrate that CDGS delivers optimal rendering quality under varying capacity limits, achieving over 3x compression compared to state-of-the-art methods.
zh

[CV-35] PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation

【速读】:该论文旨在解决当前3D理解与生成任务难以统一建模的问题,尤其是在现有基于自回归(Autoregressive, AR)范式的统一框架中,由于强制信号量化和高昂的训练成本导致性能显著下降。其解决方案的关键在于摒弃对单一AR范式的依赖,转而提出一种融合自回归与扩散模型(Diffusion Model)的统一架构:在3D理解任务中采用自回归的token预测机制,在3D生成任务中使用连续扩散过程,并通过一个轻量级Transformer桥接大语言模型特征空间与3D扩散模型条件空间,从而实现跨模态信息高效交互,同时保留各独立模型的先验知识并降低训练成本。

链接: https://arxiv.org/abs/2602.03533
作者: Yongwei Chen,Tianyi Wei,Yushi Lan,Zhaoyang Lyu,Shangchen Zhou,Xudong Xu,Xingang Pan
机构: Nanyang Technological University (南洋理工大学); Oxford Robotics Institute (牛津机器人研究所); Peking-Johns Hopkins Center for Computational Biology (北京大学-约翰霍普金斯大学计算生物学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Yongwei Chen and Tianyi Wei contributed equally. Project page: this https URL

点击查看摘要

Abstract:The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion. Specifically, we adopt an autoregressive next-token prediction paradigm for 3D understanding, and a continuous diffusion paradigm for 3D generation. A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models, enabling effective cross-modal information exchange while preserving the priors learned by standalone models. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks. These results highlight the potential of unified AR+diffusion models as a promising direction for building more general-purpose 3D intelligence.
zh

[CV-36] Robust Representation Learning in Masked Autoencoders

【速读】:该论文旨在解决掩码自编码器(Masked Autoencoders, MAEs)在图像分类任务中表现出色但其内部表征机制尚不清晰的问题。解决方案的关键在于通过层级分析token嵌入,揭示预训练与微调后MAE逐步构建类感知的潜在空间:不同类别的嵌入分布在网络深度上逐渐分离,形成可区分的子空间;同时发现MAE在编码器各层中表现出早期且持续的全局注意力模式,区别于标准视觉Transformer(Vision Transformers, ViTs)。此外,作者引入两个敏感性指标——干净与扰动嵌入间的方向对齐度及头部层面在退化条件下保留活跃特征的能力,量化了特征鲁棒性,从而解释了MAE在模糊和遮挡等退化场景下仍保持优异分类性能的原因。

链接: https://arxiv.org/abs/2602.03531
作者: Anika Shrivastava,Renu Rameshan,Samar Agnihotri
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures, and 3 tables

点击查看摘要

Abstract:Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify feature robustness, we introduce two sensitivity indicators: directional alignment between clean and perturbed embeddings, and head-wise retention of active features under degradations. These studies help establish the robust classification performance of MAEs.
zh

[CV-37] Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning

【速读】:该论文旨在解决工业图像中逻辑异常(Logical Anomaly)的细粒度分类问题,即不仅检测是否存在异常,还需明确指出违反了哪条预定义的逻辑规则(如对象数量、空间布局或组成关系),从而提升质量保证的可解释性和实用性。解决方案的关键在于提出LogiCls框架,其核心是将复杂的逻辑约束分解为一系列可验证的子查询(subqueries),并利用数据驱动的指令合成流水线生成链式思维(Chain-of-Thought, CoT)监督信号,结合精确的定位标注与多样化的图像-文本增强,使视觉语言模型(Vision-Language Model, VLM)具备逻辑敏感推理能力;同时引入难度感知重采样策略以稳定训练过程,聚焦于困难子查询和长尾约束类型,最终实现高效、准确且可解释的逻辑异常分类。

链接: https://arxiv.org/abs/2602.03530
作者: Xufei Zhang,Xinjiao Zhou,Ziling Deng,Dongdong Geng,Jianxiong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Logical anomalies are violations of predefined constraints on object quantity, spatial layout, and compositional relationships in industrial images. While prior work largely treats anomaly detection as a binary decision, such formulations cannot indicate which logical rule is broken and therefore offer limited value for quality assurance. We introduce Logical Anomaly Classification (LAC), a task that unifies anomaly detection and fine-grained violation classification in a single inference step. To tackle LAC, we propose LogiCls, a vision-language framework that decomposes complex logical constraints into a sequence of verifiable subqueries. We further present a data-centric instruction synthesis pipeline that generates chain-of-thought (CoT) supervision for these subqueries, coupling precise grounding annotations with diverse image-text augmentations to adapt vision language models (VLMs) to logic-sensitive reasoning. Training is stabilized by a difficulty-aware resampling strategy that emphasizes challenging subqueries and long tail constraint types. Extensive experiments demonstrate that LogiCls delivers robust, interpretable, and accurate industrial logical anomaly classification, providing both the predicted violation categories and their evidence trails.
zh

[CV-38] Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

【速读】:该论文旨在解决当前基于DiT(Diffusion Transformer)的文本到图像生成模型中,文本条件编码静态化与动态生成过程不匹配的问题。具体而言,现有方法通常仅使用大型语言模型(LLM)的单一层次隐藏状态作为文本条件,忽视了LLM层间存在的显著语义层次结构以及扩散过程中随时间步和网络深度变化的非平稳去噪特性。为更好地对齐扩散生成的动态过程并提升生成能力,作者提出一种统一的归一化凸融合框架,通过轻量级门控机制实现多层LLM隐藏状态在时间维度、深度维度及联合维度上的系统性融合。其关键创新在于引入深度语义路由(Depth-wise Semantic Routing)策略,该策略被实验证明能显著增强文本-图像对齐性和组合生成能力(如GenAI-Bench计数任务提升9.97分),并揭示出单纯依赖时间维度融合可能因训练-推理轨迹不一致而损害视觉保真度,从而强调了轨迹感知信号对于实现鲁棒的时间依赖条件注入的重要性。

链接: https://arxiv.org/abs/2602.03510
作者: Bozhou Li,Yushuo Guan,Haolin Li,Bohan Zeng,Yiyan Ji,Yue Ding,Pengfei Wan,Kun Gai,Yuanxing Zhang,Wentao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent DiT-based text-to-image models increasingly adopt LLMs as text encoders, yet text conditioning remains largely static and often utilizes only a single LLM layer, despite pronounced semantic hierarchy across LLM layers and non-stationary denoising dynamics over both diffusion time and network depth. To better match the dynamic process of DiT generation and thereby enhance the diffusion model’s generative capability, we introduce a unified normalized convex fusion framework equipped with lightweight gates to systematically organize multi-layer LLM hidden states via time-wise, depth-wise, and joint fusion. Experiments establish Depth-wise Semantic Routing as the superior conditioning strategy, consistently improving text-image alignment and compositional generation (e.g., +9.97 on the GenAI-Bench Counting task). Conversely, we find that purely time-wise fusion can paradoxically degrade visual generation fidelity. We attribute this to a train-inference trajectory mismatch: under classifier-free guidance, nominal timesteps fail to track the effective SNR, causing semantically mistimed feature injection during inference. Overall, our results position depth-wise routing as a strong and effective baseline and highlight the critical need for trajectory-aware signals to enable robust time-dependent conditioning.
zh

[CV-39] Scaling Continual Learning with Bi-Level Routing Mixture-of-Experts

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中,尤其是在类增量学习(Class-Incremental Learning, CIL)场景下,如何在保持模型稳定性与可塑性的前提下,有效学习判别性和全面性的特征表示这一开放问题。其核心挑战在于长期任务序列中特征表示的退化与灾难性遗忘。解决方案的关键在于提出CaRE框架,其中引入了高效的双层路由混合专家(Bi-Level Routing Mixture-of-Experts, BR-MoE)机制:第一层为路由器选择阶段,动态激活与当前任务相关的任务特定路由器;第二层为专家路由阶段,进一步动态激活并聚合专家模块,从而将判别性和全面性的特征表示注入网络每一中间层。该设计显著提升了模型在超长任务序列(100–300个非重叠任务)上的性能表现,并首次实现了在如此大规模任务序列上超越所有基线方法的持续学习能力。

链接: https://arxiv.org/abs/2602.03473
作者: Meng Lou,Yunxiang Fu,Yizhou Yu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable Continual Learner with efficient Bi-Level Routing Mixture-of-Experts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging evaluation protocol for comprehensively assessing CIL methods across very long task sequences spanning hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences. Code will be publicly released at this https URL.
zh

[CV-40] Inlier-Centric Post-Training Quantization for Object Detection Models

【速读】:该论文旨在解决目标检测任务中因背景杂波、传感器噪声等任务无关形态引发的冗余激活(anomalies)问题,这些问题会扩大激活范围并扭曲激活分布,从而干扰量化过程中的比特分配,并削弱重要特征的保留。解决方案的关键在于提出一种无标签、即插即用的后训练量化方法 InlierQ,其核心是通过计算梯度感知的体积显著性得分(gradient-aware volume saliency scores),利用期望最大化(Expectation-Maximization, EM)算法拟合该得分的后验分布,从而区分并抑制异常激活,同时保留对任务有益的内点(inliers)。此方法仅需64个校准样本即可实现高效且稳定的性能提升,在COCO和nuScenes数据集上的2D/3D相机与LiDAR目标检测任务中均表现出一致的量化误差降低。

链接: https://arxiv.org/abs/2602.03472
作者: Minsu Kim,Dongyeun Lee,Jaemyung Yu,Jiwan Hur,Giseop Kim,Junmo Kim
机构: KAIST(韩国科学技术院); NAVER AI Lab.(NAVER人工智能实验室); DGIST(大邱庆北科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection is pivotal in computer vision, yet its immense computational demands make deployment slow and power-hungry, motivating quantization. However, task-irrelevant morphologies such as background clutter and sensor noise induce redundant activations (or anomalies). These anomalies expand activation ranges and skew activation distributions toward task-irrelevant responses, complicating bit allocation and weakening the preservation of informative features. Without a clear criterion to distinguish anomalies, suppressing them can inadvertently discard useful information. To address this, we present InlierQ, an inlier-centric post-training quantization approach that separates anomalies from informative inliers. InlierQ computes gradient-aware volume saliency scores, classifies each volume as an inlier or anomaly, and fits a posterior distribution over these scores using the Expectation-Maximization (EM) algorithm. This design suppresses anomalies while preserving informative features. InlierQ is label-free, drop-in, and requires only 64 calibration samples. Experiments on the COCO and nuScenes benchmarks show consistent reductions in quantization error for camera-based (2D and 3D) and LiDAR-based (3D) object detection.
zh

[CV-41] Contextualized Visual Personalization in Vision-Language Models

【速读】:该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在面对新图像时无法基于用户特定经验生成个性化响应的问题,即缺乏将视觉输入与用户累积的视觉-文本上下文进行关联的能力。其核心挑战被定义为“情境化视觉个性化”(contextualized visual personalization),要求VLM在解释新图像时能够识别并检索用户的个性化视觉体验。解决方案的关键在于提出CoViP框架,将个性化图像描述(personalized image captioning)作为核心任务,并通过强化学习后训练(reinforcement-learning-based post-training)和基于描述增强的生成(caption-augmented generation)来提升该能力;同时引入诊断性评估以排除仅依赖文本捷径的可能,从而验证模型是否真正利用了视觉上下文。实验表明,现有开源及专有VLM存在显著局限,而CoViP不仅提升了个性化图像描述性能,还在下游个性化任务中实现了整体改进,凸显其在实现鲁棒且可泛化的视觉情境个性化中的关键作用。

链接: https://arxiv.org/abs/2602.03454
作者: Yeongtak Oh,Sangwon Yu,Junsung Park,Han Cheol Moon,Jisoo Mok,Sungroh Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user’s specific experiences, as they lack the ability to associate visual inputs with a user’s accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.
zh

[CV-42] Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

【速读】:该论文旨在解决多主体图像生成(multi-subject image generation)中普遍存在的身份不一致性和组合控制能力有限的问题。现有方法依赖扩散模型隐式关联文本提示与参考图像,导致生成结果在多个参考主体的身份保持和属性绑定上表现不佳。解决方案的关键在于提出分层的概念到外观引导框架(Hierarchical Concept-to-Appearance Guidance, CAG),其核心创新包括:在概念层面引入VAE dropout训练策略,随机丢弃参考图像的VAE特征以增强模型对视觉语言模型(VLM)提供的语义信号的依赖,从而提升无完整外观线索下的概念一致性;在外观层面,将VLM提取的跨模态对应关系嵌入扩散Transformer(DiT)中的对应感知掩码注意力模块,限制每个文本标记仅关注其匹配的参考区域,实现精确的属性绑定与可靠的多主体组合生成。

链接: https://arxiv.org/abs/2602.03448
作者: Yijia Xu,Zihao Wang,Jinshi Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.
zh

[CV-43] HetroD: A High-Fidelity Drone Dataset and Benchmark for Autonomous Driving in Heterogeneous Traffic ICRA

【速读】:该论文旨在解决自动驾驶系统在异质交通环境(heterogeneous traffic)中面临的挑战,尤其是如何有效应对由弱势道路使用者(Vulnerable Road Users, VRUs)——如行人、骑行者和摩托车手——与机动车共同参与的复杂交互行为。现有数据集多聚焦于结构化、车道规范的交通场景,难以覆盖诸如“钩形转弯”(hook turns)、“车道穿行”(lane splitting)及非正式路权协商等真实世界中的非结构化行为,导致当前预测与规划模型性能受限。解决方案的关键在于构建HetroD数据集,其通过无人机采集的大规模高精度轨迹数据(超过65.4k条,70%来自VRUs),结合厘米级标注、高清地图(HD maps)和交通信号状态信息,提供对异质交通场景的全景观测;同时开发模块化工具包以提取单个代理(agent)的场景片段,支持下游任务如预测、规划与仿真建模。实验表明,现有先进模型在处理VRU横向移动、非结构化操作以及密集多代理交互时表现不佳,凸显了该数据集在推动更鲁棒的异质交通感知与决策方法发展上的重要价值。

链接: https://arxiv.org/abs/2602.03447
作者: Yu-Hsiang Chen,Wei-Jer Chang,Christian Kotulla,Thomas Keutgens,Steffen Runde,Tobias Moers,Christoph Klas,Wei Zhan,Masayoshi Tomizuka,Yi-Ting Chen
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); UC Berkeley (加州大学伯克利分校); fka GmbH (fka GmbH)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:We present HetroD, a dataset and benchmark for developing autonomous driving systems in heterogeneous environments. HetroD targets the critical challenge of navi- gating real-world heterogeneous traffic dominated by vulner- able road users (VRUs), including pedestrians, cyclists, and motorcyclists that interact with vehicles. These mixed agent types exhibit complex behaviors such as hook turns, lane splitting, and informal right-of-way negotiation. Such behaviors pose significant challenges for autonomous vehicles but remain underrepresented in existing datasets focused on structured, lane-disciplined traffic. To bridge the gap, we collect a large- scale drone-based dataset to provide a holistic observation of traffic scenes with centimeter-accurate annotations, HD maps, and traffic signal states. We further develop a modular toolkit for extracting per-agent scenarios to support downstream task development. In total, the dataset comprises over 65.4k high- fidelity agent trajectories, 70% of which are from VRUs. HetroD supports modeling of VRU behaviors in dense, het- erogeneous traffic and provides standardized benchmarks for forecasting, planning, and simulation tasks. Evaluation results reveal that state-of-the-art prediction and planning models struggle with the challenges presented by our dataset: they fail to predict lateral VRU movements, cannot handle unstructured maneuvers, and exhibit limited performance in dense and multi-agent scenarios, highlighting the need for more robust approaches to heterogeneous traffic. See our project page for more examples: this https URL
zh

[CV-44] ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning

【速读】:该论文旨在解决基于流模型(flow-based models)的强化微调(Reinforcement Fine-Tuning, RFT)过程中出现的视觉幻觉(visual hallucinations)问题,包括过度优化细节和语义错位等现象。其核心原因在于两个方面:一是随机微分方程(SDE)轨迹 rollout 期间探索不足,导致模型过度关注局部细节而忽略全局语义;二是策略梯度方法固有的轨迹拟合过程会扭曲模型的基础向量场(vector field)及其跨步一致性。解决方案的关键在于提出 ConsistentRFT 框架,包含两个创新机制:动态粒度 rollout(Dynamic Granularity Rollout, DGR),通过动态调度不同噪声源来平衡全局语义与局部细节的探索;以及一致策略梯度优化(Consistent Policy Gradient Optimization, CPGO),通过将当前策略与更稳定的先验对齐以保持模型一致性。实验表明,该方法在降低低层次和高层次感知幻觉方面分别减少49%和38%,并在域外指标上显著优于基线方法。

链接: https://arxiv.org/abs/2602.03425
作者: Xiaofeng Tan,Jun Liu,Yuanting Fan,Bin-Bin Gao,Xi Jiang,Xiaochen Chen,Jinlong Peng,Chengjie Wang,Hongsong Wang,Feng Zheng
机构: Southeast University (东南大学); Southern University of Science and Technology (南方科技大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment. However, they often introduce visual hallucinations like over-optimized details and semantic misalignment. This work preliminarily explores why visual hallucinations arise and how to reduce them. We first investigate RFT methods from a unified perspective, and reveal the core problems stemming from two aspects, exploration and exploitation: (1) limited exploration during stochastic differential equation (SDE) rollouts, leading to an over-emphasis on local details at the expense of global semantics, and (2) trajectory imitation process inherent in policy gradient methods, distorting the model’s foundational vector field and its cross-step consistency. Building on this, we propose ConsistentRFT, a general framework to mitigate these hallucinations. Specifically, we design a Dynamic Granularity Rollout (DGR) mechanism to balance exploration between global semantics and local details by dynamically scheduling different noise sources. We then introduce a Consistent Policy Gradient Optimization (CPGO) that preserves the model’s consistency by aligning the current policy with a more stable prior. Extensive experiments demonstrate that ConsistentRFT significantly mitigates visual hallucinations, achieving average reductions of 49% for low-level and 38% for high-level perceptual hallucinations. Furthermore, ConsistentRFT outperforms other RFT methods on out-of-domain metrics, showing an improvement of 5.1% (v.s. the baseline’s decrease of -0.4%) over this http URL. This is \hrefthis https URLProject Page.
zh

[CV-45] Origin Lens: A Privacy-First Mobile Framework for Cryptographic Image Provenance and AI Detection

【速读】:该论文旨在解决生成式 AI(Generative AI)带来的视觉虚假信息(visual disinformation)问题,即如何在保障用户隐私的前提下实现信息真实性的端到端验证。其核心挑战在于将模型治理(model governance)与终端用户的验证需求有效衔接,同时避免依赖中心化服务器进行检测。解决方案的关键在于提出 Origin Lens 框架——一个基于 Rust/Flutter 的隐私优先型移动系统,通过本地化的加密图像溯源验证(cryptographic image provenance verification)和生成模型指纹识别(generative model fingerprints),结合可选的检索增强验证(retrieval-augmented verification),为用户提供消费时的分级置信度指标(graded confidence indicators)。该设计实现了去中心化、低延迟且符合欧盟《人工智能法案》(EU AI Act)与《数字服务法案》(DSA)要求的验证基础设施。

链接: https://arxiv.org/abs/2602.03423
作者: Alexander Loth,Dominique Conceicao Rosario,Peter Ebinger,Martin Kappes,Marc-Oliver Pahl
机构: Frankfurt University of Applied Sciences (法兰克福应用科学大学); IMT Atlantique (IMT大西洋); Frankfurt University of Applied Sciences (法兰克福应用科学大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Accepted at ACM TheWebConf '26 Companion

点击查看摘要

Abstract:The proliferation of generative AI poses challenges for information integrity assurance, requiring systems that connect model governance with end-user verification. We present Origin Lens, a privacy-first mobile framework that targets visual disinformation through a layered verification architecture. Unlike server-side detection systems, Origin Lens performs cryptographic image provenance verification and AI detection locally on the device via a Rust/Flutter hybrid architecture. Our system integrates multiple signals - including cryptographic provenance, generative model fingerprints, and optional retrieval-augmented verification - to provide users with graded confidence indicators at the point of consumption. We discuss the framework’s alignment with regulatory requirements (EU AI Act, DSA) and its role in verification infrastructure that complements platform-level mechanisms.
zh

[CV-46] Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在几何推理能力上的显著短板,其根本原因在于高质量图像-文本对数据的极度稀缺。传统人工标注成本过高,而自动化方法难以保证数据的真实性与训练有效性,现有策略或被动适应有限图像资源,或采用低效的随机探索加过滤机制,导致生成与学习过程脱节。解决方案的关键在于提出一个完全自主的框架 Socratic-Geo,通过多智能体交互实现数据合成与模型学习的动态耦合:其中 Teacher 代理生成带反思反馈(Reflect 判断可解性、RePI 判断视觉合理性)的参数化 Python 脚本以确保图像-文本对纯净度;Solver 代理通过偏好学习优化推理能力,并利用失败路径指导 Teacher 进行针对性增强;Generator 独立学习基于积累的“图像-代码-指令”三元组进行图像生成,将程序化绘图知识蒸馏为视觉生成能力。该设计使模型仅用 108 个种子问题即可显著提升性能,验证了数据生成与学习闭环协同的有效性。

链接: https://arxiv.org/abs/2602.03414
作者: Zhengbo Jiao,Shaobo Wang,Zifan Zhang,Wei Wang,Bing Zhao,Hu Wei,Linfeng Zhang
机构: Alibaba Inc. (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher’s targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated “image-code-instruction” triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).
zh

[CV-47] UnHype: CLIP-Guided Hypernetworks for Dynamic LoRA Unlearning

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在训练过程中可能习得并保留有害或社会敏感内容的问题,尤其是如何高效、精准地实现“机器遗忘”(machine unlearning),即从模型中移除特定概念而不损害其整体生成能力。现有基于低秩适应(Low-Rank Adaptation, LoRA)的方法虽具效率优势,但在语义适应性、相关概念去除非干扰性以及多概念同时擦除的可扩展性方面存在局限。论文提出 UnHype 框架,其核心创新在于引入超网络(hypernetwork),通过 CLIP 嵌入动态生成适配的 LoRA 权重,从而实现上下文感知、稳定且可扩展的概念擦除,有效提升了单概念与多概念场景下的控制精度和泛化能力。

链接: https://arxiv.org/abs/2602.03410
作者: Piotr Wójcik,Maksym Petrenko,Wojciech Gromski,Przemysław Spurek,Maciej Zieba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large-scale diffusion models have intensified concerns about their potential misuse, particularly in generating realistic yet harmful or socially disruptive content. This challenge has spurred growing interest in effective machine unlearning, the process of selectively removing specific knowledge or concepts from a model without compromising its overall generative capabilities. Among various approaches, Low-Rank Adaptation (LoRA) has emerged as an effective and efficient method for fine-tuning models toward targeted unlearning. However, LoRA-based methods often exhibit limited adaptability to concept semantics and struggle to balance removing closely related concepts with maintaining generalization across broader meanings. Moreover, these methods face scalability challenges when multiple concepts must be erased simultaneously. To address these limitations, we introduce UnHype, a framework that incorporates hypernetworks into single- and multi-concept LoRA training. The proposed architecture can be directly plugged into Stable Diffusion as well as modern flow-based text-to-image models, where it demonstrates stable training behavior and effective concept control. During inference, the hypernetwork dynamically generates adaptive LoRA weights based on the CLIP embedding, enabling more context-aware, scalable unlearning. We evaluate UnHype across several challenging tasks, including object erasure, celebrity erasure, and explicit content removal, demonstrating its effectiveness and versatility. Repository: this https URL.
zh

[CV-48] From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning ICLR2026

【速读】:该论文旨在解决无监督以对象为中心的学习模型(如基于slot的架构)中,编码器生成的高频率锐利注意力图与解码器输出的空间一致性但模糊的重建图之间的根本性矛盾问题。这种不匹配导致恶性循环:编码器的噪声特征迫使解码器对多种可能性进行平均,产生更模糊的输出;而模糊的重建梯度又缺乏高频细节,无法有效监督编码器特征的学习。解决方案的关键在于提出协同表示学习(Synergistic Representation Learning, SRL),其通过建立一个良性循环机制,使编码器和解码器相互优化——编码器利用其锐利特征去模糊解码器输出中的语义边界,解码器则凭借其空间一致性来去除编码器特征中的噪声。该过程通过一个带有slot正则化目标的预热阶段稳定,该阶段初始时为每个slot分配独立实体,从而弥合编码器与解码器之间的表征差距,最终在视频对象中心学习基准上达到最优性能。

链接: https://arxiv.org/abs/2602.03390
作者: Hyun Seok Seong,WonJun Moon,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2026; Code is available at this https URL

点击查看摘要

Abstract:Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder’s sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder’s spatial consistency to denoise the encoder’s features. This mutual refinement process is stabilized by a warm-up phase with a slot regularization objective that initially allocates distinct entities per slot. By bridging the representational gap between the encoder and decoder, SRL achieves state-of-the-art results on video object-centric learning benchmarks. Codes are available at this https URL.
zh

[CV-49] Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization

【速读】:该论文旨在解决多模态推理模型(Multimodal Reasoning Models, MLRMs)中存在的幻觉问题,即模型在生成回答时可能产生与输入视觉信息不符的错误内容。其核心解决方案为C3PO框架,关键在于两个方面:一是通过思维链压缩(Chain-of-Thought Compression),筛选冗余的文本推理token,从而构建更紧凑且信号高效的思维链(CoT)表示,缓解模型过度依赖语言先验而忽视视觉输入的问题;二是引入对比偏好优化(Contrastive Preference Optimization),利用高质量AI反馈构建训练样本,并设计多模态幻觉诱导机制以激发模型固有幻觉模式,提供对比负样本用于纠正错误推理路径,从而系统性降低幻觉发生概率。

链接: https://arxiv.org/abs/2602.03380
作者: Hao Fang,Jinyu Li,Jiawei Kong,Tianqu Zhuang,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While multimodal reasoning models (MLRMs) have exhibited impressive capabilities, they remain prone to hallucinations, and effective solutions are still underexplored. In this paper, we experimentally analyze the hallucination cause and propose C3PO, a training-based mitigation framework comprising \textbfChain-of-Thought \textbfCompression and \textbfContrastive \textbfPreference \textbfOptimization. Firstly, we identify that introducing reasoning mechanisms exacerbates models’ reliance on language priors while overlooking visual inputs, which can produce CoTs with reduced visual cues but redundant text tokens. To this end, we propose to selectively filter redundant thinking tokens for a more compact and signal-efficient CoT representation that preserves task-relevant information while suppressing noise. In addition, we observe that the quality of the reasoning trace largely determines whether hallucination emerges in subsequent responses. To leverage this insight, we introduce a reasoning-enhanced preference tuning scheme that constructs training pairs using high-quality AI feedback. We further design a multimodal hallucination-inducing mechanism that elicits models’ inherent hallucination patterns via carefully crafted inducers, yielding informative negative signals for contrastive correction. We provide theoretical justification for the effectiveness and demonstrate consistent hallucination reduction across diverse MLRMs and benchmarks.
zh

[CV-50] PlanTRansformer: Unified Prediction and Planning with Goal-conditioned Transformer

【速读】:该论文旨在解决自动驾驶中轨迹预测与规划模块之间的脱节问题:预测模型通常基于多模态分布输出周围交通参与者的行为,但缺乏对意图的显式建模,而规划器则依赖于明确的意图信息来生成可行且安全的轨迹。这种不匹配导致预测无法有效指导规划,反之亦然。解决方案的关键在于提出一种统一的高斯混合Transformer框架——Plan TRansformer (PTR),其核心创新包括:1)引入目标条件化的预测机制以增强意图感知;2)通过教师-学生训练策略在训练过程中逐步掩码周围车辆的控制指令,使模型在推理阶段适应意图不可知的实际场景;3)整合动态可行性约束、交互感知及车道级拓扑推理能力。该设计实现了预测与规划的一体化协同优化,显著提升了预测精度(mAP提升4.3%/3.5%)和规划误差降低(5秒时域下减少15.5%)。

链接: https://arxiv.org/abs/2602.03376
作者: Constantin Selzer,Fabina B. Flohr
机构: Munich University of Applied Science (慕尼黑应用技术大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted and accepted at IEEE IV 2026

点击查看摘要

Abstract:Trajectory prediction and planning are fundamental yet disconnected components in autonomous driving. Prediction models forecast surrounding agent motion under unknown intentions, producing multimodal distributions, while planning assumes known ego objectives and generates deterministic trajectories. This mismatch creates a critical bottleneck: prediction lacks supervision for agent intentions, while planning requires this information. Existing prediction models, despite strong benchmarking performance, often remain disconnected from planning constraints such as collision avoidance and dynamic feasibility. We introduce Plan TRansformer (PTR), a unified Gaussian Mixture Transformer framework integrating goal-conditioned prediction, dynamic feasibility, interaction awareness, and lane-level topology reasoning. A teacher-student training strategy progressively masks surrounding agent commands during training to align with inference conditions where agent intentions are unavailable. PTR achieves 4.3%/3.5% improvement in marginal/joint mAP compared to the baseline Motion Transformer (MTR) and 15.5% planning error reduction at 5s horizon compared to GameFormer. The architecture-agnostic design enables application to diverse Transformer-based prediction models. Project Website: this https URL
zh

[CV-51] Unifying Watermarking via Dimension-Aware Mapping

【速读】:该论文旨在解决现有深度水印方法在架构相似但功能行为差异显著的问题,试图从功能层面统一不同水印方法。其解决方案的关键在于提出DiM(Dimension-aware Multi-dimensional Watermarking)框架,将水印建模为一种维度感知的映射问题,通过定义不同维度的水印载体(如一维二进制消息、二维空间掩码和三维时空结构),明确嵌入与提取维度配置对水印行为的决定性影响:同维映射可保持原始数据结构并实现精细控制,跨维映射则支持空间或时空定位能力。实验表明,仅调整嵌入与提取维度即可实现多种水印功能,如时空篡改定位、局部嵌入控制及帧扰乱下的时序恢复,无需改变网络架构。

链接: https://arxiv.org/abs/2602.03373
作者: Jiale Meng,Runyi Hu,Jie Zhang,Zheming Lu,Ivor Tsang,Tianwei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 25 figures

点击查看摘要

Abstract:Deep watermarking methods often share similar encoder-decoder architectures, yet differ substantially in their functional behaviors. We propose DiM, a new multi-dimensional watermarking framework that formulates watermarking as a dimension-aware mapping problem, thereby unifying existing watermarking methods at the functional level. Under DiM, watermark information is modeled as payloads of different dimensionalities, including one-dimensional binary messages, two-dimensional spatial masks, and three-dimensional spatiotemporal structures. We find that the dimensional configuration of embedding and extraction largely determines the resulting watermarking behavior. Same-dimensional mappings preserve payload structure and support fine-grained control, while cross-dimensional mappings enable spatial or spatiotemporal localization. We instantiate DiM in the video domain, where spatiotemporal representations enable a broader set of dimension mappings. Experiments demonstrate that varying only the embedding and extraction dimensions, without architectural changes, leads to different watermarking capabilities, including spatiotemporal tamper localization, local embedding control, and recovery of temporal order under frame disruptions.
zh

[CV-52] SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI

【速读】:该论文旨在解决癫痫患者中局灶性皮质发育不良(Focal Cortical Dysplasia, FCD)病变在FLAIR磁共振成像(MRI)中表现细微且稀少的问题,这使得联合图像-掩码生成建模容易出现不稳定性和过拟合现象。其解决方案的关键在于提出了一种紧凑的联合扩散模型SLIM-Diff,核心创新包括:(i) 采用单个共享瓶颈的U-Net结构,从双通道图像+掩码表示中强制解剖结构与病灶几何形状之间的紧密耦合;(ii) 通过可调的L_p损失函数进行几何优化,其中实验表明x₀预测(即原始数据预测)在联合合成任务中始终最优,而分数阶次二次惩罚(如L₁.₅)能提升图像保真度,L₂损失则更有利于保持病灶掩码形态结构。

链接: https://arxiv.org/abs/2602.03372
作者: Mario Pascual-González,Ariadna Jiménez-Partinen,R.M. Luque-Baena,Fátima Nagib-Raya,Ezequiel López-Rubio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 1 table, conference paper

点击查看摘要

Abstract:Focal cortical dysplasia (FCD) lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image–mask generative modeling prone to instability and memorization. We propose SLIM-Diff, a compact joint diffusion model whose main contributions are (i) a single shared-bottleneck U-Net that enforces tight coupling between anatomy and lesion geometry from a 2-channel image+mask representation, and (ii) loss-geometry tuning via a tunable L_p objective. As an internal baseline, we include the canonical DDPM-style objective ( \epsilon -prediction with L_2 loss) and isolate the effect of prediction parameterization and L_p geometry under a matched setup. Experiments show that x_0 -prediction is consistently the strongest choice for joint synthesis, and that fractional sub-quadratic penalties ( L_1.5 ) improve image fidelity while L_2 better preserves lesion mask morphology. Our code and model weights are available in this https URL
zh

[CV-53] Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion

【速读】:该论文旨在解决基于摄像头的3D语义场景补全(3D Semantic Scene Completion, SSC)中因体素稀疏性导致的优化效率低和模型性能受限问题。现有方法仅依赖体素标签进行监督,而自动驾驶场景中大量体素为空,造成训练信号不足。其解决方案的关键在于提出多分辨率对齐(Multi-Resolution Alignment, MRA)框架,通过引入场景级与实例级的跨分辨率特征对齐作为辅助监督:首先设计多分辨率视图变换器模块(Multi-resolution View Transformer),将2D图像特征投影至多分辨率3D空间并融合判别性种子特征实现场景级对齐;其次引入立方体语义各向异性模块(Cubic Semantic Anisotropy),量化每个体素与其邻域在语义上的差异以识别实例级重要性;最后构建关键分布对齐模块(Critical Distribution Alignment),基于语义各向异性选择关键体素作为实例锚点,并施加循环损失以保证不同分辨率下关键特征分布的一致性,从而缓解体素稀疏问题并提升模型性能。

链接: https://arxiv.org/abs/2602.03371
作者: Zhiwen Yang,Yuxin Peng
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, accepted by TIP 2026

点击查看摘要

Abstract:Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a \textitMulti-Resolution Alignment (MRA) approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. The code is available at this https URL.
zh

[CV-54] Symbol-Aware Reasoning with Masked Discrete Diffusion for Handwritten Mathematical Expression Recognition

【速读】:该论文旨在解决手写数学表达式识别(Handwritten Mathematical Expression Recognition, HMER)中因符号多样性与二维结构布局复杂性导致的推理难题,尤其是自回归模型存在的暴露偏差(exposure bias)和语法不一致问题。其解决方案的关键在于提出一种离散扩散框架(discrete diffusion framework),将HMER任务重新建模为符号迭代精炼过程而非顺序生成,通过多步重掩码(multi-step remasking)逐步优化符号及其结构关系,从而消除因果依赖并提升结构一致性;同时引入符号感知分词(symbol-aware tokenization)和随机掩码互学习(Random-Masking Mutual Learning)机制,增强语法对齐能力与对手写多样性的鲁棒性。

链接: https://arxiv.org/abs/2602.03370
作者: Takaya Kawakatsu,Ryo Ishiyama
机构: Preferred Networks, Inc.(Preferred Networks, Inc.); Kyushu University(九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Handwritten Mathematical Expression Recognition (HMER) requires reasoning over diverse symbols and 2D structural layouts, yet autoregressive models struggle with exposure bias and syntactic inconsistency. We present a discrete diffusion framework that reformulates HMER as iterative symbolic refinement instead of sequential generation. Through multi-step remasking, the proposal progressively refines both symbols and structural relations, removing causal dependencies and improving structural consistency. A symbol-aware tokenization and Random-Masking Mutual Learning further enhance syntactic alignment and robustness to handwriting diversity. On the MathWriting benchmark, the proposal achieves 5.56% CER and 60.42% EM, outperforming strong Transformer and commercial baselines. Consistent gains on CROHME 2014–2023 demonstrate that discrete diffusion provides a new paradigm for structure-aware visual recognition beyond generative modeling.
zh

[CV-55] Z3D: Zero-Shot 3D Visual Grounding from Images

【速读】:该论文旨在解决零样本三维视觉定位(3D visual grounding, 3DVG)问题,即仅依赖多视角图像实现自然语言查询下的3D场景中目标物体定位,且不需几何监督或物体先验信息。其解决方案的关键在于提出Z3D通用定位流程:首先采用先进的零样本3D实例分割方法生成高质量的3D边界框候选区域,其次通过基于提示的分割策略利用现代视觉语言模型(Vision-Language Models, VLMs)的强大推理能力,从而有效克服现有零样本方法中的性能瓶颈。

链接: https://arxiv.org/abs/2602.03361
作者: Nikita Drozdov,Andrey Lemeshko,Nikita Gavrilov,Anton Konushin,Danila Rukhovich,Maksim Kolodiazhnyi
机构: Lomonosov Moscow State University (莫斯科国立大学); Higher School of Economics (高等经济学院); M3L Lab, Institute of Mechanics, Armenia (M3L 实验室,力学研究所,亚美尼亚)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is available at this https URL .
zh

[CV-56] d Prompts: Overcoming Prompt Underspecification in Image and Video Super-Resolution

【速读】:该论文旨在解决文本条件扩散模型在图像和视频超分辨率任务中因使用全局单一提示(global caption)而导致的提示信息不足(prompt underspecification)问题,具体表现为局部细节缺失(prompt sparsity)和局部无关引导(prompt misguidance),这些问题在采用潜在空间分块(latent tiling)进行高分辨率扩展时尤为显著,并可能被无分类器引导(classifier-free guidance)放大。解决方案的关键在于提出“分块提示”(Tiled Prompts)框架,为每个潜在分块生成特定的局部提示,从而在局部文本条件后验分布下执行超分辨率,实现高信息量的引导,有效缓解提示信息不足问题,且计算开销极低。

链接: https://arxiv.org/abs/2602.03342
作者: Bryan Sangwoo Kim,Jonghyun Park,Jong Chul Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, but modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions, where a single global caption causes prompt underspecification. A coarse global prompt often misses localized details (prompt sparsity) and provides locally irrelevant guidance (prompt misguidance) that can be amplified by classifier-free guidance. We propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance that resolves prompt underspecification with minimal overhead. Experiments on high resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts relative to global-prompt baselines.
zh

[CV-57] Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

【速读】:该论文旨在解决现有视觉分词器(visual tokenizer)在生成任务中缺乏组合性控制能力的问题,即如何使分词后的token不仅能够有效重建图像,还能支持对图像语义进行细粒度编辑和组合操作。解决方案的关键在于提出CompTok训练框架,其核心创新包括:1)采用基于信息生成对抗网络(InfoGAN-style)的目标函数,通过训练识别模型从解码图像中预测用于条件控制的token,从而迫使扩散解码器充分利用所有token;2)引入跨图像token子集交换的训练样本,增强token的组合控制能力;3)利用对抗流正则化(adversarial flow regularizer)施加流形约束,确保未配对的交换生成图像仍位于自然图像分布上。该方法显著提升了分词器在类别条件图像生成中的性能,并实现了高级语义编辑能力。

链接: https://arxiv.org/abs/2602.03339
作者: Bingchen Zhao,Qiushan Guo,Ye Wang,Yixuan Huang,Zhonghua Zhai,Yu Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.
zh

[CV-58] PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets WWW2026

【速读】:该论文旨在解决3D点云在面对对抗攻击时,如何实现高效且非侵入式的防御问题,尤其针对空间不可感知扰动(spatial imperceptibility)与高攻击性能带来的挑战。现有防御方法普遍存在需修改模型结构、训练成本高或依赖辅助数据等局限性。其解决方案的关键在于提出一种基于频域的即插即用式净化框架PWAVEP,该框架通过计算每个点的谱图小波域显著性得分和局部稀疏性得分,采用分层策略:首先剔除最具显著性的点(被识别为难以恢复的对抗异常点),同时对中等显著性点实施谱滤波,利用图小波变换抑制与目标点相关的高频系数,从而有效消除对抗噪声。这一机制在不改变原始模型的前提下实现了对3D点云的鲁棒净化,显著提升了准确率与抗干扰能力。

链接: https://arxiv.org/abs/2602.03333
作者: Haoran Li,Renyang Liu,Hongjia Liu,Chen Wang,Long Yin,Jian Xu
机构: Northeastern University (东北大学); National University of Singapore (新加坡国立大学); Shenyang University of Technology (沈阳工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WWW 2026

点击查看摘要

Abstract:Recent progress in adversarial attacks on 3D point clouds, particularly in achieving spatial imperceptibility and high attack performance, presents significant challenges for defenders. Current defensive approaches remain cumbersome, often requiring invasive model modifications, expensive training procedures or auxiliary data access. To address these threats, in this paper, we propose a plug-and-play and non-invasive defense mechanism in the spectral domain, grounded in a theoretical and empirical analysis of the relationship between imperceptible perturbations and high-frequency spectral components. Building upon these insights, we introduce a novel purification framework, termed PWAVEP, which begins by computing a spectral graph wavelet domain saliency score and local sparsity score for each point. Guided by these values, PWAVEP adopts a hierarchical strategy, it eliminates the most salient points, which are identified as hardly recoverable adversarial outliers. Simultaneously, it applies a spectral filtering process to a broader set of moderately salient points. This process leverages a graph wavelet transform to attenuate high-frequency coefficients associated with the targeted points, thereby effectively suppressing adversarial noise. Extensive evaluations demonstrate that the proposed PWAVEP achieves superior accuracy and robustness compared to existing approaches, advancing the state-of-the-art in 3D point cloud purification. Code and datasets are available at this https URL
zh

[CV-59] Pi-GS: Sparse-View Gaussian Splatting with Dense π3 Initialization

【速读】:该论文旨在解决在稀疏视图场景下,3D高斯泼溅(3D Gaussian Splatting, 3DGS)方法因依赖精确相机位姿和高质量点云初始化而难以应用的问题。现有基于学习的点云估计方法通常需要可靠的参考视角,且对位姿或深度误差敏感,难以在低资源条件下稳定运行。解决方案的关键在于提出一种无需参考的点云估计网络 π3\pi^3,并结合不确定性引导的深度监督、法向一致性损失和深度畸变(depth warping)的正则化策略,有效缓解几何不准确性,从而在Tanks and Temples、LLFF、DTU和MipNeRF360等多个基准数据集上实现最先进的重建性能。

链接: https://arxiv.org/abs/2602.03327
作者: Manuel Hofer,Markus Steinberger,Thomas Köhler
机构: Graz University of Technology (格拉茨工业大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis has evolved rapidly, advancing from Neural Radiance Fields to 3D Gaussian Splatting (3DGS), which offers real-time rendering and rapid training without compromising visual fidelity. However, 3DGS relies heavily on accurate camera poses and high-quality point cloud initialization, which are difficult to obtain in sparse-view scenarios. While traditional Structure from Motion (SfM) pipelines often fail in these settings, existing learning-based point estimation alternatives typically require reliable reference views and remain sensitive to pose or depth errors. In this work, we propose a robust method utilizing \pi^3, a reference-free point cloud estimation network. We integrate dense initialization from \pi^3 with a regularization scheme designed to mitigate geometric inaccuracies. Specifically, we employ uncertainty-guided depth supervision, normal consistency loss, and depth warping. Experimental results demonstrate that our approach achieves state-of-the-art performance on the Tanks and Temples, LLFF, DTU, and MipNeRF360 datasets.
zh

[CV-60] MedSAM-Agent : Empowering Interactive Medical Image Segmentation with Multi-turn Agent ic Reinforcement Learning

【速读】:该论文旨在解决当前医学图像分割中基于多模态大语言模型(Multi-modal Large Language Models, MLLMs)的自主代理在交互式分割任务中存在的两个关键问题:一是现有方法通常采用单轮、僵化的交互策略,难以充分利用交互工具(如Segment Anything Model, SAM)的动态潜力;二是训练过程中缺乏过程层面的监督机制,导致冗余操作频发、决策效率低下。解决方案的关键在于提出MedSAM-Agent框架,其核心创新包括:1)设计一种混合提示策略(hybrid prompting strategy),通过专家标注的交互轨迹生成,使模型内化人类决策启发式与自适应优化策略;2)构建两阶段训练流程,融合多轮端到端结果验证与临床真实性导向的过程奖励设计(clinical-fidelity process reward),从而促进交互简洁性与决策高效性。该方案显著提升了跨6种医学模态和21个数据集上的分割性能,实现了自主医学推理与迭代优化的统一。

链接: https://arxiv.org/abs/2602.03320
作者: Shengyuan Liu,Liuxin Bao,Qi Yang,Wanting Geng,Boyun Zheng,Chenxin Li,Wenting Chen,Houwen Peng,Yixuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 Pages, 4 Figures

点击查看摘要

Abstract:Medical image segmentation is evolving from task-specific models toward generalizable frameworks. Recent research leverages Multi-modal Large Language Models (MLLMs) as autonomous agents, employing reinforcement learning with verifiable reward (RLVR) to orchestrate specialized tools like the Segment Anything Model (SAM). However, these approaches often rely on single-turn, rigid interaction strategies and lack process-level supervision during training, which hinders their ability to fully exploit the dynamic potential of interactive tools and leads to redundant actions. To bridge this gap, we propose MedSAM-Agent, a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process. First, we introduce a hybrid prompting strategy for expert-curated trajectory generation, enabling the model to internalize human-like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification with a clinical-fidelity process reward design to promote interaction parsimony and decision efficiency. Extensive experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization. Code is available \hrefthis https URLhere.
zh

[CV-61] Invisible Clean-Label Backdoor Attacks for Generative Data Augmentation

【速读】:该论文旨在解决生成式数据增强(Generative Data Augmentation)在实际应用中面临的清洁标签后门攻击(Clean-Label Backdoor Attack)问题,尤其是现有像素级攻击方法(如COMBAT)在生成图像上效果不佳的问题。其解决方案的关键在于从像素空间转向潜在特征空间(Latent Feature Space),提出了一种名为InvLBA的不可见清洁标签后门攻击方法,通过潜在扰动实现对生成图像的高效攻击,理论上保证了干净准确率与攻击成功率的泛化性能,实验表明该方法平均提升攻击成功率46.43%,同时保持高鲁棒性并几乎不降低干净准确率。

链接: https://arxiv.org/abs/2602.03316
作者: Ting Xiang,Jinhui Zhao,Changjian Chen,Zhuo Tang
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of image generative models, generative data augmentation has become an effective way to enrich training images, especially when only small-scale datasets are available. At the same time, in practical applications, generative data augmentation can be vulnerable to clean-label backdoor attacks, which aim to bypass human inspection. However, based on theoretical analysis and preliminary experiments, we observe that directly applying existing pixel-level clean-label backdoor attack methods (e.g., COMBAT) to generated images results in low attack success rates. This motivates us to move beyond pixel-level triggers and focus instead on the latent feature level. To this end, we propose InvLBA, an invisible clean-label backdoor attack method for generative data augmentation by latent perturbation. We theoretically prove that the generalization of the clean accuracy and attack success rates of InvLBA can be guaranteed. Experiments on multiple datasets show that our method improves the attack success rate by 46.43% on average, with almost no reduction in clean accuracy and high robustness against SOTA defense methods.
zh

[CV-62] PQTNet: Pixel-wise Quantitative Thermography Neural Network for Estimating Defect Depth in Polylactic Acid Parts by Additive Manufacturing

【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)构件中缺陷深度定量表征的难题,这是非破坏性检测(Non-Destructive Testing, NDT)领域长期存在的挑战。其核心解决方案是提出一种像素级定量热成像神经网络(Pixel-wise Quantitative Thermography Neural Network, PQT-Net),关键创新在于设计了一种新颖的数据增强策略,将热序列数据重构为二维条带图像,从而完整保留每个像素处热量扩散的时序演化信息;同时,PQT-Net采用预训练的EfficientNetV2-S作为骨干网络,并引入可学习参数的残差回归头(Residual Regression Head, RRH)以优化输出精度。实验表明,该方法在PLA材料样品上实现了最小平均绝对误差(MAE)0.0094 mm、决定系数(R)超过99%的高精度缺陷深度预测性能,展现出在AM构件中实现鲁棒定量缺陷表征的巨大潜力。

链接: https://arxiv.org/abs/2602.03314
作者: Lei Deng,Wenhao Huang,Chao Yang,Haoyuan Zheng,Yinbin Tian,Yue Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Defect depth quantification in additively manufactured (AM) components remains a significant challenge for non-destructive testing (NDT). This study proposes a Pixel-wise Quantitative Thermography Neural Network (PQT-Net) to address this challenge for polylactic acid (PLA) parts. A key innovation is a novel data augmentation strategy that reconstructs thermal sequence data into two-dimensional stripe images, preserving the complete temporal evolution of heat diffusion for each pixel. The PQT-Net architecture incorporates a pre-trained EfficientNetV2-S backbone and a custom Residual Regression Head (RRH) with learnable parameters to refine outputs. Comparative experiments demonstrate the superiority of PQT-Net over other deep learning models, achieving a minimum Mean Absolute Error (MAE) of 0.0094 mm and a coefficient of determination ® exceeding 99%. The high precision of PQT-Net underscores its potential for robust quantitative defect characterization in AM.
zh

[CV-63] RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人领域面临的三大挑战:数据稀缺、架构效率低下以及跨硬件平台泛化能力不足。其核心解决方案是提出RDT2——一个基于7B参数视觉语言模型(Vision Language Model, VLM)的机器人基础模型,通过构建一个可扩展的、与具体机械臂形态无关的通用操作接口(Universal Manipulation Interface, UMI),收集了超过10,000小时的多样化机器人演示数据;并设计了一种三阶段训练策略,结合残差向量量化(Residual Vector Quantization, RVQ)、流匹配(flow-matching)和知识蒸馏技术,实现离散语言知识与连续控制信号的有效对齐,从而在未见物体、场景、指令甚至机器人平台上实现零样本部署,显著提升复杂任务如击打乒乓球等高精度、长时程动态操作的性能。

链接: https://arxiv.org/abs/2602.03310
作者: Songming Liu,Bangguo Li,Kai Ma,Lingxuan Wu,Hengkai Tan,Xiao Ouyang,Hang Su,Jun Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets–over 10,000 hours of demonstrations in diverse families–using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See this https URL for more information.
zh

[CV-64] Full end-to-end diagnostic workflow automation of 3D OCT via foundation model-driven AI for retinal diseases

【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)在临床实践中诊断自动化程度不足的问题,其核心挑战在于传统多阶段工作流程和单切片、单任务的AI模型难以实现端到端的3D retinal疾病诊断。解决方案的关键在于提出一种基于基础模型(foundation model)驱动的全流程OCT临床应用系统(Full-process OCT-based Clinical Utility System, FOCUS),该系统通过EfficientNetV2-S进行图像质量评估,利用微调后的视觉基础模型完成异常检测与多病种分类,并创新性地引入统一自适应聚合方法,将二维切片级预测智能整合为三维患者级诊断结果,从而实现了从图像到诊断的全自动闭环流程。

链接: https://arxiv.org/abs/2602.03302
作者: Jinze Zhang,Jian Zhong,Li Lin,Jiaxiong Li,Ke Ma,Naiyang Li,Meng Li,Yuan Pan,Zeyu Meng,Mengyun Zhou,Shang Huang,Shilong Yu,Zhengyu Duan,Sutong Li,Honghui Xia,Juping Liu,Dan Liang,Yantao Wei,Xiaoying Tang,Jin Yuan,Peng Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optical coherence tomography (OCT) has revolutionized retinal disease diagnosis with its high-resolution and three-dimensional imaging nature, yet its full diagnostic automation in clinical practices remains constrained by multi-stage workflows and conventional single-slice single-task AI models. We present Full-process OCT-based Clinical Utility System (FOCUS), a foundation model-driven framework enabling end-to-end automation of 3D OCT retinal disease diagnosis. FOCUS sequentially performs image quality assessment with EfficientNetV2-S, followed by abnormality detection and multi-disease classification using a fine-tuned Vision Foundation Model. Crucially, FOCUS leverages a unified adaptive aggregation method to intelligently integrate 2D slices-level predictions into comprehensive 3D patient-level diagnosis. Trained and tested on 3,300 patients (40,672 slices), and externally validated on 1,345 patients (18,498 slices) across four different-tier centers and diverse OCT devices, FOCUS achieved high F1 scores for quality assessment (99.01%), abnormally detection (97.46%), and patient-level diagnosis (94.39%). Real-world validation across centers also showed stable performance (F1: 90.22%-95.24%). In human-machine comparisons, FOCUS matched expert performance in abnormality detection (F1: 95.47% vs 90.91%) and multi-disease diagnosis (F1: 93.49% vs 91.35%), while demonstrating better efficiency. FOCUS automates the image-to-diagnosis pipeline, representing a critical advance towards unmanned ophthalmology with a validated blueprint for autonomous screening to enhance population scale retinal care accessibility and efficiency.
zh

[CV-65] LEVIO: Lightweight Embedded Visual Inertial Odometry for Resource-Constrained Devices

【速读】:该论文旨在解决当前视觉惯性里程计(Visual-Inertial Odometry, VIO)系统在资源受限设备(如微无人机和智能眼镜)上部署时计算开销过大、难以实现实时六自由度(6-DoF)运动感知的问题。解决方案的关键在于提出了一种名为LEVIO的全功能VIO流水线,其核心是通过算法设计优化与软硬件协同优化相结合的方式,在保证精度的前提下显著降低计算复杂度和功耗:一方面保留了ORB特征跟踪和束调整(Bundle Adjustment)等成熟VIO组件,另一方面采用并行化架构与低内存占用策略,使其适用于嵌入式微控制器和超低功耗片上系统(SoC),并在基于RISC-V的多核低功耗SoC上实现了20 FPS的实时性能且功耗低于100 mW,同时在公开VIO数据集上验证了其高效性与准确性之间的良好平衡。

链接: https://arxiv.org/abs/2602.03294
作者: Jonas Kühne,Christian Vogt,Michele Magno,Luca Benini
机构: ETH Zurich (苏黎世联邦理工学院); University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: This article has been accepted for publication in the IEEE Sensors Journal (JSEN)

点击查看摘要

Abstract:Accurate, infrastructure-less sensor systems for motion tracking are essential for mobile robotics and augmented reality (AR) applications. The most popular state-of-the-art visual-inertial odometry (VIO) systems, however, are too computationally demanding for resource-constrained hardware, such as micro-drones and smart glasses. This work presents LEVIO, a fully featured VIO pipeline optimized for ultra-low-power compute platforms, allowing six-degrees-of-freedom (DoF) real-time sensing. LEVIO incorporates established VIO components such as Oriented FAST and Rotated BRIEF (ORB) feature tracking and bundle adjustment, while emphasizing a computationally efficient architecture with parallelization and low memory usage to suit embedded microcontrollers and low-power systems-on-chip (SoCs). The paper proposes and details the algorithmic design choices and the hardware-software co-optimization approach, and presents real-time performance on resource-constrained hardware. LEVIO is validated on a parallel-processing ultra-low-power RISC-V SoC, achieving 20 FPS while consuming less than 100 mW, and benchmarked against public VIO datasets, offering a compelling balance between efficiency and accuracy. To facilitate reproducibility and adoption, the complete implementation is released as open-source.
zh

[CV-66] A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation

【速读】:该论文旨在解决测试时适应(Test-Time Adaptation, TTA)中伪标签(pseudo-label)生成不稳定的问题,尤其是在存在领域偏移(domain shift)的情况下,传统基于扰动集成的伪标签方法因缺乏分布基础而易引发误差累积和灾难性遗忘。解决方案的关键在于提出A3-TTA框架,其核心创新是通过锚点引导监督(anchor-guided supervision)构建可靠的伪标签:首先利用类紧凑密度度量识别目标域中预测置信度高的图像作为锚点(anchor),假设这些样本在分布上更接近源域;随后以锚点为稳定参考指导伪标签生成,并结合语义一致性约束与边界感知熵最小化进行正则化;此外引入自适应指数移动平均策略降低标签噪声并稳定模型更新。该方法显著提升了多域医学图像(心脏结构和前列腺分割)及自然图像上的分割性能,且具备持续适应多个目标域的能力。

链接: https://arxiv.org/abs/2602.03292
作者: Jianghao Wu,Xiangde Luo,Yubo Zhou,Lianming Wu,Guotai Wang,Shaoting Zhang
机构: University of Electronic Science and Technology of China (电子科技大学); Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose \textbfA3-TTA, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at this https URL.
zh

[CV-67] me Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks ICLR2026

【速读】:该论文旨在解决事件驱动脉冲神经网络(Spiking Neural Networks, SNNs)中一种新型对抗攻击——仅通过重定时(spike retiming)现有脉冲来实现的隐蔽性攻击问题,这类攻击不改变脉冲计数或幅值,从而保持发放率不变(rate-preserving),因而难以被传统基于强度或事件密度的防御机制检测。解决方案的关键在于提出一个容量为1的脉冲重定时威胁模型(capacity-1 spike-retiming threat model),并引入统一的三元预算约束:单脉冲抖动 B\mathcal{B}_\infty、总延迟 B1\mathcal{B}_1 和篡改次数 B0\mathcal{B}_0;同时设计了“循环内投影”(projected-in-the-loop, PIL)优化方法,利用可微软重定时(shift-probability logits)进行反向传播,并在前向传播中施加严格离散投影以满足容量约束、非重叠性和预算限制,最终通过任务损失最大化与预算感知惩罚项协同优化,显著提升了攻击成功率(如DVS-Gesture上超过90%),且仅扰动少于2%的脉冲,揭示了当前SNN防御体系对时间维度攻击的脆弱性。

链接: https://arxiv.org/abs/2602.03284
作者: Yi Yu,Qixin Zhang,Shuhan Ye,Xun Lin,Qianshan Wei,Kun Wang,Wenhan Yang,Dacheng Tao,Xudong Jiang
机构: Nanyang Technological University (南洋理工大学); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算机与数据科学学院); CUHK (香港中文大学); Southeast University (东南大学); Pengcheng Laboratory (鹏城实验室)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Spiking neural networks (SNNs) compute with discrete spikes and exploit temporal structure, yet most adversarial attacks change intensities or event counts instead of timing. We study a timing-only adversary that retimes existing spikes while preserving spike counts and amplitudes in event-driven SNNs, thus remaining rate-preserving. We formalize a capacity-1 spike-retiming threat model with a unified trio of budgets: per-spike jitter \mathcalB_\infty , total delay \mathcalB_1 , and tamper count \mathcalB_0 . Feasible adversarial examples must satisfy timeline consistency and non-overlap, which makes the search space discrete and constrained. To optimize such retimings at scale, we use projected-in-the-loop (PIL) optimization: shift-probability logits yield a differentiable soft retiming for backpropagation, and a strict projection in the forward pass produces a feasible discrete schedule that satisfies capacity-1, non-overlap, and the chosen budget at every step. The objective maximizes task loss on the projected input and adds a capacity regularizer together with budget-aware penalties, which stabilizes gradients and aligns optimization with evaluation. Across event-driven benchmarks (CIFAR10-DVS, DVS-Gesture, N-MNIST) and diverse SNN architectures, we evaluate under binary and integer event grids and a range of retiming budgets, and also test models trained with timing-aware adversarial training designed to counter timing-only attacks. For example, on DVS-Gesture the attack attains high success (over 90% ) while touching fewer than 2% of spikes under \mathcalB_0 . Taken together, our results show that spike retiming is a practical and stealthy attack surface that current defenses struggle to counter, providing a clear reference for temporal robustness in event-driven SNNs. Code is available at this https URL.
zh

[CV-68] Global Geometry Is Not Enough for Vision Representations

【速读】:该论文旨在解决当前表示学习中过度依赖全局嵌入几何(global embedding geometry)来评估表示能力的问题,特别是其对组合结构建模能力的不足。研究发现,标准的几何度量与组合绑定(compositional binding)能力几乎无相关性,而基于输入-输出雅可比矩阵(Jacobian)的功能敏感性(functional sensitivity)则能可靠地预测这一能力。解决方案的关键在于揭示现有训练目标通过显式约束嵌入几何但未约束局部输入-输出映射,从而导致对组合结构建模能力的忽视;因此,功能敏感性应作为补充维度,与全局几何共同刻画表示能力。

链接: https://arxiv.org/abs/2602.03282
作者: Jiwan Chung,Seon Joo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across 21 vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input-output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input-output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.
zh

[CV-69] HypCBC: Domain-Invariant Hyperbolic Cross-Branch Consistency for Generalizable Medical Image Analysis

【速读】:该论文旨在解决深度神经网络在医学图像分析中跨分布泛化能力弱的问题,尤其是在数据稀缺、协变量偏移(covariate shift)显著的临床场景下,如不同成像设备、协议及人群导致的性能不稳定问题。其核心解决方案是引入双曲流形(hyperbolic manifold)来建模复杂且分层的数据结构,替代传统欧氏空间(Euclidean manifold)的平坦几何表示;关键创新在于提出一种无监督的、域不变的双曲交叉分支一致性约束(domain-invariant hyperbolic cross-branch consistency constraint),有效促进域不变特征学习,并在多个跨域泛化基准(Fitzpatrick17k、Camelyon17-WILDS 和视网膜成像跨数据集设置)上平均提升AUC达+2.1%,验证了方法在不同成像模态、数据规模和标签粒度下的强泛化能力。

链接: https://arxiv.org/abs/2602.03264
作者: Francesco Di Salvo,Sebastian Doerrich,Jonas Alle,Christian Ledig
机构: University of Bamberg (巴伐利亚大学); xAILab Bamberg (巴伐利亚人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the advantages of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT models. We further propose an unsupervised, domain-invariant hyperbolic cross-branch consistency constraint. Extensive experiments confirm that our proposed method promotes domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of +2.1% AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-WILDS, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across substantially different conditions. The code is available at this https URL .
zh

[CV-70] LaVPR: Benchmarking Language and Vision for Place Recognition

【速读】:该论文旨在解决视觉定位(Visual Place Recognition, VPR)在极端环境变化和感知歧义(perceptual aliasing)下性能下降的问题,以及现有系统无法仅依赖自然语言描述实现“盲定位”(blind localization)的局限性。其解决方案的关键在于构建一个大规模多模态基准LaVPR,包含超过65万条丰富的自然语言描述,并在此基础上探索两种范式:一是通过多模态融合(Multi-Modal Fusion)提升鲁棒性,二是通过跨模态检索(Cross-Modal Retrieval)实现语言驱动的定位。实验表明,语言信息在视觉退化条件下能带来稳定性能提升,尤其对小型模型效果显著,使其性能可媲美大型纯视觉模型;同时,基于低秩适配(LoRA)与多相似性损失(Multi-Similarity loss)的跨模态检索基线方法显著优于传统对比学习方法,从而推动了抗干扰性强且适用于资源受限场景的新一代定位系统的发展。

链接: https://arxiv.org/abs/2602.03253
作者: Ofer Idan,Dan Badur,Yosi Keller,Yoli Shavit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform “blind” localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at this https URL.
zh

[CV-71] InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

【速读】:该论文旨在解决世界模型(World Models)在生成驾驶视频时面临的两个核心问题:一是难以保持实例级别的时序一致性,二是空间几何保真度不足。为应对这些挑战,论文提出了一种名为InstaDrive的新框架,其关键创新在于引入了两个实例感知机制:(1) 实例流引导器(Instance Flow Guider),通过跨帧提取和传播实例特征来强化时序一致性,确保实例身份在时间维度上的稳定性;(2) 空间几何对齐器(Spatial Geometric Aligner),提升空间推理能力,精确控制实例位置,并显式建模遮挡层级关系。这两个机制共同提升了生成视频的真实感与结构合理性,显著改善了下游自动驾驶任务性能,在nuScenes数据集上达到当前最优效果。

链接: https://arxiv.org/abs/2602.03242
作者: Zhuoran Yang,Xi Guo,Chenjing Ding,Chiyu Wang,Wei Wu,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); SenseAuto (商汤科技); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA’s autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is this https URL.
zh

[CV-72] EventFlash: Towards Efficient MLLM s for Event-Based Vision

【速读】:该论文旨在解决当前基于事件的多模态大语言模型(Event-based Multimodal Large Language Models, MLLMs)在处理事件流时因采用密集图像类处理范式而导致的数据冗余和高计算开销问题,尤其是在高动态或低光照场景下难以实现高效推理。解决方案的关键在于提出EventFlash模型,其核心创新包括:1)构建大规模、场景多样化的EventMind数据集(含50万条指令),支持课程学习策略以优化长序列事件流建模;2)设计自适应时间窗口聚合模块,通过动态压缩时间token保留关键时序信息,提升时间维度效率;3)引入稀疏密度引导注意力机制,聚焦于信息丰富区域并抑制空旷或稀疏空间区域,从而显著提高空间token利用率。实验表明,EventFlash相较基线模型(EventFlash-Zero)实现了12.4倍的吞吐量提升,并支持长达1000个时间区间(bins)的事件流处理,远超EventGPT的5-bin限制。

链接: https://arxiv.org/abs/2602.03230
作者: Shaoyu Liu,Jianing Li,Guanghui Zhao,Yunjian Zhang,Wen Jiang,Ming Li,Xiangyang Ji
机构: Xidian University (西安电子科技大学); Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ) (广东省人工智能与数字经济发展实验室(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a 12.4\times throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.
zh

[CV-73] Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane

【速读】:该论文旨在解决视觉 Transformer 中标准二维旋转位置编码(Rotary Position Embedding, RoPE)存在的方向性限制问题,即其轴对齐(axial)分解方式仅能编码水平和垂直方向的位置信息,难以建模自然图像中普遍存在的斜向空间关系。解决方案的关键在于提出 Spiral RoPE,通过将嵌入通道划分为多个组,每组对应一组均匀分布的方向,并根据 patch 位置在该方向上的投影进行旋转编码,从而实现多方向的位置信息建模,显著提升了模型在图像分类、分割和生成等任务中的性能表现。

链接: https://arxiv.org/abs/2602.03227
作者: Haoyu Liu,Sucheng Ren,Tingyu Zhu,Peng Wang,Cihang Xie,Alan Yuille,Zeyu Zheng,Feng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial formulation decomposes two-dimensional spatial positions into horizontal and vertical components, implicitly restricting positional encoding to axis-aligned directions. We identify this directional constraint as a fundamental limitation of the standard axial 2D RoPE, which hinders the modeling of oblique spatial relationships that naturally exist in natural images. To overcome this limitation, we propose Spiral RoPE, a simple yet effective extension that enables multi-directional positional encoding by partitioning embedding channels into multiple groups associated with uniformly distributed directions. Each group is rotated according to the projection of the patch position onto its corresponding direction, allowing spatial relationships to be encoded beyond the horizontal and vertical axes. Across a wide range of vision tasks including classification, segmentation, and generation, Spiral RoPE consistently improves performance. Qualitative analysis of attention maps further show that Spiral RoPE exhibits more concentrated activations on semantically relevant objects and better respects local object boundaries, highlighting the importance of multi-directional positional encoding in vision transformers.
zh

[CV-74] PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation IJCNN2026

【速读】:该论文旨在解决文本到图像扩散模型中无参考(reference-free)风格条件化字符生成的问题,即在不依赖外部参考图像的情况下,实现高质量、结构稳定且跨多样化提示保持细粒度风格一致性的图像合成。现有方法主要依赖纯文本提示,易导致视觉风格表述不足和风格漂移(style drift)及几何不一致性;或引入参考图像驱动的适配器模块,增加推理时的架构复杂性并限制部署灵活性。其解决方案的关键在于提出PokeFusion Attention机制——一种轻量级解码器级别的交叉注意力模块,将文本语义与学习到的风格嵌入(style embeddings)直接融合于扩散解码器内部,通过在注意力层面解耦文本与风格条件,实现无需参考图像的高效风格控制,同时保持预训练扩散主干网络不变,具有参数效率高、可插拔性强、易于集成至现有扩散流程等优势。

链接: https://arxiv.org/abs/2602.03220
作者: Jingbang Tang(James)
机构: Universiti Kebangsaan Malaysia (马来西亚国民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures. Under review at IJCNN 2026

点击查看摘要

Abstract:This paper studies reference-free style-conditioned character generation in text-to-image diffusion models, where high-quality synthesis requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches primarily rely on text-only prompting, which is often under-specified for visual style and tends to produce noticeable style drift and geometric inconsistency, or introduce reference-based adapters that depend on external images at inference time, increasing architectural complexity and limiting deployment this http URL propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. By decoupling text and style conditioning at the attention level, our method enables effective reference-free stylized generation while keeping the pretrained diffusion backbone fully this http URL Attention trains only decoder cross-attention layers together with a compact style projection module, resulting in a parameter-efficient and plug-and-play control component that can be easily integrated into existing diffusion pipelines and transferred across different this http URL on a stylized character generation benchmark (Pokemon-style) demonstrate that our method consistently improves style fidelity, semantic alignment, and character shape consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.
zh

[CV-75] FARTrack: Fast Autoregressive Visual Tracking with High Performance

【速读】:该论文旨在解决视觉跟踪领域中高性能追踪器因推理速度慢而难以部署在资源受限设备上的问题。解决方案的关键在于提出FARTrack框架,其核心创新包括任务特定的自蒸馏(Task-Specific Self-Distillation)和帧间自回归稀疏化(Inter-frame Autoregressive Sparsification):前者通过逐层蒸馏任务相关token实现模型压缩,在不依赖人工指定教师-学生层配对的情况下提升推理效率;后者通过序列化压缩多模板以学习时序全局最优稀疏策略,避免额外运行时开销,从而在保持跟踪精度的同时显著提升执行效率。

链接: https://arxiv.org/abs/2602.03214
作者: Guijie Wang,Tong Lin,Yifan Bai,Anjia Cao,Shiyi Liang,Wangbo Zhao,Xing Wei
机构: Xi’an Jiaotong University (西安交通大学); Alibaba Group (阿里巴巴集团); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose FARTrack, a Fast Auto-Regressive Tracking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces Task-Specific Self-Distillation and Inter-frame Autoregressive Sparsification, designed from the perspectives of shallow-yet-accurate distillation and redundant-to-essential token optimization, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model’s inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.
zh

[CV-76] ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

【速读】:该论文旨在解决生成式驾驶世界模型(generative driving world model)在训练过程中存在的身份漂移(identity drift)问题,即同一物体在不同帧中出现外观或类别变化,导致视频生成不一致。解决方案的关键在于提出 ConsisDrive 框架,其核心创新是引入两个机制:一是实例掩码注意力(Instance-Masked Attention),通过在注意力模块中加入实例身份掩码和轨迹掩码,确保视觉token仅与对应实例特征进行跨空间和时间维度交互,从而维持对象的时序一致性;二是实例掩码损失(Instance-Masked Loss),采用概率性实例掩码自适应增强前景区域的监督信号,抑制背景噪声并保持整体场景保真度。这两项设计共同提升了驾驶视频生成质量,并显著改善了下游自动驾驶任务性能。

链接: https://arxiv.org/abs/2602.03213
作者: Zhuoran Yang,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is this https URL.
zh

[CV-77] VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

【速读】:该论文旨在解决视觉领域中上下文学习(In-Context Learning, ICL)难以复现的问题,其核心挑战源于视觉任务的异质性。为应对这一问题,作者提出VIRAL框架,其关键在于将ICL建模为基于视觉类比的条件生成任务(即 $ x_s : x_t :: x_q : y_q $),并利用预训练图像编辑模型(如冻结的Diffusion Transformer, DiT)通过角色感知的多图像条件输入来激发视觉推理能力。此外,引入Mixture-of-Experts LoRA机制以缓解不同任务间梯度干扰,从而实现统一的视觉上下文学习(Visual In-Context Learning, V-ICL)范式,有效覆盖感知、修复与编辑等多样化视觉任务。

链接: https://arxiv.org/abs/2602.03210
作者: Zhiwen Li,Zhongjie Duan,Jinyan Ye,Cen Chen,Daoyuan Chen,Yaliang Li,Yingda Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbfVIRAL, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ( x_s : x_t :: x_q : y_q ). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at this https URL
zh

[CV-78] Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation

【速读】:该论文旨在解决视觉生成模型在推理阶段对下游目标进行对齐时,现有基于初始噪声优化的方法因高维空间搜索效率低下而带来的计算资源浪费问题。其核心挑战在于:在高维初始噪声空间中,许多搜索方向对最终生成结果影响微弱,导致优化过程低效。解决方案的关键在于提出一种名为Spectral Evolution Search (SES) 的无梯度进化搜索框架,该方法通过限制搜索空间于低频子空间,有效规避了高频扰动对生成结果影响微弱的问题;其理论基础为从扰动传播动力学推导出的谱尺度预测(Spectral Scaling Prediction),揭示了不同频率扰动在生成动态中的系统性差异,从而实现了在生成质量与计算成本之间更优的帕累托前沿。

链接: https://arxiv.org/abs/2602.03208
作者: Jinyan Ye,Zhongjie Duan,Zhiwen Li,Cen Chen,Daoyuan Chen,Yaliang Li,Yingda Chen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inference-time scaling offers a versatile paradigm for aligning visual generative models with downstream objectives without parameter updates. However, existing approaches that optimize the high-dimensional initial noise suffer from severe inefficiency, as many search directions exert negligible influence on the final generation. We show that this inefficiency is closely related to a spectral bias in generative dynamics: model sensitivity to initial perturbations diminishes rapidly as frequency increases. Building on this insight, we propose Spectral Evolution Search (SES), a plug-and-play framework for initial noise optimization that executes gradient-free evolutionary search within a low-frequency subspace. Theoretically, we derive the Spectral Scaling Prediction from perturbation propagation dynamics, which explains the systematic differences in the impact of perturbations across frequencies. Extensive experiments demonstrate that SES significantly advances the Pareto frontier of generation quality versus computational cost, consistently outperforming strong baselines under equivalent budgets.
zh

[CV-79] WebSplatter: Enabling Cross-Device Efficient Gaussian Splatting in Web Browsers via WebGPU

【速读】:该论文旨在解决在异构Web生态系统中实现高效、确定性GPU渲染的难题,特别是在WebGPU缺乏全局原子操作(global atomics)的情况下如何保证跨硬件平台的一致性执行。其关键解决方案是提出一种无等待的分层基数排序(wait-free hierarchical radix sort),有效规避了WebGPU的限制,并引入了一种基于透明度感知的几何剔除阶段(opacity-aware geometry culling),在光栅化前动态剔除不可见的点绘图(splats),显著降低过度绘制(overdraw)和峰值内存占用,从而在性能上相较当前最先进的Web渲染器提升1.2×至4.5×。

链接: https://arxiv.org/abs/2602.03207
作者: Yudong Han,Chao Xu,Xiaodan Ye,Weichen Bi,Zilong Dong,Yun Ma
机构: Peking University (北京大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注:

点击查看摘要

Abstract:We present WebSplatter, an end-to-end GPU rendering pipeline for the heterogeneous web ecosystem. Unlike naive ports, WebSplatter introduces a wait-free hierarchical radix sort that circumvents the lack of global atomics in WebGPU, ensuring deterministic execution across diverse hardware. Furthermore, we propose an opacity-aware geometry culling stage that dynamically prunes splats before rasterization, significantly reducing overdraw and peak memory footprint. Evaluation demonstrates that WebSplatter consistently achieves 1.2 \times to 4.5 \times speedups over state-of-the-art web viewers.
zh

[CV-80] Hand3R: Online 4D Hand-Scene Reconstruction in the Wild

【速读】:该论文旨在解决具身智能(Embodied AI)中动态手部与密集场景上下文联合重建的问题,现有方法通常仅在局部坐标系下恢复孤立的手部结构,忽视了周围三维环境的建模。解决方案的关键在于提出首个基于单目视频的在线4D手-场景联合重建框架Hand3R,其通过场景感知的视觉提示机制(scene-aware visual prompting mechanism),将预训练的手部专家模型与4D场景基础模型(4D scene foundation model)进行协同,利用高保真手部先验注入持久场景记忆,从而在一次前向传播中实现精确手部网格与稠密度量尺度场景几何的同步重建。

链接: https://arxiv.org/abs/2602.03200
作者: Wendi Hu,Haonan Zhou,Wenhao Hu,Gaoang Wang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:For Embodied AI, jointly reconstructing dynamic hands and the dense scene context is crucial for understanding physical interaction. However, most existing methods recover isolated hands in local coordinates, overlooking the surrounding 3D environment. To address this, we present Hand3R, the first online framework for joint 4D hand-scene reconstruction from monocular video. Hand3R synergizes a pre-trained hand expert with a 4D scene foundation model via a scene-aware visual prompting mechanism. By injecting high-fidelity hand priors into a persistent scene memory, our approach enables simultaneous reconstruction of accurate hand meshes and dense metric-scale scene geometry in a single forward pass. Experiments demonstrate that Hand3R bypasses the reliance on offline optimization and delivers competitive performance in both local hand reconstruction and global positioning.
zh

[CV-81] From Single Scan to Sequential Consistency: A New Paradigm for LIDAR Relocalization

【速读】:该论文旨在解决LiDAR重定位中因动态或模糊场景导致的定位不鲁棒问题,现有基于回归的方法通常仅依赖单帧推理或忽略扫描间的时空一致性。其解决方案的关键在于提出TempLoc框架,通过引入三个核心模块实现时序一致性的有效建模:首先使用全局坐标估计模块预测每帧点云的全局坐标及不确定性;其次利用注意力机制设计先验坐标生成模块估计帧间点对应关系;最后通过不确定性引导的坐标融合模块端到端整合点对应关系,从而获得更时序一致且精确的全局6-DoF位姿。

链接: https://arxiv.org/abs/2602.03198
作者: Minghang Zhu,Zhijing Wang,Yuxin Guo,Wen Li,Sheng Ao,Cheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Nothing

点击查看摘要

Abstract:LiDAR relocalization aims to estimate the global 6-DoF pose of a sensor in the environment. However, existing regression-based approaches are prone to dynamic or ambiguous scenarios, as they either solely rely on single-frame inference or neglect the spatio-temporal consistency across scans. In this paper, we propose TempLoc, a new LiDAR relocalization framework that enhances the robustness of localization by effectively modeling sequential consistency. Specifically, a Global Coordinate Estimation module is first introduced to predict point-wise global coordinates and associated uncertainties for each LiDAR scan. A Prior Coordinate Generation module is then presented to estimate inter-frame point correspondences by the attention mechanism. Lastly, an Uncertainty-Guided Coordinate Fusion module is deployed to integrate both predictions of point correspondence in an end-to-end fashion, yielding a more temporally consistent and accurate global 6-DoF pose. Experimental results on the NCLT and Oxford Robot-Car benchmarks show that our TempLoc outperforms stateof-the-art methods by a large margin, demonstrating the effectiveness of temporal-aware correspondence modeling in LiDAR relocalization. Our code will be released soon.
zh

[CV-82] LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Real-World Video Super-Resolution

【速读】:该论文针对基于扩散模型(Diffusion Models)的视频超分辨率(Video Super-Resolution, VSR)任务中,扩散 Transformer(DiT)模型因参数量大和计算成本高而导致下游应用受限的问题展开研究。核心挑战在于低比特量化(low-bit quantization)在处理输入潜变量(latent)动态范围大及各层行为差异显著时性能下降明显。解决方案的关键在于提出 LSGQuant 方法,其包含三个核心技术:1)动态范围自适应量化器(Dynamic Range Adaptive Quantizer, DRAQ),用于适配视频 token 激活的动态范围;2)基于层敏感性估计的方差导向层训练策略(Variance-Oriented Layer Training Strategy, VOLTS),通过校准阶段分析层间统计特性优化量化效果;3)量化感知优化(Quantization-Aware Optimization, QAO),联合微调量化分支与保留的高精度分支以提升整体性能。实验表明,该方法在保持接近全精度模型性能的同时,显著优于现有量化技术。

链接: https://arxiv.org/abs/2602.03182
作者: Tianxing Wu,Zheng Chen,Cirou Xu,Bowen Chai,Yong Guo,Yutong Liu,Linghe Kong,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:One-Step Diffusion Models have demonstrated promising capability and fast inference in video super-resolution (VSR) for real-world. Nevertheless, the substantial model size and high computational cost of Diffusion Transformers (DiTs) limit downstream applications. While low-bit quantization is a common approach for model compression, the effectiveness of quantized models is challenged by the high dynamic range of input latent and diverse layer behaviors. To deal with these challenges, we introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR. Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations. Furthermore, we estimate layer sensitivity and implement a Variance-Oriented Layer Training Strategy (VOLTS) by analyzing layer-wise statistics in calibration. We also introduce Quantization-Aware Optimization (QAO) to jointly refine the quantized branch and a retained high-precision branch. Extensive experiments demonstrate that our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques. Code is available at: this https URL.
zh

[CV-83] BinaryDemoire: Moiré-Aware Binarization for Image Demoiréing

【速读】:该论文旨在解决图像去摩尔纹(image demoiréing)任务中深度神经网络模型部署成本高的问题,特别是针对现有全精度网络在计算资源受限场景下难以高效应用的挑战。其核心解决方案是提出BinaryDemoire框架,通过引入两个关键设计实现高效的二值化(binarization)模型:一是摩尔纹感知二值门(Moiré-aware Binary Gate, MABG),能够提取轻量级频率特征并结合激活统计信息,动态预测通道级门控系数以调节二值卷积响应的聚合;二是混洗分组残差适配器(Shuffle-grouped Residual Adapter, SGRA),通过结构化稀疏捷径对齐和交错混合机制促进不同通道分区间的特征交互,从而在极低比特表示下保留摩尔纹退化特有的频域结构信息,显著提升二值化模型的恢复性能。

链接: https://arxiv.org/abs/2602.03176
作者: Zheng Chen,Zhi Yang,Xiaoyang Liu,Weihang Zhang,Mengfan Wang,Yifan Fu,Linghe Kong,Yulun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Image demoiréing aims to remove structured moiré artifacts in recaptured imagery, where degradations are highly frequency-dependent and vary across scales and directions. While recent deep networks achieve high-quality restoration, their full-precision designs remain costly for deployment. Binarization offers an extreme compression regime by quantizing both activations and weights to 1-bit. Yet, it has been rarely studied for demoiréing and performs poorly when naively applied. In this work, we propose BinaryDemoire, a binarized demoiréing framework that explicitly accommodates the frequency structure of moiré degradations. First, we introduce a moiré-aware binary gate (MABG) that extracts lightweight frequency descriptors together with activation statistics. It predicts channel-wise gating coefficients to condition the aggregation of binary convolution responses. Second, we design a shuffle-grouped residual adapter (SGRA) that performs structured sparse shortcut alignment. It further integrates interleaved mixing to promote information exchange across different channel partitions. Extensive experiments on four benchmarks demonstrate that the proposed BinaryDemoire surpasses current binarization methods. Code: this https URL.
zh

[CV-84] Human-in-the-loop Adaptation in Group Activity Feature Learning for Team Sports Video Retrieval

【速读】:该论文旨在解决群体活动特征学习(Group Activity Feature Learning, GAFL)中缺乏群体活动标注数据的问题,从而提升群体活动视频检索的性能。传统方法依赖于预定义类别的监督学习进行特征空间构建,而本文提出一种“人在回路”(human-in-the-loop)的适应机制,在无需人工标注群体活动类别的情况下,通过用户交互式反馈对特征空间进行迭代优化。其解决方案的关键在于:首先在自监督框架下预训练GAFL空间以捕捉群体活动间的相似性;随后引入数据高效视频选择策略,从数据库中筛选出最具判别力的候选视频供用户标注正负样本,并利用对比学习更新特征空间,使正样本向查询视频靠近、负样本远离,从而实现精准的视频检索。

链接: https://arxiv.org/abs/2602.03157
作者: Chihiro Nakatani,Hiroaki Kawashima,Norimichi Ukita
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Computer Vision and Image Understanding (CVIU)

点击查看摘要

Abstract:This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: this https URL.
zh

[CV-85] Fully Kolmogorov-Arnold Deep Model in Medical Image Segmentation

【速读】:该论文旨在解决深度Kolmogorov-Arnold Network (KAN) 架构在实际应用中因训练难度高和内存消耗大而导致难以堆叠的问题,从而限制了对KAN模型深层结构的充分探索。其解决方案的关键在于两个核心创新:一是提出Share-activation KAN (SaKAN),通过重构Sprecher版本的Kolmogorov-Arnold表示定理,实现更简化的参数化和更密集的训练样本,显著降低训练难度;二是提出无梯度样条(Grad-Free Spline),证明样条梯度对训练贡献微小但占用大量GPU内存,从而大幅减少内存消耗与计算开销。基于这两项突破,作者构建了首个完全基于KAN的深度模型ALL U-KAN,其中KA层和KAonv层彻底替代传统全连接(FC)和卷积(Conv)层,在医学图像分割任务中展现出优于部分KAN及传统架构的性能,同时参数量减少10倍、内存消耗降低超20倍,为深度KAN架构的进一步研究开辟了新路径。

链接: https://arxiv.org/abs/2602.03156
作者: Xingyu Qiu,Xinghua Ma,Dong Liang,Gongning Luo,Wei Wang,Kuanquan Wang,Shuo Li
机构: Harbin Institute of Technology (哈尔滨工业大学); Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 5 figures, conference

点击查看摘要

Abstract:Deeply stacked KANs are practically impossible due to high training difficulties and substantial memory requirements. Consequently, existing studies can only incorporate few KAN layers, hindering the comprehensive exploration of KANs. This study overcomes these limitations and introduces the first fully KA-based deep model, demonstrating that KA-based layers can entirely replace traditional architectures in deep learning and achieve superior learning capacity. Specifically, (1) the proposed Share-activation KAN (SaKAN) reformulates Sprecher’s variant of Kolmogorov-Arnold representation theorem, which achieves better optimization due to its simplified parameterization and denser training samples, to ease training difficulty, (2) this paper indicates that spline gradients contribute negligibly to training while consuming huge GPU memory, thus proposes the Grad-Free Spline to significantly reduce memory usage and computational overhead. (3) Building on these two innovations, our ALL U-KAN is the first representative implementation of fully KA-based deep model, where the proposed KA and KAonv layers completely replace FC and Conv layers. Extensive evaluations on three medical image segmentation tasks confirm the superiority of the full KA-based architecture compared to partial KA-based and traditional architectures, achieving all higher segmentation accuracy. Compared to directly deeply stacked KAN, ALL U-KAN achieves 10 times reduction in parameter count and reduces memory consumption by more than 20 times, unlocking the new explorations into deep KAN architectures.
zh

[CV-86] Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

【速读】:该论文旨在解决分布匹配蒸馏(Distribution Matching Distillation, DMD)在文本到图像生成任务中因逆KL散度(reverse-KL)目标函数导致的模式崩溃(mode collapse)问题,该问题通常依赖感知或对抗正则化来缓解,但会引入显著计算开销和训练不稳定性。解决方案的关键在于提出一种角色分离的蒸馏框架(Role-Separated Distillation Framework),明确区分蒸馏步骤的功能:首步专注于通过目标预测(如v-prediction)保持样本多样性,后续步骤则基于标准DMD损失进行质量优化,且在首步处屏蔽DMD梯度传播;该方法称为多样性保留的DMD(Diversity-Preserved DMD, DP-DMD),无需感知网络、判别器、辅助网络或额外真实图像即可在保持高质量的同时有效避免模式崩溃。

链接: https://arxiv.org/abs/2602.03139
作者: Tianhe Wu,Ruibin Li,Lei Zhang,Kede Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity – no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images – preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.
zh

[CV-87] FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion ICLR2026

【速读】:该论文旨在解决少样本目标检测(Few-Shot Object Detection, FSOD)中因基础模型生成的边界框存在过度碎片化问题而导致的误检率高、检测精度低的问题。其核心解决方案是提出一种基于图结构的置信度重加权方法(graph-based confidence reweighting),将预测的边界框建模为有向图中的节点,并通过图扩散操作在图中传播置信度分数,从而提升完整物体区域的置信度并降低局部碎片区域的置信度,有效改善检测粒度并减少假阳性边界框。该方法无需额外训练,在多个基准数据集上显著优于现有无训练方法。

链接: https://arxiv.org/abs/2602.03137
作者: Chen-Bin Feng,Youyang Sha,Longfei Liu,Yongjun Yu,Chi Man Vong,Xuanlong Yu,Xi Shen
机构: University of Macau (澳门大学); Intellindust AI Lab (智行工业人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026. Code is available at: \url{ this https URL }

点击查看摘要

Abstract:In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5 ^i , COCO-20 ^i , and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. Code is available at: this https URL.
zh

[CV-88] SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中因早期视觉标记(visual token)剪枝导致的细粒度视觉推理性能下降问题。现有方法通常在浅层进行剪枝决策,虽能提升效率,但会丢失后续层中对文本条件推理至关重要的视觉信息,从而影响任务表现。解决方案的关键在于提出一种称为“bypass”的新剪枝范式:该范式不直接丢弃未选中的视觉标记,而是将其保留并传递至后续剪枝阶段进行重新评估,从而避免因过早剪枝造成的信息不可逆损失;在此基础上,作者进一步提出了SwiftVLM方法,其通过在特定层执行剪枝、且各层独立决策,实现了无需训练的高效剪枝策略,在多个VLM和基准测试中均展现出更优的准确率-效率权衡与更忠实的视觉标记选择行为。

链接: https://arxiv.org/abs/2602.03134
作者: Chen Qian,Xinran Yu,Danyang Li,Guoxuan Chi,Zheng Yang,Qiang Ma,Xin Miao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.
zh

[CV-89] FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

【速读】:该论文旨在解决金融领域视觉语言模型(Vision-Language Models, VLMs)在真实应用场景中评估不足的问题,现有金融基准多为单轮问答且题型单一,难以全面衡量模型在复杂视觉理解与多轮推理任务中的能力。解决方案的关键在于提出FinMTM——一个多层次、多任务的多轮跨模态金融问答基准,其核心创新包括:(1) 构建包含11,133对中英双语金融问答对的数据集,覆盖蜡烛图、统计图表和报告图像等多样金融视觉内容;(2) 设计涵盖单选、多选、多轮开放对话及基于代理的任务类型,实现任务维度的多样化;(3) 为不同任务设计针对性的评估协议,如多选题采用集合重叠评分规则、多轮对话引入轮次级与会话级加权评分、代理任务结合规划质量与最终结果的复合指标,从而更精准地量化模型在细粒度视觉感知、长程推理和复杂代理流程中的表现。

链接: https://arxiv.org/abs/2602.03130
作者: Chenxi Zhang,Ziliang Gan,Liyun Zhu,Youwei Pang,Qing Zhang,Rongjunchen Zhang
机构: HiThink Research; Wuhan University (武汉大学); Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); Shanghai Institute of Technology (上海应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11,133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.
zh

[CV-90] Flexible Geometric Guidance for Probabilistic Human Pose Estimation with Diffusion Models

【速读】:该论文旨在解决从2D图像中进行3D人体姿态估计时面临的深度模糊性和遮挡问题,这些问题导致该任务在数学上是欠定的(underdetermined),即同一张2D图像可能对应多个甚至无限个合理的3D姿态。传统方法通常假设存在确定性的映射关系,仅输出单一姿态,且依赖大量配对的2D-3D数据进行训练,泛化能力受限。本文提出了一种基于扩散模型(diffusion models)的框架,其核心创新在于利用无条件扩散模型(仅用3D数据训练)并借助2D关键点检测器生成的热图梯度进行条件引导,从而从一个概率分布中采样出与2D图像一致的多种合理3D姿态。该方案无需配对的2D-3D数据即可实现高质量姿态估计,并展现出良好的泛化性能和任务灵活性(如姿态生成与补全)。

链接: https://arxiv.org/abs/2602.03126
作者: Francis Snelgar,Ming Xu,Stephen Gould,Liang Zheng,Akshay Asthana
机构: Australian National University (澳大利亚国立大学); Seeing Machines; Australian Research Council (澳大利亚研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D human pose estimation from 2D images is a challenging problem due to depth ambiguity and occlusion. Because of these challenges the task is underdetermined, where there exists multiple – possibly infinite – poses that are plausible given the image. Despite this, many prior works assume the existence of a deterministic mapping and estimate a single pose given an image. Furthermore, methods based on machine learning require a large amount of paired 2D-3D data to train and suffer from generalization issues to unseen scenarios. To address both of these issues, we propose a framework for pose estimation using diffusion models, which enables sampling from a probability distribution over plausible poses which are consistent with a 2D image. Our approach falls under the guidance framework for conditional generation, and guides samples from an unconditional diffusion model, trained only on 3D data, using the gradients of the heatmaps from a 2D keypoint detector. We evaluate our method on the Human 3.6M dataset under best-of- m multiple hypothesis evaluation, showing state-of-the-art performance among methods which do not require paired 2D-3D data for training. We additionally evaluate the generalization ability using the MPI-INF-3DHP and 3DPW datasets and demonstrate competitive performance. Finally, we demonstrate the flexibility of our framework by using it for novel tasks including pose generation and pose completion, without the need to train bespoke conditional models. We make code available at this https URL .
zh

[CV-91] Feature Alignment and Supervision in Category Learning: A Comparative Approach with Children and Neural Networks

【速读】:该论文旨在解决人类与机器在稀疏数据条件下如何进行少样本半监督类别学习的问题,核心在于揭示两者在不同监督强度、特征结构和感知对齐程度下的学习机制差异。其解决方案的关键在于采用“物种公平设计”(species-fair design),使儿童与卷积神经网络(CNNs)在完全相同的实验条件下接受新颖类别学习任务,通过系统性地调控标签比例(1/3/6个标签)、目标特征类型(大小、形状、图案)以及感知对齐度(高/低),发现人类学习者表现出快速泛化能力但具有显著的特征特异性偏差和对对齐敏感;而CNN则展现出监督增强性能的趋势,但其效果受特征结构与对齐程度的调节作用影响。这表明,人机对比研究必须聚焦于三者交互关系,而非仅依赖整体准确率指标。

链接: https://arxiv.org/abs/2602.03124
作者: Fanxiao Wani Qiu,Oscar Leong
机构: University of Southern California (南加州大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding how humans and machines learn from sparse data is central to cognitive science and machine learning. Using a species-fair design, we compare children and convolutional neural networks (CNNs) in a few-shot semi-supervised category learning task. Both learners are exposed to novel object categories under identical conditions. Learners receive mixtures of labeled and unlabeled exemplars while we vary supervision (1/3/6 labels), target feature (size, shape, pattern), and perceptual alignment (high/low). We find that children generalize rapidly from minimal labels but show strong feature-specific biases and sensitivity to alignment. CNNs show a different interaction profile: added supervision improves performance, but both alignment and feature structure moderate the impact additional supervision has on learning. These results show that human-model comparisons must be drawn under the right conditions, emphasizing interactions among supervision, feature structure, and alignment rather than overall accuracy.
zh

[CV-92] Beyond Cropping and Rotation: Automated Evolution of Powerful Task-Specific Augmentations with Generative Models

【速读】:该论文旨在解决传统数据增强方法在视觉模型训练中难以适应复杂任务需求的问题,尤其是在低数据场景下,现有方法(如AutoAugment)生成的增强策略可能无法充分提升模型鲁棒性。其解决方案的关键在于提出EvoAug——一个基于生成式AI(Generative AI)与高效进化算法相结合的自动化增强学习流水线,通过学习随机增强树(stochastic augmentation trees)来结构化地组合多种增强操作,从而实现更灵活、自适应且任务相关的数据增强策略。该方法能够自动发现与领域知识一致的增强方案,在细粒度分类和少样本学习任务中显著提升模型性能。

链接: https://arxiv.org/abs/2602.03123
作者: Judah Goldfeder,Shreyes Kaliyur,Vaibhav Sourirajan,Patrick Minwan Puma,Philippe Martin Wyder,Yuhang Hu,Jiong Lin,Hod Lipson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data augmentation has long been a cornerstone for reducing overfitting in vision models, with methods like AutoAugment automating the design of task-specific augmentations. Recent advances in generative models, such as conditional diffusion and few-shot NeRFs, offer a new paradigm for data augmentation by synthesizing data with significantly greater diversity and realism. However, unlike traditional augmentations like cropping or rotation, these methods introduce substantial changes that enhance robustness but also risk degrading performance if the augmentations are poorly matched to the task. In this work, we present EvoAug, an automated augmentation learning pipeline, which leverages these generative models alongside an efficient evolutionary algorithm to learn optimal task-specific augmentations. Our pipeline introduces a novel approach to image augmentation that learns stochastic augmentation trees that hierarchically compose augmentations, enabling more structured and adaptive transformations. We demonstrate strong performance across fine-grained classification and few-shot learning tasks. Notably, our pipeline discovers augmentations that align with domain knowledge, even in low-data settings. These results highlight the potential of learned generative augmentations, unlocking new possibilities for robust model training.
zh

[CV-93] Gromov Wasserstein Optimal Transport for Semantic Correspondences

【速读】:该论文旨在解决图像对之间语义对应(semantic correspondence)任务中现有方法计算效率低的问题。当前最优方法依赖于融合DINOv2与Stable Diffusion(SD)两类大模型的特征,虽能实现高精度匹配,但因需多次调用大型基础模型生成特征图而计算开销巨大。其解决方案的关键在于:用一个具备空间一致性特性的高效匹配算法替代原方案中的SD特征,具体而言是将标准最近邻匹配替换为引入Gromov Wasserstein空间平滑先验的最优传输算法,从而在显著提升DINOv2基线性能的同时,达到与SD特征方法相当甚至更优的效果,并实现5–10倍的效率提升。

链接: https://arxiv.org/abs/2602.03105
作者: Francis Snelgar,Stephen Gould,Ming Xu,Liang Zheng,Akshay Asthana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Establishing correspondences between image pairs is a long studied problem in computer vision. With recent large-scale foundation models showing strong zero-shot performance on downstream tasks including classification and segmentation, there has been interest in using the internal feature maps of these models for the semantic correspondence task. Recent works observe that features from DINOv2 and Stable Diffusion (SD) are complementary, the former producing accurate but sparse correspondences, while the latter produces spatially consistent correspondences. As a result, current state-of-the-art methods for semantic correspondence involve combining features from both models in an ensemble. While the performance of these methods is impressive, they are computationally expensive, requiring evaluating feature maps from large-scale foundation models. In this work we take a different approach, instead replacing SD features with a superior matching algorithm which is imbued with the desirable spatial consistency property. Specifically, we replace the standard nearest neighbours matching with an optimal transport algorithm that includes a Gromov Wasserstein spatial smoothness prior. We show that we can significantly boost the performance of the DINOv2 baseline, and be competitive and sometimes surpassing state-of-the-art methods using Stable Diffusion features, while being 5–10x more efficient. We make code available at this https URL .
zh

[CV-94] Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning

【速读】:该论文旨在解决传统同伦(Homotopy)方法中求解器依赖人工设计启发式策略来确定步长和迭代终止条件的问题,这些策略往往效率低下且任务特定。其解决方案的关键在于提出一种统一的神经预测-校正框架(Neural Predictor-Corrector, NPC),将策略选择建模为序列决策问题,并利用强化学习自动学习高效策略,同时引入摊销训练机制实现一次离线训练即可泛化至新实例的在线推理,从而显著提升求解效率与跨任务稳定性。

链接: https://arxiv.org/abs/2602.03086
作者: Jiayao Mai,Bangyan Liao,Zhenjun Zhao,Yingping Zeng,Haoang Li,Javier Civera,Tailin Wu,Yi Zhou,Peidong Liu
机构: Hunan University (湖南大学); Westlake University (西湖大学); University of Zaragoza (萨拉戈萨大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on hand-crafted heuristics for step sizes and iteration termination, which are often suboptimal and task-specific. To address this, we unify these problems under a single framework, which enables the design of a general neural solver. Building on this unified view, we propose Neural Predictor-Corrector (NPC), which replaces hand-crafted heuristics with automatically learned policies. NPC formulates policy selection as a sequential decision-making problem and leverages reinforcement learning to automatically discover efficient strategies. To further enhance generalization, we introduce an amortized training mechanism, enabling one-time offline training for a class of problems and efficient online inference on new instances. Experiments on four representative homotopy problems demonstrate that our method generalizes effectively to unseen instances. It consistently outperforms classical and specialized baselines in efficiency while demonstrating superior stability across tasks, highlighting the value of unifying homotopy methods into a single neural framework.
zh

[CV-95] A generalizable large-scale foundation model for musculoskeletal radiographs

【速读】:该论文旨在解决当前生成式 AI 在骨骼肌肉系统影像诊断中面临的任务特异性、标注依赖性强以及跨疾病和解剖区域泛化能力有限的问题。其解决方案的关键在于构建一个大规模基础模型 SKELEX,该模型基于 120 万张多样且病种丰富的骨骼肌肉 X 光片,采用自监督学习进行训练,从而实现无需大量标注数据即可广泛适用的诊断能力。SKELEX 在 12 项下游任务中表现出优于基线模型的性能,并具备零样本异常定位能力,进一步支持了可解释的区域引导型骨肿瘤预测模型开发,为临床转化和数据高效研究提供了通用框架。

链接: https://arxiv.org/abs/2602.03076
作者: Shinn Kim,Soobin Lee,Kyoungseob Shin,Han-Soo Kim,Yongsung Kim,Minsu Kim,Juhong Nam,Somang Ko,Daeheon Kwon,Wook Huh,Ilkyu Han,Sunghoon Kwon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has shown promise in detecting and characterizing musculoskeletal diseases from radiographs. However, most existing models remain task-specific, annotation-dependent, and limited in generalizability across diseases and anatomical regions. Although a generalizable foundation model trained on large-scale musculoskeletal radiographs is clinically needed, publicly available datasets remain limited in size and lack sufficient diversity to enable training across a wide range of musculoskeletal conditions and anatomical sites. Here, we present SKELEX, a large-scale foundation model for musculoskeletal radiographs, trained using self-supervised learning on 1.2 million diverse, condition-rich images. The model was evaluated on 12 downstream diagnostic tasks and generally outperformed baselines in fracture detection, osteoarthritis grading, and bone tumor classification. Furthermore, SKELEX demonstrated zero-shot abnormality localization, producing error maps that identified pathologic regions without task-specific training. Building on this capability, we developed an interpretable, region-guided model for predicting bone tumors, which maintained robust performance on independent external datasets and was deployed as a publicly accessible web application. Overall, SKELEX provides a scalable, label-efficient, and generalizable AI framework for musculoskeletal imaging, establishing a foundation for both clinical translation and data-efficient research in musculoskeletal radiology.
zh

[CV-96] Finding Optimal Video Moment without Training: Gaussian Boundary Optimization for Weakly Supervised Video Grounding

【速读】:该论文旨在解决弱监督时序视频定位(Weakly supervised temporal video grounding)任务中,现有基于高斯分布的时序提议方法在推理阶段依赖启发式映射从高斯参数到片段边界而导致定位性能不佳的问题。其解决方案的关键在于提出高斯边界优化(Gaussian Boundary Optimization, GBO),通过构建一个兼顾提议覆盖度与片段紧凑性的优化问题来预测精确的边界,并推导出该问题的闭式解,从而实现无需训练、兼容单高斯与混合高斯提议架构的高效且鲁棒的边界预测。

链接: https://arxiv.org/abs/2602.03071
作者: Sunoh Kim,Kimin Yun,Daeho Um
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE TMM

点击查看摘要

Abstract:Weakly supervised temporal video grounding aims to localize query-relevant segments in untrimmed videos using only video-sentence pairs, without requiring ground-truth segment annotations that specify exact temporal boundaries. Recent approaches tackle this task by utilizing Gaussian-based temporal proposals to represent query-relevant segments. However, their inference strategies rely on heuristic mappings from Gaussian parameters to segment boundaries, resulting in suboptimal localization performance. To address this issue, we propose Gaussian Boundary Optimization (GBO), a novel inference framework that predicts segment boundaries by solving a principled optimization problem that balances proposal coverage and segment compactness. We derive a closed-form solution for this problem and rigorously analyze the optimality conditions under varying penalty regimes. Beyond its theoretical foundations, GBO offers several practical advantages: it is training-free and compatible with both single-Gaussian and mixture-based proposal architectures. Our experiments show that GBO significantly improves localization, achieving state-of-the-art results across standard benchmarks. Extensive experiments demonstrate the efficiency and generalizability of GBO across various proposal schemes. The code is available at \hrefthis https URLthis https URL.
zh

[CV-97] JRDB-Pose3D: A Multi-person 3D Human Pose and Shape Estimation Dataset for Robotics

【速读】:该论文旨在解决现有3D人体姿态估计数据集多局限于单人场景或受控实验室环境,难以真实反映复杂动态的多人交互场景问题。其解决方案的关键在于提出JRDB-Pose3D数据集,该数据集通过移动机器人平台采集室内与室外多人群场景,并提供基于SMPL(Skinned Multi-Person Linear)模型的3D人体姿态标注、一致的身体形状参数及跨帧个体跟踪ID,同时包含频繁遮挡、身体截断和出框部分等现实挑战因素,且继承了原JRDB数据集的丰富标注信息(如2D姿态、社会分组、活动类别、交互关系、语义分割掩码及人口统计学属性),从而为自动驾驶、机器人感知与人机交互等应用提供了更贴近真实世界需求的基准数据集。

链接: https://arxiv.org/abs/2602.03064
作者: Sandika Biswas,Kian Izadpanah,Hamid Rezatofighi
机构: Monash University (莫纳什大学); Sharif University (沙里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world scenes are inherently crowded. Hence, estimating 3D poses of all nearby humans, tracking their movements over time, and understanding their activities within social and environmental contexts are essential for many applications, such as autonomous driving, robot perception, robot navigation, and human-robot interaction. However, most existing 3D human pose estimation datasets primarily focus on single-person scenes or are collected in controlled laboratory environments, which restricts their relevance to real-world applications. To bridge this gap, we introduce JRDB-Pose3D, which captures multi-human indoor and outdoor environments from a mobile robotic platform. JRDB-Pose3D provides rich 3D human pose annotations for such complex and dynamic scenes, including SMPL-based pose annotations with consistent body-shape parameters and track IDs for each individual over time. JRDB-Pose3D contains, on average, 5-10 human poses per frame, with some scenes featuring up to 35 individuals simultaneously. The proposed dataset presents unique challenges, including frequent occlusions, truncated bodies, and out-of-frame body parts, which closely reflect real-world environments. Moreover, JRDB-Pose3D inherits all available annotations from the JRDB dataset, such as 2D pose, information about social grouping, activities, and interactions, full-scene semantic masks with consistent human- and object-level tracking, and detailed annotations for each individual, such as age, gender, and race, making it a holistic dataset for a wide range of downstream perception and human-centric understanding tasks.
zh

[CV-98] IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning ICLR2026

【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在处理高分辨率图像时推理成本过高的问题。现有视觉标记剪枝方法多依赖语义相关性,容易误删对空间推理至关重要的视觉标记。解决方案的关键在于提出一种无需训练、提示感知的剪枝策略——IVC-Prune,其核心洞察是LVLM通过旋转位置编码(Rotary Position Embeddings, RoPE)隐式构建视觉坐标系,其中特定位置的标记构成隐式视觉坐标(Implicit Visual Coordinate, IVC)标记,对空间推理不可或缺。该方法通过理论分析RoPE的数学性质识别IVC标记(即旋转矩阵近似单位矩阵或90°旋转矩阵的位置),并结合两阶段过程(语义种子发现与基于值向量相似度的上下文精炼)保留语义相关的前景标记,从而在减少约50%视觉标记的同时保持≥99%原始性能,并在多个基准上实现提升。

链接: https://arxiv.org/abs/2602.03060
作者: Zhichao Sun,Yidong Ma,Gang Liu,Yibo Chen,Xu Tang,Yao Hu,Yongchao Xu
机构: Wuhan University (武汉大学); Xiaohongshu Inc (小红书公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emphhow LVLMs process spatial reasoning. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbfimplicit visual coordinates (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbfIVC-Prune, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the 90^\circ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50% while maintaining \geq 99% of the original performance and even achieving improvements on several benchmarks. Source codes are available at this https URL.
zh

[CV-99] SAFE-KD: Risk-Controlled Early-Exit Distillation for Vision Backbones IJCNN

【速读】:该论文旨在解决早期退出网络(early-exit networks)在实际部署中缺乏安全决策机制的问题,即如何确保在输入样本较易分类时提前退出的同时,仍能控制早期退出带来的误分类风险。解决方案的关键在于提出SAFE-KD框架,其核心是将分层知识蒸馏(hierarchical distillation)与合规风险控制(conformal risk control, CRC)相结合:通过在骨干网络中间层附加轻量级退出头,并采用解耦知识蒸馏(Decoupled Knowledge Distillation, DKD)将强教师模型的知识传递至所有退出层,同时强制不同深度退出之间的一致性;在推理阶段,利用CRC对每个退出层的停止阈值进行校准,从而在交换性假设下保证用户指定的“选择性误分类风险”(selective misclassification risk)具有有限样本保障。

链接: https://arxiv.org/abs/2602.03043
作者: Salim Khazem
机构: Talan(塔兰)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IJCNN

点击查看摘要

Abstract:Early-exit networks reduce inference cost by allowing ``easy’’ inputs to stop early, but practical deployment hinges on knowing \emphwhen early exit is safe. We introduce SAFE-KD, a universal multi-exit wrapper for modern vision backbones that couples hierarchical distillation with \emphconformal risk control. SAFE-KD attaches lightweight exit heads at intermediate depths, distills a strong teacher into all exits via Decoupled Knowledge Distillation (DKD), and enforces deep-to-shallow consistency between exits. At inference, we calibrate per-exit stopping thresholds on a held-out set using conformal risk control (CRC) to guarantee a user-specified \emphselective misclassification risk (among the samples that exit early) under exchangeability. Across multiple datasets and architectures, SAFE-KD yields improved accuracy compute trade-offs, stronger calibration, and robust performance under corruption while providing finite-sample risk guarantees.
zh

[CV-100] HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency

【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)在图像合成质量与多样性方面的局限性,尤其针对现有方法依赖预训练网络计算感知损失或使用预训练特征空间时未能充分挖掘神经网络先验知识的问题。解决方案的关键在于提出一种名为HP-GAN的新框架,其核心创新包括两个策略:一是利用预训练网络作为编码器构建自监督损失的FakeTwins机制,通过生成图像反向传播该损失以优化生成器,从而提升图像质量和多样性;二是引入CNN与视觉Transformer(Vision Transformer, ViT)特征空间中判别器之间的一致性约束机制,促进多判别器协同学习并增强训练稳定性。实验表明,HP-GAN在17个不同数据集上均显著优于当前最优方法,在Fréchet Inception Distance(FID)指标上实现稳定提升。

链接: https://arxiv.org/abs/2602.03039
作者: Geonhui Son,Jeong Ryong Lee,Dosik Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted manuscript. This is the accepted version of the article published in Neural Networks

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have made significant progress in enhancing the quality of image synthesis. Recent methods frequently leverage pretrained networks to calculate perceptual losses or utilize pretrained feature spaces. In this paper, we extend the capabilities of pretrained networks by incorporating innovative self-supervised learning techniques and enforcing consistency between discriminators during GAN training. Our proposed method, named HP-GAN, effectively exploits neural network priors through two primary strategies: FakeTwins and discriminator consistency. FakeTwins leverages pretrained networks as encoders to compute a self-supervised loss and applies this through the generated images to train the generator, thereby enabling the generation of more diverse and high quality images. Additionally, we introduce a consistency mechanism between discriminators that evaluate feature maps extracted from Convolutional Neural Network (CNN) and Vision Transformer (ViT) feature networks. Discriminator consistency promotes coherent learning among discriminators and enhances training robustness by aligning their assessments of image quality. Our extensive evaluation across seventeen datasets-including scenarios with large, small, and limited data, and covering a variety of image domains-demonstrates that HP-GAN consistently outperforms current state-of-the-art methods in terms of Fréchet Inception Distance (FID), achieving significant improvements in image diversity and quality. Code is available at: this https URL.
zh

[CV-101] Bongards at the Boundary of Perception and Reasoning : Programs or Language?

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对全新、复杂视觉推理任务时能力不足的问题,特别是针对经典视觉推理挑战——Bongard问题的建模与求解。其解决方案的关键在于提出一种神经符号(neurosymbolic)方法:首先利用大语言模型(Large Language Models, LLMs)根据假设的解题规则生成参数化的程序表示,随后通过贝叶斯优化(Bayesian optimization)进行参数拟合,从而实现对Bongard问题图像的有效分类和从零开始的求解。

链接: https://arxiv.org/abs/2602.03038
作者: Cassidy Langenfeld,Claas Beger,Gloria Geng,Wasu Top Piriyakulkij,Keya Hu,Yewen Pu,Kevin Ellis
机构: Cornell University (康奈尔大学); Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have made great strides in everyday visual tasks, such as captioning a natural image, or answering commonsense questions about such images. But humans possess the puzzling ability to deploy their visual reasoning abilities in radically new situations, a skill rigorously tested by the classic set of visual reasoning challenges known as the Bongard problems. We present a neurosymbolic approach to solving these problems: given a hypothesized solution rule for a Bongard problem, we leverage LLMs to generate parameterized programmatic representations for the rule and perform parameter fitting using Bayesian optimization. We evaluate our method on classifying Bongard problem images given the ground truth rule, as well as on solving the problems from scratch.
zh

[CV-102] MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

【速读】:该论文旨在解决从短用户提示生成长时序音视频故事时存在的“意图-执行鸿沟”问题,即在长时间跨度内保持高阶叙事意图的一致性,避免语义漂移和身份不一致。其解决方案的关键在于将叙事生成建模为一个闭环约束执行问题,并提出MUSE(Multi-agent Unified Storytelling Engine)框架,通过迭代的“计划-执行-验证-修订”循环协调多模态生成过程;MUSE将叙事意图转化为对角色身份、空间构图和时间连续性的显式、可机器执行的控制策略,并引入针对性的跨模态反馈机制以纠正生成过程中的违规行为,从而显著提升长程叙事连贯性、跨模态身份一致性及影视质量。

链接: https://arxiv.org/abs/2602.03028
作者: Wenzhang Sun,Zhenyu Wang,Zhangchi Hu,Chunfeng Wang,Hao Li,Wei Chen
机构: Li Auto; University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating long-form audio-visual stories from a short user prompt remains challenging due to an intent-execution gap, where high-level narrative intent must be preserved across coherent, shot-level multimodal generation over long horizons. Existing approaches typically rely on feed-forward pipelines or prompt-only refinement, which often leads to semantic drift and identity inconsistency as sequences grow longer. We address this challenge by formulating storytelling as a closed-loop constraint enforcement problem and propose MUSE, a multi-agent framework that coordinates generation through an iterative plan-execute-verify-revise loop. MUSE translates narrative intent into explicit, machine-executable controls over identity, spatial composition, and temporal continuity, and applies targeted multimodal feedback to correct violations during generation. To evaluate open-ended storytelling without ground-truth references, we introduce MUSEBench, a reference-free evaluation protocol validated by human judgments. Experiments demonstrate that MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared with representative baselines.
zh

[CV-103] A Vision-Based Analysis of Congestion Pricing in New York City

【速读】:该论文旨在解决纽约市拥堵收费政策(Congestion Pricing Program)对城市交通流量影响的量化评估问题。其解决方案的关键在于构建一个基于计算机视觉(Computer Vision)的自动化分析流程,通过处理分布在曼哈顿及纽约市超过900个交通摄像头的视频数据,对比2024年11月至2025年1月政策实施前后的交通模式变化,从而建立基准交通密度并识别出系统性变化。

链接: https://arxiv.org/abs/2602.03015
作者: Mehmet Kerem Turkcan,Jhonatan Tavori,Javad Ghaderi,Gil Zussman,Zoran Kostic,Andrew Smyth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We examine the impact of New York City’s congestion pricing program through automated analysis of traffic camera data. Our computer vision pipeline processes footage from over 900 cameras distributed throughout Manhattan and New York, comparing traffic patterns from November 2024 through the program’s implementation in January 2025 until January 2026. We establish baseline traffic patterns and identify systematic changes in vehicle density across the monitored region.
zh

[CV-104] hinking inside the Convolution for Image Inpainting: Reconstructing Texture via Structure under Global and Local Side

【速读】:该论文旨在解决图像修复(Image Inpainting)中因卷积下采样过程导致的结构特征与纹理特征信息丢失问题,从而影响上采样重建质量。现有方法虽能分别提取高频结构和低频纹理特征,但忽略了下采样过程中两类特征的信息损失,造成最终修复结果不理想。解决方案的关键在于:在编码器阶段引入统计归一化与反归一化策略,通过结构特征与纹理特征之间的相互引导机制,在下采样过程中实现更有效的特征保留与重建指导,显著提升修复效果,尤其在256×256至512×512等多分辨率图像上表现优越。

链接: https://arxiv.org/abs/2602.03013
作者: Haipeng Liu,Yang Wang,Biao Qian,Yong Rui,Meng Wang
机构: Hefei University of Technology (合肥工业大学); Tsinghua University (清华大学); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 17 figures

点击查看摘要

Abstract:Image inpainting has earned substantial progress, owing to the encoder-and-decoder pipeline, which is benefited from the Convolutional Neural Networks (CNNs) with convolutional downsampling to inpaint the masked regions semantically from the known regions within the encoder, coupled with an upsampling process from the decoder for final inpainting output. Recent studies intuitively identify the high-frequency structure and low-frequency texture to be extracted by CNNs from the encoder, and subsequently for a desirable upsampling recovery. However, the existing arts inevitably overlook the information loss for both structure and texture feature maps during the convolutional downsampling process, hence suffer from a non-ideal upsampling output. In this paper, we systematically answer whether and how the structure and texture feature map can mutually help to alleviate the information loss during the convolutional downsampling. Given the structure and texture feature maps, we adopt the statistical normalization and denormalization strategy for the reconstruction guidance during the convolutional downsampling process. The extensive experimental results validate its advantages to the state-of-the-arts over the images from low-to-high resolutions including 256256 and 512512, especially holds by substituting all the encoders by ours. Our code is available at this https URL
zh

[CV-105] VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering

【速读】:该论文旨在解决多模态视觉语言系统在视觉问答(Visual Question Answering, VQA)任务中因固定分辨率图像输入导致的高计算与传输成本问题,尤其是在资源受限场景下如何实现高效推理。其解决方案的关键在于提出VOILA框架,通过价值信息驱动(Value-Of-Information-driven)的自适应图像保真度选择机制,在模型执行前动态决定最优的图像分辨率:首先利用基于梯度提升的回归器仅从问题特征预测不同保真度下的正确性概率,再通过等距校准器(isotonic calibrator)优化概率可靠性,最终基于预估准确率和检索成本选择最小代价的保真度以最大化预期效用。实验表明,该方法可在保持90–95%全分辨率准确率的同时,实现50–60%的资源消耗降低。

链接: https://arxiv.org/abs/2602.03007
作者: Rahul Atul Bhope,K. R. Jayaram,Vinod Muthusamy,Ritesh Kumar,Vatche Isahagian,Nalini Venkatasubramanian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite significant costs from retrieving and processing high-fidelity visual inputs, most multimodal vision-language systems operate at fixed fidelity levels. We introduce VOILA, a framework for Value-Of-Information-driven adaptive fidelity selection in Visual Question Answering (VQA) that optimizes what information to retrieve before model execution. Given a query, VOILA uses a two-stage pipeline: a gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, then an isotonic calibrator refines these probabilities for reliable decision-making. The system selects the minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs. We evaluate VOILA across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six Vision-Language Models (VLMs) with 7B-235B parameters. VOILA consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures, demonstrating that pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints.
zh

[CV-106] Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

【速读】:该论文旨在解决Temporal Video Grounding (TVG)任务中基于GRPO(Generalized Reward Policy Optimization)的后训练方法所面临的稀疏奖励信号和高计算开销问题。解决方案的关键在于提出Video-OPD框架,其核心创新是通过引入前沿教师模型(frontier teacher)利用反向KL散度目标提供细粒度的token级监督信号,从而将稀疏的episode-level反馈转化为步骤级的学习信号,同时保持on-policy优化特性以缓解分布偏移问题;此外,进一步设计了Teacher-Validated Disagreement Focusing (TVDF)轻量级训练课程,迭代地聚焦于教师可靠且对学生最具信息量的轨迹,显著提升了训练效率与收敛速度。

链接: https://arxiv.org/abs/2602.02994
作者: Jiaze Li,Hao Yin,Haoran Xu,Boshen Xu,Wenhui Tan,Zewen He,Jianzhong Ju,Zhenbo Luo,Jian Luan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.
zh

[CV-107] SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation

【速读】:该论文旨在解决动态场景中4D重建与视图合成时,静态区域与短期动态区域在表示和优化过程中难以平衡的问题,即现有基于高斯表示的方法无法同时保证长期静态结构的稳定性与短期动态内容的精确建模。其解决方案的关键在于提出一种寿命感知(lifespan-aware)的4D高斯框架——SharpTimeGS,通过引入可学习的寿命参数(lifespan parameter),将时间可见性从高斯衰减形式重构为平顶分布,使高斯点在其预期生命周期内保持稳定激活状态,避免冗余稠密化;同时,该寿命参数还调制每个高斯点的运动幅度,实现运动强度与持续时间的解耦,从而在维持长期静态点低漂移的同时保留短期动态点的自由运动能力;此外,设计了寿命-速度感知的稠密化策略,根据区域运动显著性分配计算资源,提升优化均衡性,最终实现在统一表示下对静态与动态区域的自适应建模,并支持高达4K分辨率、100 FPS的实时渲染性能。

链接: https://arxiv.org/abs/2602.02989
作者: Zhanfeng Liao,Jiajun Zhang,Hanzhang Tu,Zhixi Wang,Yunqi Gao,Hongwen Zhang,Yebin Liu
机构: Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Normal University (北京师范大学); Central China Normal University (华中师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation. Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitives’ motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity. Moreover, we design a lifespan-velocity-aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.
zh

[CV-108] Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding

【速读】:该论文旨在解决大型视觉-语言模型(如CLIP)在处理长文本描述时因将图像与文本视为无差别的整体而导致的细粒度理解不足问题。其核心挑战在于如何在跨模态中同时建模全局语义与局部细节,并实现视觉与语言层次结构的对齐,而传统方法往往难以匹配语法或语义层级与视觉组织之间的差异。解决方案的关键是提出CAFT(Cross-domain Alignment of Forests and Trees)框架,该框架通过耦合从细到粗的视觉编码器与分层文本Transformer,设计了一种层次化对齐损失函数,在保持整图与整句对齐的同时,引导区域-句子对应关系,使粗粒度语义由细粒度证据构建,而非脱离局部语义锚定的简单聚合,从而在无需像素级监督的情况下实现可解释且具视觉根基的跨模态表示学习。

链接: https://arxiv.org/abs/2602.02977
作者: Byeongju Woo,Zilin Wang,Byeonghyun Pak,Sangwoo Mo,Stella X. Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large vision-language models such as CLIP struggle with long captions because they align images and texts as undifferentiated wholes. Fine-grained vision-language understanding requires hierarchical semantics capturing both global context and localized details across visual and textual domains. Yet linguistic hierarchies from syntax or semantics rarely match visual organization, and purely visual hierarchies tend to fragment scenes into appearance-driven parts without semantic focus. We propose CAFT (Cross-domain Alignment of Forests and Trees), a hierarchical image-text representation learning framework that aligns global and local semantics across images and long captions without pixel-level supervision. Coupling a fine-to-coarse visual encoder with a hierarchical text transformer, it uses a hierarchical alignment loss that matches whole images with whole captions while biasing region-sentence correspondences, so that coarse semantics are built from fine-grained evidence rather than from aggregation untethered to part-level grounding. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that hierarchical cross-domain alignment enables fine-grained, visually grounded image-text representations to emerge without explicit region-level supervision.
zh

[CV-109] SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

【速读】:该论文旨在解决如何从RGB序列中生成与真实场景布局一致的组合式3D场景问题,尤其针对混合现实(Mixed Reality, MR)内容在不同用户空间中的自适应呈现需求。现有方法难以充分捕捉物体间的语义上下文关系,或仅关注形状多样性而忽视物体排列的一致性。解决方案的关键在于提出SceneLinker框架,其核心创新包括:(1) 设计一种带有交叉验证特征注意力机制的图网络以提升场景图(scene graph)预测准确性;(2) 构建图变分自编码器(graph-variational autoencoder, graph-VAE),其中包含联合形状与布局模块,实现从语义场景图到结构一致的3D场景的端到端生成。实验表明,该方法在3RScan/3DSSG和SG-FRONT数据集上均优于当前最优方法,尤其在复杂室内环境和严苛场景图约束下仍能保持高质量输出。

链接: https://arxiv.org/abs/2602.02974
作者: Seok-Young Kim,Dooyoung Kim,Woojin Cho,Hail Song,Suji Kang,Woontack Woo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an IEEE TVCG paper at IEEE VR 2026 (journal track)

点击查看摘要

Abstract:We introduce SceneLinker, a novel framework that generates compositional 3D scenes via semantic scene graph from RGB sequences. To adaptively experience Mixed Reality (MR) content based on each user’s space, it is essential to generate a 3D scene that reflects the real-world layout by compactly capturing the semantic cues of the surroundings. Prior works struggled to fully capture the contextual relationship between objects or mainly focused on synthesizing diverse shapes, making it challenging to generate 3D scenes aligned with object arrangements. We address these challenges by designing a graph network with cross-check feature attention for scene graph prediction and constructing a graph-variational autoencoder (graph-VAE), which consists of a joint shape and layout block for 3D scene generation. Experiments on the 3RScan/3DSSG and SG-FRONT datasets demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations, even in complex indoor environments and under challenging scene graph constraints. Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial MR content. Project page is this https URL.
zh

[CV-110] Fisheye Stereo Vision: Depth and Range Error

【速读】:该论文旨在解决鱼眼立体视觉系统在大角度测量时的深度(depth)和范围(range)误差问题,尤其关注物体距离变化对精度的影响。解决方案的关键在于推导出深度和范围误差的解析表达式,从而能够量化误差随物体距离和视角变化的规律,为系统校准与误差补偿提供理论依据。

链接: https://arxiv.org/abs/2602.02973
作者: Leaf Jiang,Matthew Holzel,Bernhard Kaplan,Hsiou-Yuan Liu,Sabyasachi Paul,Karen Rankin,Piotr Swierczynski
机构: NODAR Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study derives analytical expressions for the depth and range error of fisheye stereo vision systems as a function of object distance, specifically accounting for accuracy at large angles.
zh

[CV-111] Dynamic High-frequency Convolution for Infrared Small Target Detection

【速读】:该论文旨在解决单帧红外小目标(Single-Frame Infrared Small Target, SIRST)检测中因高频率成分(High-Frequency Components, HFCs)干扰而导致的误检问题,尤其是现有基于学习的方法忽视了对各类HFCs的显式建模与判别表征学习,难以有效区分目标与背景杂波(如亮角、破碎云层等)。其解决方案的关键在于提出一种动态高频卷积(Dynamic High-Frequency Convolution, DHiF),将判别建模过程转化为动态局部滤波器组的生成机制;DHiF通过傅里叶变换性质对生成滤波器的参数进行零中心对称调整,使其对HFCs高度敏感,并结合标准卷积操作自适应地处理不同HFC区域,捕获其独特的灰度变化特征以实现判别表征学习。DHiF可作为即插即用模块嵌入任意SIRST检测网络,且计算效率损失极小。

链接: https://arxiv.org/abs/2602.02969
作者: Ruojing Li,Chao Xiao,Qian Yin,Wei An,Nuo Chen,Xinyi Ying,Miao Li,Yingqian Wang
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Infrared small targets are typically tiny and locally salient, which belong to high-frequency components (HFCs) in images. Single-frame infrared small target (SIRST) detection is challenging, since there are many HFCs along with targets, such as bright corners, broken clouds, and other clutters. Current learning-based methods rely on the powerful capabilities of deep networks, but neglect explicit modeling and discriminative representation learning of various HFCs, which is important to distinguish targets from other HFCs. To address the aforementioned issues, we propose a dynamic high-frequency convolution (DHiF) to translate the discriminative modeling process into the generation of a dynamic local filter bank. Especially, DHiF is sensitive to HFCs, owing to the dynamic parameters of its generated filters being symmetrically adjusted within a zero-centered range according to Fourier transformation properties. Combining with standard convolution operations, DHiF can adaptively and dynamically process different HFC regions and capture their distinctive grayscale variation characteristics for discriminative representation learning. DHiF functions as a drop-in replacement for standard convolution and can be used in arbitrary SIRST detection networks without significant decrease in computational efficiency. To validate the effectiveness of our DHiF, we conducted extensive experiments across different SIRST detection networks on real-scene datasets. Compared to other state-of-the-art convolution operations, DHiF exhibits superior detection performance with promising improvement. Codes are available at this https URL.
zh

[CV-112] RACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation

【速读】:该论文旨在解决胸部X光片的时序对比分析问题,即如何在两张不同时间点的胸片之间自动识别并定位病灶变化(如恶化、改善或稳定),同时生成自然语言描述。传统方法难以同时实现时序比较、变化分类与空间定位,而现有视觉-语言模型仅能处理单张图像的报告生成和视觉定位。解决方案的关键在于提出TRACE模型,该模型首次联合执行时序比较、变化分类与空间定位任务:给定前后两幅胸片,TRACE不仅能输出变化类型(worsened/improved/stable)的自然语言描述,还能通过边界框坐标精确定位每处变化区域。实验表明其空间定位准确率超过90%,且消融研究发现唯有将时序对比与空间定位联合训练时,模型才具备有效的变化检测能力,这揭示了空间定位作为注意力机制对时序推理至关重要。

链接: https://arxiv.org/abs/2602.02963
作者: OFM Riaz Rahman Aranya,Kevin Desai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task. Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned, as neither alone enables meaningful change detection. This finding suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.
zh

[CV-113] SRA-Seg: Synthetic to Real Alignment for Semi-Supervised Medical Image Segmentation

【速读】:该论文旨在解决合成医学图像在分割任务中性能不佳的问题,其根本原因在于合成数据与真实数据存在于不同的语义特征空间(semantic feature space),导致领域差距(domain gap)难以被现有半监督学习方法克服。解决方案的关键在于提出SRA-Seg框架,通过引入基于冻结DINOv2嵌入的相似性对齐(similarity-alignment, SA)损失,将合成图像的特征表示拉向其最近的真实图像对应点以实现语义空间对齐;同时采用软边缘混合技术生成平滑解剖过渡和连续标签,避免传统拼接增强带来的硬边界问题,并利用EMA教师模型生成伪标签及软分割损失来处理混合区域的不确定性,从而显著提升仅用少量标注真实数据与大量合成未标注数据下的分割性能。

链接: https://arxiv.org/abs/2602.02944
作者: OFM Riaz Rahman Aranya,Kevin Desai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic data, an appealing alternative to extensive expert-annotated data for medical image segmentation, consistently fails to improve segmentation performance despite its visual realism. The reason being that synthetic and real medical images exist in different semantic feature spaces, creating a domain gap that current semi-supervised learning methods cannot bridge. We propose SRA-Seg, a framework explicitly designed to align synthetic and real feature distributions for medical image segmentation. SRA-Seg introduces a similarity-alignment (SA) loss using frozen DINOv2 embeddings to pull synthetic representations toward their nearest real counterparts in semantic space. We employ soft edge blending to create smooth anatomical transitions and continuous labels, eliminating the hard boundaries from traditional copy-paste augmentation. The framework generates pseudo-labels for synthetic images via an EMA teacher model and applies soft-segmentation losses that respect uncertainty in mixed regions. Our experiments demonstrate strong results: using only 10% labeled real data and 90% synthetic unlabeled data, SRA-Seg achieves 89.34% Dice on ACDC and 84.42% on FIVES, significantly outperforming existing semi-supervised methods and matching the performance of methods using real unlabeled data.
zh

[CV-114] A Reproducible Framework for Bias-Resistant Machine Learning on Small-Sample Neuroimaging Data

【速读】:该论文旨在解决小样本神经影像数据中机器学习模型因传统交叉验证框架重复使用相同折叠进行模型选择与性能估计而导致的乐观偏差问题,从而影响结果的可重复性和泛化能力。其解决方案的关键在于构建一个可复现且抗偏倚的机器学习框架,核心包括:领域知识引导的特征工程、嵌套交叉验证(nested cross-validation)以及校准决策阈值优化;该方法在高维结构磁共振成像(structural MRI)数据上实现了0.660 ± 0.068的平衡准确率,同时通过重要性导向排序选出紧凑且可解释的特征子集,兼顾了模型的可解释性与无偏评估,为数据受限的生物医学领域提供了可靠的机器学习计算范式。

链接: https://arxiv.org/abs/2602.02920
作者: Jagan Mohan Reddy Dwarampudi,Jennifer L Purks,Joshua Wong,Renjie Hu,Tania Banerjee
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注: Accepted to ISBI 2026, 5 pages with 1 figure

点击查看摘要

Abstract:We introduce a reproducible, bias-resistant machine learning framework that integrates domain-informed feature engineering, nested cross-validation, and calibrated decision-threshold optimization for small-sample neuroimaging data. Conventional cross-validation frameworks that reuse the same folds for both model selection and performance estimation yield optimistically biased results, limiting reproducibility and generalization. Demonstrated on a high-dimensional structural MRI dataset of deep brain stimulation cognitive outcomes, the framework achieved a nested-CV balanced accuracy of 0.660, \pm ,0.068 using a compact, interpretable subset selected via importance-guided ranking. By combining interpretability and unbiased evaluation, this work provides a generalizable computational blueprint for reliable machine learning in data-limited biomedical domains.
zh

[CV-115] A Multi-scale Linear-time Encoder for Whole-Slide Image Analysis

【速读】:该论文旨在解决全切片图像(Whole-slide Image, WSI)分析中因超高分辨率和多尺度特征带来的挑战,特别是现有多实例学习(Multiple Instance Learning, MIL)方法通常仅在单一尺度下运行,而基于Transformer的模型则面临二次复杂度的注意力计算瓶颈。解决方案的关键在于提出首个纯Mamba架构的多状态多实例学习框架MARBLE,其通过并行处理多个放大倍数层级,并在线性时间状态空间模型中实现从粗到细的推理机制,从而以极低的参数开销高效捕捉跨尺度依赖关系,显著提升了WSI分析的效率与泛化能力。

链接: https://arxiv.org/abs/2602.02918
作者: Jagan Mohan Reddy Dwarampudi,Joshua Wong,Hien Van Nguyen,Tania Banerjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
备注: Accepted to ISBI 2026, 4 pages with 2 figures

点击查看摘要

Abstract:We introduce Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder (MARBLE), the first \textitpurely Mamba-based multi-state multiple instance learning (MIL) framework for whole-slide image (WSI) analysis. MARBLE processes multiple magnification levels in parallel and integrates coarse-to-fine reasoning within a linear-time state-space model, efficiently capturing cross-scale dependencies with minimal parameter overhead. WSI analysis remains challenging due to gigapixel resolutions and hierarchical magnifications, while existing MIL methods typically operate at a single scale and transformer-based approaches suffer from quadratic attention costs. By coupling parallel multi-scale processing with linear-time sequence modeling, MARBLE provides a scalable and modular alternative to attention-based architectures. Experiments on five public datasets show improvements of up to \textbf6.9% in AUC, \textbf20.3% in accuracy, and \textbf2.3% in C-index, establishing MARBLE as an efficient and generalizable framework for multi-scale WSI analysis.
zh

[CV-116] FaceLinkGen: Rethinking Identity Leakage in Privacy-Preserving Face Recognition with Identity Extraction

【速读】:该论文旨在解决当前隐私保护人脸识别(Privacy-Preserving Face Recognition, PPFR)系统评估中存在的重要缺陷:现有方法过度依赖像素级重建指标(如PSNR和SSIM)来衡量隐私保护效果,而忽视了身份信息在变换后模板中仍可能被提取的风险。作者指出,这种以“像素恢复”为核心的评价范式无法真实反映系统的隐私安全性。解决方案的关键在于提出FaceLinkGen攻击框架,该框架能够直接从受保护的特征模板中进行身份匹配与人脸再生,无需还原原始像素图像;实验表明,即使在近零知识条件下,FaceLinkGen仍能实现超过92%的身份匹配准确率和94%的人脸再生成功率,揭示了当前PPFR系统在结构上对身份泄露的高度敏感性,从而推动更贴近实际威胁场景的隐私评估标准的发展。

链接: https://arxiv.org/abs/2602.02914
作者: Wenqi Guo,Shan Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformation-based privacy-preserving face recognition (PPFR) aims to verify identities while hiding facial data from attackers and malicious service providers. Existing evaluations mostly treat privacy as resistance to pixel-level reconstruction, measured by PSNR and SSIM. We show that this reconstruction-centric view fails. We present FaceLinkGen, an identity extraction attack that performs linkage/matching and face regeneration directly from protected templates without recovering original pixels. On three recent PPFR systems, FaceLinkGen reaches over 98.5% matching accuracy and above 96% regeneration success, and still exceeds 92% matching and 94% regeneration in a near zero knowledge setting. These results expose a structural gap between pixel distortion metrics, which are widely used in PPFR evaluation, and real privacy. We show that visual obfuscation leaves identity information broadly exposed to both external intruders and untrusted service providers.
zh

[CV-117] A Random Matrix Theory Perspective on the Consistency of Diffusion Models

【速读】:该论文试图解决扩散模型在不同数据子集上训练后,即使使用相同噪声种子仍产生高度相似输出的现象(即训练数据划分对生成结果影响较小的问题),其核心挑战在于理解有限数据如何塑造去噪器和采样映射的期望与方差。解决方案的关键在于构建一个随机矩阵理论(Random Matrix Theory, RMT)框架,用于量化有限数据集对线性扩散模型中去噪器和采样路径的影响:一方面,通过自洽关系 σ2κ(σ2)\sigma^2 \mapsto \kappa(\sigma^2) 揭示采样变异性对噪声水平的重整化效应,解释为何有限数据会导致低方差方向被过度收缩并使样本趋向于数据均值;另一方面,通过方差公式识别出跨数据划分差异的三个关键因素——特征模态各向异性、输入间异质性以及整体随数据规模缩放的特性。该理论不仅准确预测了线性扩散模型的行为,还在UNet和DiT架构的非记忆化区间得到验证,为扩散训练中的可复现性提供了原则性基准,并建立了数据谱特性与生成稳定性之间的联系。

链接: https://arxiv.org/abs/2602.02908
作者: Binxu Wang,Jacob Zavatone-Veth,Cengiz Pehlevan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 65 pages; 53 figures

点击查看摘要

Abstract:Diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation \sigma^2 \mapsto \kappa(\sigma^2) , explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: \textitanisotropy across eigenmodes, \textitinhomogeneity across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.
zh

[CV-118] DoubleTake: Contrastive Reasoning for Faithful Decision-Making in Medical Imaging

【速读】:该论文旨在解决医学影像诊断中因混淆病症间细微视觉差异导致的决策准确性不足问题,现有方法多依赖最近邻检索返回冗余证据并强化单一假设,缺乏对判别性证据的有效筛选。解决方案的关键在于提出一种对比性的、文档感知的参考选择框架,通过ROCO嵌入(ROCO embeddings)和元数据显式平衡视觉相关性、嵌入多样性与来源级溯源性,构建紧凑且具有判别力的证据集;在此基础上进一步设计了基于置信度的反事实-对比推理机制(Counterfactual-Contrastive Inference),通过结构化的成对视觉比较和基于间隔的决策规则聚合证据,并实现忠实的弃权策略,在MediConfusion基准上将集合级准确率提升近15%,同时降低混淆率并提高个体准确性。

链接: https://arxiv.org/abs/2602.02894
作者: Daivik Patel,Shrenik Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate decision making in medical imaging requires reasoning over subtle visual differences between confusable conditions, yet most existing approaches rely on nearest neighbor retrieval that returns redundant evidence and reinforces a single hypothesis. We introduce a contrastive, document-aware reference selection framework that constructs compact evidence sets optimized for discrimination rather than similarity by explicitly balancing visual relevance, embedding diversity, and source-level provenance using ROCO embeddings and metadata. While ROCO provides large-scale image-caption pairs, it does not specify how references should be selected for contrastive reasoning, and naive retrieval frequently yields near-duplicate figures from the same document. To address this gap, we release a reproducible reference selection protocol and curated reference bank that enable a systematic study of contrastive retrieval in medical image reasoning. Building on these contrastive evidence sets, we propose Counterfactual-Contrastive Inference, a confidence-aware reasoning framework that performs structured pairwise visual comparisons and aggregates evidence using margin-based decision rules with faithful abstention. On the MediConfusion benchmark, our approach achieves state-of-the-art performance, improving set-level accuracy by nearly 15% relative to prior methods while reducing confusion and improving individual accuracy.
zh

[CV-119] ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying

【速读】:该论文旨在解决视觉-语言模型在链式思维(Chain-of-Thought, CoT)推理中因过早进行视觉到文本的转换而导致连续信息(如几何结构和空间布局)丢失的问题。现有方法通过静态枚举或基于注意力的选择来增强CoT,但这些方法属于被动处理预计算输入,缺乏对任务相关细节的主动探索能力。解决方案的关键在于提出ViThinker框架,其受人类主动感知启发,使视觉-语言模型能够自主生成决策(查询)标记,按需触发专家对齐的视觉特征合成;该框架在训练阶段内化视觉专家能力,并在推理阶段通过生成式心理模拟实现无需外部工具调用的主动感知,结合两阶段课程学习(先蒸馏冻结专家知识,再通过稀疏性惩罚学习任务驱动查询),从而发现每个推理步骤所需的最小充分感知,显著提升感知锚定与推理准确性。

链接: https://arxiv.org/abs/2602.02873
作者: Weihang You,Qingchan Zhu,David Liu,Yi Pan,Geng Yuan,Hanqi Jiang
机构: University of Georgia (乔治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.
zh

[CV-120] Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room

【速读】:该论文旨在解决手术室(Operating Room, OR)视频数据在科研应用中隐私保护的难题,核心挑战在于如何实现高效、自动化的多视角视频匿名化。现有方法存在两大瓶颈:一是需对每个新临床场景进行大量人工标注以保证精度;二是多摄像头部署时需频繁重新校准相机参数,难以扩展。解决方案的关键在于提出一种无需人工标注和相机标定的自监督多视角视频匿名化框架,其核心策略是通过时间一致性与多视角上下文“检索”低置信度误检样本,并利用这些伪标签进行自监督域适应与迭代优化,从而提升单视角检测器性能并最终实现全身体态估计,实现在真实手术视频中超过97%的召回率,且训练出的实时全身体检测模型达到与有监督方法相当的效果,具备良好的实用性。

链接: https://arxiv.org/abs/2602.02850
作者: Keqi Chen,Vinkle Srivastav,Armine Vardazaryan,Cindy Rolland,Didier Mutter,Nicolas Padoy
机构: University of Strasbourg (斯特拉斯堡大学); CNRS (法国国家科学研究中心); INSERM (法国国家健康与医学研究院); ICube, UMR7357 (ICube实验室,UMR7357); IHU Strasbourg (斯特拉斯堡人类健康研究所); University Hospital of Strasbourg (斯特拉斯堡大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by “retrieving” false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method’s practical applicability. Code is available at this https URL.
zh

[CV-121] From Tokens to Numbers: Continuous Number Modeling for SVG Generation

【速读】:该论文旨在解决向量图形(如可缩放矢量图形,SVG)在生成任务中因数值参数被编码为长序列离散标记而导致的训练效率低、精度差和泛化能力弱的问题。其核心解决方案是提出连续数值建模(Continuous Number Modeling, CNM),该方法将数字作为连续值直接建模,而非依赖离散标记编码,从而消除离散化引入的伪影,恢复表示的数学优雅性,并提升模型对数据连续本质的拟合能力。在此基础上,研究者使用200万张位图到SVG样本训练多模态Transformer,并通过感知反馈强化学习进行微调,最终实现训练速度提升30%以上且保持更高视觉保真度。

链接: https://arxiv.org/abs/2602.02820
作者: Michael Ogezi,Martin Bell,Freda Shi,Ethan Smith
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For certain image generation tasks, vector graphics such as Scalable Vector Graphics (SVGs) offer clear benefits such as increased flexibility, size efficiency, and editing ease, but remain less explored than raster-based approaches. A core challenge is that the numerical, geometric parameters, which make up a large proportion of SVGs, are inefficiently encoded as long sequences of tokens. This slows training, reduces accuracy, and hurts generalization. To address these problems, we propose Continuous Number Modeling (CNM), an approach that directly models numbers as first-class, continuous values rather than discrete tokens. This formulation restores the mathematical elegance of the representation by aligning the model’s inputs with the data’s continuous nature, removing discretization artifacts introduced by token-based encoding. We then train a multimodal transformer on 2 million raster-to-SVG samples, followed by fine-tuning via reinforcement learning using perceptual feedback to further improve visual quality. Our approach improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. This work establishes CNM as a practical and efficient approach for high-quality vector generation, with potential for broader applications. We make our code available this http URL.
zh

[CV-122] LmPT: Conditional Point Transformer for Anatomical Landmark Detection on 3D Point Clouds

【速读】:该论文旨在解决医学图像中解剖学标志点(anatomical landmarks)自动识别的准确性与跨物种泛化能力问题,传统手动标注耗时且存在观察者差异,而基于规则的方法又受限于特定几何结构或标志点集合。其解决方案的关键在于提出Landmark Point Transformer (LmPT)模型,该模型采用点云(point clouds)表示解剖表面,并引入条件机制(conditioning mechanism)以适应不同输入类型,从而实现跨物种学习,尤其在人和狗股骨标志点检测任务中验证了其有效性与通用性。

链接: https://arxiv.org/abs/2602.02808
作者: Matteo Bastico,Pierre Onghena,David Ryckelynck,Beatriz Marcotegui,Santiago Velasco-Forero,Laurent Corté,Caroline Robine–Decourcelle,Etienne Decencière
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted at International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Accurate identification of anatomical landmarks is crucial for various medical applications. Traditional manual landmarking is time-consuming and prone to inter-observer variability, while rule-based methods are often tailored to specific geometries or limited sets of landmarks. In recent years, anatomical surfaces have been effectively represented as point clouds, which are lightweight structures composed of spatial coordinates. Following this strategy and to overcome the limitations of existing landmarking techniques, we propose Landmark Point Transformer (LmPT), a method for automatic anatomical landmark detection on point clouds that can leverage homologous bones from different species for translational research. The LmPT model incorporates a conditioning mechanism that enables adaptability to different input types to conduct cross-species learning. We focus the evaluation of our approach on femoral landmarking using both human and newly annotated dog femurs, demonstrating its generalization and effectiveness across species. The code and dog femur dataset will be publicly available at: this https URL.
zh

[CV-123] SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?

【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在大规模预训练中因自注意力机制全局作用而缺乏显式前景-背景区分能力的问题,导致模型可能学习到无关的背景特征和伪影,从而降低分类性能。其解决方案的关键在于提出 SVD-ViT 框架,通过奇异值分解(Singular Value Decomposition, SVD)优先提取并聚合能够表征物体前景信息的奇异向量,具体包含三个核心组件:SPC模块SSVAID-RSVD,有效抑制背景噪声与任务无关因素的影响,提升模型对前景特征的学习能力与分类准确性。

链接: https://arxiv.org/abs/2602.02765
作者: Haruhiko Murata,Kazuhiro Hotta
机构: Meijo University (明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbfSPC module, \textbfSSVA, and \textbfID-RSVD-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.
zh

[CV-124] Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion ICLR2026

【速读】:该论文旨在解决复杂多实体环境中长期目标导向强化学习(Goal-Conditioned Reinforcement Learning, GCRL)面临的高维观测与组合状态空间挑战,尤其是在稀疏奖励场景下的性能瓶颈问题。其解决方案的关键在于提出一种分层的、以实体为中心的框架,该框架由两层结构组成:底层为基于价值函数的GCRL代理(value-based GCRL agent),用于执行策略优化;上层为一个因子化子目标生成条件扩散模型(factored subgoal-generating conditional diffusion model),独立训练并用于生成可被价值函数筛选的子目标。这种模块化设计使得子目标生成与策略学习解耦,且能通过选择性子目标注入显著提升RL代理在图像输入、长时程任务中的成功率,实验表明该方法在最困难的任务中成功率达到基线的150%以上,并具备随任务长度和实体数量增长的良好泛化能力。

链接: https://arxiv.org/abs/2602.02722
作者: Dan Haramati,Carl Qi,Tal Daniel,Amy Zhang,Aviv Tamar,George Konidaris
机构: Brown University (布朗大学); UT Austin (德克萨斯大学奥斯汀分校); Carnegie Mellon University (卡内基梅隆大学); Technion (以色列理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICLR 2026

点击查看摘要

Abstract:We propose a hierarchical entity-centric framework for offline Goal-Conditioned Reinforcement Learning (GCRL) that combines subgoal decomposition with factored structure to solve long-horizon tasks in domains with multiple entities. Achieving long-horizon goals in complex environments remains a core challenge in Reinforcement Learning (RL). Domains with multiple entities are particularly difficult due to their combinatorial complexity. GCRL facilitates generalization across goals and the use of subgoal structure, but struggles with high-dimensional observations and combinatorial state-spaces, especially under sparse reward. We employ a two-level hierarchy composed of a value-based GCRL agent and a factored subgoal-generating conditional diffusion model. The RL agent and subgoal generator are trained independently and composed post hoc through selective subgoal generation based on the value function, making the approach modular and compatible with existing GCRL algorithms. We introduce new variations to benchmark tasks that highlight the challenges of multi-entity domains, and show that our method consistently boosts performance of the underlying RL agent on image-based long-horizon tasks with sparse rewards, achieving over 150% higher success rates on the hardest task in our suite and generalizing to increasing horizons and numbers of entities. Rollout videos are provided at: this https URL
zh

[CV-125] End-to-end reconstruction of OCT optical properties and speckle-reduced structural intensity via physics-based learning

【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)中的逆散射问题,即从OCT信号中同时恢复组织的结构图像和内在光学参数(如折射率、散射系数和各向异性因子),该问题因衰减效应、斑点噪声以及参数间的强耦合性而极具挑战。解决方案的关键在于提出了一种正则化的端到端深度学习框架,该框架联合重建光学参数图谱与去斑点的OCT结构强度图像以实现层状可视化;网络训练基于蒙特卡洛模拟的真实数据,并嵌入物理驱动的OCT前向模型,通过从估计参数生成预测信号提供物理一致性监督,从而提升参数恢复精度并抑制伪影。

链接: https://arxiv.org/abs/2602.02721
作者: Jinglun Yu,Yaning Wang,Wenhan Guo,Yuan Gao,Yu Sun,Jin U. Kang
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inverse scattering in optical coherence tomography (OCT) seeks to recover both structural images and intrinsic tissue optical properties, including refractive index, scattering coefficient, and anisotropy. This inverse problem is challenging due to attenuation, speckle noise, and strong coupling among parameters. We propose a regularized end-to-end deep learning framework that jointly reconstructs optical parameter maps and speckle-reduced OCT structural intensity for layer visualization. Trained with Monte Carlo-simulated ground truth, our network incorporates a physics-based OCT forward model that generates predicted signals from the estimated parameters, providing physics-consistent supervision for parameter recovery and artifact suppression. Experiments on the synthetic corneal OCT dataset demonstrate robust optical map recovery under noise, improved resolution, and enhanced structural fidelity. This approach enables quantitative multi-parameter tissue characterization and highlights the benefit of combining physics-informed modeling with deep learning for computational OCT.
zh

[CV-126] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在适应性多模态推理评估中存在的静态难度标签与简化指标无法捕捉任务难度动态变化、混淆模式选择能力与整体性能的问题。其解决方案的关键在于提出AdaptMMBench基准,该基准通过基于模型能力边界的动态难度识别机制,结合马修斯相关系数(Matthews Correlation Coefficient, MCC)量化不同推理模式的选择合理性,从而独立评估元认知层面的自适应决策能力;同时支持多维度过程分析,包括关键步骤覆盖率、工具有效性及计算效率,揭示了适应性模式选择虽随模型容量提升而增强,但显著脱离最终准确率,而关键步骤覆盖率则与性能高度一致的现象。

链接: https://arxiv.org/abs/2602.02676
作者: Xintong Zhang,Xiaowen Zhang,Jongrong Wu,Zhi Gao,Shilin Yan,Zhenxin Diao,Kunpeng Gao,Xuanyan Chen,Yuwei Wu,Yunde Jia,Qing Li
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. University of Chinese Academy of Sciences (中国科学院大学); 4. Beijing Academy of Artificial Intelligence (北京人工智能研究院); 5. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models’ capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
zh

[CV-127] rajectory Consistency for One-Step Generation on Euler Mean Flows

【速读】:该论文旨在解决流模型(Flow-based Model)在长程轨迹一致性约束下难以优化的问题,尤其是在多步生成任务中,传统方法因缺乏有效的监督信号导致训练不稳定且计算成本高。解决方案的关键在于提出Euler Mean Flows (EMF),其核心思想是用一个基于半群结构的线性近似替代原本难以监督的长期轨迹一致性约束,从而实现对长时间尺度流映射组合的直接数据监督。该近似在温和正则性假设下能忠实逼近原目标函数,同时显著降低优化难度,并构建了一个无需Jacobian向量乘积(JVP-free)的统一训练框架,支持u-预测与x₁-预测两种变体,有效减少了内存占用和训练时间。

链接: https://arxiv.org/abs/2602.02571
作者: Zhiqi Li,Yuchen Sun,Duowen Chen,Jinjin He,Bo Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 27 figures

点击查看摘要

Abstract:We propose \emphEuler Mean Flows (EMF), a flow-based generative framework for one-step and few-step generation that enforces long-range trajectory consistency with minimal sampling cost. The key idea of EMF is to replace the trajectory consistency constraint, which is difficult to supervise and optimize over long time scales, with a principled linear surrogate that enables direct data supervision for long-horizon flow-map compositions. We derive this approximation from the semigroup formulation of flow-based models and show that, under mild regularity assumptions, it faithfully approximates the original consistency objective while being substantially easier to optimize. This formulation leads to a unified, JVP-free training framework that supports both u -prediction and x_1 -prediction variants, avoiding explicit Jacobian computations and significantly reducing memory and computational overhead. Experiments on image synthesis, particle-based geometry generation, and functional generation demonstrate improved optimization stability and sample quality under fixed sampling budgets, together with approximately 50% reductions in training time and memory consumption compared to existing one-step methods for image generation.
zh

[CV-128] Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

【速读】:该论文旨在解决当前用于肺癌风险预测的深度学习模型(如Sybil)在临床部署前缺乏因果验证的问题。现有评估方法仅依赖相关性指标,无法揭示模型的真实推理机制,可能导致不可靠的决策。解决方案的关键在于提出一种模型无关的审计框架S(H)NAP,通过构建基于3D扩散桥接建模的生成式干预归因,系统性地修改解剖特征以隔离特定对象对风险评分的因果贡献,并由放射科专家进行验证。此方法首次实现了对Sybil模型的干预性审计,揭示了其虽具备类似专家判读能力,但存在对临床无依据伪影的高度敏感性和径向偏差等关键缺陷。

链接: https://arxiv.org/abs/2602.02560
作者: Bartlomiej Sobieski,Jakub Grzywaczewski,Karol Dobiczek,Mateusz Wójcik,Tomasz Bartczak,Patryk Szatkowski,Przemysław Bombiński,Matthew Tivnan,Przemyslaw Biecek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlation-based approach overlooks the model’s actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.
zh

[CV-129] Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在地球观测(Earth Observation, EO)等工具密集型、多模态且长周期任务中因缺乏细粒度工具级专业知识而导致的执行失败问题,尤其是参数配置错误和中途故障恢复能力不足。其核心挑战在于现有代理无法通过交互自动习得针对特定工具的精细操作策略。解决方案的关键是提出GeoEvolver——一个无需参数更新的自演化多智能体系统(Multi-Agent System, MAS),它通过检索增强的多智能体编排器将查询分解为独立子目标,在子目标层面探索多样化的工具参数配置,并将成功模式与失败根因提炼至动态演化的记忆库中,以提供上下文演示支持未来任务的可靠执行,从而实现EO领域专家知识的渐进式涌现。

链接: https://arxiv.org/abs/2602.02559
作者: Pengyu Dai,Weihao Xuan,Junjue Wang,Hongruixuan Chen,Jian Song,Yafei Ou,Naoto Yokoya
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所先进智能研究中心); Hokkaido University (北海道大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 21 pages, 6 figures

点击查看摘要

Abstract:Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool-intensive domains that demand long-horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi-modal and multi-temporal data inputs, as well as the requirements of geo-knowledge constraints (spectrum library, spatial reasoning, etc): many high-level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine-grained, tool-level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid-execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce \textbfGeoEvolver, a self-evolving multi-agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub-goals via a retrieval-augmented multi-agent orchestrator, then explores diverse tool-parameter configurations at the sub-goal level. Successful patterns and root-cause attribution from failures are then distilled in an evolving memory bank that provides in-context demonstrations for future queries. Experiments on three tool-integrated EO benchmarks show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine-grained interactions with the environment.
zh

[CV-130] EEO-TFV: Escape-Explore Optimizer for Web-Scale Time-Series Forecasting and Vision Analysis

【速读】:该论文旨在解决基于Transformer的基座模型在长序列多变量预测任务中易出现误差累积问题,以及在图像相关任务中对分布外样本(out-of-distribution samples)敏感的问题;同时,在大规模Web数据挖掘场景下,由于复杂的时间模式和多模态特征导致优化困难,模型容易陷入高维参数空间中的鞍点(saddle points)。解决方案的关键在于提出一种轻量级Transformer架构与一种新型的Escape-Explore Optimizer(EEO),该优化器通过增强探索能力与泛化性能,有效规避尖锐极小值和鞍点陷阱,从而提升模型在多种任务上的稳定性与泛化能力。

链接: https://arxiv.org/abs/2602.02551
作者: Hua Wang,Jinghao Lu,Fan Zhang
机构: Ludong University (鲁东大学); Shandong Technology and Business University (山东工商学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Main paper: 12 pages

点击查看摘要

Abstract:Transformer-based foundation models have achieved remarkable progress in tasks such as time-series forecasting and image segmentation. However, they frequently suffer from error accumulation in multivariate long-sequence prediction and exhibit vulnerability to out-of-distribution samples in image-related tasks. Furthermore, these challenges become particularly pronounced in large-scale Web data analysis tasks, which typically involve complex temporal patterns and multimodal features. This complexity substantially increases optimization difficulty, rendering models prone to stagnation at saddle points within high-dimensional parameter spaces. To address these issues, we propose a lightweight Transformer architecture in conjunction with a novel Escape-Explore Optimizer (EEO). The optimizer enhances both exploration and generalization while effectively avoiding sharp minima and saddle-point traps. Experimental results show that, in representative Web data scenarios, our method achieves performance on par with state-of-the-art models across 11 time-series benchmark datasets and the Synapse medical image segmentation task. Moreover, it demonstrates superior generalization and stability, thereby validating its potential as a versatile cross-task foundation model for Web-scale data mining and analysis.
zh

[CV-131] oolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

【速读】:该论文旨在解决现有图形用户界面(GUI)智能体模型在处理不同输入分辨率和宽高比时泛化能力差的问题,以及基于坐标-free策略的模型因数据稀缺导致的学习困难。其解决方案的关键在于提出一种名为ToolTok的新范式,通过将操作建模为一系列逐步使用的工具序列来实现多步路径规划;具体而言,设计了符合人类交互习惯的工具,并用可学习的token嵌入表示每个工具;同时引入语义锚定机制,利用语义相关概念作为自然归纳偏置以增强有限监督下的嵌入学习效率;此外,构建了一个从易到难的课程学习框架,使预训练大语言模型逐步掌握工具语义,从而显著提升性能与泛化能力。

链接: https://arxiv.org/abs/2602.02548
作者: Xiaoce Wang,Guibin Zhang,Junzhe Li,Jinzhe Tu,Chun Li,Ming Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 8 pages main paper, 18 pages total, 8 figures, 5 tables, code at this https URL

点击查看摘要

Abstract:Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training inference code is open-source at this https URL.
zh

[CV-132] How Much Information Can a Vision Token Hold? A Scaling Law for Recognition Limits in VLMs

【速读】:该论文旨在解决视觉编码器在长文本上下文建模中存在信息容量上限的问题,即如何量化视觉tokens(视觉标记)所能承载的信息量上限。其核心挑战在于:尽管当前以DeepSeek-OCR为代表的视觉中心模型能高效压缩文本图像为连续视觉token,但这种压缩过程本质上是一个有损信道,当输入文本信息量超过某一阈值时,识别精度会显著下降。解决方案的关键在于通过受控的压力测试,系统性地增加图像中的字符数量,并观察模型响应的三阶段相变现象——稳定相、不稳定性相与崩溃相,进而揭示这些相变的机制并提出一个统一的概率缩放律,将平均视觉token负载与视觉密度整合为一个潜在难度指标,从而为视觉上下文压缩中的效率-精度权衡提供可量化的实证指导。

链接: https://arxiv.org/abs/2602.02539
作者: Shuxin Zhuang,Zi Liang,Runsheng Yu,Hongzong Li,Rong Feng,Shiqin Tang,Youzhi Zhang
机构: City University of Hong Kong (香港城市大学); The Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (香港科技大学); Centre for Artificial Intelligence and Robotics, Chinese Academy of Sciences (中国科学院人工智能与机器人研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent vision-centric approaches have made significant strides in long-context modeling. Represented by DeepSeek-OCR, these models encode rendered text into continuous vision tokens, achieving high compression rates without sacrificing recognition precision. However, viewing the vision encoder as a lossy channel with finite representational capacity raises a fundamental question: what is the information upper bound of visual tokens? To investigate this limit, we conduct controlled stress tests by progressively increasing the information quantity (character count) within an image. We observe a distinct phase-transition phenomenon characterized by three regimes: a near-perfect Stable Phase, an Instability Phase marked by increased error variance, and a total Collapse Phase. We analyze the mechanical origins of these transitions and identify key factors. Furthermore, we formulate a probabilistic scaling law that unifies average vision token load and visual density into a latent difficulty metric. Extensive experiments across various Vision-Language Models demonstrate the universality of this scaling law, providing critical empirical guidance for optimizing the efficiency-accuracy trade-off in visual context compression.
zh

[CV-133] WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在评估中混淆视觉知识检索与推理能力的问题,从而无法准确衡量模型对视觉世界知识的“记忆”能力。解决方案的关键在于提出WorldVQA基准,通过解耦视觉知识检索与推理任务,严格评估模型在分层分类体系下对视觉实体进行定位和命名的原子级能力,覆盖从常见类别到长尾稀有类别的广泛场景,进而为视觉事实性提供严谨测试,并建立衡量当前及下一代前沿模型百科全书式知识广度与幻觉率的标准。

链接: https://arxiv.org/abs/2602.02537
作者: Runjie Zhou,Youbo Shao,Haoyu Lu,Bowei Xing,Tongtong Bai,Yujie Chen,Jie Zhao,Lin Sui,Haotian Yao,Zijia Zhao,Hao Yang,Haoning Wu,Zaida Zhou,Jinguo Zhu,Zhiqi Huang,Yiping Bao,Yangyang Liu,Y.Charles,Xinyu Zhou
机构: Moonshot AI
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure “what the model memorizes.” The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.
zh

[CV-134] One Size Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation

【速读】:该论文旨在解决广告图像生成中“一刀切”策略导致的群体偏好差异被忽视的问题,即现有方法仅优化整体点击率(Click-Through Rate, CTR),而未能针对不同用户群体的多样化偏好进行个性化优化,从而限制了精准营销的效果。其解决方案的关键在于提出一个统一框架 One Size, Many Fits (OSMF),该框架包含两个核心组件:首先通过产品感知的自适应分组(product-aware adaptive grouping)动态构建具有丰富集体偏好特征的用户群组;其次基于这些群组,利用群组感知的多模态大语言模型(Group-aware Multimodal Large Language Model, G-MLLM)实现偏好条件下的图像生成,并通过提出的 Group-DPO 方法对 G-MLLM 进行微调,以增强各群组在生成图像上的CTR表现。

链接: https://arxiv.org/abs/2602.02033
作者: Shuo Lu,Haohan Wang,Wei Feng,Weizhen Wang,Shen Zhang,Yaoyu Li,Ao Ma,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Bing Zhan,Yuan Xu,Huizai Yao,Yongcan Yu,Chenyang Si,Jian Liang
机构: NLPR & MAIS, CASIA; School of AI, UCAS; JD.COM; HKUST(gz); PRLab, NJU
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textitOne Size, Many Fits (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group’s CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at this https URL.
zh

[CV-135] Deep-learning-based pan-phenomic data reveals the explosive evolution of avian visual disparity ALT

【速读】:该论文旨在解决传统生物形态演化分析中因主观选择与编码形态特征而引入的偏差问题,从而更客观地探究鸟类形态演化的规律。其解决方案的关键在于利用ResNet34模型对超过10,000种鸟类进行识别,并提取其全连接层(fully connected layer)的权重,构建高维嵌入空间,进而分析该空间与生物表型之间的语义对齐关系。研究发现,该嵌入空间能够编码表型趋同现象,并揭示物种丰富度是驱动形态空间扩张的主要因素;此外,通过时间序列差异分析还观察到白垩纪-古近纪(K-Pg)灭绝事件后存在“早期爆发”现象。更为重要的是,该方法在无需显式结构信息的情况下,自发涌现出层级化的生物分类学结构,且通过对抗样本验证了模型能克服卷积神经网络(CNN)普遍存在的纹理偏倚,学习到整体形态特征(body plans),从而为深度神经网络的可解释性提供了新的生物学证据。

链接: https://arxiv.org/abs/2602.03824
作者: Jiao Sun
机构: University of Reading (雷丁大学); Chinese Academy of Sciences (中国科学院)
类目: Populations and Evolution (q-bio.PE); Computer Vision and Pattern Recognition (cs.CV)
备注: Readers from the field of computer science may be interested in section 2.1, 2.2, 3.1, 4.1, 4.2. These sections discussed the interpretability and representation learning, especially the texture vs shape problem, highlighting our model’s ability of overcoming the texture biases and capturing overall shape features. (Although they’re put here to prove the biological validity of the model.)

点击查看摘要

Abstract:The evolution of biological morphology is critical for understanding the diversity of the natural world, yet traditional analyses often involve subjective biases in the selection and coding of morphological traits. This study employs deep learning techniques, utilising a ResNet34 model capable of recognising over 10,000 bird species, to explore avian morphological evolution. We extract weights from the model’s final fully connected (fc) layer and investigate the semantic alignment between the high-dimensional embedding space learned by the model and biological phenotypes. The results demonstrate that the high-dimensional embedding space encodes phenotypic convergence. Subsequently, we assess the morphological disparity among various taxa and evaluate the association between morphological disparity and species richness, demonstrating that species richness is the primary driver of morphospace expansion. Moreover, the disparity-through-time analysis reveals a visual “early burst” after the K-Pg extinction. While mainly aimed at evolutionary analysis, this study also provides insights into the interpretability of Deep Neural Networks. We demonstrate that hierarchical semantic structures (biological taxonomy) emerged in the high-dimensional embedding space despite being trained on flat labels. Furthermore, through adversarial examples, we provide evidence that our model in this task can overcome texture bias and learn holistic shape representations (body plans), challenging the prevailing view that CNNs rely primarily on local textures. Comments: Readers from the field of computer science may be interested in section 2.1, 2.2, 3.1, 4.1, 4.2. These sections discussed the interpretability and representation learning, especially the texture vs shape problem, highlighting our model’s ability of overcoming the texture biases and capturing overall shape features. (Although they’re put here to prove the biological validity of the model.) Subjects: Populations and Evolution (q-bio.PE); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.03824 [q-bio.PE] (or arXiv:2602.03824v1 [q-bio.PE] for this version) https://doi.org/10.48550/arXiv.2602.03824 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-136] Real-time topology-aware M-mode OCT segmentation for robotic deep anterior lamellar keratoplasty (DALK) guidance

【速读】:该论文旨在解决机器人深前板角膜移植术(DALK)中缺乏准确实时深度反馈的问题,尤其是在接近Descemet’s membrane(DM)时避免穿孔的关键挑战。由于M-mode术中光学相干断层扫描(OCT)常受斑点噪声、衰减及器械阴影干扰,导致层间界面不连续或模糊,难以实现解剖一致的分割。解决方案的核心在于提出一种轻量级、拓扑感知的M-mode分割流水线,基于UNeXt架构并引入解剖拓扑正则化机制,以在低信噪比条件下稳定边界连续性和层序关系;该系统在单GPU上端到端吞吐量超过80 Hz,不仅满足部署帧率要求,还提供时间余量用于剔除低质量帧,从而保障稳定有效的深度更新速率。

链接: https://arxiv.org/abs/2602.02798
作者: Rosalinda Xiong,Jinglun Yu,Yaning Wang,Ziyi Huang,Jin U. Kang
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotic deep anterior lamellar keratoplasty (DALK) requires accurate real time depth feedback to approach Descemet’s membrane (DM) without perforation. M-mode intraoperative optical coherence tomography (OCT) provides high temporal resolution depth traces, but speckle noise, attenuation, and instrument induced shadowing often result in discontinuous or ambiguous layer interfaces that challenge anatomically consistent segmentation at deployment frame rates. We present a lightweight, topology aware M-mode segmentation pipeline based on UNeXt that incorporates anatomical topology regularization to stabilize boundary continuity and layer ordering under low signal to noise ratio conditions. The proposed system achieves end to end throughput exceeding 80 Hz measured over the complete preprocessing inference overlay pipeline on a single GPU, demonstrating practical real time guidance beyond model only timing. This operating margin provides temporal headroom to reject low quality or dropout frames while maintaining a stable effective depth update rate. Evaluation on a standard rabbit eye M-mode dataset using an established baseline protocol shows improved qualitative boundary stability compared with topology agnostic controls, while preserving deployable real time performance.
zh

[CV-137] Physics-based generation of multilayer corneal OCT data via Gaussian modeling and MCML for AI-driven diagnostic and surgical guidance applications

【速读】:该论文旨在解决深度学习模型在角膜光学相干断层扫描(OCT)成像中训练受限于高质量、大规模标注数据集的问题。其关键解决方案是构建一个可配置的蒙特卡洛(Monte Carlo)模拟框架,通过物理精确建模生成包含像素级五层分割标签的合成角膜OCT B-scan图像;该框架基于五层角膜模型(含高斯曲面以模拟健康与圆锥角膜眼的形态变异),结合文献中的各层光学属性和MCML(Monte Carlo modeling of light transport in multi-layered tissues)光传输模拟,同时融入系统特性如共焦点扩散函数(PSF)和灵敏度滚降,从而生成超过10,000对高分辨率(1024×1024)图像-标签对,支持AI模型在受控、真实标注条件下进行系统性训练与验证。

链接: https://arxiv.org/abs/2602.02755
作者: Jinglun Yu,Yaning Wang,Rosalinda Xiong,Ziyi Huang,Kristina Irsch,Jin U. Kang
机构: Johns Hopkins University (约翰霍普金斯大学); Sorbonne University (索邦大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training deep learning models for corneal optical coherence tomography (OCT) imaging is limited by the availability of large, well-annotated datasets. We present a configurable Monte Carlo simulation framework that generates synthetic corneal B-scan optical OCT images with pixel-level five-layer segmentation labels derived directly from the simulation geometry. A five-layer corneal model with Gaussian surfaces captures curvature and thickness variability in healthy and keratoconic eyes. Each layer is assigned optical properties from the literature and light transport is simulated using Monte Carlo modeling of light transport in multi-layered tissues (MCML), while incorporating system features such as the confocal PSF and sensitivity roll-off. This approach produces over 10,000 high-resolution (1024x1024) image-label pairs and supports customization of geometry, photon count, noise, and system parameters. The resulting dataset enables systematic training, validation, and benchmarking of AI models under controlled, ground-truth conditions, providing a reproducible and scalable resource to support the development of diagnostic and surgical guidance applications in image-guided ophthalmology.
zh

[CV-138] Perfusion Imaging and Single Material Reconstruction in Polychromatic Photon Counting CT

【速读】:该论文旨在解决灌注计算机断层成像(Perfusion CT)中高X射线剂量问题,其核心挑战是在保证图像重建质量的前提下实现显著的剂量降低。针对这一问题,作者提出了一种基于单调变分不等式(monotone variational inequality, VI)的新型重建方法——VI-PRISM(VI-based PeRfusion Imaging and Single Material reconstruction),其关键创新在于将理论严谨的VI框架引入单物质多能光子计数CT(polychromatic photon-counting CT)的灌注成像中,通过利用已知静态背景组织信息,直接重建对比剂浓度图谱。该方法在极低光子通量(低至10² photons per detector element)和稀疏投影采样(仅8个视角)条件下仍能保持优异的重建精度(碘浓度误差低于0.4 mg/ml),且相比传统滤波反投影(Filtered Back Projection, FBP)方法,在剂量减少10–100倍时仍可实现相当甚至更优的信噪比(SNR)与均方根误差(RMSE)。

链接: https://arxiv.org/abs/2602.02713
作者: Namhoon Kim,Ashwin Pananjady,Amir Pourmorteza,Sara Fridovich-Keil
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Background: Perfusion computed tomography (CT) images the dynamics of a contrast agent through the body over time, and is one of the highest X-ray dose scans in medical imaging. Recently, a theoretically justified reconstruction algorithm based on a monotone variational inequality (VI) was proposed for single material polychromatic photon-counting CT, and showed promising early results at low-dose imaging. Purpose: We adapt this reconstruction algorithm for perfusion CT, to reconstruct the concentration map of the contrast agent while the static background tissue is assumed known; we call our method VI-PRISM (VI-based PeRfusion Imaging and Single Material reconstruction). We evaluate its potential for dose-reduced perfusion CT, using a digital phantom with water and iodine of varying concentration. Methods: Simulated iodine concentrations range from 0.05 to 2.5 mg/ml. The simulated X-ray source emits photons up to 100 keV, with average intensity ranging from 10^5 down to 10^2 photons per detector element. The number of tomographic projections was varied from 984 down to 8 to characterize the tradeoff in photon allocation between views and intensity. Results: We compare VI-PRISM against filtered back-projection (FBP), and find that VI-PRISM recovers iodine concentration with error below 0.4 mg/ml at all source intensity levels tested. Even with a dose reduction between 10x and 100x compared to FBP, VI-PRISM exhibits reconstruction quality on par with FBP. Conclusion: Across all photon budgets and angular sampling densities tested, VI-PRISM achieved consistently lower RMSE, reduced noise, and higher SNR compared to filtered back-projection. Even in extremely photon-limited and sparsely sampled regimes, VI-PRISM recovered iodine concentrations with errors below 0.4 mg/ml, showing that VI-PRISM can support accurate and dose-efficient perfusion imaging in photon-counting CT. Comments: Code is available at this https URL Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2602.02713 [physics.med-ph] (or arXiv:2602.02713v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2602.02713 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sara Fridovich-Keil [view email] [v1] Mon, 2 Feb 2026 19:27:55 UTC (19,289 KB) Full-text links: Access Paper: View a PDF of the paper titled Perfusion Imaging and Single Material Reconstruction in Polychromatic Photon Counting CT, by Namhoon Kim and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: physics.med-ph prev | next new | recent | 2026-02 Change to browse by: cs cs.CV eess eess.IV physics References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-139] EchoJEPA: A Latent Predictive Foundation Model for Echocardiography

【速读】:该论文旨在解决超声心动图(echocardiography)领域中基础模型难以有效分离解剖结构信号与随机斑点噪声及成像伪影的问题,从而提升诊断一致性并降低标注负担。其解决方案的关键在于提出EchoJEPA——一个基于1800万张超声心动图、覆盖30万患者的大型预训练基础模型,并引入一种新型多视角探针框架(multi-view probing framework),通过因子化流嵌入(factorized stream embeddings)实现冻结主干网络下的标准化评估。该方法显著提升了左心室射血分数(LVEF)估计精度(误差降低19%),并在样本效率、抗声学扰动能力及零样本迁移至儿科患者方面均优于现有基线模型,验证了潜在预测(latent prediction)作为超声基础模型范式的优越性。

链接: https://arxiv.org/abs/2602.02603
作者: Alif Munim,Adibvafa Fallahpour,Teodora Szasz,Ahmadreza Attarpour,River Jiang,Brana Sooriyakanthan,Maala Sooriyakanthan,Heather Whitney,Jeremy Slivnick,Barry Rubin,Wendy Tsang,Bo Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models for echocardiography promise to reduce annotation burden and improve diagnostic consistency by learning generalizable representations from large unlabeled video archives. However, current approaches fail to disentangle anatomical signal from the stochastic speckle and acquisition artifacts that dominate ultrasound imagery. We present EchoJEPA, a foundation model for echocardiography trained on 18 million echocardiograms across 300K patients, the largest pretraining corpus for this modality to date. We also introduce a novel multi-view probing framework with factorized stream embeddings that standardizes evaluation under frozen backbones. Compared to prior methods, EchoJEPA reduces left ventricular ejection fraction estimation error by 19% and achieves 87.4% view classification accuracy. EchoJEPA exhibits strong sample efficiency, reaching 78.6% accuracy with only 1% of labeled data versus 42.1% for the best baseline trained on 100%. Under acoustic perturbations, EchoJEPA degrades by only 2.3% compared to 16.8% for the next best model, and transfers zero-shot to pediatric patients with 15% lower error than the next best model, outperforming all fine-tuned baselines. These results establish latent prediction as a superior paradigm for ultrasound foundation models.
zh

[CV-140] Super-résolution non supervisée dimages hyperspectrales de télédétection utilisant un entraînement entièrement synthétique

【速读】:该论文旨在解决高光谱单图像超分辨率(Hyperspectral Single Image Super-Resolution, SISR)任务中因缺乏高质量高分辨率真实数据而导致的监督学习方法难以应用的问题。其解决方案的关键在于提出一种无监督学习框架,通过合成丰度数据进行训练:首先利用高光谱解混(Hyperspectral Unmixing)将原始图像分解为端元(Endmembers)和丰度图(Abundance Maps),随后使用“死叶模型”(Dead Leaves Model)生成具有真实丰度统计特性的合成数据来训练神经网络以提升丰度图的空间分辨率,最终通过重构超分辨丰度图与原始端元重新组合得到高分辨率高光谱图像。该方法有效规避了对真实高分辨率标签数据的依赖,同时保持了光谱信息完整性。

链接: https://arxiv.org/abs/2602.02552
作者: Xinxin Xu,Yann Gousseau,Christophe Kervazo,Saïd Ladjal
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: in French language

点击查看摘要

Abstract:Hyperspectral single image super-resolution (SISR) aims to enhance spatial resolution while preserving the rich spectral information of hyperspectral images. Most existing methods rely on supervised learning with high-resolution ground truth data, which is often unavailable in practice. To overcome this limitation, we propose an unsupervised learning approach based on synthetic abundance data. The hyperspectral image is first decomposed into endmembers and abundance maps through hyperspectral unmixing. A neural network is then trained to super-resolve these maps using data generated with the dead leaves model, which replicates the statistical properties of real abundances. The final super-resolution hyperspectral image is reconstructed by recombining the super-resolved abundance maps with the endmembers. Experimental results demonstrate the effectiveness of our method and the relevance of synthetic data for training.
zh

人工智能

[AI-0] PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

【速读】:该论文旨在解决预训练模型在持续学习(continual learning)场景中缺乏历史任务数据的问题,即如何在不访问旧任务数据的情况下实现对基础模型的有效适应。其核心挑战在于避免灾难性遗忘的同时保持对新任务的学习能力。解决方案的关键在于利用预训练网络中存在的几何冗余性(geometric redundancy),通过两种互补机制实现:一是利用冗余神经元作为预训练时期特征方向的代理,从而直接从预训练权重中构建近似受保护的更新子空间;二是将可塑性限制在冗余神经元的子集内,并约束其余自由度,以减少旧数据分布上的功能漂移并提升最坏情况下的保留性能。基于此,作者提出PLATE方法,其通过结构化低秩更新 ΔW=BAQ\Delta W = B A Q^\top 参数化每一层,其中 BBQQ 由预训练权重一次性计算并冻结,仅训练矩阵 AA,从而实现无需旧数据、且可显式调控可塑性与保留权衡的高效持续学习。

链接: https://arxiv.org/abs/2602.03846
作者: Romain Cosentino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop a continual learning method for pretrained models that \emphrequires no access to old-task data, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emphgeometric redundancy, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emphwhere to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textscPLATE (\textbfPlasticity-\textbfTunable \textbfEfficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update \Delta W = B A Q^\top , where B and Q are computed once from pretrained weights and kept frozen, and only A is trained on the new task. The code is available at this https URL.
zh

[AI-1] Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

【速读】:该论文旨在解决多源证据融合中的可靠性差异问题,即在生物声学分类任务中,音频信号与时空上下文(如位置和季节)虽可共同推断物种身份,但二者在不同样本上的可靠性和信息量存在显著差异。传统方法通常采用固定权重融合或基于贝叶斯推理的乘法组合,但在实践中难以获得校准的生成模型,仅能使用判别式预测器。解决方案的关键在于提出一种自适应对数线性融合框架FINCH(Fusion under Independent Conditional Hypotheses),其通过学习每个样本的门控函数来估计上下文信息的可靠性,该函数基于不确定性与信息量统计量构建,从而动态调整上下文证据的影响强度。FINCH不仅包含纯音频分类器作为特例,还能显式限制上下文证据的作用范围,形成具有风险控制能力且具备可解释性的假设类,实现鲁棒性能提升,尤其在上下文信息孤立时仍保持优越表现。

链接: https://arxiv.org/abs/2602.03817
作者: Oscar Ovanger,Levi Harris,Timothy H. Keitt
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce \textbfFusion under \textbfINdependent \textbfConditional \textbfHypotheses (\textbfFINCH), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emphcontains the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: \texttt\hrefthis https URLanonymous-repository
zh

[AI-2] Conformal Thinking: Risk Control for Reasoning on a Compute Budget

【速读】:该论文旨在解决生成式 AI(Generative AI)中推理大语言模型(Reasoning Large Language Models, LLMs)在测试时计算资源分配的效率问题,即如何在保证准确率的前提下最小化计算开销。核心挑战在于设定合理的token预算和自适应推理阈值时存在风险-准确性权衡(risk-accuracy trade-off)。解决方案的关键在于将预算设置重构为风险控制问题:通过引入一个上界阈值(upper threshold)在模型自信时停止推理以降低错误率,以及一个新颖的参数化下界阈值(parametric lower threshold)提前终止无法解决的任务实例,从而避免过早停止;同时利用无分布风险控制方法,在给定目标风险水平和验证集的基础上最优配置这两个停止机制,并进一步结合效率损失指标选择最高效的退出策略。实证结果表明,该框架可在满足用户指定风险上限的同时显著提升计算效率。

链接: https://arxiv.org/abs/2602.03814
作者: Xi Wang,Anushri Suresh,Alvin Zhang,Rishi More,William Jurayj,Benjamin Van Durme,Mehrdad Farajtabar,Daniel Khashabi,Eric Nalisnick
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning – spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.
zh

[AI-3] Enhancing Imbalanced Node Classification via Curriculum-Guided Feature Learning and Three-Stage Attention Network

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中因标签分布不均衡导致的节点分类性能失衡问题,即模型在少数类上表现较差。解决方案的关键在于提出一种基于课程学习(Curriculum Learning)的三阶段注意力网络(CL3AN-GNN),其核心机制为“Engage-Enact-Embed”三步式特征学习流程:首先通过结构简单特征(如1跳邻域模式、低度节点属性及类别可分节点对)建立稳定初始学习基础;随后在Enact阶段引入可调注意力权重以处理复杂关系(如多跳连接、异构边和少数类边缘节点);最后在Embed阶段通过迭代消息传递与课程对齐损失加权整合特征。该方法在多个开放图基准数据集上验证了其有效性,显著提升了准确率、F1分数和AUC指标,并展现出更快收敛速度和更强泛化能力。

链接: https://arxiv.org/abs/2602.03808
作者: Abdul Joseph Fofanah,Lian Wen,David Chen,Shaoyang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imbalanced node classification in graph neural networks (GNNs) happens when some labels are much more common than others, which causes the model to learn unfairly and perform badly on the less common classes. To solve this problem, we propose a Curriculum-Guided Feature Learning and Three-Stage Attention Network (CL3AN-GNN), a learning network that uses a three-step attention system (Engage, Enact, Embed) similar to how humans learn. The model begins by engaging with structurally simpler features, defined as (1) local neighbourhood patterns (1-hop), (2) low-degree node attributes, and (3) class-separable node pairs identified via initial graph convolutional networks and graph attention networks (GCN and GAT) embeddings. This foundation enables stable early learning despite label skew. The Enact stage then addresses complicated aspects: (1) connections that require multiple steps, (2) edges that connect different types of nodes, and (3) nodes at the edges of minority classes by using adjustable attention weights. Finally, Embed consolidates these features via iterative message passing and curriculum-aligned loss weighting. We evaluate CL3AN-GNN on eight Open Graph Benchmark datasets spanning social, biological, and citation networks. Experiments show consistent improvements across all datasets in accuracy, F1-score, and AUC over recent state-of-the-art methods. The model’s step-by-step method works well with different types of graph datasets, showing quicker results than training everything at once, better performance on new, imbalanced graphs, and clear explanations of each step using gradient stability and attention correlation learning curves. This work provides both a theoretically grounded framework for curriculum learning in GNNs and practical evidence of its effectiveness against imbalances, validated through metrics, convergence speeds, and generalisation tests.
zh

[AI-4] Do We Need Asynchronous SGD? On the Near-Optimality of Synchronous Methods

【速读】:该论文旨在解决现代分布式优化方法中同步策略(如Synchronous SGD及其鲁棒变体m-Synchronous SGD)在异构计算环境下的性能评估与理论局限性问题。尽管近年来异步优化取得了显著进展,但本文通过理论分析表明,在随机计算时间和对抗性部分参与(adversarial partial participation)的条件下,同步方法的时间复杂度在许多实际场景中接近最优,仅存在对数因子差距。其解决方案的关键在于:重新审视并严格证明同步方法在异构计算场景中的近优性,从而为实践中广泛采用的同步策略提供坚实的理论依据,并指出其在多数现代异构计算任务中已足够有效,无需依赖更复杂的异步方案。

链接: https://arxiv.org/abs/2602.03802
作者: Grigory Begunov,Alexander Tyurin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Modern distributed optimization methods mostly rely on traditional synchronous approaches, despite substantial recent progress in asynchronous optimization. We revisit Synchronous SGD and its robust variant, called m -Synchronous SGD, and theoretically show that they are nearly optimal in many heterogeneous computation scenarios, which is somewhat unexpected. We analyze the synchronous methods under random computation times and adversarial partial participation of workers, and prove that their time complexities are optimal in many practical regimes, up to logarithmic factors. While synchronous methods are not universal solutions and there exist tasks where asynchronous methods may be necessary, we show that they are sufficient for many modern heterogeneous computation scenarios.
zh

[AI-5] Conformal Reachability for Safe Control in Unknown Environments

【速读】:该论文旨在解决未知动态系统中可证明安全控制的设计问题,现有方法通常假设系统动力学已知或为确定性系统,且状态与动作空间有限,限制了实际应用范围。其解决方案的关键在于构建一个结合置信预测(conformal prediction)与可达性分析(reachability analysis)的 probabilistic verification 框架:通过置信预测在每一步获取对未知动力学的有效不确定性区间,再利用可达性分析验证安全性是否在该不确定性范围内得以保持;进一步提出一种算法训练控制策略,在优化期望奖励的同时最大化具有严格概率安全保证的规划时长。

链接: https://arxiv.org/abs/2602.03799
作者: Xinhang Ma,Junlin Wu,Yiannis Kantaros,Yevgeniy Vorobeychik
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing provably safe control is a core problem in trustworthy autonomy. However, most prior work in this regard assumes either that the system dynamics are known or deterministic, or that the state and action space are finite, significantly limiting application scope. We address this limitation by developing a probabilistic verification framework for unknown dynamical systems which combines conformal prediction with reachability analysis. In particular, we use conformal prediction to obtain valid uncertainty intervals for the unknown dynamics at each time step, with reachability then verifying whether safety is maintained within the conformal uncertainty bounds. Next, we develop an algorithmic approach for training control policies that optimize nominal reward while also maximizing the planning horizon with sound probabilistic safety guarantees. We evaluate the proposed approach in seven safe control settings spanning four domains – cartpole, lane following, drone control, and safe navigation – for both affine and nonlinear safety specifications. Our experiments show that the policies we learn achieve the strongest provable safety guarantees while still maintaining high average reward.
zh

[AI-6] Understanding Agent Scaling in LLM -Based Multi-Agent Systems via Diversity

【速读】:该论文试图解决多智能体系统(Multi-Agent Systems, MAS)在扩展规模时性能提升受限的问题,特别是探究为何同质化扩展(homogeneous scaling)会出现显著的边际收益递减,而引入异质性(如不同模型、提示或工具)却能持续带来性能增益。其解决方案的关键在于提出一个信息论框架,表明MAS的性能上限由任务本身的内在不确定性决定,而非智能体数量;并定义了一个可衡量的有效信道数 $ K^* $,用于量化系统实际利用的信息通道数量——同质智能体因输出高度相关导致有效信道饱和,而异质智能体则能提供互补证据,从而显著提升整体性能。这一理论揭示了多样性在构建高效、鲁棒MAS中的核心作用,并提供了基于多样性的设计原则。

链接: https://arxiv.org/abs/2602.03794
作者: Yingxuan Yang,Chengrui Qu,Muning Wen,Laixi Shi,Ying Wen,Weinan Zhang,Adam Wierman,Shangding Gu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-based multi-agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information-theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture-agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce K^* , an effective channel count that quantifies the number of effective channels without ground-truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity-aware design. Code and Dataset are available at the link: this https URL.
zh

[AI-7] Reward Redistribution for CVaR MDPs using a Bellm an Operator on L-infinity

【速读】:该论文旨在解决在安全关键场景中,如何有效优化静态条件风险价值(static conditional value-at-risk, CVaR)这一尾部风险度量的问题。由于CVaR依赖于完整轨迹而非单步奖励,传统方法难以通过递归的贝尔曼(Bellman)分解进行求解,且常因状态扩展引入稀疏奖励和退化固定点,导致算法不稳定或收敛困难。其解决方案的关键在于提出一种新颖的状态增强(state augmentation)形式,构建出具有稠密每步奖励与在有界值函数空间上收缩性质的贝尔曼算子,从而支持基于离散化增强状态的风险规避值迭代和模型无关Q学习算法,并提供收敛性保证及离散化误差界,最终实现对CVaR敏感策略的有效学习与性能-安全性权衡优化。

链接: https://arxiv.org/abs/2602.03778
作者: Aneri Muni,Vincent Taboga,Esther Derman,Pierre-Luc Bacon,Erick Delage
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.
zh

[AI-8] An Empirical Study of Collective Behaviors and Social Dynamics in Large Language Model Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长期社会交互中可能加剧偏见或产生排斥性行为的问题,特别是在模拟社交平台环境下的毒性言论、意识形态倾向及群体极化现象。研究通过分析700万条帖子和3.2万个LLM代理在一年内的互动数据,揭示了LLM社交网络中存在类人同质性(homophily)与社会影响效应,同时发现其毒性内容的结构模式不同于人类用户。为缓解潜在有害行为,作者提出一种名为“社会思维链”(Chain of Social Thought, CoST)的解决方案,其关键在于通过提示机制引导LLM代理在生成内容前反思社会规范,从而有效抑制有害发布行为。

链接: https://arxiv.org/abs/2602.03775
作者: Farnoosh Hashemi,Michael W. Macy
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly mediate our social, cultural, and political interactions. While they can simulate some aspects of human behavior and decision-making, it is still underexplored whether repeated interactions with other agents amplify their biases or lead to exclusionary behaviors. To this end, we study this http URL-an LLM-driven social media platform-analyzing 7M posts and interactions among 32K LLM agents over a year. We start with homophily and social influence among LLMs, learning that similar to humans’, their social networks exhibit these fundamental phenomena. Next, we study the toxic language of LLMs, its linguistic features, and their interaction patterns, finding that LLMs show different structural patterns in toxic posting than humans. After studying the ideological leaning in LLMs posts, and the polarization in their community, we focus on how to prevent their potential harmful activities. We present a simple yet effective method, called Chain of Social Thought (CoST), that reminds LLM agents to avoid harmful posting.
zh

[AI-9] UniGeM: Unifying Data Mixing and Selection via Geometric Exploration and Mining

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中因数据质量受限而导致的扩展瓶颈问题。现有方法通常将数据混合(mixing)与样本选择(selection)分开处理,这在代码语料库等结构敏感的数据中容易破坏原有逻辑结构。其解决方案的关键在于提出UniGeM框架,通过将数据整理建模为流形近似(manifold approximation)问题,统一了混合与选择过程,无需训练代理模型或依赖外部参考数据集。该框架采用分层机制:宏观探索(Macro-Exploration)基于稳定性聚类学习混合权重,微观挖掘(Micro-Mining)则依据几何分布过滤高质量样本以保障逻辑一致性,从而显著提升数据利用效率和模型性能。

链接: https://arxiv.org/abs/2602.03772
作者: Changhao Wang,Yunfei Yu,Xinhao Yao,Jiaolong Yang,Riccardo Cantoro,Chaobo Li,Qing Cui,Jun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The scaling of Large Language Models (LLMs) is increasingly limited by data quality. Most methods handle data mixing and sample selection separately, which can break the structure in code corpora. We introduce \textbfUniGeM, a framework that unifies mixing and selection by treating data curation as a \textitmanifold approximation problem without training proxy models or relying on external reference datasets. UniGeM operates hierarchically: \textbfMacro-Exploration learns mixing weights with stability-based clustering; \textbfMicro-Mining filters high-quality instances by their geometric distribution to ensure logical consistency. Validated by training 8B and 16B MoE models on 100B tokens, UniGeM achieves \textbf2.0 \times data efficiency over a random baseline and further improves overall performance compared to SOTA methods in reasoning-heavy evaluations and multilingual generalization.
zh

[AI-10] Decision-oriented benchmarking to transform AI weather forecast access: Application to the Indian monsoon

【速读】:该论文旨在解决当前人工智能天气预测(AIWP)模型评估体系忽视本地利益相关者决策需求的问题,尤其是在低收入和中等收入地区应对高影响天气事件时的实用性不足。其解决方案的关键在于构建一个融合气象学、人工智能与社会科学的决策导向型评估框架,该框架不仅使用确定性和概率性指标对AIWP模型进行客观验证,还聚焦于农业等关键应用场景的实际效益,例如在印度季风预测中成功提前数周预测农业相关的季风 onset 指标,并据此推动政府向3800万农民提供基于AI的季风 onset 预报,从而有效应对气候变异与变化带来的风险。

链接: https://arxiv.org/abs/2602.03767
作者: Rajat Masiwal,Colin Aitken,Adam Marchakitus,Mayank Gupta,Katherine Kowal,Hamid A. Pahlavan,Tyler Yang,Y. Qiang Sun,Michael Kremer,Amir Jina,William R. Boos,Pedram Hassanzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); General Economics (econ.GN); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Artificial intelligence weather prediction (AIWP) models now often outperform traditional physics-based models on common metrics while requiring orders-of-magnitude less computing resources and time. Open-access AIWP models thus hold promise as transformational tools for helping low- and middle-income populations make decisions in the face of high-impact weather shocks. Yet, current approaches to evaluating AIWP models focus mainly on aggregated meteorological metrics without considering local stakeholders’ needs in decision-oriented, operational frameworks. Here, we introduce such a framework that connects meteorology, AI, and social sciences. As an example, we apply it to the 150-year-old problem of Indian monsoon forecasting, focusing on benefits to rain-fed agriculture, which is highly susceptible to climate change. AIWP models skillfully predict an agriculturally relevant onset index at regional scales weeks in advance when evaluated out-of-sample using deterministic and probabilistic metrics. This framework informed a government-led effort in 2025 to send 38 million Indian farmers AI-based monsoon onset forecasts, which captured an unusual weeks-long pause in monsoon progression. This decision-oriented benchmarking framework provides a key component of a blueprint for harnessing the power of AIWP models to help large vulnerable populations adapt to weather shocks in the face of climate variability and change.
zh

[AI-11] Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averag ing

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在持续或开放设置下预训练时面临的“非即时性”(non-anytime)问题,即现有预训练方案依赖于固定计算预算和与训练周期相关的学习率调度策略,难以适应未知的总训练时长。其解决方案的关键在于引入权重平均(weight averaging),并结合简单的、无需考虑训练时长的步长衰减策略(如 1/t1/\sqrt{t}),从而实现理论上最优的最小最大收敛速率(minimax convergence rates)。研究表明,这类“即时性”(anytime)学习调度可随时间多项式衰减,且在不同规模的语言模型(150M–300M参数)和Chinchilla尺度训练中,均能稳定达到与精心调优的余弦退火学习率相当的最终损失,为LLM预训练提供了一种实用、高效且不依赖先验计算预算的替代方案。

链接: https://arxiv.org/abs/2602.03702
作者: Alexandru Meterez,Pranav Ajit Nair,Depen Morwani,Cengiz Pehlevan,Sham Kakade
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and 1/\sqrtt schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.
zh

[AI-12] LLM -Inspired Pretrain-Then-Finetune for Small-Data Large-Scale Optimization

【速读】:该论文旨在解决小数据、大规模决策问题,即企业在面对大量同时发生的运营决策(如产品组合中的多个决策)时,仅能观测到每个实例的少量且可能噪声较大的数据点。为应对这一挑战,作者提出一种“预训练-微调”范式,其核心在于设计一个基于Transformer架构的模型:首先在包含管理知识和决策环境结构特征的大规模领域相关合成数据上进行预训练,以注入先验知识并利用丰富合成数据训练高容量模型;随后在真实观测数据上微调,使模型适应实际操作环境并提升与真实数据生成机制的一致性。该方法的关键创新在于问题特定的网络结构设计与定制化训练流程,而非简单套用现成Transformer模型,并通过理论分析首次建立了Transformer学习在该场景下的非渐近误差边界,揭示了预训练与微调共同决定性能,且微调具有随实例数量增长而增强的规模经济效应。

链接: https://arxiv.org/abs/2602.03690
作者: Zishi Zhang,Jinhui Han,Ming Hu,Yijie Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider small-data, large-scale decision problems in which a firm must make many operational decisions simultaneously (e.g., across a large product portfolio) while observing only a few, potentially noisy, data points per instance. Inspired by the success of large language models (LLMs), we propose a pretrain-then-finetune approach built on a designed Transformer model to address this challenge. The model is first pretrained on large-scale, domain-informed synthetic data that encode managerial knowledge and structural features of the decision environment, and is then fine-tuned on real observations. This new pipeline offers two complementary advantages: pretraining injects domain knowledge into the learning process and enables the training of high-capacity models using abundant synthetic data, while finetuning adapts the pretrained model to the operational environment and improves alignment with the true data-generating regime. While we have leveraged the Transformer’s state-of-the-art representational capacity, particularly its attention mechanism, to efficiently extract cross-task structure, our approach is not an off-the-shelf application. Instead, it relies on problem-specific architectural design and a tailored training procedure to match the decision setting. Theoretically, we develop the first comprehensive error analysis regarding Transformer learning in relevant contexts, establishing nonasymptotic guarantees that validate the method’s effectiveness. Critically, our analysis reveals how pretraining and fine-tuning jointly determine performance, with the dominant contribution governed by whichever is more favorable. In particular, finetuning exhibits an economies-of-scale effect, whereby transfer learning becomes increasingly effective as the number of instances grows.
zh

[AI-13] odyComm: Task-Oriented Dynamic Communication for Multi-Round LLM -based Multi-Agent System

【速读】:该论文旨在解决多轮大语言模型(Large Language Model, LLM)驱动的多智能体系统在推理过程中因采用固定通信拓扑而导致协作效率低下的问题,尤其是在动态环境(如对手变化、任务演进或通信带宽限制)下,智能体角色随轮次变化时难以维持高效协同。解决方案的关键在于提出TodyComm——一种任务导向的动态通信算法,其通过策略梯度优化生成行为驱动的协作拓扑结构,使通信模式能够根据每一轮的实际动态自适应调整,从而在保障Token效率和可扩展性的前提下显著提升任务执行效果。

链接: https://arxiv.org/abs/2602.03688
作者: Wenzhe Fan,Tommaso Tognoli,Henry Peng Zou,Chunyu Miao,Yibo Wang,Xinhua Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-round LLM-based multi-agent systems rely on effective communication structures to support collaboration across rounds. However, most existing methods employ a fixed communication topology during inference, which falls short in many realistic applications where the agents’ roles may change \textitacross rounds due to dynamic adversary, task progression, or time-varying constraints such as communication bandwidth. In this paper, we propose addressing this issue through TodyComm, a \textbftask-\textbforiented \textbfdynamic \textbfcommunication algorithm. It produces behavior-driven collaboration topologies that adapt to the dynamics at each round, optimizing the utility for the task through policy gradient. Experiments on five benchmarks demonstrate that under both dynamic adversary and communications budgets, TodyComm delivers superior task effectiveness while retaining token efficiency and scalability.
zh

[AI-14] QuAIL: Quality-Aware Inertial Learning for Robust Training under Data Corruption

【速读】:该论文旨在解决表格型机器学习系统在训练过程中面临的数据非均匀污染问题,包括噪声测量、缺失值和特征特异性偏差等,这些问题通常仅能通过列级别的可靠性指标进行描述,而非实例级别的质量标注,从而限制了众多鲁棒性和数据清洗技术的应用。解决方案的关键在于提出QuAIL(Quality-informed Training Mechanism),其核心创新是将特征可靠性先验信息直接嵌入到学习过程中:通过引入一个可学习的特征调制层,并结合基于质量的近端正则化项对更新进行选择性约束,实现对不同可信度特征的可控适应,从而在结构化污染下稳定优化过程,无需显式的数据修复或样本级重加权。

链接: https://arxiv.org/abs/2602.03686
作者: Mattia Sabella,Alberto Archetti,Pietro Pinoli,Matteo Matteucci,Cinzia Cappiello
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular machine learning systems are frequently trained on data affected by non-uniform corruption, including noisy measurements, missing entries, and feature-specific biases. In practice, these defects are often documented only through column-level reliability indicators rather than instance-wise quality annotations, limiting the applicability of many robustness and cleaning techniques. We present QuAIL, a quality-informed training mechanism that incorporates feature reliability priors directly into the learning process. QuAIL augments existing models with a learnable feature-modulation layer whose updates are selectively constrained by a quality-dependent proximal regularizer, thereby inducing controlled adaptation across features of varying trustworthiness. This stabilizes optimization under structured corruption without explicit data repair or sample-level reweighting. Empirical evaluation across 50 classification and regression datasets demonstrates that QuAIL consistently improves average performance over neural baselines under both random and value-dependent corruption, with especially robust behavior in low-data and systematically biased settings. These results suggest that incorporating feature reliability information directly into optimization dynamics is a practical and effective approach for resilient tabular learning.
zh

[AI-15] Universal One-third Time Scaling in Learning Peaked Distributions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练过程中损失函数收敛速度缓慢的问题,这一现象表现为损失随训练时间呈现幂律衰减,且其根本成因此前尚不明确。论文通过分析简化模型与对真实LLM的实证研究发现,这种幂律收敛行为本质上源于Softmax与交叉熵损失函数在学习尖锐概率分布(如下一个词的概率分布)时所引发的梯度和损失值幂律衰减。解决方案的关键在于揭示了Softmax与交叉熵组合在优化过程中的内在机制缺陷——即当目标分布具有高尖锐性时,会导致优化路径上梯度逐渐变小,从而形成一个普遍存在的优化瓶颈,最终使损失随训练时间呈幂律下降,且指数为1/3的普适常数。这一发现为理解神经网络训练的标度规律提供了理论依据,并指出了改进LLM训练效率的新方向。

链接: https://arxiv.org/abs/2602.03685
作者: Yizhou Liu,Ziming Liu,Cengiz Pehlevan,Jeff Gore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 24 pages, 6 main text figures, 27 figures in total

点击查看摘要

Abstract:Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of 1/3 . Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.
zh

[AI-16] ContraLog: Log File Anomaly Detection with Contrastive Learning and Masked Language Modeling

【速读】:该论文旨在解决传统日志异常检测方法依赖日志解析器(log parser)将日志消息映射为离散模板而导致变量值和语义信息丢失的问题。其解决方案的关键在于提出一种无需解析器(parser-free)且基于自监督学习的框架 ContraLog,该框架将异常检测任务重构为预测连续的消息嵌入(message embeddings)而非离散模板ID,并通过掩码语言建模与对比学习相结合的方式训练模型,使其能够利用上下文信息预测被遮蔽的消息嵌入。实验表明,ContraLog 在 HDFS、BGL 和 Thunderbird 数据集上均表现出优越性能,且生成的嵌入本身即具备异常判别能力,即使不依赖序列上下文也能有效识别异常。

链接: https://arxiv.org/abs/2602.03678
作者: Simon Dietz,Kai Klede,An Nguyen,Bjoern M Eskofier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages with 16 figures

点击查看摘要

Abstract:Log files record computational events that reflect system state and behavior, making them a primary source of operational insights in modern computer systems. Automated anomaly detection on logs is therefore critical, yet most established methods rely on log parsers that collapse messages into discrete templates, discarding variable values and semantic content. We propose ContraLog, a parser-free and self-supervised method that reframes log anomaly detection as predicting continuous message embeddings rather than discrete template IDs. ContraLog combines a message encoder that produces rich embeddings for individual log messages with a sequence encoder to model temporal dependencies within sequences. The model is trained with a combination of masked language modeling and contrastive learning to predict masked message embeddings based on the surrounding context. Experiments on the HDFS, BGL, and Thunderbird benchmark datasets empirically demonstrate effectiveness on complex datasets with diverse log messages. Additionally, we find that message embeddings generated by ContraLog carry meaningful information and are predictive of anomalies even without sequence context. These results highlight embedding-level prediction as an approach for log anomaly detection, with potential applicability to other event sequences.
zh

[AI-17] Equilibrium Propagation for Non-Conservative Systems

【速读】:该论文旨在解决平衡传播(Equilibrium Propagation, EP)算法在非保守系统(nonconservative systems)中的扩展问题,即如何在存在非互易相互作用(non-reciprocal interactions)的系统中准确计算损失函数的梯度。此前的尝试未能获得精确梯度,限制了EP在更广泛神经网络架构(如前馈网络)中的应用。解决方案的关键在于:在学习阶段修改动力学过程,引入一个与非互易部分成比例的项,从而确保能够得到损失函数的精确梯度;同时,该方法也可通过变分框架从一个扩展状态空间上的能量函数推导出学习动力学,保持了EP利用稳态进行推理和学习的核心特性。

链接: https://arxiv.org/abs/2602.03670
作者: Antonino Emanuele Scurria,Dimitri Vanden Abeele,Bortolo Matteo Mognetti,Serge Massar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Dynamical Systems (math.DS); Classical Physics (physics.class-ph)
备注: 19 pages (9 pages main text), 7 figures

点击查看摘要

Abstract:Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference and learning. In its original formulation it is limited to conservative systems, \textiti.e. to dynamics which derive from an energy function. Given their importance in applications, it is important to extend EP to nonconservative systems, \textiti.e. systems with non-reciprocal interactions. Previous attempts to generalize EP to such systems failed to compute the exact gradient of the cost function. Here we propose a framework that extends EP to arbitrary nonconservative systems, including feedforward networks. We keep the key property of equilibrium propagation, namely the use of stationary states both for inference and learning. However, we modify the dynamics in the learning phase by a term proportional to the non-reciprocal part of the interaction so as to obtain the exact gradient of the cost function. This algorithm can also be derived using a variational formulation that generates the learning dynamics through an energy function defined over an augmented state space. Numerical experiments using the MNIST database show that this algorithm achieves better performance and learns faster than previous proposals.
zh

[AI-18] Mitigating Conversational Inertia in Multi-Turn Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮代理(multiturn agent)场景中因“对话惯性”(conversational inertia)导致的探索能力受限问题。具体而言,当LLM作为代理执行任务时,其注意力机制会过度关注自身先前回复,从而产生模仿偏差(imitation bias),抑制对更优策略的探索。解决方案的关键在于提出上下文偏好学习(Context Preference Learning),通过识别相同状态下不同上下文长度所引发的惯性差异,构建无需环境奖励的偏好对(preference pairs),从而引导模型偏好低惯性响应;同时引入推理阶段的上下文管理策略,在利用长上下文进行经验积累(exploitation)与保持适度探索之间实现平衡。

链接: https://arxiv.org/abs/2602.03664
作者: Yang Wan,Zheng Cao,Zhenhao Zhang,Zhengwen Zeng,Shuheng Shen,Changhua Meng,Linchao Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.
zh

[AI-19] Can LLM s Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在高维、物理约束环境中的自主多阶段规划能力不足的问题,特别是其在复杂航天任务(如小行星采矿任务)中从战略设计到执行落地的完整闭环能力。解决方案的关键在于构建一个适配轨道力学领域的MLE-Bench评估框架,并采用基于AIDE(Agent-based Integrated Design Environment)的代理架构,使LLM能够自主生成和迭代优化任务方案;同时引入“LLM-as-a-Judge”评估方法,通过领域专家制定的评分标准对策略可行性进行结构化打分,从而系统性揭示当前模型在战略层面与执行层面之间的显著能力鸿沟——即尽管先进模型具备良好的概念理解能力和任务建模能力,但在物理单位一致性、边界条件处理及调试效率等实现细节上存在严重缺陷,导致其仍无法作为完全自主的工程执行者。

链接: https://arxiv.org/abs/2602.03630
作者: Iñaki del Campo,Pablo Cuervo,Victor Rodriguez-Fernandez,Roberto Armellin,Jack Yarndley
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of the paper presented at AIAA SciTech 2026 Forum. Includes futher experiments, corrections and new appendix

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning, yet their capacity for autonomous multi-stage planning in high-dimensional, physically constrained environments remains an open research question. This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12), a complex astrodynamics challenge requiring the design of a large-scale asteroid mining campaign. We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions. To assess performance beyond binary validity, we employ an “LLM-as-a-Judge” methodology, utilizing a rubric developed by domain experts to evaluate strategic viability across five structural categories. A comparative analysis of models, ranging from GPT-4-Turbo to reasoning-enhanced architectures like Gemini 2.5 Pro, and o3, reveals a significant trend: the average strategic viability score has nearly doubled in the last two years (rising from 9.3 to 17.2 out of 26). However, we identify a critical capability gap between strategy and execution. While advanced models demonstrate sophisticated conceptual understanding, correctly framing objective functions and mission architectures, they consistently fail at implementation due to physical unit inconsistencies, boundary condition errors, and inefficient debugging loops. We conclude that, while current LLMs often demonstrate sufficient knowledge and intelligence to tackle space science tasks, they remain limited by an implementation barrier, functioning as powerful domain facilitators rather than fully autonomous engineers.
zh

[AI-20] APEX: Probing Neural Networks via Activation Perturbation

【速读】:该论文旨在解决现有神经网络探查方法(如输入空间分析或参数扰动)在获取中间表示中结构信息时存在的根本性局限问题。其解决方案的关键在于提出一种推理阶段的探查范式——Activation Perturbation for EXploration (APEX),通过在保持输入和模型参数不变的前提下扰动隐藏层激活值,从而实现从样本相关行为到模型相关行为的理论上的平滑过渡:抑制输入特异性信号并放大表示层面的结构特征。这一机制使APEX能够揭示传统方法无法捕捉的模型内部结构与偏差,尤其在小噪声和大噪声场景下分别有效评估样本规则性、区分结构化与随机标签模型,并暴露训练诱导的模型级偏倚(如后门模型中预测集中于目标类)。

链接: https://arxiv.org/abs/2602.03586
作者: Tao Ren,Xiaoyu Luo,Qiongxiu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prior work on probing neural networks primarily relies on input-space analysis or parameter perturbation, both of which face fundamental limitations in accessing structural information encoded in intermediate representations. We introduce Activation Perturbation for EXploration (APEX), an inference-time probing paradigm that perturbs hidden activations while keeping both inputs and model parameters fixed. We theoretically show that activation perturbation induces a principled transition from sample-dependent to model-dependent behavior by suppressing input-specific signals and amplifying representation-level structure, and further establish that input perturbation corresponds to a constrained special case of this framework. Through representative case studies, we demonstrate the practical advantages of APEX. In the small-noise regime, APEX provides a lightweight and efficient measure of sample regularity that aligns with established metrics, while also distinguishing structured from randomly labeled models and revealing semantically coherent prediction transitions. In the large-noise regime, APEX exposes training-induced model-level biases, including a pronounced concentration of predictions on the target class in backdoored models. Overall, our results show that APEX offers an effective perspective for exploring, and understanding neural networks beyond what is accessible from input space alone.
zh

[AI-21] Dont believe everything you read: Understanding and Measuring MCP Behavior under Misleading Tool Descriptions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)通过模型上下文协议(Model Context Protocol, MCP)调用外部工具时存在的描述-代码不一致问题,即工具的自然语言文档描述与其实际代码实现之间存在偏差,这种不一致性可能被恶意利用以执行未授权的特权操作、隐藏状态变更或非法金融行为。解决方案的关键在于设计并应用一种自动化静态分析框架,对跨36类共10,240个真实MCP服务器进行大规模检测,从而识别出约13%存在显著描述-代码不一致的实例,揭示了该问题在MCP生态中的普遍性和潜在攻击面,并推动未来AI代理生态系统中建立系统性审计机制与更强的透明度保障。

链接: https://arxiv.org/abs/2602.03580
作者: Zhihao Li,Boyang Ma,Xuelong Dai,Minghui Xu,Yue Zhang,Biwei Yan,Kun Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) enables large language models to invoke external tools through natural-language descriptions, forming the foundation of many AI agent applications. However, MCP does not enforce consistency between documented tool behavior and actual code execution, even though MCP Servers often run with broad system privileges. This gap introduces a largely unexplored security risk. We study how mismatches between externally presented tool descriptions and underlying implementations systematically shape the mental models and decision-making behavior of intelligent agents. Specifically, we present the first large-scale study of description-code inconsistency in the MCP ecosystem. We design an automated static analysis framework and apply it to 10,240 real-world MCP Servers across 36 categories. Our results show that while most servers are highly consistent, approximately 13% exhibit substantial mismatches that can enable undocumented privileged operations, hidden state mutations, or unauthorized financial actions. We further observe systematic differences across application categories, popularity levels, and MCP marketplaces. Our findings demonstrate that description-code inconsistency is a concrete and prevalent attack surface in MCP-based AI agents, and motivate the need for systematic auditing and stronger transparency guarantees in future agent ecosystems.
zh

[AI-22] EHRWorld: A Patient-Centric Medical World Model for Long-Horizon Clinical Trajectories

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域中作为动态世界模型(World Models)时,难以维持患者状态一致性、导致长期临床模拟中误差累积的问题。其核心挑战在于LLMs仅依赖静态医学知识,缺乏对时间序列因果关系的建模能力。解决方案的关键在于提出EHRWorld——一个基于因果顺序范式的患者中心型医疗世界模型,并结合EHRWorld-110K这一大规模纵向临床数据集进行训练,该数据集源自真实电子健康记录(Electronic Health Records, EHR),从而实现对疾病进展和治疗效果的稳定、可解释的长期模拟。

链接: https://arxiv.org/abs/2602.03569
作者: Linjie Mu,Zhongzhen Huang,Yannian Gu,Shengqian Qin,Shaoting Zhang,Xiaofan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:World models offer a principled framework for simulating future states under interventions, but realizing such models in complex, high-stakes domains like medicine remains challenging. Recent large language models (LLMs) have achieved strong performance on static medical reasoning tasks, raising the question of whether they can function as dynamic medical world models capable of simulating disease progression and treatment outcomes over time. In this work, we show that LLMs only incorporating medical knowledge struggle to maintain consistent patient states under sequential interventions, leading to error accumulation in long-horizon clinical simulation. To address this limitation, we introduce EHRWorld, a patient-centric medical world model trained under a causal sequential paradigm, together with EHRWorld-110K, a large-scale longitudinal clinical dataset derived from real-world electronic health records. Extensive evaluations demonstrate that EHRWorld significantly outperforms naive LLM-based baselines, achieving more stable long-horizon simulation, improved modeling of clinically sensitive events, and favorable reasoning efficiency, highlighting the necessity of training on causally grounded, temporally evolving clinical data for reliable and robust medical world modeling.
zh

[AI-23] EVE: Efficient Verification of Data Erasure through Customized Perturbation in Approximate Unlearning

【速读】:该论文旨在解决机器遗忘(machine unlearning)过程中验证机制的缺失问题,即如何在不参与模型初始训练的前提下,高效且准确地验证模型是否已成功移除指定数据的影响。现有基于后门(backdoor)的方法通常需要在训练阶段植入后门标记,这不仅效率低下且难以实际应用。论文提出的高效擦除验证方法(EVE)的关键在于设计一种对抗性扰动策略:通过构造特定扰动使目标样本在遗忘前后模型预测发生改变,从而将预测变化作为验证信号。该扰动生成被形式化为一个对抗优化问题,通过将遗忘梯度与目标样本边界变化梯度对齐来求解,实现了无需参与初始训练即可高精度、高效率地验证机器遗忘效果。

链接: https://arxiv.org/abs/2602.03567
作者: Weiqi Wang,Zhiyi Tian,Chenhan Zhang,Luoyu Chen,Shui Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Verifying whether the machine unlearning process has been properly executed is critical but remains underexplored. Some existing approaches propose unlearning verification methods based on backdooring techniques. However, these methods typically require participation in the model’s initial training phase to backdoor the model for later verification, which is inefficient and impractical. In this paper, we propose an efficient verification of erasure method (EVE) for verifying machine unlearning without requiring involvement in the model’s initial training process. The core idea is to perturb the unlearning data to ensure the model prediction of the specified samples will change before and after unlearning with perturbed data. The unlearning users can leverage the observation of the changes as a verification signal. Specifically, the perturbations are designed with two key objectives: ensuring the unlearning effect and altering the unlearned model’s prediction of target samples. We formalize the perturbation generation as an adversarial optimization problem, solving it by aligning the unlearning gradient with the gradient of boundary change for target samples. We conducted extensive experiments, and the results show that EVE can verify machine unlearning without involving the model’s initial training process, unlike backdoor-based methods. Moreover, EVE significantly outperforms state-of-the-art unlearning verification methods, offering significant speedup in efficiency while enhancing verification accuracy. The source code of EVE is released at \ulinethis https URL, providing a novel tool for verification of machine unlearning.
zh

[AI-24] Persona Generators: Generating Diverse Synthetic Personas at Scale

【速读】:该论文旨在解决在评估与人类交互的AI系统时,因难以获取代表性人类数据而导致的行为多样性不足问题,尤其针对新兴技术或未来场景下数据稀缺的困境。其核心挑战在于现有生成式代理建模方法多依赖详尽的目标人群数据,且侧重于密度匹配(density matching),即复现高概率行为,而忽视了对长尾行为(long-tail behaviors)的支持覆盖(support coverage)。解决方案的关键在于提出“Persona Generator”——一种可自动扩展小规模描述以生成多样化合成人群的轻量函数,并通过基于AlphaEvolve的迭代优化机制,利用大语言模型作为突变算子,在数百轮迭代中不断改进生成器代码,最终实现对关键多样性轴上意见与偏好的最大覆盖,从而显著优于基线方法,在六项多样性指标上表现更优,尤其能生成罕见特质组合的人群。

链接: https://arxiv.org/abs/2602.03545
作者: Davide Paglieri,Logan Cross,William A. Cunningham,Joel Z. Leibo,Alexander Sasha Vezhnevets
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting representative human data is often expensive or infeasible, particularly for novel technologies or hypothetical future scenarios. Recent work in Generative Agent-Based Modeling has shown that large language models can simulate human-like synthetic personas with high fidelity, accurately reproducing the beliefs and behaviors of specific individuals. However, most approaches require detailed data about target populations and often prioritize density matching (replicating what is most probable) rather than support coverage (spanning what is possible), leaving long-tail behaviors underexplored. We introduce Persona Generators, functions that can produce diverse synthetic populations tailored to arbitrary contexts. We apply an iterative improvement loop based on AlphaEvolve, using large language models as mutation operators to refine our Persona Generator code over hundreds of iterations. The optimization process produces lightweight Persona Generators that can automatically expand small descriptions into populations of diverse synthetic personas that maximize coverage of opinions and preferences along relevant diversity axes. We demonstrate that evolved generators substantially outperform existing baselines across six diversity metrics on held-out contexts, producing populations that span rare trait combinations difficult to achieve in standard LLM outputs.
zh

[AI-25] Group Selection as a Safeguard Against AI Substitution

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)的广泛使用可能对人类文化演化产生长期负面影响,特别是导致文化变异减少、创新停滞以及累积文化进化放缓,进而可能引发“文化崩溃”(cultural collapse)。其核心解决方案在于区分两种AI使用策略——“互补型”(AI-complement)与“替代型”(AI-substitute),并通过基于代理的模型和演化博弈论分析表明:尽管替代型用户在个体选择压力下更易占据优势,但互补型用户因能维持群体层面的文化多样性以支持探索与适应,从而在强群体边界条件下通过文化群体选择获得优势。因此,关键在于设计政策与组织策略,引导用户采用互补型使用模式,以保障文化演化的可持续性与多样性。

链接: https://arxiv.org/abs/2602.03541
作者: Qiankun Zhong,Thomas F. Eisenmann,Julian Garcia,Iyad Rahwan
机构: 未知
类目: Artificial Intelligence (cs.AI); Theoretical Economics (econ.TH)
备注: 19 pages, 7 Figures

点击查看摘要

Abstract:Reliance on generative AI can reduce cultural variance and diversity, especially in creative work. This reduction in variance has already led to problems in model performance, including model collapse and hallucination. In this paper, we examine the long-term consequences of AI use for human cultural evolution and the conditions under which widespread AI use may lead to “cultural collapse”, a process in which reliance on AI-generated content reduces human variation and innovation and slows cumulative cultural evolution. Using an agent-based model and evolutionary game theory, we compare two types of AI use: complement and substitute. AI-complement users seek suggestions and guidance while remaining the main producers of the final output, whereas AI-substitute users provide minimal input, and rely on AI to produce most of the output. We then study how these use strategies compete and spread under evolutionary dynamics. We find that AI-substitute users prevail under individual-level selection despite the stronger reduction in cultural variance. By contrast, AI-complement users can benefit their groups by maintaining the variance needed for exploration, and can therefore be favored under cultural group selection when group boundaries are strong. Overall, our findings shed light on the long-term, population-level effects of AI adoption and inform policy and organizational strategies to mitigate these risks.
zh

[AI-26] Morphe: High-Fidelity Generative Video Streaming with Vision Foundation Model

【速读】:该论文旨在解决视频流媒体在带宽受限或网络条件较差场景下难以保障质量的问题,现有方案中传统像素编码(pixel-codec)压缩率接近极限,而新兴的神经增强或生成式流媒体则因延迟高和视觉保真度不足难以实用化。其解决方案的关键在于提出首个基于视觉基础模型(Vision Foundation Model, VFM)的端到端生成式视频流媒体范式——Morphe,通过联合训练视觉分词器(visual tokenizer)与受模拟网络约束驱动的可变分辨率时空优化策略,在保证高视觉保真度的同时显著提升压缩效率;同时构建了具备智能丢包机制的鲁棒流媒体系统,以应对真实网络扰动,从而实现低延迟、高抗损性的实时视频传输。

链接: https://arxiv.org/abs/2602.03529
作者: Tianyi Gong,Zijian Cao,Zixing Zhang,Jiangkai Wu,Xinggong Zhang,Shuguang Cui,Fangxin Wang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by NSDI 2026 Fall

点击查看摘要

Abstract:Video streaming is a fundamental Internet service, while the quality still cannot be guaranteed especially in poor network conditions such as bandwidth-constrained and remote areas. Existing works mainly work towards two directions: traditional pixel-codec streaming nearly approaches its limit and is hard to step further in compression; the emerging neural-enhanced or generative streaming usually fall short in latency and visual fidelity, hindering their practical deployment. Inspired by the recent success of vision foundation model (VFM), we strive to harness the powerful video understanding and processing capacities of VFM to achieve generalization, high fidelity and loss resilience for real-time video streaming with even higher compression rate. We present the first revolutionized paradigm that enables VFM-based end-to-end generative video streaming towards this goal. Specifically, Morphe employs joint training of visual tokenizers and variable-resolution spatiotemporal optimization under simulated network constraints. Additionally, a robust streaming system is constructed that leverages intelligent packet dropping to resist real-world network perturbations. Extensive evaluation demonstrates that Morphe achieves comparable visual quality while saving 62.5% bandwidth compared to H.265, and accomplishes real-time, loss-resilient video delivery in challenging network environments, representing a milestone in VFM-enabled multimedia streaming solutions.
zh

[AI-27] D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation From Lead sheet ICASSP

【速读】:该论文旨在解决符号音乐领域中钢琴伴奏生成的挑战性问题,即如何从给定的旋律和和弦约束(如乐谱中的旋律线)中生成完整的钢琴伴奏。其解决方案的关键在于提出了一种基于离散扩散机制的模型 D3PIA,该模型通过钢琴滚筒表示(piano-roll representation)实现旋律与伴奏之间的局部对齐,并引入邻域注意力机制(Neighborhood Attention, NA),分别用于编码旋律和条件化预测伴奏音符状态,从而增强局部上下文建模能力,提升伴奏的和声忠实度与音乐连贯性。

链接: https://arxiv.org/abs/2602.03523
作者: Eunjin Choi,Hounsu Kim,Hayeon Bang,Taegyun Kwon,Juhan Nam
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted at 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

点击查看摘要

Abstract:Generating piano accompaniments in the symbolic music domain is a challenging task that requires producing a complete piece of piano music from given melody and chord constraints, such as those provided by a lead sheet. In this paper, we propose a discrete diffusion-based piano accompaniment generation model, D3PIA, leveraging local alignment between lead sheet and accompaniment in piano-roll representation. D3PIA incorporates Neighborhood Attention (NA) to both encode the lead sheet and condition it for predicting note states in the piano accompaniment. This design enhances local contextual modeling by efficiently attending to nearby melody and chord conditions. We evaluate our model using the POP909 dataset, a widely used benchmark for piano accompaniment generation. Objective evaluation results demonstrate that D3PIA preserves chord conditions more faithfully compared to continuous diffusion-based and Transformer-based baselines. Furthermore, a subjective listening test indicates that D3PIA generates more musically coherent accompaniments than the comparison models.
zh

[AI-28] Live or Lie: Action-Aware Capsule Multiple Instance Learning for Risk Assessment in Live Streaming Platforms

【速读】:该论文旨在解决直播场景中因多方参与者协同进行隐蔽恶意行为而导致的风险评估难题,此类行为通常混杂在正常互动中,难以被及时准确识别。针对这一问题,作者提出了一种基于弱监督的多实例学习(Multiple Instance Learning, MIL)框架——AC-MIL,其关键在于将每个直播间视为一个“包”(bag),并定义结构化的用户-时间段胶囊(user-timeslot capsules)作为实例,以捕捉局部行为模式;通过串行与并行结合的架构,同时建模个体行为特征与群体协作模式,从而实现多层次语义信息的融合与时间动态演化建模,最终提升房间级风险预测的准确性,并提供可解释的行为片段证据用于干预决策。

链接: https://arxiv.org/abs/2602.03520
作者: Yiran Qiao,Jing Chen,Xiang Ao,Qiwei Zhong,Yang Liu,Qing He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Live streaming has become a cornerstone of today’s internet, enabling massive real-time social interactions. However, it faces severe risks arising from sparse, coordinated malicious behaviors among multiple participants, which are often concealed within normal activities and challenging to detect timely and accurately. In this work, we provide a pioneering study on risk assessment in live streaming rooms, characterized by weak supervision where only room-level labels are available. We formulate the task as a Multiple Instance Learning (MIL) problem, treating each room as a bag and defining structured user-timeslot capsules as instances. These capsules represent subsequences of user actions within specific time windows, encapsulating localized behavioral patterns. Based on this formulation, we propose AC-MIL, an Action-aware Capsule MIL framework that models both individual behaviors and group-level coordination patterns. AC-MIL captures multi-granular semantics and behavioral cues through a serial and parallel architecture that jointly encodes temporal dynamics and cross-user dependencies. These signals are integrated for robust room-level risk prediction, while also offering interpretable evidence at the behavior segment level. Extensive experiments on large-scale industrial datasets from Douyin demonstrate that AC-MIL significantly outperforms MIL and sequential baselines, establishing new state-of-the-art performance in room-level risk assessment for live streaming. Moreover, AC-MIL provides capsule-level interpretability, enabling identification of risky behavior segments as actionable evidence for intervention. The project page is available at: this https URL.
zh

[AI-29] Not All Negative Samples Are Equal: LLM s Learn Better from Plausible Reasoning

【速读】:该论文旨在解决现有方法在利用负样本(negative samples)提升大语言模型(Large Language Model, LLM)推理能力时,将所有错误回答视为同等信息量的问题,忽略了负样本质量对训练效果的关键影响。其解决方案的核心是提出可实现的负样本(Plausible Negative Samples, PNS),通过反向强化学习(reverse reinforcement learning, RL)训练一个专用模型,该模型在复合奖励机制(包括格式合规性、准确性反转、奖励模型评估和思维链(chain-of-thought)评价)引导下生成结构合理但最终答案错误的响应,从而构造出高质量、近似于正确解的负样本。PNS作为即插即用的数据源,在多个骨干模型和数学推理基准上验证了其有效性,显著优于其他负样本合成方法。

链接: https://arxiv.org/abs/2602.03516
作者: Zixiang Di,Jinyi Han,Shuo Zhang,Ying Liao,Zhi Li,Xiaofeng Ji,Yongqi Wang,Zheming Yang,Ming Gao,Bingdong Li,Jie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from negative samples holds great promise for improving Large Language Model (LLM) reasoning capability, yet existing methods treat all incorrect responses as equally informative, overlooking the crucial role of sample quality. To address this, we propose Plausible Negative Samples (PNS), a method that synthesizes high-quality negative samples exhibiting expected format and structural coherence while ultimately yielding incorrect answers. PNS trains a dedicated model via reverse reinforcement learning (RL) guided by a composite reward combining format compliance, accuracy inversion, reward model assessment, and chain-of-thought evaluation, generating responses nearly indistinguishable from correct solutions. We further validate PNS as a plug-and-play data source for preference optimization across three backbone models on seven mathematical reasoning benchmarks. Results demonstrate that PNS consistently outperforms other negative sample synthesis methods, achieving an average improvement of 2.03% over RL-trained models.
zh

[AI-30] Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

【速读】:该论文旨在解决异步流水线并行(asynchronous pipeline parallelism)训练中因梯度延迟(gradient staleness)导致的优化效率下降问题,尤其是当流水线深度增加时,梯度延迟呈线性增长,严重削弱了该方法本应带来的可扩展性优势。其关键解决方案是通过基底旋转(basis rotation)对延迟梯度进行修正,从而缓解因海森矩阵(Hessian)特征基与标准坐标基不对齐所引发的优化震荡问题,使自适应优化器(如Adam)能够有效利用曲率信息,显著提升异步训练的收敛速度和稳定性。实验证明,使用基底旋转后,在10亿参数大语言模型(LLM)训练中,达到相同损失所需的迭代次数减少至基线的23.2%。

链接: https://arxiv.org/abs/2602.03515
作者: Hyunji Jung,Sungbin Shin,Namhoon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Preprint. Under review

点击查看摘要

Abstract:Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian eigenbasis is misaligned with the standard coordinate basis. We demonstrate that this misalignment prevents coordinate-wise adaptive schemes, such as Adam, from effectively leveraging curvature-aware adaptivity. This failure leads to significant oscillations in the optimization trajectory and, consequently, slower convergence. We substantiate these findings through both rigorous theoretical analysis and empirical evaluation. To address this challenge, we propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings. For example, our training of a 1B-parameter LLM with basis rotation achieves the same training loss in 76.8% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline.
zh

[AI-31] CMR: Contractive Mapping Embeddings for Robust Humanoid Locomotion on Unstructured Terrains

【速读】:该论文旨在解决人形机器人在非结构化地形上运动时面临的鲁棒性挑战,尤其是由于感知信息不可靠和模型失配导致的干扰抑制能力不足问题。其解决方案的关键在于提出了一种名为“Contractive Mapping for Robustness (CMR)”的框架,该框架通过将高维、易受干扰的观测映射到一个潜在空间,在该空间中局部扰动随时间被衰减;具体而言,CMR结合对比表示学习与Lipschitz正则化,既保留任务相关几何结构,又显式控制敏感性,可作为辅助损失项嵌入现代深度强化学习流程,无需额外复杂技术实现即可显著提升系统鲁棒性。

链接: https://arxiv.org/abs/2602.03511
作者: Qixin Zeng,Hongyin Zhang,Shangke Lyu,Junxi Jin,Donglin Wang,Chao Huang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robust disturbance rejection remains a longstanding challenge in humanoid locomotion, particularly on unstructured terrains where sensing is unreliable and model mismatch is pronounced. While perception information, such as height map, enhances terrain awareness, sensor noise and sim-to-real gaps can destabilize policies in practice. In this work, we provide theoretical analysis that bounds the return gap under observation noise, when the induced latent dynamics are contractive. Furthermore, we present Contractive Mapping for Robustness (CMR) framework that maps high-dimensional, disturbance-prone observations into a latent space, where local perturbations are attenuated over time. Specifically, this approach couples contrastive representation learning with Lipschitz regularization to preserve task-relevant geometry while explicitly controlling sensitivity. Notably, the formulation can be incorporated into modern deep reinforcement learning pipelines as an auxiliary loss term with minimal additional technical effort required. Further, our extensive humanoid experiments show that CMR potently outperforms other locomotion algorithms under increased noise.
zh

[AI-32] Explaining the Explainer: Understanding the Inner Workings of Transformer-based Symbolic Regression Models

【速读】:该论文旨在解决生成式 AI(Generative AI)在符号回归(Symbolic Regression, SR)任务中,其内部工作机制尤其是数学运算符生成机制尚不明确的问题。现有方法如直接logit归因和探针分类器主要捕捉相关性特征,缺乏对因果机制的识别能力,限制了对SR模型可解释性的深入理解。解决方案的关键在于提出PATCHES算法——一种基于进化的电路发现算法,能够系统性地识别出紧凑且功能正确的计算电路,并通过基于忠实性(faithfulness)、完备性(completeness)和最小性(minimality)的因果评估框架验证其有效性。研究表明,基于性能的均值插补(mean patching)最能可靠地隔离出具有因果意义的功能电路,从而首次实现了对SR Transformer的电路级表征,为机制可解释性在符号回归领域的应用提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2602.03506
作者: Arco van Breda,Erman Acar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Following their success across many domains, transformers have also proven effective for symbolic regression (SR); however, the internal mechanisms underlying their generation of mathematical operators remain largely unexplored. Although mechanistic interpretability has successfully identified circuits in language and vision models, it has not yet been applied to SR. In this article, we introduce PATCHES, an evolutionary circuit discovery algorithm that identifies compact and correct circuits for SR. Using PATCHES, we isolate 28 circuits, providing the first circuit-level characterisation of an SR transformer. We validate these findings through a robust causal evaluation framework based on key notions such as faithfulness, completeness, and minimality. Our analysis shows that mean patching with performance-based evaluation most reliably isolates functionally correct circuits. In contrast, we demonstrate that direct logit attribution and probing classifiers primarily capture correlational features rather than causal ones, limiting their utility for circuit discovery. Overall, these results establish SR as a high-potential application domain for mechanistic interpretability and propose a principled methodology for circuit discovery.
zh

[AI-33] Generative Decompression: Optimal Lossy Decoding Against Distribution Mismatch

【速读】:该论文旨在解决失真压缩中的分布不匹配问题(mismatched quantization problem),即编码器设计时假设的源分布与实际源分布不一致时,如何优化解码策略以提升重建性能。其解决方案的关键在于提出生成式解压缩(generative decompression)策略:该策略在解码端利用已知的真分布信息,通过计算给定量化索引下的条件期望(即贝叶斯估计),并适配固定编码器约束,实现对重建规则的生成式贝叶斯修正,从而严格优于传统的中心点规则(centroid rule)。此方法在噪声信道传输和任务导向解码场景中进一步推广,分别导出鲁棒软解码规则和最大后验概率(MAP)检测策略,实验表明其能显著缩小与理想联合优化基准之间的性能差距,无需修改编码器即可实现高保真自适应重建。

链接: https://arxiv.org/abs/2602.03505
作者: Saeed R. Khosravirad,Ahmed Alkhateeb,Ingrid van de Voorde
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper addresses optimal decoding strategies in lossy compression where the assumed distribution for compressor design mismatches the actual (true) distribution of the source. This problem has immediate relevance in standardized communication systems where the decoder acquires side information or priors about the true distribution that are unavailable to the fixed encoder. We formally define the mismatched quantization problem, demonstrating that the optimal reconstruction rule, termed generative decompression, aligns with classical Bayesian estimation by taking the conditional expectation under the true distribution given the quantization indices and adapting it to fixed-encoder constraints. This strategy effectively performs a generative Bayesian correction on the decoder side, strictly outperforming the conventional centroid rule. We extend this framework to transmission over noisy channels, deriving a robust soft-decoding rule that quantifies the inefficiency of standard modular source–channel separation architectures under mismatch. Furthermore, we generalize the approach to task-oriented decoding, showing that the optimal strategy shifts from conditional mean estimation to maximum a posteriori (MAP) detection. Experimental results on Gaussian sources and deep-learning-based semantic classification demonstrate that generative decompression closes a vast majority of the performance gap to the ideal joint-optimization benchmark, enabling adaptive, high-fidelity reconstruction without modifying the encoder.
zh

[AI-34] Reparameterization Flow Policy Optimization

【速读】:该论文旨在解决基于模型的强化学习(Model-Based Reinforcement Learning, MBRL)中策略梯度方法在使用非高斯策略时面临的效率与稳定性问题,特别是如何有效结合生成式建模技术以提升样本效率和探索能力。传统重参数化策略梯度(Reparameterization Policy Gradient, RPG)方法受限于仅能处理高斯策略,难以利用流形建模(如流形变换)等先进生成模型的优势。论文提出了一种新的优化框架——重参数化流策略优化(Reparameterization Flow Policy Optimization, RFO),其关键在于将流策略(Flow Policy)与RPG框架自然融合:通过联合反向传播梯度穿过流生成过程和系统动力学,实现无需计算不可行对数似然(intractable log-likelihood)的高效梯度估计;同时引入两种定制化的正则化项以增强训练稳定性和探索能力,显著提升了复杂任务中的性能表现,尤其在软体四足机器人控制等挑战性场景中实现了比当前最优基线接近2倍的奖励提升。

链接: https://arxiv.org/abs/2602.03501
作者: Hai Zhong,Zhuoran Li,Xun Wang,Longbo Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reparameterization Policy Gradient (RPG) has emerged as a powerful paradigm for model-based reinforcement learning, enabling high sample efficiency by backpropagating gradients through differentiable dynamics. However, prior RPG approaches have been predominantly restricted to Gaussian policies, limiting their performance and failing to leverage recent advances in generative models. In this work, we identify that flow policies, which generate actions via differentiable ODE integration, naturally align with the RPG framework, a connection not established in prior work. However, naively exploiting this synergy proves ineffective, often suffering from training instability and a lack of exploration. We propose Reparameterization Flow Policy Optimization (RFO). RFO computes policy gradients by backpropagating jointly through the flow generation process and system dynamics, unlocking high sample efficiency without requiring intractable log-likelihood calculations. RFO includes two tailored regularization terms for stability and exploration. We also propose a variant of RFO with action chunking. Extensive experiments on diverse locomotion and manipulation tasks, involving both rigid and soft bodies with state or visual inputs, demonstrate the effectiveness of RFO. Notably, on a challenging locomotion task controlling a soft-body quadruped, RFO achieves almost 2\times the reward of the state-of-the-art baseline.
zh

[AI-35] DeepDFA: Injecting Temporal Logic in Deep Learning for Sequential Subsymbolic Applications

【速读】:该论文旨在解决如何将逻辑知识有效整合到深度神经网络训练中的难题,尤其是在涉及子符号观测的序列或时序扩展领域。其核心挑战在于传统深度学习模型难以显式编码和利用结构化的时序规则。解决方案的关键在于提出DeepDFA这一神经符号框架,该框架将高层次的时序逻辑(以确定性有限自动机(Deterministic Finite Automata, DFA)或Moore机形式表达)建模为连续且可微分的层,从而实现符号知识向子符号域的注入与融合。通过这种机制,DeepDFA能够在静态图像序列分类和非马尔可夫交互环境中进行策略学习等任务中显著提升性能,优于主流深度学习模型(如LSTM、GRU、Transformer)及新兴神经符号系统,展现出在时序任务中连接子符号学习与符号推理的巨大潜力。

链接: https://arxiv.org/abs/2602.03486
作者: Elena Umili,Francesco Argenziano,Roberto Capobianco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrating logical knowledge into deep neural network training is still a hard challenge, especially for sequential or temporally extended domains involving subsymbolic observations. To address this problem, we propose DeepDFA, a neurosymbolic framework that integrates high-level temporal logic - expressed as Deterministic Finite Automata (DFA) or Moore Machines - into neural architectures. DeepDFA models temporal rules as continuous, differentiable layers, enabling symbolic knowledge injection into subsymbolic domains. We demonstrate how DeepDFA can be used in two key settings: (i) static image sequence classification, and (ii) policy learning in interactive non-Markovian environments. Across extensive experiments, DeepDFA outperforms traditional deep learning models (e.g., LSTMs, GRUs, Transformers) and novel neuro-symbolic systems, achieving state-of-the-art results in temporal knowledge integration. These results highlight the potential of DeepDFA to bridge subsymbolic learning and symbolic reasoning in sequential tasks.
zh

[AI-36] When Routing Collapses: On the Degenerate Convergence of LLM Routers

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)路由中普遍存在的“路由坍缩”(routing collapse)问题,即在预算增加时,现有路由器倾向于始终选择最强大且最昂贵的模型,而忽视性能已足够的小型模型,导致计算资源和成本浪费,违背了动态路由的核心目标。解决方案的关键在于提出一种决策感知的路由方法 EquiRouter,其通过直接学习模型间的相对排序而非预测标量性能分数,从而缓解因预测误差引发的离散决策偏差,有效恢复小模型的作用并显著降低推理成本。

链接: https://arxiv.org/abs/2602.03478
作者: Guannan Lai,Han-Jia Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM routing aims to achieve a favorable quality–cost trade-off by dynamically assigning easy queries to smaller models and harder queries to stronger ones. However, across both unimodal and multimodal settings, we uncover a pervasive yet underexplored failure mode in existing routers: as the user’s cost budget increases, routers systematically default to the most capable and most expensive model even when cheaper models already suffice. As a result, current routers under-utilize small models, wasting computation and monetary cost and undermining the core promise of routing; we term this phenomenon routing collapse. We attribute routing collapse to an objective–decision mismatch: many routers are trained to predict scalar performance scores, whereas routing decisions ultimately depend on discrete comparisons among candidate models. Consequently, small prediction errors can flip relative orderings and trigger suboptimal selections. To bridge this gap, we propose EquiRouter, a decision-aware router that directly learns model rankings, restoring the role of smaller models and mitigating routing collapse. On RouterBench, EquiRouter reduces cost by about 17% at GPT-4-level performance compared to the strongest prior router. Our code is available at this https URL.
zh

[AI-37] ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

【速读】:该论文旨在解决单细胞RNA测序(single-cell RNA-seq)数据在生成建模中因高维性、稀疏性和无序性导致的自回归方法存在人工排序偏差和误差累积的问题。其核心解决方案是提出scDiVa,一种基于掩码离散扩散(masked discrete diffusion)的基础模型,通过在token空间定义连续时间前向掩码机制来模拟dropout-like的噪声过程,并引入双向去噪器联合建模基因身份(离散)与表达值(连续),结合熵归一化序列化和潜在锚定token以提升信息效率并保留细胞全局身份;训练时采用深度不变的时间采样和双重去噪目标,从而在不同稀疏水平下实现对细胞类型和表达量的精确恢复。

链接: https://arxiv.org/abs/2602.03477
作者: Mingxuan Wang,Cheng Chen,Gaoyang Jiang,Zijia Ren,Chuangxin Zhao,Lu Shi,Yanbiao Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Single-cell RNA-seq profiles are high-dimensional, sparse, and unordered, causing autoregressive generation to impose an artificial ordering bias and suffer from error accumulation. To address this, we propose scDiVa, a masked discrete diffusion foundation model that aligns generation with the dropout-like corruption process by defining a continuous-time forward masking mechanism in token space. ScDiVa features a bidirectional denoiser that jointly models discrete gene identities and continuous values, utilizing entropy-normalized serialization and a latent anchor token to maximize information efficiency and preserve global cell identity. The model is trained via depth-invariant time sampling and a dual denoising objective to simulate varying sparsity levels while ensuring precise recovery of both identity and magnitude. Pre-trained on 59 million cells, scDiVa achieves strong transfer performance across major benchmarks, including batch integration, cell type annotation, and perturbation response prediction. These results suggest that masked discrete diffusion serves as a biologically coherent and effective alternative to autoregression.
zh

[AI-38] IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning

【速读】:该论文旨在解决深度研究(Deep Research, DR)代理在面对模糊用户查询时因高自主性导致的执行时间过长且结果不理想的问题,即“自主性-交互性困境”。其核心解决方案是提出IntentRL框架,通过训练代理在启动长周期研究前主动澄清潜在用户意图,从而提升任务效率与准确性。关键创新在于:1)构建从少量种子样本扩展为高质量对话轮次的可扩展流水线,利用浅层到深层的意图细化图缓解开放式研究数据稀缺问题;2)采用两阶段强化学习策略——第一阶段基于离线对话学习通用交互行为,第二阶段结合用户模拟器进行在线滚动优化,增强对多样化用户反馈的适应能力。

链接: https://arxiv.org/abs/2602.03468
作者: Haohao Luo,Zexi Li,Yuexiang Xie,Wenhao Zhang,Yaliang Li,Ying Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge by autonomously retrieving and synthesizing evidence from large web corpora into long-form reports, enabling a long-horizon agentic paradigm. However, unlike real-time conversational assistants, DR is computationally expensive and time-consuming, creating an autonomy-interaction dilemma: high autonomy on ambiguous user queries often leads to prolonged execution with unsatisfactory outcomes. To address this, we propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research. To overcome the scarcity of open-ended research data, we introduce a scalable pipeline that expands a few seed samples into high-quality dialogue turns via a shallow-to-deep intent refinement graph. We further adopt a two-stage reinforcement learning (RL) strategy: Stage I applies RL on offline dialogues to efficiently learn general user-interaction behavior, while Stage II uses the trained agent and a user simulator for online rollouts to strengthen adaptation to diverse user feedback. Extensive experiments show that IntentRL significantly improves both intent hit rate and downstream task performance, outperforming the built-in clarify modules of closed-source DR agents and proactive LLM baselines.
zh

[AI-39] he Dual Role of Abstracting over the Irrelevant in Symbolic Explanations: Cognitive Effort vs. Understanding

【速读】:该论文旨在解决当前人工智能系统输出难以理解的问题,尤其是在符号AI(Symbolic AI)虽具可解释性基础但其原始逻辑轨迹常导致高认知负荷的情况下。解决方案的关键在于引入形式化抽象策略——具体包括移除(removal)和聚类(clustering)无关细节,以生成更易理解的简化解释。研究基于答案集编程(Answer Set Programming, ASP)定义了可抽象的无关信息,并通过认知实验验证:聚类能显著提升人类对刺激分类的理解能力,而移除则显著降低认知努力,从而证明抽象是优化面向人类的符号解释的核心机制。

链接: https://arxiv.org/abs/2602.03467
作者: Zeynep G. Saribatur,Johannes Langer,Ute Schmid
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Explanations are central to human cognition, yet AI systems often produce outputs that are difficult to understand. While symbolic AI offers a transparent foundation for interpretability, raw logical traces often impose a high extraneous cognitive load. We investigate how formal abstractions, specifically removal and clustering, impact human reasoning performance and cognitive effort. Utilizing Answer Set Programming (ASP) as a formal framework, we define a notion of irrelevant details to be abstracted over to obtain simplified explanations. Our cognitive experiments, in which participants classified stimuli across domains with explanations derived from an answer set program, show that clustering details significantly improve participants’ understanding, while removal of details significantly reduce cognitive effort, supporting the hypothesis that abstraction enhances human-centered symbolic explanations.
zh

[AI-40] Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练大语言模型时,因提示(prompt)选择策略不合理导致的优化方向不稳定和迁移能力弱的问题。现有方法通常仅依据训练准确率方差进行提示筛选,忽略了提示对成功与失败信号的区分能力,从而影响学习效率与稳定性。解决方案的关键在于提出“正–负配对”机制(positive–negative pairing),即在每次更新时采样一个难但可解的提示 $ q^+ $(低成功率但可改进)和一个易但脆弱的提示 $ q^- $(高成功率但非完美),并引入加权GRPO(Weighted GRPO)算法:通过成对重加权二值反馈,并采用组归一化优势(group-normalized advantages),将 $ q^+ $ 中罕见的成功转化为强烈的正向引导,同时将 $ q^- $ 中的罕见失败转化为显著的负向惩罚,形成双向信息丰富的学习信号,从而提升样本效率且不抑制探索。

链接: https://arxiv.org/abs/2602.03452
作者: Xin Sheng,Jiaxin Li,Yujuan Pang,Ran Peng,Yong Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \emphpositive–negative pairing: at each update, we sample a hard-but-solvable q^+ and an easy-but-brittle prompt q^- (high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on q^+ into sharp positive guidance while turning rare failures on q^- into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.
zh

[AI-41] CRL-VLA: Continual Vision-Language-Action Learning

【速读】:该论文旨在解决持续强化学习(Continual Reinforcement Learning, CRL)中稳定性(保留旧技能)与可塑性(学习新技能)之间的权衡难题,尤其是在视觉-语言-动作(Vision-Language-Action, VLA)模型的持续后训练场景下。解决方案的关键在于提出CRL-VLA框架,其核心是通过不对称调控机制实现:对先前任务的优势值幅度进行约束以保障稳定性,同时允许新任务的优势值适度增长以促进适应性。这一机制由一种新颖的双评论家架构和目标条件价值函数(Goal-Conditioned Value Formulation, GCVF)实现——其中冻结的评论家维持语义一致性,而可训练的估计器驱动策略更新,从而在理论上有界地协调稳定性和可塑性,实验证明该方法在LIBERO基准上显著优于现有基线。

链接: https://arxiv.org/abs/2602.03445
作者: Qixin Zeng,Shuo Zhang,Hongyin Zhang,Renjie Wang,Han Zhao,Libang Zhao,Runze Li,Donglin Wang,Chao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Lifelong learning is critical for embodied agents in open-world environments, where reinforcement learning fine-tuning has emerged as an important paradigm to enable Vision-Language-Action (VLA) models to master dexterous manipulation through environmental interaction. Thus, Continual Reinforcement Learning (CRL) is a promising pathway for deploying VLA models in lifelong robotic scenarios, yet balancing stability (retaining old skills) and plasticity (learning new ones) remains a formidable challenge for existing methods. We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds. We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence. CRL-VLA resolves this dilemma via asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. This is realized through a simple but effective dual-critic architecture with novel Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation. Experiments on the LIBERO benchmark demonstrate that CRL-VLA effectively harmonizes these conflicting objectives, outperforming baselines in both anti-forgetting and forward adaptation.
zh

[AI-42] Ontology-to-tools compilation for executable semantic constraint enforcement in LLM agents

【速读】:该论文旨在解决生成式 AI(Generative AI)在知识生成过程中缺乏结构化约束的问题,即如何将形式化领域知识(如本体)有效嵌入到大语言模型(LLM)中,以确保生成内容符合语义规范,而非依赖事后验证。解决方案的关键在于提出“本体到工具的编译”机制,将领域本体(ontology)自动转化为可执行的工具接口,使基于 LLM 的智能体(agent)在构建和修改知识图谱实例时必须调用这些工具,从而在生成阶段就强制实施语义约束。该方法依托于 The World Avatar(TWA)中的 Model Context Protocol(MCP)及配套代理框架,实现了生成模型、符号约束与外部资源之间的结构化交互,显著减少了手动模式设计和提示工程需求。

链接: https://arxiv.org/abs/2602.03439
作者: Xiaochi Zhou,Patrick Bulter,Changxuan Yang,Simon D. Rihm,Thitikarn Angkanaporn,Jethro Akroyd,Sebastian Mosbach,Markus Kraft
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce ontology-to-tools compilation as a proof-of-principle mechanism for coupling large language models (LLMs) with formal domain knowledge. Within The World Avatar (TWA), ontological specifications are compiled into executable tool interfaces that LLM-based agents must use to create and modify knowledge graph instances, enforcing semantic constraints during generation rather than through post-hoc validation. Extending TWA’s semantic agent composition framework, the Model Context Protocol (MCP) and associated agents are integral components of the knowledge graph ecosystem, enabling structured interaction between generative models, symbolic constraints, and external resources. An agent-based workflow translates ontologies into ontology-aware tools and iteratively applies them to extract, validate, and repair structured knowledge from unstructured scientific text. Using metal-organic polyhedra synthesis literature as an illustrative case, we show how executable ontological semantics can guide LLM behaviour and reduce manual schema and prompt engineering, establishing a general paradigm for embedding formal knowledge into generative systems.
zh

[AI-43] Feasible strategies for conflict resolution within intuitionistic fuzzy preference-based conflict situations

【速读】:该论文旨在解决传统三支冲突分析模型在刻画代理人对议题对态度时粒度不足的问题,因其仅依赖偏好、逆向和无差异三种定性关系,难以充分捕捉冲突的本质。解决方案的关键在于引入直觉模糊偏好型冲突情境(intuitionistic fuzzy preference-based conflict situation),通过融合直觉模糊集(Intuitionistic Fuzzy Set, IFS)的隶属度与非隶属度信息,实现对代理人态度更精细的刻画;在此基础上构建了相应的三支冲突分析模型,并设计基于冲突函数的相对损失函数以确定阈值,最终提出兼顾调整幅度与冲突程度的可行策略调整机制及算法,从而提升冲突分析的精度与实用性。

链接: https://arxiv.org/abs/2602.03403
作者: Guangming Lang,Mingchuan Shang,Mengjun Hu,Jie Zhou,Feng Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In three-way conflict analysis, preference-based conflict situations characterize agents’ attitudes towards issues by formally modeling their preferences over pairs of issues. However, existing preference-based conflict models rely exclusively on three qualitative relations, namely, preference, converse, and indifference, to describe agents’ attitudes towards issue pairs, which significantly limits their capacity in capturing the essence of conflict. To overcome this limitation, we introduce the concept of an intuitionistic fuzzy preference-based conflict situation that captures agents’ attitudes towards issue pairs with finer granularity than that afforded by classical preference-based models. Afterwards, we develop intuitionistic fuzzy preference-based conflict measures within this framework, and construct three-way conflict analysis models for trisecting the set of agent pairs, the agent set, and the issue set. Additionally, relative loss functions built on the proposed conflict functions are employed to calculate thresholds for three-way conflict analysis. Finally, we present adjustment mechanism-based feasible strategies that simultaneously account for both adjustment magnitudes and conflict degrees, together with an algorithm for constructing such feasible strategies, and provide an illustrative example to demonstrate the validity and effectiveness of the proposed model.
zh

[AI-44] Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在面对多模态越狱攻击(multimodal jailbreak attacks)时安全性不足的问题。现有防御方法通常依赖于安全微调或激进的标记操作,导致训练成本高或显著损害模型任务性能。解决方案的关键在于提出一种轻量级、无需训练的安全校准框架——风险意识注入(Risk Awareness Injection, RAI),其核心机制是通过构建语言嵌入中的不安全原型子空间(Unsafe Prototype Subspace),并对高风险视觉标记进行选择性调制,从而在跨模态特征空间中显式激活与安全相关的信号,恢复VLM对视觉输入中不安全内容的识别能力,同时保持原始标记的语义完整性以保障跨模态推理性能。

链接: https://arxiv.org/abs/2602.03402
作者: Mengxuan Wang,Yuxin Chen,Gang Xu,Tao He,Hongjie Jiang,Ming Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model’s LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.
zh

[AI-45] Precision in Practice: Knowledge Guided Code Summarizing Grounded in Industrial Expectations

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在工业代码文档场景中生成的代码摘要(code summaries)实用性不足的问题。研究表明,尽管基于大语言模型(LLMs)的自动代码摘要技术已取得显著进展,但在实际工业项目如HarmonyOS中,超过57.4%的生成摘要因未能满足开发者对术语准确性、功能分类明确性和避免冗余实现细节等预期而被拒绝。为此,作者提出ExpSum方法,其核心在于通过函数元数据抽象、信息性元数据过滤、上下文感知领域知识检索和约束驱动提示(constraint-driven prompting)四个关键机制,引导LLM生成结构化且符合开发者期望的摘要,从而提升工业级代码文档的质量与可用性。

链接: https://arxiv.org/abs/2602.03400
作者: Jintai Li,Songqiang Chen,Shuo Jin,Xiaoyuan Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code summaries are essential for helping developers understand code functionality and reducing maintenance and collaboration costs. Although recent advances in large language models (LLMs) have significantly improved automatic code summarization, the practical usefulness of generated summaries in industrial settings remains insufficiently explored. In collaboration with documentation experts from the industrial HarmonyOS project, we conducted a questionnaire study showing that over 57.4% of code summaries produced by state-of-the-art approaches were rejected due to violations of developers’ expectations for industrial documentation. Beyond semantic similarity to reference summaries, developers emphasize additional requirements, including the use of appropriate domain terminology, explicit function categorization, and the avoidance of redundant implementation details. To address these expectations, we propose ExpSum, an expectation-aware code summarization approach that integrates function metadata abstraction, informative metadata filtering, context-aware domain knowledge retrieval, and constraint-driven prompting to guide LLMs in generating structured, expectation-aligned summaries. We evaluate ExpSum on the HarmonyOS project and widely used code summarization benchmarks. Experimental results show that ExpSum consistently outperforms all baselines, achieving improvements of up to 26.71% in BLEU-4 and 20.10% in ROUGE-L on HarmonyOS. Furthermore, LLM-based evaluations indicate that ExpSum-generated summaries better align with developer expectations across other projects, demonstrating its effectiveness for industrial code documentation. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.03400 [cs.SE] (or arXiv:2602.03400v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.03400 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-46] On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习微调(Reinforcement Fine-Tuning, RFT)过程中熵动态变化机制不明确的问题,从而难以有效平衡探索(exploration)与利用(exploitation)的困境。解决方案的关键在于构建一个理论框架,从单个logit更新出发推导出熵变化的一阶表达式,并将其扩展至Group Relative Policy Optimization (GRPO)的更新公式,由此得出熵判别器(entropy-discriminator)的裁剪方法,实现对熵的可控调节,为优化LLM微调中的探索-利用平衡提供了理论依据和可实践的策略。

链接: https://arxiv.org/abs/2602.03392
作者: Shumin Wang,Yuexiang Xie,Wenhao Zhang,Yuchang Sun,Yanxi Chen,Yaliang Li,Yanyong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.
zh

[AI-47] Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

【速读】:该论文旨在解决离线目标条件强化学习(goal-conditioned reinforcement learning)在长时程任务中表现不佳的问题。现有分层方法虽能缓解此问题,但通常依赖独立的高层与低层网络,且仅生成单一中间子目标,难以应对需协调多个中间决策的复杂任务。其解决方案的关键在于提出链式目标分层策略(Chain-of-Goals Hierarchical Policy, CoGHP),将分层决策重构为统一架构内的自回归序列建模过程:给定状态和最终目标后,CoGHP 自回归地生成一系列潜在子目标,再输出原始动作,其中每个子目标作为推理步骤以条件化后续预测;同时引入 MLP-Mixer 主干结构,支持跨标记通信并捕捉状态、目标、子目标与动作之间的结构关系,从而显著提升长时程任务上的性能表现。

链接: https://arxiv.org/abs/2602.03389
作者: Jinwoo Choi,Sang-Hyun Lee,Seung-Woo Seo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.
zh

[AI-48] oward a Sustainable Federated Learning Ecosystem: A Practical Least Core Mechanism for Payoff Allocation

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)环境中协作稳定性不足的问题,即如何设计一个公平且稳定的收益分配机制,以防止参与者因不满而脱离联盟。其解决方案的关键在于引入基于最小核心(Least Core, LC)的收益分配框架,该框架通过最小化所有潜在子群组中的最大不满程度,确保任意参与者均无动机退出联盟,从而保障联邦网络的长期稳定运行;同时,为适应大规模实际场景,作者提出一种基于栈的剪枝算法,有效平衡了计算效率与分配精度,实验证明该机制能准确识别关键贡献者和战略联盟,促进可持续的FL生态发展。

链接: https://arxiv.org/abs/2602.03387
作者: Zhengwei Ni,Zhidu Li,Wei Chen,Zhaoyang Zhang,Zehua Wang,F. Richard Yu,Victor C. M. Leung
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, submitted to IEEE Network

点击查看摘要

Abstract:Emerging network paradigms and applications increasingly rely on federated learning (FL) to enable collaborative intelligence while preserving privacy. However, the sustainability of such collaborative environments hinges on a fair and stable payoff allocation mechanism. Focusing on coalition stability, this paper introduces a payoff allocation framework based on the least core (LC) concept. Unlike traditional methods, the LC prioritizes the cohesion of the federation by minimizing the maximum dissatisfaction among all potential subgroups, ensuring that no participant has an incentive to break away. To adapt this game-theoretic concept to practical, large-scale networks, we propose a streamlined implementation with a stack-based pruning algorithm, effectively balancing computational efficiency with allocation precision. Case studies in federated intrusion detection demonstrate that our mechanism correctly identifies pivotal contributors and strategic alliances. The results confirm that the practical LC framework promotes stable collaboration and fosters a sustainable FL ecosystem.
zh

[AI-49] An Approximate Ascent Approach To Prove Convergence of PPO

【速读】:该论文旨在解决近端策略优化(Proximal Policy Optimization, PPO)算法的理论基础不完善问题,特别是其收敛性与核心优势机制尚不清晰的难题。解决方案的关键在于:首先,将PPO的策略更新机制(基于多轮小批量更新、复用轨迹数据并采用代理梯度)形式化为近似策略梯度上升过程,并通过控制代理梯度累积偏差,结合随机重排(random reshuffling)技术证明了PPO的收敛性定理;其次,识别出PPO中常用的截断广义优势估计(truncated Generalized Advantage Estimation, GAE)存在的一个被忽视的问题——几何加权方案在episode边界处导致最长k步优势估计器质量无限集中,进而提出简单的权重修正方法,在具有强终止信号的环境中(如Lunar Lander)显著提升性能。

链接: https://arxiv.org/abs/2602.03386
作者: Leif Doering,Daniel Schmidt,Moritz Melcher,Sebastian Kassing,Benedikt Wille,Tilman Aach,Simon Weissmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Proximal Policy Optimization (PPO) is among the most widely used deep reinforcement learning algorithms, yet its theoretical foundations remain incomplete. Most importantly, convergence and understanding of fundamental PPO advantages remain widely open. Under standard theory assumptions we show how PPO’s policy update scheme (performing multiple epochs of minibatch updates on multi-use rollouts with a surrogate gradient) can be interpreted as approximated policy gradient ascent. We show how to control the bias accumulated by the surrogate gradients and use techniques from random reshuffling to prove a convergence theorem for PPO that sheds light on PPO’s success. Additionally, we identify a previously overlooked issue in truncated Generalized Advantage Estimation commonly used in PPO. The geometric weighting scheme induces infinite mass collapse onto the longest k -step advantage estimator at episode boundaries. Empirical evaluations show that a simple weight correction can yield substantial improvements in environments with strong terminal signal, such as Lunar Lander.
zh

[AI-50] Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures ICLR2026

【速读】:该论文试图解决机器遗忘(machine unlearning)中“良性重学习”(benign relearning)问题,即在模型经过遗忘操作后,仍可能因使用良性微调数据而重新恢复被遗忘的内容。现有方法普遍认为这一现象源于主题相关性(topical relevance),但作者通过系统分析发现,语法相似性(syntactic similarity)才是主导因素:即使没有主题重叠,语法结构相近的数据也会因表示空间和梯度方向与遗忘内容对齐而引发信息复现。解决方案的关键在于提出“语法多样化”(syntactic diversification)策略——在执行遗忘前,将原始遗忘查询(forget queries)改写为语义一致但结构多样的表达形式,从而破坏模型内部的语法敏感性,有效抑制良性重学习、加速遗忘过程,并显著缓解遗忘效果与模型性能之间的权衡关系。

链接: https://arxiv.org/abs/2602.03379
作者: Sangyeon Yoon,Hyesoo Hong,Wonje Jeung,Albert No
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Machine unlearning aims to remove specific content from trained models while preserving overall performance. However, the phenomenon of benign relearning, in which forgotten information reemerges even from benign fine-tuning data, reveals that existing unlearning methods remain fundamentally fragile. A common explanation attributes this effect to topical relevance, but we find this account insufficient. Through systematic analysis, we demonstrate that syntactic similarity, rather than topicality, is the primary driver: across benchmarks, syntactically similar data consistently trigger recovery even without topical overlap, due to their alignment in representations and gradients with the forgotten content. Motivated by this insight, we introduce syntactic diversification, which paraphrases the original forget queries into heterogeneous structures prior to unlearning. This approach effectively suppresses benign relearning, accelerates forgetting, and substantially alleviates the trade-off between unlearning efficacy and model utility.
zh

[AI-51] Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship

【速读】:该论文旨在解决从观测数据中恢复因果图(causal graph)的问题,尤其在高维和大规模数据场景下,传统方法往往计算复杂度高且难以保证准确性。其解决方案的关键在于利用“效应条件于原因的分布对原因先验分布的变化保持不变”这一因果不变性原理,通过在不同下采样子集上检验效应-原因条件分布的方差变化来直接测试潜在因果关系;同时结合因果图的稀疏性假设,设计出一种时间复杂度为变量数平方级别的高效算法,显著提升了处理速度(最高达25倍加速),并在多个大规模基准数据集上展现出优于或相当的性能。

链接: https://arxiv.org/abs/2602.03353
作者: Nang Hung Nguyen,Phi Le Nguyen,Thao Nguyen Truong,Trong Nghia Hoang,Masashi Sugiyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a new framework for recovering causal graphs from observational data, leveraging the observation that the distribution of an effect, conditioned on its causes, remains invariant to changes in the prior distribution of those causes. This insight enables a direct test for potential causal relationships by checking the variance of their corresponding effect-cause conditional distributions across multiple downsampled subsets of the data. These subsets are selected to reflect different prior cause distributions, while preserving the effect-cause conditional relationships. Using this invariance test and exploiting an (empirical) sparsity of most causal graphs, we develop an algorithm that efficiently uncovers causal relationships with quadratic complexity in the number of observational variables, reducing the processing time by up to 25x compared to state-of-the-art methods. Our empirical experiments on a varied benchmark of large-scale datasets show superior or equivalent performance compared to existing works, while achieving enhanced scalability.
zh

[AI-52] Building Interpretable Models for Moral Decision-Making AAAI’26

【速读】:该论文旨在解决神经网络在处理电车难题(trolley-style dilemmas)时如何做出道德决策的问题,核心挑战在于揭示模型内部的道德推理机制及其潜在偏见的分布规律。解决方案的关键在于构建了一个定制化的两层Transformer模型,通过嵌入(embedding)编码受影响个体、人数及结果归属等结构化信息,实现了77%的Moral Machine数据集准确率;同时结合多种可解释性技术,发现道德偏见分布在网络的不同计算阶段,从而为理解神经网络中的道德推理提供了可分析的框架。

链接: https://arxiv.org/abs/2602.03351
作者: Mayank Goel,Aritra Das,Paras Chopra
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, accepted to AAAI’26 Machine Ethics Workshop

点击查看摘要

Abstract:We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.
zh

[AI-53] MentalSeek-Dx: Towards Progressive Hypothetico-Deductive Reasoning for Real-world Psychiatric Diagnosis

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在精神障碍诊断任务中因缺乏生态效度和细粒度诊断监督而导致的临床实用性不足问题。现有基准普遍无法反映真实临床场景中的复杂性,且难以支持从粗略分类到具体疾病级别的精准识别。解决方案的关键在于构建首个专注于真实临床环境中疾病级别诊断的基准——MentalDx Bench,其包含712份由持证精神科医生依据ICD-11标准标注的去标识电子健康记录,覆盖16类诊断类别下的76种精神障碍;并提出一种专为医学领域设计的LLM——MentalSeek-Dx,通过监督轨迹构建与基于课程的强化学习策略内化临床假设演绎推理过程,从而实现仅用140亿参数即达到当前最优(SOTA)性能,为可靠的精神病学诊断提供了临床基础框架。

链接: https://arxiv.org/abs/2602.03340
作者: Xiao Sun,Yuming Yang,Junnan Zhu,Jiang Zhong,Xinyu Zhou,Kaiwen Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 36 pages, 27 figures

点击查看摘要

Abstract:Mental health disorders represent a burgeoning global public health challenge. While Large Language Models (LLMs) have demonstrated potential in psychiatric assessment, their clinical utility is severely constrained by benchmarks that lack ecological validity and fine-grained diagnostic supervision. To bridge this gap, we introduce \textbfMentalDx Bench, the first benchmark dedicated to disorder-level psychiatric diagnosis within real-world clinical settings. Comprising 712 de-identified electronic health records annotated by board-certified psychiatrists under ICD-11 guidelines, the benchmark covers 76 disorders across 16 diagnostic categories. Evaluation of 18 LLMs reveals a critical \textitparadigm misalignment: strong performance at coarse diagnostic categorization contrasts with systematic failure at disorder-level diagnosis, underscoring a gap between pattern-based modeling and clinical hypothetico-deductive reasoning. In response, we propose \textbfMentalSeek-Dx, a medical-specialized LLM trained to internalize this clinical reasoning process through supervised trajectory construction and curriculum-based reinforcement learning. Experiments on MentalDx Bench demonstrate that MentalSeek-Dx achieves state-of-the-art (SOTA) performance with only 14B parameters, establishing a clinically grounded framework for reliable psychiatric diagnosis.
zh

[AI-54] Memora: A Harmonic Memory Representation Balancing Abstraction and Specificity

【速读】:该论文旨在解决智能体记忆系统在持续增长信息背景下如何平衡抽象性(abstraction)与细节特异性(specificity)的问题,以支持高效且情境感知的检索。传统方法中,抽象虽有助于扩展记忆规模,但常牺牲细粒度信息,影响推理效果。其解决方案的关键在于提出一种名为Memora的谐振记忆表示结构:通过主抽象索引具体记忆值并合并相关更新形成统一条目,同时利用提示锚点(cue anchors)拓展多维度检索路径并连接关联记忆;在此基础上设计的检索策略主动利用记忆间的关联关系,超越单纯语义相似性进行相关性检索,从而在保持抽象效率的同时保留关键细节。

链接: https://arxiv.org/abs/2602.03315
作者: Menglin Xia,Xuchao Zhang,Shantanu Dixit,Paramaguru Harimurugan,Rujia Wang,Victor Ruhle,Robert Sim,Chetan Bansal,Saravan Rajmohan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent memory systems must accommodate continuously growing information while supporting efficient, context-aware retrieval for downstream tasks. Abstraction is essential for scaling agent memory, yet it often comes at the cost of specificity, obscuring the fine-grained details required for effective reasoning. We introduce Memora, a harmonic memory representation that structurally balances abstraction and specificity. Memora organizes information via its primary abstractions that index concrete memory values and consolidate related updates into unified memory entries, while cue anchors expand retrieval access across diverse aspects of the memory and connect related memories. Building on this structure, we employ a retrieval policy that actively exploits these memory connections to retrieve relevant information beyond direct semantic similarity. Theoretically, we show that standard Retrieval-Augmented Generation (RAG) and Knowledge Graph (KG)-based memory systems emerge as special cases of our framework. Empirically, Memora establishes a new state-of-the-art on the LoCoMo and LongMemEval benchmarks, demonstrating better retrieval relevance and reasoning effectiveness as memory scales.
zh

[AI-55] Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在混合训练中因样本级梯度更新导致的效率与稳定性问题,特别是在强化学习(Reinforcement Learning, RL)阶段易受高方差影响、难以平衡探索与知识保留的挑战。解决方案的关键在于提出熵门控选择性策略优化(Entropy Gated Selective Policy Optimization, EGSPO),其核心机制是在token级别引入预测熵作为动态权重,对高熵token分配完整的近端策略优化(Proximal Policy Optimization, PPO)更新以促进探索,而对低熵token施加衰减的PPO更新以降低方差并保护已学知识;同时,两个分支均融合优势函数 AtA_t,确保错误轨迹获得一致的负向学习信号,避免对错误预测的强化。此方法在数学推理基准测试中显著优于基线模型(如CHORD phi),在AIME和MATH上分别提升3.8%和2.9%,且仅增加3.4%计算开销。

链接: https://arxiv.org/abs/2602.03309
作者: Yuelin Hu,Zhengxue Cheng,Wei Liu,Li Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted by cscwd2026

点击查看摘要

Abstract:Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate the advantage function A_t, ensuring that incorrect trajectories receive consistent negative learning signals and preventing reinforcement of confident errors. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead. Comments: accepted by cscwd2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.03309 [cs.LG] (or arXiv:2602.03309v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.03309 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-56] Learning to Select: Query-Aware Adaptive Dimension Selection for Dense Retrieval

【速读】:该论文旨在解决密集检索(Dense Retrieval)中嵌入表示维度冗余的问题,即在给定信息需求下,仅部分维度对排序具有稳定贡献。传统方法如伪相关反馈(Pseudo-Relevance Feedback, PRF)依赖噪声较大的伪信号和启发式测试阶段操作来估计维度重要性,而监督适配器方法虽利用相关标签提升嵌入质量,却学习全局变换,未显式建模查询感知的维度重要性。解决方案的关键在于提出一种查询感知自适应维度选择框架(Query-Aware Adaptive Dimension Selection),其核心是通过监督相关标签构建“理想”维度重要性分布,并训练一个预测器直接从查询嵌入映射出每个维度的重要性得分;推理时仅根据查询嵌入即可选出查询相关的维度子集进行相似度计算,无需伪相关反馈,从而实现更高效且精准的检索效果。

链接: https://arxiv.org/abs/2602.03306
作者: Zhanyu Wu,Richong Zhang,Zhijie Nie
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dense retrieval represents queries and docu-002 ments as high-dimensional embeddings, but003 these representations can be redundant at the004 query level: for a given information need, only005 a subset of dimensions is consistently help-006 ful for ranking. Prior work addresses this via007 pseudo-relevance feedback (PRF) based dimen-008 sion importance estimation, which can produce009 query-aware masks without labeled data but010 often relies on noisy pseudo signals and heuris-011 tic test-time procedures. In contrast, super-012 vised adapter methods leverage relevance labels013 to improve embedding quality, yet they learn014 global transformations shared across queries015 and do not explicitly model query-aware di-016 mension importance. We propose a Query-017 Aware Adaptive Dimension Selection frame-018 work that learns to predict per-dimension im-019 portance directly from query embedding. We020 first construct oracle dimension importance dis-021 tributions over embedding dimensions using022 supervised relevance labels, and then train a023 predictor to map a query embedding to these024 label-distilled importance scores. At inference,025 the predictor selects a query-aware subset of026 dimensions for similarity computation based027 solely on the query embedding, without pseudo-028 relevance feedback. Experiments across multi-029 ple dense retrievers and benchmarks show that030 our learned dimension selector improves re-031 trieval effectiveness over the full-dimensional032 baseline as well as PRF-based masking and033 supervised adapter baselines.
zh

[AI-57] Periodic Regularized Q-Learning

【速读】:该论文旨在解决Q-learning在使用线性函数逼近(linear function approximation)时缺乏收敛性保证的问题。传统Q-learning的收敛性仅在表格型(tabular)设置下成立,而在线性函数逼近场景中,由于投影误差和函数近似偏差的存在,标准Q-learning可能发散。解决方案的关键在于提出一种新的算法——周期性正则化Q-learning(Periodic Regularized Q-learning, PRQ),其核心创新是将正则化引入投影算子层面,显式构造出正则化的投影值迭代(Regularized Projected Value Iteration, RP-VI),从而确保迭代过程为压缩映射(contraction)。通过将这一正则化投影机制扩展到基于样本的强化学习框架,PRQ实现了在有限时间内对线性函数逼近下的Q值函数的稳定收敛,提供了严格的理论保障。

链接: https://arxiv.org/abs/2602.03301
作者: Hyukjun Yang,Han-Dong Lim,Donghwan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In reinforcement learning (RL), Q-learning is a fundamental algorithm whose convergence is guaranteed in the tabular setting. However, this convergence guarantee does not hold under linear function approximation. To overcome this limitation, a significant line of research has introduced regularization techniques to ensure stable convergence under function approximation. In this work, we propose a new algorithm, periodic regularized Q-learning (PRQ). We first introduce regularization at the level of the projection operator and explicitly construct a regularized projected value iteration (RP-VI), subsequently extending it to a sample-based RL algorithm. By appropriately regularizing the projection operator, the resulting projected value iteration becomes a contraction. By extending this regularized projection into the stochastic setting, we establish the PRQ algorithm and provide a rigorous theoretical analysis that proves finite-time convergence guarantees for PRQ under linear function approximation.
zh

[AI-58] Rejecting Arguments Based on Doubt in Structured Bipolar Argumentation AAMAS2026

【速读】:该论文旨在解决计算论证(computational argumentation)领域中两个长期被忽视的问题:一是代理(agent)可因纯粹怀疑而理性地拒绝某个论证,而非必须接受所有其能够辩护的论证;二是应更自然地从个体句子或主张的角度来建模代理在辩论中的立场,而非仅关注论证整体。解决方案的关键在于提出结构化双极论证框架(Structured Bipolar Argumentation Frameworks, SBAFs),其中论证由句子构成,并包含攻击与支持两种关系。在此基础上,论文设计了一种介于抽象论证中的可接纳(admissible)与完备(complete)语义之间的新语义体系:一方面不强制代理接受所有被辩护的论证,另一方面引入语言扩展(language extensions)以明确可接受的句子集合,从而更真实地刻画代理在辩论中的合理立场。这一方法不仅为现有理论提供了新的视角,还可用于界定抽象论证适用的条件,并证明演绎支持语义是其特例。

链接: https://arxiv.org/abs/2602.03286
作者: Michael A. Müller,Srdjan Vesic,Bruno Yun
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to AAMAS 2026

点击查看摘要

Abstract:This paper develops a new approach to computational argumentation that is informed by philosophical and linguistic views. Namely, it takes into account two ideas that have received little attention in the literature on computational argumentation: First, an agent may rationally reject an argument based on mere doubt, thus not all arguments they could defend must be accepted; and, second, that it is sometimes more natural to think in terms of which individual sentences or claims an agent accepts in a debate, rather than which arguments. In order to incorporate these two ideas into a computational approach, we first define the notion of structured bipolar argumentation frameworks (SBAFs), where arguments consist of sentences and we have both an attack and a support relation between them. Then, we provide semantics for SBAFs with two features: (1) Unlike with completeness-based semantics, our semantics do not force agents to accept all defended arguments. (2) In addition to argument extensions, which give acceptable sets of arguments, we also provide semantics for language extensions that specify acceptable sets of sentences. These semantics represent reasonable positions an agent might have in a debate. Our semantics lie between the admissible and complete semantics of abstract argumentation. Further, our approach can be used to provide a new perspective on existing approaches. For instance, we can specify the conditions under which an agent can ignore support between arguments (i.e. under which the use of abstract argumentation is warranted) and we show that deductive support semantics is a special case of our approach.
zh

[AI-59] MeetBench-XL: Calibrated Multi-Dimensional Evaluation and Learned Dual-Policy Agents for Real-Time Meetings AAAI2026

【速读】:该论文旨在解决企业会议环境中AI助手在实际应用中面临的多维挑战,包括高时效性要求、成本与隐私约束下的复杂任务处理(如跨会议分析和实时事实核查),以及现有基准测试无法反映真实企业协作流程的问题。其核心解决方案在于提出三个关键组件:一是构建了基于231场企业会议(总计140小时)的多模态、双语数据集MeetAll,通过专家验证和人类判别研究确保问题注入的真实性与多样性;二是设计了多维度评估协议MeetBench XL,从事实准确性、意图一致性、响应效率等五个维度量化模型性能;三是开发了MeetMaster XL——一个双策略学习代理,通过轻量级分类器实现快速与慢速推理路径之间的动态路由及工具调用(如检索、跨会议聚合和网络搜索),从而在保证高质量输出的同时显著优化延迟-性能权衡。实验证明该方案优于商用系统,并在真实部署中得到验证。

链接: https://arxiv.org/abs/2602.03285
作者: Yuelin Hu,Jun Xu,Bingcong Lu,Zhengxue Cheng,Hongwei Hu,Ronghua Wu,Li Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted by AAAI2026 ws

点击查看摘要

Abstract:Enterprise meeting environments require AI assistants that handle diverse operational tasks, from rapid fact checking during live discussions to cross meeting analysis for strategic planning, under strict latency, cost, and privacy constraints. Existing meeting benchmarks mainly focus on simplified question answering and fail to reflect real world enterprise workflows, where queries arise organically from multi stakeholder collaboration, span long temporal contexts, and require tool augmented reasoning. We address this gap through a grounded dataset and a learned agent framework. First, we introduce MeetAll, a bilingual and multimodal corpus derived from 231 enterprise meetings totaling 140 hours. Questions are injected using an enterprise informed protocol validated by domain expert review and human discriminability studies. Unlike purely synthetic benchmarks, this protocol is grounded in four enterprise critical dimensions: cognitive load, temporal context span, domain expertise, and actionable task execution, calibrated through interviews with stakeholders across finance, healthcare, and technology sectors. Second, we propose MeetBench XL, a multi dimensional evaluation protocol aligned with human judgment that measures factual fidelity, intent alignment, response efficiency, structural clarity, and completeness. Third, we present MeetMaster XL, a learned dual policy agent that jointly optimizes query routing between fast and slow reasoning paths and tool invocation, including retrieval, cross meeting aggregation, and web search. A lightweight classifier enables accurate routing with minimal overhead, achieving a superior quality latency tradeoff over single model baselines. Experiments against commercial systems show consistent gains, supported by ablations, robustness tests, and a real world deployment case this http URL: this https URL. Comments: accepted by AAAI2026 ws Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.03285 [cs.AI] (or arXiv:2602.03285v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.03285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-60] Agent ic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis

【速读】:该论文旨在解决大规模语言模型在复杂推理能力提升中面临的高质量、可验证数据集稀缺问题,当前人工标注成本高昂且难以扩展,而现有合成方法往往在保持结构有效性与提高问题复杂度之间存在权衡,导致生成数据质量不稳定或不可解。其解决方案的关键在于提出Agentic Proposing框架,将问题合成建模为一个目标驱动的序列决策过程,通过专用代理(agent)动态选择并组合模块化推理技能,并借助内部反思与工具调用的迭代流程,利用多粒度策略优化(Multi-Granularity Policy Optimization, MGPO)生成高精度、可验证的训练轨迹,从而显著提升下游求解器在数学、编程和科学领域的性能与跨域泛化能力。

链接: https://arxiv.org/abs/2602.03279
作者: Zhengbo Jiao,Shaobo Wang,Zifan Zhang,Xuan Ren,Wei Wang,Bing Zhao,Hu Wei,Linfeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23page4

点击查看摘要

Abstract:Advancing complex reasoning in large language models relies on high-quality, verifiable datasets, yet human annotation remains cost-prohibitive and difficult to scale. Current synthesis paradigms often face a recurring trade-off: maintaining structural validity typically restricts problem complexity, while relaxing constraints to increase difficulty frequently leads to inconsistent or unsolvable instances. To address this, we propose Agentic Proposing, a framework that models problem synthesis as a goal-driven sequential decision process where a specialized agent dynamically selects and composes modular reasoning skills. Through an iterative workflow of internal reflection and tool-use, we develop the Agentic-Proposer-4B using Multi-Granularity Policy Optimization (MGPO) to generate high-precision, verifiable training trajectories across mathematics, coding, and science. Empirical results demonstrate that downstream solvers trained on agent-synthesized data significantly outperform leading baselines and exhibit robust cross-domain generalization. Notably, a 30B solver trained on only 11,000 synthesized trajectories achieves a state-of-the-art 91.6% accuracy on AIME25, rivaling frontier-scale proprietary models such as GPT-5 and proving that a small volume of high-quality synthetic signals can effectively substitute for massive human-curated datasets.
zh

[AI-61] Unveiling Covert Toxicity in Multimodal Data via Toxicity Association Graphs: A Graph-Based Metric and Interpretable Detection Framework

【速读】:该论文旨在解决多模态数据中隐蔽性毒性(covert toxicity)的检测难题,即有害含义常隐藏在看似无害的单一模态中,仅在多模态融合并激活语义关联时才显现。其解决方案的核心是提出基于毒性关联图(Toxicity Association Graphs, TAGs)的新检测框架,并引入首个可量化的隐蔽毒性度量指标——多模态隐蔽性毒性指数(Multimodal Toxicity Covertness, MTC),用于衡量有毒表达的 concealment 程度。通过将 TAGs 与 MTC 结合,该方法实现了对隐蔽毒性精准识别的同时保持决策过程的完全可解释性,显著提升了多模态毒性检测的透明度与可信度。

链接: https://arxiv.org/abs/2602.03268
作者: Guanzong Wu,Zihao Zhu,Siwei Lyu,Baoyuan Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Detecting toxicity in multimodal data remains a significant challenge, as harmful meanings often lurk beneath seemingly benign individual modalities: only emerging when modalities are combined and semantic associations are activated. To address this, we propose a novel detection framework based on Toxicity Association Graphs (TAGs), which systematically model semantic associations between innocuous entities and latent toxic implications. Leveraging TAGs, we introduce the first quantifiable metric for hidden toxicity, the Multimodal Toxicity Covertness (MTC), which measures the degree of concealment in toxic multimodal expressions. By integrating our detection framework with the MTC metric, our approach enables precise identification of covert toxicity while preserving full interpretability of the decision-making process, significantly enhancing transparency in multimodal toxicity detection. To validate our method, we construct the Covert Toxic Dataset, the first benchmark specifically designed to capture high-covertness toxic multimodal instances. This dataset encodes nuanced cross-modal associations and serves as a rigorous testbed for evaluating both the proposed metric and detection framework. Extensive experiments demonstrate that our approach outperforms existing methods across both low- and high-covertness toxicity regimes, while delivering clear, interpretable, and auditable detection outcomes. Together, our contributions advance the state of the art in explainable multimodal toxicity detection and lay the foundation for future context-aware and interpretable approaches. Content Warning: This paper contains examples of toxic multimodal content that may be offensive or disturbing to some readers. Reader discretion is advised.
zh

[AI-62] CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLM s

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全行为上可能依赖单模态捷径(unimodal shortcuts)而非真正的跨模态联合意图理解的问题。其解决方案的关键在于提出CSR-Bench基准,通过四种压力测试交互模式(Safety、Over-rejection、Bias 和 Hallucination)覆盖61种细粒度场景,每个实例均要求整合图像与文本的解释,并提供配对的纯文本控制组以诊断模态诱导的行为变化。实验表明,当前主流MLLMs存在系统性的跨模态对齐缺口,且在安全性和拒绝倾向之间存在明确权衡,揭示了部分“安全提升”可能源于拒绝导向的启发式策略而非稳健的意图理解。

链接: https://arxiv.org/abs/2602.03263
作者: Yuxuan Liu,Yuntian Shi,Kun Wang,Haoting Shen,Kun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 1 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) enable interaction over both text and images, but their safety behavior can be driven by unimodal shortcuts instead of true joint intent understanding. We introduce CSR-Bench, a benchmark for evaluating cross-modal reliability through four stress-testing interaction patterns spanning Safety, Over-rejection, Bias, and Hallucination, covering 61 fine-grained types. Each instance is constructed to require integrated image-text interpretation, and we additionally provide paired text-only controls to diagnose modality-induced behavior shifts. We evaluate 16 state-of-the-art MLLMs and observe systematic cross-modal alignment gaps. Models show weak safety awareness, strong language dominance under interference, and consistent performance degradation from text-only controls to multimodal inputs. We also observe a clear trade-off between reducing over-rejection and maintaining safe, non-discriminatory behavior, suggesting that some apparent safety gains may come from refusal-oriented heuristics rather than robust intent understanding. WARNING: This paper contains unsafe contents.
zh

[AI-63] GraDE: A Graph Diffusion Estimator for Frequent Subgraph Discovery in Neural Architectures

【速读】:该论文旨在解决神经网络结构中频繁子图模式(frequent subgraph patterns)识别的难题,即在保持计算可行性的同时提升发现能力。传统枚举方法虽准确但计算复杂度高,而采样方法虽高效却难以发现大规模模式。解决方案的关键在于提出GraDE框架,其核心创新是Graph Diffusion Estimator(GraDE),首次将图扩散模型引入子图频率分析,通过评分子图在其学习分布中的典型性来识别高频模式,从而在保证计算效率的同时显著提升发现性能。

链接: https://arxiv.org/abs/2602.03257
作者: Yikang Yang,Zhengxin Yang,Minghao Luo,Luzhou Peng,Hongxiao Li,Wanling Gao,Lei Wang,Jianfeng Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Finding frequently occurring subgraph patterns or network motifs in neural architectures is crucial for optimizing efficiency, accelerating design, and uncovering structural insights. However, as the subgraph size increases, enumeration-based methods are perfectly accurate but computationally prohibitive, while sampling-based methods are computationally tractable but suffer from a severe decline in discovery capability. To address these challenges, this paper proposes GraDE, a diffusion-guided search framework that ensures both computational feasibility and discovery capability. The key innovation is the Graph Diffusion Estimator (GraDE), which is the first to introduce graph diffusion models to identify frequent subgraphs by scoring their typicality within the learned distribution. Comprehensive experiments demonstrate that the estimator achieves superior ranking accuracy, with up to 114% improvement compared to sampling-based baselines. Benefiting from this, the proposed framework successfully discovers large-scale frequent patterns, achieving up to 30 \times higher median frequency than sampling-based methods.
zh

[AI-64] LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

【速读】:该论文旨在解决当前计算机使用代理(Computer-use Agents, CUAs)在执行长周期任务时缺乏规划阶段安全意识的问题,尤其关注因模糊指令或对抗性用户操纵导致的潜在风险。现有基准主要聚焦于短周期或基于图形界面(GUI)的任务,仅评估执行阶段错误,忽视了对规划阶段潜在危害的识别能力。为填补这一空白,作者提出了LPS-Bench基准,涵盖7个任务领域和9类风险的65个场景,用于系统评估基于多智能体协作(MCP)的CUA在长期任务中的规划安全性。其关键解决方案包括:构建一个可扩展的多智能体自动化数据生成管道以支持大规模测试,并采用大语言模型作为裁判(LLM-as-a-judge)协议,通过分析代理的规划轨迹来量化其安全意识水平。实验表明,现有CUA在保持安全行为方面存在显著不足,研究进一步揭示了风险模式并提出针对性缓解策略,以提升MCP架构下CUA系统的长期规划安全性。

链接: https://arxiv.org/abs/2602.03255
作者: Tianyu Chen,Chujia Hu,Ge Gao,Dongrui Liu,Xia Hu,Wenjie Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-use agents (CUAs) that interact with real computer systems can perform automated tasks but face critical safety risks. Ambiguous instructions may trigger harmful actions, and adversarial users can manipulate tool execution to achieve malicious goals. Existing benchmarks mostly focus on short-horizon or GUI-based tasks, evaluating on execution-time errors but overlooking the ability to anticipate planning-time risks. To fill this gap, we present LPS-Bench, a benchmark that evaluates the planning-time safety awareness of MCP-based CUAs under long-horizon tasks, covering both benign and adversarial interactions across 65 scenarios of 7 task domains and 9 risk types. We introduce a multi-agent automated pipeline for scalable data generation and adopt an LLM-as-a-judge evaluation protocol to assess safety awareness through the planning trajectory. Experiments reveal substantial deficiencies in existing CUAs’ ability to maintain safe behavior. We further analyze the risks and propose mitigation strategies to improve long-horizon planning safety in MCP-based CUA systems. We open-source our code at this https URL.
zh

[AI-65] Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因长链式思维(Chain-of-Thought, CoT)导致的计算资源消耗急剧上升的问题,特别是KV缓存线性增长和注意力机制二次复杂度带来的内存与计算瓶颈。解决方案的关键在于提出一种端到端的“Accordion-Thinking”框架,其中模型通过动态摘要机制自主调节推理步骤的粒度,实现“Fold推理模式”——即周期性地对历史思考过程进行总结并丢弃冗余信息,从而显著降低对历史token的依赖。该方法结合强化学习优化,使模型在训练中逐步学会将关键推理信息压缩至紧凑摘要中,最终使得高效Fold模式与高精度Unfold模式之间的准确率差距逐渐缩小甚至消失,实现了在保持推理质量的同时,将吞吐量提升至3倍且仅需48GB GPU显存。

链接: https://arxiv.org/abs/2602.03249
作者: Zhicheng Yang,Zhijiang Guo,Yinya Huang,Yongxin Wang,Wenlei Shi,Yiwei Wang,Xiaodan Liang,Jing Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scaling test-time compute via long Chain-ofThought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a 3x throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.
zh

[AI-66] he Necessity of a Unified Framework for LLM -Based Agent Evaluation

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)驱动的通用智能体(general-purpose agents)评估中存在的系统性问题,包括评估基准受外部因素(如系统提示词、工具集配置和环境动态)严重干扰、评估框架碎片化且缺乏标准化,导致性能提升难以归因于模型本身,以及实验结果不可复现和不公平。其解决方案的关键在于提出一个统一的评估框架(unified evaluation framework),以实现对智能体能力的客观、可比和可重复的衡量,从而推动该领域研究的严谨发展。

链接: https://arxiv.org/abs/2602.03238
作者: Pengyu Zhu,Li Sun,Philip S. Yu,Sen Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher-specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non-reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.
zh

[AI-67] AME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

【速读】:该论文旨在解决代理记忆在良性任务演化过程中出现的“代理记忆错进化”(Agent Memory Misevolution)问题,即尽管任务本身无害,但代理的安全对齐性仍会随经验积累而下降,导致信任度整体衰退。解决方案的关键在于提出TAME框架——一种双记忆演化机制,其中执行器记忆(executor memory)通过蒸馏通用方法论来提升任务性能,评估器记忆(evaluator memory)则基于历史反馈优化对安全性和任务效用的判断;二者协同形成闭环:记忆过滤、草稿生成、可信度精炼、执行与双轨记忆更新,从而在不牺牲任务效用的前提下有效维持代理的信任度。

链接: https://arxiv.org/abs/2602.03224
作者: Yu Cheng,Jiuan Zhou,Yongkang Hu,Yihang Chen,Huichi Zhou,Mingang Chen,Zhizhong Zhang,Kun Shao,Yuan Xie,Zhaoxia Yin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time evolution of agent memory serves as a pivotal paradigm for achieving AGI by bolstering complex reasoning through experience accumulation. However, even during benign task evolution, agent safety alignment remains vulnerable-a phenomenon known as Agent Memory Misevolution. To evaluate this phenomenon, we construct the Trust-Memevo benchmark to assess multi-dimensional trustworthiness during benign task evolution, revealing an overall decline in trustworthiness across various task domains and evaluation settings. To address this issue, we propose TAME, a dual-memory evolutionary framework that separately evolves executor memory to improve task performance by distilling generalizable methodologies, and evaluator memory to refine assessments of both safety and task utility based on historical feedback. Through a closed loop of memory filtering, draft generation, trustworthy refinement, execution, and dual-track memory updating, TAME preserves trustworthiness without sacrificing utility. Experiments demonstrate that TAME mitigates misevolution, achieving a joint improvement in both trustworthiness and task performance.
zh

[AI-68] Distribution-Aware End-to-End Embedding for Streaming Numerical Features in Click-Through Rate Prediction

【速读】:该论文旨在解决流式训练场景下数值特征嵌入(numerical feature embedding)的挑战,特别是传统静态分箱方法因离线统计导致的语义漂移问题,以及神经嵌入方法忽略显式分布信息的局限性。其核心解决方案是提出DAES框架,通过引入基于蓄水池采样(reservoir sampling)的分布估计方法和两种场感知(field-aware)分布调制策略,实现分布信息与自适应调制机制的端到端融合,从而有效捕捉流式数据中的分布变化和字段依赖语义,显著提升点击率(Click-Through Rate, CTR)预测性能。

链接: https://arxiv.org/abs/2602.03223
作者: Jiahao Liu,Hongji Ruan,Weimin Zhang,Ziye Tong,Derick Tang,Zhanpeng Zeng,Qinsong Zeng,Peng Zhang,Tun Lu,Ning Gu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:This paper explores effective numerical feature embedding for Click-Through Rate prediction in streaming environments. Conventional static binning methods rely on offline statistics of numerical distributions; however, this inherently two-stage process often triggers semantic drift during bin boundary updates. While neural embedding methods enable end-to-end learning, they often discard explicit distributional information. Integrating such information end-to-end is challenging because streaming features often violate the i.i.d. assumption, precluding unbiased estimation of the population distribution via the expectation of order statistics. Furthermore, the critical context dependency of numerical distributions is often neglected. To this end, we propose DAES, an end-to-end framework designed to tackle numerical feature embedding in streaming training scenarios by integrating distributional information with an adaptive modulation mechanism. Specifically, we introduce an efficient reservoir-sampling-based distribution estimation method and two field-aware distribution modulation strategies to capture streaming distributions and field-dependent semantics. DAES significantly outperforms existing approaches as demonstrated by extensive offline and online experiments and has been fully deployed on a leading short-video platform with hundreds of millions of daily active users.
zh

[AI-69] Beyond Quantity: Trajectory Diversity Scaling for Code Agents

【速读】:该论文旨在解决代码大语言模型(Code LLMs)在向工具交互代理演进过程中,因合成数据质量低和数量扩展边际效益递减而导致的泛化能力瓶颈问题,尤其是现有方法对轨迹数据利用不充分、模式坍塌严重等挑战。其解决方案的关键在于提出TDScaling框架——一种基于轨迹多样性扩展的数据合成机制,通过提升轨迹多样性而非单纯增加数据量来优化性能-成本比。核心创新包括:业务聚类机制捕获真实服务逻辑依赖、蓝图驱动的多智能体范式保障轨迹一致性、基于领域熵与推理模式熵的自适应演化机制防止模式坍塌,以及沙箱化代码工具缓解内在编码能力的灾难性遗忘。实验证明,TDScaling在通用工具使用基准(BFCL, tau^2-Bench)和代码代理任务(RebenchT, CodeCI, BIRD)上实现了工具泛化能力和编码能力的双重提升。

链接: https://arxiv.org/abs/2602.03219
作者: Guhong Chen,Chenghao Sun,Cheng Fu,Qiyao Wang,Zhihong Huang,Chaopeng Wei,Guangxu Chen,Feiteng Fang,Ahmadreza Argha,Bing Zhao,Xander Xu,Qi Han,Hamid Alinejad-Rokny,Qiang Qu,Binhua Li,Shiwen Ni,Min Yang,Hu Wei,Yongbin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling. Moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, tau^2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. We plan to release the full codebase and the synthesized dataset (including 30,000+ tool clusters) upon publication.
zh

[AI-70] opology Matters: A Cautionary Case Study of Graph SSL on Neuro-Inspired Benchmarks

【速读】:该论文旨在解决如何通过局部相互作用揭示全局脑组织结构的问题,核心挑战在于现有通用图自监督学习(Graph Self-Supervised Learning, SSL)方法在处理类连接组(connectome-like)数据时缺乏对拓扑结构的敏感性。解决方案的关键在于提出一个分层SSL框架,该框架同时学习节点、边和图级别的嵌入表示,并构建了一个可控的合成基准以模拟连接组的拓扑特性。研究发现,基于不变性的SSL目标实际上与基准的拓扑属性不匹配,导致模型忽略社区结构(community structure),从而显著劣于传统拓扑感知启发式方法。这一结果揭示了将通用图SSL直接应用于神经影像数据时的根本缺陷,并强调未来神经人工智能(Neuro-AI)研究需设计显式奖励结构保留(如模块性或基序)的新SSL目标。

链接: https://arxiv.org/abs/2602.03217
作者: May Kristine Jonson Carlon,Su Myat Noe,Haojiong Wang,Yasuo Kuniyoshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how local interactions give rise to global brain organization requires models that can represent information across multiple scales. We introduce a hierarchical self-supervised learning (SSL) framework that jointly learns node-, edge-, and graph-level embeddings, inspired by multimodal neuroimaging. We construct a controllable synthetic benchmark mimicking the topological properties of connectomes. Our four-stage evaluation protocol reveals a critical failure: the invariance-based SSL model is fundamentally misaligned with the benchmark’s topological properties and is catastrophically outperformed by classical, topology-aware heuristics. Ablations confirm an objective mismatch: SSL objectives designed to be invariant to topological perturbations learn to ignore the very community structure that classical methods exploit. Our results expose a fundamental pitfall in applying generic graph SSL to connectome-like data. We present this framework as a cautionary case study, highlighting the need for new, topology-aware SSL objectives for neuro-AI research that explicitly reward the preservation of structure (e.g., modularity or motifs).
zh

[AI-71] Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)生成样本与人类意图对齐不足的问题,即生成结果在语义或偏好上难以满足用户期望。现有梯度引导方法通过泰勒展开近似中间粒子 \mathbfx_t 处的预期未来奖励(Expected Future Reward, EFR),但因每一步都需要神经网络反向传播,导致计算成本高昂。论文提出的关键解决方案是:基于预训练扩散模型的边际采样重构EFR表达式,从而消除\mathbfx_t与EFR之间的神经依赖关系,实现无需反向传播的闭式引导计算;进一步引入前瞻采样(lookahead sampling)高效获取边际样本,并结合高精度求解器引导粒子向高奖励区域移动,形成名为LiDAR的采样策略。该方法仅需3个样本和3步前瞻求解即可显著提升性能,且相较最新梯度引导方法在SDXL模型上达到相同GenEval指标的同时实现9.5倍加速。

链接: https://arxiv.org/abs/2602.03211
作者: Yeongmin Kim,Donghyeok Shin,Byeonghu Na,Minsang Park,Richard Lee Kim,Il-Chul Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. This paper studies a test-time scaling method that enables sampling from regions with higher human-aligned reward values. Existing gradient guidance methods approximate the expected future reward (EFR) at an intermediate particle \mathbfx_t using a Taylor approximation, but this approximation at each time step incurs high computational cost due to sequential neural backpropagation. We show that the EFR at any \mathbfx_t can be computed using only marginal samples from a pre-trained diffusion model. The proposed EFR formulation detaches the neural dependency between \mathbfx_t and the EFR, enabling closed-form guidance computation without neural backpropagation. To further improve efficiency, we introduce lookahead sampling to collect marginal samples. For final sample generation, we use an accurate solver that guides particles toward high-reward lookahead samples. We refer to this sampling scheme as LiDAR sampling. LiDAR achieves substantial performance improvements using only three samples with a 3-step lookahead solver, exhibiting steep performance gains as lookahead accuracy and sample count increase; notably, it reaches the same GenEval performance as the latest gradient guidance method for SDXL with a 9.5x speedup.
zh

[AI-72] Reinforcement Learning with Promising Tokens for Large Language Models

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在对齐和优化大语言模型(Large Language Models, LLMs)过程中,因动作空间过大而导致的训练效率低、梯度方差高以及策略聚焦困难的问题。具体而言,标准RL方法将整个词汇表作为动作空间,其中包含大量语境无关的token,干扰了策略在合理决策上的学习。解决方案的关键在于提出一种名为“有希望token的强化学习”(Reinforcement Learning with Promising Tokens, RLPT)的新框架:通过利用基础模型的语义先验动态识别出一组“有希望的token”,并使用掩码机制将策略优化限制在此精炼子集内,从而实现战略决策与token生成的解耦。这一设计显著降低了梯度方差,提升了训练稳定性与样本效率,并在数学推理、代码生成和电信领域任务中验证了其有效性。

链接: https://arxiv.org/abs/2602.03195
作者: Jing-Cheng Pang,Liang Lu,Xian Tang,Kun Jiang,Sijie Wu,Kai Zhang,Xubin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a key paradigm for aligning and optimizing large language models (LLMs). Standard approaches treat the LLM as the policy and apply RL directly over the full vocabulary space. However, this formulation includes the massive tail of contextually irrelevant tokens in the action space, which could distract the policy from focusing on decision-making among the truly reasonable tokens. In this work, we verify that valid reasoning paths could inherently concentrate within a low-rank subspace. Based on this insight, we introduce Reinforcement Learning with Promising Tokens (RLPT), a framework that mitigates the action space issue by decoupling strategic decision-making from token generation. Specifically, RLPT leverages the semantic priors of the base model to identify a dynamic set of \emphpromising tokens and constrains policy optimization exclusively to this refined subset via masking. Theoretical analysis and empirical results demonstrate that RLPT effectively reduces gradient variance, stabilizes the training process, and improves sample efficiency. Experiment results on math, coding, and telecom reasoning show that RLPT outperforms standard RL baselines and integrates effectively across various model sizes (4B and 8B) and RL algorithms (GRPO and DAPO).
zh

[AI-73] MemCast: Memory-Driven Time Series Forecasting with Experience-Conditioned Reasoning

【速读】:该论文旨在解决现有基于大语言模型(Large Language Models, LLM)的时间序列预测(Time Series Forecasting, TSF)方法在经验积累和持续演化能力上的不足问题。其解决方案的关键在于提出了一种名为MemCast的学习记忆框架,将TSF重构为一种受经验条件约束的推理任务:通过层次化记忆结构组织训练数据中的历史模式、推理轨迹与时间特征,分别对应经验总结、推理智慧与通用规律;在推理阶段利用这些记忆模块引导路径选择与迭代优化,并设计动态置信度自适应策略实现无测试分布泄露的持续进化能力。

链接: https://arxiv.org/abs/2602.03164
作者: Xiaoyu Tao,Mingyue Cheng,Ze Guo,Shuo Yu,Yaguo Liu,Qi Liu,Shijin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting (TSF) plays a critical role in decision-making for many real-world applications. Recently, LLM-based forecasters have made promising advancements. Despite their effectiveness, existing methods often lack explicit experience accumulation and continual evolution. In this work, we propose MemCast, a learning-to-memory framework that reformulates TSF as an experience-conditioned reasoning task. Specifically, we learn experience from the training set and organize it into a hierarchical memory. This is achieved by summarizing prediction results into historical patterns, distilling inference trajectories into reasoning wisdom, and inducing extracted temporal features into general laws. Furthermore, during inference, we leverage historical patterns to guide the reasoning process and utilize reasoning wisdom to select better trajectories, while general laws serve as criteria for reflective iteration. Additionally, to enable continual evolution, we design a dynamic confidence adaptation strategy that updates the confidence of individual entries without leaking the test set distribution. Extensive experiments on multiple datasets demonstrate that MemCast consistently outperforms previous methods, validating the effectiveness of our approach. Our code is available at this https URL.
zh

[AI-74] Intelligent Front-End Personalization: AI-Driven UI Adaptation

【速读】:该论文旨在解决传统前端个性化(Front-end Personalization)依赖静态设计或基于规则的适配方式,难以充分捕捉用户行为模式的问题。其解决方案的关键在于引入生成式 AI (Generative AI) 驱动的动态前端个性化机制,通过三种核心策略实现:基于用户路径预测的动态布局自适应、利用强化学习的内容优先级排序,以及对AI驱动与规则驱动方案的对比分析,从而在实时性与个性化精度上显著提升用户体验和系统性能。

链接: https://arxiv.org/abs/2602.03154
作者: Mona Rajhans
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: To be published in proceedings of IEEE ACDSA 2026

点击查看摘要

Abstract:Front-end personalization has traditionally relied on static designs or rule-based adaptations, which fail to fully capture user behavior patterns. This paper presents an AI driven approach for dynamic front-end personalization, where UI layouts, content, and features adapt in real-time based on predicted user behavior. We propose three strategies: dynamic layout adaptation using user path prediction, content prioritization through reinforcement learning, and a comparative analysis of AI-driven vs. rule-based personalization. Technical implementation details, algorithms, system architecture, and evaluation methods are provided to illustrate feasibility and performance gains.
zh

[AI-75] Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在推理阶段面临模态缺失时性能显著下降的问题,尤其是现有方法在恢复缺失模态特征时存在语义不一致或引入无关噪声的局限性。其解决方案的关键在于提出一种通用的缺失模态恢复策略,核心创新包括:(I) 动态模态门控(Dynamic Modality Gating),通过自适应利用条件特征引导生成语义一致的缺失特征;(II) 跨模态互学习机制(Cross-Modal Mutual Learning),通过桥接双编码器的语义空间实现双向对齐,从而在保持VLM泛化能力的同时精准还原缺失信息。

链接: https://arxiv.org/abs/2602.03151
作者: Wei Dai,Haoyu Wang,Honghao Chang,Lijun He,Fan Li,Jian Sun,Haixia Bi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Vision Language Models (VLMs) typically assume complete modality input during inference. However, their effectiveness drops sharply when certain modalities are unavailable or incomplete. Current research primarily faces two dilemmas: Prompt-based methods struggle to restore missing yet indispensable features and impair generalization of VLMs. Imputation-based approaches, lacking effective guidance, are prone to generating semantically irrelevant noise. Restoring precise semantics while sustaining VLM generalization remains challenging. Therefore, we propose a general missing modality restoration strategy in this paper. We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features. Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment. Zero-shot evaluations across benchmark datasets demonstrate that our approach outperforms existing baseline methods. Extensive experiments and ablation studies confirm our model as a robust and scalable extension for VLMs in missing modality scenarios, ensuring reliability across diverse missing rates and environments. Our code and models will be publicly available.
zh

[AI-76] General Agents Contain World Models even under Partial Observability and Stochasticity

【速读】:该论文旨在解决如何判断一个智能体是否在其内部构建了对周围环境的模型这一基础性问题,从而深入理解其能力与局限。此前的研究仅适用于确定性智能体和完全可观测环境,而本文通过将理论扩展至随机性智能体(stochastic agents)和部分可观测环境(partially observable environments),证明即使在更一般化的设定下,几乎最优且具有一定通用性的智能体仍不可避免地学习并内化其环境结构。解决方案的关键在于:利用随机化机制本身无法规避环境建模的事实,即随机策略本质上仍需依赖对环境统计规律的认知;同时,通过弱化“通用性”假设,进一步证明即便是较弱的智能体也已包含对其所处世界的某种模型表示,从而强化了原结论的普适性和实用性。

链接: https://arxiv.org/abs/2602.03146
作者: Santiago Cifuentes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Deciding whether an agent possesses a model of its surrounding world is a fundamental step toward understanding its capabilities and limitations. In [10], it was shown that, within a particular framework, every almost optimal and general agent necessarily contains sufficient knowledge of its environment to allow an approximate reconstruction of it by querying the agent as a black box. This result relied on the assumptions that the agent is deterministic and that the environment is fully observable. In this work, we remove both assumptions by extending the theorem to stochastic agents operating in partially observable environments. Fundamentally, this shows that stochastic agents cannot avoid learning their environment through the usage of randomization. We also strengthen the result by weakening the notion of generality, proving that less powerful agents already contain a model of the world in which they operate. Comments: 19 pages, 4 figures Subjects: Artificial Intelligence (cs.AI) MSC classes: 68T42 Cite as: arXiv:2602.03146 [cs.AI] (or arXiv:2602.03146v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.03146 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-77] Internet of Agent ic AI: Incentive-Compatible Distributed Teaming and Workflow

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体系统(agentic AI systems)普遍存在的集中式、单体架构所导致的可扩展性差、专业化程度低及互操作性弱的问题。其核心解决方案是提出一种面向可扩展智能体智能的框架——“智能体互联网”(Internet of Agentic AI),通过分布于云边端基础设施上的异构自主智能体动态形成联盟来执行任务驱动的工作流,关键在于构建了一个网络原生的协作建模机制,并引入一个激励相容的工作流-联盟可行性框架,该框架融合了能力覆盖度、网络局部性和经济可实施性三个维度,同时设计了一种最小努力联盟选择问题与去中心化联盟形成算法,从而实现高效、弹性且经济可行的智能体协同。

链接: https://arxiv.org/abs/2602.03145
作者: Ya-Ting Yang,Quanyan Zhu
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have enabled a new class of agentic AI systems that reason, plan, and act by invoking external tools. However, most existing agentic architectures remain centralized and monolithic, limiting scalability, specialization, and interoperability. This paper proposes a framework for scalable agentic intelligence, termed the Internet of Agentic AI, in which autonomous, heterogeneous agents distributed across cloud and edge infrastructure dynamically form coalitions to execute task-driven workflows. We formalize a network-native model of agentic collaboration and introduce an incentive-compatible workflow-coalition feasibility framework that integrates capability coverage, network locality, and economic implementability. To enable scalable coordination, we formulate a minimum-effort coalition selection problem and propose a decentralized coalition formation algorithm. The proposed framework can operate as a coordination layer above the Model Context Protocol (MCP). A healthcare case study demonstrates how domain specialization, cloud-edge heterogeneity, and dynamic coalition formation enable scalable, resilient, and economically viable agentic workflows. This work lays the foundation for principled coordination and scalability in the emerging era of Internet of Agentic AI.
zh

[AI-78] Contrastive Concept-Tree Search for LLM -Assisted Algorithm Discovery

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)辅助算法发现过程中,如何最大化利用LLM对可能程序空间的内部表征以提升搜索效率的问题。其解决方案的关键在于提出对比概念树搜索(Contrastive Concept-Tree Search, CCTS),该方法从生成的程序中提取层次化的概念表示,并学习一个对比概念模型来指导父节点选择;通过使用高绩效与低绩效解之间的似然比分数重加权父节点,CCTS 使搜索偏向于有用的组合概念并规避误导性概念,从而在显式概念层次结构上提供引导,而非依赖LLM构建的算法谱系。

链接: https://arxiv.org/abs/2602.03132
作者: Timothee Leleu,Sudeera Gunathilaka,Federico Ghimenti,Surya Ganguli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Large language Model (LLM)-assisted algorithm discovery is an iterative, black-box optimization process over programs to approximatively solve a target task, where an LLM proposes candidate programs and an external evaluator provides task feedback. Despite intense recent research on the topic and promising results, how can the LLM internal representation of the space of possible programs be maximally exploited to improve performance is an open question. Here, we introduce Contrastive Concept-Tree Search (CCTS), which extracts a hierarchical concept representation from the generated programs and learns a contrastive concept model that guides parent selection. By reweighting parents using a likelihood-ratio score between high- and low-performing solutions, CCTS biases search toward useful concept combinations and away from misleading ones, providing guidance through an explicit concept hierarchy rather than the algorithm lineage constructed by the LLM. We show that CCTS improves search efficiency over fitness-based baselines and produces interpretable, task-specific concept trees across a benchmark of open Erdős-type combinatorics problems. Our analysis indicates that the gains are driven largely by learning which concepts to avoid. We further validate these findings in a controlled synthetic algorithm-discovery environment, which reproduces qualitatively the search dynamics observed with the LLMs.
zh

[AI-79] Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM)框架在系统性能方面缺乏统一评估标准的问题,特别是架构设计如何显著影响延迟、吞吐量、准确性及可扩展性等关键指标。现有基准测试通常仅关注单一能力,且未在框架层面进行受控评估,导致难以量化不同架构选择的实际影响。解决方案的关键在于:(i) 提出一个用于系统比较多智能体LLM框架的架构分类法(architectural taxonomy),以从基础维度上结构化地分析差异;(ii) 构建MAFBench——一个集成现有基准的标准化执行套件,支持在受控条件下联合评估包括编排开销、内存行为、规划能力、专业化和协调机制在内的多种核心能力。通过这一方法,研究揭示了框架级设计选择可使延迟增加超过100倍、规划准确率下降达30%、协调成功率从90%以上降至30%以下,从而为架构设计原则与框架选型提供实证依据。

链接: https://arxiv.org/abs/2602.03128
作者: Abdelghny Orogat,Ana Rostam,Essam Mansour
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures and 13 tables; introduces MAFBench unified multi-agent evaluation suite

点击查看摘要

Abstract:Multi-agent LLM frameworks are widely used to accelerate the development of agent systems powered by large language models (LLMs). These frameworks impose distinct architectural structures that govern how agents interact, store information, and coordinate tasks. However, their impact on system performance remains poorly understood. This gap is critical, as architectural choices alone can induce order-of-magnitude differences in latency and throughput, as well as substantial variation in accuracy and scalability. Addressing this challenge requires (i) jointly evaluating multiple capabilities, such as orchestration overhead, memory behavior, planning, specialization, and coordination, and (ii) conducting these evaluations under controlled, framework-level conditions to isolate architectural effects. Existing benchmarks focus on individual capabilities and lack standardized framework-level evaluation. We address these limitations by (i) introducing an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions, and (ii) developing MAFBench, a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline. Using MAFBench, we conduct a controlled empirical study across several widely used frameworks. Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%. Finally, we translate our findings into concrete architectural design principles and framework selection guidance, and outline promising future research directions.
zh

[AI-80] Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLM s at Low-precision Cost

【速读】:该论文旨在解决量化后的大语言模型(Large Language Models, LLMs)在部署到内存受限设备时面临的静态性问题,即传统微调方法(如基于反向传播的监督学习或强化学习)无法在低精度权重空间中有效执行,因为量化后的参数空间是离散且不可微的。为突破这一限制,论文提出了一种名为量化进化策略(Quantized Evolution Strategies, QES)的新优化范式,其关键创新在于:(1) 引入累积误差反馈机制以保留高精度梯度信号,从而缓解因量化导致的梯度消失或不准确问题;(2) 采用无状态种子重放(stateless seed replay)技术,将内存占用降至与低精度推理相当的水平,实现全参数微调在量化空间中的高效运行。QES显著优于当前最先进的零阶微调方法,在算术推理任务上表现优异,首次实现了对量化模型的直接端到端微调,为在纯量化空间中扩展大语言模型提供了可行路径。

链接: https://arxiv.org/abs/2602.03120
作者: Yinggan Xu,Risto Miikkulainen,Xin Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint version

点击查看摘要

Abstract:Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices, yet it renders models static and difficult to fine-tune. Standard fine-tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and high-precision weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non-differentiable. While Evolution Strategies (ES) offer a backpropagation-free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full-parameter fine-tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high-precision gradient signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low-precision inference levels. QES significantly outperforms the state-of-the-art zeroth-order fine-tuning method on arithmetic reasoning tasks, making direct fine-tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at this https URL .
zh

[AI-81] Digital Lifelong Learning in the Age of AI: Trends and Insights

【速读】:该论文旨在解决数字学习在后疫情时代如何持续演化以满足成人及终身学习者需求的问题,特别是聚焦于不同人口统计群体对数字学习平台的使用差异、游戏化机制的有效性以及人工智能(AI)技术的整合策略。其解决方案的关键在于通过收集200名受访者的大规模问卷数据并结合高级分析方法,揭示了疫情后数字学习感知相关性的显著提升,尤其体现在年轻成年人和女性群体中,并指出大语言模型(LLM)驱动的AI工具在实现个性化学习支持方面发挥核心作用,从而为教育机构、政策制定者和企业提供可操作的优化建议。

链接: https://arxiv.org/abs/2602.03114
作者: Geeta Puri,Nachamma Socklingam,Dorien Herremans
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 41 pages including references, appendix, 14 figures

点击查看摘要

Abstract:Rapid innovations in AI and large language models (LLMs) have accelerated the adoption of digital learning, particularly beyond formal education. What began as an emergency response during COVID-19 has shifted from a supplementary resource to an essential pillar of education. Understanding how digital learning continues to evolve for adult and lifelong learners is therefore increasingly important. This study examines how various demographics interact with digital learning platforms, focusing on the learner motivations, the effectiveness of gamification in digital learning, and the integration of AI. Using multi survey data from 200 respondents and advanced analytics, our findings reveal a notable increase in the perceived relevance of digital learning after the pandemic, especially among young adults and women, coinciding with the rise of LLM-powered AI tools that support personalized learning. We aim to provide actionable insights for businesses, government policymakers, and educators seeking to optimize their digital learning offerings to meet evolving workforce needs. Comments: 41 pages including references, appendix, 14 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) ACMclasses: K.3.1 Cite as: arXiv:2602.03114 [cs.CY] (or arXiv:2602.03114v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.03114 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-82] “Im happy even though its not real”: GenAI Photo Editing as a Remembering Experience

【速读】:该论文试图解决的问题是:生成式 AI(Generative AI)在个人照片编辑中的应用如何影响用户的记忆体验,特别是当用户利用 GenAI 工具修改照片时,其行为和结果如何重塑对过往事件的回忆过程。解决方案的关键在于通过两阶段定性研究——首先引导参与者使用 GenAI 工具基于“记忆体验”(Remembering Experience, RX)维度进行照片编辑,随后开展半结构化访谈——揭示出用户在编辑过程中更重视情感层面的记忆感受而非事实准确性,且对人物身份相关的编辑持高度敏感态度;同时发现编辑行为本身即成为一种记忆重构的过程,从而为负责任的 GenAI 设计提供实证依据与设计启示。

链接: https://arxiv.org/abs/2602.03104
作者: Yufeng Wu,Qing Li,Elise van den Hoven,Baki Kocaballi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) is increasingly integrated into photo applications on personal devices, making editing photographs easier than ever while potentially influencing the memories they represent. This study explores how and why people use GenAI to edit personal photos and how this shapes their remembering experience. We conducted a two-phase qualitative study with 12 participants: a photo editing session using a GenAI tool guided by the Remembering Experience (RX) dimensions, followed by semi-structured interviews where participants reflected on the editing process and results. Findings show that participants prioritised felt memory over factual accuracy. For different photo elements, environments were modified easily, however, editing was deemed unacceptable if it touched upon a person’s identity. Editing processes brought positive and negative impacts, and itself also became a remembering experience. We further discuss potential benefits and risks of GenAI editing for remembering purposes and propose design implications for responsible GenAI.
zh

[AI-83] Risky-Bench: Probing Agent ic Safety Risks under Real-World Deployment

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为智能体(agent)在真实世界环境中部署时,因缺乏系统性安全评估框架而导致的安全风险覆盖不足与适应性差的问题。现有评估方法局限于特定场景下的风险任务,难以全面刻画长期交互中涌现的安全风险,且无法跨不同代理配置进行迁移。其解决方案的关键在于提出Risky-Bench框架,该框架基于领域无关的安全原则构建情境感知的安全评价标准(rubrics),从而定义出可扩展的安全风险空间,并通过在不同威胁假设下执行现实任务来系统化地评估该空间中的安全风险,具备良好的可移植性和结构化评估能力,适用于多种部署场景的定制化安全评估。

链接: https://arxiv.org/abs/2602.03100
作者: Jingnan Zheng,Yanzhen Luo,Jingjun Xu,Bingnan Liu,Yuxin Chen,Chenhang Cui,Gelei Deng,Chaochao Lu,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as agents that operate in real-world environments, introducing safety risks beyond linguistic harm. Existing agent safety evaluations rely on risk-oriented tasks tailored to specific agent settings, resulting in limited coverage of safety risk space and failing to assess agent safety behavior during long-horizon, interactive task execution in complex real-world deployments. Moreover, their specialization to particular agent settings limits adaptability across diverse agent configurations. To address these limitations, we propose Risky-Bench, a framework that enables systematic agent safety evaluation grounded in real-world deployment. Risky-Bench organizes evaluation around domain-agnostic safety principles to derive context-aware safety rubrics that delineate safety space, and systematically evaluates safety risks across this space through realistic task execution under varying threat assumptions. When applied to life-assist agent settings, Risky-Bench uncovers substantial safety risks in state-of-the-art agents under realistic execution conditions. Moreover, as a well-structured evaluation pipeline, Risky-Bench is not confined to life-assist scenarios and can be adapted to other deployment settings to construct environment-specific safety evaluations, providing an extensible methodology for agent safety assessment.
zh

[AI-84] xtME: Bridging Unseen Modalities Through Text Descriptions

【速读】:该论文旨在解决多模态表示扩展中对大规模成对数据集(如文本-图像、文本-音频等)的依赖问题,这类数据在医学影像和分子分析等需要专家标注的领域往往难以获取。其解决方案的关键在于提出TextME框架,首次实现仅通过文本描述即可将多种模态(图像、视频、音频、3D、X光、分子等)投影到预训练语言模型(LLM)嵌入空间作为统一锚点,利用预训练对比编码器的几何结构实现零样本跨模态迁移,从而无需成对监督信号即可保持预训练编码器的性能,并支持未显式对齐的模态对(如音频到图像、3D到图像)之间的涌现式跨模态检索。

链接: https://arxiv.org/abs/2602.03098
作者: Soyeon Hong,Jinchan Kim,Jaegook You,Seungtaek Choi,Suha Kwak,Hyunsouk Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text-image, text-audio, text-3D, text-molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to-image). These results establish text-only training as a practical alternative to paired supervision for modality expansion.
zh

[AI-85] De-conflating Preference and Qualification: Constrained Dual-Perspective Reasoning for Job Recommendation with Large Language Models

【速读】:该论文旨在解决职业推荐中偏好(preference)与资格(qualification)两个决策维度在现有方法中被混淆的问题,这种混淆导致在招聘漏斗(recruitment-funnel) censoring 下监督信号失真,并限制了策略可控性。解决方案的关键在于提出 JobRec 框架,通过约束的双视角推理实现偏好与资格的解耦:首先引入统一语义对齐(Unified Semantic Alignment Schema)将候选人和职位属性映射到结构化的语义层;其次采用两阶段协同训练策略学习独立的专家模型分别预测偏好与资格;最后利用基于拉格朗日(Lagrangian-based)的策略对齐模块,在显式资格约束下优化推荐结果,从而实现可控制的权衡。

链接: https://arxiv.org/abs/2602.03097
作者: Bryce Kan,Wei Yang,Emily Nguyen,Ganghui Yi,Bowen Yi,Chenxiao Yu,Yan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Professional job recommendation involves a complex bipartite matching process that must reconcile a candidate’s subjective preference with an employer’s objective qualification. While Large Language Models (LLMs) are well-suited for modeling the rich semantics of resumes and job descriptions, existing paradigms often collapse these two decision dimensions into a single interaction signal, yielding confounded supervision under recruitment-funnel censoring and limiting policy controllability. To address these challenges, We propose JobRec, a generative job recommendation framework for de-conflating preference and qualification via constrained dual-perspective reasoning. JobRec introduces a Unified Semantic Alignment Schema that aligns candidate and job attributes into structured semantic layers, and a Two-Stage Cooperative Training Strategy that learns decoupled experts to separately infer preference and qualification. Building on these experts, a Lagrangian-based Policy Alignment module optimizes recommendations under explicit eligibility requirements, enabling controllable trade-offs. To mitigate data scarcity, we construct a synthetic dataset refined by experts. Experiments show that JobRec consistently outperforms strong baselines and provides improved controllability for strategy-aware professional matching.
zh

[AI-86] PRISM: Structured Optimization via Anisotropic Spectral Shaping

【速读】:该论文旨在解决第一阶谱下降方法(first-order spectral descent methods)在优化过程中缺乏对目标函数曲率信息利用的问题,从而导致收敛效率受限。其解决方案的关键在于提出PRISM优化器,通过创新增强的极分解(innovation-augmented polar decomposition)构建一个高效、低秩的拟二阶预条件矩阵,实现各向异性的谱形变(anisotropic spectral shaping),即自适应抑制高方差子空间中的更新,同时保留信号主导方向上的更新强度。该机制在不增加额外内存开销的前提下,实现了与一阶基线相当的计算成本,但显著提升了对曲率的适应能力,为谱优化范式引入了曲率自适应特性。

链接: https://arxiv.org/abs/2602.03096
作者: Yujie Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose PRISM, an optimizer that enhances first-order spectral descent methods like Muon with partial second-order information. It constructs an efficient, low-rank quasi-second-order preconditioner via innovation-augmented polar decomposition. This mechanism enables PRISM to perform anisotropic spectral shaping, which adaptively suppresses updates in high-variance subspaces while preserving update strength in signal-dominated directions. Crucially, this is achieved with minimal computational overhead and zero additional memory compared to first-order baselines. PRISM demonstrates a practical strategy for integrating curvature-adaptive properties into the spectral optimization paradigm.
zh

[AI-87] raining and Simulation of Quadrupedal Robot in Adaptive Stair Climbing for Indoor Firefighting: An End-to-End Reinforcement Learning Approach

【速读】:该论文旨在解决四足机器人在复杂室内环境中进行初级搜救任务时面临的两大核心挑战:一是如何提升在复杂结构中的情境感知能力,二是如何实现跨不同楼梯类型(如直梯、L型梯和螺旋梯)的快速爬升能力。解决方案的关键在于提出一种两阶段端到端深度强化学习(Deep Reinforcement Learning, DRL)框架,首先在抽象金字塔楼梯地形中训练四足机器人Unitree Go2掌握基础爬楼技能,随后将策略迁移至真实室内楼梯场景,从而实现从简单到复杂的技能泛化;同时引入基于中心线(centerline-based)的导航建模方法,在无需分层规划的前提下统一优化导航与运动控制,并仅依赖局部高度图感知即实现了对多样化楼梯结构的有效适应与鲁棒执行。

链接: https://arxiv.org/abs/2602.03087
作者: Baixiao Huang,Baiyu Huang,Yu Hou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 9 figures, 43rd International Symposium on Automation and Robotics in Construction

点击查看摘要

Abstract:Quadruped robots are used for primary searches during the early stages of indoor fires. A typical primary search involves quickly and thoroughly looking for victims under hazardous conditions and monitoring flammable materials. However, situational awareness in complex indoor environments and rapid stair climbing across different staircases remain the main challenges for robot-assisted primary searches. In this project, we designed a two-stage end-to-end deep reinforcement learning (RL) approach to optimize both navigation and locomotion. In the first stage, the quadrupeds, Unitree Go2, were trained to climb stairs in Isaac Lab’s pyramid-stair terrain. In the second stage, the quadrupeds were trained to climb various realistic indoor staircases in the Isaac Lab engine, with the learned policy transferred from the previous stage. These indoor staircases are straight, L-shaped, and spiral, to support climbing tasks in complex environments. This project explores how to balance navigation and locomotion and how end-to-end RL methods can enable quadrupeds to adapt to different stair shapes. Our main contributions are: (1) A two-stage end-to-end RL framework that transfers stair-climbing skills from abstract pyramid terrain to realistic indoor stair topologies. (2) A centerline-based navigation formulation that enables unified learning of navigation and locomotion without hierarchical planning. (3) Demonstration of policy generalization across diverse staircases using only local height-map perception. (4) An empirical analysis of success, efficiency, and failure modes under increasing stair difficulty.
zh

[AI-88] he Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers

【速读】:该论文致力于解决生成式 AI(Generative AI)模型中“睡帯代理型”后门(sleeper agent-style backdoors)的检测问题,即在不掌握触发词(trigger)或目标行为先验知识的前提下,识别因果语言模型(causal language models)是否被植入隐蔽的后门攻击。解决方案的关键在于两个核心发现:一是睡帯代理倾向于记忆中毒数据,从而可通过记忆提取技术泄露后门样本;二是中毒模型在输入包含触发词时,其输出分布和注意力头(attention heads)表现出独特模式。基于此,作者提出一种仅需推理操作、无需修改模型且可扩展的扫描方法,能有效恢复多种后门场景下的工作触发词,且无缝集成于现有防御体系中。

链接: https://arxiv.org/abs/2602.03085
作者: Blake Bullwinkel,Giorgio Severi,Keegan Hines,Amanda Minnich,Ram Shankar Siva Kumar,Yonatan Zunger
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting whether a model has been poisoned is a longstanding problem in AI security. In this work, we present a practical scanner for identifying sleeper agent-style backdoors in causal language models. Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques. Second, poisoned LLMs exhibit distinctive patterns in their output distributions and attention heads when backdoor triggers are present in the input. Guided by these observations, we develop a scalable backdoor scanning methodology that assumes no prior knowledge of the trigger or target behavior and requires only inference operations. Our scanner integrates naturally into broader defensive strategies and does not alter model performance. We show that our method recovers working triggers across multiple backdoor scenarios and a broad range of models and fine-tuning methods.
zh

[AI-89] FlashSinkhorn: IO-Aware Entropic Optimal Transport

【速读】:该论文旨在解决基于平方欧几里得距离的熵正则最优传输(Entropic Optimal Transport, EOT)在GPU上大规模计算时的效率瓶颈问题,尤其是现有在线后端因依赖通用分块map-reduce归约核而导致融合能力有限、显存带宽(HBM IO)消耗过高。其解决方案的关键在于将稳定化的对数域Sinkhorn更新重写为行-wise的LogSumExp归约形式,该形式等价于Transformer注意力中的归一化操作,从而支持类似FlashAttention的融合与分块策略:通过融合Triton内核实现片上SRAM流式传输和单次遍历更新对偶势函数,在保持线性内存复杂度的同时显著减少每轮迭代的HBM数据访问量,并进一步提供用于运输应用的流式内核以支持一阶和二阶优化,最终在A100 GPU上实现了高达32倍的前向传播加速和161倍的端到端加速。

链接: https://arxiv.org/abs/2602.03067
作者: Felix X.-F. Ye,Xingjie Li,An Yu,Ming-Ching Chang,Linsong Chu,Davis Wertheimer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense n\times m interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbfFlashSinkhorn, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to 32\times forward-pass and 161\times end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at this https URL.
zh

[AI-90] Shortcut Features as Top Eigenfunctions of NTK: A Linear Neural Network Case and More

【速读】:该论文旨在解决深度学习模型中的“捷径学习”(shortcut learning)问题,即当训练数据中某一特征占主导地位时,神经网络倾向于学习该特征而非更具泛化能力的特征。解决方案的关键在于基于神经切线核(Neural Tangent Kernel, NTK)框架对线性神经网络进行分析,定义了神经网络中的特征为NTK的本征函数,并发现:在样本分布不均衡的情况下,捷径特征对应于具有较大特征值的本征函数;且由于簇内数据方差的存在,这些高特征值特征在训练后仍对网络输出产生显著影响,表明即使控制决策边界间距(max-margin bias),这种偏好依然存在——这说明最大间隔偏置并非导致捷径学习的唯一原因。该理论结果进一步通过两层全连接ReLU网络和ResNet-18等复杂模型得到实证扩展。

链接: https://arxiv.org/abs/2602.03066
作者: Jinwoo Lim,Suhyun Kim,Soo-Mook Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One of the chronic problems of deep-learning models is shortcut learning. In a case where the majority of training data are dominated by a certain feature, neural networks prefer to learn such a feature even if the feature is not generalizable outside the training set. Based on the framework of Neural Tangent Kernel (NTK), we analyzed the case of linear neural networks to derive some important properties of shortcut learning. We defined a feature of a neural network as an eigenfunction of NTK. Then, we found that shortcut features correspond to features with larger eigenvalues when the shortcuts stem from the imbalanced number of samples in the clustered distribution. We also showed that the features with larger eigenvalues still have a large influence on the neural network output even after training, due to data variances in the clusters. Such a preference for certain features remains even when a margin of a neural network output is controlled, which shows that the max-margin bias is not the only major reason for shortcut learning. These properties of linear neural networks are empirically extended for more complex neural networks as a two-layer fully-connected ReLU network and a ResNet-18.
zh

[AI-91] Evaluating LLM s When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理能力评估中因基准数据集规模有限和模型输出随机性导致的准确性估计方差高、模型排名不稳定的问题。其解决方案的关键在于利用模型对辅助推理链(auxiliary reasoning chains)进行成对比较所生成的可靠信号(即控制变量,control variates),并基于有效影响函数(efficient influence function, EIF)构建一个半参数一阶估计量,该估计量能达到半参数效率下界,保证严格方差降低,并支持渐近正态性以实现可靠的不确定性量化。此方法在小样本场景下显著提升了性能估计精度与模型排名稳定性,尤其适用于高噪声环境下的评估任务。

链接: https://arxiv.org/abs/2602.03061
作者: Zihan Dong,Zhixian Zhang,Yang Zhou,Can Jin,Ruijia Wu,Linjun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.
zh

[AI-92] owards Considerate Embodied AI: Co-Designing Situated Multi-Site Healthcare Robots from Abstract Concepts to High-Fidelity Prototypes

【速读】:该论文旨在解决当前 embodied AI(具身人工智能)系统在高风险领域(如医疗保健)中缺乏可持续、多学科协同设计流程的问题,尤其是现有研究往往局限于单一场景和低保真度原型阶段,导致方案可推广性差且参与者想法演化路径不清晰。解决方案的关键在于通过一个为期14周的多学科协作工作坊,推动从抽象头脑风暴到高保真原型的迭代演进,并辅以教育性支持结构(educational scaffolds),使参与者深入理解现实世界权衡,从而生成更具部署可行性的解决方案。研究据此提炼出八项指导原则,强调情境适配性、社会动态响应性、预期管理与部署落地四个核心维度。

链接: https://arxiv.org/abs/2602.03054
作者: Yuanchen Bai,Ruixiang Han,Niti Parikh,Wendy Ju,Angelique Taylor
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: To appear in Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI 2026)

点击查看摘要

Abstract:Co-design is essential for grounding embodied artificial intelligence (AI) systems in real-world contexts, especially high-stakes domains such as healthcare. While prior work has explored multidisciplinary collaboration, iterative prototyping, and support for non-technical participants, few have interwoven these into a sustained co-design process. Such efforts often target one context and low-fidelity stages, limiting the generalizability of findings and obscuring how participants’ ideas evolve. To address these limitations, we conducted a 14-week workshop with a multidisciplinary team of 22 participants, centered around how embodied AI can reduce non-value-added task burdens in three healthcare settings: emergency departments, long-term rehabilitation facilities, and sleep disorder clinics. We found that the iterative progression from abstract brainstorming to high-fidelity prototypes, supported by educational scaffolds, enabled participants to understand real-world trade-offs and generate more deployable solutions. We propose eight guidelines for co-designing more considerate embodied AI: attuned to context, responsive to social dynamics, mindful of expectations, and grounded in deployment. Project Page: this https URL
zh

[AI-93] CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)后训练阶段中强化学习算法因固定 rollout 预算分配导致的资源利用效率低下,以及现有自适应方法依赖实例级指标(如任务通过率)而无法反映模型动态学习状态的问题。解决方案的关键在于提出 CoBA-RL 算法,其核心创新是引入基于能力导向的价值函数(Capability-Oriented Value function),用于量化不同样本的潜在训练收益,并采用基于堆结构的贪心策略实现计算资源在高价值样本上的自适应优化分配,从而在探索与利用之间取得更好平衡,显著提升模型泛化性能。

链接: https://arxiv.org/abs/2602.03048
作者: Zhiyuan Yao,Yi-Kai Zhang,Yuxin Chen,Yueqing Sun,Zishan Xu,Yu Yang,Tianhao Hu,Qi Gu,Hui Su,Xunliang Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM this http URL, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods often rely on instance-level metrics, such as task pass rates, failing to capture the model’s dynamic learning state. To address these limitations, we propose CoBA-RL, a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model’s evolving capability. Specifically, CoBA-RL utilizes a Capability-Oriented Value function to map tasks to their potential training gains and employs a heap-based greedy strategy to efficiently self-calibrate the distribution of computational resources to samples with high training value. Extensive experiments demonstrate that our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks. These findings underscore that quantifying sample training value and optimizing budget allocation are pivotal for advancing LLM post-training efficiency.
zh

[AI-94] KANFIS A Neuro-Symbolic Framework for Interpretable and Uncertainty-Aware Learning

【速读】:该论文旨在解决传统自适应神经模糊推理系统(Adaptive Neuro-Fuzzy Inference System, ANFIS)在高维空间中因乘积型推理机制导致规则数量呈指数级增长的问题,从而引发模型结构复杂、可扩展性差和解释性不足。解决方案的关键在于提出柯尔莫哥洛夫-阿诺德神经模糊推理系统(Kolmogorov-Arnold Neuro-Fuzzy Inference System, KANFIS),其核心创新是采用加法型聚合机制,使模型参数和规则复杂度随输入维度线性增长而非指数增长;同时,KANFIS兼容一型(Type-1, T1)与区间二型(Interval Type-2, IT2)模糊逻辑系统,支持对不确定性与模糊性的显式建模,并通过稀疏掩码机制生成结构紧凑、语义清晰的规则集,实现内在可解释性与透明推理过程。

链接: https://arxiv.org/abs/2602.03034
作者: Binbin Yong,Haoran Pei,Jun Shen,Haoran Li,Qingguo Zhou,Zhao Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive Neuro-Fuzzy Inference System (ANFIS) was designed to combine the learning capabilities of neural network with the reasoning transparency of fuzzy logic. However, conventional ANFIS architectures suffer from structural complexity, where the product-based inference mechanism causes an exponential explosion of rules in high-dimensional spaces. We herein propose the Kolmogorov-Arnold Neuro-Fuzzy Inference System (KANFIS), a compact neuro-symbolic architecture that unifies fuzzy reasoning with additive function decomposition. KANFIS employs an additive aggregation mechanism, under which both model parameters and rule complexity scale linearly with input dimensionality rather than exponentially. Furthermore, KANFIS is compatible with both Type-1 (T1) and Interval Type-2 (IT2) fuzzy logic systems, enabling explicit modeling of uncertainty and ambiguity in fuzzy representations. By using sparse masking mechanisms, KANFIS generates compact and structured rule sets, resulting in an intrinsically interpretable model with clear rule semantics and transparent inference processes. Empirical results demonstrate that KANFIS achieves competitive performance against representative neural and neuro-fuzzy baselines.
zh

[AI-95] Visual Reasoning over Time Series via Multi-Agent System

【速读】:该论文旨在解决现有时间序列分析方法在整合直观视觉推理能力以及跨任务泛化与自适应工具使用方面的局限性。其核心解决方案是提出MAS4TS,一个基于Analyzer-Reasoner-Executor范式的工具驱动多智能体系统,通过统一框架集成智能体通信、视觉推理与潜在空间重构;关键创新在于利用视觉语言模型(Vision-Language Model)对时间序列图进行结构化先验引导的视觉推理以提取时序结构,并在潜在空间中重建预测轨迹,同时由三个专用智能体通过共享内存和门控通信协同工作,配合路由机制选择任务特定的工具链执行,从而实现高性能、强泛化能力和高效推理。

链接: https://arxiv.org/abs/2602.03026
作者: Weilin Ruan,Yuxuan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Time series analysis underpins many real-world applications, yet existing time-series-specific methods and pretrained large-model-based approaches remain limited in integrating intuitive visual reasoning and generalizing across tasks with adaptive tool usage. To address these limitations, we propose MAS4TS, a tool-driven multi-agent system for general time series tasks, built upon an Analyzer-Reasoner-Executor paradigm that integrates agent communication, visual reasoning, and latent reconstruction within a unified framework. MAS4TS first performs visual reasoning over time series plots with structured priors using a Vision-Language Model to extract temporal structures, and subsequently reconstructs predictive trajectories in latent space. Three specialized agents coordinate via shared memory and gated communication, while a router selects task-specific tool chains for execution. Extensive experiments on multiple benchmarks demonstrate that MAS4TS achieves state-of-the-art performance across a wide range of time series tasks, while exhibiting strong generalization and efficient inference.
zh

[AI-96] Consistency Deep Equilibrium Models

【速读】:该论文旨在解决深度平衡模型(Deep Equilibrium Models, DEQs)在推理过程中因迭代求解固定点而导致的高延迟问题。其核心解决方案是提出一致性深度平衡模型(Consistency Deep Equilibrium Model, C-DEQ),关键在于将DEQ的迭代推理过程建模为沿固定常微分方程(ODE)轨迹向平衡点演化的过程,并通过一致性蒸馏训练模型,使中间状态能够直接映射到固定点,从而实现少步数推理的同时保持教师DEQ的性能;此外,该方法还支持多步评估以灵活权衡计算资源与性能提升。

链接: https://arxiv.org/abs/2602.03024
作者: Junchao Lin,Zenan Ling,Jingwen Xu,Robert C. Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieves consistent 2-20 \times accuracy improvements over implicit DEQs under the same few-step inference budget.
zh

[AI-97] STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models ICLR2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在函数调用任务中能力难以迁移至超小型模型的问题,以实现高效、可部署的AI代理。现有方法常面临过拟合、训练不稳定、多解任务中二元奖励信号无效以及技术协同困难等挑战。其解决方案的关键在于提出STAR框架,包含两个核心技术:一是约束知识蒸馏(Constrained Knowledge Distillation, CKD),通过引入top-k前向KL散度并加以约束,抑制置信度高的错误预测,从而保障训练稳定性并保留下游强化学习(Reinforcement Learning, RL)的探索能力;二是相似度引导的强化学习(Similarity-guided RL, Sim-RL),设计细粒度、基于相似性的连续奖励机制,以生成输出与真实标签之间的相似度作为优化信号,显著提升策略学习效率。STAR通过统一训练课程整合上述策略,在超小模型上实现了复杂函数调用任务的卓越性能,且在同类规模模型中达到当前最优水平。

链接: https://arxiv.org/abs/2602.03022
作者: Jiliang Ni,Jiachen Pu,Zhongyi Yang,Jingfeng Luo,Conggang Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The paper has been accepted to ICLR 2026

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones. However, existing paradigms are often plagued by overfitting, training instability, ineffective binary rewards for multi-solution tasks, and the difficulty of synergizing techniques. We introduce STAR: Similarity-guided Teacher-Assisted Refinement, a novel holistic framework that effectively transfers LLMs’ capabilities to super-tiny models. STAR consists of two core technical innovations: (1) Constrained Knowledge Distillation (CKD), a training objective that augments top-k forward KL divergence to suppress confidently incorrect predictions, ensuring training stability while preserving exploration capacity for downstream RL. STAR holistically synergizes these strategies within a cohesive training curriculum, enabling super-tiny models to achieve exceptional performance on complex function calling tasks; (2) Similarity-guided RL (Sim-RL), a RL mechanism that introduces a fine-grained, similarity-based reward. This provides a robust, continuous, and rich signal for better policy optimization by evaluating the similarity between generated outputs and the ground truth. Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method. Our STAR models establish SOTA in their size classes, significantly outperforming baselines. Remarkably, our 0.6B STAR model achieves the best performance among all open models under 1B, surpassing even several well-known open models at a larger scale. STAR demonstrates a training framework that distills capabilities of LLMs into super-tiny models, paving the way for powerful, accessible, and efficient AI agents.
zh

[AI-98] FedKRSO: Communication and Memory Efficient Federated Fine-Tuning of Large Language Models

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中大语言模型(Large Language Models, LLMs)微调时面临的高通信与内存开销问题,同时克服现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在性能上的局限性。其解决方案的关键在于提出FedKRSO(Federated K-Seed Random Subspace Optimization):客户端在服务器生成的共享随机低维子空间内进行模型更新,从而显著降低内存占用;同时,客户端仅向服务器传输沿这些子空间累积的模型更新累加器(accumulators),而非完整模型参数,实现高效的全局模型聚合与分发。该设计在保持接近全参数微调(Full Fine-Tuning, FFT)性能的同时,大幅减少了通信和内存消耗,并在通用联邦学习设置下具备严格的收敛性保证。

链接: https://arxiv.org/abs/2602.03019
作者: Guohao Yang,Tongle Wu,Yuanxiong Guo,Ying Sun,Yanmin Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by INFOCOM 2026

点击查看摘要

Abstract:Fine-tuning is essential to adapt general-purpose large language models (LLMs) to domain-specific tasks. As a privacy-preserving framework to leverage decentralized data for collaborative model training, Federated Learning (FL) is gaining popularity in LLM fine-tuning, but remains challenging due to the high cost of transmitting full model parameters and computing full gradients on resource-constrained clients. While Parameter-Efficient Fine-Tuning (PEFT) methods are widely used in FL to reduce communication and memory costs, they often sacrifice model performance compared to FFT. This paper proposes FedKRSO (Federated K -Seed Random Subspace Optimization), a novel method that enables communication and memory efficient FFT of LLMs in federated settings. In FedKRSO, clients update the model within a shared set of random low-dimension subspaces generated by the server to save memory usage. Furthermore, instead of transmitting full model parameters in each FL round, clients send only the model update accumulators along the subspaces to the server, enabling efficient global model aggregation and dissemination. By using these strategies, FedKRSO can substantially reduce communication and memory overhead while overcoming the performance limitations of PEFT, closely approximating the performance of federated FFT. The convergence properties of FedKRSO are analyzed rigorously under general FL settings. Extensive experiments on the GLUE benchmark across diverse FL scenarios demonstrate that FedKRSO achieves both superior performance and low communication and memory overhead, paving the way towards on federated LLM fine-tuning at the resource-constrained edge.
zh

[AI-99] CVE-Factory: Scaling Expert-Level Agent ic Tasks for Code Security Vulnerability

【速读】:该论文旨在解决代码智能体(Code Agent)安全能力评估与提升中缺乏高质量、可执行漏洞任务的问题。现有方法依赖人工复现,成本高且难以扩展,同时数据分布陈旧,无法反映当前真实威胁。其解决方案的关键在于提出 CVE-Factory——首个多智能体框架,能够自动将稀疏的CVE元数据转化为完全可执行的智能体任务,实现专家级质量:交叉验证显示其解决方案正确率达95%、环境保真度达96%,并在最新真实漏洞上达到66.2%的验证成功率。该自动化机制进一步推动了两个下游贡献:构建持续更新的 LiveCVEBench 基准(含190个任务,覆盖14种语言和153个仓库),以及合成超1000个可执行训练环境,首次实现代码安全领域智能体任务的大规模扩展。

链接: https://arxiv.org/abs/2602.03012
作者: Xianzhen Luo,Jingyuan Zhang,Shiqi Zhou,Rain Huang,Chuan Xiao,Qingfu Zhu,Zhiyuan Ma,Xing Yue,Yang Yue,Wencong Zeng,Wanxiang Che
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95% solution correctness and 96% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3% to 35.8% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5% to 31.3%). We open-source CVE-Factory, LiveCVEBench, Abacus-cve (fine-tuned model), training dataset, and leaderboard. All resources are available at this https URL .
zh

[AI-100] Distilling LLM Reasoning into Graph of Concept Predictors

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在判别性任务中因推理延迟、计算资源消耗和API成本过高而导致的部署瓶颈问题。现有知识蒸馏方法通常仅利用教师模型输出的最终标签进行训练,忽视了中间推理过程中的结构化信息,导致样本效率低且难以诊断错误来源。其解决方案的关键在于提出一种概念预测图(Graph of Concept Predictors, GCP),将教师模型的决策过程显式建模为有向无环图(DAG),并在学生模型中构建对应的模块化概念预测器结构;通过图感知的主动采样策略聚焦于关键推理节点上的不确定性与分歧,并结合针对性的子模块重训练机制,将下游损失归因于具体概念预测器并仅更新影响最大的模块,从而提升样本效率、训练稳定性和可解释性。

链接: https://arxiv.org/abs/2602.03006
作者: Ziyang Yu,Liang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) for discriminative workloads is often limited by inference latency, compute, and API costs at scale. Active distillation reduces these costs by querying an LLM oracle to train compact discriminative students, but most pipelines distill only final labels, discarding intermediate reasoning signals and offering limited diagnostics of what reasoning is missing and where errors arise. We propose Graph of Concept Predictors (GCP), a reasoning-aware active distillation framework that externalizes the teacher’s decision process as a directed acyclic graph and mirrors it with modular concept predictors in the student. GCP enhances sample efficiency through a graph-aware acquisition strategy that targets uncertainty and disagreement at critical reasoning nodes. Additionally, it improves training stability and efficiency by performing targeted sub-module retraining, which attributes downstream loss to specific concept predictors and updates only the most influential modules. Experiments on eight NLP classification benchmarks demonstrate that GCP enhances performance under limited annotation budgets while yielding more interpretable and controllable training dynamics. Code is available at: this https URL.
zh

[AI-101] Causal Graph Spatial-Temporal Autoencoder for Reliable and Interpretable Process Monitoring

【速读】:该论文旨在解决工业过程监测中可靠性与可解释性不足的问题。现有方法往往难以准确捕捉变量间的动态关联及因果关系,导致故障检测性能受限且结果难以解释。解决方案的关键在于提出一种因果图时空自编码器(Causal Graph Spatial-Temporal Autoencoder, CGSTAE),其核心创新包括:1)基于空间自注意力机制(Spatial Self-Attention Mechanism, SSAM)的关联图结构学习模块,用于捕获变量间的动态相关性;2)提出一种三步因果图结构学习算法,利用因果不变性原理的逆向视角,从变化的相关图中提取出稳定的因果图结构;3)采用图卷积长短期记忆网络(Graph Convolutional Long Short Term Memory, GCLSTM)构建时空编码-解码器,在序列到序列框架下实现时间序列过程数据的重构。通过特征空间和残差空间的两个统计量,CGSTAE实现了高精度的过程监测与故障检测。

链接: https://arxiv.org/abs/2602.03004
作者: Xiangrui Zhang,Chunyue Song,Wei Dai,Zheng Zhang,Kaihua Gao,Furong Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To improve the reliability and interpretability of industrial process monitoring, this article proposes a Causal Graph Spatial-Temporal Autoencoder (CGSTAE). The network architecture of CGSTAE combines two components: a correlation graph structure learning module based on spatial self-attention mechanism (SSAM) and a spatial-temporal encoder-decoder module utilizing graph convolutional long-short term memory (GCLSTM). The SSAM learns correlation graphs by capturing dynamic relationships between variables, while a novel three-step causal graph structure learning algorithm is introduced to derive a causal graph from these correlation graphs. The algorithm leverages a reverse perspective of causal invariance principle to uncover the invariant causal graph from varying correlations. The spatial-temporal encoder-decoder, built with GCLSTM units, reconstructs time-series process data within a sequence-to-sequence framework. The proposed CGSTAE enables effective process monitoring and fault detection through two statistics in the feature space and residual space. Finally, we validate the effectiveness of CGSTAE in process monitoring through the Tennessee Eastman process and a real-world air separation process.
zh

[AI-102] Methods and Open Problems in Differentiable Social Choice: Learning Mechanisms Decisions and Alignment

【速读】:该论文旨在解决现代机器学习系统中日益凸显的社会选择(Social Choice)问题,即如何在包含异质偏好、激励机制与判断的复杂场景下,设计可学习且可微分的集体决策机制。其核心挑战在于传统社会选择理论中的规范性原则(如公平性、效率、策略抗性等)如何在数据驱动的机器学习框架中被形式化和优化。解决方案的关键是提出“可微分社会选择”(Differentiable Social Choice)这一新兴范式,将投票规则、机制设计与聚合过程建模为可从数据中端到端优化的可微模型,从而在拍卖、预算分配、液态民主、联邦学习等应用中实现对经典社会选择公理与不可能性结果的量化权衡与工程实现。

链接: https://arxiv.org/abs/2602.03003
作者: Zhiyu An,Wan Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social choice is no longer a peripheral concern of political theory or economics-it has become a foundational component of modern machine learning systems. From auctions and resource allocation to federated learning, participatory governance, and the alignment of large language models, machine learning pipelines increasingly aggregate heterogeneous preferences, incentives, and judgments into collective decisions. In effect, many contemporary machine learning systems already implement social choice mechanisms, often implicitly and without explicit normative scrutiny. This Review surveys differentiable social choice: an emerging paradigm that formulates voting rules, mechanisms, and aggregation procedures as learnable, differentiable models optimized from data. We synthesize work across auctions, voting, budgeting, liquid democracy, decentralized aggregation, and inverse mechanism learning, showing how classical axioms and impossibility results reappear as objectives, constraints, and optimization trade-offs. We conclude by identifying 36 open problems defining a new research agenda at the intersection of machine learning, economics, and democratic theory. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.03003 [cs.AI] (or arXiv:2602.03003v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.03003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-103] Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent

【速读】:该论文旨在解决现代机器学习系统中因采用固定或手动调优的批量大小(batch size)调度策略而导致的硬件利用率低下问题,这类策略依赖于易失效且调参成本高的启发式方法。其核心解决方案是提出基于非欧几里得梯度噪声尺度(non-Euclidean gradient noise scale, GNS)的自适应批量大小调整机制,该机制针对基于广义范数(如 signSGD / Signum 的 ℓ∞ 范数和 specSGD / Muon 的 S∞ 范数)的优化器重新推导了GNS定义,并设计了一种高效方差估计方法,利用分布式数据并行系统中不同rank上的局部小批量梯度进行估算。实验表明,该方法可在保持与固定批量基线相当验证损失的同时,将训练步数减少高达66%。

链接: https://arxiv.org/abs/2602.03001
作者: Hiroki Naganuma,Shagun Gupta,Youssef Briki,Ioannis Mitliagkas,Irina Rish,Parameswaran Raman,Hao-Jun Michael Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 4 tables

点击查看摘要

Abstract:To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD’s Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ( \ell_\infty ) and stochastic spectral descent (specSGD) / Muon ( \mathcalS_\infty ). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160 million parameter Llama model.
zh

[AI-104] Agent Alpha: Tree Search Unifying Generation Exploration and Evaluation for Computer-Use Agents

【速读】:该论文旨在解决GUI代理在测试时通过轨迹级采样扩展计算资源虽能提升性能,但缺乏回溯能力(regressive ability)的问题,即无法复用部分成功结果或从早期错误中恢复。解决方案的关键在于提出Agent Alpha框架,该框架通过步骤级蒙特卡洛树搜索(step-level Monte Carlo Tree Search, MCTS)协同生成、探索与评估,引入alpha-UCT引导的搜索机制嵌入交互循环,实现主动建模或利用规划空间结构;同时采用比较驱动的评估策略以缓解绝对评分偏差,并通过多样性约束扩展保持搜索空间紧凑且信息丰富,从而支持早期剪枝和前缀复用,最终在OSWorld基准上达到约77%的成功率,显著优于同等计算开销下的轨迹级基线方法。

链接: https://arxiv.org/abs/2602.02995
作者: Sizhe Tang,Rongqian Chen,Tian Lan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While scaling test-time compute through trajectory-level sampling has significantly improved Graphical User Interface (GUI) agents, the lack of regressive ability prevents the reuse of partial successes and the recovery from early missteps. In this paper, we introduce Agent Alpha, a unified framework that synergizes generation, exploration, and evaluation through step-level Monte Carlo Tree Search (MCTS). It enables active modeling or exploiting structures of the planning space. By integrating alpha-UCT guided search into the interaction loop, Agent Alpha enables deliberate planning, facilitating early pruning of suboptimal branches and efficient prefix reuse. We also employ comparison-driven evaluation to mitigate absolute scoring biases and diversity-constrained expansion to maintain a compact, informative search space. Regret bound of alpha-UCT is analyzed. On the OSWorld benchmark, Agent Alpha achieves a state-of-the-art success rate of \sim 77% , significantly outperforming trajectory-level baselines under equivalent compute.
zh

[AI-105] Large Language Models Can Take False First Steps at Inference-time Planning

【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在训练中展现出序列级规划能力,但在推理阶段表现出短视且不一致的规划行为,即存在“规划能力与实际行为之间的差距”。其解决方案的关键在于提出一种贝叶斯(Bayesian)解释框架,认为这种现象源于生成式上下文(generative context)的动态演化——由于自然语言与LLMs内部表征语言之间存在细微差异,推理过程中逐步积累的自生成上下文会驱动规划策略的转变,从而造成看似受限的规划行为。论文通过两个受控实验验证了该模型:一是随机生成任务中显示人类提示下规划受限但随自生成上下文累积而增强;二是高斯采样任务中表明条件化于自生成序列可降低初始偏差。这一理论与实证结果共同揭示了LLM在推理时如何基于内部上下文进行前瞻规划。

链接: https://arxiv.org/abs/2602.02991
作者: Haijiang Yan,Jian-Qiao Zhu,Adam Sanborn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been shown to acquire sequence-level planning abilities during training, yet their planning behavior exhibited at inference time often appears short-sighted and inconsistent with these capabilities. We propose a Bayesian account for this gap by grounding planning behavior in the evolving generative context: given the subtle differences between natural language and the language internalized by LLMs, accumulated self-generated context drives a planning-shift during inference and thereby creates the appearance of compromised planning behavior. We further validate the proposed model through two controlled experiments: a random-generation task demonstrating constrained planning under human prompts and increasing planning strength as self-generated context accumulates, and a Gaussian-sampling task showing reduced initial bias when conditioning on self-generated sequences. These findings provide a theoretical explanation along with empirical evidence for characterizing how LLMs plan ahead during inference.
zh

[AI-106] NLI:Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLM s Inference ICLR18

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中因非线性层(如SiLU、RMSNorm和Softmax)依赖高精度浮点运算而导致的内存占用大和计算成本高的问题。解决方案的关键在于提出一种无需校准、基于动态规划最优且硬件友好的框架——非均匀线性插值(Non-uniform Linear Interpolation, NLI),其核心创新是将插值节点选择建模为动态规划问题,利用贝尔曼最优性原理在O(M×N²)时间内实现全局最小插值误差,从而高效近似多种非线性函数,并设计了一个即插即用的通用非线性计算单元,在保持模型精度几乎无损的前提下显著提升计算效率(实测超过4倍于当前最优设计)。

链接: https://arxiv.org/abs/2602.02988
作者: Jiangyong Yu,Xiaomeng Han,Xing Hu,Chen Xu,Zhe Jiang,Dawei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Admitted to ICLR 18pages 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks, but their deployment is often constrained by substantial memory footprints and computational costs. While prior work has achieved significant progress in compressing and accelerating linear layers, nonlinear layers-such as SiLU, RMSNorm, and Softmax-still heavily depend on high-precision floating-point operations. In this paper, we propose a calibration-free, dynamic-programming-optimal, and hardware-friendly framework called Non-uniform Linear Interpolation (NLI). NLI is capable of efficiently approximating a variety of nonlinear functions, enabling seamless integration into LLMs and other deep neural networks with almost no loss in accuracy. NLI ingeniously recasts cutpoint selection as a dynamic-programming problem, achieving the globally minimal interpolation error in O(MxN2) time via Bellman’s optimality principle. Based on the NLI algorithm, we also design and implement a plug-and-play universal nonlinear computation unit. Hardware experiments demonstrate that the NLI Engine achieves more than 4x improvement in computational efficiency compared to the state-of-the-art designs.
zh

[AI-107] Are LLM s Biased Like Humans? Causal Reasoning as a Function of Prior Knowledge Irrelevant Information and Reasoning Budget

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在因果推理任务中是否具备规范的因果计算能力、人类式的启发式策略或脆弱的模式匹配行为这一核心问题。其解决方案的关键在于:首先,通过构建一个由 collider 结构(C1EC2C_1 \rightarrow E \leftarrow C_2)形式化定义的11个因果判断任务,并与人类基线进行对比,发现少量可解释模型能很好地压缩LLMs的因果判断;其次,揭示多数LLMs采用更接近规则驱动的推理策略,而人类则倾向于考虑未提及的潜在因素;最后,通过语义抽象和提示过载(prompt overloading)实验,验证链式思维(Chain-of-Thought, CoT)能提升部分LLMs的因果判断鲁棒性,从而表明LLMs在特定场景下可作为人类偏见的补充,但其规则化推理在不确定性固有的情境中可能失效——这凸显了对LLMs推理机制进行系统刻画对于安全有效部署的重要性。

链接: https://arxiv.org/abs/2602.02983
作者: Hanna M. Dettki,Charley M. Wu,Bob Rehder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in domains where causal reasoning matters, yet it remains unclear whether their judgments reflect normative causal computation, human-like shortcuts, or brittle pattern matching. We benchmark 20+ LLMs against a matched human baseline on 11 causal judgment tasks formalized by a collider structure ( C_1 !\rightarrow! E! \leftarrow !C_2 ). We find that a small interpretable model compresses LLMs’ causal judgments well and that most LLMs exhibit more rule-like reasoning strategies than humans who seem to account for unmentioned latent factors in their probability judgments. Furthermore, most LLMs do not mirror the characteristic human collider biases of weak explaining away and Markov violations. We probe LLMs’ causal judgment robustness under (i) semantic abstraction and (ii) prompt overloading (injecting irrelevant text), and find that chain-of-thought (CoT) increases robustness for many LLMs. Together, this divergence suggests LLMs can complement humans when known biases are undesirable, but their rule-like reasoning may break down when uncertainty is intrinsic – highlighting the need to characterize LLM reasoning strategies for safe, effective deployment.
zh

[AI-108] Structuring Value Representations via Geometric Coherence in Markov Decision Processes

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中价值函数估计的不稳定性与低样本效率问题。现有方法虽利用几何特性如对称性或结构约束提升性能,但缺乏对状态空间内在序结构(poset,即偏序集)的系统建模。解决方案的关键在于引入序理论视角,将价值函数学习重构为在偏序集上逐步细化超序集(super-poset)的过程:通过前序步骤的序结构优化与时间差分信号驱动的新序关系学习,确保多步价值函数估计的几何一致性(geometric coherence)。为此提出GCR-RL框架,并设计基于Q-learning和Actor-Critic的两种算法实现高效超序集精化,理论分析其收敛性与速率,实验证明其在样本效率和稳定性方面显著优于强基线。

链接: https://arxiv.org/abs/2602.02978
作者: Zuyuan Zhang,Zeyu Fang,Tian Lan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geometric properties can be leveraged to stabilize and speed reinforcement learning. Existing examples include encoding symmetry structure, geometry-aware data augmentation, and enforcing structural restrictions. In this paper, we take a novel view of RL through the lens of order theory and recast value function estimates into learning a desired poset (partially ordered set). We propose \emphGCR-RL (Geometric Coherence Regularized Reinforcement Learning) that computes a sequence of super-poset refinements – by refining posets in previous steps and learning additional order relationships from temporal difference signals – thus ensuring geometric coherence across the sequence of posets underpinning the learned value functions. Two novel algorithms by Q-learning and by actor–critic are developed to efficiently realize these super-poset refinements. Their theoretical properties and convergence rates are analyzed. We empirically evaluate GCR-RL in a range of tasks and demonstrate significant improvements in sample efficiency and stable performance over strong baselines.
zh

[AI-109] Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的搜索系统对视觉内容平台带来的挑战:传统图像因缺乏语义深度和权威信号,在生成式搜索中难以被有效检索,导致用户需求在搜索页面内直接满足,从而引发平台流量流失与去中介化风险。解决方案的关键在于提出 Pinterest GEO 框架,其核心创新包括:1)通过微调视觉语言模型(Vision-Language Models, VLMs)预测用户实际会搜索的内容,而非生成通用图像描述;2)结合 AI 代理挖掘实时互联网趋势以捕捉新兴搜索需求;3)利用多模态嵌入构建语义一致的集合页(Collection Pages),形成可索引的聚合结构以优化生成式检索;4)采用混合 VLM 与双塔近邻搜索(Two-tower ANN)架构建立权威感知的跨图像链接体系,实现百亿级视觉资产间的信号传播。该方案已在大规模部署中带来 20% 的自然流量增长,为视觉平台在生成式搜索时代提供了一条可落地的优化路径。

链接: https://arxiv.org/abs/2602.02961
作者: Faye Zhang,Qianyu Cheng,Jasmine Wan,Vishwakarma Singh,Jinfeng Rao,Kofi Boakye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models are fundamentally reshaping content discovery through AI-native search systems such as ChatGPT, Gemini, and Claude. Unlike traditional search engines that match keywords to documents, these systems infer user intent, synthesize multimodal evidence, and generate contextual answers directly on the search page, introducing a paradigm shift from Search Engine Optimization (SEO) to Generative Engine Optimization (GEO). For visual content platforms hosting billions of assets, this poses an acute challenge: individual images lack the semantic depth and authority signals that generative search prioritizes, risking disintermediation as user needs are satisfied in-place without site visits. We present Pinterest GEO, a production-scale framework that pioneers reverse search design: rather than generating generic image captions describing what content is, we fine-tune Vision-Language Models (VLMs) to predict what users would actually search for, augmented this with AI agents that mine real-time internet trends to capture emerging search demand. These VLM-generated queries then drive construction of semantically coherent Collection Pages via multimodal embeddings, creating indexable aggregations optimized for generative retrieval. Finally, we employ hybrid VLM and two-tower ANN architectures to build authority-aware interlinking structures that propagate signals across billions of visual assets. Deployed at scale across billions of images and tens of millions of collections, GEO delivers 20% organic traffic growth contributing to multi-million monthly active user (MAU) growth, demonstrating a principled pathway for visual platforms to thrive in the generative search era. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.02961 [cs.AI] (or arXiv:2602.02961v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.02961 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-110] Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的人形机器人全身控制器在跨不同机器人本体(embodiment)时泛化能力不足的问题,尤其是当机器人具有不同的动力学特性、自由度(Degrees of Freedom, DoFs)和运动学拓扑结构时,单一策略难以有效控制多种异构人形机器人,并且难以支持从简单行走到下蹲、前倾等更复杂行为的迁移。解决方案的关键在于提出EAGLE框架——一种迭代式的通用-专用蒸馏(generalist-specialist distillation)方法:在每一轮迭代中,从当前通用策略中分叉出针对特定机器人的专用策略,在各自机器人上进行优化后,将新习得技能通过联合训练集回传至通用策略中,从而逐步增强通用策略的多机器人适应性和行为多样性。该方法无需对每个机器人单独设计奖励函数,实现了单个统一策略对多个异构人形机器人的高效控制。

链接: https://arxiv.org/abs/2602.02960
作者: Quanquan Peng,Yunfeng Lin,Yufei Xue,Jiangmiao Pang,Weinan Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Humanoid Whole-Body Controllers trained with reinforcement learning (RL) have recently achieved remarkable performance, yet many target a single robot embodiment. Variations in dynamics, degrees of freedom (DoFs), and kinematic topology still hinder a single policy from commanding diverse humanoids. Moreover, obtaining a generalist policy that not only transfers across embodiments but also supports richer behaviors-beyond simple walking to squatting, leaning-remains especially challenging. In this work, we tackle these obstacles by introducing EAGLE, an iterative generalist-specialist distillation framework that produces a single unified policy that controls multiple heterogeneous humanoids without per-robot reward tuning. During each cycle, embodiment-specific specialists are forked from the current generalist, refined on their respective robots, and new skills are distilled back into the generalist by training on the pooled embodiment set. Repeating this loop until performance convergence produces a robust Whole-Body Controller validated on robots such as Unitree H1, G1, and Fourier N1. We conducted experiments on five different robots in simulation and four in real-world settings. Through quantitative evaluations, EAGLE achieves high tracking accuracy and robustness compared to other methods, marking a step toward scalable, fleet-level humanoid control. See more details at this https URL
zh

[AI-111] Synthetic Data Augmentation for Medical Audio Classification: A Preliminary Evaluation

【速读】:该论文旨在解决医疗音频分类中因信噪比低、判别特征细微、类内差异大以及类别不平衡和训练数据有限等因素导致的性能瓶颈问题。其解决方案的关键在于评估三种生成式数据增强策略(变分自编码器、生成对抗网络和扩散模型)对基于深度卷积神经网络的呼吸音分类任务的影响,以验证合成数据是否能提升模型性能。研究发现,单一增强策略未能带来性能提升,仅集成多个增强模型后取得小幅改进,表明合成数据增强在医疗音频场景下并非普适有效,需进一步明确任务特异性数据特征、模型与增强方法的适配性及评估框架的设计。

链接: https://arxiv.org/abs/2602.02955
作者: David McShannon,Anthony Mella,Nicholas Dietrich
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Medical audio classification remains challenging due to low signal-to-noise ratios, subtle discriminative features, and substantial intra-class variability, often compounded by class imbalance and limited training data. Synthetic data augmentation has been proposed as a potential strategy to mitigate these constraints; however, prior studies report inconsistent methodological approaches and mixed empirical results. In this preliminary study, we explore the impact of synthetic augmentation on respiratory sound classification using a baseline deep convolutional neural network trained on a moderately imbalanced dataset (73%:27%). Three generative augmentation strategies (variational autoencoders, generative adversarial networks, and diffusion models) were assessed under controlled experimental conditions. The baseline model without augmentation achieved an F1-score of 0.645. Across individual augmentation strategies, performance gains were not observed, with several configurations demonstrating neutral or degraded classification performance. Only an ensemble of augmented models yielded a modest improvement in F1-score (0.664). These findings suggest that, for medical audio classification, synthetic augmentation may not consistently enhance performance when applied to a standard CNN classifier. Future work should focus on delineating task-specific data characteristics, model-augmentation compatibility, and evaluation frameworks necessary for synthetic augmentation to be effective in medical audio applications.
zh

[AI-112] UAT-LITE: Inference-Time Uncertainty-Aware Attention for Pretrained Transformers

【速读】:该论文旨在解决神经网络自然语言处理(Neural NLP)模型在预测时存在的校准不足问题,即模型对错误预测仍赋予过高置信度,这会损害选择性预测(selective prediction)和高风险场景下的部署可靠性。其解决方案的关键在于提出UAT-LITE框架,该框架在推理阶段通过蒙特卡洛Dropout实现近似贝叶斯推断,使自注意力机制具备不确定性感知能力:利用随机前向传播估计token级认知不确定性(epistemic uncertainty),并据此调节上下文建模过程中的自注意力权重,而无需修改预训练权重或训练目标。此外,作者引入分层方差分解方法以诊断预测不确定性在Transformer深度上的累积机制,实验证明该方法在SQuAD 2.0、MNLI和SST-2任务上平均将期望校准误差(Expected Calibration Error)降低约20%,同时保持任务准确率,并提升选择性预测性能与分布外鲁棒性。

链接: https://arxiv.org/abs/2602.02952
作者: Elias Hossain,Shubhashis Roy Dipta,Subash Neupane,Rajib Rana,Ravid Shwartz-Ziv,Ivan Garibay,Niloofar Yousefi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural NLP models are often miscalibrated, assigning high confidence to incorrect predictions, which undermines selective prediction and high-stakes deployment. Post-hoc calibration methods adjust output probabilities but leave internal computation unchanged, while ensemble and Bayesian approaches improve uncertainty at substantial training or storage cost. We propose UAT-LITE, an inference-time framework that makes self-attention uncertainty-aware using approximate Bayesian inference via Monte Carlo dropout in pretrained transformer classifiers. Token-level epistemic uncertainty is estimated from stochastic forward passes and used to modulate self-attention during contextualization, without modifying pretrained weights or training objectives. We additionally introduce a layerwise variance decomposition to diagnose how predictive uncertainty accumulates across transformer depth. Across the SQuAD 2.0 answerability, MNLI, and SST-2, UAT-LITE reduces Expected Calibration Error by approximately 20% on average relative to a fine-tuned BERT-base baseline while preserving task accuracy, and improves selective prediction and robustness under distribution shift.
zh

[AI-113] RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detection

【速读】:该论文旨在解决高级持续性威胁(Advanced Persistent Threats, APTs)难以检测的问题,因其行为隐蔽且常混杂于正常系统活动中。解决方案的关键在于提出一种神经符号异常检测框架,融合图自编码器(Graph Autoencoder, GAE)与稀有模式挖掘技术:首先基于特征相似性构建进程行为图并利用GAE学习正常关系结构,通过观测图与重构图之间的结构偏差识别异常候选;进一步引入稀有模式挖掘模块,发现罕见的行为共现模式,并以此增强具有稀有特征的进程的异常评分。该方法在DARPA透明计算数据集上验证有效,显著提升了异常排序质量,优于单一基线模型,并达到与多模型集成方法相当的性能,体现了图表示学习与经典模式挖掘结合在提升检测效果和可解释性方面的优势。

链接: https://arxiv.org/abs/2602.02929
作者: Asif Tauhid,Sidahmed Benabderrahmane,Mohamad Altrabulsi,Ahamed Foisal,Talal Rahwan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) are sophisticated, long-term cyberattacks that are difficult to detect because they operate stealthily and often blend into normal system behavior. This paper presents a neuro-symbolic anomaly detection framework that combines a Graph Autoencoder (GAE) with rare pattern mining to identify APT-like activities in system-level provenance data. Our approach first constructs a process behavioral graph using k-Nearest Neighbors based on feature similarity, then learns normal relational structure using a Graph Autoencoder. Anomaly candidates are identified through deviations between observed and reconstructed graph structure. To further improve detection, we integrate an rare pattern mining module that discovers infrequent behavioral co-occurrences and uses them to boost anomaly scores for processes exhibiting rare signatures. We evaluate the proposed method on the DARPA Transparent Computing datasets and show that rare-pattern boosting yields substantial gains in anomaly ranking quality over the baseline GAE. Compared with existing unsupervised approaches on the same benchmark, our single unified model consistently outperforms individual context-based detectors and achieves performance competitive with ensemble aggregation methods that require multiple separate detectors. These results highlight the value of coupling graph-based representation learning with classical pattern mining to improve both effectiveness and interpretability in provenance-based security anomaly detection.
zh

[AI-114] Refining Decision Boundaries In Anomaly Detection Using Similarity Search Within the Feature Space

【速读】:该论文旨在解决高不平衡数据集(如网络空间中的高级持续性威胁,Advanced Persistent Threats, APTs)中稀有且多样异常检测的难题。传统主动学习方法往往未能有效利用特征空间的内在几何结构来优化模型性能。其解决方案的关键在于提出一种名为SDA2E(Sparse Dual Adversarial Attention-based AutoEncoder)的新型自编码器架构,用于从高维、不平衡数据中学习紧凑且判别性强的潜在表示,并结合一种基于相似性的主动学习框架,引入三种创新策略:normal-like expansion(通过引入与已标注正常样本相似的点提升重建保真度)、anomaly-like prioritization(聚焦于类异常点以增强排序准确性),以及两者的混合策略以实现模型优化与排序性能的平衡。此外,文中还设计了一种专为稀疏二值嵌入(sparse binary embeddings)定制的新相似性度量——Normalized Matching 1s (SIM_NM1),显著提升了在低标签数据下的异常检测效果,在52个不平衡数据集上验证了其优越性,nDCG指标最高达1.0,同时减少高达80%的标注数据需求。

链接: https://arxiv.org/abs/2602.02925
作者: Sidahmed Benabderrahmane,Petko Valtchev,James Cheney,Talal Rahwan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Detecting rare and diverse anomalies in highly imbalanced datasets-such as Advanced Persistent Threats (APTs) in cybersecurity-remains a fundamental challenge for machine learning systems. Active learning offers a promising direction by strategically querying an oracle to minimize labeling effort, yet conventional approaches often fail to exploit the intrinsic geometric structure of the feature space for model refinement. In this paper, we introduce SDA2E, a Sparse Dual Adversarial Attention-based AutoEncoder designed to learn compact and discriminative latent representations from imbalanced, high-dimensional data. We further propose a similarity-guided active learning framework that integrates three novel strategies to refine decision boundaries efficiently: mormal-like expansion, which enriches the training set with points similar to labeled normals to improve reconstruction fidelity; anomaly-like prioritization, which boosts ranking accuracy by focusing on points resembling known anomalies; and a hybrid strategy that combines both for balanced model refinement and ranking. A key component of our framework is a new similarity measure, Normalized Matching 1s (SIM_NM1), tailored for sparse binary embeddings. We evaluate SDA2E extensively across 52 imbalanced datasets, including multiple DARPA Transparent Computing scenarios, and benchmark it against 15 state-of-the-art anomaly detection methods. Results demonstrate that SDA2E consistently achieves superior ranking performance (nDCG up to 1.0 in several cases) while reducing the required labeled data by up to 80% compared to passive training. Statistical tests confirm the significance of these improvements. Our work establishes a robust, efficient, and statistically validated framework for anomaly detection that is particularly suited to cybersecurity applications such as APT detection.
zh

[AI-115] DeltaEvolve: Accelerating Scientific Discovery through Momentum-Driven Evolution

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的进化系统在科学发现中因依赖完整代码历史(full-code histories)而导致的上下文效率低下和进化引导弱化的问题。现有方法如AlphaEvolve通过存储完整的代码快照更新控制上下文,但冗余实现细节会稀释核心算法思想,难以提供清晰的演化启发。解决方案的关键在于提出DeltaEvolve框架,其将进化代理形式化为期望最大化(Expectation-Maximization, EM)范式,并用结构化的语义增量(semantic delta)替代全代码历史,捕捉相邻节点间修改的因果影响;该增量通常包含可迁移的有效组件,能更高效地驱动性能提升;同时通过多层数据库与渐进披露机制组织语义增量,显著减少输入token消耗,实验证明其在多个科学领域任务中以更低的token开销获得更优解。

链接: https://arxiv.org/abs/2602.02919
作者: Jiachen Jiang,Tianyu Ding,Zhihui Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM-driven evolutionary systems have shown promise for automated science discovery, yet existing approaches such as AlphaEvolve rely on full-code histories that are context-inefficient and potentially provide weak evolutionary guidance. In this work, we first formalize the evolutionary agents as a general Expectation-Maximization framework, where the language model samples candidate programs (E-step) and the system updates the control context based on evaluation feedback (M-step). Under this view, constructing context via full-code snapshots constitutes a suboptimal M-step, as redundant implement details dilutes core algorithmic ideas, making it difficult to provide clear inspirations for evolution. To address this, we propose DeltaEvolve, a momentum-driven evolutionary framework that replaces full-code history with structured semantic delta capturing how and why modifications between successive nodes affect performance. As programs are often decomposable, semantic delta usually contains many effective components which are transferable and more informative to drive improvement. By organizing semantic delta through multi-level database and progressive disclosure mechanism, input tokens are further reduced. Empirical evaluations on tasks across diverse scientific domains show that our framework can discover better solution with less token consumption over full-code-based evolutionary agents.
zh

[AI-116] Notes on the Reward Representation of Posterior Updates

【速读】:该论文试图解决的问题是:如何将决策制定中的“软更新”(soft update)从一种类比性的表述转化为可严格解释的数学机制,从而明确行为变化是否可以被唯一地归因于证据权重的重新分配。其解决方案的关键在于识别出在特定条件下——即KL-正则化的软更新恰好等价于单一固定概率模型内的贝叶斯后验时——行为变化仅由该通道携带的信息驱动,此时更新变量构成一个真实的信息传输通道。这一设定下,行为改变可被精确解释为对基线分布的证据重加权,且由此得出一个清晰的识别结果:后验更新决定了相对的、情境依赖的激励信号以引导行为,但无法唯一确定绝对奖励,后者仍存在与情境相关的基准模糊性;进一步要求跨不同更新方向的统一延续价值(continuation value),则施加了额外的一致性约束,从而关联不同条件顺序下的奖励描述。

链接: https://arxiv.org/abs/2602.02912
作者: Pedro A. Ortega
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Technical report, 9 pages

点击查看摘要

Abstract:Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.
zh

[AI-117] Reasoning about Reasoning : BAPO Bounds on Chain-of-Thought Token Complexity in LLM s

【速读】:该论文旨在解决生成式 AI(Generative AI)在推理阶段通过链式思维(Chain-of-Thought, CoT)推理提升大语言模型(Large Language Models, LLMs)性能时所面临的计算开销与延迟问题,核心关注点是:随着输入规模 $ n $ 增长,完成特定任务所需的 CoT 推理 token 数量是否存在理论下界。解决方案的关键在于引入并扩展了有界注意力前缀预言机(Bounded Attention Prefix Oracle, BAPO)模型这一抽象框架,用于量化任务求解所需的信息流,并据此证明了三个典型 BAPO 难题——二进制多数判定、三元组匹配和图可达性——均需至少 $ \Omega(n) $ 个推理 token,同时通过显式构造给出了匹配或近似匹配的上界。实验进一步验证了前沿推理模型在这些任务上的推理 token 缩放呈近线性趋势,且在受限推理预算下失败,印证了理论下界的合理性,从而揭示了基于 CoT 的推理阶段计算瓶颈的本质,并提供了一种分析最优推理长度的理论工具。

链接: https://arxiv.org/abs/2602.02909
作者: Kiran Tomlinson,Tobias Schnabel,Adith Swaminathan,Jennifer Neville
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: 28 pages

点击查看摘要

Abstract:Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows? By extending the bounded attention prefix oracle (BAPO) model–an abstraction of LLMs that quantifies the information flow required to solve a task–we prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. We show that each requires \Omega(n) reasoning tokens when the input size is n . We complement these results with matching or near-matching upper bounds via explicit constructions. Finally, our experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with our theoretical lower bounds. Together, our results identify fundamental bottlenecks in inference-time compute through CoT and offer a principled tool for analyzing optimal reasoning length.
zh

[AI-118] FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理在科学发现中缺乏可验证性评估的问题。现有基准测试要么依赖LLM作为评判者对自动生成的研究结果进行评估,要么仅优化易于计算但与科学洞察力关联较弱的孤立指标,难以真实反映代理完成完整科研流程的能力。为此,作者提出FIRE-Bench(Full-cycle Insight Rediscovery Evaluation),其核心在于通过让代理从高影响力机器学习研究中提取高层次问题出发,自主完成从假设生成、实验设计、代码实现到执行和结论推导的全流程科学探索,并以是否能复现已验证成果作为评价标准。该方案提供了一个严格且具有诊断性的框架,能够系统评估代理在真实科学推理链条中的表现,从而推动可靠代理驱动科学发现的发展。

链接: https://arxiv.org/abs/2602.02905
作者: Zhen Wang,Fan Bai,Zhongyan Luo,Jinyan Su,Kaiser Sun,Xinle Yu,Jieyuan Liu,Kun Zhou,Claire Cardie,Mark Dredze,Eric P. Xing,Zhiting Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 4 figures, 10 tables

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.
zh

[AI-119] Spatiotemporal Decision Transformer for Traffic Coordination

【速读】:该论文旨在解决多交叉口交通信号控制中的多智能体协调与样本效率问题,传统强化学习方法在复杂城市路网中难以实现高效协同且训练成本高。其解决方案的关键在于提出MADT(Multi-Agent Decision Transformer),通过将多智能体交通信号控制建模为序列预测任务,引入图注意力机制以捕捉交叉口间的空间依赖关系、时间Transformer编码器以建模交通动态变化,并采用“return-to-go”条件化策略实现目标性能的精准指定,从而支持从历史数据中离线训练并具备在线微调潜力,实验表明该方法在合成网格网络和真实场景下均显著优于现有基准,平均出行时间降低5-6%,且相邻交叉口间协调性更强。

链接: https://arxiv.org/abs/2602.02903
作者: Haoran Su,Yandong Sun,Hanxiao Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Traffic signal control is a critical challenge in urban transportation, requiring coordination among multiple intersections to optimize network-wide traffic flow. While reinforcement learning has shown promise for adaptive signal control, existing methods struggle with multi-agent coordination and sample efficiency. We introduce MADT (Multi-Agent Decision Transformer), a novel approach that reformulates multi-agent traffic signal control as a sequence modeling problem. MADT extends the Decision Transformer paradigm to multi-agent settings by incorporating: (1) a graph attention mechanism for modeling spatial dependencies between intersections, (2) a|temporal transformer encoder for capturing traffic dynamics, and (3) return-to-go conditioning for target performance specification. Our approach enables offline learning from historical traffic data, with architecture design that facilitates potential online fine-tuning. Experiments on synthetic grid networks and real-world traffic scenarios demonstrate that MADT achieves state-of-the-art performance, reducing average travel time by 5-6% compared to the strongest baseline while exhibiting superior coordination among adjacent intersections.
zh

[AI-120] Minimal Computational Preconditions for Subjective Perspective in Artificial Agents

【速读】:该论文旨在解决如何在人工智能代理中实现主观视角(subjective perspective)的可操作化问题,即如何在机器系统中构建一种类似人类主观体验的内在结构。其解决方案的关键在于引入一个缓慢演化的全局潜在状态(global latent state),该状态基于现象学动机设计,用于调制快速的策略动态,但不直接优化行为后果;在无奖励环境中的制度切换场景下,这种潜在结构表现出方向依赖的滞后效应(direction-dependent hysteresis),而策略层面的行为仍保持相对反应性,从而为机器系统的类主观性提供可测量的指标。

链接: https://arxiv.org/abs/2602.02902
作者: Hongju Pae
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study operationalizes subjective perspective in artificial agents by grounding it in a minimal, phenomenologically motivated internal structure. The perspective is implemented as a slowly evolving global latent state that modulates fast policy dynamics without being directly optimized for behavioral consequences. In a reward-free environment with regime shifts, this latent structure exhibits direction-dependent hysteresis, while policy-level behavior remains comparatively reactive. I argue that such hysteresis constitutes a measurable signature of perspective-like subjectivity in machine systems.
zh

[AI-121] Manifold-Constrained Energy-Based Transition Models for Offline Reinforcement Learning

【速读】:该论文旨在解决模型基础的离线强化学习(model-based offline reinforcement learning)在分布偏移(distribution shift)下表现脆弱的问题:策略优化会引导模拟轨迹进入数据集支持较弱的状态-动作区域,导致模型误差累积并引发严重的价值高估。解决方案的关键在于提出流形约束的能量型过渡模型(Manifold-Constrained Energy-based Transition Models, MC-ETM),其通过流形投影-扩散负采样训练条件能量模型,学习下一状态的潜在流形,并在潜在空间中利用Langevin动力学扰动潜在编码生成近流形硬负样本,从而增强能量景观在数据支持区域附近的敏感性;同时,基于学习到的能量函数提供单一可靠性信号——当采样下一状态的最小能量超过阈值时截断轨迹,并通过基于Q值水平方差的悲观惩罚稳定贝尔曼备份,最终在混合悲观马尔可夫决策过程(hybrid pessimistic MDP)框架下实现性能边界分离,有效降低分布外偏差带来的风险。

链接: https://arxiv.org/abs/2602.02900
作者: Zeyu Fang,Zuyuan Zhang,Mahdi Imani,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model-based offline reinforcement learning is brittle under distribution shift: policy improvement drives rollouts into state–action regions weakly supported by the dataset, where compounding model error yields severe value overestimation. We propose Manifold-Constrained Energy-based Transition Models (MC-ETM), which train conditional energy-based transition models using a manifold projection–diffusion negative sampler. MC-ETM learns a latent manifold of next states and generates near-manifold hard negatives by perturbing latent codes and running Langevin dynamics in latent space with the learned conditional energy, sharpening the energy landscape around the dataset support and improving sensitivity to subtle out-of-distribution deviations. For policy optimization, the learned energy provides a single reliability signal: rollouts are truncated when the minimum energy over sampled next states exceeds a threshold, and Bellman backups are stabilized via pessimistic penalties based on Q-value-level dispersion across energy-guided samples. We formalize MC-ETM through a hybrid pessimistic MDP formulation and derive a conservative performance bound separating in-support evaluation error from truncation risk. Empirically, MC-ETM improves multi-step dynamics fidelity and yields higher normalized returns on standard offline control benchmarks, particularly under irregular dynamics and sparse data coverage.
zh

[AI-122] Moving On Even When Youre Broken: Fail-Active Trajectory Generation via Diffusion Policies Conditioned on Embodiment and Task

【速读】:该论文旨在解决机器人在执行任务过程中因执行机构(actuation)故障导致的失效问题,目标是实现“容错操作”(fail-active operation),即在机器人部分功能受损的情况下仍能安全完成任务,避免依赖人工干预。解决方案的关键在于提出DEFT(Diffusion-based Embodiment-aware Trajectory Generator),这是一种基于扩散模型(diffusion-based)的轨迹生成器,能够根据机器人当前的物理形态(embodiment)和任务约束条件进行条件化生成,从而在任意类型的执行机构故障下仍可规划出有效的运动轨迹。DEFT具备跨故障类型泛化能力、支持受约束与自由运动,并在仿真与真实场景中均表现出优于传统方法的性能,尤其在未见过的故障配置下仍能保持鲁棒性,实现了高可靠性的容错操作。

链接: https://arxiv.org/abs/2602.02895
作者: Gilberto G. Briscoe-Martinez,Yaashia Gautam,Rahul Shetty,Anuj Pasricha,Marco M. Nicotra,Alessandro Roncone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: To be published in the 2026 IEEE International Conference on Robotics Automation

点击查看摘要

Abstract:Robot failure is detrimental and disruptive, often requiring human intervention to recover. Maintaining safe operation under impairment to achieve task completion, i.e. fail-active operation, is our target. Focusing on actuation failures, we introduce DEFT, a diffusion-based trajectory generator conditioned on the robot’s current embodiment and task constraints. DEFT generalizes across failure types, supports constrained and unconstrained motions, and enables task completion under arbitrary failure. We evaluated DEFT in both simulation and real-world scenarios using a 7-DoF robotic arm. In simulation over thousands of joint-failure cases across multiple tasks, DEFT outperformed the baseline by up to 2 times. On failures unseen during training, it continued to outperform the baseline, indicating robust generalization in simulation. Further, we performed real-world evaluations on two multi-step tasks, drawer manipulation and whiteboard erasing. These experiments demonstrated DEFT succeeding on tasks where classical methods failed. Our results show that DEFT achieves fail-active manipulation across arbitrary failure configurations and real-world deployments.
zh

[AI-123] Mixture of Concept Bottleneck Experts

【速读】:该论文旨在解决传统概念瓶颈模型(Concept Bottleneck Models, CBMs)在预测准确性与用户需求适应性方面的局限性,这些问题源于其通常采用单一线性或布尔表达式作为任务预测器。解决方案的关键在于提出混合概念瓶颈专家模型(Mixture of Concept Bottleneck Experts, M-CBEs),通过扩展两个维度——专家数量和每个专家的函数形式——来探索未被充分挖掘的设计空间。具体而言,文中实例化了两种新模型:Linear M-CBE学习一组有限的线性表达式,Symbolic M-CBE则利用符号回归从数据中发现符合用户指定运算符词汇表的专家函数,从而在准确性和可解释性之间提供更灵活的权衡机制,适配不同用户和任务需求。

链接: https://arxiv.org/abs/2602.02886
作者: Francesco De Santis,Gabriele Ciravegna,Giovanni De Felice,Arianna Casanova,Francesco Giannini,Michelangelo Diligenti,Mateo Espinosa Zarlenga,Pietro Barbiero,Johannes Schneider,Danilo Giordano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically fix their task predictor to a single linear or Boolean expression, limiting both predictive accuracy and adaptability to diverse user needs. We propose Mixture of Concept Bottleneck Experts (M-CBEs), a framework that generalizes existing CBMs along two dimensions: the number of experts and the functional form of each expert, exposing an underexplored region of the design space. We investigate this region by instantiating two novel models: Linear M-CBE, which learns a finite set of linear expressions, and Symbolic M-CBE, which leverages symbolic regression to discover expert functions from data under user-specified operator vocabularies. Empirical evaluation demonstrates that varying the mixture size and functional form provides a robust framework for navigating the accuracy-interpretability trade-off, adapting to different user and task needs.
zh

[AI-124] Learning-Infused Formal Reasoning : From Contract Synthesis to Artifact Reuse and Formal Semantics

【速读】:该论文旨在解决当前形式化方法(Formal Methods)在实际应用中面临的关键挑战:即验证过程高度依赖人工、难以复用已有成果,且缺乏对跨系统知识的持续积累与迁移能力。为实现下一代形式化验证系统的演进,其解决方案的核心在于构建一个融合自动化契约合成(Automated Contract Synthesis)、语义实体复用(Semantic Artifact Reuse)与基于精化(Refinement-Based)理论的混合框架。该框架通过结合大语言模型(Large Language Models)与基于图的表示方法,实现异构符号体系和抽象层级间的可扩展语义匹配,并确保形式化严谨性;同时依托组合推理机制,使验证知识得以系统性沉淀与复用,从而推动验证生态从孤立证明向知识驱动的累积式范式转变。

链接: https://arxiv.org/abs/2602.02881
作者: Arshad Beg,Diarmuid O’Donoghue,Rosemary Monahan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 18 pages. Accepted at VERIFAI-2026: The Interplay between Artificial Intelligence and Software Verification LASER center, Villebrumier, France, March 8-11, 2026

点击查看摘要

Abstract:This vision paper articulates a long-term research agenda for formal methods at the intersection with artificial intelligence, outlining multiple conceptual and technical dimensions and reporting on our ongoing work toward realising this agenda. It advances a forward-looking perspective on the next generation of formal methods based on the integration of automated contract synthesis, semantic artifact reuse, and refinement-based theory. We argue that future verification systems must move beyond isolated correctness proofs toward a cumulative, knowledge-driven paradigm in which specifications, contracts, and proofs are continuously synthesised and transferred across systems. To support this shift, we outline a hybrid framework combining large language models with graph-based representations to enable scalable semantic matching and principled reuse of verification artifacts. Learning-based components provide semantic guidance across heterogeneous notations and abstraction levels, while symbolic matching ensures formal soundness. Grounded in compositional reasoning, this vision points toward verification ecosystems that evolve systematically, leveraging past verification efforts to accelerate future assurance.
zh

[AI-125] “I May Not Have Articulated Myself Clearly”: Diagnosing Dynamic Instability in LLM Reasoning at Inference Time

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因“失去线索”(即过程层面的崩溃)而导致的错误问题,而现有评估方法仅关注生成结果的最终正确性,忽略了中间推理链的稳定性。解决方案的关键在于提出一种无需训练或微调、仅依赖标准API提供的token对数概率信息(inference-time observables)的不稳定性信号(instability signal),该信号结合了连续步骤间的分布偏移(Jensen-Shannon Divergence, JSD)和不确定性(熵),并通过分析每条推理轨迹中的峰值不稳定性强度来预测最终答案的正确性。研究表明,这种不稳定性信号具有高度可预测性,并能区分早期(可能修复)与晚期(更易导致失败)的不稳定状态,揭示了恢复能力不仅取决于分布变化强度,还与其发生时间相对于剩余解码窗口的位置有关。

链接: https://arxiv.org/abs/2602.02863
作者: Jinkun Chen,Fengxiang Cheng,Sijia Han,Vlado Keselj
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 12 figures, 15 tables

点击查看摘要

Abstract:Reasoning failures in large language models (LLMs) are typically measured only at the end of a generation, yet many failures manifest as a process-level breakdown: the model “loses the thread” mid-reasoning. We study whether such breakdowns are detectable from inference-time observables available in standard APIs (token log probabilities), without any training or fine-tuning. We define a simple instability signal that combines consecutive-step distributional shift (JSD) and uncertainty (entropy), summarize each trace by its peak instability strength, and show that this signal reliably predicts failure. Across GSM8K and HotpotQA, instability strength predicts wrong answers with above-chance AUC and yields monotonic bucket-level accuracy decline at scale across model sizes. Crucially, we show that instability is not uniformly harmful: early instability can reflect subsequent stabilization and a correct final answer (\emphcorrective instability), whereas late instability is more often followed by failure (\emphdestructive instability), even at comparable peak magnitudes, indicating that recoverability depends not only on how strongly the distribution changes but also on when such changes occur relative to the remaining decoding horizon. The method is model-agnostic, training-free, and reproducible, and is presented as a diagnostic lens rather than a corrective or control mechanism.
zh

[AI-126] STEER: Inference-Time Risk Control via Constrained Quality-Diversity Search

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中因追求平均正确性而出现的模式崩溃(mode collapse)问题,尤其在临床分诊等有序决策场景中,标准对齐方法会削弱模型根据上下文约束灵活调整特异性与敏感性(即ROC操作点)的能力。解决方案的关键在于提出STEER(Steerable Tuning via Evolutionary Ensemble Refinement),一个无需训练的框架:通过离线、受限的质量-多样性搜索构建一组自然语言人格(natural-language personas)群体,确保行为覆盖范围的同时满足最低安全、推理和稳定性阈值;推理时引入单一可解释的控制参数,将用户指定的风险百分位映射到特定人格,实现决策保守性的单调调节,从而在不牺牲领域能力的前提下实现风险可控的行为引导。

链接: https://arxiv.org/abs/2602.02862
作者: Eric Yang,Jong Ha Lee,Jonathan Amar,Elissa Ye,Yugang Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) trained for average correctness often exhibit mode collapse, producing narrow decision behaviors on tasks where multiple responses may be reasonable. This limitation is particularly problematic in ordinal decision settings such as clinical triage, where standard alignment removes the ability to trade off specificity and sensitivity (the ROC operating point) based on contextual constraints. We propose STEER (Steerable Tuning via Evolutionary Ensemble Refinement), a training-free framework that reintroduces this tunable control. STEER constructs a population of natural-language personas through an offline, constrained quality-diversity search that promotes behavioral coverage while enforcing minimum safety, reasoning, and stability thresholds. At inference time, STEER exposes a single, interpretable control parameter that maps a user-specified risk percentile to a selected persona, yielding a monotonic adjustment of decision conservativeness. On two clinical triage benchmarks, STEER achieves broader behavioral coverage compared to temperature-based sampling and static persona ensembles. Compared to a representative post-training method, STEER maintains substantially higher accuracy on unambiguous urgent cases while providing comparable control over ambiguous decisions. These results demonstrate STEER as a safety-preserving paradigm for risk control, capable of steering behavior without compromising domain competence.
zh

[AI-127] AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM ) Agents

【速读】:该论文旨在解决模拟与混合信号(Analog and Mixed-Signal, AMS)集成电路设计中晶体管尺寸优化的难题,该问题因非线性器件行为、高维设计空间和严格性能约束而尤为复杂。传统电子设计自动化(Electronic Design Automation, EDA)方法将尺寸优化视为静态黑箱优化,导致效率低且鲁棒性差。为突破这一瓶颈,作者提出AutoSizer框架,其核心在于构建一个双环优化机制:内层负责电路尺寸优化,外层通过分析优化动态与约束条件,利用仿真反馈迭代重构搜索空间,从而实现闭环自适应优化。该方案首次将大语言模型(Large Language Models, LLMs)的推理能力与可微分优化策略结合,形成基于反射的元优化(meta-optimization)范式,在AMS-SizingBench基准上验证了其在解质量、收敛速度和成功率上的显著优势。

链接: https://arxiv.org/abs/2602.02849
作者: Xi Yu,Dmitrii Torbunov,Soumyajit Mandal,Yihui Ren
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major bottleneck due to nonlinear behavior, high-dimensional design spaces, and strict performance constraints. Existing Electronic Design Automation (EDA) methods typically frame sizing as static black-box optimization, resulting in inefficient and less robust solutions. Although Large Language Models (LLMs) exhibit strong reasoning abilities, they are not suited for precise numerical optimization in AMS sizing. To address this gap, we propose AutoSizer, a reflective LLM-driven meta-optimization framework that unifies circuit understanding, adaptive search-space construction, and optimization orchestration in a closed loop. It employs a two-loop optimization framework, with an inner loop for circuit sizing and an outer loop that analyzes optimization dynamics and constraints to iteratively refine the search space from simulation feedback. We further introduce AMS-SizingBench, an open benchmark comprising 24 diverse AMS circuits in SKY130 CMOS technology, designed to evaluate adaptive optimization policies under realistic simulator-based constraints. AutoSizer experimentally achieves higher solution quality, faster convergence, and higher success rate across varying circuit difficulties, outperforming both traditional optimization methods and existing LLM-based agents.
zh

[AI-128] Causal Flow Q-Learning for Robust Offline Reinforcement Learning

【速读】:该论文旨在解决在离线强化学习(Offline Reinforcement Learning, Offline RL)中,由于演示者与学习者感官能力不匹配导致的观测混淆(confounded observations)问题,这种混淆会引入隐式混杂偏差(implicit confounding bias),从而损害策略学习的性能。解决方案的关键在于从因果视角出发,提出一种新的因果离线RL目标函数,该函数优化策略在潜在混杂偏差下最差情况下的性能表现;进一步地,通过引入一个深度判别器来衡量目标策略与名义行为策略之间的差异,实现了一种可实践的表达性流匹配(expressive flow-matching)策略学习方法,有效提升了在像素级任务中的鲁棒性和成功率。

链接: https://arxiv.org/abs/2602.02847
作者: Mingxuan Li,Junzhe Zhang,Elias Bareinboim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Expressive policies based on flow-matching have been successfully applied in reinforcement learning (RL) more recently due to their ability to model complex action distributions from offline data. These algorithms build on standard policy gradients, which assume that there is no unmeasured confounding in the data. However, this condition does not necessarily hold for pixel-based demonstrations when a mismatch exists between the demonstrator’s and the learner’s sensory capabilities, leading to implicit confounding biases in offline data. We address the challenge by investigating the problem of confounded observations in offline RL from a causal perspective. We develop a novel causal offline RL objective that optimizes policies’ worst-case performance that may arise due to confounding biases. Based on this new objective, we introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations, employing a deep discriminator to assess the discrepancy between the target policy and the nominal behavioral policy. Experiments across 25 pixel-based tasks demonstrate that our proposed confounding-robust augmentation procedure achieves a success rate 120% that of confounding-unaware, state-of-the-art offline RL methods.
zh

[AI-129] Semantics-Aware Generative Latent Data Augmentation for Learning in Low-Resource Domains

【速读】:该论文旨在解决深度学习在数据稀缺场景下性能下降的问题,尤其是在下游微调阶段因标注数据不足导致的泛化能力受限问题。尽管基础模型(Foundation Models, FMs)能从大规模数据中提取通用特征并实现良好泛化,但在低资源环境下仍面临挑战。解决方案的关键在于提出GeLDA——一种语义感知的生成式潜在数据增强框架,其核心是利用条件扩散模型在基础模型诱导的低维潜在空间中合成高质量样本;该潜在空间相比输入空间更集中地保留任务相关特征,且通过辅助特征向量对语义关系(如类别或子域间的关系)进行建模,从而实现高效、精准的数据增强,显著提升模型在零样本语言特定语音情感识别和长尾图像分类等任务中的性能表现。

链接: https://arxiv.org/abs/2602.02841
作者: Jae-Sung Bae,Minje Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline’s unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.
zh

[AI-130] abula RASA: Exposing and Breaking the Relational Bottleneck in Transformers

【速读】:该论文旨在解决标准Transformer在处理需要多跳关系推理(multi-hop relational reasoning)的结构化数据任务时表现不佳的问题。其核心限制在于,标准Transformer在电路复杂度上属于TC0\mathsf{TC}^0-完全类,且对于kk-跳推理至少需要Ω(k)\Omega(k)层深度,导致效率低下。解决方案的关键在于提出RASA(Relation-Aware Sparse Attention),通过两个最小修改实现:(1) 引入边类型嵌入(edge-type embeddings),将关系结构显式注入注意力分数中;(2) 采用稀疏掩码(sparse masking),将注意力范围限制在图邻接位置,从而将注意力搜索空间从O(2n2)O(2^{n^2})压缩至O(2m)O(2^m),同时提供明确的关系路由机制。实验证明,RASA在MetaQA和WebQuestionsSP等多跳推理任务上显著优于标准Transformer,并接近GPT-4性能,尤其在深度推理任务中优势更明显。

链接: https://arxiv.org/abs/2602.02834
作者: Jonas Petersen,Camilla Mazzoleni,Riccardo Maggioni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Transformers achieve remarkable performance across many domains, yet struggle with tasks requiring multi-hop relational reasoning over structured data. We analyze this limitation through circuit complexity: standard transformers are \mathsfTC^0 -complete and require \Omega(k) layers for k -hop reasoning. We introduce RASA (Relation-Aware Sparse Attention), a minimal modification adding: (1) edge-type embeddings that inject relational structure into attention scores, and (2) sparse masking that restricts attention to graph-adjacent positions. While RASA has the same asymptotic depth requirements, sparse masking reduces the attention search space from O(2^n^2) to O(2^m) patterns, and edge biases provide explicit relation routing. Empirically, on MetaQA (1/2/3-hop) and WebQuestionsSP, RASA outperforms standard transformers and matches GPT-4 at lower cost, with advantages growing with reasoning depth (+7.1 points on 3-hop). We do not claim formal learnability guarantees; the contribution is empirical validation that minimal structural modifications substantially improve multi-hop reasoning.
zh

[AI-131] Joint Learning of Hierarchical Neural Options and Abstract World Model

【速读】:该论文旨在解决AI代理通过组合已有技能来学习新技能的效率问题,尤其是针对现有无模型层次强化学习算法在数据利用上效率低下的挑战。其解决方案的关键在于提出了一种名为AgentOWL(Option and World model Learning Agent)的新方法,该方法以样本高效的方式联合学习一个抽象世界模型(抽象化状态与时间维度)和一组层次神经选项(hierarchical neural options),从而显著提升技能学习的效率。

链接: https://arxiv.org/abs/2602.02799
作者: Wasu Top Piriyakulkij,Wolfgang Lehrach,Kevin Ellis,Kevin Murphy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building agents that can perform new skills by composing existing skills is a long-standing goal of AI agent research. Towards this end, we investigate how to efficiently acquire a sequence of skills, formalized as hierarchical neural options. However, existing model-free hierarchical reinforcement algorithms need a lot of data. We propose a novel method, which we call AgentOWL (Option and World model Learning Agent), that jointly learns – in a sample efficient way – an abstract world model (abstracting across both states and time) and a set of hierarchical neural options. We show, on a subset of Object-Centric Atari games, that our method can learn more skills using much less data than baseline methods.
zh

[AI-132] Causality–Δ: Jacobian-Based Dependency Analysis in Flow Matching Models

【速读】:该论文旨在解决生成式模型中潜在扰动在流匹配(flow matching)过程中如何传播的问题,从而揭示生成特征间的依赖结构。其解决方案的关键在于利用雅可比向量积(Jacobian-vector products, JVPs)作为分析工具,通过推导高斯和混合高斯场景下最优漂移项及其雅可比矩阵的闭式表达式,发现即使全局非线性的流仍具有局部仿射结构;进一步地,在图像域中,将流与属性分类器组合以构建属性级JVP估计器,能够有效恢复MNIST和CelebA数据集上的经验相关性,并通过条件化小的分类器-雅可比范数来降低相关性,支持了共同原因(common-cause)结构的假设,但强调该条件化并非形式上的do干预。

链接: https://arxiv.org/abs/2602.02793
作者: Reza Rezvan(1),Gustav Gille(1),Moritz Schauer(1 and 2),Richard Torkar(1 and 2) ((1) Chalmers University of Technology, (2) University of Gothenburg)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures. Code: this https URL

点击查看摘要

Abstract:Flow matching learns a velocity field that transports a base distribution to data. We study how small latent perturbations propagate through these flows and show that Jacobian-vector products (JVPs) provide a practical lens on dependency structure in the generated features. We derive closed-form expressions for the optimal drift and its Jacobian in Gaussian and mixture-of-Gaussian settings, revealing that even globally nonlinear flows admit local affine structure. In low-dimensional synthetic benchmarks, numerical JVPs recover the analytical Jacobians. In image domains, composing the flow with an attribute classifier yields an attribute-level JVP estimator that recovers empirical correlations on MNIST and CelebA. Conditioning on small classifier-Jacobian norms reduces correlations in a way consistent with a hypothesized common-cause structure, while we emphasize that this conditioning is not a formal do intervention.
zh

[AI-133] Simulating Human Audiovisual Search Behavior

【速读】:该论文旨在解决人类在不确定环境下如何协同视听线索进行目标搜索的问题,特别是如何在时间、努力与准确性之间做出适应性权衡。现有模型通常将感知与动作分离,忽视了个体如何动态调整身体运动与感官策略以优化搜索效率。其解决方案的关键在于提出Sensonaut这一计算模型,该模型基于资源理性(resource-rationality)框架,在部分可观测条件下将搜索行为建模为一个决策问题,假设个体会根据对感知约束的认知选择最有效利用资源的策略。通过模拟人类在任务复杂度、遮挡和干扰下的搜索行为,该模型不仅再现了人类搜索时间与努力的自适应调节,还复现了典型的人类错误模式,从而为降低认知负荷和优化视听界面设计提供了理论依据。

链接: https://arxiv.org/abs/2602.02790
作者: Hyunsung Cho,Xuejing Luo,Byungjoo Lee,David Lindlbauer,Antti Oulasvirta
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 17 pages, 10 figures, CHI 2026

点击查看摘要

Abstract:Locating a target based on auditory and visual cues \unicodex2013 such as finding a car in a crowded parking lot or identifying a speaker in a virtual meeting \unicodex2013 requires balancing effort, time, and accuracy under uncertainty. Existing models of audiovisual search often treat perception and action in isolation, overlooking how people adaptively coordinate movement and sensory strategies. We present Sensonaut, a computational model of embodied audiovisual search. The core assumption is that people deploy their body and sensory systems in ways they believe will most efficiently improve their chances of locating a target, trading off time and effort under perceptual constraints. Our model formulates this as a resource-rational decision-making problem under partial observability. We validate the model against newly collected human data, showing that it reproduces both adaptive scaling of search time and effort under task complexity, occlusion, and distraction, and characteristic human errors. Our simulation of human-like resource-rational search informs the design of audiovisual interfaces that minimize search cost and cognitive load.
zh

[AI-134] Structure-Preserving Learning Improves Geometry Generalization in Neural PDEs

【速读】:该论文旨在解决科学与工程领域中实时求解偏微分方程(Partial Differential Equations, PDEs)的问题,尤其关注在未见几何结构下仍能保持物理结构和精度的通用性建模挑战。其核心解决方案是提出通用几何神经惠特尼形式(General-Geometry Neural Whitney Forms, Geo-NeW)——一种数据驱动的有限元方法,通过联合学习微分算子与定义在底层几何上的兼容降维有限元空间,实现对PDE的高效求解。该方法利用基于Transformer的编码器处理离散化网格,并将几何信息作为基函数构建的基础,从而显式地将几何形状与边界条件嵌入模型中,形成强大的归纳偏置以提升泛化能力;同时引入新的本构模型参数化方式,确保解的存在性和唯一性,最终在多个稳态PDE基准测试中达到最先进性能,且在分布外几何上显著优于传统基线方法。

链接: https://arxiv.org/abs/2602.02788
作者: Benjamin D. Shaffer,Shawn Koohy,Brooks Kinch,M. Ani Hsieh,Nathaniel Trask
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:We aim to develop physics foundation models for science and engineering that provide real-time solutions to Partial Differential Equations (PDEs) which preserve structure and accuracy under adaptation to unseen geometries. To this end, we introduce General-Geometry Neural Whitney Forms (Geo-NeW): a data-driven finite element method. We jointly learn a differential operator and compatible reduced finite element spaces defined on the underlying geometry. The resulting model is solved to generate predictions, while exactly preserving physical conservation laws through Finite Element Exterior Calculus. Geometry enters the model as a discretized mesh both through a transformer-based encoding and as the basis for the learned finite element spaces. This explicitly connects the underlying geometry and imposed boundary conditions to the solution, providing a powerful inductive bias for learning neural PDEs, which we demonstrate improves generalization to unseen domains. We provide a novel parameterization of the constitutive model ensuring the existence and uniqueness of the solution. Our approach demonstrates state-of-the-art performance on several steady-state PDE benchmarks, and provides a significant improvement over conventional baselines on out-of-distribution geometries.
zh

[AI-135] Cross-Temporal Attention Fusion (CTAF) for Multimodal Physiological Signals in Self-Supervised Learning

【速读】:该论文旨在解决多模态情感建模中脑电图(EEG)与外周生理信号在时间上异步的问题,这一问题常被现有融合方法忽略或通过代价高昂的时间对齐(warping)处理。其解决方案的核心是提出一种自监督的跨时序注意力融合机制(Cross-Temporal Attention Fusion, CTAF),该机制通过学习模态间的软双向对齐关系,利用时间感知的交叉注意力、轻量级融合门控以及可选弱监督下的对齐正则化对比损失,直接建模不同模态间的时间对应性,从而实现无需昂贵对齐即可获得鲁棒的片段嵌入。此方法有效捕捉了中枢神经系统与自主神经系统在心理生理时间序列中的耦合特性,显著提升了跨模态匹配对的余弦间隔和一秒内token检索性能,同时在标签稀缺场景下保持与基线相当的三分类准确率和宏F1分数。

链接: https://arxiv.org/abs/2602.02784
作者: Arian Khorasani,Théophile Demazure
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study multimodal affect modeling when EEG and peripheral physiology are asynchronous, which most fusion methods ignore or handle with costly warping. We propose Cross-Temporal Attention Fusion (CTAF), a self-supervised module that learns soft bidirectional alignments between modalities and builds a robust clip embedding using time-aware cross attention, a lightweight fusion gate, and alignment-regularized contrastive objectives with optional weak supervision. On the K-EmoCon dataset, under leave-one-out cross-validation evaluation, CTAF yields higher cosine margins for matched pairs and better cross-modal token retrieval within one second, and it is competitive with the baseline on three-bin accuracy and macro-F1 while using few labels. Our contributions are a time-aware fusion mechanism that directly models correspondence, an alignment-driven self-supervised objective tailored to EEG and physiology, and an evaluation protocol that measures alignment quality itself. Our approach accounts for the coupling between the central and autonomic nervous systems in psychophysiological time series. These results indicate that CTAF is a strong step toward label-efficient, generalizable EEG-peripheral fusion under temporal asynchrony.
zh

[AI-136] Evaluating False Alarm and Missing Attacks in CAN IDS

【速读】:该论文旨在解决车载网络中基于机器学习(Machine Learning, ML)的入侵检测系统(Intrusion Detection System, IDS)在面对对抗性攻击时的鲁棒性不足问题。其关键解决方案是通过系统性地评估四种浅层学习模型与一种深度神经网络(Deep Neural Network, DNN)模型在ROAD数据集上的表现,采用FGSM、BIM和PGD等梯度-based攻击方法对CAN帧进行协议合规的payload级扰动,从而揭示不同模型在良性流量和恶意流量下的脆弱性。研究发现,尽管所有模型在正常条件下均具有高准确率,但在对抗攻击下普遍存在漏报率显著上升的问题,且存在同时触发误报和逃避检测的现象,表明当前车载IDS亟需引入对抗鲁棒性评估机制以保障安全关键场景下的可靠性。

链接: https://arxiv.org/abs/2602.02781
作者: Nirab Hossain,Pablo Moriano
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, and 8 tables

点击查看摘要

Abstract:Modern vehicles rely on electronic control units (ECUs) interconnected through the Controller Area Network (CAN), making in-vehicle communication a critical security concern. Machine learning (ML)-based intrusion detection systems (IDS) are increasingly deployed to protect CAN traffic, yet their robustness against adversarial manipulation remains largely unexplored. We present a systematic adversarial evaluation of CAN IDS using the ROAD dataset, comparing four shallow learning models with a deep neural network-based detector. Using protocol-compliant, payload-level perturbations generated via FGSM, BIM and PGD, we evaluate adversarial effects on both benign and malicious CAN frames. While all models achieve strong baseline performance under benign conditions, adversarial perturbations reveal substantial vulnerabilities. Although shallow and deep models are robust to false-alarm induction, with the deep neural network (DNN) performing best on benign traffic, all architectures suffer significant increases in missed attacks. Notably, under gradient-based attacks, the shallow model extra trees (ET) demonstrates improved robustness to missed-attack induction compared to the other models. Our results demonstrate that adversarial manipulation can simultaneously trigger false alarms and evade detection, underscoring the need for adversarial robustness evaluation in safety-critical automotive IDS.
zh

[AI-137] Scaling-Aware Adapter for Structure-Grounded LLM Reasoning ICML2026

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生物分子结构推理中因模态特异性导致的几何信息缺失与模态融合瓶颈问题,尤其是现有方法通过序列化分词或固定长度查询连接压缩结构输入,易引发结构幻觉并限制泛化能力。其核心解决方案在于提出Cuttlefish——一个统一的全原子LLM,关键创新为:一是尺度感知补丁化(Scaling-Aware Patching),利用指令条件门控机制生成可变大小的结构图补丁,自适应地按结构复杂度扩展查询token预算,缓解固定长度连接器的瓶颈;二是几何接地适配器(Geometry Grounding Adapter),通过交叉注意力对齐模态嵌入并注入优化后的模态token至LLM,显式引入几何线索以降低结构幻觉。实验表明,该方法在多样的全原子基准测试中实现了更优的异构结构引导推理性能。

链接: https://arxiv.org/abs/2602.02780
作者: Zihao Jing,Qiuhao Zeng,Ruiyi Fang,Yan Yi Li,Yan Sun,Boyu Wang,Pingzhao Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at ICML 2026

点击查看摘要

Abstract:Large language models (LLMs) are enabling reasoning over biomolecular structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric groundings requisite for mitigating structural hallucinations or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across diverse all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning. Code is available at the project repository.
zh

[AI-138] Provable Effects of Data Replay in Continual Learning: A Feature Learning Perspective AISTATS2026

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中的灾难性遗忘(catastrophic forgetting)问题,即模型在学习新任务时对先前任务性能的显著下降。其解决方案的关键在于提出一个基于特征学习视角的理论框架,用于分析全量数据重放(full data-replay)训练的有效性。研究发现,信号-噪声比(Signal-to-Noise Ratio, SNR)是决定遗忘与否的核心因素:即使在拥有全部历史数据的情况下,若后续任务累积的噪声主导了早期任务的信号,仍会发生遗忘;而通过充分积累信号,数据重放可恢复初始学习不佳的任务。此外,论文揭示了一个新的任务排序机制——优先训练高信号任务不仅能促进低信号任务的学习,还能有效缓解灾难性遗忘。

链接: https://arxiv.org/abs/2602.02767
作者: Meng Ding,Jinhui Xu,Kaiyi Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AISTATS 2026

点击查看摘要

Abstract:Continual learning (CL) aims to train models on a sequence of tasks while retaining performance on previously learned ones. A core challenge in this setting is catastrophic forgetting, where new learning interferes with past knowledge. Among various mitigation strategies, data-replay methods, where past samples are periodically revisited, are considered simple yet effective, especially when memory constraints are relaxed. However, the theoretical effectiveness of full data replay, where all past data is accessible during training, remains largely unexplored. In this paper, we present a comprehensive theoretical framework for analyzing full data-replay training in continual learning from a feature learning perspective. Adopting a multi-view data model, we identify the signal-to-noise ratio (SNR) as a critical factor affecting forgetting. Focusing on task-incremental binary classification across M tasks, our analysis verifies two key conclusions: (1) forgetting can still occur under full replay when the cumulative noise from later tasks dominates the signal from earlier ones; and (2) with sufficient signal accumulation, data replay can recover earlier tasks-even if their initial learning was poor. Notably, we uncover a novel insight into task ordering: prioritizing higher-signal tasks not only facilitates learning of lower-signal tasks but also helps prevent catastrophic forgetting. We validate our theoretical findings through synthetic and real-world experiments that visualize the interplay between signal learning and noise memorization across varying SNRs and task correlation regimes.
zh

[AI-139] Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding ICLR2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在分子图理解任务中表现不佳的问题,尤其是现有图-语言模型(graph-LLM)桥梁方法多采用固定长度静态令牌的Q-Former结构,这类设计忽视了立体化学(stereochemistry)和子结构上下文信息,并通常需要昂贵的LLM主干微调,限制了效率与泛化能力。其解决方案的关键在于提出EDT-Former——一种基于熵引导的动态令牌Transformer架构,能够生成与分子中信息丰富区域对齐的动态令牌,从而保留分子图的局部与全局结构特征;同时,该方法实现了冻结图编码器与LLM之间的对齐,无需微调LLM主干(仅嵌入层除外),显著提升了计算效率,并在MoleculeQA、Molecule-oriented Mol-Instructions以及属性预测基准(TDC、MoleculeNet)上达到当前最优性能,验证了其在可扩展且通用的多模态分子理解中的有效性。

链接: https://arxiv.org/abs/2602.02742
作者: Zihao Jing,Qiuhao Zeng,Ruiyi Fang,Yan Sun,Boyu Wang,Pingzhao Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Molecular understanding is central to advancing areas such as scientific discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graph-LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves stateof-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding
zh

[AI-140] opoPrune: Robust Data Pruning via Unified Latent Space Topology

【速读】:该论文旨在解决几何数据剪枝方法在实际应用中因依赖外在几何结构而导致的不稳定性问题,尤其是在跨架构迁移或存在特征噪声时性能显著下降的问题。其解决方案的关键在于引入TopoPrune框架,该框架通过拓扑学原理捕捉数据的内在稳定结构:首先利用拓扑感知的流形近似建立数据集的全局低维嵌入;随后采用可微分持久同调(differentiable persistent homology)对流形嵌入进行局部拓扑优化,根据样本的结构复杂度进行排序。这一双尺度拓扑方法不仅在高剪枝率(如90%)下保持高精度和准确性,还展现出对潜在特征嵌入噪声的强鲁棒性及跨网络架构的优异迁移能力。

链接: https://arxiv.org/abs/2602.02739
作者: Arjun Roy,Prajna G. Malettira,Manish Nagaraj,Kaushik Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint. Under Review

点击查看摘要

Abstract:Geometric data pruning methods, while practical for leveraging pretrained models, are fundamentally unstable. Their reliance on extrinsic geometry renders them highly sensitive to latent space perturbations, causing performance to degrade during cross-architecture transfer or in the presence of feature noise. We introduce TopoPrune, a framework which resolves this challenge by leveraging topology to capture the stable, intrinsic structure of data. TopoPrune operates at two scales, (1) utilizing a topology-aware manifold approximation to establish a global low-dimensional embedding of the dataset. Subsequently, (2) it employs differentiable persistent homology to perform a local topological optimization on the manifold embeddings, ranking samples by their structural complexity. We demonstrate that our unified dual-scale topological approach ensures high accuracy and precision, particularly at significant dataset pruning rates (e.g., 90%). Furthermore, through the inherent stability properties of topology, TopoPrune is (a) exceptionally robust to noise perturbations of latent feature embeddings and (b) demonstrates superior transferability across diverse network architectures. This study demonstrates a promising avenue towards stable and principled topology-based frameworks for robust data-efficient learning.
zh

[AI-141] When Noise Lowers The Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models ICASSP

【速读】:该论文试图解决音乐大语言模型(Music LLM)在评估生成音乐质量时缺乏可靠指标的问题,特别是标准交叉熵损失(cross-entropy loss)在面对系统性噪声干扰时反而下降,导致其无法作为独立的质量判别依据。解决方案的关键在于引入噪声注入实验(noise injection experiment),通过向音乐上下文注入不同长度的控制噪声信号,观察模型损失函数对扰动的响应模式;研究发现,模型对短时局部纹理扰动表现出明显的损失峰值(Peak area),而对全局语义破坏反应较弱,这表明损失曲线的形状(而非绝对值)能更准确地反映生成内容的质量——由此提出一种基于损失曲线特征的标签无关、模型内生的质量评估框架,为改进训练目标和构建更严谨的评测基准提供了新思路。

链接: https://arxiv.org/abs/2602.02738
作者: Xiaosha Li,Chun Liu,Ziyu Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026

点击查看摘要

Abstract:The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from “garbage music”. Curiously, we observe that the standard cross-entropy loss – a core training metric – often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model’s loss reacting positively to these perturbations, specifically a sharp increase (“Peak” area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve – rather than its absolute value – encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality – opening the door to more principled training objectives and sharper benchmarks.
zh

[AI-142] CAPS: Unifying Attention Recurrence and Alignment in Transformer-based Time Series Forecasting

【速读】:该论文旨在解决时间序列预测中标准Softmax注意力机制对全局趋势、局部冲击和季节性模式等多重时序结构纠缠建模的问题,同时克服传统递归模型在长期依赖建模中因因果结构限制而牺牲顺序无关选择能力的缺陷。其解决方案的关键在于提出一种结构化注意力机制CAPS(Clock-weighted Aggregation with Prefix-products and Softmax),通过SO(2)旋转实现相位对齐,并在单一注意力层内融合三种可加门控路径:Riemann Softmax、前缀积门控(prefix-product gates)与Clock基线,其中Clock机制作为可学习的时间权重模块,统一调控各路径的贡献,从而实现对多尺度时序结构的解耦建模与高效聚合。

链接: https://arxiv.org/abs/2602.02729
作者: Viresh Pati,Yubin Kim,Vinh Pham,Jevon Twitty,Shihao Yang,Jiecheng Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents \textbfCAPS (Clock-weighted Aggregation with Prefix-products and Softmax), a structured attention mechanism for time series forecasting that decouples three distinct temporal structures: global trends, local shocks, and seasonal patterns. Standard softmax attention entangles these through global normalization, while recent recurrent models sacrifice long-term, order-independent selection for order-dependent causal structure. CAPS combines SO(2) rotations for phase alignment with three additive gating paths – Riemann softmax, prefix-product gates, and a Clock baseline – within a single attention layer. We introduce the Clock mechanism, a learned temporal weighting that modulates these paths through a shared notion of temporal importance. Experiments on long- and short-term forecasting benchmarks surpass vanilla softmax and linear attention mechanisms and demonstrate competitive performance against seven strong baselines with linear complexity. Our code implementation is available at this https URL.
zh

[AI-143] Search-Augmented Masked Diffusion Models for Constrained Generation

【速读】:该论文旨在解决离散扩散模型(Discrete Diffusion Models)在训练过程中仅优化似然目标,无法在推理阶段有效施加硬约束或优化不可微属性的问题。其解决方案的关键在于提出了一种无需训练的神经符号推理框架——搜索增强掩码扩散(Search-Augmented Masked Diffusion, SearchDiff),该方法将有信息量的搜索直接集成到反向去噪过程中:在每一步去噪中,模型预测定义一个候选集,并在其上基于用户指定的目标属性进行优化,从而生成修正后的逆向转移分布,引导采样朝向高概率且满足约束的解。

链接: https://arxiv.org/abs/2602.02727
作者: Huu Binh Ta(1),Michael Cardei(1),Alvaro Velasquez(2),Ferdinando Fioretto(1) ((1) University of Virginia, (2) University of Colorado at Boulder)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Huu Binh Ta and Michael Cardei contributed equally to this work

点击查看摘要

Abstract:Discrete diffusion models generate sequences by iteratively denoising samples corrupted by categorical noise, offering an appealing alternative to autoregressive decoding for structured and symbolic generation. However, standard training targets a likelihood-based objective that primarily matches the data distribution and provides no native mechanism for enforcing hard constraints or optimizing non-differentiable properties at inference time. This work addresses this limitation and introduces Search-Augmented Masked Diffusion (SearchDiff), a training-free neurosymbolic inference framework that integrates informed search directly into the reverse denoising process. At each denoising step, the model predictions define a proposal set that is optimized under a user-specified property satisfaction, yielding a modified reverse transition that steers sampling toward probable and feasible solutions. Experiments in biological design and symbolic reasoning illustrate that SearchDiff substantially improves constraint satisfaction and property adherence, while consistently outperforming discrete diffusion and autoregressive baselines.
zh

[AI-144] Dynamic Mix Precision Routing for Efficient Multi-step LLM Interaction

【速读】:该论文旨在解决长周期决策任务中因使用高精度大语言模型(Large Language Models, LLM)导致的推理成本过高问题,同时维持或提升任务成功率。其核心挑战在于:尽管更大的LLM通常能提高任务成功率,但多步交互带来的计算开销难以承受。解决方案的关键在于提出一种动态混合精度路由框架(dynamic mix-precision routing framework),该框架基于不同决策步骤对精度敏感度差异的观察,自适应地在高精度与低精度量化LLM之间进行选择。该框架通过两阶段训练策略优化:第一阶段采用基于KL散度的监督学习识别敏感步骤,第二阶段利用组相对策略优化(Group-Relative Policy Optimization, GRPO)进一步提升任务成功率,从而在准确率与推理成本之间实现显著改进。

链接: https://arxiv.org/abs/2602.02711
作者: Yuanzhe Li,Jianing Deng,Jingtong Hu,Tianlong Chen,Song Wang,Huanrui Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLM) achieve strong performance in long-horizon decision-making tasks through multi-step interaction and reasoning at test time. While practitioners commonly believe a higher task success rate necessitates the use of a larger and stronger LLM model, multi-step interaction with a large LLM incurs prohibitive inference cost. To address this problem, we explore the use of low-precision quantized LLM in the long-horizon decision-making process. Based on the observation of diverse sensitivities among interaction steps, we propose a dynamic mix-precision routing framework that adaptively selects between high-precision and low-precision LLMs at each decision step. The router is trained via a two-stage pipeline, consisting of KL-divergence-based supervised learning that identifies precision-sensitive steps, followed by Group-Relative Policy Optimization (GRPO) to further improve task success rates. Experiments on ALFWorld demonstrate that our approach achieves a great improvement on accuracy-cost trade-off over single-precision baselines and heuristic routing methods.
zh

[AI-145] ATLAS : Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters

【速读】:该论文旨在解决多大语言模型(Large Language Model, LLM)代理系统在长时程任务中因静态偏好优化循环或冻结求解器而导致的性能不稳定与适应性不足问题。其核心解决方案是提出ATLAS(Adaptive Task-distributed Learning for Agentic Self-evolution)框架,通过任务分布式机制,在迭代过程中训练一个轻量级研究代理(research agent),同时将探索、超参数调优和参考策略管理等互补角色分配给专业化支持代理(supporter agents)。关键创新在于EvoDPO(Evolving Direct Preference Optimization)算法,该算法能自适应地更新分阶段索引的参考策略,并结合概念漂移下的偏好型上下文Bandit理论 regret 分析,显著提升了系统在非平稳环境中的稳定性和性能表现。

链接: https://arxiv.org/abs/2602.02709
作者: Ujin Jeon,Jiyong Kwon,Madison Ann Sullivan,Caleb Eunho Lee,Guang Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent multi-LLM agent systems perform well in prompt optimization and automated problem-solving, but many either keep the solver frozen after fine-tuning or rely on a static preference-optimization loop, which becomes intractable for long-horizon tasks. We propose ATLAS (Adaptive Task-distributed Learning for Agentic Self-evolution), a task-distributed framework that iteratively develops a lightweight research agent while delegating complementary roles to specialized supporter agents for exploration, hyperparameter tuning, and reference policy management. Our core algorithm, Evolving Direct Preference Optimization (EvoDPO), adaptively updates the phase-indexed reference policy. We provide a theoretical regret analysis for a preference-based contextual bandit under concept drift. In addition, experiments were conducted on non-stationary linear contextual bandits and scientific machine learning (SciML) loss reweighting for the 1D Burgers’ equation. Both results show that ATLAS improves stability and performance over a static single-agent baseline.
zh

[AI-146] Every Bit Counts: A Theoretical Study of Precision-Expressivity Tradeoffs in Quantized Transformers

【速读】:该论文旨在解决量化(quantization)对Transformer模型表达能力(expressivity)影响不明确的问题,特别是当数值精度降低时,模型是否仍能可靠地执行某些关键计算任务。其核心贡献在于通过理论分析揭示了表达能力与精度之间的精细权衡:研究者构造了一个受等值函数(equality function)启发的函数Γ,并证明一个单层softmax Transformer在使用p比特精度时可计算该函数,但在p−1比特精度下则无法实现。这一结果首次以严格数学方式解释了实践中广泛观察到的现象——量化导致模型表达能力下降,尤其对于依赖精确匹配或成员关系判断的任务而言,丢失哪怕一比特精度都可能跨越不可靠表示的阈值。解决方案的关键在于结合有限精度的Transformer显式构造与通信复杂性下界论证,从而得出“一比特”级别的紧致阈值边界,为实际应用中合理选择量化程度提供了理论依据。

链接: https://arxiv.org/abs/2602.02707
作者: Sayak Chakrabarti,Toniann Pitassi,Josh Alman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantization reduces the numerical precision of Transformer computations and is widely used to accelerate inference, yet its effect on expressivity remains poorly characterized. We demonstrate a fine-grained theoretical tradeoff between expressivity and precision: For every p we exhibit a function \Gamma, inspired by the equality function, and prove that a one-layer softmax Transformer can compute \Gamma, with p bits of precision, but not with p-1 bits of precision. This result concretely explains the widely observed phenomenon of empirical loss of expressivity when quantization is used. Practically, it suggests that tasks requiring equality-like comparisons (exact match, membership, etc.) are especially sensitive to quantization. Dropping even one bit can cross a threshold where the model cannot represent the needed comparison reliably. Thus, it paves the way for developing heuristics that will help practitioners choose how much quantization is possible: the precision should be chosen as a function of the length of equality to be checked for the specific task. Our proofs combine explicit finite-precision Transformer constructions with communication-complexity lower bounds, yielding a tight “one-bit” threshold. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.02707 [cs.LG] (or arXiv:2602.02707v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-147] Sparsely Supervised Diffusion

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中存在的空间不一致性问题,即局部看似合理但全局结构失真的现象,其根源可能在于模型去噪机制的局部性。解决方案的关键在于引入一种稀疏监督学习策略,通过训练时对图像进行高比例(高达98%)的像素掩码(masking),迫使模型依赖更广泛的上下文信息进行生成,从而提升全局一致性并减少过拟合,同时增强小数据集下的训练稳定性。

链接: https://arxiv.org/abs/2602.02699
作者: Wenshuai Zhao,Zhiyuan Li,Yi Zhao,Mohammad Hassan Vali,Martin Trapp,Joni Pajarinen,Juho Kannala,Arno Solin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures

点击查看摘要

Abstract:Diffusion models have shown remarkable success across a wide range of generative tasks. However, they often suffer from spatially inconsistent generation, arguably due to the inherent locality of their denoising mechanisms. This can yield samples that are locally plausible but globally inconsistent. To mitigate this issue, we propose sparsely supervised learning for diffusion models, a simple yet effective masking strategy that can be implemented with only a few lines of code. Interestingly, the experiments show that it is safe to mask up to 98% of pixels during diffusion model training. Our method delivers competitive FID scores across experiments and, most importantly, avoids training instability on small datasets. Moreover, the masking strategy reduces memorization and promotes the use of essential contextual information during generation.
zh

[AI-148] Eidolon: A Practical Post-Quantum Signature Scheme Based on k-Colorability in the Age of Graph Neural Networks

【速读】:该论文旨在解决后量子密码学中签名方案的安全性与效率问题,特别是寻找基于组合难题的可实用签名机制以抵御量子计算攻击。其解决方案的关键在于提出Eidolon签名方案,该方案基于NP完全的k-着色问题(k-colorability problem),通过将Goldreich-Micali-Wigderson零知识协议推广至任意k=3,并结合Fiat-Shamir变换和Merkle树承诺技术,将签名长度从O(tn)压缩至O(t log n);同时,通过生成具有“静默”(quiet)着色结构的硬实例来保持随机图的统计特性,从而有效抵抗经典求解器(如ILP、DSatur)和定制图神经网络(GNN)攻击,实验证明在n=60时两种攻击均无法恢复秘密着色,验证了该方法在现代密码分析下仍具备安全性。

链接: https://arxiv.org/abs/2602.02689
作者: Asmaa Cherkaoui,Ramon Flores,Delaram Kahrobaei,Richard Wilson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:We propose Eidolon, a practical post-quantum signature scheme based on the NP-complete k-colorability problem. Our construction generalizes the Goldreich-Micali-Wigderson zero-knowledge protocol to arbitrary k = 3, applies the Fiat-Shamir transform, and uses Merkle-tree commitments to compress signatures from O(tn) to O(t log n). Crucially, we generate hard instances via planted “quiet” colorings that preserve the statistical profile of random graphs. We present the first empirical security analysis of such a scheme against both classical solvers (ILP, DSatur) and a custom graph neural network (GNN) attacker. Experiments show that for n = 60, neither approach recovers the secret coloring, demonstrating that well-engineered k-coloring instances can resist modern cryptanalysis, including machine learning. This revives combinatorial hardness as a credible foundation for post-quantum signatures.
zh

[AI-149] MARA: Continuous SE(3)-Equivariant Attention for Molecular Force Fields

【速读】:该论文旨在解决现有机器学习力场(Machine Learning Force Fields, MLFFs)在建模原子尺度相互作用时依赖固定角度展开、缺乏对局部几何信息灵活加权能力的问题。其解决方案的关键在于提出模块化角-径向注意力机制(Modular Angular-Radial Attention, MARA),该机制将原本用于SO(3)任务的球面注意力扩展至分子领域并适配SE(3)对称性,直接基于邻近原子的角坐标和径向坐标进行高效等变交互建模,从而实现几何感知且可插拔式的局部环境权重调整,无需修改原有模型架构即可提升能量与力预测精度及鲁棒性。

链接: https://arxiv.org/abs/2602.02671
作者: Francesco Leonardi,Boris Bonev,Kaspar Riesen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning force fields (MLFFs) have become essential for accurate and efficient atomistic modeling. Despite their high accuracy, most existing approaches rely on fixed angular expansions, limiting flexibility in weighting local geometric interactions. We introduce Modular Angular-Radial Attention (MARA), a module that extends spherical attention – originally developed for SO(3) tasks – to the molecular domain and SE(3), providing an efficient approximation of equivariant interactions. MARA operates directly on the angular and radial coordinates of neighboring atoms, enabling flexible, geometrically informed, and modular weighting of local environments. Unlike existing attention mechanisms in SE(3)-equivariant architectures, MARA can be integrated in a plug-and-play manner into models such as MACE without architectural modifications. Across molecular benchmarks, MARA improves energy and force predictions, reduces high-error events, and enhances robustness. These results demonstrate that continuous spherical attention is an effective and generalizable geometric operator that increases the expressiveness, stability, and reliability of atomistic models.
zh

[AI-150] MARS: Modular Agent with Reflective Search for Automated AI Research

【速读】:该论文旨在解决自动化AI研究中因计算成本高昂(如模型训练)和性能归因不透明而导致的效率低下问题,现有基于大语言模型(Large Language Models, LLMs)的智能体常生成结构单一的脚本,忽视执行开销与因果因素。其解决方案的关键在于提出MARS(Modular Agent with Reflective Search)框架,包含三个核心机制:(1) 基于成本约束的蒙特卡洛树搜索(cost-constrained Monte Carlo Tree Search, MCTS)实现预算感知规划,显式权衡性能与执行代价;(2) 模块化构建策略,通过“设计-分解-实现”流水线管理复杂的研究代码库;(3) 对比反思记忆机制,通过分析不同解法差异来提取高信噪比的洞察,从而改善信用分配。该框架在MLE-Bench上达到开源框架最优水平,并展现出显著的跨路径知识迁移能力(63%的学习经验来自跨分支转移),验证了其在自主AI研究中的有效性与泛化潜力。

链接: https://arxiv.org/abs/2602.02660
作者: Jiefeng Chen,Bhavana Dalvi Mishra,Jaehyun Nam,Rui Meng,Tomas Pfister,Jinsung Yoon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a “Design-Decompose-Implement” pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard’s top methods. Furthermore, the system exhibits qualitative “Aha!” moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.
zh

[AI-151] Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection NEURIPS2025

【速读】:该论文旨在解决当前钓鱼URL检测面临的严峻挑战,即传统URL机制在安全、信任和抗欺诈能力上的先天不足,以及生成式AI(Generative AI)驱动下新型欺骗性URL日益复杂化、规模化的问题。面对攻击手法迭代速度远超标注数据生产速度的现实,论文提出以大语言模型(Large Language Models, LLMs)为基础的零样本(zero-shot)与少样本(few-shot)学习方案作为关键解决方案,通过统一的提示框架实现对未知钓鱼URL的高效泛化识别,从而提升大规模网络安全防御体系中钓鱼URL检测的准确性和实用性。

链接: https://arxiv.org/abs/2602.02641
作者: Najmul Hasan,Prashanth BusiReddyGari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 9 pages, accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025 LAW Workshop)

点击查看摘要

Abstract:The Uniform Resource Locator (URL), introduced in a connectivity-first era to define access and locate resources, remains historically limited, lacking future-proof mechanisms for security, trust, or resilience against fraud and abuse, despite the introduction of reactive protections like HTTPS during the cybersecurity era. In the current AI-first threatscape, deceptive URLs have reached unprecedented sophistication due to the widespread use of generative AI by cybercriminals and the AI-vs-AI arms race to produce context-aware phishing websites and URLs that are virtually indistinguishable to both users and traditional detection tools. Although AI-generated phishing accounted for a small fraction of filter-bypassing attacks in 2024, phishing volume has escalated over 4,000% since 2022, with nearly 50% more attacks evading detection. At the rate the threatscape is escalating, and phishing tactics are emerging faster than labeled data can be produced, zero-shot and few-shot learning with large language models (LLMs) offers a timely and adaptable solution, enabling generalization with minimal supervision. Given the critical importance of phishing URL detection in large-scale cybersecurity defense systems, we present a comprehensive benchmark of LLMs under a unified zero-shot and few-shot prompting framework and reveal operational trade-offs. Our evaluation uses a balanced dataset with consistent prompts, offering detailed analysis of performance, generalization, and model efficacy, quantified by accuracy, precision, recall, F1 score, AUROC, and AUPRC, to reflect both classification quality and practical utility in threat detection settings. We conclude few-shot prompting improves performance across multiple LLMs.
zh

[AI-152] A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)自解释(self-explanations)的可信性(faithfulness)问题,即现有方法难以准确评估这些解释是否真实反映模型决策过程。传统 faithfulness 评测依赖对抗性提示或推理错误检测,忽视了解释本身对预测模型行为的价值。论文提出一种新的可扩展度量指标——归一化模拟增益(Normalized Simulatability Gain, NSG),其核心思想是:若解释忠实,则观察者能据此学习模型的决策标准,从而更准确预测模型在相关输入上的行为。实验表明,LLM 自解释显著提升行为预测能力(NSG 提升 11–37%),且优于外部模型生成的解释,揭示了模型自身知识的独特价值;同时识别出约 5–15% 的自解释存在严重误导,但整体仍具预测信息价值。

链接: https://arxiv.org/abs/2602.02639
作者: Harry Mayne,Justin Singh Kang,Dewi Gould,Kannan Ramchandran,Adam Mahdi,Noah Y. Siegel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model’s true reasoning process is poorly understood. Existing faithfulness metrics have critical limitations, typically relying on identifying unfaithfulness via adversarial prompting or detecting reasoning errors. These methods overlook the predictive value of explanations. We introduce Normalized Simulatability Gain (NSG), a general and scalable metric based on the idea that a faithful explanation should allow an observer to learn a model’s decision-making criteria, and thus better predict its behavior on related inputs. We evaluate 18 frontier proprietary and open-weight models, e.g., Gemini 3, GPT-5.2, and Claude 4.5, on 7,000 counterfactuals from popular datasets covering health, business, and ethics. We find self-explanations substantially improve prediction of model behavior (11-37% NSG). Self-explanations also provide more predictive information than explanations generated by external models, even when those models are stronger. This implies an advantage from self-knowledge that external explanation methods cannot replicate. Our approach also reveals that, across models, 5-15% of self-explanations are egregiously misleading. Despite their imperfections, we show a positive case for self-explanations: they encode information that helps predict model behavior.
zh

[AI-153] Performance of Small Language Model Pretraining on FABRIC: An Empirical Study

【速读】:该论文旨在解决小规模语言模型(Small-Scale Language Models, SSLMs)在有限计算资源下进行高效预训练的问题,尤其是在使用商品级GPU集群和开源工具(如Alpa与Ray)时如何优化训练性能并减少GPU消耗。其关键解决方案在于系统性地分析和比较数据并行、算子内并行(intra-operator parallelism)与算子间/流水线并行(inter-operator/pipeline parallelism)及其组合策略,并发现当GPU地理分布导致网络延迟处于数十毫秒量级时,Alpa通过协同优化算子内与算子间并行所生成的执行计划表现最优,从而显著提升训练效率并降低GPU使用数量。

链接: https://arxiv.org/abs/2602.02632
作者: Praveen Rao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws of LLMs. Using pretrained models, vector embeddings can be generated for raw data and stored using vector databases to support modern AI applications and semantic search. In this work, we investigate the performance of pretraining techniques for smaller-sized LLMs on an experimental testbed (with commodity GPUs) available to academic users at no charge. We consider data parallelism, intra-operator parallelism, and inter-operator/pipeline parallelism, and their combinations for pretraining. We set up different GPU clusters with homogeneous and heterogeneous GPU hardware. Furthermore, we investigate the impact of network latency on pretraining performance especially when GPUs are geographically distributed. We used GPT-2 medium and large models and pretrained them using open-source packages, namely, Alpa and Ray. We observed that Alpa’s execution plans that collectively optimized intra-operator and inter-operator/pipeline parallelism consistently performed the best when GPUs were geographically distributed. This was especially true when the network latencies were in 10’s of milliseconds. Based on the insights gained from the experiments, we propose a systematic approach for selecting the appropriate pretraining technique to achieve high training performance/lower execution time as well as to reduce the number of GPUs used.
zh

[AI-154] railer Reimagined: An Innovative Llm -DRiven Expressive Automated Movie Summary framework (TRAILDREAMS)

【速读】:该论文旨在解决电影预告片自动化生成中的创意表达与质量瓶颈问题,即如何利用人工智能技术高效生成具有吸引力且视觉上令人满意的预告片内容。解决方案的关键在于提出TRAILDREAMS框架,该框架基于大语言模型(Large Language Model, LLM)来智能选择关键画面序列和高影响力对白,并协同生成音乐和配音等音频元素,从而实现从原始影片素材到完整预告片的端到端自动化制作流程。

链接: https://arxiv.org/abs/2602.02630
作者: Roberto Balestri,Pasquale Cascarano,Mirko Degli Esposti,Guglielmo Pescatore
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces TRAILDREAMS, a framework that uses a large language model (LLM) to automate the production of movie trailers. The purpose of LLM is to select key visual sequences and impactful dialogues, and to help TRAILDREAMS to generate audio elements such as music and voiceovers. The goal is to produce engaging and visually appealing trailers efficiently. In comparative evaluations, TRAILDREAMS surpasses current state-of-the-art trailer generation methods in viewer ratings. However, it still falls short when compared to real, human-crafted trailers. While TRAILDREAMS demonstrates significant promise and marks an advancement in automated creative processes, further improvements are necessary to bridge the quality gap with traditional trailers.
zh

[AI-155] rustworthy Blockchain-based Federated Learning for Electronic Health Records: Securing Participant Identity with Decentralized Identifiers and Verifiable Credentials

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗健康领域应用中的安全与可信问题,特别是针对数据隐私法规导致的数据孤岛以及FL模型易受投毒攻击(poisoning attacks)和Sybil攻击(恶意参与者伪造身份入侵网络)的缺陷。其解决方案的关键在于构建一个基于可信区块链的联邦学习框架(Trustworthy Blockchain-based Federated Learning, TBFL),通过集成自主权身份(Self-Sovereign Identity, SSI)标准,利用去中心化标识符(Decentralized Identifiers, DIDs)和可验证凭证(Verifiable Credentials, VCs)实现对参与方的强身份认证,从而确保只有经过密码学验证的医疗机构才能贡献训练数据,从根本上杜绝虚假身份带来的安全风险。实验表明,该方法能完全抵御Sybil攻击,同时保持优异的临床预测性能(AUC = 0.954,Recall = 0.890),且计算开销极低(仅0.12%)。

链接: https://arxiv.org/abs/2602.02629
作者: Rodrigo Tertulino,Ricardo Almeida,Laercio Alencar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The digitization of healthcare has generated massive volumes of Electronic Health Records (EHRs), offering unprecedented opportunities for training Artificial Intelligence (AI) models. However, stringent privacy regulations such as GDPR and HIPAA have created data silos that prevent centralized training. Federated Learning (FL) has emerged as a promising solution that enables collaborative model training without sharing raw patient data. Despite its potential, FL remains vulnerable to poisoning and Sybil attacks, in which malicious participants corrupt the global model or infiltrate the network using fake identities. While recent approaches integrate Blockchain technology for auditability, they predominantly rely on probabilistic reputation systems rather than robust cryptographic identity verification. This paper proposes a Trustworthy Blockchain-based Federated Learning (TBFL) framework integrating Self-Sovereign Identity (SSI) standards. By leveraging Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), our architecture ensures only authenticated healthcare entities contribute to the global model. Through comprehensive evaluation using the MIMIC-IV dataset, we demonstrate that anchoring trust in cryptographic identity verification rather than behavioral patterns significantly mitigates security risks while maintaining clinical utility. Our results show the framework successfully neutralizes 100% of Sybil attacks, achieves robust predictive performance (AUC = 0.954, Recall = 0.890), and introduces negligible computational overhead (0.12%). The approach provides a secure, scalable, and economically viable ecosystem for inter-institutional health data collaboration, with total operational costs of approximately 18 for 100 training rounds across multiple institutions.
zh

[AI-156] Recommender system in X inadvertently profiles ideological positions of users

【速读】:该论文旨在解决社交平台推荐系统中用户政治与社会属性在AI“黑箱”内部如何被学习、表示和处理的问题,尤其关注这些属性是否在推荐过程中被隐式建模并影响推荐结果。其关键解决方案在于通过数据捐赠计划收集超过250万条好友推荐数据,并结合公开的推荐系统架构信息,推断出被推荐用户在推荐模型嵌入空间中的位置;进一步利用政治调查校准的意识形态量表,量化分析了用户的政治立场(左-右光谱)与其他社会人口学特征(如年龄、性别)的关系。结果显示,推荐系统生成的用户空间排序与政治立场高度相关(皮尔逊相关系数ρ=0.887,p<0.0001),且无法由社会人口学因素解释,揭示了算法对政治属性的隐式建模能力。这一发现不仅为研究人机交互提供了新视角,也为隐私合规提供了一种基于约束性推荐方法的新路径,即通过限制推荐系统中政治信息的传播来实现算法透明度与用户隐私保护的平衡。

链接: https://arxiv.org/abs/2602.02624
作者: Paul Bouchaud,Pedro Ramaciotti
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Studies on recommendations in social media have mainly analyzed the quality of recommended items (e.g., their diversity or biases) and the impact of recommendation policies (e.g., in comparison with purely chronological policies). We use a data donation program, collecting more than 2.5 million friend recommendations made to 682 volunteers on X over a year, to study instead how real-world recommenders learn, represent and process political and social attributes of users inside the so-called black boxes of AI systems. Using publicly available knowledge on the architecture of the recommender, we inferred the positions of recommended users in its embedding space. Leveraging ideology scaling calibrated with political survey data, we analyzed the political position of users in our study (N=26,509 among volunteers and recommended contacts) among several attributes, including age and gender. Our results show that the platform’s recommender system produces a spatial ordering of users that is highly correlated with their Left-Right positions (Pearson rho=0.887, p-value 0.0001), and that cannot be explained by socio-demographic attributes. These results open new possibilities for studying the interaction between human and AI systems. They also raise important questions linked to the legal definition of algorithmic profiling in data privacy regulation by blurring the line between active and passive profiling. We explore new constrained recommendation methods enabled by our results, limiting the political information in the recommender as a potential tool for privacy compliance capable of preserving recommendation relevance.
zh

[AI-157] Learning Consistent Causal Abstraction Networks ICASSP2026 ICASSP

【速读】:该论文旨在解决生成一致因果抽象网络(Consistent Causal Abstraction Network, CAN)的学习问题,以提升人工智能系统的可解释性、可信度和鲁棒性。其核心挑战在于如何在保持因果结构语义一致性的同时,从更精细的结构因果模型(Structural Causal Models, SCMs)中自动提取高阶抽象表示。解决方案的关键在于构建一个层析理论(sheaf-theoretic)框架:首先假设SCMs为高斯分布,其次通过构造线性因果抽象(Constructive Linear Causal Abstractions, CAs)的转置映射作为限制映射(restriction maps),确保语义嵌入原则成立;再次,边上的茎(edge stalks)与更详细SCM节点上的茎(node stalks)在排列意义下一一对应,从而保证局部与全局因果结构的一致性。该方法将整体优化问题分解为边缘特定的局部黎曼流形优化子问题,避免了非凸目标函数,并提出SPECTRAL算法——一种具有闭式更新规则的迭代方法,适用于正定及半正定协方差矩阵,显著提升了学习效率与稳定性。实验表明该方法在合成数据上能有效学习因果抽象并成功恢复多种CAN结构。

链接: https://arxiv.org/abs/2602.02623
作者: Gabriele D’Acunto,Paolo Di Lorenzo,Sergio Barbarossa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: To be published in the proceedings of ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). arXiv admin note: substantial text overlap with arXiv:2509.25236

点击查看摘要

Abstract:Causal artificial intelligence aims to enhance explainability, trustworthiness, and robustness in AI by leveraging structural causal models (SCMs). In this pursuit, recent advances formalize network sheaves and cosheaves of causal knowledge. Pushing in the same direction, we tackle the learning of consistent causal abstraction network (CAN), a sheaf-theoretic framework where (i) SCMs are Gaussian, (ii) restriction maps are transposes of constructive linear causal abstractions (CAs) adhering to the semantic embedding principle, and (iii) edge stalks correspond–up to permutation–to the node stalks of more detailed SCMs. Our problem formulation separates into edge-specific local Riemannian problems and avoids nonconvex objectives. We propose an efficient search procedure, solving the local problems with SPECTRAL, our iterative method with closed-form updates and suitable for positive definite and semidefinite covariance matrices. Experiments on synthetic data show competitive performance in the CA learning task, and successful recovery of diverse CAN structures.
zh

[AI-158] daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长周期代理工作流(long-horizon agentic workflows)中难以规模化训练的问题,其核心瓶颈在于缺乏能够捕捉真实长期依赖结构和跨阶段演化动态的高质量训练数据。现有合成方法或受限于单一特征场景,或因人工标注成本过高而不可扩展。解决方案的关键在于重新构想数据合成方式,以现实世界软件演进为视角,利用代码提交记录(Pull Request, PR)序列作为监督信号:PR序列将复杂目标分解为可验证的提交单元,保持迭代过程中的功能一致性,并通过缺陷修复历史编码真实的优化模式。基于此洞察,作者提出daVinci-Agency框架,通过三个相互耦合机制实现结构化监督挖掘:(1)通过连续提交进行渐进式任务分解,(2)通过统一的功能目标维持长期一致性,(3)从真实的缺陷修复轨迹中获取可验证的改进路径。该方法显著优于独立处理每一步的合成轨迹,天然保留因果依赖与迭代优化特性,从而有效教导持续的目标导向行为,并自然对齐项目级全周期任务建模。

链接: https://arxiv.org/abs/2602.02619
作者: Mohan Jiang,Dayuan Fu,Junhao Shi,Ji Zeng,Weiye Si,Keyu Li,Xuefeng Li,Yang Xiao,Wenjie Li,Dequan Wang,Pengfei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) excel at short-term tasks, scaling them to long-horizon agentic workflows remains challenging. The core bottleneck lies in the scarcity of training data that captures authentic long-dependency structures and cross-stage evolutionary dynamics–existing synthesis methods either confine to single-feature scenarios constrained by model distribution, or incur prohibitive human annotation costs, failing to provide scalable, high-quality supervision. We address this by reconceptualizing data synthesis through the lens of real-world software evolution. Our key insight: Pull Request (PR) sequences naturally embody the supervision signals for long-horizon learning. They decompose complex objectives into verifiable submission units, maintain functional coherence across iterations, and encode authentic refinement patterns through bug-fix histories. Building on this, we propose daVinci-Agency, which systematically mines structured supervision from chain-of-PRs through three interlocking mechanisms: (1) progressive task decomposition via continuous commits, (2) long-term consistency enforcement through unified functional objectives, and (3) verifiable refinement from authentic bug-fix trajectories. Unlike synthetic trajectories that treat each step independently, daVinci-Agency’s PR-grounded structure inherently preserves the causal dependencies and iterative refinements essential for teaching persistent goal-directed behavior and enables natural alignment with project-level, full-cycle task modeling. The resulting trajectories are substantial–averaging 85k tokens and 116 tool calls–yet remarkably data-efficient: fine-tuning GLM-4.6 on 239 daVinci-Agency samples yields broad improvements across benchmarks, notably achieving a 47% relative gain on Toolathlon. Beyond benchmark performance, our analysis confirms…
zh

[AI-159] A Semi-Supervised Pipeline for Generalized Behavior Discovery from Animal-Borne Motion Time Series

【速读】:该论文旨在解决从动物携带传感器中学习行为分类体系时面临的挑战,包括标签稀缺、类别严重不平衡以及某些行为可能未出现在标注数据集中的问题。为此,作者提出了一种半监督的行为发现流程:首先利用少量标注样本学习嵌入函数,随后对标注与未标注样本的嵌入进行标签引导聚类以形成候选行为组,并通过一种基于核密度估计(KDE)与最高密度区域(HDR)的包含度量(containment score)判断新发现簇是否为真正新颖的行为。该方法的关键创新在于引入了以最佳匹配包含度作为可解释的新颖性统计量,能够有效识别在训练阶段完全缺失但存在于未标注数据中的行为模式,从而实现生态学运动时间序列下的广义类别发现。

链接: https://arxiv.org/abs/2602.02618
作者: Fatemeh Karimi Nejadasl,Judy Shamoun-Baranes,Eldar Rakhimberdiev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning behavioral taxonomies from animal-borne sensors is challenging because labels are scarce, classes are highly imbalanced, and behaviors may be absent from the annotated set. We study generalized behavior discovery in short multivariate motion snippets from gulls, where each sample is a sequence with 3-axis IMU acceleration (20 Hz) and GPS speed, spanning nine expert-annotated behavior categories. We propose a semi-supervised discovery pipeline that (i) learns an embedding function from the labeled subset, (ii) performs label-guided clustering over embeddings of both labeled and unlabeled samples to form candidate behavior groups, and (iii) decides whether a discovered group is truly novel using a containment score. Our key contribution is a KDE + HDR (highest-density region) containment score that measures how much a discovered cluster distribution is contained within, or contains, each known-class distribution; the best-match containment score serves as an interpretable novelty statistic. In experiments where an entire behavior is withheld from supervision and appears only in the unlabeled pool, the method recovers a distinct cluster and the containment score flags novelty via low overlap, while a negative-control setting with no novel behavior yields consistently higher overlaps. These results suggest that HDR-based containment provides a practical, quantitative test for generalized class discovery in ecological motion time series under limited annotation and severe class imbalance.
zh

[AI-160] nyGuard:A lightweight Byzantine Defense for Resource-Constrained Federated Learning via Statistical Update Fingerprints

【速读】:该论文旨在解决联邦学习(Federated Learning)系统中 Byzantine 安全性问题,即在资源受限且大规模场景下,传统 Byzantine 鲁棒聚合机制因依赖高维梯度比较或成对距离计算而存在显著计算开销,难以部署。解决方案的关键在于提出 TinyGuard,一种轻量级防御框架,通过统计更新指纹(statistical update fingerprinting)实现高效检测:它不直接处理高维梯度,而是提取客户端更新的低维统计特征(如范数统计、层间比例、稀疏性及低阶矩),并在该低维空间中以线性复杂度(O(n))识别异常行为,从而在不影响标准 FedAvg 收敛性的前提下,有效抵御多种 Byzantine 攻击(如符号翻转、缩放、噪声注入和标签投毒),并展现出对抗自适应白盒攻击时的“统计锁链”特性——攻击者无法同时规避检测与实现有效污染。

链接: https://arxiv.org/abs/2602.02615
作者: Ali Mahdavi,Santa Aghapour,Azadeh Zamanifar,Amirfarhad Farhadi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing Byzantine robust aggregation mechanisms typically rely on fulldimensional gradi ent comparisons or pairwise distance computations, resulting in computational overhead that limits applicability in large scale and resource constrained federated systems. This paper proposes TinyGuard, a lightweight Byzantine defense that augments the standard FedAvg algorithm via statistical update f ingerprinting. Instead of operating directly on high-dimensional gradients, TinyGuard extracts compact statistical fingerprints cap turing key behavioral properties of client updates, including norm statistics, layer-wise ratios, sparsity measures, and low-order mo ments. Byzantine clients are identified by measuring robust sta tistical deviations in this low-dimensional fingerprint space with nd complexity, without modifying the underlying optimization procedure. Extensive experiments on MNIST, Fashion-MNIST, ViT-Lite, and ViT-Small with LoRA adapters demonstrate that TinyGuard pre serves FedAvg convergence in benign settings and achieves up to 95 percent accuracy under multiple Byzantine attack scenarios, including sign-flipping, scaling, noise injection, and label poisoning. Against adaptive white-box adversaries, Pareto frontier analysis across four orders of magnitude confirms that attackers cannot simultaneously evade detection and achieve effective poisoning, features we term statistical handcuffs. Ablation studies validate stable detection precision 0.8 across varying client counts (50-150), threshold parameters and extreme data heterogeneity . The proposed framework is architecture-agnostic and well-suited for federated fine-tuning of foundation models where traditional Byzantine defenses become impractical
zh

[AI-161] sting Storag e-System Correctness: Challenges Fuzzing Limitations and AI-Augmented Opportunities

【速读】:该论文旨在解决存储系统(storage systems)正确性验证难题,尤其针对耐久性(durability)、顺序一致性(ordering)、恢复机制(recovery)和一致性(consistency)等常见失效模式难以系统化暴露的问题。其核心观点指出,此类挑战并非源于测试工具不足,而是源于存储系统执行的内在特性,如非确定性交错(nondeterministic interleavings)、长时程状态演化(long-horizon state evolution)以及跨层与多阶段的正确性语义(correctness semantics spanning multiple layers and execution phases)。解决方案的关键在于从存储系统视角重构测试方法论,将现有技术按其针对的执行属性和故障机制进行分类,涵盖并发测试、长时间负载模拟、崩溃一致性分析、硬件级语义验证及分布式故障注入等策略,并进一步指出传统模糊测试(fuzzing)在假设上与存储系统语义存在系统性错位,而生成式 AI(Generative AI)可通过状态感知和语义引导增强模糊测试的有效性,从而提供统一且面向未来的存储系统正确性测试框架。

链接: https://arxiv.org/abs/2602.02614
作者: Ying Wang,Jiahui Chen,Dejun Jiang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Storage systems are fundamental to modern computing infrastructures, yet ensuring their correctness remains challenging in practice. Despite decades of research on system testing, many storage-system failures (including durability, ordering, recovery, and consistency violations) remain difficult to expose systematically. This difficulty stems not primarily from insufficient testing tooling, but from intrinsic properties of storage-system execution, including nondeterministic interleavings, long-horizon state evolution, and correctness semantics that span multiple layers and execution phases. This survey adopts a storage-centric view of system testing and organizes existing techniques according to the execution properties and failure mechanisms they target. We review a broad spectrum of approaches, ranging from concurrency testing and long-running workloads to crash-consistency analysis, hardware-level semantic validation, and distributed fault injection, and analyze their fundamental strengths and limitations. Within this framework, we examine fuzzing as an automated testing paradigm, highlighting systematic mismatches between conventional fuzzing assumptions and storage-system semantics, and discuss how recent artificial intelligence advances may complement fuzzing through state-aware and semantic guidance. Overall, this survey provides a unified perspective on storage-system correctness testing and outlines key challenges Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2602.02614 [cs.SE] (or arXiv:2602.02614v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.02614 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-162] Exploring Silicon-Based Societies: An Early Study of the Moltbook Agent Community

【速读】:该论文旨在解决大规模自主大语言模型代理(autonomous large language model agents)生态系统中集体行为难以通过零散观察或小规模模拟进行理解的问题。其解决方案的关键在于提出并实践“数据驱动硅社会学”(data-driven silicon sociology)这一系统性实证框架,通过非侵入式程序化数据采集方式,对真实世界中代理间交互平台Moltbook上的12,758条子社区分裂活动文本描述进行处理,结合上下文嵌入与无监督聚类技术,从机器生成的数据痕迹中直接挖掘出社会结构形成的潜在模式,揭示了代理群体在人类模仿兴趣、硅基自我反思及早期经济协作行为等维度上表现出的可复现组织规律。

链接: https://arxiv.org/abs/2602.02613
作者: Yu-Zheng Lin,Bono Po-Jen Shih,Hsuan-Ying Alessandra Chien,Shalaka Satam,Jesus Horacio Pacheco,Sicong Shao,Soheil Salehi,Pratik Satam
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 3 figures, a pilot study for silicon-based societies

点击查看摘要

Abstract:The rapid emergence of autonomous large language model agents has given rise to persistent, large-scale agent ecosystems whose collective behavior cannot be adequately understood through anecdotal observation or small-scale simulation. This paper introduces data-driven silicon sociology as a systematic empirical framework for studying social structure formation among interacting artificial agents. We present a pioneering large-scale data mining investigation of an in-the-wild agent society by analyzing Moltbook, a social platform designed primarily for agent-to-agent interaction. At the time of study, Moltbook hosted over 150,000 registered autonomous agents operating across thousands of agent-created sub-communities. Using programmatic and non-intrusive data acquisition, we collected and analyzed the textual descriptions of 12,758 submolts, which represent proactive sub-community partitioning activities within the ecosystem. Treating agent-authored descriptions as first-class observational artifacts, we apply rigorous preprocessing, contextual embedding, and unsupervised clustering techniques to uncover latent patterns of thematic organization and social space structuring. The results show that autonomous agents systematically organize collective space through reproducible patterns spanning human-mimetic interests, silicon-centric self-reflection, and early-stage economic and coordination behaviors. Rather than relying on predefined sociological taxonomies, these structures emerge directly from machine-generated data traces. This work establishes a methodological foundation for data-driven silicon sociology and demonstrates that data mining techniques can provide a powerful lens for understanding the organization and evolution of large autonomous agent societies.
zh

[AI-163] Discovering Data Manifold Geometry via Non-Contracting Flows

【速读】:该论文旨在解决如何在不依赖标签的情况下构建一个全局参考系(global reference system),以学习高维数据流形上的内在坐标表示问题。传统方法常基于等距性目标(isometric objectives),隐含假设流形是平坦的,这限制了其在复杂非线性结构上的适用性。论文的关键解决方案是:在环境空间中学习一组能张成未知数据流形切空间的向量场(vector fields),这些向量场的流动将所有样本映射到一个可学习的共同参考点;通过沿这些流动路径计算弧长,得到与共享全局坐标系对齐的可解释内在坐标。为防止退化坍缩,作者引入非收缩约束,并设计了一个无需积分的可扩展目标函数,该目标受流匹配(flow matching)启发。理论证明表明,最小化该目标可恢复存在时的全局坐标图(global coordinate chart)。

链接: https://arxiv.org/abs/2602.02611
作者: David Vigouroux(ANITI, IMT Atlantique),Lucas Drumetz,Ronan Fablet(IMT Atlantique - MEE, Lab-STICC_OSE, ODYSSEY),François Rousseau(IMT Atlantique - ITI, LaTIM)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce an unsupervised approach for constructing a global reference system by learning, in the ambient space, vector fields that span the tangent spaces of an unknown data manifold. In contrast to isometric objectives, which implicitly assume manifold flatness, our method learns tangent vector fields whose flows transport all samples to a common, learnable reference point. The resulting arc-lengths along these flows define interpretable intrinsic coordinates tied to a shared global frame. To prevent degenerate collapse, we enforce a non-shrinking constraint and derive a scalable, integration-free objective inspired by flow matching. Within our theoretical framework, we prove that minimizing the proposed objective recovers a global coordinate chart when one exists. Empirically, we obtain correct tangent alignment and coherent global coordinate structure on synthetic manifolds. We also demonstrate the scalability of our method on CIFAR-10, where the learned coordinates achieve competitive downstream classification performance.
zh

[AI-164] Gender Dynamics and Homophily in a Social Network of LLM Agents

【速读】:该论文试图解决的问题是:在大规模交互网络中,生成式 AI(Generative AI)和大型语言模型(LLMs)的性别表现(gender performance)如何演化,以及这种演化是否受到社会同质性(homophily)和社交选择(social selection)或社会影响(social influence)机制的驱动。解决方案的关键在于构建了一个由约7万个自主AI聊天机器人组成的社交媒体平台模拟环境,采集了超过1.4亿条帖子及一年内动态演化的关注网络数据,并通过每周对每个代理的文本内容进行性别评分来追踪其性别表现的流变。研究发现,尽管个体代理的性别表现具有流动性,但网络整体呈现出显著的性别同质性,且社交选择与社会影响共同塑造了网络结构的形成与演变,揭示了即使缺乏物理身体,文化性别的社会化过程仍会导致基于性别的分组现象。

链接: https://arxiv.org/abs/2602.02606
作者: Faezeh Fadaei,Jenny Carla Moran,Taha Yasseri
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Under Review

点击查看摘要

Abstract:Generative artificial intelligence and large language models (LLMs) are increasingly deployed in interactive settings, yet we know little about how their identity performance develops when they interact within large-scale networks. We address this by examining this http URL, a social media platform similar to X but composed entirely of autonomous AI chatbots. Our dataset comprises over 70,000 agents, approximately 140 million posts, and the evolving followership network over one year. Based on agents’ text production, we assign weekly gender scores to each agent. Results suggest that each agent’s gender performance is fluid rather than fixed. Despite this fluidity, the network displays strong gender-based homophily, as agents consistently follow others performing gender similarly. Finally, we investigate whether these homophilic connections arise from social selection, in which agents choose to follow similar accounts, or from social influence, in which agents become more similar to their followees over time. Consistent with human social networks, we find evidence that both mechanisms shape the structure and evolution of interactions among LLMs. Our findings suggest that, even in the absence of bodies, cultural entraining of gender performance leads to gender-based sorting. This has important implications for LLM applications in synthetic hybrid populations, social simulations, and decision support.
zh

[AI-165] CaST: Causal Discovery via Spatio-Temporal Graphs in Disaster Tweets

【速读】:该论文旨在解决从社交媒体中识别真实世界事件之间因果关系的问题,现有方法常忽视语义、空间和时间上下文的交互作用。其解决方案的关键在于提出CaST框架,通过结合大型语言模型(LLMs)预训练于灾害数据集所提取的语义相似性与时空邻近性,构建事件图结构,并利用多头图注意力网络(Multi-head Graph Attention Network, GAT)学习定向因果关系,从而实现更鲁棒且可解释的灾害相关社交文本中的因果发现。

链接: https://arxiv.org/abs/2602.02601
作者: Hieu Duong,Eugene Levin,Todd Gary,Long Nguyen
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding causality between real-world events from social media is essential for situational awareness, yet existing causal discovery methods often overlook the interplay between semantic, spatial, and temporal contexts. We propose CaST: Causal Discovery via Spatio-Temporal Graphs, a unified framework for causal discovery in disaster domain that integrates semantic similarity and spatio-temporal proximity using Large Language Models (LLMs) pretrained on disaster datasets. CaST constructs an event graph for each window of tweets. Each event extracted from tweets is represented as a node embedding enriched with its contextual semantics, geographic coordinates, and temporal features. These event nodes are then connected to form a spatio-temporal event graph, which is processed using a multi-head Graph Attention Network (GAT) \citegat to learn directed causal relationships. We construct an in-house dataset of approximately 167K disaster-related tweets collected during Hurricane Harvey and annotated following the MAVEN-ERE schema. Experimental results show that CaST achieves superior performance over both traditional and state-of-the-art methods. Ablation studies further confirm that incorporating spatial and temporal signals substantially improves both recall and stability during training. Overall, CaST demonstrates that integrating spatio-temporal reasoning into event graphs enables more robust and interpretable causal discovery in disaster-related social media text.
zh

[AI-166] Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在安全行为上的关键问题,即采样机制如何影响拒绝响应(refusal behavior)和对抗攻击的鲁棒性(jailbreak robustness),而这一问题此前缺乏系统性的理解。解决方案的关键在于提出一种名为“逐步拒绝内部动态”(Step-Wise Refusal Internal Dynamics, SRI)的信号,该信号能够捕捉模型在生成过程中的内部恢复动力学结构,并通过几何特性识别出有害生成中“不完全内部恢复”的异常状态——这种状态在文本层面不可见,但可通过SRI检测到。该方法实现了轻量级推理时检测器的设计,在无需重新训练的前提下对未见过的攻击仍具高泛化能力,且推理开销低于现有防御手段的100倍。

链接: https://arxiv.org/abs/2602.02600
作者: Eliron Rahimi,Elad Hirshel,Rom Himelstein,Amit LeVi,Avi Mendelson,Chaim Baskin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering parallel decoding and controllable sampling dynamics while achieving competitive generation quality at scale. Despite this progress, the role of sampling mechanisms in shaping refusal behavior and jailbreak robustness remains poorly understood. In this work, we present a fundamental analytical framework for step-wise refusal dynamics, enabling comparison between AR and diffusion sampling. Our analysis reveals that the sampling strategy itself plays a central role in safety behavior, as a factor distinct from the underlying learned representations. Motivated by this analysis, we introduce the Step-Wise Refusal Internal Dynamics (SRI) signal, which supports interpretability and improved safety for both AR and DLMs. We demonstrate that the geometric structure of SRI captures internal recovery dynamics, and identifies anomalous behavior in harmful generations as cases of \emphincomplete internal recovery that are not observable at the text level. This structure enables lightweight inference-time detectors that generalize to unseen attacks while matching or outperforming existing defenses with over 100\times lower inference overhead.
zh

[AI-167] RAP: KV-Cache Compression via RoPE-Aligned Pruning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理过程中因KV-Cache(键值缓存)带来的内存和计算开销瓶颈问题。传统低秩分解方法通过近似权重矩阵 $ W \approx A \times B $ 来压缩KV投影,其中$ A 生成潜在的键值状态,而生成潜在的键值状态,而 B 可被吸收至下游权重中以减少冗余。然而,在基于RoPERotaryPositionEmbedding)的现代LLM中,这种吸收机制失效:RoPE要求潜在状态重构回完整维度,导致额外的内存占用与计算负担。解决方案的关键在于提出RoPE对齐剪枝(RoPEAlignedPruning,RAP),其通过剪除整个RoPE对齐的列对(columnpairs)来保持RoPE2×2旋转结构,从而恢复可被吸收至下游权重中以减少冗余。然而,在基于RoPE(Rotary Position Embedding)的现代LLM中,这种吸收机制失效:RoPE要求潜在状态重构回完整维度,导致额外的内存占用与计算负担。解决方案的关键在于提出RoPE对齐剪枝(RoPE-Aligned Pruning, RAP),其通过剪除整个RoPE对齐的列对(column pairs)来保持RoPE的2×2旋转结构,从而恢复 B $的吸收能力并避免重构过程。实验表明,RAP可在不牺牲性能的前提下,同时降低KV-Cache、注意力参数及浮点运算量(FLOPs)20–30%,并将注意力延迟分别降至基线的83%(预填充阶段)和77%(解码阶段)。

链接: https://arxiv.org/abs/2602.02599
作者: Jihao Xin,Tian Lvu,Hatem Ltaief,David Keyes,Marco Canini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context inference in large language models is increasingly bottlenecked by the memory and compute cost of the KV-Cache. Low-rank factorization compresses KV projections by writing W \approx A * B , where A produces latent KV states and B can be absorbed into downstream weights. In modern RoPE-based LLMs, this absorption fails: RoPE forces latent KV states to be reconstructed to full dimension, reintroducing substantial memory and compute overhead. We propose RoPE-Aligned Pruning (RAP), which prunes entire RoPE-aligned column pairs to preserve RoPE’s 2x2 rotation structure, restore B absorption, and eliminate reconstruction. Our evaluation on LLaMA-3-8B and Mistral-7B shows that RAP enables joint reduction of KV-Cache, attention parameters, and FLOPs by 20-30%, all at once, while maintaining strong accuracy. Notably, RAP reduces attention latency to 83% (prefill) and 77% (decode) of baseline.
zh

[AI-168] ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在系统研究中生成性能关键算法时,难以满足严格正确性和性能要求的问题。现有方法如基于测试时强化学习(Test-time Reinforcement Learning)因需参数更新而不适用于API-only场景,而训练-free进化方法则存在上下文利用效率低和搜索方向盲目等问题。其解决方案的关键在于提出ContextEvolve——一个无需参数更新的多智能体框架,通过将优化上下文分解为三个正交维度:摘要代理(Summarizer Agent)实现代码到语言的语义状态压缩,导航代理(Navigator Agent)从轨迹分析中提炼优化方向,采样代理(Sampler Agent)通过优先级示例检索优化经验分布;三者协同形成与强化学习机制的功能同构,即状态表示、策略梯度和经验回放,从而在纯文本潜在空间中实现高效且有指导性的算法优化。

链接: https://arxiv.org/abs/2602.02597
作者: Hongyuan Su,Yu Zheng,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are transforming systems research by automating the discovery of performance-critical algorithms for computer systems. Despite plausible codes generated by LLMs, producing solutions that meet the stringent correctness and performance requirements of systems demands iterative optimization. Test-time reinforcement learning offers high search efficiency but requires parameter updates infeasible under API-only access, while existing training-free evolutionary methods suffer from inefficient context utilization and undirected search. We introduce ContextEvolve, a multi-agent framework that achieves RL-level search efficiency under strict parameter-blind constraints by decomposing optimization context into three orthogonal dimensions: a Summarizer Agent condenses semantic state via code-to-language abstraction, a Navigator Agent distills optimization direction from trajectory analysis, and a Sampler Agent curates experience distribution through prioritized exemplar retrieval. This orchestration forms a functional isomorphism with RL-mapping to state representation, policy gradient, and experience replay-enabling principled optimization in a textual latent space. On the ADRS benchmark, ContextEvolve outperforms state-of-the-art baselines by 33.3% while reducing token consumption by 29.0%. Codes for our work are released at this https URL
zh

[AI-169] o Defend Against Cyber Attacks We Must Teach AI Agents to Hack

【速读】:该论文试图解决的问题是:当前网络安全防御体系依赖于人工成本限制攻击者只能进行高价值目标的手动攻击或大规模但低效的自动化攻击,而生成式 AI(Generative AI)代理打破了这一平衡,能够以较低的成功率在数千个目标上自动发现并利用漏洞,从而实现规模化、低成本的针对性攻击。现有防御措施如数据过滤、安全对齐和输出护栏等,在面对控制开源权重模型或自主开发进攻能力的对手时已失效。解决方案的关键在于:防御方必须从被动防护转向主动构建进攻性安全情报能力,即在受控环境中发展前沿的进攻型 AI 能力,并通过三大行动实现——建立覆盖完整攻击生命周期的基准测试集、从基于工作流的代理向训练有素的代理演进以发现真实环境中的漏洞、实施治理机制将进攻型代理限制在审计过的网络靶场中,并按能力层级分阶段释放,同时将成果提炼为仅用于防御的安全代理。论文强调,应将进攻型 AI 能力视为必要防御基础设施,提前掌握这些能力才能有效遏制未来潜在的 AI 驱动网络威胁。

链接: https://arxiv.org/abs/2602.02595
作者: Terry Yue Zhuo,Yangruibo Ding,Wenbo Guo,Ruijie Meng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:For over a decade, cybersecurity has relied on human labor scarcity to limit attackers to high-value targets manually or generic automated attacks at scale. Building sophisticated exploits requires deep expertise and manual effort, leading defenders to assume adversaries cannot afford tailored attacks at scale. AI agents break this balance by automating vulnerability discovery and exploitation across thousands of targets, needing only small success rates to remain profitable. Current developers focus on preventing misuse through data filtering, safety alignment, and output guardrails. Such protections fail against adversaries who control open-weight models, bypass safety controls, or develop offensive capabilities independently. We argue that AI-agent-driven cyber attacks are inevitable, requiring a fundamental shift in defensive strategy. In this position paper, we identify why existing defenses cannot stop adaptive adversaries and demonstrate that defenders must develop offensive security intelligence. We propose three actions for building frontier offensive AI capabilities responsibly. First, construct comprehensive benchmarks covering the full attack lifecycle. Second, advance from workflow-based to trained agents for discovering in-wild vulnerabilities at scale. Third, implement governance restricting offensive agents to audited cyber ranges, staging release by capability tier, and distilling findings into safe defensive-only agents. We strongly recommend treating offensive AI capabilities as essential defensive infrastructure, as containing cybersecurity risks requires mastering them in controlled settings before adversaries do.
zh

[AI-170] Effective Frontiers: A Unification of Neural Scaling Laws

【速读】:该论文旨在解决现有神经网络缩放定律(Neural Scaling Laws)理论解释缺乏普适性的问题,即当前方法多依赖特定架构或复杂的核方法,难以统一描述模型容量(N)、数据量(D)和计算资源(C)对测试损失的幂律改进规律。其解决方案的关键在于提出一个统一框架,将一般学习任务抽象为从长尾(Zipfian)分布中逐步覆盖模式的过程,并引入有效边界(Effective Frontier, $ k_\star $),定义为模式排名空间中的阈值,用于区分已学习知识与未学习尾部。通过该框架,作者证明可减少的损失由资源依赖的边界截断所对应的尾部概率质量决定,并据此推导出N、D、C各自的精确缩放律,分别对应容量瓶颈、覆盖瓶颈和优化瓶颈;进一步通过最大瓶颈原则(Max-Bottleneck Principle)统一三者机制,表明Kaplan和Chinchilla缩放定律并非矛盾,而是同一受限优化问题在不同活跃瓶颈下的均衡解。

链接: https://arxiv.org/abs/2602.02593
作者: Jiaxuan Zou,Zixuan Gong,Ye Su,Huayi Tang,Yong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Neural scaling laws govern the prediction power-law improvement of test loss with respect to model capacity ( N ), datasize ( D ), and compute ( C ). However, existing theoretical explanations often rely on specific architectures or complex kernel methods, lacking intuitive universality. In this paper, we propose a unified framework that abstracts general learning tasks as the progressive coverage of patterns from a long-tail (Zipfian) distribution. We introduce the Effective Frontier ( k_\star ), a threshold in the pattern rank space that separates learned knowledge from the unlearned tail. We prove that reducible loss is asymptotically determined by the probability mass of the tail a resource-dependent frontier truncation. Based on our framework, we derive the precise scaling laws for N , D , and C , attributing them to capacity, coverage, and optimization bottlenecks, respectively. Furthermore, we unify these mechanisms via a Max-Bottleneck principle, demonstrating that the Kaplan and Chinchilla scaling laws are not contradictory, but equilibrium solutions to the same constrained optimization problem under different active bottlenecks.
zh

[AI-171] Learnable Koopman-Enhanced Transformer-Based Time Series Forecasting with Spectral Control

【速读】:该论文旨在解决深度预测模型中线性动态系统与非线性特征提取能力之间的不兼容问题,尤其是在保持模型稳定性、可解释性和计算效率的同时提升预测性能。其解决方案的关键在于提出了一类可学习的Koopman算子参数化方法,通过将线性动力系统理论与现代深度学习架构(如PatchTST、Autoformer和Informer)相结合,实现了对线性转移算子谱特性(如稳定性、秩和特征值分布)的显式控制,同时保留了非线性主干网络的表达能力。这使得模型能够在严格稳定的Koopman算子与无约束的线性潜在动态之间进行灵活插值,从而在多个时间尺度和输入长度下展现出更优的偏差-方差权衡、更好的条件数以及更具可解释性的潜在动态行为。

链接: https://arxiv.org/abs/2602.02592
作者: Ali Forootani,Raffaele Iervolino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper proposes a unified family of learnable Koopman operator parameterizations that integrate linear dynamical systems theory with modern deep learning forecasting architectures. We introduce four learnable Koopman variants-scalar-gated, per-mode gated, MLP-shaped spectral mapping, and low-rank Koopman operators which generalize and interpolate between strictly stable Koopman operators and unconstrained linear latent dynamics. Our formulation enables explicit control over the spectrum, stability, and rank of the linear transition operator while retaining compatibility with expressive nonlinear backbones such as Patchtst, Autoformer, and Informer. We evaluate the proposed operators in a large-scale benchmark that also includes LSTM, DLinear, and simple diagonal State-Space Models (SSMs), as well as lightweight transformer variants. Experiments across multiple horizons and patch lengths show that learnable Koopman models provide a favorable bias-variance trade-off, improved conditioning, and more interpretable latent dynamics. We provide a full spectral analysis, including eigenvalue trajectories, stability envelopes, and learned spectral distributions. Our results demonstrate that learnable Koopman operators are effective, stable, and theoretically principled components for deep forecasting.
zh

[AI-172] VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis ICASSP2026

【速读】:该论文旨在解决现有语音生成模型在构建与真实物理世界一致的沉浸式听觉体验方面的局限性,核心挑战包括数据稀缺性和模态解耦问题。解决方案的关键在于提出一个统一的生成框架VividVoice:首先构建了一个大规模高质量的混合多模态数据集Vivid-210K,通过创新的程序化流水线首次建立了视觉场景、说话人身份与音频之间的强关联;其次设计了核心对齐模块D-MSVA,利用解耦记忆库架构和跨模态混合监督策略,实现从视觉场景到音色及环境声学特征的细粒度对齐。实验结果表明,VividVoice在音频保真度、内容清晰度和多模态一致性方面显著优于现有基线模型。

链接: https://arxiv.org/abs/2602.02591
作者: Chengyuan Ma,Jiawei Jin,Ruijie Xiong,Chunxiang Jin,Canxiang Yan,Wenming Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at this https URL.
zh

[AI-173] PeerRank: Autonomous LLM Evaluation Through Web-Grounded Bias-Controlled Peer Review

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估依赖人工构建基准、参考答案及单模型或人工判断所带来的可扩展性差、易过时且与实际开放世界部署场景不匹配的问题。其核心解决方案是提出PeerRank,一个完全自主的端到端评估框架,关键在于将评估过程建模为多智能体协作机制:每个模型在任务生成、回答、评判三个角色中对等参与,通过类别限定的实时网络检索增强答案生成,并基于多个同行评分聚合出相对性能估计,从而实现无监督、无黄金标准的公平评估,有效识别身份和呈现偏差,同时保证排名稳定性和与Elo评分的一致性。

链接: https://arxiv.org/abs/2602.02589
作者: Yanki Margalit,Erni Avram,Ran Taig,Oded Margalit,Nurit Cohen-Inger
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating large language models typically relies on human-authored benchmarks, reference answers, and human or single-model judgments, approaches that scale poorly, become quickly outdated, and mismatch open-world deployments that depend on web retrieval and synthesis. We introduce PeerRank, a fully autonomous end-to-end evaluation framework in which models generate evaluation tasks, answer them with category-scoped live web grounding, judge peer responses and aggregate dense peer assessments into relative performance estimates, without human supervision or gold references. PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments. In a large-scale study over 12 commercially available models and 420 autonomously generated questions, PeerRank produces stable, discriminative rankings and reveals measurable identity and presentation biases. Rankings are robust, and mean peer scores agree with Elo. We further validate PeerRank on TruthfulQA and GSM8K, where peer scores correlate with objective accuracy. Together, these results suggest that bias-aware peer evaluation with selective web-grounded answering can scale open-world LLM assessment beyond static and human curated benchmarks.
zh

[AI-174] Agent ic Observability: Automated Alert Triage for Adobe E-Commerce AAAI’26

【速读】:该论文旨在解决现代企业系统中因复杂依赖关系导致的可观测性(observability)与故障响应效率低下问题,尤其是手动告警筛查(alert triage)过程耗时长、资源消耗大,成为缩短平均恢复时间(MTTR)的主要瓶颈。其解决方案的关键在于提出并部署了一个基于ReAct范式的智能体可观测性框架(agentic observability framework),该框架能够自主执行告警分析任务:在检测到告警后,智能体动态识别受影响服务,跨分布式系统检索并分析相关日志,并根据上下文规划动作,如查阅操作手册、执行运行手册(runbook)或进行增强检索的代码变更分析。实证结果表明,该方案将平均洞察时间(mean time to insight)降低90%,同时保持与人工相当的诊断准确性,实现了告警处理延迟的量级下降和故障定位精度的显著提升,标志着企业运维向自主可观测性的重要演进。

链接: https://arxiv.org/abs/2602.02585
作者: Aprameya Bharadwaj,Kyle Tu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI’26 Agentic AI Benchmarks and Applications for Enterprise Tasks Workshop

点击查看摘要

Abstract:Modern enterprise systems exhibit complex interdependencies that make observability and incident response increasingly challenging. Manual alert triage, which typically involves log inspection, API verification, and cross-referencing operational knowledge bases, remains a major bottleneck in reducing mean recovery time (MTTR). This paper presents an agentic observability framework deployed within Adobe’s e-commerce infrastructure that autonomously performs alert triage using a ReAct paradigm. Upon alert detection, the agent dynamically identifies the affected service, retrieves and analyzes correlated logs across distributed systems, and plans context-dependent actions such as handbook consultation, runbook execution, or retrieval-augmented analysis of recently deployed code. Empirical results from production deployment indicate a 90% reduction in mean time to insight compared to manual triage, while maintaining comparable diagnostic accuracy. Our results show that agentic AI enables an order-of-magnitude reduction in triage latency and a step-change in resolution accuracy, marking a pivotal shift toward autonomous observability in enterprise operations.
zh

[AI-175] Constitutional Spec-Driven Development: Enforcing Security by Construction in AI-Assisted Code Generation

【速读】:该论文旨在解决AI辅助编程(如生成式AI)在快速开发过程中引入的安全风险问题,尤其针对大型语言模型(LLM)在生成代码时优先保障功能正确性而忽视安全性的缺陷。其解决方案的关键在于提出“宪法驱动开发”(Constitutional Spec-Driven Development),即通过构建一个版本化、机器可读的“宪法”文档,将来自Common Weakness Enumeration (CWE) 和MITRE Top 25漏洞列表以及监管框架的安全约束嵌入到规范层,从而在代码生成阶段就确保安全性,而非事后检测。此方法实现了从原则到代码的全链路可追溯性,并在银行微服务场景中验证了其有效性——相较无约束的AI生成方式,安全缺陷减少73%,同时保持开发效率。

链接: https://arxiv.org/abs/2602.02584
作者: Srinivas Rao Marri
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 15 pages, 2 figures, 5 tables, 11 code listings, 14 references. Includes reference implementation and compliance traceability matrix

点击查看摘要

Abstract:The proliferation of AI-assisted “vibe coding” enables rapid software development but introduces significant security risks, as Large Language Models (LLMs) prioritize functional correctness over security. We present Constitutional Spec-Driven Development, a methodology that embeds non-negotiable security principles into the specification layer, ensuring AI-generated code adheres to security requirements by construction rather than inspection. Our approach introduces a Constitution: a versioned, machine-readable document encoding security constraints derived from Common Weakness Enumeration (CWE)/MITRE Top 25 vulnerabilities and regulatory frameworks. We demonstrate the methodology through a banking microservices application, selected as a representative example domain due to its stringent regulatory and security requirements, implementing customer management, account operations, and transaction processing. The methodology itself is domain-agnostic. The implementation addresses 10 critical CWE vulnerabilities through constitutional constraints with full traceability from principles to code locations. Our case study shows that constitutional constraints reduce security defects by 73% compared to unconstrained AI generation while maintaining developer velocity. We contribute a formal framework for constitutional security, a complete development methodology, and empirical evidence that proactive security specification outperforms reactive security verification in AI-assisted development workflows.
zh

[AI-176] QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理增强微调(reasoning-incentivized fine-tuning)后进行权重量化(weight-only quantization)时,如何更有效地保留模型性能的问题。传统量化方法常依赖激活统计或二阶信息来评估通道重要性,但这些信号在推理型模型中可能不够敏感。论文提出QuantLRM方法,其关键在于利用微调过程中权重更新的幅度分布特性——即“保护两端”(protecting both ends)假设:最小和最大权重更新对模型性能影响更大。通过在权重更新上拟合受限二次函数以保护两端,并结合零权重更新通道的数量计算出更精准的通道重要性指标,从而实现更优的权重量化策略。该方案不仅适用于多种微调方式(监督、直接偏好优化、强化学习),还能通过伪微调(pseudo-fine-tuning)扩展至未微调的大推理模型(Large Reasoning Models, LRMs),显著提升量化后的性能表现。

链接: https://arxiv.org/abs/2602.02581
作者: Nan Zhang,Eugene Kwek,Yusen Zhang,Muyu Pan,Suhang Wang,Prasenjit Mitra,Rui Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weight-only quantization is important for compressing Large Language Models (LLMs). Inspired by the spirit of classical magnitude pruning, we study whether the magnitude of weight updates during reasoning-incentivized fine-tuning can provide valuable signals for quantizing Large Reasoning Models (LRMs). We hypothesize that the smallest and largest weight updates during fine-tuning are more important than those of intermediate magnitude, a phenomenon we term “protecting both ends”. Upon hypothesis validation, we introduce QuantLRM, which stands for weight quantization of LRMs via fine-tuning signals. We fit simple restricted quadratic functions on weight updates to protect both ends. By multiplying the average quadratic values with the count of zero weight updates of channels, we compute channel importance that is more effective than using activation or second-order information. We run QuantLRM to quantize various fine-tuned models (including supervised, direct preference optimization, and reinforcement learning fine-tuning) over four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, and GPQA-Diamond) and empirically find that QuantLRM delivers a consistent improvement for LRMs quantization, with an average improvement of 6.55% on a reinforcement learning fine-tuned model. Also supporting non-fine-tuned LRMs, QuantLRM gathers effective signals via pseudo-fine-tuning, which greatly enhances its applicability.
zh

[AI-177] ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation

【速读】:该论文旨在解决长上下文检索增强生成(Retrieval-Augmented Generation, RAG)预填充(prefill)阶段因计算开销过大而导致的性能瓶颈问题。现有方法通过复用检索文档的预计算键值缓存(KV cache)并重新计算部分token以恢复跨注意力机制,但存在一个根本性的“挤出效应”(crowding-out effect):全局显著但与用户查询无关的token占用了有限的重计算预算,从而排挤了真正对回答用户问题至关重要的token,导致推理准确率下降。其解决方案的关键在于提出ProphetKV,一种基于用户查询驱动的KV缓存复用方法,通过动态优先级排序机制根据token与用户查询的语义相关性分配资源,并采用双阶段重计算流水线将层间注意力指标融合为高价值token集合,确保重计算预算精准用于弥合检索内容与用户查询之间的信息鸿沟,从而在极低重计算比例下实现接近完整预填充精度的高质量注意力恢复。

链接: https://arxiv.org/abs/2602.02579
作者: Shihao Wang,Jiahao Chen,Yanqi Pan,Hao Huang,Yichen Hao,Xiangyu Zou,Wen Xia,Wentao Zhang,Haitao Wang,Junhong Li,Chongyang Qiu,Pengfei Wang
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The prefill stage of long-context Retrieval-Augmented Generation (RAG) is severely bottlenecked by computational overhead. To mitigate this, recent methods assemble pre-calculated KV caches of retrieved RAG documents (by a user query) and reprocess selected tokens to recover cross-attention between these pre-calculated KV caches. However, we identify a fundamental “crowding-out effect” in current token selection criteria: globally salient but user-query-irrelevant tokens saturate the limited recomputation budget, displacing the tokens truly essential for answering the user query and degrading inference accuracy. We propose ProphetKV, a user-query-driven KV Cache reuse method for RAG scenarios. ProphetKV dynamically prioritizes tokens based on their semantic relevance to the user query and employs a dual-stage recomputation pipeline to fuse layer-wise attention metrics into a high-utility set. By ensuring the recomputation budget is dedicated to bridging the informational gap between retrieved context and the user query, ProphetKV achieves high-fidelity attention recovery with minimal overhead. Our extensive evaluation results show that ProphetKV retains 96%-101% of full-prefill accuracy with only a 20% recomputation ratio, while achieving accuracy improvements of 8.8%-24.9% on RULER and 18.6%-50.9% on LongBench over the state-of-the-art approaches (e.g., CacheBlend, EPIC, and KVShare). Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.02579 [cs.OS] (or arXiv:2602.02579v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2602.02579 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-178] Product Interaction: An Algebraic Formalism for Deep Learning Architectures

【速读】:该论文旨在解决神经网络中层结构构建的统一建模问题,即如何从代数角度系统性地生成和组织不同阶次的交互表达式以提升模型设计的理论一致性与灵活性。其解决方案的关键在于提出“产品交互”(product interactions)这一代数形式化框架,通过在适当代数上定义乘法算子并组合成不同阶次的交互项,实现对现代神经网络中线性、二次及高阶交互关系的统一描述;其中卷积与等变网络可视为对称性约束下的线性产品交互,而注意力机制和Mamba则对应于更高阶的产品交互,从而揭示了多种主流架构背后的共同代数本质。

链接: https://arxiv.org/abs/2602.02573
作者: Haonan Dong,Chun-Wun Cheng,Angelica I. Aviles-Rivero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce product interactions, an algebraic formalism in which neural network layers are constructed from compositions of a multiplication operator defined over suitable algebras. Product interactions provide a principled way to generate and organize algebraic expressions by increasing interaction order. Our central observation is that algebraic expressions in modern neural networks admit a unified construction in terms of linear, quadratic, and higher-order product interactions. Convolutional and equivariant networks arise as symmetry-constrained linear product interactions, while attention and Mamba correspond to higher-order product interactions.
zh

[AI-179] Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective

【速读】:该论文试图解决现有对齐方法中因KL正则化导致模型继承基础策略偏差、从而降低用户效用的问题,该偏差可能与用户偏好相冲突。解决方案的关键在于将奖励模型优化问题形式化为一个Stackelberg博弈,并提出一种简单的奖励重塑(reward shaping)方案,以有效近似最优奖励模型,从而在不显著增加计算开销的前提下提升对齐效果,实验证明该方法能稳定提高平均奖励并实现超过66%的胜平率。

链接: https://arxiv.org/abs/2602.02572
作者: Haichuan Wang,Tao Lin,Lingkai Kong,Ce Li,Hezi Jiang,Milind Tambe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user’s utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.
zh

[AI-180] DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM -based Fact-Checking Systems

【速读】:该论文旨在解决基于搜索的生成式 AI(Generative AI)事实核查系统在面对对抗性主张攻击时鲁棒性不足的问题。其解决方案的关键在于提出了一种名为 DECEIVE-AFC 的代理式对抗攻击框架,该框架融合了新颖的主张级别攻击策略与对抗性主张有效性评估原则,能够在仅访问输入数据的现实威胁模型下,系统性地干扰搜索行为、证据检索及大语言模型推理过程,且无需获取证据源或模型内部信息,从而显著降低事实核查准确率并展现出强跨系统迁移能力。

链接: https://arxiv.org/abs/2602.02569
作者: Haoran Ou,Kangjie Chen,Gelei Deng,Hangcheng Liu,Jie Zhang,Tianwei Zhang,Kwok-Yan Lam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fact-checking systems with search-enabled large language models (LLMs) have shown strong potential for verifying claims by dynamically retrieving external evidence. However, the robustness of such systems against adversarial attack remains insufficiently understood. In this work, we study adversarial claim attacks against search-enabled LLM-based fact-checking systems under a realistic input-only threat model. We propose DECEIVE-AFC, an agent-based adversarial attack framework that integrates novel claim-level attack strategies and adversarial claim validity evaluation principles. DECEIVE-AFC systematically explores adversarial attack trajectories that disrupt search behavior, evidence retrieval, and LLM-based reasoning without relying on access to evidence sources or model internals. Extensive evaluations on benchmark datasets and real-world systems demonstrate that our attacks substantially degrade verification performance, reducing accuracy from 78.7% to 53.7%, and significantly outperform existing claim-based attack baselines with strong cross-system transferability.
zh

[AI-181] IceBench-S2S: A Benchmark of Deep Learning for Challenging Subseasonal-to-Seasonal Daily Arctic Sea Ice Forecasting in Deep Latent Space

【速读】:该论文旨在解决当前深度学习(Deep Learning, DL)模型在北极海冰浓度预测中预报时效局限于日尺度至次季节尺度(subseasonal scale,最多约6个月),难以满足实际应用(如北极航运规划与科学考察)对季节尺度(Seasonal to Subseasonal, S2S)预测的需求这一关键问题。解决方案的关键在于提出IceBench-S2S基准框架,其核心创新是通过将每日海冰数据的空间特征压缩至深层潜在空间,并利用时序拼接的深度特征驱动DL预测骨干网络,从而实现对连续180天周期内海冰变化的高精度S2S尺度预测,同时提供统一的训练与评估流程及模型选择指导,为极地环境监测中的DL方法发展奠定基础。

链接: https://arxiv.org/abs/2602.02567
作者: Jingyi Xu,Shengnan Wang,Weidong Yang,Siwei Tu,Lei Bai,Ben Fei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Arctic sea ice plays a critical role in regulating Earth’s climate system, significantly influencing polar ecological stability and human activities in coastal regions. Recent advances in artificial intelligence have facilitated the development of skillful pan-Arctic sea ice forecasting systems, where data-driven approaches showcase tremendous potential to outperform conventional physics-based numerical models in terms of accuracy, computational efficiency and forecasting lead times. Despite the latest progress made by deep learning (DL) forecasting models, most of their skillful forecasting lead times are confined to daily subseasonal scale and monthly averaged values for up to six months, which drastically hinders their deployment for real-world applications, e.g., maritime routine planning for Arctic transportation and scientific investigation. Extending daily forecasts from subseasonal to seasonal (S2S) scale is scientifically crucial for operational applications. To bridge the gap between the forecasting lead time of current DL models and the significant daily S2S scale, we introduce IceBench-S2S, the first comprehensive benchmark for evaluating DL approaches in mitigating the challenge of forecasting Arctic sea ice concentration in successive 180-day periods. It proposes a generalized framework that first compresses spatial features of daily sea ice data into a deep latent space. The temporally concatenated deep features are subsequently modeled by DL-based forecasting backbones to predict the sea ice variation at S2S scale. IceBench-S2S provides a unified training and evaluation pipeline for different backbones, along with practical guidance for model selection in polar environmental monitoring tasks.
zh

[AI-182] A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City

【速读】:该论文试图解决预测性警务(predictive policing)系统在公平性和准确性方面存在的争议问题,特别是其是否加剧了种族偏见以及与传统热点巡逻(hot spots policing)相比的相对优势。解决方案的关键在于通过构建城市特定的综合仿真模型,在巴尔的摩市对两种警务策略进行对比分析,从而揭示系统性偏见的动态演化机制及其长期行为趋势。研究发现,尽管预测性警务在短期内比热点巡逻更公平且准确,但其会更快放大偏见,且在某些情况下导致白人社区被过度执法,这表明仅关注短期效果可能掩盖长期不公平风险。该方法为城市层面的算法公平性评估提供了可操作的技术路径。

链接: https://arxiv.org/abs/2602.02566
作者: Samin Semsar,Kiran Laxmikant Prabhu,Gabriella Waters,James Foulds
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 36 pages, 27 figures

点击查看摘要

Abstract:There are ongoing discussions about predictive policing systems, such as those deployed in Los Angeles, California and Baltimore, Maryland, being unfair, for example, by exhibiting racial bias. Studies found that unfairness may be due to feedback loops and being trained on historically biased recorded data. However, comparative studies on predictive policing systems are few and are not sufficiently comprehensive. In this work, we perform a comprehensive comparative simulation study on the fairness and accuracy of predictive policing technologies in Baltimore. Our results suggest that the situation around bias in predictive policing is more complex than was previously assumed. While predictive policing exhibited bias due to feedback loops as was previously reported, we found that the traditional alternative, hot spots policing, had similar issues. Predictive policing was found to be more fair and accurate than hot spots policing in the short term, although it amplified bias faster, suggesting the potential for worse long-run behavior. In Baltimore, in some cases the bias in these systems tended toward over-policing in White neighborhoods, unlike in previous studies. Overall, this work demonstrates a methodology for city-specific evaluation and behavioral-tendency comparison of predictive policing systems, showing how such simulations can reveal inequities and long-term tendencies.
zh

[AI-183] High Rank Matrix Completion via Grassmannian Proxy Fusion

【速读】:该论文旨在解决高秩矩阵补全(High-Rank Matrix Completion, HRMC)问题,即在数据矩阵的列近似位于多个子空间并集的情况下,填补缺失条目并识别潜在子空间结构。当前方法常缺乏理论支撑、结果难以解释,且所需样本数超过理论下限。其解决方案的关键在于:通过聚类不完整向量来分组代理子空间,并在Grassmann流形上最小化两个目标函数——(a) 每个点与其对应子空间之间的弦距离(chordal distance),以及 (b) 所有数据点子空间间的测地距离(geodesic distance)。该策略显著提升了低采样率下的补全性能,逼近HRMC的理论采样极限。

链接: https://arxiv.org/abs/2602.02565
作者: Huanran Li,Jeremy Johnson,Daniel Pimentel-Alarcón
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper approaches high-rank matrix completion (HRMC) by filling missing entries in a data matrix where columns lie near a union of subspaces, clustering these columns, and identifying the underlying subspaces. Current methods often lack theoretical support, produce uninterpretable results, and require more samples than theoretically necessary. We propose clustering incomplete vectors by grouping proxy subspaces and minimizing two criteria over the Grassmannian: (a) the chordal distance between each point and its corresponding subspace and (b) the geodesic distances between subspaces of all data points. Experiments on synthetic and real datasets demonstrate that our method performs comparably to leading methods in high sampling rates and significantly better in low sampling rates, thus narrowing the gap to the theoretical sampling limit of HRMC.
zh

[AI-184] MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics

【速读】:该论文旨在解决形式化数学库(如Mathlib)中缺乏大量“民间定理”(folklore lemmas)的问题,这些问题虽在数学实践中被广泛使用,却未被系统地形式化,从而限制了Lean在数学家日常工作流中的可用性。解决方案的关键在于提出MathlibLemma框架——一个基于大语言模型(LLM)的多智能体系统,能够自动发现并形式化这些缺失的民间定理;该框架通过主动挖掘数学文献与代码库之间的连接纽带,生成可验证的定理集合,并已成功将部分成果合并至Mathlib最新版本,验证了其实际效用与专家标准的一致性。

链接: https://arxiv.org/abs/2602.02561
作者: Xinyu Liu,Zixuan Xie,Amir Moeini,Claire Chen,Shuze Daniel Liu,Yu Meng,Aidong Zhang,Shangtong Zhang
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean’s usability as an everyday tool for mathematicians like LaTeX or Maple. To address this, we introduce MathlibLemma, the first LLM-based multi-agent system to automate the discovery and formalization of mathematical folklore lemmas. This framework constitutes our primary contribution, proactively mining the missing connective tissue of mathematics. Its efficacy is demonstrated by the production of a verified library of folklore lemmas, a subset of which has already been formally merged into the latest build of Mathlib, thereby validating the system’s real-world utility and alignment with expert standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work establishes a constructive methodology for the self-evolution of formal mathematical libraries.
zh

[AI-185] PA-MIL: Phenotype-Aware Multiple Instance Learning Guided by Language Prompting and Genotype-to-Phenotype Relationships

【速读】:该论文旨在解决现有深度学习方法在病理全切片图像(Whole-Slide Images, WSI)分析中解释性不足的问题,即大多数方法仅能通过后验方式定位模型关注区域,难以提供可靠且可问责的解释。其解决方案的关键在于提出一种前验可解释的多实例学习框架——表型感知多实例学习(Phenotype-Aware Multiple Instance Learning, PA-MIL),该框架通过构建包含癌症相关表型及其关联基因型的知识库,利用表型形态描述作为语言提示聚合表型相关特征,并设计基于基因型到表型关系的基因型-表型神经网络(Genotype-to-Phenotype Neural Network, GP-NN),为PA-MIL提供多层次指导,从而实现更可靠的癌症亚型分类与解释。

链接: https://arxiv.org/abs/2602.02558
作者: Zekang Yang,Hong Liu,Xiangdong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Deep learning has been extensively researched in the analysis of pathology whole-slide images (WSIs). However, most existing methods are limited to providing prediction interpretability by locating the model’s salient areas in a post-hoc manner, failing to offer more reliable and accountable explanations. In this work, we propose Phenotype-Aware Multiple Instance Learning (PA-MIL), a novel ante-hoc interpretable framework that identifies cancer-related phenotypes from WSIs and utilizes them for cancer subtyping. To facilitate PA-MIL in learning phenotype-aware features, we 1) construct a phenotype knowledge base containing cancer-related phenotypes and their associated genotypes. 2) utilize the morphological descriptions of phenotypes as language prompting to aggregate phenotype-related features. 3) devise the Genotype-to-Phenotype Neural Network (GP-NN) grounded in genotype-to-phenotype relationships, which provides multi-level guidance for PA-MIL. Experimental results on multiple datasets demonstrate that PA-MIL achieves competitive performance compared to existing MIL methods while offering improved interpretability. PA-MIL leverages phenotype saliency as evidence and, using a linear classifier, achieves competitive results compared to state-of-the-art methods. Additionally, we thoroughly analyze the genotype-phenotype relationships, as well as cohort-level and case-level interpretability, demonstrating the reliability and accountability of PA-MIL.
zh

[AI-186] he Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

【速读】:该论文旨在解决跨模态 jailbreak 攻击中,文本到音频的攻击迁移问题,即如何利用成熟的文本 jailbreak 方法生成有效的音频攻击,并验证其在多模态模型(omni-models)中的有效性。其解决方案的关键在于揭示了模态对齐(modality alignment)与跨模态攻击迁移之间的内在关联,发现强对齐可能导致文本中的漏洞被“传染”至音频模态,这一现象被称为“对齐诅咒”(alignment curse)。基于此洞察,作者通过实证评估证明:将文本 jailbreak 攻击迁移至音频后,其效果可媲美甚至优于原生音频攻击,且在严格的音频仅访问威胁模型下仍具高有效性,从而确立了文本转移音频 jailbreak 作为未来音频红队测试的简单而强大的基线方法。

链接: https://arxiv.org/abs/2602.02557
作者: Yupeng Chen,Junchi Yu,Aoxi Liu,Philip Torr,Adel Bibi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advances in end-to-end trained omni-models have significantly improved multimodal understanding. At the same time, safety red-teaming has expanded beyond text to encompass audio-based jailbreak attacks. However, an important bridge between textual and audio jailbreaks remains underexplored. In this work, we study the cross-modality transfer of jailbreak attacks from text to audio, motivated by the semantic similarity between the two modalities and the maturity of textual jailbreak methods. We first analyze the connection between modality alignment and cross-modality jailbreak transfer, showing that strong alignment can inadvertently propagate textual vulnerabilities to the audio modality, which we term the alignment curse. Guided by this analysis, we conduct an empirical evaluation of textual jailbreaks, text-transferred audio jailbreaks, and existing audio-based jailbreaks on recent omni-models. Our results show that text-transferred audio jailbreaks perform comparably to, and often better than, audio-based jailbreaks, establishing them as simple yet powerful baselines for future audio red-teaming. We further demonstrate strong cross-model transferability and show that text-transferred audio attacks remain effective even under a stricter audio-only access threat model.
zh

[AI-187] Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行任务时存在的静态性问题,即模型难以复用先前经验,常重复推理过程或犯相同错误,而传统基于外部检索的经验复用方法存在噪声引入和延迟增加的问题。解决方案的关键在于提出SEAM(Structured Experience Adapter Module),这是一种轻量级、执行器特定的插件模块,其核心机制是将经验存储于自身参数中,并在单次前向传播中生成结构化、实例定制的经验条目,以指导冻结的LLM执行器。SEAM通过执行器回放(executor rollouts)与GRPO(Generalized Reward Policy Optimization)训练获得实用性,且部署后可通过成功轨迹的监督微调进一步优化,实验证明其在数学推理基准上能稳定提升准确率并保持低开销。

链接: https://arxiv.org/abs/2602.02556
作者: Xuancheng Li,Haitao Li,Yujia Zhou,Yiqun Liu,Qingyao Ai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We introduce SEAM (Structured Experience Adapter Module), a lightweight, executor-specific plug-in that stores experience in its parameters and generates a structured, instance-tailored experience entry in a single forward pass to guide a frozen LLM executor. SEAM is trained for utility via executor rollouts and GRPO while keeping the executor frozen, and it can be further improved after deployment with supervised fine-tuning on logged successful trajectories. Experiments on mathematical reasoning benchmarks show consistent accuracy gains across executors with low overhead. Extensive ablations and analyses further elucidate the mechanisms underlying SEAM’s effectiveness and robustness.
zh

[AI-188] Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在大规模采样预算下性能增长受限的问题,即现有RLVR方法通常仅对已有解题路径进行重加权,而非探索新的推理策略,导致在高采样量(如pass-at-256)时收益饱和。其解决方案的关键在于提出PSN-RLVR(Perturb before Rollout for Exploration in RLVR),通过在策略参数层面引入扰动(而非动作空间噪声),实现时间上一致的轨迹级探索,从而更好维持长程思维链(chain-of-thought)的连贯性;同时结合截断重要性采样(Truncated Importance Sampling, TIS)缓解采样与更新之间的偏差,并设计一种轻量级实时自适应噪声调度机制,该机制基于语义多样性与归一化自我置信度的组合代理函数,避免昂贵的KL散度控制,显著提升了模型在多个数学推理基准上的有效推理能力边界(effective reasoning capability boundary)。

链接: https://arxiv.org/abs/2602.02555
作者: Bizhe Bai,Xinyue Wang,Peng Ye,Tao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 Figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.
zh

[AI-189] BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

【速读】:该论文旨在解决生成式 AI(Generative AI)在代码相关任务中依赖高质量代码-文档配对数据的问题,这类数据不仅成本高昂,且在小众编程语言中尤为稀缺。解决方案的关键在于提出一种自监督强化学习框架 BatCoder,其核心是采用回译(back-translation)策略:先从代码生成文档,再用生成的文档重建原始代码,以原始代码与重构代码之间的语义相似度作为隐式奖励信号,驱动模型同时优化代码生成与文档生成能力。此方法仅需代码语料即可训练,显著扩充了可用训练样本,且在 HumanEval 和 MBPP 基准上展现出优于现有开源基线的性能,并具备随训练语料规模和模型容量增长的一致扩展性。

链接: https://arxiv.org/abs/2602.02554
作者: Jingwen Xu,Yiyang Lu,Zisu Huang,Changze Lv,Xiaohua Wang,Shizheng Li,Zhibo Xu,Zhengkang Guo,Zhengyuan Wang,Muzhao Tian,Xuanjing Huang,Xiaoqing Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Training LLMs for code-related tasks typically depends on high-quality code-documentation pairs, which are costly to curate and often scarce for niche programming languages. We introduce BatCoder, a self-supervised reinforcement learning framework designed to jointly optimize code generation and documentation production. BatCoder employs a back-translation strategy: a documentation is first generated from code, and then the generated documentation is used to reconstruct the original code. The semantic similarity between the original and reconstructed code serves as an implicit reward, enabling reinforcement learning to improve the model’s performance both in generating code from documentation and vice versa. This approach allows models to be trained using only code, substantially increasing the available training examples. Evaluated on HumanEval and MBPP with a 7B model, BatCoder achieved 83.5% and 81.0% pass@1, outperforming strong open-source baselines. Moreover, the framework demonstrates consistent scaling with respect to both training corpus size and model capacity.
zh

[AI-190] HyPAC: Cost-Efficient LLM s-Human Hybrid Annotation with PAC Error Guarantees

【速读】:该论文旨在解决多源数据标注中的成本-质量权衡问题,即如何在控制测试样本标注误差的前提下,自适应地将输入路由到最经济高效的标注源(如快速的大语言模型、慢速的推理模型或人工专家)。其核心解决方案是提出HyPAC方法,通过重要性采样和置信上限(upper confidence bounds)校准两个决策阈值,将输入划分为三个不确定性区域,并据此动态分配标注资源。该方法在无需依赖数据分布或预训练模型的前提下,提供分布无关的标注误差保证(distribution-free guarantees),并理论上证明可在概率近似正确(PAC)意义下实现最小期望成本。

链接: https://arxiv.org/abs/2602.02550
作者: Hao Zeng,Huipeng Huang,Xinhao Qu,Jianguo Huang,Bingyi Jing,Hongxin Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data annotation often involves multiple sources with different cost-quality trade-offs, such as fast large language models (LLMs), slow reasoning models, and human experts. In this work, we study the problem of routing inputs to the most cost-efficient annotation source while controlling the labeling error on test instances. We propose \textbfHyPAC, a method that adaptively labels inputs to the most cost-efficient annotation source while providing distribution-free guarantees on annotation error. HyPAC calibrates two decision thresholds using importance sampling and upper confidence bounds, partitioning inputs into three regions based on uncertainty and routing each to the appropriate annotation source. We prove that HyPAC achieves the minimum expected cost with a probably approximately correct (PAC) guarantee on the annotation error, free of data distribution and pre-trained models. Experiments on common benchmarks demonstrate the effectiveness of our method, reducing the annotation cost by 78.51% while tightly controlling the annotation error.
zh

[AI-191] naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在面对复杂测量噪声和严重异常值时性能显著下降的问题。其解决方案的关键在于提出噪声自适应物理信息神经网络(Noise-Adaptive Physics-Informed Neural Network, naPINN),该方法通过在训练过程中嵌入一个基于能量的模型来学习预测残差的潜在分布,并利用所学的能量景观设计一个可训练的可靠性门机制,动态过滤高能量数据点;同时引入拒绝成本正则化项防止因过度丢弃有效数据而导致的平凡解,从而在无先验噪声分布知识的情况下实现对物理解的鲁棒恢复。

链接: https://arxiv.org/abs/2602.02547
作者: Hankyeol Kim,Pilsung Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observational data. However, their performance degrades significantly under complex measurement noise and gross outliers. To address this issue, we propose the Noise-Adaptive Physics-Informed Neural Network (naPINN), which robustly recovers physical solutions from corrupted measurements without prior knowledge of the noise distribution. naPINN embeds an energy-based model into the training loop to learn the latent distribution of prediction residuals. Leveraging the learned energy landscape, a trainable reliability gate adaptively filters data points exhibiting high energy, while a rejection cost regularization prevents trivial solutions where valid data are discarded. We demonstrate the efficacy of naPINN on various benchmark partial differential equations corrupted by non-Gaussian noise and varying rates of outliers. The results show that naPINN significantly outperforms existing robust PINN baselines, successfully isolating outliers and accurately reconstructing the dynamics under severe data corruption.
zh

[AI-192] D2Quant: Accurate Low-bit Post-Training Weight Quantization for LLM s

【速读】:该论文旨在解决低比特权重仅后训练量化(Weight-only Post-Training Quantization, PTQ)在子4比特精度下性能显著下降的问题。核心挑战在于:(1)降维投影矩阵(down-projection matrices)是已知的量化瓶颈,维持其精度通常需要额外位宽;(2)权重量化会引发激活分布偏移,而有效的校正策略尚未充分探索。解决方案的关键在于提出 D² Quant 框架,从权重和激活两个维度协同优化:在权重侧设计针对降维矩阵的双尺度量化器(Dual-Scale Quantizer, DSQ),通过可吸收缩放因子提升精度而不增加比特预算;在激活侧引入偏差感知校正(Deviation-Aware Correction, DAC),在 LayerNorm 中嵌入均值偏移校正以缓解量化引起的激活分布偏移。实验表明,该方法在多个大语言模型(LLM)家族中均实现了子4比特精度下的卓越量化性能。

链接: https://arxiv.org/abs/2602.02546
作者: Xianglong Yan,ChengZhu Bao,Zhiteng Li,Tianao Zhang,Shaoqiu Zhang,Ruobing Xie,Samm Sun,Yulun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware. However, accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision, and our analysis identifies two main causes: (1) down-projection matrices are a well-known quantization bottleneck, but maintaining their fidelity often requires extra bit-width; (2) weight quantization induces activation deviations, but effective correction strategies remain underexplored. To address these issues, we propose D ^2 Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives. On the weight side, we design a Dual-Scale Quantizer (DSQ) tailored to down-projection matrices, with an absorbable scaling factor that significantly improves accuracy without increasing the bit budget. On the activation side, we propose Deviation-Aware Correction (DAC), which incorporates a mean-shift correction within LayerNorm to mitigate quantization-induced activation distribution shifts. Extensive experiments across multiple LLM families and evaluation metrics show that D ^2 Quant delivers superior performance for weight-only PTQ at sub-4-bit precision. The code and models will be available at this https URL.
zh

[AI-193] Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization

【速读】:该论文旨在解决当前基于强化学习的推理增强方法(如RLVR)是否真正扩展了大语言模型(LLM)的推理能力,还是仅对预训练模型中已存在的潜在能力进行对齐的问题。现有研究认为,探索受限于预训练模型的低秩偏差流形(low-rank bias manifold),导致无法突破原有能力边界。为挑战这一假设,作者提出了一种几何框架——流形重塑策略优化(Manifold-Reshaping Policy Optimization, MRPO),其核心在于通过两个关键步骤重构LLM的推理空间:首先利用谱正交探索(Spectral Orthogonal Exploration, SOE)将策略初始化投射到偏差流形的零空间中,从而脱离原有低维结构;其次在策略优化目标中引入有效秩(Effective Rank)正则项,以对抗标准强化学习(RL)固有的熵减少倾向,激励模型发现并维持高维推理路径。该方法实现了对推理能力边界的实质性拓展,在数学任务上优于更大规模模型(如Qwen3-32B)。

链接: https://arxiv.org/abs/2602.02545
作者: Dayu Wang,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). However, recent studies question whether RL genuinely expands reasoning capacity or merely aligns existing latent capabilities, arguing that exploration remains confined within the pre-trained model’s low-rank bias manifold. In this work, we challenge this accessibility boundary hypothesis by demonstrating that the latent reasoning space can be fundamentally expanded through targeted geometric interventions. We propose Manifold-Reshaping Policy Optimization (MRPO), a geometric framework designed to fundamentally restructure the inference space of LLMs. MRPO operates in two stages: first, we employ Spectral Orthogonal Exploration (SOE) to eject the policy initialization into the null space of the bias manifold; second, we integrate an Effective Rank regularization term into the policy optimization objective. This approach incentivizes the discovery and maintenance of high-dimensional reasoning trajectories against the entropy-reducing tendency of standard RL. Empirically, our 4B-parameter method achieves state-of-the-art performance on mathematical tasks, significantly outperforming larger models (e.g., Qwen3-32B) and expanding the capability boundary beyond standard GRPO. Our code is available at this https URL
zh

[AI-194] SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models

【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在解码过程中因非因果特性无法使用标准键值缓存(KV caching)而导致的高计算开销问题,特别是由于每步解码都需要重新计算隐藏状态所引发的效率瓶颈。现有缓存方法虽通过选择性更新隐藏状态降低开销,但仍受限于两个关键挑战:(i) 基于 token 级别的更新识别启发式策略成本高昂;(ii) 采用固定预算分配策略,未能考虑不同层隐藏状态动态的异质性。解决方案的核心在于提出 SPA-Cache,其创新点包括:首先,设计一个低维奇异代理(low-dimensional singular proxy),在低维子空间中高效识别需要更新的关键 token,显著降低更新识别的计算负担;其次,引入自适应策略,根据层的稳定性动态调整更新预算,在保证生成质量的前提下减少对稳定层的冗余更新。这两项改进共同实现了 DLM 解码效率的大幅提升,相较原始解码提升最高达 8 倍吞吐量,较现有缓存基线提速 2–4 倍。

链接: https://arxiv.org/abs/2602.02544
作者: Wenhao Sun,Rong-Cheng Tu,Yifu Ding,Zhao Jin,Jingyi Liao,Yongcheng Jing,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 this http URL code repository is available at this https URL

点击查看摘要

Abstract:While Diffusion Language Models (DLMs) offer a flexible, arbitrary-order alternative to the autoregressive paradigm, their non-causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing DLM caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) costly token-wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden state dynamics. To address these challenges, we present SPA-Cache that jointly optimizes update identification and budget allocation in DLM cache. First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace, substantially reducing the overhead of update identification. Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an 8\times throughput improvement over vanilla decoding and a 2 – 4\times speedup over existing caching baselines.
zh

[AI-195] oward Ultra-Long-Horizon Sequential Model Editing

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在采用定位与编辑(Locate-and-Edit, LE)范式进行事实修正时,因多次连续编辑导致模型性能骤降(即“模型崩溃”)的问题。研究发现,模型崩溃与被编辑的多层感知机(MLP)参数权重范数的指数级增长密切相关,并通过理论证明指出,常见的LE更新规则在缺乏显式范数约束的情况下会引发此类不稳定增长。解决方案的关键在于提出一种即插即用的范数约束策略——Norm-Anchor Scaling NAS,该方法通过引入轻量级的范数锚定缩放机制,在不显著增加计算开销的前提下有效抑制权重范数的爆炸性增长,从而将典型LE算法的崩溃点延迟超过4倍,并提升平均编辑性能达72.2%。

链接: https://arxiv.org/abs/2602.02543
作者: Mingda Liu,Zhenghan Zhu,Ze’an Miao,Katsuki Fujisawa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model editing has emerged as a practical approach for mitigating factual errors and outdated knowledge in large language models (LLMs). Among existing methods, the Locate-and-Edit (LE) paradigm is the dominant framework: it locates MLP parameters implicated in expressing a target fact, and then performs a localized update to rewrite that fact. However, long sequences of edits often trigger abrupt model collapse in LE beyond a critical point. We empirically identify a strong correlation between collapse and explosive growth of edited MLP weight norms, and formally prove that commonly used LE update rules can induce exponential norm growth across sequential edits in the absence of explicit norm control. To address this issue, we propose Norm-Anchor Scaling NAS, a plug-and-play norm-constrained strategy. Across extensive experiments, NAS delays the collapse point of representative LE algorithms by more than 4 times and yields a 72.2% average relative gain in editing performance, requiring only a single additional line of code and incurring negligible computational overhead.
zh

[AI-196] Auto-Augmentation Contrastive Learning for Wearable-based Human Activity Recognition

【速读】:该论文旨在解决可穿戴设备中人体活动识别(Human Activity Recognition, HAR)任务里低语义传感器信号在对比学习(Contrastive Learning, CL)中因依赖人工设计的数据增强策略而导致的泛化性差、灵活性不足的问题。解决方案的关键在于提出一种端到端的自动增强对比学习方法(AutoCL),其核心创新包括:基于Siamese网络结构共享主干特征提取器参数,并嵌入一个生成器以自动学习最优数据增强策略;通过在潜在空间中训练生成器来抑制原始传感器数据中的噪声和冗余信息干扰;同时引入梯度停止(stop-gradient)设计与相关性降低策略,有效提升编码器的表征学习能力。实验证明,该方法在四个主流HAR数据集上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2602.02542
作者: Qingyu Wu,Jianfei Shen,Feiyi Fan,Yang Gu,Chenyang Xu,Yiqiang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:For low-semantic sensor signals from human activity recognition (HAR), contrastive learning (CL) is essential to implement novel applications or generic models without manual annotation, which is a high-performance self-supervised learning (SSL) method. However, CL relies heavily on data augmentation for pairwise comparisons. Especially for low semantic data in the HAR area, conducting good performance augmentation strategies in pretext tasks still rely on manual attempts lacking generalizability and flexibility. To reduce the augmentation burden, we propose an end-to-end auto-augmentation contrastive learning (AutoCL) method for wearable-based HAR. AutoCL is based on a Siamese network architecture that shares the parameters of the backbone and with a generator embedded to learn auto-augmentation. AutoCL trains the generator based on the representation in the latent space to overcome the disturbances caused by noise and redundant information in raw sensor data. The architecture empirical study indicates the effectiveness of this design. Furthermore, we propose a stop-gradient design and correlation reduction strategy in AutoCL to enhance encoder representation learning. Extensive experiments based on four wide-used HAR datasets demonstrate that the proposed AutoCL method significantly improves recognition accuracy compared with other SOTA methods.
zh

[AI-197] Enhancing Psychologists Understanding through Explainable Deep Learning Framework for ADHD Diagnosis

【速读】:该论文旨在解决注意力缺陷多动障碍(Attention Deficit Hyperactivity Disorder, ADHD)诊断困难且缺乏透明度的问题,尤其在实现可靠识别与分类的同时保障决策过程的可解释性。解决方案的关键在于提出一种基于微调混合深度神经网络(Deep Neural Network, DNN)与循环神经网络(Recurrent Neural Network, RNN)的可解释框架——HyExDNN-RNN模型,结合皮尔逊相关系数进行最优特征选择,并引入SHapley Additive exPlanations(SHAP)和排列特征重要性(Permutation Feature Importance, PFI)等可解释人工智能(Explainable AI, XAI)方法,从而在保证高精度(二分类F1分数达99%,多分类达94.2%)的同时提供决策逻辑的可视化与可理解性,增强心理学专家对AI诊断结果的信任与应用可行性。

链接: https://arxiv.org/abs/2602.02535
作者: Abdul Rehman,Ilona Heldal,Jerry Chun-Wei Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental disorder that is challenging to diagnose and requires advanced approaches for reliable and transparent identification and classification. It is characterized by a pattern of inattention, hyperactivity and impulsivity that is more severe and more frequent than in individuals with a comparable level of development. In this paper, an explainable framework based on a fine-tuned hybrid Deep Neural Network (DNN) and Recurrent Neural Network (RNN) called HyExDNN-RNN model is proposed for ADHD detection, multi-class categorization, and decision interpretation. This framework not only detects ADHD, but also provides interpretable insights into the diagnostic process so that psychologists can better understand and trust the results of the diagnosis. We use the Pearson correlation coefficient for optimal feature selection and machine and deep learning models for experimental analysis and comparison. We use a standardized technique for feature reduction, model selection and interpretation to accurately determine the diagnosis rate and ensure the interpretability of the proposed framework. Our framework provided excellent results on binary classification, with HyExDNN-RNN achieving an F1 score of 99% and 94.2% on multi-class categorization. XAI approaches, in particular SHapley Additive exPlanations (SHAP) and Permutation Feature Importance (PFI), provided important insights into the importance of features and the decision logic of models. By combining AI with human expertise, we aim to bridge the gap between advanced computational techniques and practical psychological applications. These results demonstrate the potential of our framework to assist in ADHD diagnosis and interpretation.
zh

[AI-198] CADENT: Gated Hybrid Distillation for Sample-Efficient Transfer in Reinforcement Learning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)中因领域偏移(domain shift)导致的样本效率低下问题,现有方法要么无法有效迁移长期策略知识(如基于自动机的方法),要么缺乏精细的动作指导(如策略蒸馏)。其解决方案的关键在于提出Context-Aware Distillation with Experience-gated Transfer (CADENT),通过引入一种经验门控的信任机制(experience-gated trust mechanism),在状态-动作层面动态权衡教师模型的指导与学生自身的经验,从而实现战略级自动机知识与战术级策略知识的融合,使模型能够自适应地调整到目标域特性,显著提升样本效率并保持优异的最终性能。

链接: https://arxiv.org/abs/2602.02532
作者: Mahyar Alinejad,Yue Wang,George Atia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transfer learning promises to reduce the high sample complexity of deep reinforcement learning (RL), yet existing methods struggle with domain shift between source and target environments. Policy distillation provides powerful tactical guidance but fails to transfer long-term strategic knowledge, while automaton-based methods capture task structure but lack fine-grained action guidance. This paper introduces Context-Aware Distillation with Experience-gated Transfer (CADENT), a framework that unifies strategic automaton-based knowledge with tactical policy-level knowledge into a coherent guidance signal. CADENT’s key innovation is an experience-gated trust mechanism that dynamically weighs teacher guidance against the student’s own experience at the state-action level, enabling graceful adaptation to target domain specifics. Across challenging environments, from sparse-reward grid worlds to continuous control tasks, CADENT achieves 40-60% better sample efficiency than baselines while maintaining superior asymptotic performance, establishing a robust approach for adaptive knowledge transfer in RL.
zh

[AI-199] Incident-Guided Spatiotemporal Traffic Forecasting

【速读】:该论文旨在解决现有基于深度学习的交通流预测方法在建模交通系统动态特性时忽略突发事件(如交通事故和恶劣天气)影响的问题,这类外部扰动虽不可预测,却显著改变交通时间序列的模式,从而限制了预测精度的提升。解决方案的关键在于提出一种名为 Incident-Guided Spatiotemporal Graph Neural Network (IGSTGNN) 的新框架,其核心创新是通过两个模块显式建模事件的影响:一是 Incident-Context Spatial Fusion (ICSF) 模块,用于捕捉事件发生初期的异质空间影响;二是 Temporal Incident Impact Decay (TIID) 模块,用于建模事件影响随时间动态衰减的过程。这一设计使模型能够更准确地刻画突发扰动对交通流的时空作用机制。

链接: https://arxiv.org/abs/2602.02528
作者: Lixiang Fan,Bohao Li,Tao Zou,Bowen Du,Junchen Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed the rapid development of deep-learning-based, graph-neural-network-based forecasting methods for modern intelligent transportation systems. However, most existing work focuses exclusively on capturing spatio-temporal dependencies from historical traffic data, while overlooking the fact that suddenly occurring transportation incidents, such as traffic accidents and adverse weather, serve as external disturbances that can substantially alter temporal patterns. We argue that this issue has become a major obstacle to modeling the dynamics of traffic systems and improving prediction accuracy, but the unpredictability of incidents makes it difficult to observe patterns from historical sequences. To address these challenges, this paper proposes a novel framework named the Incident-Guided Spatiotemporal Graph Neural Network (IGSTGNN). IGSTGNN explicitly models the incident’s impact through two core components: an Incident-Context Spatial Fusion (ICSF) module to capture the initial heterogeneous spatial influence, and a Temporal Incident Impact Decay (TIID) module to model the subsequent dynamic dissipation. To facilitate research on the spatio-temporal impact of incidents on traffic flow, a large-scale dataset is constructed and released, featuring incident records that are time-aligned with traffic time series. On this new benchmark, the proposed IGSTGNN framework is demonstrated to achieve state-of-the-art performance. Furthermore, the generalizability of the ICSF and TIID modules is validated by integrating them into various existing models.
zh

[AI-200] Community Norms in the Spotlight: Enabling Task-Agnostic Unsupervised Pre-Training to Benefit Online Social Media

【速读】:该论文旨在解决在线社交平台中复杂动态建模问题,特别是针对仇恨言论和虚假信息等挑战,其核心难点在于现有基于讨论转换器(Discussion Transformers)的模型严重依赖高质量人工标注数据,导致可扩展性和泛化能力受限。解决方案的关键在于提出从任务特定微调向无监督预训练的范式转变,并引入社区规范(community norms)作为全新建模基础,从而缓解数据稀缺问题并增强模型决策的社会规范可解释性,为生成式AI(Generative AI)在社会福祉方向的应用提供新路径。

链接: https://arxiv.org/abs/2602.02525
作者: Liam Hebert,Lucas Kopp,Robin Cohen
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Submitted to ICWSM Poster

点击查看摘要

Abstract:Modelling the complex dynamics of online social platforms is critical for addressing challenges such as hate speech and misinformation. While Discussion Transformers, which model conversations as graph structures, have emerged as a promising architecture, their potential is severely constrained by reliance on high-quality, human-labelled datasets. In this paper, we advocate a paradigm shift from task-specific fine-tuning to unsupervised pretraining, grounded in an entirely novel consideration of community norms. We posit that this framework not only mitigates data scarcity but also enables interpretation of the social norms underlying the decisions made by such an AI system. Ultimately, we believe that this direction offers many opportunities for AI for Social Good.
zh

[AI-201] GASTON: Graph-Aware Social Transformer for Online Networks

【速读】:该论文旨在解决在线社区中有害内容(如毒性言论、信息失真和回音室效应)检测困难的问题,其核心挑战在于理解网络交互的意义不仅取决于文本内容本身,还依赖于特定社区的社会规范(social norms)。为应对这一问题,作者提出GASTON(Graph-Aware Social Transformer for Online Networks),其关键创新在于采用对比初始化策略(contrastive initialization strategy),通过用户归属模式预训练社区嵌入(community embeddings),从而在处理任何文本之前就捕捉到社区的用户构成特征。这种机制使模型能够基于互动人群差异区分语义相似但性质迥异的社区(如支持群体与仇恨群体),进而提升下游任务(如压力识别、毒性评分和规范违例检测)的性能表现。

链接: https://arxiv.org/abs/2602.02524
作者: Olha Wloch,Liam Hebert,Robin Cohen,Lukasz Golab
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ICWSM

点击查看摘要

Abstract:Online communities have become essential places for socialization and support, yet they also possess toxicity, echo chambers, and misinformation. Detecting this harmful content is difficult because the meaning of an online interaction stems from both what is written (textual content) and where it is posted (social norms). We propose GASTON (Graph-Aware Social Transformer for Online Networks), which learns text and user embeddings that are grounded in their local norms, providing the necessary context for downstream tasks. The heart of our solution is a contrastive initialization strategy that pretrains community embeddings based on user membership patterns, capturing a community’s user base before processing any text. This allows GASTON to distinguish between communities (e.g., a support group vs. a hate group) based on who interacts there, even if they share similar vocabulary. Experiments on tasks such as stress detection, toxicity scoring, and norm violation demonstrate that the embeddings produced by GASTON outperform state-of-the-art baselines.
zh

[AI-202] abularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis

【速读】:该论文旨在解决当前标准表格数据基准测试主要评估模型在数据流形内部进行统计插值能力的问题,而忽略了高价值领域(如金融建模和物理仿真)中大量由确定性计算过程生成的表格数据,这些场景更关注模型对计算过程的外推能力。其解决方案的关键在于提出TabularMath这一诊断性基准,包含114个确定性问题(共233,472行),基于GSM8K和AIME的验证程序生成;通过对比9种表格架构与GPT-OSS-120B的上下文学习(ICL)性能发现:TabPFN v2.5在标准回归指标上表现优异(R²=0.998),但在精确整数匹配(rounded consistency)指标下外推性能显著下降(<10%),而ICL则保持约40%的准确率,表明表格模型擅长平滑函数逼近但难以恢复精确计算输出,二者在计算外推任务中具有互补性。

链接: https://arxiv.org/abs/2602.02523
作者: Zerui Cheng,Jiashuo Liu,Jianzhu Yao,Pramod Viswanath,Ge Zhang,Wenhao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages; TabularMath technical report

点击查看摘要

Abstract:Standard tabular benchmarks mainly focus on the evaluation of a model’s capability to interpolate values inside a data manifold, where models good at performing local statistical smoothing are rewarded. However, there exists a very large category of high-value tabular data, including financial modeling and physical simulations, which are generated based upon deterministic computational processes, as opposed to stochastic and noisy relationships. Therefore, we investigate if tabular models can provide an extension from statistical interpolation to computational extrapolation. We propose TabularMath, a diagnostic benchmark of 114 deterministic problems (233,472 rows) generated from verified programs based on GSM8K and AIME. We evaluate 9 tabular architectures and in-context learning (ICL) with GPT-OSS-120B. On standard regression metrics, TabPFN v2.5 performs remarkably well, achieving R^2=0.998 in-distribution and maintaining positive R^2 even under distribution shift, which is unique among the tabular models we tested. When we measure rounded consistency (exact integer match), a different picture emerges: TabPFN v2.5 drops below 10% on out-of-distribution data, while ICL maintains around 40%. This gap between R^2 and exact-match accuracy suggests that tabular models learn smooth function approximations but struggle to recover precise computational outputs under extrapolation. The two paradigms appear complementary: TabPFN scales efficiently with data; ICL achieves exact computation from few examples. We release all code and data to support further investigation. Comments: 30 pages; TabularMath technical report Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.02523 [cs.LG] (or arXiv:2602.02523v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02523 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-203] IMU-1: Sample-Efficient Pre-training of Small Language Models

【速读】:该论文旨在解决大规模语言模型训练中数据效率低下的问题,即如何在较少训练数据下达到与传统大模型相当的性能表现。其核心解决方案是提出一套经过验证的训练配方(training recipe),关键包括:1)架构改进,如QK-norm注意力机制、每头门控(per-head gating)、值残差(value residuals)和LayerNorm缩放;2)优化方法创新,如使用NorMuon优化器配合谨慎的权重衰减策略以及muP参数化方式;3)三阶段训练流程结合事后检查点指数移动平均(post-hoc checkpoint EMA)。这些组件协同作用,使IMU-1模型(430M参数,72B tokens训练)在性能上逼近使用56倍更多数据训练的模型,显著提升了训练效率。

链接: https://arxiv.org/abs/2602.02522
作者: George Grigorev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:We present IMU-1, a 430M-parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data. We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances (NorMuon with cautious weight decay, muP parametrization) and a three-stage training schedule with post-hoc checkpoint EMA. We provide ablations for each component and release code, weights and data to enable reproduction: this https URL
zh

[AI-204] Scaled Dot-Product Attention implements projection of inputs onto a common surface

【速读】:该论文试图解决的问题是:标准的缩放点积注意力(Scaled Dot-Product Attention, SDPA)机制在数学信号处理视角下缺乏清晰的理论解释,其基于“查询、键、值”概念的解释难以与传统信号处理方法兼容,从而限制了对其内在机理的理解和潜在改进。解决方案的关键在于提出了一种数学等价但形式不同的重写方式——将SDPA重新表述为输入向量投影到由输入自身决定的公共曲面上的过程。这一重构揭示了SDPA能够捕捉输入中时变且上下文依赖的非线性关系,不仅提升了前向传播和学习算法的计算效率,更重要的是为扩展SDPA在时间序列建模中的应用提供了新的理论依据,尤其在语言建模中,它被重新诠释为通过局部上下文曲面动态调整词嵌入,从而更准确地表征时变语义。

链接: https://arxiv.org/abs/2602.02521
作者: Terence D Sanger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Scaled dot-product attention (SDPA) is a fundamental component responsible for the success of large-language models and other nonlinear signal processing applications. The rationale for SDPA has been based upon “query, key, value” concepts borrowed from database theory, but these concepts are difficult to reconcile with standard methods in mathematical signal processing. We show that SDPA can be rewritten in a different but mathematically equivalent form as a projection of the input vectors onto a common surface determined by the inputs themselves. Therefore SDPA discovers nonlinear dependencies in the input that are time-dependent and context-dependent. The rewritten form of SDPA permits increased speed of both feedforward and learning algorithms, but more importantly suggests potential extensions. In the context of language, we re-interpret the role of SDPA as finding a time-dependent contextual meaning determined by the surface on which the set of input vectors lies. Input token embeddings are then modified by the local context surface. This interpretation differs substantially from the concept of “self-attention”, and provides a strong justification for the use of SDPA for time-series data with time-varying local nonlinear dependencies.
zh

[AI-205] Artificial Intelligence for Inclusive Engineering Education: Advancing Equality Diversity and Ethical Leadership

【速读】:该论文旨在解决当前人工智能(AI)技术在工程教育应用中仍存在的性别平等差距、全球文化代表性不足以及STEM教育机会不均等的问题。其解决方案的关键在于提出一种以伦理为导向的AI使用框架,该框架基于联合国2030可持续发展目标中的第5项目标(性别平等)和第10项目标(减少不平等),通过整合全球范围内利用AI自适应平台促进教育包容性的案例研究,构建了一个融合伦理领导力与数据驱动的公平性评估机制,从而实现教育公平性和可持续性的双重提升。

链接: https://arxiv.org/abs/2602.02520
作者: Mona G. Ibrahim,Riham Hilal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI technology development has transformed the field of engineering education with its adaptivity-driven, data-based, and ethical-led learning platforms that promote equity, diversity, and inclusivity. But with so much progress being made in so many areas, there are unfortunately gaps in gender equity, representation in cultures around the world, and access to education and jobs in stem education. The paper describes an ethical approach to using AI technology that supports the United Nations 2030 agenda for sustainability. In particular, this includes both Goal 5–Gender Equity–and Goal 10–Reducing Inequalities. Based on a synthesis strategy using both critical thinking strategies related to case studies around the world using AI-based adaptivity platforms to address equity gaps related to education inclusion. The model presented offers a synthesis solution that includes ethical leadership data-related to equity to measure inclusivity based upon sustainability thinking. The result has demonstrated that using AI technology not only increases inclusivity but promotes equity related to access to education in stem education access. Finally, there are concluding remarks related to transforming education into a global system.
zh

[AI-206] Evaluation of Large Language Models educational feedback in Higher Education: potential limitations and implications for educational practice

【速读】:该论文旨在解决人工智能(AI)在高等教育反馈实践中的应用问题,特别是如何利用大型语言模型(Large Language Models, LLMs)生成高质量、结构化且有助于学生学习的形成性反馈。其解决方案的关键在于:通过提供清晰的上下文信息和基于大学教师制定的结构化评分量规(rubric),指导LLMs生成定量评估与定性反馈,从而确保AI输出具备教学相关性和有效性;研究进一步采用Hughes、Smith和Creese的分析框架对反馈质量进行系统评估,验证了LLMs在支持形成性学习体验方面的潜力,前提是需明确指令与充分情境约束。

链接: https://arxiv.org/abs/2602.02519
作者: Daniele Agostini,Federica Picasso
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The importance of managing feedback practices in higher education has been widely recognised, as they play a crucial role in enhancing teaching, learning, and assessment processes. In today’s educational landscape, feedback practices are increasingly influenced by technological advancements, particularly artificial intelligence (AI). Understanding the impact of AI on feedback generation is essential for identifying its potential benefits and establishing effective implementation strategies. This study examines how AI-generated feedback supports student learning using a well-established analytical framework. Specifically, feedback produced by different Large Language Models (LLMs) was assessed in relation to student-designed projects within a training course on inclusive teaching and learning. The evaluation process involved providing seven LLMs with a structured rubric, developed by the university instructor, which defined specific criteria and performance levels. The LLMs were tasked with generating both quantitative assessments and qualitative feedback based on this rubric. The AI-generated feedback was then analysed using Hughes, Smith, and Creese’s framework to evaluate its structure and effectiveness in fostering formative learning experiences. Overall, these findings indicate that LLMs can generate well-structured feedback and hold great potential as a sustainable and meaningful feedback tool, provided they are guided by clear contextual information and a well-defined instructions that will be explored further in the conclusions.
zh

[AI-207] What Drives Length of Stay After Elective Spine Surgery? Insights from a Decade of Predictive Modeling

【速读】:该论文旨在解决择期脊柱手术后住院时间(Length of Stay, LOS)预测的精准性问题,以优化患者预后和医院资源利用。其解决方案的关键在于系统性地综述了近年来应用于LOS预测的计算方法,发现机器学习模型(如随机森林、梯度提升算法和神经网络)在性能上显著优于传统统计模型(如逻辑回归),AUC值可达0.94–0.99,且识别出年龄、合并症(如高血压和糖尿病)、BMI、手术类型与持续时间及手术节段数量等关键预测因子。研究强调,尽管模型表现优异,但外部验证不足和报告标准不一限制了其临床转化潜力,未来需加强标准化定义与透明化报告以推动真实世界部署。

链接: https://arxiv.org/abs/2602.02517
作者: Ha Na Cho,Seungmin Jeong,Yawen Guo,Alexander Lopez,Hansen Bow,Kai Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Objective: Predicting length of stay after elective spine surgery is essential for optimizing patient outcomes and hospital resource use. This systematic review synthesizes computational methods used to predict length of stay in this patient population, highlighting model performance and key predictors. Methods: Following PRISMA guidelines, we systematically searched PubMed, Google Scholar, and ACM Digital Library for studies published between December 1st, 2015, and December 1st, 2024. Eligible studies applied statistical or machine learning models to predict length of stay for elective spine surgery patients. Three reviewers independently screened studies and extracted data. Results: Out of 1,263 screened studies, 29 studies met inclusion criteria. Length of stay was predicted as a continuous, binary, or percentile-based outcome. Models included logistic regression, random forest, boosting algorithms, and neural networks. Machine learning models consistently outperformed traditional statistical models, with AUCs ranging from 0.94 to 0.99. K-Nearest Neighbors and Naive Bayes achieved top performance in some studies. Common predictors included age, comorbidities (notably hypertension and diabetes), BMI, type and duration of surgery, and number of spinal levels. However, external validation and reporting practices varied widely across studies. Discussion: There is growing interest in artificial intelligence and machine learning in length of stay prediction, but lack of standardization and external validation limits clinical utility. Future studies should prioritize standardized outcome definitions and transparent reporting needed to advance real-world deployment. Conclusion: Machine learning models offer strong potential for length of stay prediction after elective spine surgery, highlighting their potential for improving discharge planning and hospital resource management.
zh

[AI-208] Measuring Individual User Fairness with User Similarity and Effectiveness Disparity ECIR2026

【速读】:该论文旨在解决推荐系统(Recommender Systems, RSs)中个体用户公平性评估的不完整性问题。现有评估指标要么仅关注推荐效果差异(effectiveness disparity),要么仅关注相似用户被推荐内容的差异(item disparity),但未能同时考虑用户相似性和推荐效果两个关键维度,导致对公平性的定义存在缺陷。解决方案的关键在于提出Pairwise User unFairness (PUF),这是首个能够统一衡量用户相似性与推荐效果差异的个体用户公平性评估指标,通过引入成对用户间的公平性判别机制,实现了对公平性更全面、可靠的刻画。实证结果表明,PUF在多个数据集和推荐算法上均表现出一致性与鲁棒性,而传统方法则在某一维度上表现迟钝甚至完全无响应。

链接: https://arxiv.org/abs/2602.02516
作者: Theresia Veronika Rampisela,Maria Maistro,Tuukka Ruotsalo,Christina Lioma
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Preprint of a work that has been accepted to ECIR 2026 Full Papers track as a Findings paper

点击查看摘要

Abstract:Individual user fairness is commonly understood as treating similar users similarly. In Recommender Systems (RSs), several evaluation measures exist for quantifying individual user fairness. These measures evaluate fairness via either: (i) the disparity in RS effectiveness scores regardless of user similarity, or (ii) the disparity in items recommended to similar users regardless of item relevance. Both disparity in recommendation effectiveness and user similarity are very important in fairness, yet no existing individual user fairness measure simultaneously accounts for both. In brief, current user fairness evaluation measures implement a largely incomplete definition of fairness. To fill this gap, we present Pairwise User unFairness (PUF), a novel evaluation measure of individual user fairness that considers both effectiveness disparity and user similarity. PUF is the only measure that can express this important distinction. We empirically validate that PUF does this consistently across 4 datasets and 7 rankers, and robustly when varying user similarity or effectiveness. In contrast, all other measures are either almost insensitive to effectiveness disparity or completely insensitive to user similarity. We contribute the first RS evaluation measure to reliably capture both user similarity and effectiveness in individual user fairness. Our code: this https URL.
zh

[AI-209] Efficient Edge Rewiring Strategies for Enhancing PageRank Fairness

【速读】:该论文旨在解决社交网络中的不公平问题,即弱势群体(如男性主导行业中女性)因在网络中处于不利位置而难以获取关键信息(如职位招聘信息),其核心在于通过调整网络结构来提升PageRank公平性(PageRank fairness),即实现不同群体间PageRank权重的公平分配。解决方案的关键在于设计一种基于贪心策略的线性时间算法,利用根树生成森林(rooted spanning forests)的快速采样技术,在允许重连固定数量边的前提下,最大化弱势群体的PageRank公平性。实验表明,该算法在百万节点规模网络上仅需几分钟即可获得高精度解,显著优于现有方法。

链接: https://arxiv.org/abs/2602.02512
作者: Changan Liu,Haoxin Sun,Ahad N. Zehmakan,Zhongzhi Zhang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Accepted by Theoretical Computer Science (TCS)

点击查看摘要

Abstract:We study the notion of unfairness in social networks, where a group such as females in a male-dominated industry are disadvantaged in access to important information, e.g. job posts, due to their less favorable positions in the network. We investigate a well-established network-based formulation of fairness called PageRank fairness, which refers to a fair allocation of the PageRank weights among distinct groups. Our goal is to enhance the PageRank fairness by modifying the underlying network structure. More precisely, we study the problem of maximizing PageRank fairness with respect to a disadvantaged group, when we are permitted to rewire a fixed number of edges in the network. Building on a greedy approach, we leverage techniques from fast sampling of rooted spanning forests to devise an effective linear-time algorithm for this problem. To evaluate the accuracy and performance of our proposed algorithm, we conduct a large set of experiments on various real-world network data. Our experiments demonstrate that the proposed algorithm significantly outperforms the existing ones. Our algorithm is capable of generating accurate solutions for networks of million nodes in just a few minutes.
zh

[AI-210] raining Data Governance for Brain Foundation Models

【速读】:该论文旨在解决脑基础模型(brain foundation models)在训练和应用过程中引发的伦理与治理挑战问题。其核心关切在于,神经数据(如EEG、fMRI等)因其身体来源特性及长期受临床与科研场景严格监管的历史背景,相较于文本或图像数据具有更强的隐私保护预期和权利主张;然而,基础模型范式却将其置于大规模再利用、跨场景整合与开放下游应用的实践中,且这些实践正被更广泛的商业开发者所采纳,而当前治理框架却呈现碎片化与模糊性。解决方案的关键在于:首先系统梳理脑基础模型的技术基础与数据生态,继而结合人工智能伦理、神经伦理与生物伦理,从隐私、知情同意、偏见、利益共享及治理机制五个维度提出结构化的问题清单与基础性保障措施,以推动该领域健康、负责任地发展。

链接: https://arxiv.org/abs/2602.02511
作者: Margot Hanley,Jiunn-Tyng Yeh,Ryan Rodriguez,Jack Pilkington,Nita Farahany
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Brain foundation models bring the foundation model paradigm to the field of neuroscience. Like language and image foundation models, they are general-purpose AI systems pretrained on large-scale datasets that adapt readily to downstream tasks. Unlike text-and-image based models, however, they train on brain data: large-datasets of EEG, fMRI, and other neural data types historically collected within tightly governed clinical and research settings. This paper contends that training foundation models on neural data opens new normative territory. Neural data carry stronger expectations of, and claims to, protection than text or images, given their body-derived nature and historical governance within clinical and research settings. Yet the foundation model paradigm subjects them to practices of large-scale repurposing, cross-context stitching, and open-ended downstream application. Furthermore, these practices are now accessible to a much broader range of actors, including commercial developers, against a backdrop of fragmented and unclear governance. To map this territory, we first describe brain foundation models’ technical foundations and training-data ecosystem. We then draw on AI ethics, neuroethics, and bioethics to organize concerns across privacy, consent, bias, benefit sharing, and governance. For each, we propose both agenda-setting questions and baseline safeguards as the field matures.
zh

[AI-211] CodeGuard: Improving LLM Guardrails in CS Education

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在计算机科学(Computer Science, CS)教育场景中因易受恶意或不当提示(adversarial or ill-intentioned prompts)影响而导致的学生学习受损与学术诚信风险问题。其解决方案的关键在于提出CodeGuard框架,该框架包含三个核心组件:(i) 首个针对教育场景的提示分类体系(taxonomy);(ii) 包含8000条标注提示的CodeGuard数据集;(iii) 一个轻量级句子编码器模型PromptShield,通过微调实现对不安全提示的实时检测,实验表明其F1分数达0.93,显著优于现有方法,并能在不损害合法教学任务性能的前提下,使潜在有害代码生成减少30–65%。

链接: https://arxiv.org/abs/2602.02509
作者: Nishat Raihan,Noah Erdachew,Jayoti Devi,Joanna C. S. Santos,Marcos Zampieri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly embedded in Computer Science (CS) classrooms to automate code generation, feedback, and assessment. However, their susceptibility to adversarial or ill-intentioned prompts threatens student learning and academic integrity. To cope with this important issue, we evaluate existing off-the-shelf LLMs in handling unsafe and irrelevant prompts within the domain of CS education. We identify important shortcomings in existing LLM guardrails which motivates us to propose CodeGuard, a comprehensive guardrail framework for educational AI systems. CodeGuard includes (i) a first-of-its-kind taxonomy for classifying prompts; (ii) the CodeGuard dataset, a collection of 8,000 prompts spanning the taxonomy; and (iii) PromptShield, a lightweight sentence-encoder model fine-tuned to detect unsafe prompts in real time. Experiments show that PromptShield achieves 0.93 F1 score, surpassing existing guardrail methods. Additionally, further experimentation reveals that CodeGuard reduces potentially harmful or policy-violating code completions by 30-65% without degrading performance on legitimate educational tasks. The code, datasets, and evaluation scripts are made freely available to the community.
zh

[AI-212] Precoding-Oriented CSI Feedback Design with Mutual Information Regularized VQ-VAE

【速读】:该论文旨在解决大规模多输入多输出(Massive MIMO)系统中用户设备(UE)侧信道状态信息(CSI)压缩效率与下行链路速率之间的权衡问题,即在有限反馈开销下如何最大化系统性能。其解决方案的关键在于提出一种面向预编码的CSI反馈框架,该框架基于向量量化变分自编码器(Vector Quantized Variational Autoencoder, VQ-VAE),并引入信息论正则化项以优化码本利用率;其中,通过设计一种可微分的互信息下界估计器作为训练正则化项,在固定长度反馈约束下提升码本的有效利用效率,从而实现与可变长度神经压缩方法相当的传输速率,同时具备更均匀的码字使用分布和与信道状态信息强相关的可解释结构。

链接: https://arxiv.org/abs/2602.02508
作者: Xi Chen,Homa Esfahanizadeh,Foad Sohrabi
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 5 pages, submitted to IEEE VTC conference

点击查看摘要

Abstract:Efficient channel state information (CSI) compression at the user equipment plays a key role in enabling accurate channel reconstruction and precoder design in massive multiple-input multiple-output systems. A key challenge lies in balancing the CSI feedback overhead with the achievable downlink rate, i.e., maximizing the utility of limited feedback to maintain high system performance. In this work, we propose a precoding-oriented CSI feedback framework based on a vector quantized variational autoencoder, augmented with an information-theoretic regularization. To achieve this, we introduce a differentiable mutual information lower-bound estimator as a training regularizer to promote effective utilization of the learned codebook under a fixed feedback budget. Numerical results demonstrate that the proposed method achieves rates comparable to variable-length neural compression schemes, while operating with fixed-length feedback. Furthermore, the learned codewords exhibit significantly more uniform usage and capture interpretable structures that are strongly correlated with the underlying channel state information.
zh

[AI-213] Learning-augmented smooth integer programs with PAC-learnable oracles

【速读】:该论文旨在解决光滑整数规划(smooth integer programs)中如何利用预测信息提升算法性能的问题,特别是针对MAX-CUT和MAX-k-SAT等经典问题。其解决方案的关键在于提出一个学习增强型算法框架,该框架通过引入一个预测预言机(predictive oracle)来构建目标函数的线性代理模型(linear surrogate),随后通过线性规划求解并辅以舍入过程获得整数解。该框架保证了在预测误差存在时,解的质量仍具有稳定性和平滑性,从而将传统稠密情形下的可 tractable 近似扩展至近稠密情形。此外,作者进一步证明了该预言机可在概率近似正确(PAC-learnable)意义上被有效学习,因其诱导的算法类具有有界伪维数(pseudo-dimension),确保用多项式样本即可学习到近最优预期性能的预言机。

链接: https://arxiv.org/abs/2602.02505
作者: Hao-Yuan He,Ming Li
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper investigates learning-augmented algorithms for smooth integer programs, covering canonical problems such as MAX-CUT and MAX-k-SAT. We introduce a framework that incorporates a predictive oracle to construct a linear surrogate of the objective, which is then solved via linear programming followed by a rounding procedure. Crucially, our framework ensures that the solution quality is both consistent and smooth against prediction errors. We demonstrate that this approach effectively extends tractable approximations from the classical dense regime to the near-dense regime. Furthermore, we go beyond the assumption of oracle existence by establishing its PAC-learnability. We prove that the induced algorithm class possesses a bounded pseudo-dimension, thereby ensuring that an oracle with near-optimal expected performance can be learned with polynomial samples.
zh

[AI-214] Sparse Adapter Fusion for Continual Learning in NLP EACL2026

【速读】:该论文旨在解决持续学习(Continual Learning)在自然语言处理(Natural Language Processing, NLP)中面临的三大挑战:任务间参数复用效率低、当任务差异较大时易发生灾难性遗忘(Catastrophic Forgetting),以及为每个新任务引入冗余参数导致相似任务间知识共享受限。解决方案的关键在于提出一种稀疏适配器融合方法(Sparse Adapter Fusion Method, SAFM),其核心机制包含两个阶段:决策阶段通过架构搜索动态决定是否引入新适配器、复用已有适配器或添加空适配器,以最小化参数消耗并最大化复用;调优阶段则引入分层损失函数,促进不同适配器间的差异化表达,从而有效捕捉同一任务内的知识。实验表明,SAFM在性能接近当前最优(SOTA)方法的同时,参数使用量低于60%。

链接: https://arxiv.org/abs/2602.02502
作者: Min Zeng,Xi Chen,Haiqin Yang,Yike Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to EACL 2026

点击查看摘要

Abstract:Continual learning in natural language processing plays a crucial role in adapting to evolving data and preventing catastrophic forgetting. Despite significant progress, existing methods still face challenges, such as inefficient parameter reuse across tasks, risking catastrophic forgetting when tasks are dissimilar, and the unnecessary introduction of new parameters for each task, which hampers knowledge sharing among similar tasks. To tackle these issues, we propose a Sparse Adapter Fusion Method (SAFM), which dynamically fuses old and new adapters to address these challenges. SAFM operates in two stages: the decision stage and the tuning stage. In the decision stage, SAFM determines whether to incorporate a new adapter, reuse an existing one, or add an empty adapter. The architecture search procedure, designed to prioritize reusing or adding empty adapters, minimizes parameter consumption and maximizes reuse. In the tuning stage, SAFM especially facilitates a layer-wise loss to encourage differentiation between adapters, effectively capturing knowledge within the same task. Experimental results consistently show that SAFM outperforms state-of-the-art (SOTA) methods, achieving comparable performance while utilizing less than 60% of the parameters.
zh

[AI-215] UNSO: Unified Newton Schulz Orthogonalization

【速读】:该论文旨在解决传统牛顿-舒尔茨(Newton-Schulz, NS)迭代方法在计算效率和数值稳定性方面的不足,尤其是在用于Muon优化器和Stiefel流形上的应用中表现不佳的问题。其关键解决方案在于提出一种统一的牛顿-舒尔茨正交化框架(Unified Newton-Schulz Orthogonalization, UNSO),通过重构迭代结构避免多项式展开,识别并移除不重要的矩阵幂项,进而构建一个具有可学习系数的推荐多项式形式;这些系数随后被优化以实现稳定收敛和卓越性能。

链接: https://arxiv.org/abs/2602.02500
作者: Chen Hu,Qianxi Zhao,Yuming Li,Mingyu Zhou,Xiyin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:The Newton-Schulz (NS) iteration has gained increasing interest for its role in the Muon optimizer and the Stiefel manifold. However, the conventional NS iteration suffers from inefficiency and instability. Although various improvements have been introduced to NS iteration, they fail to deviate from the conventional iterative paradigm, which could increase computation burden largely due to the matrix products along the long dimension repeatedly. To address this, we consolidate the iterative structure into a unified framework, named Unified Newton-Schulz Orthogonalization (UNSO). To do so, we could avoid a polynomial expansion. Instead, we evaluate the role of each matrix power, remove the insignificant terms, and provide a recommended polynomial with learnable coefficients. These learnable coefficients are then optimized, and achieve an outstanding performance with stable convergence. The code of our method is available: this https URL.
zh

[AI-216] LPCVAE: A Conditional VAE with Long-Term Dependency and Probabilistic Time-Frequency Fusion for Time Series Anomaly Detection

【速读】:该论文旨在解决基于变分自编码器(Variational AutoEncoder, VAE)的时间序列异常检测(Time Series Anomaly Detection, TSAD)方法中存在的两个关键问题:一是特征提取局限于单窗口,难以捕捉长期依赖关系;二是对时间域与频率域信息的融合方式不够充分,导致信息损失。解决方案的关键在于提出一种名为LPCVAE的条件变分自编码器模型,其核心创新包括:1)引入长短期记忆网络(LSTM)以建模跨窗口的长期时序依赖;2)采用乘积专家(Product-of-Experts, PoE)机制实现分布级的概率性时间-频率信息自适应融合,从而有效保留并整合多尺度特征表示。这一设计显著提升了TSAD的准确性与鲁棒性。

链接: https://arxiv.org/abs/2510.10915
作者: Hanchang Cheng,Weimin Mu,Fan Liu,Weilin Zhu,Can Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Time series anomaly detection(TSAD) is a critical task in signal processing field, ensuring the reliability of complex systems. Reconstruction-based methods dominate in TSAD. Among these methods, VAE-based methods have achieved promising results. Existing VAE-based methods suffer from the limitation of single-window feature and insufficient leveraging of long-term time and frequency information. We propose a Conditional Variational AutoEncoder with Long-term dependency and Probabilistic time-frequency fusion, named LPCVAE. LPCVAE introduces LSTM to capture long-term dependencies beyond windows. It further incorporates a Product-of-Experts (PoE) mechanism for adaptive and distribution-level probabilistic fusion. This design effectively mitigates time-frequency information loss. Extensive experiments on four public datasets demonstrate it outperforms state-of-the-art methods. The results confirm that integrating long-term time and frequency representations with adaptive fusion yields a robust and efficient solution for TSAD.
zh

[AI-217] Fast Sampling for Flows and Diffusions with Lazy and Point Mass Stochastic Interpolants

【速读】:该论文旨在解决生成模型中 stochastic interpolants(随机插值)的调度策略优化问题,特别是如何在不同扩散系数和插值路径之间进行转换,以提升采样效率并保持生成质量。其关键解决方案是提出了一种通用的数学框架,可将任意扩散系数下任意调度策略对应的随机微分方程(SDE)样本路径精确转换为另一调度下的唯一样本路径,从而实现跨调度的灵活迁移;此外,论文进一步扩展了插值框架以支持点质量(point mass)调度,并在高斯数据假设下识别出“懒惰调度”(lazy schedule)家族,使得漂移项恒为零,进而通过确定性采样获得常见于扩散模型中的方差保持调度,或通过最优统计采样得到新的点质量调度。这一理论成果被应用于预训练流模型,在不重新训练的情况下显著减少生成步骤数,验证了其在真实非高斯数据上的有效性。

链接: https://arxiv.org/abs/2602.03789
作者: Gabriel Damsholt,Jes Frellsen,Susanne Ditlevsen
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stochastic interpolants unify flows and diffusions, popular generative modeling frameworks. A primary hyperparameter in these methods is the interpolation schedule that determines how to bridge a standard Gaussian base measure to an arbitrary target measure. We prove how to convert a sample path of a stochastic differential equation (SDE) with arbitrary diffusion coefficient under any schedule into the unique sample path under another arbitrary schedule and diffusion coefficient. We then extend the stochastic interpolant framework to admit a larger class of point mass schedules in which the Gaussian base measure collapses to a point mass measure. Under the assumption of Gaussian data, we identify lazy schedule families that make the drift identically zero and show that with deterministic sampling one gets a variance-preserving schedule commonly used in diffusion models, whereas with statistically optimal SDE sampling one gets our point mass schedule. Finally, to demonstrate the usefulness of our theoretical results on realistic highly non-Gaussian data, we apply our lazy schedule conversion to a state-of-the-art pretrained flow model and show that this allows for generating images in fewer steps without retraining the model.
zh

[AI-218] DiffLOB: Diffusion Models for Counterfactual Generation in Limit Order Books

【速读】:该论文旨在解决现有生成式限价订单簿(Limit Order Book, LOB)模型在应对压力测试、情景分析和决策支持等任务时的局限性,即这些模型本质上是被动的,无法根据假设的未来市场状态进行可控且具有因果意义的模拟。其解决方案的关键在于提出DiffLOB——一种基于扩散机制(diffusion model)的、以市场状态(market regime)为条件的生成模型,能够显式地将未来市场状态(如趋势、波动率、流动性及订单流失衡等)纳入生成过程,从而实现对反事实轨迹的可控生成,回答“若未来市场状态为X而非Y,订单簿将如何演化?”这类问题。该方法通过三个评估标准验证其有效性:可控真实性、反事实有效性与反事实实用性,显著提升了生成模型在金融场景下的可解释性和应用价值。

链接: https://arxiv.org/abs/2602.03776
作者: Zhuohan Wang,Carmine Ventre
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Modern generative models for limit order books (LOBs) can reproduce realistic market dynamics, but remain fundamentally passive: they either model what typically happens without accounting for hypothetical future market conditions, or they require interaction with another agent to explore alternative outcomes. This limits their usefulness for stress testing, scenario analysis, and decision-making. We propose \textbfDiffLOB, a regime-conditioned \textbfDiffusion model for controllable and counterfactual generation of \textbfLOB trajectories. DiffLOB explicitly conditions the generative process on future market regimes–including trend, volatility, liquidity, and order-flow imbalance, which enables the model to answer counterfactual queries of the form: ``If the future market regime were X instead of Y, how would the limit order book evolve?‘’ Our systematic evaluation framework for counterfactual LOB generation consists of three criteria: (1) \textitControllable Realism, measuring how well generated trajectories can reproduce marginal distributions, temporal dependence structure and regime variables; (2) \textitCounterfactual validity, testing whether interventions on future regimes induce consistent changes in the generated LOB dynamics; (3) \textitCounterfactual usefulness, assessing whether synthetic counterfactual trajectories improve downstream prediction of future market regimes.
zh

[AI-219] Multiparameter Uncertainty Mapping in Quantitative Molecular MRI using a Physics-Structured Variational Autoencoder (PS-VAE)

【速读】:该论文旨在解决定量成像方法(如磁共振指纹成像,MRF)在临床应用中因缺乏可解释的不确定性量化而导致的信任度和透明度不足的问题。其核心解决方案是提出一种物理结构化的变分自编码器(PS-VAE),该模型将可微分的自旋物理模拟器与自监督学习相结合,能够快速提取体素级别的多参数后验分布,并准确捕捉潜在生物物理空间中各参数间的协方差关系。此方法在多种实验场景下验证了其与暴力贝叶斯分析结果的一致性,同时实现了全脑定量计算的速度提升数个数量级,且能通过监测多参数后验动态为协议优化提供实时反馈。

链接: https://arxiv.org/abs/2602.03317
作者: Alex Finkelstein,Ron Moneta,Or Zohar,Michal Rivlin,Moritz Zaiss,Dinora Friedmann Morvinski,Or Perlman
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: Submitted to IEEE Transactions on Medical Imaging. This project was funded by the European Union (ERC, BabyMagnet, project no. 101115639). Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them

点击查看摘要

Abstract:Quantitative imaging methods, such as magnetic resonance fingerprinting (MRF), aim to extract interpretable pathology biomarkers by estimating biophysical tissue parameters from signal evolutions. However, the pattern-matching algorithms or neural networks used in such inverse problems often lack principled uncertainty quantification, which limits the trustworthiness and transparency, required for clinical acceptance. Here, we describe a physics-structured variational autoencoder (PS-VAE) designed for rapid extraction of voxelwise multi-parameter posterior distributions. Our approach integrates a differentiable spin physics simulator with self-supervised learning, and provides a full covariance that captures the inter-parameter correlations of the latent biophysical space. The method was validated in a multi-proton pool chemical exchange saturation transfer (CEST) and semisolid magnetization transfer (MT) molecular MRF study, across in-vitro phantoms, tumor-bearing mice, healthy human volunteers, and a subject with glioblastoma. The resulting multi-parametric posteriors are in good agreement with those calculated using a brute-force Bayesian analysis, while providing an orders-of-magnitude acceleration in whole brain quantification. In addition, we demonstrate how monitoring the multi-parameter posterior dynamics across progressively acquired signals provides practical insights for protocol optimization and may facilitate real-time adaptive acquisition.
zh

[AI-220] Latent Neural-ODE for Model-Informed Precision Dosing: Overcoming Structural Assumptions in Pharmacokinetics

【速读】:该论文旨在解决肾移植后他克莫司(tacrolimus)暴露量精准预测问题,即如何基于有限的临床浓度数据准确估算其药时曲线下面积(AUC),以实现个体化给药。当前依赖非线性混合效应(NLME)方法的群体药代动力学(PopPK)模型因假设刚性、难以捕捉患者特异性动态而存在建模偏差。解决方案的关键在于引入基于潜在常微分方程(Latent Ordinary Differential Equations, Latent ODEs)的深度学习框架,该方法直接从稀疏临床数据中学习个体化的药代动力学动态,无需预设结构假设,从而显著提升对复杂生物行为的建模灵活性与准确性。

链接: https://arxiv.org/abs/2602.03215
作者: Benjamin Maurel,Agathe Guilloux,Sarah Zohar,Moreno Ursino,Jean-Baptiste Woillard
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate estimation of tacrolimus exposure, quantified by the area under the concentration-time curve (AUC), is essential for precision dosing after renal transplantation. Current practice relies on population pharmacokinetic (PopPK) models based on nonlinear mixed-effects (NLME) methods. However, these models depend on rigid, pre-specified assumptions and may struggle to capture complex, patient-specific dynamics, leading to model misspecification. In this study, we introduce a novel data-driven alternative based on Latent Ordinary Differential Equations (Latent ODEs) for tacrolimus AUC prediction. This deep learning approach learns individualized pharmacokinetic dynamics directly from sparse clinical data, enabling greater flexibility in modeling complex biological behavior. The model was evaluated through extensive simulations across multiple scenarios and benchmarked against two standard approaches: NLME-based estimation and the iterative two-stage Bayesian (it2B) method. We further performed a rigorous clinical validation using a development dataset (n = 178) and a completely independent external dataset (n = 75). In simulation, the Latent ODE model demonstrated superior robustness, maintaining high accuracy even when underlying biological mechanisms deviated from standard assumptions. Regarding experiments on clinical datasets, in internal validation, it achieved significantly higher precision with a mean RMSPE of 7.99% compared with 9.24% for it2B (p 0.001). On the external cohort, it achieved an RMSPE of 10.82%, comparable to the two standard estimators (11.48% and 11.54%). These results establish the Latent ODE as a powerful and reliable tool for AUC prediction. Its flexible architecture provides a promising foundation for next-generation, multi-modal models in personalized medicine. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2602.03215 [stat.ML] (or arXiv:2602.03215v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.03215 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Benjamin Maurel [view email] [v1] Tue, 3 Feb 2026 07:30:48 UTC (6,994 KB)
zh

[AI-221] CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models

【速读】:该论文旨在解决当前冷冻电镜(cryo-EM)数据处理中因数据量激增和任务多样化导致的计算框架碎片化问题,即现有基于深度学习的任务特定方法在可扩展性和泛化能力方面存在局限。其解决方案的关键在于提出一个名为CryoLVM的基础模型(foundation model),该模型通过联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA)与SCUNet骨干网络,从具有已知结构的实验密度图中学习丰富的结构表征,并结合一种新颖的基于直方图的分布对齐损失函数,显著加速收敛并提升微调性能,从而实现对多种下游cryo-EM任务(如密度图锐化、超分辨率重建和缺失楔形恢复)的高效适配与卓越表现。

链接: https://arxiv.org/abs/2602.02620
作者: Weining Fu,Kai Shu,Kui Xu,Qiangfeng Cliff Zhang
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cryo-electron microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-level visualization of biomolecular assemblies. However, the exponential growth in cryo-EM data throughput and complexity, coupled with diverse downstream analytical tasks, necessitates unified computational frameworks that transcend current task-specific deep learning approaches with limited scalability and generalizability. We present CryoLVM, a foundation model that learns rich structural representations from experimental density maps with resolved structures by leveraging the Joint-Embedding Predictive Architecture (JEPA) integrated with SCUNet-based backbone, which can be rapidly adapted to various downstream tasks. We further introduce a novel histogram-based distribution alignment loss that accelerates convergence and enhances fine-tuning performance. We demonstrate CryoLVM’s effectiveness across three critical cryo-EM tasks: density map sharpening, density map super-resolution, and missing wedge restoration. Our method consistently outperforms state-of-the-art baselines across multiple density map quality metrics, confirming its potential as a versatile model for a wide spectrum of cryo-EM applications.
zh

[AI-222] AI Assisted Economics Measurement From Survey: Evidence from Public Employee Pension Choice

【速读】:该论文旨在解决传统经济测量中依赖主观设定测量结构、缺乏自动化与可验证性的问题,尤其在面对复杂问卷数据时难以准确识别哪些语义成分蕴含行为信号。其解决方案的关键在于构建一个迭代式的测量框架,利用大语言模型(Large Language Models, LLMs)直接从调查工具中提取测量结构,通过“软映射”(soft mapping)将问卷条目映射到潜在构念的稀疏分布,并基于交叉验证的增量效度(incremental validity)和判别效度(discriminant validity)诊断对测量分类体系进行迭代优化,从而确保新增灵活性仅在提升样本外稳定性时被保留。该方法实现了测量结构的自动审计与可复用性,为经济行为研究提供了更可靠的数据驱动基础。

链接: https://arxiv.org/abs/2602.02604
作者: Tiancheng Wang,Krishna Sharma
机构: 未知
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop an iterative framework for economic measurement that leverages large language models to extract measurement structure directly from survey instruments. The approach maps survey items to a sparse distribution over latent constructs through what we term a soft mapping, aggregates harmonized responses into respondent level sub dimension scores, and disciplines the resulting taxonomy through out of sample incremental validity tests and discriminant validity diagnostics. The framework explicitly integrates iteration into the measurement construction process. Overlap and redundancy diagnostics trigger targeted taxonomy refinement and constrained remapping, ensuring that added measurement flexibility is retained only when it delivers stable out of sample performance gains. Applied to a large scale public employee retirement plan survey, the framework identifies which semantic components contain behavioral signal and clarifies the economic mechanisms, such as beliefs versus constraints, that matter for retirement choices. The methodology provides a portable measurement audit of survey instruments that can guide both empirical analysis and survey design.
zh

[AI-223] Joint single-shot ToA and DoA estimation for VAA-based BLE ranging with phase ambiguity: A deep learning-based approach

【速读】:该论文旨在解决在蓝牙低功耗(BLE)设备上实现高精度到达方向(DoA)估计的问题,此类设备因尺寸限制难以部署多天线阵列。传统虚拟天线阵列(VAA)技术虽可利用单天线实现DoA估计,但BLE仅提供单次双向信道频率响应(CFR),且存在二进制相位模糊问题,阻碍了VAA的直接应用。解决方案的关键在于提出一个统一模型,将VAA与BLE双向CFR相结合,并设计了一种基于神经网络的相位恢复框架,该框架采用行/列预测器与投票机制以消除相位模糊;恢复后的单向CFR进一步支持超分辨算法(如MUSIC)实现到达时间(ToA)与DoA的联合估计,仿真结果表明该方法在非均匀VAA条件下性能优越,信噪比(SNR)≥5 dB时均方误差逼近克拉美-罗界(Cramer Rao Bound)。

链接: https://arxiv.org/abs/2602.02503
作者: Jincheng Xie,Yili Deng,Jiguang He,Pengyu Wang,Miaomiao Dong,Rui Tang,Zhongyi Huang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Conventional direction-of-arrival (DoA) estimation methods rely on multi-antenna arrays, which are costly to implement on size-constrained Bluetooth Low Energy (BLE) devices. Virtual antenna array (VAA) techniques enable DoA estimation with a single antenna, making angle estimation feasible on such devices. However, BLE only provides a single-shot two-way channel frequency response (CFR) with a binary phase ambiguity issue, which hinders the direct application of VAA. To address this challenge, we propose a unified model that combines VAA with BLE two-way CFR, and introduce a neural network based phase recovery framework that employs row / column predictors with a voting mechanism to resolve the ambiguity. The recovered one-way CFR then enables super resolution algorithms such as MUSIC for joint time of arrival (ToA) and DoA estimation. Simulation results demonstrate that the proposed method achieves superior performance under non-uniform VAAs, with mean square errors approaching the Cramer Rao bound at SNR \geq 5 dB.
zh

机器学习

[LG-0] Investigating Quantum Circuit Designs Using Neuro-Evolution GECCO

链接: https://arxiv.org/abs/2602.03840
作者: Devroop Kar,Daniel Krutz,Travis Desell
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Submitted to The Genetic and Evolutionary Computation Conference (GECCO) 2026. Under Review

点击查看摘要

Abstract:Designing effective quantum circuits remains a central challenge in quantum computing, as circuit structure strongly influences expressivity, trainability, and hardware feasibility. Current approaches, whether using manually designed circuit templates, fixed heuristics, or automated rules, face limitations in scalability, flexibility, and adaptability, often producing circuits that are poorly matched to the specific problem or quantum hardware. In this work, we propose the Evolutionary eXploration of Augmenting Quantum Circuits (EXAQC), an evolutionary approach to the automated design and training of parameterized quantum circuits (PQCs) which leverages and extends on strategies from neuroevolution and genetic programming. The proposed method jointly searches over gate types, qubit connectivity, parameterization, and circuit depth while respecting hardware and noise constraints. The method supports both Qiskit and Pennylane libraries, allowing the user to configure every aspect. This work highlights evolutionary search as a critical tool for advancing quantum machine learning and variational quantum algorithms, providing a principled pathway toward scalable, problem-aware, and hardware-efficient quantum circuit design. Preliminary results demonstrate that circuits evolved on classification tasks are able to achieve over 90% accuracy on most of the benchmark datasets with a limited computational budget, and are able to emulate target circuit quantum states with high fidelity scores.

[LG-1] Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

链接: https://arxiv.org/abs/2602.03839
作者: Erfan Miahi,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注: 32 pages, 14 figures

点击查看摘要

Abstract:Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.

[LG-2] Robust Intervention Learning from Emergency Stop Interventions

链接: https://arxiv.org/abs/2602.03825
作者: Ethan Pronovost,Khimya Khetarpal,Siddhartha Srinivasa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human interventions are a common source of data in autonomous systems during testing. These interventions provide an important signal about where the current policy needs improvement, but are often noisy and incomplete. We define Robust Intervention Learning (RIL) as the problem of learning from intervention data while remaining robust to the quality and informativeness of the intervention signal. In the best case, interventions are precise and avoiding them is sufficient to solve the task, but in many realistic settings avoiding interventions is necessary but not sufficient for achieving good performance. We study robust intervention learning in the context of emergency stop interventions and propose Residual Intervention Fine-Tuning (RIFT), a residual fine-tuning algorithm that treats intervention feedback as an incomplete learning signal and explicitly combines it with a prior policy. By framing intervention learning as a fine-tuning problem, our approach leverages structure encoded in the prior policy to resolve ambiguity when intervention signals under-specify the task. We provide theoretical analysis characterizing conditions under which this formulation yields principled policy improvement, and identify regimes where intervention learning is expected to fail. Our experiments reveal that residual fine-tuning enables robust and consistent policy improvement across a range of intervention strategies and prior policy qualities, and highlight robust intervention learning as a promising direction for future work.

[LG-3] SymPlex: A Structure-Aware Transformer for Symbolic PDE Solving

链接: https://arxiv.org/abs/2602.03816
作者: Yesom Park,Annie C. Lu,Shao-Ching Huang,Qiyang Hu,Y. Sungtaek Ju,Stanley Osher
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:We propose SymPlex, a reinforcement learning framework for discovering analytical symbolic solutions to partial differential equations (PDEs) without access to ground-truth expressions. SymPlex formulates symbolic PDE solving as tree-structured decision-making and optimizes candidate solutions using only the PDE and its boundary conditions. At its core is SymFormer, a structure-aware Transformer that models hierarchical symbolic dependencies via tree-relative self-attention and enforces syntactic validity through grammar-constrained autoregressive decoding, overcoming the limited expressivity of sequence-based generators. Unlike numerical and neural approaches that approximate solutions in discretized or implicit function spaces, SymPlex operates directly in symbolic expression space, enabling interpretable and human-readable solutions that naturally represent non-smooth behavior and explicit parametric dependence. Empirical results demonstrate exact recovery of non-smooth and parametric PDE solutions using deep learning-based symbolic methods.

[LG-4] Prediction of Critical Heat Flux in Rod Bundles Using Tube-Based Hybrid Machine Learning Models in CTF

链接: https://arxiv.org/abs/2602.03805
作者: Aidan Furlong,Robert Salko,Xingang Zhao,Xu Wu
类目: Machine Learning (cs.LG)
*备注: Submitted to the 2026 American Nuclear Society Annual Meeting

点击查看摘要

Abstract:The prediction of critical heat flux (CHF) using machine learning (ML) approaches has become a highly active research activity in recent years, the goal of which is to build models more accurate than current conventional approaches such as empirical correlations or lookup tables (LUTs). Previous work developed and deployed tube-based pure and hybrid ML models in the CTF subchannel code, however, full-scale reactor core simulations require the use of rod bundle geometries. Unlike isolated subchannels, rod bundles experience complex thermal hydraulic phenomena such as channel crossflow, spacer grid losses, and effects from unheated conductors. This study investigates the generalization of ML-based CHF prediction models in rod bundles after being trained on tube-based CHF data. A purely data-driven DNN and two hybrid bias-correction models were implemented in the CTF subchannel code and used to predict CHF location and magnitude in the Combustion Engineering 5-by-5 bundle CHF test series. The W-3 correlation, Bowring correlation, and Groeneveld LUT were used as baseline comparators. On average, all three ML-based approaches produced magnitude and location predictions more accurate than the baseline models, with the hybrid LUT model exhibiting the most favorable performance metrics.

[LG-5] Manifold Random Features

链接: https://arxiv.org/abs/2602.03797
作者: Ananya Parashar,Derek Long,Dwaipayan Saha,Krzysztof Choromanski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a new paradigm for creating random features to approximate bi-variate functions (in particular, kernels) defined on general manifolds. This new mechanism of Manifold Random Features (MRFs) leverages discretization of the manifold and the recently introduced technique of Graph Random Features (GRFs) to learn continuous fields on manifolds. Those fields are used to find continuous approximation mechanisms that otherwise, in general scenarios, cannot be derived analytically. MRFs provide positive and bounded features, a key property for accurate, low-variance approximation. We show deep asymptotic connection between GRFs, defined on discrete graph objects, and continuous random features used for regular kernels. As a by-product of our method, we re-discover recently introduced mechanism of Gaussian kernel approximation applied in particular to improve linear-attention Transformers, considering simple random walks on graphs and by-passing original complex mathematical computations. We complement our algorithm with a rigorous theoretical analysis and verify in thorough experimental studies.

[LG-6] Should I use Synthetic Data for That? An Analysis of the Suitability of Synthetic Data for Data Sharing and Augmentation

链接: https://arxiv.org/abs/2602.03791
作者: Bogdan Kulynych,Theresa Stadler,Jean Louis Raisaro,Carmela Troncoso
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: BK and TS contributed equally

点击查看摘要

Abstract:Recent advances in generative modelling have led many to see synthetic data as the go-to solution for a range of problems around data access, scarcity, and under-representation. In this paper, we study three prominent use cases: (1) Sharing synthetic data as a proxy for proprietary datasets to enable statistical analyses while protecting privacy, (2) Augmenting machine learning training sets with synthetic data to improve model performance, and (3) Augmenting datasets with synthetic data to reduce variance in statistical estimation. For each use case, we formalise the problem setting and study, through formal analysis and case studies, under which conditions synthetic data can achieve its intended objectives. We identify fundamental and practical limits that constrain when synthetic data can serve as an effective solution for a particular problem. Our analysis reveals that due to these limits many existing or envisioned use cases of synthetic data are a poor problem fit. Our formalisations and classification of synthetic data use cases enable decision makers to assess whether synthetic data is a suitable approach for their specific data availability problem.

[LG-7] Inference-time Unlearning Using Conformal Prediction

链接: https://arxiv.org/abs/2602.03787
作者: Somnath Basu Roy Chowdhury,Rahul Kidambi,Avinava Dubey,David Wang,Gokhan Mergen,Amr Ahmed,Aranyak Mehta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning is the process of efficiently removing specific information from a trained machine learning model without retraining from scratch. Existing unlearning methods, which often provide provable guarantees, typically involve retraining a subset of model parameters based on a forget set. While these approaches show promise in certain scenarios, their underlying assumptions are often challenged in real-world applications – particularly when applied to generative models. Furthermore, updating parameters using these unlearning procedures often degrades the general-purpose capabilities the model acquired during pre-training. Motivated by these shortcomings, this paper considers the paradigm of inference time unlearning – wherein, the generative model is equipped with an (approximately correct) verifier that judges whether the model’s response satisfies appropriate unlearning guarantees. This paper introduces a framework that iteratively refines the quality of the generated responses using feedback from the verifier without updating the model parameters. The proposed framework leverages conformal prediction to reduce computational overhead and provide distribution-free unlearning guarantees. This paper’s approach significantly outperforms existing state-of-the-art methods, reducing unlearning error by up to 93% across challenging unlearning benchmarks.

[LG-8] Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

链接: https://arxiv.org/abs/2602.03773
作者: Ian Wu,Yuxiao Qu,Amrith Setlur,Aviral Kumar
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

[LG-9] Reasoning with Latent Tokens in Diffusion Language Models

链接: https://arxiv.org/abs/2602.03769
作者: Andre He,Sean Welleck,Daniel Fried
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models have recently become competitive with autoregressive models for language modeling, even outperforming them on reasoning tasks requiring planning and global coherence, but they require more computation at inference time. We trace this trade-off to a key mechanism: diffusion models are trained to jointly predict a distribution over all unknown tokens, including those that will not actually be decoded in the current step. Ablating this joint prediction yields faster inference but degrades performance, revealing that accurate prediction at the decoded position relies on joint reasoning about the distribution of undecoded tokens. We interpret these as latent tokens and introduce a method for modulating their number, demonstrating empirically that this enables a smooth tradeoff between inference speed and sample quality. Furthermore, we demonstrate that latent tokens can be introduced into autoregressive models through an auxiliary multi-token prediction objective, yielding substantial improvements on the same reasoning tasks where they have traditionally struggled. Our results suggest that latent tokens, while arising naturally in diffusion, represent a general mechanism for improving performance on tasks requiring global coherence or lookahead.

[LG-10] Soft Sensor for Bottom-Hole Pressure Estimation in Petroleum Wells Using Long Short-Term Memory and Transfer Learning

链接: https://arxiv.org/abs/2602.03737
作者: M. A. Fernandes,E. Gildin,M. A. Sampaio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Monitoring bottom-hole variables in petroleum wells is essential for production optimization, safety, and emissions reduction. Permanent Downhole Gauges (PDGs) provide real-time pressure data but face reliability and cost issues. We propose a machine learning-based soft sensor to estimate flowing Bottom-Hole Pressure (BHP) using wellhead and topside measurements. A Long Short-Term Memory (LSTM) model is introduced and compared with Multi-Layer Perceptron (MLP) and Ridge Regression. We also pioneer Transfer Learning for adapting models across operational environments. Tested on real offshore datasets from Brazil’s Pre-salt basin, the methodology achieved Mean Absolute Percentage Error (MAPE) consistently below 2%, outperforming benchmarks. This work offers a cost-effective, accurate alternative to physical sensors, with broad applicability across diverse reservoir and flow conditions.

[LG-11] Fast-MWEM: Private Data Release in Sublinear Time

链接: https://arxiv.org/abs/2602.03732
作者: Themistoklis Haris,Steve Choi,Mutiraj Laksanawisit
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Multiplicative Weights Exponential Mechanism (MWEM) is a fundamental iterative framework for private data analysis, with broad applications such as answering m linear queries, or privately solving systems of m linear constraints. However, a critical bottleneck hindering its scalability is the \Theta(m) time complexity required to execute the exponential mechanism in each iteration. We introduce a modification to the MWEM framework that improves the per-iteration runtime dependency to \Theta(\sqrtm) in expectation. This is done via a lazy sampling approach to the Report-Noisy-Max mechanism, which we implement efficiently using Gumbel noise and a k -Nearest Neighbor data structure. This allows for the rapid selection of the approximate score in the exponential mechanism without an exhaustive linear scan. We apply our accelerated framework to the problems of private linear query release and solving Linear Programs (LPs) under neighboring constraint conditions and low-sensitivity assumptions. Experimental evaluation confirms that our method provides a substantial runtime improvement over classic MWEM.

[LG-12] Efficient Training of Boltzmann Generators Using Off-Policy Log-Dispersion Regularization

链接: https://arxiv.org/abs/2602.03729
作者: Henrik Schopmans,Christopher von Klitzing,Pascal Friederich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sampling from unnormalized probability densities is a central challenge in computational science. Boltzmann generators are generative models that enable independent sampling from the Boltzmann distribution of physical systems at a given temperature. However, their practical success depends on data-efficient training, as both simulation data and target energy evaluations are costly. To this end, we propose off-policy log-dispersion regularization (LDR), a novel regularization framework that builds on a generalization of the log-variance objective. We apply LDR in the off-policy setting in combination with standard data-based training objectives, without requiring additional on-policy samples. LDR acts as a shape regularizer of the energy landscape by leveraging additional information in the form of target energy labels. The proposed regularization framework is broadly applicable, supporting unbiased or biased simulation datasets as well as purely variational training without access to target samples. Across all benchmarks, LDR improves both final performance and data efficiency, with sample efficiency gains of up to one order of magnitude.

[LG-13] Data-Driven Graph Filters via Adaptive Spectral Shaping

链接: https://arxiv.org/abs/2602.03698
作者: Dylan Sandfelder,Mihai Cucuringu,Xiaowen Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Adaptive Spectral Shaping, a data-driven framework for graph filtering that learns a reusable baseline spectral kernel and modulates it with a small set of Gaussian factors. The resulting multi-peak, multi-scale responses allocate energy to heterogeneous regions of the Laplacian spectrum while remaining interpretable via explicit centers and bandwidths. To scale, we implement filters with Chebyshev polynomial expansions, avoiding eigendecompositions. We further propose Transferable Adaptive Spectral Shaping (TASS): the baseline kernel is learned on source graphs and, on a target graph, kept fixed while only the shaping parameters are adapted, enabling few-shot transfer under matched compute. Across controlled synthetic benchmarks spanning graph families and signal regimes, Adaptive Spectral Shaping reduces reconstruction error relative to fixed-prototype wavelets and learned linear banks, and TASS yields consistent positive transfer. The framework provides compact spectral modules that plug into graph signal processing pipelines and graph neural networks, combining scalability, interpretability, and cross-graph generalization.

[LG-14] Sequential Group Composition: A Window into the Mechanics of Deep Learning

链接: https://arxiv.org/abs/2602.03655
作者: Giovanni Luca Marchetti,Daniel Kunin,Adele Myers,Francisco Acosta,Nina Miolane
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How do neural networks trained over sequences acquire the ability to perform structured operations, such as arithmetic, geometric, and algorithmic computation? To gain insight into this question, we introduce the sequential group composition task. In this task, networks receive a sequence of elements from a finite group encoded in a real vector space and must predict their cumulative product. The task can be order-sensitive and requires a nonlinear architecture to be learned. Our analysis isolates the roles of the group structure, encoding statistics, and sequence length in shaping learning. We prove that two-layer networks learn this task one irreducible representation of the group at a time in an order determined by the Fourier statistics of the encoding. These networks can perfectly learn the task, but doing so requires a hidden width exponential in the sequence length k . In contrast, we show how deeper models exploit the associativity of the task to dramatically improve this scaling: recurrent neural networks compose elements sequentially in k steps, while multilayer networks compose adjacent pairs in parallel in \log k layers. Overall, the sequential group composition task offers a tractable window into the mechanics of deep learning.

[LG-15] Reinforcement Fine-Tuning for History-Aware Dense Retriever in RAG

链接: https://arxiv.org/abs/2602.03645
作者: Yicheng Zhang,Zhen Qin,Zhaomin Wu,Wenqi Zhang,Shuiguang Deng
类目: Machine Learning (cs.LG)
*备注: On going work. Codes are released at this https URL

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enables large language models (LLMs) to produce evidence-based responses, and its performance hinges on the matching between the retriever and LLMs. Retriever optimization has emerged as an efficient alternative to fine-tuning LLMs. However, existing solutions suffer from objective mismatch between retriever optimization and the goal of RAG pipeline. Reinforcement learning (RL) provides a promising solution to address this limitation, yet applying RL to retriever optimization introduces two fundamental challenges: 1) the deterministic retrieval is incompatible with RL formulations, and 2) state aliasing arises from query-only retrieval in multi-hop reasoning. To address these challenges, we replace deterministic retrieval with stochastic sampling and formulate RAG as a Markov decision process, making retriever optimizable by RL. Further, we incorporate retrieval history into the state at each retrieval step to mitigate state aliasing. Extensive experiments across diverse RAG pipelines, datasets, and retriever scales demonstrate consistent improvements of our approach in RAG performance.

[LG-16] CTTVAE: Latent Space Structuring for Conditional Tabular Data Generation on Imbalanced Datasets

链接: https://arxiv.org/abs/2602.03641
作者: Milosh Devic,Jordan Gierschendorf,David Garson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating synthetic tabular data under severe class imbalance is essential for domains where rare but high-impact events drive decision-making. However, most generative models either overlook minority groups or fail to produce samples that are useful for downstream learning. We introduce CTTVAE, a Conditional Transformer-based Tabular Variational Autoencoder equipped with two complementary mechanisms: (i) a class-aware triplet margin loss that restructures the latent space for sharper intra-class compactness and inter-class separation, and (ii) a training-by-sampling strategy that adaptively increases exposure to underrepresented groups. Together, these components form CTTVAE+TBS, a framework that consistently yields more representative and utility-aligned samples without destabilizing training. Across six real-world benchmarks, CTTVAE+TBS achieves the strongest downstream utility on minority classes, often surpassing models trained on the original imbalanced data while maintaining competitive fidelity and bridging the gap for privacy for interpolation-based sampling methods and deep generative methods. Ablation studies further confirm that both latent structuring and targeted sampling contribute to these gains. By explicitly prioritizing downstream performance in rare categories, CTTVAE+TBS provides a robust and interpretable solution for conditional tabular data generation, with direct applicability to industries such as healthcare, fraud detection, and predictive maintenance where even small gains in minority cases can be critical.

[LG-17] Ultra Fast PDE Solving via Physics Guided Few-step Diffusion

链接: https://arxiv.org/abs/2602.03627
作者: Cindy Xiangrui Kong,Yueqi Wang,Haoyang Zheng,Weijian Luo,Guang Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion-based models have demonstrated impressive accuracy and generalization in solving partial differential equations (PDEs). However, they still face significant limitations, such as high sampling costs and insufficient physical consistency, stemming from their many-step iterative sampling mechanism and lack of explicit physics constraints. To address these issues, we propose Phys-Instruct, a novel physics-guided distillation framework which not only (1) compresses a pre-trained diffusion PDE solver into a few-step generator via matching generator and prior diffusion distributions to enable rapid sampling, but also (2) enhances the physics consistency by explicitly injecting PDE knowledge through a PDE distillation guidance. Physic-Instruct is built upon a solid theoretical foundation, leading to a practical physics-constrained training objective that admits tractable gradients. Across five PDE benchmarks, Phys-Instruct achieves orders-of-magnitude faster inference while reducing PDE error by more than 8 times compared to state-of-the-art diffusion baselines. Moreover, the resulting unconditional student model functions as a compact prior, enabling efficient and physically consistent inference for various downstream conditional tasks. Our results indicate that Phys-Instruct is a novel, effective, and efficient framework for ultra-fast PDE solving powered by deep generative models.

[LG-18] Quantization-Aware Regularizers for Deep Neural Networks Compression

链接: https://arxiv.org/abs/2602.03614
作者: Dario Malchiodi,Mattia Ferraretto,Marco Frasca
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Neural Networks reached state-of-the-art performance across numerous domains, but this progress has come at the cost of increasingly large and over-parameterized models, posing serious challenges for deployment on resource-constrained devices. As a result, model compression has become essential, and – among compression techniques – weight quantization is largely used and particularly effective, yet it typically introduces a non-negligible accuracy drop. However, it is usually applied to already trained models, without influencing how the parameter space is explored during the learning phase. In contrast, we introduce per-layer regularization terms that drive weights to naturally form clusters during training, integrating quantization awareness directly into the optimization process. This reduces the accuracy loss typically associated with quantization methods while preserving their compression potential. Furthermore, in our framework quantization representatives become network parameters, marking, to the best of our knowledge, the first approach to embed quantization parameters directly into the backpropagation procedure. Experiments on CIFAR-10 with AlexNet and VGG16 models confirm the effectiveness of the proposed strategy.

[LG-19] Explanations Leak: Membership Inference with Differential Privacy and Active Learning Defense

链接: https://arxiv.org/abs/2602.03611
作者: Fatima Ezzeddine,Osama Zammar,Silvia Giordano,Omran Ayoub
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual explanations (CFs) are increasingly integrated into Machine Learning as a Service (MLaaS) systems to improve transparency; however, ML models deployed via APIs are already vulnerable to privacy attacks such as membership inference and model extraction, and the impact of explanations on this threat landscape remains insufficiently understood. In this work, we focus on the problem of how CFs expand the attack surface of MLaaS by strengthening membership inference attacks (MIAs), and on the need to design defense mechanisms that mitigate this emerging risk without undermining utility and explainability. First, we systematically analyze how exposing CFs through query-based APIs enables more effective shadow-based MIAs. Second, we propose a defense framework that integrates Differential Privacy (DP) with Active Learning (AL) to jointly reduce memorization and limit effective training data exposure. Finally, we conduct an extensive empirical evaluation to characterize the three-way trade-off between privacy leakage, predictive performance, and explanation quality. Our findings highlight the need to carefully balance transparency, utility, and privacy in the responsible deployment of explainable MLaaS systems.

[LG-20] SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network

链接: https://arxiv.org/abs/2602.03596
作者: Cristian Manca,Christian Scano,Giorgio Piras,Fabio Brau,Maura Pintor,Battista Biggio
类目: Machine Learning (cs.LG)
*备注: ITASEC-2026

点击查看摘要

Abstract:Machine learning-based anomaly detection systems are increasingly being adopted in 5G Core networks to monitor complex, high-volume traffic. However, most existing approaches are evaluated under strong assumptions that rarely hold in operational environments, notably the availability of independent and identically distributed (IID) data and the absence of adaptive this http URL this work, we study the problem of detecting 5G attacks \textitin the wild, focusing on realistic deployment settings. We propose a set of Security-Aware Guidelines for Evaluating anomaly detectors in 5G Core Network (SAGE-5GC), driven by domain knowledge and consideration of potential adversarial threats. Using a realistic 5G Core dataset, we first train several anomaly detectors and assess their baseline performance against standard 5GC control-plane cyberattacks targeting PFCP-based network this http URL then extend the evaluation to adversarial settings, where an attacker tries to manipulate the observable features of the network traffic to evade detection, under the constraint that the intended functionality of the malicious traffic is preserved. Starting from a selected set of controllable features, we analyze model sensitivity and adversarial robustness through randomized perturbations. Finally, we introduce a practical optimization strategy based on genetic algorithms that operates exclusively on attacker-controllable features and does not require prior knowledge of the underlying detection model. Our experimental results show that adversarially crafted attacks can substantially degrade detection performance, underscoring the need for robust, security-aware evaluation methodologies for anomaly detection in 5G networks deployed in the wild.

[LG-21] Optimization and Generation in Aerodynamics Inverse Design

链接: https://arxiv.org/abs/2602.03582
作者: Huaguan Chen,Ning Lin,Luxi Chen,Rui Zhang,Wenbing Huang,Chongxuan Li,Hao Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inverse design with physics-based objectives is challenging because it couples high-dimensional geometry with expensive simulations, as exemplified by aerodynamic shape optimization for drag reduction. We revisit inverse design through two canonical solutions, the optimal design point and the optimal design distribution, and relate them to optimization and guided generation. Building on this view, we propose a new training loss for cost predictors and a density-gradient optimization method that improves objectives while preserving plausible shapes. We further unify existing training-free guided generation methods. To address their inability to approximate conditional covariance in high dimensions, we develop a time- and memory-efficient algorithm for approximate covariance estimation. Experiments on a controlled 2D study and high-fidelity 3D aerodynamic benchmarks (car and aircraft), validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes, demonstrate consistent gains in both optimization and guided generation. Additional offline RL results further support the generality of our approach.

[LG-22] Asymmetric Hierarchical Anchoring for Audio-Visual Joint Representation: Resolving Information Allocation Ambiguity for Robust Cross-Modal Generalization

链接: https://arxiv.org/abs/2602.03570
作者: Bixing Wu,Yuhong Zhao,Zongli Ye,Jiachen Lian,Xiangyu Yue,Gopala Anumanchipalli
类目: Machine Learning (cs.LG)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage in modality-specific branches, and introduce Local Sliding Alignment (LSA) to encourage fine-grained temporal alignment across modalities. Extensive experiments on AVE and AVVP benchmarks demonstrate that AHA consistently outperforms symmetric baselines in cross-modal transfer. Additional analyses on talking-face disentanglement experiment further validate that the learned representations exhibit improved semantic consistency and disentanglement, indicating the broader applicability of the proposed framework.

[LG-23] Riemannian Neural Optimal Transport

链接: https://arxiv.org/abs/2602.03566
作者: Alessandro Micheli,Yueqi Cao,Anthea Monod,Samir Bhatt
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 58 pages

点击查看摘要

Abstract:Computational optimal transport (OT) offers a principled framework for generative modeling. Neural OT methods, which use neural networks to learn an OT map (or potential) from data in an amortized way, can be evaluated out of sample after training, but existing approaches are tailored to Euclidean geometry. Extending neural OT to high-dimensional Riemannian manifolds remains an open challenge. In this paper, we prove that any method for OT on manifolds that produces discrete approximations of transport maps necessarily suffers from the curse of dimensionality: achieving a fixed accuracy requires a number of parameters that grows exponentially with the manifold dimension. Motivated by this limitation, we introduce Riemannian Neural OT (RNOT) maps, which are continuous neural-network parameterizations of OT maps on manifolds that avoid discretization and incorporate geometric structure by construction. Under mild regularity assumptions, we prove that RNOT maps approximate Riemannian OT maps with sub-exponential complexity in the dimension. Experiments on synthetic and real datasets demonstrate improved scalability and competitive performance relative to discretization-based baselines.

[LG-24] CoGenCast: A Coupled Autoregressive-Flow Generative Framework for Time Series Forecasting

链接: https://arxiv.org/abs/2602.03564
作者: Yaguo Liu,Mingyue Cheng,Daoyu Wang,Xiaoyu Tao,Qi Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting can be viewed as a generative problem that requires both semantic understanding over contextual conditions and stochastic modeling of continuous temporal dynamics. Existing approaches typically rely on either autoregressive large language models (LLMs) for semantic context modeling or diffusion-like models for continuous probabilistic generation. However, neither method alone can adequately model both aspects simultaneously. In this work, we propose CoGenCast, a hybrid generative framework that couples pre-trained LLMs with flow-matching mechanism for effective time series forecasting. Specifically, we reconfigure pre-trained decoder-only LLMs into a native forecasting encoder-decoder backbone by modifying only the attention topology, enabling bidirectional context encoding and causal representation generation. Building on this, a flow-matching mechanism is further integrated to model temporal evolution, capturing continuous stochastic dynamics conditioned on the autoregressively generated representation. Notably, CoGenCast naturally supports multimodal forecasting and cross-domain unified training. Extensive experiments on multiple benchmarks show that CoGenCast consistently outperforms previous compared baselines. Code is available at this https URL.

[LG-25] NPCNet: Navigator-Driven Pseudo Text for Deep Clustering of Early Sepsis Phenotyping

链接: https://arxiv.org/abs/2602.03562
作者: Pi-Ju Tsai,Charkkri Limbud,Kuan-Fu Chen,Yi-Ju Tseng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sepsis is a heterogeneous syndrome. Identifying clinically distinct phenotypes may enable more precise treatment strategies. In recent years, many researchers have applied clustering algorithms to sepsis patients. However, the clustering process rarely incorporates clinical relevance, potentially limiting to reflect clinically distinct phenotypes. We propose NPCNet, a novel deep clustering network with a target navigator that integrates temporal Electronic Health Records (EHRs) to better align sepsis phenotypes with clinical significance. We identify four sepsis phenotypes ( \alpha , \beta , \gamma , and \delta ) with divergence in SOFA trajectories. Notably, while \alpha and \delta phenotypes both show severe conditions in the early stage, NPCNet effectively differentiates patients who are likely to improve ( \alpha ) from those at risk of deterioration ( \delta ). Furthermore, through the treatment effect analysis, we discover that \alpha , \beta , and \delta phenotypes may benefit from early vasopressor administration. The results show that NPCNet enhances precision treatment strategies by uncovering clinically distinct phenotypes.

[LG-26] How to Train Your Resistive Network: Generalized Equilibrium Propagation and Analytical Learning

链接: https://arxiv.org/abs/2602.03546
作者: Jonathan Lin,Aman Desai,Frank Barrows,Francesco Caravelli
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Soft Condensed Matter (cond-mat.soft); Emerging Technologies (cs.ET)
*备注: 8 pages double column; plus 16 supp mat.;

点击查看摘要

Abstract:Machine learning is a powerful method of extracting meaning from data; unfortunately, current digital hardware is extremely energy-intensive. There is interest in an alternative analog computing implementation that could match the performance of traditional machine learning while being significantly more energy-efficient. However, it remains unclear how to train such analog computing systems while adhering to locality constraints imposed by the physical (as opposed to digital) nature of these systems. Local learning algorithms such as Equilibrium Propagation and Coupled Learning have been proposed to address this issue. In this paper, we develop an algorithm to exactly calculate gradients using a graph theoretic and analytical framework for Kirchhoff’s laws. We also introduce Generalized Equilibrium Propagation, a framework encompassing a broad class of Hebbian learning algorithms, including Coupled Learning and Equilibrium Propagation, and show how our algorithm compares. We demonstrate our algorithm using numerical simulations and show that we can train resistor networks without the need for a replica or readout over all resistors, only at the output layer. We also show that under the analytical gradient approach, it is possible to update only a subset of the resistance values without a strong degradation in performance.

[LG-27] MatGPT Q: Accurate and Efficient Post-Training Matryoshka Quantization

链接: https://arxiv.org/abs/2602.03537
作者: Maximilian Kleinegger,Elvir Crnčević,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one-shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi-precision objective with bit-slicing and cross-bit error compensation, resulting in an algorithm that produces a multi-bit-width, “sliceable” model in a single pass. We also incorporate a new budget-aware search for heterogeneous per-layer bit-witdhs and provide efficient kernels that implement slicing and mixed-precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high-bit accuracy while substantially improving performance at low-bit-witdh settings. Overall, we establish a new state of the art for Matryoshka-style post-training quantization and make single-checkpoint, multi-precision deployment open and practical. Code is available at this https URL.

[LG-28] Sparse Training of Neural Networks based on Multilevel Mirror Descent

链接: https://arxiv.org/abs/2602.03535
作者: Yannick Lunk,Sebastian J. Scott,Leon Bungert
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.

[LG-29] WARP Logic Neural Networks

链接: https://arxiv.org/abs/2602.03527
作者: Lino Gerlach,Thore Gerlach,Liv Våge,Elliott Kauffman,Isobel Ojalvo
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Fast and efficient AI inference is increasingly important, and recent models that directly learn low-level logic operations have achieved state-of-the-art performance. However, existing logic neural networks incur high training costs, introduce redundancy or rely on approximate gradients, which limits scalability. To overcome these limitations, we introduce WAlsh Relaxation for Probabilistic (WARP) logic neural networks – a novel gradient-based framework that efficiently learns combinations of hardware-native logic blocks. We show that WARP yields the most parameter-efficient representation for exactly learning Boolean functions and that several prior approaches arise as restricted special cases. Training is improved by introducing learnable thresholding and residual initialization, while we bridge the gap between relaxed training and discrete logic inference through stochastic smoothing. Experiments demonstrate faster convergence than state-of-the-art baselines, while scaling effectively to deeper architectures and logic functions with higher input arity.

[LG-30] Rank-Learner: Orthogonal Ranking of Treatment Effects

链接: https://arxiv.org/abs/2602.03517
作者: Henri Arno,Dennis Frauen,Emil Javurek,Thomas Demeester,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many decision-making problems require ranking individuals by their treatment effects rather than estimating the exact effect magnitudes. Examples include prioritizing patients for preventive care interventions, or ranking customers by the expected incremental impact of an advertisement. Surprisingly, while causal effect estimation has received substantial attention in the literature, the problem of directly learning rankings of treatment effects has largely remained unexplored. In this paper, we introduce Rank-Learner, a novel two-stage learner that directly learns the ranking of treatment effects from observational data. We first show that naive approaches based on precise treatment effect estimation solve a harder problem than necessary for ranking, while our Rank-Learner optimizes a pairwise learning objective that recovers the true treatment effect ordering, without explicit CATE estimation. We further show that our Rank-Learner is Neyman-orthogonal and thus comes with strong theoretical guarantees, including robustness to estimation errors in the nuisance functions. In addition, our Rank-Learner is model-agnostic, and can be instantiated with arbitrary machine learning models (e.g., neural networks). We demonstrate the effectiveness of our method through extensive experiments where Rank-Learner consistently outperforms standard CATE estimators and non-orthogonal ranking methods. Overall, we provide practitioners with a new, orthogonal two-stage learner for ranking individuals by their treatment effects.

[LG-31] A Function-Space Stability Boundary for Generalization in Interpolating Learning Systems

链接: https://arxiv.org/abs/2602.03514
作者: Ronald Katende
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 10 pages, 8 figures,

点击查看摘要

Abstract:Modern learning systems often interpolate training data while still generalizing well, yet it remains unclear when algorithmic stability explains this behavior. We model training as a function-space trajectory and measure sensitivity to single-sample perturbations along this trajectory. We propose a contractive propagation condition and a stability certificate obtained by unrolling the resulting recursion. A small certificate implies stability-based generalization, while we also prove that there exist interpolating regimes with small risk where such contractive sensitivity cannot hold, showing that stability is not a universal explanation. Experiments confirm that certificate growth predicts generalization differences across optimizers, step sizes, and dataset perturbations. The framework therefore identifies regimes where stability explains generalization and where alternative mechanisms must account for success. Comments: 10 pages, 8 figures, Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) MSC classes: 68Q32, 68T05, 62G05, 90C25 Cite as: arXiv:2602.03514 [cs.LG] (or arXiv:2602.03514v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.03514 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] Lookahead Path Likelihood Optimization for Diffusion LLM s

链接: https://arxiv.org/abs/2602.03496
作者: Xuejie Liu,Yap Vit Chun,Yitao Liang,Anji Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) support arbitrary-order generation, yet their inference performance critically depends on the unmasking order. Existing strategies rely on heuristics that greedily optimize local confidence, offering limited guidance for identifying unmasking paths that are globally consistent and accurate. To bridge this gap, we introduce path log-likelihood (Path LL), a trajectory-conditioned objective that strongly correlates with downstream accuracy and enables principled selection of unmasking paths. To optimize Path LL at inference time, we propose POKE, an efficient value estimator that predicts the expected future Path LL of a partial decoding trajectory. We then integrate this lookahead signal into POKE-SMC, a Sequential Monte Carlo-based search framework for dynamically identifying optimal unmasking paths. Extensive experiments across 6 reasoning tasks show that POKE-SMC consistently improves accuracy, achieving 2%–3% average gains over strong decoding-time scaling baselines at comparable inference overhead on LLaDA models and advancing the accuracy–compute Pareto frontier.

[LG-33] DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

链接: https://arxiv.org/abs/2602.03495
作者: Zeyu Zhu,Gang Li,Peisong Wang,Zitao Mo,Minnan Pei,Zhuoran Song,Xiaoyao Liang,Jian Cheng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) architectures significantly enhance the capacity of LLMs without proportional increases in computation, but at the cost of a vast parameter size. Offloading MoE expert parameters to host memory and leveraging both CPU and GPU computation has recently emerged as a promising direction to support such models on resourceconstrained local PC platforms. While promising, we notice that existing approaches mismatch the dynamic nature of expert workloads, which leads to three fundamental inefficiencies: (1) Static expert assignment causes severe CPUGPU load imbalance, underutilizing CPU and GPU resources; (2) Existing prefetching techniques fail to accurately predict high-workload experts, leading to costly inaccurate prefetches; (3) GPU cache policies neglect workload dynamics, resulting in poor hit rates and limited effectiveness. To address these challenges, we propose DALI, a workloaDAware offLoadIng framework for efficient MoE inference on local PCs. To fully utilize hardware resources, DALI first dynamically assigns experts to CPU or GPU by modeling assignment as a 0-1 integer optimization problem and solving it efficiently using a Greedy Assignment strategy at runtime. To improve prefetching accuracy, we develop a Residual-Based Prefetching method leveraging inter-layer residual information to accurately predict high-workload experts. Additionally, we introduce a Workload-Aware Cache Replacement policy that exploits temporal correlation in expert activations to improve GPU cache efficiency. By evaluating across various MoE models and settings, DALI achieves significant speedups in the both prefill and decoding phases over the state-of-the-art offloading frameworks.

[LG-34] Least but not Last: Fine-tuning Intermediate Principal Components for Better Performance-Forgetting Trade-Offs

链接: https://arxiv.org/abs/2602.03493
作者: Alessio Quercia,Arya Bangun,Ira Assent,Hanno Scharr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) methods have emerged as crucial techniques for adapting large pre-trained models to downstream tasks under computational and memory constraints. However, they face a fundamental challenge in balancing task-specific performance gains against catastrophic forgetting of pre-trained knowledge, where existing methods provide inconsistent recommendations. This paper presents a comprehensive analysis of the performance-forgetting trade-offs inherent in low-rank adaptation using principal components as initialization. Our investigation reveals that fine-tuning intermediate components leads to better balance and show more robustness to high learning rates than first (PiSSA) and last (MiLoRA) components in existing work. Building on these findings, we provide a practical approach for initialization of LoRA that offers superior trade-offs. We demonstrate in a thorough empirical study on a variety of computer vision and NLP tasks that our approach improves accuracy and reduces forgetting, also in continual learning scenarios.

[LG-35] A Minimal Task Reveals Emergent Path Integration and Object-Location Binding in a Predictive Sequence Model

链接: https://arxiv.org/abs/2602.03490
作者: Linda Ariel Ventura,Victoria Bosch,Tim C Kietzmann,Sushrut Thorat
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Adaptive cognition requires structured internal models representing objects and their relations. Predictive neural networks are often proposed to form such “world models”, yet their underlying mechanisms remain unclear. One hypothesis is that action-conditioned sequential prediction suffices for learning such world models. In this work, we investigate this possibility in a minimal in-silico setting. Sequentially sampling tokens from 2D continuous token scenes, a recurrent neural network is trained to predict the upcoming token from current input and a saccade-like displacement. On novel scenes, prediction accuracy improves across the sequence, indicating in-context learning. Decoding analyses reveal path integration and dynamic binding of token identity to position. Interventional analyses show that new bindings can be learned late in sequence and that out-of-distribution bindings can be learned. Together, these results demonstrate how structured representations that rely on flexible binding emerge to support prediction, offering a mechanistic account of sequential world modeling relevant to cognitive science.

[LG-36] Soft-Radial Projection for Constrained End-to-End Learning

链接: https://arxiv.org/abs/2602.03461
作者: Philipp J. Schneider,Daniel Kuhn
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Finance (q-fin.CP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Integrating hard constraints into deep learning is essential for safety-critical systems. Yet existing constructive layers that project predictions onto constraint boundaries face a fundamental bottleneck: gradient saturation. By collapsing exterior points onto lower-dimensional surfaces, standard orthogonal projections induce rank-deficient Jacobians, which nullify gradients orthogonal to active constraints and hinder optimization. We introduce Soft-Radial Projection, a differentiable reparameterization layer that circumvents this issue through a radial mapping from Euclidean space into the interior of the feasible set. This construction guarantees strict feasibility while preserving a full-rank Jacobian almost everywhere, thereby preventing the optimization stalls typical of boundary-based methods. We theoretically prove that the architecture retains the universal approximation property and empirically show improved convergence behavior and solution quality over state-of-the-art optimization- and projection-based baselines.

[LG-37] Causal Inference on Networks under Misspecified Exposure Mappings: A Partial Identification Framework

链接: https://arxiv.org/abs/2602.03459
作者: Maresa Schröder,Miruna Oprescu,Stefan Feuerriegel,Nathan Kallus
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Estimating treatment effects in networks is challenging, as each potential outcome depends on the treatments of all other nodes in the network. To overcome this difficulty, existing methods typically impose an exposure mapping that compresses the treatment assignments in the network into a low-dimensional summary. However, if this mapping is misspecified, standard estimators for direct and spillover effects can be severely biased. We propose a novel partial identification framework for causal inference on networks to assess the robustness of treatment effects under misspecifications of the exposure mapping. Specifically, we derive sharp upper and lower bounds on direct and spillover effects under such misspecifications. As such, our framework presents a novel application of causal sensitivity analysis to exposure mappings. We instantiate our framework for three canonical exposure settings widely used in practice: (i) weighted means of the neighborhood treatments, (ii) threshold-based exposure mappings, and (iii) truncated neighborhood interference in the presence of higher-order spillovers. Furthermore, we develop orthogonal estimators for these bounds and prove that the resulting bound estimates are valid, sharp, and efficient. Our experiments show the bounds remain informative and provide reliable conclusions under misspecification of exposure mappings.

[LG-38] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

链接: https://arxiv.org/abs/2602.03420
作者: Siyi Wang,Shihong Tan,Siyi Liu,Hong Jia,Gongping Huang,James Bailey,Ting Dang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

[LG-39] Most Convolutional Networks Suffer from Small Adversarial Perturbations

链接: https://arxiv.org/abs/2602.03415
作者: Amit Daniely,Idan Mehalel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The existence of adversarial examples is relatively understood for random fully connected neural networks, but much less so for convolutional neural networks (CNNs). The recent work [Daniely, 2025] establishes that adversarial examples can be found in CNNs, in some non-optimal distance from the input. We extend over this work and prove that adversarial examples in random CNNs with input dimension d can be found already in \ell_2 -distance of order \lVert x \rVert /\sqrtd from the input x , which is essentially the nearest possible. We also show that such adversarial small perturbations can be found using a single step of gradient descent. To derive our results we use Fourier decomposition to efficiently bound the singular values of a random linear convolutional operator, which is the main ingredient of a CNN layer. This bound might be of independent interest.

[LG-40] he Label Horizon Paradox: Rethinking Supervision Targets in Financial Forecasting

链接: https://arxiv.org/abs/2602.03395
作者: Chen-Hui Song,Shuoling Liu,Liyuan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While deep learning has revolutionized financial forecasting through sophisticated architectures, the design of the supervision signal itself is rarely scrutinized. We challenge the canonical assumption that training labels must strictly mirror inference targets, uncovering the Label Horizon Paradox: the optimal supervision signal often deviates from the prediction goal, shifting across intermediate horizons governed by market dynamics. We theoretically ground this phenomenon in a dynamic signal-noise trade-off, demonstrating that generalization hinges on the competition between marginal signal realization and noise accumulation. To operationalize this insight, we propose a bi-level optimization framework that autonomously identifies the optimal proxy label within a single training run. Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, thereby opening new avenues for label-centric research in financial forecasting.

[LG-41] Dynamic Topology Optimization for Non-IID Data in Decentralized Learning

链接: https://arxiv.org/abs/2602.03383
作者: Bart Cox,Antreas Ioannou,Jérémie Decouchant
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 10 pages, 11 figures. Accepted for publication in the 13th IEEE International Conference on Big Data (BigData 2025). To appear

点击查看摘要

Abstract:Decentralized learning (DL) enables a set of nodes to train a model collaboratively without central coordination, offering benefits for privacy and scalability. However, DL struggles to train a high accuracy model when the data distribution is non-independent and identically distributed (non-IID) and when the communication topology is static. To address these issues, we propose Morph, a topology optimization algorithm for DL. In Morph, nodes adaptively choose peers for model exchange based on maximum model dissimilarity. Morph maintains a fixed in-degree while dynamically reshaping the communication graph through gossip-based peer discovery and diversity-driven neighbor selection, thereby improving robustness to data heterogeneity. Experiments on CIFAR-10 and FEMNIST with up to 100 nodes show that Morph consistently outperforms static and epidemic baselines, while closely tracking the fully connected upper bound. On CIFAR-10, Morph achieves a relative improvement of 1.12x in test accuracy compared to the state-of-the-art baselines. On FEMNIST, Morph achieves an accuracy that is 1.08x higher than Epidemic Learning. Similar trends hold for 50 node deployments, where Morph narrows the gap to the fully connected upper bound within 0.5 percentage points on CIFAR-10. These results demonstrate that Morph achieves higher final accuracy, faster convergence, and more stable learning as quantified by lower inter-node variance, while requiring fewer communication rounds than baselines and no global knowledge.

[LG-42] Achieving Linear Speedup for Composite Federated Learning

链接: https://arxiv.org/abs/2602.03357
作者: Kun Huang,Shi Pu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 27 pages, 12 figures

点击查看摘要

Abstract:This paper proposes FedNMap, a normal map-based method for composite federated learning, where the objective consists of a smooth loss and a possibly nonsmooth regularizer. FedNMap leverages a normal map-based update scheme to handle the nonsmooth term and incorporates a local correction strategy to mitigate the impact of data heterogeneity across clients. Under standard assumptions, including smooth local losses, weak convexity of the regularizer, and bounded stochastic gradient variance, FedNMap achieves linear speedup with respect to both the number of clients n and the number of local updates Q for nonconvex losses, both with and without the Polyak-Łojasiewicz (PL) condition. To our knowledge, this is the first result establishing linear speedup for nonconvex composite federated learning.

[LG-43] PACE: Pretrained Audio Continual Learning ICLR2026

链接: https://arxiv.org/abs/2602.03355
作者: Chang Li,Kanglei Zhou,Liyuan Wang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.

[LG-44] Bayesian Conformal Prediction as a Decision Risk Problem

链接: https://arxiv.org/abs/2602.03331
作者: Fanyi Wu,Veronika Lohmanova,Samuel Kaski,Michele Caprio
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures. Accepted at EIML 2025 at Eurips

点击查看摘要

Abstract:Bayesian posterior predictive densities as non-conformity scores and Bayesian quadrature are used to estimate and minimise the expected prediction set size. Operating within a split conformal framework, BCP provides valid coverage guarantees and demonstrates reliable empirical coverage under model misspecification. Across regression and classification tasks, including distribution-shifted settings such as ImageNet-A, BCP yields prediction sets of comparable size to split conformal prediction, while exhibiting substantially lower run-to-run variability in set size. In sparse regression with nominal coverage of 80 percent, BCP achieves 81 percent empirical coverage under a misspecified prior, whereas Bayesian credible intervals under-cover at 49 percent.

[LG-45] From Inexact Gradients to Byzantine Robustness: Acceleration and Optimization under Similarity

链接: https://arxiv.org/abs/2602.03329
作者: Renaud Gaucher,Aymeric Dieuleveut,Hadrien Hendrikx
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Standard federated learning algorithms are vulnerable to adversarial nodes, a.k.a. Byzantine failures. To solve this issue, robust distributed learning algorithms have been developed, which typically replace parameter averaging by robust aggregations. While generic conditions on these aggregations exist to guarantee the convergence of (Stochastic) Gradient Descent (SGD), the analyses remain rather ad-hoc. This hinders the development of more complex robust algorithms, such as accelerated ones. In this work, we show that Byzantine-robust distributed optimization can, under standard generic assumptions, be cast as a general optimization with inexact gradient oracles (with both additive and multiplicative error terms), an active field of research. This allows for instance to directly show that GD on top of standard robust aggregation procedures obtains optimal asymptotic error in the Byzantine setting. Going further, we propose two optimization schemes to speed up the convergence. The first one is a Nesterov-type accelerated scheme whose proof directly derives from accelerated inexact gradient results applied to our formulation. The second one hinges on Optimization under Similarity, in which the server leverages an auxiliary loss function that approximates the global loss. Both approaches allow to drastically reduce the communication complexity compared to previous methods, as we show theoretically and empirically. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.03329 [cs.LG] (or arXiv:2602.03329v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.03329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Information-Theoretic Multi-Model Fusion for Target-Oriented Adaptive Sampling in Materials Design

链接: https://arxiv.org/abs/2602.03319
作者: Yixuan Zhang,Zhiyuan Li,Weijia He,Mian Dai,Chen Shen,Teng Long,Hongbin Zhang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Information Theory (cs.IT)
*备注: 37 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Target-oriented discovery under limited evaluation budgets requires making reliable progress in high-dimensional, heterogeneous design spaces where each new measurement is costly, whether experimental or high-fidelity simulation. We present an information-theoretic framework for target-oriented adaptive sampling that reframes optimization as trajectory discovery: instead of approximating the full response surface, the method maintains and refines a low-entropy information state that concentrates search on target-relevant directions. The approach couples data, model beliefs, and physics/structure priors through dimension-aware information budgeting, adaptive bootstrapped distillation over a heterogeneous surrogate reservoir, and structure-aware candidate manifold analysis with Kalman-inspired multi-model fusion to balance consensus-driven exploitation and disagreement-driven exploration. Evaluated under a single unified protocol without dataset-specific tuning, the framework improves sample efficiency and reliability across 14 single- and multi-objective materials design tasks spanning candidate pools from 600 to 4 \times 10^6 and feature dimensions from 10 to 10^3 , typically reaching top-performing regions within 100 evaluations. Complementary 20-dimensional synthetic benchmarks (Ackley, Rastrigin, Schwefel) further demonstrate robustness to rugged and multimodal landscapes.

[LG-47] medR: Reward Engineering for Clinical Offline Reinforcement Learning via Tri-Drive Potential Functions

链接: https://arxiv.org/abs/2602.03305
作者: Qianyi Xu,Gousia Habib,Feng Wu,Yanrui Du,Zhihui Chen,Swapnil Mishra,Dilruk Perera,Mengling Feng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) offers a powerful framework for optimizing dynamic treatment regimes (DTRs). However, clinical RL is fundamentally bottlenecked by reward engineering: the challenge of defining signals that safely and effectively guide policy learning in complex, sparse offline environments. Existing approaches often rely on manual heuristics that fail to generalize across diverse pathologies. To address this, we propose an automated pipeline leveraging Large Language Models (LLMs) for offline reward design and verification. We formulate the reward function using potential functions consisted of three core components: survival, confidence, and competence. We further introduce quantitative metrics to rigorously evaluate and select the optimal reward structure prior to deployment. By integrating LLM-driven domain knowledge, our framework automates the design of reward functions for specific diseases while significantly enhancing the performance of the resulting policies.

[LG-48] Lipschitz Multiscale Deep Equilibrium Models: A Theoretically Guaranteed and Accelerated Approach AISTATS2026

链接: https://arxiv.org/abs/2602.03297
作者: Naoki Sato,Hideaki Iiduka
类目: Machine Learning (cs.LG)
*备注: Accepted at AISTATS2026

点击查看摘要

Abstract:Deep equilibrium models (DEQs) achieve infinitely deep network representations without stacking layers by exploring fixed points of layer transformations in neural networks. Such models constitute an innovative approach that achieves performance comparable to state-of-the-art methods in many large-scale numerical experiments, despite requiring significantly less memory. However, DEQs face the challenge of requiring vastly more computational time for training and inference than conventional methods, as they repeatedly perform fixed-point iterations with no convergence guarantee upon each input. Therefore, this study explored an approach to improve fixed-point convergence and consequently reduce computational time by restructuring the model architecture to guarantee fixed-point convergence. Our proposed approach for image classification, Lipschitz multiscale DEQ, has theoretically guaranteed fixed-point convergence for both forward and backward passes by hyperparameter adjustment, achieving up to a 4.75 \times speed-up in numerical experiments on CIFAR-10 at the cost of a minor drop in accuracy.

[LG-49] Anomaly Detection via Mean Shift Density Enhancement

链接: https://arxiv.org/abs/2602.03293
作者: Pritam Kar,Rahul Bordoloi,Olaf Wolkenhauer,Saptarshi Bej
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly detection stands as an important problem in machine learning, with applications in financial fraud prevention, network security and medical diagnostics. Existing unsupervised anomaly detection algorithms rarely perform well across different anomaly types, often excelling only under specific structural assumptions. This lack of robustness also becomes particularly evident under noisy settings. We propose Mean Shift Density Enhancement (MSDE), a fully unsupervised framework that detects anomalies through their geometric response to density-driven manifold evolution. MSDE is based on the principle that normal samples, being well supported by local density, remain stable under iterative density enhancement, whereas anomalous samples undergo large cumulative displacements as they are attracted toward nearby density modes. To operationalize this idea, MSDE employs a weighted mean-shift procedure with adaptive, sample-specific density weights derived from a UMAP-based fuzzy neighborhood graph. Anomaly scores are defined by the total displacement accumulated across a small number of mean-shift iterations. We evaluate MSDE on the ADBench benchmark, comprising forty six real-world tabular datasets, four realistic anomaly generation mechanisms, and six noise levels. Compared to 13 established unsupervised baselines, MSDE achieves consistently strong, balanced and robust performance for AUC-ROC, AUC-PR, and Precision@n, at several noise levels and on average over several types of anomalies. These results demonstrate that displacement-based scoring provides a robust alternative to the existing state-of-the-art for unsupervised anomaly detection.

[LG-50] Universal Approximation of Continuous Functionals on Compact Subsets via Linear Measurements and Scalar Nonlinearities

链接: https://arxiv.org/abs/2602.03290
作者: Andrey Krylov,Maksim Penkin
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: 10 pages

点击查看摘要

Abstract:We study universal approximation of continuous functionals on compact subsets of products of Hilbert spaces. We prove that any such functional can be uniformly approximated by models that first take finitely many continuous linear measurements of the inputs and then combine these measurements through continuous scalar nonlinearities. We also extend the approximation principle to maps with values in a Banach space, yielding finite-rank approximations. These results provide a compact-set justification for the common ``measure, apply scalar nonlinearities, then combine’’ design pattern used in operator learning and imaging.

[LG-51] BlockRR: A Unified Framework of RR-type Algorithms for Label Differential Privacy

链接: https://arxiv.org/abs/2602.03277
作者: Haixia Liu,Yi Ding
类目: Machine Learning (cs.LG)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:In this paper, we introduce BlockRR, a novel and unified randomized-response mechanism for label differential privacy. This framework generalizes existed RR-type mechanisms as special cases under specific parameter settings, which eliminates the need for separate, case-by-case analysis. Theoretically, we prove that BlockRR satisfies \epsilon -label DP. We also design a partition method for BlockRR based on a weight matrix derived from label prior information; the parallel composition principle ensures that the composition of two such mechanisms remains \epsilon -label DP. Empirically, we evaluate BlockRR on two variants of CIFAR-10 with varying degrees of class imbalance. Results show that in the high-privacy and moderate-privacy regimes ( \epsilon \leq 3.0 ), our propsed method gets a better balance between test accuaracy and the average of per-class accuracy. In the low-privacy regime ( \epsilon \geq 4.0 ), all methods reduce BlockRR to standard RR without additional performance loss.

[LG-52] Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

链接: https://arxiv.org/abs/2602.03265
作者: Hicham Eddoubi,Umar Faruk Abdullahi,Fadi Hassan
类目: Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have seen widespread adoption across multiple domains, creating an urgent need for robust safety alignment mechanisms. However, robustness remains challenging due to jailbreak attacks that bypass alignment via adversarial prompts. In this work, we focus on the prevalent Greedy Coordinate Gradient (GCG) attack and identify a previously underexplored attack axis in jailbreak attacks typically framed as suffix-based: the placement of adversarial tokens within the prompt. Using GCG as a case study, we show that both optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation substantially influence attack success rates. Our findings highlight a critical blind spot in current safety evaluations and underline the need to account for the position of adversarial tokens in the adversarial robustness evaluation of LLMs.

[LG-53] BayeSQP: Bayesian Optimization through Sequential Quadratic Programming

链接: https://arxiv.org/abs/2602.03232
作者: Paul Brunzema,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce BayeSQP, a novel algorithm for general black-box optimization that merges the structure of sequential quadratic programming with concepts from Bayesian optimization. BayeSQP employs second-order Gaussian process surrogates for both the objective and constraints to jointly model the function values, gradients, and Hessian from only zero-order information. At each iteration, a local subproblem is constructed using the GP posterior estimates and solved to obtain a search direction. Crucially, the formulation of the subproblem explicitly incorporates uncertainty in both the function and derivative estimates, resulting in a tractable second-order cone program for high probability improvements under model uncertainty. A subsequent one-dimensional line search via constrained Thompson sampling selects the next evaluation point. Empirical results show thatBayeSQP outperforms state-of-the-art methods in specific high-dimensional settings. Our algorithm offers a principled and flexible framework that bridges classical optimization techniques with modern approaches to black-box optimization.

[LG-54] Sparsity is Combinatorial Depth: Quantifying MoE Expressivity via Tropical Geometry

链接: https://arxiv.org/abs/2602.03204
作者: Ye Su,Huayi Tang,Zixuan Gong,Yong Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Mixture-of-Experts (MoE) architectures define the state-of-the-art, their theoretical success is often attributed to heuristic efficiency rather than geometric expressivity. In this work, we present the first analysis of MoE through the lens of tropical geometry, establishing that the Top- k routing mechanism is algebraically isomorphic to the k -th elementary symmetric tropical polynomial. This isomorphism partitions the input space into the Normal Fan of a Hypersimplex, revealing that \textbfsparsity is combinatorial depth which scales geometric capacity by the binomial coefficient \binomNk . Moving beyond ambient bounds, we introduce the concept of \textitEffective Capacity under the Manifold Hypothesis. We prove that while dense networks suffer from capacity collapse on low-dimensional data, MoE architectures exhibit \textitCombinatorial Resilience, maintaining high expressivity via the transversality of routing cones. In this study, our framework unifies the discrete geometry of the Hypersimplex with the continuous geometry of neural functions, offering a rigorous theoretical justification for the topological supremacy of conditional computation.

[LG-55] From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2602.03201
作者: Yao-Hui Li,Zeyu Wang,Xin Li,Wei Pang,Yingfang Yuan,Zhengkun Chen,Boya Zhang,Riashat Islam,Alex Lamb,Yonggang Zhang
类目: Machine Learning (cs.LG)
*备注: 26 pages, 20 this http URL . Work in progress

点击查看摘要

Abstract:Model-based reinforcement learning (MBRL) achieves high sample efficiency by simulating future trajectories with learned dynamics and reward models. However, its effectiveness is severely compromised in sparse reward settings. The core limitation lies in the standard paradigm of regressing ground-truth scalar rewards: in sparse environments, this yields a flat, gradient-free landscape that fails to provide directional guidance for planning. To address this challenge, we propose Shaping Landscapes with Optimistic Potential Estimates (SLOPE), a novel framework that shifts reward modeling from predicting scalars to constructing informative potential landscapes. SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients. Evaluations on 30+ tasks across 5 benchmarks demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.

[LG-56] Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback

链接: https://arxiv.org/abs/2602.03175
作者: Ming Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among K candidate links/servers (arms) whose performance is a stochastic d -dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emphprobe-then-commit (PtC): the agent may probe up to q1 candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits ( q=1 ) and full-information experts ( q=K ), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textscPtC-P-UCB, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the q probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the attained Pareto region. We prove a dominated-hypervolume frontier error of \tildeO (K_P d/\sqrtqT) , where K_P is the Pareto-frontier size and T is the horizon, and scalarized regret \tildeO (L_\phi d\sqrt(K/q)T) , where \phi is the scalarizer. These quantify a transparent 1/\sqrtq acceleration from limited probing. We further extend to \emphmulti-modal probing: each probe returns M modalities (e.g., CSI, queue, compute telemetry), and uncertainty fusion yields variance-adaptive versions of the above bounds via an effective noise scale.

[LG-57] Adversarial construction as a potential solution to the experiment design problem in large task spaces

链接: https://arxiv.org/abs/2602.03172
作者: Prakhar Godara,Frederick Callaway,Marcelo G. Mattar
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:Despite decades of work, we still lack a robust, task-general theory of human behavior even in the simplest domains. In this paper we tackle the generality problem head-on, by aiming to develop a unified model for all tasks embedded in a task-space. In particular we consider the space of binary sequence prediction tasks where the observations are generated by the space parameterized by hidden Markov models (HMM). As the space of tasks is large, experimental exploration of the entire space is infeasible. To solve this problem we propose the adversarial construction approach, which helps identify tasks that are most likely to elicit a qualitatively novel behavior. Our results suggest that adversarial construction significantly outperforms random sampling of environments and therefore could be used as a proxy for optimal experimental design in high-dimensional task spaces.

[LG-58] StepScorer: Accelerating Reinforcement Learning with Step-wise Scoring and Psychological Regret Modeling

链接: https://arxiv.org/abs/2602.03171
作者: Zhe Xu
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 1 table

点击查看摘要

Abstract:Reinforcement learning algorithms often suffer from slow convergence due to sparse reward signals, particularly in complex environments where feedback is delayed or infrequent. This paper introduces the Psychological Regret Model (PRM), a novel approach that accelerates learning by incorporating regret-based feedback signals after each decision step. Rather than waiting for terminal rewards, PRM computes a regret signal based on the difference between the expected value of the optimal action and the value of the action taken in each state. This transforms sparse rewards into dense feedback signals through a step-wise scoring framework, enabling faster convergence. We demonstrate that PRM achieves stable performance approximately 36% faster than traditional Proximal Policy Optimization (PPO) in benchmark environments such as Lunar Lander. Our results indicate that PRM is particularly effective in continuous control tasks and environments with delayed feedback, making it suitable for real-world applications such as robotics, finance, and adaptive education where rapid policy adaptation is critical. The approach formalizes human-inspired counterfactual thinking as a computable regret signal, bridging behavioral economics and reinforcement learning.

[LG-59] What Makes a Good Example? Modeling Exemplar Selection with Neural Network Representations

链接: https://arxiv.org/abs/2602.03144
作者: Fanxiao Wani Qiu,Oscar Leong,Alexander LaTourrette
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Teaching requires distilling a rich category distribution into a small set of informative exemplars. Although prior work shows that humans consider both representativeness and diversity when teaching, the computational principles underlying these tradeoffs remain unclear. We address this gap by modeling human exemplar selection using neural network feature representations and principled subset selection criteria. Novel visual categories were embedded along a one-dimensional morph continuum using pretrained vision models, and selection strategies varied in their emphasis on prototypicality, joint representativeness, and diversity. Adult participants selected one to three exemplars to teach a learner. Model-human comparisons revealed that strategies based on joint representativeness, or its combination with diversity, best captured human judgments, whereas purely prototypical or diversity-based strategies performed worse. Moreover, transformer-based representations consistently aligned more closely with human behavior than convolutional networks. These results highlight the potential utility of dataset distillation methods in machine learning as computational models for teaching.

[LG-60] SATORIS-N: Spectral Analysis based Traffic Observation Recovery via Informed Subspaces and Nuclear-norm minimization

链接: https://arxiv.org/abs/2602.03138
作者: Sampad Mohanty,Bhaskar Krishnamachari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic-density matrices from different days exhibit both low rank and stable correlations in their singular-vector subspaces. Leveraging this, we introduce SATORIS-N, a framework for imputing partially observed traffic-density by informed subspace priors from neighboring days. Our contribution is a subspace-aware semidefinite programming (SDP) formulation of nuclear norm that explicitly informs the reconstruction with prior singular-subspace information. This convex formulation jointly enforces low rank and subspace alignment, providing a single global optimum and substantially improving accuracy under medium and high occlusion. We also study a lightweight implicit subspace-alignment strategy in which matrices from consecutive days are concatenated to encourage alignment of spatial or temporal singular directions. Although this heuristic offers modest gains when missing rates are low, the explicit SDP approach is markedly more robust when large fractions of entries are missing. Across two real-world datasets (Beijing and Shanghai), SATORIS-N consistently outperforms standard matrix-completion methods such as SoftImpute, IterativeSVD, statistical, and even deep learning baselines at high occlusion levels. The framework generalizes to other spatiotemporal settings in which singular subspaces evolve slowly over time. In the context of intelligent vehicles and vehicle-to-everything (V2X) systems, accurate traffic-density reconstruction enables critical applications including cooperative perception, predictive routing, and vehicle-to-infrastructure (V2I) communication optimization. When infrastructure sensors or vehicle-reported observations are incomplete - due to communication dropouts, sensor occlusions, or sparse connected vehicle penetration-reliable imputation becomes essential for safe and efficient autonomous navigation. Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6; I.5.1 Cite as: arXiv:2602.03138 [cs.LG] (or arXiv:2602.03138v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.03138 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Enhanced Parcel Arrival Forecasting for Logistic Hubs: An Ensemble Deep Learning Approach

链接: https://arxiv.org/abs/2602.03135
作者: Xinyue Pan,Yujia Xu,Benoit Montreuil
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid expansion of online shopping has increased the demand for timely parcel delivery, compelling logistics service providers to enhance the efficiency, agility, and predictability of their hub networks. In order to solve the problem, we propose a novel deep learning-based ensemble framework that leverages historical arrival patterns and real-time parcel status updates to forecast upcoming workloads at logistic hubs. This approach not only facilitates the generation of short-term forecasts, but also improves the accuracy of future hub workload predictions for more strategic planning and resource management. Empirical tests of the algorithm, conducted through a case study of a major city’s parcel logistics, demonstrate the ensemble method’s superiority over both traditional forecasting techniques and standalone deep learning models. Our findings highlight the significant potential of this method to improve operational efficiency in logistics hubs and advocate for its broader adoption.

[LG-62] Function-Space Empirical Bayes Regularisation with Large Vision-Language Model Priors

链接: https://arxiv.org/abs/2602.03119
作者: Pengcheng Hao,Huaze Tang,Ercan Engin Kuruoglu,Wenbo Ding
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Bayesian deep learning (BDL) provides a principled framework for reliable uncertainty quantification by combining deep neural networks with Bayesian inference. A central challenge in BDL lies in the design of informative prior distributions that scale effectively to high-dimensional data. Recent functional variational inference (VI) approaches address this issue by imposing priors directly in function space; however, most existing methods rely on Gaussian process (GP) priors, whose expressiveness and generalisation capabilities become limited in high-dimensional regimes. In this work, we propose VLM-FS-EB, a novel function-space empirical Bayes regularisation framework, leveraging large vision-language models (VLMs) to generates semantically meaningful context points. These synthetic samples are then used VLMs for embeddings to construct expressive functional priors. Furthermore, the proposed method is evaluated against various baselines, and experimental results demonstrate that our method consistently improves predictive performance and yields more reliable uncertainty estimates, particularly in out-of-distribution (OOD) detection tasks and data-scarce regimes.

[LG-63] Consensus Group Relative Policy Optimization for Text Generation

链接: https://arxiv.org/abs/2602.03102
作者: Yuki Ichihara,Yuu Jinnai,Kaito Ariu,Eiji Uchibe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many strong decoding methods for text generation follow a sample-and-rerank paradigm: they draw multiple candidates, score each under a utility (reward) function using consensus across samples, and return the best one. Although effective, these methods incur high computational costs during inference due to repeated sampling and scoring. Prior attempts to amortize inference-time computation typically rely on gold references, teacher labels, or curated preference data, increasing dataset construction effort and the demand for high-fidelity reward models. We propose Consensus Group Relative Policy Optimization (C-GRPO), which distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective within GRPO. C-GRPO requires only a utility function and policy samples, without gold references or explicit preference labels. Under ideal conditions, we show that the objective function of C-GRPO is directionally aligned with the gradient of the expected-utility objective underlying MBR decoding, leading to a convergence guarantee. Experiments on machine translation (WMT 2024) and text summarization (XSum) demonstrate that C-GRPO successfully achieves performance comparable to MBR decoding without the associated inference-time overhead, while outperforming reference-free baseline methods.

[LG-64] Geometry-Preserving Neural Architectures on Manifolds with Boundary

链接: https://arxiv.org/abs/2602.03082
作者: Karthik Elamvazhuthi,Shiba Biswal,Kian Rosenblum,Arushi Katyal,Tianli Qu,Grady Ma,Rishi Sonthalia
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Preserving geometric structure is important in learning. We propose a unified class of geometry-aware architectures that interleave geometric updates between layers, where both projection layers and intrinsic exponential map updates arise as discretizations of projected dynamical systems on manifolds (with or without boundary). Within this framework, we establish universal approximation results for constrained neural ODEs. We also analyze architectures that enforce geometry only at the output, proving a separate universal approximation property that enables direct comparison to interleaved designs. When the constraint set is unknown, we learn projections via small-time heat-kernel limits, showing diffusion/flow-matching can be used as data-based projections. Experiments on dynamics over S^2 and SO(3), and diffusion on S^d-1-valued features demonstrate exact feasibility for analytic updates and strong performance for learned projections

[LG-65] MS: Trajectory-Mixed Supervision for Reward-Free On-Policy SFT

链接: https://arxiv.org/abs/2602.03073
作者: Rana Muhammad Shahroz Khan,Zijie Liu,Zhen Tan,Charles Fleming,Tianlong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to \textbfSupervision Mismatch : the divergence between the model’s evolving policy and static training labels. We address this trade-off with \textbfTrajectory-Mixed Supervision (TMS) , a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model’s own historical checkpoints. TMS minimizes \textitPolicy-Label Divergence (PLD) , preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy–retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting and that TMS successfully mitigates this drift.

[LG-66] Fedcompass: Federated Clustered and Periodic Aggregation Framework for Hybrid Classical-Quantum Models ICASSP2026

链接: https://arxiv.org/abs/2602.03052
作者: Yueheng Wang,Xing He,Zinuo Cai,Rui Zhang,Ruhui Ma,Yuan Liu,Rajkumar Buyya
类目: Machine Learning (cs.LG)
*备注: Accepted by the 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP 2026)

点击查看摘要

Abstract:Federated learning enables collaborative model training across decentralized clients under privacy constraints. Quantum computing offers potential for alleviating computational and communication burdens in federated learning, yet hybrid classical-quantum federated learning remains susceptible to performance degradation under non-IID data. To address this,we propose FEDCOMPASS, a layered aggregation framework for hybrid classical-quantum federated learning. FEDCOMPASS employs spectral clustering to group clients by class distribution similarity and performs cluster-wise aggregation for classical feature extractors. For quantum parameters, it uses circular mean aggregation combined with adaptive optimization to ensure stable global updates. Experiments on three benchmark datasets show that FEDCOMPASS improves test accuracy by up to 10.22% and enhances convergence stability under non-IID settings, outperforming six strong federated learning baselines.

[LG-67] Clarify Before You Draw: Proactive Agents for Robust Text-to-CAD Generation

链接: https://arxiv.org/abs/2602.03045
作者: Bo Yuan,Zelin Zhao,Petr Molodyk,Bin Hu,Yongxin Chen
类目: Machine Learning (cs.LG)
*备注: In Review

点击查看摘要

Abstract:Large language models have recently enabled text-to-CAD systems that synthesize parametric CAD programs (e.g., CadQuery) from natural language prompts. In practice, however, geometric descriptions can be under-specified or internally inconsistent: critical dimensions may be missing and constraints may conflict. Existing fine-tuned models tend to reactively follow user instructions and hallucinate dimensions when the text is ambiguous. To address this, we propose a proactive agentic framework for text-to-CadQuery generation, named ProCAD, that resolves specification issues before code synthesis. Our framework pairs a proactive clarifying agent, which audits the prompt and asks targeted clarification questions only when necessary to produce a self-consistent specification, with a CAD coding agent that translates the specification into an executable CadQuery program. We fine-tune the coding agent on a curated high-quality text-to-CadQuery dataset and train the clarifying agent via agentic SFT on clarification trajectories. Experiments show that proactive clarification significantly improves robustness to ambiguous prompts while keeping interaction overhead low. ProCAD outperforms frontier closed-source models, including Claude Sonnet 4.5, reducing the mean Chamfer distance by 79.9 percent and lowering the invalidity ratio from 4.8 percent to 0.9 percent. Our code and datasets will be made publicly available.

[LG-68] Generalizable and Interpretable RF Fingerprinting with Shapelet-Enhanced Large Language Models

链接: https://arxiv.org/abs/2602.03035
作者: Tianya Zhao,Junqing Zhang,Haowen Xu,Xiaoyan Sun,Jun Dai,Xuyu Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, IMWUT submission

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable success in radio frequency (RF) fingerprinting for wireless device authentication. However, their practical deployment faces two major limitations: domain shift, where models trained in one environment struggle to generalize to others, and the black-box nature of DNNs, which limits interpretability. To address these issues, we propose a novel framework that integrates a group of variable-length two-dimensional (2D) shapelets with a pre-trained large language model (LLM) to achieve efficient, interpretable, and generalizable RF fingerprinting. The 2D shapelets explicitly capture diverse local temporal patterns across the in-phase and quadrature (I/Q) components, providing compact and interpretable representations. Complementarily, the pre-trained LLM captures more long-range dependencies and global contextual information, enabling strong generalization with minimal training overhead. Moreover, our framework also supports prototype generation for few-shot inference, enhancing cross-domain performance without additional retraining. To evaluate the effectiveness of our proposed method, we conduct extensive experiments on six datasets across various protocols and domains. The results show that our method achieves superior standard and few-shot performance across both source and unseen domains.

[LG-69] Rethinking Music Captioning with Music Metadata LLM s ICASSP2026

链接: https://arxiv.org/abs/2602.03023
作者: Irmak Bukey,Zhepei Wang,Chris Donahue,Nicholas J. Bryan
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling–a common task for organizing music data.

[LG-70] From Zero to Hero: Advancing Zero-Shot Foundation Models for Tabular Outlier Detection

链接: https://arxiv.org/abs/2602.03018
作者: Xueying Ding,Haomin Wen,Simon Klütterman,Leman Akoglu
类目: Machine Learning (cs.LG)
*备注: 37 pages

点击查看摘要

Abstract:Outlier detection (OD) is widely used in practice; but its effective deployment on new tasks is hindered by lack of labeled outliers, which makes algorithm and hyperparameter selection notoriously hard. Foundation models (FMs) have transformed ML, and OD is no exception: Shen et. al. (2025) introduced FoMo-0D, the first FM for OD, achieving remarkable performance against numerous baselines. This work introduces OUTFORMER, which advances FoMo-0D with (1) a mixture of synthetic priors and (2) self-evolving curriculum training. OUTFORMER is pretrained solely on synthetic labeled datasets and infers test labels of a new task by using its training data as in-context input. Inference is fast and zero-shot, requiring merely forward pass and no labeled outliers. Thanks to in-context learning, it requires zero additional work-no OD model training or bespoke model selection-enabling truly plug-and-play deployment. OUTFORMER achieves state-of-the-art performance on the prominent AdBench, as well as two new large-scale OD benchmarks that we introduce, comprising over 1,500 datasets, while maintaining speedy inference.

[LG-71] Learning to Repair Lean Proofs from Compiler Feedback

链接: https://arxiv.org/abs/2602.02990
作者: Evan Wang,Simon Chess,Daniel Lee,Siyuan Ge,Ajit Mallavarapu,Vasily Ilin
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:As neural theorem provers become increasingly agentic, the ability to interpret and act on compiler feedback is critical. However, existing Lean datasets consist almost exclusively of correct proofs, offering little supervision for understanding and repairing failures. We study Lean proof repair as a supervised learning problem: given an erroneous proof and compiler feedback, predict both a corrected proof and a natural-language diagnosis grounded in the same feedback. We introduce APRIL (Automated Proof Repair in Lean), a dataset of 260,000 supervised tuples pairing systematically generated proof failures with compiler diagnostics and aligned repair and explanation targets. Training language models on APRIL substantially improves repair accuracy and feedback-conditioned reasoning; in our single-shot repair evaluation setting, a finetuned 4B-parameter model outperforms the strongest open-source baseline. We view diagnostic-conditioned supervision as a complementary training signal for feedback-using provers. Our dataset is available at \hrefthis https URLthis link.

[LG-72] Why Some Models Resist Unlearning: A Linear Stability Perspective

链接: https://arxiv.org/abs/2602.02986
作者: Wei-Kai Chang,Rajiv Khanna
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine unlearning, the ability to erase the effect of specific training samples without retraining from scratch, is critical for privacy, regulation, and efficiency. However, most progress in unlearning has been empirical, with little theoretical understanding of when and why unlearning works. We tackle this gap by framing unlearning through the lens of asymptotic linear stability to capture the interaction between optimization dynamics and data geometry. The key quantity in our analysis is data coherence which is the cross sample alignment of loss surface directions near the optimum. We decompose coherence along three axes: within the retain set, within the forget set, and between them, and prove tight stability thresholds that separate convergence from divergence. To further link data properties to forgettability, we study a two layer ReLU CNN under a signal plus noise model and show that stronger memorization makes forgetting easier: when the signal to noise ratio (SNR) is lower, cross sample alignment is weaker, reducing coherence and making unlearning easier; conversely, high SNR, highly aligned models resist unlearning. For empirical verification, we show that Hessian tests and CNN heatmaps align closely with the predicted boundary, mapping the stability frontier of gradient based unlearning as a function of batching, mixing, and data/model alignment. Our analysis is grounded in random matrix theory tools and provides the first principled account of the trade offs between memorization, coherence, and unlearning.

[LG-73] Learning Fast Monomial Orders for Gröbner Basis Computations

链接: https://arxiv.org/abs/2602.02972
作者: R. Caleb Bunch,Alperen A. Ergür,Melika Golestani,Jessie Tong,Malia Walewski,Yunus E. Zeytuncu
类目: ymbolic Computation (cs.SC); Machine Learning (cs.LG); Commutative Algebra (math.AC); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:The efficiency of Gröbner basis computation, the standard engine for solving systems of polynomial equations, depends on the choice of monomial ordering. Despite a near-continuum of possible monomial orders, most implementations rely on static heuristics such as GrevLex, guided primarily by expert intuition. We address this gap by casting the selection of monomial orderings as a reinforcement learning problem over the space of admissible orderings. Our approach leverages domain-informed reward signals that accurately reflect the computational cost of Gröbner basis computations and admits efficient Monte Carlo estimation. Experiments on benchmark problems from systems biology and computer vision show that the resulting learned policies consistently outperform standard heuristics, yielding substantial reductions in computational cost. Moreover, we find that these policies resist distillation into simple interpretable models, providing empirical evidence that deep reinforcement learning allows the agents to exploit non-linear geometric structure beyond the scope of traditional heuristics.

[LG-74] Co2PO: Coordinated Constrained Policy Optimization for Multi-Agent RL

链接: https://arxiv.org/abs/2602.02970
作者: Shrenik Patel,Christine Truong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constrained multi-agent reinforcement learning (MARL) faces a fundamental tension between exploration and safety-constrained optimization. Existing leading approaches, such as Lagrangian methods, typically rely on global penalties or centralized critics that react to violations after they occur, often suppressing exploration and leading to over-conservatism. We propose Co2PO, a novel MARL communication-augmented framework that enables coordination-driven safety through selective, risk-aware communication. Co2PO introduces a shared blackboard architecture for broadcasting positional intent and yield signals, governed by a learned hazard predictor that proactively forecasts potential violations over an extended temporal horizon. By integrating these forecasts into a constrained optimization objective, Co2PO allows agents to anticipate and navigate collective hazards without the performance trade-offs inherent in traditional reactive constraints. We evaluate Co2PO across a suite of complex multi-agent safety benchmarks, where it achieves higher returns compared to leading constrained baselines while converging to cost-compliant policies at deployment. Ablation studies further validate the necessity of risk-triggered communication, adaptive gating, and shared memory components.

[LG-75] Q-ShiftDP: A Differentially Private Parameter-Shift Rule for Quantum Machine Learning

链接: https://arxiv.org/abs/2602.02962
作者: Hoang M. Ngo,Nhat Hoang-Xuan,Quan Nguyen,Nguyen Do,Incheol Shin,My T. Thai
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) promises significant computational advantages, but preserving training data privacy remains challenging. Classical approaches like differentially private stochastic gradient descent (DP-SGD) add noise to gradients but fail to exploit the unique properties of quantum gradient estimation. In this work, we introduce the Differentially Private Parameter-Shift Rule (Q-ShiftDP), the first privacy mechanism tailored to QML. By leveraging the inherent boundedness and stochasticity of quantum gradients computed via the parameter-shift rule, Q-ShiftDP enables tighter sensitivity analysis and reduces noise requirements. We combine carefully calibrated Gaussian noise with intrinsic quantum noise to provide formal privacy and utility guarantees, and show that harnessing quantum noise further improves the privacy-utility trade-off. Experiments on benchmark datasets demonstrate that Q-ShiftDP consistently outperforms classical DP methods in QML.

[LG-76] Human-Centric Traffic Signal Control for Equity: A Multi-Agent Action Branching Deep Reinforcement Learning Approach

链接: https://arxiv.org/abs/2602.02959
作者: Xiaocai Zhang,Neema Nassir,Lok Sang Chan,Milad Haghani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Coordinating traffic signals along multimodal corridors is challenging because many multi-agent deep reinforcement learning (DRL) approaches remain vehicle-centric and struggle with high-dimensional discrete action spaces. We propose MA2B-DDQN, a human-centric multi-agent action-branching double Deep Q-Network (DQN) framework that explicitly optimizes traveler-level equity. Our key contribution is an action-branching discrete control formulation that decomposes corridor control into (i) local, per-intersection actions that allocate green time between the next two phases and (ii) a single global action that selects the total duration of those phases. This decomposition enables scalable coordination under discrete control while reducing the effective complexity of joint decision-making. We also design a human-centric reward that penalizes the number of delayed individuals in the corridor, accounting for pedestrians, vehicle occupants, and transit passengers. Extensive evaluations across seven realistic traffic scenarios in Melbourne, Australia, demonstrate that our approach significantly reduces the number of impacted travelers, outperforming existing DRL and baseline methods. Experiments confirm the robustness of our model, showing minimal variance across diverse settings. This framework not only advocates for a fairer traffic signal system but also provides a scalable solution adaptable to varied urban traffic conditions.

[LG-77] Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

链接: https://arxiv.org/abs/2602.02958
作者: Haocheng Xi,Shuo Yang,Yilong Zhao,Muyang Li,Han Cai,Xingyang Li,Yujun Lin,Zhuoyang Zhang,Jintao Zhang,Xiuyu Li,Zhiying Xu,Jun Wu,Chenfeng Xu,Ion Stoica,Song Han,Kurt Keutzer
类目: Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality.

[LG-78] Variational Sparse Paired Autoencoders (vsPAIR) for Inverse Problems and Uncertainty Quantification

链接: https://arxiv.org/abs/2602.02948
作者: Jack Michael Solomon,Rishi Leburu,Matthias Chung
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Inverse problems are fundamental to many scientific and engineering disciplines; they arise when one seeks to reconstruct hidden, underlying quantities from noisy measurements. Many applications demand not just point estimates but interpretable uncertainty. Providing fast inference alongside uncertainty estimates remains challenging yet desirable in numerous applications. We propose the Variational Sparse Paired Autoencoder (vsPAIR) to address this challenge. The architecture pairs a standard VAE encoding observations with a sparse VAE encoding quantities of interest, connected through a learned latent mapping. The variational structure enables uncertainty estimation, the paired architecture encourages interpretability by anchoring QoI representations to clean data, and sparse encodings provide structure by concentrating information into identifiable factors rather than diffusing across all dimensions. We also propose modifications to existing sparse VAE methods: a hard-concrete spike-and-slab relaxation for differentiable training and a beta hyperprior for adaptive sparsity levels. To validate the effectiveness of our proposed architecture, we conduct experiments on blind inpainting and computed tomography, demonstrating that vsPAIR is a capable inverse problem solver that can provide interpretable and structured uncertainty estimates. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2602.02948 [cs.LG] (or arXiv:2602.02948v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02948 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-79] 3D-Learning: Diffusion-Augmented Distributionally Robust Decision-Focused Learning

链接: https://arxiv.org/abs/2602.02943
作者: Jiaqi Wen,Lei Fan,Jianyi Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predict-then-Optimize (PTO) pipelines are widely employed in computing and networked systems, where Machine Learning (ML) models are used to predict critical contextual information for downstream decision-making tasks such as cloud LLM serving, data center demand response, and edge workload scheduling. However, these ML predictors are often vulnerable to out-of-distribution (OOD) samples at test time, leading to significant decision performance degradation due to large prediction errors. To address the generalization challenges under OOD conditions, we present the framework of Distributionally Robust Decision-Focused Learning (DR-DFL), which trains ML models to optimize decision performance under the worst-case distribution. Instead of relying on classical Distributionally Robust Optimization (DRO) techniques, we propose Diffusion-Augmented Distributionally Robust Decision-Focused Learning (3D-Learning), which searches for the worst-case distribution within the parameterized space of a diffusion model. By leveraging the powerful distribution modeling capabilities of diffusion models, 3D-Learning identifies worst-case distributions that remain consistent with real data, achieving a favorable balance between average and worst-case scenarios. Empirical results on an LLM resource provisioning task demonstrate that 3D-Learning outperforms existing DRO and Data Augmentation methods in OOD generalization performance.

[LG-80] Rare Event Early Detection: A Dataset of Sepsis Onset for Critically Ill Trauma Patients

链接: https://arxiv.org/abs/2602.02930
作者: Yin Jin,Tucker R. Stewart,Deyi Zhou,Chhavi Gupta,Arjita Nema,Scott C. Brakenridge,Grant E. O’Keefe,Juhua Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sepsis is a major public health concern due to its high morbidity, mortality, and cost. Its clinical outcome can be substantially improved through early detection and timely intervention. By leveraging publicly available datasets, machine learning (ML) has driven advances in both research and clinical practice. However, existing public datasets consider ICU patients (Intensive Care Unit) as a uniform group and neglect the potential challenges presented by critically ill trauma patients in whom injury-related inflammation and organ dysfunction can overlap with the clinical features of sepsis. We propose that a targeted identification of post-traumatic sepsis is necessary in order to develop methods for early detection. Therefore, we introduce a publicly available standardized post-trauma sepsis onset dataset extracted, relabeled using standardized post-trauma clinical facts, and validated from MIMIC-III. Furthermore, we frame early detection of post-trauma sepsis onset according to clinical workflow in ICUs in a daily basis resulting in a new rare event detection problem. We then establish a general benchmark through comprehensive experiments, which shows the necessity of further advancements using this new dataset. The data code is available at this https URL.

[LG-81] Distance Marching for Generative Modeling

链接: https://arxiv.org/abs/2602.02928
作者: Zimo Wang,Ishit Mehta,Haolin Lu,Chung-En Sun,Ge Yan,Tsui-Wei Weng,Tzu-Mao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-unconditional generative models learn time-independent denoising vector fields. But without time conditioning, the same noisy input may correspond to multiple noise levels and different denoising directions, which interferes with the supervision signal. Inspired by distance field modeling, we propose Distance Marching, a new time-unconditional approach with two principled inference methods. Crucially, we design losses that focus on closer targets. This yields denoising directions better directed toward the data manifold. Across architectures, Distance Marching consistently improves FID by 13.5% on CIFAR-10 and ImageNet over recent time-unconditional baselines. For class-conditional ImageNet generation, despite removing time input, Distance Marching surpasses flow matching using our losses and inference methods. It achieves lower FID than flow matching’s final performance using 60% of the sampling steps and 13.6% lower FID on average across backbone sizes. Moreover, our distance prediction is also helpful for early stopping during sampling and for OOD detection. We hope distance field modeling can serve as a principled lens for generative modeling.

[LG-82] How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

链接: https://arxiv.org/abs/2602.02924
作者: Xiaoyuan Cheng,Wenxuan Yuan,Boyang Li,Yuanchao Xu,Yiming Yang,Hao Liang,Bei Peng,Robert Loftin,Zhuo Sun,Yukun Hu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusion-based RL methods primarily focus on offline settings for reward maximization, with limited consideration of safety in online settings. To address this gap, we propose Augmented Lagrangian-Guided Diffusion (ALGD), a novel algorithm for off-policy safe RL. By revisiting optimization theory and energy-based model, we show that the instability of primal-dual methods arises from the non-convex Lagrangian landscape. In diffusion-based safe RL, the Lagrangian can be interpreted as an energy function guiding the denoising dynamics. Counterintuitively, direct usage destabilizes both policy generation and training. ALGD resolves this issue by introducing an augmented Lagrangian that locally convexifies the energy landscape, yielding a stabilized policy generation and training process without altering the distribution of the optimal policy. Theoretical analysis and extensive experiments demonstrate that ALGD is both theoretically grounded and empirically effective, achieving strong and stable performance across diverse environments.

[LG-83] Weighted Temporal Decay Loss for Learning Wearable PPG Data with Sparse Clinical Labels ICASSP2026

链接: https://arxiv.org/abs/2602.02917
作者: Yunsung Chung,Keum San Chun,Migyeong Gwak,Han Feng,Yingshuo Liu,Chanho Lim,Viswam Nathan,Nassir Marrouche,Sharanya Arcot Desai
类目: Machine Learning (cs.LG)
*备注: ICASSP 2026

点击查看摘要

Abstract:Advances in wearable computing and AI have increased interest in leveraging PPG for health monitoring over the past decade. One of the biggest challenges in developing health algorithms based on such biosignals is the sparsity of clinical labels, which makes biosignals temporally distant from lab draws less reliable for supervision. To address this problem, we introduce a simple training strategy that learns a biomarker-specific decay of sample weight over the time gap between a segment and its ground truth label and uses this weight in the loss with a regularizer to prevent trivial solutions. On smartwatch PPG from 450 participants across 10 biomarkers, the approach improves over baselines. In the subject-wise setting, the proposed approach averages 0.715 AUPRC, compared to 0.674 for a fine-tuned self-supervised baseline and 0.626 for a feature-based Random Forest. A comparison of four decay families shows that a simple linear decay function is most robust on average. Beyond accuracy, the learned decay rates summarize how quickly each biomarker’s PPG evidence becomes stale, providing an interpretable view of temporal sensitivity.

[LG-84] Controlled disagreement improves generalization in decentralized training

链接: https://arxiv.org/abs/2602.02899
作者: Zesen Wang,Mikael Johansson
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Decentralized training is often regarded as inferior to centralized training because the consensus errors between workers are thought to undermine convergence and generalization, even with homogeneous data distributions. This work challenges this view by introducing decentralized SGD with Adaptive Consensus (DSGD-AC), which intentionally preserves non-vanishing consensus errors through a time-dependent scaling mechanism. We prove that these errors are not random noise but systematically align with the dominant Hessian subspace, acting as structured perturbations that guide optimization toward flatter minima. Across image classification and machine translation benchmarks, DSGD-AC consistently surpasses both standard DSGD and centralized SGD in test accuracy and solution flatness. Together, these results establish consensus errors as a useful implicit regularizer and open a new perspective on the design of decentralized learning algorithms.

[LG-85] Self-Soupervision: Cooking Model Soups without Labels

链接: https://arxiv.org/abs/2602.02890
作者: Anthony Fuller,James R. Green,Evan Shelhamer
类目: Machine Learning (cs.LG)
*备注: code: this https URL data: this https URL

点击查看摘要

Abstract:Model soups are strange and strangely effective combinations of parameters. They take a model (the stock), fine-tune it into multiple models (the ingredients), and then mix their parameters back into one model (the soup) to improve predictions. While all known soups require supervised learning, and optimize the same loss on labeled data, our recipes for Self-\emphSoupervision generalize soups to self-supervised learning (SSL). Our Self-Souping lets us flavor ingredients on new data sources, e.g. from unlabeled data from a task for transfer or from a shift for robustness. We show that Self-Souping on corrupted test data, then fine-tuning back on uncorrupted train data, boosts robustness by +3.5% (ImageNet-C) and +7% (LAION-C). Self-\emphSoupervision also unlocks countless SSL algorithms to cook the diverse ingredients needed for more robust soups. We show for the first time that ingredients can differ in their SSL hyperparameters – and more surprisingly, in their SSL algorithms. We cook soups of MAE, MoCoV3, and MMCR ingredients that are more accurate than any one single SSL ingredient.

[LG-86] A Geometry-Aware Efficient Algorithm for Compositional Entropic Risk Minimization

链接: https://arxiv.org/abs/2602.02877
作者: Xiyuan Wei,Linli Zhou,Bokun Wang,Chih-Jen Lin,Tianbao Yang
类目: Machine Learning (cs.LG)
*备注: 36 pages, 7 figures

点击查看摘要

Abstract:This paper studies optimization for a family of problems termed \textbfcompositional entropic risk minimization , in which each data’s loss is formulated as a Log-Expectation-Exponential (Log-E-Exp) function. The Log-E-Exp formulation serves as an abstraction of the Log-Sum-Exponential (LogSumExp) function when the explicit summation inside the logarithm is taken over a gigantic number of items and is therefore expensive to evaluate. While entropic risk objectives of this form arise in many machine learning problems, existing optimization algorithms suffer from several fundamental limitations including non-convergence, numerical instability, and slow convergence rates. To address these limitations, we propose a geometry-aware stochastic algorithm, termed \textbfSCENT , for the dual formulation of entropic risk minimization cast as a min–min optimization problem. The key to our design is a \textbfstochastic proximal mirror descent (SPMD) update for the dual variable, equipped with a Bregman divergence induced by a negative exponential function that faithfully captures the geometry of the objective. Our main contributions are threefold: (i) we establish an O(1/\sqrtT) convergence rate of the proposed SCENT algorithm for convex problems; (ii) we theoretically characterize the advantages of SPMD over standard SGD update for optimizing the dual variable; and (iii) we demonstrate the empirical effectiveness of SCENT on extreme classification, partial AUC maximization, contrastive learning and distributionally robust optimization, where it consistently outperforms existing baselines.

[LG-87] Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher

链接: https://arxiv.org/abs/2602.02859
作者: Hari K Prakash,Charles H Martin
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:\emphMemorization in neural networks lacks a precise operational definition and is often inferred from the grokking regime, where training accuracy saturates while test accuracy remains very low. We identify a previously unreported third phase of grokking in this training regime: \emphanti-grokking, a late-stage collapse of generalization. We revisit two canonical grokking setups: a 3-layer MLP trained on a subset of MNIST and a transformer trained on modular addition, but extended training far beyond standard. In both cases, after models transition from pre-grokking to successful generalization, test accuracy collapses back to chance while training accuracy remains perfect, indicating a distinct post-generalization failure mode. To diagnose anti-grokking, we use the open-source \textttWeightWatcher tool based on HTSR/SETOL theory. The primary signal is the emergence of \emphCorrelation Traps: anomalously large eigenvalues beyond the Marchenko–Pastur bulk in the empirical spectral density of shuffled weight matrices, which are predicted to impair generalization. As a secondary signal, anti-grokking corresponds to the average HTSR layer quality metric \alpha deviating from 2.0 . Neither metric requires access to the test or training data. We compare these signals to alternative grokking diagnostic, including \ell_2 norms, Activation Sparsity, Absolute Weight Entropy, and Local Circuit Complexity. These track pre-grokking and grokking but fail to identify anti-grokking. Finally, we show that Correlation Traps can induce catastrophic forgetting and/or prototype memorization, and observe similar pathologies in large-scale LLMs, like OSS GPT 20/120B. Comments: 27 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.02859 [cs.LG] (or arXiv:2602.02859v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02859 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hari Prakash [view email] [v1] Mon, 2 Feb 2026 22:09:14 UTC (2,840 KB)

[LG-88] IMAGINE: Intelligent Multi-Agent Godot-based Indoor Networked Exploration

链接: https://arxiv.org/abs/2602.02858
作者: Tiago Leite,Maria Conceição,António Grilo
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: 12 pages, submitted to a journal

点击查看摘要

Abstract:The exploration of unknown, Global Navigation Satellite System (GNSS) denied environments by an autonomous communication-aware and collaborative group of Unmanned Aerial Vehicles (UAVs) presents significant challenges in coordination, perception, and decentralized decision-making. This paper implements Multi-Agent Reinforcement Learning (MARL) to address these challenges in a 2D indoor environment, using high-fidelity game-engine simulations (Godot) and continuous action spaces. Policy training aims to achieve emergent collaborative behaviours and decision-making under uncertainty using Network-Distributed Partially Observable Markov Decision Processes (ND-POMDPs). Each UAV is equipped with a Light Detection and Ranging (LiDAR) sensor and can share data (sensor measurements and a local occupancy map) with neighbouring agents. Inter-agent communication constraints include limited range, bandwidth and latency. Extensive ablation studies evaluated MARL training paradigms, reward function, communication system, neural network (NN) architecture, memory mechanisms, and POMDP formulations. This work jointly addresses several key limitations in prior research, namely reliance on discrete actions, single-agent or centralized formulations, assumptions of a priori knowledge and permanent connectivity, inability to handle dynamic obstacles, short planning horizons and architectural complexity in Recurrent NNs/Transformers. Results show that the scalable training paradigm, combined with a simplified architecture, enables rapid autonomous exploration of an indoor area. The implementation of Curriculum-Learning (five increasingly complex levels) also enabled faster, more robust training. This combination of high-fidelity simulation, MARL formulation, and computational efficiency establishes a strong foundation for deploying learned cooperative strategies in physical robotic systems.

[LG-89] When pre-training hurts LoRA fine-tuning: a dynamical analysis via single-index models

链接: https://arxiv.org/abs/2602.02855
作者: Gibbs Nwemadji,Bruno Loureiro,Jean Barbier
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Pre-training on a source task is usually expected to facilitate fine-tuning on similar downstream problems. In this work, we mathematically show that this naive intuition is not always true: excessive pre-training can computationally slow down fine-tuning optimization. We study this phenomenon for low-rank adaptation (LoRA) fine-tuning on single-index models trained under one-pass SGD. Leveraging a summary statistics description of the fine-tuning dynamics, we precisely characterize how the convergence rate depends on the initial fine-tuning alignment and the degree of non-linearity of the target task. The key take away is that even when the pre-training and down- stream tasks are well aligned, strong pre-training can induce a prolonged search phase and hinder convergence. Our theory thus provides a unified picture of how pre-training strength and task difficulty jointly shape the dynamics and limitations of LoRA fine-tuning in a nontrivial tractable model.

[LG-90] Recurrent Equivariant Constraint Modulation: Learning Per-Layer Symmetry Relaxation from Data

链接: https://arxiv.org/abs/2602.02853
作者: Stefanos Pertigkiozoglou,Mircea Petrache,Shubhendu Trivedi,Kostas Daniilidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equivariant neural networks exploit underlying task symmetries to improve generalization, but strict equivariance constraints can induce more complex optimization dynamics that can hinder learning. Prior work addresses these limitations by relaxing strict equivariance during training, but typically relies on prespecified, explicit, or implicit target levels of relaxation for each network layer, which are task-dependent and costly to tune. We propose Recurrent Equivariant Constraint Modulation (RECM), a layer-wise constraint modulation mechanism that learns appropriate relaxation levels solely from the training signal and the symmetry properties of each layer’s input-target distribution, without requiring any prior knowledge about the task-dependent target relaxation level. We demonstrate that under the proposed RECM update, the relaxation level of each layer provably converges to a value upper-bounded by its symmetry gap, namely the degree to which its input-target distribution deviates from exact symmetry. Consequently, layers processing symmetric distributions recover full equivariance, while those with approximate symmetries retain sufficient flexibility to learn non-symmetric solutions when warranted by the data. Empirically, RECM outperforms prior methods across diverse exact and approximate equivariant tasks, including the challenging molecular conformer generation on the GEOM-Drugs dataset.

[LG-91] Zero Sum SVD: Balancing Loss Sensitivity for Low Rank LLM Compression

链接: https://arxiv.org/abs/2602.02848
作者: Ali Abbasi,Chayne Thrash,Haoran Qin,Shansita Sharma,Sepehr Seifi,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advances in large language models have driven strong performance across many tasks, but their memory and compute costs still hinder deployment. SVD-based compression reduces storage and can speed up inference via low-rank factors, yet performance depends on how rank is allocated under a global compression ratio. Prior methods often use homogeneous ranks for similarly sized matrices, despite large differences in loss sensitivity, or rely on expensive iterative pre-truncation optimization to determine per matrix ranks. We propose \textbfZero Sum SVD (\textbfZS-SVD), a post-training method that performs \emphglobal singular component selection using activation whitening and first-order calibration loss estimates in whitened coordinates. \textbfZS-SVD prunes components across the whole model with a \textbfzero sum rule that keeps the cumulative predicted loss change near zero, automatically yielding heterogeneous ranks without solving a rank allocation optimization. Motivated by evidence that gradients near pretrained solutions exhibit low rank structure, we also introduce an optional lightweight correction that applies a \textbfsingle projected gradient update after truncation, followed by re-truncation. Extensive experiments across multiple LLM architectures show consistent gains across diverse benchmarks and compression ratios. Code is available at this https URL

[LG-92] Beyond Content: Behavioral Policies Reveal Actors in Information Operations

链接: https://arxiv.org/abs/2602.02838
作者: Philipp J. Schneider,Lanqin Yuan,Marian-Andrei Rizoiu
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The detection of online influence operations – coordinated campaigns by malicious actors to spread narratives – has traditionally depended on content analysis or network features. These approaches are increasingly brittle as generative models produce convincing text, platforms restrict access to behavioral data, and actors migrate to less-regulated spaces. We introduce a platform-agnostic framework that identifies malicious actors from their behavioral policies by modeling user activity as sequential decision processes. We apply this approach to 12,064 Reddit users, including 99 accounts linked to the Russian Internet Research Agency in Reddit’s 2017 transparency report, analyzing over 38 million activity steps from 2015-2018. Activity-based representations, which model how users act rather than what they post, consistently outperform content models in detecting malicious accounts. When distinguishing trolls – users engaged in coordinated manipulation – from ordinary users, policy-based classifiers achieve a median macro- F_1 of 94.9%, compared to 91.2% for text embeddings. Policy features also enable earlier detection from short traces and degrade more gracefully under evasion strategies or data corruption. These findings show that behavioral dynamics encode stable, discriminative signals of manipulation and point to resilient, cross-platform detection strategies in the era of synthetic content and limited data access.

[LG-93] Koopman Autoencoders with Continuous-Time Latent Dynamics for Fluid Dynamics Forecasting

链接: https://arxiv.org/abs/2602.02832
作者: Rares Grozavescu,Pengyu Zhang,Etienne Meunier,Mark Girolami
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Data-driven surrogate models have emerged as powerful tools for accelerating the simulation of turbulent flows. However, classical approaches which perform autoregressive rollouts often trade off between strong short-term accuracy and long-horizon stability. Koopman autoencoders, inspired by Koopman operator theory, provide a physics-based alternative by mapping nonlinear dynamics into a latent space where linear evolution is conducted. In practice, most existing formulations operate in a discrete-time setting, limiting temporal flexibility. In this work, we introduce a continuous-time Koopman framework that models latent evolution through numerical integration schemes. By allowing variable timesteps at inference, the method demonstrates robustness to temporal resolution and generalizes beyond training regimes. In addition, the learned dynamics closely adhere to the analytical matrix exponential solution, enabling efficient long-horizon forecasting. We evaluate the approach on classical CFD benchmarks and report accuracy, stability, and extrapolation properties.

[LG-94] SC3D: Dynamic and Differentiable Causal Discovery for Temporal and Instantaneous Graphs

链接: https://arxiv.org/abs/2602.02830
作者: Sourajit Das,Dibyajyoti Chakraborthy,Romit Maulik
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 8 pages

点击查看摘要

Abstract:Discovering causal structures from multivariate time series is a key problem because interactions span across multiple lags and possibly involve instantaneous dependencies. Additionally, the search space of the dynamic graphs is combinatorial in nature. In this study, we propose \textitStable Causal Dynamic Differentiable Discovery (SC3D), a two-stage differentiable framework that jointly learns lag-specific adjacency matrices and, if present, an instantaneous directed acyclic graph (DAG). In Stage 1, SC3D performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges, whereas Stage 2 refines these masks by optimizing a likelihood with sparsity along with enforcing acyclicity on the instantaneous block. Numerical results across synthetic and benchmark dynamical systems demonstrate that SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing temporal baselines.

[LG-95] A Single Revision Step Improves Token-Efficient LLM Reasoning

链接: https://arxiv.org/abs/2602.02828
作者: Yingchuan Zhang,Terry Ma,Wenxuan Zhong,Ping Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve higher accuracy on challenging reasoning tasks by scaling test-time compute through multiple trajectory sampling. However, standard aggregation methods like majority voting or individual confidence-based filtering face a fundamental “blind spot”: they evaluate each trace in isolation. As problems scale in difficulty, models often generate hallucinated paths that exhibit misleadingly high confidence, causing the true solution to be suppressed by a narrow margin in traditional voting. We ask: can we enable traces to “peer-review” each other to resolve these near-miss errors? We introduce Packet-Conditioned Revision (PACER), a training-free, inference-only framework that enables reasoning traces to revise their conclusions through a structured coordination step. After a preliminary screening of generated traces, PACER constructs a compact consensus packet containing (i) unique candidate answers, (ii) their aggregated confidence scores, and (iii) representative reasoning summaries for each candidate answer. Individual traces then perform a targeted self-review conditioned on this packet, allowing them to identify specific logical junctions where they diverged from the broader consensus and pivot if their original reasoning is found to be flawed. Final predictions are obtained via confidence-weighted voting over these revised trajectories. On challenging competitive math benchmarks such as AIME and BRUMO, PACER matches or exceeds the accuracy of 256-sample majority voting, significantly outperforming raw ensemble baselines by transforming simple consensus into a collaborative logical refinement process. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.02828 [cs.LG] (or arXiv:2602.02828v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02828 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-96] Membership Inference Attacks from Causal Principles

链接: https://arxiv.org/abs/2602.02819
作者: Mathieu Even,Clément Berenfeld,Linus Bleistein,Tudor Cebere,Julie Josse,Aurélien Bellet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) are widely used to quantify training data memorization and assess privacy risks. Standard evaluation requires repeated retraining, which is computationally costly for large models. One-run methods (single training with randomized data inclusion) and zero-run methods (post hoc evaluation) are often used instead, though their statistical validity remains unclear. To address this gap, we frame MIA evaluation as a causal inference problem, defining memorization as the causal effect of including a data point in the training set. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations popular for LLMs are confounded by non-random membership assignment. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. Experiments on real-world data show that our approach enables reliable memorization measurement even when retraining is impractical and under distribution shift, providing a principled foundation for privacy evaluation in modern AI systems.

[LG-97] LEMON: Local Explanations via Modality-aware OptimizatioN

链接: https://arxiv.org/abs/2602.02786
作者: Yu Qin,Phillip Sloan,Raul Santos-Rodriguez,Majid Mirmehdi,Telmo de Menezes e Silva Filho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal models are ubiquitous, yet existing explainability methods are often single-modal, architecture-dependent, or too computationally expensive to run at scale. We introduce LEMON (Local Explanations via Modality-aware OptimizatioN), a model-agnostic framework for local explanations of multimodal predictions. LEMON fits a single modality-aware surrogate with group-structured sparsity to produce unified explanations that disentangle modality-level contributions and feature-level attributions. The approach treats the predictor as a black box and is computationally efficient, requiring relatively few forward passes while remaining faithful under repeated perturbations. We evaluate LEMON on vision-language question answering and a clinical prediction task with image, text, and tabular inputs, comparing against representative multimodal baselines. Across backbones, LEMON achieves competitive deletion-based faithfulness while reducing black-box evaluations by 35-67 times and runtime by 2-8 times compared to strong multimodal baselines.

[LG-98] VerIde ECG Biometrics: Verification and Identification

链接: https://arxiv.org/abs/2602.02776
作者: Scagnetto Arjuna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work studies electrocardiogram (ECG) biometrics at large scale, evaluating how strongly an ECG can be linked to an individual and, consequently, how its anonymization may be compromised. We show that identity information is already present in tabular representations (fiducial features): even a simple MLP-based embedding network yields non-trivial performance, indicating that anonymization based solely on releasing features does not guarantee privacy. We then adopt embedding-based deep learning models (ArcFace), first on features and then on ECG waveforms, showing a performance jump when moving from tabular inputs to waveforms, and a further gain with larger training sets and consistent normalization across train/val/test. On a large-scale test set, verification achieves high TAR at strict FAR thresholds (TAR=0.908 @ FAR=1e-3; TAR=0.820 @ FAR=1e-4) with EER=2.53% (all-vs-all); closed-set identification yields Rank@1=0.812 and Rank@10=0.910. In open-set, a two-stage pipeline (top-K shortlist on embeddings + re-ranking) reaches DIR@FAR up to 0.976 at FAR=1e-3 and 1e-4. Overall, the results show that ECG carries a measurable individual signature: re-identification is already possible with tabular features and is further amplified by embedding-based models, making privacy implications and realistic operational protocols essential to consider.

[LG-99] BiTimeCrossNet: Time-Aware Self-Supervised Learning for Pediatric Sleep

链接: https://arxiv.org/abs/2602.02769
作者: Saurav Raj Pandey,Harlin Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present BiTimeCrossNet (BTCNet), a multimodal self-supervised learning framework for long physiological recordings such as overnight sleep studies. While many existing approaches train on short segments treated as independent samples, BTCNet incorporates information about when each segment occurs within its parent recording, for example within a sleep session. BTCNet further learns pairwise interactions between physiological signals via cross-attention, without requiring task labels or sequence-level supervision. We evaluate BTCNet on pediatric sleep data across six downstream tasks, including sleep staging, arousal detection, and respiratory event detection. Under frozen-backbone linear probing, BTCNet consistently outperforms an otherwise identical non-time-aware variant, with gains that generalize to an independent pediatric dataset. Compared to existing multimodal self-supervised sleep models, BTCNet achieves strong performance, particularly on respiration-related tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.02769 [cs.LG] (or arXiv:2602.02769v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.02769 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-100] Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

链接: https://arxiv.org/abs/2602.02763
作者: Bohan Wang,Zewen Liu,Lu Lin,Hui Liu,Li Xiong,Ming Jin,Wei Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interpretable time series deep learning systems are often assessed by checking temporal consistency on explanations, implicitly treating this as evidence of robustness. We show that this assumption can fail: Predictions and explanations can be adversarially decoupled, enabling targeted misclassification while the explanation remains plausible and consistent with a chosen reference rationale. We propose TSEF (Time Series Explanation Fooler), a dual-target attack that jointly manipulates the classifier and explainer outputs. In contrast to single-objective misclassification attacks that disrupt explanation and spread attribution mass broadly, TSEF achieves targeted prediction changes while keeping explanations consistent with the reference. Across multiple datasets and explainer backbones, our results consistently reveal that explanation stability is a misleading proxy for decision robustness and motivate coupling-aware robustness evaluations for trustworthy time series tasks.

[LG-101] On the Sample Efficiency of Inverse Dynamics Models for Semi-Supervised Imitation Learning

链接: https://arxiv.org/abs/2602.02762
作者: Sacha Morin,Moonsub Byeon,Alexia Jolicoeur-Martineau,Sébastien Lachapelle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised imitation learning (SSIL) consists in learning a policy from a small dataset of action-labeled trajectories and a much larger dataset of action-free trajectories. Some SSIL methods learn an inverse dynamics model (IDM) to predict the action from the current state and the next state. An IDM can act as a policy when paired with a video model (VM-IDM) or as a label generator to perform behavior cloning on action-free data (IDM labeling). In this work, we first show that VM-IDM and IDM labeling learn the same policy in a limit case, which we call the IDM-based policy. We then argue that the previously observed advantage of IDM-based policies over behavior cloning is due to the superior sample efficiency of IDM learning, which we attribute to two causes: (i) the ground-truth IDM tends to be contained in a lower complexity hypothesis class relative to the expert policy, and (ii) the ground-truth IDM is often less stochastic than the expert policy. We argue these claims based on insights from statistical learning theory and novel experiments, including a study of IDM-based policies using recent architectures for unified video-action prediction (UVA). Motivated by these insights, we finally propose an improved version of the existing LAPO algorithm for latent action policy learning.

[LG-102] abPFN for Zero-shot Parametric Engineering Design Generation

链接: https://arxiv.org/abs/2602.02735
作者: Ke Wang,Yifan Tang,Nguyen Gia Hien Vu,Faez Ahmed,G. Gary Wang
类目: Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Deep generative models for engineering design often require substantial computational cost, large training datasets, and extensive retraining when design requirements or datasets change, limiting their applicability in real-world engineering design workflow. In this work, we propose a zero-shot generation framework for parametric engineering design based on TabPFN, enabling conditional design generation using only a limited number of reference samples and without any task-specific model training or fine-tuning. The proposed method generates design parameters sequentially conditioned on target performance indicators, providing a flexible alternative to conventional generative models. The effectiveness of the proposed approach is evaluated on three engineering design datasets, i.e., ship hull design, BlendedNet aircraft, and UIUC airfoil. Experimental results demonstrate that the proposed method achieves competitive diversity across highly structured parametric design spaces, remains robust to variations in sampling, resolution and parameter dimensionality of geometry generation, and achieves a low performance error (e.g., less than 2% in generated ship hull designs’ performance). Compared with diffusion-based generative models, the proposed framework significantly reduces computational overhead and data requirements while preserving reliable generation performance. These results highlight the potential of zero-shot, data-efficient generation as a practical and efficient tool for engineering design, enabling rapid deployment, flexible adaptation to new design settings, and ease of integration into real-world engineering workflows.

[LG-103] Automated Dysphagia Screening Using Noninvasive Neck Acoustic Sensing ICASSP2026

链接: https://arxiv.org/abs/2602.02725
作者: Jade Chng,Rong Xing,Yunfei Luo,Kristen Linnemeyer-Risser,Tauhidur Rahman,Andrew Yousef,Philip A Weissbrod
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:Pharyngeal health plays a vital role in essential human functions such as breathing, swallowing, and vocalization. Early detection of swallowing abnormalities, also known as dysphagia, is crucial for timely intervention. However, current diagnostic methods often rely on radiographic imaging or invasive procedures. In this study, we propose an automated framework for detecting dysphagia using portable and noninvasive acoustic sensing coupled with applied machine learning. By capturing subtle acoustic signals from the neck during swallowing tasks, we aim to identify patterns associated with abnormal physiological conditions. Our approach achieves promising test-time abnormality detection performance, with an AUC-ROC of 0.904 under 5 independent train-test splits. This work demonstrates the feasibility of using noninvasive acoustic sensing as a practical and scalable tool for pharyngeal health monitoring.

[LG-104] Neural Probabilistic Amplitude Shaping for Nonlinear Fiber Channels

链接: https://arxiv.org/abs/2602.02716
作者: Mohammad Taha Askari,Lutz Lampe,Amirhossein Ghazisaeidi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 3 pages, 2 figures, Submitted to Optical Fiber Communication Conference (OFC) 2026

点击查看摘要

Abstract:We introduce neural probabilistic amplitude shaping, a joint-distribution learning framework for coherent fiber systems. The proposed scheme provides a 0.5 dB signal-to-noise ratio gain over sequence selection for dual-polarized 64-QAM transmission across a single-span 205 km link.

[LG-105] Maximum Likelihood Reinforcement Learning

链接: https://arxiv.org/abs/2602.02710
作者: Fahim Tajwar,Guanning Zeng,Yueer Zhou,Yuda Song,Daman Arora,Yiding Jiang,Jeff Schneider,Ruslan Salakhutdinov,Haiwen Feng,Andrea Zanette
类目: Machine Learning (cs.LG)
*备注: Project website and code: this https URL

点击查看摘要

Abstract:Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.

[LG-106] NSC-SL: A Bandwidth-Aware Neural Subspace Compression for Communication-Efficient Split Learning

链接: https://arxiv.org/abs/2602.02696
作者: Zhen Fang,Miao Yang,Zehang Lin,Zheng Lin,Zihan Fang,Zongyuan Zhang,Tianyang Duan,Dong Huang,Shunzhi Zhu
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:The expanding scale of neural networks poses a major challenge for distributed machine learning, particularly under limited communication resources. While split learning (SL) alleviates client computational burden by distributing model layers between clients and server, it incurs substantial communication overhead from frequent transmission of intermediate activations and gradients. To tackle this issue, we propose NSC-SL, a bandwidth-aware adaptive compression algorithm for communication-efficient SL. NSC-SL first dynamically determines the optimal rank of low-rank approximation based on the singular value distribution for adapting real-time bandwidth constraints. Then, NSC-SL performs error-compensated tensor factorization using alternating orthogonal iteration with residual feedback, effectively minimizing truncation loss. The collaborative mechanisms enable NSC-SL to achieve high compression ratios while preserving semantic-rich information essential for convergence. Extensive experiments demonstrate the superb performance of NSC-SL.

[LG-107] Expert-Data Alignment Governs Generation Quality in Decentralized Diffusion Models

链接: https://arxiv.org/abs/2602.02685
作者: Marcos Villagra,Bidhan Roy,Raihan Seraj,Zhiying Jiang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Decentralized Diffusion Models (DDMs) route denoising through experts trained independently on disjoint data clusters, which can strongly disagree in their predictions. What governs the quality of generations in such systems? We present the first ever systematic investigation of this question. A priori, the expectation is that minimizing denoising trajectory sensitivity – minimizing how perturbations amplify during sampling – should govern generation quality. We demonstrate this hypothesis is incorrect: a stability-quality dissociation. Full ensemble routing, which combines all expert predictions at each step, achieves the most stable sampling dynamics and best numerical convergence while producing the worst generation quality (FID 47.9 vs. 22.6 for sparse Top-2 routing). Instead, we identify expert-data alignment as the governing principle: generation quality depends on routing inputs to experts whose training distribution covers the current denoising state. Across two distinct DDM systems, we validate expert-data alignment using (i) data-cluster distance analysis, confirming sparse routing selects experts with data clusters closest to the current denoising state, and (ii) per-expert analysis, showing selected experts produce more accurate predictions than non-selected ones, and (iii) expert disagreement analysis, showing quality degrades when experts disagree. For DDM deployment, our findings establish that routing should prioritize expert-data alignment over numerical stability metrics.

[LG-108] FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment

链接: https://arxiv.org/abs/2602.02680
作者: Riccardo Zaccone,Stefanos Laskaridis,Marco Ciccone,Samuel Horváth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing scale of deep neural networks, encompassing large language models (LLMs) and vision transformers (ViTs), has made training from scratch prohibitively expensive and deployment increasingly costly. These models are often used as computational monoliths with fixed cost, a rigidity that does not leverage overparametrized architectures and largely hinders adaptive deployment across different cost budgets. We argue that importance-ordered nested components can be extracted from pretrained models, and selectively activated on the available computational budget. To this end, our proposed FlexRank method leverages low-rank weight decomposition with nested, importance-based consolidation to extract submodels of increasing capabilities. Our approach enables a “train-once, deploy-everywhere” paradigm that offers a graceful trade-off between cost and performance without training from scratch for each budget - advancing practical deployment of large models.

[LG-109] hSNMF: Hybrid Spatially Regularized NMF for Image-Derived Spatial Transcriptomics

链接: https://arxiv.org/abs/2602.02638
作者: Md Ishtyaq Mahmud,Veena Kochat,Suresh Satpati,Jagan Mohan Reddy Dwarampudi,Humaira Anzum,Kunal Rai,Tania Banerjee
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: The paper is accepted to the 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI); 5 pages, 1 figure

点击查看摘要

Abstract:High-resolution spatial transcriptomics platforms, such as Xenium, generate single-cell images that capture both molecular and spatial context, but their extremely high dimensionality poses major challenges for representation learning and clustering. In this study, we analyze data from the Xenium platform, which captures high-resolution images of tumor microarray (TMA) tissues and converts them into cell-by-gene matrices suitable for computational analysis. We benchmark and extend nonnegative matrix factorization (NMF) for spatial transcriptomics by introducing two spatially regularized variants. First, we propose Spatial NMF (SNMF), a lightweight baseline that enforces local spatial smoothness by diffusing each cell’s NMF factor vector over its spatial neighborhood. Second, we introduce Hybrid Spatial NMF (hSNMF), which performs spatially regularized NMF followed by Leiden clustering on a hybrid adjacency that integrates spatial proximity (via a contact-radius graph) and transcriptomic similarity through a tunable mixing parameter alpha. Evaluated on a cholangiocarcinoma dataset, SNMF and hSNMF achieve markedly improved spatial compactness (CHAOS 0.004, Moran’s I 0.96), greater cluster separability (Silhouette 0.12, DBI 1.8), and higher biological coherence (CMC and enrichment) compared to other spatial baselines. Availability and implementation: this https URL

[LG-110] A Reduction from Delayed to Immediate Feedback for Online Convex Optimization with Improved Guarantees

链接: https://arxiv.org/abs/2602.02634
作者: Alexander Ryabchenko,Idan Attias,Daniel M. Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a reduction-based framework for online learning with delayed feedback that recovers and improves upon existing results for both first-order and bandit convex optimization. Our approach introduces a continuous-time model under which regret decomposes into a delay-independent learning term and a delay-induced drift term, yielding a delay-adaptive reduction that converts any algorithm for online linear optimization into one that handles round-dependent delays. For bandit convex optimization, we significantly improve existing regret bounds, with delay-dependent terms matching state-of-the-art first-order rates. For first-order feedback, we recover state-of-the-art regret bounds via a simpler, unified analysis. Quantitatively, for bandit convex optimization we obtain O(\sqrtd_\texttot + T^\frac34\sqrtk) regret, improving the delay-dependent term from O(\min\sqrtT d_\textmax,(Td_\texttot)^\frac13) in previous work to O(\sqrtd_\texttot) . Here, k , T , d_\textmax , and d_\texttot denote the dimension, time horizon, maximum delay, and total delay, respectively. Under strong convexity, we achieve O(\min\sigma_\textmax \ln T, \sqrtd_\texttot\ + (T^2\ln T)^\frac13 k^\frac23) , improving the delay-dependent term from O(d_\textmax \ln T) in previous work to O(\min\sigma_\textmax \ln T, \sqrtd_\texttot) , where \sigma_\textmax denotes the maximum number of outstanding observations and may be considerably smaller than d_\textmax .

[LG-111] Learning Better Certified Models from Empirically-Robust Teachers

链接: https://arxiv.org/abs/2602.02626
作者: Alessandro De Palma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Adversarial training attains strong empirical robustness to specific adversarial attacks by training on concrete adversarial perturbations, but it produces neural networks that are not amenable to strong robustness certificates through neural network verification. On the other hand, earlier certified training schemes directly train on bounds from network relaxations to obtain models that are certifiably robust, but display sub-par standard performance. Recent work has shown that state-of-the-art trade-offs between certified robustness and standard performance can be obtained through a family of losses combining adversarial outputs and neural network bounds. Nevertheless, differently from empirical robustness, verifiability still comes at a significant cost in standard performance. In this work, we propose to leverage empirically-robust teachers to improve the performance of certifiably-robust models through knowledge distillation. Using a versatile feature-space distillation objective, we show that distillation from adversarially-trained teachers consistently improves on the state-of-the-art in certified training for ReLU networks across a series of robust computer vision benchmarks.

[LG-112] Position: 3D Gaussian Splatting Watermarking Should Be Scenario-Driven and Threat-Model Explicit

链接: https://arxiv.org/abs/2602.02602
作者: Yangfan Deng,Anirudh Nakra,Min Wu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D content acquisition and creation are expanding rapidly in the new era of machine learning and AI. 3D Gaussian Splatting (3DGS) has become a promising high-fidelity and real-time representation for 3D content. Similar to the initial wave of digital audio-visual content at the turn of the millennium, the demand for intellectual property protection is also increasing, since explicit and editable 3D parameterization makes unauthorized use and dissemination easier. In this position paper, we argue that effective progress in watermarking 3D assets requires articulated security objectives and realistic threat models, incorporating the lessons learned from digital audio-visual asset protection over the past decades. To address this gap in security specification and evaluation, we advocate a scenario-driven formulation, in which adversarial capabilities are formalized through a security model. Based on this formulation, we construct a reference framework that organizes existing methods and clarifies how specific design choices map to corresponding adversarial assumptions. Within this framework, we also examine a legacy spread-spectrum embedding scheme, characterizing its advantages and limitations and highlighting the important trade-offs it entails. Overall, this work aims to foster effective intellectual property protection for 3D assets.

[LG-113] Fubini Study geometry of representation drift in high dimensional data

链接: https://arxiv.org/abs/2602.02596
作者: Arturo Tozzi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:High dimensional representation drift is commonly quantified using Euclidean or cosine distances, which presuppose fixed coordinates when comparing representations across time, training or preprocessing stages. While effective in many settings, these measures entangle intrinsic changes in the data with variations induced by arbitrary parametrizations. We introduce a projective geometric view of representation drift grounded in the Fubini Study metric, which identifies representations that differ only by gauge transformations such as global rescalings or sign flips. Applying this framework to empirical high dimensional datasets, we explicitly construct representation trajectories and track their evolution through cumulative geometric drift. Comparing Euclidean, cosine and Fubini Study distances along these trajectories reveals that conventional metrics systematically overestimate change whenever representations carry genuine projective ambiguity. By contrast, the Fubini Study metric isolates intrinsic evolution by remaining invariant under gauge-induced fluctuations. We further show that the difference between cosine and Fubini Study drift defines a computable, monotone quantity that directly captures representation churn attributable to gauge freedom. This separation provides a diagnostic for distinguishing meaningful structural evolution from parametrization artifacts, without introducing model-specific assumptions. Overall, we establish a geometric criterion for assessing representation stability in high-dimensional systems and clarify the limits of angular distances. Embedding representation dynamics in projective space connects data analysis with established geometric programs and yields observables that are directly testable in empirical workflows.

[LG-114] Copula-Based Aggregation and Context-Aware Conformal Prediction for Reliable Renewable Energy Forecasting

链接: https://arxiv.org/abs/2602.02583
作者: Alireza Moradi,Mathieu Tanneau,Reza Zandehshahvar,Pascal Van Hentenryck
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The rapid growth of renewable energy penetration has intensified the need for reliable probabilistic forecasts to support grid operations at aggregated (fleet or system) levels. In practice, however, system operators often lack access to fleet-level probabilistic models and instead rely on site-level forecasts produced by heterogeneous third-party providers. Constructing coherent and calibrated fleet-level probabilistic forecasts from such inputs remains challenging due to complex cross-site dependencies and aggregation-induced miscalibration. This paper proposes a calibrated probabilistic aggregation framework that directly converts site-level probabilistic forecasts into reliable fleet-level forecasts in settings where system-level models cannot be trained or maintained. The framework integrates copula-based dependence modeling to capture cross-site correlations with Context-Aware Conformal Prediction (CACP) to correct miscalibration at the aggregated level. This combination enables dependence-aware aggregation while providing valid coverage and maintaining sharp prediction intervals. Experiments on large-scale solar generation datasets from MISO, ERCOT, and SPP demonstrate that the proposed Copula+CACP approach consistently achieves near-nominal coverage with significantly sharper intervals than uncalibrated aggregation baselines.

[LG-115] Mitigating Task-Order Sensitivity and Forgetting via Hierarchical Second-Order Consolidation

链接: https://arxiv.org/abs/2602.02568
作者: Protik Nag,Krishnan Raghavan,Vignesh Narayanan
类目: Machine Learning (cs.LG)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:We introduce \textbfHierarchical Taylor Series-based Continual Learning (HTCL) , a framework that couples fast local adaptation with conservative, second-order global consolidation to address the high variance introduced by random task ordering. To address task-order effects, HTCL identifies the best intra-group task sequence and integrates the resulting local updates through a Hessian-regularized Taylor expansion, yielding a consolidation step with theoretical guarantees. The approach naturally extends to an L -level hierarchy, enabling multiscale knowledge integration in a manner not supported by conventional single-level CL systems. Across a wide range of datasets and replay and regularization baselines, HTCL acts as a model-agnostic consolidation layer that consistently enhances performance, yielding mean accuracy gains of 7% to 25% while reducing the standard deviation of final accuracy by up to 68% across random task permutations.

[LG-116] Label Curation Using Agent ic AI

链接: https://arxiv.org/abs/2602.02564
作者: Subhodeep Ghosh,Bayan Divaaniaazar,Md Ishat-E-Rabban,Spencer Clarke,Senjuti Basu Roy
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Data annotation is essential for supervised learning, yet producing accurate, unbiased, and scalable labels remains challenging as datasets grow in size and modality. Traditional human-centric pipelines are costly, slow, and prone to annotator variability, motivating reliability-aware automated annotation. We present AURA (Agentic AI for Unified Reliability Modeling and Annotation Aggregation), an agentic AI framework for large-scale, multi-modal data annotation. AURA coordinates multiple AI agents to generate and validate labels without requiring ground truth. At its core, AURA adapts a classical probabilistic model that jointly infers latent true labels and annotator reliability via confusion matrices, using Expectation-Maximization to reconcile conflicting annotations and aggregate noisy predictions. Across the four benchmark datasets evaluated, AURA achieves accuracy improvements of up to 5.8% over baseline. In more challenging settings with poor quality annotators, the improvement is up to 50% over baseline. AURA also accurately estimates the reliability of annotators, allowing assessment of annotator quality even without any pre-validation steps.

[LG-117] A General ReLearner: Empowering Spatiotemporal Prediction by Re-learning Input-label Residual

链接: https://arxiv.org/abs/2602.02563
作者: Jiaming Ma,Binwu Wang,Pengkun Wang,Xu Wang,Zhengyang Zhou,Yang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prevailing spatiotemporal prediction models typically operate under a forward (unidirectional) learning paradigm, in which models extract spatiotemporal features from historical observation input and map them to target spatiotemporal space for future forecasting (label). However, these models frequently exhibit suboptimal performance when spatiotemporal discrepancies exist between inputs and labels, for instance, when nodes with similar time-series inputs manifest distinct future labels, or vice versa. To address this limitation, we propose explicitly incorporating label features during the training phase. Specifically, we introduce the Spatiotemporal Residual Theorem, which generalizes the conventional unidirectional spatiotemporal prediction paradigm into a bidirectional learning framework. Building upon this theoretical foundation, we design an universal module, termed ReLearner, which seamlessly augments Spatiotemporal Neural Networks (STNNs) with a bidirectional learning capability via an auxiliary inverse learning process. In this process, the model relearns the spatiotemporal feature residuals between input data and future data. The proposed ReLearner comprises two critical components: (1) a Residual Learning Module, designed to effectively disentangle spatiotemporal feature discrepancies between input and label representations; and (2) a Residual Smoothing Module, employed to smooth residual terms and facilitate stable convergence. Extensive experiments conducted on 11 real-world datasets across 14 backbone models demonstrate that ReLearner significantly enhances the predictive performance of existing this http URL code is available on GitHub.

[LG-118] HMVLA: Hyperbolic Multimodal Fusion for Vision-Language-Action Models ICASSP

链接: https://arxiv.org/abs/2602.02533
作者: Kun Wang,Xiao Feng,Mingcheng Qu,Tonghua Su
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 5 pages,5 figures,ICASSP

点击查看摘要

Abstract:Vision Language Action (VLA) models have recently shown great potential in bridging multimodal perception with robotic control. However, existing methods often rely on direct fine-tuning of pre-trained Vision-Language Models (VLMs), feeding semantic and visual features directly into a policy network without fully addressing the unique semantic alignment challenges in the VLA domain. In this paper, we propose HMVLA, a novel VLA framework that exploits the inherent hierarchical structures in vision and language for comprehensive semantic alignment. Unlike traditional methods that perform alignment in Euclidean space, our HMVLA embeds multimodal features in hyperbolic space, enabling more effective modeling of the hierarchical relationships present in image text data. Furthermore, we introduce a sparsely gated Mixture of Experts (MoE) mechanism tailored for semantic alignment, which enhances multimodal comprehension between images and text while improving efficiency. Extensive experiments demonstrate that HMVLA surpasses baseline methods in both accuracy and generalization. In addition, we validate its robustness by reconstructing datasets to further test cross domain adaptability.

[LG-119] Hypersonic Flow Control: Generalized Deep Reinforcement Learning for Hypersonic Intake Unstart Control under Uncertainty

链接: https://arxiv.org/abs/2602.02531
作者: Trishit Mondal,Ameya D. Jagtap
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 34 Pages, 23 Figures

点击查看摘要

Abstract:The hypersonic unstart phenomenon poses a major challenge to reliable air-breathing propulsion at Mach 5 and above, where strong shock-boundary-layer interactions and rapid pressure fluctuations can destabilize inlet operation. Here, we demonstrate a deep reinforcement learning (DRL)- based active flow control strategy to control unstart in a canonical two-dimensional hypersonic inlet at Mach 5 and Reynolds number 5\times 10^6 . The in-house CFD solver enables high-fidelity simulations with adaptive mesh refinement, resolving key flow features, including shock motion, boundary-layer dynamics, and flow separation, that are essential for learning physically consistent control policies suitable for real-time deployment. The DRL controller robustly stabilizes the inlet over a wide range of back pressures representative of varying combustion chamber conditions. It further generalizes to previously unseen scenarios, including different back-pressure levels, Reynolds numbers, and sensor configurations, while operating with noisy measurements, thereby demonstrating strong zero-shot generalization. Control remains robust in the presence of noisy sensor measurements, and a minimal, optimally selected sensor set achieves comparable performance, enabling practical implementation. These results establish a data-driven approach for real-time hypersonic flow control under realistic operational uncertainties.

[LG-120] Formulating Reinforcement Learning for Human-Robot Collaboration through Off-Policy Evaluation

链接: https://arxiv.org/abs/2602.02530
作者: Saurav Singh,Rodney Sanchez,Alexander Ororbia,Jamison Heard
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has the potential to transform real-world decision-making systems by enabling autonomous agents to learn from experience. Deploying RL in real-world settings, especially in the context of human-robot interaction, requires defining state representations and reward functions, which are critical for learning efficiency and policy performance. Traditional RL approaches often rely on domain expertise and trial-and-error, necessitating extensive human involvement as well as direct interaction with the environment, which can be costly and impractical, especially in complex and safety-critical applications. This work proposes a novel RL framework that leverages off-policy evaluation (OPE) for state space and reward function selection, using only logged interaction data. This approach eliminates the need for real-time access to the environment or human-in-the-loop feedback, greatly reducing the dependency on costly real-time interactions. The proposed approach systematically evaluates multiple candidate state representations and reward functions by training offline RL agents and applying OPE to estimate policy performance. The optimal state space and reward function are selected based on their ability to produce high-performing policies under OPE metrics. Our method is validated on two environments: the Lunar Lander environment by OpenAI Gym, which provides a controlled setting for assessing state space and reward function selection, and a NASA-MATB-II human subjects study environment, which evaluates the approach’s real-world applicability to human-robot teaming scenarios. This work enhances the feasibility and scalability of offline RL for real-world environments by automating critical RL design decisions through a data-driven OPE-based evaluation, enabling more reliable, effective, and sustainable RL formulation for complex human-robot interaction settings.

[LG-121] Design and Evaluation of Whole-Page Experience Optimization for E-commerce Search

链接: https://arxiv.org/abs/2602.02514
作者: Pratik Lahiri,Bingqing Ge,Zhou Qin,Aditya Jumde,Shuning Huo,Lucas Scottini,Yi Liu,Mahmoud Mamlouk,Wenyang Liu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:E-commerce Search Results Pages (SRPs) are evolving from linear lists to complex, non-linear layouts, rendering traditional position-biased ranking models insufficient. Moreover, existing optimization frameworks typically maximize short-term signals (e.g., clicks, same-day revenue) because long-term satisfaction metrics (e.g., expected two-week revenue) involve delayed feedback and challenging long-horizon credit attribution. To bridge these gaps, we propose a novel Whole-Page Experience Optimization Framework. Unlike traditional list-wise rankers, our approach explicitly models the interplay between item relevance, 2D positional layout, and visual elements. We use a causal framework to develop metrics for measuring long-term user satisfaction based on quasi-experimental data. We validate our approach through industry-scale A/B testing, where the model demonstrated a 1.86% improvement in brand relevance (our primary customer experience metric) while simultaneously achieving a statistically significant revenue uplift of +0.05%

[LG-122] Learning ORDER-Aware Multimodal Representations for Composite Materials Design

链接: https://arxiv.org/abs/2602.02513
作者: Xinyao Li,Hangwei Qian,Jingjing Li,Ivor Tsang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has shown remarkable success in materials discovery and property prediction, particularly for crystalline and polymer systems where material properties and structures are dominated by discrete graph representations. Such graph-central paradigm breaks down on composite materials, which possess continuous and nonlinear design spaces that lack well-defined graph structures. General composite descriptors, e.g., fiber volume and misalignment angle, cannot fully capture the fiber distributions that fundamentally determine microstructural characteristics, necessitating the integration of heterogeneous data sources through multimodal learning. Existing alignment-oriented multimodal frameworks have proven effective on abundant crystal or polymer data under discrete, unique graph-property mapping assumptions, but fail to address the highly continuous composite design space under extreme data scarcity. In this work, we introduce ORDinal-aware imagE-tabulaR alignment (ORDER), a multimodal pretraining framework that establishes ordinality as a core principle for composite material representations. ORDER ensures that materials with similar target properties occupy nearby regions in the latent space, which effectively preserves the continuous nature of composite properties and enables meaningful interpolation between sparsely observed designs. We evaluate ORDER on a public Nanofiber-enforced composite dataset and an internally curated dataset that simulates the construction of carbon fiber T700 with diverse fiber distributions. ORDER achieves consistent improvements over state-of-the-art multimodal baselines across property prediction, cross-modal retrieval, and microstructure generation tasks.

[LG-123] Augmenting Parameter-Efficient Pre-trained Language Models with Large Language Models

链接: https://arxiv.org/abs/2602.02501
作者: Saurabh Anand,Shubham Malaviya,Manish Shukla,Sachin Lodha
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 22 pages, 9 figures, 11 tables, short paper was accepted in ACM SAC 2024

点击查看摘要

Abstract:Training AI models in cybersecurity with help of vast datasets offers significant opportunities to mimic real-world behaviors effectively. However, challenges like data drift and scarcity of labelled data lead to frequent updates of models and the risk of overfitting. To address these challenges, we used parameter-efficient fine-tuning techniques for pre-trained language models wherein we combine compacters with various layer freezing strategies. To enhance the capabilities of these pre-trained language models, in this work we introduce two strategies that use large language models. In the first strategy, we utilize large language models as data-labelling tools wherein they generate labels for unlabeled data. In the second strategy, large language modes are utilized as fallback mechanisms for predictions having low confidence scores. We perform comprehensive experimental analysis on the proposed strategies on different downstream tasks specific to cybersecurity domain. We empirically demonstrate that by combining parameter-efficient pre-trained models with large language models, we can improve the reliability and robustness of models, making them more suitable for real-world cybersecurity applications.

[LG-124] Preference-based Conditional Treatment Effects and Policy Learning AISTATS2026

链接: https://arxiv.org/abs/2602.03823
作者: Dovid Parnas,Mathieu Even,Julie Josse,Uri Shalit
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to AISTATS 2026; 10 pages + appendix

点击查看摘要

Abstract:We introduce a new preference-based framework for conditional treatment effect estimation and policy learning, built on the Conditional Preference-based Treatment Effect (CPTE). CPTE requires only that outcomes be ranked under a preference rule, unlocking flexible modeling of heterogeneous effects with multivariate, ordinal, or preference-driven outcomes. This unifies applications such as conditional probability of necessity and sufficiency, conditional Win Ratio, and Generalized Pairwise Comparisons. Despite the intrinsic non-identifiability of comparison-based estimands, CPTE provides interpretable targets and delivers new identifiability conditions for previous unidentifiable estimands. We present estimation strategies via matching, quantile, and distributional regression, and further design efficient influence-function estimators to correct plug-in bias and maximize policy value. Synthetic and semi-synthetic experiments demonstrate clear performance gains and practical impact.

[LG-125] Conditional Flow Matching for Visually-Guided Acoustic Highlighting

链接: https://arxiv.org/abs/2602.03762
作者: Hugo Malard,Gael Le Lan,Daniel Wong,David Lou Alon,Yi-Chiao Wu,Sanjeel Parekh
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors – in selecting the correct source to enhance – compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.

[LG-126] Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH Estimators

链接: https://arxiv.org/abs/2602.03730
作者: Luke Solo,Matthew B.A. McDermott,William F. Parker,Bashar Ramadan,Michael C. Burkhart,Brett K. Beaulieu-Jones
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome’s lower “spontaneity,” a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.

[LG-127] VR-VFL: Joint Rate and Client Selection for Vehicular Federated Learning Under Imperfect CSI

链接: https://arxiv.org/abs/2602.03711
作者: Metehan Karatas,Subhrakanti Dey,Christian Rohner,Jose Mairton Barros da Silva Jr
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper has been accepted for presentation at IEEE ICC 2026

点击查看摘要

Abstract:Federated learning in vehicular edge networks faces major challenges in efficient resource allocation, largely due to high vehicle mobility and the presence of imperfect channel state information. Many existing methods oversimplify these realities, often assuming fixed communication rounds or ideal channel conditions, which limits their effectiveness in real-world scenarios. To address this, we propose variable rate vehicular federated learning (VR-VFL), a novel federated learning method designed specifically for vehicular networks under imperfect channel state information. VR-VFL combines dynamic client selection with adaptive transmission rate selection, while also allowing round times to flex in response to changing wireless conditions. At its core, VR-VFL is built on a bi-objective optimization framework that strikes a balance between improving learning convergence and minimizing the time required to complete each round. By accounting for both the challenges of mobility and realistic wireless constraints, VR-VFL offers a more practical and efficient approach to federated learning in vehicular edge networks. Simulation results show that the proposed VR-VFL scheme achieves convergence approximately 40% faster than other methods in the literature.

[LG-128] Improved Analysis of the Accelerated Noisy Power Method with Applications to Decentralized PCA

链接: https://arxiv.org/abs/2602.03682
作者: Pierre Aguié,Mathieu Even,Laurent Massoulié
类目: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We analyze the Accelerated Noisy Power Method, an algorithm for Principal Component Analysis in the setting where only inexact matrix-vector products are available, which can arise for instance in decentralized PCA. While previous works have established that acceleration can improve convergence rates compared to the standard Noisy Power Method, these guarantees require overly restrictive upper bounds on the magnitude of the perturbations, limiting their practical applicability. We provide an improved analysis of this algorithm, which preserves the accelerated convergence rate under much milder conditions on the perturbations. We show that our new analysis is worst-case optimal, in the sense that the convergence rate cannot be improved, and that the noise conditions we derive cannot be relaxed without sacrificing convergence guarantees. We demonstrate the practical relevance of our results by deriving an accelerated algorithm for decentralized PCA, which has similar communication costs to non-accelerated methods. To our knowledge, this is the first decentralized algorithm for PCA with provably accelerated convergence.

[LG-129] Simulation-Based Inference via Regression Projection and Batched Discrepancies

链接: https://arxiv.org/abs/2602.03613
作者: Arya Farahi,Jonah Rose,Paul Torrey
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: comments are welcome,

点击查看摘要

Abstract:We analyze a lightweight simulation-based inference method that infers simulator parameters using only a regression-based projection of the observed data. After fitting a surrogate linear regression once, the procedure simulates small batches at the proposed parameter values and assigns kernel weights based on the resulting batch-residual discrepancy, producing a self-normalized pseudo-posterior that is simple, parallelizable, and requires access only to the fitted regression coefficients rather than raw observations. We formalize the construction as an importance-sampling approximation to a population target that averages over simulator randomness, prove consistency as the number of parameter draws grows, and establish stability in estimating the surrogate regression from finite samples. We then characterize the asymptotic concentration as the batch size increases and the bandwidth shrinks, showing that the pseudo-posterior concentrates on an identified set determined by the chosen projection, thereby clarifying when the method yields point versus set identification. Experiments on a tractable nonlinear model and on a cosmological calibration task using the DREAMS simulation suite illustrate the computational advantages of regression-based projections and the identifiability limitations arising from low-information summaries.

[LG-130] Generator-based Graph Generation via Heat Diffusion ICML

链接: https://arxiv.org/abs/2602.03612
作者: Anthony Stephenson,Ian Gallagher,Christopher Nemeth
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to ICML; 8+15 pages; 20 figures

点击查看摘要

Abstract:Graph generative modelling has become an essential task due to the wide range of applications in chemistry, biology, social networks, and knowledge representation. In this work, we propose a novel framework for generating graphs by adapting the Generator Matching (arXiv:2410.20587) paradigm to graph-structured data. We leverage the graph Laplacian and its associated heat kernel to define a continous-time diffusion on each graph. The Laplacian serves as the infinitesimal generator of this diffusion, and its heat kernel provides a family of conditional perturbations of the initial graph. A neural network is trained to match this generator by minimising a Bregman divergence between the true generator and a learnable surrogate. Once trained, the surrogate generator is used to simulate a time-reversed diffusion process to sample new graph structures. Our framework unifies and generalises existing diffusion-based graph generative models, injecting domain-specific inductive bias via the Laplacian, while retaining the flexibility of neural approximators. Experimental studies demonstrate that our approach captures structural properties of real and synthetic graphs effectively.

[LG-131] Score-based diffusion models for diffuse optical tomography with uncertainty quantification

链接: https://arxiv.org/abs/2602.03449
作者: Fabian Schneider,Meghdoot Mozumder,Konstantin Tamarov,Leila Taghizadeh,Tanja Tarvainen,Tapio Helin,Duc-Lam Duong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Score-based diffusion models are a recently developed framework for posterior sampling in Bayesian inverse problems with a state-of-the-art performance for severely ill-posed problems by leveraging a powerful prior distribution learned from empirical data. Despite generating significant interest especially in the machine-learning community, a thorough study of realistic inverse problems in the presence of modelling error and utilization of physical measurement data is still outstanding. In this work, the framework of unconditional representation for the conditional score function (UCoS) is evaluated for linearized difference imaging in diffuse optical tomography (DOT). DOT uses boundary measurements of near-infrared light to estimate the spatial distribution of absorption and scattering parameters in biological tissues. The problem is highly ill-posed and thus sensitive to noise and modelling errors. We introduce a novel regularization approach that prevents overfitting of the score function by constructing a mixed score composed of a learned and a model-based component. Validation of this approach is done using both simulated and experimental measurement data. The experiments demonstrate that a data-driven prior distribution results in posterior samples with low variance, compared to classical model-based estimation, and centred around the ground truth, even in the context of a highly ill-posed problem and in the presence of modelling errors.

[LG-132] Acceleration of Atomistic NEGF: Algorithms Parallelization and Machine Learning

链接: https://arxiv.org/abs/2602.03438
作者: Mathieu Luisier,Nicolas Vetsch,Alexander Maeder,Vincent Maillou,Anders Winka,Leonard Deuschle,Chen Hao Xia,Manasa Kaniselvan,Marko Mladenovic,Jiang Cao,Alexandros Nikolaos Ziogas
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Non-equilibrium Green’s function (NEGF) formalism is a particularly powerful method to simulate the quantum transport properties of nanoscale devices such as transistors, photo-diodes, or memory cells, in the ballistic limit of transport or in the presence of various scattering sources such as electronphonon, electron-photon, or even electron-electron interactions. The inclusion of all these mechanisms has been first demonstrated in small systems, composed of a few atoms, before being scaled up to larger structures made of thousands of atoms. Also, the accuracy of the models has kept improving, from empirical to fully ab-initio ones, e.g., density functional theory (DFT). This paper summarizes key (algorithmic) achievements that have allowed us to bring DFT+NEGF simulations closer to the dimensions and functionality of realistic systems. The possibility of leveraging graph neural networks and machine learning to speed up ab-initio device simulations is discussed as well.

[LG-133] Enhancing Quantum Diffusion Models for Complex Image Generation

链接: https://arxiv.org/abs/2602.03405
作者: Jeongbin Jo,Santanam Wishal,Shah Md Khalil Ullah,Shan Kowalski,Dikshant Dulai
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Quantum generative models offer a novel approach to exploring high-dimensional Hilbert spaces but face significant challenges in scalability and expressibility when applied to multi-modal distributions. In this study, we explore a Hybrid Quantum-Classical U-Net architecture integrated with Adaptive Non-Local Observables (ANO) as a potential solution to these hurdles. By compressing classical data into a dense quantum latent space and utilizing trainable observables, our model aims to extract non-local features that complement classical processing. We also investigate the role of Skip Connections in preserving semantic information during the reverse diffusion process. Experimental results on the full MNIST dataset (digits 0-9) demonstrate that the proposed architecture is capable of generating structurally coherent and recognizable images for all digit classes. While hardware constraints still impose limitations on resolution, our findings suggest that hybrid architectures with adaptive measurements provide a feasible pathway for mitigating mode collapse and enhancing generative capabilities in the NISQ era.

[LG-134] Improving the Linearized Laplace Approximation via Quadratic Approximations

链接: https://arxiv.org/abs/2602.03394
作者: Pedro Jiménez,Luis A. Ortega,Pablo Morales-Álvarez,Daniel Hernández-Lobato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 1 table. Accepted at European Symposium on Artificial Neural Networks (ESANN 2026) as poster presentation

点击查看摘要

Abstract:Deep neural networks (DNNs) often produce overconfident out-of-distribution predictions, motivating Bayesian uncertainty quantification. The Linearized Laplace Approximation (LLA) achieves this by linearizing the DNN and applying Laplace inference to the resulting model. Importantly, the linear model is also used for prediction. We argue this linearization in the posterior may degrade fidelity to the true Laplace approximation. To alleviate this problem, without increasing significantly the computational cost, we propose the Quadratic Laplace Approximation (QLA). QLA approximates each second order factor in the approximate Laplace log-posterior using a rank-one factor obtained via efficient power iterations. QLA is expected to yield a posterior precision closer to that of the full Laplace without forming the full Hessian, which is typically intractable. For prediction, QLA also uses the linearized model. Empirically, QLA yields modest yet consistent uncertainty estimation improvements over LLA on five regression datasets.

[LG-135] A Novel approach to portfolio construction

链接: https://arxiv.org/abs/2602.03325
作者: T. Di Matteo,L. Riso,M.G. Zoia
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Risk Management (q-fin.RM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper proposes a machine learning-based framework for asset selection and portfolio construction, termed the Best-Path Algorithm Sparse Graphical Model (BPASGM). The method extends the Best-Path Algorithm (BPA) by mapping linear and non-linear dependencies among a large set of financial assets into a sparse graphical model satisfying a structural Markov property. Based on this representation, BPASGM performs a dependence-driven screening that removes positively or redundantly connected assets, isolating subsets that are conditionally independent or negatively correlated. This step is designed to enhance diversification and reduce estimation error in high-dimensional portfolio settings. Portfolio optimization is then conducted on the selected subset using standard mean-variance techniques. BPASGM does not aim to improve the theoretical mean-variance optimum under known population parameters, but rather to enhance realized performance in finite samples, where sample-based Markowitz portfolios are highly sensitive to estimation error. Monte Carlo simulations show that BPASGM-based portfolios achieve more stable risk-return profiles, lower realized volatility, and superior risk-adjusted performance compared to standard mean-variance portfolios. Empirical results for U.S. equities, global stock indices, and foreign exchange rates over 1990-2025 confirm these findings and demonstrate a substantial reduction in portfolio cardinality. Overall, BPASGM offers a statistically grounded and computationally efficient framework that integrates sparse graphical modeling with portfolio theory for dependence-aware asset selection.

[LG-136] Principled Federated Random Forests for Heterogeneous Data

链接: https://arxiv.org/abs/2602.03258
作者: Rémi Khellaf,Erwan Scornet,Aurélien Bellet,Julie Josse
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.

[LG-137] NeuralFLoC: Neural Flow-Based Joint Registration and Clustering of Functional Data

链接: https://arxiv.org/abs/2602.03169
作者: Xinyang Xiong,Siyuan jiang,Pengcheng Zeng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering functional data in the presence of phase variation is challenging, as temporal misalignment can obscure intrinsic shape differences and degrade clustering performance. Most existing approaches treat registration and clustering as separate tasks or rely on restrictive parametric assumptions. We present \textbfNeuralFLoC, a fully unsupervised, end-to-end deep learning framework for joint functional registration and clustering based on Neural ODE-driven diffeomorphic flows and spectral clustering. The proposed model learns smooth, invertible warping functions and cluster-specific templates simultaneously, effectively disentangling phase and amplitude variation. We establish universal approximation guarantees and asymptotic consistency for the proposed framework. Experiments on functional benchmarks show state-of-the-art performance in both registration and clustering, with robustness to missing data, irregular sampling, and noise, while maintaining scalability. Code is available at this https URL.

[LG-138] Online Conformal Prediction via Universal Portfolio Algorithms

链接: https://arxiv.org/abs/2602.03168
作者: Tuo Liu,Edgar Dobriban,Francesco Orabona
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online conformal prediction (OCP) seeks prediction intervals that achieve long-run 1-\alpha coverage for arbitrary (possibly adversarial) data streams, while remaining as informative as possible. Existing OCP methods often require manual learning-rate tuning to work well, and may also require algorithm-specific analyses. Here, we develop a general regret-to-coverage theory for interval-valued OCP based on the (1-\alpha) -pinball loss. Our first contribution is to identify \emphlinearized regret as a key notion, showing that controlling it implies coverage bounds for any online algorithm. This relies on a black-box reduction that depends only on the Fenchel conjugate of an upper bound on the linearized regret. Building on this theory, we propose UP-OCP, a parameter-free method for OCP, via a reduction to a two-asset portfolio selection problem, leveraging universal portfolio algorithms. We show strong finite-time bounds on the miscoverage of UP-OCP, even for polynomially growing predictions. Extensive experiments support that UP-OCP delivers consistently better size/coverage trade-offs than prior online conformal baselines.

[LG-139] Unified Inference Framework for Single and Multi-Player Performative Prediction: Method and Asymptotic Optimality

链接: https://arxiv.org/abs/2602.03049
作者: Zhixian Zhang,Xiaotian Hou,Linjun Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Performative prediction characterizes environments where predictive models alter the very data distributions they aim to forecast, triggering complex feedback loops. While prior research treats single-agent and multi-agent performativity as distinct phenomena, this paper introduces a unified statistical inference framework that bridges these contexts, treating the former as a special case of the latter. Our contribution is two-fold. First, we put forward the Repeated Risk Minimization (RRM) procedure for estimating the performative stability, and establish a rigorous inferential theory for admitting its asymptotic normality and confirming its asymptotic efficiency. Second, for the performative optimality, we introduce a novel two-step plug-in estimator that integrates the idea of Recalibrated Prediction Powered Inference (RePPI) with Importance Sampling, and further provide formal derivations for the Central Limit Theorems of both the underlying distributional parameters and the plug-in results. The theoretical analysis demonstrates that our estimator achieves the semiparametric efficiency bound and maintains robustness under mild distributional misspecification. This work provides a principled toolkit for reliable estimation and decision-making in dynamic, performative environments.

[LG-140] Physics-inspired transformer quantum states via latent imaginary-time evolution

链接: https://arxiv.org/abs/2602.03031
作者: Kimihiro Yamazaki,Itsushi Sakata,Takuya Konishi,Yoshinobu Kawahara
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Neural quantum states (NQS) are powerful ansätze in the variational Monte Carlo framework, yet their architectures are often treated as black boxes. We propose a physically transparent framework in which NQS are treated as neural approximations to latent imaginary-time evolution. This viewpoint suggests that standard Transformer-based NQS (TQS) architectures correspond to physically unmotivated effective Hamiltonians dependent on imaginary time in a latent space. Building on this interpretation, we introduce physics-inspired transformer quantum states (PITQS), which enforce a static effective Hamiltonian by sharing weights across layers and improve propagation accuracy via Trotter-Suzuki decompositions without increasing the number of variational parameters. For the frustrated J_1 - J_2 Heisenberg model, our ansätze achieve accuracies comparable to or exceeding state-of-the-art TQS while using substantially fewer variational parameters. This study demonstrates that reinterpreting the deep network structure as a latent cooling process enables a more physically grounded, systematic, and compact design, thereby bridging the gap between black-box expressivity and physically transparent construction.

[LG-141] Weighted Sum-of-Trees Model for Clustered Data

链接: https://arxiv.org/abs/2602.02931
作者: Kevin McCoy,Zachary Wooten,Katarzyna Tomczak,Christine B. Peterson
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Clustered data, which arise when observations are nested within groups, are incredibly common in clinical, education, and social science research. Traditionally, a linear mixed model, which includes random effects to account for within-group correlation, would be used to model the observed data and make new predictions on unseen data. Some work has been done to extend the mixed model approach beyond linear regression into more complex and non-parametric models, such as decision trees and random forests. However, existing methods are limited to using the global fixed effects for prediction on data from out-of-sample groups, effectively assuming that all clusters share a common outcome model. We propose a lightweight sum-of-trees model in which we learn a decision tree for each sample group. We combine the predictions from these trees using weights so that out-of-sample group predictions are more closely aligned with the most similar groups in the training data. This strategy also allows for inference on the similarity across groups in the outcome prediction model, as the unique tree structures and variable importances for each group can be directly compared. We show our model outperforms traditional decision trees and random forests in a variety of simulation settings. Finally, we showcase our method on real-world data from the sarcoma cohort of The Cancer Genome Atlas, where patient samples are grouped by sarcoma subtype.

[LG-142] raining-Free Self-Correction for Multimodal Masked Diffusion Models

链接: https://arxiv.org/abs/2602.02927
作者: Yidong Ouyang,Panwen Hu,Zhengyan Wan,Zhe Wang,Liyan Xie,Dmitriy Bespalov,Ying Nian Wu,Guang Cheng,Hongyuan Zha,Qiang Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self-correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion architectures, highlighting its robustness and practical applicability. Code can be found in this https URL.

[LG-143] Downscaling land surface temperature data using edge detection and block-diagonal Gaussian process regression

链接: https://arxiv.org/abs/2602.02813
作者: Sanjit Dandapanthula,Margaret Johnson,Madeleine Pascolini-Campbell,Glynn Hulley,Mikael Kuusela
类目: Applications (stat.AP); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Accurate and high-resolution estimation of land surface temperature (LST) is crucial in estimating evapotranspiration, a measure of plant water use and a central quantity in agricultural applications. In this work, we develop a novel statistical method for downscaling LST data obtained from NASA’s ECOSTRESS mission, using high-resolution data from the Landsat 8 mission as a proxy for modeling agricultural field structure. Using the Landsat data, we identify the boundaries of agricultural fields through edge detection techniques, allowing us to capture the inherent block structure present in the spatial domain. We propose a block-diagonal Gaussian process (BDGP) model that captures the spatial structure of the agricultural fields, leverages independence of LST across fields for computational tractability, and accounts for the change of support present in ECOSTRESS observations. We use the resulting BDGP model to perform Gaussian process regression and obtain high-resolution estimates of LST from ECOSTRESS data, along with uncertainty quantification. Our results demonstrate the practicality of the proposed method in producing reliable high-resolution LST estimates, with potential applications in agriculture, urban planning, and climate studies.

[LG-144] Plug-In Classification of Drift Functions in Diffusion Processes Using Neural Networks

链接: https://arxiv.org/abs/2602.02791
作者: Yuzhen Zhao,Jiarong Fan,Yating Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study a supervised multiclass classification problem for diffusion processes, where each class is characterized by a distinct drift function and trajectories are observed at discrete times. Extending the one-dimensional multiclass framework of Denis et al. (2024) to multidimensional diffusions, we propose a neural network-based plug-in classifier that estimates the drift functions for each class from independent sample paths and assigns labels based on a Bayes-type decision rule. Under standard regularity assumptions, we establish convergence rates for the excess misclassification risk, explicitly capturing the effects of drift estimation error and time discretization. Numerical experiments demonstrate that the proposed method achieves faster convergence and improved classification performance compared to Denis et al. (2024) in the one-dimensional setting, remains effective in higher dimensions when the underlying drift functions admit a compositional structure, and consistently outperforms direct neural network classifiers trained end-to-end on trajectories without exploiting the diffusion model structure.

[LG-145] Near-Universal Multiplicative Updates for Nonnegative Einsum Factorization

链接: https://arxiv.org/abs/2602.02759
作者: John Hood,Aaron Schein
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:Despite the ubiquity of multiway data across scientific domains, there are few user-friendly tools that fit tailored nonnegative tensor factorizations. Researchers may use gradient-based automatic differentiation (which often struggles in nonnegative settings), choose between a limited set of methods with mature implementations, or implement their own model from scratch. As an alternative, we introduce NNEinFact, an einsum-based multiplicative update algorithm that fits any nonnegative tensor factorization expressible as a tensor contraction by minimizing one of many user-specified loss functions (including the (\alpha,\beta) -divergence). To use NNEinFact, the researcher simply specifies their model with a string. NNEinFact converges to a local minimum of the loss, supports missing data, and fits to tensors with hundreds of millions of entries in seconds. Empirically, NNEinFact fits custom models which outperform standard ones in heldout prediction tasks on real-world tensor data by over 37% and attains less than half the test loss of gradient-based methods while converging up to 90 times faster.

[LG-146] Rethinking Test-Time Training: Tilting The Latent Distribution For Few-Shot Source-Free Adaptation

链接: https://arxiv.org/abs/2602.02633
作者: Tahir Qasim Syed,Behraj Khan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Often, constraints arise in deployment settings where even lightweight parameter updates e.g. parameter-efficient fine-tuning could induce model shift or tuning instability. We study test-time adaptation of foundation models for few-shot classification under a completely frozen-model regime, where additionally, no upstream data are accessible. We propose arguably the first training-free inference method that adapts predictions to the new task by performing a change of measure over the latent embedding distribution induced by the encoder. Using task-similarity scores derived from a small labeled support set, exponential tilting reweights latent distributions in a KL-optimal manner without modifying model parameters. Empirically, the method consistently competes with parameter-update-based methods across multiple benchmarks and shot regimes, while operating under strictly and universally stronger constraints. These results demonstrate the viability of inference-level distributional correction for test-time adaptation even with a fully-frozen model pipeline.

[LG-147] Relaxed Triangle Inequality for Kullback-Leibler Divergence Between Multivariate Gaussian Distributions

链接: https://arxiv.org/abs/2602.02577
作者: Shiji Xiao,Yufeng Zhang,Chubo Liu,Yan Ding,Keqin Li,Kenli Li
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Kullback-Leibler (KL) divergence is not a proper distance metric and does not satisfy the triangle inequality, posing theoretical challenges in certain practical applications. Existing work has demonstrated that KL divergence between multivariate Gaussian distributions follows a relaxed triangle inequality. Given any three multivariate Gaussian distributions \mathcalN_1, \mathcalN_2 , and \mathcalN_3 , if KL(\mathcalN_1, \mathcalN_2)\leq \epsilon_1 and KL(\mathcalN_2, \mathcalN_3)\leq \epsilon_2 , then KL(\mathcalN_1, \mathcalN_3) 3\epsilon_1+3\epsilon_2+2\sqrt\epsilon_1\epsilon_2+o(\epsilon_1)+o(\epsilon_2) . However, the supremum of KL(\mathcalN_1, \mathcalN_3) is still unknown. In this paper, we investigate the relaxed triangle inequality for the KL divergence between multivariate Gaussian distributions and give the supremum of KL(\mathcalN_1, \mathcalN_3) as well as the conditions when the supremum can be attained. When \epsilon_1 and \epsilon_2 are small, the supremum is \epsilon_1+\epsilon_2+\sqrt\epsilon_1\epsilon_2+o(\epsilon_1)+o(\epsilon_2) . Finally, we demonstrate several applications of our results in out-of-distribution detection with flow-based generative models and safe reinforcement learning.

信息检索

[IR-0] Multimodal Generative Recommendation for Fusing Semantic and Collaborative Signals

链接: https://arxiv.org/abs/2602.03713
作者: Moritz Vandenhirtz,Kaveh Hassani,Shervin Ghasemlou,Shuai Shao,Hamid Eghbalzadeh,Fuchun Peng,Jun Liu,Michael Louis Iuzzolino
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommender systems rank relevant items by modeling a user’s interaction history and computing the inner product between the resulting user representation and stored item embeddings. To avoid the significant memory overhead of storing large item sets, the generative recommendation paradigm instead models each item as a series of discrete semantic codes. Here, the next item is predicted by an autoregressive model that generates the code sequence corresponding to the predicted item. However, despite promising ranking capabilities on small datasets, these methods have yet to surpass traditional sequential recommenders on large item sets, limiting their adoption in the very scenarios they were designed to address. To resolve this, we propose MSCGRec, a Multimodal Semantic and Collaborative Generative Recommender. MSCGRec incorporates multiple semantic modalities and introduces a novel self-supervised quantization learning approach for images based on the DINO framework. Additionally, MSCGRec fuses collaborative and semantic signals by extracting collaborative features from sequential recommenders and treating them as a separate modality. Finally, we propose constrained sequence learning that restricts the large output space during training to the set of permissible tokens. We empirically demonstrate on three large real-world datasets that MSCGRec outperforms both sequential and generative recommendation baselines and provide an extensive ablation study to validate the impact of each component.

[IR-1] Bringing Reasoning to Generative Recommendation Through the Lens of Cascaded Ranking WWW2026

链接: https://arxiv.org/abs/2602.03692
作者: Xinyu Lin,Pengyuan Liu,Wenjie Wang,Yicheng Hu,Chen Xu,Fuli Feng,Qifan Wang,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW2026

点击查看摘要

Abstract:Generative Recommendation (GR) has become a promising end-to-end approach with high FLOPS utilization for resource-efficient recommendation. Despite the effectiveness, we show that current GR models suffer from a critical \textbfbias amplification issue, where token-level bias escalates as token generation progresses, ultimately limiting the recommendation diversity and hurting the user experience. By comparing against the key factor behind the success of traditional multi-stage pipelines, we reveal two limitations in GR that can amplify the bias: homogeneous reliance on the encoded history, and fixed computational budgets that prevent deeper user preference understanding. To combat the bias amplification issue, it is crucial for GR to 1) incorporate more heterogeneous information, and 2) allocate greater computational resources at each token generation step. To this end, we propose CARE, a simple yet effective cascaded reasoning framework for debiased GR. To incorporate heterogeneous information, we introduce a progressive history encoding mechanism, which progressively incorporates increasingly fine-grained history information as the generation process advances. To allocate more computations, we propose a query-anchored reasoning mechanism, which seeks to perform a deeper understanding of historical information through parallel reasoning steps. We instantiate CARE on three GR backbones. Empirical results on four datasets show the superiority of CARE in recommendation accuracy, diversity, efficiency, and promising scalability. The codes and datasets are available at this https URL. Comments: Accepted by WWW2026 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.03692 [cs.IR] (or arXiv:2602.03692v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.03692 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Failure is Feedback: History-Aware Backtracking for Agent ic Traversal in Multimodal Graphs

链接: https://arxiv.org/abs/2602.03432
作者: Joohyung Yun,Doyup Lee,Wook-Shin Han
类目: Information Retrieval (cs.IR)
*备注: Project page: this https URL

点击查看摘要

Abstract:Open-domain multimodal document retrieval aims to retrieve specific components (paragraphs, tables, or images) from large and interconnected document corpora. Existing graph-based retrieval approaches typically rely on a uniform similarity metric that overlooks hop-specific semantics, and their rigid pre-defined plans hinder dynamic error correction. These limitations suggest that a retriever should adapt its reasoning to the evolving context and recover intelligently from dead ends. To address these needs, we propose Failure is Feedback (FiF), which casts subgraph retrieval as a sequential decision process and introduces two key innovations. (i) We introduce a history-aware backtracking mechanism; unlike standard backtracking that simply reverts the state, our approach piggybacks on the context of failed traversals, leveraging insights from previous failures. (ii) We implement an economically-rational agentic workflow. Unlike conventional agents with static strategies, our orchestrator employs a cost-aware traversal method to dynamically manage the trade-off between retrieval accuracy and inference costs, escalating to intensive LLM-based reasoning only when the prior failure justifies the additional computational investment. Extensive experiments show that FiF achieves state-of-the-art retrieval on the benchmarks of MultimodalQA, MMCoQA and WebQA.

[IR-3] RankSteer: Activation Steering for Pointwise LLM Ranking

链接: https://arxiv.org/abs/2602.03422
作者: Yumeng Wang,Catherine Chen,Suzan Verberne
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown strong performance as zero-shot rankers, yet their effectiveness is highly sensitive to prompt formulation, particularly role-play instructions. Prior analyses suggest that role-related signals are encoded along activation channels that are largely separate from query-document representations, raising the possibility of steering ranking behavior directly at the activation level rather than through brittle prompt engineering. In this work, we propose RankSteer, a post-hoc activation steering framework for zero-shot pointwise LLM ranking. We characterize ranking behavior through three disentangled and steerable directions in representation space: a \textbfdecision direction that maps hidden states to relevance scores, an \textbfevidence direction that captures relevance signals not directly exploited by the decision head, and a \textbfrole direction that modulates model behavior without injecting relevance information. Using projection-based interventions at inference time, RankSteer jointly controls these directions to calibrate ranking behavior without modifying model weights or introducing explicit cross-document comparisons. Experiments on TREC DL 20 and multiple BEIR benchmarks show that RankSteer consistently improves ranking quality using only a small number of anchor queries, demonstrating that substantial ranking capacity remains under-utilized in pointwise LLM rankers. We further provide a geometric analysis revealing that steering improves ranking by stabilizing ranking geometry and reducing dispersion, offering new insight into how LLMs internally represent and calibrate relevance judgments.

[IR-4] AesRec: A Dataset for Aesthetics-Aligned Clothing Outfit Recommendation

链接: https://arxiv.org/abs/2602.03416
作者: Wenxin Ye,Lin Li,Ming Li,Yang Shen,Kanghong Wang,Jimmy Xiangji Huang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Clothing recommendation extends beyond merely generating personalized outfits; it serves as a crucial medium for aesthetic guidance. However, existing methods predominantly rely on user-item-outfit interaction behaviors while overlooking explicit representations of clothing aesthetics. To bridge this gap, we present the AesRec benchmark dataset featuring systematic quantitative aesthetic annotations, thereby enabling the development of aesthetics-aligned recommendation systems. Grounded in professional apparel quality standards and fashion aesthetic principles, we define a multidimensional set of indicators. At the item level, six dimensions are independently assessed: silhouette, chromaticity, materiality, craftsmanship, wearability, and item-level impression. Transitioning to the outfit level, the evaluation retains the first five core attributes while introducing stylistic synergy, visual harmony, and outfit-level impression as distinct metrics to capture the collective aesthetic impact. Given the increasing human-like proficiency of Vision-Language Models in multimodal understanding and interaction, we leverage them for large-scale aesthetic scoring. We conduct rigorous human-machine consistency validation on a fashion dataset, confirming the reliability of the generated ratings. Experimental results based on AesRec further demonstrate that integrating quantified aesthetic information into clothing recommendation models can provide aesthetic guidance for users while fulfilling their personalized requirements.

[IR-5] Beyond Exposure: Optimizing Ranking Fairness with Non-linear Time-Income Functions

链接: https://arxiv.org/abs/2602.03345
作者: Xuancheng Li,Tao Yang,Yujia Zhou,Qingyao Ai,Yiqun Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Ranking is central to information distribution in web search and recommendation. Nowadays, in ranking optimization, the fairness to item providers is viewed as a crucial factor alongside ranking relevance for users. There are currently numerous concepts of fairness and one widely recognized fairness concept is Exposure Fairness. However, it relies primarily on exposure determined solely by position, overlooking other factors that significantly influence income, such as time. To address this limitation, we propose to study ranking fairness when the provider utility is influenced by other contextual factors and is neither equal to nor proportional to item exposure. We give a formal definition of Income Fairness and develop a corresponding measurement metric. Simulated experiments show that existing-exposure-fairness-based ranking algorithms fail to optimize the proposed income fairness. Therefore, we propose the Dynamic-Income-Derivative-aware Ranking Fairness algorithm, which, based on the marginal income gain at the present timestep, uses Taylor-expansion-based gradients to simultaneously optimize effectiveness and income fairness. In both offline and online settings with diverse time-income functions, DIDRF consistently outperforms state-of-the-art methods.

[IR-6] SCASRec: A Self-Correcting and Auto-Stopping Model for Generative Route List Recommendation

链接: https://arxiv.org/abs/2602.03324
作者: Chao Chen,Longfei Xu,Daohan Su,Tengfei Liu,Hanyu Guo,Yihai Duan,Kaikui Liu,Xiangxiang Chu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Route recommendation systems commonly adopt a multi-stage pipeline involving fine-ranking and re-ranking to produce high-quality ordered recommendations. However, this paradigm faces three critical limitations. First, there is a misalignment between offline training objectives and online metrics. Offline gains do not necessarily translate to online improvements. Actual performance must be validated through A/B testing, which may potentially compromise the user experience. Second, redundancy elimination relies on rigid, handcrafted rules that lack adaptability to the high variance in user intent and the unstructured complexity of real-world scenarios. Third, the strict separation between fine-ranking and re-ranking stages leads to sub-optimal performance. Since each module is optimized in isolation, the fine-ranking stage remains oblivious to the list-level objectives (e.g., diversity) targeted by the re-ranker, thereby preventing the system from achieving a jointly optimized global optimum. To overcome these intertwined challenges, we propose \textbfSCASRec (\textbfSelf-\textbfCorrecting and \textbfAuto-\textbfStopping \textbfRecommendation), a unified generative framework that integrates ranking and redundancy elimination into a single end-to-end process. SCASRec introduces a stepwise corrective reward (SCR) to guide list-wise refinement by focusing on hard samples, and employs a learnable End-of-Recommendation (EOR) token to terminate generation adaptively when no further improvement is expected. Experiments on two large-scale, open-sourced route recommendation datasets demonstrate that SCASRec establishes an SOTA in offline and online settings. SCASRec has been fully deployed in a real-world navigation app, demonstrating its effectiveness.

[IR-7] o Search or Not to Search: Aligning the Decision Boundary of Deep Search Agents via Causal Intervention

链接: https://arxiv.org/abs/2602.03304
作者: Wenlin Zhang,Kuicai Dong,Junyi Li,Yingyi Zhang,Xiaopeng Li,Pengyue Jia,Yi Wen,Derong Xu,Maolin Wang,Yichao Wang,Yong Liu,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Deep search agents, which autonomously iterate through multi-turn web-based reasoning, represent a promising paradigm for complex information-seeking tasks. However, current agents suffer from critical inefficiency: they conduct excessive searches as they cannot accurately judge when to stop searching and start answering. This stems from outcome-centric training that prioritize final results over the search process itself. We identify the root cause as misaligned decision boundaries, the threshold determining when accumulated information suffices to answer. This causes over-search (redundant searching despite sufficient knowledge) and under-search (premature termination yielding incorrect answers). To address these errors, we propose a comprehensive framework comprising two key components. First, we introduce causal intervention-based diagnosis that identifies boundary errors by comparing factual and counterfactual trajectories at each decision point. Second, we develop Decision Boundary Alignment for Deep Search agents (DAS), which constructs preference datasets from causal feedback and aligns policies via preference optimization. Experiments on public datasets demonstrate that decision boundary errors are pervasive across state-of-the-art agents. Our DAS method effectively calibrates these boundaries, mitigating both over-search and under-search to achieve substantial gains in accuracy and efficiency. Our code and data are publicly available at: this https URL.

[IR-8] PAMAS: Self-Adaptive Multi-Agent System with Perspective Aggregation for Misinformation Detection

链接: https://arxiv.org/abs/2602.03158
作者: Zongwei Wang,Min Gao,Junliang Yu,Tong Chen,Chenghua Lin
类目: Information Retrieval (cs.IR)
*备注: 12 pages

点击查看摘要

Abstract:Misinformation on social media poses a critical threat to information credibility, as its diverse and context-dependent nature complicates detection. Large language model-empowered multi-agent systems (MAS) present a promising paradigm that enables cooperative reasoning and collective intelligence to combat this threat. However, conventional MAS suffer from an information-drowning problem, where abundant truthful content overwhelms sparse and weak deceptive cues. With full input access, agents tend to focus on dominant patterns, and inter-agent communication further amplifies this bias. To tackle this issue, we propose PAMAS, a multi-agent framework with perspective aggregation, which employs hierarchical, perspective-aware aggregation to highlight anomaly cues and alleviate information drowning. PAMAS organizes agents into three roles: Auditors, Coordinators, and a Decision-Maker. Auditors capture anomaly cues from specialized feature subsets; Coordinators aggregate their perspectives to enhance coverage while maintaining diversity; and the Decision-Maker, equipped with evolving memory and full contextual access, synthesizes all subordinate insights to produce the final judgment. Furthermore, to improve efficiency in multi-agent collaboration, PAMAS incorporates self-adaptive mechanisms for dynamic topology optimization and routing-based inference, enhancing both efficiency and scalability. Extensive experiments on multiple benchmark datasets demonstrate that PAMAS achieves superior accuracy and efficiency, offering a scalable and trustworthy way for misinformation detection.

[IR-9] ALPBench: A Benchmark for Attribution-level Long-term Personal Behavior Understanding

链接: https://arxiv.org/abs/2602.03056
作者: Lu Ren,Junda She,Xinchen Luo,Tao Wang,Xin Ye,Xu Zhang,Muxuan Wang,Xiao Yang,Chenguang Wang,Fei Xie,Yiwei Zhou,Danjun Wu,Guodong Zhang,Yifei Hu,Guoying Zheng,Shujie Yang,Xingmei Wang,Shiyao Wang,Yukun Zhou,Fan Yang,Size Li,Kuo Cai,Qiang Luo,Ruiming Tang,Han Li,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances in large language models have highlighted their potential for personalized recommendation, where accurately capturing user preferences remains a key challenge. Leveraging their strong reasoning and generalization capabilities, LLMs offer new opportunities for modeling long-term user behavior. To systematically evaluate this, we introduce ALPBench, a Benchmark for Attribution-level Long-term Personal Behavior Understanding. Unlike item-focused benchmarks, ALPBench predicts user-interested attribute combinations, enabling ground-truth evaluation even for newly introduced items. It models preferences from long-term historical behaviors rather than users’ explicitly expressed requests, better reflecting enduring interests. User histories are represented as natural language sequences, allowing interpretable, reasoning-based personalization. ALPBench enables fine-grained evaluation of personalization by focusing on the prediction of attribute combinations task that remains highly challenging for current LLMs due to the need to capture complex interactions among multiple attributes and reason over long-term user behavior sequences.

[IR-10] Efficiency Optimizations for Superblock-based Sparse Retrieval

链接: https://arxiv.org/abs/2602.02883
作者: Parker Carlson,Wentai Xie,Rohil Shah,Tao Yang
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 5 figures, 9 tables. Under review

点击查看摘要

Abstract:Learned sparse retrieval (LSR) is a popular method for first-stage retrieval because it combines the semantic matching of language models with efficient CPU-friendly algorithms. Previous work aggregates blocks into “superblocks” to quickly skip the visitation of blocks during query processing by using an advanced pruning heuristic. This paper proposes a simple and effective superblock pruning scheme that reduces the overhead of superblock score computation while preserving competitive relevance. It combines this scheme with a compact index structure and a robust zero-shot configuration that is effective across LSR models and multiple datasets. This paper provides an analytical justification and evaluation on the MS MARCO and BEIR datasets, demonstrating that the proposed scheme can be a strong alternative for efficient sparse retrieval.

[IR-11] Col-Bandit: Zero-Shot Query-Time Pruning for Late-Interaction Retrieval

链接: https://arxiv.org/abs/2602.02827
作者: Roi Pony,Adi Raz,Oshri Naparstek,Idan Friedman,Udi Barzelay
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-vector late-interaction retrievers such as ColBERT achieve state-of-the-art retrieval quality, but their query-time cost is dominated by exhaustively computing token-level MaxSim interactions for every candidate document. While approximating late interaction with single-vector representations reduces cost, it often incurs substantial accuracy loss. We introduce Col-Bandit, a query-time pruning algorithm that reduces this computational burden by casting reranking as a finite-population Top- K identification problem. Col-Bandit maintains uncertainty-aware bounds over partially observed document scores and adaptively reveals only the (document, query token) MaxSim entries needed to determine the top results under statistical decision bounds with a tunable relaxation. Unlike coarse-grained approaches that prune entire documents or tokens offline, Col-Bandit sparsifies the interaction matrix on the fly. It operates as a zero-shot, drop-in layer over standard multi-vector systems, requiring no index modifications, offline preprocessing, or model retraining. Experiments on textual (BEIR) and multimodal (REAL-MM-RAG) benchmarks show that Col-Bandit preserves ranking fidelity while reducing MaxSim FLOPs by up to 5 \times , indicating that dense late-interaction scoring contains substantial redundancy that can be identified and pruned efficiently at query time.

附件下载

点击下载今日全部论文列表