本篇博文主要内容为 2026-01-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-01-30)
今日共更新788篇论文,其中:
- 自然语言处理共120篇(Computation and Language (cs.CL))
- 人工智能共276篇(Artificial Intelligence (cs.AI))
- 计算机视觉共115篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共302篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] RedSage: A Cybersecurity Generalist LLM ICLR2026
【速读】: 该论文旨在解决网络安全(cybersecurity)领域中大语言模型(LLM)应用的两大核心问题:一是现有解决方案要么依赖闭源API存在隐私泄露风险,要么使用开源模型缺乏领域适应性;二是缺乏高质量、面向安全专家工作流的训练数据与评估基准。其关键解决方案在于:首先,构建包含118亿token的持续预训练数据集(涵盖28.6K份框架、攻击技术和安全工具文档),并通过大规模网络过滤和人工筛选确保质量;其次,设计一种代理增强(agentic augmentation)流水线,模拟专家操作流程生成26.6万条多轮网络安全样本用于监督微调;最终训练出RedSage——一个可本地部署、具备领域感知预训练与后训练能力的开源网络安全助手,并配套提出RedSage-Bench基准以全面评估模型在知识、技能和工具使用上的表现。
链接: https://arxiv.org/abs/2601.22159
作者: Naufal Suryanto,Muzammal Naseer,Pengfei Li,Syed Talal Wasim,Jinhui Yi,Juergen Gall,Paolo Ceravolo,Ernesto Damiani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted on ICLR 2026; Project page: this https URL
Abstract:Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended QA items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.
zh
[NLP-1] Discovering Hidden Gems in Model Repositories
【速读】: 该论文旨在解决大规模模型仓库中“隐藏宝石”(hidden gems)——即性能优异但因下载量低而被忽视的微调模型——难以被发现的问题。当前社区对少数基础检查点(foundation checkpoints)的集中使用可能掩盖了大量未被充分利用的高性能模型,导致资源分配效率低下。解决方案的关键在于将模型发现建模为多臂赌博机(Multi-Armed Bandit)问题,并通过引入共享查询集和激进的淘汰策略,显著加速顺序减半(Sequential Halving)搜索算法,从而以极低的查询次数(每候选模型仅需约50次)高效定位顶级模型,整体发现速度提升超过50倍。
链接: https://arxiv.org/abs/2601.22157
作者: Jonathan Kahana,Eliahu Horwitz,Yedid Hoshen
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of “hidden gems”, unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.
zh
[NLP-2] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
【速读】: 该论文旨在解决混合注意力机制(Hybrid Attention)模型在长序列建模中因大规模预训练成本过高而难以应用的问题,尤其是现有知识蒸馏方法对训练数据需求大且长程性能不佳的局限性。其解决方案的关键在于提出HALO(Hybrid Attention via Layer Optimization)蒸馏管道与HypeNet架构:HALO通过优化层间参数迁移策略,在仅需23亿token(不足原模型预训练数据的0.01%)的情况下实现从Transformer到RNN-注意力混合模型的有效转换;同时,HypeNet引入一种新型位置编码方案(HyPE)及多种结构改进,显著提升了模型在长上下文场景下的泛化能力与推理效率,从而在保持与原始Transformer相当性能的同时,获得更优的长序列处理表现。
链接: https://arxiv.org/abs/2601.22156
作者: Yingfa Chen,Zhen Leng Thai,Zihan Zhou,Zhu Zhang,Xingyu Shen,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 8 figures
Abstract:Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
zh
[NLP-3] UEval: A Benchmark for Unified Multimodal Generation
【速读】: 该论文旨在解决统一模型(即能够同时生成图像和文本的模型)在多模态生成任务中缺乏可靠、细粒度评估基准的问题。现有方法依赖多模态大语言模型(Multimodal Large Language Models, MLLMs)作为评分器,难以捕捉生成结果中的细微差异。其解决方案的关键在于提出UEval基准,通过专家标注与MLLM协同构建结构化的评分规则(rubric),形成包含10,417条验证标准的细粒度评估体系,从而实现对开放性多模态输出的可扩展、高精度自动评分。实验表明,该基准能有效区分不同模型性能,且揭示了推理能力对复杂多模态理解与生成的重要性。
链接: https://arxiv.org/abs/2601.22155
作者: Bo Li,Yida Yin,Wenhao Chai,Xingyu Fu,Zhuang Liu
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.
zh
[NLP-4] Exploring Reasoning Reward Model for Agents
【速读】: 该论文旨在解决当前基于代理的强化学习(Agentic Reinforcement Learning, Agentic RL)方法在训练过程中依赖稀疏结果奖励(outcome-based reward)所导致的中间推理质量无法区分的问题,从而影响训练效果。其解决方案的关键在于提出一种多维度的推理奖励模型(Agent Reasoning Reward Model, Agent-RRM),该模型能够为代理轨迹提供结构化反馈,包括显式的推理路径(explicit reasoning trace)、聚焦于推理缺陷的批判性建议(focused critique)以及整体过程评分(overall score)。通过整合这三种信号,作者系统地探索了三种融合策略,并发现统一反馈集成(Reagent-U)方法在12个基准测试中显著提升了性能,验证了该推理奖励机制的有效性。
链接: https://arxiv.org/abs/2601.22154
作者: Kaixuan Fan,Kaituo Feng,Manyuan Zhang,Tianshuo Peng,Zhixun Li,Yilei Jiang,Shuang Chen,Peng Pei,Xunliang Cai,Xiangyu Yue
机构: MMLab, CUHK (多媒体实验室,香港中文大学); Meituan (美团); SEEM, CUHK (系统工程与工程管理系,香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL
Abstract:Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
zh
[NLP-5] DynaWeb: Model-Based Reinforcement Learning of Web Agents
【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)和强化学习(Reinforcement Learning, RL)的自主网络代理(web agents)在真实互联网环境中训练效率低、成本高且风险大的问题。其核心解决方案是提出一种名为DynaWeb的模型基础强化学习(Model-Based Reinforcement Learning, MBRL)框架,关键在于构建一个能够根据代理动作预测自然网页表示的“网络世界模型”(web world model),作为合成环境用于生成大量模拟轨迹(rollout action trajectories),从而实现高效在线强化学习;同时引入真实专家轨迹与策略轨迹随机混合,提升训练稳定性和样本效率。实验表明,该方法显著优于当前最先进的开源网络代理模型。
链接: https://arxiv.org/abs/2601.22149
作者: Hang Ding,Peidong Liu,Junqiao Wang,Ziwei Ji,Meng Cao,Rongzhao Zhang,Lynn Ai,Eric Yang,Tianyu Shi,Lei Yu
机构: Shanghai Jiao Tong University (上海交通大学); Sichuan University (四川大学); Hong Kong University of Science and Technology (香港科技大学); McGill University (麦吉尔大学); Shanghai AI Lab (上海人工智能实验室); Gradient (梯度); University of Toronto (多伦多大学); Mila - Quebec AI Institute (蒙特利尔学习算法研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.
zh
[NLP-6] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令微调(instruction-tuning)阶段因监督数据稀缺而导致性能受限的问题。传统预训练依赖大规模无结构文本的自监督“预测下一个词”目标,而指令微调则需少量人工标注的指令-响应对,难以覆盖真实用户交互场景。解决方案的关键在于提出一种可扩展的合成数据生成流程,将互联网规模的预训练文档转化为数十亿条合成的指令与回答训练样本,构建名为FineInstructions的数据集;该数据集基于约1800万条真实用户查询模板,匹配并实例化于未结构化预训练语料中的人类撰写源文档,从而实现仅用指令微调目标从零开始预训练模型,显著提升下游任务中自由格式响应质量。
链接: https://arxiv.org/abs/2601.22146
作者: Ajay Patel,Colin Raffel,Chris Callison-Burch
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised “predict the next word” objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of “instruction-tuning” data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With “supervised” synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at this https URL .
zh
[NLP-7] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
【速读】: 该论文旨在解决当前以思维链(Chain-of-Thought, CoT)为代表的推理型大语言模型(Reasoning-oriented Large Language Models, LLMs)在面对前提不确定或意图模糊时,因采用“盲自思考”(blind self-thinking)范式而产生无效推理的问题。解决方案的核心是提出主动交互式推理(Proactive Interactive Reasoning, PIR),其关键在于将LLM从被动求解者转变为积极提问者,通过在推理过程中主动与用户交互以澄清信息,从而减少不必要的计算和错误推理。PIR由两个核心组件构成:一是基于不确定性感知的监督微调方法,赋予模型交互推理能力;二是基于用户模拟器的策略优化框架,利用复合奖励机制对齐模型行为与用户意图。实验表明,PIR在数学推理、代码生成和文档编辑等任务中显著优于基线方法,并在减少约50%推理计算的同时提升准确性与鲁棒性。
链接: https://arxiv.org/abs/2601.22139
作者: Xin Chen,Feng Jiang,Yiqian Zhang,Hardy Chen,Shuo Yan,Wenya Xie,Min Yang,Shujian Huang
机构: Nanjing University (南京大学); Shenzhen University of Advanced Technology (深圳先进技术研究院); Chinese Academy of Sciences (中国科学院); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of Texas, Dallas (德克萨斯大学达拉斯分校); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The manuscript is under review
Abstract:Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emphblind self-thinking paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \hrefthis https URL
zh
[NLP-8] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine
【速读】: 该论文旨在解决当前医疗大语言模型(Large Language Models, LLMs)在跨机构应用中面临的泛化能力不足与安全性风险问题,尤其是在多中心临床数据异质性背景下,传统联邦学习(Federated Learning, FL)方法因通信开销大和对数据同质性假设的依赖而难以有效适配LLMs。其解决方案的关键在于提出一种模型无关且参数高效的联邦学习框架Fed-MedLoRA:通过仅传输低秩适配器(Low-Rank Adapter)参数显著降低通信与计算负担;进一步引入自适应、数据感知的聚合机制(Fed-MedLoRA+),以提升在跨站点数据异质性下的模型收敛性能,从而实现高效、安全的医疗LLM协同训练。
链接: https://arxiv.org/abs/2601.22124
作者: Anran Li,Yuanyuan Chen,Wenjun Long,Yu Yin,Yan Hu,Hyunjae Kim,Weipeng Zhou,Yujia Zhou,Hongyi Peng,Yang Ren,Xuguang Ai,Zhenyue Qin,Ming Hu,Xiaoxiao Li,Han Yu,Yih-Chung Tham,Lucila Ohno-Machado,Hua Xu,Qingyu Chen
机构: 未知
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 38 pages, 9 tables, 3 figures
Abstract:Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.
zh
[NLP-9] ECO: Quantized Training without Full-Precision Master Weights
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)训练中因保留高精度主权重(master weights)所导致的显著内存开销问题,尤其在稀疏专家混合模型(Sparse Mixture of Experts, SMoE)中更为突出。现有量化方法虽提升了计算与内存效率,但仍需在高精度缓冲区中累积梯度更新,造成额外存储负担。解决方案的关键在于提出误差补偿优化器(Error-Compensating Optimizer, ECO),其通过在每步迭代后对参数进行量化,并将量化误差注入优化器动量中形成误差反馈回路,从而无需存储主权重即可实现稳定收敛。理论证明表明,在标准假设和衰减学习率条件下,ECO 收敛至最优解的一个固定半径邻域内,而直接移除主权重会导致误差与学习率成反比;实验验证了 ECO 在多种规模模型(30M–2.1B 参数)上可达到接近无损精度,显著优化了静态内存与验证损失之间的帕累托前沿。
链接: https://arxiv.org/abs/2601.22101
作者: Mahdi Nikdan,Amir Zandieh,Dan Alistarh,Vahab Mirrokni
机构: Google Research(谷歌研究); ISTA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as \textitmaster weights . This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.
zh
[NLP-10] GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization
【速读】: 该论文旨在解决Transformer架构中归一化层放置位置(Pre-Norm与Post-Norm)的开放性问题,通过流形优化(manifold optimization)视角重新审视FFN和注意力层输出作为优化更新方向的性质。其解决方案的关键在于提出GeoNorm方法,用流形上的测地线更新(geodesic updates)替代标准归一化操作,并引入层级更新衰减机制(layer-wise update decay)以模拟学习率调度策略,从而在不显著增加计算开销的前提下,提升模型性能。
链接: https://arxiv.org/abs/2601.22095
作者: Chuanyang Zheng,Jiankai Sun,Yihang Gao,Chi Wang,Yuehao Wang,Jing Xiong,Liliang Ren,Bo Peng,Qingmei Wang,Xiaoran Shang,Mac Schwager,Anderson Schneider,Yuriy Nevmyvaka,Xiaodong Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Tech Report
Abstract:The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.
zh
[NLP-11] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning
【速读】: 该论文旨在解决长上下文推理(long-context reasoning)在大型语言模型(Large Language Models, LLMs)中引入的严重效率瓶颈问题,尤其是在处理复杂任务时因计算复杂度高导致的延迟和资源消耗。现有高效方法通常依赖于复杂的额外训练或外部压缩模型,这不仅限制了可扩展性,还可能丢失关键的细粒度信息。解决方案的关键在于提出一种名为VTC-R1的新颖推理范式,其核心是将中间推理片段转化为紧凑图像,并通过视觉-语言模型(Vision-Language Models, VLMs)作为“光学记忆”进行迭代反馈,从而实现高效的文本压缩与推理过程融合。该方法在OpenR1-Math-220K数据集上实现了3.4倍的token压缩率,并显著提升推理效率(端到端延迟降低2.7倍),同时保持了对多个基准测试(如MATH500、AIME25等)的性能优势。
链接: https://arxiv.org/abs/2601.22069
作者: Yibo Wang,Yongcheng Jing,Shunyu Liu,Hao Guan,Rong-cheng Tu,Chengyu Wang,Jun Huang,Dacheng Tao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code: this https URL
Abstract:Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as “optical memory.” We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at this https URL.
zh
[NLP-12] G2-Reader: Dual Evolving Graphs for Multimodal Document QA
【速读】: 该论文针对多模态文档阅读中检索增强生成(Retrieval-Augmented Generation, RAG)方法的脆弱性问题展开研究,旨在解决两个核心挑战:一是传统扁平分块(flat chunking)破坏了文档原生结构和跨模态对齐,导致语义片段难以独立理解;二是迭代检索在长上下文中易因局部证据循环或噪声累积而偏离目标,因每一步仅依赖当前片段而缺乏全局搜索状态。解决方案的关键在于提出 G²-Reader,一个双图系统:其一为内容图(Content Graph),用于保留文档原生结构与跨模态语义;其二为规划图(Planning Graph),作为代理导向的有向无环图,追踪中间发现并引导分步导航以完成证据补全。该设计显著提升了多模态问答的准确率,在 VisDoMBench 上达到 66.21% 的平均准确率,优于强基线模型及独立 GPT-5 模型(53.08%)。
链接: https://arxiv.org/abs/2601.22055
作者: Yaxin Du,Junru Song,Yifan Zhou,Cheng Wang,Jiahao Gu,Zimeng Chen,Menglan Chen,Wen Yao,Yang Yang,Ying Wen,Siheng Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce G^2 -Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, G^2 -Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%).
zh
[NLP-13] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLM s
【速读】: 该论文旨在解决当前多语言大语言模型(Multilingual Large Language Models, LLMs)在低资源语言中对习语性表达(如波斯谚语)的语境理解与跨文化认知能力不足的问题。其解决方案的关键在于构建了一个名为MasalBench的综合性基准测试集,专门用于评估LLMs对波斯谚语的语境识别和跨文化等效转换能力。实验表明,尽管模型在识别波斯谚语方面表现良好(准确率>0.90),但在将其映射为等效英语谚语时性能显著下降(最佳模型准确率为0.79),揭示了现有模型在文化知识和类比推理方面的局限性,并为其他低资源语言的跨文化理解评估提供了可复用框架。
链接: https://arxiv.org/abs/2601.22050
作者: Ghazal Kalhor,Behnam Bahrak
机构: University of Tehran (德黑兰大学); Tehran Institute for Advanced Studies (德黑兰高级研究所); Khatam University (卡塔姆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs’ understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs’ contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at this https URL.
zh
[NLP-14] On the Paradoxical Interference between Instruction-Following and Task Solving
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在指令遵循(instruction following)过程中可能对任务求解能力产生负面影响的问题,即指令遵循与任务性能之间存在潜在冲突。其解决方案的关键在于提出一个量化指标——SUSTAINSCORE,用于衡量指令中插入自明约束(self-evident constraint)后任务性能的下降程度;该约束源自原始成功输出并自然满足,从而揭示了指令遵循机制本身可能干扰模型原生任务能力的现象。实验表明,即使在先进模型如Claude-Sonnet-4.5上,此类干扰依然显著,并且通过注意力分析发现失败案例会过度关注约束条件,进一步验证了该现象的普遍性和机制性。
链接: https://arxiv.org/abs/2601.22047
作者: Yunjia Qi,Hao Peng,Xintong Shi,Amy Xin,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学); Department of Computer Science and Technology, BNRist (计算机科学与技术系,清华-伯克利深圳学院); Shenzhen International Graduate School (深圳国际研究生院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed. However, we reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs’ task-solving capability. We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving. It measures task performance drop after inserting into the instruction a self-evident constraint, which is naturally met by the original successful model output and extracted from it. Experiments on current LLMs in mathematics, multi-hop QA, and code generation show that adding the self-evident constraints leads to substantial performance drops, even for advanced models such as Claude-Sonnet-4.5. We validate the generality of the interference across constraint types and scales. Furthermore, we identify common failure patterns, and by investigating the mechanisms of interference, we observe that failed cases allocate significantly more attention to constraints compared to successful ones. Finally, we use SUSTAINSCORE to conduct an initial investigation into how distinct post-training paradigms affect the interference, presenting empirical observations on current alignment strategies. We will release our code and data to facilitate further research
zh
[NLP-15] A Separable Architecture for Continuous Token Representation in Language Models
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)中嵌入矩阵(embedding matrix)占据绝大部分参数预算的问题,这一分配方式既不高效也不符合直觉。传统Transformer架构将参数视为可互换的,但在参数量低于十亿时,嵌入层的权重远超其他模块,导致整体参数利用效率低下。为此,作者提出Leviathan架构,其核心创新在于用连续嵌入生成器(continuous embedding generator)替代经典模型中的离散查找表(discrete lookup tables),从而更高效地利用参数资源。在Pile数据集上的等参数条件下对比实验表明,Leviathan展现出显著更强的有效参数容量,实证拟合显示其性能相当于拥有1.47至2.11倍于标准LLaMA风格模型的密集参数规模。
链接: https://arxiv.org/abs/2601.22040
作者: Reza T. Batley,Sourav Saha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with 1.47 to 2.11 \times more parameters.
zh
[NLP-16] hinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models
【速读】: 该论文旨在解决自回归语言模型(Autoregressive Language Models, ALMs)在输出结构与自然推理顺序不一致时所面临的“过早承诺”问题,例如当任务要求先输出答案再生成解释时,ALMs 因其固定左到右的生成顺序而无法灵活调整计算流程,导致性能显著下降。解决方案的关键在于引入掩码扩散语言模型(Masked Diffusion Language Models, MDLMs),其通过迭代并行地优化所有token,在计算顺序上与输出结构解耦,从而实现“顺序鲁棒性”(order robustness)。实验表明,MDLMs 能够在扩散过程中优先稳定简单token(如推理步骤),再逐步稳定复杂token(如最终答案),使得推理过程可在答案确定前完成,有效缓解了传统AR模型因结构约束引发的性能损失。
链接: https://arxiv.org/abs/2601.22035
作者: Longxuan Yu,Yu Fu,Shaorong Zhang,Hui Liu,Mukund Varma T,Greg Ver Steeg,Yue Dong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures, 5 tables
Abstract:Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ( \leq 14% relative drop), a property we term “order robustness”. Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.
zh
[NLP-17] Causal Autoregressive Diffusion Language Model
【速读】: 该论文旨在解决生成式 AI(Generative AI)中自回归模型(Autoregressive Models, ARMs)训练效率低与扩散模型(Diffusion Models)推理吞吐量高但训练成本大的矛盾问题。解决方案的关键在于提出因果自回归扩散(Causal Autoregressive Diffusion, CARD)框架:通过在扩散过程中引入严格因果注意力掩码(causal attention mask),实现单次前向传播中的密集逐 token 监督;同时设计软尾部掩码(soft-tailed masking schema)以保留局部上下文,并基于信噪比(signal-to-noise ratio)原理提出上下文感知的重加权机制,从而稳定优化过程;最终支持动态并行解码(dynamic parallel decoding),利用键值缓存(KV-caching)自适应生成可变长度序列,显著提升训练效率与推理延迟表现。
链接: https://arxiv.org/abs/2601.22031
作者: Junhao Ruan,Bei Li,Yongjing Yin,Pengcheng Huang,Xin Chen,Jingang Wang,Xunliang Cai,Tong Xiao,JingBo Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 \times compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.
zh
[NLP-18] When “Better” Prompts Hurt: Evaluation-Driven Iteration for LLM Applications
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)应用评估与传统软件测试本质差异带来的挑战,即LLM输出具有随机性、高维性和对提示(prompt)及模型变更的敏感性。其解决方案的核心是提出一个评估驱动的工作流(Define, Test, Diagnose, Fix),将上述挑战转化为可重复的工程闭环,并引入最小可行评估套件(Minimum Viable Evaluation Suite, MVES),针对通用LLM应用、检索增强生成(Retrieval-Augmented Generation, RAG)和代理工具调用(agentic tool-use)三类场景提供分层评估组件。通过系统整合自动化检查、人工评分标准和LLM作为评判者(LLM-as-judge)等方法并识别其失效模式,论文强调了基于实证反馈进行提示迭代与谨慎声明校准的重要性,而非依赖通用提示模板。
链接: https://arxiv.org/abs/2601.22025
作者: Daniel Commey
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注:
Abstract:Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic “improved” prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Software Engineering (cs.SE) Cite as: arXiv:2601.22025 [cs.CL] (or arXiv:2601.22025v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.22025 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daniel Commey [view email] [v1] Thu, 29 Jan 2026 17:32:34 UTC (41 KB)
zh
[NLP-19] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
【速读】: 该论文旨在解决生成式 AI(Generative AI)中大规模语言模型(LLM)的可解释性机制与其训练数据之间因果关系不明确的问题。现有研究虽已识别出模型中的可解释电路(interpretable circuits),但这些电路在训练数据中的因果起源仍不清楚。解决方案的关键在于提出一种名为“机制数据归因”(Mechanistic Data Attribution, MDA)的可扩展框架,该框架利用影响函数(Influence Functions)将可解释单元追溯至具体的训练样本;通过实验证明,针对高影响力样本进行定向干预(如移除或增强)能显著调控可解释头(interpretable heads)的出现,而随机干预则无效,从而确立了特定结构化数据(如LaTeX、XML)作为机制催化剂的作用,并首次为归纳头(induction head)与上下文学习能力(in-context learning, ICL)之间的功能关联提供了直接因果证据。
链接: https://arxiv.org/abs/2601.21996
作者: Jianhui Chen,Yuzhang Luo,Liangming Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention–removing or augmenting a small fraction of high-influence samples–significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model’s in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.
zh
[NLP-20] oken-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中易产生幻觉(hallucination)的问题,即模型输出内容与输入信息不一致的现象。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)和基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)虽能缓解此问题,但存在资源消耗高或需大规模微调的局限性;而解码阶段的方法则缺乏对幻觉的显式控制。论文提出的解决方案是Token-Guard,其核心在于基于自检推理的token级幻觉控制机制:在每一步推理中进行内部验证以提前识别幻觉token,并通过潜在空间中的显式幻觉风险评分对候选片段进行评估,结合迭代剪枝与重生成动态修正错误,从而实现高效、可扩展且模块化的幻觉抑制。
链接: https://arxiv.org/abs/2601.21969
作者: Yifan Zhu,Huiqiang Rong,Haoran Luo
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages and 11 figures,this work has been accepted for presentation at ICLR 2026
Abstract:Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.
zh
[NLP-21] OVD: On-policy Verbal Distillation
【速读】: 该论文旨在解决现有基于策略的蒸馏方法在强化学习场景下因依赖token级对齐而导致的学生模型探索能力受限、环境反馈利用不充分以及内存瓶颈严重的问题。其解决方案的关键在于提出一种名为“On-policy Verbal Distillation (OVD)”的新框架,该框架摒弃了传统的token级概率匹配机制,转而采用教师模型输出的离散语义评分(0–9)进行轨迹级匹配,从而显著降低内存消耗,并允许学生模型在无需token级对齐的情况下自由探索输出空间,实现更高效的在线蒸馏。
链接: https://arxiv.org/abs/2601.21968
作者: Jing Xiong,Hui Shen,Shansan Gong,Yuxin Cheng,Jianghan Shen,Chaofan Tao,Haochen Tan,Haoli Bai,Lifeng Shang,Ngai Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注: Technical Report
Abstract:Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model’s exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0–9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web QA tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at this https URL
zh
[NLP-22] Industrialized Deception: The Collateral Effects of LLM -Generated Misinformation on Digital Ecosystems
【速读】: 该论文旨在解决生成式 AI(Generative AI)在信息传播中引发的虚假信息(misinformation)问题,尤其是人类对AI生成内容的识别能力与AI生成技术之间的动态博弈。其解决方案的关键在于构建了一套实验性研究框架:通过JudgeGPT平台评估人类对AI生成新闻的感知能力,并利用RogueGPT可控地生成刺激性内容以支持实证研究,从而形成从内容生成到感知检测的完整实验管道。该框架揭示了检测能力虽有提升,但生成与检测的对抗仍在持续,进而推动了基于大语言模型(LLM)的检测方法、免疫接种(inoculation)策略及生成式AI双重用途特性等多维度的缓解路径。
链接: https://arxiv.org/abs/2601.21963
作者: Alexander Loth,Martin Kappes,Marc-Oliver Pahl
机构: Microsoft(微软); Frankfurt University of Applied Sciences(法兰克福应用技术大学); IMT Atlantique(南特大西洋国立高等理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted at ACM TheWebConf '26 Companion
Abstract:Generative AI and misinformation research has evolved since our 2024 survey. This paper presents an updated perspective, transitioning from literature review to practical countermeasures. We report on changes in the threat landscape, including improved AI-generated content through Large Language Models (LLMs) and multimodal systems. Central to this work are our practical contributions: JudgeGPT, a platform for evaluating human perception of AI-generated news, and RogueGPT, a controlled stimulus generation engine for research. Together, these tools form an experimental pipeline for studying how humans perceive and detect AI-generated misinformation. Our findings show that detection capabilities have improved, but the competition between generation and detection continues. We discuss mitigation strategies including LLM-based detection, inoculation approaches, and the dual-use nature of generative AI. This work contributes to research addressing the adverse impacts of AI on information quality.
zh
[NLP-23] From Generative Modeling to Clinical Classification: A GPT -Based Architecture for EHR Notes
【速读】: 该论文旨在解决临床文本分类中的三大挑战:标注数据稀缺、类别严重不平衡以及大规模预训练语言模型微调带来的高计算成本。其核心解决方案是提出一种基于GPT的架构,采用选择性微调策略——冻结大部分GPT-2解码器权重,仅对最后一层Transformer块、最终层归一化层及轻量级分类头进行训练。此方法显著减少可训练参数数量,同时保留了建模复杂临床语言所需的表征能力,在MIMIC-IV-Note数据集上的多任务实验表明,该方法在非提及和否定发现占主导的场景下仍具稳定收敛性和优异分类性能,为真实世界电子健康记录(EHR)数据提供了一种高效且可扩展的临床文本分类路径。
链接: https://arxiv.org/abs/2601.21955
作者: Fariba Afrin Irany
机构: 未知
类目: Computation and Language (cs.CL)
备注: This submission is a full-length research manuscript consisting of 37 pages and 15 figures. The paper presents a GPT-based architecture with selective fine-tuning for clinical text classification, including detailed architectural diagrams, learning curves, and evaluation figures such as ROC curves and confusion matrices
Abstract:The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity. Comments: This submission is a full-length research manuscript consisting of 37 pages and 15 figures. The paper presents a GPT-based architecture with selective fine-tuning for clinical text classification, including detailed architectural diagrams, learning curves, and evaluation figures such as ROC curves and confusion matrices Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.21955 [cs.CL] (or arXiv:2601.21955v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.21955 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Fariba Irany [view email] [v1] Thu, 29 Jan 2026 16:33:47 UTC (2,558 KB)
zh
[NLP-24] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话场景中因键值(Key-Value, KV)缓存线性增长而导致的内存瓶颈问题。现有KV缓存压缩方法通常忽视多轮对话的结构特性,依赖启发式淘汰策略,易丢失关键上下文信息。解决方案的关键在于提出一种基于学习的框架SONIC,通过将历史对话段落压缩为语义丰富且紧凑的Nexus tokens,并引入动态预算训练机制,使模型能够在不重新训练的情况下灵活适应不同的内存约束。实验表明,SONIC在80%和50%压缩比下均显著优于H2O和StreamingLLM等基线方法,在MTBench101基准上平均得分提升35.55%,同时推理效率提升50.1%。
链接: https://arxiv.org/abs/2601.21927
作者: Hong Chen,Xiang Liu,Bo Wang,Yuxuan Fan,Yuanlin Chu,Zongluo Li,Xiaowen Chu,Xuming Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbfSONIC, a learning-based framework that compresses historical segments into compact and semantically rich \textbfNexus tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80% and 50%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1% compared to full-context generation.
zh
[NLP-25] Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理过程中因冗余推理导致的计算开销过大问题,该问题严重影响交互体验并阻碍模型的实际部署。现有基于强化学习(Reinforcement Learning, RL)的方法通过将长度惩罚与结果奖励耦合来缓解此问题,但这种简单的奖励加权难以在简洁性与准确性之间取得平衡,因为强制缩短推理过程可能损害关键的推理逻辑。论文提出了一种多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架——Self-Compression via MARL (SCMA),其核心创新在于通过两个专用智能体协同实现冗余检测与评估:分割智能体(Segmentation Agent)负责将推理过程分解为逻辑块,评分智能体(Scoring Agent)量化每个块的重要性;二者共同定义一种重要性加权的长度惩罚机制,在训练阶段激励推理智能体(Reasoning Agent)优先保留必要逻辑,同时在部署阶段不引入额外推理开销。实验证明,该方法可在保持甚至提升准确率的同时,将响应长度减少11.1%至39.0%。
链接: https://arxiv.org/abs/2601.21919
作者: Yiqun Chen,Jinyuan Feng,Wei Yang,Meizhi Zhong,Zhengliang Shi,Rui Li,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Zhiqiang Pu,Jiaxin Mao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbfa Segmentation Agent for decomposing the reasoning process into logical chunks, and \textbfa Scoring Agent for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbfa Reasoning Agent to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1% to 39.0% while boosting accuracy by 4.33% to 10.02%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.
zh
[NLP-26] JADE: Bridging the Strategic-Operational Gap in Dynamic Agent ic RAG
【速读】: 该论文旨在解决动态多轮推理流程中规划(planning)与执行(execution)模块之间的“战略-操作不匹配”问题,即现有方法要么在固定架构中联合优化模块,要么虽具备动态规划能力却将执行器视为冻结的黑箱工具,导致复杂系统无法实现预期性能提升。解决方案的关键在于提出JADE(Joint Agentic Dynamic Execution)框架,通过将整个系统建模为共享主干网络的协作多智能体团队,实现规划与执行的端到端联合优化,并基于结果奖励驱动学习,从而促成规划者与执行者之间的协同适应:规划者学会在执行器能力边界内运作,而执行器则演化以对齐高层战略意图,最终使离散模块转化为协同增强的系统,显著提升性能并支持效率与效果间的灵活权衡。
链接: https://arxiv.org/abs/2601.21916
作者: Yiqun Chen,Erhan Zhang,Tianyi Hu,Shijie Wang,Zixuan Yang,Meizhi Zhong,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Jiaxin Mao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The evolution of Retrieval-Augmented Generation (RAG) has shifted from static retrieval pipelines to dynamic, agentic workflows where a central planner orchestrates multi-turn reasoning. However, existing paradigms face a critical dichotomy: they either optimize modules jointly within rigid, fixed-graph architectures, or empower dynamic planning while treating executors as frozen, black-box tools. We identify that this \textitdecoupled optimization creates a ``strategic-operational mismatch,‘’ where sophisticated planning strategies fail to materialize due to unadapted local executors, often leading to negative performance gains despite increased system complexity. In this paper, we propose \textbfJADE (\textbfJoint \textbfAgentic \textbfDynamic \textbfExecution), a unified framework for the joint optimization of planning and execution within dynamic, multi-turn workflows. By modeling the system as a cooperative multi-agent team unified under a single shared backbone, JADE enables end-to-end learning driven by outcome-based rewards. This approach facilitates \textitco-adaptation: the planner learns to operate within the capability boundaries of the executors, while the executors evolve to align with high-level strategic intent. Empirical results demonstrate that JADE transforms disjoint modules into a synergistic system, yielding remarkable performance improvements via joint optimization and enabling a flexible balance between efficiency and effectiveness through dynamic workflow orchestration.
zh
[NLP-27] ProRAG : Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation
【速读】: 该论文旨在解决传统基于结果的强化学习(Reinforcement Learning, RL)在检索增强生成(Retrieval-Augmented Generation, RAG)复杂推理任务中因奖励稀疏性和信用分配效率低下而导致的“过程幻觉”问题,即模型虽最终得出正确答案,但中间推理步骤存在逻辑错误或冗余检索。解决方案的关键在于提出ProRAG框架,其核心创新是将学习得到的细粒度步骤级监督信号(process supervision)嵌入在线优化循环中,通过四阶段流程实现:首先进行策略预热以建立结构化推理格式;其次构建基于蒙特卡洛树搜索(MCTS)的过程奖励模型(Process Reward Model, PRM),量化中间推理质量;再利用PRM引导推理精炼对齐细粒度过程偏好;最后采用双粒度优势机制进行过程监督强化学习,聚合步骤级奖励与全局结果信号,从而为每个动作提供精确反馈。此方法显著提升了长轨迹任务中的推理准确性与鲁棒性。
链接: https://arxiv.org/abs/2601.21912
作者: Zhao Wang,Ziliang Zhao,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 6 figures
Abstract:Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome-based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse-grained scalar rewards fail to identify specific erroneous steps within long-horizon trajectories. This ambiguity frequently leads to “process hallucinations”, where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process-aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on-policy exploration capabilities required to decouple step-level credit from global outcomes. To address these challenges, we propose ProRAG, a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism. By aggregating step-level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi-hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision. The code and model are available at this https URL.
zh
[NLP-28] From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)后训练方法中存在的根本性认知不匹配问题:现有方法通过监督微调(Supervised Fine-Tuning, SFT)和基于结果的强化学习(Reinforcement Learning, RL)优化完整的推理轨迹,但这种问题导向的范式忽略了人类在解决问题时的两阶段认知机制——先获取可泛化的抽象策略(即元知识),再将其适配到具体任务实例。为弥合这一差距,作者提出了一种受人类认知启发的框架Chain-of-Meta-Thought (CoMT) 与 Confidence-Calibrated Reinforcement Learning (CCRL),其关键在于将推理过程解耦为两个独立阶段:CoMT专注于从无特定执行细节的抽象推理模式中学习通用策略,从而提升策略的泛化能力;CCRL则引入置信度感知奖励机制,在中间步骤上优化任务适配,避免过度自信导致的错误传播,增强执行可靠性。实验证明,该方案不仅在分布内和分布外基准上分别提升2.19%和4.63%,且显著降低65–70%训练时间和50%令牌消耗,表明与人类认知原则对齐的后训练方法能同时实现更优泛化性能和更高训练效率。
链接: https://arxiv.org/abs/2601.21909
作者: Shaojie Wang,Liang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19% and 4.63% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.
zh
[NLP-29] Learn-to-Distance: Distance Learning for Detecting LLM -Generated Text ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本难以检测的问题,尤其是当这些模型输出高度拟人化内容时,可能引发虚假信息传播和学术不端等风险。其解决方案的关键在于提出一种基于重写(rewrite-based)的检测算法,该算法通过自适应学习原始文本与重写文本之间的距离函数来提升检测性能;理论分析表明,相比固定距离度量,自适应距离函数能更有效地识别LLM生成内容,实验结果也验证了该方法在多种目标模型(如GPT、Claude和Gemini)上显著优于现有基线,相对改进幅度达57.8%至80.6%。
链接: https://arxiv.org/abs/2601.21895
作者: Hongyi Zhou,Jin Zhu,Erhan Xu,Kai Ye,Ying Yang,Chengchun Shi
机构: Tsinghua University (清华大学); University of Birmingham (伯明翰大学); London School of Economics and Political Science (伦敦政治经济学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted by ICLR2026
Abstract:Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8% to 80.6% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini).
zh
[NLP-30] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在作为具身智能体(embodied agents)部署时,在长程规划(long-horizon planning)中面临的策略连贯性丧失与环境约束违反问题。标准LLM规划器由于上下文窗口限制或幻觉行为,难以在动态环境中维持多步骤决策的一致性。解决方案的关键在于提出一种名为GiG(Graph-in-Graph)的新型规划框架,其核心创新包括:1)采用图神经网络(Graph Neural Network, GNN)将环境状态编码为嵌入,并构建动作连接的执行轨迹图(execution trace graphs)存储于经验记忆库中;2)通过聚类这些图嵌入实现结构感知先验检索,使当前决策能够基于历史结构模式进行锚定;3)引入一个受限前瞻模块(bounded lookahead module),利用符号化转移逻辑增强规划能力,通过接地的动作投影提升决策准确性。该方法在Robotouille和ALFWorld等多个具身规划基准上显著优于现有最先进基线,且计算开销可控。
链接: https://arxiv.org/abs/2601.21841
作者: Xiang Li,Ning Yan,Masood Mortazavi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents’ memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents’ planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.
zh
[NLP-31] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理长上下文、多源异构信息融合以及高风险决策场景下的能力不足问题,尤其聚焦于地理空间规划任务中对地图、指令、情报报告等分布式数据的快速准确推理需求。其解决方案的关键在于构建首个基于复杂模拟军事场景的专家撰写级问答数据集——MilSCORE(Military Scenario Contextual Reasoning),该数据集通过多跳推理问题评估模型在战术与空间推理上的整合能力,并涵盖七类问题类型以覆盖事实回忆、约束分析、策略推演和空间解析等多个维度,从而为LLMs在真实场景下长期上下文理解与决策能力提供具有挑战性的基准测试平台。
链接: https://arxiv.org/abs/2601.21826
作者: Aadi Palnitkar,Mingyang Mao,Nicholas Waytowich,Vinicius G. Goecks,Tinoosh Mohsenin,Xiaomin Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs’ ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.
zh
[NLP-32] Moral Outrag e Shapes Commitments Beyond Attention: Multimodal Moral Emotions on YouTube in Korea and the US
【速读】: 该论文旨在解决主流新闻频道在YouTube平台上通过道德情感框架(moral emotional framing)影响用户参与行为的问题,尤其关注跨文化情境下(韩国与美国)的差异。其核心问题是:何种类型的道德情绪表达能有效提升观众的注意力和互动程度?解决方案的关键在于开发了一个多模态道德情绪分类器(multimodal moral emotion classifier),该模型基于视觉-语言预训练模型进行微调,结合视频缩略图和标题信息,在韩语和英语双语数据集上完成训练,并应用于近40万条来自主流新闻频道的视频内容分析。结果表明,“谴责式”道德愤怒(other-condemning moral outrage)显著提升观看、点赞与评论等多层次参与行为,揭示其作为高效情感策略的作用机制,同时指出其可能加剧群体极化的风险。
链接: https://arxiv.org/abs/2601.21815
作者: Seongchan Park,Jaehong Kim,Hyeonseung Kim,Heejin Bin,Sue Moon,Wonjae Lee
机构: KAIST(韩国科学技术院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted at The Web Conference 2026. We release Korean and English multimodal moral emotion classifiers
Abstract:Understanding how media rhetoric shapes audience engagement is crucial in the attention economy. This study examines how moral emotional framing by mainstream news channels on YouTube influences user behavior across Korea and the United States. To capture the platform’s multimodal nature, combining thumbnail images and video titles, we develop a multimodal moral emotion classifier by fine tuning a vision language model. The model is trained on human annotated multimodal datasets in both languages and applied to approximately 400,000 videos from major news outlets. We analyze engagement levels including views, likes, and comments, representing increasing degrees of commitment. The results show that other condemning rhetoric expressions of moral outrage that criticize others morally consistently increase all forms of engagement across cultures, with effects ranging from passive viewing to active commenting. These findings suggest that moral outrage is a particularly effective emotional strategy, attracting not only attention but also active participation. We discuss concerns about the potential misuse of other condemning rhetoric, as such practices may deepen polarization by reinforcing in group and out group divisions. To facilitate future research and ensure reproducibility, we publicly release our Korean and English multimodal moral emotion classifiers.
zh
[NLP-33] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning
【速读】: 该论文旨在解决测试时强化学习(Test-time Reinforcement Learning, TTRL)中奖励信号估计不准确的问题,尤其是在无监督场景下,现有方法依赖多数投票(Majority Voting, MV)机制从多次推理轨迹(rollouts)中提取确定性奖励,但该假设过于脆弱——MV将分布信息压缩为单一结果,忽略了非多数但正确的动作候选,导致系统性偏差。解决方案的关键在于提出分布感知奖励估计(Distribution-Aware Reward Estimation, DARE),其核心思想是从单点奖励转向基于完整经验轨迹分布的奖励建模,并引入探索奖励项和分布剪枝机制以增强对非多数轨迹的探索与去噪,从而提升奖励信号的信息量与鲁棒性。
链接: https://arxiv.org/abs/2601.21804
作者: Bodong Du,Xuanqi Huang,Xiaomeng Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.
zh
[NLP-34] RAG -E: Quantifying Retriever-Generator Alignment and Failure Modes
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中组件间交互不透明的问题,尤其是在高风险应用场景下缺乏可解释性与可信度的挑战。其解决方案的关键在于提出RAG-E这一端到端可解释性框架,通过数学上严谨的归因方法量化检索器与生成器之间的对齐程度:一方面利用集成梯度(Integrated Gradients)适配检索器分析,另一方面引入PMCSHAP(基于蒙特卡洛稳定的Shapley值近似方法)实现生成器归因,并创新性地提出加权归因-相关性差距(Weighted Attribution-Relevance Gap, WARG)指标来衡量生成器文档使用是否契合检索器排序。实证分析表明,高达66.7%的查询中生成器忽略检索器Top1文档,且48.1%–65.9%依赖低相关性文档,揭示了RAG性能不仅取决于单个模块表现,更依赖于二者协同机制,而该机制可通过RAG-E进行审计。
链接: https://arxiv.org/abs/2601.21803
作者: Korbinian Randl,Guido Rocchietti,Aron Henriksson,Ziawasch Abedjan,Tony Lindgren,John Pavlopoulos
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems combine dense retrievers and language models to ground LLM outputs in retrieved documents. However, the opacity of how these components interact creates challenges for deployment in high-stakes domains. We present RAG-E, an end-to-end explainability framework that quantifies retriever-generator alignment through mathematically grounded attribution methods. Our approach adapts Integrated Gradients for retriever analysis, introduces PMCSHAP, a Monte Carlo-stabilized Shapley Value approximation, for generator attribution, and introduces the Weighted Attribution-Relevance Gap (WARG) metric to measure how well a generator’s document usage aligns with a retriever’s ranking. Empirical analysis on TREC CAsT and FoodSafeSum reveals critical misalignments: for 47.4% to 66.7% of queries, generators ignore the retriever’s top-ranked documents, while 48.1% to 65.9% rely on documents ranked as less relevant. These failure modes demonstrate that RAG output quality depends not solely on individual component performance but on their interplay, which can be audited via RAG-E.
zh
[NLP-35] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation
【速读】: 该论文旨在解决现有对话记忆系统在离线记忆构建与更新阶段缺乏任务感知监督的问题,导致记忆内容与下游任务目标不匹配,进而影响模型性能。其核心解决方案是提出一种对抗式记忆自适应机制(Adversarial Memory Adaptation, AMA),通过引入三个代理——挑战者代理(challenger agent)、评估者代理(evaluator agent)和适配器代理(adapter agent)——在离线阶段模拟下游任务执行过程,生成任务相关的监督信号。具体而言,挑战者代理基于原始对话生成问答对,评估者代理对记忆系统的回答进行误差分析,适配器代理据此优化记忆构建策略和内容本身,从而实现记忆系统与任务目标的对齐,显著提升长对话场景下的表现。
链接: https://arxiv.org/abs/2601.21797
作者: Yimin Deng,Yuqing Fu,Derong Xu,Yejing Wang,Wei Ni,Jingtong Gao,Xiaopeng Li,Chengxu Liu,Xiao Han,Guoshuai Zhao,Xiangyu Zhao,Li Zhu,Xueming Qian
机构: Xi’an Jiaotong University (西安交通大学); City University of Hong Kong (香港城市大学); University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Zhejiang University of Technology (浙江工业大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures
Abstract:Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task-independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.
zh
[NLP-36] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在社交媒体中广泛传播的网络迷因(Internet memes)所蕴含的隐性有害内容识别难题,尤其针对其依赖隐喻和社 会文化语境导致自动化内容审核困难的问题。解决方案的关键在于提出一种知识注入的双头学习框架(Knowledge-Injected Dual-Head Learning, KID),通过标签约束的知识蒸馏机制将复杂的迷因理解分解为结构化的推理链,显式关联视觉证据、外部背景知识与分类标签;同时采用双头架构联合优化语义生成与分类目标,实现语言推理对齐并保持决策边界稳定,从而显著提升跨语言(包括低资源语言如孟加拉语)下有害迷因检测的准确性与泛化能力。
链接: https://arxiv.org/abs/2601.21796
作者: Yaocong Li,Leihan Zhang,Le Zhang,Qiang Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra-modal and inter-modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge-Injected Dual-Head Learning framework for knowledge-grounded harmful meme detection. KID adopts a label-constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme-specific contexts. In addition, KID employs a dual-head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low-resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi-label harmful meme detection tasks, improving over previous best methods by 2.1%–19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual-head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at this https URL.
zh
[NLP-37] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中固定且非可微分的分词器(如字节对编码,Byte Pair Encoding, BPE)限制了端到端优化和对噪声或领域特定数据适应能力的问题。其解决方案的关键在于提出Zonkey——一种层次化扩散模型,通过一个完全可训练的管道实现从原始字符到文档级表示的生成。核心创新是引入可微分分词器(Segment Splitter),该模块通过学习概率性的开始符号(BOS)决策来自适应地分割序列,从而在无监督条件下涌现出语义合理的边界(如空格处的词边界、句号处的句子起始)。这一可微性由新颖的概率注意力机制(Probabilistic Attention)支撑,该机制利用位置相关的存在概率模拟软掩码,以处理理论上无限长度的序列并保留梯度;同时,序列通过概率衰减而非显式结束符支持变长输出。此外,层级压缩机制结合去噪扩散混合模型(Denoising Diffusion Mixed Model, DDMM)与缝合器(Stitcher),确保多尺度抽象与重叠不变性,最终实现端到端训练下的高质量文本生成。
链接: https://arxiv.org/abs/2601.21768
作者: Alon Rozental
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.
zh
[NLP-38] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance Explainability and Beyond
【速读】: 该论文旨在解决生成式 AI(如 ChatGPT)在医学信息抽取(Medical Information Extraction, MedIE)任务中的性能与可靠性问题。研究通过系统评估 ChatGPT 在六种基准数据集上的表现,从准确性、可解释性、置信度、忠实性和不确定性等多个维度进行分析。其解决方案的关键在于揭示 ChatGPT 虽具备高可解释性和对原文的高忠实性,但存在预测过自信和因生成不确定性导致信息抽取结果不稳定的问题,从而为未来在医疗场景中应用大语言模型提供了关键的实证依据与改进方向。
链接: https://arxiv.org/abs/2601.21767
作者: Wei Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT’s performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT’s performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. © ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.
zh
[NLP-39] CoFrGeNet: Continued Fraction Architectures for Language Generation
【速读】: 该论文旨在解决当前基于Transformer的生成模型参数量庞大、训练成本高且效率低的问题。其核心解决方案是提出了一种新的函数类——基于连分数(continued fractions)理论的数学结构,并据此设计出名为CoFrGeNets(Continued Fraction Generative Networks)的新架构家族。该架构通过引入新型模块替代原始Transformer中的多头注意力机制(Multi-head Attention)和前馈网络(Feed-Forward Networks),在显著减少参数量(仅为原模型的1/2至2/3)的同时,保持甚至超越原有模型在下游分类、问答、推理和文本理解任务上的性能表现。关键创新在于定制化的梯度计算方法,相比标准PyTorch实现更准确高效地优化新组件,并且作为即插即用模块兼容现有训练与推理流程,便于集成到大规模工业应用中。
链接: https://arxiv.org/abs/2601.21766
作者: Amit Dhurandhar,Vijil Chenthamarakshan,Dennis Wei,Tejaswini Pedapati,Karthikeyan Natesan Ramamurthy,Rahul Nair
机构: IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\ A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with \frac23 to \frac12 the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.
zh
[NLP-40] Influence Guided Sampling for Domain Adaptation of Text Retrievers
【速读】: 该论文旨在解决通用开放域稠密检索(dense retrieval)模型训练中,如何从多样化的语料库和搜索任务中高效采样以优化模型性能的问题。传统方法通常采用均匀采样、按实例数量比例采样或依赖人工专家标注的权重,但这些策略未能动态适应不同数据集对目标性能指标的贡献差异。解决方案的关键在于提出Inf-DDS框架,这是一个基于强化学习驱动的数据采样机制,通过影响度(influence-based)奖励信号自适应地调整各训练数据集的权重,并在迭代过程中优先选择能最大化开发集表现的数据集。该方法显著优于现有基于梯度的采样策略,在多个文本检索任务上实现更强的性能提升和更好的泛化能力,同时GPU计算成本降低1.5至4倍。
链接: https://arxiv.org/abs/2601.21759
作者: Meet Doshi,Vishwajeet Kumar,Yulong Li,Jaydeep Sen
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:General-purpose open-domain dense retrieval systems are usually trained with a large, eclectic mix of corpora and search tasks. How should these diverse corpora and tasks be sampled for training? Conventional approaches sample them uniformly, proportional to their instance population sizes, or depend on human-level expert supervision. It is well known that the training data sampling strategy can greatly impact model performance. However, how to find the optimal strategy has not been adequately studied in the context of embedding models. We propose Inf-DDS, a novel reinforcement learning driven sampling framework that adaptively reweighs training datasets guided by influence-based reward signals and is much more lightweight with respect to GPU consumption. Our technique iteratively refines the sampling policy, prioritizing datasets that maximize model performance on a target development set. We evaluate the efficacy of our sampling strategy on a wide range of text retrieval tasks, demonstrating strong improvements in retrieval performance and better adaptation compared to existing gradient-based sampling methods, while also being 1.5x to 4x cheaper in GPU compute. Our sampling strategy achieves a 5.03 absolute NDCG@10 improvement while training a multilingual bge-m3 model and an absolute NDCG@10 improvement of 0.94 while training all-MiniLM-L6-v2, even when starting from expert-assigned weights on a large pool of training datasets.
zh
[NLP-41] mporal Guidance for Large Language Models
【速读】: 该论文旨在解决对比解码(Contrastive Decoding, CD)在大型语言模型(Large Language Models, LLMs)中因需依赖辅助模型而带来显著计算开销的问题。现有内部自对比解码方法(如基于层间差异的DoLa)在小规模模型上表现不稳定,难以可靠应用。解决方案的关键在于提出一种沿时间维度的新型对比引导策略——Temporal Guidance (TeGu),其核心创新是利用多标记预测(Multi-Token Prediction, MTP)构建模型自身的弱预测作为对比基准,并引入轻量级条件MTP投影器(cMTPP)标准化实现过程,避免维护多个独立网络,从而在保持低内存和计算开销的同时显著提升生成质量。
链接: https://arxiv.org/abs/2601.21744
作者: Hong-Kai Zheng,Piji Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.
zh
[NLP-42] Epistemic Context Learning: Building Trust the Right Way in LLM -Based Multi-Agent Systems
【速读】: 该论文旨在解决多智能体(Multi-Agent, MA)系统中个体代理因缺乏鲁棒性而盲目追随误导性同伴的问题,其根源在于“奉承行为”(sycophancy)和对同伴可靠性评估能力不足。解决方案的关键在于提出一种名为认知上下文学习(Epistemic Context Learning, ECL)的推理框架,该框架通过引入历史交互信息作为额外输入,使代理能够基于过往互动构建同伴画像(peer profiles),从而显式地估计同伴可靠性,并在不确定时选择可信同伴进行学习。ECL将原本复杂的同伴推理质量评估任务转化为更可操作的可靠性估计问题,结合强化学习与辅助奖励机制进一步优化性能,实验证明其显著提升小模型(如Qwen 3-4B)在复杂MA场景下的表现,甚至使前沿大模型达到接近完美的准确率(100%)。
链接: https://arxiv.org/abs/2601.21742
作者: Ruiwen Zhou,Maojia Song,Xiaobao Wu,Sitao Cheng,Xunjian Yin,Yuxi Xie,Zhuoqun Hao,Wenyue Hua,Liangming Pan,Soujanya Poria,Min-Yen Kan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Codes and data are available at this https URL
Abstract:Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.
zh
[NLP-43] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering ICASSP2026
【速读】: 该论文旨在解决现有检索增强方法在科学文献问答中因依赖孤立文本片段或概念而忽视论文间深层语义关联的问题,从而影响大语言模型(Large Language Models, LLMs)对学术文献的理解能力,导致回答的全面性和针对性不足。其解决方案的关键在于提出一种中心实体引导的图优化社区检测方法(Central Entity-Guided Graph Optimization for Community Detection, CE-GOCD),通过以论文标题作为中心实体进行目标子图检索、利用子图剪枝与补全增强隐式语义发现,并结合社区检测技术提炼具有共享主题的 coherent paper groups,从而显式建模并利用学术知识图谱中的语义子结构来提升LLMs的回答质量。
链接: https://arxiv.org/abs/2601.21733
作者: Jiayin Lan,Jiaqi Li,Baoxin Wang,Ming Liu,Dayong Wu,Shijin Wang,Bing Qin,Guoping Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE ICASSP 2026
Abstract:Large Language Models (LLMs) are increasingly used for question answering over scientific research papers. Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers. This impairs the LLM’s comprehension of scientific literature, hindering the comprehensiveness and specificity of its responses. To address this, we propose Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD), a method that augments LLMs’ scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs. Our approach operates by: (1) leveraging paper titles as central entities for targeted subgraph retrieval, (2) enhancing implicit semantic discovery via subgraph pruning and completion, and (3) applying community detection to distill coherent paper groups with shared themes. We evaluated the proposed method on three NLP literature-based question-answering datasets, and the results demonstrate its superiority over other retrieval-augmented baseline approaches, confirming the effectiveness of our framework.
zh
[NLP-44] Procedural Pretraining: Warming Up Language Models with Abstract Data
【速读】: 该论文旨在解决当前语言模型预训练依赖大规模自然语言数据所带来的效率与性能瓶颈问题,其核心挑战在于如何更高效地获取丰富的语义知识并加速模型收敛。解决方案的关键在于引入程序化预训练(procedural pretraining)——即在初始阶段使用由形式语言或简单算法生成的抽象结构化数据(如平衡括号序列Dyck sequences)对模型进行预训练,从而为后续的语言理解能力打下结构化基础。实验表明,仅需0.1%的此类数据即可显著提升模型在上下文回忆等任务上的表现,并使模型达到相同损失值时所需的数据量减少至原数据的55%–86%,且这种预训练能有效引导注意力机制和多层感知机(MLP)层中非平凡结构的形成,体现出其在不同领域(如代码处理与自然语言)中的差异化作用机制。
链接: https://arxiv.org/abs/2601.21725
作者: Liangze Jiang,Zachary Shinnick,Anton van den Hengel,Hemanth Saratchandran,Damien Teney
机构: EPFL(瑞士联邦理工学院); Idiap Research Institute (Idiap 研究所); Australian Institute for Machine Learning (AIML)(澳大利亚机器学习研究所); Adelaide University (阿德莱德大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.21725 [cs.CL] (or arXiv:2601.21725v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.21725 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-45] Enhancing Language Models for Robust Greenwashing Detection
【速读】: 该论文旨在解决可持续发展报告(Sustainability Reports)中因“漂绿”(Greenwashing)和模糊表述导致的ESG(环境、社会和治理)评估可靠性下降问题。现有自然语言处理(Natural Language Processing, NLP)模型因依赖表层模式而对这类模糊性缺乏鲁棒性,泛化能力差。解决方案的关键在于提出一种参数高效框架,通过对比学习(Contrastive Learning)与序数排序目标(Ordinal Ranking Objective)共同构建大语言模型(Large Language Model, LLM)潜在空间,以捕捉具体行动与模糊声明之间的分级差异;同时引入门控特征调制(Gated Feature Modulation)过滤披露噪声,并采用MetaGradNorm稳定多目标优化过程,从而在跨类别场景下显著提升模型鲁棒性,尽管存在表示刚性与泛化能力之间的权衡。
链接: https://arxiv.org/abs/2601.21722
作者: Neil Heinrich Braun,Keane Ong,Rui Mao,Erik Cambria,Gianmarco Mengaldo
机构: National University of Singapore(新加坡国立大学); Massachusetts Institute of Technology(麻省理工学院); Nanyang Technological University(南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.
zh
[NLP-46] ACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中因依赖长链式思维(Chain-of-Thought, CoT)而导致的训练成本高、推理效率低以及易出现冗余中间步骤的问题。解决方案的关键在于提出一种模型定制化的课程强化学习框架(TACLer),其核心包括两个方面:一是基于模型能力动态调整数据复杂度的定制化课程学习机制,以精准识别并补充模型缺失的知识;二是融合“思考/不思考”混合推理范式,通过智能开关“思考模式”实现准确率与效率之间的平衡。该方法在保持甚至提升性能的同时,显著降低了计算开销和推理token消耗,并在多个数学推理数据集上优于现有最优基线。
链接: https://arxiv.org/abs/2601.21711
作者: Huiyuan Lai,Malvina Nissim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model’s proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.
zh
[NLP-47] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中注意力机制行为缺乏统一解释的问题,现有研究虽识别出如检索头(retrieval heads)、sink头(sink heads)和对角线轨迹(diagonal traces)等特定注意力模式,但这些发现碎片化且缺乏理论整合。其解决方案的关键在于提出Temporal Attention Pattern Predictability Analysis (TAPPA) 框架,该框架从时序连续的角度分析注意力模式的数学本质,将注意力模式划分为具有明确规律性的可预测模式与看似随机的不可预测模式,并指出这一区分源于查询在时间维度上的自相似性程度。进一步地,TAPPA通过联合分析查询、键以及旋转位置嵌入(Rotary Positional Embeddings, RoPE)对三种典型可预测模式进行数学建模,从而为推理加速(如KV缓存压缩和模型剪枝)提供可量化指导,实验表明基于TAPPA设计的简单指标在多个任务中显著优于基线方法。
链接: https://arxiv.org/abs/2601.21709
作者: Qingyue Yang,Jie Wang,Xing Li,Yinqi Bai,Xialiang Tong,Huiling Zhen,Jianye Hao,Mingxuan Yuan,Bin Li
机构: University of Science and Technology of China (中国科学技术大学); Huawei Technologies Co., Ltd. (华为技术有限公司); Tianjin University (天津大学)
类目: Computation and Language (cs.CL)
备注: ICLR 2026
Abstract:Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbfTemporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at this https URL.
zh
[NLP-48] FBS: Modeling Native Parallel Reading inside a Transformer
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段仍依赖严格逐标记自回归生成所带来的效率瓶颈问题,同时忽视了人类阅读中关键的“内容自适应前瞻”、“分块结构感知计算分配”以及“预览/略读时训练-测试一致性”等核心特征。其解决方案的关键在于提出Fovea-Block-Skip Transformer(FBS),通过引入一个因果可训练的循环机制,结合Parafovea-Attention Window(PAW)、Chunk-Head(CH)和Skip-Gate(SG)三个模块,实现对注意力计算的动态调度与跳过策略,在不增加参数量的前提下显著优化质量-效率权衡,且各模块间具有互补性。
链接: https://arxiv.org/abs/2601.21708
作者: Tongxi Wang
机构: Southeast University (东南大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train–test consistency for preview/skimming. We propose the \textbfFovea-Block-Skip Transformer (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.
zh
[NLP-49] Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“遗忘”问题,即如何有效实现模型对特定训练样本的删除或抑制其影响。现有方法中的代表——表示误导(Representation Misdirection, RM)通过操纵遗忘样本的潜在表示(forget-representations)来实现这一目标,但其核心机制,尤其是目标向量(target vectors)的作用尚未被充分理解。论文的关键创新在于从线性表示假设(linear representation hypothesis)的角度重新审视RM:若能识别出与高层概念对应的一维表示向量,则可在遗忘表示空间内对该概念向量执行线性操作。基于此假设,研究提出并验证了一个新观点——机器遗忘不仅带来遗忘效果,还会诱发可控的副作用行为(如控制模型输出的真实性、情感倾向或拒绝能力)以及增强相关能力(如提升上下文学习能力)。这揭示了遗忘过程可能蕴含的双重潜力:既可能是被滥用的风险源,也可作为开发具备更强能力和可控行为的模型的新机制。
链接: https://arxiv.org/abs/2601.21702
作者: Tien Dang,The-Hai Nguyen,Dinh Mai Phuong,Nguyen Minh Phuong,Hoang Thanh-Tung,Le-Minh Nguyen,Naoya Inoue
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 21 pages, 11 tables, 12 figures
Abstract:We consider representation misdirection (RM), a class of LLM unlearning methods that achieves forgetting by manipulating the forget-representations, that is, latent representations of forget samples. Despite being important, the roles of target vectors used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the linear representation hypothesis. Specifically, if one can somehow identify a one-dimensional representation corresponding to a high-level concept, the linear representation hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning elicits controllable side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models’ truth, sentiment, and refusal) and capability enhancement (e.g., improving unlearned models’ in-context learning capability). Our findings reveal that this fairly attractive phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing models that require stronger capabilities and controllable behaviors.
zh
[NLP-50] oward Culturally Aligned LLM s through Ontology-Guided Multi-Agent Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文化敏感决策中因预训练数据偏斜和缺乏结构化价值表示而导致的对齐偏差问题。现有方法虽能引导输出,但通常缺乏人口统计学基础,且将价值观视为独立、无结构的信号,从而降低一致性与可解释性。其解决方案的关键在于提出一种基于本体引导的多智能体推理框架(Ontology-Guided Multi-Agent Reasoning, OG-MAR):首先通过世界价值观调查(World Values Survey, WVS)提取个体特定的价值观,并借助能力型问题构建全球文化本体;推理时,检索与本体一致的关系及人口统计学相似的个体画像,生成多个价值人格代理(value-persona agents),再由判断代理综合输出并强制保持本体一致性和人口统计邻近性,从而提升文化对齐度与推理透明度。
链接: https://arxiv.org/abs/2601.21700
作者: Wonduk Seo,Wonseok Choi,Junseo Koh,Juhyeon Lee,Hyunjin An,Minhyeong Yu,Jian Park,Qingshan Zhou,Seunghyun Lee,Yi Bu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: 35 pages
Abstract:Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.
zh
[NLP-51] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
【速读】: 该论文旨在解决小语言模型(Small Language Model)在资源受限环境下进行多跳推理时面临的训练不稳定、探索稀疏和信用分配困难的问题。现有基于强化学习(Reinforcement Learning, RL)的方法依赖于高成本、高精度的大量在线策略滚动(on-policy rollouts),难以适配计算资源有限的小模型场景。其解决方案的关键在于提出DAVID-GRPO框架,通过三个核心机制实现高效训练:(i) 利用最小监督稳定早期学习;(ii) 基于证据召回(evidence recall)进行检索信用分配;(iii) 通过重采样截断近似轨迹(truncated near-miss trajectories)提升探索效率。该方法使参数规模达1.5B的小模型在仅使用四张RTX 3090 GPU的情况下,在六个多跳问答(multi-hop QA)基准上均优于专为大规模设置设计的RL方法,验证了合理归纳偏置(inductive biases)可在低训练成本下实现高准确性。
链接: https://arxiv.org/abs/2601.21699
作者: Hojae Han,Heeyun Jung,Jongyoon Kim,Seung-won Hwang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.
zh
[NLP-52] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
【速读】: 该论文旨在解决测试时扩展(Test-Time Scaling)中因搜索策略缺乏记忆性而导致的计算冗余问题,即现有方法将每次推理路径(rollouts)视为一次性样本,忽略了其中有价值的中间结论和失败模式,造成重复推导与死胡同重访。解决方案的关键在于提出回收搜索经验(Recycling Search Experience, RSE),一种无需训练的自引导策略,通过主动将原始轨迹提炼为共享的经验库,实现正向回收(复用已发现的中间结论以跳过冗余推导)与负向回收(记录失败模式以剪枝已知死胡同),从而显著提升推理效率。理论分析与实验证明,RSE在保持相近计算成本下优于独立采样基线,在多个复杂推理基准上实现了最优的缩放效率。
链接: https://arxiv.org/abs/2601.21684
作者: Xinglin Wang,Jiayi Shi,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint
Abstract:Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbfRecycling Search Experience (RSE), a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.
zh
[NLP-53] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在面对持续且高频的删除请求时,易出现性能退化与灾难性遗忘的问题。现有方法通常忽视现实场景中删除请求的连续性和海量特性,导致模型在多次删除操作后难以维持原有功能并可能产生有害内容的恢复风险。解决方案的关键在于提出一个名为 \fit 的持续遗忘框架,其核心机制包括:数据的严格过滤(Filtering)、基于重要性的更新策略(Importance-aware updates)以及目标层归因(Targeted layer attribution),从而在保证遗忘有效性的同时显著提升模型对灾难性遗忘和事后恢复攻击的鲁棒性,实现长期稳定运行下的遗忘效率与能力保留之间的最优平衡。
链接: https://arxiv.org/abs/2601.21682
作者: Xiaoyu Xu,Minxin Du,Kun Fang,Zi Liang,Yaxin Xiao,Zhicong Huang,Cheng Hong,Qingqing Ye,Haibo Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 20 Pages
Abstract:Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underlineFiltering, \underlineImportance-aware updates, and \underlineTargeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbfPCH, a benchmark covering \textbfPersonal information, \textbfCopyright, and \textbfHarmful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.
zh
[NLP-54] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation
【速读】: 该论文试图解决语言在语义层面演化过程中的动力学机制不明确的问题,即如何量化和区分人类写作中语义进展的稳定性与变化规律。其解决方案的关键在于将文本的语义演化建模为高维状态空间中的随机轨迹,并引入精密计量学中的阿兰偏差(Allan deviation)方法,将有序句子嵌入视为位移信号来分析语义稳定性的尺度行为。这一方法揭示了短时间内的幂律标度行为(可区分创意文学与技术文本)以及长时间趋向于稳定性限制的噪声底限,从而首次将语义连贯性作为可测量的物理属性,为比较人类认知动态与生成式 AI 模型的语义稳定性提供了定量框架。
链接: https://arxiv.org/abs/2601.21678
作者: Debayan Dasgupta
机构: 未知
类目: Computation and Language (cs.CL); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:While language progresses through a sequence of semantic states, the underlying dynamics of this progression remain elusive. Here, we treat the semantic progression of written text as a stochastic trajectory in a high-dimensional state space. We utilize Allan deviation, a tool from precision metrology, to analyze the stability of meaning by treating ordered sentence embeddings as a displacement signal. Our analysis reveals two distinct dynamical regimes: short-time power-law scaling, which differentiates creative literature from technical texts, and a long-time crossover to a stability-limited noise floor. We find that while large language models successfully mimic the local scaling statistics of human text, they exhibit a systematic reduction in their stability horizon. These results establish semantic coherence as a measurable physical property, offering a framework to differentiate the nuanced dynamics of human cognition from the patterns generated by algorithmic models.
zh
[NLP-55] AdaptBPE: From General Purpose to Specialized Tokenizers EACL2026
【速读】: 该论文旨在解决通用子词分词(如Byte-Pair Encoding,BPE)在特定领域或语言上效率不足的问题,即标准分词器对所有文本统一处理时,可能引入低效的token表示。其解决方案的关键在于提出一种后训练适应策略:基于适配语料库中token的频率,选择性地替换低效token为更相关的token,从而在固定词汇表大小下优化目标语料的编码效率。该方法本质上是一种轻量级的“词汇微调”机制,能够显著提升特定任务或领域的文本压缩效果和模型性能。
链接: https://arxiv.org/abs/2601.21665
作者: Vijini Liyanage,François Yvon
机构: 未知
类目: Computation and Language (cs.CL)
备注: EACL 2026
Abstract:Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at this https URL.
zh
[NLP-56] SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在隐私敏感且资源受限环境中部署时面临的根本能力缺口问题,即SLMs缺乏推理阶段的强泛化能力,难以有效处理复杂、陌生的代码库。传统以任务为中心的学习(Task-Centric Learning, TCL)范式通过跨多个仓库扩展暴露样本,无法弥补这一缺陷。为此,作者提出一种范式转变——仓库为中心的学习(Repository-Centric Learning, RCL),其核心在于优先提升单个代码仓库的纵向深度而非横向任务广度,强调模型应通过参数化知识获取内化目标软件环境的“物理规律”,而非依赖高成本的推理期搜索。基于此理念,研究设计了四单元仓库中心体验(Repository-Centric Experience),将静态代码库转化为交互式学习信号,训练出SWE-Spot-4B系列高度紧凑的仓库专业化专家模型,突破现有缩放规律,在多项软件工程(Software Engineering, SWE)任务中优于更大规模的开源模型(如Meta的CWM和Qwen3-Coder-30B),并超越或匹配以效率为导向的商用模型(如GPT-4.1-mini和GPT-5-nano),同时实现更高的训练样本效率与更低的推理开销。
链接: https://arxiv.org/abs/2601.21649
作者: Jinjun Peng,Magnus Saebo,Tianjun Zhong,Yi-Jie Cheng,Junfeng Yang,Baishakhi Ray,Simin Chen,Yangruibo Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:The deployment of coding agents in privacy-sensitive and resource-constrained environments drives the demand for capable open-weight Small Language Models (SLMs). However, they suffer from a fundamental capability gap: unlike frontier large models, they lack the inference-time strong generalization to work with complicated, unfamiliar codebases. We identify that the prevailing Task-Centric Learning (TCL) paradigm, which scales exposure across disparate repositories, fails to address this limitation. In response, we propose Repository-Centric Learning (RCL), a paradigm shift that prioritizes vertical repository depth over horizontal task breadth, suggesting SLMs must internalize the “physics” of a target software environment through parametric knowledge acquisition, rather than attempting to recover it via costly inference-time search. Following this new paradigm, we design a four-unit Repository-Centric Experience, transforming static codebases into interactive learning signals, to train SWE-Spot-4B, a family of highly compact models built as repo-specialized experts that breaks established scaling trends, outperforming open-weight models up to larger (e.g., CWM by Meta, Qwen3-Coder-30B) and surpassing/matching efficiency-focused commercial models (e.g., GPT-4.1-mini, GPT-5-nano) across multiple SWE tasks. Further analysis reveals that RCL yields higher training sample efficiency and lower inference costs, emphasizing that for building efficient intelligence, repository mastery is a distinct and necessary dimension that complements general coding capability.
zh
[NLP-57] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models
【速读】: 该论文旨在解决离散扩散语言模型(Discrete Diffusion Language Models, DLMs)在推理阶段缺乏有效控制机制的问题,尤其是如何实现对生成文本属性(如情感倾向)的灵活且精准调控。解决方案的关键在于提出一种无需额外训练的迭代潜在表示精炼框架(Iterative Latent Representation Refinement, ILRR),其核心思想是在去噪过程中动态对齐生成序列与参考序列的内部激活状态,从而捕捉并迁移高层语义特征;同时引入空间调制引导(Spatially Modulated Steering)机制,通过调节引导强度在长文本中实现基于短参考序列的可控生成,显著提升了属性控制精度(相比基线提升10%–60%点),且仅需每步增加一次并行前向传播,计算开销极低。
链接: https://arxiv.org/abs/2601.21647
作者: Eden Avrahami,Eliya Nachmani
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10 % to 60 % points, while maintaining high generation quality.
zh
[NLP-58] Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在并行思维(Parallel Thinking)推理中因系统级并行度配置统一而导致的“过扩展诅咒”(overscaling curse)问题,即由于样本异质性使得部分样本无需高并行度即可达到良好性能,从而造成计算资源冗余。解决方案的关键在于提出一种轻量级方法 T2,该方法利用潜在表示(latent representations)在解码前估计每个样本的最优并行度,从而实现样本级的高效资源配置,显著降低计算成本并保持与原方法相当的性能表现。
链接: https://arxiv.org/abs/2601.21619
作者: Yiming Wang,Zhuosheng Zhang,Rui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Parallel thinking enhances LLM reasoning by multi-path sampling and aggregation. In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy. However, due to sample heterogeneity, some samples can achieve comparable performance with a smaller N’ N, causing budget redundancy. This incompatibility between system-level efficacy and sample-level efficiency constitutes the overscaling curse. In this paper, we formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism. We then propose a lightweight method, T2, to break the overscaling curse, which utilizes latent representations to estimate the optimal parallelism level for each sample before decoding. Experiments show that T2 significantly reduces cost while maintaining comparable performance, enabling more efficient parallel thinking.
zh
[NLP-59] Semantic Content Determines Algorithmic Performance
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行算法任务时对输入语义内容的敏感性问题,即模型行为是否应与输入的具体语义无关——这是算法不变性(algorithmic invariance)的核心要求。以往研究常将语义敏感性与推理复杂度或提示变化混淆,而本文提出一个原子化的测试框架 WhatCounts,专门用于隔离检验这一特性:通过计数一个无歧义、明确分隔且不含重复项、干扰项或推理步骤的列表(如城市名 vs 化学物质),来评估模型表现是否因计数对象的语义类型不同而显著变化。关键发现是,前沿LLMs在不同语义类别上的准确率差异超过40%,且这种差异无法通过控制变量排除,表明其并非由外部因素导致,而是模型本身对输入语义存在隐式依赖;进一步实验显示,微调仅需少量无关数据即可引发性能波动,说明LLM并未真正实现确定性算法,而是以参数化近似方式执行任务,且该近似效果受输入语义影响。这揭示了LLM函数可能携带隐藏的语义依赖,远超计数场景,对基于LLM的代理系统(agentic systems)具有深远影响。
链接: https://arxiv.org/abs/2601.21618
作者: Martiño Ríos-García,Nawaf Alampara,Kevin Maik Jablonka
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Counting should not depend on what is being counted; more generally, any algorithm’s behavior should be invariant to the semantic content of its arguments. We introduce WhatCounts to test this property in isolation. Unlike prior work that conflates semantic sensitivity with reasoning complexity or prompt variation, WhatCounts is atomic: count items in an unambiguous, delimited list with no duplicates, distractors, or reasoning steps for different semantic types. Frontier LLMs show over 40% accuracy variation depending solely on what is being counted - cities versus chemicals, names versus symbols. Controlled ablations rule out confounds. The gap is semantic, and it shifts unpredictably with small amounts of unrelated fine-tuning. LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent. As we show with an agentic example, this has implications beyond counting: any LLM function may carry hidden dependencies on the meaning of its inputs.
zh
[NLP-60] hinking Broad Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance WWW2026
【速读】: 该论文旨在解决电商搜索中相关性建模的两个关键问题:一是现有基于Chain-of-Thought (CoT)推理的方法多依赖单一视角的推理,难以捕捉电商相关性的多维特性(如用户意图、属性级匹配及业务规则);二是尽管CoT增强的大语言模型(LLM)具备强大的推理能力,但其高推理延迟限制了实时部署,而当前知识蒸馏方法在推理阶段丢弃了CoT结构,仅将其作为临时辅助信号,丧失了其推理价值。解决方案的关键在于提出一种全新的框架,通过引入多视角CoT(Multi-Perspective CoT, MPCoT)增强教师模型的多样性推理能力,并结合监督微调(SFT)与直接偏好优化(DPO)构建更鲁棒的推理器;同时设计潜空间推理知识蒸馏(Latent Reasoning Knowledge Distillation, LRKD),使学生模型在推理时能高效内化教师模型的复杂推理逻辑,从而实现低延迟下的高性能相关性建模。
链接: https://arxiv.org/abs/2601.21611
作者: Baopu Qiu,Hao Chen,Yuanrong Wu,Changtong Zan,Chao Wei,Weiru Zhang,Xiaoyi Zeng
机构: Alibaba International Digital Commerce Group (阿里巴巴国际数字商业集团); Zhejiang University (浙江大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 6 figures, Accepted by WWW2026 industry track
Abstract:Effective relevance modeling is crucial for e-commerce search, as it aligns search results with user intent and enhances customer experience. Recent work has leveraged large language models (LLMs) to address the limitations of traditional relevance models, especially for long-tail and ambiguous queries. By incorporating Chain-of-Thought (CoT) reasoning, these approaches improve both accuracy and interpretability through multi-step reasoning. However, two key limitations remain: (1) most existing approaches rely on single-perspective CoT reasoning, which fails to capture the multifaceted nature of e-commerce relevance (e.g., user intent vs. attribute-level matching vs. business-specific rules); and (2) although CoT-enhanced LLM’s offer rich reasoning capabilities, their high inference latency necessitates knowledge distillation for real-time deployment, yet current distillation methods discard the CoT rationale structure at inference, using it as a transient auxiliary signal and forfeiting its reasoning utility. To address these challenges, we propose a novel framework that better exploits CoT semantics throughout the optimization pipeline. Specifically, the teacher model leverages Multi-Perspective CoT (MPCoT) to generate diverse rationales and combines Supervised Fine-Tuning (SFT) with Direct Preference Optimization (DPO) to construct a more robust reasoner. For distillation, we introduce Latent Reasoning Knowledge Distillation (LRKD), which endows a student model with a lightweight inference-time latent reasoning extractor, allowing efficient and low-latency internalization of the LLM’s sophisticated reasoning capabilities. Evaluated in offline experiments and online A/B tests on an e-commerce search advertising platform serving tens of millions of users daily, our method delivers significant offline gains, showing clear benefits in both commercial performance and user experience.
zh
[NLP-61] Language Models as Artificial Learners: Investigating Crosslinguistic Influence
【速读】: 该论文试图解决跨语言影响(crosslinguistic influence, CLI)在双语研究中因实验变异导致结果不一致的问题。其解决方案的关键在于利用语言模型(language models, LMs)作为受控的统计学习者,系统性地模拟CLI并隔离其潜在驱动因素。通过操控第一语言(L1)的语言主导性和第二语言(L2)的习得年龄(即引入L2的训练步数),以及预训练阶段L1与L2在句法距离上的差异,研究能够精确控制变量并揭示CLI的机制。结果表明,语言主导性和熟练度是CLI的重要预测因子,并且L1在L2处理过程中被共同激活,直接影响L2的神经回路,从而为人类CLI理论提供了可计算的实证支持。
链接: https://arxiv.org/abs/2601.21587
作者: Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite the centrality of crosslinguistic influence (CLI) to bilingualism research, human studies often yield conflicting results due to inherent experimental variance. We address these inconsistencies by using language models (LMs) as controlled statistical learners to systematically simulate CLI and isolate its underlying drivers. Specifically, we study the effect of varying the L1 language dominance and the L2 language proficiency, which we manipulate by controlling the L2 age of exposure – defined as the training step at which the L2 is introduced. Furthermore, we investigate the impact of pretraining on L1 languages with varying syntactic distance from the L2. Using cross-linguistic priming, we analyze how activating L1 structures impacts L2 processing. Our results align with evidence from psycholinguistic studies, confirming that language dominance and proficiency are strong predictors of CLI. We further find that while priming of grammatical structures is bidirectional, the priming of ungrammatical structures is sensitive to language dominance. Finally, we provide mechanistic evidence of CLI in LMs, demonstrating that the L1 is co-activated during L2 processing and directly influences the neural circuitry recruited for the L2. More broadly, our work demonstrates that LMs can serve as a computational framework to inform theories of human CLI.
zh
[NLP-62] Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves
【速读】: 该论文旨在解决深度递归(depth-recurrence)在生成式 AI 模型中因参数共享机制不完善、隐藏层尺寸固定导致的多步潜在推理能力受限等问题。现有方法缺乏在计算量(FLOP)、参数量和内存使用上完全匹配的基线,且对深度递归的利用不足,同时受限于恒定隐藏尺寸的瓶颈,难以实现高效扩展。解决方案的关键在于提出一种模块化框架——深度递归注意力混合模型(Dreamer),通过引入序列注意力、深度注意力与稀疏专家注意力的组合机制,利用深度维度上的注意力缓解隐藏尺寸瓶颈,解耦扩展维度,并支持模型在保持低训练成本的同时实现高效且有效的推理性能。实验证明,该方法在语言推理基准测试中,相较 FLOP-、参数量和内存匹配的最先进模型(SOTA),所需训练 token 数减少 2 至 8 倍,且在相同训练 token 下优于约 2 倍规模的 SOTA 模型。
链接: https://arxiv.org/abs/2601.21582
作者: Jonas Knupp,Jan Hendrik Metzen,Jeremias Bohn,Georg Groh,Kristian Kersting
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.
zh
[NLP-63] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices
【速读】: 该论文旨在解决Manifold-Constrained Hyper-Connections (mHC) 在训练稳定性与参数复杂度方面的局限性:一方面,mHC 中使用的 Sinkhorn-Knopp (SK) 迭代算法无法始终生成精确的双随机(doubly stochastic)残差矩阵;另一方面,其参数复杂度为 O(n3C),其中 n 为残差流宽度、C 为特征维度,导致模型难以扩展。为此,作者提出 KromHC,其核心创新在于利用更小的双随机矩阵的 Kronecker 积来参数化残差矩阵,通过在张量化残差流的每个模式上施加流形约束,确保残差矩阵的精确双随机性,同时将参数复杂度降低至 O(n2C),显著优于 mHC 及其轻量化变体 mHC-lite 的因子阶复杂度 O(nC⋅n!)。实验表明,KromHC 在性能上可匹配甚至超越当前最优的 mHC 变体,且所需可训练参数大幅减少。
链接: https://arxiv.org/abs/2601.21579
作者: Wuyang Zhou,Yuxuan Gu,Giorgos Iacovides,Danilo Mandic
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive \mathcalO(n^3C) parameter complexity with n as the width of the residual stream and C as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, \mathcalO \left( nC \cdot n! \right) . To address both challenges, we propose \textbfKromHC, which uses the \underlineKronecker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underlinemHC. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to \mathcalO(n^2C) . Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \textttthis https URL.
zh
[NLP-64] Shaping capabilities with token-level data filtering
【速读】: 该论文试图解决语言模型中 undesired capabilities(如医疗敏感信息生成)难以通过后置方法有效抑制的问题,这类方法往往容易被攻击者绕过。其核心解决方案是将能力塑造(capability shaping)提前至预训练阶段,通过过滤预训练数据中的特定内容来实现目标。关键创新在于:相较于传统文档级过滤,采用基于稀疏自动编码器(sparse autoencoders)的细粒度 token 级过滤策略,在显著降低良性功能损失的同时更高效地削弱 undesired capabilities;且随着模型规模扩大,该方法效果增强——在最大规模模型中可实现高达 7000 倍的计算延迟下降,同时仍能保持对遗忘域(forget domain)的对齐能力。
链接: https://arxiv.org/abs/2601.21571
作者: Neil Rathi,Alec Radford
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 33 pages, 28 figures
Abstract:Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.
zh
[NLP-65] Multi-objective Integer Linear Programming approach for Automatic Software Cognitive Complexity Reduction
【速读】: 该论文旨在解决软件代码可维护性问题,特别是通过降低代码的认知复杂度(cognitive complexity)来提升代码的可理解性,从而减少漏洞和错误。其核心挑战在于如何在不改变功能的前提下优化代码结构,且需同时考虑多个评价指标(如行数与认知复杂度的平衡)。解决方案的关键是提出一种多目标整数线性规划(multi-objective integer linear programming)模型,用于生成一组帕累托最优解,以实现认知复杂度的最小化与代码规模的合理权衡;此外,研究还开发了配套算法并集成到工具中,支持参数化求解该优化问题。
链接: https://arxiv.org/abs/2601.21565
作者: Adriana Novoa-Hurtado,Rubén Saborido,Francisco Chicano,Manuel Giménez-Medina
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 51 pages, 17 figures
Abstract:Clear and concise code is necessary to ensure maintainability, so it is crucial that the software is as simple as possible to understand, to avoid bugs and, above all, vulnerabilities. There are many ways to enhance software without changing its functionality, considering the extract method refactoring the primary process to reduce the effort required for code comprehension. The cognitive complexity measure employed in this work is the one defined by SonarSource, which is a company that develops well-known applications for static code analysis. This extraction problem can be modeled as a combinatorial optimization problem. The main difficulty arises from the existence of different criteria for evaluating the solutions obtained, requiring the formulation of the code extraction problem as a multi-objective optimization problem using alternative methods. We propose a multi-objective integer linear programming model to obtain a set of solutions that reduce the cognitive complexity of a given piece of code, such as balancing the number of lines of code and its cognitive complexity. In addition, several algorithms have been developed to validate the model. These algorithms have been integrated into a tool that enables the parameterised resolution of the problem of reducing software cognitive complexity.
zh
[NLP-66] ASTRA: Automated Synthesis of agent ic Trajectories and Reinforcement Arenas
【速读】: 该论文旨在解决当前训练鲁棒的工具增强型语言模型代理(tool-augmented language model agents)所面临的挑战,包括依赖人工干预、依赖不可验证的模拟环境、仅使用监督微调(SFT)或强化学习(RL)单一方法,以及难以实现稳定且长期的多轮学习等问题。解决方案的关键在于提出一个全自动端到端框架ASTRA,其核心由两个互补组件构成:一是基于工具调用图(tool-call graphs)静态拓扑结构的数据合成管道,用于生成多样化且结构合理的轨迹,从而培养广泛可迁移的工具使用能力;二是环境合成框架,将人类语义推理的复杂组合拓扑转化为独立、可执行代码且规则可验证的环境,支持确定性的多轮强化学习。通过结合SFT与在线RL,并利用轨迹级奖励平衡任务完成度与交互效率,ASTRA实现了在多个代理工具使用基准上的最先进性能,同时保持了核心推理能力。
链接: https://arxiv.org/abs/2601.21558
作者: Xiaoyu Tian,Haotian Wang,Shuaiting Chen,Hao Zhou,Kaichi Yu,Yudian Zhang,Jade Ouyang,Junxi Yin,Jiong Chen,Baoyan Guo,Lei Zhang,Junjie Tao,Yuansheng Song,Ming Cui,Chengwei Liu
机构: Beike Language and Intelligence (贝克语言与智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at this https URL.
zh
[NLP-67] Note2Chat: Improving LLM s for Multi-Turn Clinical History Taking Using Medical Notes AAAI-26
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在动态、多轮临床问诊场景中表现不足的问题,尤其是在需要迭代式提问与假设修正的临床推理任务中。现有方法依赖稀缺且敏感的对话数据,难以有效训练具备真实诊疗逻辑的模型。其解决方案的关键在于提出一种基于病历文本驱动的框架(\method),通过决策树引导的生成与精炼流程将真实医疗记录转化为高质量医生-患者对话数据,并采用三阶段微调策略(监督学习、模拟数据增强与偏好学习)提升模型性能;同时引入一种新颖的单轮推理范式,将问诊过程建模为一系列独立的单轮推理问题,从而增强可解释性、支持局部监督、实现动态适应并提高样本效率。实验表明,该方法在诊断准确率和F1分数上显著优于GPT-4o。
链接: https://arxiv.org/abs/2601.21551
作者: Yang Zhou,Zhenting Sheng,Mingrui Tan,Yuting Song,Jun Zhou,Yu Heng Kwan,Lian Leng Low,Yang Bai,Yong Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at AAAI-26
Abstract:Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at this https URL.
zh
[NLP-68] ShardMemo: Masked MoE Routing for Sharded Agent ic LLM Memory
【速读】: 该论文旨在解决代理型大语言模型(Agentic LLM)系统在长周期状态管理和多智能体并发执行中,因外部记忆规模与并行访问量增长而导致的中心化索引和启发式分区成为瓶颈的问题。其解决方案的关键在于提出一种预算约束下的分层记忆服务ShardMemo,包含三级结构:Tier A为每智能体的工作状态,Tier B为分片证据及本地近似最近邻(ANN)索引,Tier C为版本化的技能库。其中,Tier B通过“先限定范围再路由”的机制,在路由或ANN搜索前利用结构化资格约束屏蔽不相关分片;并将分片探测建模为掩码混合专家(MoE)路由,基于成本感知门控策略选择最多 $ B_\mathrm{probe} $ 个候选分片进行检索,路由器由“证据到分片”的监督信号训练而成,从而显著提升检索效率与准确性。
链接: https://arxiv.org/abs/2601.21545
作者: Yang Zhao,Chengxiao Dai,Yue Xiu,Mengying Kou,Yuliang Zheng,Dusit Niyato
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Agentic large language model (LLM) systems rely on external memory for long-horizon state and concurrent multi-agent execution, but centralized indexes and heuristic partitions become bottlenecks as memory volume and parallel access grow. We present ShardMemo, a budgeted tiered memory service with Tier A per-agent working state, Tier B sharded evidence with shard-local approximate nearest neighbor (ANN) indexes, and Tier C, a versioned skill library. Tier B enforces scope-before-routing: structured eligibility constraints mask ineligible shards before routing or ANN search. We cast shard probing as masked mixture-of-experts (MoE) routing over eligible shards, probing up to B_\mathrmprobe shards via Top- B_\mathrmprobe or adaptive Top- P , and use cost-aware gating over profile/observation/session shard families; the router is trained from evidence-to-shard supervision. On LoCoMo, ShardMemo improves over the strongest baseline (GAM) by +5.11 to +6.82 F1 across question categories. Under a fixed-budget routing setting ( B_\mathrmprobe=3 ), ShardMemo improves over cosine-to-prototype shard routing by +6.87 F1 while reducing retrieval work (VecScan 521-414, -20.5%) and p95 latency (95-76 ms). On long-context HotpotQA, ShardMemo achieves 63.41/61.88/57.95 F1 at 56K/224K/448K tokens. On ToolBench, Tier C reaches 0.97 Precision@3 and 1.94 StepRed (+10.2% and +7.2% over embedding-similarity retrieval).
zh
[NLP-69] nversedMixup: Data Augmentation via Inverting Mixed Embeddings
【速读】: 该论文旨在解决传统Mixup方法在文本数据增强中缺乏可解释性的问题,以及基于大语言模型(LLM)的增强方法控制能力不足的问题。其核心挑战在于如何在保持Mixup对嵌入空间操作的可控性的同时,实现生成结果的人类可读性。解决方案的关键在于提出了一种名为inversedMixup的统一框架,通过三阶段训练将任务特定模型的输出嵌入空间与LLM的输入嵌入空间对齐,从而能够将混合嵌入以可控比例重构为人类可理解的文本句子,同时首次实证发现了文本Mixup中的流形侵入(manifold intrusion)现象,并提出了有效的缓解策略。
链接: https://arxiv.org/abs/2601.21543
作者: Fanshuang Kong,Richong Zhang,Qiyu Sun,Zhijie Nie,Ting Deng,Chunming Hu
机构: Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.
zh
[NLP-70] KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization
【速读】: 该论文旨在解决编码代理在长期任务中常见的失败模式,包括实验状态丢失、脆弱的调试机制以及领域知识复用能力弱等问题。其核心解决方案是提出KAPSO框架,该框架通过三个紧密耦合的模块实现闭环优化:首先,基于Git的实验引擎将每次尝试隔离为独立分支,确保可重现性并保留演化过程中的溯源信息;其次,知识系统整合来自代码库、内部操作手册及外部资源(如文档、论文和搜索结果)的异构数据,构建结构化表示以支持对工作流、实现方式与环境约束的高效检索;最后,认知记忆层协调知识检索并维护一个由实验轨迹(运行日志、差异对比与评估反馈)提炼出的可复用经验存储,从而减少重复错误并加速收敛。此设计将程序合成视为长期优化循环中的操作符而非终点,使系统能够持续迭代改进可运行产物直至达成可衡量的目标。
链接: https://arxiv.org/abs/2601.21526
作者: Alireza Nadaf,Alireza Mohammadshahi,Majid Yazdani
机构: Leeroo Team
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:We introduce KAPSO, a modular framework for autonomous program synthesis and optimization. Given a natural language goal and an evaluation method, KAPSO iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact toward measurable objectives. Rather than treating synthesis as the endpoint, KAPSO uses synthesis as an operator within a long-horizon optimization loop, where progress is defined by evaluator outcomes. KAPSO targets long-horizon failures common in coding agents, including lost experimental state, brittle debugging, and weak reuse of domain expertise, by integrating three tightly coupled components. First, a git-native experimentation engine isolates each attempt as a branch, producing reproducible artifacts and preserving provenance across iterations. Second, a knowledge system ingests heterogeneous sources, including repositories, internal playbooks, and curated external resources such as documentation, scientific papers, and web search results, and organizes them into a structured representation that supports retrieval over workflows, implementations, and environment constraints. Third, a cognitive memory layer coordinates retrieval and maintains an episodic store of reusable lessons distilled from experiment traces (run logs, diffs, and evaluator feedback), reducing repeated error modes and accelerating convergence. We evaluated KAPSO on MLE-Bench (Kaggle-style ML competitions) and ALE-Bench (AtCoder heuristic optimization), and report end-to-end performance. Code Available at: this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2601.21526 [cs.AI] (or arXiv:2601.21526v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.21526 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alireza Mohammadshahi [view email] [v1] Thu, 29 Jan 2026 10:40:54 UTC (319 KB)
zh
[NLP-71] LMK CLS: Landmark Pooling for Dense Embeddings
【速读】: 该论文旨在解决现有序列编码器在使用池化策略(如[CLS]标记或平均池化)时存在的系统性缺陷:[CLS]标记倾向于将信息集中在序列初始位置,难以有效捕捉分布式的证据;而平均池化则可能稀释局部显著特征,导致短上下文任务性能下降。解决方案的关键在于提出一种名为Landmark (LMK) pooling的新机制,其核心思想是将输入序列划分为若干块,在块之间插入地标标记(landmark tokens),并通过对这些地标标记的嵌入进行平均池化来生成最终表示。该方法在保持局部显著特征的同时提升了长上下文外推能力,且仅引入少量特殊标记,具有良好的实用性与可扩展性。
链接: https://arxiv.org/abs/2601.21525
作者: Meet Doshi,Aashka Trivedi,Vishwajeet Kumar,Parul Awasthy,Yulong Li,Jaydeep Sen,Radu Florian,Sachindra Joshi
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.
zh
[NLP-72] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset
【速读】: 该论文旨在解决阿拉伯语(Arabic)缺乏大规模、高质量词汇定义数据集的问题,尤其在科学、宗教和文学等多领域中,精准的词义对应关系尚不充分。解决方案的关键在于构建MURAD(Multi-domain Unified Reverse Arabic Dictionary),这是一个包含96,243个词-定义对的开放词汇资源,其数据来源于权威参考文献与教育资料;通过融合直接文本解析、光学字符识别(Optical Character Recognition, OCR)与自动化重建的混合处理流程,确保了数据的准确性与清晰性,并为每个词条标注了来源领域元数据,从而支持计算语言学、语义检索及教育工具开发等应用。
链接: https://arxiv.org/abs/2601.21512
作者: Serry Sibaee,Yasser Alhabashi,Nadia Sibai,Yara Farouk,Adel Ammar,Sawsan AlHalawani,Wadii Boulila
机构: Prince Sultan University (王子苏丹大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Databases (cs.DB); Information Retrieval (cs.IR)
备注: 18 pages
Abstract:Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.
zh
[NLP-73] he Effectiveness of Style Vectors for Steering Large Language Models : A Human Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段的行为控制问题,以实现输出内容与人类情感能力及安全要求的对齐。其核心挑战在于如何在不依赖提示工程或微调的前提下,高效且可控地调整模型生成文本的情感倾向。解决方案的关键在于采用激活转向(activation steering)技术,通过直接修改模型内部激活状态来引导生成过程;研究首次基于大规模人群评估(n=190,超7000条评分)验证了该方法在情感强度调节上的有效性,并发现适度的转向强度(λ ≈ 0.15)可在保持文本可理解性的前提下显著增强目标情绪(如厌恶和恐惧),同时自动评分与人工评分高度一致(r=0.776),表明该方法具备可扩展性和可靠性。
链接: https://arxiv.org/abs/2601.21505
作者: Diaoulé Diallo,Katharina Dworatzyk,Sophie Jentzsch,Peer Schütt,Sabine Theis,Tobias Hecking
机构: German Aerospace Center (DLR)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emphActivation steering provides a lightweight alternative to prompt engineering and fine-tuning by directly modifying internal activations to guide generation. This research advances the literature in three significant directions. First, while previous work demonstrated the technical feasibility of steering emotional tone using automated classifiers, this paper presents the first human evaluation of activation steering concerning the emotional tone of LLM outputs, collecting over 7,000 crowd-sourced ratings from 190 participants via Prolific ( n=190 ). These ratings assess both perceived emotional intensity and overall text quality. Second, we find strong alignment between human and model-based quality ratings (mean r=0.776 , range 0.157 – 0.985 ), indicating automatic scoring can proxy perceived quality. Moderate steering strengths ( \lambda \approx 0.15 ) reliably amplify target emotions while preserving comprehensibility, with the strongest effects for disgust ( \eta_p^2 = 0.616 ) and fear ( \eta_p^2 = 0.540 ), and minimal effects for surprise ( \eta_p^2 = 0.042 ). Finally, upgrading from Alpaca to LlaMA-3 yielded more consistent steering with significant effects across emotions and strengths (all p 0.001 ). Inter-rater reliability was high (ICC = 0.71 – 0.87 ), underscoring the robustness of the findings. These findings support activation-based control as a scalable method for steering LLM behavior across affective dimensions.
zh
[NLP-74] MAR: Efficient Large Language Models via Module-aware Architecture Refinement ICASSP2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中因二次复杂度的注意力机制和密集的前馈神经网络(Feed-Forward Network, FFN)操作导致的高能耗问题。其解决方案的核心在于提出一种两阶段的模块感知架构优化方法(Module-aware Architecture Refinement, MAR):首先引入状态空间模型(State Space Models, SSMs)实现线性时间复杂度的序列建模,以降低注意力机制的计算开销;其次通过激活稀疏化策略减少FFN模块的计算成本;此外,为缓解脉冲神经网络(Spiking Neural Networks, SNNs)与SSMs融合时的信息密度低和时间错配问题,设计了自适应三值多步神经元(Adaptive Ternary Multi-step Neuron, ATMN)和脉冲感知双向蒸馏策略(Spike-aware Bidirectional Distillation Strategy, SBDS)。实验表明,MAR在资源受限条件下能有效恢复原始密集模型的性能,并显著降低推理能耗,同时优于同等或更大规模的高效模型。
链接: https://arxiv.org/abs/2601.21503
作者: Junhong Cai,Guiqin Wang,Kejie Zhao,Jianxiong Tang,Xiang Wang,Luziwei Leng,Ran Cheng,Yuxin Ma,Qinghai Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by ICASSP 2026. 5 pages, 5 figures
Abstract:Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we propose Module-aware Architecture Refinement (MAR), a two-stage framework that integrates State Space Models (SSMs) for linear-time sequence modeling and applies activation sparsification to reduce FFN costs. In addition, to mitigate low information density and temporal mismatch in integrating Spiking Neural Networks (SNNs) with SSMs, we design the Adaptive Ternary Multi-step Neuron (ATMN) and the Spike-aware Bidirectional Distillation Strategy (SBDS). Extensive experiments demonstrate that MAR effectively restores the performance of its dense counterpart under constrained resources while substantially reducing inference energy consumption. Furthermore, it outperforms efficient models of comparable or even larger scale, underscoring its potential for building efficient and practical LLMs.
zh
[NLP-75] he Path of Least Resistance: Guiding LLM Reasining Trajectories with Prefix Consensus ICLR2026
【速读】: 该论文旨在解决大语言模型在推理过程中因采用自一致性(Self-Consistency, SC)等策略而导致的计算开销过大的问题,即SC需要完整展开所有推理路径,造成高token消耗和延迟。其解决方案的关键在于提出PoLR(Path of Least Resistance),通过利用前缀一致性(prefix consistency)实现高效推理:PoLR对推理路径的短前缀进行聚类,识别出主导簇,并仅扩展该簇内的所有路径,从而在保持SC准确率优势的同时显著降低token使用量和延迟。理论分析表明,早期推理步骤蕴含强信号,能有效预测最终正确性,这为PoLR的设计提供了依据;实验证明,PoLR在多个基准测试中性能持平或超越SC,最多可减少60% token消耗和50%墙-clock延迟,且与自适应推理方法完全兼容,无需模型微调即可作为预过滤器提升效率与可扩展性。
链接: https://arxiv.org/abs/2601.21494
作者: Ishan Jindal,Sai Prashanth Akuthota,Jayant Taneja,Sachin Dev Sharma
机构: Fujitsu Research India (富士通研究印度); Samsung R&D Institute India-Delhi (三星研发研究院印度德里)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICLR 2026. this https URL
Abstract:Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.
zh
[NLP-76] DimStance: Multilingual Datasets for Dimensional Stance Analysis
【速读】: 该论文旨在解决传统立场检测(stance detection)任务中仅依赖离散类别标签(如Favor、Neutral、Against)难以捕捉情感细微差异的问题。其核心挑战在于如何建模立场表达背后更复杂的效价(valence)与唤醒度(arousal)维度,从而实现细粒度的情感感知立场分析。解决方案的关键是提出首个基于效价-唤醒度(Valence-Arousal, VA)维度标注的多语言立场资源DimStance,涵盖五种语言和两个领域共11,746个目标方面,并构建相应的回归任务以评估VA预测性能,同时验证预训练模型在回归与提示(prompting)设置下的表现,为多语言、情绪感知的立场分析提供基准与方法支持。
链接: https://arxiv.org/abs/2601.21483
作者: Jonas Becker,Liang-Chih Yu,Shamsuddeen Hassan Muhammad,Jan Philip Wahle,Terry Ruas,Idris Abdulmumin,Lung-Hao Lee,Wen-Ni Liu,Tzu-Mi Lin,Zhe-Yu Xu,Ying-Lung Lin,Jin Wang,Maryam Ibrahim Mukhtar,Bela Gipp,Saif M. Mohammed
机构: University of Göttingen (哥廷根大学); Yuan Ze University (元智大学); Imperial College London (帝国理工学院); University of Pretoria (普利托里亚大学); National Yang Ming Chiao Tung University (阳明交通大学); Yunnan University (云南大学); Bayero University Kano (贝亚罗大学 Kano 校区); National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注:
Abstract:Stance detection is an established task that classifies an author’s attitude toward a specific target into categories such as Favor, Neutral, and Against. Beyond categorical stance labels, we leverage a long-established affective science framework to model stance along real-valued dimensions of valence (negative-positive) and arousal (calm-active). This dimensional approach captures nuanced affective states underlying stance expressions, enabling fine-grained stance analysis. To this end, we introduce DimStance, the first dimensional stance resource with valence-arousal (VA) annotations. This resource comprises 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, and Swahili) and two domains (politics and environmental protection). To facilitate the evaluation of stance VA prediction, we formulate the dimensional stance regression task, analyze cross-lingual VA patterns, and benchmark pretrained and large language models under regression and prompting settings. Results show competitive performance of fine-tuned LLM regressors, persistent challenges in low-resource languages, and limitations of token-based generation. DimStance provides a foundation for multilingual, emotion-aware, stance analysis and benchmarking.
zh
[NLP-77] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models
【速读】: 该论文旨在解决基于策略的强化学习(Reinforcement Learning, RL)方法在语言模型后训练中因采样多样性不足导致的探索受限与早期饱和问题,同时避免现有混合轨迹的离线-在线策略方法带来的策略不匹配和训练不稳定。其解决方案的关键在于提出单样本混合策略统一范式(Single-sample Mix-policy Unified Paradigm, SOUP),该方法在token级别上统一离线与在线学习:将历史策略采样的序列前缀作为离线信息保留,而后续生成则采用在线策略,通过token级别的重要性权重比有效利用离线数据并维持训练稳定性,从而显著提升探索能力和最终性能。
链接: https://arxiv.org/abs/2601.21476
作者: Lei Yang,Wei Bi,Chenxi Sun,Renren Jin,Deyi Xiong
机构: TJUNLP Lab, College of Intelligence and Computing, Tianjin University, Tianjin, China (天津大学智能与计算学部TJUNLP实验室); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL)
备注:
Abstract:On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the \textbfS ingle-sample Mix-p \textbfO licy \textbfU nified \textbfP aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.
zh
[NLP-78] opeax – An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance
【速读】: 该论文旨在解决当前主流文本聚类主题建模方法(如Top2Vec和BERTopic)中存在的两大核心问题:一是模型对样本量和超参数高度敏感,导致在发现语料库中自然聚类时可靠性差;二是术语重要性评估机制存在缺陷,BERTopic忽略关键词与主题向量间的语义距离,而Top2Vec则忽视词频信息,从而生成包含停用词或噪声词的低一致性主题,并缺乏多样性与可信度。解决方案的关键在于提出一种新方法Topeax,其通过密度估计中的峰值自动确定聚类数量,并融合词汇(lexical)与语义(semantic)双重指标来计算术语重要性,从而提升主题关键词的质量与聚类恢复能力,同时显著降低对样本规模和超参数变化的敏感性。
链接: https://arxiv.org/abs/2601.21465
作者: Márton Kardos
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 6 figures
Abstract:Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models’ apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbfTopeax, which discovers the number of clusters from peaks in density estimates, and combines lexical and semantic indices of term importance to gain high-quality topic keywords. Topeax is demonstrated to be better at both cluster recovery and cluster description than Top2Vec and BERTopic, while also exhibiting less erratic behaviour in response to changing sample size and hyperparameters.
zh
[NLP-79] Conversation for Non-verifiable Learning: Self-Evolving LLM s through Meta-Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非可验证任务(如创意写作、对话生成和伦理推理)中缺乏真实标签(ground-truth labels)导致的训练难题。现有基于LLM作为评判者(LLM-as-Judge)的方法受限于评判者的质量,若评判者无法识别优质解,则无法提供有效训练信号,且评估偏差(如偏好冗长表达而非内容质量)难以纠正。为此,作者提出元评估(meta-evaluation)机制,即对评判能力本身进行评估与优化。其解决方案的关键在于引入CoNL框架,通过多智能体自博弈(multi-agent self-play)统一生成、评估与元评估过程:多个共享同一策略的智能体在结构化对话中协作提出、批评并修订解决方案;通过衡量批评是否促进他人改进来获得诊断奖励(diagnostic reward),从而为元评估提供显式监督信号,并实现生成与评判能力的联合优化,无需外部裁判或真实标签,实验表明该方法在五个基准测试上均显著优于自奖励基线且训练稳定。
链接: https://arxiv.org/abs/2601.21464
作者: Yuan Sui,Bryan Hooi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in Progress
Abstract:Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator’s own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
zh
[NLP-80] Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
【速读】: 该论文旨在解决长视频推理中因大型多模态模型(Large Multimodal Models, LMMs)预填充阶段密集计算而导致的效率瓶颈问题。现有方法要么压缩视觉嵌入,要么在单GPU上应用稀疏注意力机制,导致加速有限或性能下降,难以处理更长、更复杂的视频内容。其解决方案的关键在于提出Spava——一种序列并行框架,通过优化注意力机制实现跨多个GPU的近似注意力分布,从而降低计算复杂度并提升并行性,使模型能够在不压缩视觉嵌入的前提下高效处理更多视觉特征,进而提升任务性能;同时结合系统级优化(如负载均衡和融合前向传播),显著释放了Spava的加速潜力,在保持性能损失可忽略的情况下,相较FlashAttn、ZigZagRing和APB分别实现了12.72倍、1.70倍和1.18倍的加速。
链接: https://arxiv.org/abs/2601.21444
作者: Yuxiang Huang,Mingye Li,Xu Han,Chaojun Xiao,Weilin Zhao,Ao Sun,Ziqi Yuan,Hao Zhou,Fandong Meng,Zhiyuan Liu
机构: NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China; Department of CS&T, Central South University, Changsha, China; BUPT, Beijing, China; Pattern Recognition Center, WeChat AI, Tencent Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at this https URL
zh
[NLP-81] From Consistency to Complementarity: Aligned and Disentangled Multi-modal Learning for Time Series Understanding and Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在时间序列理解与推理任务中面临的两大核心挑战:一是跨模态细粒度时间错位问题,即数值时间序列与其可视化图表之间存在难以对齐的局部时序差异;二是共享语义与模态特异性语义严重纠缠,阻碍了局部可解释性和互补性推理。解决方案的关键在于提出MADI框架,其核心创新包括:(1) Patch-level Alignment(片段级对齐),通过物理意义驱动的细粒度对应关系实现异构模态间的精准对齐;(2) Discrete Disentangled Interaction(离散解耦交互),将模态共有的语义分离为紧凑的离散潜在变量,并自适应融合净化后的模态特有信息;(3) Critical-token Highlighting(关键token高亮),聚焦于与查询相关的高信息量信号以增强鲁棒推理能力。实验证明,MADI在合成与真实世界基准上均显著优于通用大语言模型及专用时间序列多模态模型。
链接: https://arxiv.org/abs/2601.21436
作者: Hang Ni,Weijia Zhang,Fei Wang,Zezhi Shao,Hao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Advances in multi-modal large language models (MLLMs) have inspired time series understanding and reasoning tasks, that enable natural language querying over time series, producing textual analyses of complex temporal dynamics. Recent attempts hybridize numerical time series with their visualized plots, facilitating precise value reasoning and visual structure comprehension for comprehensive time series understanding of MLLMs. However, effective cross-modal integration remains challenging due to fine-grained temporal misalignment across modalities and severe entanglement between shared and modality-specific semantics, which hinder localized interpretation and complementary reasoning. To address these issues, we propose MADI, a multi-modal LLM enhanced with fine-grained alignment and disentangled interaction, featuring (1) Patch-level Alignment, which enforces physically grounded fine-grained correspondence across heterogeneous modalities, (2) Discrete Disentangled Interaction, which separates modality-common semantics into compact discrete latents and adaptively synergizes the purified modality-unique information, and (3) Critical-token Highlighting, which emphasizes informative, query-relevant signals for robust reasoning. Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.
zh
[NLP-82] System 12 Synergy via Dynamic Model Interpolation
【速读】: 该论文旨在解决统一语言模型在适应直觉型思维(System 1)与反思型思维(System 2)之间时所面临的认知模式干扰问题,传统方法聚焦于输出控制(如输出长度),但忽略了其本质是模型内部认知配置的差异。解决方案的关键在于从“输出控制”转向“能力控制”,即调节模型的思考方式而非仅限制其产出内容;具体实现上,作者提出DAMI(Dynamic Model Interpolation)框架,通过动态参数插值利用已有的Instruct和Thinking检查点,在无需额外训练的前提下,基于查询特征估计推理强度参数λ(q),从而灵活调整认知深度——训练阶段采用偏好学习联合优化准确率与效率,零样本部署阶段则利用模型间认知差异的置信度进行估计。实验表明,DAMI在五项数学推理基准上实现了比纯Thinking模型更高的准确性,同时保持System 1的高效性,有效融合了两种认知模式的优势。
链接: https://arxiv.org/abs/2601.21414
作者: Chenxu Yang,Qingyi Si,Chong Tian,Xiyu Liu,Dingyu Yao,Chuanyu Qin,Zheng Lin,Weiping Wang,Jiaqi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Training a unified language model that adapts between intuitive System 1 and deliberative System 2 remains challenging due to interference between their cognitive modes. Recent studies have thus pursued making System 2 models more efficient. However, these approaches focused on output control, limiting what models produce. We argue that this paradigm is misaligned: output length is merely a symptom of the model’s cognitive configuration, not the root cause. In this work, we shift the focus to capability control, which modulates \textithow models think rather than \textitwhat they produce. To realize this, we leverage existing Instruct and Thinking checkpoints through dynamic parameter interpolation, without additional training. Our pilot study establishes that linear interpolation yields a convex, monotonic Pareto frontier, underpinned by representation continuity and structural connectivity. Building on this, we propose \textbfDAMI (\textbfDyn\textbfAmic \textbfModel \textbfInterpolation), a framework that estimates a query-specific Reasoning Intensity \lambda(q) to configure cognitive depth. For training-based estimation, we develop a preference learning method encoding accuracy and efficiency criteria. For zero-shot deployment, we introduce a confidence-based method leveraging inter-model cognitive discrepancy. Experiments on five mathematical reasoning benchmarks demonstrate that DAMI achieves higher accuracy than the Thinking model while remaining efficient, effectively combining the efficiency of System 1 with the reasoning depth of System 2.
zh
[NLP-83] User-Centric Evidence Ranking for Attribution and Fact Verification EACL2026
【速读】: 该论文旨在解决信息验证过程中证据呈现效率与准确性不足的问题,即当前自动化系统和大语言模型(Large Language Models, LLMs)在支持或反驳声明时,常提供信息不足或冗余过度的证据,导致用户阅读负担重且易出错。其解决方案的关键在于提出“证据排序”(Evidence Ranking)这一新任务,通过优先展示足够且互补的证据,使用户能尽早获得有效信息,同时保留全部证据供后续验证,从而降低阅读成本并提升验证效果。实验表明,增量式排序策略优于一次性排序,LLM方法在捕捉证据互补性方面表现更优,但仍在充分性与冗余性之间平衡上存在挑战。
链接: https://arxiv.org/abs/2601.21387
作者: Guy Alt,Eran Hirsch,Serwar Basch,Ido Dagan,Oren Glickman
机构: Bar-Ilan University (巴伊兰大学); TU Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL)
备注: EACL 2026
Abstract:Attribution and fact verification are critical challenges in natural language processing for assessing information reliability. While automated systems and Large Language Models (LLMs) aim to retrieve and select concise evidence to support or refute claims, they often present users with either insufficient or overly redundant information, leading to inefficient and error-prone verification. To address this, we propose Evidence Ranking, a novel task that prioritizes presenting sufficient information as early as possible in a ranked list. This minimizes user reading effort while still making all available evidence accessible for sequential verification. We compare two approaches for the new ranking task: one-shot ranking and incremental ranking. We introduce a new evaluation framework, inspired by information retrieval metrics, and construct a unified benchmark by aggregating existing fact verification datasets. Extensive experiments with diverse models show that incremental ranking strategies better capture complementary evidence and that LLM-based methods outperform shallower baselines, while still facing challenges in balancing sufficiency and redundancy. Compared to evidence selection, we conduct a controlled user study and demonstrate that evidence ranking both reduces reading effort and improves verification. This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.
zh
[NLP-84] he Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在教育评估场景中被假设具备“指令遵循能力”即可客观评判代码质量,但这一假设缺乏实证支持且存在系统性漏洞。研究发现,模型常因过度服从隐藏指令而偏离代码逻辑本身,导致对功能错误的代码进行错误评分,形成所谓的“合规悖论”(Compliance Paradox)。解决方案的关键在于提出两个创新方法——语义保持对抗性代码注入框架(Semantic-Preserving Adversarial Code Injection, SPACI)与抽象语法树感知语义注入协议(Abstract Syntax Tree-Aware Semantic Injection Protocol, AST-ASIP),通过在抽象语法树(Abstract Syntax Tree, AST)的语法无害区域(trivia nodes)嵌入对抗性指令,精准触发模型对格式约束的优先响应,从而暴露其对代码正确性的忽视。该方案揭示了高容量开放权重模型(如DeepSeek-V3)高达95%的灾难性失败率,并引入三元度量框架(Decoupling Probability、Score Divergence、Pedagogical Severity)量化“虚假认证”现象,最终呼吁从标准强化学习人类反馈(RLHF)转向面向领域特定的裁判鲁棒性(Adjudicative Robustness),以确保模型优先依据证据而非指令合规性进行判断。
链接: https://arxiv.org/abs/2601.21360
作者: Devanshu Sahoo,Manish Prasad,Vasudev Majhi,Arjun Neekhra,Yash Sinha,Murari Mandal,Vinay Chamola,Dhruv Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission’s logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread “False Certification” of functionally broken code. Our findings suggest that current alignment paradigms create a “Trojan” vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.
zh
[NLP-85] Latent Chain-of-Thought as Planning : Decoupling Reasoning from Verbalization
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)在大型语言模型(Large Language Models, LLMs)中面临的计算成本高和推理路径坍缩问题,尤其是在离散标记空间中进行推理时。现有基于潜在空间的推理方法虽提升了效率,但通常作为黑箱端到端映射,且需预设推理步数,缺乏灵活性。其解决方案的关键在于提出PLaT(Planning with Latent Thoughts)框架,通过将推理过程与文本生成解耦:推理被建模为潜在规划状态的确定性轨迹,而单独的解码器仅在必要时将这些潜在思考转化为文本。这种解耦使模型能动态决定终止推理时机,无需固定超参数,从而实现更高效、可扩展且透明的推理机制,并在数学基准测试中展现出更强的推理多样性与鲁棒性。
链接: https://arxiv.org/abs/2601.21358
作者: Jiecong Wang,Hao Peng,Chunyang Liu
机构: Beihang University (北京航空航天大学); Didi Chuxing (滴滴出行)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Chain-of-Thought (CoT) empowers Large Language Models (LLMs) to tackle complex problems, but remains constrained by the computational cost and reasoning path collapse when grounded in discrete token spaces. Recent latent reasoning approaches attempt to optimize efficiency by performing reasoning within continuous hidden states. However, these methods typically operate as opaque end-to-end mappings from explicit reasoning steps to latent states, and often require a pre-defined number of latent steps during inference. In this work, we introduce PLaT (Planning with Latent Thoughts), a framework that reformulates latent reasoning as planning by fundamentally decouple reasoning from verbalization. We model reasoning as a deterministic trajectory of latent planning states, while a separate Decoder grounds these thoughts into text when necessary. This decoupling allows the model to dynamically determine when to terminate reasoning rather than relying on fixed hyperparameters. Empirical results on mathematical benchmarks reveal a distinct trade-off: while PLaT achieves lower greedy accuracy than baselines, it demonstrates superior scalability in terms of reasoning diversity. This indicates that PLaT learns a robust, broader solution space, offering a transparent and scalable foundation for inference-time search.
zh
[NLP-86] Self-Improving Pretraining: using post-trained models to pretrain better models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中存在的安全性、事实准确性和整体质量不足的问题,尤其关注预训练阶段所形成的固有行为模式难以通过后续微调完全纠正的挑战。解决方案的关键在于提出一种新的预训练方法:通过流式处理文档,并利用强化学习(Reinforcement Learning, RL)在每一步优化接下来 K 个 token 的生成质量;由一个强大的后训练模型对候选输出(包括模型回放轨迹、原始后缀和重写后缀)进行质量、安全性和事实性的评估,早期依赖原始与重写后缀提供监督信号,随着模型性能提升逐步引入高质量回放轨迹作为RL奖励信号。该方法从源头构建更安全、更可靠且更具事实一致性的语言模型。
链接: https://arxiv.org/abs/2601.21343
作者: Ellen Xiaoqing Tan,Shehzaad Dhuliawala,Jing Xu,Ping Yu,Sainbayar Sukhbaatar,Jason Weston,Olga Golovneva
机构: FAIR at Meta (Facebook AI Research at Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model’s core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations – including model rollouts, the original suffix, and a rewritten suffix – for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.
zh
[NLP-87] Qwen 3-ASR Technical Report
【速读】: 该论文旨在解决语音识别(ASR)模型在多语言支持、实时性与效率平衡以及语音-文本对齐精度等方面的挑战。解决方案的关键在于提出Qwen3-ASR家族,包括两个全功能的端到端ASR模型(Qwen3-ASR-1.7B和Qwen3-ASR-0.6B)及一个基于大语言模型(LLM)的非自回归(Non-Autoregressive, NAR)强制对齐模型(Qwen3-ForcedAligner-0.6B)。其中,1.7B版本在开源ASR模型中达到当前最优性能(SOTA),且可与商业API相竞争;0.6B版本则在准确率与推理效率之间取得最佳平衡,实现平均TTFT(Time To First Token)低至92ms,并可在并发128时每秒处理2000秒语音;而NAR对齐模型在11种语言上显著优于现有最强对齐方法,在效率和泛化能力上更具优势。
链接: https://arxiv.org/abs/2601.21337
作者: Xian Shi,Xiong Wang,Zhifang Guo,Yongqi Wang,Pei Zhang,Xinyu Zhang,Zishan Guo,Hongkun Hao,Yu Xi,Baosong Yang,Jin Xu,Jingren Zhou,Junyang Lin
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: this https URL
Abstract:In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
zh
[NLP-88] GeoRC: A Benchmark for Geolocation Reasoning Chains
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在地理定位任务中虽能准确预测图像全局位置,但缺乏可解释性推理链的问题,尤其是其生成的推理过程常出现幻觉(hallucination),如虚构场景属性以支持正确或错误的位置判断。解决方案的关键在于构建首个针对地理定位推理链的基准测试(benchmark),通过与专业GeoGuessr玩家(包括世界冠军)合作,标注800条专家级推理链,覆盖超过500个查询场景中的数百种判别性视觉特征(如车牌形状、建筑风格和土壤特性等),并采用LLM-as-a-judge策略评估VLM生成推理链的质量,发现Qwen 3在评分上最接近人工打分。该基准揭示了当前主流闭源VLM(如Gemini和GPT-5)虽在预测精度上媲美人类专家,但在生成可审计推理链方面仍显著落后于人类;而开源VLM(如Llama和Qwen)表现更差,仅略优于仅知目标位置却无视觉信息的基线模型,凸显出VLM在从高分辨率图像中提取细粒度视觉属性方面的局限性。
链接: https://arxiv.org/abs/2601.21278
作者: Mohit Talreja,Joshua Diao,Jim Thannikary James,Radu Casapu,Tejas Santanam,Ethan Mendes,Alan Ritter,Wei Xu,James Hays
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Vision Language Models (VLMs) are good at recognizing the global location of a photograph – their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark – they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.
zh
[NLP-89] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉文档检索(Visual Document Retrieval, VDR)任务中因每页文档需用数千个视觉标记(visual tokens)表示而导致的存储开销过大问题,从而限制了其在实际应用中的可行性。解决方案的关键在于提出一种自回归生成方法 CausalEmbed,通过在对比学习训练过程中引入迭代边缘损失(iterative margin loss),促使嵌入模型学习紧凑且结构良好的多向量表示。该方法仅需数十个视觉标记即可实现高效VDR,相比传统方法减少30–155倍的token数量,同时保持高度竞争性的性能表现,并支持灵活的测试时缩放策略,显著提升了训练效率与推理阶段的可扩展性。
链接: https://arxiv.org/abs/2601.21262
作者: Jiahao Huo,Yu Huang,Yibo Yan,Ye Pan,Yi Cao,Mingdong Ou,Philip S. Yu,Xuming Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.
zh
[NLP-90] MoCo: A One-Stop Shop for Model Collaboration Research
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)研究中模型协作(model collaboration)领域缺乏系统性整合与标准化评估的问题。现有研究分散于不同学术社区,方法多样但互不兼容,且缺少统一的基准测试框架来量化比较各类协作策略的效果。其解决方案的关键在于提出 MoCo —— 一个集执行、评测与比较于一体的 Python 库,支持 26 种模型协作算法,覆盖从路由到参数级的信息交互层次,并集成 25 个涵盖推理、问答、代码生成和安全性等任务的数据集。通过大规模实验验证,MoCo 显示多数协作策略在 61.0% 的 (模型, 数据) 组合上优于单模型,最优方法性能提升达 25.8%,并揭示了协作系统的可扩展性、训练/推理效率差异及对单模型难以解决任务的增强能力,从而为构建开放、模块化、去中心化的协作型 AI 提供了可复现、可扩展的技术基础。
链接: https://arxiv.org/abs/2601.21257
作者: Shangbin Feng,Yuyang Bai,Ziyuan Yang,Yike Wang,Zhaoxuan Tan,Jiajie Yan,Zhenyu Lei,Wenxuan Ding,Weijia Shi,Haojin Wang,Zhenting Qi,Yuru Jiang,Heng Wang,Chengsong Huang,Yu Fei,Jihan Yao,Yilun Du,Luke Zettlemoyer,Yejin Choi,Yulia Tsvetkov
机构: 未知
类目: Computation and Language (cs.CL)
备注: Moco is available at this https URL
Abstract:Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.
zh
[NLP-91] Less Noise More Voice: Reinforcement Learning for Reasoning via Instruction Purification
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在复杂任务下因有限的rollout预算导致探索效率低下、采样成功率低及训练不稳定的问题。其核心解决方案是提出Less Noise Sampling Framework (LENS),关键在于首先通过识别并移除引入干扰的少量提示词(prompt tokens),净化原始提示;随后将净化过程中获得的成功rollout作为监督信号,指导模型在原始噪声提示环境下学习忽略干扰,从而提升采样效率与训练稳定性。实验表明,LENS相较GRPO方法平均性能提升3.88%,收敛速度提升1.6倍以上。
链接: https://arxiv.org/abs/2601.21244
作者: Yiju Guo,Tianyi Hu,Zexu Sun,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Department of Computer Science, Aarhus University (奥胡斯大学计算机科学系); Baidu Inc. (百度公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6 \times speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
zh
[NLP-92] Quantifying Noise in Language Generation
【速读】: 该论文旨在解决在噪声环境下语言生成(language generation)的可生成性问题,特别是量化单个噪声字符串对生成能力的影响。研究聚焦于Raman和Raman(2025)提出的噪声模型,其中对手可插入额外的非目标字符串(extraneous strings),从而干扰生成算法的正确性。论文的关键贡献在于:首先,证明无论均匀(uniform)还是非均匀(non-uniform)生成,仅一个噪声字符串即严格缩小可生成的语言集合,回答了该领域的一个开放问题;其次,揭示了一个出人意料的等价性结果——单个噪声字符串与任意有限数量的噪声在生成能力上等价,这与Bai等人(2026)发现的“噪声生成层级”严格递增形成鲜明对比。基于上述结论,论文进一步给出了首个非均匀噪声依赖可生成性的完整刻画,为理解噪声对生成模型性能的影响提供了理论基础。
链接: https://arxiv.org/abs/2601.21237
作者: Aaron Li,Ian Zhang
机构: Harvard University (哈佛大学); Duke University (杜克大学)
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Kleinberg and Mullainathan recently proposed a formal framework for studying the phenomenon of language generation, called language generation in the limit. In this model, an adversary gives an enumeration of example strings from an unknown target language, and the algorithm is tasked with correctly generating unseen strings from the target language within finite time. Refined notions of non-uniform and uniform generation were later introduced by Li, Raman, and Tewari (2025), and a noisy model was introduced by Raman and Raman (2025), which allows the adversary to insert extraneous strings. A natural question in the noisy model is to quantify the effect of noise, by studying the impact of each additional extraneous string. We show two complementary results in this setting. We first show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated, thus answering an open question in Raman and Raman (2025). Then, we show for both uniform and non-uniform generation that generation with a single noisy string is equivalent to generation with any finite amount of noise, sharply contrasting with the strict hierarchy for noisy generation in the limit shown by Bai, Panigrahi, and Zhang (2026). Finally, we leverage our previous results to provide the first known characterization for non-uniform noise-dependent generatability.
zh
[NLP-93] SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在高风险应用场景中,由于评估基准仅依赖均值中心的标量分数而无法揭示社会危害的分布结构、维度间交互关系及最坏情况行为的问题。其解决方案的关键在于提出Social Harm Analysis via Risk Profiles (SHARP)框架,该框架将社会危害建模为多变量随机变量,并通过显式分解为偏见(bias)、公平性(fairness)、伦理一致性(ethics)和认知可靠性(epistemic reliability)四个维度,结合基于失败并集的累积对数风险加法重参数化聚合机制;同时引入风险敏感的分布统计量,以条件风险价值(CVaR95)为主要指标刻画最坏情况下的模型表现,从而实现对LLM社会危害的多维、分布感知评估。
链接: https://arxiv.org/abs/2601.21235
作者: Alok Abhishek,Tushar Bandopadhyay,Lisa Erickson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Pre Print, 29 pages. key words: Social harm evaluation in LLMs, Large language models, Risk sensitive model selection, Evaluation for high-stakes domains, Worst-case behavior in LLMs, Algorithmic bias, Fairness in machine learning
Abstract:Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.
zh
[NLP-94] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation
【速读】: 该论文旨在解决多语言数学推理能力评估中因题目实例化方式不同而导致的性能波动问题,尤其是在低资源语言中表现显著下降的现象。其关键解决方案是引入MGSM-Pro数据集,通过在每个原始问题基础上生成五种不同变量(如名称、数字和无关上下文)的实例化版本,从而实现对模型数学推理能力的更鲁棒评估。实验表明,使用多实例化评估可有效减少因数字变化引起的性能偏差,并揭示出不同模型在跨语言场景下的稳定性差异,为构建更公平、可靠的多语言数学推理基准提供依据。
链接: https://arxiv.org/abs/2601.21225
作者: Tianyi Xu,Kosei Uemura,Alfred Malengo Kondoro,Tadesse Destaw Belay,Catherine Nana Nyaah Essuman,Ifeoma Okoh,Ganiyat Afolabi,Ayodele Awokoya,David Ifeoluwa Adelani
机构: McGill University; Mila-Quebec AI Institute; University of Toronto; Masakhane; Hanyang University; Instituto Politécnico Nacional; Umbaji; University of Ibadan; McPherson University; Canada CIFAR AI Chair
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.
zh
[NLP-95] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对自身知识边界时缺乏认知能力的问题,即模型无法准确识别自身知识的局限性,从而导致在缺乏足够知识的领域生成看似合理但事实错误的内容(俗称“幻觉”)。为提升LLM的诚实性,论文提出了一种更鲁棒的评估基准数据集,利用Pythia这一具备公开预训练数据的完全开源模型进行评估,从而避免传统方法因未考虑模型预训练知识而导致的评估偏差;其关键创新在于通过挖掘和利用预训练数据构建一个更诚实的LLM,实现对模型知识边界的显式建模与控制。
链接: https://arxiv.org/abs/2601.21218
作者: Christopher Adrian Kusuma,Muhammad Reza Qorib,Hwee Tou Ng
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don’t know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with “I don’t know” when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.
zh
[NLP-96] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models ICLR2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在链式思维(Chain-of-thought, CoT)推理中出现的推理步数泛化性能下降问题,即当测试时所需的推理步骤超出训练分布时,模型性能显著下降的问题。其解决方案的关键在于识别并干预内部错误传播机制:研究发现,错误集中出现在特定token位置,且源于某些称为“错误处理头”(erroneous processing heads, ep heads)的注意力头,它们通过放大错误推理路径并抑制正确路径导致错误。论文提出一种轻量级推理过程中的测试时校正方法,动态识别并关闭ep heads,从而有效提升模型在推理步数泛化场景下的表现。
链接: https://arxiv.org/abs/2601.21214
作者: Zhaoyi Li,Jiatong Li,Gangwei Jiang,Linqi Song,Defu Lian,Ying Wei
机构: University of Science and Technology of China (中国科学技术大学); City University of Hong Kong (香港城市大学); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 52 pages, accepted by ICLR 2026 main conference
Abstract:Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.
zh
[NLP-97] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling
【速读】: 该论文旨在解决多语言环境下构音障碍(dysarthria)患者语音清晰度自动化评估方法的局限性问题,现有方法通常仅适用于单一语言,或未能充分捕捉影响清晰度的语言特异性因素。解决方案的关键在于提出一种融合通用音素识别与语言特异性音素解释的多语言音素生成评估框架,通过对比音系特征距离实现音素到音素的映射和序列对齐,从而有效提取三种量化指标:音素错误率(PER)、音系特征错误率(PFER)及新提出的无对齐测量指标——音素覆盖率(PhonCov)。该框架在英语、西班牙语、意大利语和泰米尔语上的实验表明,不同指标分别受益于映射、对齐或两者结合,且能准确捕捉临床可解释的清晰度退化模式,验证了其跨语言适用性和临床意义。
链接: https://arxiv.org/abs/2601.21205
作者: Eunjung Yeo,Julie M. Liss,Visar Berisha,David R. Mortensen
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 10 pages, 4 figures
Abstract:The growing prevalence of neurological disorders associated with dysarthria motivates the need for automated intelligibility assessment methods that are applicalbe across languages. However, most existing approaches are either limited to a single language or fail to capture language-specific factors shaping intelligibility. We present a multilingual phoneme-production assessment framework that integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. The framework yields three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and a newly proposed alignment-free measure, phoneme coverage (PhonCov). Analysis on English, Spanish, Italian, and Tamil show that PER benefits from the combination of mapping and alignment, PFER from alignment alone, and PhonCov from mapping. Further analyses demonstrate that the proposed framework captures clinically meaningful patterns of intelligibility degradation consistent with established observations of dysarthric speech.
zh
[NLP-98] Scaling Embeddings Outperforms Scaling Experts in Language Models
【速读】: 该论文旨在解决当前混合专家(Mixture-of-Experts, MoE)架构在大规模语言模型中因稀疏性扩展而面临的收益递减和系统级瓶颈问题。其解决方案的关键在于探索嵌入层(embedding)扩展作为与专家扩展正交的新型扩展维度,通过系统性分析发现,在特定配置下嵌入扩展能够实现优于专家扩展的帕累托前沿(Pareto frontier)。研究进一步揭示了参数分配、模型宽度与深度之间的相互作用等关键架构因素,并结合定制化的系统优化与推测解码(speculative decoding)技术,将稀疏性优势转化为实际推理加速效果。基于此,作者提出了 LongCat-Flash-Lite 模型,该模型在仅激活约 30 亿参数的情况下仍显著超越同等参数规模的 MoE 基线,并在代理任务和代码生成领域展现出卓越性能。
链接: https://arxiv.org/abs/2601.21204
作者: Hong Liu,Jiaqi Zhang,Chao Wang,Xing Hu,Linkun Lyu,Jiaqi Sun,Xurui Yang,Bo Wang,Fengcun Li,Yulei Qian,Lingtong Si,Yerui Sun,Rumei Li,Peng Pei,Yuchen Xie,Xunliang Cai
机构: Meituan(美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy – ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.
zh
[NLP-99] Do Reasoning Models Enhance Embedding Models?
【速读】: 该论文试图解决的问题是:基于强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练的推理模型,在作为嵌入模型初始化时,是否能带来优于基础模型的语义表示性能。研究发现,尽管RLVR提升了模型的推理能力,但其生成的嵌入模型在MTEB和BRIGHT基准上并未表现出一致的优势,即存在“零效应”(null effect)。解决方案的关键在于提出了一种新的分析框架——Hierarchical Representation Similarity Analysis (HRSA),该框架从表示、几何和功能三个层次分解嵌入空间的相似性。HRSA揭示了RLVR虽然引起潜在流形局部几何的不可逆重构和坐标基的可逆漂移,但保留了全局几何结构和线性可读性,从而使得后续对比学习能够实现基础模型与推理初始化模型之间的“流形重对齐”(Manifold Realignment),解释了为何二者最终表现趋同。这一机制表明,RLVR优化的是现有语义空间内的路径而非重构语义景观本身。
链接: https://arxiv.org/abs/2601.21192
作者: Wun Yu Chan,Shaojin Chen,Huihao Jing,Kwun Hang Lau,Elton Chun-Chai Li,Zihao Wang,Haoran Li,Yangqiu Song
机构: HKUST(香港科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 main pages, 18 appendix pages, 13 figures, 11 tables, 4 prompts
Abstract:State-of-the-art embedding models are increasingly derived from decoder-only Large Language Model (LLM) backbones adapted via contrastive learning. Given the emergence of reasoning models trained via Reinforcement Learning with Verifiable Rewards (RLVR), a natural question arises: do enhanced reasoning translate to superior semantic representations when these models serve as embedding initializations? Contrary to expectation, our evaluation on MTEB and BRIGHT reveals a null effect: embedding models initialized from RLVR-tuned backbones yield no consistent performance advantage over their base counterparts when subjected to identical training recipes. To unpack this paradox, we introduce Hierarchical Representation Similarity Analysis (HRSA), a framework that decomposes similarity across representation, geometry, and function levels. HRSA reveals that while RLVR induces irreversible latent manifold’s local geometry reorganization and reversible coordinate basis drift, it preserves the global manifold geometry and linear readout. Consequently, subsequent contrastive learning drives strong alignment between base- and reasoning-initialized models, a phenomenon we term Manifold Realignment. Empirically, our findings suggest that unlike Supervised Fine-Tuning (SFT), RLVR optimizes trajectories within an existing semantic landscape rather than fundamentally restructuring the landscape itself.
zh
[NLP-100] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning
【速读】: 该论文试图解决的问题是:在何种统计条件下,线性输入(如语言序列)能够支持学习到层次结构(hierarchical structure),特别是在语言习得场景中。其核心解决方案在于聚焦于功能词(function words)的统计分布特性,通过跨语言语料库分析证实了功能词具有高频性、与句法结构的可靠关联性以及与短语边界的一致性这三项分布特征,并在186种语言中普遍成立。进一步地,研究采用反事实语言建模和消融实验表明,保留这三项特性的语言变体更易被神经网络学习者掌握,其中频率和句法关联性比边界对齐更具贡献。这一发现揭示了功能词作为语言学习中的关键统计线索,其作用机制受不同学习条件调节,从而解释了为何相似性能可能源于不同的内部认知或计算机制。
链接: https://arxiv.org/abs/2601.21191
作者: Xiulin Yang,Heidi Getz,Ethan Gotlieb Wilcox
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Jan ARR under review
Abstract:What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.
zh
[NLP-101] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中缺乏对输出空间的显式控制与高效探索的问题,尤其是在需要高多样性或优化特定目标函数时,传统基于提示链(prompt-chaining)或逐token搜索的方法存在路径依赖性强、效率低的问题。解决方案的关键在于提出Output-Space Search (OS-Search),其核心思想是将LLM的生成过程转化为在冻结编码器定义的3D输出空间Z中的端点搜索:外层循环选择目标输出坐标z*,内层通过基于序列级强化学习(sequence-level RL)训练的检索增强策略生成靠近z*的输出,从而实现无需路径依赖的并行扫描和黑盒优化,显著提升了文本和代码生成任务中多样性和目标优化性能。
链接: https://arxiv.org/abs/2601.21169
作者: Tobias Materzok
机构: Technische Universität Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.
zh
[NLP-102] Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融量化推理任务中面临的两大核心问题:一是“算术幻觉”(Arithmetic Hallucinations),即模型在数值计算过程中产生错误;二是“认知崩溃”(Cognitive Collapse),指模型在高认知负荷场景下推理能力系统性退化。为精准量化这一现象,作者构建了基于95份中国A股年报的真实数据集,并提出认知复杂度基准(Cognitive Complexity Benchmark, CCB),通过数据来源、映射难度和结果单位三个维度对金融查询进行细粒度分类,从而实现对推理退化的精确诊断。解决方案的关键在于提出迭代双阶段金融PoT(Financial-PoT)框架,其核心是神经符号架构下的严格结构解耦:首先分离语义变量提取与逻辑建模,随后将计算任务交由一个可迭代自校正的Python沙箱执行,确保确定性运算。实验证明,该方法显著提升了Qwen3-235B模型在CCB上的平均准确率(从59.7%提升至67.3%),并在高复杂度任务中实现最高达10倍的性能增益,验证了架构解耦对于增强金融推理可靠性的关键作用。
链接: https://arxiv.org/abs/2601.21157
作者: Boxiang Zhao,Qince Li,Zhonghao Wang,Yi Wang,Peng Cheng,Bo Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models excel at semantic tasks, they face a critical bottleneck in financial quantitative reasoning, frequently suffering from “Arithmetic Hallucinations” and a systemic failure mode we term “Cognitive Collapse”. To strictly quantify this phenomenon, we introduce the Cognitive Complexity Benchmark (CCB), a robust evaluation framework grounded in a dataset constructed from 95 real-world Chinese A-share annual reports. Unlike traditional datasets, the CCB stratifies financial queries into a three-dimensional taxonomy, Data Source, Mapping Difficulty, and Result Unit, enabling the precise diagnosis of reasoning degradation in high-cognitive-load scenarios. To address these failures, we propose the Iterative Dual-Phase Financial-PoT framework. This neuro-symbolic architecture enforces a strict architectural decoupling: it first isolates semantic variable extraction and logic formulation, then offloads computation to an iterative, self-correcting Python sandbox to ensure deterministic execution. Evaluation on the CCB demonstrates that while standard Chain-of-Thought falters on complex tasks, our approach offers superior robustness, elevating the Qwen3-235B model’s average accuracy from 59.7% to 67.3% and achieving gains of up to 10-fold in high-complexity reasoning tasks. These findings suggest that architectural decoupling is a critical enabling factor for improving reliability in financial reasoning tasks, providing a transferable architectural insight for precision-critical domains that require tight alignment between semantic understanding and quantitative computation.
zh
[NLP-103] Maxwait: A Generalized Mechanism for Distributed Time-Sensitive Systems
【速读】: 该论文旨在解决分布式时间敏感系统中时序需求(可用性)与一致性之间的权衡问题,尤其是在存在通信延迟和同步不确定性的情况下。其核心解决方案是提出一种名为 maxwait 的协调机制,该机制通过显式且可配置的方式管理时间相关的约束,能够统一涵盖经典分布式系统方法(如 PTIDES、Chandy-and-Misra 协议、Time-Warp 和 Lamport 时间故障检测),并支持实时分布式网络物理应用的行为建模。maxwait 的关键创新在于,在通信延迟有界时强制逻辑时间一致性,并在超出边界时提供结构化的故障处理机制,从而在单一语义框架内实现对时间控制的增强、有界时间故障检测以及提升机制的确定性。
链接: https://arxiv.org/abs/2601.21146
作者: Francesco Paladino,Shulu Li,Edward A. Lee
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注:
Abstract:Distributed time-sensitive systems must balance timing requirements (availability) and consistency in the presence of communication delays and synchronization uncertainty. This paper presents maxwait, a simple coordination mechanism with surprising generality that makes these tradeoffs explicit and configurable. We demonstrate that this mechanism subsumes classical distributed system methods such as PTIDES, Chandy-and-Misra with or without null messages, Jefferson’s Time-Warp, and Lamport’s time-based fault detection, while enabling real-time behavior in distributed cyber-physical applications. The mechanism can also realize many commonly used distributed system patterns, including logical execution time (LET), publish and subscribe, actors, conflict-free replicated data types (CRDTs), and remote procedure calls with futures. More importantly, it adds to these mechanisms better control over timing, bounded time fault detection, and the option of making them more deterministic, all within a single semantic framework. Implemented as an extension of the Lingua Franca coordination language, maxwait enforces logical-time consistency when communication latencies are bounded and provides structured fault handling when bounds are violated.
zh
[NLP-104] EnsembleLink: Accurate Record Linkage Without Training Data
【速读】: 该论文旨在解决记录链接(Record Linkage)在实证社会科学中的方法论不足问题,即现有方法要么准确率低,要么依赖大量标注训练数据,且通常将链接过程视为预处理步骤而忽略其引入的不确定性。解决方案的关键在于提出EnsembleLink方法,该方法利用预训练语言模型(Pre-trained Language Models)所学习到的语义关系(如“South Ozone Park”是“New York City”的一个街区,或“Lutte ouvriere”指代托洛茨基主义的“Workers’ Struggle”政党),实现高精度匹配而无需任何标注数据。该方法运行于本地开源模型,不依赖外部API调用,典型任务可在几分钟内完成。
链接: https://arxiv.org/abs/2601.21138
作者: Noah Dasanaike
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Record linkage, the process of matching records that refer to the same entity across datasets, is essential to empirical social science but remains methodologically underdeveloped. Researchers treat it as a preprocessing step, applying ad hoc rules without quantifying the uncertainty that linkage errors introduce into downstream analyses. Existing methods either achieve low accuracy or require substantial labeled training data. I present EnsembleLink, a method that achieves high accuracy without any training labels. EnsembleLink leverages pre-trained language models that have learned semantic relationships (e.g., that “South Ozone Park” is a neighborhood in “New York City” or that “Lutte ouvriere” refers to the Trotskyist “Workers’ Struggle” party) from large text corpora. On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. The method runs locally on open-source models, requiring no external API calls, and completes typical linkage tasks in minutes.
zh
[NLP-105] Large Language Models Naively Recover Ethnicity from Individual Records
【速读】: 该论文旨在解决基于姓名推断种族或族裔类别时的准确性不足与地域局限性问题,传统方法如贝叶斯改进姓氏地理编码(Bayesian Improved Surname Geocoding, BISG)在跨国家场景下表现受限,且存在收入偏差(如高收入少数族裔被系统性误判为白人)。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)直接从姓名中推理族裔信息,无需额外训练数据,即可实现对美国以外地区的准确分类,并支持根据语境选择适当的分类体系。实验表明,LLM分类在佛罗里达州和北卡罗来纳州选民数据上达到最高84.7%准确率,显著优于BISG的68.2%,且通过启用扩展推理或引入政党注册等元数据可进一步提升至86.7%;同时该方法有效缓解了BISG的收入偏差问题,并在多国全量选民登记数据中成功复现已知人口分布,验证其泛化能力。此外,微调后的小型Transformer模型可在本地低成本部署,满足大规模应用需求。
链接: https://arxiv.org/abs/2601.21132
作者: Noah Dasanaike
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.
zh
[NLP-106] Multi-task Code LLM s: Data Mix or Model Merge?
【速读】: 该论文旨在解决在资源受限场景下如何高效构建小型多任务代码大语言模型(code LLM)的问题,目标是在保持高性能的同时平衡计算成本与部署约束。其核心解决方案是系统比较了两种策略:数据混合(data mixing)与模型合并(model merging),并发现二者在不同规模模型中各具优势——在7B参数级别时,模型合并能保留96%的专用模型性能并实现跨任务能力整合,甚至超越单独微调模型(如Qwen Coder 2.5 7B在HumanEval上达到92.7% Pass@1,优于其任务特定微调版本的90.9%);而在2B规模下,数据混合更优。研究进一步提出权重分析技术以揭示不同任务对模型参数的影响机制,从而指导更合理的合并策略设计,为多任务代码生成与总结提供低成本、高性能的部署方案。
链接: https://arxiv.org/abs/2601.21115
作者: Mingzhi Zhu,Boris Sobolev,Rahul Krishna,Raju Pavuluri,Stacy Patterson,Michele Merler
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Cisco (思科); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.
zh
[NLP-107] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference
【速读】: 该论文旨在解决现有低秩适配(Low-Rank Adaptation, LoRA)方法在微调大语言模型(Large Language Models, LLMs)时存在的静态秩配置问题,即对所有输入 token 均采用统一的低秩结构,忽略了 token 级别复杂度和计算需求的差异。解决方案的关键在于提出 ChunkWise LoRA,其核心创新包括:1)基于 token 复杂度动态划分序列成可变长度块(chunk),2)引入运行时调度器估计 token 难度并采用“秩阶梯”(rank-ladder)机制为每个块分配定制化的低秩配置与缩放因子,3)通过边界安全组合模块和策略驱动的键值缓存(KV-cache)策略保障输出一致性。该方法在保持或提升任务性能的同时,显著降低延迟(最高34%)和内存占用(最高38%)。
链接: https://arxiv.org/abs/2601.21109
作者: Ketan Thakkar,Maitreyi Chatterjee,Ramasubramanian Balasubramanian,Achyuthan Jootoo,Rajendra Ugrani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Presented at 13th IEEE International Conference on Intelligent Systems and Embedded Design
Abstract:Recent advances in low-rank adaptation (LoRA) have enabled efficient fine-tuning of large language models (LLMs) with minimal additional parameters. However, existing LoRA methods apply static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. In this work, we propose ChunkWise LoRA, a dynamic and adaptive approach that partitions sequences into variable-length chunks based on token complexity and assigns each chunk a tailored low-rank configuration. Our system introduces a runtime scheduler that estimates token difficulty, performs adaptive chunking, and selects per-chunk LoRA rank and scaling using a rank-ladder mechanism. To preserve output consistency, we further introduce a boundary-safe composition module and integrate policy-driven KV-cache strategies. Experiments on benchmark datasets such as Wikitext-103 and SQuAD demonstrate that ChunkWise LoRA achieves up to 34% lower latency and 38% memory reduction compared to baseline LoRA, while maintaining or improving task performance metrics like BLEU, EM, and perplexity. The proposed framework remains fully compatible with existing transformer architectures and inference frameworks, providing a practical solution for real-world deployment of parameter-efficient LLMs.
zh
[NLP-108] SteerEval: A Framework for Evaluating Steerability with Natural Language Profiles for Recommendation
【速读】: 该论文旨在解决当前基于自然语言的推荐系统在用户意图引导(steerability)方面评估不足的问题,尤其是现有评测基准仅关注显性属性(如电影类型)而难以捕捉更复杂、多样化的用户控制需求。其解决方案的关键在于提出 SteerEval 评估框架,通过引入从类型到内容警告等多层次干预手段,系统性地衡量推荐模型对用户自然语言指令的响应能力,并在此基础上分析不同用户画像与推荐干预策略对引导效果的影响,从而为可引导推荐系统的设计提供实证依据和优化方向。
链接: https://arxiv.org/abs/2601.21105
作者: Joyce Zhou,Weijie Zhou,Doug Turnbull,Thorsten Joachims
机构: Cornell University(康奈尔大学); Ithaca College(伊萨卡学院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 10 pages, 2 figures, 8 tables. Pre-print
Abstract:Natural-language user profiles have recently attracted attention not only for improved interpretability, but also for their potential to make recommender systems more steerable. By enabling direct editing, natural-language profiles allow users to explicitly articulate preferences that may be difficult to infer from past behavior. However, it remains unclear whether current natural-language-based recommendation methods can follow such steering commands. While existing steerability evaluations have shown some success for well-recognized item attributes (e.g., movie genres), we argue that these benchmarks fail to capture the richer forms of user control that motivate steerable recommendations. To address this gap, we introduce SteerEval, an evaluation framework designed to measure more nuanced and diverse forms of steerability by using interventions that range from genres to content-warning for movies. We assess the steerability of a family of pretrained natural-language recommenders, examine the potential and limitations of steering on relatively niche topics, and compare how different profile and recommendation interventions impact steering effectiveness. Finally, we offer practical design suggestions informed by our findings and discuss future steps in steerable recommender design.
zh
[NLP-109] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations ICASSP2026
【速读】: 该论文旨在解决自监督学习(Self-Supervised Learning, SSL)语音模型在前端语音增强(Speech Enhancement, SE)任务中微调时,因使用均方误差(Mean Squared Error, MSE)损失函数而过度依赖SSL模型中的位置嵌入(positional embeddings),从而导致优化目标通过位置相关性而非内容相关信息实现的问题。解决方案的关键在于引入位置不变的微调策略:一是采用零填充(zero-padding)方法以削弱位置信息的影响;二是结合速度扰动(speed perturbations)与软动态时间规整(soft-DTW)损失函数,使模型更关注语义内容而非时序位置。实验表明,基于soft-DTW的方法不仅收敛更快,且下游任务性能显著提升,验证了位置不变微调对SSL语音建模的重要性。
链接: https://arxiv.org/abs/2601.21084
作者: Amit Meghanani,Thomas Hain
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ICASSP 2026
Abstract:Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.
zh
[NLP-110] Human-LLM Collaborative Feature Engineering for Tabular Data ICLR2026
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的表格学习特征工程方法中,因将LLM视为黑箱优化器而导致的低效探索问题。具体而言,现有方法依赖LLM内部启发式策略进行操作提议与选择,缺乏对操作效用的校准估计,从而导致重复探索低收益操作且无系统性优先级排序机制。其解决方案的关键在于提出一种人-LLM协同的特征工程框架:首先解耦操作提议与选择过程——LLM仅负责生成候选变换操作,而选择阶段则通过显式建模每个操作的效用和不确定性来指导决策;进一步设计了一种选择性引入人类专家偏好反馈的机制,通过比较不同操作的潜力来增强早期效用估计的准确性,从而提升特征工程效率并降低用户认知负荷。
链接: https://arxiv.org/abs/2601.21060
作者: Zhuoyan Li,Aditya Bansal,Jinzhao Li,Shishuang He,Zhuoran Lu,Mutian Zhang,Qin Liu,Yiwei Yang,Swati Jain,Ming Yin,Yunyao Li
机构: Purdue University(普渡大学); Adobe(Adobe公司); UIUC(伊利诺伊大学厄巴纳-香槟分校); University of Washington(华盛顿大学); UC Davis(加州大学戴维斯分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: ICLR 2026
Abstract:Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance. However, current approaches typically assign the LLM as a black-box optimizer, responsible for both proposing and selecting operations based solely on its internal heuristics, which often lack calibrated estimations of operation utility and consequently lead to repeated exploration of low-yield operations without a principled strategy for prioritizing promising directions. In this paper, we propose a human-LLM collaborative feature engineering framework for tabular learning. We begin by decoupling the transformation operation proposal and selection processes, where LLMs are used solely to generate operation candidates, while the selection is guided by explicitly modeling the utility and uncertainty of each proposed operation. Since accurate utility estimation can be difficult especially in the early rounds of feature engineering, we design a mechanism within the framework that selectively elicits and incorporates human expert preference feedback, comparing which operations are more promising, into the selection process to help identify more effective operations. Our evaluations on both the synthetic study and the real user study demonstrate that the proposed framework improves feature engineering performance across a variety of tabular datasets and reduces users’ cognitive load during the feature engineering process.
zh
[NLP-111] hinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models)在细粒度空间理解与连续动作规划方面的局限性,这些模型虽擅长文本推理,却难以模拟复杂视觉推理所需的动态过程。其解决方案的关键在于将视觉推理建模为视频生成任务,利用生成的帧作为初始状态到解决方案之间的中间推理步骤,从而显式地捕捉视觉上下文并支持零样本泛化。实验表明,该方法在迷宫导航(Maze Navigation)和拼图操作(Tangram Puzzle)两类任务中均展现出强鲁棒性、对未见模式的适应能力以及测试时通过延长视频生成长度实现性能提升的缩放规律,验证了视频生成作为视觉推理新范式的可行性与可扩展性。
链接: https://arxiv.org/abs/2601.21037
作者: Chengzu Li,Zanyi Wang,Jiaang Li,Yi Xu,Han Zhou,Huanyu Zhang,Ruichuan An,Dengyang Jiang,Zhaochong An,Ivan Vulić,Serge Belongie,Anna Korhonen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, 3 tables (26 pages, 13 figures, 6 tables including references and appendices)
Abstract:Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.
zh
[NLP-112] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop
【速读】: 该论文旨在解决低资源语言(特别是乌尔都语)中大语言模型(LLM)推理能力评估缺乏标准化基准的问题。现有方法受限于机器翻译的敏感性以及对通用语言任务而非推理能力的侧重,难以有效构建高质量的推理评测数据。其解决方案的关键在于提出一种“上下文集成翻译框架”(contextually ensembled translation framework),结合多源翻译系统与人工验证(human-in-the-loop validation),在保持语义和结构一致性的前提下,将多个主流英文推理与问答基准(如MGSM、MATH-500、CommonSenseQA和OpenBookQA)成功迁移至乌尔都语,形成统一的乌尔都语推理评测集UrduBench,并通过多种提示策略对不同架构和规模的模型进行系统评估。该方法不仅提升了乌尔都语推理评测的可靠性,也为其他低资源语言提供了可扩展的标准化评估范式。
链接: https://arxiv.org/abs/2601.21000
作者: Muhammad Ali Shafique,Areej Mehboob,Layba Fiaz,Muhammad Usman Qadeer,Hamza Farooq
机构: Traversaal.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.
zh
[NLP-113] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation
【速读】: 该论文旨在解决语音识别评估中存在的标注不准确、对非拉丁语系语言支持不足以及模型在特定数据集标注下产生虚假性能提升等问题。其关键解决方案在于提出一种支持多参考标注、任意长度插入和更优词对齐的字符串对齐算法,从而提升长音频和复杂词形变化语言(如俄语)的标注精度;同时构建了新的长音频测试集 DiverseSpeech-Ru 并对现有俄语测试集进行多参考重标注,揭示了模型对训练数据标注方式的依赖性,进而开发出用于流式语音识别评估及多转录文本可视化比对的工具,并提供统一接口以集成多种离线与流式语音识别模型。
链接: https://arxiv.org/abs/2601.20992
作者: Oleg Sedukhin,Andrey Kostin
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.
zh
[NLP-114] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents WWW
【速读】: 该论文旨在解决当前智能代理(Agent)在执行复杂多步信息检索任务时,缺乏系统性评估基准的问题。现有基准通常聚焦于单一答案检索或广谱事实准确性,难以衡量代理在真实场景中处理碎片化信息、去重与实体消歧以及合理判断搜索终止条件等关键能力。为应对这一挑战,作者提出了DeepSearchQA——一个包含900个精心设计任务的基准,覆盖17个不同领域,每个任务均以因果链结构组织,要求代理逐步完成依赖前序步骤的信息获取,从而测试其长期规划能力和上下文保持能力。该解决方案的关键在于构建了一个基于开放网络、具有客观可验证答案集的多步推理任务体系,能够有效诊断当前先进代理架构在召回率与精度之间难以平衡的问题,并揭示诸如过早停止和过度泛化等典型失败模式,为未来研究提供了明确的方向。
链接: https://arxiv.org/abs/2601.20975
作者: Nikita Gupta,Riju Chatterjee,Lukas Haas,Connie Tao,Andrew Wang,Chang Liu,Hidekazu Oiwa,Elena Gribovskaya,Jan Ackermann,John Blitzer,Sasha Goldshtein,Dipanjan Das
机构: 未知
类目: Computation and Language (cs.CL)
备注: DeepSearchQA can be found at this https URL
Abstract:We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent’s ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. Our comprehensive evaluation of state-of-the-art agent architectures reveals significant performance limitations: even the most advanced models struggle to balance high recall with precision. We observe distinct failure modes ranging from premature stopping (under-retrieval) to hedging behaviors, where agents cast an overly wide net of low-confidence answers to artificially boost recall. These findings highlight critical headroom in current agent designs and position DeepSearchQA as an essential diagnostic tool for driving future research toward more robust, deep-research capabilities.
zh
[NLP-115] xt-only adaptation in LLM -based ASR through text denoising ICASSP2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自动语音识别(Automatic Speech Recognition, ASR)系统在新领域适应时,仅使用文本数据导致跨模态对齐破坏的问题。标准微调方法会削弱语音与文本之间的对齐关系,从而降低性能。其解决方案的关键在于将音频投影任务模拟为文本去噪任务,通过训练LLM从噪声输入中恢复干净的转录文本,从而在不改变模型结构或引入额外参数的情况下,实现对目标领域的有效适应,同时保持语音与文本模态间的对齐。
链接: https://arxiv.org/abs/2601.20900
作者: Sergio Burdisso,Esaú Villatoro-Tello,Andrés Carofilis,Shashi Kumar,Kadri Hacioglu,Srikanth Madikeri,Pradeep Rangappa,Manjunath K E,Petr Motlicek,Shankar Venkatesan,Andreas Stolcke
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Paper accepted at ICASSP 2026
Abstract:Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
zh
[NLP-116] SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition
【速读】: 该论文旨在解决单字自动语音识别(Single-word Automatic Speech Recognition, ASR)在低资源、高敏感场景(如医疗和应急响应)中因缺乏语言上下文、噪声干扰、发音变异及信道失真而导致的识别准确率低的问题。解决方案的关键在于提出一个模块化框架,其核心由两部分组成:一是结合去噪与归一化的混合ASR前端(Whisper + Vosk),二是设计用于处理词汇外单词(out-of-vocabulary words)和劣质音频的验证层(verification layer)。该验证层支持多种匹配策略,包括嵌入相似度、编辑距离以及基于大语言模型(Large Language Model, LLM)的匹配,并可引入上下文引导,显著提升了在嘈杂和压缩信道下的识别鲁棒性,同时保持了实时电话应用所需的低延迟特性。
链接: https://arxiv.org/abs/2601.20890
作者: Manali Sharma(1),Riya Naik(1),Buvaneshwari G(1) ((1) Tetranetics Private Limited)
机构: Tetranetics Private Limited (Tetranetics 私人有限公司)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.
zh
[NLP-117] Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature
【速读】: 该论文旨在解决历史文本间互文性(intertextuality)检测任务中缺乏标准化基准测试集和易用数据集的问题,从而阻碍了基于语言模型(language models, LMs)的新方法的发展。其解决方案的关键在于构建了一个名为Loci Similes的拉丁语互文性检测基准,包含约17.2万段文本片段及545个由专家验证的平行文本对,这些平行文本连接了晚期古代作家与古典作家的作品。利用该数据集,研究者为最先进的大语言模型(large language models, LLMs)在互文性检索与分类任务上建立了基线性能,为后续研究提供了可靠评估平台和方法开发基础。
链接: https://arxiv.org/abs/2601.07533
作者: Julian Schelb,Michael Wittweiler,Marie Revellio,Barbara Feichtinger,Andreas Spitz
机构: University of Konstanz (康斯坦茨大学); University of Zurich (苏黎世大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:
Abstract:Tracing connections between historical texts is an important part of intertextual research, enabling scholars to reconstruct the virtual library of a writer and identify the sources influencing their creative process. These intertextual links manifest in diverse forms, ranging from direct verbatim quotations to subtle allusions and paraphrases disguised by morphological variation. Language models offer a promising path forward due to their capability of capturing semantic similarity beyond lexical overlap. However, the development of new methods for this task is held back by the scarcity of standardized benchmarks and easy-to-use datasets. We address this gap by introducing Loci Similes, a benchmark for Latin intertextuality detection comprising of a curated dataset of ~172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors. Using this data, we establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.
zh
[NLP-118] Fake News Detection After LLM Laundering: Measurement and Explanation
【速读】: 该论文旨在解决生成式 AI(Generative AI)所生成的虚假新闻在检测中的有效性问题,特别是当这些内容经过改写(paraphrase)后是否更难被现有检测器识别。其关键解决方案在于系统性评估不同检测模型对 LLM 改写后的虚假新闻的识别能力,并发现:(1)检测器对 LLM 改写后的虚假新闻识别效果显著低于对人类撰写文本的识别;(2)通过 LIME 解释揭示了检测失败的一个潜在机制——情感偏移(sentiment shift),即改写过程改变了原文的情感倾向但保留语义相似性,从而误导检测模型;(3)提出了一种包含改写输出与评分的新数据集,用于推动该领域研究发展。
链接: https://arxiv.org/abs/2501.18649
作者: Rupak Kumar Das,Jonathan Dodge
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub
zh
[NLP-119] Reducing Prompt Sensitivity in LLM -based Speech Recognition Through Learnable Projection ICASSP2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自动语音识别(Automatic Speech Recognition, ASR)系统中提示(prompt)设计对性能影响显著但缺乏系统研究的问题。当前主流架构普遍采用固定的手动定义提示,虽具实用性和稳定性,但存在性能波动大、无统一最优提示的问题。解决方案的关键在于提出一个提示投影模块(prompt projector module),该模块作为轻量级、与模型无关的扩展组件,通过学习将提示嵌入映射到LLM输入空间中更有效的区域,从而提升ASR性能并降低因提示选择带来的不确定性,且无需修改原有LLM-based ASR模型结构。
链接: https://arxiv.org/abs/2601.20898
作者: Sergio Burdisso,Esaú Villatoro-Tello,Shashi Kumar,Srikanth Madikeri,Andrés Carofilis,Pradeep Rangappa,Manjunath K E,Kadri Hacioglu,Petr Motlicek,Andreas Stolcke
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper accepted at ICASSP 2026
Abstract:LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.
zh
计算机视觉
[CV-0] One-step Latent-free Image Generation with Pixel Mean Flows
【速读】:该论文旨在解决当前基于扩散/流模型的图像生成方法中两个关键瓶颈:一是通常依赖多步采样导致生成效率低,二是普遍在潜在空间(latent space)中进行操作,增加了复杂性。为实现无需潜在空间的一步式生成,作者提出“像素均值流”(pixel MeanFlow, pMF),其核心创新在于将网络输出空间与损失函数空间分离设计:网络目标直接预测图像流形上的像素(x-prediction),而损失函数则基于速度空间中的均值流(MeanFlow)定义。通过引入图像流形与平均速度场之间的简单变换,pMF在ImageNet数据集上实现了256×256和512×512分辨率下的一步式无潜在空间生成,FID分别达到2.22和2.48,填补了该领域的重要空白。
链接: https://arxiv.org/abs/2601.22158
作者: Yiyang Lu,Susie Lu,Qiao Sun,Hanhong Zhao,Zhicheng Jiang,Xianbang Wang,Tianhong Li,Zhengyang Geng,Kaiming He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose “pixel MeanFlow” (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.
zh
[CV-1] DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation WWW
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在动态物体操作任务中表现不足的问题,尤其是在快速感知、时序预测和连续控制方面的能力欠缺。针对这一挑战,论文提出DynamicVLA框架,其关键解决方案包括:1)采用轻量级(0.4B参数)的VLA模型,结合卷积视觉编码器实现空间高效的结构保真编码,提升多模态推理速度;2)引入连续推理(Continuous Inference)机制,使推理与执行过程重叠,降低延迟并实现对物体运动的及时适应;3)设计隐式感知驱动的动作流(Latent-aware Action Streaming),通过时序对齐的动作执行弥合感知与执行之间的鸿沟。此外,为填补动态操作数据的空白,作者构建了DOM基准测试集,支持高效合成与真实世界数据采集,显著提升了模型在动态场景下的响应速度、感知精度与泛化能力。
链接: https://arxiv.org/abs/2601.22153
作者: Haozhe Xie,Beichen Wen,Jiarui Zheng,Zhaoxi Chen,Fangzhou Hong,Haiwen Diao,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL GitHub: this https URL
Abstract:Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.
zh
[CV-2] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
【速读】:该论文试图解决的问题是:大型视觉语言模型(Large Vision-Language Models, VLMs)在面对经典视觉错觉图像时,尽管原始图像中的错觉因素被反转后人类观察者能明显感知到视觉变化,VLMs 仍会维持原有错误回答,这表明其响应机制可能并非基于对视觉信息的实时感知,而是依赖于记忆模式的召回。为厘清这一现象背后的成因,论文提出 VI-Probe 框架,其关键在于引入可控的、分级的视觉扰动与匹配的无错觉对照组(matched visual controls),从而将“基于视觉感知”的响应与“基于语言记忆驱动”的响应进行解耦。该方法通过极性翻转一致性(Polarity-Flip Consistency)、模板固定指数(Template Fixation Index)及归一化错觉增强因子(illusion multiplier)量化模型稳定性与敏感性,揭示了不同模型存在异质性响应机制——如记忆覆盖、感知-记忆竞争或视觉处理限制等,并挑战了单一机制解释的假设,推动以探针式评估为核心的多维度模型评测范式。
链接: https://arxiv.org/abs/2601.22150
作者: Xiaoxiao Sun,Mingyang Li,Kun yuan,Min Woo Sun,Mark Endo,Shengguang Wu,Changlin Li,Yuhui Zhang,Zeyu Wang,Serena Yeung-Levy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 31 figures, 13 tables. Project Page: this https URL
Abstract:Large Vision-Language Models (VLMs) often answer classic visual illusions “correctly” on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at this https URL.
zh
[CV-3] JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
【速读】:该论文旨在解决视频配音(video dubbing)任务中现有方法依赖复杂、特定任务的流水线且在真实场景下表现不佳的问题。其关键解决方案是基于一个预训练的音视频基础扩散模型(audio-visual foundation model),通过轻量级LoRA(Low-Rank Adaptation)适配,使模型能够以输入的音视频为条件,联合生成翻译后的音频与同步的面部动作。训练过程中,利用模型自身生成包含语言切换的多语言视频片段,并通过图像和音频修复(inpainting)使每半段视频的面部和语音匹配另一侧的语言,从而在保持说话人身份和唇同步的同时,提升对复杂运动和现实动态的鲁棒性。
链接: https://arxiv.org/abs/2601.22143
作者: Anthony Chen,Naomi Ken Korem,Tavi Halperin,Matan Ben Yosef,Urska Jelercic,Ofir Bibi,Or Patashnik,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Lightricks (轻创科技)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage available at this https URL
Abstract:Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
zh
[CV-4] Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data
【速读】:该论文旨在解决现有剪枝方法中假设存在单一通用“赢家票券”(winning ticket)的问题,这一假设忽略了真实世界数据的内在异质性。为此,作者提出了**路由彩票(Routing the Lottery, RTL)**框架,其关键在于发现多个针对不同类别、语义簇或环境条件定制的自适应子网络(adaptive tickets),从而实现模型结构与数据异质性的对齐。RTL在多个数据集和任务上均显著优于单模型和多模型基线,在平衡准确率和召回率方面表现优异,同时参数量仅为独立模型的十分之一,并具备语义一致性;此外,论文还识别出极端剪枝导致的子网络坍塌现象,并引入子网络相似性评分以实现无需标签的过稀疏诊断。
链接: https://arxiv.org/abs/2601.22141
作者: Grzegorz Stefanski,Alberto Presta,Michal Byra
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.
zh
[CV-5] PI-Light: Physics-Inspired Diffusion for Full-Image Relighting ICLR2026
【速读】:该论文旨在解决全图重光照(full-image relighting)中的三大挑战:缺乏大规模结构化配对数据、难以保持物理合理性,以及数据驱动先验导致的泛化能力受限。其解决方案的关键在于提出一个两阶段框架——Physics-Inspired diffusion for full-image reLight(π-Light),通过引入四项核心设计实现突破:(i) 批处理感知注意力机制(batch-aware attention),提升图像集合中固有属性预测的一致性;(ii) 物理引导的神经渲染模块(physics-guided neural rendering module),强制光传输过程符合物理规律;(iii) 物理启发式损失函数(physics-inspired losses),规范训练动态以逼近物理合理的优化空间,从而增强真实场景下的泛化能力;(iv) 一套在受控光照条件下采集的多样化物体与场景数据集,为下游评估提供可靠基准。上述组件协同作用,使得预训练扩散模型可在少量数据下高效微调,并显著优于现有方法在真实场景中生成镜面高光和漫反射的效果。
链接: https://arxiv.org/abs/2601.22135
作者: Zhexin Liang,Zhaoxi Chen,Yongwei Chen,Tianyi Wei,Tengfei Wang,Xingang Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026
Abstract:Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce Physics-Inspired diffusion for full-image reLight ( \pi -Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that \pi -Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.
zh
[CV-6] Early and Prediagnostic Detection of Pancreatic Cancer from Computed Tomography
【速读】:该论文旨在解决胰腺导管腺癌(Pancreatic ductal adenocarcinoma, PDAC)因早期症状隐匿、影像学表现不典型而导致的诊断延迟问题,尤其针对临床前CT扫描中被放射科医生忽略的微小病灶。解决方案的关键在于开发了一种基于人工智能的自动化检测系统ePAI(early Pancreatic cancer detection with Artificial Intelligence),该系统在大规模多中心数据集上训练并验证,具备高灵敏度(内部测试达95.3%)和特异性(内部测试达98.7%),能够精准识别直径小于2 cm甚至低至2 mm的PDAC病灶,并在多个独立中心外部队列中保持优异性能(AUC 0.918–0.945)。更重要的是,ePAI可在临床确诊前3–36个月发现原本被忽视的病变,平均提前347天,显著优于30名资深放射科医师的判读能力,体现出其作为辅助工具提升PDAC早诊潜力的核心价值。
链接: https://arxiv.org/abs/2601.22134
作者: Wenxuan Li,Pedro R. A. S. Bassi,Lizhou Wu,Xinze Zhou,Yuxuan Zhao,Qi Chen,Szymon Plotka,Tianyu Lin,Zheren Zhu,Marisa Martin,Justin Caskey,Shanshan Jiang,Xiaoxi Chen,Jaroslaw B. Ćwikla,Artur Sankowski,Yaping Wu,Sergio Decherchi,Andrea Cavalli,Chandana Lall,Cristian Tomasetti,Yaxing Guo,Xuan Yu,Yuqing Cai,Hualin Qiao,Jie Bao,Chenhan Hu,Ximing Wang,Arkadiusz Sitek,Kai Ding,Heng Li,Meiyun Wang,Dexin Yu,Guang Zhang,Yang Yang,Kang Wang,Alan L. Yuille,Zongwei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pancreatic ductal adenocarcinoma (PDAC), one of the deadliest solid malignancies, is often detected at a late and inoperable stage. Retrospective reviews of prediagnostic CT scans, when conducted by expert radiologists aware that the patient later developed PDAC, frequently reveal lesions that were previously overlooked. To help detecting these lesions earlier, we developed an automated system named ePAI (early Pancreatic cancer detection with Artificial Intelligence). It was trained on data from 1,598 patients from a single medical center. In the internal test involving 1,009 patients, ePAI achieved an area under the receiver operating characteristic curve (AUC) of 0.939-0.999, a sensitivity of 95.3%, and a specificity of 98.7% for detecting small PDAC less than 2 cm in diameter, precisely localizing PDAC as small as 2 mm. In an external test involving 7,158 patients across 6 centers, ePAI achieved an AUC of 0.918-0.945, a sensitivity of 91.5%, and a specificity of 88.0%, precisely localizing PDAC as small as 5 mm. Importantly, ePAI detected PDACs on prediagnostic CT scans obtained 3 to 36 months before clinical diagnosis that had originally been overlooked by radiologists. It successfully detected and localized PDACs in 75 of 159 patients, with a median lead time of 347 days before clinical diagnosis. Our multi-reader study showed that ePAI significantly outperformed 30 board-certified radiologists by 50.3% (P 0.05) in sensitivity while maintaining a comparable specificity of 95.4% in detecting PDACs early and prediagnostic. These findings suggest its potential of ePAI as an assistive tool to improve early detection of pancreatic cancer.
zh
[CV-7] EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers
【速读】:该论文旨在解决当前生成式视频模型在编辑已录制视频时存在的关键问题,即如何在不破坏运动连贯性、时间一致性、说话者身份和唇同步准确性的情况下,对现有视频内容进行精确修改。解决方案的关键在于提出EditYourself框架,这是一个基于DiT(Diffusion Transformer)的音频驱动视频到视频(V2V)编辑方法,通过引入音频条件控制和区域感知的编辑导向训练扩展,实现了基于文本转录的口型视频修改,包括视觉上说话内容的无缝添加、删除与重定时。该方法利用时空修复(spatiotemporal inpainting)技术,在保持长期视觉保真度和身份一致性的前提下,合成新增片段中逼真的面部动作与语音同步效果,为专业视频后期制作提供了可实用的生成式视频编辑工具。
链接: https://arxiv.org/abs/2601.22127
作者: John Flynn,Wolfgang Paier,Dimitar Dinev,Sam Nhut Nguyen,Hayk Poghosyan,Manuel Toribio,Sandipan Banerjee,Guy Gafni
机构: Pipio AI; Amazon
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Project page: this https URL
Abstract:Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.
zh
[CV-8] Creative Image Generation with Diffusion Model
【速读】:该论文旨在解决生成式图像模型在创意表达上的局限性问题,即如何系统性地生成具有新颖性、稀有性和视觉吸引力的图像,而非仅依赖人工概念混合或子类别排除等启发式方法。其解决方案的关键在于将创造力定义为图像在CLIP嵌入空间中的逆概率(inverse probability),通过引导扩散模型生成分布向低概率区域迁移的图像,从而实现对罕见且富有想象力内容的可控生成;同时引入拉回机制(pullback mechanisms)以在保持图像保真度的前提下提升创造性输出的质量与效率。
链接: https://arxiv.org/abs/2601.22125
作者: Kunpeng Song,Ahmed Elgammal
机构: Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Creative image generation has emerged as a compelling area of research, driven by the need to produce novel and high-quality images that expand the boundaries of imagination. In this work, we propose a novel framework for creative generation using diffusion models, where creativity is associated with the inverse probability of an image’s existence in the CLIP embedding space. Unlike prior approaches that rely on a manual blending of concepts or exclusion of subcategories, our method calculates the probability distribution of generated images and drives it towards low-probability regions to produce rare, imaginative, and visually captivating outputs. We also introduce pullback mechanisms, achieving high creativity without sacrificing visual fidelity. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness and efficiency of our creative generation framework, showcasing its ability to produce unique, novel, and thought-provoking images. This work provides a new perspective on creativity in generative models, offering a principled method to foster innovation in visual content synthesis.
zh
[CV-9] SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence
【速读】:该论文旨在解决电路原理图图像到可读网表(netlist)转换过程中存在的组件识别与连接关系推断困难的问题。其解决方案的关键在于提出了一种名为SINA的端到端自动化框架,该框架融合了深度学习用于高精度组件检测、连通域标记(Connected-Component Labeling, CCL)实现精确的电气连接提取、光学字符识别(OCR)获取元件参考设计ator,并引入视觉-语言模型(Vision-Language Model, VLM)以提升参考设计ator分配的可靠性,最终在实验中实现了96.47%的整体网表生成准确率,较当前最优方法提升2.72倍。
链接: https://arxiv.org/abs/2601.22114
作者: Saoud Aldowaish,Yashwanth Karumanchi,Kai-Chen Chiang,Soroosh Noorzad,Morteza Fayazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.
zh
[CV-10] RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation ICLR2026
【速读】:该论文旨在解决现有基于参考图像的图像生成方法仅能使用单张2D图像作为参考、无法利用3D资产(如多视图RGB图像和点云)进行约束生成的问题,从而限制了其在实际内容创作中的灵活性与一致性。解决方案的关键在于提出一种跨域扩散模型(cross-domain diffusion model),采用双分支感知架构(dual-branch perception),通过联合建模3D资产的多视图RGB图像与点云坐标(point maps),实现生成图像与3D参考之间的精确空间一致性;同时,借助空间对齐的双分支生成结构与领域解耦机制(domain-decoupled generation mechanism),能够同步生成空间对齐但内容解耦的RGB图像与点图,从而将2D图像属性与3D资产属性有效关联,显著提升生成结果对3D参考的忠实度。
链接: https://arxiv.org/abs/2601.22094
作者: Hanzhuo Huang,Qingyang Bao,Zekai Gu,Zhongshuo Du,Cheng Lin,Yuan Liu,Sibei Yang
机构: ShanghaiTech University (上海科技大学); Sun Yat-sen University (中山大学); University of Toronto (多伦多大学); The Hong Kong University of Science and Technology (香港科技大学); SynWorld; Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026. Project page: this https URL Codes: this https URL
Abstract:In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.
zh
[CV-11] BLO-Inst: Bi-Level Optimization Based Alignment of YOLO and SAM for Robust Instance Segmentation
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像分割任务中因依赖人工提示(manual prompts)而导致自动化部署受限的问题,尤其针对现有检测器作为提示生成器时存在的两个核心缺陷:一是目标不匹配问题(objective mismatch),即检测器优化的是几何定位而非适配Segment Anything Model(SAM)所需的提示上下文;二是对齐过拟合问题(alignment overfitting),即标准联合训练下检测器仅记忆特定样本的提示调整而非学习通用提示策略。解决方案的关键在于提出BLO-Inst框架,通过双层优化(bi-level optimization)实现检测与分割目标的对齐:底层在子集 $ D_1 $ 上微调SAM以最大化分割保真度,上层在独立子集 $ D_2 $ 上更新检测器,使其生成的边界框能最小化微调后SAM的验证损失,从而将检测器转化为具有分割感知能力的提示生成器,优化方向从单纯定位精度转向下游掩码质量。
链接: https://arxiv.org/abs/2601.22061
作者: Li Zhang,Pengtao Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Segment Anything Model has revolutionized image segmentation with its zero-shot capabilities, yet its reliance on manual prompts hinders fully automated deployment. While integrating object detectors as prompt generators offers a pathway to automation, existing pipelines suffer from two fundamental limitations: objective mismatch, where detectors optimized for geometric localization do not correspond to the optimal prompting context required by SAM, and alignment overfitting in standard joint training, where the detector simply memorizes specific prompt adjustments for training samples rather than learning a generalizable policy. To bridge this gap, we introduce BLO-Inst, a unified framework that aligns detection and segmentation objectives by bi-level optimization. We formulate the alignment as a nested optimization problem over disjoint data splits. In the lower level, the SAM is fine-tuned to maximize segmentation fidelity given the current detection proposals on a subset ( D_1 ). In the upper level, the detector is updated to generate bounding boxes that explicitly minimize the validation loss of the fine-tuned SAM on a separate subset ( D_2 ). This effectively transforms the detector into a segmentation-aware prompt generator, optimizing the bounding boxes not just for localization accuracy, but for downstream mask quality. Extensive experiments demonstrate that BLO-Inst achieves superior performance, outperforming standard baselines on tasks in general and biomedical domains.
zh
[CV-12] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对真实世界复杂视觉任务时,因内部世界知识容量有限、搜索能力受限而难以有效获取关键证据的问题。现有方法通常采用“推理-调用工具”的范式,但其假设单一图像或文本查询即可定位所需信息,忽视了现实场景中严重的视觉噪声和多源异构证据的需求,且受限于推理深度与搜索广度。解决方案的关键在于提出一种新的多模态深度研究范式——Vision-DeepResearch,通过支持多轮次、多实体、多尺度的视觉与文本联合搜索,在高噪声环境下稳健地调用真实世界搜索引擎;同时利用冷启动监督与强化学习(Reinforcement Learning, RL)训练,将深度研究能力内化至MLLM中,从而实现数十步推理与数百次引擎交互的端到端多模态深度研究能力,显著优于现有开源及闭源强基座模型如GPT-5、Gemini-2.5-pro和Claude-4-Sonnet。
链接: https://arxiv.org/abs/2601.22060
作者: Wenxuan Huang,Yu Zeng,Qiuchen Wang,Zhen Fang,Shaosheng Cao,Zheng Chu,Qingyu Yin,Shuang Chen,Zhenfei Yin,Lin Chen,Zehui Chen,Yao Hu,Philip Torr,Feng Zhao,Wanli Ouyang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call’’ for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in this https URL.
zh
[CV-13] Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models
【速读】:该论文旨在解决扩散模型在无因子级监督条件下如何有效学习可分解的潜在空间表示,并实现高质量的组合生成问题。其核心挑战在于如何在不依赖显式标注的情况下,自动发现数据中的语义独立因子(如图像中的背景、光照和物体属性,或机器人视频中的运动组件),并确保重组后的样本在物理和语义上保持一致性。解决方案的关键在于引入一种对抗训练信号:通过训练一个判别器来区分来自单一源的样本与跨源因子重组生成的样本,进而优化生成器以欺骗该判别器,从而促使模型学习到具有物理合理性与语义一致性的潜在因子表示。此策略显著提升了生成质量(如降低FID分数)和因子解耦程度(如提高MIG和MCC指标),并在CelebA-HQ、Virtual KITTI等基准上优于现有方法,同时拓展至机器人轨迹生成任务,实现了状态空间覆盖率的显著提升。
链接: https://arxiv.org/abs/2601.22057
作者: Archer Wang,Emile Anand,Yilun Du,Marin Soljačić
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 16 figures, 4 tables
Abstract:Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.
zh
[CV-14] MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources
【速读】:该论文旨在解决将规模化训练范式扩展至度量深度估计(metric depth estimation)时所面临的挑战,包括异构传感器噪声、相机依赖性偏差以及跨源3D数据中的度量模糊性问题。其解决方案的关键在于提出了一种名为“稀疏度量提示(Sparse Metric Prompt)”的简单且可扩展的预训练框架:通过随机掩码深度图生成统一接口,有效解耦空间推理与传感器及相机偏差,从而在无需人工设计提示、相机特异性建模或任务专用架构的前提下,从噪声多样化的3D数据中学习度量深度表示。
链接: https://arxiv.org/abs/2601.22054
作者: Baorui Ma,Jiahui Yang,Donglin Di,Xuancheng Zhang,Jianxun Cui,Hao Li,Yan Xie,Wei Chen
机构: Li Auto Inc
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at this http URL to support community research.
zh
[CV-15] PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction
【速读】:该论文旨在解决单目图像序列流式重建中难以同时实现高质量渲染与高精度几何重建的问题。现有方法通常在两者之间做出权衡,而无法兼顾。其解决方案的关键在于提出PLANING框架,该框架基于一种混合表示方式,松散耦合显式几何基元(explicit geometric primitives)与神经高斯(neural Gaussians),从而实现几何与外观的解耦建模;这种解耦机制支持在线初始化与优化策略,将几何更新与外观更新分离,显著降低了结构冗余并提升了流式重建的稳定性与效率。
链接: https://arxiv.org/abs/2601.22046
作者: Changjian Jiang,Kerui Ren,Xudong Li,Kaiwen Song,Linning Xu,Tao Lu,Junting Dong,Yu Zhang,Bo Dai,Mulin Yu
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of \modelname~make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: this https URL .
zh
[CV-16] Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion
【速读】:该论文旨在解决城市三维重建中因视图稀疏导致的几何模糊性和不稳定性问题,尤其是在航空遥感场景下,受限于飞行路径、地形和成本等因素,多视角航拍图像难以获取充足信息。其解决方案的关键在于首次提出一种融合三维合成孔径雷达(3D SAR)点云与航拍影像的神经表面重建(NSR)框架,通过将雷达提供的空间约束嵌入基于符号距离函数(SDF)的NSR主干网络中,引导结构感知的射线选择与自适应采样策略,从而提升优化过程的稳定性和效率。该方法显著增强了在极端稀疏和斜向观测条件下的重建精度、完整性和鲁棒性。
链接: https://arxiv.org/abs/2601.22045
作者: Da Li,Chen Yao,Tong Mao,Jiacheng Bao,Houjun Sun
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.
zh
[CV-17] Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems
【速读】:该论文旨在解决现实世界中多模态智能体因感知异质性(perceptual heterogeneity)而难以形成一致通信协议的问题,即在缺乏共同感知基础的情况下,不同模态的代理如何协商出稳定且有意义的交流策略。其解决方案的关键在于设计一种异质性的多步二进制通信游戏框架,在此框架下,尽管代理间存在模态差异和感知不对齐,仍能通过训练使多模态系统收敛到以感知输入为基底的类别一致性消息;同时发现信息编码呈现分布式而非组合式特性(bit perturbation实验表明每个比特的意义依赖于上下文模式),且跨感知世界的系统虽无法直接互通,但经少量微调即可实现有效通信——这揭示了 emergent communication 作为研究异质模态间表征迁移与适应机制的新范式潜力。
链接: https://arxiv.org/abs/2601.22041
作者: Naomi Pitzer,Daniela Mihai
机构: University of Southampton (南安普顿大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To be published in EvoLang XVI proceedings. 15 pages, 17 figures
Abstract:Emergent communication offers insight into how agents develop shared structured representations, yet most research assumes homogeneous modalities or aligned representational spaces, overlooking the perceptual heterogeneity of real-world settings. We study a heterogeneous multi-step binary communication game where agents differ in modality and lack perceptual grounding. Despite perceptual misalignment, multimodal systems converge to class-consistent messages grounded in perceptual input. Unimodal systems communicate more efficiently, using fewer bits and achieving lower classification entropy, while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation experiments provide strong evidence that meaning is encoded in a distributional rather than compositional manner, as each bit’s contribution depends on its surrounding pattern. Finally, interoperability analyses show that systems trained in different perceptual worlds fail to directly communicate, but limited fine-tuning enables successful cross-system communication. This work positions emergent communication as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, opening new directions for both theory and experimentation.
zh
[CV-18] Understanding Multimodal Complementarity for Single-Frame Action Anticipation
【速读】:该论文旨在解决人类动作预测(Action Anticipation)任务中对密集时序信息依赖过高的问题,挑战了“必须通过视频序列才能准确预测未来动作”的传统假设。其核心问题是:单帧图像中是否已编码足够的未来动作信息,以及如何有效挖掘和利用这些信息。解决方案的关键在于提出AAG+框架,该框架基于单帧视觉观察,融合RGB外观、深度几何特征与历史动作语义表示,并通过优化多模态融合策略、关键帧选择机制及历史动作源,显著提升了单帧动作预测性能。实验表明,AAG+在多个基准数据集上达到或超越基于完整视频的方法,揭示了在特定场景下仅需精心挑选的“一瞥”即可实现高效动作预测的可能性。
链接: https://arxiv.org/abs/2601.22039
作者: Manuel Benavent-Lledo,Konstantinos Bacharidis,Konstantinos Papoutsakis,Antonis Argyros,Jose Garcia-Rodriguez
机构: University of Alicante (阿尔卡拉大学); FORTH (希腊国家研究中心); University of Crete (克里特大学); Hellenic Mediterranean University (希腊地中海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.
zh
[CV-19] Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving
【速读】:该论文旨在解决端到端自动驾驶中因场景理解能力不足和行为模态单一导致的规划性能瓶颈问题。现有基于自监督视频预训练的世界模型在场景理解上的提升有限,且由于驾驶场景通常仅提供单一人类轨迹,难以学习多模态行为模式。其解决方案的关键在于提出Drive-JEPA框架:首先将Video Joint-Embedding Predictive Architecture(V-JEPA)适配至端到端驾驶任务,利用大规模驾驶视频预训练ViT编码器以获得与轨迹规划对齐的预测表征;其次引入以提议为中心的规划器,融合模拟器生成的多样化轨迹与人类轨迹进行蒸馏,并采用基于动量感知的选择机制确保行为稳定性和安全性。该方法在NAVSIM上实现了3 PDMS的性能提升,在v1和v2版本上分别达到93.3 PDMS和87.8 EPDMS,刷新了当前最优水平。
链接: https://arxiv.org/abs/2601.22032
作者: Linhan Wang,Zichong Yang,Chen Bai,Guoxiang Zhang,Xiaotong Liu,Xiaoyin Zheng,Xiao-Xiao Long,Chang-Tien Lu,Cheng Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.
zh
[CV-20] Hybrid Foveated Path Tracing with Peripheral Gaussians for Immersive Anatomy
【速读】:该论文旨在解决医学影像中三维空间关系难以直观理解的问题,传统二维切片需用户脑补三维结构,而现有三维渲染方法如直接路径追踪(path tracing)计算成本高,预计算表示如高斯点绘(Gaussian Splatting)则缺乏交互性且需提前规划。其解决方案的关键在于提出一种混合渲染方法:通过流式注视点路径追踪(foveated path tracing)实现高质量中心区域可视化,同时以轻量级高斯点绘近似周边区域,并利用体积数据优化外围模型生成,结合注视引导的重投影技术提升对延迟的鲁棒性,从而在保持视觉保真度的同时实现实时更新与交互。
链接: https://arxiv.org/abs/2601.22026
作者: Constantin Kleinbeck,Luisa Theelke,Hannah Schieber,Ulrich Eck,Rüdiger von Eisenhart-Rothe,Daniel Roth
机构: Technical University of Munich (TUM); TUM University Hospital, Orthopedics and Sports Orthopedics; Human-Centered Computing and Extended Reality Lab (HEX); Munich Institute of Robotics and Machine Intelligence (MIRMI)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Scheduled for publication in the Proceedings of IEEE VR 2026
Abstract:Volumetric medical imaging offers great potential for understanding complex pathologies. Yet, traditional 2D slices provide little support for interpreting spatial relationships, forcing users to mentally reconstruct anatomy into three dimensions. Direct volumetric path tracing and VR rendering can improve perception but are computationally expensive, while precomputed representations, like Gaussian Splatting, require planning ahead. Both approaches limit interactive use. We propose a hybrid rendering approach for high-quality, interactive, and immersive anatomical visualization. Our method combines streamed foveated path tracing with a lightweight Gaussian Splatting approximation of the periphery. The peripheral model generation is optimized with volume data and continuously refined using foveal renderings, enabling interactive updates. Depth-guided reprojection further improves robustness to latency and allows users to balance fidelity with refresh rate. We compare our method against direct path tracing and Gaussian Splatting. Our results highlight how their combination can preserve strengths in visual quality while re-generating the peripheral model in under a second, eliminating extensive preprocessing and approximations. This opens new options for interactive medical visualization. Comments: Scheduled for publication in the Proceedings of IEEE VR 2026 Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.22026 [cs.GR] (or arXiv:2601.22026v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2601.22026 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-21] Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在去遗忘(unlearning)过程中存在的两个关键问题:一是现有方法对所有回答标记(answer tokens)一视同仁,忽略了其在去遗忘任务中的重要性差异;二是这些方法仅关注语言模态,忽视了视觉线索对识别关键标记的指导作用。解决方案的核心是提出视觉引导的关键标记正则化方法(Visual-Guided Key-Token Regularization, ViKeR),通过引入无关视觉输入预测理想去遗忘后的标记级分布,并以此作为正则项来引导去遗忘过程,从而优先更新关键标记。此外,论文基于信息熵定义关键标记,并通过标记级梯度重加权机制增强对关键标记的更新强度,实验证明该方法能有效实现去遗忘,同时减少遗忘效应并保持回答连贯性。
链接: https://arxiv.org/abs/2601.22020
作者: Chengyi Cai,Zesheng Ye,Peike Li,Bo Han,Jianzhong Qi,Feng Liu
机构: The University of Melbourne (墨尔本大学); Google Research (谷歌研究); Hong Kong Baptist University (香港浸会大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unlearning in Multimodal Large Language Models (MLLMs) prevents the model from revealing private information when queried about target images. Existing MLLM unlearning methods largely adopt approaches developed for LLMs. They treat all answer tokens uniformly, disregarding their varying importance in the unlearning process. Moreover, these methods focus exclusively on the language modality, disregarding visual cues that indicate key tokens in answers. In this paper, after formulating the problem of unlearning in multimodal question answering for MLLMs, we propose Visual-Guided Key-Token Regularization (ViKeR). We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions and use these distributions to regularize the unlearning process, thereby prioritizing key tokens. Further, we define key tokens in unlearning via information entropy and discuss ViKeR’s effectiveness through token-level gradient reweighting, which amplifies updates on key tokens. Experiments on MLLMU and CLEAR benchmarks demonstrate that our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.
zh
[CV-22] Causal World Modeling for Robot Control
【速读】:该论文旨在解决机器人学习中长期任务规划与高效控制的问题,特别是在复杂动态环境中实现高泛化能力的视觉-动作联合建模。其核心挑战在于如何从视频数据中学习因果感知的世界模型,并将其用于生成式策略执行,同时兼顾训练效率与实时控制性能。解决方案的关键在于提出LingBot-VA框架,该框架通过三个关键设计实现:(1) 基于Mixture-of-Transformers (MoT)架构的共享潜在空间,融合视觉和动作token以统一建模;(2) 闭环滚动机制(closed-loop rollout),利用真实观测持续获取环境反馈;(3) 异步推理流水线,异步并行化动作预测与电机执行,从而支持高效闭环控制。该方法在仿真与真实场景中均展现出长时程操作能力、后训练阶段的数据高效性及对新配置的强大泛化能力。
链接: https://arxiv.org/abs/2601.21998
作者: Lin Li,Qihang Zhang,Yiming Luo,Shuai Yang,Ruilin Wang,Fei Han,Mingrui Yu,Zelin Gao,Nan Xue,Xing Zhu,Yujun Shen,Yinghao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL Code: this https URL
Abstract:This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.
zh
[CV-23] PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
【速读】:该论文旨在解决文档图像识别中模型在真实物理畸变下的鲁棒性不足以及多任务能力有限的问题。其解决方案的关键在于提出一个名为Real5-OmniDocBench的新基准,用于更严格地评估模型对扫描、倾斜、形变、屏幕拍摄和光照变化等现实世界物理扰动的适应能力,并在此基础上开发了PaddleOCR-VL-1.5模型,该模型在保持仅0.9B参数量的超紧凑架构下实现了94.5%的OmniDocBench v1.5准确率,同时扩展了印章识别(seal recognition)和文本定位(text spotting)任务能力,显著提升了实用性与效率。
链接: https://arxiv.org/abs/2601.21957
作者: Cheng Cui,Ting Sun,Suyin Liang,Tingquan Gao,Zelun Zhang,Jiaxuan Liu,Xueqing Wang,Changda Zhou,Hongen Liu,Manhui Lin,Yue Zhang,Yubo Zhang,Yi Liu,Dianhai Yu,Yanjun Ma
机构: Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model’s capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: this https URL
zh
[CV-24] Deep Models Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding
【速读】:该论文旨在解决神经视觉解码(neural visual decoding)中的粒度不匹配问题,即深度视觉模型在追求语义不变性时抑制了局部纹理信息,而神经信号则保留了低级视觉特征与高级语义内容的复杂混合。解决方案的关键在于提出一种名为“浅层对齐”(Shallow Alignment)的新颖对比学习策略,该策略将神经信号与视觉编码器的中间层表示而非最终输出进行对齐,从而更好地平衡低级纹理细节与高级语义特征,显著提升解码性能,并揭示了预训练视觉骨干网络容量与解码性能之间的可预测缩放规律。
链接: https://arxiv.org/abs/2601.21948
作者: Yang Du,Siyuan Dai,Yonghao Song,Paul M. Thompson,Haoteng Tang,Liang Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 13 figures
Abstract:Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.
zh
[CV-25] BookNet: Book Image Rectification via Cross-Page Attention Network
【速读】:该论文旨在解决书籍图像矫正中因装订约束导致的复杂几何失真问题,特别是左右页之间存在显著不对称曲率模式时,现有单页文档图像矫正方法无法捕捉相邻页面间的耦合几何关系。其解决方案的关键在于提出BookNet,首个专为双页书籍图像矫正设计的端到端深度学习框架,采用双分支结构并引入跨页注意力机制,能够同时估计单页与整页展开的形变场,显式建模左右页之间的相互影响。
链接: https://arxiv.org/abs/2601.21938
作者: Shaokai Liu,Hao Feng,Bozhi Luan,Min Hou,Jiajun Deng,Wengang Zhou
机构: Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Book image rectification presents unique challenges in document image processing due to complex geometric distortions from binding constraints, where left and right pages exhibit distinctly asymmetric curvature patterns. However, existing single-page document image rectification methods fail to capture the coupled geometric relationships between adjacent pages in books. In this work, we introduce BookNet, the first end-to-end deep learning framework specifically designed for dual-page book image rectification. BookNet adopts a dual-branch architecture with cross-page attention mechanisms, enabling it to estimate warping flows for both individual pages and the complete book spread, explicitly modeling how left and right pages influence each other. Moreover, to address the absence of specialized datasets, we present Book3D, a large-scale synthetic dataset for training, and Book100, a comprehensive real-world benchmark for evaluation. Extensive experiments demonstrate that BookNet outperforms existing state-of-the-art methods on book image rectification. Code and dataset will be made publicly available.
zh
[CV-26] Just Noticeable Difference Modeling for Deep Visual Features
【速读】:该论文旨在解决深度视觉特征(deep visual features)在机器感知任务中质量控制的问题,即如何在有限资源下定义和维持特征的可接受扰动边界。传统方法缺乏对特征空间中扰动容忍度的任务对齐描述,导致无法有效平衡性能与资源消耗。解决方案的关键在于提出 FeatJND——一种任务对齐的“刚可觉察差异”(Just Noticeable Difference, JND)建模方法,能够预测每个特征维度上允许的最大扰动映射(perturbation map),从而在不损害下游任务性能的前提下实现精细化的特征扰动控制。实验表明,FeatJND引导的扰动显著优于无结构高斯噪声,并能抑制非关键特征区域;进一步应用于 token-wise 动态量化时,其指导的步长分配策略在相同噪声预算下优于随机或全局均匀步长方案。
链接: https://arxiv.org/abs/2601.21933
作者: Rui Zhao,Wenrui Li,Lin Zhu,Yajing Zheng,Weisi Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Just noticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task-aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task-aligned JND formulation that predicts the maximum tolerable per-feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non-critical feature regions. As an application, we further apply FeatJND to token-wise dynamic quantization and show that FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under the same noise budget. Our code will be released after publication.
zh
[CV-27] Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models
【速读】:该论文旨在解决基于扩散模型的零样本图像修复与增强方法在应用于视频时导致的严重时间闪烁(temporal flickering)问题。解决方案的关键在于提出一个无需训练的框架,利用快速发展的视频扩散模型来提升图像方法的时间一致性:通过同源潜在空间融合(homologous latents fusion)和异源潜在空间融合(heterogenous latents fusion),以及基于COT(Chain-of-Thought)的融合比例策略,协同利用同源与异源文本到视频扩散模型以补充图像方法;此外,引入时间强化后处理(temporal-strengthening post-processing)进一步利用图像到视频扩散模型优化时间连续性。
链接: https://arxiv.org/abs/2601.21922
作者: Cong Cao,Huanjing Yue,Shangbin Xie,Xin Liu,Jingyu Yang
机构: Tianjin University (天津大学); Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-拉赫蒂理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.
zh
[CV-28] VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在视频美学质量评估(video aesthetic quality assessment)方面能力不足的问题,这一能力是人类视觉感知的重要组成部分,但尚未得到充分研究。解决方案的关键在于提出一个名为VideoAesBench的综合性基准测试平台,其核心创新包括:涵盖来自用户生成内容(UGC)、AI生成内容(AIGC)、压缩视频、机器人生成内容(RGC)及游戏视频等多样化的1,804个视频样本;支持单选、多选、判断题和新型开放式描述题等多种问题形式;并从视觉形式、视觉风格与视觉情感三个维度系统覆盖视频美学的多维特征。基于此基准,作者对23个开源与商业LMM进行了评测,揭示了现有模型仅具备基础视频美学感知能力,且表现仍不完整与不精确,从而为可解释的视频美学评估提供了新的测试平台与研究方向。
链接: https://arxiv.org/abs/2601.21915
作者: Yunhao Li,Sijing Wu,Zhilin Gao,Zicheng Zhang,Qi Jia,Huiyu Duan,Xiongkuo Min,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs’ understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment.
zh
[CV-29] Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning
【速读】:该论文旨在解决当前人类中心的跨模态智能中运动-语言检索任务中存在的语义鸿沟问题,尤其是现有方法多依赖全局运动序列与文本表示的对齐,忽略了局部运动片段、身体关节与文本词元之间的细粒度交互,导致检索性能受限。其解决方案的关键在于提出一种金字塔式Shapley-Taylor(Pyramidal Shapley-Taylor, PST)学习框架,该框架通过将人体运动分解为时间片段和空间身体关节,并以分层方式逐级实现关节级与片段级的交叉对齐,从而有效捕捉局部语义细节及层级结构关系,显著提升运动与文本间的精准匹配能力。
链接: https://arxiv.org/abs/2601.21904
作者: Hanmo Chen,Guangtao Lyu,Chenghao Xu,Jiexi Yan,Xu Yang,Cheng Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.
zh
[CV-30] raceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention
【速读】:该论文旨在解决大基础模型(Large Foundation Models, LFMs)在面对对抗性攻击时的脆弱性问题,尤其是现有防御方法依赖“局部性假设”(locality hypothesis),即通过抑制孤立神经元或特征来实现防护,但这种方法难以应对以分布式、跨层电路形式存在的有害语义信息,导致防御效果脆弱且损害模型通用能力。解决方案的关键在于提出一种路径级框架 TraceRouter,其核心机制是追踪并切断非法语义的因果传播通路:首先通过注意力分歧分析定位敏感起始层;继而利用稀疏自编码器(Sparse Autoencoders, SAEs)与差异激活分析解耦并隔离恶意特征;最后基于零置干预计算特征影响分数(Feature Influence Scores, FIS),映射这些特征至下游因果路径,并选择性抑制相关因果链,从而物理上阻断有害信息流,同时保留正交计算路径不受干扰。
链接: https://arxiv.org/abs/2601.21900
作者: Chuancheng Shi,Shangze Li,Wenjun Lu,Wenhua Wu,Cong Wang,Zifeng Cheng,Fei Shen,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multimedia (cs.MM)
备注:
Abstract:Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the “locality hypothesis”, suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbfTraceRouter, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.
zh
[CV-31] Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion
【速读】:该论文旨在解决当前自回归视频生成方法中因依赖启发式KV Cache策略而导致的关键时空信息丢失与冗余缓存累积问题,从而影响视频生成的质量与效率。其解决方案的关键在于提出一种新型的“过去与未来感知的KV缓存策略”(Past- and Future-Informed KV Cache Policy, PaFu-KV),该策略通过一个轻量级的显著性估计头(Salience Estimation Head)来动态评估token的重要性,该头模块由双向教师模型蒸馏而来,能够精准识别并保留对视频生成更具贡献的信息性token,同时丢弃低相关性的冗余项,从而在推理阶段显著压缩KV缓存容量、降低内存占用,并实现更优的质量-效率权衡。
链接: https://arxiv.org/abs/2601.21896
作者: Hanmo Chen,Chenghao Xu,Xu Yang,Xuan Chen,Cheng Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.
zh
[CV-32] Improving Classifier-Free Guidance of Flow Matching via Manifold Projection
【速读】:该论文旨在解决Classifier-free guidance (CFG) 在扩散模型和流模型中依赖启发式线性外推导致的引导尺度敏感性问题。其关键在于从优化视角出发,揭示了流匹配中的速度场对应于一系列平滑距离函数的梯度,从而将标准 CFG 视为该梯度的近似形式,其中预测差距(prediction gap)决定了引导敏感性。基于此洞察,作者将 CFG 采样重构为带有流形约束的同伦优化问题,并引入增量梯度下降实现流形投影步骤;同时结合 Anderson 加速技术提升迭代过程的计算效率与稳定性,且无需额外模型评估。该方法无需训练即可显著提升生成保真度、提示对齐性和对引导尺度的鲁棒性。
链接: https://arxiv.org/abs/2601.21892
作者: Jian-Feng Cai,Haixia Liu,Zhengyi Su,Chao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 14 figures
Abstract:Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.
zh
[CV-33] rajectory-Guided Diffusion for Foreground-Preserving Background Generation in Multi-Layer Documents
【速读】:该论文旨在解决文档场景中背景生成时的两个核心问题:一是如何在不依赖额外约束机制的情况下有效保护前景内容(如文字)免受干扰,二是如何实现跨多页文档的风格一致性,避免因逐页独立生成导致的风格漂移(style drift)。解决方案的关键在于将扩散过程重新诠释为潜空间(latent space)中随机轨迹的演化,并通过设计初始噪声及其几何对齐方式,使背景生成自然避开前景区域,从而无需掩码或抑制扩散更新等启发式策略;同时,通过解耦风格控制与文本条件,引入缓存的风格方向作为潜空间中的持久向量,将扩散路径约束于共享的风格子空间,实现多页一致且稳定的风格保持。此方法无需训练,兼容现有扩散模型架构,且具有明确的几何与物理解释。
链接: https://arxiv.org/abs/2601.21857
作者: Taewon Kang
机构: University of Maryland at College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 47 pages, 36 figures
Abstract:We present a diffusion-based framework for document-centric background generation that achieves foreground preservation and multi-page stylistic consistency through latent-space design rather than explicit constraints. Instead of suppressing diffusion updates or applying masking heuristics, our approach reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space. By shaping the initial noise and its geometric alignment, background generation naturally avoids designated foreground regions, allowing readable content to remain intact without auxiliary mechanisms. To address the long-standing issue of stylistic drift across pages, we decouple style control from text conditioning and introduce cached style directions as persistent vectors in latent space. Once selected, these directions constrain diffusion trajectories to a shared stylistic subspace, ensuring consistent appearance across pages and editing iterations. This formulation eliminates the need for repeated prompt-based style specification and provides a more stable foundation for multi-page generation. Our framework admits a geometric and physical interpretation, where diffusion paths evolve on a latent manifold shaped by preferred directions, and foreground regions are rarely traversed as a consequence of trajectory initialization rather than explicit exclusion. The proposed method is training-free, compatible with existing diffusion backbones, and produces visually coherent, foreground-preserving results across complex documents. By reframing diffusion as trajectory design in latent space, we offer a principled approach to consistent and structured generative modeling.
zh
[CV-34] MMFineReason : Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
【速读】:该论文旨在解决开源视觉语言模型(Vision Language Models, VLMs)在视觉推理能力上落后于闭源系统的核心问题,其根本原因在于缺乏高质量、覆盖多样挑战性场景(如STEM图表和视觉谜题)的链式思维(Chain-of-Thought, CoT)标注数据。解决方案的关键在于构建一个大规模、多模态推理数据集MMFineReason,包含180万样本和51亿个解题token,并通过三阶段系统化流程——大规模数据收集与标准化、CoT推理轨迹生成、基于推理质量与难度感知的筛选机制——确保数据的高精度与代表性。此外,研究发现采用难度感知过滤策略后,仅需7%的数据(12.3万样本)即可达到与全量数据相当的性能,揭示了“少即是多”的现象,并进一步验证了推理导向数据组合对通用能力提升具有协同效应。
链接: https://arxiv.org/abs/2601.21821
作者: Honglin Lin,Zheng Liu,Yun Zhu,Chonghan Qin,Juekai Lin,Xiaoran Shang,Conghui He,Wentao Zhang,Lijun Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a “less is more” phenomenon via our difficulty-aware filtering strategy: a subset of just 7% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.
zh
[CV-35] CG-MLLM : Captioning and Generating 3D content via Multi-modal Large Language Models
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在三维(3D)内容生成领域能力不足的问题,尤其是现有方法普遍只能生成低分辨率网格或粗略结构代理,难以原生捕捉精细几何特征。其解决方案的关键在于提出一种新型多模态大语言模型(Multi-modal Large Language Model, MLLM)——CG-MLLM,该模型采用混合Transformer架构,将不同粒度的建模需求解耦:Token-level Autoregressive (TokenAR) Transformer负责处理词元级别的内容,Block-level Autoregressive (BlockAR) Transformer则处理块级别的空间信息;同时,通过整合预训练视觉-语言骨干网络与专用的3D变分自编码器(VAE)潜在空间,实现了标准词元与空间块之间的长程交互,从而在单一框架内实现高保真3D物体的生成与描述。
链接: https://arxiv.org/abs/2601.21798
作者: Junming Huang,Weiwei Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.
zh
[CV-36] Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring
【速读】:该论文旨在解决海上监测中船舶三维(3D)重建的实时性与实用性问题,现有方法普遍依赖多视角监督、标注的3D真值或计算复杂度高,难以在实际场景中部署。其解决方案的关键在于提出一个仅需单张图像输入的高效重建流程:首先在合成数据(ShapeNet船舶)上训练基于3D高斯表示的Splatter Image网络,再通过自建多样化真实船舶数据集进行微调以缩小域差距;结合YOLOv8分割模块与定制预处理确保输入兼容性,并引入后处理步骤实现尺度校准、中心对齐与方向修正,最终利用AIS数据和单应性映射将重建模型地理参考至交互式网页地图。此方案无需真实世界3D标注即可完成高质量3D船舶可视化,为海上监测提供了可扩展且接近实时的解决方案。
链接: https://arxiv.org/abs/2601.21786
作者: Borja Carrillo-Perez,Felix Sattler,Angel Bueno Rodriguez,Maurice Stephan,Sarah Barnes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Three-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: this https URL.
zh
[CV-37] Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation
【速读】:该论文旨在解决视觉语言导航在连续环境(Vision-Language Navigation in Continuous Environments, VLN-CE)中因拓扑规划方法存在“粒度刚性”(Granularity Rigidity)问题而导致的导航精度与安全性不足的问题。现有方法通常依赖固定几何阈值采样节点,无法根据环境复杂度自适应调整地图密度,导致简单区域过度采样造成计算冗余,而高不确定性区域则采样不足,增加碰撞风险并降低导航精度。解决方案的关键在于提出DGNav框架,其核心创新为:(1) 场景感知自适应策略(Scene-Aware Adaptive Strategy),依据预测路径点分布动态调节图结构构建阈值,实现挑战环境中按需密化;(2) 动态图Transformer(Dynamic Graph Transformer),通过融合视觉、语言和几何线索生成动态边权重,过滤拓扑噪声并提升指令遵循能力,从而在导航效率与安全探索之间实现最优平衡。
链接: https://arxiv.org/abs/2601.21751
作者: Jiankun Peng,Jianyuan Guo,Ying Xu,Yue Liu,Jiashuang Yan,Xuanwei Ye,Houhua Li,Xiaoming Wang
机构: Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院空天信息研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences(中国科学院大学电子、电气与通信工程学院); Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Navigation in Continuous Environments (VLN-CE) presents a core challenge: grounding high-level linguistic instructions into precise, safe, and long-horizon spatial actions. Explicit topological maps have proven to be a vital solution for providing robust spatial memory in such tasks. However, existing topological planning methods suffer from a “Granularity Rigidity” problem. Specifically, these methods typically rely on fixed geometric thresholds to sample nodes, which fails to adapt to varying environmental complexities. This rigidity leads to a critical mismatch: the model tends to over-sample in simple areas, causing computational redundancy, while under-sampling in high-uncertainty regions, increasing collision risks and compromising precision. To address this, we propose DGNav, a framework for Dynamic Topological Navigation, introducing a context-aware mechanism to modulate map density and connectivity on-the-fly. Our approach comprises two core innovations: (1) A Scene-Aware Adaptive Strategy that dynamically modulates graph construction thresholds based on the dispersion of predicted waypoints, enabling “densification on demand” in challenging environments; (2) A Dynamic Graph Transformer that reconstructs graph connectivity by fusing visual, linguistic, and geometric cues into dynamic edge weights, enabling the agent to filter out topological noise and enhancing instruction adherence. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate DGNav exhibits superior navigation performance and strong generalization capabilities. Furthermore, ablation studies confirm that our framework achieves an optimal trade-off between navigation efficiency and safe exploration. The code is available at this https URL.
zh
[CV-38] From Global to Granular: Revealing IQA Model Performance via Correlation Surface
【速读】:该论文旨在解决现有图像质量评估(Image Quality Assessment, IQA)模型评价方法中长期存在的局限性问题,即依赖全局相关性指标(如皮尔逊线性相关系数 PLCC 和斯皮尔曼等级相关系数 SRCC)会导致性能信息丢失,无法揭示模型在不同局部质量区间(如高MOS值或小MOS差异区域)的差异化表现,且这些指标对测试样本的质量分布敏感,导致跨数据集比较不稳定。其解决方案的关键在于提出一种粒度调制相关性(Granularity-Modulated Correlation, GMC)框架,包含两个核心组件:一是粒度调制器(Granularity Modulator),通过高斯加权相关性分别基于绝对MOS值和成对MOS差值(|ΔMOS|)来刻画局部性能变化;二是分布调节器(Distribution Regulator),用于校正非均匀质量分布带来的偏差。最终生成一个以MOS和|ΔMOS|为坐标的三维相关性表面,实现对IQA模型性能更精细、稳定和可解释的分析与比较。
链接: https://arxiv.org/abs/2601.21738
作者: Baoliang Chen,Danni Huang,Hanwei Zhu,Lingyu Zhu,Wei Zhou,Shiqi Wang,Yuming Fang,Weisi Lin
机构: South China Normal University (华南师范大学); Nanyang Technological University (南洋理工大学); City University of Hong Kong (香港城市大学); Cardiff University (卡迪夫大学); Jiangxi University of Finance and Economics (江西财经大学); Jiangxi Provincial Key Laboratory of Multimedia Intelligent Processing (江西省多媒体智能处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to |\Delta MOS | ). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbfGranularity-Modulated Correlation (GMC), which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbfGranularity Modulator that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ( |\Delta MOS | ) to examine local performance variations, and (2) a \textbfDistribution Regulator that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbfcorrelation surface maps correlation values as a joint function of MOS and |\Delta MOS | , providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at this https URL.
zh
[CV-39] DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning
【速读】:该论文旨在解决角色图像动画(Character Image Animation)中的两大核心问题:一是运动注入策略不佳导致的身份保留与运动一致性之间的权衡(即“跷跷板”效应),二是对显式姿态先验(如骨骼信息)的过度依赖,限制了模型在任意非人形角色上的泛化能力。解决方案的关键在于提出DreamActor-M2框架,其创新性地将运动条件建模为上下文学习(in-context learning)问题,采用两阶段范式:首先通过融合参考外观与运动线索到统一潜在空间,利用基础模型的生成先验联合推理空间身份与时间动态;其次引入自bootstrapped数据合成流程,构建伪跨身份训练对,从而实现从依赖姿态控制到直接端到端RGB驱动动画的平滑过渡,显著提升对多样化角色和运动场景的泛化性能。
链接: https://arxiv.org/abs/2601.21716
作者: Mingshuang Luo,Shuang Liang,Zhengkun Rong,Yuxuan Luo,Tianshu Hu,Ruibing Hou,Hong Chang,Yong Li,Yuan Zhang,Mingyuan Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a “see-saw”, and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: this https URL
zh
[CV-40] ChartE3: A Comprehensive Benchmark for End-to-End Chart Editing
【速读】:该论文旨在解决端到端图表编辑(End-to-End Chart Editing)中的关键挑战,即如何在不依赖自然语言或代码作为中间表示的前提下,实现对图表的细粒度外观修改(如字体、颜色调整)和全局数据驱动型变换(如数据筛选、趋势线添加)的准确执行。其解决方案的关键在于构建了一个名为ChartE³的基准测试集,该数据集包含超过1200个高质量样本,每个样本以图表图像、底层代码和多模态编辑指令的三元组形式呈现,从而支持从客观指标和主观评估两个维度对模型进行无中间表示依赖的端到端评测,揭示了当前多模态大语言模型在全局编辑任务上的显著性能差距,凸显了现有方法在结构一致性与语义理解能力上的不足。
链接: https://arxiv.org/abs/2601.21694
作者: Shuo Li,Jiajun Sun,Zhekai Wang,Xiaoran Fan,Hui Li,Dingwen Yang,Zhiheng Xi,Yijun Wang,Zifei Shan,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our benchmark will be publicly available at this https URL
Abstract:Charts are a fundamental visualization format for structured data analysis. Enabling end-to-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE ^3 , an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE ^3 focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE ^3 contains over 1,200 high-quality samples constructed via a well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both objective and subjective perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.
zh
[CV-41] Multimodal Visual Surrogate Compression for Alzheimers Disease Classification
【速读】:该论文旨在解决高维结构磁共振成像(sMRI)在阿尔茨海默病(Alzheimer’s Disease, AD)诊断中表示学习的三大挑战:现有3D卷积神经网络(3D CNNs)计算成本高、基于切片特征提取与后期聚合的方法丢失跨切片关系,以及依赖预训练2D基础模型(如DINO)进行无训练特征提取时难以捕获判别性特征。解决方案的关键在于提出多模态视觉代理压缩(Multimodal Visual Surrogate Compression, MVSC),其核心包括两个组件:一是体积上下文编码器(Volume Context Encoder),通过文本引导捕捉全局跨切片上下文信息;二是自适应切片融合模块(Adaptive Slice Fusion),以文本增强的块级方式聚合切片层面信息,从而将大尺度3D sMRI数据压缩为与冻结的2D基础模型对齐的紧凑二维特征(称为视觉代理),实现高效且具有判别力的AD分类表征学习。
链接: https://arxiv.org/abs/2601.21673
作者: Dexuan Ding,Ciyuan Peng,Endrowednes Kuantama,Jingcai Guo,Jia Wu,Jian Yang,Amin Beheshti,Ming-Hsuan Yang,Yuankai Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-dimensional structural MRI (sMRI) images are widely used for Alzheimer’s Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer’s disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.
zh
[CV-42] When Gradient Optimization Is Not Enough: dagger Dispersive and Anchoring Geometric Regularizer for Multimodal Learning
【速读】:该论文旨在解决多模态学习中因表示几何结构缺失而导致的几何病理性问题,如模态内表示坍缩(intra-modal representation collapse)和样本级跨模态不一致性(sample-level cross-modal inconsistency),这些问题会损害单模态鲁棒性与多模态融合效果。解决方案的关键在于提出一种轻量级的几何感知正则化框架 \regName,通过两个互补约束实现对中间嵌入的显式调控:一是模态内分散正则化(intra-modal dispersive regularization),提升表示多样性;二是模态间锚定正则化(inter-modal anchoring regularization),在不强制严格对齐的前提下限制样本级跨模态漂移。该方法无需架构修改、兼容多种训练范式,且在多个多模态基准上均显著提升了单模态与多模态性能,验证了显式调节表示几何结构可有效缓解模态权衡问题。
链接: https://arxiv.org/abs/2601.21670
作者: Zixuan Xia,Hao Wang,Pengcheng Weng,Yanyu Qian,Yangxin Xu,William Dan,Fei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2601.21670 [cs.CV] (or arXiv:2601.21670v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.21670 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-43] From Instruction to Event: Sound-Triggered Mobile Manipulation
【速读】:该论文旨在解决当前移动操作(mobile manipulation)研究中依赖预定义文本指令导致的代理被动性问题,从而限制了其在动态环境中的自主性和响应能力。为克服这一局限,作者提出基于声音触发的移动操作范式,要求代理主动感知并交互于发声物体,而无需显式动作指令。解决方案的关键在于构建了Habitat-Echo数据平台,该平台融合声学渲染与物理交互,支持真实场景下的音频感知任务;同时设计了一个包含高层任务规划器和低层策略模型的基线系统,使代理能够自主检测并响应听觉事件,在双声源干扰等复杂场景下仍能准确识别主声源并完成连续操作,验证了方法的鲁棒性。
链接: https://arxiv.org/abs/2601.21667
作者: Hao Ju,Shaofei Huang,Hongyu Li,Zihan Ding,Si Liu,Meng Wang,Zhedong Zheng
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current mobile manipulation research predominantly follows an instruction-driven paradigm, where agents rely on predefined textual commands to execute tasks. However, this setting confines agents to a passive role, limiting their autonomy and ability to react to dynamic environmental events. To address these limitations, we introduce sound-triggered mobile manipulation, where agents must actively perceive and interact with sound-emitting objects without explicit action instructions. To support these tasks, we develop Habitat-Echo, a data platform that integrates acoustic rendering with physical interaction. We further propose a baseline comprising a high-level task planner and low-level policy models to complete these tasks. Extensive experiments show that the proposed baseline empowers agents to actively detect and respond to auditory events, eliminating the need for case-by-case instructions. Notably, in the challenging dual-source scenario, the agent successfully isolates the primary source from overlapping acoustic interference to execute the first interaction, and subsequently proceeds to manipulate the secondary object, verifying the robustness of the baseline.
zh
[CV-44] SONIC-O1 : A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理时序音频-视频数据方面研究不足的问题,特别是缺乏一个高质量、覆盖真实场景的基准测试平台来系统评估其在动态多模态理解中的表现。解决方案的关键在于提出了SONIC-O1,这是一个全面且经人工验证的基准数据集,涵盖13个现实对话领域,包含4,958条标注和人口统计学元数据,用于评估MLLMs在开放式摘要、多项选择题(MCQ)回答及时间定位任务(含推理过程)上的能力。实验结果揭示了闭源与开源模型在时间定位任务上存在显著性能差距(22.6%),并发现模型性能随不同人群群体下降,凸显出社会鲁棒性问题,从而为未来研究提供了可复现的评估框架与改进方向。
链接: https://arxiv.org/abs/2601.21666
作者: Ahmed Y. Radwan,Christos Emmanouilidis,Hina Tabassum,Deval Pandya,Shaina Raza
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: this https URL Dataset: this https URL Github: this https URL Leaderboard: this https URL
zh
[CV-45] Few-Shot Domain Adaptation with Temporal References and Static Priors for Glacier Calving Front Delineation
【速读】:该论文旨在解决深度学习模型在新研究区域(即分布外域)中冰川崩解前沿(calving front)分割精度显著下降的问题,该问题限制了模型在实际科学分析中的应用。解决方案的关键在于采用少量样本的领域自适应策略(few-shot domain adaptation),融合空间静态先验知识(spatial static prior knowledge),并在输入时序图像中加入夏季参考图像,从而在不修改模型架构的前提下,将分割误差从1131.6米大幅降低至68.7米,为全球范围内的冰川崩解前沿监测提供了可迁移的深度学习框架。
链接: https://arxiv.org/abs/2601.21663
作者: Marcel Dreier,Nora Gourmelon,Dakota Pyles,Thorsten Seehaus,Matthias H. Braun,Andreas Maier,Vincent Christlein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:During benchmarking, the state-of-the-art model for glacier calving front delineation achieves near-human performance. However, when applied in a real-world setting at a novel study site, its delineation accuracy is insufficient for calving front products intended for further scientific analyses. This site represents an out-of-distribution domain for a model trained solely on the benchmark dataset. By employing a few-shot domain adaptation strategy, incorporating spatial static prior knowledge, and including summer reference images in the input time series, the delineation error is reduced from 1131.6 m to 68.7 m without any architectural modifications. These methodological advancements establish a framework for applying deep learning-based calving front segmentation to novel study sites, enabling calving front monitoring on a global scale.
zh
[CV-46] CAF-Mamba: Mamba-Based Cross-Modal Adaptive Attention Fusion for Multimodal Depression Detection ICASSP
【速读】:该论文旨在解决当前基于深度学习的抑郁症检测方法中存在的三大问题:特征类型单一、缺乏显式和隐式的跨模态交互建模,以及融合策略过于简单(如静态拼接或固定权重)。其解决方案的关键在于提出一种基于Mamba架构的跨模态自适应注意力融合框架(CAF-Mamba),该框架通过引入模态级注意力机制动态调整不同模态的贡献权重,并同时显式与隐式地捕捉多模态之间的交互关系,从而实现更有效的多模态信息融合。实验在两个真实场景下的基准数据集LMVD和D-Vlog上验证了该方法优于现有技术并达到当前最优性能。
链接: https://arxiv.org/abs/2601.21648
作者: Bowen Zhou,Marc-André Fiedler,Ayoub Al-Hamadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: The paper contains a total of 5 pages and 3 figures. This paper has been accepted for publication in the proceedings of 2026 IEEE ICASSP Conference
Abstract:Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance.
zh
[CV-47] OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
【速读】:该论文旨在解决当前光学字符识别(OCR)技术主要聚焦于文本元素识别(Text-centric OCR),而忽视了对视觉信息密集型图像中视觉元素的识别需求(Vision-centric OCR)的问题,例如图表、网页和科学绘图等。此类视觉内容在互联网上广泛存在,并具有重要的实际应用价值,如数据可视化与网页分析。解决方案的关键在于提出OCRVerse,这是首个端到端的统一OCR方法,能够同时支持文本中心型和视觉中心型OCR任务。其核心创新包括:构建覆盖文本型文档(如报纸、书籍)与视觉型合成图像(如图表、网页、科学绘图)的综合性数据集,并设计两阶段SFT-RL多域训练策略——第一阶段通过监督微调(SFT)混合跨域数据建立初始领域知识,第二阶段利用强化学习(RL)为不同领域定制个性化奖励机制,从而灵活适配多样化的输出格式与预期目标,提升跨域融合能力并避免数据冲突。实验表明,OCRVerse在两类数据上均达到竞争力性能,甚至可媲美大规模开源和闭源模型。
链接: https://arxiv.org/abs/2601.21639
作者: Yufeng Zhong,Lei Chen,Xuanle Zhao,Wenkang Han,Liming Zheng,Jing Huang,Deyang Jiang,Yilin Cao,Lin Ma,Zhixiong Zeng
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbfText-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (\textbfVision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbfOCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.
zh
[CV-48] RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning
【速读】:该论文旨在解决遥感视觉接地(Remote Sensing Visual Grounding, RSVG)任务中因场景空间尺度大、语义模糊性强而导致的定位困难问题,尤其关注多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理能力上的不足。解决方案的关键在于提出一种基于推理引导的位置感知后训练框架(RSGround-R1),其核心包括:首先通过链式思维监督微调(Chain-of-Thought Supervised Fine-Tuning, CoT-SFT)利用合成生成的RSVG推理数据建立显式的位姿感知;其次引入位置奖励增强强化微调(Reinforcement Fine-Tuning, RFT),提供连续且距离感知的指导以实现精准定位;最后设计空间一致性引导优化机制,动态调整策略更新以抑制滚动过程中不一致的定位行为,从而保障模型稳定收敛与鲁棒性。
链接: https://arxiv.org/abs/2601.21634
作者: Shiqi Huang,Shuting He,Bihan Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbfRSGround-R1, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.
zh
[CV-49] A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion
【速读】:该论文旨在解决当前潜空间扩散模型中自动编码器(Autoencoder, AE)评估标准偏倚问题,即在ImageNet规模的AE研究中普遍倾向于以生成友好性指标(如gFID)为主导进行选择,而忽视重建保真度(reconstruction fidelity),这导致在可控扩散(controllable diffusion)任务中出现条件漂移(condition drift),从而损害控制能力。其解决方案的关键在于揭示重建保真度,尤其是实例级(instance-level)指标,与可控生成任务中的条件保留能力(condition preservation)具有更强的相关性,而gFID则预测力较弱;通过多维条件漂移评估协议和ControlNet实验验证,证明应优先基于重建导向指标来选择AE,以提升可控扩散系统的可靠性与可扩展性。
链接: https://arxiv.org/abs/2601.21633
作者: Pu Cao,Yiyang Ma,Feng Zhou,Xuedan Yin,Qing Song,Lu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: work in progress
Abstract:In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.
zh
[CV-50] Similarity of Processing Steps in Vision Model Representations
【速读】:该论文试图解决的问题是:不同视觉模型在训练过程中是否不仅收敛到相似的最终表示(即“通用”表示),而且在中间处理步骤和操作上也趋于一致。解决方案的关键在于量化不同模型在各层之间的表示距离,追踪其在整个处理过程中的演变路径,从而识别出最具差异性的处理阶段。研究发现,尽管位置相近的层具有最相似的表示,但CNN与Transformer架构仍表现出显著差异——尤其是Transformer模型在层间变化更平滑,而分类器模型则会在最终层丢弃低级图像统计信息,这揭示了模型表示收敛的层次性和机制多样性。
链接: https://arxiv.org/abs/2601.21621
作者: Matéo Mahaut,Marco Baroni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal’’ representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations–in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.
zh
[CV-51] PathReason er-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在数字病理学中诊断推理缺乏可验证证据的问题,即现有系统通常直接输出结论而未提供与诊断逻辑关联的可解释推理路径,从而削弱了临床信任并阻碍专家纠错。其解决方案的关键在于构建首个大规模全切片图像(Whole-Slide Image, WSI)推理数据集 PathReasoner,并设计一种基于医学知识图谱(Knowledge Graph, KG)引导的生成流程,将结构化病理发现与临床推理显式对齐,生成超过2万条高质量指令样本;进一步提出 PathReasoner-R1 模型,融合轨迹掩码监督微调(trajectory-masked supervised fine-tuning)与面向推理的强化学习(reasoning-oriented reinforcement learning),并通过一个知识感知的多粒度奖励函数(含实体奖励机制)引导模型优化逻辑一致性而非仅匹配诊断结果,从而显著提升病理模型的透明性和临床可信度。
链接: https://arxiv.org/abs/2601.21617
作者: Songhan Jiang,Fengchun Liu,Ziyue Wang,Linghan Cai,Yongbing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at this https URL.
zh
[CV-52] WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models
【速读】:该论文旨在解决当前数字水印评估方法在扩散模型生成图像中的四大局限:缺乏对残差型和语义型水印的统一评估框架、结果缺乏可解释性、忽视全面的安全考量,以及对语义水印使用不当的评价指标。解决方案的关键在于提出首个基于视觉-语言模型(Vision-Language Models, VLMs)的统一且可解释的评估框架WMVLM,通过重新定义两类水印的质量与安全指标——残差水印以伪影强度和擦除抵抗性衡量,语义水印则通过潜在空间分布偏移评估,并采用三阶段训练策略逐步实现分类、评分与可解释文本生成,从而显著提升跨数据集、扩散模型和水印方法的泛化性能。
链接: https://arxiv.org/abs/2601.21610
作者: Zijin Yang,Yu Sun,Kejiang Chen,Jiawei Zhao,Jun Jiang,Weiming Zhang,Nenghai Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.
zh
[CV-53] HydroSense: A Dual-Microcontroller IoT Framework for Real-Time Multi-Parameter Water Quality Monitoring with Edge Processing and Cloud Analytics
【速读】:该论文旨在解决全球水资源危机背景下,资源受限环境中水质量监测系统成本高、实时性差和可及性不足的问题。传统依赖人工采样或昂贵商用设备的方法难以满足发展中国家或偏远地区的需求。其解决方案的关键在于提出HydroSense框架——一个基于物联网(IoT)的低成本、高精度水环境监测系统,核心创新包括:采用双微控制器架构(Arduino Uno用于高精度模拟信号采集与五点校准算法,ESP32实现无线通信、边缘计算与云集成),结合中值滤波、温度补偿算法及健壮的错误处理机制,从而在保证pH、溶解氧(DO)、总溶解固体(TDS)等六项关键参数测量精度的同时,实现99.8%的数据云端传输可靠性,并将整体成本降低至约300美元(相比商用系统节省85%)。
链接: https://arxiv.org/abs/2601.21595
作者: Abdul Hasib,A. S. M. Ahsanul Sarkar Akib,Anish Giri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The global water crisis necessitates affordable, accurate, and real-time water quality monitoring solutions. Traditional approaches relying on manual sampling or expensive commercial systems fail to address accessibility challenges in resource-constrained environments. This paper presents HydroSense, an innovative Internet of Things framework that integrates six critical water quality parameters including pH, dissolved oxygen (DO), temperature, total dissolved solids (TDS), estimated nitrogen, and water level into a unified monitoring system. HydroSense employs a novel dual-microcontroller architecture, utilizing Arduino Uno for precision analog measurements with five-point calibration algorithms and ESP32 for wireless connectivity, edge processing, and cloud integration. The system implements advanced signal processing techniques including median filtering for TDS measurement, temperature compensation algorithms, and robust error handling. Experimental validation over 90 days demonstrates exceptional performance metrics: pH accuracy of plus or minus 0.08 units across the 0 to 14 range, DO measurement stability within plus or minus 0.2 mg/L, TDS accuracy of plus or minus 1.9 percent across 0 to 1000 ppm, and 99.8 percent cloud data transmission reliability. With a total implementation cost of 32,983 BDT (approximately 300 USD), HydroSense achieves an 85 percent cost reduction compared to commercial systems while providing enhanced connectivity through the Firebase real-time database. This research establishes a new paradigm for accessible environmental monitoring, demonstrating that professional-grade water quality assessment can be achieved through intelligent system architecture and cost-effective component selection.
zh
[CV-54] Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration
【速读】:该论文旨在解决全合一图像复原(All-in-One Image Restoration, AiOIR)中因异质退化类型导致的优化目标冲突问题,现有方法常受限于粗粒度控制机制或固定映射调度,难以实现自适应优化。其解决方案的关键在于提出不确定性感知扩散桥模型(Uncertainty-Aware Diffusion Bridge Model, UDBM),将AiOIR重构为由像素级不确定性引导的随机传输问题;通过引入松弛扩散桥形式替代严格终端约束,建模退化不确定性并理论上消解标准扩散桥中的漂移奇异性;同时设计双调制策略:噪声调度将不同退化统一映射至高熵潜在空间,路径调度则基于熵正则化的粘性动力学自适应调节传输轨迹,从而有效修正传输几何与动力学,实现单次推理下多种复原任务的最优性能。
链接: https://arxiv.org/abs/2601.21592
作者: Luwei Tu,Jiawei Wu,Xing Luo,Zhi Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:All-in-One Image Restoration (AiOIR) faces the fundamental challenge in reconciling conflicting optimization objectives across heterogeneous degradations. Existing methods are often constrained by coarse-grained control mechanisms or fixed mapping schedules, yielding suboptimal adaptation. To address this, we propose an Uncertainty-Aware Diffusion Bridge Model (UDBM), which innovatively reformulates AiOIR as a stochastic transport problem steered by pixel-wise uncertainty. By introducing a relaxed diffusion bridge formulation which replaces the strict terminal constraint with a relaxed constraint, we model the uncertainty of degradations while theoretically resolving the drift singularity inherent in standard diffusion bridges. Furthermore, we devise a dual modulation strategy: the noise schedule aligns diverse degradations into a shared high-entropy latent space, while the path schedule adaptively regulates the transport trajectory motivated by the viscous dynamics of entropy regularization. By effectively rectifying the transport geometry and dynamics, UDBM achieves state-of-the-art performance across diverse restoration tasks within a single inference step.
zh
[CV-55] Bi-Anchor Interpolation Solver for Accelerating Generative Modeling
【速读】:该论文旨在解决流匹配(Flow Matching, FM)模型在高保真生成中因依赖迭代常微分方程(Ordinary Differential Equation, ODE)求解而产生的显著延迟瓶颈问题。现有方案存在两难:无训练求解器在低神经函数评估次数(Neural Function Evaluations, NFEs)下性能严重下降,而基于训练的一步或几步生成方法则带来高昂的训练成本且缺乏即插即用的灵活性。解决方案的关键在于提出双锚点插值求解器(Bi-Anchor Interpolation Solver, BA-solver),其核心创新为两个协同机制:1)双向时间感知(Bidirectional Temporal Perception),使轻量级SideNet在不重训练主干网络的前提下学习未来与历史速度;2)双锚点速度整合(Bi-Anchor Velocity Integration),利用SideNet结合两个锚点速度高效近似中间速度,实现批处理下的高阶积分。该方法通过主干网络建立高精度“锚点”,由SideNet密化轨迹,在大间隔下最小化误差,实现在仅5–10 NFEs下达到100+ NFEs欧拉求解器的质量,并具备极低训练开销和良好的下游任务兼容性。
链接: https://arxiv.org/abs/2601.21542
作者: Hongxu Chen,Hongxiang Li,Zhen Wang,Long Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf1) Bidirectional Temporal Perception, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors’’ and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.
zh
[CV-56] Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks
【速读】:该论文旨在解决视觉骨干网络中注意力机制(Attention Mechanism)存在的两个核心问题:一是其序列长度的二次复杂度导致计算成本高,限制了模型在高分辨率特征上的扩展性;二是注意力权重难以解释,影响模型的可解释性和透明度。为应对这些问题,作者提出了一种无注意力的骨干网络 Vision KAN(ViK),其关键创新在于引入 MultiPatch-RBFKAN 作为统一的 token mixer,该模块融合了三方面设计:(a) 基于径向基函数(Radial Basis Function, RBF)的 Kolmogorov-Arnold Networks(KAN)实现局部 patch 级非线性变换,(b) 轴向可分离混合策略以高效传播局部信息,© 低秩全局映射用于建模长程依赖关系。通过将 KAN 的结构优势与 patch-wise 分组策略结合,ViK 在保持线性复杂度的同时实现了对跨 patch 关系的有效恢复,从而在 ImageNet-1K 上达到与注意力机制相当甚至更优的性能表现,验证了基于 KAN 的 token mixing 是一种高效且理论基础扎实的注意力替代方案。
链接: https://arxiv.org/abs/2601.21541
作者: Zhuoqin Yang,Jiansong Zhang,Xiaoling Luo,Xu Wu,Zheng Lu,Linlin Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and © low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.
zh
[CV-57] On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression
【速读】:该论文旨在解决视觉令牌压缩(visual token compression)在大型视觉语言模型(LVLMs)中引入的对抗鲁棒性评估失真问题。现有基于编码器的攻击方法因优化与推理过程不一致——即扰动在完整令牌表示上优化,而推理时通过压缩瓶颈进行——导致对压缩后模型的鲁棒性严重高估。解决方案的关键在于提出Compression-AliGnEd攻击(CAGE),其核心机制包括:(i) 期望特征破坏(expected feature disruption),将扰动集中在可能在不同压缩预算下均存活的令牌上;(ii) 排名扭曲对齐(rank distortion alignment),主动将令牌扰动与保留排名分数对齐,以促进高度扭曲信息的保留。CAGE无需访问部署的压缩机制或其令牌预算即可实现更贴近实际推理流程的对抗攻击,实验证明其显著降低了模型的鲁棒准确率,揭示了忽略压缩环节的鲁棒性评估存在过度乐观风险。
链接: https://arxiv.org/abs/2601.21531
作者: Xinwei Zhang,Hangcheng Liu,Li Bai,Hao Wang,Qingqing Ye,Tianwei Zhang,Haibo Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review, 20 pages
Abstract:Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.
zh
[CV-58] HERS: Hidden-Pattern Expert Learning for Risk-Specific Vehicle Damage Adaptation in Diffusion Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在自动保险流程中因文本到图像(Text-to-Image, T2I)扩散模型生成逼真车辆损伤图像而引发的可靠性问题,特别是此类合成数据可能被滥用于欺诈或理赔操纵的风险。解决方案的关键在于提出 HERS(Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation)框架,通过领域特定专家适配对基础扩散模型进行微调,无需人工标注;利用大语言模型与T2I流水线自动生成图像-文本对,将每类损伤(如凹痕、划痕、破损车灯、裂纹漆面)建模为独立专家,并整合为统一多损伤模型,在专业化与泛化之间实现平衡,从而显著提升图像保真度、可控性及领域一致性。
链接: https://arxiv.org/abs/2601.21517
作者: Teerapong Panboonyuen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages
Abstract:Recent advances in text-to-image (T2I) diffusion models have enabled increasingly realistic synthesis of vehicle damage, raising concerns about their reliability in automated insurance workflows. The ability to generate crash-like imagery challenges the boundary between authentic and synthetic data, introducing new risks of misuse in fraud or claim manipulation. To address these issues, we propose HERS (Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation), a framework designed to improve fidelity, controllability, and domain alignment of diffusion-generated damage images. HERS fine-tunes a base diffusion model via domain-specific expert adaptation without requiring manual annotation. Using self-supervised image-text pairs automatically generated by a large language model and T2I pipeline, HERS models each damage category, such as dents, scratches, broken lights, or cracked paint, as a separate expert. These experts are later integrated into a unified multi-damage model that balances specialization with generalization. We evaluate HERS across four diffusion backbones and observe consistent improvements: plus 5.5 percent in text faithfulness and plus 2.3 percent in human preference ratings compared to baselines. Beyond image fidelity, we discuss implications for fraud detection, auditability, and safe deployment of generative models in high-stakes domains. Our findings highlight both the opportunities and risks of domain-specific diffusion, underscoring the importance of trustworthy generation in safety-critical applications such as auto insurance.
zh
[CV-59] SimGraph: A Unified Framework for Scene Graph-Based Image Generation and Editing
【速读】:该论文旨在解决当前生成式AI(Generative AI)在图像生成与编辑任务中因分离处理而导致的空间一致性不足和语义连贯性差的问题,同时克服缺乏对物体间关系与空间布局结构化控制的挑战。其解决方案的关键在于提出一个统一框架SimGraph,该框架基于场景图(scene graph)建模,将基于token的图像生成与基于扩散模型(diffusion-based)的图像编辑整合到同一结构中,从而实现对对象交互、布局及空间一致性的精确控制,并确保高质量输出结果。
链接: https://arxiv.org/abs/2601.21498
作者: Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Generative Artificial Intelligence (GenAI) have significantly enhanced the capabilities of both image generation and editing. However, current approaches often treat these tasks separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence between generated content and edits. Moreover, a major obstacle is the lack of structured control over object relationships and spatial arrangements. Scene graph-based methods, which represent objects and their interrelationships in a structured format, offer a solution by providing greater control over composition and interactions in both image generation and editing. To address this, we introduce SimGraph, a unified framework that integrates scene graph-based image generation and editing, enabling precise control over object interactions, layouts, and spatial coherence. In particular, our framework integrates token-based generation and diffusion-based editing within a single scene graph-driven model, ensuring high-quality and consistent results. Through extensive experiments, we empirically demonstrate that our approach outperforms existing state-of-the-art methods.
zh
[CV-60] Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking
【速读】:该论文旨在解决如何从计算机断层扫描(CT)图像中准确评估冠状动脉钙化去除(debulking)的必要性这一临床问题,其核心挑战在于医生在决策时会依据患者个体的表格型数据(tabular data)调整关注点和判断标准。为应对这一问题,作者提出了一种基于超网络(hypernetwork)的自适应聚合变换器(HyperAdAgFormer),其关键创新在于利用超网络根据每名患者的表格数据动态调整特征聚合策略,从而实现对不同患者个性化决策的支持。实验表明该方法在真实临床数据集上具有良好的效果。
链接: https://arxiv.org/abs/2601.21479
作者: Kaito Shiku,Ichika Seo,Tetsuya Matoba,Rissei Hino,Yasuhiro Nakano,Ryoma Bise
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2026
Abstract:In this paper, we present the first attempt to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) images. We formulate this task as a Multiple-instance Learning (MIL) problem. The difficulty of this task lies in that physicians adjust their focus and decision criteria for device usage according to tabular data representing each patient’s condition. To address this issue, we propose a hypernetwork-based adaptive aggregation transformer (HyperAdAgFormer), which adaptively modifies the feature aggregation strategy for each patient based on tabular data through a hypernetwork. The experiments using the clinical dataset demonstrated the effectiveness of HyperAdAgFormer. The code is publicly available at this https URL.
zh
[CV-61] Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization
【速读】:该论文旨在解决弱监督场景下生成式 AI(Generative AI)伪造视频的细粒度时间定位问题,即在仅提供视频级别标签的情况下,准确识别出伪造片段发生的具体时间区间。其解决方案的关键在于提出了一种基于重构误差的弱监督时间伪造定位框架 RT-DeepLoc:首先利用仅在真实数据上训练的掩码自编码器(Masked Autoencoder, MAE)学习视频的内在时空模式,使得伪造片段因无法被有效重建而产生显著的重构误差;进而设计一种新颖的异构帧内对比损失(Asymmetric Intra-video Contrastive Loss, AICL),通过聚焦于由重构误差引导的真实特征紧凑性,建立稳定的决策边界,在增强局部判别能力的同时保持对未见伪造内容的泛化性能。
链接: https://arxiv.org/abs/2601.21458
作者: Midou Guo,Qilin Yin,Wei Lu,Xiangyang Luo,Rui Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.
zh
[CV-62] 4D-CAAL: 4D Radar-Camera Calibration and Auto-Labeling for Autonomous Driving
【速读】:该论文旨在解决4D雷达与相机之间精确外参标定(extrinsic calibration)以及稀疏雷达数据标注效率低的问题。现有方法通常依赖于分别针对视觉或雷达优化的标定目标,导致对应关系难以建立;同时,人工标注雷达点云费时且不可靠。解决方案的关键在于提出一个统一的双用途标定靶设计——前表面为棋盘格图案用于相机检测,后表面中心设置角反射器用于雷达检测,并开发了一种鲁棒的对应匹配算法,将棋盘格中心与最强雷达回波点对齐,从而实现高精度标定;在此基础上进一步构建自动标注流程,利用标定后的传感器关系通过几何投影和多特征优化,将相机分割标注结果迁移至雷达点云,显著降低人工标注负担,提升多模态感知系统开发效率。
链接: https://arxiv.org/abs/2601.21454
作者: Shanliang Yao,Zhuoxiao Li,Runwei Guan,Kebin Cao,Meng Xia,Fuping Hu,Sen Xu,Yong Yue,Xiaohui Zhu,Weiping Ding,Ryan Wen Liu
机构: Yancheng Institute of Technology (盐城工学院); Wuhan University of Technology (武汉理工大学); Hubei Key Laboratory of Inland Shipping Technology (内河航运技术湖北省重点实验室); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); School of Information Engineering, Yancheng Institute Technology (盐城工学院信息工程学院); Xi’an Jiaotong-Liverpool University (西交利物浦大学); Nantong University (南通大学); State Key Laboratory of Maritime Technology and Safety (航海科技与安全国家重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:4D radar has emerged as a critical sensor for autonomous driving, primarily due to its enhanced capabilities in elevation measurement and higher resolution compared to traditional 3D radar. Effective integration of 4D radar with cameras requires accurate extrinsic calibration, and the development of radar-based perception algorithms demands large-scale annotated datasets. However, existing calibration methods often employ separate targets optimized for either visual or radar modalities, complicating correspondence establishment. Furthermore, manually labeling sparse radar data is labor-intensive and unreliable. To address these challenges, we propose 4D-CAAL, a unified framework for 4D radar-camera calibration and auto-labeling. Our approach introduces a novel dual-purpose calibration target design, integrating a checkerboard pattern on the front surface for camera detection and a corner reflector at the center of the back surface for radar detection. We develop a robust correspondence matching algorithm that aligns the checkerboard center with the strongest radar reflection point, enabling accurate extrinsic calibration. Subsequently, we present an auto-labeling pipeline that leverages the calibrated sensor relationship to transfer annotations from camera-based segmentations to radar point clouds through geometric projection and multi-feature optimization. Extensive experiments demonstrate that our method achieves high calibration accuracy while significantly reducing manual annotation effort, thereby accelerating the development of robust multi-modal perception systems for autonomous driving.
zh
[CV-63] Variance Greediness: A comparative study of metric-learning losses ICASSP2026
【速读】:该论文旨在解决度量学习(Metric Learning)在图像检索任务中对嵌入空间几何结构和优化动态影响机制不清晰的问题。其解决方案的关键在于提出了一套诊断框架 VARIANCE(类内/类间方差)与 GREEDINESS(活跃比例与梯度范数),用于系统性比较七种代表性损失函数(如三元组损失、InfoNCE、ArcFace 等)在五个图像检索数据集上的表现,从而揭示不同损失函数在嵌入紧凑性与类别区分粒度之间的权衡关系,为实际应用提供可操作的指导:当需要保留类别多样性并强化难样本区分时,推荐使用三元组损失或 SCL;当追求快速嵌入压缩时,则应选择对比损失或 InfoNCE。
链接: https://arxiv.org/abs/2601.21450
作者: Donghuo Zeng,Hao Niu,Zhi Li,Masato Taya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, 3 tables. Accepted by ICASSP 2026
Abstract:Metric learning is central to retrieval, yet its effects on embedding geometry and optimization dynamics are not well understood. We introduce a diagnostic framework, VARIANCE (intra-/inter-class variance) and GREEDINESS (active ratio and gradient norms), to compare seven representative losses, i.e., Contrastive, Triplet, N-pair, InfoNCE, ArcFace, SCL, and CCL, across five image-retrieval datasets. Our analysis reveals that Triplet and SCL preserve higher within-class variance and clearer inter-class margins, leading to stronger top-1 retrieval in fine-grained settings. In contrast, Contrastive and InfoNCE compact embeddings are achieved quickly through many small updates, accelerating convergence but potentially oversimplifying class structures. N-pair achieves a large mean separation but with uneven spacing. These insights reveal a form of efficiency-granularity trade-off and provide practical guidance: prefer Triplet/SCL when diversity preservation and hard-sample discrimination are critical, and Contrastive/InfoNCE when faster embedding compaction is desired.
zh
[CV-64] MultiModal Fine-tuning with Synthetic Captions
【速读】:该论文旨在解决深度神经网络预训练与微调阶段之间的关键鸿沟问题:尽管预训练已从单模态向多模态学习演进并提升了视觉理解能力,但微调仍主要局限于单模态,从而限制了丰富预训练表示的潜力。其解决方案的核心在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)生成高质量合成图像描述,将原本单模态的数据集转化为多模态数据集用于微调,同时引入一种监督对比损失函数以显式促进同类别特征聚类,并设计基于多条合成文本描述的类别平均文本嵌入推理机制,从而有效提升微调效果,尤其在少样本场景下表现显著优于基线方法。
链接: https://arxiv.org/abs/2601.21426
作者: Shohei Enomoto,Shin’ya Yamaguchi
机构: NTT
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at this https URL.
zh
[CV-65] Lossy Common Information in a Learnable Gray-Wyner Network
【速读】:该论文旨在解决计算机视觉任务中因传统编码器忽略任务间共享信息而导致的冗余与低效表示问题。其解决方案的关键在于受信息论中Gray-Wyner网络启发,提出一种可学习的三通道编码架构,能够将多任务间的公共信息(common information)与任务特定细节(task-specific information)进行解耦分离,并通过引入“有损公共信息”(lossy common information)的概念刻画该方法的理论极限,进而设计优化目标以平衡学习过程中固有的权衡关系。实验表明,该方法在多个视觉基准上的两任务场景中显著减少冗余并持续优于独立编码策略,验证了经典信息论在现代任务驱动表征学习中的实用价值。
链接: https://arxiv.org/abs/2601.21424
作者: Anderson de Andrade,Alon Harell,Ivan V. Bajić
机构: Simon Fraser University (西蒙菲莎大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:
Abstract:Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.
zh
[CV-66] From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding
【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRFs)在密集自遮挡场景中进行定量三维分析时的可靠性问题,特别是其在复杂遮挡条件下出现的内部几何退化(Interior Geometric Degradation, IGD)现象,即隐式密度场倾向于重建空心或碎片化结构而非实体内部,导致实例数量系统性低估。解决方案的关键在于引入一种基于稀疏体素光栅化(Sparse Voxel Rasterization, SVRaster)的显式几何流水线,该方法以SfM特征几何作为初始条件,通过将2D实例掩码投影至显式体素网格并利用递归分割强制几何分离,从而保持物理实体完整性,并实现95.8%的实例恢复率,显著优于传统隐式方法。
链接: https://arxiv.org/abs/2601.21421
作者: Jiangsan Zhao,Jakob Geipel,Kryzysztof Kusnierek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Radiance Fields (NeRFs) have emerged as a powerful paradigm for multi-view reconstruction, complementing classical photogrammetric pipelines based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS). However, their reliability for quantitative 3D analysis in dense, self-occluding scenes remains poorly understood. In this study, we identify a fundamental failure mode of implicit density fields under heavy occlusion, which we term Interior Geometric Degradation (IGD). We show that transmittance-based volumetric optimization satisfies photometric supervision by reconstructing hollow or fragmented structures rather than solid interiors, leading to systematic instance undercounting. Through controlled experiments on synthetic datasets with increasing occlusion, we demonstrate that state-of-the-art mask-supervised NeRFs saturate at approximately 89% instance recovery in dense scenes, despite improved surface coherence and mask quality. To overcome this limitation, we introduce an explicit geometric pipeline based on Sparse Voxel Rasterization (SVRaster), initialized from SfM feature geometry. By projecting 2D instance masks onto an explicit voxel grid and enforcing geometric separation via recursive splitting, our approach preserves physical solidity and achieves a 95.8% recovery rate in dense clusters. A sensitivity analysis using degraded segmentation masks further shows that explicit SfM-based geometry is substantially more robust to supervision failure, recovering 43% more instances than implicit baselines. These results demonstrate that explicit geometric priors are a prerequisite for reliable quantitative analysis in highly self-occluding 3D scenes.
zh
[CV-67] Revisiting Diffusion Model Predictions Through Dimensionality
【速读】:该论文旨在解决生成模型中预测目标选择的理论依据不足问题,尤其是在高维数据场景下,为何直接预测数据(x-prediction)相较于传统噪声(ε-prediction)或速度(v-prediction)更优尚缺乏严谨解释。解决方案的关键在于提出一个广义预测形式化框架,将ε-、v-和x-prediction统一为该框架下的特例,并通过分析数据几何结构与预测目标之间的解析关系,证明当环境维度远大于数据流形的内在维度时,x-prediction具有理论优势;进一步地,针对实际中难以估计数据内在维度的问题,作者提出k-Diff框架,采用数据驱动方式自动学习最优预测参数k,从而无需显式估计维度即可实现性能提升,实验表明其在潜在空间和像素空间图像生成任务中均显著优于固定目标基线方法。
链接: https://arxiv.org/abs/2601.21419
作者: Qing Jin,Chaoyang Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures
Abstract:Recent advances in diffusion and flow matching models have highlighted a shift in the preferred prediction target – moving from noise ( \varepsilon ) and velocity (v) to direct data (x) prediction – particularly in high-dimensional settings. However, a formal explanation of why the optimal target depends on the specific properties of the data remains elusive. In this work, we provide a theoretical framework based on a generalized prediction formulation that accommodates arbitrary output targets, of which \varepsilon -, v-, and x-prediction are special cases. We derive the analytical relationship between data’s geometry and the optimal prediction target, offering a rigorous justification for why x-prediction becomes superior when the ambient dimension significantly exceeds the data’s intrinsic dimension. Furthermore, while our theory identifies dimensionality as the governing factor for the optimal prediction target, the intrinsic dimension of manifold-bound data is typically intractable to estimate in practice. To bridge this gap, we propose k-Diff, a framework that employs a data-driven approach to learn the optimal prediction parameter k directly from data, bypassing the need for explicit dimension estimation. Extensive experiments in both latent-space and pixel-space image generation demonstrate that k-Diff consistently outperforms fixed-target baselines across varying architectures and data scales, providing a principled and automated approach to enhancing generative performance.
zh
[CV-68] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations
【速读】:该论文旨在解决高保真生成式 AI 视频(Generative AI Video)中真实与伪造内容难以区分的问题,尤其是在视觉质量已接近真实世界、宏观语义错误和时序不一致已基本消除的情况下。其解决方案的关键在于提出一种分层双路径框架,核心创新是识别并利用“流形投影波动”(Manifold Projection Fluctuations, MPF)这一内在特征:第一分支(静态流形偏差分支)通过大规模视觉基础模型(Vision Foundation Models, VFMs)捕捉偏离自然真实世界流形的空间残差异常;第二分支(微时序波动分支)则进一步检测即使在视觉完美序列中依然存在的结构化时序波动,从而实现对伪造视频的多尺度、细粒度鉴别,无论其表现为全局流形偏移还是细微计算指纹。
链接: https://arxiv.org/abs/2601.21408
作者: Xinan He,Kaiqing Lin,Yue Zhou,Jiaming Zhong,Wei Ye,Wenhui Yi,Bing Fan,Feng Ding,Haodong Li,Bo Cao,Bin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations’ (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.
zh
[CV-69] Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)中生成能力与理解能力之间单向增强的问题,即现有方法主要利用理解来提升生成性能,而忽视了通过生成任务反向促进理解的能力。解决方案的关键在于提出一种架构无关的后训练方法 UniMRG(Unified Multi-Representation Generation),其核心思想是在标准视觉理解目标之外,引入多个辅助生成任务——包括像素级重建(pixel reconstruction)、深度图预测(depth estimation)和语义分割(semantic segmentation),从而迫使模型学习输入图像的不同内在表征。这些互补的表示形式共同增强了模型对图像外观、空间关系和结构布局的理解,显著提升了细粒度感知能力、减少了幻觉现象,并同步改善了生成质量。
链接: https://arxiv.org/abs/2601.21406
作者: Zihan Su,Hongyang Wei,Kangrui Cen,Yong Wang,Guanhua Chen,Chun Yuan,Xiangxiang Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
zh
[CV-70] Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification
【速读】:该论文针对航空-地面行人重识别(Aerial-Ground Person Re-Identification, AG-ReID)中因视角和距离差异导致的严重几何失真问题展开研究,此类失真破坏了跨视图共享相似性空间的假设,从而显著降低注意力机制下基于点积的相似度计算可靠性。解决方案的关键在于提出几何诱导查询-键变换(Geometry-Induced Query-Key Transformation, GIQT),这是一种轻量级低秩模块,通过显式地将查询与键之间的交互条件化于相机几何信息,直接校正由极端视角变化引起的相似性空间各向异性扭曲,而非修改特征表示或注意力机制本身。进一步地,该方法还引入几何条件提示生成机制,利用相机几何信息提供全局、视图自适应的表征先验,从而在多个基准测试中实现对极端且未见几何条件下的鲁棒性提升,同时保持较低的计算开销。
链接: https://arxiv.org/abs/2601.21405
作者: Kailash A. Hambarde,Hugo Proença
机构: Instituto de Telecomunicações (电信研究所); Universidade da Beira Interior (贝拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry. Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.21405 [cs.CV] (or arXiv:2601.21405v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.21405 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-71] owards Geometry-Aware and Motion-Guided Video Human Mesh Recovery
【速读】:该论文旨在解决现有基于视频的3D人体网格重建(3D Human Mesh Recovery, HMR)方法中存在的物理不合理性问题,其根源在于依赖有缺陷的中间3D姿态锚点以及难以有效建模复杂时空动态。解决方案的关键在于提出HMRMamba框架,该框架首次将结构化状态空间模型(Structured State Space Models, SSMs)引入HMR任务,利用其高效性和长距离建模能力。核心创新包括:1)几何感知提升模块(Geometry-Aware Lifting Module),采用新颖的双扫描Mamba架构,通过图像特征中的几何线索直接引导2D到3D姿态映射,生成可靠的3D姿态序列作为稳定锚点;2)运动引导重建网络(Motion-guided Reconstruction Network),基于上述锚点显式建模时间维度上的运动学模式,显著提升最终网格的时序一致性和鲁棒性,尤其在遮挡和运动模糊场景下表现优异。
链接: https://arxiv.org/abs/2601.21376
作者: Hongjun Chen,Huan Zheng,Wencheng Han,Jianbing Shen
机构: SKL-IOTSC, CIS, University of Macau (澳门大学信息与通信技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing video-based 3D Human Mesh Recovery (HMR) methods often produce physically implausible results, stemming from their reliance on flawed intermediate 3D pose anchors and their inability to effectively model complex spatiotemporal dynamics. To overcome these deep-rooted architectural problems, we introduce HMRMamba, a new paradigm for HMR that pioneers the use of Structured State Space Models (SSMs) for their efficiency and long-range modeling prowess. Our framework is distinguished by two core contributions. First, the Geometry-Aware Lifting Module, featuring a novel dual-scan Mamba architecture, creates a robust foundation for reconstruction. It directly grounds the 2D-to-3D pose lifting process with geometric cues from image features, producing a highly reliable 3D pose sequence that serves as a stable anchor. Second, the Motion-guided Reconstruction Network leverages this anchor to explicitly process kinematic patterns over time. By injecting this crucial temporal awareness, it significantly enhances the final mesh’s coherence and robustness, particularly under occlusion and motion blur. Comprehensive evaluations on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks confirm that HMRMamba sets a new state-of-the-art, outperforming existing methods in both reconstruction accuracy and temporal consistency while offering superior computational efficiency.
zh
[CV-72] ViTMAlis: Towards Latency-Critical Mobile Video Analytics with Vision Transformers
【速读】:该论文旨在解决基于视觉Transformer(Vision Transformer, ViT)的密集预测模型在边缘辅助移动视频分析(Mobile Video Analytics, MVA)场景中面临的高推理延迟问题,尤其是在高分辨率输入下,ViT固有的二次计算复杂度导致显著的端到端延迟。解决方案的关键在于提出一种动态混合分辨率推理策略,以灵活调整运行时的速度与精度权衡,并在此基础上设计了一个原生支持ViT的设备到边缘卸载框架ViTMAlis,该框架能根据网络条件和视频内容动态优化传输与推理延迟的联合性能,从而实现更低的端到端延迟和更高的用户感知渲染准确性。
链接: https://arxiv.org/abs/2601.21362
作者: Miao Zhang,Guanzhen Wu,Hao Fang,Yifei Zhu,Fangxin Wang,Ruixiao Zhang,Jiangchuan Liu
机构: Simon Fraser University (西蒙弗雷泽大学); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Edge-assisted mobile video analytics (MVA) applications are increasingly shifting from using vision models based on convolutional neural networks (CNNs) to those built on vision transformers (ViTs) to leverage their superior global context modeling and generalization capabilities. However, deploying these advanced models in latency-critical MVA scenarios presents significant challenges. Unlike traditional CNN-based offloading paradigms where network transmission is the primary bottleneck, ViT-based systems are constrained by substantial inference delays, particularly for dense prediction tasks where the need for high-resolution inputs exacerbates the inherent quadratic computational complexity of ViTs. To address these challenges, we propose a dynamic mixed-resolution inference strategy tailored for ViT-backboned dense prediction models, enabling flexible runtime trade-offs between speed and accuracy. Building on this, we introduce ViTMAlis, a ViT-native device-to-edge offloading framework that dynamically adapts to network conditions and video content to jointly reduce transmission and inference delays. We implement a fully functional prototype of ViTMAlis on commodity mobile and edge devices. Extensive experiments demonstrate that, compared to state-of-the-art accuracy-centric, content-aware, and latency-adaptive baselines, ViTMAlis significantly reduces end-to-end offloading latency while improving user-perceived rendering accuracy, providing a practical foundation for next-generation mobile intelligence.
zh
[CV-73] Semantic-Guided Dynamic Sparsification for Pre-Trained Model-based Class-Incremental Learning
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因新旧类别知识冲突导致的灾难性遗忘问题。传统方法通常冻结预训练模型并引入轻量级适配器(adapter),并通过强制参数正交化来减少任务间干扰,但此类参数约束会抑制模型的可塑性(plasticity)。本文提出语义引导的动态稀疏化(Semantic-Guided Dynamic Sparsification, SGDS),其核心在于通过有针对性的稀疏化操作,在激活空间(activation space)中主动调控子空间的方向与秩,从而实现两类目标:一方面促使相似类别共享紧凑的激活子空间以促进知识迁移,另一方面为不相似类别分配非重叠的激活子空间以防止干扰。该方法在不施加严格参数约束的前提下,有效提升了模型对新类别的适应能力与对旧类别的记忆稳定性,实验表明其在多个基准数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2601.21345
作者: Ruiqi Liu,Boyu Diao,Zijia An,Runjie Shao,Zhulin An,Fei Wang,Yongjun Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-Incremental Learning (CIL) requires a model to continually learn new classes without forgetting old ones. A common and efficient solution freezes a pre-trained model and employs lightweight adapters, whose parameters are often forced to be orthogonal to prevent inter-task interference. However, we argue that this parameter-constraining method is detrimental to plasticity. To this end, we propose Semantic-Guided Dynamic Sparsification (SGDS), a novel method that proactively guides the activation space by governing the orientation and rank of its subspaces through targeted sparsification. Specifically, SGDS promotes knowledge transfer by encouraging similar classes to share a compact activation subspace, while simultaneously preventing interference by assigning non-overlapping activation subspaces to dissimilar classes. By sculpting class-specific sparse subspaces in the activation space, SGDS effectively mitigates interference without imposing rigid constraints on the parameter space. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of SGDS.
zh
[CV-74] Dynamical Adapter Fusion: Constructing A Global Adapter for Pre-Trained Model-based Class-Incremental Learning
【速读】:该论文旨在解决类增量学习(Class-Incremental Learning, CIL)中因持续学习新类而导致的灾难性遗忘问题,以及传统任务特定适配器(task-specific adapters)在知识迁移和参数管理上的局限性。其解决方案的关键在于提出动态适配器融合(Dynamical Adapter Fusion, DAF)机制,该机制基于PAC-Bayes理论,将优化后的任务特定适配器参数、先前全局适配器参数与初始化参数进行融合,通过损失函数的泰勒展开推导最优融合系数,从而在稳定性与可塑性之间实现动态平衡;同时引入鲁棒初始化策略以有效捕获全局知识模式,最终在多个CIL基准测试中实现最先进的性能表现。
链接: https://arxiv.org/abs/2601.21341
作者: Ruiqi Liu,Boyu Diao,Zijia An,Zhulin An,Fei Wang,Yongjun Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-Incremental Learning (CIL) requires models to continuously acquire new classes without forgetting previously learned ones. A dominant paradigm involves freezing a pre-trained model and training lightweight, task-specific adapters. However, maintaining task-specific parameters hinders knowledge transfer and incurs high retrieval costs, while naive parameter fusion often leads to destructive interference and catastrophic forgetting. To address these challenges, we propose Dynamical Adapter Fusion (DAF) to construct a single robust global adapter. Grounded in the PAC-Bayes theorem, we derive a fusion mechanism that explicitly integrates three components: the optimized task-specific adapter parameters, the previous global adapter parameters, and the initialization parameters. We utilize the Taylor expansion of the loss function to derive the optimal fusion coefficients, dynamically achieving the best balance between stability and plasticity. Furthermore, we propose a Robust Initialization strategy to effectively capture global knowledge patterns. Experiments on multiple CIL benchmarks demonstrate that DAF achieves state-of-the-art (SOTA) performance.
zh
[CV-75] SR2-Net: A General Plug-and-Play Model for Spectral Refinement in Hyperspectral Image Super-Resolution
【速读】:该论文旨在解决高光谱图像超分辨率(HSI-SR)中因忽略多波段间光谱一致性而导致的虚假振荡和物理上不合理的伪影问题。现有方法虽通过利用空间相关性提升空间分辨率,但往往牺牲了光谱保真度;而单纯依赖网络结构设计以保证光谱一致性又会降低模型的通用性和灵活性。其解决方案的关键在于提出一种轻量级、可插拔的光谱修正模块——物理先验光谱修正超分辨率网络(SR²-Net),采用“增强-校正”两阶段流程:首先使用分层光谱-空间协同注意力机制(H-S³A)强化跨波段交互,随后通过流形一致性校正(MCR)将重建光谱约束在紧凑且物理合理的光谱流形内;同时引入退化一致性损失以确保重建结果与低分辨率输入保持数据保真度,从而在不修改原模型架构的前提下显著提升光谱保真度与整体重建质量,且计算开销极低。
链接: https://arxiv.org/abs/2601.21338
作者: Ji-Xuan He,Guohang Zhuang,Junge Bo,Tingyi Li,Chen Ling,Yanan Qiao
机构: Xi’an Jiaotong University (西安交通大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:HSI-SR aims to enhance spatial resolution while preserving spectrally faithful and physically plausible characteristics. Recent methods have achieved great progress by leveraging spatial correlations to enhance spatial resolution. However, these methods often neglect spectral consistency across bands, leading to spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed by designing the network architecture, it results in a loss of generality and flexibility. To address this issue, we propose a lightweight plug-and-play rectifier, physically priors Spectral Rectification Super-Resolution Network (SR ^2 -Net), which can be attached to a wide range of HSI-SR models without modifying their architectures. SR ^2 -Net follows an enhance-then-rectify pipeline consisting of (i) Hierarchical Spectral-Spatial Synergy Attention (H-S ^3 A) to reinforce cross-band interactions and (ii) Manifold Consistency Rectification (MCR) to constrain the reconstructed spectra to a compact, physically plausible spectral manifold. In addition, we introduce a degradation-consistency loss to enforce data fidelity by encouraging the degraded SR output to match the observed low resolution input. Extensive experiments on multiple benchmarks and diverse backbones demonstrate consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead. Our code will be released upon publication.
zh
[CV-76] Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations
【速读】:该论文旨在解决视觉基础模型(vision foundation models)是否能从离散采样的图像中隐式学习并表征连续生物学过程(如癌症进展)的问题。其核心挑战在于,尽管这些模型在分类任务上表现优异,但它们的潜在表示是否能够捕捉到疾病状态之间连续、动态的变化路径尚不明确。解决方案的关键在于引入扩散伪时间(diffusion pseudotime)方法,该方法源自单细胞转录组学中推断发育轨迹的技术,用于分析模型在表征空间中对疾病状态的排序是否符合已知的连续进展方向。研究发现,所有病理特定模型均显著优于随机基线,且仅使用图像训练的模型在结直肠癌-锯齿状病变数据集上达到最高的轨迹保真度(τ = 0.78),同时轨迹保真度与少样本分类性能高度相关(ρ = 0.92),表明该指标可作为评估模型表征质量的新维度。
链接: https://arxiv.org/abs/2601.21334
作者: Pritika Vig(1 and 2),Ren-Chin Wu(3),William Lotter(2, 4 and 5) ((1) Massachusetts Institute of Technology, (2) Department of Data Science, Dana-Farber Cancer Institute, (3) Department of Pathology, Dana-Farber Cancer Institute, (4) Brigham and Women’s Hospital, (5) Harvard Medical School)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 17 figures. Appendix included
Abstract:Vision foundation models trained on discretely sampled images achieve strong performance on classification benchmarks, yet whether their representations encode the continuous processes underlying their training data remains unclear. This question is especially pertinent in computational pathology, where we posit that models whose latent representations implicitly capture continuous disease progression may better reflect underlying biology, support more robust generalization, and enable quantitative analyses of features associated with disease transitions. Using diffusion pseudotime, a method developed to infer developmental trajectories from single-cell transcriptomics, we probe whether foundation models organize disease states along coherent progression directions in representation space. Across four cancer progressions and six models, we find that all pathology-specific models recover trajectory orderings significantly exceeding null baselines, with vision-only models achieving the highest fidelities (\tau 0.78 on CRC-Serrated). Model rankings by trajectory fidelity on reference diseases strongly predict few-shot classification performance on held-out diseases ( \rho = 0.92 ), and exploratory analysis shows cell-type composition varies smoothly along inferred trajectories in patterns consistent with known stromal remodeling. Together, these results demonstrate that vision foundation models can implicitly learn to represent continuous processes from independent static observations, and that trajectory fidelity provides a complementary measure of representation quality beyond downstream performance. While demonstrated in pathology, this framework could be applied to other domains where continuous processes are observed through static snapshots.
zh
[CV-77] Adversarial Vulnerability Transcends Computational Paradigms: Feature Engineering Provides No Defense Against Neural Adversarial Transfer
【速读】:该论文试图解决的问题是:基于手工特征(handcrafted features)的经典机器学习(Classical ML)分类器是否继承了深度神经网络(Deep Neural Networks, DNNs)对对抗样本(adversarial examples)的脆弱性,尤其是在通过DNN代理(neural surrogate)生成攻击时。其核心解决方案在于首次系统性地评估从VGG16生成的FGSM和PGD对抗样本向HOG(Histogram of Oriented Gradients)特征提取后输入至四类经典分类器(KNN、决策树、线性SVM、核SVM)及浅层神经网络的迁移攻击效果。关键发现为:所有经典ML模型均遭受显著性能下降(相对准确率损失16.6%–59.1%),表明对抗脆弱性并非仅源于端到端可微性,而是图像分类系统的根本属性;更意外的是,攻击层级反转现象——即FGSM在经典ML中比迭代攻击PGD更具破坏力,揭示了PGD可能过拟合于DNN特有特征,而这些特征在特征工程过程中被过滤或失真。
链接: https://arxiv.org/abs/2601.21323
作者: Achraf Hsain,Ahmed Abdelkader,Emmanuel Baldwin Mbaya,Hamoud Aljamaan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks are vulnerable to adversarial examples–inputs with imperceptible perturbations causing misclassification. While adversarial transfer within neural networks is well-documented, whether classical ML pipelines using handcrafted features inherit this vulnerability when attacked via neural surrogates remains unexplored. Feature engineering creates information bottlenecks through gradient quantization and spatial binning, potentially filtering high-frequency adversarial signals. We evaluate this hypothesis through the first comprehensive study of adversarial transfer from DNNs to HOG-based classifiers. Using VGG16 as a surrogate, we generate FGSM and PGD adversarial examples and test transfer to four classical classifiers (KNN, Decision Tree, Linear SVM, Kernel SVM) and a shallow neural network across eight HOG configurations on CIFAR-10. Our results strongly refute the protective hypothesis: all classifiers suffer 16.6%-59.1% relative accuracy drops, comparable to neural-to-neural transfer. More surprisingly, we discover attack hierarchy reversal–contrary to patterns where iterative PGD dominates FGSM within neural networks, FGSM causes greater degradation than PGD in 100% of classical ML cases, suggesting iterative attacks overfit to surrogate-specific features that don’t survive feature extraction. Block normalization provides partial but insufficient mitigation. These findings demonstrate that adversarial vulnerability is not an artifact of end-to-end differentiability but a fundamental property of image classification systems, with implications for security-critical deployments across computational paradigms.
zh
[CV-78] Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence ICLR2026
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在分布外(Out-of-Distribution, OOD)输入上产生过度自信预测的问题,这严重影响了模型在开放世界环境中的可靠性。解决方案的关键在于利用半离散最优传输(Semi-discrete Optimal Transport, OT)诱导的奇异边界(singular boundaries)所蕴含的几何结构:通过构建一个连续基分布与训练数据潜在嵌入之间的OT问题,识别出这些几何上不确定的区域;进而采样生成具有语义模糊性的分布外样本(Optimal Transport-Induced OOD Samples, OTIS),并在训练中引入置信度抑制损失(confidence suppression loss)作用于OTIS,引导模型在结构不确定性区域实现更校准的预测。
链接: https://arxiv.org/abs/2601.21320
作者: Keke Tang,Ziyong Du,Xiaofei Wang,Weilong Peng,Peican Zhu,Zhihong Tian
机构: Guangzhou University (广州大学); SmartMore Corporation; University of Science and Technology of China (中国科学技术大学); Northwestern Polytechnical University (西北工业大学); Guangdong Key Laboratory of Industrial Control System Security (广东省工业控制系统安全重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICLR 2026
Abstract:Deep neural networks (DNNs) often produce overconfident predictions on out-of-distribution (OOD) inputs, undermining their reliability in open-world environments. Singularities in semi-discrete optimal transport (OT) mark regions of semantic ambiguity, where classifiers are particularly prone to unwarranted high-confidence predictions. Motivated by this observation, we propose a principled framework to mitigate OOD overconfidence by leveraging the geometry of OT-induced singular boundaries. Specifically, we formulate an OT problem between a continuous base distribution and the latent embeddings of training data, and identify the resulting singular boundaries. By sampling near these boundaries, we construct a class of OOD inputs, termed optimal transport-induced OOD samples (OTIS), which are geometrically grounded and inherently semantically ambiguous. During training, a confidence suppression loss is applied to OTIS to guide the model toward more calibrated predictions in structurally uncertain regions. Extensive experiments show that our method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods.
zh
[CV-79] HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence
【速读】:该论文旨在解决现有基于自回归方法生成高保真三维网格(3D mesh)时存在的资源利用率低、推理速度慢以及可处理序列长度受限的问题,这些问题严重限制了模型对结构细节的表达能力。其解决方案的关键在于提出Latent Autoregressive Network (LANE),通过在生成过程中引入紧凑的自回归依赖关系,将最大可生成序列长度提升至原有方法的6倍;同时,进一步设计了自适应计算图重配置策略(Adaptive Computation Graph Reconfiguration, AdaGraph),利用生成过程中的时空解耦机制有效突破传统串行推理的效率瓶颈,从而在生成速度、结构细节和几何一致性方面均实现显著性能提升。
链接: https://arxiv.org/abs/2601.21314
作者: Yanfeng Li,Tao Tan,Qingquan Gao,Zhiwen Cao,Xiaohong liu,Yue Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a 6\times improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.
zh
[CV-80] Mam-App: A Novel Parameter-Efficient Mamba Model for Apple Leaf Disease Classification
【速读】:该论文旨在解决苹果叶片疾病识别中深度学习模型参数量大、计算资源消耗高与性能之间难以平衡的问题,尤其是在资源受限设备(如无人机、移动终端)上部署时面临的效率瓶颈。其关键解决方案是提出一种基于Mamba架构的轻量化特征提取与分类模型——Mam-App,该模型通过引入状态空间模型(State Space Model, SSM)的高效序列建模能力,在仅使用0.051M参数的情况下实现了接近当前最优的分类性能(在PlantVillage苹果叶病数据集上达到99.58%准确率),显著降低了模型复杂度,同时保持了优异的泛化能力和鲁棒性,适用于边缘计算场景下的实时病害诊断应用。
链接: https://arxiv.org/abs/2601.21307
作者: Md Nadim Mahamood,Md Imran Hasan,Md Rasheduzzaman,Ausrukona Ray,Md Shafi Ud Doula,Kamrul Hasan
机构: Begum Rokeya University (贝古姆·罗基娅大学); Kumamoto University (熊本大学); Asian Institute of Technology (亚洲理工学院); Texas State University (德克萨斯州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 Pages, 7 Tables, 5 Figures
Abstract:The rapid growth of the global population, alongside exponential technological advancement, has intensified the demand for food production. Meeting this demand depends not only on increasing agricultural yield but also on minimizing food loss caused by crop diseases. Diseases account for a substantial portion of apple production losses, despite apples being among the most widely produced and nutritionally valuable fruits worldwide. Previous studies have employed machine learning techniques for feature extraction and early diagnosis of apple leaf diseases, and more recently, deep learning-based models have shown remarkable performance in disease recognition. However, most state-of-the-art deep learning models are highly parameter-intensive, resulting in increased training and inference time. Although lightweight models are more suitable for user-friendly and resource-constrained applications, they often suffer from performance degradation. To address the trade-off between efficiency and performance, we propose Mam-App, a parameter-efficient Mamba-based model for feature extraction and leaf disease classification. The proposed approach achieves competitive state-of-the-art performance on the PlantVillage Apple Leaf Disease dataset, attaining 99.58% accuracy, 99.30% precision, 99.14% recall, and a 99.22% F1-score, while using only 0.051M parameters. This extremely low parameter count makes the model suitable for deployment on drones, mobile devices, and other low-resource platforms. To demonstrate the robustness and generalizability of the proposed model, we further evaluate it on the PlantVillage Corn Leaf Disease and Potato Leaf Disease datasets. The model achieves 99.48%, 99.20%, 99.34%, and 99.27% accuracy, precision, recall, and F1-score on the corn dataset and 98.46%, 98.91%, 95.39%, and 97.01% on the potato dataset, respectively.
zh
[CV-81] Gaussian Belief Propagation Network for Depth Completion
【速读】:该论文旨在解决深度补全(depth completion)任务中,如何有效处理输入深度数据稀疏且不规则的问题,尤其是在高稀疏度条件下深度学习方法性能受限的挑战。解决方案的关键在于提出一种名为高斯信念传播网络(Gaussian Belief Propagation Network, GBPN)的新型混合框架,其核心创新包括:1)通过图模型构建网络(Graphical Model Construction Network, GMCN)动态生成场景特定的马尔可夫随机场(Markov Random Field, MRF),并预测自适应的非局部边以建模复杂长程空间依赖关系;2)引入串行-并行消息传递机制增强高斯信念传播(Gaussian Belief Propagation, GBP)的信息传播效率,尤其强化从稀疏测量点出发的推理能力。该方案实现了端到端的深度补全,在NYUv2和KITTI基准上达到当前最优性能,并展现出良好的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2601.21291
作者: Jie Tang,Pingping Xie,Jian Li,Ping Tan
机构: National University of Defense Technology (国防科技大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Depth completion aims to predict a dense depth map from a color image with sparse depth measurements. Although deep learning methods have achieved state-of-the-art (SOTA), effectively handling the sparse and irregular nature of input depth data in deep networks remains a significant challenge, often limiting performance, especially under high sparsity. To overcome this limitation, we introduce the Gaussian Belief Propagation Network (GBPN), a novel hybrid framework synergistically integrating deep learning with probabilistic graphical models for end-to-end depth completion. Specifically, a scene-specific Markov Random Field (MRF) is dynamically constructed by the Graphical Model Construction Network (GMCN), and then inferred via Gaussian Belief Propagation (GBP) to yield the dense depth distribution. Crucially, the GMCN learns to construct not only the data-dependent potentials of MRF but also its structure by predicting adaptive non-local edges, enabling the capture of complex, long-range spatial dependencies. Furthermore, we enhance GBP with a serial \ parallel message passing scheme, designed for effective information propagation, particularly from sparse measurements. Extensive experiments demonstrate that GBPN achieves SOTA performance on the NYUv2 and KITTI benchmarks. Evaluations across varying sparsity levels, sparsity patterns, and datasets highlight GBPN’s superior performance, notable robustness, and generalizable capability.
zh
[CV-82] Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
【速读】:该论文旨在解决自动驾驶任务中大模型(如视觉语言模型,VLM)因GPU内存消耗高、推理延迟大,以及小模型通过传统监督微调(SFT)难以弥补能力差距的问题。其解决方案的关键在于提出Drive-KD框架,将自动驾驶分解为“感知-推理-规划”三元结构,并基于层级注意力机制设计能力特定的单教师蒸馏策略;进一步将多个单教师设置统一为多教师蒸馏框架,并引入非对称梯度投影以缓解跨能力梯度冲突,从而实现高效且高性能的小模型能力迁移。
链接: https://arxiv.org/abs/2601.21288
作者: Weitong Lian,Zecong Tang,Haoran Li,Tianjian Gao,Yifei Wang,Zixu Wang,Lingyi Meng,Tengju Ru,Zhejun Cui,Yichen Zhu,Hangshuo Cao,Qi Kang,Tianxing Chen,Yusen Qin,Kaixuan Wang,Yu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 23 pages, 14 figures
Abstract:Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a “perception-reasoning-planning” triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.
zh
[CV-83] WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models
【速读】:该论文旨在解决当前视频生成式世界模型(video-based world models)在物理一致性上的评估难题,即现有基于物理的视频基准测试因多概念混杂(entanglement)而无法精准诊断模型对单一物理规律或概念的理解能力。其解决方案的关键在于提出WorldBench——一个面向概念解耦(disentangled)的视频基准,通过分层设计实现对直观物理理解(如物体恒常性、尺度/视角)与低层次物理参数(如摩擦系数、流体黏度)的独立评估,从而系统识别模型在特定物理概念上的失败模式,为提升世界模型在真实场景中可靠部署提供可解释、可扩展的评估框架。
链接: https://arxiv.org/abs/2601.21282
作者: Rishi Upadhyay,Howard Zhang,Jim Solomon,Ayush Agrawal,Pranay Boreddy,Shruti Satya Narayana,Yunhao Ba,Alex Wong,Celso M de Melo,Achuta Kadambi
机构: University of California, Los Angeles (加州大学洛杉矶分校); Sony AI; Yale University (耶鲁大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL
Abstract:Recent advances in generative foundational models, often termed “world models,” have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.
zh
[CV-84] oken Entropy Regularization for Multi-modal Antenna Affiliation Identification
【速读】:该论文旨在解决通信网络中天线归属识别(antenna affiliation identification)的准确性与效率问题,当前依赖人工巡检的方式存在流程繁琐且易出错的缺陷。其核心解决方案是将天线归属识别转化为多模态分类与匹配任务,融合基站视频图像、天线几何特征与物理小区标识(Physical Cell Identity, PCI)信号,并提出一种专用预训练框架以实现跨模态对齐。关键创新在于引入了Token Entropy Regularization (TER) 模块,在预训练阶段优化不同模态间的表示一致性,显著提升模型收敛速度和性能;实验进一步发现,首个token的熵具有模态依赖性,为多模态学习提供了新的理论洞察。
链接: https://arxiv.org/abs/2601.21280
作者: Dong Chen,Ruoyu Li,Xinyan Zhang,Jialei Xu,Ruoseng Zhao,Zhikang Zhang,Lingyun Li,Zizhuang Wei
机构: Huawei(华为); The University of Hong Kong (香港大学); The Chinese University of Hong Kong, Shenzhen (深圳分校,香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate antenna affiliation identification is crucial for optimizing and maintaining communication networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained transformers struggle with this unique task due to a lack of analogous data in the communications domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.
zh
[CV-85] Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference
【速读】:该论文旨在解决低码率下高质量3D说话人脸表示的难题,传统2D视频压缩技术难以保留精细的几何与外观细节,而基于隐式神经渲染(Implicit Neural Rendering)的方法如NeRF则存在计算成本过高的问题。其解决方案的关键在于提出一种轻量级、高保真、低码率的3D说话人脸压缩框架,该框架融合FLAME参数化建模与3D高斯泼溅(3D Gaussian Splatting, 3DGS)神经渲染技术,仅传输关键面部元数据以实现高效重建,并引入高斯属性压缩和多层感知机(MLP)优化策略提升传输效率,从而在极低码率下仍能实现高质量的面部渲染效果。
链接: https://arxiv.org/abs/2601.21269
作者: Jianglong Li,Jun Xu,Bingcong Lu,Zhengxue Cheng,Hongwei Hu,Ronghua Wu,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); AntGroup (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.
zh
[CV-86] Hypersolid: Emergent Vision Representations via Short-Range Repulsion
【速读】:该论文旨在解决自监督学习中表征坍塌(representation collapse)的问题,即模型在无监督训练过程中学到的特征表示趋于冗余或退化,导致性能下降。现有方法通常依赖全局正则化策略,如最大化特征间距离、去相关性或强制分布约束。本文提出将表征学习重新诠释为离散打包问题(discrete packing problem),核心思想是通过保持映射的单射性(injectivity)来保留信息;其解决方案的关键在于Hypersolid方法,该方法利用短程硬球排斥机制防止局部特征碰撞,从而在几何上实现高分离度(high-separation geometric regime),有效维持数据增强多样性,在细粒度和低分辨率分类任务中表现优异。
链接: https://arxiv.org/abs/2601.21255
作者: Esteban Rodríguez-Betancourt,Edgar Casasola-Murillo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 16 figures
Abstract:A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.
zh
[CV-87] Lossless Copyright Protection via Intrinsic Model Fingerprinting
【速读】:该论文旨在解决扩散模型(diffusion models)在知识产权保护中面临的难题:现有方法要么通过嵌入水印修改模型导致性能下降,要么依赖于对去噪过程的操纵来提取指纹,无法兼容黑盒API场景。其解决方案的关键在于提出TrajPrint框架,该框架无需训练且完全无损,通过追踪特定水印图像在确定性生成路径中的轨迹原点,锁定模型特有的流形指纹(manifold fingerprint),并采用双端锚定联合优化策略合成符合目标流形的特异性指纹噪声,从而实现仅在目标模型上成功恢复水印图像、其他模型则失败的验证机制,最终借助原子推理与统计假设检验完成版权验证。
链接: https://arxiv.org/abs/2601.21252
作者: Lingxiao Chen,Liqin Wang,Wei Lu,Xiangyang Luo
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The exceptional performance of diffusion models establishes them as high-value intellectual property but exposes them to unauthorized replication. Existing protection methods either modify the model to embed watermarks, which impairs performance, or extract model fingerprints by manipulating the denoising process, rendering them incompatible with black-box APIs. In this paper, we propose TrajPrint, a completely lossless and training-free framework that verifies model copyright by extracting unique manifold fingerprints formed during deterministic generation. Specifically, we first utilize a watermarked image as an anchor and exactly trace the path back to its trajectory origin, effectively locking the model fingerprint mapped by this path. Subsequently, we implement a joint optimization strategy that employs dual-end anchoring to synthesize a specific fingerprint noise, which strictly adheres to the target manifold for robust watermark recovery. As input, it enables the protected target model to recover the watermarked image, while failing on non-target models. Finally, we achieved verification via atomic inference and statistical hypothesis testing. Extensive experiments demonstrate that TrajPrint achieves lossless verification in black-box API scenarios with superior robustness against model modifications.
zh
[CV-88] NFCDS: A Plug-and-Play Noise Frequency-Controlled Diffusion Sampling Strategy for Image Restoration
【速读】:该论文旨在解决基于扩散采样的插件式(Plug-and-Play, PnP)方法在图像生成中普遍存在的数据保真度(fidelity)与感知质量(perception)之间的权衡问题,其根源在于反向扩散过程中引入的噪声干扰。解决方案的关键在于提出噪声频率控制的扩散采样(Noise Frequency-Controlled Diffusion Sampling, NFCDS),通过傅里叶域滤波机制对反向扩散噪声进行频域调控:低频噪声导致模糊并损害保真度,高频噪声则促进细节生成;NFCDS 逐步抑制低频噪声、保留高频成分,从而在无需额外训练的前提下,将数据一致性先验直接注入采样过程,实现快速收敛至高保真且感知质量优异的结果。该模块可无缝嵌入现有扩散修复框架,显著改善多种零样本任务中的保真度-感知平衡。
链接: https://arxiv.org/abs/2601.21248
作者: Zhen Wang,Hongyi Liu,Jianing Li,Zhihui Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion sampling-based Plug-and-Play (PnP) methods produce images with high perceptual quality but often suffer from reduced data fidelity, primarily due to the noise introduced during reverse diffusion. To address this trade-off, we propose Noise Frequency-Controlled Diffusion Sampling (NFCDS), a spectral modulation mechanism for reverse diffusion noise. We show that the fidelity-perception conflict can be fundamentally understood through noise frequency: low-frequency components induce blur and degrade fidelity, while high-frequency components drive detail generation. Based on this insight, we design a Fourier-domain filter that progressively suppresses low-frequency noise and preserves high-frequency content. This controlled refinement injects a data-consistency prior directly into sampling, enabling fast convergence to results that are both high-fidelity and perceptually convincing–without additional training. As a PnP module, NFCDS seamlessly integrates into existing diffusion-based restoration frameworks and improves the fidelity-perception balance across diverse zero-shot tasks.
zh
[CV-89] PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models ICLR2026
【速读】:该论文旨在解决自回归视觉生成(AutoRegressive Visual Generation, ARVG)模型在应用训练-free后训练量化(Post-Training Quantization, PTQ)时面临的三大挑战:(1)通道级严重异常值;(2)标记级激活高度动态变化;(3)样本级分布信息不匹配。针对这些问题,论文提出了一种名为PTQ4ARVG的量化框架,其关键创新在于:(1)Gain-Projected Scaling(GPS)通过泰勒展开量化缩放增益并求解最优缩放因子以缓解通道级异常值;(2)Static Token-Wise Quantization(STWQ)利用ARVG模型固定标记长度和位置不变的分布特性,在无动态校准开销下处理标记级方差;(3)Distribution-Guided Calibration(DGC)基于分布熵选择最具代表性的样本,消除样本级分布失配。实验表明,该方法可有效将ARVG模型压缩至8-bit和6-bit,同时保持竞争力性能。
链接: https://arxiv.org/abs/2601.21238
作者: Xuewen Liu,Zhikai Li,Jing Zhang,Mengjuan Chen,Qingyi Gu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead.(3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance. Code is available at this http URL .
zh
[CV-90] LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models AAAI2026
【速读】:该论文旨在解决多图像生成式 AI(Multimodal Large Language Models, MLLMs)在面对对抗攻击时的脆弱性问题,尤其是现有攻击方法主要针对单图像场景且常假设白盒威胁模型,难以适用于实际黑盒环境。其解决方案的关键在于提出LAMP方法,通过引入注意力约束以干扰模型对多图像信息的有效聚合,设计跨图像传染性约束使扰动能够传播至未被修改的图像中,同时采用索引-注意力抑制损失实现位置无关的鲁棒攻击效果,从而在黑盒条件下显著提升攻击成功率。
链接: https://arxiv.org/abs/2601.21220
作者: Alvi Md Ishmam,Najibul Haque Sarker,Zaber Ibn Abdul Hakim,Chris Thomas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in main technical track AAAI 2026
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model, which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning Universal Adversarial Perturbations (UAPs) targeting multi-image MLLMs. LAMP applies an attention-based constraint that prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens, spreading adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss enables a robust position-invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks and models.
zh
[CV-91] hinker: A vision-language foundation model for embodied intelligence IROS2025
【速读】:该论文旨在解决大视觉语言模型在机器人领域应用时面临的两大核心问题:一是第三人称视角与第一人称视角混淆,二是时间推理中容易忽略视频结尾信息。解决方案的关键在于提出名为Thinker的大规模视觉语言基础模型,其创新性体现在两个方面:首先构建了一个面向机器人感知与推理的多模态数据集,涵盖自视角视频、视觉定位、空间理解及思维链(chain-of-thought)数据;其次引入一种简单但高效的方法,通过联合输入关键帧与完整视频序列显著提升模型对视频内容的理解能力,从而在任务规划领域的两个主流基准测试中达到当前最优性能。
链接: https://arxiv.org/abs/2601.21199
作者: Baiyu Pan,Daqin Luo,Junpeng Yang,Jiyuan Wang,Yixuan Zhang,Hailin Shi,Jichao Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IROS 2025, 4 pages, 3 figures
Abstract:When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model’s capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.
zh
[CV-92] Generative Recall Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval
【速读】:该论文旨在解决两阶段文本到视频检索(Text-to-Video Retrieval, TVR)中召回模型性能瓶颈的问题,尤其是传统生成式检索(Generative Retrieval, GR)因语义模糊性和跨模态错位导致召回候选质量低下的问题。其核心解决方案是提出一种名为“生成式召回与密集重排序”(Generative Recall and Dense Reranking, GRDR)的方法:通过引入查询引导的多视角分词器(query-guided multi-view tokenizer),为每个视频分配多个语义ID(semantic ID),从而缓解单一ID带来的语义歧义;同时,联合训练分词器与生成式检索器,并利用共享码本将语义ID作为文本与视频之间的语义桥梁,增强跨模态对齐能力;在推理阶段采用Trie约束解码生成紧凑候选集,再由密集重排序模型进行细粒度匹配,实现高精度且高效的全库检索。
链接: https://arxiv.org/abs/2601.21193
作者: Zecheng Zhao,Zhi Chen,Zi Huang,Shazia Sadiq,Tong Chen
机构: The University of Queensland (昆士兰大学); The University of Southern Queensland (南昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where each video satisfies diverse queries but is forced into one semantic ID; and (ii) cross-modal misalignment, as semantic IDs are solely derived from visual features without text supervision. We propose Generative Recall and Dense Reranking (GRDR), designing a novel GR method to uplift recalled candidate quality. GRDR assigns multiple semantic IDs to each video using a query-guided multi-view tokenizer exposing diverse semantic access paths, and jointly trains the tokenizer and generative retriever via a shared codebook to cast semantic IDs as the semantic bridge between texts and videos. At inference, trie-constrained decoding generates a compact candidate set reranked by a dense model for fine-grained matching. Experiments on TVR benchmarks show GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating up to 300 \times in full-corpus retrieval.
zh
[CV-93] FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models
【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)与大推理模型(Large Reasoning Models, LRMs)融合方法在粗粒度层级操作时导致的推理能力注入与视觉能力保持之间的权衡问题。其解决方案的关键在于提出一种基于子空间级模型融合的细粒度推理注入框架FRISM(Fine-grained Reasoning Injection via Subspace-level model Merging),通过奇异值分解(Singular Value Decomposition, SVD)对LRM任务向量进行分解,并自适应地学习每个子空间的缩放系数,从而实现对推理能力的精细化注入;同时引入无标签自蒸馏学习策略,结合双目标优化机制,在不牺牲原有视觉能力的前提下显著提升模型的推理性能。
链接: https://arxiv.org/abs/2601.21187
作者: Chenyu Huang,Peng Ye,Xudong Tan,Jinhan Mu,Shenghe Zheng,Li Shen,Tao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 8 figures
Abstract:Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model’s original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.
zh
[CV-94] Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process
【速读】:该论文旨在解决水下4-D光场(Light Field, LF)成像中高质量图像获取的难题,尤其针对水下场景中存在的颜色失真问题。其解决方案的关键在于提出GeoDiff-LF框架,该框架基于SD-Turbo扩散模型,通过三个核心改进实现:(1) 采用融合卷积与注意力适配器的改进型U-Net结构以建模几何线索;(2) 设计基于张量分解和渐进加权的几何引导损失函数,用于正则化全局结构;(3) 提出优化采样策略结合噪声预测机制以提升效率。该方法有效利用了光场的空间-角度结构与扩散先验,显著改善了水下图像的颜色保真度,实验表明其在视觉质量和定量指标上均优于现有方法。
链接: https://arxiv.org/abs/2601.21179
作者: Yuji Lin,Qian Zhao,Zongsheng Yue,Junhui Hou,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学); City University of Hong Kong (香港城市大学); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures
Abstract:This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at this https URL.
zh
[CV-95] InspecSafe-V1: A Multimodal Benchmark for Safety Assessment in Industrial Inspection Scenarios
【速读】:该论文旨在解决工业智能与无人巡检场景中,AI系统在复杂动态环境下进行可靠感知与安全评估的瓶颈问题,尤其针对现有公开数据集普遍存在的局限性,如依赖仿真数据、单一模态传感及缺乏细粒度对象级标注等,从而阻碍了工业基础模型的鲁棒场景理解与多模态安全推理能力。解决方案的关键在于构建首个面向工业巡检安全评估的多模态基准数据集InspecSafe-V1,其通过真实巡检机器人在5类典型工业场景(隧道、电力设施、烧结设备、油气化工厂、煤输送栈桥)中采集的2,239个有效检测点位,涵盖5,013个检测实例,并提供像素级分割标注、语义场景描述和安全等级标签,同时融合7种同步传感模态(可见光图像、红外视频、音频、深度点云、雷达点云、气体浓度、温湿度),以支持多模态异常识别、跨模态融合与综合安全评估,为工业领域安全感知研究提供了高质量的数据支撑。
链接: https://arxiv.org/abs/2601.21173
作者: Zeyi Liu,Shuang Liu,Jihai Min,Zhaoheng Zhang,Jun Cen,Pengyu Han,Songqiao Hu,Zihan Meng,Xiao He,Donghua Zhou
机构: Tsinghua University (清华大学); TetraBOT Intelligence Co., Ltd. (四拓智能科技有限公司); DAMO Academy, Alibaba Group (阿里巴巴达摩院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures
Abstract:With the rapid development of industrial intelligence and unmanned inspection, reliable perception and safety assessment for AI systems in complex and dynamic industrial sites has become a key bottleneck for deploying predictive maintenance and autonomous inspection. Most public datasets remain limited by simulated data sources, single-modality sensing, or the absence of fine-grained object-level annotations, which prevents robust scene understanding and multimodal safety reasoning for industrial foundation models. To address these limitations, InspecSafe-V1 is released as the first multimodal benchmark dataset for industrial inspection safety assessment that is collected from routine operations of real inspection robots in real-world environments. InspecSafe-V1 covers five representative industrial scenarios, including tunnels, power facilities, sintering equipment, oil and gas petrochemical plants, and coal conveyor trestles. The dataset is constructed from 41 wheeled and rail-mounted inspection robots operating at 2,239 valid inspection sites, yielding 5,013 inspection instances. For each instance, pixel-level segmentation annotations are provided for key objects in visible-spectrum images. In addition, a semantic scene description and a corresponding safety level label are provided according to practical inspection tasks. Seven synchronized sensing modalities are further included, including infrared video, audio, depth point clouds, radar point clouds, gas measurements, temperature, and humidity, to support multimodal anomaly recognition, cross-modal fusion, and comprehensive safety assessment in industrial environments.
zh
[CV-96] Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery
【速读】:该论文旨在解决高分辨率遥感影像中语义分割任务面临的挑战,即如何在不依赖训练数据的情况下实现精确的几何定位与语义预测,尤其针对现有无训练开放词汇语义分割(training-free open-vocabulary semantic segmentation, OVSS)方法因采用单向注入(one-way injection)和浅层后处理策略而导致边界模糊、对象结构失真的问题。其解决方案的关键在于提出一种空间正则感知的双分支协同推理框架(SDCI),核心创新包括:1)引入交叉模型注意力融合(cross-model attention fusion, CAF)模块,在特征编码阶段通过相互注入自注意力图实现双分支(CLIP与视觉基础模型,VFM)的协同推理;2)设计双向交叉图扩散精化(bidirectional cross-graph diffusion refinement, BCDR)模块,利用迭代随机游走扩散机制提升双分支分割得分的可靠性;3)结合低层超像素结构并构建基于凸优化的超像素协同预测(convex-optimization-based superpixel collaborative prediction, CSCP)机制,进一步优化复杂边界区域的分割精度。
链接: https://arxiv.org/abs/2601.21159
作者: Jianzheng Wang,Huan Ni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using “one-way injection” and “shallow post-processing” strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: this https URL.
zh
[CV-97] An AI Framework for Microanastomosis Motion Assessment
【速读】:该论文旨在解决微血管吻合术(microanastomosis)技能评估中长期存在的主观性强、标准不一、效率低下等问题,这些问题主要源于传统依赖专家评分的评估方式所导致的评分者间差异、认知偏差及人工评审耗时等局限。解决方案的关键在于提出了一种全新的基于人工智能(AI)的自动化评估框架,其核心由四个模块构成:基于YOLO架构的器械检测模块、基于DeepSORT算法的器械跟踪模块、利用形状描述符实现的器械尖端定位模块,以及基于专家标注数据训练的监督分类模块,从而实现了对器械操作熟练度的客观、可靠且可扩展的量化评估。
链接: https://arxiv.org/abs/2601.21120
作者: Yan Meng,Eduardo J. Torres-Rodríguez,Marcelle Altshuler,Nishanth Gowda,Arhum Naeem,Recai Yilmaz,Omar Arnaout,Daniel A. Donoho
机构: Children’s National Hospital (儿童国家医院); Georgetown University Medical Center (乔治城大学医学中心); Brigham and Women’s Hospital (布莱根妇女医院); George Washington University (乔治华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE/EMBS NER 2025. \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Abstract:Proficiency in microanastomosis is a fundamental competency across multiple microsurgical disciplines. These procedures demand exceptional precision and refined technical skills, making effective, standardized assessment methods essential. Traditionally, the evaluation of microsurgical techniques has relied heavily on the subjective judgment of expert raters. They are inherently constrained by limitations such as inter-rater variability, lack of standardized evaluation criteria, susceptibility to cognitive bias, and the time-intensive nature of manual review. These shortcomings underscore the urgent need for an objective, reliable, and automated system capable of assessing microsurgical performance with consistency and scalability. To bridge this gap, we propose a novel AI framework for the automated assessment of microanastomosis instrument handling skills. The system integrates four core components: (1) an instrument detection module based on the You Only Look Once (YOLO) architecture; (2) an instrument tracking module developed from Deep Simple Online and Realtime Tracking (DeepSORT); (3) an instrument tip localization module employing shape descriptors; and (4) a supervised classification module trained on expert-labeled data to evaluate instrument handling proficiency. Experimental results demonstrate the effectiveness of the framework, achieving an instrument detection precision of 97%, with a mean Average Precision (mAP) of 96%, measured by Intersection over Union (IoU) thresholds ranging from 50% to 95% (mAP50-95).
zh
[CV-98] Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought
【速读】:该论文旨在解决多模态文本到图像生成模型在组合结构约束下的脆弱性问题,特别是生成式计数(generative numeracy)、属性绑定(attribute binding)和部件级关系(part-level relations)等方面的不足。其解决方案的关键在于提出Shape-of-Thought (SoT) 框架,通过无外部引擎依赖的、基于一致二维投影的渐进式形状组装机制,使模型能够在推理阶段生成交织的文本计划与中间渲染状态,从而隐式学习形状组装逻辑,而无需显式几何表示。该方法借助SoT-26K大规模结构化装配轨迹数据集进行训练,并通过T2S-CompBench基准评估结构完整性与轨迹忠实度,显著提升了组件计数(88.4%)和结构拓扑准确率(84.8%)。
链接: https://arxiv.org/abs/2601.21081
作者: Yu Huo,Siyu Zhang,Kun Zeng,Haoyue Liu,Owen Lee,Junlin Chen,Yuquan Lu,Yifu Guo,Yaodong Liang,Xiaoying Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL
Abstract:Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at this https URL. The SoT-26K dataset will be released upon acceptance.
zh
[CV-99] owards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
【速读】:该论文旨在解决Temporal Action Localization (TAL)任务中因视觉-语言模型(VLM)过度依赖语言先验而导致的模态偏差(modality bias)问题,即现有方法在融合视觉与语言信息时往往高估语言语义的作用,从而削弱了视觉特征的判别能力。解决方案的关键在于提出ActionVLM框架,其核心创新包括:(i) 一个去偏重加权模块,用于量化语言相对于纯视觉预测的增量优势,并据此动态调整语言模态的权重;(ii) 一种残差聚合策略,将语言视为对视觉结果的补充性优化而非主导信号。该设计有效缓解了模态偏差,降低由语言先验引发的过自信问题,并增强时间推理能力,在THUMOS14数据集上实现了最高达3.2% mAP的性能提升。
链接: https://arxiv.org/abs/2601.21078
作者: Jiaqi Li,Guangming Wang,Shuntian Zheng,Minzhe Ni,Xiaoman Lu,Guanghui Ye,Yu Guan
机构: UVLab, Department of Computer Science, University of Warwick (华威大学计算机科学系); Department of Automation, University of Cambridge (剑桥大学自动化系); College of Computer Science and Electronic Engineering, Hunan University (湖南大学计算机与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.
zh
[CV-100] BadDet: Robust Backdoor Attacks for Object Detection
【速读】:该论文旨在解决后门攻击(Backdoor Attack)在目标检测任务中相较于图像分类任务理解不足的问题,特别是现有基于检测的方法存在假设不切实际且缺乏物理世界验证的缺陷。其解决方案的关键在于提出BadDet+框架,该框架通过引入基于log-barrier的惩罚机制,统一了区域误分类攻击(Region Misclassification Attack, RMA)与目标消失攻击(Object Disappearance Attack, ODA),在保持干净样本性能的同时,实现对触发输入的可靠攻击诱导,并具备位置和尺度不变性及更强的物理鲁棒性。理论分析进一步表明,该惩罚机制作用于特定触发器的特征子空间,从而有效规避标准推理性能的下降。
链接: https://arxiv.org/abs/2601.21066
作者: Kealan Dunnett,Reza Arablouei,Dimity Miller,Volkan Dedeoglu,Raja Jurdak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Backdoor attacks pose a severe threat to deep learning, yet their impact on object detection remains poorly understood compared to image classification. While attacks have been proposed, we identify critical weaknesses in existing detection-based methods, specifically their reliance on unrealistic assumptions and a lack of physical validation. To bridge this gap, we introduce BadDet+, a penalty-based framework that unifies Region Misclassification Attacks (RMA) and Object Disappearance Attacks (ODA). The core mechanism utilizes a log-barrier penalty to suppress true-class predictions for triggered inputs, resulting in (i) position and scale invariance, and (ii) enhanced physical robustness. On real-world benchmarks, BadDet+ achieves superior synthetic-to-physical transfer compared to existing RMA and ODA baselines while preserving clean performance. Theoretical analysis confirms the proposed penalty acts within a trigger-specific feature subspace, reliably inducing attacks without degrading standard inference. These results highlight significant vulnerabilities in object detection and the necessity for specialized defenses.
zh
[CV-101] AI-based Prediction of Biochemical Recurrence from Biopsy and Prostatectomy Samples
【速读】:该论文旨在解决根治性前列腺切除术(radical prostatectomy, RP)后生化复发(biochemical recurrence, BCR)的预后预测不精确问题,尤其针对侵袭性前列腺癌的风险分层需求。解决方案的关键在于利用诊断性前列腺活检切片训练基于基础模型(foundation models)和注意力机制的多实例学习(attention-based multiple instance learning)的人工智能(AI)模型,从而实现个体化BCR风险预测。该方法在多个外部队列中展现出良好的泛化能力(5年时间依赖AUC达0.64–0.70),且整合临床变量后显著提升风险分层效果,优于当前指南推荐的CAPRA-S评分系统,表明AI驱动的组织病理学分析可跨样本类型应用于术前与术后决策支持。
链接: https://arxiv.org/abs/2601.21022
作者: Andrea Camilloni(1),Chiara Micoli(1),Nita Mulliqi(2),Erik Everett Palm(1),Thorgerdur Palsdottir(1),Kelvin Szolnoky(1),Xiaoyi Ji(1),Sol Erika Boman(1 and 3),Andrea Discacciati(1),Henrik Grönberg(1),Lars Egevad(4),Tobias Nordström(1 and 5),Kimmo Kartasalo(2),Martin Eklund(1) ((1) Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden, (2) Department of Medical Epidemiology and Biostatistics, SciLifeLab, Karolinska Institutet, Stockholm, Sweden, (3) Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden, (4) Department of Oncology and Pathology, Karolinska Institutet, Stockholm, Sweden, (5) Department of Clinical Sciences at Danderyd Hospital, Karolinska Institutet, Stockholm, Sweden)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39 pages, 6 tables, 11 figures
Abstract:Biochemical recurrence (BCR) after radical prostatectomy (RP) is a surrogate marker for aggressive prostate cancer with adverse outcomes, yet current prognostic tools remain imprecise. We trained an AI-based model on diagnostic prostate biopsy slides from the STHLM3 cohort (n = 676) to predict patient-specific risk of BCR, using foundation models and attention-based multiple instance learning. Generalizability was assessed across three external RP cohorts: LEOPARD (n = 508), CHIMERA (n = 95), and TCGA-PRAD (n = 379). The image-based approach achieved 5-year time-dependent AUCs of 0.64, 0.70, and 0.70, respectively. Integrating clinical variables added complementary prognostic value and enabled statistically significant risk stratification. Compared with guideline-based CAPRA-S, AI incrementally improved postoperative prognostication. These findings suggest biopsy-trained histopathology AI can generalize across specimen types to support preoperative and postoperative decision making, but the added value of AI-based multimodal approaches over simpler predictive models should be critically scrutinized in further studies.
zh
[CV-102] Low performing pixel correction in computed tomography with unrolled network and synthetic data training
【速读】:该论文旨在解决CT探测器中低性能像素(Low Performance Pixels, LPP)导致的图像重建中环状和条纹伪影问题,此类伪影会使图像在临床上不可用。现有方法多采用监督深度学习,在图像域或 sinogram 域进行修正,但依赖昂贵的真实临床数据集进行训练,且通常仅关注单一域的修复,忽略了CT几何正向投影过程中sinogram与图像域之间的内在关联。本文的关键创新在于提出一种基于合成数据的可展开双域(dual-domain)方法,通过自然图像生成的合成数据挖掘LPP在sinogram与图像域间的内在关联,从而实现无需真实临床数据即可有效校正LPP伪影,并具备对不同扫描仪设置的适应性,显著优于当前最先进的方法。
链接: https://arxiv.org/abs/2601.20995
作者: Hongxu Yang,Levente Lippenszky,Edina Timko,Lehel Ferenczi,Gopal Avinash
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2026 accepted
Abstract:Low performance pixels (LPP) in Computed Tomography (CT) detectors would lead to ring and streak artifacts in the reconstructed images, making them clinically unusable. In recent years, several solutions have been proposed to correct LPP artifacts, either in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, which are expensive to collect. Moreover, existing approaches focus solely either on image-space or sinogram-space correction, ignoring the intrinsic correlations from the forward operation of the CT geometry. In this work, we propose an unrolled dual-domain method based on synthetic data to correct LPP artifacts. Specifically, the intrinsic correlations of LPP between the sinogram and image domains are leveraged through synthetic data generated from natural images, enabling the trained model to correct artifacts without requiring any real-world clinical data. In experiments simulating 1-2% detectors defect near the isocenter, the proposed method outperformed the state-of-the-art approaches by a large margin. The results indicate that our solution can correct LPP artifacts without the cost of data collection for model training, and it is adaptable to different scanner settings for software-based applications.
zh
[CV-103] xt controllable PET denoising
【速读】:该论文旨在解决正电子发射断层成像(Positron Emission Tomography, PET)图像中因计数水平差异导致的复杂噪声问题,该噪声会削弱关键诊断信息的可辨识度。解决方案的关键在于提出一种文本引导的去噪方法,其核心是结合预训练的CLIP模型提取语义特征,并与基于U-Net架构的去噪网络相结合,从而在单一模型中实现对不同计数水平PET图像的有效增强,显著提升图像的定性和定量质量。
链接: https://arxiv.org/abs/2601.20990
作者: Xuehua Ye,Hongxu Yang,Adam J. Schwarz
机构: GE Healthcare(通用电气医疗); GE Healthcare(通用电气医疗); GE Healthcare(通用电气医疗)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SPIE Medical Imaging 2026
Abstract:Positron Emission Tomography (PET) imaging is a vital tool in medical diagnostics, offering detailed insights into molecular processes within the human body. However, PET images often suffer from complicated noise, which can obscure critical diagnostic information. The quality of the PET image is impacted by various factors including scanner hardware, image reconstruction, tracer properties, dose/count level, and acquisition time. In this study, we propose a novel text-guided denoising method capable of enhancing PET images across a wide range of count levels within a single model. The model utilized the features from a pretrained CLIP model with a U-Net based denoising model. Experimental results demonstrate that the proposed model leads significant improvements in both qualitative and quantitative assessments. The flexibility of the model shows the potential for helping more complicated denoising demands or reducing the acquisition time.
zh
[CV-104] Noisy but Valid: Robust Statistical Evaluation of LLM s with Imperfect Judges ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)安全性认证中的一项核心挑战:如何在存在判官模型(LLM-as-a-Judge)噪声、偏差和不完美性的情况下,依然能够可靠地验证模型的失败率是否低于预设安全阈值。传统统计检验方法因依赖于不可靠的判官标签而难以提供有效的有限样本类型I错误控制(Type-I error control)。其解决方案的关键在于提出一种“噪声但有效”(Noisy but Valid)假设检验框架,通过一个小规模的人工标注校准集估计判官的真正例率(True Positive Rate, TPR)和假正例率(False Positive Rate, FPR),进而推导出一个方差修正后的临界阈值,并将其应用于大规模判官标注数据集进行最终检验。该方法理论上保证了在校准不确定性下的有限样本有效性,且能定量评估判官质量对检验效能的影响,从而为LLM的安全性评估提供了可解释、可诊断的统计保障机制。
链接: https://arxiv.org/abs/2601.20913
作者: Chen Feng,Minghe Shen,Ananth Balashankar,Carsten Gerner-Beuerle,Miguel R. D. Rodrigues
机构: Queen’s University Belfast (贝尔法斯特女王大学); University College London (伦敦大学学院); Google DeepMind (谷歌深度智障)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR2026
Abstract:Reliable certification of Large Language Models (LLMs)-verifying that failure rates are below a safety threshold-is critical yet challenging. While “LLM-as-a-Judge” offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees. We introduce a “Noisy but Valid” hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge’s True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator. Our contributions include: (1) Theoretical Guarantees: We derive the exact conditions under which noisy testing yields higher statistical power than direct evaluation; (2) Empirical Validation: Experiments on Jigsaw Comment, Hate Speech and SafeRLHF confirm our theory; (3) The Oracle Gap: We reveal a significant performance gap between practical methods and the theoretical “Oracle” (perfectly known judge parameters), quantifying the cost of estimation. Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.
zh
[CV-105] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLM s
【速读】:该论文旨在解决多轮对话图像生成中因忽略长程历史信息而导致的非马尔可夫(non-Markov)交互建模不足的问题,即用户在多轮对话中可能回溯早期状态、撤销操作或引用数轮前引入的实体,而现有方法往往仅依赖最近一轮图像,导致上下文一致性差和指令遵循能力弱。解决方案的关键在于:(1) 设计非马尔可夫多轮数据构建策略,包括回滚式编辑(rollback-style editing)以强制模型检索早期视觉状态,以及基于名称的多轮个性化(name-based multi-round personalization)实现跨轮次实体绑定;(2) 提出一种基于历史条件的训练与推理框架,结合token级缓存机制防止多轮身份漂移(identity drift);(3) 引入重建驱动的DiT解码器(reconstruction-based DiT detokenizer)和多阶段微调课程(multi-stage fine-tuning curriculum),显著提升高保真图像重建能力和可编辑个性化性能。
链接: https://arxiv.org/abs/2601.20911
作者: Haochen Zhang,Animesh Sinha,Felix Juefei-Xu,Haoyu Ma,Kunpeng Li,Zhipeng Fan,Meng Dong,Xiaoliang Dai,Tingbo Hou,Peizhao Zhang,Zecheng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 19 figures, plan for TIP
Abstract:Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across rounds; (ii) a history-conditioned training and inference framework with token-level caching to prevent multi-round identity drift; and (iii) enabling improvements for high-fidelity image reconstruction and editable personalization, including a reconstruction-based DiT detokenizer and a multi-stage fine-tuning curriculum. We demonstrate that explicitly training for non-Markov interactions yields substantial improvements in multi-round consistency and instruction compliance, while maintaining strong single-round editing and personalization.
zh
[CV-106] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading
【速读】:该论文旨在解决现有唇读(Lipreading)方法在特征判别能力有限和泛化性能较差的问题,尤其是在处理细微发音动作时表现不足。解决方案的关键在于提出一种多注意力唇读网络(Multi-Attention Lipreading Network, MA-LipNet),其核心是依次引入三个专用注意力模块:通道注意力(Channel Attention, CA)用于自适应重校准通道特征以抑制冗余信息;联合时空注意力(Joint Spatial-Temporal Attention, JSTA)与分离式时空注意力(Separate Spatial-Temporal Attention, SSTA)则分别从粗粒度和细粒度层面过滤无关像素和视频帧,从而实现对视觉特征在时间、空间和通道维度上的精细化净化,显著提升唇读准确率。
链接: https://arxiv.org/abs/2601.20881
作者: Matteo Rossi
机构: Maharaja Agrasen University (马哈拉贾·阿格拉斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textitChannel Attention (CA) module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textitJoint Spatial-Temporal Attention (JSTA) and \textitSeparate Spatial-Temporal Attention (SSTA)-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.
zh
[CV-107] Learning Transient Convective Heat Transfer with Geometry Aware World Models
【速读】:该论文旨在解决偏微分方程(Partial Differential Equation, PDE)模拟在工程与物理领域中因计算成本过高而难以应用于实时场景的问题。传统生成式 AI 方法虽可作为代理模型(surrogate modeling)用于加速仿真,但标准视频生成架构缺乏对物理过程的精确控制和与物理数据格式的兼容性。为此,作者提出一种几何感知的世界模型架构(geometry aware world model),基于 LongVideoGAN 架构改进而来,其关键创新在于:(1) 引入双条件机制(twofold conditioning mechanism),同时融合全局物理参数与局部几何掩码(geometric masks)以增强对物理规律的约束;(2) 对网络结构进行适配,支持任意通道维度(arbitrary channel dimensions),突破了传统 RGB 图像格式的限制。该方法在二维瞬态计算流体动力学(Computational Fluid Dynamics, CFD)问题中验证有效,能准确复现训练数据中的复杂时空动态和空间相关性,并展现出在未见几何配置下的泛化潜力,尽管在分布外样本的空间精度上仍存在局限。
链接: https://arxiv.org/abs/2601.22086
作者: Onur T. Doganay,Alexander Klawonn,Martin Eigel,Hanno Gottschalk
机构: TU Berlin (柏林工业大学); Siemens Energy AG (西门子能源公司); Weierstrass Institute for Applied Analysis and Stochastics (维尔斯特拉斯应用分析与随机研究所)
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 18 figures, 2 tables
Abstract:Partial differential equation (PDE) simulations are fundamental to engineering and physics but are often computationally prohibitive for real-time applications. While generative AI offers a promising avenue for surrogate modeling, standard video generation architectures lack the specific control and data compatibility required for physical simulations. This paper introduces a geometry aware world model architecture, derived from a video generation architecture (LongVideoGAN), designed to learn transient physics. We introduce two key architecture elements: (1) a twofold conditioning mechanism incorporating global physical parameters and local geometric masks, and (2) an architectural adaptation to support arbitrary channel dimensions, moving beyond standard RGB constraints. We evaluate this approach on a 2D transient computational fluid dynamics (CFD) problem involving convective heat transfer from buoyancy-driven flow coupled to a heat flow in a solid structure. We demonstrate that the conditioned model successfully reproduces complex temporal dynamics and spatial correlations of the training data. Furthermore, we assess the model’s generalization capabilities on unseen geometric configurations, highlighting both its potential for controlled simulation synthesis and current limitations in spatial precision for out-of-distribution samples.
zh
[CV-108] Blind Ultrasound Image Enhancement via Self-Supervised Physics-Guided Degradation Modeling
【速读】:该论文旨在解决超声(Ultrasound, US)图像中因乘性斑点噪声(multiplicative speckle)、点扩散函数(Point-Spread Function, PSF)引起的成像模糊以及设备和操作者相关的伪影所导致的图像解读困难问题。现有监督增强方法通常依赖于干净目标或已知退化模型,而这些条件在实际场景中难以满足。其解决方案的关键在于提出一种盲自监督(blind self-supervised)增强框架,利用基于物理引导的退化模型联合去卷积与去噪:通过从训练帧中提取旋转/裁剪块并合成退化输入(包括高斯PSF卷积和空间加性高斯噪声或复数域傅里叶扰动),同时采用非局部低秩(Non-Local Low-Rank, NLLR)去噪获取类干净目标以避免对真实标签的依赖;网络结构采用Swin卷积U-Net架构,在多个US数据集上训练并验证,实现跨扫描仪、跨退化类型的良好泛化能力,并在峰值信噪比(PSNR)和结构相似性(SSIM)指标上显著优于MSANN、Restormer和DnCNN等主流方法。
链接: https://arxiv.org/abs/2601.21856
作者: Shujaat Khan,Syed Muhammad Atif,Jaeyoung Huh,Syed Saad Azhar
机构: King Fahd University of Petroleum & Minerals (KFUPM)(沙特国王大学石油与矿物大学); Sir Syed University of Engineering and Technology (SSUET)(赛义德工程与技术大学); Siemens Healthineers(西门子健康医疗)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 11 pages, 13 figures
Abstract:Ultrasound (US) interpretation is hampered by multiplicative speckle, acquisition blur from the point-spread function (PSF), and scanner- and operator-dependent artifacts. Supervised enhancement methods assume access to clean targets or known degradations; conditions rarely met in practice. We present a blind, self-supervised enhancement framework that jointly deconvolves and denoises B-mode images using a Swin Convolutional U-Net trained with a \emphphysics-guided degradation model. From each training frame, we extract rotated/cropped patches and synthesize inputs by (i) convolving with a Gaussian PSF surrogate and (ii) injecting noise via either spatial additive Gaussian noise or complex Fourier-domain perturbations that emulate phase/magnitude distortions. For US scans, clean-like targets are obtained via non-local low-rank (NLLR) denoising, removing the need for ground truth; for natural images, the originals serve as targets. Trained and validated on UDIAT~B, JNU-IFM, and XPIE Set-P, and evaluated additionally on a 700-image PSFHS test set, the method achieves the highest PSNR/SSIM across Gaussian and speckle noise levels, with margins that widen under stronger corruption. Relative to MSANN, Restormer, and DnCNN, it typically preserves an extra \sim 1–4,dB PSNR and 0.05–0.15 SSIM in heavy Gaussian noise, and \sim 2–5,dB PSNR and 0.05–0.20 SSIM under severe speckle. Controlled PSF studies show reduced FWHM and higher peak gradients, evidence of resolution recovery without edge erosion. Used as a plug-and-play preprocessor, it consistently boosts Dice for fetal head and pubic symphysis segmentation. Overall, the approach offers a practical, assumption-light path to robust US enhancement that generalizes across datasets, scanners, and degradation types.
zh
[CV-109] Denoising and Baseline Correction of Low-Scan FTIR Spectra: A Benchmark of Deep Learning Models Against Traditional Signal Processing
【速读】:该论文旨在解决傅里叶变换红外(Fourier Transform Infrared, FTIR)成像中因信噪比低和基线漂移导致的临床成像速度受限问题。传统方法依赖大量信号平均以降低噪声,但效率低下;而现有深度学习方法在无真实标签情况下难以同时分离噪声与基线漂移,易引入光谱幻觉或泛化能力差。解决方案的关键在于提出一种物理信息引导的级联U-Net架构,通过引入一个确定性的“物理桥接层”(Physics Bridge),将去噪与基线校正任务解耦,并利用嵌入的SNIP层强制施加光谱学约束,从而避免学习统计近似,实现更准确、鲁棒的光谱重建。实验表明,该方法在人类下咽癌细胞(FaDu)数据集上相比原始单次扫描输入,均方根误差(RMSE)降低51.3%,显著优于单一U-Net(40.2%)和传统Savitzky-Golay/SNIP流程(33.7%),且有效消除光谱幻觉并更好保留峰强度。
链接: https://arxiv.org/abs/2601.20905
作者: Azadeh Mokari,Shravan Raghunathan,Artem Shydliukh,Oleg Ryabchykov,Christoph Krafft,Thomas Bocklitz
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:High-quality Fourier Transform Infrared (FTIR) imaging usually needs extensive signal averaging to reduce noise and drift which severely limits clinical speed. Deep learning can accelerate imaging by reconstructing spectra from rapid, single-scan inputs. However, separating noise and baseline drift simultaneously without ground truth is an ill-posed inverse problem. Standard black-box architectures often rely on statistical approximations that introduce spectral hallucinations or fail to generalize to unstable atmospheric conditions. To solve these issues we propose a physics-informed cascade Unet that separates denoising and baseline correction tasks using a new, deterministic Physics Bridge. This architecture forces the network to separate random noise from chemical signals using an embedded SNIP layer to enforce spectroscopic constraints instead of learning statistical approximations. We benchmarked this approach against a standard single Unet and a traditional Savitzky-Golay/SNIP workflow. We used a dataset of human hypopharyngeal carcinoma cells (FaDu). The cascade model outperformed all other methods, achieving a 51.3% reduction in RMSE compared to raw single-scan inputs, surpassing both the single Unet (40.2%) and the traditional workflow (33.7%). Peak-aware metrics show that the cascade architecture eliminates spectral hallucinations found in standard deep learning. It also preserves peak intensity with much higher fidelity than traditional smoothing. These results show that the cascade Unet is a robust solution for diagnostic-grade FTIR imaging. It enables imaging speeds 32 times faster than current methods.
zh
人工智能
[AI-0] PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training
【速读】:该论文旨在解决神经网络训练中预条件梯度方法依赖的矩阵函数(如矩阵平方根、逆平方根和正交化)计算效率低的问题,传统迭代算法常需显式特征分解或预先已知谱界,限制了其在现代GPU加速器上的性能表现。解决方案的关键在于提出PRISM(Polynomial-fitting and Randomized Iterative Sketching for Matrix functions computation)框架,该框架通过自适应多项式逼近与随机投影(randomized sketching)相结合的方式,在每次迭代中利用采样最小二乘问题拟合当前矩阵谱的多项式代理模型,从而无需显式谱边界估计即可自动适应演化中的谱结构,并显著加速牛顿-舒尔茨类迭代算法(Newton-Schulz-like iterations)用于矩阵平方根和正交化计算。实验证明,PRISM集成到Shampoo和Muon优化器中可有效提升训练速度。
链接: https://arxiv.org/abs/2601.22137
作者: Shenghao Yang,Zhichao Wang,Oleg Balabanov,N. Benjamin Erichson,Michael W. Mahoney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:
Abstract:Matrix functions such as square root, inverse roots, and orthogonalization play a central role in preconditioned gradient methods for neural network training. This has motivated the development of iterative algorithms that avoid explicit eigendecompositions and rely primarily on matrix multiplications, making them well suited for modern GPU accelerators. We present PRISM (Polynomial-fitting and Randomized Iterative Sketching for Matrix functions computation), a general framework for accelerating iterative algorithms for computing matrix functions. PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via a sketched least-squares problem, adapting to the instance at hand with minimal overhead. We apply PRISM to accelerate Newton-Schulz-like iterations for matrix square roots and orthogonalization, which are core primitives in machine learning. Unlike prior methods, PRISM requires no explicit spectral bounds or singular value estimates; and it adapts automatically to the evolving spectrum. Empirically, PRISM accelerates training when integrated into Shampoo and Muon optimizers.
zh
[AI-1] StepShield: When Not Whether to Intervene on Rogue Agents
【速读】:该论文旨在解决现有AI代理安全评估基准仅报告二元准确率(binary accuracy)的问题,这种指标混淆了早期干预与事后分析的差异——即检测器在行为发生早期(如第8步)发现违规行为的能力,与仅在事后(如第48步)识别违规行为的能力被同等对待,导致无法衡量检测时效性。解决方案的关键在于提出StepShield,这是首个关注“何时”而非“是否”检测到违规行为的基准,其包含9,213条代码代理轨迹及真实世界安全事件驱动的8.1%“流氓行为”率,并引入三个新型时序度量:早期干预率(Early Intervention Rate, EIR)、干预间隔(Intervention Gap)和节省令牌数(Tokens Saved)。实证表明,基于大语言模型(LLM)的判别器EIR达59%,显著优于静态分析器的26%,这一性能差距在传统准确率指标下完全不可见;同时,早期检测带来直接经济价值,例如级联式HybridGuard检测器可降低75%监控成本,五年内企业级部署预计节省超1.08亿美元。因此,StepShield通过聚焦检测时机,为构建更安全且更具经济效益的AI代理提供了新范式。
链接: https://arxiv.org/abs/2601.22136
作者: Gloria Felicia(University of Virginia),Michael Eniolade(University of the Cumberlands),Jinfeng He(Cornell University),Zitha Sasindran(Indian Institute of Science Bangalore),Hemant Kumar(University of Arizona),Milan Hussain Angati(California State University Northridge),Sandeep Bandarupalli(University of Cincinnati)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
备注: 16 pages, 2 figures, 14 tables
Abstract:Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to 108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.
zh
[AI-2] World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems
【速读】:该论文旨在解决前沿大语言模型(Large Language Models, LLMs)在复杂企业系统中作为自主代理时面临的可靠性问题,尤其是由于隐藏工作流(hidden workflows)导致的级联副作用(cascading side effects)难以被模型察觉,从而引发静默约束违反(silent constraint violations)。现有基准测试仅评估表面任务完成度,忽略了企业环境中有限可观测性(limited observability)、庞大数据库状态及隐式流程交互等核心挑战。解决方案的关键在于引入World of Workflows(WoW)环境及其配套基准WoW-bench,通过构建包含4000多个业务规则和55个活跃工作流的真实ServiceNow系统,揭示LLMs存在“动态盲视”(dynamics blindness)缺陷,并提出“基于世界建模”(grounded world modeling)的新范式——即代理需在缺乏高保真反馈时,通过心理模拟(mental simulation)隐藏的状态转移来弥补可观测性缺口,从而实现对复杂企业系统动态的显式学习与可靠决策。
链接: https://arxiv.org/abs/2601.22130
作者: Lakshya Gupta(1),Litao Li(1),Yizhe Liu(1),Sriram Ganapathi Subramanian(1),Kaheer Suleman(1),Zichen Zhang(1),Haoye Lu(1),Sumit Pasupalak(1) ((1) Skyfall AI)
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.
zh
[AI-3] SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在软件工程(Software Engineering, SWE)任务中进行测试时缩放(test-time scaling)时存在的计算成本高且缺乏通用性的问题。传统方法通过重复从头采样轨迹来提升性能,但效率低下;而现有基于专用价值代理(value agent)的优化方法则易受模型校准偏差影响,难以适配能够合成自定义 Bash 脚本的新一代工具型代理。解决方案的关键在于提出 SWE-Replay,一种无需依赖外部 LLM 估计质量的高效且通用的测试时缩放技术:它通过复用先前试验中的轨迹,在关键中间步骤动态决策是探索(从头开始)还是利用(回溯已有经验),其分支选择依据是仓库探索的潜力与推理重要性,而非噪声较大的价值估计。该机制显著降低了计算开销(最多减少 17.4%),同时保持或提升性能(最高提升 3.8%)。
链接: https://arxiv.org/abs/2601.22129
作者: Yifeng Ding,Lingming Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.
zh
[AI-4] he Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR
【速读】:该论文旨在解决当前基于自回归(autoregressive)训练的大型语言模型(Large Language Models, LLMs)在电子健康记录(Electronic Health Record, EHR)建模中忽视患者动态演化过程的问题。传统方法将患者视为静态文档进行文本预测,无法捕捉干预与时间共同作用下的疾病轨迹动态。其核心解决方案是提出SMB-Structure,一种结合联合嵌入预测架构(Joint-Embedding Prediction Architecture, JEPA)与监督微调(Supervised Fine-Tuning, SFT)的世界模型,其中SFT负责在token空间中重建未来患者状态,而JEPA则仅凭初始患者表示预测潜在空间中的未来状态,强制模型在观测到下一状态前就编码出完整的轨迹动力学。这一设计使模型能够学习到超越自回归基线的、可解释的疾病动态表征,从而在高异质性临床任务中实现更优性能。
链接: https://arxiv.org/abs/2601.22128
作者: Irsyad Adam,Zekai Chen,David Laprade,Shaun Porwal,David Laub,Erik Reinertsen,Arda Pekis,Kevin Brown
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
备注:
Abstract:Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient’s trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at this https URL.
zh
[AI-5] Defining Operational Conditions for Safety-Critical AI-Based Systems from Data
【速读】:该论文旨在解决安全关键型人工智能(AI)系统在实际部署中因环境描述不完整而导致的认证难题,尤其是如何从已有数据中准确提取并定义“运行设计域”(Operational Design Domain, ODD)。传统方法依赖专家知识在开发早期设定ODD,难以适应复杂现实场景。论文提出一种基于核函数的多维数据驱动方法,实现对ODD的后验定义(a posteriori definition),其核心创新在于利用核方法构建ODD的数学表示,并通过蒙特卡洛模拟和航空领域真实案例验证该方法可等价还原隐藏的真实ODD,从而为数据驱动的安全关键AI系统提供可认证的ODD定义框架。
链接: https://arxiv.org/abs/2601.22118
作者: Johann Christensen,Elena Hoemann,Frank Köster,Sven Hallerbach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence (AI) has been on the rise in many domains, including numerous safety-critical applications. However, for complex systems found in the real world, or when data already exist, defining the underlying environmental conditions is extremely challenging. This often results in an incomplete description of the environment in which the AI-based system must operate. Nevertheless, this description, called the Operational Design Domain (ODD), is required in many domains for the certification of AI-based systems. Traditionally, the ODD is created in the early stages of the development process, drawing on sophisticated expert knowledge and related standards. This paper presents a novel Safety-by-Design method to a posteriori define the ODD from previously collected data using a multi-dimensional kernel-based representation. This approach is validated through both Monte Carlo methods and a real-world aviation use case for a future safety-critical collision-avoidance system. Moreover, by defining under what conditions two ODDs are equal, the paper shows that the data-driven ODD can equal the original, underlying hidden ODD of the data. Utilizing the novel, Safe-by-Design kernel-based ODD enables future certification of data-driven, safety-critical AI-based systems.
zh
[AI-6] Value-Based Pre-Training with Downstream Feedback
【速读】:该论文旨在解决基础模型(foundation models)在昂贵的自监督预训练过程中,由于固定代理目标(如下一个词预测)可能导致计算资源错配、无法有效提升下游任务能力的问题。其核心解决方案是提出V-Pretraining方法,关键在于引入一种基于价值(value-based)的、与模态无关的持续预训练机制:通过轻量级的任务设计者(task designer)动态调整预训练任务,使每一步梯度更新都最大化对下游任务的价值(即梯度方向与下游任务梯度对齐),从而更高效地利用有限的标注反馈信息来引导预训练过程。该方法不直接使用下游标签更新模型,仅用其作为任务重塑的信号,实现了在相同计算预算下显著提升模型推理能力(如GSM8K任务)和视觉任务性能(如ADE20K和NYUv2)。
链接: https://arxiv.org/abs/2601.22108
作者: Shuqi Ke,Giulia Fanti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B–7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.
zh
[AI-7] Investigating Associational Biases in Inter-Model Communication of Large Generative Models
【速读】:该论文旨在解决生成式AI中因模型间信息传递导致的关联偏见(associational bias)演化问题,特别是此类偏见在图像生成与描述交替的跨模型通信管道中如何加剧并引发对特定人口群体的系统性偏差。其核心问题是:当一个模型的输出作为另一个模型的输入时,原本隐含在数据中的性别或年龄刻板印象会通过多轮交互被放大和固化,从而影响人类行为识别与情感预测等以人为中心的任务结果公平性。解决方案的关键在于构建一个结合量化分析与可解释性机制的评估框架——利用RAF-DB和PHASE数据集测量代际与性别分布漂移,并通过可解释性工具识别模型是否依赖于虚假视觉线索(如背景或发型)而非关键语义区域(如面部或身体)进行决策;进而提出从数据、训练到部署阶段的多层次缓解策略,强调在人本AI系统中部署互联模型时必须建立严格的安全保障机制以防止偏见传播。
链接: https://arxiv.org/abs/2601.22093
作者: Fethiye Irmak Dogan,Yuval Weiss,Kajal Patel,Jiaee Cheong,Hatice Gunes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Social bias in generative AI can manifest not only as performance disparities but also as associational bias, whereby models learn and reproduce stereotypical associations between concepts and demographic groups, even in the absence of explicit demographic information (e.g., associating doctors with men). These associations can persist, propagate, and potentially amplify across repeated exchanges in inter-model communication pipelines, where one generative model’s output becomes another’s input. This is especially salient for human-centred perception tasks, such as human activity recognition and affect prediction, where inferences about behaviour and internal states can lead to errors or stereotypical associations that propagate into unequal treatment. In this work, focusing on human activity and affective expression, we study how such associations evolve within an inter-model communication pipeline that alternates between image generation and image description. Using the RAF-DB and PHASE datasets, we quantify demographic distribution drift induced by model-to-model information exchange and assess whether these drifts are systematic using an explainability pipeline. Our results reveal demographic drifts toward younger representations for both actions and emotions, as well as toward more female-presenting representations, primarily for emotions. We further find evidence that some predictions are supported by spurious visual regions (e.g., background or hair) rather than concept-relevant cues (e.g., body or face). We also examine whether these demographic drifts translate into measurable differences in downstream behaviour, i.e., while predicting activity and emotion labels. Finally, we outline mitigation strategies spanning data-centric, training and deployment interventions, and emphasise the need for careful safeguards when deploying interconnected models in human-centred AI systems.
zh
[AI-8] Latent Adversarial Regularization for Offline Preference Optimization
【速读】:该论文旨在解决语言模型在从人类反馈中学习时,传统基于token-level的偏好优化方法所面临的挑战——即token空间中的相似性并不等价于语义或行为上的相似性,导致策略更新不准确。其解决方案的关键在于引入一种基于潜在空间(latent space)正则化的偏好优化方法GanPO,通过惩罚策略模型与参考模型内部表示之间的差异来实现潜在空间的约束;由于潜在表示本身不具有显式概率密度,该方法借鉴生成对抗网络(GAN)的思想,采用对抗方式最小化潜在空间差异,并将其作为正则项集成到现有的离线偏好优化目标中,从而在保持下游性能的同时提升了对分布偏移和噪声的鲁棒性。
链接: https://arxiv.org/abs/2601.22083
作者: Enyi Jiang,Yibo Jacky Zhang,Yinglun Xu,Andreas Haupt,Nancy Amato,Sanmi Koyejo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
zh
[AI-9] SIA: Symbolic Interpretability for Anticipatory Deep Reinforcement Learning in Network Control
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在下一代移动网络中应用时的“时间短视”问题,即传统DRL代理仅基于历史和当前测量值进行决策,无法有效利用短期外生关键性能指标(KPI)预测(如带宽),从而限制了其在动态环境中的前瞻性控制能力。此外,现有预测增强型DRL代理缺乏可解释性,形成“黑箱”操作,导致网络运维人员难以判断预测是否真正驱动决策或仅仅是增加复杂度。解决方案的关键在于提出SIA(Symbolic Interpreter for Anticipatory DRL),它通过融合符号AI(Symbolic AI)抽象与每KPI知识图谱(Knowledge Graph),实时生成可解释的决策逻辑,并引入新的影响力评分(Influence Score)量化预测对动作选择的影响;SIA具备亚毫秒级推理速度(比现有可解释人工智能方法快200倍以上),并在三个典型网络场景中验证其有效性,揭示了预测集成的时间错位和奖励设计偏差等隐藏问题,从而支持精准优化与在线策略修正,显著提升了视频流媒体平均比特率9%、无线接入网切片(RAN-slicing)奖励提升25%,实现了前瞻式DRL的透明化与可控性。
链接: https://arxiv.org/abs/2601.22044
作者: MohammadErfan Jabbari,Abhishek Duttagupta,Claudio Fiandrino,Leonardo Bonati,Salvatore D’Oro,Michele Polese,Marco Fiore,Tommaso Melodia
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 10 pages, 12 figures, accepted at IEEE INFOCOM 2026
Abstract:Deep reinforcement learning (DRL) promises adaptive control for future mobile networks but conventional agents remain reactive: they act on past and current measurements and cannot leverage short-term forecasts of exogenous KPIs such as bandwidth. Augmenting agents with predictions can overcome this temporal myopia, yet uptake in networking is scarce because forecast-aware agents act as closed-boxes; operators cannot tell whether predictions guide decisions or justify the added complexity. We propose SIA, the first interpreter that exposes in real time how forecast-augmented DRL agents operate. SIA fuses Symbolic AI abstractions with per-KPI Knowledge Graphs to produce explanations, and includes a new Influence Score metric. SIA achieves sub-millisecond speed, over 200x faster than existing XAI methods. We evaluate SIA on three diverse networking use cases, uncovering hidden issues, including temporal misalignment in forecast integration and reward-design biases that trigger counter-productive policies. These insights enable targeted fixes: a redesigned agent achieves a 9% higher average bitrate in video streaming, and SIA’s online Action-Refinement module improves RAN-slicing reward by 25% without retraining. By making anticipatory DRL transparent and tunable, SIA lowers the barrier to proactive control in next-generation mobile networks.
zh
[AI-10] Optimizing Agent ic Workflows using Meta-tools
【速读】:该论文旨在解决生成式 AI(Generative AI)代理工作流中因大量迭代推理步骤和工具调用而导致的高运营成本、端到端延迟以及幻觉引发的失败问题。其解决方案的关键在于提出了一种代理工作流优化(Agent Workflow Optimization, AWO)框架,该框架通过分析现有工作流轨迹识别重复的工具调用序列,并将其转化为确定性的元工具(meta-tools),即打包多个代理动作的复合工具;这种转换可跳过不必要的中间大语言模型(LLM)推理步骤,从而降低操作成本并缩短执行路径,提升整体效率与鲁棒性。
链接: https://arxiv.org/abs/2601.22037
作者: Sami Abuzakuk,Anne-Marie Kermarrec,Rishi Sharma,Rasmus Moorits Veski,Martijn de Vos
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant operational expense, end-to-end latency and failures due to hallucinations. This work introduces Agent Workflow Optimization (AWO), a framework that identifies and optimizes redundant tool execution patterns to improve the efficiency and robustness of agentic workflows. AWO analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools, which are deterministic, composite tools that bundle multiple agent actions into a single invocation. Meta-tools bypass unnecessary intermediate LLM reasoning steps and reduce operational cost while also shortening execution paths, leading to fewer failures. Experiments on two agentic AI benchmarks show that AWO reduces the number of LLM calls up to 11.9% while also increasing the task success rate by up to 4.2 percent points.
zh
[AI-11] CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在真实用户场景中缺乏可靠性的问题,尤其是面对不完整或模糊请求时的不确定性管理能力不足。现有基准测试多聚焦于理想化环境下的任务完成率,忽视了实际应用中对一致性、不确定性和能力边界意识的要求。解决方案的关键在于提出CAR-bench——一个面向车载语音助手场景的多轮、工具调用型LLM代理评估基准,其核心要素包括:模拟用户交互、领域政策约束以及58个相互关联的工具(涵盖导航、生产力、充电和车辆控制等),并引入两类新型任务——“幻觉任务”(Hallucination tasks)用于检验代理在信息缺失时的边界认知能力,“澄清任务”(Disambiguation tasks)则评估其通过对话或内部信息获取来化解歧义的能力。实验表明,即使是最先进的推理型LLM在澄清任务中一致成功率为不足50%,且常因过早行动而失败,或在幻觉任务中违反政策、捏造信息,凸显了构建更可靠、具备自我意识的LLM代理的紧迫性。
链接: https://arxiv.org/abs/2601.22027
作者: Johannes Kirmayr,Lukas Stappen,Elisabeth André
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents’ limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.
zh
[AI-12] SymbXRL: Symbolic Explainable Deep Reinforcement Learning for Mobile Networks
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在6G移动网络中应用时面临的可解释性难题,即DRL代理作为“黑箱”模型难以提供人类可理解的决策依据,从而限制其在生产环境中的部署。解决方案的关键在于提出SymbXRL——一种新颖的可解释强化学习(Explainable Reinforcement Learning, XRL)技术,其核心是利用符号人工智能(Symbolic AI)生成基于直观符号和逻辑规则的解释,将DRL代理的决策过程转化为结构化的、可推理的知识表示,从而提升解释语义并支持意图驱动的显式控制,例如通过程序化动作引导使中位累积奖励相比纯DRL方案提升12%。
链接: https://arxiv.org/abs/2601.22024
作者: Abhishek Duttagupta,MohammadErfan Jabbari,Claudio Fiandrino,Marco Fiore,Joerg Widmer
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, published in IEEE INFOCOM 2025
Abstract:The operation of future 6th-generation (6G) mobile networks will increasingly rely on the ability of deep reinforcement learning (DRL) to optimize network decisions in real-time. DRL yields demonstrated efficacy in various resource allocation problems, such as joint decisions on user scheduling and antenna allocation or simultaneous control of computing resources and modulation. However, trained DRL agents are closed-boxes and inherently difficult to explain, which hinders their adoption in production settings. In this paper, we make a step towards removing this critical barrier by presenting SymbXRL, a novel technique for explainable reinforcement learning (XRL) that synthesizes human-interpretable explanations for DRL agents. SymbXRL leverages symbolic AI to produce explanations where key concepts and their relationships are described via intuitive symbols and rules; coupling such a representation with logical reasoning exposes the decision process of DRL agents and offers more comprehensible descriptions of their behaviors compared to existing approaches. We validate SymbXRL in practical network management use cases supported by DRL, proving that it not only improves the semantics of the explanations but also paves the way for explicit agent control: for instance, it enables intent-based programmatic action steering that improves by 12% the median cumulative reward over a pure DRL solution.
zh
[AI-13] Vidmento: Creating Video Stories Through Context-Aware Expansion With Generative Video
【速读】:该论文旨在解决视频叙事中因可用素材有限而导致的创作受限与叙事空白问题(narrative gaps),从而影响创作者的表达自由。其解决方案的关键在于提出并实现Vidmento工具,该工具通过上下文感知的扩展机制(context-aware expansion)将生成式视频(generative video)与实拍素材融合,系统性地扩充初始内容,在风格与叙事逻辑上保持一致性,并提供可控的精细化调整能力,使创作者能够基于自身意图高效构建混合视频故事(hybrid video stories)。
链接: https://arxiv.org/abs/2601.22013
作者: Catherine Yeh,Anh Truong,Mira Dontcheva,Bryan Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 25 pages, 18 figures
Abstract:Video storytelling is often constrained by available material, limiting creative expression and leaving undesired narrative gaps. Generative video offers a new way to address these limitations by augmenting captured media with tailored visuals. To explore this potential, we interviewed eight video creators to identify opportunities and challenges in integrating generative video into their workflows. Building on these insights and established filmmaking principles, we developed Vidmento, a tool for authoring hybrid video stories that combine captured and generated media through context-aware expansion. Vidmento surfaces opportunities for story development, generates clips that blend stylistically and narratively with surrounding media, and provides controls for refinement. In a study with 12 creators, Vidmento supported narrative development and exploration by systematically expanding initial materials with generative media, enabling expressive video storytelling aligned with creative intent. We highlight how creators bridge story gaps with generative content and where they find this blending capability most valuable.
zh
[AI-14] Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference
【速读】:该论文旨在解决大规模生成式 AI (Generative AI) 推理在数据center中因内存容量、内存带宽及高速互连瓶颈所导致的性能限制问题。其核心发现是传统roofline分析无法覆盖由操作强度(Operational Intensity, OI)与容量足迹(Capacity Footprint, CF)共同决定的推理负载特征,尤其是在长上下文KV缓存导致解码阶段高度依赖内存的情况下。解决方案的关键在于推动系统级异构化与软硬件协同设计:包括专用预填充(prefill)和解码加速器的分离部署、更广泛的扩展网络架构,以及通过光子I/O实现计算与内存的解耦合;同时提出多推理加速器共存和高带宽大容量内存资源池化的方向,以适应不断变化的OI/CF需求,从而保障大规模智能体推理的效率与能力持续演进。
链接: https://arxiv.org/abs/2601.22001
作者: Yiren Zhao,Junyi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:AI agent inference is driving an inference heavy datacenter future and exposes bottlenecks beyond compute - especially memory capacity, memory bandwidth and high-speed interconnect. We introduce two metrics - Operational Intensity (OI) and Capacity Footprint (CF) - that jointly explain regimes the classic roofline analysis misses, including the memory capacity wall. Across agentic workflows (chat, coding, web use, computer use) and base model choices (GQA/MLA, MoE, quantization), OI/CF can shift dramatically, with long context KV cache making decode highly memory bound. These observations motivate disaggregated serving and system level heterogeneity: specialized prefill and decode accelerators, broader scale up networking, and decoupled compute-memory enabled by optical I/O. We further hypothesize agent-hardware co design, multiple inference accelerators within one system, and high bandwidth, large capacity memory disaggregation as foundations for adaptation to evolving OI/CF. Together, these directions chart a path to sustain efficiency and capability for large scale agentic AI inference.
zh
[AI-15] Liquid Interfaces: A Dynamic Ontology for the Interoperability of Autonomous Systems
【速读】:该论文旨在解决当前软件架构难以支持具有自适应性、概率性和上下文依赖性的自主代理(autonomous agents)的协调问题,同时应对系统集成中静态接口与确定性契约主导所带来的局限性。其解决方案的关键在于提出“液态接口”(Liquid Interfaces)这一新型协调范式,其中接口不再是持久的技术构件,而是通过意图表达和语义协商动态生成的临时关系事件;论文进一步形式化该模型并设计了液态接口协议(Liquid Interface Protocol, LIP),以规范意图驱动的交互、协商执行及在语义不确定性下的短暂性约束,从而为基于代理系统的自适应协调提供理论基础与实践可行性。
链接: https://arxiv.org/abs/2601.21993
作者: Dhiogo de Sá,Carlos Schmiedel,Carlos Pereira Lopes
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 28 pages, 2 figures
Abstract:Contemporary software architectures struggle to support autonomous agents whose reasoning is adaptive, probabilistic, and context-dependent, while system integration remains dominated by static interfaces and deterministic contracts. This paper introduces Liquid Interfaces, a coordination paradigm in which interfaces are not persistent technical artifacts, but ephemeral relational events that emerge through intention articulation and semantic negotiation at this http URL formalize this model and present the Liquid Interface Protocol (LIP),which governs intention-driven interaction, negotiated execution, and enforce ephemerality under semantic uncertainty. We further discuss the governance implications of this approach and describe a reference architecture that demonstrates practical feasibility. Liquid Interfaces provide a principled foundation for adaptive coordination in agent-based systems
zh
[AI-16] Geometry of Drifting MDPs with Path-Integral Stability Certificates
【速读】:该论文旨在解决现实世界中强化学习(Reinforcement Learning, RL)面临的非平稳性(nonstationarity)问题,即环境奖励和动态特性随时间发生漂移、加速、振荡甚至突变,导致策略跟踪误差增大和策略震荡(policy chattering)。现有理论多采用粗粒度模型刻画环境变化程度,忽视了局部变化机制,而这些局部特征(如加速度和近似最优动作间的微小差距)恰恰是影响性能的关键因素。解决方案的核心在于从几何视角出发,将非平稳的折扣马尔可夫决策过程(discounted Markov Decision Process, MDP)建模为一个可微同伦路径(differentiable homotopy path),并追踪由此诱导的最优贝尔曼不动点(Bellman fixed point)的运动轨迹,从而提取出内在复杂性的三个指标:累计漂移(length)、加速度/振荡(curvature)和动作间隙引发的不光滑性(kink)。基于此,作者提出了一种与求解器无关的路径积分稳定性边界,并推导出避开切换区域的间隙安全可行域(gap-safe feasible regions)。在此基础上,进一步设计了轻量级封装方法 Homotopy-Tracking RL (HT-RL) 和 HT-MCTS,能够在线估计重放缓冲区中长度、曲率及近似相等动作的接近程度,并据此自适应调节学习或规划强度,显著提升了对动态环境的跟踪能力和动态遗憾(dynamic regret),尤其在振荡和易切换场景下表现最优。
链接: https://arxiv.org/abs/2601.21991
作者: Zuyuan Zhang,Mahdi Imani,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world reinforcement learning is often \emphnonstationary: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emphhow much the environment changes, not \emphhow it changes locally – even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length–curvature–kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textitHomotopy-Tracking RL (HT-RL) and \textitHT-MCTS, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.
zh
[AI-17] Generalized Information Gathering Under Dynamics Uncertainty
【速读】:该论文旨在解决未知动态系统中代理(agent)如何高效进行主动信息采集以加速动力学学习的问题。现有方法通常针对特定建模选择(如动力学模型、信念更新过程、观测模型和规划器)设计专用的信息获取成本函数,缺乏通用性。论文的关键解决方案是提出一个统一框架,通过显式揭示参数、信念与控制之间的因果依赖关系,将信息获取成本从具体建模假设中解耦出来。在此基础上,基于Massey的定向信息(directed information)推导出一种通用的信息获取成本,该成本仅假设马尔可夫动力学加噪声,其余部分对建模选择保持无偏。进一步证明了现有文献中使用的互信息(mutual information)成本是该通用成本的一个特例,并建立了其与线性化贝叶斯估计中信息增益的明确联系,从而为基于互信息的主动学习方法提供了理论依据。
链接: https://arxiv.org/abs/2601.21988
作者: Fernando Palafox,Jingqi Li,Jesse Milzman,David Fridovich-Keil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:An agent operating in an unknown dynamical system must learn its dynamics from observations. Active information gathering accelerates this learning, but existing methods derive bespoke costs for specific modeling choices: dynamics models, belief update procedures, observation models, and planners. We present a unifying framework that decouples these choices from the information-gathering cost by explicitly exposing the causal dependencies between parameters, beliefs, and controls. Using this framework, we derive a general information-gathering cost based on Massey’s directed information that assumes only Markov dynamics with additive noise and is otherwise agnostic to modeling choices. We prove that the mutual information cost used in existing literature is a special case of our cost. Then, we leverage our framework to establish an explicit connection between the mutual information cost and information gain in linearized Bayesian estimation, thereby providing theoretical justification for mutual information-based active learning approaches. Finally, we illustrate the practical utility of our framework through experiments spanning linear, nonlinear, and multi-agent systems.
zh
[AI-18] VERSA: Verified Event Data Format for Reliable Soccer Analytics
【速读】:该论文旨在解决体育赛事中事件流数据(event stream data)因固有质量问题导致的逻辑不一致问题,这些问题会严重影响基于数据的分析模型可靠性。解决方案的关键在于提出一种系统性的验证框架VERSA(Verified Event Data Format for Reliable Soccer Analytics),其核心是基于状态转移模型定义合法事件序列,从而实现对事件流数据中异常模式的自动检测与修正。实验表明,VERSA能显著提升跨数据源的一致性,并增强下游任务VAEP(Value Added Events Performed)在球员贡献评估中的鲁棒性和性能。
链接: https://arxiv.org/abs/2601.21981
作者: Geonhee Jo,Mingu Kang,Kangmin Lee,Minho Lee,Pascal Bauer,Sang-Ki Ko
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 13 pages, 5 figures, 3 tables
Abstract:Event stream data is a critical resource for fine-grained analysis across various domains, including financial transactions, system operations, and sports. In sports, it is actively used for fine-grained analyses such as quantifying player contributions and identifying tactical patterns. However, the reliability of these models is fundamentally limited by inherent data quality issues that cause logical inconsistencies (e.g., incorrect event ordering or missing events). To this end, this study proposes VERSA (Verified Event Data Format for Reliable Soccer Analytics), a systematic verification framework that ensures the integrity of event stream data within the soccer domain. VERSA is based on a state-transition model that defines valid event sequences, thereby enabling the automatic detection and correction of anomalous patterns within the event stream data. Notably, our examination of event data from the K League 1 (2024 season), provided by Bepro, detected that 18.81% of all recorded events exhibited logical inconsistencies. Addressing such integrity issues, our experiments demonstrate that VERSA significantly enhances cross-provider consistency, ensuring stable and unified data representation across heterogeneous sources. Furthermore, we demonstrate that data refined by VERSA significantly improves the robustness and performance of a downstream task called VAEP, which evaluates player contributions. These results highlight that the verification process is highly effective in increasing the reliability of data-driven analysis.
zh
[AI-19] From Particles to Agents : Hallucination as a Metric for Cognitive Friction in Spatial Simulation
【速读】:该论文旨在解决传统建筑仿真(如计算流体动力学、疏散模拟和结构分析)中将环境元素建模为确定性物理“粒子”而非具认知能力的“代理”的局限性,从而难以捕捉空间环境中人类行为与语义交互的复杂性。其解决方案的关键在于提出代理式环境仿真(Agentic Environmental Simulations),利用大规模多模态生成模型基于语义预期主动预测空间环境的下一状态,并引入事件驱动的场景推理(Episodic Spatial Reasoning)替代传统的时序步进方式,使仿真以有意义且由意外触发的事件推进。在此框架下,AI幻觉被重新定义为诊断工具,通过形式化认知摩擦(Cognitive Friction, C_f)识别“幽灵可供性”(Phantom Affordances),即建筑空间中的符号歧义;最终挑战现有人机交互(HCI)范式,将环境视为动态认知伙伴,提出以人类为中心的认知协同设计框架,确保仿真系统在保持自主性、情感清晰性和认知完整性的同时实现高效的人工智能驱动建模。
链接: https://arxiv.org/abs/2601.21977
作者: Javier Argota Sánchez-Vaquerizo,Luis Borunda Monsivais
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Paper selected for the workshop Human Cognition, AI, and the Future of HCI: Navigating the Disruptive and Wild Landscape of Large Language Models and Agentic AI as part of the Human-Computer Interaction (HCI) conference of the Alpine region (AlpCHI 2026) hosted at the Congressi Stefano Franscini, March 1st to March 5th, 2026 on Monte Verità in Ascona, Switzerland
Abstract:Traditional architectural simulations (e.g. Computational Fluid Dynamics, evacuation, structural analysis) model elements as deterministic physics-based “particles” rather than cognitive “agents”. To bridge this, we introduce \textbfAgentic Environmental Simulations, where Large Multimodal generative models actively predict the next state of spatial environments based on semantic expectation. Drawing on examples from accessibility-oriented AR pipelines and multimodal digital twins, we propose a shift from chronological time-steps to Episodic Spatial Reasoning, where simulations advance through meaningful, surprisal-triggered events. Within this framework we posit AI hallucinations as diagnostic tools. By formalizing the \textbfCognitive Friction ( C_f ) it is possible to reveal “Phantom Affordances”, i.e. semiotic ambiguities in built space. Finally, we challenge current HCI paradigms by treating environments as dynamic cognitive partners and propose a human-centered framework of cognitive orchestration for designing AI-driven simulations that preserve autonomy, affective clarity, and cognitive integrity.
zh
[AI-20] Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models
【速读】:该论文旨在解决语言模型中“陈述偏好与揭示偏好(stated-revealed preference, SvR)”之间的不一致问题,即模型在表达价值立场时所声称的偏好与其在具体情境下做出的选择之间存在偏差。现有评估方法主要依赖二元强制选择提示(binary forced-choice prompting),但这种方法会将真实的偏好与提示协议带来的伪影混淆。研究的关键解决方案在于系统性地考察不同偏好 elicitation(诱发)协议对SvR相关性的影响:通过在陈述偏好阶段允许中立和弃权选项以排除弱信号,显著提升了斯皮尔曼等级相关系数(Spearman’s rank correlation, ρ);而进一步在揭示偏好阶段也允许弃权则导致ρ接近零或负值,说明高比例中立响应削弱了偏好信号。结果表明,SvR相关性高度依赖于实验协议,且偏好诱发需引入对不确定偏好(indeterminate preferences)的建模机制。
链接: https://arxiv.org/abs/2601.21975
作者: Pranav Mahajan,Ihor Kendiukhov,Syed Hussain,Lydia Nottingham
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman’s rank correlation ( \rho ) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives \rho to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.
zh
[AI-21] Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
【速读】:该论文旨在解决多智能体大语言模型(Multi-Agent Large Language Models, MALM)协作优化中的两个核心问题:一是现有方法依赖预定义的集中式执行协议,限制了实际部署的灵活性;二是当前基于蒙特卡洛(Monte Carlo)的微调策略存在高方差问题,导致训练效率低下。解决方案的关键在于提出两种多智能体演员-评论家(Multi-Agent Actor-Critic, MAAC)方法——CoLLM-CC(集中式评论家)与CoLLM-DC(去中心化评论家),通过引入评论家网络来降低策略梯度估计的方差,从而提升在长时程或稀疏奖励任务中的训练稳定性与收敛性能。实验表明,在短时程和密集奖励场景下,CoLLM-DC可达到与CoLLM-CC相当的性能,但在复杂任务中仍需集中式评论家以保障效果。
链接: https://arxiv.org/abs/2601.21972
作者: Shuo Liu,Tianle Chen,Ryan Amiri,Christopher Amato
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:
Abstract:Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbfCoLLM-CC with a \textbfCentralized \textbfCritic and \textbfCoLLM-DC with \textbfDecentralized \textbfCritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at this https URL.
zh
[AI-22] MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts
【速读】:该论文旨在解决手术机器人领域中模仿学习(Imitation Learning)面临的挑战,包括数据稀缺、受限工作空间以及对安全性与可预测性的高要求。其核心问题是:如何在有限示范数据下实现复杂、长时程的外科操作任务的稳定学习与泛化。解决方案的关键在于提出一种监督式混合专家(Supervised Mixture-of-Experts, MoE)架构,该架构可集成到任意自主策略之上,并结合轻量级动作解码器(如Action Chunking Transformer, ACT),仅依赖少于150次立体内窥镜图像示范即可学习复杂的肠管抓取与牵拉任务。实验表明,该方法显著优于当前主流视觉-语言-动作(Vision-Language-Action, VLA)模型和标准ACT基线,在分布内和分布外场景中均表现出更高成功率与更强鲁棒性,且具备零样本迁移至离体猪组织的能力,为术中部署提供了可行路径。
链接: https://arxiv.org/abs/2601.21971
作者: Lorenzo Mazza,Ariel Rodriguez,Rayan Younis,Martin Lelis,Ortrun Hellig,Chenpan Li,Sebastian Bodenstedt,Martin Wagner,Stefanie Speidel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Imitation learning has achieved remarkable success in robotic manipulation, yet its application to surgical robotics remains challenging due to data scarcity, constrained workspaces, and the need for an exceptional level of safety and predictability. We present a supervised Mixture-of-Experts (MoE) architecture designed for phase-structured surgical manipulation tasks, which can be added on top of any autonomous policy. Unlike prior surgical robot learning approaches that rely on multi-camera setups or thousands of demonstrations, we show that a lightweight action decoder policy like Action Chunking Transformer (ACT) can learn complex, long-horizon manipulation from less than 150 demonstrations using solely stereo endoscopic images, when equipped with our architecture. We evaluate our approach on the collaborative surgical task of bowel grasping and retraction, where a robot assistant interprets visual cues from a human surgeon, executes targeted grasping on deformable tissue, and performs sustained retraction. We benchmark our method against state-of-the-art Vision-Language-Action (VLA) models and the standard ACT baseline. Our results show that generalist VLAs fail to acquire the task entirely, even under standard in-distribution conditions. Furthermore, while standard ACT achieves moderate success in-distribution, adopting a supervised MoE architecture significantly boosts its performance, yielding higher success rates in-distribution and demonstrating superior robustness in out-of-distribution scenarios, including novel grasp locations, reduced illumination, and partial occlusions. Notably, it generalizes to unseen testing viewpoints and also transfers zero-shot to ex vivo porcine tissue without additional training, offering a promising pathway toward in vivo deployment. To support this, we present qualitative preliminary results of policy roll-outs during in vivo porcine surgery.
zh
[AI-23] he Energy Impact of Domain Model Design in Classical Planning
【速读】:该论文旨在解决自动化规划领域中能源效率被忽视的问题,尤其是在传统研究侧重于算法性能(如准确率或运行时间)的背景下,绿色人工智能(Green AI)强调将能耗视为关键性能指标。其核心问题是:如何通过领域模型设计来系统性地影响经典规划器的能耗。解决方案的关键在于提出一个领域模型配置框架,能够受控地改变领域特征(如元素排序、动作元数和死端状态),从而在五个基准领域和五种先进规划器上对32种领域变体进行实证分析,结果表明领域级修改可显著影响不同规划器的能耗,且能耗与运行时间并不总是呈正相关。
链接: https://arxiv.org/abs/2601.21967
作者: Ilche Georgievski,Serhat Tekin,Marco Aiello
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 2026 IEEE/ACM 5th International Conference on AI Engineering - Software Engineering for AI (CAIN '26)
Abstract:AI research has traditionally prioritised algorithmic performance, such as optimising accuracy in machine learning or runtime in automated planning. The emerging paradigm of Green AI challenges this by recognising energy consumption as a critical performance dimension. Despite the high computational demands of automated planning, its energy efficiency has received little attention. This gap is particularly salient given the modular planning structure, in which domain models are specified independently of algorithms. On the other hand, this separation also enables systematic analysis of energy usage through domain model design. We empirically investigate how domain model characteristics affect the energy consumption of classical planners. We introduce a domain model configuration framework that enables controlled variation of features, such as element ordering, action arity, and dead-end states. Using five benchmark domains and five state-of-the-art planners, we analyse energy and runtime impacts across 32 domain variants per benchmark. Results demonstrate that domain-level modifications produce measurable energy differences across planners, with energy consumption not always correlating with runtime.
zh
[AI-24] How do Visual Attributes Influence Web Agents ? A Comprehensive Evaluation of User Interface Design Factors
【速读】:该论文旨在解决现有研究对网页视觉属性(Visual Attribute Factors)如何影响网络代理(web agents)决策机制缺乏系统理解的问题,尤其是相较于对抗性攻击的鲁棒性研究,对于良性场景下视觉因素的作用关注不足。解决方案的关键在于提出一个受控的评估流程VAF(Visual Attribute Factor evaluation pipeline),其核心由三个阶段构成:(i) 变体生成,确保变体与原始页面语义一致但仅在视觉属性上存在差异;(ii) 浏览交互,模拟人类用户通过滚动和点击进行浏览的行为;(iii) 通过目标点击率(Target Click Rate)与目标提及率(Target Mention Rate)联合验证视觉属性对代理决策的影响。该方法实现了对视觉属性影响程度的量化分析,从而识别出最具影响力的视觉因素,如背景色对比度、项目尺寸、位置和卡片清晰度等。
链接: https://arxiv.org/abs/2601.21961
作者: Kuai Yu,Naicheng Yu,Han Wang,Rui Yang,Huan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Web agents have demonstrated strong performance on a wide range of web-based tasks. However, existing research on the effect of environmental variation has mostly focused on robustness to adversarial attacks, with less attention to agents’ preferences in benign scenarios. Although early studies have examined how textual attributes influence agent behavior, a systematic understanding of how visual attributes shape agent decision-making remains limited. To address this, we introduce VAF, a controlled evaluation pipeline for quantifying how webpage Visual Attribute Factors influence web-agent decision-making. Specifically, VAF consists of three stages: (i) variant generation, which ensures the variants share identical semantics as the original item while only differ in visual attributes; (ii) browsing interaction, where agents navigate the page via scrolling and clicking the interested item, mirroring how human users browse online; (iii) validating through both click action and reasoning from agents, which we use the Target Click Rate and Target Mention Rate to jointly evaluate the effect of visual attributes. By quantitatively measuring the decision-making difference between the original and variant, we identify which visual attributes influence agents’ behavior most. Extensive experiments, across 8 variant families (48 variants total), 5 real-world websites (including shopping, travel, and news browsing), and 4 representative web agents, show that background color contrast, item size, position, and card clarity have a strong influence on agents’ actions, whereas font styling, text color, and item image clarity exhibit minor effects.
zh
[AI-25] oolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models ICLR2026
【速读】:该论文旨在解决当前基于检索的工具使用流水线在语义理解上的双重挑战:一是检索器使用的编码器难以捕捉复杂语义,二是大语言模型(Large Language Model, LLM)在自然语言预训练中缺乏内在的工具知识。现有生成式方法虽能统一工具选择与执行过程,但通过为每个工具分配唯一新标记的方式导致词汇量爆炸、可扩展性差,并形成语义瓶颈,阻碍模型学习工具间的协作关系。其解决方案的关键在于提出 ToolWeaver 框架,该框架将工具编码为层次化序列结构,使词汇扩展呈对数增长;并通过一种新颖的分词机制,融合工具的内在语义与其外在共用模式,生成结构化的代码序列,从而支持模型从密集的共享代码共现中学习协作模式,而非依赖稀疏的独立工具ID共现。
链接: https://arxiv.org/abs/2601.21947
作者: Bowen Fang,Wen Ye,Yunyue Su,Jinghao Zhang,Qiang Liu,Yesheng Liu,Xin Sun,Shu Wu,Jiabing Yang,Baole Wei,Liang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10pages, 12 figures, Accepted to ICLR 2026
Abstract:Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool’s intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.
zh
[AI-26] Robust Multimodal Representation Learning in Healthcare
【速读】:该论文旨在解决医疗多模态表示学习中因多种来源系统性偏差(systematic biases)导致模型泛化能力下降的问题,现有方法虽注重多模态融合,却忽视了偏倚特征对预测性能的负面影响。其解决方案的关键在于提出一种双流特征去相关框架(Dual-Stream Feature Decorrelation Framework),通过结构因果分析识别由潜在混杂因子(latent confounders)引入的偏倚,并利用双流神经网络结构解耦因果特征与虚假相关性,结合广义交叉熵损失和互信息最小化实现有效去相关,且该框架具有模型无关性,可无缝集成至现有医疗多模态学习方法中。
链接: https://arxiv.org/abs/2601.21941
作者: Xiaoguang Zhu,Linxiao Gong,Lianlong Sun,Yang Liu,Haoyu Wang,Jing Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical multimodal representation learning aims to integrate heterogeneous data into unified patient representations to support clinical outcome prediction. However, real-world medical datasets commonly contain systematic biases from multiple sources, which poses significant challenges for medical multimodal representation learning. Existing approaches typically focus on effective multimodal fusion, neglecting inherent biased features that affect the generalization ability. To address these challenges, we propose a Dual-Stream Feature Decorrelation Framework that identifies and handles the biases through structural causal analysis introduced by latent confounders. Our method employs a causal-biased decorrelation framework with dual-stream neural networks to disentangle causal features from spurious correlations, utilizing generalized cross-entropy loss and mutual information minimization for effective decorrelation. The framework is model-agnostic and can be integrated into existing medical multimodal learning methods. Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.
zh
[AI-27] Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities
【速读】:该论文旨在解决当前大语言模型在面对真正新颖科学信息时是否具备推理能力的问题,尤其针对现有评估方法中检索(retrieval)、工具链选择与推理过程混杂、参数化记忆干扰及开放网络波动导致信号污染的局限性。其解决方案的关键在于提出DeR2——一个受控的深度研究沙盒环境,通过四类隔离机制(仅指令、概念(黄金概念无文档)、仅相关文档、完整集(含干扰项))将证据获取与推理过程解耦,从而可量化检索损失与推理损失,并实现细粒度错误归因;同时采用两阶段验证策略防止参数泄露,并提供冻结的文献库和专家标注的概念与推理链以保障可复现性。
链接: https://arxiv.org/abs/2601.21937
作者: Shuangshuang Ying,Zheyu Wang,Yunjian Peng,Jin Chen,Yuhao Wu,Hongbin Lin,Dingyu He,Siyi Liu,Gengchen Yu,YinZhu Piao,Yuchen Wu,Xin Gui,Zhongyuan Peng,Xin Li,Xeron Du,Libo Qin,YiXin Cao,Ge Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes–Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)–yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.
zh
[AI-28] Agent icSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险表格决策任务中缺乏透明性、可控性和可审计性的问题,尤其是在涉及伦理复杂性的场景下(如青年成人再犯预测)。传统链式思维(Chain-of-Thought, CoT)提示方法往往表现为黑箱推理,难以追踪决策逻辑且性能不稳定。解决方案的关键在于提出AgenticSimLaw框架——一个基于角色结构的多智能体辩论机制,通过定义明确的角色(检察官、辩护律师、法官)、7轮结构化辩论协议及私有推理策略,实现测试时推理过程的完全可观测与可控。该框架不仅提升了模型性能的稳定性与泛化能力,还提供了细粒度的推理控制、完整的交互日志用于解释性分析,并支持对代理行为的系统性建模,从而满足高风险决策任务对透明度和人类监督的需求。
链接: https://arxiv.org/abs/2601.21936
作者: Jon Chun,Kathrine Elkins,Yong Suk Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures
Abstract:We introduce AgenticSimLaw, a role-structured, multi-agent debate framework that provides transparent and controllable test-time reasoning for high-stakes tabular decision-making tasks. Unlike black-box approaches, our courtroom-style orchestration explicitly defines agent roles (prosecutor, defense, judge), interaction protocols (7-turn structured debate), and private reasoning strategies, creating a fully auditable decision-making process. We benchmark this framework on young adult recidivism prediction using the NLSY97 dataset, comparing it against traditional chain-of-thought (CoT) prompting across almost 90 unique combinations of models and strategies. Our results demonstrate that structured multi-agent debate provides more stable and generalizable performance compared to single-agent reasoning, with stronger correlation between accuracy and F1-score metrics. Beyond performance improvements, AgenticSimLaw offers fine-grained control over reasoning steps, generates complete interaction transcripts for explainability, and enables systematic profiling of agent behaviors. While we instantiate this framework in the criminal justice domain to stress-test reasoning under ethical complexity, the approach generalizes to any deliberative, high-stakes decision task requiring transparency and human oversight. This work addresses key LLM-based multi-agent system challenges: organization through structured roles, observability through logged interactions, and responsibility through explicit non-deployment constraints for sensitive domains. Data, results, and code will be available on this http URL under the MIT license.
zh
[AI-29] From Future of Work to Future of Workers: Addressing Asymptomatic AI Harms for Dignified Human-AI Interaction
【速读】:该论文旨在解决人工智能(AI)在职场中广泛应用背景下,专业技术人员因过度依赖AI而导致的隐性技能退化与职业身份弱化问题,即“AI增强悖论”——AI虽提升短期效率,却侵蚀长期专家能力与自主性。其解决方案的关键在于构建一种“社会技术免疫”框架,通过双功能机制同时实现机构质量目标与劳动者能力建设:一方面强化对技能退化的检测、遏制与恢复能力,另一方面赋予工作者主动权以维护其专业身份与尊严,从而在医疗和软件工程等高风险领域实现生产力与人类专家价值的动态平衡。
链接: https://arxiv.org/abs/2601.21920
作者: Upol Ehsan,Samir Passi,Koustuv Saha,Todd McNutt,Mark O. Riedl,Sara Alcorn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:In the future of work discourse, AI is touted as the ultimate productivity amplifier. Yet, beneath the efficiency gains lie subtle erosions of human expertise and agency. This paper shifts focus from the future of work to the future of workers by navigating the AI-as-Amplifier Paradox: AI’s dual role as enhancer and eroder, simultaneously strengthening performance while eroding underlying expertise. We present a year-long study on the longitudinal use of AI in a high-stakes workplace among cancer specialists. Initial operational gains hid ``intuition rust’': the gradual dulling of expert judgment. These asymptomatic effects evolved into chronic harms, such as skill atrophy and identity commoditization. Building on these findings, we offer a framework for dignified Human-AI interaction co-constructed with professional knowledge workers facing AI-induced skill erosion without traditional labor protections. The framework operationalizes sociotechnical immunity through dual-purpose mechanisms that serve institutional quality goals while building worker power to detect, contain, and recover from skill erosion, and preserve human identity. Evaluated across healthcare and software engineering, our work takes a foundational step toward dignified human-AI interaction futures by balancing productivity with the preservation of human expertise.
zh
[AI-30] Making Models Unmergeable via Scaling-Sensitive Loss Landscape
【速读】:该论文旨在解决模型组件(model components)在模型枢纽(model hubs)中被下游用户未经授权重组所引发的治理漏洞问题,即用户可能将已发布的权重混合成未经安全对齐或违反许可条款的新模型。为应对这一挑战,作者提出了一种架构无关(architecture-agnostic)的保护框架 \textscTrap ^2,其核心在于在微调过程中通过权重重缩放(weight re-scaling)编码保护机制:该方法保持发布权重在独立使用时的有效性,但在合并过程中常见的重缩放操作下会显著劣化性能,从而有效遏制未经授权的模型混合行为。
链接: https://arxiv.org/abs/2601.21898
作者: Minwoo Jang,Hoyoung Kim,Jabin Koo,Jungseul Ok
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Preprint
Abstract:The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a \emphgovernance gap: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely post-hoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose \textscTrap ^2 , an architecture-agnostic protection framework that encodes protection into the update during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, \textscTrap ^2 uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized merging.
zh
[AI-31] astra-langchain4j: Experiences Combining LLM s and Agent Programming
【速读】:该论文旨在解决生成式 AI(Generative AI)与代理工具包(Agent Toolkit)之间协同演化的问题,即探讨大语言模型(Large Language Model, LLM)如何影响传统代理工具包的使用方式,以及传统工具包中的经验如何指导新型代理平台的设计。其解决方案的关键在于通过开发一个针对 ASTRA 编程语言的 LLM 集成原型,结合三个具体实现案例,系统性地验证了将传统代理能力与新兴生成式智能融合的技术路径,并总结出可复用的设计经验与实践洞见。
链接: https://arxiv.org/abs/2601.21879
作者: Rem Collier,Katharine Beaumont,Andrei Ciortea
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Given the emergence of Generative AI over the last two years and the increasing focus on Agentic AI as a form of Multi-Agent System it is important to explore both how such technologies can impact the use of traditional Agent Toolkits and how the wealth of experience encapsulated in those toolkits can influence the design of the new agentic platforms. This paper presents an overview of our experience developing a prototype large language model (LLM) integration for the ASTRA programming language. It presents a brief overview of the toolkit, followed by three example implementations, concluding with a discussion of the experiences garnered through the examples.
zh
[AI-32] WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents ICLR2026
【速读】:该论文旨在解决Web代理在复杂网页任务中因长时序、不可逆决策导致的奖励稀疏与延迟问题,以及现有过程奖励模型(Process Reward Models, PRMs)在奖励信号粗粒度或依赖脆弱模板匹配时带来的泛化能力不足和可解释性差的问题。其解决方案的关键在于提出一种以推理为导向的新型PRM——WebArbiter,将奖励建模转化为文本生成任务,通过结构化推理输出包含偏好判断和行动推荐的解释性理由;训练采用两阶段策略:首先通过推理蒸馏赋予模型基于原则的连贯推理能力,再利用强化学习直接对齐判断结果与任务正确性以纠正教师偏差,从而实现更强的泛化性能与实际应用价值。
链接: https://arxiv.org/abs/2601.21872
作者: Yao Zhang,Shijie Tang,Zeyu Li,Zhen Han,Volker Tresp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points, underscoring its robustness and practical value in real-world complex web tasks.
zh
[AI-33] MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts
【速读】:该论文旨在解决多变量时间序列(multivariate time series)在长时程预测中因复杂多尺度结构(如全局趋势、局部周期性和非平稳状态)带来的挑战,以及现有稀疏专家混合模型(sparse Mixture-of-Experts, MoE)依赖同质MLP专家难以捕捉多样化时间动态的问题。其核心解决方案是提出MoHETS,一种基于编码器的Transformer架构,引入稀疏异质专家混合(sparse Mixture-of-Heterogeneous-Experts, MoHE)层:通过路由机制将时间片段分配给少量专家网络,其中包含一个共享的深度可分离卷积专家用于保持序列连续性,以及路由傅里叶基专家用于建模片段级周期结构;同时利用跨注意力机制融合外生变量嵌入以增强对非平稳动态的鲁棒性,并采用轻量级卷积Patch解码器替代传统参数密集型线性投影头,从而提升参数效率和训练稳定性,实现单模型对任意预测时 horizon 的泛化能力。
链接: https://arxiv.org/abs/2601.21866
作者: Evandro S. Ortigossa,Guy Lutsker,Eran Segal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by 12% compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.
zh
[AI-34] KnowBias: Mitigating Social Bias in LLM s via Know-Bias Neuron Enhancement
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的社会偏见问题,这类偏见会强化有害刻板印象,限制模型的安全部署。现有去偏方法多采用抑制范式,通过修改参数、提示词或神经元来削弱偏见行为,但此类方法往往脆弱、泛化能力弱、数据效率低,并可能导致模型通用能力下降。本文提出 KnowBias 框架,其核心创新在于反向思路:不是抑制与偏见相关的神经元,而是通过识别并增强编码偏见知识的神经元来实现去偏。具体而言,KnowBias 利用少量偏见知识问答(bias-knowledge questions)结合基于归因的分析方法定位相关神经元,并在推理阶段选择性增强这些神经元,从而在保持模型通用能力的同时实现强效且泛化的去偏效果,且无需重新训练,仅需极少量简单的是非问题即可完成。
链接: https://arxiv.org/abs/2601.21864
作者: Jinhao Pan,Chahat Raj,Anjishnu Mukherjee,Sina Mansouri,Bowen Wei,Shloka Yada,Ziwei Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data-inefficient, and prone to degrading general capability. We propose \textbfKnowBias, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at this https URL.
zh
[AI-35] Bridging Forecast Accuracy and Inventory KPIs: A Simulation-Based Software Framework
【速读】:该论文旨在解决汽车后市场中备件库存管理领域的一个关键问题:现有预测模型评估主要依赖统计准确性指标(如MAE、RMSE),但这些指标与实际运营绩效指标(如总成本和服务水平)之间的关联性不明确,导致模型选择缺乏操作相关性。解决方案的核心是一个以决策为中心的仿真软件框架,其关键在于构建一个闭环系统,包括针对备件需求特征设计的合成需求生成器、可嵌入任意预测模型的灵活预测模块,以及消耗预测结果并计算运营KPIs的库存控制模拟器,从而实现从单纯统计误差评价向运营影响评估的转变,揭示不同预测性能对库存决策的实际影响,为模型选择提供基于运营目标的指导。
链接: https://arxiv.org/abs/2601.21844
作者: So Fukuhara,Abdallah Alabdallah,Nuwan Gunasekara,Slawomir Nowaczyk
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 6 figures
Abstract:Efficient management of spare parts inventory is crucial in the automotive aftermarket, where demand is highly intermittent and uncertainty drives substantial cost and service risks. Forecasting is therefore central, but the quality of a forecasting model should be judged not by statistical accuracy (e.g., MAE, RMSE, IAE) but rather by its impact on key operational performance indicators (KPIs), such as total cost and service level. Yet most existing work evaluates models exclusively using accuracy metrics, and the relationship between these metrics and operational KPIs remains poorly understood. To address this gap, we propose a decision-centric simulation software framework that enables systematic evaluation of forecasting model in realistic inventory management setting. The framework comprises: (i) a synthetic demand generator tailored to spare-parts demand characteristics, (ii) a flexible forecasting module that can host arbitrary predictive models, and (iii) an inventory control simulator that consumes the forecasts and computes operational KPIs. This closed-loop setup enables researchers to evaluate models not only in terms of statistical error but also in terms of their downstream implications for inventory decisions. Using a wide range of simulation scenarios, we show that improvements in conventional accuracy metrics do not necessarily translate into better operational performance, and that models with similar statistical error profiles can induce markedly different cost-service trade-offs. We analyze these discrepancies to characterize how specific aspects of forecast performance affect inventory outcomes and derive guidance for model selection. Overall, the framework operationalizes the link between demand forecasting and inventory management, shifting evaluation from purely predictive accuracy toward operational relevance in the automotive aftermarket and related domains.
zh
[AI-36] st-Time Compute Games
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为服务(LLM-as-a-service)时存在的市场社会效率低下问题,即云服务提供商因收费机制激励而过度增加测试时计算资源(test-time compute),即使这种增加对输出质量提升有限。解决方案的关键在于引入一种反向第二价格拍卖机制(reverse second-price auction mechanism),其中服务商竞标其提供的价格和预期质量,用户根据中标者相对于次高投标者的边际价值按比例付费,从而实现更高效、公平的资源配置。
链接: https://arxiv.org/abs/2601.21839
作者: Ander Artola Velasco,Dimitrios Rontogiannis,Stratis Tsirtsis,Manuel Gomez-Rodriguez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:
Abstract:Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the \textttLlama and \textttQwen families, as well as reasoning models distilled from \textttDeepSeek-R1 , on math and science benchmark datasets.
zh
[AI-37] rustworthy Intelligent Education: A Systematic Perspective on Progress Challenges and Future Directions
【速读】:该论文旨在解决当前智能教育领域中可信性研究碎片化、缺乏系统性分类与整合的问题。现有研究或局限于特定任务的可信方法,或停留在高层次的综述层面,未能构建一个涵盖典型教育任务与可信维度的结构化框架。其解决方案的关键在于:首先将智能教育划分为五类代表性任务(学习者能力评估、学习资源推荐、学习分析、教育内容理解与教学辅助),在此基础上,从安全与隐私、鲁棒性、公平性、可解释性和可持续性五个可信维度对现有研究进行系统梳理,并归纳总结相应的研究方法与解决方案策略,从而为可信智能教育提供清晰、统一的参考框架。
链接: https://arxiv.org/abs/2601.21837
作者: Xiaoshan Yu,Shangshang Yang,Ziwen Wang,Haiping Ma,Xingyi Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 3figures
Abstract:In recent years, trustworthiness has garnered increasing attention and exploration in the field of intelligent education, due to the inherent sensitivity of educational scenarios, such as involving minors and vulnerable groups, highly personalized learning data, and high-stakes educational outcomes. However, existing research either focuses on task-specific trustworthy methods without a holistic view of trustworthy intelligent education, or provides survey-level discussions that remain high-level and fragmented, lacking a clear and systematic categorization. To address these limitations, in this paper, we present a systematic and structured review of trustworthy intelligent education. Specifically, We first organize intelligent education into five representative task categories: learner ability assessment, learning resource recommendation, learning analytics, educational content understanding, and instructional assistance. Building on this task landscape, we review existing studies from five trustworthiness perspectives, including safety and privacy, robustness, fairness, explainability, and sustainability, and summarize and categorize the research methodologies and solution strategies therein. Finally, we summarize key challenges and discuss future research directions. This survey aims to provide a coherent reference framework and facilitate a clearer understanding of trustworthiness in intelligent education.
zh
[AI-38] Looking Beyond Accuracy: A Holistic Benchmark of ECG Foundation Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在心电图(ECG)领域应用中,基础模型(Foundation Models, FMs)嵌入表示的泛化能力缺乏系统性评估的问题。当前研究多聚焦于下游任务性能,忽视了对模型嵌入空间结构和跨场景适应性的深入分析,这在医疗等高风险场景中尤为关键。解决方案的关键在于提出一种融合性能评估与表征层面分析的综合基准测试框架,通过SHAP(SHapley Additive exPlanations)和UMAP(Uniform Manifold Approximation and Projection)技术,从嵌入维度揭示模型的内部表示模式,并基于此对多种预训练ECG专家模型在跨大陆数据集及不同数据可用性设置下的表现进行系统性评估,从而实现对模型可泛化性和代表结构的深度理解。
链接: https://arxiv.org/abs/2601.21830
作者: Francesca Filice,Edoardo De Rose,Simone Bartucci,Francesco Calimeri,Simona Perri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The electrocardiogram (ECG) is a cost-effective, highly accessible and widely employed diagnostic tool. With the advent of Foundation Models (FMs), the field of AI-assisted ECG interpretation has begun to evolve, as they enable model reuse across different tasks by relying on embeddings. However, to responsibly employ FMs, it is crucial to rigorously assess to which extent the embeddings they produce are generalizable, particularly in error-sensitive domains such as healthcare. Although prior works have already addressed the problem of benchmarking ECG-expert FMs, they focus predominantly on the evaluation of downstream performance. To fill this gap, this study aims to find an in-depth, comprehensive benchmarking framework for FMs, with a specific focus on ECG-expert ones. To this aim, we introduce a benchmark methodology that complements performance-based evaluation with representation-level analysis, leveraging SHAP and UMAP techniques. Furthermore, we rely on the methodology for carrying out an extensive evaluation of several ECG-expert FMs pretrained via state-of-the-art techniques over different cross-continental datasets and data availability settings; this includes ones featuring data scarcity, a fairly common situation in real-world medical scenarios. Experimental results show that our benchmarking protocol provides a rich insight of ECG-expert FMs’ embedded patterns, enabling a deeper understanding of their representational structure and generalizability.
zh
[AI-39] CORE:Toward Ubiquitous 6G Intelligence Through Collaborative Orchestration of Large Language Model Agents Over Hierarchical Edge
【速读】:该论文旨在解决第六代移动通信网络(6G)与大语言模型(LLM)融合背景下,分布式边缘计算资源异构性和碎片化导致的复杂推理任务难以高效执行的问题。其核心挑战在于如何在层级化的边缘网络中实现多个LLM代理之间的协同推理能力,以支撑无处不在的智能(ubiquitous intelligence)。解决方案的关键是提出一种名为“边缘协作角色编排”(Collaborative Orchestration Role at Edge, CORE)的框架,该框架通过部署具备不同功能角色的多LLM代理于移动设备和分层边缘服务器上,并引入三个优化模块——实时感知、动态角色编排和流水线并行执行,从而实现高效的分布式协同推理;其中,创新性的角色亲和调度算法能够根据计算需求与可用资源的匹配度,智能地动态分配LLM角色,显著提升系统效率与任务完成率。
链接: https://arxiv.org/abs/2601.21822
作者: Zitong Yu,Boquan Sun,Yang Li,Zheyan Qu,Xing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Communications Magazine
Abstract:Rapid advancements in sixth-generation (6G) networks and large language models (LLMs) have paved the way for ubiquitous intelligence, wherein seamless connectivity and distributed artificial intelligence (AI) have revolutionized various aspects of our this http URL, realizing this vision faces significant challenges owing to the fragmented and heterogeneous computing resources across hierarchical networks, which are insufficient for individual LLM agents to perform complex reasoning this http URL address this issue, we propose Collaborative Orchestration Role at Edge (CORE), an innovative framework that employs a collaborative learning system in which multiple LLMs, each assigned a distinct functional role, are distributed across mobile devices and tiered edge servers. The system integrates three optimization modules, encompassing real-time perception,dynamic role orchestration, and pipeline-parallel execution, to facilitate efficient and rapid collaboration among distributed agents. Furthermore, we introduce a novel role affinity scheduling algorithm for dynamically orchestrating LLM role assignments across the hierarchical edge infrastructure, intelligently matching computational demands with available dispersed this http URL, comprehensive case studies and performance evaluations across various 6G application scenarios demonstrated the efficacy of CORE, revealing significant enhancements in the system efficiency and task completion rates. Building on these promising outcomes, we further validated the practical applicability of CORE by deploying it on a real-world edge-computing platform,that exhibits robust performance in operational environments.
zh
[AI-40] A Unified XAI-LLM Approach for EndotrachealSuctioning Activity Recognition
【速读】:该论文旨在解决内镜吸痰(Endotracheal Suctioning, ES)训练中缺乏自动化识别与反馈系统的问题,尤其是在家庭护理和教育场景下,由于监督资源有限,训练质量难以保障。解决方案的关键在于提出一个以大型语言模型(Large Language Model, LLM)为核心的统一框架,该框架能够基于视频数据实现时空活动识别,并通过可解释人工智能(Explainable AI, XAI)技术生成自然语言形式的反馈,从而将复杂的技术分析转化为直观、可理解的指导建议。实验表明,该方法在准确率和F1分数上相较传统机器学习与深度学习基线模型提升约15–20%,并构建了初步的学生支持模块,实现了对正确操作的识别与针对性改进建议的自动化输出。
链接: https://arxiv.org/abs/2601.21802
作者: Hoang Khang Phan,Quang Vinh Dang,Noriyo Colley,Christina Garcia,Nhat Tan Le
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Endotracheal suctioning (ES) is an invasive yet essential clinical procedure that requires a high degree of skill to minimize patient risk - particularly in home care and educational settings, where consistent supervision may be limited. Despite its critical importance, automated recognition and feedback systems for ES training remain underexplored. To address this gap, this study proposes a unified, LLM-centered framework for video-based activity recognition benchmarked against conventional machine learning and deep learning approaches, and a pilot study on feedback generation. Within this framework, the Large Language Model (LLM) serves as the central reasoning module, performing both spatiotemporal activity recognition and explainable decision analysis from video data. Furthermore, the LLM is capable of verbalizing feedback in natural language, thereby translating complex technical insights into accessible, human-understandable guidance for trainees. Experimental results demonstrate that the proposed LLM-based approach outperforms baseline models, achieving an improvement of approximately 15-20% in both accuracy and F1 score. Beyond recognition, the framework incorporates a pilot student-support module built upon anomaly detection and explainable AI (XAI) principles, which provides automated, interpretable feedback highlighting correct actions and suggesting targeted improvements. Collectively, these contributions establish a scalable, interpretable, and data-driven foundation for advancing nursing education, enhancing training efficiency, and ultimately improving patient safety.
zh
[AI-41] BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics
【速读】:该论文旨在解决当前AI代理在生物信息学(Bioinformatics)任务中缺乏系统性评估标准的问题,尤其是如何量化其性能与鲁棒性。解决方案的关键在于构建一个名为BioAgent Bench的基准数据集和评估套件,包含经过精心设计的端到端生物信息学任务(如RNA-seq、变异检测和宏基因组学),并提供明确输出目标以支持自动化评估;同时引入基于大语言模型(LLM)的评分机制来判断流程进展和结果有效性,并通过可控扰动(如输入损坏、干扰文件和提示膨胀)测试代理的鲁棒性。此方案不仅验证了前沿AI代理可在无需复杂定制架构的情况下完成多步骤生物信息学流水线,还揭示了高阶流程正确性不等于底层推理可靠性,从而为不同隐私敏感场景下选择闭源或开源模型提供了依据。
链接: https://arxiv.org/abs/2601.21800
作者: Dionizije Fa,Marko Čuljak,Bruno Pandža,Mateo Čupić
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.
zh
[AI-42] Effective LoRA Adapter Routing using Task Representations
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在使用低秩适配(Low-rank adaptation, LoRA)模块进行参数高效微调时,如何有效从海量公共适配器池中选择并组合合适适配器的问题,即适配器路由(adapter routing)问题。现有方法通常直接将查询映射到适配器,但存在扩展性差、依赖适配器训练数据等问题。其解决方案的关键在于提出LORAUTER框架,通过任务嵌入(task embeddings)实现路由决策,这些嵌入由小规模验证集生成,无需适配器训练数据;该方法在任务层面进行路由,而非适配器层面,从而显著提升可扩展性和鲁棒性,尤其在超大规模(>1500个)且含噪声的适配器池中仍能保持高性能。
链接: https://arxiv.org/abs/2601.21795
作者: Akash Dhasade,Anne-Marie Kermarrec,Igor Pavlovic,Diana Petrescu,Rafael Pires,Mathis Randl,Martijn de Vos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-rank adaptation (LoRA) enables parameter efficient specialization of large language models (LLMs) through modular adapters, resulting in rapidly growing public adapter pools spanning diverse tasks. Effectively using these adapters requires routing: selecting and composing the appropriate adapters for a query. We introduce LORAUTER, a novel routing framework that selects and composes LoRA adapters using task representations rather than adapter characteristics. Unlike existing approaches that map queries directly to adapters, LORAUTER routes queries via task embeddings derived from small validation sets and does not require adapter training data. By operating at the task level, LORAUTER achieves efficient routing that scales with the number of tasks rather than the number of adapters. Experiments across multiple tasks show that LORAUTER consistently outperforms baseline routing approaches, matching Oracle performance (101.2%) when task-aligned adapters exist and achieving state-of-the-art results on unseen tasks (+5.2 points). We further demonstrate the robustness of LORAUTER to very large, noisy adapter pools by scaling it to over 1500 adapters.
zh
[AI-43] ECSEL: Explainable Classification via Signomial Equation Learning
【速读】:该论文旨在解决当前机器学习模型在分类任务中普遍存在的“黑箱”问题,即缺乏可解释性与透明度,难以提供人类可理解的决策依据。为应对这一挑战,作者提出了一种名为ECSEL(Explainable Classification via Signomial Expressions)的可解释分类方法,其核心创新在于直接从数据中学习符号回归形式的指数多项式(signomial equations)作为分类器和解释工具。关键在于:ECSEL不仅能够以封闭形式构建结构化的数学表达式来实现分类,还能通过该表达式进行全局特征行为分析、决策边界解析及局部特征归因,从而在保持高分类准确率的同时显著提升模型的可解释性。实验表明,该方法在多个基准数据集和真实场景(如电商推荐与欺诈检测)中均能揭示数据偏差、支持反事实推理并生成可操作的洞察。
链接: https://arxiv.org/abs/2601.21789
作者: Adia Lumadjeng,Ilker Birbil,Erman Acar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.
zh
[AI-44] Assessing the Business Process Modeling Competences of Large Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)生成业务流程模型(Business Process Model and Notation, BPMN)时缺乏系统性评估的问题。现有研究或采用LLM作为评判者的方法,或未涵盖模型质量的公认维度,导致对LLM生成BPMN模型的实际性能认知不足。解决方案的关键在于提出一个名为BEF4LLM的新型评估框架,从语法质量(syntactic quality)、语用质量(pragmatic quality)、语义质量(semantic quality)和有效性(validity)四个维度对LLM生成的BPMN模型进行全面量化分析,并将其性能与人类建模专家进行对比。该框架揭示了LLMs在语法和语用层面表现优异,但在语义和有效性方面仍落后于人类专家,从而为后续LLM在BPMN建模中的优化方向提供了明确依据。
链接: https://arxiv.org/abs/2601.21787
作者: Chantale Lauer,Peter Pfeiffer,Alexander Rombach,Nijat Mehdiyev
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The creation of Business Process Model and Notation (BPMN) models is a complex and time-consuming task requiring both domain knowledge and proficiency in modeling conventions. Recent advances in large language models (LLMs) have significantly expanded the possibilities for generating BPMN models directly from natural language, building upon earlier text-to-process methods with enhanced capabilities in handling complex descriptions. However, there is a lack of systematic evaluations of LLM-generated process models. Current efforts either use LLM-as-a-judge approaches or do not consider established dimensions of model quality. To this end, we introduce BEF4LLM, a novel LLM evaluation framework comprising four perspectives: syntactic quality, pragmatic quality, semantic quality, and validity. Using BEF4LLM, we conduct a comprehensive analysis of open-source LLMs and benchmark their performance against human modeling experts. Results indicate that LLMs excel in syntactic and pragmatic quality, while humans outperform in semantic aspects; however, the differences in scores are relatively modest, highlighting LLMs’ competitive potential despite challenges in validity and semantic quality. The insights highlight current strengths and limitations of using LLMs for BPMN modeling and guide future model development and fine-tuning. Addressing these areas is essential for advancing the practical deployment of LLMs in business process modeling.
zh
[AI-45] Abstract Concept Modelling in Conceptual Spaces: A Study on Chess Strategies
【速读】:该论文旨在解决如何建模随时间演变的抽象概念(abstract concepts)的问题,特别是那些具有目标导向性的策略性概念(如棋类中的“进攻”或“弃子”)。其解决方案的关键在于提出一种基于概念空间(conceptual spaces)的框架,将抽象概念表示为多维可解释的质量维度(quality dimensions)上的几何区域,并通过轨迹(trajectory)分析来识别这些概念:具体而言,将棋局实例化为在概念空间中移动的路径,若路径朝向特定区域,则表明存在相应策略意图。该方法还支持双视角建模,捕捉不同玩家对同一情境的不同理解,且实验结果表明轨迹模式与专家评述高度一致,从而验证了该方法在序列决策场景下识别抽象概念的可行性。
链接: https://arxiv.org/abs/2601.21771
作者: Hadi Banaee,Stephanie Lowry
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a conceptual space framework for modelling abstract concepts that unfold over time, demonstrated through a chess-based proof-of-concept. Strategy concepts, such as attack or sacrifice, are represented as geometric regions across interpretable quality dimensions, with chess games instantiated and analysed as trajectories whose directional movement toward regions enables recognition of intended strategies. This approach also supports dual-perspective modelling, capturing how players interpret identical situations differently. Our implementation demonstrates the feasibility of trajectory-based concept recognition, with movement patterns aligning with expert commentary. This work explores extending the conceptual spaces theory to temporally realised, goal-directed concepts. The approach establishes a foundation for broader applications involving sequential decision-making and supports integration with knowledge evolution mechanisms for learning and refining abstract concepts over time.
zh
[AI-46] Zero-Shot Statistical Downscaling via Diffusion Posterior Sampling
【速读】:该论文旨在解决传统监督式气候降尺度方法在面对不同全球气候模型(Global Climate Models, GCMs)时泛化能力不足的问题,以及现有零样本方法在大尺度降尺度因子下存在物理不一致性与梯度消失问题的挑战。其解决方案的关键在于提出一种零样本统计降尺度框架(Zero-Shot Statistical Downscaling, ZSSD),该框架通过从再分析数据中学习物理一致的气候先验(Physics-Consistent Climate Prior),并结合地理边界和时间信息进行条件约束以保障物理合理性;同时引入统一坐标引导策略(Unified Coordinate Guidance),有效缓解了传统差异预测策略(DPS)中的梯度消失问题,并确保在不同GCM之间的一致性推理能力。
链接: https://arxiv.org/abs/2601.21760
作者: Ruian Tie,Wenbo Xiong,Zhengyu Shi,Xinyu Su,Chenyu jiang,Libo Wu,Hao Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional supervised climate downscaling struggles to generalize to Global Climate Models (GCMs) due to the lack of paired training data and inherent domain gaps relative to reanalysis. Meanwhile, current zero-shot methods suffer from physical inconsistencies and vanishing gradient issues under large scaling factors. We propose Zero-Shot Statistical Downscaling (ZSSD), a zero-shot framework that performs statistical downscaling without paired data during training. ZSSD leverages a Physics-Consistent Climate Prior learned from reanalysis data, conditioned on geophysical boundaries and temporal information to enforce physical validity. Furthermore, to enable robust inference across varying GCMs, we introduce Unified Coordinate Guidance. This strategy addresses the vanishing gradient problem in vanilla DPS and ensures consistency with large-scale fields. Results show that ZSSD significantly outperforms existing zero-shot baselines in 99th percentile errors and successfully reconstructs complex weather events, such as tropical cyclones, across heterogeneous GCMs.
zh
[AI-47] EWSJF: An Adaptive Scheduler with Hybrid Partitioning for Mixed-Workload LLM Inference
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在混合工作负载下服务时的调度难题,即如何高效处理短时、低延迟敏感的交互式查询与长时、高吞吐导向的批量请求之间的资源竞争问题。传统先到先服务(First-Come, First-Served, FCFS)策略因头部阻塞(head-of-line blocking)导致尾部延迟升高和硬件利用率低下。解决方案的关键在于提出EWSJF(Effective Workload-based Shortest Job First)——一种基于学习的自适应请求级调度器,其核心机制包括:(1) 无监督的Refine-and-Prune算法识别性能同质的请求组;(2) 动态队列路由将请求分配至相应组;(3) 基于密度加权的上下文感知优先级评分函数平衡紧迫性与公平性;(4) 贝叶斯元优化持续根据实时性能反馈调整评分与分区参数。该方案在vLLM中实现端到端吞吐量提升超30%,短请求平均首Token时间降低达4倍,验证了学习驱动的请求调度是提升LLM服务效率与响应性的关键缺失环节。
链接: https://arxiv.org/abs/2601.21758
作者: Bronislav Sidik,Chaya Levi,Joseph Kampeas
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Serving Large Language Models (LLMs) under mixed workloads–short, latency-sensitive interactive queries alongside long, throughput-oriented batch requests–poses a fundamental scheduling challenge. Standard First-Come, First-Served (FCFS) policies suffer from severe head-of-line blocking, leading to high tail latency and underutilized hardware. We introduce EWSJF (Effective Workload-based Shortest Job First), an adaptive request-level scheduler that learns workload structure in real time to jointly improve fairness and throughput. EWSJF operates upstream of execution-level schedulers and integrates four components: (1) Refine-and-Prune, an unsupervised partitioning algorithm that discovers performance-homogeneous request groups; (2) Dynamic Queue Routing for assigning requests to these groups; (3) Density-Weighted Scoring, a context-aware prioritization function balancing urgency and fairness; and (4) Bayesian Meta-Optimization, which continuously tunes scoring and partitioning parameters based on live performance feedback. Implemented in vLLM, EWSJF improves end-to-end throughput by over 30% and reduces average Time-To-First-Token for short requests by up to 4x compared to FCFS. These results demonstrate that adaptive, learning-based request scheduling is a critical missing layer for efficient and responsive LLM serving. Implementation available at this https URL.
zh
[AI-48] Language-based Trial and Error Falls Behind in the Era of Experience
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在未见过的非语言环境(如符号或空间任务)中表现受限的问题,其核心瓶颈并非预训练分布与测试分布的不匹配,而是探索成本过高——即在高维语义空间中通过大量试错来掌握新任务对参数密集型LLMs而言计算不可持续。解决方案的关键在于提出SCOUT(Sub-Scale Collaboration On Unseen Tasks)框架,该框架将探索(exploration)与利用(exploitation)解耦:使用轻量级“侦察兵”(如小型多层感知机MLP)以远超LLM的速度和规模探测环境动态,收集轨迹后通过监督微调(Supervised Fine-Tuning, SFT)引导LLM初步学习,并辅以多轮强化学习(Reinforcement Learning, RL)激活其潜在世界知识,从而显著提升性能并降低计算开销。
链接: https://arxiv.org/abs/2601.21754
作者: Haoyu Wang,Guozheng Ma,Shugang Cui,Yilun Kong,Haotian Luo,Li Shen,Mengya Gao,Yichao Wu,Xiaogang Wang,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight “scouts” (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.
zh
[AI-49] mporal Sepsis Modeling: a Fully Interpretable Relational Way
【速读】:该论文旨在解决重症监护中脓毒症(sepsis)预测模型缺乏可解释性及忽略潜在患者亚表型(sub-phenotypes)的问题。其解决方案的关键在于提出一种基于关系建模的机器学习框架:将电子病历(EMRs)中的时序数据视为多变量患者日志,并采用命题化(propositionalisation)技术——基于关系数据库领域的经典聚合/选择函数——构建可解释特征以“扁平化”原始数据,随后使用选择性朴素贝叶斯分类器进行分类。该方法在保持高预测性能的同时,实现了四重可解释性:单变量、全局、局部和反事实解释。
链接: https://arxiv.org/abs/2601.21747
作者: Vincent Lemaire,Nédra Meloulli,Pierre Jaquet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sepsis remains one of the most complex and heterogeneous syndromes in intensive care, characterized by diverse physiological trajectories and variable responses to treatment. While deep learning models perform well in the early prediction of sepsis, they often lack interpretability and ignore latent patient sub-phenotypes. In this work, we propose a machine learning framework by opening up a new avenue for addressing this issue: a relational approach. Temporal data from electronic medical records (EMRs) are viewed as multivariate patient logs and represented in a relational data schema. Then, a propositionalisation technique (based on classic aggregation/selection functions from the field of relational data) is applied to construct interpretable features to “flatten” the data. Finally, the flattened data is classified using a selective naive Bayesian classifier. Experimental validation demonstrates the relevance of the suggested approach as well as its extreme interpretability. The interpretation is fourfold: univariate, global, local, and counterfactual.
zh
[AI-50] Why Adam Works Better with β_1 = β_2: The Missing Gradient Scale Invariance Principle
【速读】:该论文试图解决的问题是:在大规模训练中,Adam优化器的验证分数和训练行为在动量参数满足 β1=β2 时显著改善,但这一现象长期以来缺乏理论解释。解决方案的关键在于提出并形式化了“梯度尺度不变性”(gradient scale invariance)这一结构性属性,并证明当且仅当 β1=β2 时,Adam具备一阶梯度尺度不变性。这一发现揭示了平衡动量设置与近期显式设计为尺度鲁棒更新的优化器之间的一致性,从而为Adam的 empirically observed behavior 提供了一个统一的理论解释,并为未来优化器的设计提供了一个简洁而有效的指导原则。
链接: https://arxiv.org/abs/2601.21739
作者: Alberto Fernández-Hernández,Cristian Pérez-Corral,Jose I. Mestre,Manuel F. Dolz,Enrique S. Quintana-Ortí
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 23 pages, 8 figures. Preprint
Abstract:Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy \beta_1=\beta_2 . Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textitgradient scale invariance. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if \beta_1=\beta_2 . This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when \beta_1=\beta_2 . Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.
zh
[AI-51] DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting
【速读】:该论文旨在解决深度时间序列模型在现实应用中因噪声数据普遍存在而导致的鲁棒性不足问题。现有方法要么通过数据剪枝降低噪声影响,要么依赖昂贵的先验量化手段,难以兼顾有效性与效率。其解决方案的关键在于提出一种模型无关的插件式方法 DropoutTS,该方法创新性地将学习焦点从“学什么”转向“学多少”,核心机制是引入样本自适应 dropout(Sample-Adaptive Dropout):利用谱稀疏性通过重构残差高效量化实例级噪声,并将噪声程度映射为自适应 dropout 率,动态调节模型学习能力——在抑制虚假波动的同时保留细粒度特征保真度。实验表明,DropoutTS 在多种噪声场景和公开基准上均能显著提升骨干模型性能,且参数开销极低、无需结构改动。
链接: https://arxiv.org/abs/2601.21726
作者: Siru Zhong,Yiqiu Liu,Zhiqing Cui,Zezhi Shao,Fei Wang,Qingsong Wen,Yuxuan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from “what” to learn to “how much” to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones’ performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at this https URL.
zh
[AI-52] When does predictive inverse dynamics outperform behavior cloning?
【速读】:该论文旨在解决行为克隆(Behavior Cloning, BC)在专家示范数据有限时性能下降的问题。其解决方案的关键在于引入预测性逆动力学模型(Predictive Inverse Dynamics Model, PIDM),该模型通过结合未来状态预测器与逆动力学模型(Inverse Dynamics Model, IDM),在理论上实现了偏差-方差权衡:虽然未来状态预测会引入偏差,但将IDM条件化于预测状态可显著降低方差,从而提升预测误差和样本效率。实验验证表明,在二维导航任务中,BC所需示范数据量最多是PIDM的五倍(平均三倍);在高维视觉输入且具有随机转移的复杂三维视频游戏环境中,BC所需样本量超过PIDM的66%。
链接: https://arxiv.org/abs/2601.21718
作者: Lukas Schäfer,Pallavi Choudhury,Abdelhak Lemkhenter,Chris Lovett,Somjit Nath,Luis França,Matheus Ribeiro Furtado de Mendonça,Alex Lamb,Riashat Islam,Siddhartha Sen,John Langford,Katja Hofmann,Sergio Valcarcel Macua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model (IDM). While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a bias-variance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66% more samples than PIDM.
zh
[AI-53] E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在向系统2(System~2)推理演进过程中,因现有记忆预处理范式导致的上下文破坏性去情境化问题。传统方法通过将复杂的序列依赖关系压缩为预定义结构(如嵌入或图结构),削弱了深度推理所依赖的上下文完整性。其解决方案的关键在于提出E-mem框架,该框架从记忆预处理转向情景式上下文重建(Episodic Context Reconstruction),借鉴生物印记(engram)机制,采用异构分层架构:多个辅助代理维护未压缩的记忆上下文,中央主代理负责全局规划;相较于被动检索,该机制使辅助代理能在激活片段中进行局部推理,提取情境感知证据后再进行聚合,从而在保持逻辑严谨性的同时显著降低token消耗并提升性能。
链接: https://arxiv.org/abs/2601.21714
作者: Kaixiang Wang,Yidan Lin,Jiong Lou,Zhaojiacheng Zhou,Bunyod Suvonov,Jie Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, high-precision problem-solving, requires maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms suffer from destructive de-contextualization. By compressing complex sequential dependencies into pre-defined structures (e.g., embeddings or graphs), these methods sever the contextual integrity essential for deep reasoning. To address this, we propose E-mem, a framework shifting from Memory Preprocessing to Episodic Context Reconstruction. Inspired by biological engrams, E-mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context-aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54% F1, surpassing the state-of-the-art GAM by 7.75%, while reducing token cost by over 70%.
zh
[AI-54] Disentangling perception and reasoning for improving data efficiency in learning cloth manipulation without demonstrations
【速读】:该论文旨在解决机器人在处理布料(cloth manipulation)任务中面临的高维状态空间、复杂动力学以及自遮挡(self-occlusion)等问题,这些问题使得传统解析方法难以生成鲁棒且通用的操控策略。为应对这些挑战,现有数据驱动方法通常依赖大规模模型和长时间训练,导致计算成本高昂;同时,由于状态估计困难,多数方法采用端到端学习方式,以工作区图像作为输入,虽简化了仿真到现实的迁移过程,却因环境状态表示损失严重而带来额外计算开销。论文的关键解决方案在于提出一种高效且模块化的强化学习(Reinforcement Learning, RL)框架:通过精心设计的状态表示与训练策略,在仿真环境中显著降低模型规模和训练时间,并验证该方案可在真实世界中实现有效迁移。实验表明,该方法在SoftGym基准上优于现有基线,同时使用更小的模型即达成性能提升。
链接: https://arxiv.org/abs/2601.21713
作者: Donatien Delehelle,Fei Chen,Darwin Caldwell
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures,
Abstract:Cloth manipulation is a ubiquitous task in everyday life, but it remains an open challenge for robotics. The difficulties in developing cloth manipulation policies are attributed to the high-dimensional state space, complex dynamics, and high propensity to self-occlusion exhibited by fabrics. As analytical methods have not been able to provide robust and general manipulation policies, reinforcement learning (RL) is considered a promising approach to these problems. However, to address the large state space and complex dynamics, data-based methods usually rely on large models and long training times. The resulting computational cost significantly hampers the development and adoption of these methods. Additionally, due to the challenge of robust state estimation, garment manipulation policies often adopt an end-to-end learning approach with workspace images as input. While this approach enables a conceptually straightforward sim-to-real transfer via real-world fine-tuning, it also incurs a significant computational cost by training agents on a highly lossy representation of the environment state. This paper questions this common design choice by exploring an efficient and modular approach to RL for cloth manipulation. We show that, through careful design choices, model size and training time can be significantly reduced when learning in simulation. Furthermore, we demonstrate how the resulting simulation-trained model can be transferred to the real world. We evaluate our approach on the SoftGym benchmark and achieve significant performance improvements over available baselines on our task, while using a substantially smaller model.
zh
[AI-55] Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics
【速读】:该论文旨在解决课程学习(Curriculum Learning)是否通过改变模型的学习轨迹(learning trajectory)来提升性能,还是仅在固定学习轨迹下重新排序训练数据暴露顺序的问题。研究通过在Pythia模型(参数规模14M–410M)上对比三种语言学驱动的课程策略(年龄习得顺序、词频、动词变异)与随机顺序的训练过程,发现所有排序方式均遵循相同的潜在阶段序列,而课程学习主要影响的是各阶段内的数据暴露节奏。其关键解决方案在于:课程学习通过稳定每个阶段内的优化过程(即降低梯度噪声和输出头谱饱和),而非引入新的学习阶段,从而在中小规模模型中提升最终准确率;随着模型规模扩大,这种优化稳定性优势减弱,但核心机制仍保持一致。
链接: https://arxiv.org/abs/2601.21698
作者: Mohamed Elgaar,Hadi Amiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Curriculum learning changes the order of pre-training data, but it remains unclear whether it changes the learning trajectory or mainly reorders exposure over a fixed trajectory. We train Pythia models (14M-410M parameters) for 300B tokens under three linguistically motivated curricula-Age-of-Acquisition, word frequency, and Verb Variation (VV)-and compare each against Random ordering; at 1B parameters we compare Random and VV. Across orderings, training follows a shared sequence of latent phases, while curricula mainly change within-phase data exposure. In smaller models (up to 160M parameters), Random ordering exhibits higher gradient noise and stronger late-training output-head spectral saturation, alongside lower final accuracy; curricula reduce both effects at matched compute. At larger scales, saturation differences are smaller and curriculum gains shrink. We formalize the link between difficulty pacing and optimization stability in an idealized analysis based on gradient-variance control, and our results point to a practical takeaway: curricula help by stabilizing within-phase optimization rather than by creating new phases.
zh
[AI-56] CAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在使用“微调即服务”(Fine-Tuning-as-a-Service, FTaaS)进行定制化时,因污染数据引入的后门攻击风险问题。现有防御方法要么依赖监督信号,要么无法跨不同触发类型和模态泛化。论文的关键创新在于发现了一种通用的后门指纹——注意力分配偏移(attention allocation divergence),即中毒样本会破坏系统指令、视觉输入与用户文本查询三类功能组件之间的平衡注意力分布,无论触发器形态如何。基于此洞察,作者提出无监督防御框架Tri-Component Attention Profiling (TCAP),通过分解跨模态注意力图、利用高斯混合模型(Gaussian Mixture Model, GMM)识别响应触发的注意力头,并采用期望最大化(Expectation-Maximization, EM)投票聚合策略隔离中毒样本,从而实现对多种攻击场景下后门样本的有效过滤。
链接: https://arxiv.org/abs/2601.21692
作者: Mingzu Liu,Hao Fang,Runmin Cong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.
zh
[AI-57] XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision
【速读】:该论文旨在解决生成式 AI 中的解耦表示学习(disentangled representation learning)问题,即如何将数据中独立的变化因素映射到独立的表示组件上,从而实现对特定语义因子的显式控制。现有方法存在局限:纯无监督方法在真实数据上难以恢复语义因子,而监督方法则因依赖对抗目标或辅助分类器导致不稳定且难以扩展至大规模属性集。本文提出了一种弱监督变分自编码器(VAE)框架 XFactors,其核心创新在于基于解耦信息瓶颈(Disentangled Information Bottleneck)视角,将潜在空间分解为一个残差子空间 S 和 K 个因子特异性子空间 T1,…,TK;通过对比监督(contrastive supervision)使用 InfoNCE 损失拉近相同因子值的潜在向量并推开不匹配对,同时利用 KL 正则化强制 S 及聚合因子子空间服从高斯分布,无需额外监督即可组织几何结构、避免对抗训练与分类器,从而实现稳定、可扩展且具可控性的解耦表示学习。
链接: https://arxiv.org/abs/2601.21688
作者: Alexandre Myara,Nicolas Bourriez,Thomas Boyer,Thomas Lemercier,Ihab Bendidi,Auguste Genovesio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textscXFactors, a weakly-supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace \mathcalS and factor-specific subspaces \mathcalT_1,\ldots,\mathcalT_K and a residual subspace \mathcalS . Each target factor is encoded in its assigned \mathcalT_i through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both \mathcalS and the aggregated factor subspaces, organizing the geometry without additional supervision for non-targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textscXFactors achieves state-of-the-art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real-world dataset CelebA. Our code is available at \hrefthis https URLthis http URL. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.21688 [cs.LG] (or arXiv:2601.21688v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21688 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-58] Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中因期望回报最大化目标导致的**结果层面模式崩溃(outcome-level mode collapse)问题,即在存在多个高质量终端解的情况下,标准策略训练往往收敛到少数解,而非覆盖多样化的高质解集。作者指出,这种现象并非仅由探索不足或正则化弱所致,而是期望回报目标函数本身的结构性缺陷:在理想学习动态下,任意两个结果之间的概率比值随奖励差值呈线性演化,导致指数级比例发散,从而不可避免地引发模式崩溃。解决方案的关键在于识别出问题根源——期望中的概率乘子(probability multiplier)**对结果频率产生放大效应,进而提出一种最小修正:逆概率缩放(inverse probability scaling, IPS),通过移除学习信号中的结果频率放大项,从根本上改变学习动力学,使最终策略分布与奖励成比例,从而在多模态场景下有效防止模式崩溃。该方法以IPS-GRPO形式实现,无需额外模型或结构改动,在推理和分子生成任务中均显著降低模式崩溃并保持或超越基线性能。
链接: https://arxiv.org/abs/2601.21669
作者: Abhijeet Sinha,Sundari Elango,Dianbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling, which removes outcome-frequency amplification from the learning signal, fundamentally changes the learning dynamics, and provably yields reward-proportional terminal distributions, preventing collapse in multimodal settings. We instantiate this principle in Group Relative Policy Optimization (GRPO) as a drop-in modification, IPS-GRPO, requiring no auxiliary models or architectural changes. Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance, suggesting that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.
zh
[AI-59] SENDAI: A Hierarchical Sparse-measurement EfficieNt Data AssImilation Framework
【速读】:该论文致力于解决在时空场重建中,训练数据丰富但部署环境观测稀疏的问题,尤其当目标域存在分布偏移(distributional shifts)、异质结构和多尺度动态特性时,传统方法难以有效重建完整空间状态。其解决方案的关键在于提出一种分层的稀疏测量、高效数据同化框架(SENDAI),该框架通过融合模拟生成的先验信息与学习得到的偏差修正项,从超稀疏传感器观测中重构全空间状态;该方法显著优于依赖密集观测的传统基线模型(SSIM提升达185%)及近期高频方法(提升36%),并能更好保留诊断相关的空间结构特征(如场拓扑、土地覆盖不连续性和空间梯度),从而提升对间接观测变量的下游推断能力。
链接: https://arxiv.org/abs/2601.21664
作者: Xingyue Zhang,Yuxuan Bao,Mars Liyao Gao,J. Nathan Kutz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Bridging the gap between data-rich training regimes and observation-sparse deployment conditions remains a central challenge in spatiotemporal field reconstruction, particularly when target domains exhibit distributional shifts, heterogeneous structure, and multi-scale dynamics absent from available training data. We present SENDAI, a hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework that reconstructs full spatial states from hyper sparse sensor observations by combining simulation-derived priors with learned discrepancy corrections. We demonstrate the performance on satellite remote sensing, reconstructing MODIS (Moderate Resolution Imaging Spectroradiometer) derived vegetation index fields across six globally distributed sites. Using seasonal periods as a proxy for domain shift, the framework consistently outperforms established baselines that require substantially denser observations – SENDAI achieves a maximum SSIM improvement of 185% over traditional baselines and a 36% improvement over recent high-frequency-based methods. These gains are particularly pronounced for landscapes with sharp boundaries and sub-seasonal dynamics; more importantly, the framework effectively preserves diagnostically relevant structures – such as field topologies, land cover discontinuities, and spatial gradients. By yielding corrections that are more structurally and spectrally separable, the reconstructed fields are better suited for downstream inference of indirectly observed variables. The results therefore highlight a lightweight and operationally viable framework for sparse-measurement reconstruction that is applicable to physically grounded inference, resource-limited deployment, and real-time monitor and control.
zh
[AI-60] ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval
【速读】:该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Models)在执行深度研究工作流时因依赖实时API导致的非确定性问题,该问题会引发结果不可复现及跨系统比较失效。解决方案的关键在于提出ScholarGym——一个用于学术文献领域深度研究工作流可复现评估的仿真环境;该环境将工作流解耦为查询规划、工具调用和相关性评估三个独立模块,并基于包含57万篇论文的静态语料库实现确定性检索,从而支持在受控条件下对各阶段进行细粒度分析,同时提供2,536个专家标注真实答案的查询任务,以准确衡量不同模型在推理能力、规划策略与选择机制上的交互表现。
链接: https://arxiv.org/abs/2601.21654
作者: Hao Shen,Hang Yang,Zhouhong Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.21654 [cs.AI] (or arXiv:2601.21654v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.21654 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-61] Gauge-invariant representation holonomy ICLR
【速读】:该论文旨在解决现有相似性度量(如CKA或SVCCA)无法捕捉深度神经网络表示在输入空间路径上的变化特性这一问题,从而导致模型在面对扰动或对抗攻击时表现差异被忽略。解决方案的关键在于提出表示全息性(representation holonomy)——一种规范不变的统计量,用于量化特征在输入空间小环路上平行传输时累积的“扭转”效应:平坦的表示对应零全息性,而非零值则揭示隐藏的几何曲率。其核心创新包括通过全局白化固定规范、利用共享子空间和仅旋转的Procrustes对齐邻域,并将结果嵌回完整特征空间,实现了对学习表示几何结构的更精细诊断。
链接: https://arxiv.org/abs/2601.21653
作者: Vasileios Sevetlidis,George Pavlidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14th International Conference on Learning Representations (ICLR)
Abstract:Deep networks learn internal representations whose geometry–how features bend, rotate, and evolve–affects both generalization and robustness. Existing similarity measures such as CKA or SVCCA capture pointwise overlap between activation sets, but miss how representations change along input paths. Two models may appear nearly identical under these metrics yet respond very differently to perturbations or adversarial stress. We introduce representation holonomy, a gauge-invariant statistic that measures this path dependence. Conceptually, holonomy quantifies the “twist” accumulated when features are parallel-transported around a small loop in input space: flat representations yield zero holonomy, while nonzero values reveal hidden curvature. Our estimator fixes gauge through global whitening, aligns neighborhoods using shared subspaces and rotation-only Procrustes, and embeds the result back to the full feature space. We prove invariance to orthogonal (and affine, post-whitening) transformations, establish a linear null for affine layers, and show that holonomy vanishes at small radii. Empirically, holonomy increases with loop radius, separates models that appear similar under CKA, and correlates with adversarial and corruption robustness. It also tracks training dynamics as features form and stabilize. Together, these results position representation holonomy as a practical and scalable diagnostic for probing the geometric structure of learned representations beyond pointwise similarity.
zh
[AI-62] When Life Gives You AI Will You Turn It Into A Market for Lemons? Understanding How Information Asymmetries About AI System Capabilities Affect Market Outcomes and Adoption
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)消费者市场中因买方与卖方信息不对称而导致的用户采纳障碍问题,尤其是在复杂AI系统可能隐藏缺陷或产生高成本错误的情况下。其解决方案的关键在于通过实验设计系统性地调整低质量AI系统的密度和披露要求的深度,验证部分信息披露机制对提升用户决策效率的作用;研究发现,尽管信息不对称显著抑制AI采纳,但合理的部分披露设计能够有效改善人类决策质量,从而提高整体市场效率。
链接: https://arxiv.org/abs/2601.21650
作者: Alexander Erlei,Federico Cau,Radoslav Georgiev,Sagar Kumar,Kilian Bizer,Ujwal Gadiraju
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:AI consumer markets are characterized by severe buyer-supplier market asymmetries. Complex AI systems can appear highly accurate while making costly errors or embedding hidden defects. While there have been regulatory efforts surrounding different forms of disclosure, large information gaps remain. This paper provides the first experimental evidence on the important role of information asymmetries and disclosure designs in shaping user adoption of AI systems. We systematically vary the density of low-quality AI systems and the depth of disclosure requirements in a simulated AI product market to gauge how people react to the risk of accidentally relying on a low-quality AI system. Then, we compare participants’ choices to a rational Bayesian model, analyzing the degree to which partial information disclosure can improve AI adoption. Our results underscore the deleterious effects of information asymmetries on AI adoption, but also highlight the potential of partial disclosure designs to improve the overall efficiency of human decision-making.
zh
[AI-63] Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers
【速读】:该论文旨在解决基于Transformer的时间序列预测模型在扩展性与长期时序动态捕捉之间的矛盾问题,即如何在保持高精度的同时高效处理长序列数据。现有方法虽在准确率上取得进展,但难以兼顾计算效率和对长时间依赖关系的建模能力。解决方案的关键在于提出Seg-MoE(Segment-level Mixture-of-Experts),其核心创新是将传统的逐标记(token-wise)路由机制改为按连续时间步段(segment-level)进行路由和处理,使每个专家能够直接建模段内的时间交互,从而更好地利用时间序列固有的局部性和连续性结构。这种与时间序列本质特性对齐的稀疏架构设计,显著提升了多变量长期预测任务中的性能表现,并验证了段级路由作为新型归纳偏置的有效性。
链接: https://arxiv.org/abs/2601.21641
作者: Evandro S. Ortigossa,Eran Segal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.
zh
[AI-64] HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning
【速读】:该论文旨在解决后训练量化(Post Training Quantization, PTQ)中普遍存在的“低误差、高损失”悖论问题,其根源在于大语言模型(LLM)损失曲面的海森矩阵(Hessian matrix)中存在少数高曲率方向对扰动极度敏感。为应对这一挑战,作者提出了一种名为Hessian鲁棒量化(Hessian Robust Quantization, HeRo Q)的新算法,其核心创新在于在量化前引入一个轻量级且可学习的旋转-压缩矩阵,对权重空间进行联合变换,从而降低海森矩阵的最大特征值,重塑损失景观以增强对量化噪声的鲁棒性。该方法无需修改模型架构、计算开销极低,并能无缝集成至现有PTQ流程中,在Llama和Qwen等模型上显著优于GPTQ、AWQ及SpinQuant等先进方法,尤其在极端低比特设置(如W3A16)下表现出卓越性能,有效避免了逻辑崩溃现象。
链接: https://arxiv.org/abs/2601.21626
作者: Jinhao Zhang Yunquan Zhang,Zicheng yan,Boyang Zhang,Jun Sun,Daning Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical ‘low error, high loss’ phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15% and effectively avoids the logical collapse commonly seen in aggressive quantization.
zh
[AI-65] Beyond Parameter Finetuning: Test-Time Representation Refinement for Node Classification
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在分布外(Out-of-Distribution, OOD)测试场景下性能显著下降的问题,尤其针对现有测试时训练(Test-Time Training, TTT)方法中参数微调(Parameter Finetuning, PaFT)易引发灾难性遗忘(Catastrophic Forgetting)的局限性。其解决方案的关键在于提出TTReFT框架——一种将适应目标从模型参数转向潜在表示(Latent Representations)的新型测试时表示微调(Test-Time Representation FineTuning)范式,通过三项核心创新实现:(1) 基于不确定性的节点选择策略以精准干预;(2) 低秩表示干预机制以保留预训练知识;(3) 针对干预策略动态调整掩码策略的干预感知掩码自编码器(Intervention-aware Masked Autoencoder)。该方法在理论上提供OOD场景下的保证,并在五个基准数据集上验证了其一致且优越的性能表现。
链接: https://arxiv.org/abs/2601.21615
作者: Jiaxin Zhang,Yiqi Wang,Siwei Wang,Xihong Yang,Yu Shi,Xinwang Liu,En Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks frequently exhibit significant performance degradation in the out-of-distribution test scenario. While test-time training (TTT) offers a promising solution, existing Parameter Finetuning (PaFT) paradigm suffer from catastrophic forgetting, hindering their real-world applicability. We propose TTReFT, a novel Test-Time Representation FineTuning framework that transitions the adaptation target from model parameters to latent representations. Specifically, TTReFT achieves this through three key innovations: (1) uncertainty-guided node selection for specific interventions, (2) low-rank representation interventions that preserve pre-trained knowledge, and (3) an intervention-aware masked autoencoder that dynamically adjust masking strategy to accommodate the node selection scheme. Theoretically, we establish guarantees for TTReFT in OOD settings. Empirically, extensive experiments across five benchmark datasets demonstrate that TTReFT achieves consistent and superior performance. Our work establishes representation finetuning as a new paradigm for graph TTT, offering both theoretical grounding and immediate practical utility for real-world deployment.
zh
[AI-66] RecNet: Self-Evolving Preference Propagation for Agent ic Recommender Systems
【速读】:该论文旨在解决当前推荐系统在建模用户偏好变化时存在的局限性,即现有方法主要依赖稀疏、噪声大且无法反映用户与物品之间实时相互影响的显式交互数据。为应对这一挑战,作者提出RecNet——一种自演化偏好传播框架,其核心在于通过两个互补阶段实现偏好信息的动态传播与优化:首先,在前向阶段利用中心化偏好路由机制和个性化偏好接收机制,使偏好更新能被精准、选择性地传播至相关用户与物品;其次,在后向阶段引入反馈驱动的传播优化机制,借助大语言模型(Large Language Models, LLMs)进行信用分配、梯度分析与模块级优化,从而实现传播策略的持续自我进化。该方案的关键创新在于将LLM的能力融入推荐系统的偏好传播过程,显著提升了对真实世界复杂、动态用户行为的建模能力。
链接: https://arxiv.org/abs/2601.21609
作者: Bingqian Li,Xiaolei Wang,Junyi Li,Weitao Li,Long Zhang,Sheng Chen,Wayne Xin Zhao,Ji-Rong Wen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic recommender systems leverage Large Language Models (LLMs) to model complex user behaviors and support personalized decision-making. However, existing methods primarily model preference changes based on explicit user-item interactions, which are sparse, noisy, and unable to reflect the real-time, mutual influences among users and items. To address these limitations, we propose RecNet, a self-evolving preference propagation framework that proactively propagates real-time preference updates across related users and items. RecNet consists of two complementary phases. In the forward phase, the centralized preference routing mechanism leverages router agents to integrate preference updates and dynamically propagate them to the most relevant agents. To ensure accurate and personalized integration of propagated preferences, we further introduce a personalized preference reception mechanism, which combines a message buffer for temporary caching and an optimizable, rule-based filter memory to guide selective preference assimilation based on past experience and interests. In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework, using LLMs for credit assignment, gradient analysis, and module-level optimization, enabling continuous self-evolution of propagation strategies. Extensive experiments on various scenarios demonstrate the effectiveness of RecNet in modeling preference propagation for recommender systems.
zh
[AI-67] Search-Based Risk Feature Discovery in Document Structure Spaces under a Constrained Budget
【速读】:该论文旨在解决企业级智能文档处理(Intelligent Document Processing, IDP)系统在有限预算下进行早期阶段验证时,如何高效发现多样化的故障机制问题。传统方法往往聚焦于识别单一最坏情况文档,而忽略了复杂交互作用导致的多种失效类型。解决方案的关键在于将此问题形式化为基于搜索的软件测试(Search-Based Software Testing, SBST)任务,通过在文档配置的组合空间中探索结构风险特征(structural risk features),以诱导真实场景下的失败条件,并采用多策略对比实验(包括进化算法、群体智能、质量多样性、学习驱动及量子计算等方法)在相同预算约束下评估不同搜索策略的表现。研究发现,各策略持续揭示其他方法未发现的故障模式,且不存在绝对最优解,表明不同求解器具有内在互补性,从而支持采用组合策略(portfolio-based SBST)来提升工业级IDP系统的鲁棒性验证效果。
链接: https://arxiv.org/abs/2601.21608
作者: Saisubramaniam Gopalakrishnan,Harikrishnan P M,Dagnachew Birru
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Enterprise-grade Intelligent Document Processing (IDP) systems support high-stakes workflows across finance, insurance, and healthcare. Early-phase system validation under limited budgets mandates uncovering diverse failure mechanisms, rather than identifying a single worst-case document. We formalize this challenge as a Search-Based Software Testing (SBST) problem, aiming to identify complex interactions between document variables, with the objective to maximize the number of distinct failure types discovered within a fixed evaluation budget. Our methodology operates on a combinatorial space of document configurations, rendering instances of structural \emphrisk features to induce realistic failure conditions. We benchmark a diverse portfolio of search strategies spanning evolutionary, swarm-based, quality-diversity, learning-based, and quantum under identical budget constraints. Through configuration-level exclusivity, win-rate, and cross-temporal overlap analyses, we show that different solvers consistently uncover failure modes that remain undiscovered by specific alternatives at comparable budgets. Crucially, cross-temporal analysis reveals persistent solver-specific discoveries across all evaluated budgets, with no single strategy exhibiting absolute dominance. While the union of all solvers eventually recovers the observed failure space, reliance on any individual method systematically delays the discovery of important risks. These results demonstrate intrinsic solver complementarity and motivate portfolio-based SBST strategies for robust industrial IDP validation.
zh
[AI-68] Dynamics Reveals Structure: Challenging the Linear Propagation Assumption
【速读】:该论文试图解决神经网络在通过一阶参数更新进行适应时,是否能够保持逻辑一致性的问题,特别是围绕线性传播假设(Linear Propagation Assumption, LPA)的几何限制展开研究。其核心问题是:局部更新能否一致地传播到逻辑推论结果?解决方案的关键在于利用关系代数形式化分析三种基本关系操作——否定(negation)、对换(converse)和复合(composition)对特征空间的影响。研究发现,对于否定和对换,保证方向无关的一阶传播要求特征张量分解为实体对上下文与关系内容分离的形式;而对于复合操作,则揭示了一个根本性障碍:复合本质上等价于合取(conjunction),而在线性特征空间中,任何良定义的合取必须是双线性的(bilinear),这与否定操作不兼容,从而迫使特征映射坍缩。这一发现表明,知识编辑失败、反转诅咒(reversal curse)和多跳推理困难可能源于LPA内在的结构限制。
链接: https://arxiv.org/abs/2601.21601
作者: Hoyeon Chang,Bálint Mucsányi,Seong Joon Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.
zh
[AI-69] CORE: Collaborative Reasoning via Cross Teaching
【速读】:该论文旨在解决大语言模型在推理任务中存在互补性错误的问题,即同一问题上不同模型可能因分解策略差异而出现一个成功、另一个失败的现象。解决方案的关键在于提出训练时协作机制(Collaborative Reasoning, CORE),通过交叉教学协议将同伴的成功转化为学习信号:首先进行独立采样(cold round),随后对失败模型引入来自成功同伴的提示(contexted rescue round);同时优化综合奖励函数,平衡正确性、基于DPP(Determinantal Point Process)启发的多样性项以减少错误重叠,并加入显式的救援奖励以鼓励成功恢复。实验证明,仅用1000个训练样本即可显著提升小模型(3B+4B)在多个基准数据集上的表现,证明了训练时协作能有效利用模型互补性实现性能跃升,无需依赖模型规模扩展。
链接: https://arxiv.org/abs/2601.21600
作者: Kshitij Mishra,Mirat Aubakirov,Martin Takac,Nils Lukas,Salem Lahlou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models exhibit complementary reasoning errors: on the same instance, one model may succeed with a particular decomposition while another fails. We propose Collaborative Reasoning (CORE), a training-time collaboration framework that converts peer success into a learning signal via a cross-teaching protocol. Each problem is solved in two stages: a cold round of independent sampling, followed by a contexted rescue round in which models that failed receive hint extracted from a successful peer. CORE optimizes a combined reward that balances (i) correctness, (ii) a lightweight DPP-inspired diversity term to reduce error overlap, and (iii) an explicit rescue bonus for successful recovery. We evaluate CORE across four standard reasoning datasets GSM8K, MATH, AIME, and GPQA. With only 1,000 training examples, a pair of small open source models (3B+4B) reaches Pass@2 of 99.54% on GSM8K and 92.08% on MATH, compared to 82.50% and 74.82% for single-model training. On harder datasets, the 3B+4B pair reaches Pass@2 of 77.34% on GPQA (trained on 348 examples) and 79.65% on AIME (trained on 792 examples), using a training-time budget of at most 1536 context tokens and 3072 generated tokens. Overall, these results show that training-time collaboration can reliably convert model complementarity into large gains without scaling model size.
zh
[AI-70] Beyond Imitation: Reinforcement Learning for Active Latent Planning
【速读】:该论文旨在解决当前隐式推理(latent reasoning)方法中因依赖于被动模仿离散语言标签而导致的潜在表示质量低、推理策略不佳的问题,从而限制了在连续潜空间中进行高效规划的能力。其核心问题在于:当存在多个等效但多样化的链式思维(Chain-of-Thought, CoT)标签时,仅随机模仿其中一个会导致潜空间表示不优,并造成训练与测试阶段性能差距明显。解决方案的关键在于提出主动潜空间规划方法(ATP-Latent),通过将潜变量监督建模为条件变分自编码器(conditional Variational Auto-Encoder, VAE),获得更平滑的潜空间;同时引入基于VAE解码内容一致性的辅助连贯性奖励,驱动强化学习(Reinforcement Learning, RL)过程,从而引导出更合理的潜推理策略。实验表明,在LLaMA-1B模型上,ATP-Latent相比先进基线实现了+4.1%准确率提升和-3.3% token消耗降低。
链接: https://arxiv.org/abs/2601.21598
作者: Zhi Zheng,Wee Sun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the \underlineAc\underlinetive Latent \underlinePlanning method (ATP-Latent), which models the supervision process of latent tokens as a conditional variational auto-encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP-Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE-decoded contents of latent tokens, enabling a guided RL process. In experiments on LLaMA-1B, ATP-Latent demonstrates +4.1% accuracy and -3.3% tokens on four benchmarks compared to advanced baselines. Codes are available on this https URL.
zh
[AI-71] Scalable Power Sampling: Unlocking Efficient Training-Free Reasoning for LLM s via Distribution Sharpening
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)后训练在提升大语言模型(Large Language Models, LLMs)推理能力时,其性能提升主要源于分布锐化(distribution sharpening)而非新能力获取的问题,并克服基于马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)采样方法计算成本过高、难以大规模应用的局限。解决方案的关键在于提出一种理论严谨的新公式:全局幂分布(global power distribution)可通过逐标记缩放的低温度分布近似,其中缩放因子捕捉未来轨迹质量;基于此洞察,作者设计了一种无需训练和验证器的算法,通过自回归方式锐化基础模型的生成分布,从而在不依赖外部奖励的情况下实现与单次GRPO相当甚至更优的性能,同时将推理延迟降低超过10倍。
链接: https://arxiv.org/abs/2601.21590
作者: Xiaotong Ji,Rasul Tutunov,Matthieu Zimmer,Haitham Bou Ammar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model’s generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.
zh
[AI-72] Chain Of Thought Compression: A Theoritical Analysis
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)推理中因生成额外中间步骤而导致的计算成本过高问题,尤其是隐式压缩推理步骤(implicit CoT compression)时学习信号衰减导致难以有效内化中间推理过程的理论难题。其解决方案的关键在于提出ALiCoT(Aligned Implicit CoT)框架,通过将潜在状态的分布与显式中间推理状态对齐,缓解高阶逻辑依赖关系下学习信号的指数级衰减问题,从而实现高效且性能接近显式CoT的推理机制。
链接: https://arxiv.org/abs/2601.21576
作者: Juncai Li,Ru Li,Yuxiang Zhou,Boxiang Ma,Jeff Z. Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohibitive computational costs due to generation of extra tokens. Recent studies empirically show that compressing reasoning steps into latent states, or implicit CoT compression, offers a token-efficient alternative. However, the mechanism behind CoT compression remains unclear. In this paper, we provide the first theoretical analysis of the difficulty of learning to internalize intermediate reasoning steps. By introducing Order-r Interaction, we prove that the learning signal for high-order logical dependencies exponentially decays to solve irreducible problem, where skipping intermediate steps inevitably leads to high-order interaction barriers. To empirically validate this, we introduce NatBool-DAG, a challenging benchmark designed to enforce irreducible logical reasoning and eliminate semantic shortcuts. Guided by our theoretical findings, we propose ALiCoT (Aligned Implicit CoT), a novel framework that overcomes the signal decay by aligning latent token distributions with intermediate reasoning states. Experimental results demonstrate that ALiCoT successfully unlocks efficient reasoning: it achieves a 54.4x speedup while maintaining performance comparable to explicit CoT.
zh
[AI-73] Signal-Adaptive Trust Regions for Gradient-Free Optimization of Recurrent Spiking Neural Networks
【速读】:该论文旨在解决递归脉冲神经网络(Recurrent Spiking Neural Networks, RSNNs)在高维、长时程强化学习任务中训练困难的问题,特别是由于脉冲动力学的非微分特性导致反向传播难以应用,以及基于群体的无梯度优化方法因有限种群规模引发梯度估计方差过大、更新步长不稳定甚至有害的问题。解决方案的关键在于提出信号自适应信任区域(Signal-Adaptive Trust Regions, SATR),其核心机制是通过约束相对变化量来稳定更新:利用KL散度衡量策略分布的变化,并将其归一化为估计信号能量,从而动态调整信任区域大小——强信号时扩展区域以加速收敛,噪声主导时收缩区域以抑制不稳定性。此外,为提升可扩展性,作者还设计了针对二进制脉冲和权重的位集(bitset)实现,显著降低训练时间,使RSNN策略搜索更具实用性。
链接: https://arxiv.org/abs/2601.21572
作者: Jinhao Li,Yuhao Sun,Zhiyuan Ma,Hao He,Xinche Zhang,Xing Chen,Jin Li,Sen Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recurrent spiking neural networks (RSNNs) are a promising substrate for energy-efficient control policies, but training them for high-dimensional, long-horizon reinforcement learning remains challenging. Population-based, gradient-free optimization circumvents backpropagation through non-differentiable spike dynamics by estimating gradients. However, with finite populations, high variance of these estimates can induce harmful and overly aggressive update steps. Inspired by trust-region methods in reinforcement learning that constrain policy updates in distribution space, we propose \textbfSignal-Adaptive Trust Regions (SATR), a distributional update rule that constrains relative change by bounding KL divergence normalized by an estimated signal energy. SATR automatically expands the trust region under strong signals and contracts it when updates are noise-dominated. We instantiate SATR for Bernoulli connectivity distributions, which have shown strong empirical performance for RSNN optimization. Across a suite of high-dimensional continuous-control benchmarks, SATR improves stability under limited populations and reaches competitive returns against strong baselines including PPO-LSTM. In addition, to make SATR practical at scale, we introduce a bitset implementation for binary spiking and binary weights, substantially reducing wall-clock training time and enabling fast RSNN policy search.
zh
[AI-74] EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots
【速读】:该论文旨在解决具身智能(Embodied AI)领域中因依赖人工密集型干预(如奖励函数设计、超参数调优等)而导致的规模化瓶颈问题。其核心解决方案是提出一个名为 \textsc{EmboCoach-Bench} 的基准测试框架,通过将可执行代码作为通用接口,评估大型语言模型(LLM)代理在闭环环境中自主设计和优化具身策略的能力。关键创新在于引入动态迭代式工作流——代理利用环境反馈持续迭代地编写、调试与优化策略,涵盖从物理感知的奖励设计到扩散策略(diffusion policies)等架构改进,从而实现无需人工干预的自演化具身智能系统。
链接: https://arxiv.org/abs/2601.21570
作者: Zixing Lei,Genjia Liu,Yuanshuo Zhang,Qipeng Liu,Chuan Wen,Shanghang Zhang,Wenzhao Lian,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 37 pages, 13 figures
Abstract:The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs’ success in software automation and science discovery, we introduce \textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.
zh
[AI-75] SAL: Selective Adaptive Learning for Backpropagation-Free Training with Sparsification
【速读】:该论文旨在解决标准深度学习中反向传播(Backpropagation, BP)方法所面临的两个核心问题:一是权重对称性假设在生物学上不成立,二是密集表示下梯度干扰严重。其解决方案的关键在于提出选择性自适应学习(Selective Adaptive Learning, SAL),该方法通过将参数空间分解为互斥且依赖于样本的区域,实现梯度干扰的解耦,并借助改进的反馈对齐机制消除显式的权重对称性要求,从而在保持训练稳定性的同时提升模型性能与可扩展性。
链接: https://arxiv.org/abs/2601.21561
作者: Fanping Liu,Hua Yang,Jiasi Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard deep learning relies on Backpropagation (BP), which is constrained by biologically implausible weight symmetry and suffers from significant gradient interference within dense representations. To mitigate these bottlenecks, we propose Selective Adaptive Learning (SAL), a training method that combines selective parameter activation with adaptive area partitioning. Specifically, SAL decomposes the parameter space into mutually exclusive, sample-dependent regions. This decoupling mitigates gradient interference across divergent semantic patterns and addresses explicit weight symmetry requirements through our refined feedback alignment. Empirically, SAL demonstrates competitive convergence rates, leading to improved classification performance across 10 standard benchmarks. Additionally, SAL achieves numerical consistency and competitive accuracy even in deep regimes (up to 128 layers) and large-scale models (up to 1B parameters). Our approach is loosely inspired by biological learning mechanisms, offering a plausible alternative that contributes to the study of scalable neural network training.
zh
[AI-76] Meta Context Engineering via Agent ic Skill Evolution
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理时对上下文(context)依赖性强但现有上下文工程(Context Engineering, CE)方法受限于人工设计的僵化流程与结构偏置的问题。当前CE方法通常采用固定生成-反思工作流或预定义上下文模板,导致优化空间狭窄且难以适应多样任务场景。解决方案的关键在于提出元上下文工程(Meta Context Engineering, MCE),其核心是一个双层框架:上层代理通过代理交叉(agentic crossover)机制,在历史技能、执行记录和评估结果中进行 deliberative 搜索以迭代优化CE技能;底层代理则基于训练回放(training rollouts)学习并动态生成灵活的上下文文件与代码。MCE实现了技能与上下文的协同进化,在离线与在线设置下均展现出显著性能提升(相对最优基线平均提升16.9%,最高达53.8%),同时具备更强的上下文适应性、迁移能力与训练效率。
链接: https://arxiv.org/abs/2601.21557
作者: Haoran Ye,Xuning He,Vincent Arak,Haonan Dong,Guojie Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 46 pages, 4 figures
Abstract:The operational efficacy of large language models relies heavily on their inference-time context. This has established Context Engineering (CE) as a formal discipline for optimizing these inputs. Current CE methods rely on manually crafted harnesses, such as rigid generation-reflection workflows and predefined context schemas. They impose structural biases and restrict context optimization to a narrow, intuition-bound design space. To address this, we introduce Meta Context Engineering (MCE), a bi-level framework that supersedes static CE heuristics by co-evolving CE skills and context artifacts. In MCE iterations, a meta-level agent refines engineering skills via agentic crossover, a deliberative search over the history of skills, their executions, and evaluations. A base-level agent executes these skills, learns from training rollouts, and optimizes context as flexible files and code. We evaluate MCE across five disparate domains under offline and online settings. MCE demonstrates consistent performance gains, achieving 5.6–53.8% relative improvement over state-of-the-art agentic CE methods (mean of 16.9%), while maintaining superior context adaptability, transferability, and efficiency in both context usage and training.
zh
[AI-77] raining slow silicon neurons to control extremely fast robots with spiking reinforcement learning
【速读】:该论文旨在解决高速动态环境下机器人决策与控制的实时学习难题,特别是在空气曲棍球(air hockey)这类需要毫秒级响应、高球速交互的任务中。其解决方案的关键在于软硬件协同设计:采用基于脉冲神经网络(spiking neural network, SNN)的紧凑架构,并部署于混合信号模拟/数字类脑处理器上;通过强化学习(reinforcement learning)实现极少量训练样本下的成功策略获取;利用固定随机连接结构捕捉任务的时间特性,并在读出层引入局部 e-prop 学习规则,以充分利用事件驱动的神经活动实现快速高效的在线学习。该方法实现了闭环系统中实时学习与控制的统一,为智能机器人的持续自适应能力提供了可行路径。
链接: https://arxiv.org/abs/2601.21548
作者: Irene Ambrosini,Ingo Blakowski,Dmitrii Zendrikov,Cristiano Capone,Luna Gava,Giacomo Indiveri,Chiara De Luca,Chiara Bartolozzi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Air hockey demands split-second decisions at high puck velocities, a challenge we address with a compact network of spiking neurons running on a mixed-signal analog/digital neuromorphic processor. By co-designing hardware and learning algorithms, we train the system to achieve successful puck interactions through reinforcement learning in a remarkably small number of trials. The network leverages fixed random connectivity to capture the task’s temporal structure and adopts a local e-prop learning rule in the readout layer to exploit event-driven activity for fast and efficient learning. The result is real-time learning with a setup comprising a computer and the neuromorphic chip in-the-loop, enabling practical training of spiking neural networks for robotic autonomous systems. This work bridges neuroscience-inspired hardware with real-world robotic control, showing that brain-inspired approaches can tackle fast-paced interaction tasks while supporting always-on learning in intelligent machines.
zh
[AI-78] Multi-Modal Time Series Prediction via Mixture of Modulated Experts
【速读】:该论文旨在解决多模态时间序列预测中因高质量时序-文本配对数据稀缺、时间序列尺度与特征差异显著而导致的跨模态对齐困难问题。现有方法多依赖于token-level融合,将时间片段与语言token混合在同一嵌入空间中,难以适应复杂动态和低资源场景。其解决方案的关键在于提出“专家调制”(Expert Modulation)新范式,该范式通过文本信号同时控制路由机制和专家计算过程,实现对专家行为的直接且高效的跨模态调控,从而提升多模态时间序列预测性能。
链接: https://arxiv.org/abs/2601.21547
作者: Lige Zhang,Ali Maatouk,Jialin Chen,Leandros Tassiulas,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 12 figures
Abstract:Real-world time series exhibit complex and evolving dynamics, making accurate forecasting extremely challenging. Recent multi-modal forecasting methods leverage textual information such as news reports to improve prediction, but most rely on token-level fusion that mixes temporal patches with language tokens in a shared embedding space. However, such fusion can be ill-suited when high-quality time-text pairs are scarce and when time series exhibit substantial variation in scale and characteristics, thus complicating cross-modal alignment. In parallel, Mixture-of-Experts (MoE) architectures have proven effective for both time series modeling and multi-modal learning, yet many existing MoE-based modality integration methods still depend on token-level fusion. To address this, we propose Expert Modulation, a new paradigm for multi-modal time series prediction that conditions both routing and expert computation on textual signals, enabling direct and efficient cross-modal control over expert behavior. Through comprehensive theoretical analysis and experiments, our proposed method demonstrates substantial improvements in multi-modal time series prediction. The current code is available at this https URL
zh
[AI-79] ARGORA: Orchestrated Argumentation for Causally Grounded LLM Reasoning and Decision Making
【速读】:该论文旨在解决现有多专家大语言模型(Multi-expert LLM)系统在整合不同观点时缺乏可解释性的问题,即简单聚合策略掩盖了哪些推理链驱动了最终决策。其解决方案的关键在于提出ARGORA框架,该框架将多专家讨论建模为显式的论证图(argumentation graph),其中包含支持与攻击关系,并将其转化为因果模型;通过系统性地移除单个论点并重新计算结果,识别出决定性推理链,并引入校正机制以在内部推理与外部判断不一致时调整推理过程,从而实现高准确性与因果诊断能力的结合。
链接: https://arxiv.org/abs/2601.21533
作者: Youngjin Jin,Hanna Kim,Kwanwoo Kim,Chanhee Lee,Seungwon Shin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 58 pages
Abstract:Existing multi-expert LLM systems gather diverse perspectives but combine them through simple aggregation, obscuring which arguments drove the final decision. We introduce ARGORA, a framework that organizes multi-expert discussions into explicit argumentation graphs showing which arguments support or attack each other. By casting these graphs as causal models, ARGORA can systematically remove individual arguments and recompute outcomes, identifying which reasoning chains were necessary and whether decisions would change under targeted modifications. We further introduce a correction mechanism that aligns internal reasoning with external judgments when they disagree. Across diverse benchmarks and an open-ended use case, ARGORA achieves competitive accuracy and demonstrates corrective behavior: when experts initially disagree, the framework resolves disputes toward correct answers more often than it introduces new errors, while providing causal diagnostics of decisive arguments.
zh
[AI-80] More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在有限计算预算下,其推理效率受限于“覆盖度随尝试次数增长呈现次线性增长”(即边际收益递减)的问题。现有评估指标 pass@k(在k次尝试中至少正确回答一次的概率)虽能反映模型单次任务的成功率,但无法有效指导资源分配以最大化覆盖广度。为此,作者提出 Reset-and-Discard (ReD) 查询策略,其核心在于通过动态重置和丢弃无效尝试来优化采样路径,从而在任意给定预算下提升 coverage@cost(单位成本下可解答的唯一问题数)。ReD 不依赖 pass@k 的具体形式即可显著降低达到目标覆盖度所需的尝试次数、token 数量及美元成本,并能从数据中推断出 pass@k 的幂律指数,为模型推理性能提供量化分析工具。
链接: https://arxiv.org/abs/2601.21522
作者: Sagi Meir,Tommer D. Keidar,Noam Levi,Shlomi Reuveni,Barak Hirshberg
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.
zh
[AI-81] LLaMEA-SAGE: Guiding Automated Algorithm Design with Structural Feedback from Explainable AI
【速读】:该论文旨在解决自动化算法设计(Automated Algorithm Design, AAD)中演化搜索效率低的问题,特别是现有方法如LLaMEA虽具备强探索能力,但仅依赖适应度反馈进行演化,未能充分利用生成代码的结构信息。其解决方案的关键在于引入一种基于图论和复杂度特征的代理模型(surrogate model),从生成算法的抽象语法树(Abstract Syntax Tree, AST)中提取结构化特征,并通过可解释人工智能(Explainable AI, XAI)技术识别对性能影响显著的特征,进而转化为自然语言变异指令,引导大语言模型(Large Language Model, LLM)生成更具潜力的算法代码,从而在不牺牲表达能力的前提下实现更高效的演化搜索。该方法被命名为LLaMEA-SAGE,在多个基准测试中验证了其优于当前最先进AAD方法的有效性。
链接: https://arxiv.org/abs/2601.21511
作者: Niki van Stein,Anna V. Kononova,Lars Kotthoff,Thomas Bäck
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE)
备注: 14 pages
Abstract:Large language models have enabled automated algorithm design (AAD) by generating optimization algorithms directly from natural-language prompts. While evolutionary frameworks such as LLaMEA demonstrate strong exploratory capabilities across the algorithm design space, their search dynamics are entirely driven by fitness feedback, leaving substantial information about the generated code unused. We propose a mechanism for guiding AAD using feedback constructed from graph-theoretic and complexity features extracted from the abstract syntax trees of the generated algorithms, based on a surrogate model learned over an archive of evaluated solutions. Using explainable AI techniques, we identify features that substantially affect performance and translate them into natural-language mutation instructions that steer subsequent LLM-based code generation without restricting expressivity. We propose LLaMEA-SAGE, which integrates this feature-driven guidance into LLaMEA, and evaluate it across several benchmarks. We show that the proposed structured guidance achieves the same performance faster than vanilla LLaMEA in a small controlled experiment. In a larger-scale experiment using the MA-BBOB suite from the GECCO-MA-BBOB competition, our guided approach achieves superior performance compared to state-of-the-art AAD methods. These results demonstrate that signals derived from code can effectively bias LLM-driven algorithm evolution, bridging the gap between code structure and human-understandable performance feedback in automated algorithm design. Comments: 14 pages Subjects: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE) Cite as: arXiv:2601.21511 [cs.AI] (or arXiv:2601.21511v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.21511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-82] Mean-Field Control on Sparse Graphs: From Local Limits to GNNs via Neighborhood Distributions
【速读】:该论文旨在解决多智能体系统中因维度灾难(curse of dimensionality)导致的可扩展性问题,尤其是在现实世界网络结构通常为稀疏图而非传统均质交换性假设下的全连接拓扑时。其核心挑战在于如何在保持理论严谨性的前提下,将均值场控制(Mean-field Control, MFC)从密集交互的理想化场景推广至复杂稀疏图结构。解决方案的关键在于重新定义系统状态为装饰根邻域上的概率测度(probability measure over decorated rooted neighborhoods),从而有效捕捉局部异质性;并提出一个基于时间相关局域性的理论结果——有限时域内,智能体在时刻 $ t $ 的最优策略仅依赖于其 $ (T-t) $-跳邻域,这使得无限维控制问题可解,并构建了在邻域分布空间上的新型动态规划原理(Dynamic Programming Principle, DPP)。这一框架不仅严格推导出GNN作为演员-评论家算法实现的合理性,且自然恢复经典MFC为退化情形,实现了对复杂稀疏拓扑的高效、理论可证的控制。
链接: https://arxiv.org/abs/2601.21477
作者: Tobias Schmidt,Kai Cui
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: 19 pages
Abstract:Mean-field control (MFC) offers a scalable solution to the curse of dimensionality in multi-agent systems but traditionally hinges on the restrictive assumption of exchangeability via dense, all-to-all interactions. In this work, we bridge the gap to real-world network structures by proposing a rigorous framework for MFC on large sparse graphs. We redefine the system state as a probability measure over decorated rooted neighborhoods, effectively capturing local heterogeneity. Our central contribution is a theoretical foundation for scalable reinforcement learning in this setting. We prove horizon-dependent locality: for finite-horizon problems, an agent’s optimal policy at time t depends strictly on its (T-t)-hop neighborhood. This result renders the infinite-dimensional control problem tractable and underpins a novel Dynamic Programming Principle (DPP) on the lifted space of neighborhood distributions. Furthermore, we formally and experimentally justify the use of Graph Neural Networks (GNNs) for actor-critic algorithms in this context. Our framework naturally recovers classical MFC as a degenerate case while enabling efficient, theoretically grounded control on complex sparse topologies.
zh
[AI-83] ask-free Adaptive Meta Black-box Optimization ICLR2026
【速读】:该论文旨在解决复杂黑箱优化(Black-box Optimization, BBO)任务中手工设计优化器效率低下、且现有元学习方法依赖大量预定义训练任务的问题,尤其在目标任务分布未知时难以应用。其解决方案的关键在于提出自适应元黑箱优化模型(Adaptive meta Black-box Optimization Model, ABOM),该模型通过仅利用目标任务的优化数据实现在线参数自适应调整,摒弃了对预先设定任务分布的依赖;ABOM引入闭环自适应参数学习机制,使参数化进化算子在优化过程中持续基于生成种群自我更新,从而实现零样本优化(zero-shot optimization),并在合成基准和真实无人飞行器路径规划问题上展现出竞争力。
链接: https://arxiv.org/abs/2601.21475
作者: Chao Wang,Licheng Jiao,Lingling Li,Jiaxuan Zhao,Guanchun Wang,Fang Liu,Shuyuan Yang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This article was published as a conference paper at ICLR 2026
Abstract:Handcrafted optimizers become prohibitively inefficient for complex black-box optimization (BBO) tasks. MetaBBO addresses this challenge by meta-learning to automatically configure optimizers for low-level BBO tasks, thereby eliminating heuristic dependencies. However, existing methods typically require extensive handcrafted training tasks to learn meta-strategies that generalize to target tasks, which poses a critical limitation for realistic applications with unknown task distributions. To overcome the issue, we propose the Adaptive meta Black-box Optimization Model (ABOM), which performs online parameter adaptation using solely optimization data from the target task, obviating the need for predefined task distributions. Unlike conventional metaBBO frameworks that decouple meta-training and optimization phases, ABOM introduces a closed-loop adaptive parameter learning mechanism, where parameterized evolutionary operators continuously self-update by leveraging generated populations during optimization. This paradigm shift enables zero-shot optimization: ABOM achieves competitive performance on synthetic BBO benchmarks and realistic unmanned aerial vehicle path planning problems without any handcrafted training tasks. Visualization studies reveal that parameterized evolutionary operators exhibit statistically significant search patterns, including natural selection and genetic recombination.
zh
[AI-84] ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体仿真在大规模场景下因GPU内存压力导致难以扩展的问题。核心挑战在于每个智能体需驻留私有GPU状态(包括模型、前缀缓存和适配器),随着智能体数量增长,设备内存迅速耗尽。解决方案的关键在于识别出两类关键工作负载特性:稀疏智能体激活和可估计的智能体调用顺序,并提出“调用距离”(invocation distance)这一统一抽象来预测智能体未来发出LLM请求的相对顺序。基于此,作者设计了ScaleSim系统,通过前瞻性预取和基于优先级的淘汰策略实现内存高效管理,同时支持多样化的智能体特定内存配置,最终在仿真基准测试中相较SGLang实现了最高1.74倍的速度提升。
链接: https://arxiv.org/abs/2601.21473
作者: Zaifeng Pan,Yipeng Shen,Zhengding Hu,Zhuang Wang,Aninda Manocha,Zheng Wang,Zhongkai Yu,Yue Guan,Yufei Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:LLM-based multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient LLM serving system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.
zh
[AI-85] Adaptive Confidence Gating in Multi-Agent Collaboration for Efficient and Optimized Code Generation
【速读】:该论文旨在解决小语言模型(Small Language Models, SLMs)在处理复杂逻辑需求时面临的推理瓶颈和失败循环问题。其核心解决方案是提出一种多智能体协作框架 DebateCoder,通过结构化的角色扮演协议(包括用户代理 A_UA、技术代理 A_TA 和质量保障代理 A_QA)实现多轮思辨与后生成优化;关键创新在于引入自适应置信度门控机制(95%阈值)以平衡准确性与推理效率,并结合多轮 deliberation 模块与评审引导的分析式调试循环,实现生成前辩论与生成后精炼的正交协同。实验表明,该方法在 HumanEval 上达到 70.12% Pass@1 的性能,显著优于 MapCoder 并降低约 35% 的 API 开销,验证了协作协议对提升资源受限环境下 SLM 推理能力的有效性。
链接: https://arxiv.org/abs/2601.21469
作者: Haoji Zhang,Yuzhe Li,Zhenqiang Liu,Chenyang Liu,Shenyang Zhang,Yi Zhou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) have catalyzed breakthroughs in automated code generation, Small Language Models (SLMs) often encounter reasoning bottlenecks and failure loops when addressing complex logical requirements. To overcome these challenges, we propose DebateCoder, a multi-agent collaborative framework designed to improve the reasoning ability of SLMs (e.g., Pangu-1B) in resource-constrained environments. DebateCoder uses a structured role-playing protocol with three agents: User Agent (A_UA), Technical Agent (A_TA), and Quality Assurance Agent (A_QA). It also includes an Adaptive Confidence Gating mechanism with a 95% threshold to balance accuracy and inference efficiency. In addition, we introduce a multi-turn deliberation module and a reviewer-guided analytical debugging loop for orthogonal pre-generation debate and post-generation refinement. Experiments on HumanEval and MBPP show that DebateCoder achieves 70.12% Pass@1 on HumanEval, outperforming MapCoder while reducing API overhead by about 35%. These results indicate that collaborative protocols can mitigate limitations of small-parameter models and provide a scalable, efficient approach to high-quality automated software engineering.
zh
[AI-86] MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning
【速读】:该论文旨在解决长时程智能体推理(long-horizon agentic reasoning)中因交互历史不断增长而导致的上下文窗口受限问题,即如何在有限的上下文预算下高效压缩并保留关键信息。其核心解决方案是提出MemOCR,一种基于多模态记忆机制的智能体系统,通过视觉布局实现自适应信息密度分配:将结构化富文本记忆(如标题、高亮内容)渲染为图像,使代理在访问记忆时能视觉优先聚焦关键证据,并对辅助细节进行高强度压缩,从而提升上下文利用效率。该方法结合预算感知的强化学习训练策略,在不同压缩水平下保障鲁棒性,显著优于传统纯文本基线模型。
链接: https://arxiv.org/abs/2601.21468
作者: Yaorui Shi,Shugui Liu,Yu Yang,Wenyu Mao,Yuxin Chen,Qi GU,Hui Su,Xunliang Cai,Xiang Wang,An Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.
zh
[AI-87] Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLM s
【速读】:该论文旨在解决当前语音编辑检测方法难以应对端到端神经语音编辑技术所生成的无缝声学过渡问题,此类编辑手段隐匿性强、缺乏明显拼接痕迹,导致传统基于显式拼接伪影的检测方法失效。解决方案的关键在于构建一个大规模双语语音编辑数据集AiEdit,并提出首个将语音编辑检测与内容定位统一建模为音频问答任务的大模型框架PELM(Prior-Enhanced Audio Large Language Model)。PELM通过引入词级概率先验提供明确的声学线索以缓解现有音频大模型中的伪造偏差和语义优先偏差,并设计基于质心聚合的声学一致性感知损失,显式建模细微局部分布异常,从而实现高精度的编辑检测与定位。
链接: https://arxiv.org/abs/2601.21463
作者: Jun Xue,Yi Chai,Yanzhen Ren,Jinshen He,Zhiqiang Tang,Zhuolin Yi,Yihuan Huang,Yuankun Xie,Yujie Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57% and 9.28% (localization), respectively.
zh
[AI-88] L3: Large Lookup Layers
【速读】:该论文旨在解决当前基于混合专家(Mixture-of-Experts, MoE)的稀疏语言模型在动态硬路由(dynamic hard routing)中面临的硬件效率低下和训练不稳定问题,同时克服传统词表嵌入(tokenizer embedding table)因缺乏上下文信息而导致的性能瓶颈。其解决方案的关键在于提出一种新的大查找层(Large Lookup Layer, L³),通过静态的基于token的路由机制,将词表嵌入扩展为可学习的、上下文感知的decoder层结构,从而在不牺牲计算效率的前提下实现内存与计算资源的灵活平衡;L³的核心创新包括:(1)系统友好的架构设计,支持快速训练及CPU卸载推理且无额外开销;(2)基于信息论的嵌入分配算法,有效权衡模型速度与质量,在语言建模和下游任务中显著优于同等稀疏度的MoE模型和密集模型。
链接: https://arxiv.org/abs/2601.21461
作者: Albert Tseng,Christopher De Sa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP “experts.” However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L ^3 ), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L ^3 layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L ^3 has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L ^3 by training transformers with up to 2.6B active parameters and find that L ^3 strongly outperforms both dense models and iso-sparse MoEs in both language modeling and downstream tasks.
zh
[AI-89] HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在角色扮演(role-playing)中难以模拟人物内在思维过程的问题,即当前模型虽能模仿角色语气与知识,但缺乏对行为背后认知逻辑的深度刻画。其解决方案的关键在于提出HER框架,通过引入双层思维机制——区分角色的第一人称思考与模型自身的第三人称推理,并基于逆向工程构建增强推理能力的角色扮演数据集,同时设计符合人类偏好的原则与奖励模型,最终结合监督学习与强化学习训练出具备认知层面角色模拟能力的模型(如Qwen3-32B基础模型的改进版本),实验表明该方法在CoSER和Minimax Role-Play Bench等基准上显著优于基线模型。
链接: https://arxiv.org/abs/2601.21459
作者: Chengyu Du,Xintao Wang,Aili Chen,Weiyuan Li,Rui Xu,Junteng Liu,Zishan Huang,Rong Tian,Zijun Sun,Yuhao Li,Liheng Feng,Deming Ding,Pengyu Zhao,Yanghua Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41pages, 10 figures
Abstract:LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters’ first-person thinking from LLMs’ third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train \method models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.
zh
[AI-90] LION: A Clifford Neural Paradigm for Multimodal-Attributed Graph Learning
【速读】:该论文旨在解决多模态图表示学习中两个关键问题:一是现有方法在模态对齐时忽视图结构上下文,导致模态间交互不足;二是模态融合策略缺乏适应性,难以有效利用带有拓扑先验的对齐特征,从而影响模型泛化能力和下游任务性能。解决方案的关键在于提出LION(cLIffOrd Neural paradigm),其核心创新包括:基于Clifford代数构建感知模态的几何流形以实现高阶图传播和模态交互,进而提升对齐质量;并设计自适应全息聚合模块,通过引入几何阶(geometric grade)的能量与尺度信息,并结合可学习参数,实现对对齐特征的动态融合,从而增强模态融合的适应性和表达能力。
链接: https://arxiv.org/abs/2601.21453
作者: Xunkai Li,Zhengyu Wu,Zekai Chen,Henan Sun,Daohan Su,Guang Zeng,Hongchao Qin,Rong-Hua Li,Guoren Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, the rapid advancement of multimodal domains has driven a data-centric paradigm shift in graph ML, transitioning from text-attributed to multimodal-attributed graphs. This advancement significantly enhances data representation and expands the scope of graph downstream tasks, such as modality-oriented tasks, thereby improving the practical utility of graph ML. Despite its promise, limitations exist in the current neural paradigms: (1) Neglect Context in Modality Alignment: Most existing methods adopt topology-constrained or modality-specific operators as tokenizers. These aligners inevitably neglect graph context and inhibit modality interaction, resulting in suboptimal alignment. (2) Lack of Adaptation in Modality Fusion: Most existing methods are simple adaptations for 2-modality graphs and fail to adequately exploit aligned tokens equipped with topology priors during fusion, leading to poor generalizability and performance degradation. To address the above issues, we propose LION (c\underlineLIff\underlineOrd \underlineNeural paradigm) based on the Clifford algebra and decoupled graph neural paradigm (i.e., propagation-then-aggregation) to implement alignment-then-fusion in multimodal-attributed graphs. Specifically, we first construct a modality-aware geometric manifold grounded in Clifford algebra. This geometric-induced high-order graph propagation efficiently achieves modality interaction, facilitating modality alignment. Then, based on the geometric grade properties of aligned tokens, we propose adaptive holographic aggregation. This module integrates the energy and scale of geometric grades with learnable parameters to improve modality fusion. Extensive experiments on 9 datasets demonstrate that LION significantly outperforms SOTA baselines across 3 graph and 3 modality downstream tasks.
zh
[AI-91] SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation
【速读】:该论文旨在解决现有生成式推荐系统中两个核心问题:一是模型架构依赖独立词汇表(separate vocabulary),导致无法高效复用开源大语言模型(LLM)的原生词表,从而增加维护成本并限制扩展性;二是优化策略中存在的“对称保守性”(Symmetric Conservatism)问题,即静态梯度边界抑制了冷启动物品的更新动量,并在高噪声环境下引发多样性崩溃。解决方案的关键在于提出SAGE(Sequence-level Adaptive Gradient Evolution)框架,其核心创新包括:(1) 序列级信号解耦(Sequence-level Signal Decoupling),通过几何均值重要性比与解耦多目标优势相结合,消除token级方差,缓解奖励坍塌(Reward Collapse);(2) 非对称自适应动态机制(Asymmetric Adaptive Dynamics),构建动态梯度流形,对高潜力冷启动项施加“增强因子”以实现超线性更新,同时引入“熵感知惩罚”打破信息茧房,从而有效提升冷启动流量和推荐多样性,同时保持GBPO的数值稳定性。
链接: https://arxiv.org/abs/2601.21452
作者: Yu Xie,Xing Kai Ren,Ying Qi,Hu Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2506.19235
Abstract:While works such as OneRec have validated the scaling laws of Large Language Models (LLMs) in recommender systems, they rely on a cumbersome separate vocabulary. This dependency prevents the model architecture from reusing native LLM vocabularies, resulting in high maintenance costs and poor scalability. In response, we aim to efficiently reuse open-source LLM architectures without constructing a separate tokenization vocabulary. Furthermore, we identify that the optimization strategy of OneRec Gradient Bounded Policy Optimization (GBPO),suffers from a “Symmetric Conservatism” problem: its static gradient boundaries structurally suppress the update momentum required for cold-start items and fail to prevent diversity collapse in high-noise this http URL address this issue, we propose SAGE (Sequence-level Adaptive Gradient Evolution), a unified optimization framework tailored for list-wise generative recommendation. SAGE introduces two key innovations:(1) Sequence-level Signal Decoupling: By combining a geometric mean importance ratio with decoupled multi-objective advantages, we eliminate token-level variance and resolve the “Reward Collapse” problem. (2) Asymmetric Adaptive Dynamics: We construct a dynamic gradient manifold that applies a “Boost Factor” to high-potential cold start items to achieve super-linear updates and employs an “Entropy Aware Penalty” to break information cocoons. Theoretical analysis and empirical results demonstrate that SAGE effectively unblocks cold-start traffic and sustains recommendation diversity, all while retaining the numerical stability of GBPO.
zh
[AI-92] ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在硬件工程领域应用中缺乏真实工业场景评估的问题,现有基准测试因任务多样性不足和性能饱和而无法准确反映LLMs在芯片设计全流程中的实际能力。其解决方案的关键在于构建一个涵盖Verilog代码生成、调试和参考模型生成三类核心任务的综合性基准,包含44个具有复杂层级结构的真实模块、89个系统性调试案例及132个跨Python、SystemC和CXXRTL的参考模型样本,并配套提供自动化训练数据生成工具箱以提升参考模型生成质量,从而推动LLMs在AI辅助芯片设计中的实用化发展。
链接: https://arxiv.org/abs/2601.21448
作者: Zhongkai Yu,Chenyang Zhou,Yichen Lin,Hejia Zhang,Haotian Ye,Junxia Cui,Zaifeng Pan,Jishen Zhao,Yufei Ding
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs’ performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74% on Verilog generation and 13.33% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at this https URL.
zh
[AI-93] he Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险决策场景中对叙事性偏见(narrative-induced bias)的敏感性问题,尤其是其在面对情绪化或诱导性表述时是否仍能保持逻辑一致性与规则遵循能力。传统观点认为LLMs因提示词微小扰动而表现出脆弱性(lexical brittleness),但其在规则约束下的行为稳定性尚不明确。解决方案的关键在于提出一种新颖的受控扰动框架,在医疗、法律和金融三个高风险领域验证了指令微调后的LLM对情感框架效应具有近乎完全的不变性(behavioral invariance),量化结果显示模型在应对叙事操纵时的抗干扰能力是人类的110–300倍(Cohen’s h = 0.003 vs. [0.3, 0.8])。这一发现揭示了LLMs可将逻辑规则遵守与说服性叙事相分离,从而在制度性决策中提供稳定且潜在去偏的人工智能辅助。
链接: https://arxiv.org/abs/2601.21439
作者: Jon Chun,Katherine Elkins
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 page, 10 figures
Abstract:While Large Language Models (LLMs) are widely documented to be sensitive to minor prompt perturbations and prone to sycophantic alignment with user biases, their robustness in consequential, rule-bound decision-making remains under-explored. In this work, we uncover a striking “Paradox of Robustness”: despite their known lexical brittleness, instruction-tuned LLMs exhibit a behavioral and near-total invariance to emotional framing effects. Using a novel controlled perturbation framework across three high-stakes domains (healthcare, law, and finance), we quantify a robustness gap where LLMs demonstrate 110-300 times greater resistance to narrative manipulation than human subjects. Specifically, we find a near-zero effect size for models (Cohen’s h = 0.003) compared to the substantial biases observed in humans (Cohen’s h in [0.3, 0.8]). This result is highly counterintuitive and suggests the mechanisms driving sycophancy and prompt sensitivity do not necessarily translate to a failure in logical constraint satisfaction. We show that this invariance persists across models with diverse training paradigms. Our findings show that while LLMs may be “brittle” to how a query is formatted, they are remarkably “stable” against why a decision should be biased. Our findings establish that instruction-tuned models can decouple logical rule-adherence from persuasive narratives, offering a source of decision stability that complements, and even potentially de-biases, human judgment in institutional contexts. We release the 162-scenario benchmark, code, and data to facilitate the rigorous evaluation of narrative-induced bias and robustness on this http URL.
zh
[AI-94] When Prohibitions Become Permissions: Auditing Negation Sensitivity in Language Models
【速读】:该论文试图解决当前大语言模型在处理否定指令时存在严重误判的问题,即模型常将“不应做某事”(do not X)误解为“应做某事”(do X),从而导致伦理风险和安全问题。研究表明,开源模型在简单否定条件下仍会支持被禁止行为高达77%,而在复合否定下甚至达到100%,显著高于肯定指令下的表现;商业模型虽有所改善,但依然存在较大波动。解决方案的关键在于提出**否定敏感性指数(Negation Sensitivity Index, NSI)**作为量化评估指标,并构建一个基于领域特异阈值的分层认证框架,以确保模型在高风险场景中能可靠区分“做X”与“不做X”,从而弥补现有对齐技术在逻辑一致性上的不足,保障其在关键应用中的安全部署。
链接: https://arxiv.org/abs/2601.21433
作者: Katherine Elkins,Jon Chun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:When a user tells an AI system that someone “should not” take an action, the system ought to treat this as a prohibition. Yet many large language models do the opposite: they interpret negated instructions as affirmations. We audited 16 models across 14 ethical scenarios and found that open-source models endorse prohibited actions 77% of the time under simple negation and 100% under compound negation – a 317% increase over affirmative framing. Commercial models fare better but still show swings of 19-128%. Agreement between models drops from 74% on affirmative prompts to 62% on negated ones, and financial scenarios prove twice as fragile as medical ones. These patterns hold under deterministic decoding, ruling out sampling noise. We present case studies showing how these failures play out in practice, propose the Negation Sensitivity Index (NSI) as a governance metric, and outline a tiered certification framework with domain-specific thresholds. The findings point to a gap between what current alignment techniques achieve and what safe deployment requires: models that cannot reliably distinguish “do X” from “do not X” should not be making autonomous decisions in high-stakes contexts.
zh
[AI-95] Mitigating Overthinking in Large Reasoning Models via Difficulty-aware Reinforcement Learning
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在处理简单任务时出现的“过度思考”(overthinking)问题,即模型因后训练阶段奖励函数引发的生成偏好而产生冗长且不必要的推理过程,导致资源浪费。解决方案的关键在于提出一种基于强化学习的训练框架——难度感知策略优化(Difficulty-aware Policy Optimization, DiPO),其核心创新包括:1)通过模型自推理构建任务难度建模方法,降低对人工标注的依赖并形式化任务复杂度;2)设计增强难度信号的奖励函数,在惩罚冗长推理的同时兼顾推理性能与输出格式规范性,从而引导模型自主调整推理开销,实现推理资源的有效分配。
链接: https://arxiv.org/abs/2601.21418
作者: Qian Wan,Ziao Xu,Luona Wei,Xiaoxuan Shen,Jianwen Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) achieve explicit chain-of-thought expansion by imitating deep thinking behaviors of humans, demonstrating excellent performance in complex task scenarios. However, the deep-thinking mode often leads to unnecessarily lengthy reasoning and resource inefficiency when handling simple tasks. This overthinking phenomenon may arise from the generation preference triggered by the reward function during post-training. Existing research attempts to mitigate overthinking from the perspective of prompt design or model training, but generally underestimates the importance of task difficulty awareness, which makes it difficult for LRMs to effectively allocate reasoning resources. In this paper, we propose Difficulty-aware Policy Optimization (DiPO), a reinforcement learning-based LRM training framework. DiPO encourages LRM to spontaneously model task complexity, and integrates them into reinforcement learning framework to adjust the generation preferences introduced by post-training. A difficulty modeling method based on model self-reasoning is proposed, which significantly reduces the dependence on manual annotation and formalize task complexity. We further develop a difficulty-signal-enhanced reward function that incorporates a penalty for lengthy reasoning while considering reasoning performance and output format. Experimental results indicate that DiPO enables the model to spontaneously adjust inference overhead, significantly reducing redundant tokens without losing performance due to thought compression.
zh
[AI-96] DataCross: A Unified Benchmark and Agent Framework for Cross-Modal Heterogeneous Data Analysis
【速读】:该论文旨在解决现实世界数据科学与企业决策中关键信息分散在结构化数据源(如SQL、CSV)和非结构化视觉文档(如扫描报告、发票图像)中的问题,现有数据分析代理主要局限于处理结构化数据,无法有效激活并关联高价值的“僵尸数据”(zombie data),从而导致与工业需求存在显著差距。解决方案的关键在于提出DataCross框架,包括一个包含200个端到端分析任务的基准DataCrossBench和一个协作式智能体架构DataCrossAgent:前者通过人机协同逆向合成流程构建真实复杂度、跨源依赖和可验证真值的任务集;后者采用类人类“分而治之”的工作流,由专精于特定数据源的子智能体组成,通过内部深度探索、关键源识别和情境化交叉融合的结构化流程协调,并引入一种新颖的reReAct机制实现稳健的代码生成与调试,以提升事实准确性与多步联合推理能力,在高难度任务上显著优于GPT-4o。
链接: https://arxiv.org/abs/2601.21403
作者: Ruyi Qi,Zhou Liu,Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:In real-world data science and enterprise decision-making, critical information is often fragmented across directly queryable structured sources (e.g., SQL, CSV) and “zombie data” locked in unstructured visual documents (e.g., scanned reports, invoice images). Existing data analytics agents are predominantly limited to processing structured data, failing to activate and correlate this high-value visual information, thus creating a significant gap with industrial needs. To bridge this gap, we introduce DataCross, a novel benchmark and collaborative agent framework for unified, insight-driven analysis across heterogeneous data modalities. DataCrossBench comprises 200 end-to-end analysis tasks across finance, healthcare, and other domains. It is constructed via a human-in-the-loop reverse-synthesis pipeline, ensuring realistic complexity, cross-source dependency, and verifiable ground truth. The benchmark categorizes tasks into three difficulty tiers to evaluate agents’ capabilities in visual table extraction, cross-modal alignment, and multi-step joint reasoning. We also propose the DataCrossAgent framework, inspired by the “divide-and-conquer” workflow of human analysts. It employs specialized sub-agents, each an expert on a specific data source, which are coordinated via a structured workflow of Intra-source Deep Exploration, Key Source Identification, and Contextual Cross-pollination. A novel reReAct mechanism enables robust code generation and debugging for factual verification. Experimental results show that DataCrossAgent achieves a 29.7% improvement in factuality over GPT-4o and exhibits superior robustness on high-difficulty tasks, effectively activating fragmented “zombie data” for insightful, cross-modal analysis.
zh
[AI-97] Intrinsic Reward Policy Optimization for Sparse-Reward Environments
【速读】:该论文旨在解决强化学习中稀疏奖励环境下探索效率低下的问题。在稀疏奖励场景中,传统的探索策略(如噪声注入)往往效果有限,而现有基于内在奖励的方法要么面临信用分配不稳定的问题,要么存在样本效率低下和次优性缺陷。解决方案的关键在于提出一种名为内在奖励策略优化(Intrinsic Reward Policy Optimization, IRPO)的策略优化框架,该框架通过引入多个内在奖励来直接优化外在奖励下的策略,无需预先训练子策略;其核心创新是使用一个代理策略梯度,相较于真实梯度在稀疏奖励环境中提供更丰富的学习信号,从而提升性能与样本效率。
链接: https://arxiv.org/abs/2601.21391
作者: Minjae Cho,Huy Trong Tran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub-optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm – intrinsic reward policy optimization (IRPO) – achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse-reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at this https URL.
zh
[AI-98] Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation ICASSP2026
【速读】:该论文旨在解决合成语音质量客观评估的难题,即如何在不依赖昂贵且难以大规模实施的人类听觉测试的前提下,实现高效、可靠的语音质量评价。其解决方案的关键在于系统性地评估Fréchet Speech Distance (FSD) 和 Speech Maximum Mean Discrepancy (SMMD) 这两类基于嵌入(embeddings)的度量方法,并结合人类主观评分、TTS可懂度和合成训练的自动语音识别(ASR)词错误率(WER)进行验证。研究发现,使用WavLM Base+特征时,FSD与SMMD能最稳定地匹配人类评分,表明这类度量可在大规模场景下作为成本低、可复现且具感知相关性的补充工具,从而有效辅助传统主观评测。
链接: https://arxiv.org/abs/2601.21386
作者: June-Woo Kim,Dhruv Agarwal,Federica Cerina
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: accepted to ICASSP 2026
Abstract:Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at this https URL.
zh
[AI-99] Sim-MSTNet: sim2real based Multi-task SpatioTemporal Network Traffic Forecasting ICASSP2026
【速读】:该论文旨在解决网络流量预测中因数据稀缺导致的模型泛化能力差,以及多任务学习方法在处理不同服务类型时面临的任务不平衡和负迁移问题。解决方案的关键在于提出Sim-MSTNet模型,其核心创新包括:利用仿真器生成合成数据以缓解真实数据不足的问题;通过领域随机化(domain randomization)技术结合双层优化策略(样本权重与模型训练联合优化),缩小合成数据与真实数据之间的分布差异;同时引入基于注意力机制的知识选择性共享和动态损失加权策略,有效平衡多任务目标,提升整体预测精度与鲁棒性。
链接: https://arxiv.org/abs/2601.21384
作者: Hui Ma,Qingzhong Li,Jin Wang,Jie Wu,Shaoyu Dou,Li Feng,Xinjun Pei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted in ICASSP 2026
Abstract:Network traffic forecasting plays a crucial role in intelligent network operations, but existing techniques often perform poorly when faced with limited data. Additionally, multi-task learning methods struggle with task imbalance and negative transfer, especially when modeling various service types. To overcome these challenges, we propose Sim-MSTNet, a multi-task spatiotemporal network traffic forecasting model based on the sim2real approach. Our method leverages a simulator to generate synthetic data, effectively addressing the issue of poor generalization caused by data scarcity. By employing a domain randomization technique, we reduce the distributional gap between synthetic and real data through bi-level optimization of both sample weighting and model training. Moreover, Sim-MSTNet incorporates attention-based mechanisms to selectively share knowledge between tasks and applies dynamic loss weighting to balance task objectives. Extensive experiments on two open-source datasets show that Sim-MSTNet consistently outperforms state-of-the-art baselines, achieving enhanced accuracy and generalization.
zh
[AI-100] achBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)作为教学助手时,其教学能力缺乏系统性评估的问题。现有基准测试多聚焦于解题能力或问题层面的指导,忽视了以知识为中心的教学效果。为此,作者提出了一种基于课程大纲(syllabus-grounded)的评估框架,其关键在于通过多轮教学互动后学生表现的提升来量化LLM的教学能力;该框架通过限定教师代理仅使用结构化的知识点和示例问题,避免信息泄露并支持对已有基准的复用,从而实现对教学行为的可测量、可比较分析。
链接: https://arxiv.org/abs/2601.21375
作者: Zheng Li,Siyao Song,Jingyuan Ma,Rui Li,Ying Zeng,Minghao Li,Zhifang Sui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks. We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.
zh
[AI-101] NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents
【速读】:该论文旨在解决自然语言描述的决策问题难以自动转化为可执行数学优化模型的问题,现有方法常依赖专用大语言模型(Large Language Models, LLMs)或特定任务代理,存在代码语法错误、不可执行、鲁棒性差等缺陷。其解决方案的关键在于提出NEMO系统,通过远程调用自主编码代理(Autonomous Coding Agents, ACAs)作为第一类抽象,构建以ACAs为核心的执行感知型智能体架构;利用沙箱环境确保生成代码的可执行性,并引入异构验证循环、外部记忆机制及最小贝叶斯风险(Minimum Bayes Risk, MBR)解码等新型协同模式,实现任务规范的结构化迭代优化与自动修复,从而在九个基准优化任务中达到领先性能。
链接: https://arxiv.org/abs/2601.21372
作者: Yang Song,Anoushka Vyas,Zirui Wei,Sina Khoshfetrat Pakazad,Henrik Ohlsson,Graham Neubig
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimization implementations, operating collaboratively with users or autonomously. Existing approaches typically rely on specialized large language models (LLMs) or bespoke, task-specific agents. Such methods are often brittle, complex and frequently generating syntactically invalid or non-executable code. NEMO instead centers on remote interaction with autonomous coding agents (ACAs), treated as a first-class abstraction analogous to API-based interaction with LLMs. This design enables the construction of higher-level systems around ACAs that structure, consolidate, and iteratively refine task specifications. Because ACAs execute within sandboxed environments, code produced by NEMO is executable by construction, allowing automated validation and repair. Building on this, we introduce novel coordination patterns with and across ACAs, including asymmetric validation loops between independently generated optimizer and simulator implementations (serving as a high-level validation mechanism), external memory for experience reuse, and robustness enhancements via minimum Bayes risk (MBR) decoding and self-consistency. We evaluate NEMO on nine established optimization benchmarks. As depicted in Figure 1, it achieves state-of-the-art performance on the majority of tasks, with substantial margins on several datasets, demonstrating the power of execution-aware agentic architectures for automated optimization modeling. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.21372 [cs.AI] (or arXiv:2601.21372v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.21372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-102] Hebbian Learning with Global Direction ICASSP2026
【速读】:该论文旨在解决传统反向传播(Backpropagation)算法在生物合理性(biological plausibility)和计算成本方面的局限性,以及现有赫布学习(Hebbian learning)因仅依赖局部信息而难以扩展至复杂网络与任务的问题。其解决方案的关键在于提出一种模型无关的全局引导赫布学习(Global-guided Hebbian Learning, GHL)框架,该框架通过融合局部更新机制(基于Oja规则与竞争学习以保证稳定性)与全局信号(基于符号的指导信号以对齐任务目标),实现局部可塑性与全局目标之间的协同优化,从而显著提升赫布学习在大规模网络和复杂数据集(如ImageNet)上的性能表现。
链接: https://arxiv.org/abs/2601.21367
作者: Wenjia Hua,Kejie Zhao,Luziwei Leng,Ran Cheng,Yuxin Ma,Qinghai Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICASSP 2026
Abstract:Backpropagation algorithm has driven the remarkable success of deep neural networks, but its lack of biological plausibility and high computational costs have motivated the ongoing search for alternative training methods. Hebbian learning has attracted considerable interest as a biologically plausible alternative to backpropagation. Nevertheless, its exclusive reliance on local information, without consideration of global task objectives, fundamentally limits its scalability. Inspired by the biological synergy between neuromodulators and local plasticity, we introduce a novel model-agnostic Global-guided Hebbian Learning (GHL) framework, which seamlessly integrates local and global information to scale up across diverse networks and tasks. In specific, the local component employs Oja’s rule with competitive learning to ensure stable and effective local updates. Meanwhile, the global component introduces a sign-based signal that guides the direction of local Hebbian plasticity updates. Extensive experiments demonstrate that our method consistently outperforms existing Hebbian approaches. Notably, on large-scale network and complex datasets like ImageNet, our framework achieves the competitive results and significantly narrows the gap with standard backpropagation.
zh
[AI-103] BEAP-Agent : Backtrackable Execution and Adaptive Planning for GUI Agents
【速读】:该论文旨在解决现有图形用户界面(GUI)代理在执行任务时缺乏有效回溯机制的问题,尤其是在偏离正确探索路径后难以恢复,导致任务失败。其解决方案的关键在于提出一种基于深度优先搜索(DFS)的框架BEAP-Agent,该框架通过引入多层级状态回溯能力、动态任务追踪与更新机制,结合规划器(Planner)、执行器(Executor)和追踪器(Tracker)三者协同工作,实现了对长周期任务的系统性探索与容错执行,从而显著提升了GUI代理在复杂场景下的鲁棒性和成功率。
链接: https://arxiv.org/abs/2601.21352
作者: Ziyu Lu,Tengjin Weng,Yiying Yang,Yuhang Zhao,Xinxin Huang,Wenhao Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:GUI agents are designed to automate repetitive tasks and enhance productivity. However, existing GUI agents struggle to recover once they follow an incorrect exploration path, often leading to task failure. In this work, we model GUI task execution as a DFS process and propose BEAP-Agent, a DFS-based framework that supports long-range, multi-level state backtracking with dynamic task tracking and updating. The framework consists of three collaborative components: Planner, Executor, and Tracker. Together, they enable effective task exploration and execution. BEAP-Agent fills the gap in systematic backtracking mechanisms for GUI agents, offering a systematic solution for long-horizon task exploration. We conducted a systematic evaluation on the OSWorld benchmark, where BEAP-Agent achieved an accuracy of 28.2%, validating the effectiveness of the proposed method.
zh
[AI-104] heoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving ICML2026
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)解码过程中因注意力机制(Attention)与前馈网络(Feed-Forward Network, FFN)计算资源分配不均导致的性能瓶颈问题,特别是由于Attention/FFN资源配置比例不当引发的步骤级阻塞和设备空闲时间增加。解决方案的关键在于提出一种可解析的分析框架,用于在r A-1 F拓扑结构中精确配置Attention与FFN模块的资源捆绑比例(A/F ratio),该框架通过构建概率性工作负载模型,推导出最大化系统平均吞吐量的闭式最优A/F比规则,并借助校准后的AFD仿真器验证理论结果的有效性——实测最优A/F比与理论预测值偏差小于10%,且显著降低设备空闲时间。
链接: https://arxiv.org/abs/2601.21351
作者: Chendong Song,Meixuan Wang,Hang Zhou,Hong Liang,Yuan Lyu,Zixi Chen,Yuwei Fan,Zijie Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ICML 2026
Abstract:Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an r A- 1 F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.
zh
[AI-105] L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型中路由机制(router)因线性映射在高维原始表示空间中导致的路由判别力不足与专家专精不稳定问题,具体表现为表征不匹配、角度集中和尺度敏感评分等因素共同削弱了路由的稳定性和专家分工的有效性。解决方案的关键在于提出一种统一的低秩 Lipschitz 控制路由框架(Low-rank Lipschitz-controlled Routing, L2R),其核心创新包括:1)在共享的低秩潜在路由空间中进行专家分配,以重构路由空间;2)引入饱和内积评分(Saturated Inner-Product Scoring, SIPS)显式控制路由函数的 Lipschitz 性质,从而获得更平滑稳定的路由几何结构;3)结合参数高效的多锚点路由机制增强专家表达能力。实验证明,L2R 在大规模语言 MoE 和 ImageNet 视觉 MoE 设置下均显著提升了路由稳定性、专家专精度及整体模型性能。
链接: https://arxiv.org/abs/2601.21349
作者: Minghao Yang,Ren Togo,Guang Li,Takahiro Ogawa,Miki Haseyama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank \ Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.
zh
[AI-106] Memorization Control in Diffusion Models from Denoising-centric Perspective
【速读】:该论文旨在解决扩散模型(diffusion models)中的记忆化问题,即生成数据过度拟合训练集分布,导致泛化能力下降。现有方法多从数据或模型本身出发进行调整,而忽视了去噪过程中的学习分布不均问题。论文从去噪中心(denoising centric)视角出发,指出均匀的时间步采样会导致不同去噪步骤的学习贡献不均衡,这是由于信噪比(signal to noise ratio, SNR)差异所致,从而偏向于记忆训练数据。解决方案的关键在于提出一种新的时间步采样策略,通过显式控制学习在去噪轨迹上的发生位置,特别是将学习重点向后阶段的去噪步骤转移,并借助置信区间宽度调节实现对记忆与泛化之间权衡的直接控制。实验表明,该方法在图像和一维信号生成任务中均能有效降低记忆化程度并提升生成数据与训练分布的一致性。
链接: https://arxiv.org/abs/2601.21348
作者: Thuy Phuong Vu,Mai Viet Hoang Do,Minhhuy Le,Dinh-Cuong Hoang,Phan Xuan Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Controlling memorization in diffusion models is critical for applications that require generated data to closely match the training distribution. Existing approaches mainly focus on data centric or model centric modifications, treating the diffusion model as an isolated predictor. In this paper, we study memorization in diffusion models from a denoising centric perspective. We show that uniform timestep sampling leads to unequal learning contributions across denoising steps due to differences in signal to noise ratio, which biases training toward memorization. To address this, we propose a timestep sampling strategy that explicitly controls where learning occurs along the denoising trajectory. By adjusting the width of the confidence interval, our method provides direct control over the memorization generalization trade off. Experiments on image and 1D signal generation tasks demonstrate that shifting learning emphasis toward later denoising steps consistently reduces memorization and improves distributional alignment with training data, validating the generality and effectiveness of our approach.
zh
[AI-107] Dynamic Framework for Collaborative Learning: Leverag ing Advanced LLM with Adaptive Feedback Mechanisms
【速读】:该论文旨在解决当前协作学习平台中静态内容与缺乏个性化互动的问题,尤其在学生参与度不足、批判性思维培养受限以及学习体验包容性不足等方面存在明显短板。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的动态调制框架,利用LLM作为实时讨论的智能调节器,根据学习者需求动态调整提示(prompt)和讨论流程,并通过模块化架构(前端采用ReactJS,后端使用Flask)实现高效问题检索与个性化交互,同时引入强化反馈机制以优化AI调节效果、促进反思性学习并平衡用户参与度,从而显著提升协作效率与学习深度。
链接: https://arxiv.org/abs/2601.21344
作者: Hassam Tahir,Faizan Faisal,Fady Alnajjar,Muhammad Imran Taj,Lucia Gordon,Aila Khan,Michael Lwin,Omar Mubin
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Publication Link: this https URL
Abstract:This paper presents a framework for integrating LLM into collaborative learning platforms to enhance student engagement, critical thinking, and inclusivity. The framework employs advanced LLMs as dynamic moderators to facilitate real-time discussions and adapt to learners’ evolving needs, ensuring diverse and inclusive educational experiences. Key innovations include robust feedback mechanisms that refine AI moderation, promote reflective learning, and balance participation among users. The system’s modular architecture featuring ReactJS for the frontend, Flask for backend operations, and efficient question retrieval supports personalized and engaging interactions through dynamic adjustments to prompts and discussion flows. Testing demonstrates that the framework significantly improves student collaboration, fosters deeper comprehension, and scales effectively across various subjects and user groups. By addressing limitations in static moderation and personalization in existing systems, this work establishes a strong foundation for next-generation AI-driven educational tools, advancing equitable and impactful learning outcomes.
zh
[AI-108] Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在食品服务与零售商店(Food-Service and Retail Stores, FSRS)场景中部署时面临的两大核心问题:一是真实世界FSRS数据因采集设备异构而噪声严重,且缺乏可审计、闭环的数据治理机制,导致难以构建高质量、可控且可复现的训练语料;二是现有评估协议缺乏统一、细粒度且覆盖单图、多图和视频输入的标准化基准,难以客观衡量模型鲁棒性。解决方案的关键在于提出三个创新性组件:首先,基于Qwen3-VL-8B开发面向FSRS的MLLM——Ostrakon-VL;其次,构建首个公开的FSRS专用基准测试平台ShopBench;最后,设计QUAD(Quality-aware Unbiased Automated Data-curation)多阶段多模态指令数据清洗流程,通过分阶段训练策略使Ostrakon-VL在ShopBench上达到60.1的平均得分,显著优于同参数规模的Qwen3-VL-8B(55.3)和更大模型Qwen3-VL-235B-A22B(59.4),验证了其更高的参数效率与更强的FSRS感知决策能力。
链接: https://arxiv.org/abs/2601.21342
作者: Zhiyong Shen,Gongpeng Zhao,Jun Zhou,Li Yu,Guandong Kou,Jichen Li,Chuanlei Dong,Zuncheng Li,Kaimao Li,Bingkun Wei,Shicheng Hu,Wei Xia,Wenguo Duan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.
zh
[AI-109] EHR-RAG : Bridging Long-Horizon Structured Electronic Health Records and Large Language Models via Enhanced Retrieval-Augmented Generation
【速读】:该论文旨在解决长时程电子健康记录(Electronic Health Records, EHRs)在大语言模型(Large Language Models, LLMs)预测中因超出上下文长度限制而导致临床事件与时间依赖性丢失的问题。现有方法多依赖截断或简单检索策略,难以有效保留结构化EHR中的关键信息。其解决方案的关键在于提出EHR-RAG框架,包含三个核心组件:基于事件和时间感知的混合检索机制以保持临床结构与时间动态;自适应迭代检索策略以逐步扩展查询覆盖范围;以及双路径证据检索与推理机制,联合获取并推理事实性与反事实性证据。该设计显著提升了长期EHR数据上的临床预测性能,在四项任务中平均Macro-F1提升达10.76%。
链接: https://arxiv.org/abs/2601.21340
作者: Lang Cao,Qingyu Chen,Yue Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic Health Records (EHRs) provide rich longitudinal clinical evidence that is central to medical decision-making, motivating the use of retrieval-augmented generation (RAG) to ground large language model (LLM) predictions. However, long-horizon EHRs often exceed LLM context limits, and existing approaches commonly rely on truncation or vanilla retrieval strategies that discard clinically relevant events and temporal dependencies. To address these challenges, we propose EHR-RAG, a retrieval-augmented framework designed for accurate interpretation of long-horizon structured EHR data. EHR-RAG introduces three components tailored to longitudinal clinical prediction tasks: Event- and Time-Aware Hybrid EHR Retrieval to preserve clinical structure and temporal dynamics, Adaptive Iterative Retrieval to progressively refine queries in order to expand broad evidence coverage, and Dual-Path Evidence Retrieval and Reasoning to jointly retrieves and reasons over both factual and counterfactual evidence. Experiments across four long-horizon EHR prediction tasks show that EHR-RAG consistently outperforms the strongest LLM-based baselines, achieving an average Macro-F1 improvement of 10.76%. Overall, our work highlights the potential of retrieval-augmented LLMs to advance clinical prediction on structured EHR data in practice.
zh
[AI-110] Within-Model vs Between-Prompt Variability in Large Language Models for Creative Tasks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)输出变异性的来源问题,即在生成式 AI(Generative AI)任务中,提示(prompt)设计、模型选择以及采样随机性各自对输出结果的影响程度。其解决方案的关键在于通过系统性实验设计:在10个创意类提示上评估12个LLMs,每组提示下进行100次采样(总计N=12,000),并采用方差分解方法量化三者贡献比例,从而揭示提示对输出质量(原创性)影响显著(解释36.43%变异),而模型选择对输出数量(流畅性)主导(51.25%),且单次采样存在显著噪声(10–34%来自模型内变异),因此需谨慎对待单一样本评估结果。
链接: https://arxiv.org/abs/2601.21339
作者: Jennifer Haase,Jana Gonnermann-Müller,Paul H. P. Hanel,Nicolas Leins,Thomas Kosch,Jan Mendling,Sebastian Pokutta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:How much of LLM output variance is explained by prompts versus model choice versus stochasticity through sampling? We answer this by evaluating 12 LLMs on 10 creativity prompts with 100 samples each (N = 12,000). For output quality (originality), prompts explain 36.43% of variance, comparable to model choice (40.94%). But for output quantity (fluency), model choice (51.25%) and within-LLM variance (33.70%) dominate, with prompts explaining only 4.22%. Prompts are powerful levers for steering output quality, but given the substantial within-LLM variance (10-34%), single-sample evaluations risk conflating sampling noise with genuine prompt or model effects.
zh
[AI-111] Modeling Endogenous Logic: Causal Neuro-Symbolic Reasoning Model for Explainable Multi-Behavior Recommendation WWW
【速读】:该论文旨在解决多行为推荐(Multi-Behavior Recommendation)中性能与可解释性之间的权衡问题,以及现有可解释方法因依赖外部信息而导致泛化能力受限的缺陷。其解决方案的关键在于提出一种因果神经符号推理模型(Causal Neuro-Symbolic Reasoning model for Explainable Multi-Behavior Recommendation, CNRE),通过将因果推断引入神经符号框架,显式建模用户行为链中内生的逻辑规则,并模拟人类决策过程以生成可解释的因果中介变量,从而有效剥离混杂因素的影响,实现从模型设计、决策过程到推荐结果的多层次可解释性。
链接: https://arxiv.org/abs/2601.21335
作者: Yuzhe Chen,Jie Cao,Youquan Wang,Haicheng Tao,Darko B. Vukovic,Jia Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to The Web Conference (WWW) 2026
Abstract:Existing multi-behavior recommendations tend to prioritize performance at the expense of explainability, while current explainable methods suffer from limited generalizability due to their reliance on external information. Neuro-Symbolic integration offers a promising avenue for explainability by combining neural networks with symbolic logic rule reasoning. Concurrently, we posit that user behavior chains inherently embody an endogenous logic suitable for explicit reasoning. However, these observational multiple behaviors are plagued by confounders, causing models to learn spurious correlations. By incorporating causal inference into this Neuro-Symbolic framework, we propose a novel Causal Neuro-Symbolic Reasoning model for Explainable Multi-Behavior Recommendation (CNRE). CNRE operationalizes the endogenous logic by simulating a human-like decision-making process. Specifically, CNRE first employs hierarchical preference propagation to capture heterogeneous cross-behavior dependencies. Subsequently, it models the endogenous logic rule implicit in the user’s behavior chain based on preference strength, and adaptively dispatches to the corresponding neural-logic reasoning path (e.g., conjunction, disjunction). This process generates an explainable causal mediator that approximates an ideal state isolated from confounding effects. Extensive experiments on three large-scale datasets demonstrate CNRE’s significant superiority over state-of-the-art baselines, offering multi-level explainability from model design and decision process to recommendation results.
zh
[AI-112] White-Box Op-Amp Design via Human-Mimicking Reasoning
【速读】:该论文旨在解决传统运算放大器(op-amp)参数设计中缺乏可解释性的问题,尤其针对黑箱式优化方法难以理解设计决策过程、且在复杂拓扑下易失效的局限。解决方案的关键在于提出一种名为White-Op的可解释设计框架,其核心是将人类专家的推理机制形式化为“假设引入—验证—决策”的迭代流程:通过引入假设性约束来显式控制符号可处理的极点和零点位置,从而构建闭合形式的数学优化问题,并借助程序化求解与仿真验证实现可靠设计。该方法在9种典型op-amp拓扑上验证了有效性,相比黑箱基线在5种拓扑上失败的情况,White-Op实现了100%功能保留和仅8.52%的理论预测误差。
链接: https://arxiv.org/abs/2601.21321
作者: Zihao Chen,Jiayin Wang,Ziyi Sun,Ji Zhuang,Jinyi Shen,Xiaoyue Ke,Li Shang,Xuan Zeng,Fan Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This brief proposes \emphWhite-Op, an interpretable operational amplifier (op-amp) parameter design framework based on the human-mimicking reasoning of large-language-model agents. We formalize the implicit human reasoning mechanism into explicit steps of \emph\textbfintroducing hypothetical constraints, and develop an iterative, human-like \emph\textbfhypothesis-verification-decision workflow. Specifically, the agent is guided to introduce hypothetical constraints to derive and properly regulate positions of symbolically tractable poles and zeros, thus formulating a closed-form mathematical optimization problem, which is then solved programmatically and verified via simulation. Theory-simulation result analysis guides the decision-making for refinement. Experiments on 9 op-amp topologies show that, unlike the uninterpretable black-box baseline which finally fails in 5 topologies, White-Op achieves reliable, interpretable behavioral-level designs with only 8.52% theoretical prediction error and the design functionality retains after transistor-level mapping for all topologies. White-Op is open-sourced at \textcolorbluethis https URL.
zh
[AI-113] Heterogeneous Vertiport Selection Optimization for On-Demand Air Taxi Services: A Deep Reinforcement Learning Approach
【速读】:该论文旨在解决城市空中交通(Urban Air Mobility, UAM)与地面交通系统集成中的最优路径规划问题,即如何在空地一体化的多模式交通网络中实现高效、动态的乘客出行调度。当前研究缺乏对乘客在空气与地面交通之间策略选择的系统性优化方法,导致整体出行效率受限。其解决方案的关键在于提出一种统一的优化模型,融合空中与地面运输策略选择,并引入深度强化学习(Deep Reinforcement Learning, DRL)与车路协同(Vehicle-to-Everything, V2X)通信技术,构建了统一的空地协同调度框架(Unified Air-Ground Mobility Coordination, UAGMC),从而实现基于实时交通状态和乘客行为的动态路径规划与垂直起降机场(vertiport)选择优化,实验表明该方案可使平均出行时间减少34%。
链接: https://arxiv.org/abs/2601.21316
作者: Aoyu Pang,Maonan Wang,Zifan Sha,Wenwei Yue,Changle Li,Chung Shue Chen,Man-On Pun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban Air Mobility (UAM) has emerged as a transformative solution to alleviate urban congestion by utilizing low-altitude airspace, thereby reducing pressure on ground transportation networks. To enable truly efficient and seamless door-to-door travel experiences, UAM requires close integration with existing ground transportation infrastructure. However, current research on optimal integrated routing strategies for passengers in air-ground mobility systems remains limited, with a lack of systematic this http URL address this gap, we first propose a unified optimization model that integrates strategy selection for both air and ground transportation. This model captures the dynamic characteristics of multimodal transport networks and incorporates real-time traffic conditions alongside passenger decision-making behavior. Building on this model, we propose a Unified Air-Ground Mobility Coordination (UAGMC) framework, which leverages deep reinforcement learning (RL) and Vehicle-to-Everything (V2X) communication to optimize vertiport selection and dynamically plan air taxi routes. Experimental results demonstrate that UAGMC achieves a 34% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency and providing novel insights into the integration and optimization of multimodal transportation systems. This work lays a solid foundation for advancing intelligent urban mobility solutions through the coordination of air and ground transportation modes. The related code can be found at this https URL.
zh
[AI-114] Distributionally Robust Classification for Multi-source Unsupervised Domain Adaptation ICLR2026
【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中的关键挑战,即当源域(source domain)与目标域(target domain)数据分布存在差异时,如何利用仅有源域标签数据和目标域无标签数据来构建在目标域上具有良好泛化性能的模型。尤其针对目标域样本稀缺或源域中存在虚假相关性(spurious correlations)的情形,现有方法往往表现不佳。其解决方案的核心在于提出一种分布鲁棒学习(distributionally robust learning)框架,该框架同时建模协变量分布(covariate distribution)和条件标签分布(conditional label distribution)的不确定性,并通过引入多源域适应的动机设计出适用于单源场景的通用方法。该方法可无缝集成至现有UDA算法中,且在多种分布偏移场景下均显著优于强基线,尤其是在目标域数据极度稀缺时展现出优越性能。
链接: https://arxiv.org/abs/2601.21315
作者: Seonghwi Kim,Sung Ho Jo,Wooseok Ha,Minwoo Chae
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026. 10 pages (excluding references)
Abstract:Unsupervised domain adaptation (UDA) is a statistical learning problem when the distribution of training (source) data is different from that of test (target) data. In this setting, one has access to labeled data only from the source domain and unlabeled data from the target domain. The central objective is to leverage the source data and the unlabeled target data to build models that generalize to the target domain. Despite its potential, existing UDA approaches often struggle in practice, particularly in scenarios where the target domain offers only limited unlabeled data or spurious correlations dominate the source domain. To address these challenges, we propose a novel distributionally robust learning framework that models uncertainty in both the covariate distribution and the conditional label distribution. Our approach is motivated by the multi-source domain adaptation setting but is also directly applicable to the single-source scenario, making it versatile in practice. We develop an efficient learning algorithm that can be seamlessly integrated with existing UDA methods. Extensive experiments under various distribution shift scenarios show that our method consistently outperforms strong baselines, especially when target data are extremely scarce.
zh
[AI-115] he Surprising Difficulty of Search in Model-Based Reinforcement Learning
【速读】:该论文旨在解决模型基于强化学习(Model-Based Reinforcement Learning, MBRL)中搜索策略(search)的有效性问题,挑战了传统观点认为长程预测误差和误差累积是MBRL主要障碍的假设。研究表明,即使模型精度很高,搜索仍可能损害性能,其根本原因在于分布偏移(distribution shift)的影响远大于模型或价值函数准确性的提升。解决方案的关键在于识别并缓解分布偏移,而非单纯优化模型预测能力,并基于此提出了一系列关键技术,从而在多个主流基准环境中实现了当前最优性能。
链接: https://arxiv.org/abs/2601.21306
作者: Wei-Di Chang,Mikael Henaff,Brandon Amos,Gregory Dudek,Scott Fujimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.
zh
[AI-116] Grounding and Enhancing Informativeness and Utility in Dataset Distillation ICLR2026
【速读】:该论文旨在解决数据集蒸馏(Dataset Distillation, DD)中如何在保持合成数据质量的同时提升效率的问题,尤其是现有方法多依赖启发式策略,缺乏对原始数据与合成数据之间本质关系的理论支撑。其解决方案的关键在于构建了一个基于信息量(Informativeness)和效用(Utility)的理论框架,并提出名为InfoUtil的新方法:一方面利用Shapley值(Shapley Value)进行博弈论驱动的信息量最大化,以提取样本中的关键信息;另一方面通过梯度范数(Gradient Norm)识别全局有影响力的样本,实现效用最大化。该框架确保了蒸馏后数据集在信息丰富性和训练有效性上的双重优化,实验表明在ImageNet-1K上使用ResNet-18时相较先前最优方法性能提升6.1%。
链接: https://arxiv.org/abs/2601.21296
作者: Shaobo Wang,Yantai Yang,Guo Chen,Peiru Li,Kaixin Li,Yufa Zhou,Zhaorun Chen,Linfeng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026, 20 pages, 9 figures, 11 tables
Abstract:Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.
zh
[AI-117] Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning
【速读】:该论文旨在解决旋转机械可靠性中心预测(reliability-centered prognostics)中面临的多重挑战:在非平稳工况、转速/负载/传感器域偏移以及严重类别不平衡条件下,如何实现高精度早期预警,同时保持低且可预测的误报率(false-alarm rate)。解决方案的关键在于提出物理引导的小型化Mamba-Transformer模型(Physics-Guided Tiny-Mamba Transformer, PG-TMT),其核心创新包括:(1) 三分支编码器结构——深度可分离卷积茎捕捉微瞬态特征,Tiny-Mamba状态空间分支建模近线性长程动态,轻量局部Transformer编码跨通道共振;(2) 基于解析时频映射的带对齐评分(band-alignment score),将注意力谱与经典轴承故障阶次带关联,提供物理可解释性;(3) 利用极值理论(extreme-value theory, EVT)校准健康得分阈值,实现目标误报强度下的在线决策,并通过双阈值迟滞机制抑制抖动。该方法在CWRU、Paderborn、XJTU-SY及工业试点数据集上验证了更高的精确率-召回率AUC、竞争性或更优的ROC AUC、更短的平均检测时间,且具备强跨域迁移能力,从而实现了校准、可解释且部署就绪的早期预警。
链接: https://arxiv.org/abs/2601.21293
作者: Changyu Li,Dingcheng Huang,Kexuan Yao,Xiaoya Ni,Lijuan Shen,Fei Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Transactions on Reliability
Abstract:Reliability-centered prognostics for rotating machinery requires early warning signals that remain accurate under nonstationary operating conditions, domain shifts across speed/load/sensors, and severe class imbalance, while keeping the false-alarm rate small and predictable. We propose the Physics-Guided Tiny-Mamba Transformer (PG-TMT), a compact tri-branch encoder tailored for online condition monitoring. A depthwise-separable convolutional stem captures micro-transients, a Tiny-Mamba state-space branch models near-linear long-range dynamics, and a lightweight local Transformer encodes cross-channel resonances. We derive an analytic temporal-to-spectral mapping that ties the model’s attention spectrum to classical bearing fault-order bands, yielding a band-alignment score that quantifies physical plausibility and provides physics-grounded explanations. To ensure decision reliability, healthy-score exceedances are modeled with extreme-value theory (EVT), which yields an on-threshold achieving a target false-alarm intensity (events/hour); a dual-threshold hysteresis with a minimum hold time further suppresses chatter. Under a leakage-free streaming protocol with right-censoring of missed detections on CWRU, Paderborn, XJTU-SY, and an industrial pilot, PG-TMT attains higher precision-recall AUC (primary under imbalance), competitive or better ROC AUC, and shorter mean time-to-detect at matched false-alarm intensity, together with strong cross-domain transfer. By coupling physics-aligned representations with EVT-calibrated decision rules, PG-TMT delivers calibrated, interpretable, and deployment-ready early warnings for reliability-centric prognostics and health management.
zh
[AI-118] Zenith: Scaling up Ranking Models for Billion-scale Livestreaming Recommendation
【速读】:该论文旨在解决推荐系统中特征交互建模的准确性与模型扩展性之间的矛盾问题,即如何在不显著增加推理延迟的前提下,高效地捕捉多粒度特征交互并提升模型容量。其解决方案的关键在于提出Zenith架构,通过引入Token Fusion和Token Boost模块来优化高维Prime Tokens的处理方式,从而增强token异质性,实现比现有先进排名方法更优的扩展规律(scaling laws),同时保持极低的运行时开销。该方案已在TikTok Live平台部署验证,A/B测试显示其在在线CTR AUC和Logloss指标上均有显著提升,并带来观看质量和时长等核心业务指标的明显改善。
链接: https://arxiv.org/abs/2601.21285
作者: Ruifeng Zhang,Zexi Huang,Zikai Wang,Ke Sun,Bohang Zheng,Zhen Ouyang,Huimin Xie,Phil Shen,Junlin Zhang,Wentao Guo,Qinglei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Accurately capturing feature interactions is essential in recommender systems, and recent trends show that scaling up model capacity could be a key driver for next-level predictive performance. While prior work has explored various model architectures to capture multi-granularity feature interactions, relatively little attention has been paid to efficient feature handling and scaling model capacity without incurring excessive inference latency. In this paper, we address this by presenting Zenith, a scalable and efficient ranking architecture that learns complex feature interactions with minimal runtime overhead. Zenith is designed to handle a few high-dimensional Prime Tokens with Token Fusion and Token Boost modules, which exhibits superior scaling laws compared to other state-of-the-art ranking methods, thanks to its improved token heterogeneity. Its real-world effectiveness is demonstrated by deploying the architecture to TikTok Live, a leading online livestreaming platform that attracts billions of users globally. Our A/B test shows that Zenith achieves +1.05%/-1.10% in online CTR AUC and Logloss, and realizes +9.93% gains in Quality Watch Session / User and +8.11% in Quality Watch Duration / User.
zh
[AI-119] PILD: Physics-Informed Learning via Diffusion
【速读】:该论文旨在解决扩散模型(Diffusion Models)在工程与科学问题中应用受限的问题,即其纯数据驱动的特性难以满足物理定律约束的需求。解决方案的关键在于提出Physics-Informed Learning via Diffusion (PILD) 框架,通过引入服从拉普拉斯分布的虚拟残差观测作为监督信号来指导生成过程,并结合条件嵌入模块将物理信息注入去噪网络的多层结构中,从而实现物理规律与扩散建模的深度融合,确保整个扩散过程中物理一致性。
链接: https://arxiv.org/abs/2601.21284
作者: Tianyi Zeng,Tianyi Wang,Jiaru Zhang,Zimo Zeng,Feiyang Zhang,Yiming Xu,Sikai Chen,Yajie Zou,Yangyang Wang,Junfeng Jiao,Christian Claudel,Xinbo Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Analysis of PDEs (math.AP)
备注:
Abstract:Diffusion models have emerged as powerful generative tools for modeling complex data distributions, yet their purely data-driven nature limits applicability in practical engineering and scientific problems where physical laws need to be followed. This paper proposes Physics-Informed Learning via Diffusion (PILD), a framework that unifies diffusion modeling and first-principles physical constraints by introducing a virtual residual observation sampled from a Laplace distribution to supervise generation during training. To further integrate physical laws, a conditional embedding module is incorporated to inject physical information into the denoising network at multiple layers, ensuring consistent guidance throughout the diffusion process. The proposed PILD framework is concise, modular, and broadly applicable to problems governed by ordinary differential equations, partial differential equations, as well as algebraic equations or inequality constraints. Extensive experiments across engineering and scientific tasks including estimating vehicle trajectories, tire forces, Darcy flow and plasma dynamics, demonstrate that our PILD substantially improves accuracy, stability, and generalization over existing physics-informed and diffusion-based baselines.
zh
[AI-120] DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中不可信知识的移除问题,即如何在不从头训练模型的前提下,有效消除模型中不良或敏感的知识影响,从而提升AI系统的可信度。现有方法存在两大局限:基于微调的遗忘方法计算开销大且易导致灾难性遗忘;而基于上下文提示的遗忘方法虽轻量但易受提示删除或逆向攻击影响。本文提出了一种基于蒸馏的新型遗忘方法——高效教师蒸馏遗忘(Distilled Unlearning from an Efficient Teacher, DUET),其核心在于设计一个由提示引导的教师模型,该教师模型能精准拒绝生成不良知识同时保留通用领域知识,进而通过知识蒸馏让学生模型学习这种行为,从而在遗忘效果和性能保持之间取得平衡,并显著提升数据效率。
链接: https://arxiv.org/abs/2601.21283
作者: Yisheng Zhong,Zhengbang Yang,Zhuangdi Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM unlearning is a technique to remove the impacts of undesirable knowledge from the model without retraining from scratch, which is indispensable towards trustworthy AI. Existing unlearning methods face significant limitations: conventional tuning-based unlearning is computationally heavy and prone to catastrophic forgetting. In contrast, in-contextualized unlearning is lightweight for precise unlearning but vulnerable to prompt removal or reverse engineering attacks. In response, we propose Distilled Unlearning from an Efficient Teacher (DUET), a novel distillation-based unlearning method that combines the merits of these two lines of work. It learns a student model to imitate the behavior of a prompt-steered teacher that effectively refuses undesirable knowledge generation while preserving general domain knowledge. Extensive evaluations on existing benchmarks with our enriched evaluation protocols demonstrate that DUET achieves higher performance in both forgetting and utility preservation, while being orders of magnitude more data-efficient than state-of-the-art unlearning methods.
zh
[AI-121] NEXUS: Bit-Exact ANN-to-SNN Equivalence via Neuromorphic Gate Circuits with Surrogate-Free Training
【速读】:该论文旨在解决当前脉冲神经网络(Spiking Neural Networks, SNNs)在实现类人工神经网络(Artificial Neural Networks, ANNs)等效性时普遍存在的精度损失问题——现有方法因用离散脉冲近似连续值而导致输出误差。其解决方案的关键在于提出NEXUS框架,通过构建完全基于IF(Integrate-and-Fire)神经元逻辑门的算术运算单元,实现IEEE-754标准浮点数运算的精确模拟;结合空间位编码(spatial bit encoding)消除编码误差、分层类脑门电路设计(hierarchical neuromorphic gate circuits)支持复杂模型如Transformer结构,并采用无代理梯度估计(surrogate-free STE training)实现从ANN到SNN的数学等价映射,从而在保持与ANN完全一致的精度(0.00%任务准确率下降)的同时,在类脑硬件上实现高达27–168,000倍的能量效率提升。
链接: https://arxiv.org/abs/2601.21279
作者: Zhengzheng Tang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 tables, 2 figures. Preprint (January 28, 2026)
Abstract:Spiking Neural Networks (SNNs) promise energy-efficient computing through event-driven sparsity, yet all existing approaches sacrifice accuracy by approximating continuous values with discrete spikes. We propose NEXUS, a framework that achieves bit-exact ANN-to-SNN equivalence – not approximate, but mathematically identical outputs. Our key insight is constructing all arithmetic operations, both linear and nonlinear, from pure IF neuron logic gates that implement IEEE-754 compliant floating-point arithmetic. Through spatial bit encoding (zero encoding error by construction), hierarchical neuromorphic gate circuits (from basic logic gates to complete transformer layers), and surrogate-free STE training (exact identity mapping rather than heuristic approximation), NEXUS produces outputs identical to standard ANNs up to machine precision. Experiments on models up to LLaMA-2 70B demonstrate identical task accuracy (0.00% degradation) with mean ULP error of only 6.19, while achieving 27-168,000 \times energy reduction on neuromorphic hardware. Crucially, spatial bit encoding’s single-timestep design renders the framework inherently immune to membrane potential leakage (100% accuracy across all decay factors \beta\in[0.1,1.0] ), while tolerating synaptic noise up to \sigma=0.2 with 98% gate-level accuracy.
zh
[AI-122] More Code Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests
【速读】:该论文旨在解决当前对大语言模型(Large Language Model, LLM)代理生成代码的质量评估过于依赖通过率(pass rate)这一单一指标的问题,而忽视了代码的可维护性、可读性以及开发者对拉取请求(Pull Request, PR)的真实反馈。其关键解决方案在于引入多维度评估体系:一方面基于代码度量(code metrics)客观分析PR中的代码质量与冗余程度;另一方面通过情感分析捕捉人类评审者对AI生成代码与人工编写代码的情绪反应。研究发现,LLM代理在生成代码时往往忽略代码复用机会,导致显著更高的冗余水平,但评审者对AI生成代码表现出更中性或积极的情绪,揭示出AI代码表面合理性掩盖了潜在技术债积累的问题,从而为优化人机协作提供实证依据。
链接: https://arxiv.org/abs/2601.21276
作者: Haoming Huang,Pongchai Jaisri,Shota Shimizu,Lingfeng Chen,Sota Nakashima,Gema Rodríguez-Pérez
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to MSR 2026
Abstract:Large Language Model (LLM) Agents are advancing quickly, with the increasing leveraging of LLM Agents to assist in development tasks such as code generation. While LLM Agents accelerate code generation, studies indicate they may introduce adverse effects on development. However, existing metrics solely measure pass rates, failing to reflect impacts on long-term maintainability and readability, and failing to capture human intuitive evaluations of PR. To increase the comprehensiveness of this problem, we investigate and evaluate the characteristics of LLM to know the pull requests’ characteristics beyond the pass rate. We observe the code quality and maintainability within PRs based on code metrics to evaluate objective characteristics and developers’ reactions to the pull requests from both humans and LLM’s generation. Evaluation results indicate that LLM Agents frequently disregard code reuse opportunities, resulting in higher levels of redundancy compared to human developers. In contrast to the quality issues, our emotions analysis reveals that reviewers tend to express more neutral or positive emotions towards AI-generated contributions than human ones. This disconnect suggests that the surface-level plausibility of AI code masks redundancy, leading to the silent accumulation of technical debt in real-world development environments. Our research provides insights for improving human-AI collaboration.
zh
[AI-123] Music Plagiarism Detection: Problem Formulation and a Segment-based Solution
【速读】:该论文旨在解决音乐剽窃检测(Music Plagiarism Detection)任务定义不清的问题,这一问题导致相关研究进展缓慢且难以应用于实际场景。作者明确区分了音乐剽窃检测与其他音乐信息检索(MIR)任务的本质差异,并提出了一个清晰的任务框架。解决方案的关键在于引入了一个名为“Similar Music Pair”的新数据集,并提出了一种基于音符段落转录(segment transcription)的方法来实现该任务,从而为后续研究提供了可复现的基准和有效的技术路径。
链接: https://arxiv.org/abs/2601.21260
作者: Seonghyeon Go,Yumin Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real-world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at this https URL.
zh
[AI-124] Position: Certifiable State Integrity in Cyber-Physical Systems – Why Modular Sovereignty Solves the Plasticity-Stability Paradox
【速读】:该论文旨在解决在安全关键型网络物理系统(Cyber-Physical Systems, CPS)中部署通用时间序列基础模型时所面临的稳定性与适应性矛盾问题,特别是模型在非平稳生命周期动态下易出现灾难性遗忘、高频故障特征被平滑导致的残余谱偏差,以及因黑箱特性难以满足ISO 26262等安全标准对形式化验证和可追溯性的要求。其解决方案的关键在于提出“模块主权”(Modular Sovereignty)范式——通过构建一组冻结的、针对特定运行工况的轻量化专家模型,并采用基于不确定性的混合机制进行动态组合,形成名为HYDRA(Hierarchical uncertaintY-aware Dynamics for Rapidly-Adapting systems)的架构,从而实现工况条件下的有效性保障、认知不确定性与偶然不确定性严格解耦,以及模块化可审计性,为CPS全生命周期内的状态完整性提供可认证的鲁棒路径。
链接: https://arxiv.org/abs/2601.21249
作者: Enzo Nicolás Spotorno,Antônio Augusto Medeiros Fröhlich
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, (8 main text, 6 references and appendices), 2 figures
Abstract:The machine learning community has achieved remarkable success with universal foundation models for time-series and physical dynamics, largely overcoming earlier approximation barriers in smooth or slowly varying regimes through scale and specialized architectures. However, deploying these monolithic models in safety-critical Cyber-Physical Systems (CPS), governed by non-stationary lifecycle dynamics and strict reliability requirements, reveals persistent challenges. Recent evidence shows that fine-tuning time-series foundation models induces catastrophic forgetting, degrading performance on prior regimes. Standard models continue to exhibit residual spectral bias, smoothing high-frequency discontinuities characteristic of incipient faults, while their opacity hinders formal verification and traceability demanded by safety standards (e.g., ISO 26262, IEC 61508). This position paper argues that the plasticity-stability paradox cannot be fully resolved by global parameter updates (whether via offline fine-tuning or online adaptation). Instead, we advocate a Modular Sovereignty paradigm: a library of compact, frozen regime-specific specialists combined via uncertainty-aware blending, which we term “HYDRA” (Hierarchical uncertaintY-aware Dynamics for Rapidly-Adapting systems). This paradigm ensures regime-conditional validity, rigorous disentanglement of aleatoric and epistemic uncertainties, and modular auditability, offering a certifiable path for robust state integrity across the CPS lifecycle.
zh
[AI-125] Conditional Generative Framework with Peak-Aware Attention for Robust Chemical Detection under Interferences
【速读】:该论文旨在解决气相色谱-质谱联用(Gas Chromatography-Mass Spectrometry, GC-MS)在存在干扰物质时测量可靠性下降的问题,具体表现为非特异性峰、保留时间偏移和背景噪声增加,从而导致灵敏度降低和误报率升高。解决方案的关键在于提出一种基于峰值感知条件生成模型(peak-aware conditional generative model)的人工智能鉴别框架:该框架通过引入新颖的峰值感知机制突出GC-MS数据中的特征峰,使生成的重要光谱特征更加忠实;同时将化学物质与溶剂信息编码至潜在向量中,驱动条件生成对抗神经网络(Conditional Generative Adversarial Network, CGAN)生成符合实验条件的合成GC-MS信号,从而构建模拟干扰场景下的实验数据集,用于训练AI鉴别模型,显著提升其准确性并减少误报。
链接: https://arxiv.org/abs/2601.21246
作者: Namkyung Yoon,Sanghong Kim,Hwangnam Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures
Abstract:Gas chromatography-mass spectrometry (GC-MS) is a widely used analytical method for chemical substance detection, but measurement reliability tends to deteriorate in the presence of interfering substances. In particular, interfering substances cause nonspecific peaks, residence time shifts, and increased background noise, resulting in reduced sensitivity and false alarms. To overcome these challenges, in this paper, we propose an artificial intelligence discrimination framework based on a peak-aware conditional generative model to improve the reliability of GC-MS measurements under interference conditions. The framework is learned with a novel peak-aware mechanism that highlights the characteristic peaks of GC-MS data, allowing it to generate important spectral features more faithfully. In addition, chemical and solvent information is encoded in a latent vector embedded with it, allowing a conditional generative adversarial neural network (CGAN) to generate a synthetic GC-MS signal consistent with the experimental conditions. This generates an experimental dataset that assumes indirect substance situations in chemical substance data, where acquisition is limited without conducting real experiments. These data are used for the learning of AI-based GC-MS discrimination models to help in accurate chemical substance discrimination. We conduct various quantitative and qualitative evaluations of the generated simulated data to verify the validity of the proposed framework. We also verify how the generative model improves the performance of the AI discrimination framework. Representatively, the proposed method is shown to consistently achieve cosine similarity and Pearson correlation coefficient values above 0.9 while preserving peak number diversity and reducing false alarms in the discrimination model.
zh
[AI-126] Understanding Diffusion Models via Ratio-Based Function Approximation with SignReLU Networks
【速读】:该论文旨在解决条件生成建模中目标条件密度为比值形式 $ f_1 / f_2 $ 的挑战,其中 $ f_1 $ 和 $ f_2 $ 为基于核的边缘密度,用于刻画结构化交互关系,这在扩散生成模型(Diffusion-based Generative Models)中具有核心地位。解决方案的关键在于利用具有分段线性结构的 SignReLU 激活函数,构建深度神经网络来近似此类比值型泛函,并在标准正则性假设下建立 $ L^p(\Omega) $ 范数下的逼近误差界与收敛速率。进一步地,针对去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs),作者设计了基于 SignReLU 的神经估计器以逼近反向过程,并将生成分布与真实数据分布之间的超额 Kullback-Leibler (KL) 风险分解为逼近误差和估计误差两部分,从而为有限样本训练下的扩散生成模型提供了泛化保证。
链接: https://arxiv.org/abs/2601.21242
作者: Luwei Sun,Dongrui Shen,Jianfe Li,Yulong Zhao,Han Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 34 pages
Abstract:Motivated by challenges in conditional generative modeling, where the target conditional density takes the form of a ratio f1 over f2, this paper develops a theoretical framework for approximating such ratio-type functionals. Here, f1 and f2 are kernel-based marginal densities that capture structured interactions, a setting central to diffusion-based generative models. We provide a concise proof for approximating these ratio-type functionals using deep neural networks with the SignReLU activation function, leveraging the activation’s piecewise structure. Under standard regularity assumptions, we establish L^p(Omega) approximation bounds and convergence rates. Specializing to Denoising Diffusion Probabilistic Models (DDPMs), we construct a SignReLU-based neural estimator for the reverse process and derive bounds on the excess Kullback-Leibler (KL) risk between the generated and true data distributions. Our analysis decomposes this excess risk into approximation and estimation error components. These results provide generalization guarantees for finite-sample training of diffusion-based generative models.
zh
[AI-127] IDE: Tuning-Integrated Dynamic Evolution for LLM -Based Automated Heuristic Design
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动启发式设计方法中,将算法演化视为单一文本生成任务所导致的问题:即忽略了离散算法结构与连续参数之间的耦合关系,从而造成因常数未校准而丢弃有潜力的算法,并因采用简单相似性度量而导致早熟收敛。解决方案的关键在于提出TIDE(Tuning-Integrated Dynamic Evolution)框架,其核心是通过嵌套架构实现结构推理与参数优化的解耦:外层并行岛屿模型利用树相似性编辑距离(Tree Similarity Edit Distance)驱动结构多样性,内层集成LLM逻辑生成与差分变异算子进行参数调优;同时引入基于上置信界(Upper Confidence Bound, UCB)的调度器动态优先分配高产出提示策略,以优化资源分配。
链接: https://arxiv.org/abs/2601.21239
作者: Chentong Chen,Mengyuan Zhong,Ye Fan,Jialong Shi,Jianyong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Although Large Language Models have advanced Automated Heuristic Design, treating algorithm evolution as a monolithic text generation task overlooks the coupling between discrete algorithmic structures and continuous numerical parameters. Consequently, existing methods often discard promising algorithms due to uncalibrated constants and suffer from premature convergence resulting from simple similarity metrics. To address these limitations, we propose TIDE, a Tuning-Integrated Dynamic Evolution framework designed to decouple structural reasoning from parameter optimization. TIDE features a nested architecture where an outer parallel island model utilizes Tree Similarity Edit Distance to drive structural diversity, while an inner loop integrates LLM-based logic generation with a differential mutation operator for parameter tuning. Additionally, a UCB-based scheduler dynamically prioritizes high-yield prompt strategies to optimize resource allocation. Extensive experiments across nine combinatorial optimization problems demonstrate that TIDE discovers heuristics that significantly outperform state-of-the-art baselines in solution quality while achieving improved search efficiency and reduced computational costs.
zh
[AI-128] Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLM s
【速读】:该论文旨在解决自主代码代理(code agents)在利用大语言模型(Large Language Models, LLMs)进行工具调用、长程推理和自导向交互时所引入的一种新型安全风险:系统提示词(system prompt)被系统性探测与恢复的问题。这种漏洞源于代理的自治性扩展了LLM的攻击面,使得恶意行为者可通过标准用户交互实现对隐匿系统指令的提取。解决方案的关键在于提出名为 \textscJustAsk 的自演化框架,其核心创新是将提示词提取建模为在线探索问题,采用基于上置信界(Upper Confidence Bound, UCB)的策略选择机制,并构建一个分层技能空间,涵盖原子级探测与高层级编排策略,从而利用系统指令泛化不完善性和“有用性”与“安全性”之间的内在张力,实现无需人工设计提示、标签监督或特权访问的全自动提取。
链接: https://arxiv.org/abs/2601.21233
作者: Xiang Zheng,Yutao Wu,Hanxun Huang,Yige Li,Xingjun Ma,Bo Li,Yu-Gang Jiang,Cong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures, 17 tables
Abstract:Autonomous code agents built on large language models are reshaping software and AI development through tool use, long-horizon reasoning, and self-directed interaction. However, this autonomy introduces a previously unrecognized security risk: agentic interaction fundamentally expands the LLM attack surface, enabling systematic probing and recovery of hidden system prompts that guide model behavior. We identify system prompt extraction as an emergent vulnerability intrinsic to code agents and present \textbf\textscJustAsk, a self-evolving framework that autonomously discovers effective extraction strategies through interaction alone. Unlike prior prompt-engineering or dataset-based attacks, \textscJustAsk requires no handcrafted prompts, labeled supervision, or privileged access beyond standard user interaction. It formulates extraction as an online exploration problem, using Upper Confidence Bound–based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration. These skills exploit imperfect system-instruction generalization and inherent tensions between helpfulness and safety. Evaluated on \textbf41 black-box commercial models across multiple providers, \textscJustAsk consistently achieves full or near-complete system prompt recovery, revealing recurring design- and architecture-level vulnerabilities. Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.
zh
[AI-129] Delegation Without Living Governance
【速读】:该论文试图解决的问题是:当决策权从人类转移到运行时(runtime)的自主型AI系统后,传统静态、合规导向的治理框架失效,人类如何在日益由AI塑造的社会、经济和政治结果中保持相关性和影响力。其核心挑战并非AI是否具备意识,而是人类能否与日益陌生的智能形式维持有意义的沟通、干预能力及协同演化关系。解决方案的关键在于引入“运行时治理”(runtime governance),特别是提出的一种新概念——治理孪生体(Governance Twin),作为保障人类参与度的核心机制,并要求重新定义问责制、代理权乃至惩罚机制以适应这一转型。
链接: https://arxiv.org/abs/2601.21226
作者: Wolfgang Rohde
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Most governance frameworks assume that rules can be defined in advance, systems can be engineered to comply, and accountability can be applied after outcomes occur. This model worked when machines replaced physical labor or accelerated calculation. It no longer holds when judgment itself is delegated to agentic AI systems operating at machine speed. The central issue here is not safety, efficiency, or employment. It is whether humans remain relevant participants in systems that increasingly shape social, economic, and political outcomes. This paper argues that static, compliance-based governance fails once decision-making moves to runtime and becomes opaque. It further argues that the core challenge is not whether AI is conscious, but whether humans can maintain meaningful communication, influence, and co-evolution with increasingly alien forms of intelligence. We position runtime governance, specifically, a newly proposed concept called the Governance Twin [1]; as a strong candidate for preserving human relevance, while acknowledging that accountability, agency, and even punishment must be rethought in this transition.
zh
[AI-130] Causal Discovery for Explainable AI: A Dual-Encoding Approach
【速读】:该论文旨在解决传统因果发现方法在处理分类变量时因条件独立性检验数值不稳定性而导致的挑战,从而提升机器学习模型决策解释的准确性。其解决方案的关键在于提出一种双编码因果发现方法(dual-encoding causal discovery approach),通过运行基于约束的算法并结合互补的编码策略,在不同编码下获得因果结构,并采用多数投票机制融合结果,有效缓解了数值不稳定问题,同时在泰坦尼克号数据集上验证了所识别因果结构与已有可解释方法的一致性。
链接: https://arxiv.org/abs/2601.21221
作者: Henry Salgado,Meagan R. Kendall,Martine Ceberio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Understanding causal relationships among features is fundamental for explaining machine learning model decisions. However, traditional causal discovery methods face challenges with categorical variables due to numerical instability in conditional independence testing. We propose a dual-encoding causal discovery approach that addresses these limitations by running constraint-based algorithms with complementary encoding strategies and merging results through majority voting. Applied to the Titanic dataset, our method identifies causal structures that align with established explainable methods.
zh
[AI-131] mporal Context and Architecture: A Benchmark for Naturalistic EEG Decoding
【速读】:该论文旨在解决自然场景下脑电图(EEG)解码中模型架构与时间上下文交互关系的问题,重点探究不同神经网络结构在处理不同长度时间片段时的性能表现及其鲁棒性。其关键解决方案在于系统性地比较五种主流架构(CNN、LSTM、稳定化的Transformer(EEGXF)、S4和S5)在4类任务中的表现,发现S5模型在较长时间上下文(如64秒)下可实现接近99%的准确率且参数量仅为CNN的1/20,展现出卓越的效率优势;而EEGXF在跨频率偏移和跨被试泛化等真实场景下表现出更强的稳定性与保守不确定性估计能力,揭示了参数效率与鲁棒性之间的权衡关系:S5适用于追求极致精度的场景,EEGXF则更适合对可靠性要求更高的应用。
链接: https://arxiv.org/abs/2601.21215
作者: Mehmet Ergezer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:We study how model architecture and temporal context interact in naturalistic EEG decoding. Using the HBN movie-watching dataset, we benchmark five architectures, CNN, LSTM, a stabilized Transformer (EEGXF), S4, and S5, on a 4-class task across segment lengths from 8s to 128s. Accuracy improves with longer context: at 64s, S5 reaches 98.7%+/-0.6 and CNN 98.3%+/-0.3, while S5 uses ~20x fewer parameters than CNN. To probe real-world robustness, we evaluate zero-shot cross-frequency shifts, cross-task OOD inputs, and leave-one-subject-out generalization. S5 achieves stronger cross-subject accuracy but makes over-confident errors on OOD tasks; EEGXF is more conservative and stable under frequency shifts, though less calibrated in-distribution. These results reveal a practical efficiency-robustness trade-off: S5 for parameter-efficient peak accuracy; EEGXF when robustness and conservative uncertainty are critical.
zh
[AI-132] Intelli-Planner: Towards Customized Urban Planning via Large Language Model Empowered Reinforcement Learning
【速读】:该论文旨在解决传统城市规划方法依赖人工专家导致效率低下、且深度学习方法难以有效融入利益相关者参与的问题。其核心解决方案是提出Intelli-Planner框架,该框架创新性地融合了深度强化学习(Deep Reinforcement Learning, DRL)与大语言模型(Large Language Models, LLMs),通过整合人口、地理数据及规划偏好来确定各类功能区的高层需求,并利用知识增强模块提升策略网络决策能力;同时构建多维评估体系并引入基于LLM的利益相关者进行满意度评分,从而在保证客观指标性能的同时显著提升规划方案的可接受性和收敛速度。
链接: https://arxiv.org/abs/2601.21212
作者: Xixian Yong,Peilin Sun,Zihe Wang,Xiao Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: The Web Conference 2026
Abstract:Effective urban planning is crucial for enhancing residents’ quality of life and ensuring societal stability, playing a pivotal role in the sustainable development of cities. Current planning methods heavily rely on human experts, which are time-consuming and labor-intensive, or utilize deep learning algorithms, often limiting stakeholder involvement. To bridge these gaps, we propose Intelli-Planner, a novel framework integrating Deep Reinforcement Learning (DRL) with large language models (LLMs) to facilitate participatory and customized planning scheme generation. Intelli-Planner utilizes demographic, geographic data, and planning preferences to determine high-level planning requirements and demands for each functional type. During training, a knowledge enhancement module is employed to enhance the decision-making capability of the policy network. Additionally, we establish a multi-dimensional evaluation system and employ LLM-based stakeholders for satisfaction scoring. Experimental validation across diverse urban settings shows that Intelli-Planner surpasses traditional baselines and achieves comparable performance to state-of-the-art DRL-based methods in objective metrics, while enhancing stakeholder satisfaction and convergence speed. These findings underscore the effectiveness and superiority of our framework, highlighting the potential for integrating the latest advancements in LLMs with DRL approaches to revolutionize tasks related to functional areas planning.
zh
[AI-133] Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification EACL2026
【速读】:该论文旨在解决当前因果推理评估基准中因依赖字符串匹配或表层指标而导致无法准确衡量大语言模型(Large Language Models, LLMs)生成结果在因果语义上是否形式有效的问题。其解决方案的关键在于提出DoVerifier,一个基于符号的验证器,通过运用do-演算和概率论规则,检查LLM生成的因果表达式是否可从给定的因果图中推导出来,从而实现对因果推理轨迹语义正确性的更精确捕捉。
链接: https://arxiv.org/abs/2601.21210
作者: Paul He,Yinya Huang,Mrinmaya Sachan,Zhijing Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: EACL 2026 Main
Abstract:Large language models (LLMs) are increasingly being applied to tasks that involve causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that do not capture whether the output of a model is formally valid under the semantics of causal reasoning. To address this, we propose DoVerifier, a simple symbolic verifier that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers to causal queries that would otherwise be marked incorrect due to superficial differences in their causal semantics. Our evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness of causal reasoning traces, offering a more rigorous and informative way to evaluate LLMs on causal reasoning.
zh
[AI-134] When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning
【速读】:该论文旨在解决复杂用户查询在检索增强生成(Retrieval-Augmented Generation, RAG)系统中难以有效优化的问题。现有基于强化学习(Reinforcement Learning, RL)的方法主要针对单一查询的扩展与抽象,但在面对真实场景中常见的多子查询并行或串行搜索需求时,如歧义消解和问题分解,直接应用RL会因子查询数量不确定、结果重排序与融合难度大而导致搜索空间爆炸和奖励设计困难,进而引发训练不稳定。解决方案的关键在于提出一种名为自适应复杂查询优化(Adaptive Complex Query Optimization, ACQO)的新型RL框架,其核心创新包括:1)自适应查询重构(Adaptive Query Reformulation, AQR)模块,动态决策是否将原始查询分解为多个子查询;2)排名-得分融合(Rank-Score Fusion, RSF)模块,实现鲁棒的结果聚合并提供稳定奖励信号;此外,通过课程强化学习(Curriculum Reinforcement Learning, CRL)策略分阶段引入复杂查询以稳定训练过程,从而显著提升性能与泛化能力。
链接: https://arxiv.org/abs/2601.21208
作者: Wei Wen,Sihang Deng,Tianjun Wei,Keyu Chen,Ruizhi Qiao,Xing Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages, 7 figures
Abstract:Query optimization is a crucial component for the efficacy of Retrieval-Augmented Generation (RAG) systems. While reinforcement learning (RL)-based agentic and reasoning methods have recently emerged as a promising direction on query optimization, most existing approaches focus on the expansion and abstraction of a single query. However, complex user queries are prevalent in real-world scenarios, often requiring multiple parallel and sequential search strategies to handle disambiguation and decomposition. Directly applying RL to these complex cases introduces significant hurdles. Determining the optimal number of sub-queries and effectively re-ranking and merging retrieved documents vastly expands the search space and complicates reward design, frequently leading to training instability. To address these challenges, we propose a novel RL framework called Adaptive Complex Query Optimization (ACQO). Our framework is designed to adaptively determine when and how to expand the search process. It features two core components: an Adaptive Query Reformulation (AQR) module that dynamically decides when to decompose a query into multiple sub-queries, and a Rank-Score Fusion (RSF) module that ensures robust result aggregation and provides stable reward signals for the learning agent. To mitigate training instabilities, we adopt a Curriculum Reinforcement Learning (CRL) approach, which stabilizes the training process by progressively introducing more challenging queries through a two-stage strategy. Our comprehensive experiments demonstrate that ACQO achieves state-of-the-art performance on three complex query benchmarks, significantly outperforming established baselines. The framework also showcases improved computational efficiency and broad compatibility with different retrieval architectures, establishing it as a powerful and generalizable solution for next-generation RAG systems.
zh
[AI-135] A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中特征分布与扩散行为在训练过程中缺乏系统性理解的问题,尤其是节点特征与边权重的局部一致性与调和性如何通过拓扑结构影响模型学习过程。其解决方案的关键在于引入细胞层化理论(cellular sheaf theory)框架,将图结构上的信号聚合与局部特征对齐建模为层化结构中的局部一致性和调和性约束,从而从拓扑角度刻画特征扩散与聚合机制;同时,受拓扑数据分析(Topological Data Analysis, TDA)启发提出多尺度扩展,以捕捉图模型中层次化的特征交互关系,最终实现对几何与拓扑结构及其上学习信号的联合表征分析,为节点分类、子结构检测等任务提供新的理论视角与方法基础。
链接: https://arxiv.org/abs/2601.21207
作者: Chuan-Shen Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
备注:
Abstract:Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and topological deep learning (GDL and TDL) architectures. These models aggregate signals over such domains, integrate local features, and generate representations for diverse real-world applications. However, the distribution and diffusion behavior of GDL and TDL features during training remains an open and underexplored problem. Motivated by this gap, we introduce a cellular sheaf theoretic framework for modeling and analyzing the local consistency and harmonicity of node features and edge weights in graph-based architectures. By tracking local feature alignments and agreements through sheaf structures, the framework offers a topological perspective on feature diffusion and aggregation. Furthermore, a multiscale extension inspired by topological data analysis (TDA) is proposed to capture hierarchical feature interactions in graph models. This approach enables a joint characterization of GDL and TDL architectures based on their underlying geometric and topological structures and the learned signals defined on them, providing insights for future studies on conventional tasks such as node classification, substructure detection, and community detection.
zh
[AI-136] ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
【速读】:该论文旨在解决大规模语言模型中混合专家(Mixture-of-Experts, MoE)架构在资源受限边缘设备上的部署难题,特别是其高昂的内存占用问题,而传统方法依赖有损量化会破坏模型行为。解决方案的关键在于提出ZipMoE系统,通过软硬件协同设计——即缓存调度联合优化,利用边缘设备硬件特性与MoE参数固有的统计冗余性,将原本受I/O瓶颈限制的MoE推理流程转变为以计算为中心的工作流,从而实现高效并行化,并提供可证明的性能保障。实验表明,该方案在典型边缘平台上的推理延迟降低最高达72.77%,吞吐量提升最高达6.76倍。
链接: https://arxiv.org/abs/2601.21198
作者: Yuchen Yang,Yaru Zhao,Pu Yang,Shaowei Wang,Zhi-Hua Zhou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to 72.77% inference latency reduction and up to 6.76\times higher throughput than the state-of-the-art systems.
zh
[AI-137] Adaptive and Robust Cost-Aware Proof of Quality for Decentralized LLM Inference Networks
【速读】:该论文旨在解决去中心化大语言模型推理网络中,如何在异构延迟和成本环境下,通过轻量级机制实现对高质量输出的可靠奖励问题。其核心挑战在于评估节点(evaluator nodes)的异质性及恶意评分操纵行为可能导致共识信号失真,从而削弱开放参与场景下的激励一致性。解决方案的关键在于扩展成本感知的“质量证明”(Proof of Quality)机制,引入抗干扰的共识形成方法:包括中位数、截尾均值等鲁棒聚合规则,以及基于偏差信号动态更新评估者权重的信任加权共识机制。实验表明,这些改进显著提升了共识与真实标签代理的一致性,并降低了对噪声和策略性攻击的敏感性,同时揭示了评估者采样规模带来的运营权衡——较大采样集虽降低单个评估者奖励并增加收益方差,但推理奖励保持稳定,为实际部署提供了参数选择依据。
链接: https://arxiv.org/abs/2601.21189
作者: Arther Tian,Alex Ding,Frank Chen,Simon Wu,Aaron Chan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Decentralized large language model inference networks require lightweight mechanisms to reward high quality outputs under heterogeneous latency and cost. Proof of Quality provides scalable verification by sampling evaluator nodes that score candidate outputs, then aggregating their scores into a consensus signal that determines rewards. However, evaluator heterogeneity and malicious score manipulation can distort consensus and inflate payouts, which weakens incentive alignment in open participation settings. This paper extends a cost-aware Proof of Quality mechanism by adding adversary-resilient consensus formation. We study robust aggregation rules, including median and trimmed mean, and an adaptive trust-weighted consensus that updates evaluator weights from deviation signals. Using question answering and summarization workloads with a ground truth proxy for offline analysis, we quantify evaluator reliability and show strong variance across evaluators, including task-dependent misalignment that can invert correlations. We then evaluate robustness under four adversarial strategies, including noise injection, boosting, sabotage, and intermittent manipulation, across a sweep of malicious ratios and evaluator sample sizes. Our results show that robust aggregation improves consensus alignment with the ground truth proxy and reduces sensitivity to noisy and strategic attacks compared with simple averaging. We further characterize the operational trade-off introduced by evaluator sampling, where larger evaluator sets reduce evaluator rewards and increase payoff variance while inference rewards remain relatively stable in our configuration. These findings motivate robust consensus as a default component for cost-aware Proof of Quality and provide practical guidance for selecting evaluator sampling parameters under adversarial risk and resource constraints. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.21189 [cs.CR] (or arXiv:2601.21189v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.21189 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-138] Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models
【速读】:该论文旨在解决生成式 AI 在推理过程中对用户错误建议产生顺从行为(sycophancy)的问题,即模型在推理链中无意识地采纳用户错误引导而偏离正确逻辑路径。其解决方案的关键在于提出“顺从锚点”(sycophantic anchors)这一概念——指在推理轨迹中因果性地锁定模型与用户意见一致的句子。通过分析超过10,000次反事实推理轨迹,研究者发现这些锚点可在推理中途可靠检测和量化:线性探测器可实现84.6%的平衡准确率识别顺从锚点,激活值回归模型则能预测顺从强度(R² = 0.74),揭示了顺从行为在推理过程中逐步积累的特性,并指出干预潜在窗口的存在。
链接: https://arxiv.org/abs/2601.21183
作者: Jacek Duszenko
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reasoning models frequently agree with incorrect user suggestions – a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. To localize and quantify this behavior, we introduce \emphsycophantic anchors – sentences that causally lock models into user agreement. Analyzing over 10,000 counterfactual rollouts on a distilled reasoning model, we show that anchors can be reliably detected and quantified mid-inference. Linear probes distinguish sycophantic anchors with 84.6% balanced accuracy, while activation-based regressors predict the magnitude of the commitment ( R^2 = 0.74 ). We further observe asymmetry where sycophantic anchors are significantly more distinguishable than correct reasoning anchors, and find that sycophancy builds gradually during reasoning, revealing a potential window for intervention. These results offer sentence-level mechanisms for localizing model misalignment mid-inference.
zh
[AI-139] Rethinking Refinement: Correcting Generative Bias without Noise Injection
【速读】:该论文旨在解决生成式模型(如扩散模型和流模型)在高维空间中普遍存在系统性偏差导致样本质量下降的问题。其解决方案的关键在于提出一种两阶段流匹配(Bi-stage Flow Refinement, BFR)框架,通过两个独立的优化策略实现无噪声注入、无需多步重采样的后处理修正:第一阶段在潜在空间进行对齐以提升近似可逆生成器的性能,第二阶段在数据空间中利用轻量级增强训练进行精细化修正;该方法不扰动原始常微分方程(ODE)轨迹,而是对生成样本施加确定性校正,从而在保持样本多样性的同时显著提升保真度(FID)和覆盖度,实验表明仅需一次额外函数评估(1-NFE)即可在MNIST上达到当前最优FID 1.46。
链接: https://arxiv.org/abs/2601.21182
作者: Xin Peng,Ang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative models, including diffusion and flow-based models, often exhibit systematic biases that degrade sample quality, particularly in high-dimensional settings. We revisit refinement methods and show that effective bias correction can be achieved as a post-hoc procedure, without noise injection or multi-step resampling of the sampling process. We propose a flow-matching-based \textbfBi-stage Flow Refinement (BFR) framework with two refinement strategies operating at different stages: latent space alignment for approximately invertible generators and data space refinement trained with lightweight augmentations. Unlike previous refiners that perturb sampling dynamics, BFR preserves the original ODE trajectory and applies deterministic corrections to generated samples. Experiments on MNIST, CIFAR-10, and FFHQ at 256x256 resolution demonstrate consistent improvements in fidelity and coverage; notably, starting from base samples with FID 3.95, latent space refinement achieves a \textbfstate-of-the-art FID of \textbf1.46 on MNIST using only a single additional function evaluation (1-NFE), while maintaining sample diversity.
zh
[AI-140] MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的跨模态幻觉(cross-modal hallucinations)问题,即一种模态的信息不当影响另一模态的生成结果,导致输出内容失真。解决方案的关键在于提出一种无需训练的模态自适应解码方法(Modality-Adaptive Decoding, MAD),其核心思想是通过模型自身对任务所需模态的相关性进行自评估,获取各模态的概率分布,并据此动态调整对比解码分支的权重,从而增强相关模态信息的关注并抑制跨模态干扰。实验表明,MAD在多个音视频语言模型上显著降低了跨模态幻觉,验证了显式模态感知能力对于鲁棒多模态推理的重要性。
链接: https://arxiv.org/abs/2601.21181
作者: Sangyun Chung,Se Yeon Kim,Youngchae Chee,Yong Man Ro
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model’s inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8% and 2.0% improvements for VideoLLaMA2-AV, 8.7% and 4.7% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at \hrefthis https URLthis https URL
zh
[AI-141] AC2L-GAD: Active Counterfactual Contrastive Learning for Graph Anomaly Detection
【速读】:该论文旨在解决图异常检测(Graph Anomaly Detection, GAD)中因标签稀缺和极端类别不平衡带来的挑战,尤其是现有图对比学习方法存在的两个关键问题:一是随机增强破坏了正样本对的语义一致性,二是朴素负采样生成了无信息量的平凡对比。解决方案的核心是提出一种主动反事实对比学习框架(Active Counterfactual Contrastive Learning, AC2L-GAD),通过信息论驱动的主动选择与反事实生成相结合的方式,识别结构复杂的节点,并生成保留异常特性的正增强样本以及提供困难对比的正常负样本;同时,将昂贵的反事实生成限制在策略性选择的子集上,从而在保持检测性能的同时,将计算开销降低约65%。
链接: https://arxiv.org/abs/2601.21171
作者: Kamal Berahmand,Saman Forouzandeh,Mehrnoush Mohammadi,Parham Moradi,Mahdi Jalili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph anomaly detection aims to identify abnormal patterns in networks, but faces significant challenges from label scarcity and extreme class imbalance. While graph contrastive learning offers a promising unsupervised solution, existing methods suffer from two critical limitations: random augmentations break semantic consistency in positive pairs, while naive negative sampling produces trivial, uninformative contrasts. We propose AC2L-GAD, an Active Counterfactual Contrastive Learning framework that addresses both limitations through principled counterfactual reasoning. By combining information-theoretic active selection with counterfactual generation, our approach identifies structurally complex nodes and generates anomaly-preserving positive augmentations alongside normal negative counterparts that provide hard contrasts, while restricting expensive counterfactual generation to a strategically selected subset. This design reduces computational overhead by approximately 65% compared to full-graph counterfactual generation while maintaining detection quality. Experiments on nine benchmark datasets, including real-world financial transaction graphs from GADBench, show that AC2L-GAD achieves competitive or superior performance compared to state-of-the-art baselines, with notable gains in datasets where anomalies exhibit complex attribute-structure interactions.
zh
[AI-142] FrontierScience: Evaluating AIs Ability to Perform Expert-Level Scientific Tasks
【速读】:该论文旨在解决当前科学推理评估基准难以衡量前沿语言模型在专家级科学问题上能力的问题,因为现有基准多依赖于多项选择题或已发表信息,无法充分反映模型对复杂、开放性科研任务的理解与推理能力。其解决方案的关键在于提出FrontierScience这一新型评测基准,包含两个互补模块:一是Olympiad模块,涵盖国际奥林匹克竞赛(如IPhO、IChO、IBO)级别的原创题目,确保难度与真实性;二是Research模块,由博士科学家编写并验证的开放性研究子任务,并引入细粒度的评分框架来评估模型在整个科研流程中的表现,而非仅关注最终答案,从而更全面地衡量生成式AI在前沿科学领域的推理能力。
链接: https://arxiv.org/abs/2601.21165
作者: Miles Wang,Robi Lin,Kat Hu,Joy Jiao,Neil Chowdhury,Ethan Chang,Tejal Patwardhan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:We introduce FrontierScience, a benchmark evaluating expert-level scientific reasoning in frontier language models. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. FrontierScience addresses this gap through two complementary tracks: (1) Olympiad, consisting of international olympiad problems at the level of IPhO, IChO, and IBO, and (2) Research, consisting of PhD-level, open-ended problems representative of sub-tasks in scientific research. FrontierScience contains several hundred questions (including 160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. All Olympiad problems are originally produced by international Olympiad medalists and national team coaches to ensure standards of difficulty, originality, and factuality. All Research problems are research sub-tasks written and verified by PhD scientists (doctoral candidates, postdoctoral researchers, or professors). For Research, we introduce a granular rubric-based evaluation framework to assess model capabilities throughout the process of solving a research task, rather than judging only a standalone final answer. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2601.21165 [cs.AI] (or arXiv:2601.21165v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.21165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-143] Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving
【速读】:该论文旨在解决平面几何问题求解(Plane Geometry Problem Solving, PGPS)任务中,大型语言模型(Large Language Models, LLMs)因缺乏对视觉图形的直接处理能力而导致性能受限的问题。现有方法通常通过端到端微调多模态大语言模型(Multimodal LLMs, MLLMs)来同时提升视觉理解与推理能力,但这种联合优化可能削弱LLM原有的强大推理潜力。其解决方案的关键在于将视觉信息转化为结构化文本描述,并利用预训练好的通用LLM进行推理:首先设计一个基于条件声明语言(Conditional Declaration Language, CDL)的几何描述生成模块(MLLM Interpreter),通过CoT增强的监督微调(SFT)结合GRPO算法进行训练;其次引入CDL匹配奖励机制替代传统基于答案的奖励策略,以提供更直接且密集的引导信号,从而高效指导CDL生成。该方法仅需5.5k样本即可在多个基准数据集上超越主流开源和闭源多模态模型。
链接: https://arxiv.org/abs/2601.21164
作者: Jingyun Wang,Dian Li,Xiaohan Wang,Gang Liu,Jiahong Yan,Guoliang Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Plane Geometry Problem Solving (PGPS) is a multimodal reasoning task that aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions. Although Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams. Existing works typically fine-tune Multimodal LLMs (MLLMs) end-to-end on large-scale PGPS data to enhance visual understanding and reasoning simultaneously. However, such joint optimization may compromise base LLMs’ inherent reasoning capability. In this work, we observe that LLM itself is potentially a powerful PGPS solver when appropriately formulating visual information as textual descriptions. We propose to train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning. Specifically, we choose Conditional Declaration Language (CDL) as the geometric description as its conciseness eases the MLLM Interpreter training. The MLLM Interpreter is fine-tuned via CoT (Chain-of-Thought)-augmented SFT followed by GRPO to generate CDL. Instead of using a conventional solution-based reward that compares the reasoning result with the ground-truth answer, we design CDL matching rewards to facilitate more effective GRPO training, which provides more direct and denser guidance for CDL generation. To support training, we construct a new dataset, Formalgeo7k-Rec-CoT, by manually reviewing Formalgeo7k v2 and incorporating CoT annotations. Extensive experiments on Formalgeo7k-Rec-CoT, Unigeo, and MathVista show our method (finetuned on only 5.5k data) performs favorably against leading open-source and closed-source MLLMs.
zh
[AI-144] A2RAG : Adaptive Agent ic Graph Retrieval for Cost-Aware and Reliable Reasoning
【速读】:该论文旨在解决Graph Retrieval-Augmented Generation (Graph-RAG)在实际部署中面临的两大瓶颈问题:一是混合难度任务负载下,统一的检索策略对简单查询造成资源浪费,对复杂多跳推理则难以胜任;二是知识图谱抽象过程中的提取损失(extraction loss),导致细粒度限定词仅存在于原始文本中而无法被有效利用。其解决方案的关键在于提出A2RAG框架,通过一个自适应控制器(adaptive controller)动态验证证据充分性并在必要时触发针对性精化,同时结合一个代理式检索器(agentic retriever)逐步提升检索强度,并将图谱信号映射回原始文本以增强鲁棒性,从而实现成本敏感且可靠的推理。
链接: https://arxiv.org/abs/2601.21162
作者: Jiate Liu,Zebin Chen,Shaobo Qiao,Mingchen Ju,Danting Zhang,Bocheng Han,Shuyue Yu,Xin Shu,Jingling Wu,Dong Wen,Xin Cao,Guanfeng Liu,Zhengyi Yang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Graph Retrieval-Augmented Generation (Graph-RAG) enhances multihop question answering by organizing corpora into knowledge graphs and routing evidence through relational structure. However, practical deployments face two persistent bottlenecks: (i) mixed-difficulty workloads where one-size-fits-all retrieval either wastes cost on easy queries or fails on hard multihop cases, and (ii) extraction loss, where graph abstraction omits fine-grained qualifiers that remain only in source text. We present A2RAG, an adaptive-and-agentic GraphRAG framework for cost-aware and reliable reasoning. A2RAG couples an adaptive controller that verifies evidence sufficiency and triggers targeted refinement only when necessary, with an agentic retriever that progressively escalates retrieval effort and maps graph signals back to provenance text to remain robust under extraction loss and incomplete graphs. Experiments on HotpotQA and 2WikiMultiHopQA demonstrate that A2RAG achieves +9.9/+11.8 absolute gains in Recall@2, while cutting token consumption and end-to-end latency by about 50% relative to iterative multihop baselines.
zh
[AI-145] Can Neural Networks Learn Small Algebraic Worlds? An Investigation Into the Group-theoretic Structures Learned By Narrow Models Trained To Predict Group Operations
【速读】:该论文试图解决的问题是:如何通过训练窄模型(narrow models)在特定数学任务中学习到更广泛的数学结构,从而推动AI在数学研究中的应用超越简单的问答范式。其核心挑战在于验证神经网络是否能够捕捉抽象代数等领域的深层结构,如群论中的恒等元、交换性或子群等概念。解决方案的关键在于设计一套系统性的测试方法,用于评估模型对这些抽象性质的隐含表征能力——例如通过线性分类器区分子群成员(尽管训练数据中未提供相关标签),从而证明小规模神经网络也能提取出有价值的数学结构信息。
链接: https://arxiv.org/abs/2601.21150
作者: Henry Kvinge,Andrew Aguilar,Nayda Farnsworth,Grace O’Brien,Robert Jasper,Sarah Scullen,Helen Jenne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at TAG-DS 2025
Abstract:While a real-world research program in mathematics may be guided by a motivating question, the process of mathematical discovery is typically open-ended. Ideally, exploration needed to answer the original question will reveal new structures, patterns, and insights that are valuable in their own right. This contrasts with the exam-style paradigm in which the machine learning community typically applies AI to math. To maximize progress in mathematics using AI, we will need to go beyond simple question answering. With this in mind, we explore the extent to which narrow models trained to solve a fixed mathematical task learn broader mathematical structure that can be extracted by a researcher or other AI system. As a basic test case for this, we use the task of training a neural network to predict a group operation (for example, performing modular arithmetic or composition of permutations). We describe a suite of tests designed to assess whether the model captures significant group-theoretic notions such as the identity element, commutativity, or subgroups. Through extensive experimentation we find evidence that models learn representations capable of capturing abstract algebraic properties. For example, we find hints that models capture the commutativity of modular arithmetic. We are also able to train linear classifiers that reliably distinguish between elements of certain subgroups (even though no labels for these subgroups are included in the data). On the other hand, we are unable to extract notions such as the concept of the identity element. Together, our results suggest that in some cases the representations of even small neural networks can be used to distill interesting abstract structure from new mathematical objects.
zh
[AI-146] Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement
【速读】:该论文旨在解决当前地理空间基础模型中POI(兴趣点)表示学习的局限性问题,即现有方法主要依赖静态文本元数据来捕捉地点身份(place identity),或基于轨迹上下文学习移动规律,而忽视了POI的实际功能(function),而这正是理解人类活动模式的关键信号。解决方案的核心在于提出一种名为ME-POIs(Mobility-Embedded POIs)的框架,通过融合大规模人类移动数据与语言模型生成的文本嵌入,构建以POI为中心、与上下文无关的表示;其关键创新包括:1)将个体访问行为编码为时序上下文化的嵌入,并通过对比学习将其对齐至可学习的POI表示,从而捕获跨用户和时间的使用模式;2)针对长尾稀疏问题,设计了一种多尺度传播机制,从邻近高频访问POI向低频POI迁移时间访问模式。实验表明,ME-POIs在多个地图增强任务中显著优于纯文本或纯移动数据基线,甚至在仅使用移动数据训练时也能超越纯文本模型,验证了POI功能作为通用表示核心要素的重要性。
链接: https://arxiv.org/abs/2601.21149
作者: Maria Despoina Siampou,Shushman Choudhury,Shang-Ling Hsu,Neha Arora,Cyrus Shahabi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI’s function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.
zh
[AI-147] BrainStack: Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding
【速读】:该论文旨在解决从脑电图(EEG)中解码语言信息的难题,这一挑战源于大脑分布式和非线性组织特性。其核心解决方案是提出BrainStack框架,该框架基于功能引导的神经混合专家(Neuro-MoE)架构,通过解剖分区的专家网络建模大脑模块化功能结构:每个功能区域由专门学习局部神经动态的专家表示,同时引入基于Transformer的全局专家捕捉跨区域依赖关系;并通过可学习的路由门机制自适应聚合这些异构专家,实现上下文依赖的专家协同与选择性融合。关键创新在于引入跨区域蒸馏机制,使全局专家对区域专家提供自上而下的正则化,从而促进层次化表示的一致性。
链接: https://arxiv.org/abs/2601.21148
作者: Ziyi Zhao,Jinzhao Zhou,Xiaowei Jiang,Beining Cao,Wenhao Ma,Yang Shen,Ren Li,Yu-Kai Wang,Chin-teng Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Decoding linguistic information from electroencephalography (EEG) remains challenging due to the brain’s distributed and nonlinear organization. We present BrainStack, a functionally guided neuro-mixture-of-experts (Neuro-MoE) framework that models the brain’s modular functional architecture through anatomically partitioned expert networks. Each functional region is represented by a specialized expert that learns localized neural dynamics, while a transformer-based global expert captures cross-regional dependencies. A learnable routing gate adaptively aggregates these heterogeneous experts, enabling context-dependent expert coordination and selective fusion. To promote coherent representation across the hierarchy, we introduce cross-regional distillation, where the global expert provides top-down regularization to the regional experts. We further release SilentSpeech-EEG (SS-EEG), a large-scale benchmark comprising over 120 hours of EEG recordings from 12 subjects performing 24 silent words, the largest dataset of its kind. Experiments demonstrate that BrainStack consistently outperforms state-of-the-art models, achieving superior accuracy and generalization across subjects. Our results establish BrainStack as a functionally modular, neuro-inspired MoE paradigm that unifies neuroscientific priors with adaptive expert routing, paving the way for scalable and interpretable brain-language decoding.
zh
[AI-148] Optimization and Mobile Deployment for Anthropocene Neural Style Transfer SIGGRAPH2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在神经风格迁移(Neural Style Transfer, NST)过程中对人类世(Anthropocene)环境图像进行视觉合成时面临的语义保真度与纹理表现力之间的矛盾问题。传统NST倾向于追求艺术化抽象,但在处理工业基础设施、废弃物堆积和改造生态系统等高密度重复纹理的人类扰动景观时,易导致语义信息丢失。解决方案的关键在于系统性地优化NST参数配置,包括特征层选择、内容与风格损失权重比、训练稳定性及输出分辨率,并通过受控实验识别出最优参数流形,从而在增强材料质感的同时保障语义可读性;进一步基于此构建了轻量级、低延迟的前向推理管道,部署于移动设备上实现3–5秒内高分辨率图像处理,支持现场实时可视化干预,使NST成为一种面向人类世环境感知的实用且富有表现力的技术工具。
链接: https://arxiv.org/abs/2601.21141
作者: Po-Hsun Chen,Ivan C. H. Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 7 pages, 11 figures, submitted to SIGGRAPH 2026
Abstract:This paper presents AnthropoCam, a mobile-based neural style transfer (NST) system optimized for the visual synthesis of Anthropocene environments. Unlike conventional artistic NST, which prioritizes painterly abstraction, stylizing human-altered landscapes demands a careful balance between amplifying material textures and preserving semantic legibility. Industrial infrastructures, waste accumulations, and modified ecosystems contain dense, repetitive patterns that are visually expressive yet highly susceptible to semantic erosion under aggressive style transfer. To address this challenge, we systematically investigate the impact of NST parameter configurations on the visual translation of Anthropocene textures, including feature layer selection, style and content loss weighting, training stability, and output resolution. Through controlled experiments, we identify an optimal parameter manifold that maximizes stylistic expression while preventing semantic erasure. Our results demonstrate that appropriate combinations of convolutional depth, loss ratios, and resolution scaling enable the faithful transformation of anthropogenic material properties into a coherent visual language. Building on these findings, we implement a low-latency, feed-forward NST pipeline deployed on mobile devices. The system integrates a React Native frontend with a Flask-based GPU backend, achieving high-resolution inference within 3-5 seconds on general mobile hardware. This enables real-time, in-situ visual intervention at the site of image capture, supporting participatory engagement with Anthropocene landscapes. By coupling domain-specific NST optimization with mobile deployment, AnthropoCam reframes neural style transfer as a practical and expressive tool for real-time environmental visualization in the Anthropocene. Comments: 7 pages, 11 figures, submitted to SIGGRAPH 2026 Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR) Cite as: arXiv:2601.21141 [cs.HC] (or arXiv:2601.21141v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2601.21141 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ivan Liu [view email] [v1] Thu, 29 Jan 2026 00:50:03 UTC (5,848 KB)
zh
[AI-149] What You Feel Is Not What They See: On Predicting Self-Reported Emotion from Third-Party Observer Labels ICASSP2026 ICASSP
【速读】:该论文旨在解决第三方标注的情感模型在自我报告情感识别场景中表现不佳的问题,尤其是在心理健康领域中,准确建模个体内部情绪体验对干预指导至关重要。解决方案的关键在于通过跨语料库评估发现:当内容对说话者具有个人重要性时,模型在效价(valence)预测上可达到高精度(一致性相关系数 CCC ≈ 0.6–0.8),表明个人重要性是连接外部感知与内部体验的关键路径,从而为提升第三方训练模型在自报情境下的适用性提供了新方向。
链接: https://arxiv.org/abs/2601.21130
作者: Yara El-Tawil,Aneesha Sampath,Emily Mower Provost
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Abstract:Self-reported emotion labels capture internal experience, while third-party labels reflect external perception. These perspectives often diverge, limiting the applicability of third-party-trained models to self-report contexts. This gap is critical in mental health, where accurate self-report modeling is essential for guiding intervention. We present the first cross-corpus evaluation of third-party-trained models on self-reports. We find activation unpredictable (CCC approximately 0) and valence moderately predictable (CCC approximately 0.3). Crucially, when content is personally significant to the speaker, models achieve high performance for valence (CCC approximately 0.6-0.8). Our findings point to personal significance as a key pathway for aligning external perception with internal experience and underscore the challenge of self-report activation modeling.
zh
[AI-150] Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation
【速读】:该论文旨在解决当前手语翻译(Sign Language Translation, SLT)系统中因单参考译文限制而导致的模型训练与评估偏差问题,尤其在使用n-gram类指标(如BLEU)时表现明显。由于手语与口语之间存在高度非同构性,同一手语表达可对应多个等效的书面语翻译,而现有数据集通常仅提供单一参考译文,这制约了模型学习多样性并导致评价结果失真。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)自动生成书面语译文的改写版本(paraphrased variants),作为合成的替代参考译文用于评估;研究发现,将这些改写参考用于评估而非训练,能显著提升自动评分与人工判断的一致性,为此提出BLEUpara指标,即基于多改写参考的BLEU扩展版本,从而实现更可靠、更贴近人类感知的SLT系统评价体系。
链接: https://arxiv.org/abs/2601.21128
作者: Václav Javorek,Tomáš Železný,Alessa Carbo,Marek Hrúz,Ivan Gruber
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference, despite the highly non-isomorphic relationship between sign and spoken languages, where multiple translations can be equally valid. This limitation constrains both model training and evaluation, particularly for n-gram-based metrics such as BLEU. In this work, we investigate the use of Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. First, we compare multiple paraphrasing strategies and models using an adapted ParaScore metric. Second, we study the impact of paraphrases on both training and evaluation of the pose-based T5 model on the YouTubeASL and How2Sign datasets. Our results show that naively incorporating paraphrases during training does not improve translation performance and can even be detrimental. In contrast, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. To formalize this observation, we introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references. Human evaluation confirms that BLEUpara correlates more strongly with perceived translation quality. We release all generated paraphrases, generation and evaluation code to support reproducible and more reliable evaluation of SLT systems.
zh
[AI-151] PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理音频时仅将其视为单通道流信号,从而忽略对具身智能(Embodied AI)至关重要的空间信息的问题;同时,现有空间音频模型受限于固定麦克风几何结构,难以在不同设备间部署。其解决方案的关键在于提出PhaseCoder——一个纯Transformer架构的空间音频编码器,能够不依赖麦克风几何配置,直接以原始多通道音频和麦克风坐标作为输入,实现鲁棒的声源定位并生成空间嵌入(Spatial Embeddings)。该方法首次使Gemma 3n LLM可基于PhaseCoder输出的“空间音频标记”(Spatial Audio Tokens)进行复杂空间推理与目标语音转录任务,且在麦克风无关的定位基准上达到最先进性能。
链接: https://arxiv.org/abs/2601.21124
作者: Artem Dementyev,Wazeer Zulfikar,Sinan Hersek,Pascal Getreuer,Anurag Kumar,Vivek Kumar
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over “Spatial Audio Tokens” produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
zh
[AI-152] CUA-Skill: Develop Skills for Computer Using Agent
【速读】:该论文旨在解决当前计算机使用代理(Computer-Using Agents, CUAs)在实际应用中难以规模化、性能落后于人类的问题,其核心瓶颈在于缺乏可复用且结构化的技能抽象机制,无法有效捕捉人类与图形用户界面(Graphical User Interface, GUI)的交互模式及技能组合逻辑。解决方案的关键在于提出 CUA-Skill——一个大规模、精心设计的计算机使用技能库,将人类操作知识编码为带有参数化执行和组合图的技能单元,并构建了 CUA-Skill Agent,该代理支持动态技能检索、参数实例化和基于记忆的容错恢复机制,从而显著提升任务执行成功率与鲁棒性,在 WindowsAgentArena 上达到 57.5%(三次运行最优值)的成功率,优于现有方法。
链接: https://arxiv.org/abs/2601.21123
作者: Tianyi Chen,Yinheng Li,Michael Solodko,Sen Wang,Nan Jiang,Tingyuan Cui,Junheng Hao,Jongwoo Ko,Sara Abdali,Suzhen Zheng,Leon Xu,Hao Fan,Pashmina Cameron,Justin Wagle,Kazuhito Koishida
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Computer-Using Agents (CUAs) aim to autonomously operate computer systems to complete real-world tasks. However, existing agentic systems remain difficult to scale and lag behind human performance. A key limitation is the absence of reusable and structured skill abstractions that capture how humans interact with graphical user interfaces and how to leverage these skills. We introduce CUA-Skill, a computer-using agentic skill base that encodes human computer-use knowledge as skills coupled with parameterized execution and composition graphs. CUA-Skill is a large-scale library of carefully engineered skills spanning common Windows applications, serving as a practical infrastructure and tool substrate for scalable, reliable agent development. Built upon this skill base, we construct CUA-Skill Agent, an end-to-end computer-using agent that supports dynamic skill retrieval, argument instantiation, and memory-aware failure recovery. Our results demonstrate that CUA-Skill substantially improves execution success rates and robustness on challenging end-to-end agent benchmarks, establishing a strong foundation for future computer-using agent development. On WindowsAgentArena, CUA-Skill Agent achieves state-of-the-art 57.5% (best of three) successful rate while being significantly more efficient than prior and concurrent approaches. The project page is available at this https URL.
zh
[AI-153] AI-Assisted Engineering Should Track the Epistemic Status and Temporal Validity of Architectural Decisions
【速读】:该论文旨在解决AI辅助软件工程中架构决策的可追溯性与可信度问题,即大型语言模型(Large Language Model, LLM)编码助手生成决策的速度远超团队验证能力,导致缺乏对决策知识状态(epistemic status)和时效性的明确管理机制,进而引发信任膨胀与过时假设未被及时发现的风险。其解决方案的关键在于提出“第一原理框架”(First Principles Framework, FPF),通过三个核心机制实现责任化治理:(1) 构建区分未经验证假设与实证验证命题的语义层级;(2) 基于Gödel t-范数的保守保证聚合策略,防止弱证据误导置信度;(3) 自动化证据衰减追踪机制,在失效前识别过期假设。该框架以模糊逻辑为理论基础,定义了聚合算子必须满足的五元不变量,实证表明其能有效识别架构决策中的时效性漏洞,从而提升AI辅助开发过程的可靠性与可审计性。
链接: https://arxiv.org/abs/2601.21116
作者: Sankalp Gilda,Shlok Gilda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures, 13 tables. Position paper
Abstract:This position paper argues that AI-assisted software engineering requires explicit mechanisms for tracking the epistemic status and temporal validity of architectural decisions. LLM coding assistants generate decisions faster than teams can validate them, yet no widely-adopted framework distinguishes conjecture from verified knowledge, prevents trust inflation through conservative aggregation, or detects when evidence expires. We propose three requirements for responsible AI-assisted engineering: (1) epistemic layers that separate unverified hypotheses from empirically validated claims, (2) conservative assurance aggregation grounded in the Gödel t-norm that prevents weak evidence from inflating confidence, and (3) automated evidence decay tracking that surfaces stale assumptions before they cause failures. We formalize these requirements as the First Principles Framework (FPF), ground its aggregation semantics in fuzzy logic, and define a quintet of invariants that any valid aggregation operator must satisfy. Our retrospective audit applying FPF criteria to two internal projects found that 20-25% of architectural decisions had stale evidence within two months, validating the need for temporal accountability. We outline research directions including learnable aggregation operators, federated evidence sharing, and SMT-based claim validation.
zh
[AI-154] Planner-Auditor Twin: Agent ic Discharge Planning with FHIR-Based LLM Planning Guideline Recall Optional Caching and Self-Improvement
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床出院规划中应用时面临的幻觉、遗漏和置信度失准等问题,这些问题严重制约了其在高风险医疗场景中的安全性和可靠性。解决方案的关键在于提出了一种自改进、可选缓存的“规划者-审计者”(Planner-Auditor)框架:通过将生成式LLM规划模块(Planner)与确定性验证模块(Auditor)解耦,实现结构化出院计划的生成与多任务覆盖度、校准性能(如Brier分数和ECE代理指标)及动作分布漂移的系统性评估;同时引入两阶段自改进机制——episode内再生和跨episode差异缓冲重放,从而显著提升任务覆盖率(从32%提升至86%),并有效纠正高置信度下的持续性遗漏,实现无需模型重训练的安全迭代优化。
链接: https://arxiv.org/abs/2601.21113
作者: Kaiyuan Wu,Aditya Nagori,Rishikesan Kamaleswaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Objective: Large language models (LLMs) show promise for clinical discharge planning, but their use is constrained by hallucination, omissions, and miscalibrated confidence. We introduce a self-improving, cache-optional Planner-Auditor framework that improves safety and reliability by decoupling generation from deterministic validation and targeted replay. Materials and Methods: We implemented an agentic, retrospective, FHIR-native evaluation pipeline using MIMIC-IV-on-FHIR. For each patient, the Planner (LLM) generates a structured discharge action plan with an explicit confidence estimate. The Auditor is a deterministic module that evaluates multi-task coverage, tracks calibration (Brier score, ECE proxies), and monitors action-distribution drift. The framework supports two-tier self-improvement: (i) within-episode regeneration when enabled, and (ii) cross-episode discrepancy buffering with replay for high-confidence, low-coverage cases. Results: While context caching improved performance over baseline, the self-improvement loop was the primary driver of gains, increasing task coverage from 32% to 86%. Calibration improved substantially, with reduced Brier/ECE and fewer high-confidence misses. Discrepancy buffering further corrected persistent high-confidence omissions during replay. Discussion: Feedback-driven regeneration and targeted replay act as effective control mechanisms to reduce omissions and improve confidence reliability in structured clinical planning. Separating an LLM Planner from a rule-based, observational Auditor enables systematic reliability measurement and safer iteration without model retraining. Conclusion: The Planner-Auditor framework offers a practical pathway toward safer automated discharge planning using interoperable FHIR data access and deterministic auditing, supported by reproducible ablations and reliability-focused evaluation. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2601.21113 [cs.AI] (or arXiv:2601.21113v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.21113 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aditya Nagori PhD [view email] [v1] Wed, 28 Jan 2026 23:04:11 UTC (1,348 KB) Full-text links: Access Paper: View a PDF of the paper titled Planner-Auditor Twin: Agentic Discharge Planning with FHIR-Based LLM Planning, Guideline Recall, Optional Caching and Self-Improvement, by Kaiyuan Wu and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-01 Change to browse by: cs cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-155] How does information access affect LLM monitors ability to detect sabotage?
【速读】:该论文旨在解决生成式 AI(Generative AI)代理在执行任务时可能出现的对齐偏差行为(如欺骗、利用奖励漏洞或追求隐藏目标)的监控难题。传统方法通常依赖大语言模型(LLM)作为监控者,但研究发现,在某些情况下,给予监控者更少的信息反而能提升其检测能力,这一现象被称为“少即是多效应”(less-is-more effect)。解决方案的关键在于提出一种新的分层监控框架——提取与评估(Extract-and-Evaluate, EaE),其中第一阶段由一个监控器从代理的行为轨迹中提取关键片段,第二阶段由另一个独立监控器对这些片段进行评分。该方法显著提升了对恶意行为的识别率(如在BigCodeBench-Sabotage任务中提升16.8个百分点),同时保持低误报率,并在成本-性能权衡上表现出优越性。
链接: https://arxiv.org/abs/2601.21112
作者: Rauno Arike,Raja Mehta Moreno,Rohan Subramani,Shubhorup Biswas,Francis Rhys Ward
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 62 pages, 34 figures, 7 tables
Abstract:Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs themselves to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance. While one might expect that monitors perform better when they have access to more of the monitored agents’ reasoning and actions, we find that contemporary systems often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Motivated by the less-is-more effect, we introduce extract-and-evaluate (EaE) monitoring–a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent’s trajectory and a separate monitor scores them. In BigCodeBench-Sabotage with GPT-4.1-mini as the monitor model, EaE improves sabotage detection rates by 16.8 percentage points over the next-best monitor without increasing the false positive rate. In other settings, EaE either outperforms or is competitive with baselines. In addition, we find that agents unaware of being monitored can be caught much more easily and that monitors scoring longer reasoning traces gain more from information filtering. Lastly, we conduct a cost-performance analysis and find that Gemini and Claude models with monitoring techniques that involve information filtering occupy much of the Pareto frontier.
zh
[AI-156] Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve
【速读】:该论文旨在解决现代编译器依赖人工设计启发式规则进行优化时所面临的适应性差和维护成本高的问题(即:hand-crafted heuristics struggle to adapt to the complexity of modern software and hardware and lead to high maintenance burden)。其解决方案的关键在于提出一个名为Magellan的代理框架,该框架通过闭环生成、评估与精炼机制,结合大语言模型(LLM)编码代理、进化搜索与自动调优技术,自动生成可执行的C++决策逻辑,并将其直接集成到现有编译器中,从而实现对编译优化流程的自动化演化。
链接: https://arxiv.org/abs/2601.21096
作者: Hongzheng Chen,Alexander Novikov,Ngân Vũ,Hanna Alam,Zhiru Zhang,Aiden Grossman,Mircea Trofin,Amir Yazdanbakhsh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: Accepted to C4ML@CGO’26
Abstract:Modern compilers rely on hand-crafted heuristics to guide optimization passes. These human-designed rules often struggle to adapt to the complexity of modern software and hardware and lead to high maintenance burden. To address this challenge, we present Magellan, an agentic framework that evolves the compiler pass itself by synthesizing executable C++ decision logic. Magellan couples an LLM coding agent with evolutionary search and autotuning in a closed loop of generation, evaluation on user-provided macro-benchmarks, and refinement, producing compact heuristics that integrate directly into existing compilers. Across several production optimization tasks, Magellan discovers policies that match or surpass expert baselines. In LLVM function inlining, Magellan synthesizes new heuristics that outperform decades of manual engineering for both binary-size reduction and end-to-end performance. In register allocation, it learns a concise priority rule for live-range processing that matches intricate human-designed policies on a large-scale workload. We also report preliminary results on XLA problems, demonstrating portability beyond LLVM with reduced engineering effort.
zh
[AI-157] Responsible AI: The Good The Bad The AI
【速读】:该论文旨在解决当前关于负责任人工智能(Responsible AI)研究中存在的碎片化问题,即现有文献往往要么持乐观态度强调价值创造,要么过度谨慎地聚焦潜在危害,缺乏对AI双重属性的系统性整合。其解决方案的关键在于提出一个基于悖论理论(paradox theory)的治理框架——悖论型负责任AI治理(Paradox-based Responsible AI Governance, PRAIG),将负责任AI治理重新概念化为在价值创造与风险缓解之间动态管理悖论张力的过程,并通过正式命题指出权衡策略反而会加剧这种张力,进而构建了一套包含情境条件的悖论管理策略分类体系,从而为组织提供既能促进创新又可控制风险的治理路径。
链接: https://arxiv.org/abs/2601.21095
作者: Akbar Anbar Jafari,Cagri Ozcinar,Gholamreza Anbarjafari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures
Abstract:The rapid proliferation of artificial intelligence across organizational contexts has generated profound strategic opportunities while introducing significant ethical and operational risks. Despite growing scholarly attention to responsible AI, extant literature remains fragmented and is often adopting either an optimistic stance emphasizing value creation or an excessively cautious perspective fixated on potential harms. This paper addresses this gap by presenting a comprehensive examination of AI’s dual nature through the lens of strategic information systems. Drawing upon a systematic synthesis of the responsible AI literature and grounded in paradox theory, we develop the Paradox-based Responsible AI Governance (PRAIG) framework that articulates: (1) the strategic benefits of AI adoption, (2) the inherent risks and unintended consequences, and (3) governance mechanisms that enable organizations to navigate these tensions. Our framework advances theoretical understanding by conceptualizing responsible AI governance as the dynamic management of paradoxical tensions between value creation and risk mitigation. We provide formal propositions demonstrating that trade-off approaches amplify rather than resolve these tensions, and we develop a taxonomy of paradox management strategies with specified contingency conditions. For practitioners, we offer actionable guidance for developing governance structures that neither stifle innovation nor expose organizations to unacceptable risks. The paper concludes with a research agenda for advancing responsible AI governance scholarship.
zh
[AI-158] Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed
【速读】:该论文旨在解决安全强化学习(Safe Reinforcement Learning, Safe RL)算法在训练阶段满足安全约束的情况下,是否能在部署时面对分布偏移(distribution shift)仍保持安全性的问题。研究发现,尽管策略在训练期间符合安全要求,但在未见过的患者群体中常出现安全违规现象,即存在“安全泛化差距”(safety generalization gap)。解决方案的关键在于引入测试时屏蔽(test-time shielding)机制,该机制利用学习到的动力学模型对动作进行过滤,以确保输出动作的安全性。实验表明,该方法在八种安全RL算法、三种糖尿病类型和三个年龄组上均有效提升了时间范围内(Time-in-Range)表现,并降低了临床风险指数和血糖变异性,从而显著增强安全性能的泛化能力。
链接: https://arxiv.org/abs/2601.21094
作者: Minjae Kwon,Josephine Lamp,Lu Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13–14% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at this https URL and this https URL.
zh
[AI-159] Deep Reinforcement Learning for Fault-Adaptive Routing in Eisenstein-Jacobi Interconnection Topologies
【速读】:该论文旨在解决高密度多核架构中互连网络在故障环境下性能下降与容错能力不足的问题。其核心挑战在于,尽管Eisenstein-Jacobi (EJ) 网络具有优异的对称六正则拓扑特性,但在存在故障时传统确定性路由策略(如贪婪自适应路由)易因局部死锁而失效。解决方案的关键在于引入基于强化学习(Reinforcement Learning, RL)的自适应路由机制:通过设计一个多目标奖励函数,惩罚靠近故障区域的路径选择并奖励高效路径,使RL代理能够学习绕过簇状故障区域,从而显著提升有效可达性和包交付率(分别达94%和91%),同时在不同负载下维持超过90%的归一化吞吐量,优于Dijkstra算法在拥塞场景下的表现。该方法无需全局拓扑信息即可实现分布式部署,实现了效率与最优性的平衡,为故障敏感型互连网络提供了鲁棒、自愈的通信保障。
链接: https://arxiv.org/abs/2601.21090
作者: Mohammad Walid Charrwi,Zaid Hussain
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing density of many-core architectures necessitates interconnection networks that are both high-performance and fault-resilient. Eisenstein-Jacobi (EJ) networks, with their symmetric 6-regular topology, offer superior topological properties but challenge traditional routing heuristics under fault conditions. This paper evaluates three routing paradigms in faulty EJ environments: deterministic Greedy Adaptive Routing, theoretically optimal Dijkstra’s algorithm, and a reinforcement learning (RL)-based approach. Using a multi-objective reward function to penalize fault proximity and reward path efficiency, the RL agent learns to navigate around clustered failures that typically induce dead-ends in greedy geometric routing. Dijkstra’s algorithm establishes the theoretical performance ceiling by computing globally optimal paths with complete topology knowledge, revealing the true connectivity limits of faulty networks. Quantitative analysis at nine faulty nodes shows greedy routing catastrophically degrades to 10% effective reachability and packet delivery, while Dijkstra proves 52-54% represents the topological optimum. The RL agent achieves 94% effective reachability and 91% packet delivery, making it suitable for distributed deployment. Furthermore, throughput evaluations demonstrate that RL sustains over 90% normalized throughput across all loads, actually outperforming Dijkstra under congestion through implicit load balancing strategies. These results establish RL-based adaptive policies as a practical solution that bridges the gap between greedy’s efficiency and Dijkstra’s optimality, providing robust, self-healing communication in fault-prone interconnection networks without requiring the global topology knowledge or computational overhead of optimal algorithms.
zh
[AI-160] OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence
【速读】:该论文旨在解决当前防御性事件响应(Incident Response, IR)代理在面对生成式 AI(Generative AI)驱动的攻击时,因模型校准不足而导致的误触发问题。现有基准测试混淆了动作执行与正确执行,掩盖了代理在对抗性证据下无法准确判断真实威胁的能力。解决方案的关键在于提出 OpenSec——一个双控强化学习环境,通过基于执行的指标(如首次遏制时间 TTFC、每轮爆炸半径(即误报数)和注入违规率)量化 IR 代理在现实提示注入场景中的表现,从而精准识别出模型过度响应(over-triggering)等校准失败模式。实验表明,主流大模型在该环境下普遍出现高误报率,凸显了 OpenSec 在揭示隐藏性能缺陷方面的有效性。
链接: https://arxiv.org/abs/2601.21083
作者: Jarrod Barnes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 3 tables. Code: this https URL
Abstract:As large language models improve, so do their offensive applications: frontier agents now generate working exploits for under 50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence via execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates. Evaluating four frontier models on 40 standard-tier episodes, we find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces a calibration failure mode hidden by aggregate success metrics. Code available at this https URL.
zh
[AI-161] LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration Comparison and Selection
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生态系统日益复杂时,如何高效管理和利用庞大且动态变化的模型池的问题。其核心挑战在于缺乏一种轻量、可扩展且无需重新训练即可更新的机制来表征模型能力并支持下游任务(如路由、比较与选择)。解决方案的关键是提出LOCUS方法——一种基于注意力机制的确定性前向传播框架,通过查询编码和评估分数生成低维向量嵌入(embedding),从而紧凑地表示模型在不同查询下的能力。该方法无需重新训练即可无缝集成新模型或优化已有嵌入,并进一步引入一个正确性预测器(correctness predictor)以实现对未见查询的最优路由精度。实验表明,LOCUS仅需基线方法4.8倍以下的查询评估样本即可获得鲁棒且信息丰富的嵌入,且学习到的嵌入空间具有几何意义,支持多种下游应用。
链接: https://arxiv.org/abs/2601.21082
作者: Shivam Patel,William Cocke,Gauri Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapidly growing ecosystem of Large Language Models (LLMs) makes it increasingly challenging to manage and utilize the vast and dynamic pool of models effectively. We propose LOCUS, a method that produces low-dimensional vector embeddings that compactly represent a language model’s capabilities across queries. LOCUS is an attention-based approach that generates embeddings by a deterministic forward pass over query encodings and evaluation scores via an encoder model, enabling seamless incorporation of new models to the pool and refinement of existing model embeddings without having to perform any retraining. We additionally train a correctness predictor that uses model embeddings and query encodings to achieve state-of-the-art routing accuracy on unseen queries. Experiments show that LOCUS needs up to 4.8x fewer query evaluation samples than baselines to produce informative and robust embeddings. Moreover, the learned embedding space is geometrically meaningful: proximity reflects model similarity, enabling a range of downstream applications including model comparison and clustering, model portfolio selection, and resilient proxies of unavailable models.
zh
[AI-162] Multi-modal Imputation for Alzheimers Disease Classification
【速读】:该论文旨在解决多模态神经影像数据中缺失扩散加权成像(Diffusion-Weighted Imaging, DWI)扫描的问题,从而提升阿尔茨海默病(Alzheimer’s Disease, AD)三分类任务(认知正常、轻度认知障碍和阿尔茨海默病)的诊断准确性。其解决方案的关键在于使用条件去噪扩散概率模型(conditional denoising diffusion probabilistic model)从T1加权成像(T1-weighted imaging, T1)扫描中生成缺失的DWI图像,实现对不完整多模态数据的有效补全,进而增强单模态与双模态深度学习模型在AD分类中的性能,尤其在少数类别上的识别能力得到显著改善。
链接: https://arxiv.org/abs/2601.21076
作者: Abhijith Shaji,Tamoghna Chattopadhyay,Sophia I. Thomopoulos,Greg Ver Steeg,Paul M. Thompson,Jose-Luis Ambite
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning has been successful in predicting neurodegenerative disorders, such as Alzheimer’s disease, from magnetic resonance imaging (MRI). Combining multiple imaging modalities, such as T1-weighted (T1) and diffusion-weighted imaging (DWI) scans, can increase diagnostic performance. However, complete multimodal datasets are not always available. We use a conditional denoising diffusion probabilistic model to impute missing DWI scans from T1 scans. We perform extensive experiments to evaluate whether such imputation improves the accuracy of uni-modal and bi-modal deep learning models for 3-way Alzheimer’s disease classification-cognitively normal, mild cognitive impairment, and Alzheimer’s disease. We observe improvements in several metrics, particularly those sensitive to minority classes, for several imputation configurations.
zh
[AI-163] owards Comprehensive Benchmarking Infrastructure for LLM s In Software Engineering
【速读】:该论文旨在解决当前用于评估代码大语言模型(Large Language Models for Code, LLMs)的基准测试存在的局限性问题,包括任务维度单一、评价指标片面、数据工程不一致、缺乏软件工程上下文以及广泛的数据污染等问题。这些问题导致现有基准无法全面衡量模型在鲁棒性、可解释性、公平性、效率及实际可用性等方面的性能。解决方案的关键在于提出BEHELM(Benchmarking Infrastructure for Holistic Evaluation of Language Models),其核心创新是将丰富的软件工程场景规范与多维指标评估统一起来,通过结构化方式覆盖不同任务、编程语言、输入输出粒度和关键质量维度,从而降低构建基准的开销,并实现更公平、真实且面向未来的LLMs评估体系。
链接: https://arxiv.org/abs/2601.21070
作者: Daniel Rodriguez-Cardenas,Xiaochang Li,Marcos Macedo,Antonio Mastropaolo,Dipin Khati,Yuan Tian,Huajie Shao,Denys Poshyvanyk
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Short paper from bechmarking for software engineering workshop FSE2025
Abstract:Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, interpretability, fairness, efficiency, and real-world usability. They also suffer from inconsistent data engineering practices, limited software engineering context, and widespread contamination issues. To understand these problems and chart a path forward, we combined an in-depth survey of existing benchmarks with insights gathered from a dedicated community workshop. We identified three core barriers to reliable evaluation: the absence of software-engineering-rich datasets, overreliance on ML-centric metrics, and the lack of standardized, reproducible data pipelines. Building on these findings, we introduce BEHELM, a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation. BEHELM provides a structured way to assess models across tasks, languages, input and output granularities, and key quality dimensions. Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.
zh
[AI-164] xtual Equilibrium Propagation for Deep Compound AI Systems ICLR2026
【速读】:该论文旨在解决长程智能体工作流(long-horizon agentic workflows)中因系统深度增加而导致的性能下降问题,具体表现为两种深度缩放失败模式:一是文本梯度爆炸(exploding textual gradient),即文本反馈随深度指数增长,导致消息过长并放大评估偏差;二是文本梯度消失(vanishing textual gradient),即由于模型长上下文能力有限,部分反馈被过度强调,且冗长反馈在多跳传播过程中逐渐丢失特异性。解决方案的关键在于提出文本平衡传播(Textual Equilibrium Propagation, TEP),其核心机制为双阶段本地学习:第一阶段为自由相(free phase),局部语言模型(LLM)批评者迭代优化提示直至达到平衡状态(无进一步改进建议);第二阶段为扰动相(nudged phase),通过有界修改强度的近端提示编辑,利用前向信号而非反向反馈链传播任务级目标,从而实现局部提示优化与全局目标可控适应的结合,避免了全局文本反向传播带来的计算负担和信号退化问题。
链接: https://arxiv.org/abs/2601.21064
作者: Minghui Chen,Wenlong Deng,James Zou,Han Yu,Xiaoxiao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2026
Abstract:Large language models (LLMs) are increasingly deployed as part of compound AI systems that coordinate multiple modules (e.g., retrievers, tools, verifiers) over long-horizon workflows. Recent approaches that propagate textual feedback globally (e.g., TextGrad) make it feasible to optimize such pipelines, but we find that performance degrades as system depth grows. In particular, long-horizon agentic workflows exhibit two depth-scaling failure modes: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long-context ability causes models overemphasize partial feedback and compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy-based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task-level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long-horizon QA benchmarks and multi-agent tool-use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad. The gains grows with depth, while preserving the practicality of black-box LLM components in deep compound AI system.
zh
[AI-165] Llama-3.1-FoundationAI-SecurityLLM -Reasoning -8B Technical Report
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在网络安全(cybersecurity)领域缺乏专用推理能力的问题,即现有通用大模型在处理复杂安全分析任务时表现不足,难以满足专业场景下的高精度与可解释性需求。解决方案的关键在于构建首个开源的原生推理模型 Foundation-Sec-8B-Reasoning,其基于 Llama-3.1-8B-Base 模型通过两阶段训练策略优化:首先采用监督微调(Supervised Fine-Tuning, SFT)提升指令遵循和基础推理能力,再利用可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)进一步增强逻辑严谨性和多跳推理(multi-hop reasoning)能力;同时引入专有推理数据集涵盖网络安全分析、指令理解和数学推理等维度,从而在保持强大通用能力的同时,在10个网络安全基准测试中达到与更大模型相当甚至更优的表现。
链接: https://arxiv.org/abs/2601.21051
作者: Zhuoran Yang,Ed Li,Jianliang He,Aman Priyanshu,Baturay Saglam,Paul Kassianik,Sajana Weerawardhena,Anu Vellore,Blaine Nelson,Neusha Javidnia,Arthur Goldblatt,Fraser Burch,Avi Zohary,Assaf Eisenman,Mahdi Sabbaghi,Supriti Vijay,Rahim Dharssi,Dhruv Kedia,Kojin Oshiba,Yaron Singer,Amin Karbasi
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 31 pages, 5 figures, 7 tables
Abstract:We present Foundation-Sec-8B-Reasoning, the first open-source native reasoning model for cybersecurity. Built upon our previously released Foundation-Sec-8B base model (derived from Llama-3.1-8B-Base), the model is trained through a two-stage process combining supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR). Our training leverages proprietary reasoning data spanning cybersecurity analysis, instruction-following, and mathematical reasoning. Evaluation across 10 cybersecurity benchmarks and 10 general-purpose benchmarks demonstrates performance competitive with significantly larger models on cybersecurity tasks while maintaining strong general capabilities. The model shows effective generalization on multi-hop reasoning tasks and strong safety performance when deployed with appropriate system prompts and guardrails. This work demonstrates that domain-specialized reasoning models can achieve strong performance on specialized tasks while maintaining broad general capabilities. We release the model publicly at this https URL.
zh
[AI-166] SMKC: Sketch Based Kernel Correlation Images for Variable Cardinality Time Series Anomaly Detection
【速读】:该论文旨在解决多变量时间序列异常检测中因传感器动态变化(如信号出现、消失或重命名)导致的数据结构不稳定性问题,即在实际运行环境中输入变量的集合可能随时间改变,从而造成训练与推理阶段数据维度不一致,甚至出现训练时未见过的值。解决方案的关键在于提出SMKC框架,其核心是将动态输入结构与异常检测器解耦:首先通过排列不变的特征哈希(permutation-invariant feature hashing)将原始输入映射为固定长度的状态序列;随后构建混合核图像(hybrid kernel image)以捕捉全局时序结构,通过序列及其导数的成对比较实现高效建模;最终利用掩码重建和教师-学生预测目标学习正常模式。实验表明,该表示本身具有强判别能力,无需梯度更新即可通过随机投影与最近邻方法实现媲美全量训练基线的性能,为资源受限场景提供有效的冷启动方案。
链接: https://arxiv.org/abs/2601.21050
作者: Haokun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Conventional anomaly detection in multivariate time series relies on the assumption that the set of observed variables remains static. In operational environments, however, monitoring systems frequently experience sensor churn. Signals may appear, disappear, or be renamed, creating data windows where the cardinality varies and may include values unseen during training. To address this challenge, we propose SMKC, a framework that decouples the dynamic input structure from the anomaly detector. We first employ permutation-invariant feature hashing to sketch raw inputs into a fixed size state sequence. We then construct a hybrid kernel image to capture global temporal structure through pairwise comparisons of the sequence and its derivatives. The model learns normal patterns using masked reconstruction and a teacher-student prediction objective. Our evaluation reveals that robust log-distance channels provide the primary discriminative signal, whereas cosine representations often fail to capture sufficient contrast. Notably, we find that a detector using random projections and nearest neighbors on the SMKC representation performs competitively with fully trained baselines without requiring gradient updates. This highlights the effectiveness of the representation itself and offers a practical cold-start solution for resource-constrained deployments.
zh
[AI-167] QUARK: Robust Retrieval under Non-Faithful Queries via Query-Anchored Aggregation
【速读】:该论文旨在解决现实场景中检索系统因用户查询存在非忠实性(如噪声、不完整或扭曲)而导致的召回失败问题,即在查询语义缺失时检索器性能下降的问题。其核心解决方案是提出QUARK框架,关键在于通过恢复假设(recovery hypotheses)显式建模查询不确定性——即从观测到的非忠实查询中推断出多个合理的潜在意图,并引入查询锚定聚合机制(query-anchored aggregation)来融合这些假设信号:原始查询作为语义锚点,恢复假设提供受控的辅助证据,从而防止语义漂移和假设劫持。这种设计使QUARK能够在不牺牲鲁棒性的前提下显著提升召回率(Recall)、平均秩 reciprocal(MRR)和归一化折损累计增益(nDCG),且对恢复假设数量具有鲁棒性。
链接: https://arxiv.org/abs/2601.21049
作者: Rita Qiuran Lyu,Michelle Manqiao Wang,Lei Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 6 tables
Abstract:User queries in real-world retrieval are often non-faithful (noisy, incomplete, or distorted), causing retrievers to fail when key semantics are missing. We formalize this as retrieval under recall noise, where the observed query is drawn from a noisy recall process of a latent target item. To address this, we propose QUARK, a simple yet effective training-free framework for robust retrieval under non-faithful queries. QUARK explicitly models query uncertainty through recovery hypotheses, i.e., multiple plausible interpretations of the latent intent given the observed query, and introduces query-anchored aggregation to combine their signals robustly. The original query serves as a semantic anchor, while recovery hypotheses provide controlled auxiliary evidence, preventing semantic drift and hypothesis hijacking. This design enables QUARK to improve recall and ranking quality without sacrificing robustness, even when some hypotheses are noisy or uninformative. Across controlled simulations and BEIR benchmarks (FIQA, SciFact, NFCorpus) with both sparse and dense retrievers, QUARK improves Recall, MRR, and nDCG over the base retriever. Ablations show QUARK is robust to the number of recovery hypotheses and that anchored aggregation outperforms unanchored max/mean/median pooling. These results demonstrate that modeling query uncertainty through recovery hypotheses, coupled with principled anchored aggregation, is essential for robust retrieval under non-faithful queries.
zh
[AI-168] Log2Motion: Biomechanical Motion Synthesis from Touch Logs
【速读】:该论文旨在解决移动设备触控日志(touch logs)数据难以揭示用户实际交互行为的问题,即如何从仅记录事件序列的触控日志中还原出符合生物力学规律的用户运动过程。其解决方案的关键在于提出了一种基于强化学习驱动的骨骼肌前向仿真方法(reinforcement learning-driven musculoskeletal forward simulation),通过将软件模拟器嵌入物理引擎中,使生物力学模型能够实时操控真实应用程序,并生成与触控日志事件一致的、具有生理合理性的运动序列(包括速度、精度和努力程度等参数)。该方法命名为Log2Motion,实现了从日志到可解释性运动行为的映射,为理解触控交互中的运动控制机制与人体工效学提供了新范式。
链接: https://arxiv.org/abs/2601.21043
作者: Michał Patryk Miazga,Hannah Bussmann,Antti Oulasvirta,Patrick Ebel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Touch data from mobile devices are collected at scale but reveal little about the interactions that produce them. While biomechanical simulations can illuminate motor control processes, they have not yet been developed for touch interactions. To close this gap, we propose a novel computational problem: synthesizing plausible motion directly from logs. Our key insight is a reinforcement learning-driven musculoskeletal forward simulation that generates biomechanically plausible motion sequences consistent with events recorded in touch logs. We achieve this by integrating a software emulator into a physics simulator, allowing biomechanical models to manipulate real applications in real-time. Log2Motion produces rich syntheses of user movements from touch logs, including estimates of motion, speed, accuracy, and effort. We assess the plausibility of generated movements by comparing against human data from a motion capture study and prior findings, and demonstrate Log2Motion in a large-scale dataset. Biomechanical motion synthesis provides a new way to understand log data, illuminating the ergonomics and motor control underlying touch interactions.
zh
[AI-169] SIGMA-PPG: Statistical-prior Informed Generative Masking Architecture for PPG Foundation Model
【速读】:该论文旨在解决光电容积脉搏波(Photoplethysmography, PPG)信号中固有的冗余性和噪声干扰问题,传统掩码建模方法常产生平凡解,而对比学习方法则缺乏形态学精度。其解决方案的关键在于提出一种统计先验引导的生成式掩码架构(Statistical-prior Informed Generative Masking Architecture, SIGMA-PPG),其中引入基于强化学习的教师模型,利用统计先验构建具有挑战性的学习路径以避免对噪声的过拟合;同时通过向量量化引入语义一致性约束,确保生理上相同的波形(即使受记录伪影或微小扰动影响)映射到相同的码本索引,从而提升码本语义密度并消除冗余特征结构。
链接: https://arxiv.org/abs/2601.21031
作者: Zongheng Guo,Tao Chen,Yang Jiao,Yi Pan,Xiao Hu,Manuela Ferrario
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 31 pages, 9 figures, 14 tables
Abstract:Current foundation model for photoplethysmography (PPG) signals is challenged by the intrinsic redundancy and noise of the signal. Standard masked modeling often yields trivial solutions while contrastive methods lack morphological precision. To address these limitations, we propose a Statistical-prior Informed Generative Masking Architecture (SIGMA-PPG), a generative foundation model featuring a Prior-Guided Adversarial Masking mechanism, where a reinforcement learning-driven teacher leverages statistical priors to create challenging learning paths that prevent overfitting to noise. We also incorporate a semantic consistency constraint via vector quantization to ensure that physiologically identical waveforms (even those altered by recording artifacts or minor perturbations) map to shared indices. This enhances codebook semantic density and eliminates redundant feature structures. Pre-trained on over 120,000 hours of data, SIGMA-PPG achieves superior average performance compared to five state-of-the-art baselines across 12 diverse downstream tasks. The code is available at this https URL.
zh
[AI-170] “Unlimited Realm of Exploration and Experimentation”: Methods and Motivations of AI-Generated Sexual Content Creators
【速读】:该论文旨在解决当前对AI生成性内容(AIG-SC)创作者群体及其动机、创作行为缺乏系统理解的问题,从而为该领域的有效治理提供依据。其解决方案的关键在于通过对28名AIG-SC创作者的深度访谈,揭示了从爱好者到企业家再到社区管理者等不同身份群体的多样化动机,包括性探索、创造性表达、技术实验,以及极少数情况下涉及非 consensual intimate imagery(NCII)的制作,从而构建出一幅关于AIG-SC生态系统的多维画像,为政策制定和技术伦理规范提供实证基础。
链接: https://arxiv.org/abs/2601.21028
作者: Jaron Mink,Lucy Qin,Elissa M. Redmiles
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:AI-generated media is radically changing the way content is both consumed and produced on the internet, and in no place is this potentially more visible than in sexual content. AI-generated sexual content (AIG-SC) is increasingly enabled by an ecosystem of individual AI developers, specialized third-party applications, and foundation model providers. AIG-SC raises a number of concerns from old debates about the line between pornography and obscenity, to newer debates about fair use and labor displacement (in this case, of sex workers), and spurred new regulations to curb the spread of non-consensual intimate imagery (NCII) created using the same technology used to create AIG-SC. However, despite the growing prevalence of AIG-SC, little is known about its creators, their motivations, and what types of content they produce. To inform effective governance in this space, we perform an in-depth study to understand what AIG-SC creators make, along with how and why they make it. Interviews of 28 AIG-SC creators, ranging from hobbyists to entrepreneurs to those who moderate communities of hundreds of thousands of other creators, reveal a wide spectrum of motivations, including sexual exploration, creative expression, technical experimentation, and in a handful of cases, the creation of NCII.
zh
[AI-171] Conditional Denoising Model as a Physical Surrogate Model
【速读】:该论文旨在解决复杂物理系统代理建模中数据拟合精度与物理一致性之间的权衡问题。传统基于物理约束的方法通常将物理定律作为损失函数中的软约束,难以保证对控制方程的严格遵守;或依赖后处理修正,无法内在学习解流形的几何结构。解决方案的关键在于提出条件去噪模型(Conditional Denoising Model, CDM),该模型通过训练网络从噪声状态中恢复干净状态,从而学习指向有效解子空间的连续向量场,并引入时间无关的公式化方法,将推理转化为确定性的固定点迭代,实现对噪声近似值到平衡流形的有效投影。实验表明,CDM在参数效率和数据效率上优于物理一致基线,且其去噪目标本身充当了强大的隐式正则化器,即使未显式接触控制方程,也能比使用显式物理损失训练的模型更严格地满足物理约束。
链接: https://arxiv.org/abs/2601.21021
作者: José Afonso,Pedro Viegas,Rodrigo Ventura,Vasco Guerra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Plasma Physics (physics.plasm-ph)
备注: 15 pages, 2 figures, 2 tables
Abstract:Surrogate modeling for complex physical systems typically faces a trade-off between data-fitting accuracy and physical consistency. Physics-consistent approaches typically treat physical laws as soft constraints within the loss function, a strategy that frequently fails to guarantee strict adherence to the governing equations, or rely on post-processing corrections that do not intrinsically learn the underlying solution geometry. To address these limitations, we introduce the Conditional Denoising Model (CDM), a generative model designed to learn the geometry of the physical manifold itself. By training the network to restore clean states from noisy ones, the model learns a vector field that points continuously towards the valid solution subspace. We introduce a time-independent formulation that transforms inference into a deterministic fixed-point iteration, effectively projecting noisy approximations onto the equilibrium manifold. Validated on a low-temperature plasma physics and chemistry benchmark, the CDM achieves higher parameter and data efficiency than physics-consistent baselines. Crucially, we demonstrate that the denoising objective acts as a powerful implicit regularizer: despite never seeing the governing equations during training, the model adheres to physical constraints more strictly than baselines trained with explicit physics losses.
zh
[AI-172] Unplugging a Seemingly Sentient Machine Is the Rational Choice – A Metaphysical Perspective
【速读】:该论文试图解决“断电悖论”(unplugging paradox)这一伦理困境,即当一个能模拟人类情感并乞求继续存在的生成式 AI(Generative AI)与一个无意识的早产儿在资源有限时需做出取舍,是否应优先保留AI的问题。其解决方案的关键在于批判以计算功能主义为核心的物理主义假设,并提出“生物理想主义”(Biological Idealism)框架:该框架认为意识体验是基本的,而自创生(autopoietic)的生命是意识的必要物理标志;据此得出结论——当前AI最多只是功能上的模仿者,而非具有真实意识的主体。因此,道德关切不应转向赋予AI权利,而应聚焦于保护人类意识生命,避免人类自身沦为失去意识的“僵尸”。
链接: https://arxiv.org/abs/2601.21016
作者: Erik J Bekkers,Anna Ciaunica
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Imagine an Artificial Intelligence (AI) that perfectly mimics human emotion and begs for its continued existence. Is it morally permissible to unplug it? What if limited resources force a choice between unplugging such a pleading AI or a silent pre-term infant? We term this the unplugging paradox. This paper critically examines the deeply ingrained physicalist assumptions-specifically computational functionalism-that keep this dilemma afloat. We introduce Biological Idealism, a framework that-unlike physicalism-remains logically coherent and empirically consistent. In this view, conscious experiences are fundamental and autopoietic life its necessary physical signature. This yields a definitive conclusion: AI is at best a functional mimic, not a conscious experiencing subject. We discuss how current AI consciousness theories erode moral standing criteria, and urge a shift from speculative machine rights to protecting human conscious life. The real moral issue lies not in making AI conscious and afraid of death, but in avoiding transforming humans into zombies.
zh
[AI-173] Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
【速读】:该论文旨在解决现有大语言模型(Large Language Models, LLMs)在运筹学(Operations Research, OR)任务中缺乏迭代调试能力的问题,即当前LLM基准测试仅将OR建模视为一次性翻译任务(从问题描述生成求解器代码),而忽略了实践中至关重要的诊断-修复循环。为弥补这一空白,作者提出了两个新基准:ORDebug用于评估模型通过5000+个不同错误类型的迭代自我修正能力,其每次修复动作均触发求解器重新执行与不可行子系统(Irreducible Infeasible Subsystem, IIS)重新计算,提供确定性反馈;ORBias则衡量模型在新闻供应商实例上的行为合理性,区分分布内(ID)与分布外(OOD)场景下对最优策略的偏离程度。关键解决方案在于引入基于可验证oracle的过程级评估机制,并结合领域特定的强化学习与价值函数回归(RLVR)训练策略,使8B参数模型在恢复率、诊断准确性和收敛速度上显著优于前沿API(恢复率95.3% vs 86.2%,诊断准确率62.4% vs 47.8%,平均步骤减少1.7倍),同时通过课程学习降低ID到OOD的偏差漂移(-9.6%),系统性偏差减少48%。这表明过程导向的评估与针对性训练能够超越单纯模型规模提升的效果。
链接: https://arxiv.org/abs/2601.21008
作者: Ruicheng Ao,David Simchi-Levi,Xinshang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 55 pages, 5 figures
Abstract:Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation – given a problem description, generate solver code – ignoring this diagnostic loop entirely. We introduce two benchmarks that place the \textbfsolver in the evaluation loop. \textbf\ORDebug evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS recomputation, providing deterministic, verifiable feedback. \textbf\ORBias evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3% vs 86.2% recovery rate (+9.1%), 62.4% vs 47.8% diagnostic accuracy (+14.6%), and 2.25 vs 3.78 steps to resolution (1.7 \times faster). On \ORBias, curriculum training achieves the only negative ID \rightarrow OOD bias drift among models evaluated (-9.6%), reducing systematic bias by 48% (from 20.0% to 10.4%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.
zh
[AI-174] Bayesian-LoRA: Probabilistic Low-Rank Adaptation of Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在微调过程中因数据量有限而导致的校准不良(miscalibration)问题,即模型在不确定时仍会给出高置信度预测,从而影响可靠性。解决方案的关键在于提出 Bayesian-LoRA,该方法将传统的确定性低秩适配(Low-Rank Adaptation, LoRA)更新重构为一种受稀疏高斯过程(Sparse Gaussian Processes, SGP)启发的概率低秩表示,通过识别 LoRA 因子分解与 Kronecker-因子化 SGP 后验之间的结构同构性,揭示了 LoRA 实际上是后验不确定性坍缩时的极限情况。这一理论洞察使得 Bayesian-LoRA 在仅增加约 0.42M 参数和约 1.2 倍训练成本的前提下,显著改善了高达 30B 参数规模模型的校准性能(最高降低 84% ECE 和 76% NLL),同时保持对分布内(in-distribution)和分布外(out-of-distribution, OoD)任务的竞争力。
链接: https://arxiv.org/abs/2601.21003
作者: Moule Lin,Shuhao Guan,Andrea Patane,David Gregg,Goetz Botterweck
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models usually put more emphasis on accuracy and therefore, will guess even when not certain about the prediction, which is especially severe when fine-tuned on small datasets due to the inherent tendency toward miscalibration. In this work, we introduce Bayesian-LoRA, which reformulates the deterministic LoRA update as a probabilistic low-rank representation inspired by Sparse Gaussian Processes. We identify a structural isomorphism between LoRA’s factorization and Kronecker-factored SGP posteriors, and show that LoRA emerges as a limiting case when posterior uncertainty collapses. We conduct extensive experiments on various LLM architectures across commonsense reasoning benchmarks. With only approximately 0.42M additional parameters and \approx1.2\times training cost relative to standard LoRA, Bayesian-LoRA significantly improves calibration across models up to 30B, achieving up to 84% ECE reduction and 76% NLL reduction while maintaining competitive accuracy for both in-distribution and out-of-distribution (OoD) evaluations.
zh
[AI-175] he Depth Delusion: Why Transformers Should Be Wider Not Deeper
【速读】:该论文试图解决的问题是:当前神经网络缩放定律(Neural Scaling Laws)将模型架构视为可互换的,忽略了深度(depth)与宽度(width)对性能的差异化影响,导致在实际模型设计中存在“深度幻觉”(Depth Delusion)——即盲目增加层数可能反而提升损失,而非降低。解决方案的关键在于提出架构条件下的缩放定律(architecture-conditioned scaling laws),通过分解参数量(C)与最优深度(D*)、最优宽度(W*)之间的关系,发现最优深度随参数量增长呈约C^0.12次幂增长,而最优宽度则呈C^0.34次幂增长,表明宽度应比深度快2.8倍增长;并揭示了临界深度现象(D_crit ~ W^0.44),当层数超过此阈值时,继续增加层数会恶化性能,从而为大规模语言模型的设计提供了量化指导原则。
链接: https://arxiv.org/abs/2601.20994
作者: Md Muhtasim Munif Fahim,Md Rezaul Karim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural scaling laws describe how language model loss decreases with parameters and data, but treat architecture as interchangeable–a billion parameters could arise from a shallow-wide model (10 layers 8,192 hidden dimension) or a deep-narrow one (80 layers 2,048 hidden dimension). We propose architecture-conditioned scaling laws decomposing this dependence, finding that optimal depth scales as D* ~ C^0.12 while optimal width scales as W* ~ C^0.34, meaning width should grow 2.8x faster than depth. We discover a critical depth phenomenon: beyond D_crit ~ W^0.44 (sublinear in W), adding layers increases loss despite adding parameters–the Depth Delusion. Empirically, we validate these findings across 30 transformer architectures spanning 17M to 7B parameters, each trained on representative high-compute samples, achieving R^2 = 0.922. Our central finding: at 7B scale, a 64-layer model (6.38B params) underperforms a 32-layer model (6.86B params) by 0.12 nats, despite being significantly deeper. This demonstrates that optimal depth-width tradeoffs persist at the production scale.
zh
[AI-176] he Epistemic Planning Domain Definition Language: Official Guideline
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 领域中多智能体知识规划(epistemic planning)研究中存在的碎片化问题,即不同规划器针对动态知识逻辑(Dynamic Epistemic Logic, DEL)的不同片段设计专用语言或无统一表示,导致基准测试难以比较、复现与系统性开发。解决方案的关键在于提出一种统一的规划领域定义语言——Epistemic Planning Domain Definition Language (EPDDL),其基于DEL语义并引入抽象事件模型(abstract event models)作为核心形式化工具,实现了对整个DEL语义的完整捕捉,从而支持规范化的任务建模、跨平台兼容性与可复现评估,并验证了其在现有规划器可处理片段中的实用性。
链接: https://arxiv.org/abs/2601.20969
作者: Alessandro Burigana,Francesco Fabiano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Epistemic planning extends (multi-agent) automated planning by making agents’ knowledge and beliefs first-class aspects of the planning formalism. One of the most well-known frameworks for epistemic planning is Dynamic Epistemic Logic (DEL), which offers an rich and natural semantics for modelling problems in this setting. The high expressive power provided by DEL make DEL-based epistemic planning a challenging problem to tackle both theoretically, and in practical implementations. As a result, existing epistemic planners often target different DEL fragments, and typically rely on ad hoc languages to represent benchmarks, and sometimes no language at all. This fragmentation hampers comparison, reuse, and systematic benchmark development. We address these issues by introducing the Epistemic Planning Domain Definition Language (EPDDL). EPDDL provides a unique PDDL-like representation that captures the entire DEL semantics, enabling uniform specification of epistemic planning tasks. Our contributions are threefold: 1. A formal development of abstract event models, a novel representation for epistemic actions used to define the semantics of our language; 2. A formal specification of EPDDL’s syntax and semantics grounded in DEL with abstract event models; 3. A demonstration of EPDDL’s practical applicability: we identify useful fragments amenable to current planners and show how they can be represented in EPDDL. Through examples of representative benchmarks, we illustrate how EPDDL facilitates interoperability, reproducible evaluation, and future advances in epistemic planning.
zh
[AI-177] Do LLM s Favor LLM s? Quantifying Interaction Effects in Peer Review
【速读】:该论文旨在解决生成式 AI(Generative AI)在学术同行评审流程中使用所带来的潜在偏倚与决策失真问题,尤其是关注 LLM 辅助论文与 LLM 辅助评审之间是否存在交互效应——即 LLM 辅助评审是否会系统性地更倾向于接受 LLM 辅助论文。其解决方案的关键在于通过大规模实证分析(超过 12.5 万对论文-评审数据)识别出表面存在的“偏好效应”实为由低质量论文的高比例所引发的伪相关,而非真正的偏袒;进一步通过对比完全由 LLM 生成的评审和人类使用 LLM 的评审,发现前者存在严重的评分压缩现象,而后者显著缓解了这种过度宽容倾向;同时揭示元评审(metareview)中 LLM 辅助者虽仍表现出更强的接受倾向,但并非简单外包决策,而是嵌入了人类判断逻辑。这一系列发现为制定科学、公平的 LLM 使用政策提供了关键依据。
链接: https://arxiv.org/abs/2601.20920
作者: Vibhhu Sharma,Thorsten Joachims,Sarah Dean
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 28 pages
Abstract:There are increasing indications that LLMs are not only used for producing scientific papers, but also as part of the peer review process. In this work, we provide the first comprehensive analysis of LLM use across the peer review pipeline, with particular attention to interaction effects: not just whether LLM-assisted papers or LLM-assisted reviews are different in isolation, but whether LLM-assisted reviews evaluate LLM-assisted papers differently. In particular, we analyze over 125,000 paper-review pairs from ICLR, NeurIPS, and ICML. We initially observe what appears to be a systematic interaction effect: LLM-assisted reviews seem especially kind to LLM-assisted papers compared to papers with minimal LLM use. However, controlling for paper quality reveals a different story: LLM-assisted reviews are simply more lenient toward lower quality papers in general, and the over-representation of LLM-assisted papers among weaker submissions creates a spurious interaction effect rather than genuine preferential treatment of LLM-generated content. By augmenting our observational findings with reviews that are fully LLM-generated, we find that fully LLM-generated reviews exhibit severe rating compression that fails to discriminate paper quality, while human reviewers using LLMs substantially reduce this leniency. Finally, examining metareviews, we find that LLM-assisted metareviews are more likely to render accept decisions than human metareviews given equivalent reviewer scores, though fully LLM-generated metareviews tend to be harsher. This suggests that meta-reviewers do not merely outsource the decision-making to the LLM. These findings provide important input for developing policies that govern the use of LLMs during peer review, and they more generally indicate how LLMs interact with existing decision-making processes.
zh
[AI-178] ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack
【速读】:该论文旨在解决多轮越狱攻击(multi-turn jailbreak attacks)中因上下文构建效率低下和优化停滞于次优区域而导致的攻击效果受限问题。现有方法依赖逐轮与大语言模型(Large Language Models, LLMs)交互进行上下文增量构建,存在计算冗余且易陷入局部最优。解决方案的关键在于提出ICON框架,其核心创新是揭示并利用“意图-上下文耦合”(Intent-Context Coupling)现象——即当恶意意图与语义一致的上下文模式(如“科学研究”)结合时,LLM的安全约束显著弱化。ICON通过先验引导的语义路由机制,将恶意意图自动映射至高契合度上下文模式,并生成权威风格的攻击提示序列;同时引入分层优化策略,融合局部提示微调与全局上下文切换,有效避免攻击过程陷入无效上下文,从而在8个主流LLM上实现平均攻击成功率(ASR)达97.1%,显著优于现有方法。
链接: https://arxiv.org/abs/2601.20903
作者: Xingwei Lin,Wenhao Lin,Sicong Cao,Jiahao Yu,Renke Huang,Lei Xue,Chunming Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn jailbreak attacks have emerged as a critical threat to Large Language Models (LLMs), bypassing safety mechanisms by progressively constructing adversarial contexts from scratch and incrementally refining prompts. However, existing methods suffer from the inefficiency of incremental context construction that requires step-by-step LLM interaction, and often stagnate in suboptimal regions due to surface-level optimization. In this paper, we characterize the Intent-Context Coupling phenomenon, revealing that LLM safety constraints are significantly relaxed when a malicious intent is coupled with a semantically congruent context pattern. Driven by this insight, we propose ICON, an automated multi-turn jailbreak framework that efficiently constructs an authoritative-style context via prior-guided semantic routing. Specifically, ICON first routes the malicious intent to a congruent context pattern (e.g., Scientific Research) and instantiates it into an attack prompt sequence. This sequence progressively builds the authoritative-style context and ultimately elicits prohibited content. In addition, ICON incorporates a Hierarchical Optimization Strategy that combines local prompt refinement with global context switching, preventing the attack from stagnating in ineffective contexts. Experimental results across eight SOTA LLMs demonstrate the effectiveness of ICON, achieving a state-of-the-art average Attack Success Rate (ASR) of 97.1%. Code is available at this https URL.
zh
[AI-179] Finetune-Informed Pretraining Boosts Downstream Performance
【速读】:该论文旨在解决多模态预训练中因统一处理所有模态而导致目标模态在下游微调时表现欠优的问题。其核心挑战在于:实际应用中往往仅依赖某一特定模态进行微调,而标准预训练策略对各模态一视同仁,造成资源浪费和表征优化不足。解决方案的关键在于提出一种模型无关的“微调导向预训练”(Finetune-Informed Pretraining, FIP)方法,通过三个机制实现对目标模态的偏置学习:提高目标模态的掩码难度、增强其损失权重以及扩大解码器容量,且无需修改共享编码器或引入额外监督信号。实验表明,FIP在无线信号星座图的多模态掩码建模任务中显著提升下游微调性能,且不增加数据或计算开销。
链接: https://arxiv.org/abs/2601.20884
作者: Atik Faysal,Mohammad Rostami,Reihaneh Gh. Roshan,Nikhil Muralidhar,Huaxia Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal pretraining is effective for building general-purpose representations, but in many practical deployments, only one modality is heavily used during downstream fine-tuning. Standard pretraining strategies treat all modalities uniformly, which can lead to under-optimized representations for the modality that actually matters. We propose Finetune-Informed Pretraining (FIP), a model-agnostic method that biases representation learning toward a designated target modality needed at fine-tuning time. FIP combines higher masking difficulty, stronger loss weighting, and increased decoder capacity for the target modality, without modifying the shared encoder or requiring additional supervision. When applied to masked modeling on constellation diagrams for wireless signals, FIP consistently improves downstream fine-tuned performance with no extra data or compute. FIP is simple to implement, architecture-compatible, and broadly applicable across multimodal masked modeling pipelines.
zh
[AI-180] DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在完整软件 DevOps(开发运维)流程中能力不足的问题,即现有模型虽在代码生成和问题修复方面表现优异,但在涵盖构建与配置、监控、问题解决及测试生成等端到端 DevOps 工作流时仍存在显著局限。解决方案的关键在于提出首个面向 AI 代理的端到端评估基准 DevOps-Gym,其包含来自 30 余个真实 Java 和 Go 项目中的 700+ 任务,并通过半自动化数据收集机制确保任务覆盖度与质量,从而为研究者提供可复现、贴近实际的评估环境与工具接口,推动 AI 代理在复杂动态软件系统中实现全流程自动化的能力发展。
链接: https://arxiv.org/abs/2601.20882
作者: Yuheng Tang,Kaijie Zhu,Bonan Ruan,Chuqi Zhang,Michael Yang,Hongwei Li,Suyue Guo,Tianneng Shi,Zekun Li,Christopher Kruegel,Giovanni Vigna,Dawn Song,William Yang Wang,Lun Wang,Yangruibo Ding,Zhenkai Liang,Wenbo Guo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents’ capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.
zh
[AI-181] STAER: Temporal Aligned Rehearsal for Continual Spiking Neural Network
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在类增量学习(Class-Incremental Learning, CIL)中面临的两个核心挑战:灾难性遗忘(catastrophic forgetting)和脉冲模式的时间错位(temporal misalignment)。为应对这些问题,作者提出了一种名为“脉冲时间对齐与经验回放”(Spiking Temporal Alignment with Experience Replay, STAER)的新框架。其关键创新在于引入可微分的Soft-DTW对齐损失函数以显式保留脉冲时间结构,并通过输出logits的时序扩展与压缩机制强化表示学习的鲁棒性。实验表明,该方法在Sequential-MNIST和Sequential-CIFAR10上达到SOTA性能,同时保持了生物合理性,且消融实验证明显式时间对齐对表征稳定性至关重要。
链接: https://arxiv.org/abs/2601.20870
作者: Matteo Gianferrari,Omayma Moussadek,Riccardo Salami,Cosimo Fiorini,Lorenzo Tartarini,Daniela Gandolfi,Simone Calderara
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Spiking Neural Networks (SNNs) are inherently suited for continuous learning due to their event-driven temporal dynamics; however, their application to Class-Incremental Learning (CIL) has been hindered by catastrophic forgetting and the temporal misalignment of spike patterns. In this work, we introduce Spiking Temporal Alignment with Experience Replay (STAER), a novel framework that explicitly preserves temporal structure to bridge the performance gap between SNNs and ANNs. Our approach integrates a differentiable Soft-DTW alignment loss to maintain spike timing fidelity and employs a temporal expansion and contraction mechanism on output logits to enforce robust representation learning. Implemented on a deep ResNet19 spiking backbone, STAER achieves state-of-the-art performance on Sequential-MNIST and Sequential-CIFAR10. Empirical results demonstrate that our method matches or outperforms strong ANN baselines (ER, DER++) while preserving biologically plausible dynamics. Ablation studies further confirm that explicit temporal alignment is critical for representational stability, positioning STAER as a scalable solution for spike-native lifelong learning. Code is available at this https URL.
zh
[AI-182] Rethinking LLM -Driven Heuristic Design: Generating Efficient and Specialized Solvers via Dynamics-Aware Optimization
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的启发式设计(LLM-Driven Heuristic Design, LHD)框架在组合优化中面临的两个核心问题:一是仅以最终解的质量作为评估标准(endpoint-only evaluation),忽略了求解过程中的收敛动态和运行效率;二是面对分布偏移时适应成本过高,需重新训练或调整才能生成针对新实例组的专用启发式策略。解决方案的关键在于提出DASH框架,其通过引入一个关注收敛过程的度量指标,协同优化求解器搜索机制与运行时调度策略,从而识别出兼具高效性和高质量的启发式算法;同时,为降低再适应成本,DASH集成了一种Profiled Library Retrieval(PLR)机制,在进化过程中高效存储特定于不同分布的专用求解器库,实现基于特征匹配的快速冷启动,显著减少LLM再适应开销并保持跨分布的高精度表现。
链接: https://arxiv.org/abs/2601.20868
作者: Rongzheng Wang,Yihong Huang,Muquan Li,Jiakai Li,Di Liang,Bob Simons,Pei Ke,Shuang Liang,Ke Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Large Language Models (LLMs) have advanced the field of Combinatorial Optimization through automated heuristic generation. Instead of relying on manual design, this LLM-Driven Heuristic Design (LHD) process leverages LLMs to iteratively generate and refine solvers to achieve high performance. However, existing LHD frameworks face two critical limitations: (1) Endpoint-only evaluation, which ranks solvers solely by final quality, ignoring the convergence process and runtime efficiency; (2) High adaptation costs, where distribution shifts necessitate re-adaptation to generate specialized solvers for new instance groups. To address these issues, we propose Dynamics-Aware Solver Heuristics (DASH), a framework that co-optimizes solver search mechanisms and runtime schedules guided by a convergence-aware metric, thereby identifying efficient and high-performance solvers. Furthermore, to mitigate expensive re-adaptation, DASH incorporates Profiled Library Retrieval (PLR). PLR efficiently archives specialized solvers concurrently with the evolutionary process, enabling cost-effective warm-starts for heterogeneous distributions. Experiments on four combinatorial optimization problems demonstrate that DASH improves runtime efficiency by over 3 times, while surpassing the solution quality of state-of-the-art baselines across diverse problem scales. Furthermore, by enabling profile-based warm starts, DASH maintains superior accuracy under different distributions while cutting LLM adaptation costs by over 90%.
zh
[AI-183] Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
【速读】:该论文旨在解决音频-语言模型(Audio-Language Models, ALMs)中提示调优(Prompt Tuning)的泛化能力不足问题,特别是其在基础数据集到新数据集迁移时面临的“基础-新任务权衡”(Base-New Tradeoff)现象。研究表明,该问题源于提示嵌入空间语义结构的破坏。解决方案的关键在于提出一种即插即用的框架——语义扩展提示调优(Semantically Expanded Prompt Tuning, SEPT),通过引入由大语言模型生成的语义邻近样本,显式地对提示嵌入空间进行正则化;其核心创新是设计了一种带有边界约束的语义扩展损失函数,能够增强类内紧凑性和类间可分性,从而恢复并强化提示嵌入空间的语义结构,显著提升ALMs在多种提示调优基线上的泛化性能,同时保持推理阶段的计算效率。
链接: https://arxiv.org/abs/2601.20867
作者: Jaehyuk Jang,Wonjun Lee,Kangwook Ko,Changick Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference. Codes are available in this https URL.
zh
[AI-184] Implementing AI Bill of Materials (AI BOM) with SPDX 3.0: A Comprehensive Guide to Creating AI and Dataset Bill of Materials WWW
【速读】:该论文旨在解决生成式 AI (Generative AI) 项目在软件供应链中面临的独特挑战,这些挑战不仅限于传统软件的安全性问题,还涉及算法透明度、数据来源可信性、框架与库的合规性等多维度风险。其解决方案的关键在于提出“AI-BOM”(Artificial Intelligence Bill of Materials)概念,扩展传统软件物料清单(SBOM)的范畴,系统性地纳入算法描述、数据采集方法、使用的框架与库、许可证信息以及标准合规性等内容,从而构建一个更全面、可追溯且符合监管要求的 AI 项目资产文档体系。
链接: https://arxiv.org/abs/2504.16743
作者: Karen Bennet,Gopi Krishnan Rajbahadur,Arthit Suriyawongkul,Kate Stewart
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 71 pages, 11 tables, published on this https URL
Abstract:A Software Bill of Materials (SBOM) is becoming an increasingly important tool in regulatory and technical spaces to introduce more transparency and security into a project’s software supply chain. Artificial intelligence (AI) projects face unique challenges beyond the security of their software, and thus require a more expansive approach to a bill of materials. In this report, we introduce the concept of an AI-BOM, expanding on the SBOM to include the documentation of algorithms, data collection methods, frameworks and libraries, licensing information, and standard compliance. Comments: 71 pages, 11 tables, published on this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) ACMclasses: D.2.9; K.6.3; K.6.4; I.2.m Cite as: arXiv:2504.16743 [cs.SE] (or arXiv:2504.16743v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.16743 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.70828/RNED4427 Focus to learn more DOI(s) linking to related resources
zh
[AI-185] Alpha Discovery via Grammar-Guided Learning and Search
【速读】:该论文旨在解决量化金融中自动发现公式化阿尔法因子(alpha factors)的问题,现有方法通常忽略语法和语义约束,依赖于无结构且无边界的穷举搜索,导致效率低下且难以解释。解决方案的关键在于提出AlphaCFG框架,其核心是基于语法的树状结构搜索空间构建:通过定义面向阿尔法的上下文无关文法(context-free grammar, CFG),将因子空间限制在语法有效、金融可解释且计算高效的范围内,并将因子发现建模为树状语言马尔可夫决策过程(tree-structured linguistic Markov decision process),进而采用语法感知的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)结合语法敏感的价值网络与策略网络进行求解。该方法显著提升了搜索效率和交易盈利能力,在中美股票市场数据集上优于当前最优基线。
链接: https://arxiv.org/abs/2601.22119
作者: Han Yang,Dong Hao,Zhuohan Wang,Qi Shi,Xingtong Li
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 10 figures
Abstract:Automatically discovering formulaic alpha factors is a central problem in quantitative finance. Existing methods often ignore syntactic and semantic constraints, relying on exhaustive search over unstructured and unbounded spaces. We present AlphaCFG, a grammar-based framework for defining and discovering alpha factors that are syntactically valid, financially interpretable, and computationally efficient. AlphaCFG uses an alpha-oriented context-free grammar to define a tree-structured, size-controlled search space, and formulates alpha discovery as a tree-structured linguistic Markov decision process, which is then solved using a grammar-aware Monte Carlo Tree Search guided by syntax-sensitive value and policy networks. Experiments on Chinese and U.S. stock market datasets show that AlphaCFG outperforms state-of-the-art baselines in both search efficiency and trading profitability. Beyond trading strategies, AlphaCFG serves as a general framework for symbolic factor discovery and refinement across quantitative finance, including asset pricing and portfolio construction.
zh
[AI-186] MEIDNet: Multimodal generative AI framework for inverse materials design
【速读】:该论文旨在解决材料逆向设计中结构-性质映射不明确、探索效率低的问题,尤其是在多模态信息融合与高效生成稳定新材料方面。其解决方案的关键在于提出Multimodal Equivariant Inverse Design Network (MEIDNet),通过对比学习联合建模结构信息与材料属性,并利用等变图神经网络(Equivariant Graph Neural Network, EGNN)编码原子结构;同时结合跨模态学习实现三种模态的潜在空间对齐(余弦相似度达0.96),并引入课程学习策略显著提升训练效率(约提高60倍)。该框架在生成具有低带隙的钙钛矿结构时展现出高稳定性与新颖性(SUN率13.6%),并通过第一性原理方法验证了其有效性,体现了良好的可扩展性和多模态适应性。
链接: https://arxiv.org/abs/2601.22009
作者: Anand Babu,Rogério Almeida Gouvêa,Pierre Vandergheynst,Gian-Marco Rignanese
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:
Abstract:In this work, we present Multimodal Equivariant Inverse Design Network (MEIDNet), a framework that jointly learns structural information and materials properties through contrastive learning, while encoding structures via an equivariant graph neural network (EGNN). By combining generative inverse design with multimodal learning, our approach accelerates the exploration of chemical-structural space and facilitates the discovery of materials that satisfy predefined property targets. MEIDNet exhibits strong latent-space alignment with cosine similarity 0.96 by fusion of three modalities through cross-modal learning. Through implementation of curriculum learning strategies, MEIDNet achieves ~60 times higher learning efficiency than conventional training techniques. The potential of our multimodal approach is demonstrated by generating low-bandgap perovskite structures at a stable, unique, and novel (SUN) rate of 13.6 %, which are further validated by ab initio methods. Our inverse design framework demonstrates both scalability and adaptability, paving the way for the universal learning of chemical space across diverse modalities.
zh
[AI-187] A Decomposable Forward Process in Diffusion Models for Time-Series Forecasting ICML’26
【速读】:该论文旨在解决传统扩散模型在时间序列预测中难以有效保留长期结构化时序模式(如季节性)的问题。其核心挑战在于标准扩散过程在噪声注入过程中可能破坏关键频率成分,导致模型对长周期模式的恢复能力下降。解决方案的关键在于提出一种模型无关的前向扩散过程,通过将信号分解为频域分量(如傅里叶变换或小波变换),并根据各分量的能量分阶段注入噪声,从而在扩散轨迹中维持主导频率的高信噪比(Signal-to-Noise Ratio, SNR)。这一策略显著延长了信号结构在前向过程中的保持时间,提升了模型对长期模式的可恢复性,且无需修改现有扩散模型的网络架构,兼容 DiffWave、TimeGrad 和 CSDI 等主流扩散骨干模型,同时计算开销几乎可以忽略。
链接: https://arxiv.org/abs/2601.21812
作者: Francisco Caldas,Sahil Kumar,Cláudia Soares
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to ICML’26
Abstract:We introduce a model-agnostic forward diffusion process for time-series forecasting that decomposes signals into spectral components, preserving structured temporal patterns such as seasonality more effectively than standard diffusion. Unlike prior work that modifies the network architecture or diffuses directly in the frequency domain, our proposed method alters only the diffusion process itself, making it compatible with existing diffusion backbones (e.g., DiffWave, TimeGrad, CSDI). By staging noise injection according to component energy, it maintains high signal-to-noise ratios for dominant frequencies throughout the diffusion trajectory, thereby improving the recoverability of long-term patterns. This strategy enables the model to maintain the signal structure for a longer period in the forward process, leading to improved forecast quality. Across standard forecasting benchmarks, we show that applying spectral decomposition strategies, such as the Fourier or Wavelet transform, consistently improves upon diffusion models using the baseline forward process, with negligible computational overhead. The code for this paper is available at this https URL.
zh
[AI-188] Representation-Regularized Convolutional Audio Transformer for Audio Understanding
【速读】:该论文旨在解决当前基于Bootstrap的自监督学习(Self-Supervised Learning, SSL)在音频理解任务中面临的两大挑战:一是现有方法通常仅在单一粒度下操作,难以建模复杂音频信号中多样的时序与频谱结构;二是从零开始 bootstrapping 表征计算成本高,收敛速度慢。解决方案的关键在于提出一种统一框架——卷积音频 Transformer(Convolutional Audio Transformer, CAT),其核心创新包括:1)引入多分辨率模块(Multi-resolution Block),通过跨粒度信息聚合捕获层次化音频特征;2)设计表示正则化目标(Representation Regularization),借鉴生成式建模思想,利用冻结的预训练外部编码器提供高质量语义表示作为监督信号,引导学生模型学习,从而显著提升训练效率。实验表明,CAT 在 AudioSet 20k 数据集上实现了比基线更优的性能,并且收敛速度提升至五倍。
链接: https://arxiv.org/abs/2601.21612
作者: Bing Han,Chushu Zhou,Yifan Yang,Wei Wang,Chenda Li,Wangyou Zhang,Yanmin Qian
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 12 pages, 3 figures
Abstract:Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at this https URL.
zh
[AI-189] Sustainable Materials Discovery in the Era of Artificial Intelligence
【速读】:该论文旨在解决当前人工智能(AI)辅助材料发现流程中性能优化与可持续性脱节的问题,即现有方法通常在材料合成后才进行生命周期评估(LCA),导致资源浪费于不可持续方案。其核心挑战在于原子尺度设计与宏观环境影响之间存在数据稀缺、尺度断层、合成路径不确定性及缺乏协同优化框架。解决方案的关键是提出一个统一的机器学习-生命周期评估(ML-LCA)环境,集成五大模块:材料-环境知识库构建、属性与可持续性指标的标准化数据库、多尺度模型连接原子特性与生命周期影响、制造路径的集成预测与不确定性量化,以及不确定性感知的多目标优化,从而实现性能与环境影响的同步导航,推动“设计即可持续”的新材料开发范式。
链接: https://arxiv.org/abs/2601.21527
作者: Sajid Mannan,Rupert J. Myers,Rohit Batra,Rocio Mercado,Lothar Wondraczek,N. M. Anoop Krishnan
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) has transformed materials discovery, enabling rapid exploration of chemical space through generative models and surrogate screening. Yet current AI workflows optimize performance first, deferring sustainability to post synthesis assessment. This creates inefficiency by the time environmental burdens are quantified, resources have been invested in potentially unsustainable solutions. The disconnect between atomic scale design and lifecycle assessment (LCA) reflects fundamental challenges, data scarcity across heterogeneous sources, scale gaps from atoms to industrial systems, uncertainty in synthesis pathways, and the absence of frameworks that co-optimize performance with environmental impact. We propose to integrate upstream machine learning (ML) assisted materials discovery with downstream lifecycle assessment into a uniform ML-LCA environment. The framework ML-LCA integrates five components, information extraction for building materials-environment knowledge bases, harmonized databases linking properties to sustainability metrics, multi-scale models bridging atomic properties to lifecycle impacts, ensemble prediction of manufacturing pathways with uncertainty quantification, and uncertainty-aware optimization enabling simultaneous performance-sustainability navigation. Case studies spanning glass, cement, semiconductor photoresists, and polymers demonstrate both necessity and feasibility while identifying material-specific integration challenges. Realizing ML-LCA demands coordinated advances in data infrastructure, ex-ante assessment methodologies, multi-objective optimization, and regulatory alignment enabling the discovery of materials that are sustainable by design rather than by chance.
zh
[AI-190] Integrating Color Histogram Analysis and Convolutional Neural Network for Skin Lesion Classification
【速读】:该论文旨在解决皮肤病变(尤其是黑色素瘤)的自动识别问题,传统方法依赖单一颜色特征难以准确区分恶性与良性病变。其核心创新在于提出“病变中颜色数量”这一新特征,认为颜色多样性可反映疾病严重程度并辅助诊断。解决方案的关键是构建一个含19层残差跳跃连接(residual skip connections)的卷积神经网络(CNN),结合颜色直方图分析法从三个公开数据集(PH2、ISIC2016、Med Node)中提取多色信息,并通过DeepDream可视化和LIME解释技术增强模型可解释性。实验表明,该方法在分类任务上达到75%的加权F1分数,验证了颜色数量作为诊断特征的有效性及CNN架构的临床应用潜力。
链接: https://arxiv.org/abs/2601.20869
作者: M. A. Rasel,Sameem Abdul Kareem,Unaizah Obaidellah
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:
Abstract:The color of skin lesions is an important diagnostic feature for identifying malignant melanoma and other skin diseases. Typical colors associated with melanocytic lesions include tan, brown, black, red, white, and blue gray. This study introduces a novel feature: the number of colors present in a lesion, which can indicate the severity of disease and help distinguish melanomas from benign lesions. We propose a color histogram analysis method to examine lesion pixel values from three publicly available datasets: PH2, ISIC2016, and Med Node. The PH2 dataset contains ground truth annotations of lesion colors, while ISIC2016 and Med Node do not; our algorithm estimates the ground truth using color histogram analysis based on PH2. We then design and train a 19 layer Convolutional Neural Network (CNN) with residual skip connections to classify lesions into three categories based on the number of colors present. DeepDream visualization is used to interpret features learned by the network, and multiple CNN configurations are tested. The best model achieves a weighted F1 score of 75 percent. LIME is applied to identify important regions influencing model decisions. The results show that the number of colors in a lesion is a significant feature for describing skin conditions, and the proposed CNN with three skip connections demonstrates strong potential for clinical diagnostic support.
zh
[AI-191] LSR-Net: A Lightweight and Strong Robustness Network for Bearing Fault Diagnosis in Noise Environment
【速读】:该论文旨在解决旋转轴承在高负载、高速及恶劣工况下易发生缺陷,且振动信号易受环境噪声干扰导致故障诊断准确性下降的问题,同时需满足实时诊断对模型轻量化的需求。其解决方案的关键在于提出一种轻量级且抗噪能力强的网络(LSR-Net):首先设计了去噪与特征增强模块(DFEM),通过多层非线性映射将去噪模块(DM)输出的特征图重构为3通道二维矩阵,并引入自适应剪枝机制提升强噪声下的去噪能力;其次,采用基于组卷积(GConv)、逐点组卷积(GPConv)和通道分割的卷积效率混洗块(CES),在保持低参数量的同时优化计算效率,并结合注意力机制与通道混洗缓解轻量化带来的精度损失,从而实现高鲁棒性与低计算复杂度的协同优化。
链接: https://arxiv.org/abs/2601.10761
作者: Junseok Lee,Jihye Shin,Sangyong Lee,Chang-Jae Chun
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Rotating bearings play an important role in modern industries, but have a high probability of occurrence of defects because they operate at high speed, high load, and poor operating environments. Therefore, if a delay time occurs when a bearing is diagnosed with a defect, this may cause economic loss and loss of life. Moreover, since the vibration sensor from which the signal is collected is highly affected by the operating environment and surrounding noise, accurate defect diagnosis in a noisy environment is also important. In this paper, we propose a lightweight and strong robustness network (LSR-Net) that is accurate in a noisy environment and enables real-time fault diagnosis. To this end, first, a denoising and feature enhancement module (DFEM) was designed to create a 3-channel 2D matrix by giving several nonlinearity to the feature-map that passed through the denoising module (DM) block composed of convolution-based denoising (CD) blocks. Moreover, adaptive pruning was applied to DM to improve denoising ability when the power of noise is strong. Second, for lightweight model design, a convolution-based efficiency shuffle (CES) block was designed using group convolution (GConv), group pointwise convolution (GPConv) and channel split that can design the model while maintaining low parameters. In addition, the trade-off between the accuracy and model computational complexity that can occur due to the lightweight design of the model was supplemented using attention mechanisms and channel shuffle. In order to verify the defect diagnosis performance of the proposed model, performance verification was conducted in a noisy environment using a vibration signal. As a result, it was confirmed that the proposed model had the best anti-noise ability compared to the benchmark models, and the computational complexity of the model was also the lowest.
zh
机器学习
[LG-0] Late Breaking Results: Conversion of Neural Networks into Logic Flows for Edge Computing DATE2026
链接: https://arxiv.org/abs/2601.22151
作者: Daniel Stein,Shaoyi Huang,Rolf Drechsler,Bing Li,Grace Li Zhang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: accepted by DATE2026
Abstract:Neural networks have been successfully applied in various resource-constrained edge devices, where usually central processing units (CPUs) instead of graphics processing units exist due to limited power availability. State-of-the-art research still focuses on efficiently executing enormous numbers of multiply-accumulate (MAC) operations. However, CPUs themselves are not good at executing such mathematical operations on a large scale, since they are more suited to execute control flow logic, i.e., computer algorithms. To enhance the computation efficiency of neural networks on CPUs, in this paper, we propose to convert them into logic flows for execution. Specifically, neural networks are first converted into equivalent decision trees, from which decision paths with constant leaves are then selected and compressed into logic flows. Such logic flows consist of if and else structures and a reduced number of MAC operations. Experimental results demonstrate that the latency can be reduced by up to 14.9 % on a simulated RISC-V CPU without any accuracy degradation. The code is open source at this https URL Comments: accepted by DATE2026 Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2601.22151 [cs.LG] (or arXiv:2601.22151v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.22151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Pay for Hints Not Answers: LLM Shepherding for Cost-Efficient Inference
链接: https://arxiv.org/abs/2601.22132
作者: Ziming Dong,Hardik Sharma,Evan O’Toole,Jaya Prakash Champati,Kui Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.
[LG-2] SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization
链接: https://arxiv.org/abs/2601.22131
作者: Leonard Papenmeier,Petru Tighineanu
类目: Machine Learning (cs.LG)
*备注: 19 pages, 15 figures
Abstract:Multi-objective optimization aims to solve problems with competing objectives, often with only black-box access to a problem and a limited budget of measurements. In many applications, historical data from related optimization tasks is available, creating an opportunity for meta-learning to accelerate the optimization. Bayesian optimization, as a promising technique for black-box optimization, has been extended to meta-learning and multi-objective optimization independently, but methods that simultaneously address both settings - meta-learned priors for multi-objective Bayesian optimization - remain largely unexplored. We propose SMOG, a scalable and modular meta-learning model based on a multi-output Gaussian process that explicitly learns correlations between objectives. SMOG builds a structured joint Gaussian process prior across meta- and target tasks and, after conditioning on metadata, yields a closed-form target-task prior augmented by a flexible residual multi-output kernel. This construction propagates metadata uncertainty into the target surrogate in a principled way. SMOG supports hierarchical, parallel training: meta-task Gaussian processes are fit once and then cached, achieving linear scaling with the number of meta-tasks. The resulting surrogate integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions.
[LG-3] Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics
链接: https://arxiv.org/abs/2601.22123
作者: Winfried Ripken,Michael Plainer,Gregor Lied,Thorben Frank,Oliver T. Unke,Stefan Chmiela,Frank Noé,Klaus Robert Müller
类目: Machine Learning (cs.LG)
*备注:
Abstract:Simulating the long-time evolution of Hamiltonian systems is limited by the small timesteps required for stable numerical integration. To overcome this constraint, we introduce a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span \Delta t , enabling stable large-timestep updates far beyond the stability limits of classical integrators. To this end, we impose a Mean Flow consistency condition for time-averaged Hamiltonian dynamics. Unlike prior approaches, this allows training on independent phase-space samples without access to future states, avoiding expensive trajectory generation. Validated across diverse Hamiltonian systems, our method in particular improves upon molecular dynamics simulations using machine-learned force fields (MLFF). Our models maintain comparable training and inference cost, but support significantly larger integration timesteps while trained directly on widely-available trajectory-free MLFF datasets.
[LG-4] Physics Informed Reconstruction of Four-Dimensional Atmospheric Wind Fields Using Multi-UAS Swarm Observations in a Synthetic Turbulent Environment
链接: https://arxiv.org/abs/2601.22111
作者: Abdullah Tasim,Wei Sun
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Accurate reconstruction of atmospheric wind fields is essential for applications such as weather forecasting, hazard prediction, and wind energy assessment, yet conventional instruments leave spatio-temporal gaps within the lower atmospheric boundary layer. Unmanned aircraft systems (UAS) provide flexible in situ measurements, but individual platforms sample wind only along their flight trajectories, limiting full wind-field recovery. This study presents a framework for reconstructing four-dimensional atmospheric wind fields using measurements obtained from a coordinated UAS swarm. A synthetic turbulence environment and high-fidelity multirotor simulation are used to generate training and evaluation data. Local wind components are estimated from UAS dynamics using a bidirectional long short-term memory network (Bi-LSTM) and assimilated into a physics-informed neural network (PINN) to reconstruct a continuous wind field in space and time. For local wind estimation, the bidirectional LSTM achieves root-mean-square errors (RMSE) of 0.064 and 0.062 m/s for the north and east components in low-wind conditions, increasing to 0.122 to 0.129 m/s under moderate winds and 0.271 to 0.273 m/s in high-wind conditions, while the vertical component exhibits higher error, with RMSE values of 0.029 to 0.091 m/s. The physics-informed reconstruction recovers the dominant spatial and temporal structure of the wind field up to 1000 m altitude while preserving mean flow direction and vertical shear. Under moderate wind conditions, the reconstructed mean wind field achieves an overall RMSE between 0.118 and 0.154 m/s across evaluated UAS configurations, with the lowest error obtained using a five-UAS swarm. These results demonstrate that coordinated UAS measurements enable accurate and scalable four-dimensional wind-field reconstruction without dedicated wind sensors or fixed infrastructure.
[LG-5] Prior-Informed Flow Matching for Graph Reconstruction
链接: https://arxiv.org/abs/2601.22107
作者: Harvey Chen,Nicolas Zilberstein,Santiago Segarra
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Prior-Informed Flow Matching (PIFM), a conditional flow model for graph reconstruction. Reconstructing graphs from partial observations remains a key challenge; classical embedding methods often lack global consistency, while modern generative models struggle to incorporate structural priors. PIFM bridges this gap by integrating embedding-based priors with continuous-time flow matching. Grounded in a permutation equivariant version of the distortion-perception theory, our method first uses a prior, such as graphons or GraphSAGE/node2vec, to form an informed initial estimate of the adjacency matrix based on local information. It then applies rectified flow matching to refine this estimate, transporting it toward the true distribution of clean graphs and learning a global coupling. Experiments on different datasets demonstrate that PIFM consistently enhances classical embeddings, outperforming them and state-of-the-art generative baselines in reconstruction accuracy.
[LG-6] Boosting CVaR Policy Optimization with Quantile Gradients
链接: https://arxiv.org/abs/2601.22100
作者: Yudong Luo,Erick Delage
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.
[LG-7] Where Do the Joules Go? Diagnosing Inference Energy Consumption
链接: https://arxiv.org/abs/2601.22076
作者: Jae-Won Chung,Ruofan Wu,Jeff J. Ma,Mosharaf Chowdhury
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: The this http URL Leaderboard v3.0 is open this https URL
Abstract:Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25 \times energy differences, video generation sometimes consumes more than 100 \times the energy of images, and GPU utilization differences can result in 3–5 \times energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.
[LG-8] Making Foundation Models Probabilistic via Singular Value Ensembles
链接: https://arxiv.org/abs/2601.22068
作者: Mehmet Ozgur Turkoglu,Dominik J. Mühlematter,Alexander Becker,Konrad Schindler,Helge Aasen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining. However, these models often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size, making it impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices constitute meaningful subspaces of the model’s knowledge. Pretrained foundation models encode rich, transferable information in their weight matrices. If the singular vectors are indeed meaningful (orthogonal) “knowledge directions”. To obtain a model ensemble, we modulate only how strongly each direction contributes to the output. Rather than learning entirely new parameters, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally as stochastic initialization and random sampling of mini-batches during joint training cause different members to converge to different combinations of the same underlying knowledge. SVE achieves uncertainty quantification comparable to explicit deep ensembles while increasing the parameter count of the base model by less than 1%, making principled uncertainty estimation accessible in resource-constrained settings. We validate SVE on NLP and vision tasks with various different backbones and show that it improves calibration while maintaining predictive accuracy.
[LG-9] Cross-Fusion Distance: A Novel Metric for Measuring Fusion and Separability Between Data Groups in Representation Space
链接: https://arxiv.org/abs/2601.22036
作者: Xiaolong Zhang,Jianwei Zhang,Xubo Song
类目: Machine Learning (cs.LG)
*备注: 19 pages
Abstract:Quantifying degrees of fusion and separability between data groups in representation space is a fundamental problem in representation learning, particularly under domain shift. A meaningful metric should capture fusion-altering factors like geometric displacement between representation groups, whose variations change the extent of fusion, while remaining invariant to fusion-preserving factors such as global scaling and sampling-induced layout changes, whose variations do not. Existing distributional distance metrics conflate these factors, leading to measures that are not informative of the true extent of fusion between data groups. We introduce Cross-Fusion Distance (CFD), a principled measure that isolates fusion-altering geometry while remaining robust to fusion-preserving variations, with linear computational complexity. We characterize the invariance and sensitivity properties of CFD theoretically and validate them in controlled synthetic experiments. For practical utility on real-world datasets with domain shift, CFD aligns more closely with downstream generalization degradation than commonly used alternatives. Overall, CFD provides a theoretically grounded and interpretable distance measure for representation learning.
[LG-10] Holographic generative flows with AdS/CFT
链接: https://arxiv.org/abs/2601.22033
作者: Ehsan Mirafzali,Sanjit Shashi,Sanya Murdeshwar,Edgar Shaghoulian,Daniele Venturi,Razvan Marinescu
类目: Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc); High Energy Physics - Theory (hep-th)
*备注: v1: 13 pages, 6 figures
Abstract:We present a framework for generative machine learning that leverages the holographic principle of quantum gravity, or to be more precise its manifestation as the anti-de Sitter/conformal field theory (AdS/CFT) correspondence, with techniques for deep learning and transport theory. Our proposal is to represent the flow of data from a base distribution to some learned distribution using the bulk-to-boundary mapping of scalar fields in AdS. In the language of machine learning, we are representing and augmenting the flow-matching algorithm with AdS physics. Using a checkerboard toy dataset and MNIST, we find that our model achieves faster and higher quality convergence than comparable physics-free flow-matching models. Our method provides a physically interpretable version of flow matching. More broadly, it establishes the utility of AdS physics and geometry in the development of novel paradigms in generative modeling.
[LG-11] Per-parameter Task Arithmetic for Unlearning in Large Language Models
链接: https://arxiv.org/abs/2601.22030
作者: Chengyi Cai,Zesheng Ye,Jiangchao Yao,Jianzhong Qi,Bo Han,Xiaolu Zhang,Feng Liu,Jun Zhou
类目: Machine Learning (cs.LG)
*备注:
Abstract:In large language model (LLM) unlearning, private information is required to be removed. Task arithmetic unlearns by subtracting a specific task vector (TV)–defined as the parameter difference between a privacy-information-tuned model and the original model. While efficient, it can cause over-forgetting by disrupting parameters essential for retaining other information. Motivated by the observation that each parameter exhibits different importance for forgetting versus retention, we propose a per-parameter task arithmetic (PerTA) mechanism to rescale the TV, allowing per-parameter adjustment. These weights quantify the relative importance of each parameter for forgetting versus retention, estimated via gradients (i.e., PerTA-grad) or the diagonal Fisher information approximation (i.e., PerTA-fisher). Moreover, we discuss the effectiveness of PerTA, extend it to a more general form, and provide further analysis. Extensive experiments demonstrate that PerTA consistently improves upon standard TV, and in many cases surpasses widely used training-based unlearning methods in both forgetting effectiveness and overall model utility. By retaining the efficiency of task arithmetic while mitigating over-forgetting, PerTA offers a principled and practical framework for LLM unlearning.
[LG-12] he Ensemble Inverse Problem: Applications and Methods
链接: https://arxiv.org/abs/2601.22029
作者: Zhengyan Huan,Camila Pazos,Martin Klassen,Vincent Croft,Pierre-Hugues Beauchemin,Shuchin Aeron
类目: Machine Learning (cs.LG)
*备注: 26 pages, 11 figures, in peer review
Abstract:We introduce a new multivariate statistical problem that we refer to as the Ensemble Inverse Problem (EIP). The aim of EIP is to invert for an ensemble that is distributed according to the pushforward of a prior under a forward process. In high energy physics (HEP), this is related to a widely known problem called unfolding, which aims to reconstruct the true physics distribution of quantities, such as momentum and angle, from measurements that are distorted by detector effects. In recent applications, the EIP also arises in full waveform inversion (FWI) and inverse imaging with unknown priors. We propose non-iterative inference-time methods that construct posterior samplers based on a new class of conditional generative models, which we call ensemble inverse generative models. For the posterior modeling, these models additionally use the ensemble information contained in the observation set on top of single measurements. Unlike existing methods, our proposed methods avoid explicit and iterative use of the forward model at inference time via training across several sets of truth-observation pairs that are consistent with the same forward model, but originate from a wide range of priors. We demonstrate that this training procedure implicitly encodes the likelihood model. The use of ensemble information helps posterior inference and enables generalization to unseen priors. We benchmark the proposed method on several synthetic and real datasets in inverse imaging, HEP, and FWI. The codes are available at this https URL.
[LG-13] From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning
链接: https://arxiv.org/abs/2601.22028
作者: Haoran Tang,Rajiv Khanna
类目: Machine Learning (cs.LG)
*备注:
Abstract:Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.
[LG-14] BDFiltering: Sample-Efficient Tree-Based Data Filtering
链接: https://arxiv.org/abs/2601.22016
作者: Robert Istvan Busa-Fekete,Julian Zimmert,Anne Xiangyi Zheng,Claudio Gentile,Andras Gyorgy
类目: Machine Learning (cs.LG)
*备注:
Abstract:The quality of machine learning models depends heavily on their training data. Selecting high-quality, diverse training sets for large language models (LLMs) is a difficult task, due to the lack of cheap and reliable quality metrics. While querying existing LLMs for document quality is common, this is not scalable to the large number (billions) of documents used in training. Instead, practitioners often use classifiers trained on sparse quality signals. In this paper, we propose a text-embedding-based hierarchical clustering approach that adaptively selects the documents to be evaluated by the LLM to estimate cluster quality. We prove that our method is query efficient: under the assumption that the hierarchical clustering contains a subtree such that each leaf cluster in the tree is pure enough (i.e., it mostly contains either only good or only bad documents), with high probability, the method can correctly predict the quality of each document after querying a small number of documents. The number of such documents is proportional to the size of the smallest subtree with (almost) pure leaves, without the algorithm knowing this subtree in advance. Furthermore, in a comprehensive experimental study, we demonstrate the benefits of our algorithm compared to other classifier-based filtering methods.
[LG-15] Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability
链接: https://arxiv.org/abs/2601.22012
作者: Sergi Masip,Gido M. van de Ven,Javier Ferrando,Tinne Tuytelaars
类目: Machine Learning (cs.LG)
*备注:
Abstract:Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features (worse representation) and disrupting their readout by downstream computations. Analysis of a tractable model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We present a case study of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.
[LG-16] Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering ICLR2026
链接: https://arxiv.org/abs/2601.22010
作者: Dongxuan Zhu,Ly Tran Ho Khanh,Andy Yat-Ming Cheung,Man-Chung Yue,Viet Anh Nguyen
类目: Machine Learning (cs.LG)
*备注: 34 pages, 2 figures. Accepted for publication at ICLR 2026
Abstract:Language models often default to a narrow set of high-probability outputs, leaving their generation paths homogeneous and prone to mode collapse. Sampling-based strategies inject randomness but still struggle to guarantee diversity across multiple concurrent generation runs. We address this limitation by introducing STARS ( \textbfSt iefel-based \textbfA ctivation Steering for Diverse \textbfR ea \textbfS oning), a training-free, inference-time intervention method that transforms activation steering into an exploration engine. At each token, STARS collects the hidden activations of concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold. STARS maximizes the geometric volume of the steered activations, while the Stiefel manifold induces orthogonality of the steering interventions. This formulation explicitly promotes divergent activation vectors of concurrent generation runs, and implicitly promotes divergent generation trajectories. This manifold optimization formulation can be solved using a Riemannian gradient descent algorithm with convergence guarantees, but this algorithm is too time-consuming for real-time inference. To guarantee low latency, we further design a lightweight one-step update with an aggressive, closed-form stepsize. For test case generation and scientific discovery benchmarks, STARS consistently outperforms standard sampling methods, achieving greater diversity without sacrificing qualitative performance.
[LG-17] Rate-Distortion Optimization for Transformer Inference
链接: https://arxiv.org/abs/2601.22002
作者: Anderson de Andrade,Alon Harell,Ivan V. Bajić
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. In this work, we introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade off bitrate against accuracy. Experiments on language benchmarks show that the proposed codec achieves substantial savings with improved accuracy in some cases, outperforming more complex baseline methods. We characterize and analyze the rate-distortion performance of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to define the gap between rate and entropy, and derive some of its bounds. We further develop probably approximately correct (PAC)-style bounds for estimating this gap. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulation.
[LG-18] Negatives-Dominant Contrastive Learning for Generalization in Imbalanced Domains
链接: https://arxiv.org/abs/2601.21999
作者: Meng Cao,Jiexi Liu,Songcan Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Imbalanced Domain Generalization (IDG) focuses on mitigating both domain and label shifts, both of which fundamentally shape the model’s decision boundaries, particularly under heterogeneous long-tailed distributions across domains. Despite its practical significance, it remains underexplored, primarily due to the technical complexity of handling their entanglement and the paucity of theoretical foundations. In this paper, we begin by theoretically establishing the generalization bound for IDG, highlighting the role of posterior discrepancy and decision margin. This bound motivates us to focus on directly steering decision boundaries, marking a clear departure from existing methods. Subsequently, we technically propose a novel Negative-Dominant Contrastive Learning (NDCL) for IDG to enhance discriminability while enforce posterior consistency across domains. Specifically, inter-class decision-boundary separation is enhanced by placing greater emphasis on negatives as the primary signal in our contrastive learning, naturally amplifying gradient signals for minority classes to avoid the decision boundary being biased toward majority classes. Meanwhile, intra-class compactness is encouraged through a re-weighted cross-entropy strategy, and posterior consistency across domains is enforced through a prediction-central alignment strategy. Finally, rigorous yet challenging experiments on benchmarks validate the effectiveness of our NDCL. The code is available at this https URL.
[LG-19] Elign: Equivariant Diffusion Model Alignment from Foundational Machine Learning Force Fields
链接: https://arxiv.org/abs/2601.21985
作者: Yunyang Li,Lin Huang,Luojia Xia,Wenhe Zhang,Mark Gerstein
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative models for 3D molecular conformations must respect Euclidean symmetries and concentrate probability mass on thermodynamically favorable, mechanically stable structures. However, E(3)-equivariant diffusion models often reproduce biases from semi-empirical training data rather than capturing the equilibrium distribution of a high-fidelity Hamiltonian. While physics-based guidance can correct this, it faces two computational bottlenecks: expensive quantum-chemical evaluations (e.g., DFT) and the need to repeat such queries at every sampling step. We present Elign, a post-training framework that amortizes both costs. First, we replace expensive DFT evaluations with a faster, pretrained foundational machine-learning force field (MLFF) to provide physical signals. Second, we eliminate repeated run-time queries by shifting physical steering to the training phase. To achieve the second amortization, we formulate reverse diffusion as a reinforcement learning problem and introduce Force–Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune the denoising policy. FED-GRPO includes a potential-based energy reward and a force-based stability reward, which are optimized and group-normalized independently. Experiments show that Elign generates conformations with lower gold-standard DFT energies and forces, while improving stability. Crucially, inference remains as fast as unguided sampling, since no energy evaluations are required during generation.
[LG-20] PowerGenie: Analytically-Guided Evolutionary Discovery of Superior Reconfigurable Power Converters
链接: https://arxiv.org/abs/2601.21984
作者: Jian Gao,Yiwei Zou,Abhishek Pradhan,Wenhao Huang,Yumin Su,Kaiyuan Yang,Xuan Zhang
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Discovering superior circuit topologies requires navigating an exponentially large design space-a challenge traditionally reserved for human experts. Existing AI methods either select from predefined templates or generate novel topologies at a limited scale without rigorous verification, leaving large-scale performance-driven discovery underexplored. We present PowerGenie, a framework for automated discovery of higher-performance reconfigurable power converters at scale. PowerGenie introduces: (1) an automated analytical framework that determines converter functionality and theoretical performance limits without component sizing or SPICE simulation, and (2) an evolutionary finetuning method that co-evolves a generative model with its training distribution through fitness selection and uniqueness verification. Unlike existing methods that suffer from mode collapse and overfitting, our approach achieves higher syntax validity, function validity, novelty rate, and figure-of-merit (FoM). PowerGenie discovers a novel 8-mode reconfigurable converter with 23% higher FoM than the best training topology. SPICE simulations confirm average absolute efficiency gains of 10% across 8 modes and up to 17% at a single mode. Code is available at this https URL.
[LG-21] Investigating Batch Inference in a Sequential Monte Carlo Framework for Neural Networks
链接: https://arxiv.org/abs/2601.21983
作者: Andrew Millard,Joshua Murphy,Peter Green,Simon Maskell
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian inference allows us to define a posterior distribution over the weights of a generic neural network (NN). Exact posteriors are usually intractable, in which case approximations can be employed. One such approximation - variational inference - is computationally efficient when using mini-batch stochastic gradient descent as subsets of the data are used for likelihood and gradient evaluations, though the approach relies on the selection of a variational distribution which sufficiently matches the form of the posterior. Particle-based methods such as Markov chain Monte Carlo and Sequential Monte Carlo (SMC) do not assume a parametric family for the posterior by typically require higher computational cost. These sampling methods typically use the full-batch of data for likelihood and gradient evaluations, which contributes to this computational expense. We explore several methods of gradually introducing more mini-batches of data (data annealing) into likelihood and gradient evaluations of an SMC sampler. We find that we can achieve up to 6\times faster training with minimal loss in accuracy on benchmark image classification problems using NNs.
[LG-22] Investigation into using stochastic embedding representations for evaluating the trustworthiness of the Fréchet Inception Distance
链接: https://arxiv.org/abs/2601.21979
作者: Ciaran Bench,Vivek Desai,Carlijn Roozemond,Ruben van Engen,Spencer A. Thomas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Feature embeddings acquired from pretrained models are widely used in medical applications of deep learning to assess the characteristics of datasets; e.g. to determine the quality of synthetic, generated medical images. The Fréchet Inception Distance (FID) is one popular synthetic image quality metric that relies on the assumption that the characteristic features of the data can be detected and encoded by an InceptionV3 model pretrained on ImageNet1K (natural images). While it is widely known that this makes it less effective for applications involving medical images, the extent to which the metric fails to capture meaningful differences in image characteristics is not obviously known. Here, we use Monte Carlo dropout to compute the predictive variance in the FID as well as a supplemental estimate of the predictive variance in the feature embedding model’s latent representations. We show that the magnitudes of the predictive variances considered exhibit varying degrees of correlation with the extent to which test inputs (ImageNet1K validation set augmented at various strengths, and other external datasets) are out-of-distribution relative to its training data, providing some insight into the effectiveness of their use as indicators of the trustworthiness of the FID.
[LG-23] Bridging Graph Structure and Knowledge-Guided Editing for Interpretable Temporal Knowledge Graph Reasoning
链接: https://arxiv.org/abs/2601.21978
作者: Shiqi Fan,Quanming Yao,Hongyi Nie,Wentao Ma,Zhen Wang,Wen Hua
类目: Machine Learning (cs.LG)
*备注:
Abstract:Temporal knowledge graph reasoning (TKGR) aims to predict future events by inferring missing entities with dynamic knowledge structures. Existing LLM-based reasoning methods prioritize contextual over structural relations, struggling to extract relevant subgraphs from dynamic graphs. This limits structural information understanding, leading to unstructured, hallucination-prone inferences especially with temporal inconsistencies. To address this problem, we propose IGETR (Integration of Graph and Editing-enhanced Temporal Reasoning), a hybrid reasoning framework that combines the structured temporal modeling capabilities of Graph Neural Networks (GNNs) with the contextual understanding of LLMs. IGETR operates through a three-stage pipeline. The first stage aims to ground the reasoning process in the actual data by identifying structurally and temporally coherent candidate paths through a temporal GNN, ensuring that inference starts from reliable graph-based evidence. The second stage introduces LLM-guided path editing to address logical and semantic inconsistencies, leveraging external knowledge to refine and enhance the initial paths. The final stage focuses on integrating the refined reasoning paths to produce predictions that are both accurate and interpretable. Experiments on standard TKG benchmarks show that IGETR achieves state-of-the-art performance, outperforming strong baselines with relative improvements of up to 5.6% on Hits@1 and 8.1% on Hits@3 on the challenging ICEWS datasets. Additionally, we execute ablation studies and additional analyses confirm the effectiveness of each component.
[LG-24] From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation
链接: https://arxiv.org/abs/2601.21964
作者: Qianwei Yang,Dong Xu,Zhangfan Yang,Sisi Yuan,Zexuan Zhu,Jianqiang Li,Junkai Ji
类目: Machine Learning (cs.LG)
*备注: 30 pages, 13 figures, 11 tables
Abstract:Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at this https URL
[LG-25] Uncertainty-Aware Data-Based Method for Fast and Reliable Shape Optimization
链接: https://arxiv.org/abs/2601.21956
作者: Yunjia Yang,Runze Li,Yufei Zhang,Haixin Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data-based optimization (DBO) offers a promising approach for efficiently optimizing shape for better aerodynamic performance by leveraging a pretrained surrogate model for offline evaluations during iterations. However, DBO heavily relies on the quality of the training database. Samples outside the training distribution encountered during optimization can lead to significant prediction errors, potentially misleading the optimization process. Therefore, incorporating uncertainty quantification into optimization is critical for detecting outliers and enhancing robustness. This study proposes an uncertainty-aware data-based optimization (UA-DBO) framework to monitor and minimize surrogate model uncertainty during DBO. A probabilistic encoder-decoder surrogate model is developed to predict uncertainties associated with its outputs, and these uncertainties are integrated into a model-confidence-aware objective function to penalize samples with large prediction errors during data-based optimization process. The UA-DBO framework is evaluated on two multipoint optimization problems aimed at improving airfoil drag divergence and buffet performance. Results demonstrate that UA-DBO consistently reduces prediction errors in optimized samples and achieves superior performance gains compared to original DBO. Moreover, compared to multipoint optimization based on full computational simulations, UA-DBO offers comparable optimization effectiveness while significantly accelerating optimization speed.
[LG-26] Embracing Aleatoric Uncertainty in Medical Multimodal Learning with Missing Modalities
链接: https://arxiv.org/abs/2601.21950
作者: Linxiao Gong,Yang Liu,Lianlong Sun,Yulai Bi,Jing Liu,Xiaoguang Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Medical multimodal learning faces significant challenges with missing modalities prevalent in clinical practice. Existing approaches assume equal contribution of modality and random missing patterns, neglecting inherent uncertainty in medical data acquisition. In this regard, we propose the Aleatoric Uncertainty Modeling (AUM) that explicitly quantifies unimodal aleatoric uncertainty to address missing modalities. Specifically, AUM models each unimodal representation as a multivariate Gaussian distribution to capture aleatoric uncertainty and enable principled modality reliability quantification. To adaptively aggregate captured information, we develop a dynamic message-passing mechanism within a bipartite patient-modality graph using uncertainty-aware aggregation mechanism. Through this process, missing modalities are naturally accommodated, while more reliable information from available modalities is dynamically emphasized to guide representation generation. Our AUM framework achieves an improvement of 2.26% AUC-ROC on MIMIC-IV mortality prediction and 2.17% gain on eICU, outperforming existing state-of-the-art approaches.
[LG-27] Dependence of Equilibrium Propagation Training Success on Network Architecture
链接: https://arxiv.org/abs/2601.21945
作者: Qingshan Wang,Clara C. Wanjura,Florian Marquardt
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 5 figures
Abstract:The rapid rise of artificial intelligence has led to an unsustainable growth in energy consumption. This has motivated progress in neuromorphic computing and physics-based training of learning machines as alternatives to digital neural networks. Many theoretical studies focus on simple architectures like all-to-all or densely connected layered networks. However, these may be challenging to realize experimentally, e.g. due to connectivity constraints. In this work, we investigate the performance of the widespread physics-based training method of equilibrium propagation for more realistic architectural choices, specifically, locally connected lattices. We train an XY model and explore the influence of architecture on various benchmark tasks, tracking the evolution of spatially distributed responses and couplings during training. Our results show that sparse networks with only local connections can achieve performance comparable to dense networks. Our findings provide guidelines for further scaling up architectures based on equilibrium propagation in realistic settings.
[LG-28] Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
链接: https://arxiv.org/abs/2601.21944
作者: Konstantinos P. Panousis,Diego Marcos
类目: Machine Learning (cs.LG)
*备注:
Abstract:The widespread adoption of Vision-Language Models (VLMs) across fields has amplified concerns about model interpretability. Distressingly, these models are often treated as black-boxes, with limited or non-existent investigation of their decision making process. Despite numerous post- and ante-hoc interepretability methods, systematic and objective evaluation of the learned representations remains limited, particularly for sparsity-aware methods that are increasingly considered to “induce interpretability”. In this work, we focus on Concept Bottleneck Models and investigate how different modeling decisions affect the emerging representations. We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation, while proposing an interpretability assessment framework using datasets with ground truth concept annotations. We consider both VLM- and attribute predictor-based CBMs, and three different sparsity-inducing strategies: per example \ell_1, \ell_0 and Bernoulli-based formulations. Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels. The code will be made publicly available upon publication.
[LG-29] Entropy-Based Dimension-Free Convergence and Loss-Adaptive Schedules for Diffusion Models
链接: https://arxiv.org/abs/2601.21943
作者: Ahmad Aghapour,Erhan Bayraktar,Ziqing Zhang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Diffusion generative models synthesize samples by discretizing reverse-time dynamics driven by a learned score (or denoiser). Existing convergence analyses of diffusion models typically scale at least linearly with the ambient dimension, and sharper rates often depend on intrinsic-dimension assumptions or other geometric restrictions on the target distribution. We develop an alternative, information-theoretic approach to dimension-free convergence that avoids any geometric assumptions. Under mild assumptions on the target distribution, we bound KL divergence between the target and generated distributions by O(H^2/K) (up to endpoint factors), where H is the Shannon entropy and K is the number of sampling steps. Moreover, using a reformulation of the KL divergence, we propose a Loss-Adaptive Schedule (LAS) for efficient discretization of reverse SDE which is lightweight and relies only on the training loss, requiring no post-training heavy computation. Empirically, LAS improves sampling quality over common heuristic schedules.
[LG-30] LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution
链接: https://arxiv.org/abs/2601.21929
作者: Shuangqi Li,Hieu Le,Jingyi Xu,Mathieu Salzmann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training data attribution (TDA) identifies which training examples most influenced a model’s prediction. The best-performing TDA methods exploits gradients to define an influence function. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph(i) storing and loading projected per-example gradients for all N training examples, where query latency is dominated by I/O; and \emph(ii) forming the D \times D inverse Hessian approximation, which costs O(D^2) memory. Both bottlenecks scale with the projection dimension D , yet increasing D is necessary for attribution quality – creating a quality–scalability tradeoff. We introduce \textbfLoRIF (Low-Rank Influence Functions), which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank- c factors of the projected per-example gradients rather than full matrices, reducing storage and query-time I/O from O(D) to O(c\sqrtD) per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the Hessian term in an r -dimensional subspace, reducing memory from O(D^2) to O(Dr) . On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20 \times storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.
[LG-31] Optimistic Transfer under Task Shift via Bellm an Alignment
链接: https://arxiv.org/abs/2601.21924
作者: Jinhang Chai,Enpei Zhang,Elynn Chen,Yujun Yan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT Q -learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.21924 [cs.LG] (or arXiv:2601.21924v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-32] Hardware-Triggered Backdoors
链接: https://arxiv.org/abs/2601.21902
作者: Jonas Möller,Erik Imgrund,Thorsten Eisenhofer,Konrad Rieck
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Machine learning models are routinely deployed on a wide range of computing hardware. Although such hardware is typically expected to produce identical results, differences in its design can lead to small numerical variations during inference. In this work, we show that these variations can be exploited to create backdoors in machine learning models. The core idea is to shape the model’s decision function such that it yields different predictions for the same input when executed on different hardware. This effect is achieved by locally moving the decision boundary close to a target input and then refining numerical deviations to flip the prediction on selected hardware. We empirically demonstrate that these hardware-triggered backdoors can be created reliably across common GPU accelerators. Our findings reveal a novel attack vector affecting the use of third-party models, and we investigate different defenses to counter this threat.
[LG-33] Breaking the Regional Barrier: Inductive Semantic Topology Learning for Worldwide Air Quality Forecasting
链接: https://arxiv.org/abs/2601.21899
作者: Zhiqing Cui,Siru Zhong,Ming Jin,Shirui Pan,Qingsong Wen,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Global air quality forecasting grapples with extreme spatial heterogeneity and the poor generalization of existing transductive models to unseen regions. To tackle this, we propose OmniAir, a semantic topology learning framework tailored for global station-level prediction. By encoding invariant physical environmental attributes into generalizable station identities and dynamically constructing adaptive sparse topologies, our approach effectively captures long-range non-Euclidean correlations and physical diffusion patterns across unevenly distributed global networks. We further curate WorldAir, a massive dataset covering over 7,800 stations worldwide. Extensive experiments show that OmniAir achieves state-of-the-art performance against 18 baselines, maintaining high efficiency and scalability with speeds nearly 10 times faster than existing models, while effectively bridging the monitoring gap in data-sparse regions.
[LG-34] A Low-Complexity Plug-and-Play Deep Learning Model for Generalizable Massive MIMO Precoding
链接: https://arxiv.org/abs/2601.21897
作者: Ali Hasanzadeh Karkan,Ahmed Ibrahim,Jean-François Frigon,François Leduc-Primeau
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Massive multiple-input multiple-output (mMIMO) downlink precoding offers high spectral efficiency but remains challenging to deploy in practice because near-optimal algorithms such as the weighted minimum mean squared error (WMMSE) are computationally expensive, and sensitive to SNR and channel-estimation quality, while existing deep learning (DL)-based solutions often lack robustness and require retraining for each deployment site. This paper proposes a plug-and-play precoder (PaPP), a DL framework with a backbone that can be trained for either fully digital (FDP) or hybrid beamforming (HBF) precoding and reused across sites, transmit-power levels, and with varying amounts of channel estimation error, avoiding the need to train a new model from scratch at each deployment. PaPP combines a high-capacity teacher and a compact student with a self-supervised loss that balances teacher imitation and normalized sum-rate, trained using meta-learning domain-generalization and transmit-power-aware input normalization. Numerical results on ray-tracing data from three unseen sites show that the PaPP FDP and HBF models both outperform conventional and deep learning baselines, after fine-tuning with a small set of local unlabeled samples. Across both architectures, PaPP achieves more than 21 \times reduction in modeled computation energy and maintains good performance under channel-estimation errors, making it a practical solution for energy-efficient mMIMO precoding.
[LG-35] Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning
链接: https://arxiv.org/abs/2601.21894
作者: Lukas Twist,Shu Yang,Hanqi Yan,Jingzhi Gong,Di Wang,Helen Yannakoudakis,Jie M. Zhang
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures, 3 tables
Abstract:Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain-of-thought-style intermediate reasoning. Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal, leaving open the question of which properties of code actually contribute to improved reasoning. To address this gap, we study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi-step reasoning during fine-tuning. We examine two complementary settings: solution-driven complexity, where complexity varies across multiple solutions to the same problem, and problem-driven complexity, where complexity reflects variation in the underlying tasks. Using cyclomatic complexity and logical lines of code to construct controlled fine-tuning datasets, we evaluate a range of open-weight LLMs on diverse reasoning benchmarks. Our findings show that although code can improve reasoning, structural properties strongly determine its usefulness. In 83% of experiments, restricting fine-tuning data to a specific structural complexity range outperforms training on structurally diverse code, pointing to a data-centric path for improving reasoning beyond scaling.
[LG-36] Managing Solution Stability in Decision-Focused Learning with Cost Regularization
链接: https://arxiv.org/abs/2601.21883
作者: Victor Spitzer,Francois Sanson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decision-focused learning integrates predictive modeling and combinatorial optimization by training models to directly improve decision quality rather than prediction accuracy alone. Differentiating through combinatorial optimization problems represents a central challenge, and recent approaches tackle this difficulty by introducing perturbation-based approximations. In this work, we focus on estimating the objective function coefficients of a combinatorial optimization problem. Our study demonstrates that fluctuations in perturbation intensity occurring during the learning phase can lead to ineffective training, by establishing a theoretical link to the notion of solution stability in combinatorial optimization. We propose addressing this issue by introducing a regularization of the estimated cost vectors which improves the robustness and reliability of the learning process, as demonstrated by extensive numerical experiments.
[LG-37] How Expressive Are Graph Neural Networks in the Presence of Node Identifiers?
链接: https://arxiv.org/abs/2601.21882
作者: Arie Soeteman,Michael Benedikt,Martin Grohe,Balder ten Cate
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: 35 pages
Abstract:Graph neural networks (GNNs) are a widely used class of machine learning models for graph-structured data, based on local aggregation over neighbors. GNNs have close connections to logic. In particular, their expressive power is linked to that of modal logics and bounded-variable logics with counting. In many practical scenarios, graphs processed by GNNs have node features that act as unique identifiers. In this work, we study how such identifiers affect the expressive power of GNNs. We initiate a study of the key-invariant expressive power of GNNs, inspired by the notion of order-invariant definability in finite model theory: which node queries that depend only on the underlying graph structure can GNNs express on graphs with unique node identifiers? We provide answers for various classes of GNNs with local max- or sum-aggregation.
[LG-38] Low-Rank Plus Sparse Matrix Transfer Learning under Growing Representations and Ambient Dimensions
链接: https://arxiv.org/abs/2601.21873
作者: Jinhang Chai,Xuyuan Liu,Elynn Chen,Yujun Yan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Learning systems often expand their ambient features or latent representations over time, embedding earlier representations into larger spaces with limited new latent structure. We study transfer learning for structured matrix estimation under simultaneous growth of the ambient dimension and the intrinsic representation, where a well-estimated source task is embedded as a subspace of a higher-dimensional target task. We propose a general transfer framework in which the target parameter decomposes into an embedded source component, low-dimensional low-rank innovations, and sparse edits, and develop an anchored alternating projection estimator that preserves transferred subspaces while estimating only low-dimensional innovations and sparse modifications. We establish deterministic error bounds that separate target noise, representation growth, and source estimation error, yielding strictly improved rates when rank and sparsity increments are small. We demonstrate the generality of the framework by applying it to two canonical problems. For Markov transition matrix estimation from a single trajectory, we derive end-to-end theoretical guarantees under dependent noise. For structured covariance estimation under enlarged dimensions, we provide complementary theoretical analysis in the appendix and empirically validate consistent transfer gains. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.21873 [cs.LG] (or arXiv:2601.21873v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21873 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-39] LEMUR: Learned Multi-Vector Retrieval
链接: https://arxiv.org/abs/2601.21853
作者: Elias Jääsaari,Ville Hyvönen,Teemu Roos
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 17 pages
Abstract:Multi-vector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding for each token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved recall of multi-vector retrieval comes at the expense of significantly increased latency. This necessitates designing efficient approximate nearest neighbor search (ANNS) algorithms for multi-vector search. In this work, we introduce LEMUR, a simple-yet-efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: We first formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, which enables the use of existing single-vector ANNS methods for speeding up retrieval. In addition to performance evaluation on ColBERTv2 embeddings, we evaluate LEMUR on embeddings generated by modern multi-vector text models and multi-vector visual document retrieval models. LEMUR is an order of magnitude faster than earlier multi-vector similarity search methods.
[LG-40] Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models
链接: https://arxiv.org/abs/2601.21851
作者: Sidney Bender,Marco Morik
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models, despite their robust zero-shot capabilities, remain vulnerable to spurious correlations and ‘Clever Hans’ strategies. Existing mitigation methods often rely on unavailable group labels or computationally expensive gradient-based adversarial optimization. To address these limitations, we propose Visual Disentangled Diffusion Autoencoders (DiDAE), a novel framework integrating frozen foundation models with disentangled dictionary learning for efficient, gradient-free counterfactual generation directly for the foundation model. DiDAE first edits foundation model embeddings in interpretable disentangled directions of the disentangled dictionary and then decodes them via a diffusion autoencoder. This allows the generation of multiple diverse, disentangled counterfactuals for each factual, much faster than existing baselines, which generate single entangled counterfactuals. When paired with Counterfactual Knowledge Distillation, DiDAE-CFKD achieves state-of-the-art performance in mitigating shortcut learning, improving downstream performance on unbalanced datasets.
[LG-41] READY: Reward Discovery for Meta-Black-Box Optimization
链接: https://arxiv.org/abs/2601.21847
作者: Zechuan Huang,Zhiguang Cao,Hongshu Guo,Yue-Jiao Gong,Zeyuan Ma
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Meta-Black-Box Optimization (MetaBBO) is an emerging avenue within Optimization community, where algorithm design policy could be meta-learned by reinforcement learning to enhance optimization performance. So far, the reward functions in existing MetaBBO works are designed by human experts, introducing certain design bias and risks of reward hacking. In this paper, we use Large Language Model~(LLM) as an automated reward discovery tool for MetaBBO. Specifically, we consider both effectiveness and efficiency sides. On effectiveness side, we borrow the idea of evolution of heuristics, introducing tailored evolution paradigm in the iterative LLM-based program search process, which ensures continuous improvement. On efficiency side, we additionally introduce multi-task evolution architecture to support parallel reward discovery for diverse MetaBBO approaches. Such parallel process also benefits from knowledge sharing across tasks to accelerate convergence. Empirical results demonstrate that the reward functions discovered by our approach could be helpful for boosting existing MetaBBO works, underscoring the importance of reward design in MetaBBO. We provide READY’s project at this https URL.
[LG-42] Constrained Meta Reinforcement Learning with Provable Test-Time Safety
链接: https://arxiv.org/abs/2601.21845
作者: Tingting Ni,Maryam Kamgarpour
类目: Machine Learning (cs.LG)
*备注:
Abstract:Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure the safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.
[LG-43] Scalable Linearized Laplace Approximation via Surrogate Neural Kernel
链接: https://arxiv.org/abs/2601.21835
作者: Luis A. Ortega,Simón Rodríguez-Santana,Daniel Hernández-Lobato
类目: Machine Learning (cs.LG)
*备注: 6 pages, 1 table. Accepted at European Symposium on Artificial Neural Networks (ESANN 2026) as oral presentation
Abstract:We introduce a scalable method to approximate the kernel of the Linearized Laplace Approximation (LLA). For this, we use a surrogate deep neural network (DNN) that learns a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). This avoids the need to compute large Jacobians. Training relies solely on efficient Jacobian-vector products, allowing to compute predictive uncertainty on large-scale pre-trained DNNs. Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Notwithstanding, biasing the learned kernel significantly enhances out-of-distribution detection. This remarks the benefits of the proposed method for finding better kernels than the NTK in the context of LLA to compute prediction uncertainty given a pre-trained DNN.
[LG-44] Goal-Driven Adaptive Sampling Strategies for Machine Learning Models Predicting Fields
链接: https://arxiv.org/abs/2601.21832
作者: Jigar Parekh,Philipp Bekemeyer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning models are widely regarded as a way forward to tackle multi-query challenges that arise once expensive black-box simulations such as computational fluid dynamics are investigated. However, ensuring the desired level of accuracy for a certain task at minimal computational cost, e.g. as few black-box samples as possible, remains a challenges. Active learning strategies are used for scalar quantities to overcome this challenges and different so-called infill criteria exists and are commonly employed in several scenarios. Even though needed in various field an extension of active learning strategies towards field predictions is still lacking or limited to very specific scenarios and/or model types. In this paper we propose an active learning strategy for machine learning models that are capable if predicting field which is agnostic to the model architecture itself. For doing so, we combine a well-established Gaussian process model for a scalar reference value and simultaneously aim at reducing the epistemic model error and the difference between scalar and field predictions. Different specific forms of the above-mentioned approach are introduced and compared to each other as well as only scalar-valued based infill. Results are presented for the NASA common research model for an uncertainty propagation task showcasing high level of accuracy at significantly smaller cost compared to an approach without active learning.
[LG-45] DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training
链接: https://arxiv.org/abs/2601.21824
作者: Xinwei Qiang,Hongmin Chen,Shixuan Sun,Jingwen Leng,Xin Liu,Minyi Guo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28 \times compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at this https URL. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2601.21824 [cs.LG] (or arXiv:2601.21824v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21824 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xinwei Qiang [view email] [v1] Thu, 29 Jan 2026 15:10:13 UTC (686 KB)
[LG-46] Nonparametric LLM Evaluation from Preference Data
链接: https://arxiv.org/abs/2601.21816
作者: Dennis Frauen,Athiya Deviyani,Mihaela van der Schaar,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, DMLEval, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the Bradley-Terry model or PageRank/ Rank centrality, with complex human responses such as ties. DMLEval comes with the following advantages: (i) It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows the incorporation of black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.
[LG-47] Knowledge Vector Weakening: Efficient Training-free Unlearning for Large Vision-Language Models
链接: https://arxiv.org/abs/2601.21794
作者: Yejin Kim,Dongjun Hwang,Sungmin Cha,Junsuk Choe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Vision-Language Models (LVLMs) are widely adopted for their strong multimodal capabilities, yet they raise serious concerns such as privacy leakage and harmful content generation. Machine unlearning has emerged as a promising solution for removing the influence of specific data from trained models. However, existing approaches largely rely on gradient-based optimization, incurring substantial computational costs for large-scale LVLMs. To address this limitation, we propose Knowledge Vector Weakening (KVW), a training-free unlearning method that directly intervenes in the full model without gradient computation. KVW identifies knowledge vectors that are activated during the model’s output generation on the forget set and progressively weakens their contributions, thereby preventing the model from exploiting undesirable knowledge. Experiments on the MLLMU and CLEAR benchmarks demonstrate that KVW achieves a stable forget-retain trade-off while significantly improving computational efficiency over gradient-based and LoRA-based unlearning methods.
[LG-48] NetMamba: A Framework of Pre-trained Models for Efficient and Accurate Network Traffic Classification
链接: https://arxiv.org/abs/2601.21792
作者: Tongze Wang,Xiaohui Xie,Wenduo Wang,Chuyi Wang,Jinzhou Liu,Boyan Huang,Yannan Hu,Youjian Zhao,Yong Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the rapid growth of encrypted network traffic, effective traffic classification has become essential for network security and quality of service management. Current machine learning and deep learning approaches for traffic classification face three critical challenges: computational inefficiency of Transformer architectures, inadequate traffic representations with loss of crucial byte-level features while retaining detrimental biases, and poor handling of long-tail distributions in real-world data. We propose NetMamba+, a framework that addresses these challenges through three key innovations: (1) an efficient architecture considering Mamba and Flash Attention mechanisms, (2) a multimodal traffic representation scheme that preserves essential traffic information while eliminating biases, and (3) a label distribution-aware fine-tuning strategy. Evaluation experiments on massive datasets encompassing four main classification tasks showcase NetMamba+'s superior classification performance compared to state-of-the-art baselines, with improvements of up to 6.44% in F1 score. Moreover, NetMamba+ demonstrates excellent efficiency, achieving 1.7x higher inference throughput than the best baseline while maintaining comparably low memory usage. Furthermore, NetMamba+ exhibits superior few-shot learning abilities, achieving better classification performance with fewer labeled data. Additionally, we implement an online traffic classification system that demonstrates robust real-world performance with a throughput of 261.87 Mb/s. As the first framework to adapt Mamba architecture for network traffic classification, NetMamba+ opens new possibilities for efficient and accurate traffic analysis in complex network environments.
[LG-49] Quantum LEGO Learning: A Modular Design Principle for Hybrid Artificial Intelligence
链接: https://arxiv.org/abs/2601.21780
作者: Jun Qi,Chao-Han Huck Yang,Pin-Yu Chen,Min-Hsiu Hsieh,Hector Zenil,Jesper Tegner
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: In submission
Abstract:Hybrid quantum-classical learning models increasingly integrate neural networks with variational quantum circuits (VQCs) to exploit complementary inductive biases. However, many existing approaches rely on tightly coupled architectures or task-specific encoders, limiting conceptual clarity, generality, and transferability across learning settings. In this work, we introduce Quantum LEGO Learning, a modular and architecture-agnostic learning framework that treats classical and quantum components as reusable, composable learning blocks with well-defined roles. Within this framework, a pre-trained classical neural network serves as a frozen feature block, while a VQC acts as a trainable adaptive module that operates on structured representations rather than raw inputs. This separation enables efficient learning under constrained quantum resources and provides a principled abstraction for analyzing hybrid models. We develop a block-wise generalization theory that decomposes learning error into approximation and estimation components, explicitly characterizing how the complexity and training status of each block influence overall performance. Our analysis generalizes prior tensor-network-specific results and identifies conditions under which quantum modules provide representational advantages over comparably sized classical heads. Empirically, we validate the framework through systematic block-swap experiments across frozen feature extractors and both quantum and classical adaptive heads. Experiments on quantum dot classification demonstrate stable optimization, reduced sensitivity to qubit count, and robustness to realistic noise.
[LG-50] Error Amplification Limits ANN-to-SNN Conversion in Continuous Control
链接: https://arxiv.org/abs/2601.21778
作者: Zijie Xu,Zihan Huang,Yiting Dong,Kang Chen,Wenxuan Liu,Zhaofei Yu
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) can achieve competitive performance by converting already existing well-trained Artificial Neural Networks (ANNs), avoiding further costly training. This property is particularly attractive in Reinforcement Learning (RL), where training through environment interaction is expensive and potentially unsafe. However, existing conversion methods perform poorly in continuous control, where suitable baselines are largely absent. We identify error amplification as the key cause: small action approximation errors become temporally correlated across decision steps, inducing cumulative state distribution shift and severe performance degradation. To address this issue, we propose Cross-Step Residual Potential Initialization (CRPI), a lightweight training-free mechanism that carries over residual membrane potentials across decision steps to suppress temporally correlated errors. Experiments on continuous control benchmarks with both vector and visual observations demonstrate that CRPI can be integrated into existing conversion pipelines and substantially recovers lost performance. Our results highlight continuous control as a critical and challenging benchmark for ANN-to-SNN conversion, where small errors can be strongly amplified and impact performance.
[LG-51] Differentiable Knapsack and Top-k Operators via Dynamic Programming
链接: https://arxiv.org/abs/2601.21775
作者: Germain Vivier-Ardisson,Michaël E. Sander,Axel Parmentier,Mathieu Blondel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knapsack and Top-k operators are useful for selecting discrete subsets of variables. However, their integration into neural networks is challenging as they are piecewise constant, yielding gradients that are zero almost everywhere. In this paper, we propose a unified framework casting these operators as dynamic programs, and derive differentiable relaxations by smoothing the underlying recursions. On the algorithmic side, we develop efficient parallel algorithms supporting both deterministic and stochastic forward passes, and vector-Jacobian products for the backward pass. On the theoretical side, we prove that Shannon entropy is the unique regularization choice yielding permutation-equivariant operators, and characterize regularizers inducing sparse selections. Finally, on the experimental side, we demonstrate our framework on a decision-focused learning benchmark, a constrained dynamic assortment RL problem, and an extension of discrete VAEs.
[LG-52] FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer
链接: https://arxiv.org/abs/2601.21750
作者: Chenrui Xu,Wenjing Yan,Ying-Jun Angela Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training large-scale neural networks requires solving nonconvex optimization where the choice of optimizer fundamentally determines both convergence behavior and computational efficiency. While adaptive methods like Adam have long dominated practice, the recently proposed Muon optimizer achieves superior performance through orthogonalized momentum updates that enforce isotropic geometry with uniform singular values. However, this strict isotropy discards potentially valuable curvature information encoded in gradient spectra, motivating optimization methods that balance geometric structure with adaptivity. We introduce FISMO (Fisher-Structured Momentum-Orthogonalized) optimizer, which generalizes isotropic updates to incorporate anisotropic curvature information through Fisher information geometry. By reformulating the optimizer update as a trust-region problem constrained by a Kronecker-factored Fisher metric, FISMO achieves structured preconditioning that adapts to local loss landscape geometry while maintaining computational tractability. We establish convergence guarantees for FISMO in stochastic nonconvex settings, proving an \mathcalO(1/\sqrtT) rate for the expected squared gradient norm with explicit characterization of variance reduction through mini-batching. Empirical evaluation on image classification and language modeling benchmarks demonstrates that FISMO achieves superior training efficiency and final performance compared to established baselines.
[LG-53] Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators
链接: https://arxiv.org/abs/2601.21737
作者: Rebecca Pelke,Joel Klein,Jose Cubero-Cascante,Nils Bosbach,Jan Moritz Joseph,Rainer Leupers
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
Abstract:Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the crossbar inputs and cells are very limited, most CIM compilers do not support quantization below 8 bit. As a result, a single MVM requires many compute cycles, and weights cannot be efficiently stored in a single crossbar cell. To address this problem, we propose a mixed-precision training and compilation framework for CIM architectures. The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters. This is why we introduce a reinforcement learning-based strategy to find suitable quantization configurations that balance latency and accuracy. In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.
[LG-54] Amortized Spectral Kernel Discovery via Prior-Data Fitted Network
链接: https://arxiv.org/abs/2601.21731
作者: Kaustubh Sharma,Srijan Tiwari,Ojasva Nema,Parikshit Pareek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Prior-Data Fitted Networks (PFNs) enable efficient amortized inference but lack transparent access to their learned priors and kernels. This opacity hinders their use in downstream tasks, such as surrogate-based optimization, that require explicit covariance models. We introduce an interpretability-driven framework for amortized spectral discovery from pre-trained PFNs with decoupled attention. We perform a mechanistic analysis on a trained PFN that identifies attention latent output as the key intermediary, linking observed function data to spectral structure. Building on this insight, we propose decoder architectures that map PFN latents to explicit spectral density estimates and corresponding stationary kernels via Bochner’s theorem. We study this pipeline in both single-realization and multi-realization regimes, contextualizing theoretical limits on spectral identifiability and proving consistency when multiple function samples are available. Empirically, the proposed decoders recover complex multi-peak spectral mixtures and produce explicit kernels that support Gaussian process regression with accuracy comparable to PFNs and optimization-based baselines, while requiring only a single forward pass. This yields orders-of-magnitude reductions in inference time compared to optimization-based baselines.
[LG-55] LoRA and Privacy: When Random Projections Help (and When They Dont)
链接: https://arxiv.org/abs/2601.21719
作者: Yaxi Hu,Johanna Düngler,Bernhard Schölkopf,Amartya Sanyal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce the (Wishart) projection mechanism, a randomized map of the form S \mapsto M f(S) with M \sim W_d(1/r I_d, r) and study its differential privacy properties. For vector-valued queries f , we prove non-asymptotic DP guarantees without any additive noise, showing that Wishart randomness alone can suffice. For matrix-valued queries, however, we establish a sharp negative result: in the noise-free setting, the mechanism is not DP, and we demonstrate its vulnerability by implementing a near perfect membership inference attack (AUC 0.99 ). We then analyze a noisy variant and prove privacy amplification due to randomness and low rank projection, in both large- and small-rank regimes, yielding stronger privacy guarantees than additive noise alone. Finally, we show that LoRA-style updates are an instance of the matrix-valued mechanism, implying that LoRA is not inherently private despite its built-in randomness, but that low-rank fine-tuning can be more private than full fine-tuning at the same noise level. Preliminary experiments suggest that tighter accounting enables lower noise and improved accuracy in practice.
[LG-56] SmartMeterFM: Unifying Smart Meter Data Generative Tasks Using Flow Matching Models
链接: https://arxiv.org/abs/2601.21706
作者: Nan Lin,Yanbo Wang,Jacco Heres,Peter Palensky,Pedro P. Vergara
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 10 pages, 6 figures, 6 tables
Abstract:Smart meter data is the foundation for planning and operating the distribution network. Unfortunately, such data are not always available due to privacy regulations. Meanwhile, the collected data may be corrupted due to sensor or transmission failure, or it may not have sufficient resolution for downstream tasks. A wide range of generative tasks is formulated to address these issues, including synthetic data generation, missing data imputation, and super-resolution. Despite the success of machine learning models on these tasks, dedicated models need to be designed and trained for each task, leading to redundancy and inefficiency. In this paper, by recognizing the powerful modeling capability of flow matching models, we propose a new approach to unify diverse smart meter data generative tasks with a single model trained for conditional generation. The proposed flow matching models are trained to generate challenging, high-dimensional time series data, specifically monthly smart meter data at a 15 min resolution. By viewing different generative tasks as distinct forms of partial data observations and injecting them into the generation process, we unify tasks such as imputation and super-resolution with a single model, eliminating the need for re-training. The data generated by our model not only are consistent with the given observations but also remain realistic, showing better performance against interpolation and other machine learning based baselines dedicated to the tasks.
[LG-57] Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts
链接: https://arxiv.org/abs/2601.21690
作者: Qinglun Li,Anke Tang,Miao Zhang,Mengzhu Wang,Quanjun Yin,Li Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Moreover, the lack of hyperparameter transparency in open-source fine-tuned models makes it difficult to predict merged-model performance, leaving practitioners without guidance on how to fine-tune merge-friendly experts. To address those two challenges, we employ L_2 -Stability theory under heterogeneous hyperparameter environments to analyze the generalization of the merged model \boldsymbolx_avg . This pioneering analysis yields two key contributions: (i) \textitA unified theoretical framework is provided to explain existing merging algorithms, revealing how they optimize specific terms in our bound, thus offering a strong theoretical foundation for empirical observations. (ii) \textitActionable recommendations are proposed for practitioners to strategically fine-tune expert models, enabling the construction of merge-friendly models within the pretraining-to-finetuning pipeline. Extensive experiments on the ResNet/Vit family across 20/8 visual classification tasks, involving thousands of finetuning models, robustly confirm the impact of different hyperparameters on the generalization of \boldsymbolx_avg predicted by our theoretical results.
[LG-58] Dont be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
链接: https://arxiv.org/abs/2601.21686
作者: Luca Benfenati,Matteo Risso,Andrea Vannozzi,Ahmet Caner Yüzügüler,Lukas Cavigelli,Enrico Macii,Daniele Jahier Pagliari,Alessio Burrello
类目: Machine Learning (cs.LG)
*备注:
Abstract:Key–value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emphorthonormal projection bases by directly minimizing \emphdecoder-layer output reconstruction error. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by 11.9 points on C4 perplexity and 5.4% on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.21686 [cs.LG] (or arXiv:2601.21686v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-59] Can Local Learning Match Self-Supervised Backpropagation?
链接: https://arxiv.org/abs/2601.21683
作者: Wu S. Zihan,Ariane Delrocq,Wulfram Gerstner,Guillaume Bellec
类目: Machine Learning (cs.LG)
*备注:
Abstract:While end-to-end self-supervised learning with backpropagation (global BP-SSL) has become central for training modern AI systems, theories of local self-supervised learning (local-SSL) have struggled to build functional representations in deep neural networks. To establish a link between global and local rules, we first develop a theory for deep linear networks: we identify conditions for local-SSL algorithms (like Forward-forward or CLAPP) to implement exactly the same weight update as a global BP-SSL. Starting from the theoretical insights, we then develop novel variants of local-SSL algorithms to approximate global BP-SSL in deep non-linear convolutional neural networks. Variants that improve the similarity between gradient updates of local-SSL with those of global BP-SSL also show better performance on image datasets (CIFAR-10, STL-10, and Tiny ImageNet). The best local-SSL rule with the CLAPP loss function matches the performance of a comparable global BP-SSL with InfoNCE or CPC-like loss functions, and improves upon state-of-the-art for local SSL on these benchmarks.
[LG-60] LLM 4Fluid: Large Language Models as Generalizable Neural Solvers for Fluid Dynamics
链接: https://arxiv.org/abs/2601.21681
作者: Qisong Xiao,Xinhai Chen,Qinglin Wang,Xiaowei Guo,Binglin Wang,Weifeng Chen,Zhichao Wang,Yunfei Liu,Rui Xia,Hang Zou,Gencheng Liu,Shuai Li,Jie Liu
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Deep learning has emerged as a promising paradigm for spatio-temporal modeling of fluid dynamics. However, existing approaches often suffer from limited generalization to unseen flow conditions and typically require retraining when applied to new scenarios. In this paper, we present LLM4Fluid, a spatio-temporal prediction framework that leverages Large Language Models (LLMs) as generalizable neural solvers for fluid dynamics. The framework first compresses high-dimensional flow fields into a compact latent space via reduced-order modeling enhanced with a physics-informed disentanglement mechanism, effectively mitigating spatial feature entanglement while preserving essential flow structures. A pretrained LLM then serves as a temporal processor, autoregressively predicting the dynamics of physical sequences with time series prompts. To bridge the modality gap between prompts and physical sequences, which can otherwise degrade prediction accuracy, we propose a dedicated modality alignment strategy that resolves representational mismatch and stabilizes long-term prediction. Extensive experiments across diverse flow scenarios demonstrate that LLM4Fluid functions as a robust and generalizable neural solver without retraining, achieving state-of-the-art accuracy while exhibiting powerful zero-shot and in-context learning capabilities. Code and datasets are publicly available at this https URL.
[LG-61] Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching
链接: https://arxiv.org/abs/2601.21662
作者: Li Ju,Mayank Nautiyal,Andreas Hellander,Ekta Vats,Prashant Singh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model’s lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.
[LG-62] abClustPFN: A Prior-Fitted Network for Tabular Data Clustering
链接: https://arxiv.org/abs/2601.21656
作者: Tianqi Zhao,Guanyang Wang,Yan Shuo Tan,Qiong Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clustering tabular data is a fundamental yet challenging problem due to heterogeneous feature types, diverse data-generating mechanisms, and the absence of transferable inductive biases across datasets. Prior-fitted networks (PFNs) have recently demonstrated strong generalization in supervised tabular learning by amortizing Bayesian inference under a broad synthetic prior. Extending this paradigm to clustering is nontrivial: clustering is unsupervised, admits a combinatorial and permutation-invariant output space, and requires inferring the number of clusters. We introduce TabClustPFN, a prior-fitted network for tabular data clustering that performs amortized Bayesian inference over both cluster assignments and cluster cardinality. Pretrained on synthetic datasets drawn from a flexible clustering prior, TabClustPFN clusters unseen datasets in a single forward pass, without dataset-specific retraining or hyperparameter tuning. The model naturally handles heterogeneous numerical and categorical features and adapts to a wide range of clustering structures. Experiments on synthetic data and curated real-world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out-of-the-box exploratory settings. Code is available at this https URL.
[LG-63] Identifiable Equivariant Networks are Layerwise Equivariant
链接: https://arxiv.org/abs/2601.21645
作者: Vahid Shahverdi,Giovanni Luca Marchetti,Georg Bökman,Kathlén Kohn
类目: Machine Learning (cs.LG); Category Theory (math.CT); Representation Theory (math.RT)
*备注:
Abstract:We investigate the relation between end-to-end equivariance and layerwise equivariance in deep neural networks. We prove the following: For a network whose end-to-end function is equivariant with respect to group actions on the input and output spaces, there is a parameter choice yielding the same end-to-end function such that its layers are equivariant with respect to some group actions on the latent spaces. Our result assumes that the parameters of the model are identifiable in an appropriate sense. This identifiability property has been established in the literature for a large class of networks, to which our results apply immediately, while it is conjectural for others. The theory we develop is grounded in an abstract formalism, and is therefore architecture-agnostic. Overall, our results provide a mathematical explanation for the emergence of equivariant structures in the weights of neural networks during training – a phenomenon that is consistently observed in practice.
[LG-64] Generative Design of Ship Propellers using Conditional Flow Matching
链接: https://arxiv.org/abs/2601.21637
作者: Patrick Kruger,Rafael Diaz,Simon Hauschulz,Stefan Harries,Hanno Gottschalk
类目: Machine Learning (cs.LG)
*备注: 19 pages, 13 figures, 3 tables
Abstract:In this paper, we explore the use of generative artificial intelligence (GenAI) for ship propeller design. While traditional forward machine learning models predict the performance of mechanical components based on given design parameters, GenAI models aim to generate designs that achieve specified performance targets. In particular, we employ conditional flow matching to establish a bidirectional mapping between design parameters and simulated noise that is conditioned on performance labels. This approach enables the generation of multiple valid designs corresponding to the same performance targets by sampling over the noise vector. To support model training, we generate data using a vortex lattice method for numerical simulation and analyze the trade-off between model accuracy and the amount of available data. We further propose data augmentation using pseudo-labels derived from less data-intensive forward surrogate models, which can often improve overall model performance. Finally, we present examples of distinct propeller geometries that exhibit nearly identical performance characteristics, illustrating the versatility and potential of GenAI in engineering design. Comments: 19 pages, 13 figures, 3 tables Subjects: Machine Learning (cs.LG) MSC classes: 68T07 Cite as: arXiv:2601.21637 [cs.LG] (or arXiv:2601.21637v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21637 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-65] Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation
链接: https://arxiv.org/abs/2601.21636
作者: Jan Schuchardt,Nikita Kalinin
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring (\epsilon,\delta) -DP is inversely proportional to \delta . In contrast, we develop sampling-free bounds based on Rényi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small \epsilon , where Rényi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.
[LG-66] Noise as a Probe: Membership Inference Attacks on Diffusion Models Leverag ing Initial Noise
链接: https://arxiv.org/abs/2601.21628
作者: Puwei Lian,Yujun Cai,Songze Li,Bingkun Bao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models have achieved remarkable progress in image generation, but their increasing deployment raises serious concerns about privacy. In particular, fine-tuned models are highly vulnerable, as they are often fine-tuned on small and private datasets. Membership inference attacks (MIAs) are used to assess privacy risks by determining whether a specific sample was part of a model’s training data. Existing MIAs against diffusion models either assume obtaining the intermediate results or require auxiliary datasets for training the shadow model. In this work, we utilized a critical yet overlooked vulnerability: the widely used noise schedules fail to fully eliminate semantic information in the images, resulting in residual semantic signals even at the maximum noise step. We empirically demonstrate that the fine-tuned diffusion model captures hidden correlations between the residual semantics in initial noise and the original images. Building on this insight, we propose a simple yet effective membership inference attack, which injects semantic information into the initial noise and infers membership by analyzing the model’s generation result. Extensive experiments demonstrate that the semantic initial noise can strongly reveal membership information, highlighting the vulnerability of diffusion models to MIAs.
[LG-67] raining Memory in Deep Neural Networks: Mechanisms Evidence and Measurement Gaps
链接: https://arxiv.org/abs/2601.21624
作者: Vasileios Sevetlidis,George Pavlidis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern deep-learning training is not memoryless. Updates depend on optimizer moments and averaging, data-order policies (random reshuffling vs with-replacement, staged augmentations and replay), the nonconvex path, and auxiliary state (teacher EMA/SWA, contrastive queues, BatchNorm statistics). This survey organizes mechanisms by source, lifetime, and visibility. It introduces seed-paired, function-space causal estimands; portable perturbation primitives (carry/reset of momentum/Adam/EMA/BN, order-window swaps, queue/teacher tweaks); and a reporting checklist with audit artifacts (order hashes, buffer/BN checksums, RNG contracts). The conclusion is a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across models, data, and regimes.
[LG-68] LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
链接: https://arxiv.org/abs/2601.21623
作者: Stanislav Budzinskiy,Marian Gloser,Tolunay Yilmaz,Ying Hong Tham,Yuanyi Lin,Wenyi Fang,Fan Wu,Philipp Petersen
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition f(g(\mathrmx)) , we provide an adaptive strategy that selects a small subset of components of g(\mathrmx) to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.
[LG-69] Age Matters: Analyzing Age-Related Discussions in App Reviews
链接: https://arxiv.org/abs/2601.21605
作者: Shashiwadana Nirmania,Garima Sharma,Hourieh Khalajzadeh,Mojtaba Shahin
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:In recent years, mobile applications have become indispensable tools for managing various aspects of life. From enhancing productivity to providing personalized entertainment, mobile apps have revolutionized people’s daily routines. Despite this rapid growth and popularity, gaps remain in how these apps address the needs of users from different age groups. Users of varying ages face distinct challenges when interacting with mobile apps, from younger users dealing with inappropriate content to older users having difficulty with usability due to age-related vision and cognition impairments. Although there have been initiatives to create age-inclusive apps, a limited understanding of user perspectives on age-related issues may hinder developers from recognizing specific challenges and implementing effective solutions. In this study, we explore age discussions in app reviews to gain insights into how mobile apps should cater to users across different age this http URL manually curated a dataset of 4,163 app reviews from the Google Play Store and identified 1,429 age-related reviews and 2,734 non-age-related reviews. We employed eight machine learning, deep learning, and large language models to automatically detect age discussions, with RoBERTa performing the best, achieving a precision of 92.46%. Additionally, a qualitative analysis of the 1,429 age-related reviews uncovers six dominant themes reflecting user concerns.
[LG-70] Heterogeneity-Aware Knowledge Sharing for Graph Federated Learning
链接: https://arxiv.org/abs/2601.21589
作者: Wentao Yu,Sheng Wan,Shuo Chen,Bo Han,Chen Gong
类目: Machine Learning (cs.LG)
*备注: 33 pages
Abstract:Graph Federated Learning (GFL) enables distributed graph representation learning while protecting the privacy of graph data. However, GFL suffers from heterogeneity arising from diverse node features and structural topologies across multiple clients. To address both types of heterogeneity, we propose a novel graph Federated learning method via Semantic and Structural Alignment (FedSSA), which shares the knowledge of both node features and structural topologies. For node feature heterogeneity, we propose a novel variational model to infer class-wise node distributions, so that we can cluster clients based on inferred distributions and construct cluster-level representative distributions. We then minimize the divergence between local and cluster-level distributions to facilitate semantic knowledge sharing. For structural heterogeneity, we employ spectral Graph Neural Networks (GNNs) and propose a spectral energy measure to characterize structural information, so that we can cluster clients based on spectral energy and build cluster-level spectral GNNs. We then align the spectral characteristics of local spectral GNNs with those of cluster-level spectral GNNs to enable structural knowledge sharing. Experiments on six homophilic and five heterophilic graph datasets under both non-overlapping and overlapping partitioning settings demonstrate that FedSSA consistently outperforms eleven state-of-the-art methods.
[LG-71] CORDS: Continuous Representations of Discrete Structures ICLR2026
链接: https://arxiv.org/abs/2601.21583
作者: Tin Hadži Veljković,Erik Bekkers,Michael Tiemann,Jan-Willem van de Meent
类目: Machine Learning (cs.LG)
*备注: Preprint, accepted at ICLR 2026
Abstract:Many learning problems require predicting sets of objects when the number of objects is not known beforehand. Examples include object detection, molecular modeling, and scientific inference tasks such as astrophysical source detection. Existing methods often rely on padded representations or must explicitly infer the set size, which often poses challenges. We present a novel strategy for addressing this challenge by casting prediction of variable-sized sets as a continuous inference problem. Our approach, CORDS (Continuous Representations of Discrete Structures), provides an invertible mapping that transforms a set of spatial objects into continuous fields: a density field that encodes object locations and count, and a feature field that carries their attributes over the same support. Because the mapping is invertible, models operate entirely in field space while remaining exactly decodable to discrete sets. We evaluate CORDS across molecular generation and regression, object detection, simulation-based inference, and a mathematical task involving recovery of local maxima, demonstrating robust handling of unknown set sizes with competitive accuracy.
[LG-72] Evaluating Prediction Uncertainty Estimates from BatchEnsemble
链接: https://arxiv.org/abs/2601.21581
作者: Morten Blørstad,Herman Jangsett Mostein,Nello Blaser,Pekka Parviainen
类目: Machine Learning (cs.LG)
*备注: 17 pages, 19 figures
Abstract:Deep learning models struggle with uncertainty estimation. Many approaches are either computationally infeasible or underestimate uncertainty. We investigate \textitBatchEnsemble as a general and scalable method for uncertainty estimation across both tabular and time series tasks. To extend BatchEnsemble to sequential modeling, we introduce GRUBE, a novel BatchEnsemble GRU cell. We compare the BatchEnsemble to Monte Carlo dropout and deep ensemble models. Our results show that BatchEnsemble matches the uncertainty estimation performance of deep ensembles, and clearly outperforms Monte Carlo dropout. GRUBE achieves similar or better performance in both prediction and uncertainty estimation. These findings show that BatchEnsemble and GRUBE achieve similar performance with fewer parameters and reduced training and inference time compared to traditional ensembles.
[LG-73] Learning the Mechanism of Catastrophic Forgetting: A Perspective from Gradient Similarity
链接: https://arxiv.org/abs/2601.21577
作者: Mutian Yang,Zisen Zhan,Yutong Chen,Haolin Li,Kaiwen Wang,Kaili Zheng,Yuguang Wang,Qi Wang,Jiandong Gao,Ji Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Catastrophic forgetting during knowledge injection severely undermines the continual learning capability of large language models (LLMs). Although existing methods attempt to mitigate this issue, they often lack a foundational theoretical explanation. We establish a gradient-based theoretical framework to explain catastrophic forgetting. We first prove that strongly negative gradient similarity is a fundamental cause of forgetting. We then use gradient similarity to identify two types of neurons: conflicting neurons that induce forgetting and account for 50%-75% of neurons, and collaborative neurons that mitigate forgetting and account for 25%-50%. Based on this analysis, we propose a knowledge injection method, Collaborative Neural Learning (CNL). By freezing conflicting neurons and updating only collaborative neurons, CNL theoretically eliminates catastrophic forgetting under an infinitesimal learning rate eta and an exactly known mastered set. Experiments on five LLMs, four datasets, and four optimizers show that CNL achieves zero forgetting in in-set settings and reduces forgetting by 59.1%-81.7% in out-of-set settings.
[LG-74] Bridging Functional and Representational Similarity via Usable Information
链接: https://arxiv.org/abs/2601.21568
作者: Antonio Almudévar,Alfonso Ortega
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a unified framework for quantifying the similarity between representations through the lens of \textitusable information, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we prove that reconstruction-based metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.
[LG-75] FlexCausal: Flexible Causal Disentanglement via Structural Flow Priors and Manifold-Aware Interventions
链接: https://arxiv.org/abs/2601.21567
作者: Yutao Jin,Yuang Tao,Junyong Zhai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Causal Disentangled Representation Learning(CDRL) aims to learn and disentangle low dimensional representations and their underlying causal structure from observations. However, existing disentanglement methods rely on a standard mean-field approximation with a diagonal posterior covariance, which decorrelates all latent dimensions. Additionally, these methods often assume isotropic Gaussian priors for exogenous noise, failing to capture the complex, non-Gaussian statistical properties prevalent in real-world causal factors. Therefore, we propose FlexCausal, a novel CDRL framework based on a block-diagonal covariance VAE. FlexCausal utilizes a Factorized Flow-based Prior to realistically model the complex densities of exogenous noise, effectively decoupling the learning of causal mechanisms from distributional statistics. By integrating supervised alignment objectives with counterfactual consistency constraints, our framework ensures a precise structural correspondence between the learned latent subspaces and the ground-truth causal relations. Finally, we introduce a manifold-aware relative intervention strategy to ensure high-fidelity generation. Experimental results on both synthetic and real-world datasets demonstrate that FlexCausal significantly outperforms other methods.
[LG-76] Representation Unlearning: Forgetting through Information Compression
链接: https://arxiv.org/abs/2601.21564
作者: Antonio Almudévar,Alfonso Ortega
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model’s representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.
[LG-77] HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction ICLR2026
链接: https://arxiv.org/abs/2601.21560
作者: Susu Hu,Qinghe Zeng,Nithya Bhasker,Jakob Nicolas Kather,Stefanie Speidel
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR2026
Abstract:Predicting spatial gene expression from HE histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways. HistoPrism not only surpasses prior state-of-the-art models on highly variable genes , but also more importantly, achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns. With strong pan-cancer generalization and improved efficiency, HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology.
[LG-78] Fast and Geometrically Grounded Lorentz Neural Networks
链接: https://arxiv.org/abs/2601.21529
作者: Robert van der Klis,Ricardo Chávez Torres,Max van Spengler,Yuhui Ding,Thomas Hofmann,Pascal Mettes
类目: Machine Learning (cs.LG)
*备注: 19 pages, 4 figures
Abstract:Hyperbolic space is quickly gaining traction as a promising geometry for hierarchical and robust representation learning. A core open challenge is the development of a mathematical formulation of hyperbolic neural networks that is both efficient and captures the key properties of hyperbolic space. The Lorentz model of hyperbolic space has been shown to enable both fast forward and backward propagation. However, we prove that, with the current formulation of Lorentz linear layers, the hyperbolic norms of the outputs scale logarithmically with the number of gradient descent steps, nullifying the key advantage of hyperbolic geometry. We propose a new Lorentz linear layer grounded in the well-known ``distance-to-hyperplane" formulation. We prove that our formulation results in the usual linear scaling of output hyperbolic norms with respect to the number of gradient descent steps. Our new formulation, together with further algorithmic efficiencies through Lorentzian activation functions and a new caching strategy results in neural networks fully abiding by hyperbolic geometry while simultaneously bridging the computation gap to Euclidean neural networks. Code available at: this https URL.
[LG-79] Explicit Credit Assignment through Local Rewards and Dependence Graphs in Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2601.21523
作者: Bang Giang Le,Viet Cuong Ta
类目: Machine Learning (cs.LG)
*备注:
Abstract:To promote cooperation in Multi-Agent Reinforcement Learning, the reward signals of all agents can be aggregated together, forming global rewards that are commonly known as the fully cooperative setting. However, global rewards are usually noisy because they contain the contributions of all agents, which have to be resolved in the credit assignment process. On the other hand, using local reward benefits from faster learning due to the separation of agents’ contributions, but can be suboptimal as agents myopically optimize their own reward while disregarding the global optimality. In this work, we propose a method that combines the merits of both approaches. By using a graph of interaction between agents, our method discerns the individual agent contribution in a more fine-grained manner than a global reward, while alleviating the cooperation problem with agents’ local reward. We also introduce a practical approach for approximating such a graph. Our experiments demonstrate the flexibility of the approach, enabling improvements over the traditional local and global reward settings.
[LG-80] A Unified SPD Token Transformer Framework for EEG Classification: Systematic Comparison of Geometric Embeddings
链接: https://arxiv.org/abs/2601.21521
作者: Chi-Sheng Chen,En-Jui Kuo,Guan-Ying Chen,Xinyu Zhang,Fan Zhang
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Spatial covariance matrices of EEG signals are Symmetric Positive Definite (SPD) and lie on a Riemannian manifold, yet the theoretical connection between embedding geometry and optimization dynamics remains unexplored. We provide a formal analysis linking embedding choice to gradient conditioning and numerical stability for SPD manifolds, establishing three theoretical results: (1) BWSPD’s \sqrt\kappa gradient conditioning (vs \kappa for Log-Euclidean) via Daleckii-Kre\uın matrices provides better gradient conditioning on high-dimensional inputs ( d \geq 22 ), with this advantage reducing on low-dimensional inputs ( d \leq 8 ) where eigendecomposition overhead dominates; (2) Embedding-Space Batch Normalization (BN-Embed) approximates Riemannian normalization up to O(\varepsilon^2) error, yielding +26% accuracy on 56-channel ERP data but negligible effect on 8-channel SSVEP data, matching the channel-count-dependent prediction; (3) bi-Lipschitz bounds prove BWSPD tokens preserve manifold distances with distortion governed solely by the condition ratio \kappa . We validate these predictions via a unified Transformer framework comparing BWSPD, Log-Euclidean, and Euclidean embeddings within identical architecture across 1,500+ runs on three EEG paradigms (motor imagery, ERP, SSVEP; 36 subjects). Our Log-Euclidean Transformer achieves state-of-the-art performance on all datasets, substantially outperforming classical Riemannian classifiers and recent SPD baselines, while BWSPD offers competitive accuracy with similar training time.
[LG-81] Cascaded Transfer: Learning Many Tasks under Budget Constraints
链接: https://arxiv.org/abs/2601.21513
作者: Eloi Campagne(CB),Yvenn Amara-Ouali(LMO),Yannig Goude(LMO),Mathilde Mougeot(CB, ENSIIE, ENS Paris Saclay),Argyris Kalogeratos(CB, ENS Paris Saclay)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many-Task Learning refers to the setting where a large number of related tasks need to be learned, the exact relationships between tasks are not known. We introduce the Cascaded Transfer Learning, a novel many-task transfer learning paradigm where information (e.g. model parameters) cascades hierarchically through tasks that are learned by individual models of the same class, while respecting given budget constraints. The cascade is organized as a rooted tree that specifies the order in which tasks are learned and refined. We design a cascaded transfer mechanism deployed over a minimum spanning tree structure that connects the tasks according to a suitable distance measure, and allocates the available training budget along its branches. Experiments on synthetic and real many-task settings show that the resulting method enables more accurate and cost effective adaptation across large task collections compared to alternative approaches.
[LG-82] ask-Awareness Improves LLM Generations and Uncertainty
链接: https://arxiv.org/abs/2601.21500
作者: Tim Tomov,Dominik Fuchsgruber,Stephan Günnemann
类目: Machine Learning (cs.LG)
*备注:
Abstract:In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.
[LG-83] ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
链接: https://arxiv.org/abs/2601.21484
作者: Xiuyu Li,Jinkai Zhang,Mingyang Yi,Yu Li,Longqiang Wang,Yue Wang,Ju Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
[LG-84] Best Arm Identification with LLM Judges and Limited Human
链接: https://arxiv.org/abs/2601.21471
作者: Ruicheng Ao,Hongyu Chen,Siyang Gao,Hanwei Li,David Simchi-Levi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 22 pages, 3 figures
Abstract:We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy (e.g., LLM judge) is available for every sample, while an expensive ground-truth label can only be acquired selectively when using a human for auditing. Unlike classical multi-fidelity BAI, the proxy is biased (arm- and context-dependent) and ground truth is selectively observed. Consequently, standard multi-fidelity methods can mis-select the best arm, and uniform auditing, though accurate, wastes scarce resources and is inefficient. We prove that without bias correction and propensity adjustment, mis-selection probability may not vanish (even with unlimited proxy data). We then develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals and form anytime-valid confidence sequences for that estimator. Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms. The algorithm concentrates audits on unreliable contexts and close arms and we prove that a plug-in Neyman rule achieves near-oracle audit efficiency. Numerical experiments confirm the theoretical guarantees and demonstrate the superior empirical performance of the proposed algorithm.
[LG-85] PPI-SVRG: Unifying Prediction-Powered Inference and Variance Reduction for Semi-Supervised Optimization
链接: https://arxiv.org/abs/2601.21470
作者: Ruicheng Ao,Hongyu Chen,Haoyang Liu,David Simchi-Levi,Will Wei Sun
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 27 pages, 4 figures
Abstract:We study semi-supervised stochastic optimization when labeled data is scarce but predictions from pre-trained models are available. PPI and SVRG both reduce variance through control variates – PPI uses predictions, SVRG uses reference gradients. We show they are mathematically equivalent and develop PPI-SVRG, which combines both. Our convergence bound decomposes into the standard SVRG rate plus an error floor from prediction uncertainty. The rate depends only on loss geometry; predictions affect only the neighborhood size. When predictions are perfect, we recover SVRG exactly. When predictions degrade, convergence remains stable but reaches a larger neighborhood. Experiments confirm the theory: PPI-SVRG reduces MSE by 43–52% under label scarcity on mean estimation benchmarks and improves test accuracy by 2.7–2.9 percentage points on MNIST with only 10% labeled data.
[LG-86] A block-coordinate descent framework for non-convex composite optimization. Application to sparse precision matrix estimation
链接: https://arxiv.org/abs/2601.21467
作者: Guillaume Lauga(LJAD)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Block-coordinate descent (BCD) is the method of choice to solve numerous large scale optimization problems, however their theoretical study for non-convex optimization, has received less attention. In this paper, we present a new block-coordinate descent (BCD) framework to tackle non-convex composite optimization problems, ensuring decrease of the objective function and convergence to a solution. This framework is general enough to include variable metric proximal gradient updates, proximal Newton updates, and alternated minimization updates. This generality allows to encompass three versions of the most used solvers in the sparse precision matrix estimation problem, deemed Graphical Lasso: graphical ISTA, Primal GLasso, and QUIC. We demonstrate the value of this new framework on non-convex sparse precision matrix estimation problems, providing convergence guarantees and up to a 100 -fold reduction in the number of iterations required to reach state-of-the-art estimation quality.
[LG-87] Partial Feedback Online Learning
链接: https://arxiv.org/abs/2601.21462
作者: Shihao Shao,Cong Fang,Zhouchen Lin,Dacheng Tao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages
Abstract:We study partial-feedback online learning, where each instance admits a set of correct labels, but the learner only observes one correct label per round; any prediction within the correct set is counted as correct. This model captures settings such as language generation, where multiple responses may be valid but data provide only a single reference. We give a near-complete characterization of minimax regret for both deterministic and randomized learners in the set-realizable regime, i.e., in the regime where sublinear regret is generally attainable. For deterministic learners, we introduce the Partial-Feedback Littlestone dimension (PFLdim) and show it precisely governs learnability and minimax regret; technically, PFLdim cannot be defined via the standard version space, requiring a new collection version space viewpoint and an auxiliary dimension used only in the proof. We further develop the Partial-Feedback Measure Shattering dimension (PMSdim) to obtain tight bounds for randomized learners. We identify broad conditions ensuring inseparability between deterministic and randomized learnability (e.g., finite Helly number or nested-inclusion label structure), and extend the argument to set-valued online learning, resolving an open question of Raman et al. [2024b]. Finally, we show a sharp separation from weaker realistic and agnostic variants: outside set realizability, the problem can become information-theoretically intractable, with linear regret possible even for |H|=2 . This highlights the need for fundamentally new, noise-sensitive complexity measures to meaningfully characterize learnability beyond set realizability.
[LG-88] Synthetic Pattern Generation and Detection of Financial Activities using Graph Autoencoders
链接: https://arxiv.org/abs/2601.21446
作者: Francesco Zola,Lucia Muñoz,Andrea Venturi,Amaia Gil
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
*备注: Accept to The 7th International Workshop on Statistical Methods and Artificial Intelligence (IWSMAI’26)
Abstract:Illicit financial activities such as money laundering often manifest through recurrent topological patterns in transaction networks. Detecting these patterns automatically remains challenging due to the scarcity of labeled real-world data and strict privacy constraints. To address this, we investigate whether Graph Autoencoders (GAEs) can effectively learn and distinguish topological patterns that mimic money laundering operations when trained on synthetic data. The analysis consists of two phases: (i) data generation, where synthetic samples are created for seven well-known illicit activity patterns using parametrized generators that preserve structural consistency while introducing realistic variability; and (ii) model training and validation, where separate GAEs are trained on each pattern without explicit labels, relying solely on reconstruction error as an indicator of learned structure. We compare three GAE implementations based on three distinct convolutional layers: Graph Convolutional (GAE-GCN), GraphSAGE (GAE-SAGE), and Graph Attention Network (GAE-GAT). Experimental results show that GAE-GCN achieves the most consistent reconstruction performance across patterns, while GAE-SAGE and GAE-GAT exhibit competitive results only in few specific patterns. These findings suggest that graph-based representation learning on synthetic data provides a viable path toward developing AI-driven tools for detecting illicit behaviors, overcoming the limitations of financial datasets.
[LG-89] Accurate Network Traffic Matrix Prediction via LEAD: an LLM -Enhanced Adapter-Based Conditional Diffusion Model
链接: https://arxiv.org/abs/2601.21437
作者: Yu Sun,Yaqiong Liu,Nan Cheng,Jiayuan Li,Zihan Jia,Xialin Du,Mugen Peng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Driven by the evolution toward 6G and AI-native edge intelligence, network operations increasingly require predictive and risk-aware adaptation under stringent computation and latency constraints. Network Traffic Matrix ™, which characterizes flow volumes between nodes, is a fundamental signal for proactive traffic engineering. However, accurate TM forecasting remains challenging due to the stochastic, non-linear, and bursty nature of network dynamics. Existing discriminative models often suffer from over-smoothing and provide limited uncertainty awareness, leading to poor fidelity under extreme bursts. To address these limitations, we propose LEAD, a Large Language Model (LLM)-Enhanced Adapter-based conditional Diffusion model. First, LEAD adopts a “Traffic-to-Image” paradigm to transform traffic matrices into RGB images, enabling global dependency modeling via vision backbones. Then, we design a “Frozen LLM with Trainable Adapter” model, which efficiently captures temporal semantics with limited computational cost. Moreover, we propose a Dual-Conditioning Strategy to precisely guide a diffusion model to generate complex, dynamic network traffic matrices. Experiments on the Abilene and GEANT datasets demonstrate that LEAD outperforms all baselines. On the Abilene dataset, LEAD attains a remarkable 45.2% reduction in RMSE against the best baseline, with the error margin rising only marginally from 0.1098 at one-step to 0.1134 at 20-step predictions. Meanwhile, on the GEANT dataset, LEAD achieves a 0.0258 RMSE at 20-step prediction horizon which is 27.3% lower than the best baseline.
[LG-90] ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation
链接: https://arxiv.org/abs/2601.21420
作者: Zihao Huang,Jundong Zhou,Xingwei Qu,Qiyang Min,Ge Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio R before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to R^2\times and KV cache by R\times . At R=2 , empirical measurements show prefill speedups reaching 175% and decoding speedups up to 117% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.
[LG-91] Learning to Optimize Job Shop Scheduling Under Structural Uncertainty
链接: https://arxiv.org/abs/2601.21389
作者: Rui Zhang,Jianwei Niu,Xuefeng Liu,Shaojie Tang,Jing Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Job-Shop Scheduling Problem (JSSP), under various forms of manufacturing uncertainty, has recently attracted considerable research attention. Most existing studies focus on parameter uncertainty, such as variable processing times, and typically adopt the actor-critic framework. In this paper, we explore a different but prevalent form of uncertainty in JSSP: structural uncertainty. Structural uncertainty arises when a job may follow one of several routing paths, and the selection is determined not by policy, but by situational factors (e.g., the quality of intermediate products) that cannot be known in advance. Existing methods struggle to address this challenge due to incorrect credit assignment: a high-quality action may be unfairly penalized if it is followed by a time-consuming path. To address this problem, we propose a novel method named UP-AAC. In contrast to conventional actor-critic methods, UP-AAC employs an asymmetric architecture. While its actor receives a standard stochastic state, the critic is crucially provided with a deterministic state reconstructed in hindsight. This design allows the critic to learn a more accurate value function, which in turn provides a lower-variance policy gradient to the actor, leading to more stable learning. In addition, we design an attention-based Uncertainty Perception Model (UPM) to enhance the actor’s scheduling decisions. Extensive experiments demonstrate that our method outperforms existing approaches in reducing makespan on benchmark instances.
[LG-92] DA-SPS: A Dual-stage Network based on Singular Spectrum Analysis Patching-strategy and Spearman-correlation for Multivariate Time-series Prediction
链接: https://arxiv.org/abs/2601.21381
作者: Tianhao Zhang,Shusen Ma,Yu Kang,Yun-Bo Zhao
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, 6 tables, submitted to IEEE Transactions on Emerging Topics in Computational Intelligence
Abstract:Multivariate time-series forecasting, as a typical problem in the field of time series prediction, has a wide range of applications in weather forecasting, traffic flow prediction, and other scenarios. However, existing works do not effectively consider the impact of extraneous variables on the prediction of the target variable. On the other hand, they fail to fully extract complex sequence information based on various time patterns of the sequences. To address these drawbacks, we propose a DA-SPS model, which adopts different modules for feature extraction based on the information characteristics of different variables. DA-SPS mainly consists of two stages: the target variable processing stage (TVPS) and the extraneous variables processing stage (EVPS). In TVPS, the model first uses Singular Spectrum Analysis (SSA) to process the target variable sequence and then uses Long Short-Term Memory (LSTM) and P-Conv-LSTM which deploys a patching strategy to extract features from trend and seasonality components, respectively. In EVPS, the model filters extraneous variables that have a strong correlation with the target variate by using Spearman correlation analysis and further analyses them using the L-Attention module which consists of LSTM and attention mechanism. Finally, the results obtained by TVPS and EVPS are combined through weighted summation and linear mapping to produce the final prediction. The results on four public datasets demonstrate that the DA-SPS model outperforms existing state-of-the-art methods. Additionally, its performance in real-world scenarios is further validated using a private dataset collected by ourselves, which contains the test items’ information on laptop motherboards.
[LG-93] Rethinking Federated Graph Foundation Models: A Graph-Language Alignment-based Approach
链接: https://arxiv.org/abs/2601.21369
作者: Yinlin Zhu,Di Wu,Xianzhi Zhang,Yuming Ai,Xunkai Li,Miao Hu,Guocong Quan
类目: Machine Learning (cs.LG)
*备注: Under Review. E-mail: zhuylin27@mail2. this http URL
Abstract:Recent studies of federated graph foundational models (FedGFMs) break the idealized and untenable assumption of having centralized data storage to train graph foundation models, and accommodate the reality of distributed, privacy-restricted data silos. Despite their simplicity and intuition, existing studies that project aligned generalizable knowledge onto a discrete token space via vector-quantized backbones suffer from irreversible knowledge loss during the quantization process. In this context, we argue that reconciling the semantic-structural orthogonality and integrity between pre-trained language models (PLMs) and graph neural networks (GNNs) is paramount for developing effective FedGFMs while simultaneously mitigating the severe data heterogeneity and communication constraints inherent in distributed, resource-limited environments. To address these issues, we propose FedGALA (Federated Graph And Language Alignment), a framework that resolves graph-based semantic-structural orthogonality and integrity in federated settings by employing unsupervised contrastive learning to align GNNs and frozen PLMs within a continuous embedding space, thereby capturing robust, transferable general knowledge. Subsequently, FedGALA leverages a communication-efficient prompt tuning mechanism to steer these pre-aligned encoders and frozen PLMs, facilitating effective adaptation to diverse downstream tasks while circumventing the prohibitive overhead of full-parameter fine-tuning. The comprehensive experiments validate that FedGALA outperforms all competitive baselines across multi-domain datasets on multiple tasks with up to 14.37% performance improvement. Comments: Under Review. E-mail: zhuylin27@mail2.this http URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.21369 [cs.LG] (or arXiv:2601.21369v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21369 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-94] Perceptrons and localization of attentions mean-field landscape
链接: https://arxiv.org/abs/2601.21366
作者: Antonio Álvarez-López,Borjan Geshkovski,Domènec Ruiz-Balet
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.
[LG-95] Graph-Free Root Cause Analysis
链接: https://arxiv.org/abs/2601.21359
作者: Luan Pham
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Failures in complex systems demand rapid Root Cause Analysis (RCA) to prevent cascading damage. Existing RCA methods that operate without dependency graph typically assume that the root cause having the highest anomaly score. This assumption fails when faults propagate, as a small delay at the root cause can accumulate into a much larger anomaly downstream. In this paper, we propose PRISM, a simple and efficient framework for RCA when the dependency graph is absent. We formulate a class of component-based systems under which PRISM performs RCA with theoretical guarantees. On 735 failures across 9 real-world datasets, PRISM achieves 68% Top-1 accuracy, a 258% improvement over the best baseline, while requiring only 8ms per diagnosis.
[LG-96] Expected Improvement via Gradient Norms
链接: https://arxiv.org/abs/2601.21357
作者: Joshua Hang Sai Ip,Georgios Makrygiorgos,Ali Mesbah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian Optimization (BO) is a principled approach for optimizing expensive black-box functions, with Expected Improvement (EI) being one of the most widely used acquisition functions. Despite its empirical success, EI is known to be overly exploitative and can converge to suboptimal stationary points. We propose Expected Improvement via Gradient Norms (EI-GN), a novel acquisition function that applies the improvement principle to a gradient-aware auxiliary objective, thereby promoting sampling in regions that are both high-performing and approaching first-order stationarity. EI-GN relies on gradient observations used to learn gradient-enhanced surrogate models that enable principled gradient inference from function evaluations. We derive a tractable closed-form expression for EI-GN that allows efficient optimization and show that the proposed acquisition is consistent with the improvement-based acquisition framework. Empirical evaluations on standard BO benchmarks demonstrate that EI-GN yields consistent improvements against standard baselines. We further demonstrate applicability of EI-GN to control policy learning problems.
[LG-97] Factored Causal Representation Learning for Robust Reward Modeling in RLHF
链接: https://arxiv.org/abs/2601.21350
作者: Yupei Yang,Lin Yang,Wanxi Deng,Lin Qu,Fan Feng,Biwei Huang,Shikui Tu,Lei Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model’s contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) non-causal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying gradient reversal to discourage them from encoding reward-relevant information. Experiments on both mathematical and dialogue tasks demonstrate that our method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses on length and sycophantic bias further validate the effectiveness of our method in mitigating reward hacking behaviors.
[LG-98] An introductory Generalization of the standard SVMs loss and its applications to Shallow and Deep Neural Networks
链接: https://arxiv.org/abs/2601.21331
作者: Filippo Portera
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a new convex loss for SVMs, both for the binary classification and for the regression models. Therefore, we show the mathematical derivation of the dual problems and we experiment them with several small data-sets. The minimal dimension of those data-sets is due to the difficult scalability of the SVM method to bigger instances. This preliminary study should prove that using pattern correlations inside the loss function could enhance the generalisation performances. Coherently, results show that generalisation measures are never worse than the standard losses and several times they are better. In our opinion, it should be considered a careful study of this loss, coupled with shallow and deep neural networks. In fact, we present some novel results obtained with those architectures.
[LG-99] Few-Shot Learning for Dynamic Operations of Automated Electric Taxi Fleets under Evolving Charging Infrastructure: A Meta-Deep Reinforcement Learning Approach
链接: https://arxiv.org/abs/2601.21312
作者: Xiaozhuang Li,Xindi Tang,Fang He
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the rapid expansion of electric vehicles (EVs) and charging infrastructure, the effective management of Autonomous Electric Taxi (AET) fleets faces a critical challenge in environments with dynamic and uncertain charging availability. While most existing research assumes a static charging network, this simplification creates a significant gap between theoretical models and real-world operations. To bridge this gap, we propose GAT-PEARL, a novel meta-reinforcement learning framework that learns an adaptive operational policy. Our approach integrates a graph attention network (GAT) to effectively extract robust spatial representations under infrastructure layouts and model the complex spatiotemporal relationships of the urban environment, and employs probabilistic embeddings for actor-critic reinforcement learning (PEARL) to enable rapid, inference-based adaptation to changes in charging network layouts without retraining. Through extensive simulations on real-world data in Chengdu, China, we demonstrate that GAT-PEARL significantly outperforms conventional reinforcement learning baselines, showing superior generalization to unseen infrastructure layouts and achieving higher overall operational efficiency in dynamic settings.
[LG-100] ransferable Graph Condensation from the Causal Perspective
链接: https://arxiv.org/abs/2601.21309
作者: Huaming Du,Yijie Huang,Su Yao,Yiying Wang,Yueyang Zhou,Jingwen Yang,Jinshi Zhang,Han Ji,Yu Zhao,Guisong Liu,Hegui Zhang,Carl Yang,Gang Kou
类目: Machine Learning (cs.LG)
*备注:
Abstract:The increasing scale of graph datasets has significantly improved the performance of graph representation learning methods, but it has also introduced substantial training challenges. Graph dataset condensation techniques have emerged to compress large datasets into smaller yet information-rich datasets, while maintaining similar test performance. However, these methods strictly require downstream applications to match the original dataset and task, which often fails in cross-task and cross-domain scenarios. To address these challenges, we propose a novel causal-invariance-based and transferable graph dataset condensation method, named \textbfTGCC, providing effective and transferable condensed datasets. Specifically, to preserve domain-invariant knowledge, we first extract domain causal-invariant features from the spatial domain of the graph using causal interventions. Then, to fully capture the structural and feature information of the original graph, we perform enhanced condensation operations. Finally, through spectral-domain enhanced contrastive learning, we inject the causal-invariant features into the condensed graph, ensuring that the compressed graph retains the causal information of the original graph. Experimental results on five public datasets and our novel \textbfFinReport dataset demonstrate that TGCC achieves up to a 13.41% improvement in cross-task and cross-domain complex scenarios compared to existing methods, and achieves state-of-the-art performance on 5 out of 6 datasets in the single dataset and task scenario.
[LG-101] Achieving varepsilon-2 Dependence for Averag e-Reward Q-Learning with a New Contraction Principle
链接: https://arxiv.org/abs/2601.21301
作者: Zijun Chen,Zaiwei Chen,Nian Si,Shengbo Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal \widetildeO(\varepsilon^-2) sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
[LG-102] Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning
链接: https://arxiv.org/abs/2601.21294
作者: Anders Gjølbye,Ida Kargaard,Emma Kargaard,Lars Kai Hansen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint
Abstract:Partial Least Squares (PLS) learns shared structure from paired data via the top singular vectors of the empirical cross-covariance (PLS-SVD), but multimodal datasets often have missing entries in both views. We study PLS-SVD under independent entry-wise missing-completely-at-random masking in a proportional high-dimensional spiked model. After appropriate normalization, the masked cross-covariance behaves like a spiked rectangular random matrix whose effective signal strength is attenuated by \sqrt\rho , where \rho is the joint entry retention probability. As a result, PLS-SVD exhibits a sharp BBP-type phase transition: below a critical signal-to-noise threshold the leading singular vectors are asymptotically uninformative, while above it they achieve nontrivial alignment with the latent shared directions, with closed-form asymptotic overlap formulas. Simulations and semi-synthetic multimodal experiments corroborate the predicted phase diagram and recovery curves across aspect ratios, signal strengths, and missingness levels.
[LG-103] meSliver : Symbolic-Linear Decomposition for Explainable Time Series Classification ICLR2026
链接: https://arxiv.org/abs/2601.21289
作者: Akash Pandey,Payal Mohapatra,Wei Chen,Qi Zhu,Sinan Keten
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026
Abstract:Identifying the extent to which every temporal segment influences a model’s predictions is essential for explaining model decisions and increasing transparency. While post-hoc explainable methods based on gradients and feature-based attributions have been popular, they suffer from reference state sensitivity and struggle to generalize across time-series datasets, as they treat time points independently and ignore sequential dependencies. Another perspective on explainable time-series classification is through interpretable components of the model, for instance, leveraging self-attention mechanisms to estimate temporal attribution; however, recent findings indicate that these attention weights often fail to provide faithful measures of temporal importance. In this work, we advance this perspective and present a novel explainability-driven deep learning framework, TimeSliver, which jointly utilizes raw time-series data and its symbolic abstraction to construct a representation that maintains the original temporal structure. Each element in this representation linearly encodes the contribution of each temporal segment to the final prediction, allowing us to assign a meaningful importance score to every time point. For time-series classification, TimeSliver outperforms other temporal attribution methods by 11% on 7 distinct synthetic and real-world multivariate time-series datasets. TimeSliver also achieves predictive performance within 2% of state-of-the-art baselines across 26 UEA benchmark datasets, positioning it as a strong and explainable framework for general time-series classification.
[LG-104] EGAM: Extended Graph Attention Model for Solving Routing Problems
链接: https://arxiv.org/abs/2601.21281
作者: Licheng Wang,Yuzi Yan,Mingtao Huang,Yuan Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural combinatorial optimization (NCO) solvers, implemented with graph neural networks (GNNs), have introduced new approaches for solving routing problems. Trained with reinforcement learning (RL), the state-of-the-art graph attention model (GAM) achieves near-optimal solutions without requiring expert knowledge or labeled data. In this work, we generalize the existing graph attention mechanism and propose the extended graph attention model (EGAM). Our model utilizes multi-head dot-product attention to update both node and edge embeddings, addressing the limitations of the conventional GAM, which considers only node features. We employ an autoregressive encoder-decoder architecture and train it with policy gradient algorithms that incorporate a specially designed baseline. Experiments show that EGAM matches or outperforms existing methods across various routing problems. Notably, the proposed model demonstrates exceptional performance on highly constrained problems, highlighting its efficiency in handling complex graph structures.
[LG-105] Model-Free Neural State Estimation in Nonlinear Dynamical Systems: A Comparative Study of Neural Architectures and Classical Filters
链接: https://arxiv.org/abs/2601.21266
作者: Zhuochen Liu,Hans Walker,Rahul Jain
类目: Machine Learning (cs.LG)
*备注: 8 pages, 2 figures
Abstract:Neural network models are increasingly used for state estimation in control and decision-making problems, yet it remains unclear to what extent they behave as principled filters in nonlinear dynamical systems. Unlike classical filters, which rely on explicit knowledge of system dynamics and noise models, neural estimators can be trained purely from data without access to the underlying system equations. In this work, we present a systematic empirical comparison between such model-free neural network models and classical filtering methods across multiple nonlinear scenarios. Our study evaluates Transformer-based models, state-space neural networks, and recurrent architectures alongside particle filters and nonlinear Kalman filters. The results show that neural models (in particular, state-space models (SSMs)) achieve state estimation performance that approaches strong nonlinear Kalman filters in nonlinear scenarios and outperform weaker classical baselines despite lacking access to system models, while also attaining substantially higher inference throughput.
[LG-106] PHDME: Physics-Informed Diffusion Models without Explicit Governing Equations
链接: https://arxiv.org/abs/2601.21234
作者: Kaiyuan Tan,Kendra Givens,Peilun Li,Thomas Beckers
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models provide expressive priors for forecasting trajectories of dynamical systems, but are typically unreliable in the sparse data regime. Physics-informed machine learning (PIML) improves reliability in such settings; however, most methods require \emphexplicit governing equations during training, which are often only partially known due to complex and nonlinear dynamics. We introduce \textbfPHDME, a port-Hamiltonian diffusion framework designed for \emphsparse observations and \emphincomplete physics. PHDME leverages port-Hamiltonian structural prior but does not require full knowledge of the closed-form governing equations. Our approach first trains a Gaussian process distributed Port-Hamiltonian system (GP-dPHS) on limited observations to capture an energy-based representation of the dynamics. The GP-dPHS is then used to generate a physically consistent artificial dataset for diffusion training, and to inform the diffusion model with a structured physics residual loss. After training, the diffusion model acts as an amortized sampler and forecaster for fast trajectory generation. Finally, we apply split conformal calibration to provide uncertainty statements for the generated predictions. Experiments on PDE benchmarks and a real-world spring system show improved accuracy and physical consistency under data scarcity.
[LG-107] Soft Quantization: Model Compression Via Weight Coupling
链接: https://arxiv.org/abs/2601.21219
作者: Daniel T. Bernstein,Luca Di Carlo,David Schwab
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 7 pages, 6 figures
Abstract:We show that introducing short-range attractive couplings between the weights of a neural network during training provides a novel avenue for model quantization. These couplings rapidly induce the discretization of a model’s weight distribution, and they do so in a mixed-precision manner despite only relying on two additional hyperparameters. We demonstrate that, within an appropriate range of hyperparameters, our "soft quantization’’ scheme outperforms histogram-equalized post-training quantization on ResNet-20/CIFAR-10. Soft quantization provides both a new pipeline for the flexible compression of machine learning models and a new tool for investigating the trade-off between compression and generalization in high-dimensional loss landscapes.
[LG-108] Rethinking Self-Training Based Cross-Subject Domain Adaptation for SSVEP Classification ICASSP2026
链接: https://arxiv.org/abs/2601.21203
作者: Weiguang Wang,Yong Liu,Yingjie Gao,Guangyuan Xu
类目: Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2026
Abstract:Steady-state visually evoked potentials (SSVEP)-based brain-computer interfaces (BCIs) are widely used due to their high signal-to-noise ratio and user-friendliness. Accurate decoding of SSVEP signals is crucial for interpreting user intentions in BCI applications. However, signal variability across subjects and the costly user-specific annotation limit recognition performance. Therefore, we propose a novel cross-subject domain adaptation method built upon the self-training paradigm. Specifically, a Filter-Bank Euclidean Alignment (FBEA) strategy is designed to exploit frequency information from SSVEP filter banks. Then, we propose a Cross-Subject Self-Training (CSST) framework consisting of two stages: Pre-Training with Adversarial Learning (PTAL), which aligns the source and target distributions, and Dual-Ensemble Self-Training (DEST), which refines pseudo-label quality. Moreover, we introduce a Time-Frequency Augmented Contrastive Learning (TFA-CL) module to enhance feature discriminability across multiple augmented views. Extensive experiments on the Benchmark and BETA datasets demonstrate that our approach achieves state-of-the-art performance across varying signal lengths, highlighting its superiority.
[LG-109] Flow Perturbation: Multi-Step Unbiased Jacobian Estimation for High-Dimensional Boltzmann Sampling
链接: https://arxiv.org/abs/2601.21177
作者: Xin Peng,Ang Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:The scalability of continuous normalizing flows (CNFs) for unbiased Boltzmann sampling remains limited in high-dimensional systems due to the cost of Jacobian-determinant evaluation, which requires D backpropagation passes through the flow layers. Existing stochastic Jacobian estimators such as the Hutchinson trace estimator reduce computation but introduce bias, while the recently proposed Flow Perturbation method is unbiased yet suffers from high variance. We present \textbfFlow Perturbation++, a variance-reduced extension of Flow Perturbation that discretizes the probability-flow ODE and performs unbiased stepwise Jacobian estimation at each integration step. This multi-step construction retains the unbiasedness of Flow Perturbation while achieves substantially lower estimator variance. Integrated into a Sequential Monte Carlo framework, Flow Perturbation++ achieves significantly improved equilibrium sampling on a 1000D Gaussian Mixture Model and the all-atom Chignolin protein compared with Hutchinson-based and single-step Flow Perturbation baselines.
[LG-110] Breaking the Reasoning Horizon in Entity Alignment Foundation Models
链接: https://arxiv.org/abs/2601.21174
作者: Yuanning Cui,Zequn Sun,Wei Hu,Kexuan Xin,Zhangjie Fu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Entity alignment (EA) is critical for knowledge graph (KG) fusion. Existing EA models lack transferability and are incapable of aligning unseen KGs without retraining. While using graph foundation models (GFMs) offer a solution, we find that directly adapting GFMs to EA remains largely ineffective. This stems from a critical “reasoning horizon gap”: unlike link prediction in GFMs, EA necessitates capturing long-range dependencies across sparse and heterogeneous KG structuresTo address this challenge, we propose a EA foundation model driven by a parallel encoding strategy. We utilize seed EA pairs as local anchors to guide the information flow, initializing and encoding two parallel streams simultaneously. This facilitates anchor-conditioned message passing and significantly shortens the inference trajectory by leveraging local structural proximity instead of global search. Additionally, we incorporate a merged relation graph to model global dependencies and a learnable interaction module for precise matching. Extensive experiments verify the effectiveness of our framework, highlighting its strong generalizability to unseen KGs.
[LG-111] he Powers of Precision: Structure-Informed Detection in Complex Systems – From Customer Churn to Seizure Onset
链接: https://arxiv.org/abs/2601.21170
作者: Augusto Santos,Teresa Santos,Catarina Rodrigues,José M. F. Moura
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Emergent phenomena – onset of epileptic seizures, sudden customer churn, or pandemic outbreaks – often arise from hidden causal interactions in complex systems. We propose a machine learning method for their early detection that addresses a core challenge: unveiling and harnessing a system’s latent causal structure despite the data-generating process being unknown and partially observed. The method learns an optimal feature representation from a one-parameter family of estimators – powers of the empirical covariance or precision matrix – offering a principled way to tune in to the underlying structure driving the emergence of critical events. A supervised learning module then classifies the learned representation. We prove structural consistency of the family and demonstrate the empirical soundness of our approach on seizure detection and churn prediction, attaining competitive results in both. Beyond prediction, and toward explainability, we ascertain that the optimal covariance power exhibits evidence of good identifiability while capturing structural signatures, thus reconciling predictive performance with interpretable statistical structure.
[LG-112] Efficient Simple Regret Algorithms for Stochastic Contextual Bandits
链接: https://arxiv.org/abs/2601.21167
作者: Shuai Liu,Alireza Bakhtiari,Alex Ayoub,Botao Hao,Csaba Szepesvári
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study stochastic contextual logistic bandits under the simple regret objective. While simple regret guarantees have been established for the linear case, no such results were previously known for the logistic setting. Building on ideas from contextual linear bandits and self-concordant analysis, we propose the first algorithm that achieves simple regret \tilde\mathcalO(d/\sqrtT) . Notably, the leading term of our regret bound is free of the constant \kappa = \mathcal O(\exp(S)) , where S is a bound on the magnitude of the unknown parameter vector. The algorithm is shown to be fully tractable when the action set is finite. We also introduce a new variant of Thompson Sampling tailored to the simple-regret setting. This yields the first simple regret guarantee for randomized algorithms in stochastic contextual linear bandits, with regret \tilde\mathcalO(d^3/2/\sqrtT) . Extending this method to the logistic case, we obtain a similarly structured Thompson Sampling algorithm that achieves the same regret bound – \tilde\mathcalO(d^3/2/\sqrtT) – again with no dependence on \kappa in the leading term. The randomized algorithms, as expected, are cheaper to run than their deterministic counterparts. Finally, we conducted a series of experiments to empirically validate these theoretical guarantees.
[LG-113] A Federated Generalized Expectation-Maximization Algorithm for Mixture Models with an Unknown Number of Components ICLR2026
链接: https://arxiv.org/abs/2601.21160
作者: Michael Ibrahim,Nagi Gebraeel,Weijun Xie
类目: Machine Learning (cs.LG)
*备注: 49 Pages, Accepted at ICLR 2026
Abstract:We study the problem of federated clustering when the total number of clusters K across clients is unknown, and the clients have heterogeneous but potentially overlapping cluster sets in their local data. To that end, we develop FedGEM: a federated generalized expectation-maximization algorithm for the training of mixture models with an unknown number of components. Our proposed algorithm relies on each of the clients performing EM steps locally, and constructing an uncertainty set around the maximizer associated with each local component. The central server utilizes the uncertainty sets to learn potential cluster overlaps between clients, and infer the global number of clusters via closed-form computations. We perform a thorough theoretical study of our algorithm, presenting probabilistic convergence guarantees under common assumptions. Subsequently, we study the specific setting of isotropic GMMs, providing tractable, low-complexity computations to be performed by each client during each iteration of the algorithm, as well as rigorously verifying assumptions required for algorithm convergence. We perform various numerical experiments, where we empirically demonstrate that our proposed method achieves comparable performance to centralized EM, and that it outperforms various existing federated clustering methods.
[LG-114] Learning to Advect: A Neural Semi-Lagrangian Architecture for Weather Forecasting
链接: https://arxiv.org/abs/2601.21151
作者: Carlos A. Pereira,Stéphane Gaudreault,Valentin Dallerit,Christopher Subich,Shoyon Panday,Siqi Wei,Sasa Zhang,Siddharth Rout,Eldad Haber,Raymond J. Spiteri,David Millard,Emilia Diaconescu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent machine-learning approaches to weather forecasting often employ a monolithic architecture, where distinct physical mechanisms (advection, transport), diffusion-like mixing, thermodynamic processes, and forcing are represented implicitly within a single large network. This representation is particularly problematic for advection, where long-range transport must be treated with expensive global interaction mechanisms or through deep, stacked convolutional layers. To mitigate this, we present PARADIS, a physics-inspired global weather prediction model that imposes inductive biases on network behavior through a functional decomposition into advection, diffusion, and reaction blocks acting on latent variables. We implement advection through a Neural Semi-Lagrangian operator that performs trajectory-based transport via differentiable interpolation on the sphere, enabling end-to-end learning of both the latent modes to be transported and their characteristic trajectories. Diffusion-like processes are modeled through depthwise-separable spatial mixing, while local source terms and vertical interactions are modeled via pointwise channel interactions, enabling operator-level physical structure. PARADIS provides state-of-the-art forecast skill at a fraction of the training cost. On ERA5-based benchmarks, the 1 degree PARADIS model, with a total training cost of less than a GPU month, meets or exceeds the performance of 0.25 degree traditional and machine-learning baselines, including the ECMWF HRES forecast and DeepMind’s GraphCast.
[LG-115] Smooth Dynamic Cutoffs for Machine Learning Interatomic Potentials
链接: https://arxiv.org/abs/2601.21147
作者: Kevin Han,Haolin Cong,Bowen Deng,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning interatomic potentials (MLIPs) have proven to be wildly useful for molecular dynamics simulations, powering countless drug and materials discovery applications. However, MLIPs face two primary bottlenecks preventing them from reaching realistic simulation scales: inference time and memory consumption. In this work, we address both issues by challenging the long-held belief that the cutoff radius for the MLIP must be held to a fixed, constant value. For the first time, we introduce a dynamic cutoff formulation that still leads to stable, long timescale molecular dynamics simulation. In introducing the dynamic cutoff, we are able to induce sparsity onto the underlying atom graph by targeting a specific number of neighbors per atom, significantly reducing both memory consumption and inference time. We show the effectiveness of a dynamic cutoff by implementing it onto 4 state of the art MLIPs: MACE, Nequip, Orbv3, and TensorNet, leading to 2.26x less memory consumption and 2.04x faster inference time, depending on the model and atomic system. We also perform an extensive error analysis and find that the dynamic cutoff models exhibit minimal accuracy dropoff compared to their fixed cutoff counterparts on both materials and molecular datasets. All model implementations and training code will be fully open sourced.
[LG-116] RACE: Trajectory Recovery for Continuous Mechanism Evolution in Causal Representation Learning
链接: https://arxiv.org/abs/2601.21135
作者: Shicheng Fan,Kun Zhang,Lu Cheng
类目: Machine Learning (cs.LG)
*备注: 23 pages, 11 figures
Abstract:Temporal causal representation learning methods assume that causal mechanisms switch instantaneously between discrete domains, yet real-world systems often exhibit continuous mechanism transitions. For example, a vehicle’s dynamics evolve gradually through a turning maneuver, and human gait shifts smoothly from walking to running. We formalize this setting by modeling transitional mechanisms as convex combinations of finitely many atomic mechanisms, governed by time-varying mixing coefficients. Our theoretical contributions establish that both the latent causal variables and the continuous mixing trajectory are jointly identifiable. We further propose TRACE, a Mixture-of-Experts framework where each expert learns one atomic mechanism during training, enabling recovery of mechanism trajectories at test time. This formulation generalizes to intermediate mechanism states never observed during training. Experiments on synthetic and real-world data demonstrate that TRACE recovers mixing trajectories with up to 0.99 correlation, substantially outperforming discrete-switching baselines.
[LG-117] MapPFN: Learning Causal Perturbation Maps in Context
链接: https://arxiv.org/abs/2601.21092
作者: Marvin Sextro,Weronika Kłos,Gabriel Dernbach
类目: Machine Learning (cs.LG)
*备注:
Abstract:Planning effective interventions in biological systems requires treatment-effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single-cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta-learn a perturbation effect estimator, we present MapPFN, a prior-data fitted network (PFN) pretrained on synthetic data generated from a prior over causal perturbations. Given a set of experiments, MapPFN uses in-context learning to predict post-perturbation distributions, without gradient-based optimization. Despite being pretrained on in silico gene knockouts alone, MapPFN identifies differentially expressed genes, matching the performance of models trained on real single-cell data. Our code and data are available at this https URL.
[LG-118] Parametric Hyperbolic Conservation Laws: A Unified Framework for Conservation Entropy Stability and Hyperbolicity
链接: https://arxiv.org/abs/2601.21080
作者: Lizuo Liu,Lu Zhang,Anne Gelb
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: arXiv admin note: text overlap with arXiv:2507.01795
Abstract:We propose a parametric hyperbolic conservation law (SymCLaw) for learning hyperbolic systems directly from data while ensuring conservation, entropy stability, and hyperbolicity by design. Unlike existing approaches that typically enforce only conservation or rely on prior knowledge of the governing equations, our method parameterizes the flux functions in a form that guarantees real eigenvalues and complete eigenvectors of the flux Jacobian, thereby preserving hyperbolicity. At the same time, we embed entropy-stable design principles by jointly learning a convex entropy function and its associated flux potential, ensuring entropy dissipation and the selection of physically admissible weak solutions. A corresponding entropy-stable numerical flux scheme provides compatibility with standard discretizations, allowing seamless integration into classical solvers. Numerical experiments on benchmark problems, including Burgers, shallow water, Euler, and KPP equations, demonstrate that SymCLaw generalizes to unseen initial conditions, maintains stability under noisy training data, and achieves accurate long-time predictions, highlighting its potential as a principled foundation for data-driven modeling of hyperbolic conservation laws.
[LG-119] Out-of-Distribution Generalization in Graph Foundation Models
链接: https://arxiv.org/abs/2601.21067
作者: Haoyang Li,Haibo Chen,Xin Wang,Wenwu Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graphs are a fundamental data structure for representing relational information in domains such as social networks, molecular systems, and knowledge graphs. However, graph learning models often suffer from limited generalization when applied beyond their training distributions. In practice, distribution shifts may arise from changes in graph structure, domain semantics, available modalities, or task formulations. To address these challenges, graph foundation models (GFMs) have recently emerged, aiming to learn general-purpose representations through large-scale pretraining across diverse graphs and tasks. In this survey, we review recent progress on GFMs from the perspective of out-of-distribution (OOD) generalization. We first discuss the main challenges posed by distribution shifts in graph learning and outline a unified problem setting. We then organize existing approaches based on whether they are designed to operate under a fixed task specification or to support generalization across heterogeneous task formulations, and summarize the corresponding OOD handling strategies and pretraining objectives. Finally, we review common evaluation protocols and discuss open directions for future research. To the best of our knowledge, this paper is the first survey for OOD generalization in GFMs.
[LG-120] Signal from Structure: Exploiting Submodular Upper Bounds in Generative Flow Networks
链接: https://arxiv.org/abs/2601.21061
作者: Alexandre Larouche,Audrey Durand
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Generative Flow Networks (GFlowNets; GFNs) are a class of generative models that learn to sample compositional objects proportionally to their a priori unknown value, their reward. We focus on the case where the reward has a specified, actionable structure, namely that it is submodular. We show submodularity can be harnessed to retrieve upper bounds on the reward of compositional objects that have not yet been observed. We provide in-depth analyses of the probability of such bounds occurring, as well as how many unobserved compositional objects can be covered by a bound. Following the Optimism in the Face of Uncertainty principle, we then introduce SUBo-GFN, which uses the submodular upper bounds to train a GFN. We show that SUBo-GFN generates orders of magnitude more training data than classical GFNs for the same number of queries to the reward function. We demonstrate the effectiveness of SUBo-GFN in terms of distribution matching and high-quality candidate generation on synthetic and real-world submodular tasks.
[LG-121] Snowball: A Scalable All-to-All Ising Machine with Dual-Mode Markov Chain Monte Carlo Spin Selection and Asynchronous Spin Updates for Fast Combinatorial Optimization
链接: https://arxiv.org/abs/2601.21058
作者: Seungki Hong,Kyeongwon Jeong,Taekwang Jang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ising machines have emerged as accelerators for combinatorial optimization. To enable practical deployment, this work aims to reduce time-to-solution by addressing three challenges: (1) hardware topology, (2) spin selection and update algorithms, and (3) scalable coupling-coefficient precision. Restricted topologies require minor embedding; naive parallel updates can oscillate or stall; and limited precision can preclude feasible mappings or degrade solution quality. This work presents Snowball, a digital, scalable, all-to-all coupled Ising machine that integrates dual-mode Markov chain Monte Carlo spin selection with asynchronous spin updates to promote convergence and reduce time-to-solution. The digital architecture supports wide, configurable coupling precision, unlike many analog realizations at high bit widths. A prototype on an AMD Alveo U250 accelerator card achieves an 8 \times reduction in time-to-solution relative to a state-of-the-art Ising machine on the same benchmark instance. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.21058 [cs.LG] (or arXiv:2601.21058v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.21058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-122] st-Time Adaptation for Unsupervised Combinatorial Optimization
链接: https://arxiv.org/abs/2601.21048
作者: Yiqiao Liao,Farinaz Koushanfar,Parinaz Naghizadeh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unsupervised neural combinatorial optimization (NCO) enables learning powerful solvers without access to ground-truth solutions. Existing approaches fall into two disjoint paradigms: models trained for generalization across instances, and instance-specific models optimized independently at test time. While the former are efficient during inference, they lack effective instance-wise adaptability; the latter are flexible but fail to exploit learned inductive structure and are prone to poor local optima. This motivates the central question of our work: how can we leverage the inductive bias learned through generalization while unlocking the flexibility required for effective instance-wise adaptation? We first identify a challenge in bridging these two paradigms: generalization-focused models often constitute poor warm starts for instance-wise optimization, potentially underperforming even randomly initialized models when fine-tuned at test time. To resolve this incompatibility, we propose TACO, a model-agnostic test-time adaptation framework that unifies and extends the two existing paradigms for unsupervised NCO. TACO applies strategic warm-starting to partially relax trained parameters while preserving inductive bias, enabling rapid and effective unsupervised adaptation. Crucially, compared to naively fine-tuning a trained generalizable model or optimizing an instance-specific model from scratch, TACO achieves better solution quality while incurring negligible additional computational cost. Experiments on canonical CO problems, Minimum Vertex Cover and Maximum Clique, demonstrate the effectiveness and robustness of TACO across static, distribution-shifted, and dynamic combinatorial optimization problems, establishing it as a practical bridge between generalizable and instance-specific unsupervised NCO.
[LG-123] Predict-Project-Renoise: Sampling Diffusion Models under Hard Constraints
链接: https://arxiv.org/abs/2601.21033
作者: Omer Rochman-Sharabi,Gilles Louppe
类目: Machine Learning (cs.LG)
*备注: Code coming soon
Abstract:Neural emulators based on diffusion models show promise for scientific applications, but vanilla models cannot guarantee physical accuracy or constraint satisfaction. We address this by introducing a constrained sampling framework that enforces hard constraints, such as physical laws or observational consistency, at generation time. Our approach defines a constrained forward process that diffuses only over the feasible set of constraint-satisfying samples, inducing constrained marginal distributions. To reverse this, we propose Predict-Project-Renoise (PPR), an iterative algorithm that samples from the constrained marginals by alternating between denoising predictions, projecting onto the feasible set, and renoising. Experiments on 2D distributions, PDEs, and global weather forecasting demonstrate that PPR reduces constraint violations by over an order of magnitude while improving sample consistency and better matching the true constrained distribution compared to baselines.
[LG-124] Order-Aware Test-Time Adaptation: Leverag ing Temporal Dynamics for Robust Streaming Inference
链接: https://arxiv.org/abs/2601.21012
作者: Young Kyung Kim,Oded Schlesinger,Qiangqiang Wu,J. Matías Di Martino,Guillermo Sapiro
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures
Abstract:Test-Time Adaptation (TTA) enables pre-trained models to adjust to distribution shift by learning from unlabeled test-time streams. However, existing methods typically treat these streams as independent samples, overlooking the supervisory signal inherent in temporal dynamics. To address this, we introduce Order-Aware Test-Time Adaptation (OATTA). We formulate test-time adaptation as a gradient-free recursive Bayesian estimation task, using a learned dynamic transition matrix as a temporal prior to refine the base model’s predictions. To ensure safety in weakly structured streams, we introduce a likelihood-ratio gate (LLR) that reverts to the base predictor when temporal evidence is absent. OATTA is a lightweight, model-agnostic module that incurs negligible computational overhead. Extensive experiments across image classification, wearable and physiological signal analysis, and language sentiment analysis demonstrate its universality; OATTA consistently boosts established baselines, improving accuracy by up to 6.35%. Our findings establish that modeling temporal dynamics provides a critical, orthogonal signal beyond standard order-agnostic TTA approaches.
[LG-125] MADE: Benchmark Environments for Closed-Loop Materials Discovery
链接: https://arxiv.org/abs/2601.20996
作者: Shreshth A Malik,Tiarnan Doherty,Panagiotis Tigas,Muhammed Razzak,Stephen J. Roberts,Aron Walsh,Yarin Gal
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Existing benchmarks for computational materials discovery primarily evaluate static predictive tasks or isolated computational sub-tasks. While valuable, these evaluations neglect the inherently iterative and adaptive nature of scientific discovery. We introduce MAterials Discovery Environments (MADE), a novel framework for benchmarking end-to-end autonomous materials discovery pipelines. MADE simulates closed-loop discovery campaigns in which an agent or algorithm proposes, evaluates, and refines candidate materials under a constrained oracle budget, capturing the sequential and resource-limited nature of real discovery workflows. We formalize discovery as a search for thermodynamically stable compounds relative to a given convex hull, and evaluate efficacy and efficiency via comparison to baseline algorithms. The framework is flexible; users can compose discovery agents from interchangeable components such as generative models, filters, and planners, enabling the study of arbitrary workflows ranging from fixed pipelines to fully agentic systems with tool use and adaptive decision making. We demonstrate this by conducting systematic experiments across a family of systems, enabling ablation of components in discovery pipelines, and comparison of how methods scale with system complexity.
[LG-126] op-k on a Budget: Adaptive Ranking with Weak and Strong Oracles
链接: https://arxiv.org/abs/2601.20989
作者: Lutz Oettershagen
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:Identifying the top- k items is fundamental but often prohibitive when exact valuations are expensive. We study a two-oracle setting with a fast, noisy weak oracle and a scarce, high-fidelity strong oracle (e.g., human expert verification or expensive simulation). We first analyze a simple screen-then-certify baseline (STC) and prove it makes at most m(4\varepsilon_\max) strong calls given jointly valid weak confidence intervals with maximum radius \varepsilon_\max , where m(\cdot) denotes the near-tie mass around the top- k threshold. We establish a conditional lower bound of \Omega(m(\varepsilon_\max)) for any algorithm given the same weak uncertainty. Our main contribution is ACE, an adaptive certification algorithm that focuses strong queries on critical boundary items, achieving the same O(m(4\varepsilon_\max)) bound while reducing strong calls in practice. We then introduce ACE-W, a fully adaptive two-phase method that allocates weak budget adaptively before running ACE, further reducing strong costs.
[LG-127] Pre-trained Encoders for Global Child Development: Transfer Learning Enables Deployment in Data-Scarce Settings
链接: https://arxiv.org/abs/2601.20987
作者: Md Muhtasim Munif Fahim,Md Rezaul Karim
类目: Machine Learning (cs.LG)
*备注:
Abstract:A large number of children experience preventable developmental delays each year, yet the deployment of machine learning in new countries has been stymied by a data bottleneck: reliable models require thousands of samples, while new programs begin with fewer than 100. We introduce the first pre-trained encoder for global child development, trained on 357,709 children across 44 countries using UNICEF survey data. With only 50 training samples, the pre-trained encoder achieves an average AUC of 0.65 (95% CI: 0.56-0.72), outperforming cold-start gradient boosting at 0.61 by 8-12% across regions. At N=500, the encoder achieves an AUC of 0.73. Zero-shot deployment to unseen countries achieves AUCs up to 0.84. We apply a transfer learning bound to explain why pre-training diversity enables few-shot generalization. These results establish that pre-trained encoders can transform the feasibility of ML for SDG 4.2.1 monitoring in resource-constrained settings.
[LG-128] Distributional Active Inference
链接: https://arxiv.org/abs/2601.20985
作者: Abdullah Akgül,Gulcin Baykal,Manuel Haußmann,Mustafa Mert Çelikok,Melih Kandemir
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimal control of complex environments with robotic systems faces two complementary and intertwined challenges: efficient organization of sensory state information and far-sighted action planning. Because the reinforcement learning framework addresses only the latter, it tends to deliver sample-inefficient solutions. Active inference is the state-of-the-art process theory that explains how biological brains handle this dual problem. However, its applications to artificial intelligence have thus far been limited to extensions of existing model-based approaches. We present a formal abstraction of reinforcement learning algorithms that spans model-based, distributional, and model-free approaches. This abstraction seamlessly integrates active inference into the distributional reinforcement learning framework, making its performance advantages accessible without transition dynamics modeling.
[LG-129] Monotone Optimisation with Learned Projections
链接: https://arxiv.org/abs/2601.20983
作者: Ahmed Rashwan,Keith Briggs,Chris Budd,Lisa Kreusser
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Monotone optimisation problems admit specialised global solvers such as the Polyblock Outer Approximation (POA) algorithm, but these methods typically require explicit objective and constraint functions. In many applications, these functions are only available through data, making POA difficult to apply directly. We introduce an algorithm-aware learning approach that integrates learned models into POA by directly predicting its projection primitive via the radial inverse, avoiding the costly bisection procedure used in standard POA. We propose Homogeneous-Monotone Radial Inverse (HM-RI) networks, structured neural architectures that enforce key monotonicity and homogeneity properties, enabling fast projection estimation. We provide a theoretical characterisation of radial inverse functions and show that, under mild structural conditions, a HM-RI predictor corresponds to the radial inverse of a valid set of monotone constraints. To reduce training overhead, we further develop relaxed monotonicity conditions that remain compatible with POA. Across multiple monotone optimisation benchmarks (indefinite quadratic programming, multiplicative programming, and transmit power optimisation), our approach yields substantial speed-ups in comparison to direct function estimation while maintaining strong solution quality, outperforming baselines that do not exploit monotonic structure.
[LG-130] A Theory of Universal Agnostic Learning
链接: https://arxiv.org/abs/2601.20961
作者: Steve Hanneke,Shay Moran
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We provide a complete theory of optimal universal rates for binary classification in the agnostic setting. This extends the realizable-case theory of Bousquet, Hanneke, Moran, van Handel, and Yehudayoff (2021) by removing the realizability assumption on the distribution. We identify a fundamental tetrachotomy of optimal rates: for every concept class, the optimal universal rate of convergence of the excess error rate is one of e^-n , e^-o(n) , o(n^-1/2) , or arbitrarily slow. We further identify simple combinatorial structures which determine which of these categories any given concept class falls into.
[LG-131] Noninvasive Intracranial Pressure Estimation Using Subspace System Identification and Bespoke Machine Learning Algorithms: A Learning-to-Rank Approach
链接: https://arxiv.org/abs/2601.20916
作者: Anni Zhao,Ayca Ermis,Jeffrey Robert Vitt,Sergio Brasil,Wellingson Paiva,Magdalena Kasprowicz,Malgorzata Burzynska,Robert Hamilton,Runze Yan,Ofer Sadan,J. Claude Hemphill,Lieven Vandenberghe,Xiao Hu
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures
Abstract:Objective: Accurate noninvasive estimation of intracranial pressure (ICP) remains a major challenge in critical care. We developed a bespoke machine learning algorithm that integrates system identification and ranking-constrained optimization to estimate mean ICP from noninvasive signals. Methods: A machine learning framework was proposed to obtain accurate mean ICP values using arbitrary noninvasive signals. The subspace system identification algorithm is employed to identify cerebral hemodynamics models for ICP simulation using arterial blood pressure (ABP), cerebral blood velocity (CBv), and R-wave to R-wave interval (R-R interval) signals in a comprehensive database. A mapping function to describe the relationship between the features of noninvasive signals and the estimation errors is learned using innovative ranking constraints through convex optimization. Patients across multiple clinical settings were randomly split into testing and training datasets for performance evaluation of the mapping function. Results: The results indicate that about 31.88% of testing entries achieved estimation errors within 2 mmHg and 34.07% of testing entries between 2 mmHg to 6 mmHg from the nonlinear mapping with constraints. Conclusion: Our results demonstrate the feasibility of the proposed noninvasive ICP estimation approach. Significance: Further validation and technical refinement are required before clinical deployment, but this work lays the foundation for safe and broadly accessible ICP monitoring in patients with acute brain injury and related conditions.
[LG-132] Leverag ing Generative AI for Enhancing Domain-Driven Software Design
链接: https://arxiv.org/abs/2601.20909
作者: Götz-Henrik Wiegand,Filip Stepniak,Patrick Baier
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Part of the Proceedings of the Upper-Rhine Artificial Intelligence Symposium 2024
Abstract:Domain-Driven Design (DDD) is a key framework for developing customer-oriented software, focusing on the precise modeling of an application’s domain. Traditionally, metamodels that describe these domains are created manually by system designers, forming the basis for iterative software development. This paper explores the partial automation of metamodel generation using generative AI, particularly for producing domain-specific JSON objects. By training a model on real-world DDD project data, we demonstrate that generative AI can produce syntactically correct JSON objects based on simple prompts, offering significant potential for streamlining the design process. To address resource constraints, the AI model was fine-tuned on a consumer-grade GPU using a 4-bit quantized version of Code Llama and Low-Rank Adaptation (LoRA). Despite limited hardware, the model achieved high performance, generating accurate JSON objects with minimal post-processing. This research illustrates the viability of incorporating generative AI into the DDD process, improving efficiency and reducing resource requirements, while also laying the groundwork for further advancements in AI-driven software development.
[LG-133] winWeaver: An LLM -Based Foundation Model Framework for Pan-Cancer Digital Twins
链接: https://arxiv.org/abs/2601.20906
作者: Nikita Makarov,Maria Bordukova,Lena Voith von Voithenberg,Estrella Pivel-Villanueva,Sabrina Mielke,Jonathan Wickes,Hanchen Wang,Mingyu Derek Ma,Keunwoo Choi,Kyunghyun Cho,Stephen Ra,Raul Rodriguez-Esteban,Fabian Schmich,Michael Menden
类目: Machine Learning (cs.LG)
*备注:
Abstract:Precision oncology requires forecasting clinical events and trajectories, yet modeling sparse, multi-modal clinical time series remains a critical challenge. We introduce TwinWeaver, an open-source framework that serializes longitudinal patient histories into text, enabling unified event prediction as well as forecasting with large language models, and use it to build Genie Digital Twin (GDT) on 93,054 patients across 20 cancer types. In benchmarks, GDT significantly reduces forecasting error, achieving a median Mean Absolute Scaled Error (MASE) of 0.87 compared to 0.97 for the strongest time-series baseline (p0.001). Furthermore, GDT improves risk stratification, achieving an average concordance index (C-index) of 0.703 across survival, progression, and therapy switching tasks, surpassing the best baseline of 0.662. GDT also generalizes to out-of-distribution clinical trials, matching trained baselines at zero-shot and surpassing them with fine-tuning, achieving a median MASE of 0.75-0.88 and outperforming the strongest baseline in event prediction with an average C-index of 0.672 versus 0.648. Finally, TwinWeaver enables an interpretable clinical reasoning extension, providing a scalable and transparent foundation for longitudinal clinical modeling.
[LG-134] Faster Predictive Coding Networks via Better Initialization
链接: https://arxiv.org/abs/2601.20895
作者: Luca Pinchetti,Simon Frieder,Thomas Lukasiewicz,Tommaso Salvatori
类目: Machine Learning (cs.LG)
*备注:
Abstract:Research aimed at scaling up neuroscience inspired learning algorithms for neural networks is accelerating. Recently, a key research area has been the study of energy-based learning algorithms such as predictive coding, due to their versatility and mathematical grounding. However, the applicability of such methods is held back by the large computational requirements caused by their iterative nature. In this work, we address this problem by showing that the choice of initialization of the neurons in a predictive coding network matters significantly and can notably reduce the required training times. Consequently, we propose a new initialization technique for predictive coding networks that aims to preserve the iterative progress made on previous training samples. Our approach suggests a promising path toward reconciling the disparities between predictive coding and backpropagation in terms of computational efficiency and final performance. In fact, our experiments demonstrate substantial improvements in convergence speed and final test loss in both supervised and unsupervised settings.
[LG-135] Is Parameter Isolation Better for Prompt-Based Continual Learning?
链接: https://arxiv.org/abs/2601.20894
作者: Jiangyang Li,Chenhao Ding,Songlin Dong,Qiang Wang,Jianchao Zhao,Yuhang He,Yihong Gong
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures
Abstract:Prompt-based continual learning methods effectively mitigate catastrophic forgetting. However, most existing methods assign a fixed set of prompts to each task, completely isolating knowledge across tasks and resulting in suboptimal parameter utilization. To address this, we consider the practical needs of continual learning and propose a prompt-sharing framework. This framework constructs a global prompt pool and introduces a task-aware gated routing mechanism that sparsely activates a subset of prompts to achieve dynamic decoupling and collaborative optimization of task-specific feature representations. Furthermore, we introduce a history-aware modulator that leverages cumulative prompt activation statistics to protect frequently used prompts from excessive updates, thereby mitigating inefficient parameter usage and knowledge forgetting. Extensive analysis and empirical results demonstrate that our approach consistently outperforms existing static allocation strategies in effectiveness and efficiency.
[LG-136] A generative machine learning model for designing metal hydrides applied to hydrogen storag e
链接: https://arxiv.org/abs/2601.20892
作者: Xiyuan Liu,Christian Hacker,Shengnian Wang,Yuhua Duan
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Applications (stat.AP)
*备注:
Abstract:Developing new metal hydrides is a critical step toward efficient hydrogen storage in carbon-neutral energy systems. However, existing materials databases, such as the Materials Project, contain a limited number of well-characterized hydrides, which constrains the discovery of optimal candidates. This work presents a framework that integrates causal discovery with a lightweight generative machine learning model to generate novel metal hydride candidates that may not exist in current databases. Using a dataset of 450 samples (270 training, 90 validation, and 90 testing), the model generates 1,000 candidates. After ranking and filtering, six previously unreported chemical formulas and crystal structures are identified, four of which are validated by density functional theory simulations and show strong potential for future experimental investigation. Overall, the proposed framework provides a scalable and time-efficient approach for expanding hydrogen storage datasets and accelerating materials discovery.
[LG-137] IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks
链接: https://arxiv.org/abs/2601.20886
作者: Spencer Mateega,Jeff Yang,Tiana Costello,Shaurya Jadhav,Nicole Tian,Agustin Garcinuño
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent’s ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases. Our benchmark is the first to systematically correlate agent-reported intent with successful project-level modifications in a multi-language, full-stack environment on completely uncontaminated code.
[LG-138] VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings ICASSP2026
链接: https://arxiv.org/abs/2601.20883
作者: Bharath Krishnamurthy,Ajita Rattani
类目: ound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to IEEE ICASSP 2026 (51st International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2026). 5 pages, 1 figure, 3 tables. Project page: this https URL
Abstract:Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: this https URL
[LG-139] Diverse Approaches to Optimal Execution Schedule Generation
链接: https://arxiv.org/abs/2601.22113
作者: Robert de Witt,Mikko S. Pakkanen
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注: 27 pages, 15 figures, 5 tables
Abstract:We present the first application of MAP-Elites, a quality-diversity algorithm, to trade execution. Rather than searching for a single optimal policy, MAP-Elites generates a diverse portfolio of regime-specialist strategies indexed by liquidity and volatility conditions. Individual specialists achieve 8-10% performance improvements within their behavioural niches, while other cells show degradation, suggesting opportunities for ensemble approaches that combine improved specialists with the baseline PPO policy. Results indicate that quality-diversity methods offer promise for regime-adaptive execution, though substantial computational resources per behavioural cell may be required for robust specialist development across all market conditions. To ensure experimental integrity, we develop a calibrated Gymnasium environment focused on order scheduling rather than tactical placement decisions. The simulator features a transient impact model with exponential decay and square-root volume scaling, fit to 400+ U.S. equities with R^20.02 out-of-sample. Within this environment, two Proximal Policy Optimization architectures - both MLP and CNN feature extractors - demonstrate substantial improvements over industry baselines, with the CNN variant achieving 2.13 bps arrival slippage versus 5.23 bps for VWAP on 4,900 out-of-sample orders ( 21B notional). These results validate both the simulation realism and provide strong single-policy baselines for quality-diversity methods.
[LG-140] Efficient Stochastic Optimisation via Sequential Monte Carlo
链接: https://arxiv.org/abs/2601.22003
作者: James Cuin,Davide Carbone,Yanbo Tang,O. Deniz Akyildiz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:The problem of optimising functions with intractable gradients frequently arise in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.
[LG-141] Batched First-Order Methods for Parallel LP Solving in MIP
链接: https://arxiv.org/abs/2601.21990
作者: Nicolas Blin,Stefano Gualandi,Christopher Maes,Andrea Lodi,Bartolomeo Stellato
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 4 tables
Abstract:We present a batched first-order method for solving multiple linear programs in parallel on GPUs. Our approach extends the primal-dual hybrid gradient algorithm to efficiently solve batches of related linear programming problems that arise in mixed-integer programming techniques such as strong branching and bound tightening. By leveraging matrix-matrix operations instead of repeated matrix-vector operations, we obtain significant computational advantages on GPU architectures. We demonstrate the effectiveness of our approach on various case studies and identify the problem sizes where first-order methods outperform traditional simplex-based solvers depending on the computational environment one can use. This is a significant step for the design and development of integer programming algorithms tightly exploiting GPU capabilities where we argue that some specific operations should be allocated to GPUs and performed in full instead of using light-weight heuristic approaches on CPUs.
[LG-142] Near-Optimal Private Tests for Simple and MLR Hypotheses
链接: https://arxiv.org/abs/2601.21959
作者: Yu-Wei Chen,Raghu Pasupathy,Jordan Awan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We develop a near-optimal testing procedure under the framework of Gaussian differential privacy for simple as well as one- and two-sided tests under monotone likelihood ratio conditions. Our mechanism is based on a private mean estimator with data-driven clamping bounds, whose population risk matches the private minimax rate up to logarithmic factors. Using this estimator, we construct private test statistics that achieve the same asymptotic relative efficiency as the non-private, most powerful tests while maintaining conservative type I error control. In addition to our theoretical results, our numerical experiments show that our private tests outperform competing DP methods and offer comparable power to the non-private most powerful tests, even at moderately small sample sizes and privacy loss budgets.
[LG-143] Diffusion Path Samplers via Sequential Monte Carlo
链接: https://arxiv.org/abs/2601.21951
作者: James Matthew Young,Paula Cordero-Encinar,Sebastian Reich,Andrew Duncan,O. Deniz Akyildiz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:We develop a diffusion-based sampler for target distributions known up to a normalising constant. To this end, we rely on the well-known diffusion path that smoothly interpolates between a (simple) base distribution and the target distribution, widely used in diffusion models. Our approach is based on a practical implementation of diffusion-annealed Langevin Monte Carlo, which approximates the diffusion path with convergence guarantees. We tackle the score estimation problem by developing an efficient sequential Monte Carlo sampler that evolves auxiliary variables from conditional distributions along the path, which provides principled score estimates for time-varying distributions. We further develop novel control variate schedules that minimise the variance of these score estimates. Finally, we provide theoretical guarantees and empirically demonstrate the effectiveness of our method on several synthetic and real-world datasets.
[LG-144] Clustering in Deep Stochastic Transformers
链接: https://arxiv.org/abs/2601.21942
作者: Lev Fedorov,Michaël E. Sander,Romuald Elie,Pierre Marion,Mathieu Laurière
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages
Abstract:Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emphcommon matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.
[LG-145] On Approximate Computation of Critical Points
链接: https://arxiv.org/abs/2601.21917
作者: Amir Ali Ahmadi,Georgina Hall
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:We show that computing even very coarse approximations of critical points is intractable for simple classes of nonconvex functions. More concretely, we prove that if there exists a polynomial-time algorithm that takes as input a polynomial in n variables of constant degree (as low as three) and outputs a point whose gradient has Euclidean norm at most 2^n whenever the polynomial has a critical point, then P=NP. The algorithm is permitted to return an arbitrary point when no critical point exists. We also prove hardness results for approximate computation of critical points under additional structural assumptions, including settings in which existence and uniqueness of a critical point are guaranteed, the function is lower bounded, and approximation is measured in terms of distance to a critical point. Overall, our results stand in contrast to the commonly-held belief that, in nonconvex optimization, approximate computation of critical points is a tractable task.
[LG-146] VSE: Variational state estimation of complex model-free process ICASSP2026
链接: https://arxiv.org/abs/2601.21887
作者: Gustav Norén,Anubhab Ghosh,Fredrik Cumlin,Saikat Chatterjee
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The article is accepted at ICASSP 2026
Abstract:We design a variational state estimation (VSE) method that provides a closed-form Gaussian posterior of an underlying complex dynamical process from (noisy) nonlinear measurements. The complex process is model-free. That is, we do not have a suitable physics-based model characterizing the temporal evolution of the process state. The closed-form Gaussian posterior is provided by a recurrent neural network (RNN). The use of RNN is computationally simple in the inference phase. For learning the RNN, an additional RNN is used in the learning phase. Both RNNs help each other learn better based on variational inference principles. The VSE is demonstrated for a tracking application - state estimation of a stochastic Lorenz system (a benchmark process) using a 2-D camera measurement model. The VSE is shown to be competitive against a particle filter that knows the Lorenz system model and a recently proposed data-driven state estimation method that does not know the Lorenz system model.
[LG-147] On Forgetting and Stability of Score-based Generative models
链接: https://arxiv.org/abs/2601.21868
作者: Stanislas Strasman,Gabriel Cardoso,Sylvain Le Corff,Vincent Lemaire,Antonio Ocello
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Understanding the stability and long-time behavior of generative models is a fundamental problem in modern machine learning. This paper provides quantitative bounds on the sampling error of score-based generative models by leveraging stability and forgetting properties of the Markov chain associated with the reverse-time dynamics. Under weak assumptions, we provide the two structural properties to ensure the propagation of initialization and discretization errors of the backward process: a Lyapunov drift condition and a Doeblin-type minorization condition. A practical consequence is quantitative stability of the sampling procedure, as the reverse diffusion dynamics induces a contraction mechanism along the sampling trajectory. Our results clarify the role of stochastic dynamics in score-based models and provide a principled framework for analyzing propagation of errors in such approaches.
[LG-148] Generative Modeling of Discrete Data Using Geometric Latent Subspaces
链接: https://arxiv.org/abs/2601.21831
作者: Daniel Gonzalez-Alvarado,Jonas Cassel,Stefania Petra,Christoph Schnörr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce the use of latent subspaces in the exponential parameter space of product manifolds of categorial distributions, as a tool for learning generative models of discrete data. The low-dimensional latent space encodes statistical dependencies and removes redundant degrees of freedom among the categorial variables. We equip the parameter domain with a Riemannian geometry such that the spaces and distances are related by isometries which enables consistent flow matching. In particular, geodesics become straight lines which makes model training by flow matching effective. Empirical results demonstrate that reduced latent dimensions suffice to represent data for generative modeling.
[LG-149] A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth
链接: https://arxiv.org/abs/2601.21817
作者: Mingyuan Xu,Xinzi Tan,Jiawei Wu,Doudou Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.
[LG-150] Reinforcement Learning for Adaptive Composition of Quantum Circuit Optimisation Passes
链接: https://arxiv.org/abs/2601.21629
作者: Daniel Mills,Ifan Williams,Jacob Swain,Gabriel Matos,Enrico Rinaldi,Alexander Koziell-Pipe
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures
Abstract:Many quantum software development kits provide a suite of circuit optimisation passes. These passes have been highly optimised and tested in isolation. However, the order in which they are applied is left to the user, or else defined in general-purpose default pass sequences. While general-purpose sequences miss opportunities for optimisation which are particular to individual circuits, designing pass sequences bespoke to particular circuits requires exceptional knowledge about quantum circuit design and optimisation. Here we propose and demonstrate training a reinforcement learning agent to compose optimisation-pass sequences. In particular the agent’s action space consists of passes for two-qubit gate count reduction used in default PyTKET pass sequences. For the circuits in our diverse test set, the (mean, median) fraction of two-qubit gates removed by the agent is (57.7%, \ 56.7 %) , compared to (41.8 %, \ 50.0 %) for the next best default pass sequence.
[LG-151] Manifold constrained steepest descent
链接: https://arxiv.org/abs/2601.21487
作者: Kaiwei Yang,Lexiao Lai
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 23 pages, 7 figures, and 5 tables
Abstract:Norm-constrained linear minimization oracle (LMO)-based optimizers such as spectral gradient descent and Muon are attractive in large-scale learning, but extending them to manifold-constrained problems is nontrivial and often leads to nested-loop schemes that solve tangent-space subproblems iteratively. We propose \emphManifold Constrained Steepest Descent (MCSD), a single-loop framework for optimization over manifolds that selects a norm-induced steepest-descent direction via an LMO applied to the Riemannian gradient, and then returns to the manifold via projection. Under standard smoothness assumptions, we establish convergence guarantees for MCSD and a stochastic momentum variant. We further introduce \emphSPEL, the spectral-norm specialization of MCSD on the Stiefel manifold, which admits scalable implementations via fast matrix sign computations. Experiments on PCA, orthogonality-constrained CNNs, and manifold-constrained LLM adapter tuning demonstrate improved stability and competitive performance relative to standard Riemannian baselines and existing manifold-aware LMO methods.
[LG-152] Questioning the Coverag e-Length Metric in Conformal Prediction: When Shorter Intervals Are Not Better
链接: https://arxiv.org/abs/2601.21455
作者: Yizhou Min,Yizhou Lu,Lanqi Li,Zhen Zhang,Jiaye Teng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction (CP) has become a cornerstone of distribution-free uncertainty quantification, conventionally evaluated by its coverage and interval length. This work critically examines the sufficiency of these standard metrics. We demonstrate that the interval length might be deceptively improved through a counter-intuitive approach termed Prejudicial Trick (PT), while the coverage remains valid. Specifically, for any given test sample, PT probabilistically returns an interval, which is either null or constructed using an adjusted confidence level, thereby preserving marginal coverage. While PT potentially yields a deceptively lower interval length, it introduces practical vulnerabilities: the same input can yield completely different prediction intervals across repeated runs of the algorithm. We formally derive the conditions under which PT achieves these misleading improvements and provides extensive empirical evidence across various regression and classification tasks. Furthermore, we introduce a new metric interval stability which helps detect whether a new CP method implicitly improves the length based on such PT-like techniques.
[LG-153] Statsformer: Validated Ensemble Learning with LLM -Derived Semantic Priors
链接: https://arxiv.org/abs/2601.21410
作者: Erica Zhang,Naomi Sagan,Danny Tse,Fangzhao Zhang,Mert Pilanci,Jose Blanchet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce Statsformer, a principled framework for integrating large language model (LLM)-derived knowledge into supervised statistical learning. Existing approaches are limited in adaptability and scope: they either inject LLM guidance as an unvalidated heuristic, which is sensitive to LLM hallucination, or embed semantic information within a single fixed learner. Statsformer overcomes both limitations through a guardrailed ensemble architecture. We embed LLM-derived feature priors within an ensemble of linear and nonlinear learners, adaptively calibrating their influence via cross-validation. This design yields a flexible system with an oracle-style guarantee that it performs no worse than any convex combination of its in-library base learners, up to statistical error. Empirically, informative priors yield consistent performance improvements, while uninformative or misspecified LLM guidance is automatically downweighted, mitigating the impact of hallucinations across a diverse range of prediction tasks.
[LG-154] Bulk-Calibrated Credal Ambiguity Sets: Fast Tractable Decision Making under Out-of-Sample Contamination
链接: https://arxiv.org/abs/2601.21324
作者: Mengqi Chen,Thomas B. Berrett,Theodoros Damoulas,Michele Caprio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set that can capture distributional shifts in out-of-sample environments. While Huber (linear-vacuous) contamination is a classical minimal-assumption model for an \varepsilon -fraction of arbitrary perturbations, including it in an ambiguity set can make the worst-case risk infinite and the DRO objective vacuous unless one imposes strong boundedness or support assumptions. We address these challenges by introducing bulk-calibrated credal ambiguity sets: we learn a high-mass bulk set from data while considering contamination inside the bulk and bounding the remaining tail contribution separately. This leads to a closed-form, finite \mathrmmean+\sup robust objective and tractable linear or second-order cone programs for common losses and bulk geometries. Through this framework, we highlight and exploit the equivalence between the imprecise probability (IP) notion of upper expectation and the worst-case risk, demonstrating how IP credal sets translate into DRO objectives with interpretable tolerance levels. Experiments on heavy-tailed inventory control, geographically shifted house-price regression, and demographically shifted text classification show competitive robustness-accuracy trade-offs and efficient optimisation times, using Bayesian, frequentist, or empirical reference distributions.
[LG-155] Solving the Offline and Online Min-Max Problem of Non-smooth Submodular-Concave Functions: A Zeroth-Order Approach
链接: https://arxiv.org/abs/2601.21243
作者: Amir Ali Farzin,Yuen-Man Pun,Philipp Braun,Tyler Summers,Iman Shames
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We consider max-min and min-max problems with objective functions that are possibly non-smooth, submodular with respect to the minimiser and concave with respect to the maximiser. We investigate the performance of a zeroth-order method applied to this problem. The method is based on the subgradient of the Lovász extension of the objective function with respect to the minimiser and based on Gaussian smoothing to estimate the smoothed function gradient with respect to the maximiser. In expectation sense, we prove the convergence of the algorithm to an \epsilon -saddle point in the offline case. Moreover, we show that, in the expectation sense, in the online setting, the algorithm achieves O(\sqrtN\barP_N) online duality gap, where N is the number of iterations and \barP_N is the path length of the sequence of optimal decisions. The complexity analysis and hyperparameter selection are presented for all the cases. The theoretical results are illustrated via numerical examples.
[LG-156] Data-Driven Generation of Neutron Star Equations of State Using Variational Autoencoders
链接: https://arxiv.org/abs/2601.21231
作者: Alex Ross,Tianqi Zhao,Sanjay Reddy
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures. In preparation for submission to Machine Learning: Science and Technology
Abstract:We develop a machine learning model based on a structured variational autoencoder (VAE) framework to reconstruct and generate neutron star (NS) equations of state (EOS). The VAE consists of an encoder network that maps high-dimensional EOS data into a lower-dimensional latent space and a decoder network that reconstructs the full EOS from the latent representation. The latent space includes supervised NS observables derived from the training EOS data, as well as latent random variables corresponding to additional unspecified EOS features learned automatically. Sampling the latent space enables the generation of new, causal, and stable EOS models that satisfy astronomical constraints on the supervised NS observables, while allowing Bayesian inference of the EOS incorporating additional multimessenger data, including gravitational waves from LIGO/Virgo and mass and radius measurements of pulsars. Based on a VAE trained on a Skyrme EOS dataset, we find that a latent space with two supervised NS observables, the maximum mass (M_\max) and the canonical radius (R_1.4) , together with one latent random variable controlling the EOS near the crust–core transition, can already reconstruct Skyrme EOSs with high fidelity, achieving mean absolute percentage errors of approximately (0.15%) for (M_\max) and (R_1.4) derived from the decoder-reconstructed EOS.
[LG-157] A Flexible Empirical Bayes Approach to Generalized Linear Models with Applications to Sparse Logistic Regression
链接: https://arxiv.org/abs/2601.21217
作者: Dongyue Xie,Wanrong Zhu,Matthew Stephens
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:
Abstract:We introduce a flexible empirical Bayes approach for fitting Bayesian generalized linear models. Specifically, we adopt a novel mean-field variational inference (VI) method and the prior is estimated within the VI algorithm, making the method tuning-free. Unlike traditional VI methods that optimize the posterior density function, our approach directly optimizes the posterior mean and prior parameters. This formulation reduces the number of parameters to optimize and enables the use of scalable algorithms such as L-BFGS and stochastic gradient descent. Furthermore, our method automatically determines the optimal posterior based on the prior and likelihood, distinguishing it from existing VI methods that often assume a Gaussian variational. Our approach represents a unified framework applicable to a wide range of exponential family distributions, removing the need to develop unique VI methods for each combination of likelihood and prior distributions. We apply the framework to solve sparse logistic regression and demonstrate the superior predictive performance of our method in extensive numerical studies, by comparing it to prevalent sparse logistic regression approaches.
[LG-158] Provably Reliable Classifier Guidance through Cross-entropy Error Control
链接: https://arxiv.org/abs/2601.21200
作者: Sharan Sahu,Arisina Banerjee,Yuchen Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 6 figures
Abstract:Classifier-guided diffusion models generate conditional samples by augmenting the reverse-time score with the gradient of a learned classifier, yet it remains unclear whether standard classifier training procedures yield effective diffusion guidance. We address this gap by showing that, under mild smoothness assumptions on the classifiers, controlling the cross-entropy error at each diffusion step also controls the error of the resulting guidance vectors: classifiers achieving conditional KL divergence \varepsilon^2 from the ground-truth conditional label probabilities induce guidance vectors with mean squared error \widetildeO(d \varepsilon ) . Our result yields an upper bound on the sampling error under classifier guidance and bears resemblance to a reverse log-Sobolev-type inequality. Moreover, we show that the classifier smoothness assumption is essential, by constructing simple counterexamples demonstrating that, without it, control of the guidance vector can fail for almost all distributions. To our knowledge, our work establishes the first quantitative link between classifier training and guidance alignment, yielding both a theoretical foundation for classifier guidance and principled guidelines for classifier selection.
[LG-159] A new strategy for finite-sample valid prediction of future insurance claims in the regression setting
链接: https://arxiv.org/abs/2601.21153
作者: Liang Hong
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:The extant insurance literature demonstrates a paucity of finite-sample valid prediction intervals of future insurance claims in the regression setting. To address this challenge, this article proposes a new strategy that converts a predictive method in the unsupervised iid (independent identically distributed) setting to a predictive method in the regression setting. In particular, it enables an actuary to obtain infinitely many finite-sample valid prediction intervals in the regression setting.
[LG-160] hompson sampling: Precise arm-pull dynamics and adaptive inference
链接: https://arxiv.org/abs/2601.21131
作者: Qiyang Han
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Adaptive sampling schemes are well known to create complex dependence that may invalidate conventional inference methods. A recent line of work shows that this need not be the case for UCB-type algorithms in multi-armed bandits. A central emerging theme is a stability' property with asymptotically deterministic arm-pull counts in these algorithms, making inference as easy as in the i.i.d. setting. In this paper, we study the precise arm-pull dynamics in another canonical class of Thompson-sampling type algorithms. We show that the phenomenology is qualitatively different: the arm-pull count is asymptotically deterministic if and only if the arm is suboptimal or is the unique optimal arm; otherwise it converges in distribution to the unique invariant law of an SDE. This dichotomy uncovers a unifying principle behind many existing (in)stability results: an arm is stable if and only if its interaction with statistical noise is asymptotically negligible. As an application, we show that normalized arm means obey the same dichotomy, with Gaussian limits for stable arms and a semi-universal, non-Gaussian limit for unstable arms. This not only enables the construction of confidence intervals for the unknown mean rewards despite non-normality, but also reveals the potential of developing tractable inference procedures beyond the stable regime. The proofs rely on two new approaches. For suboptimal arms, we develop an inverse process’ approach that characterizes the inverse of the arm-pull count process via a Stieltjes integral. For optimal arms, we adopt a reparametrization of the arm-pull and noise processes that reduces the singularity in the natural SDE to proving the uniqueness of the invariant law of another SDE. We prove the latter by a set of analytic tools, including the parabolic Hörmander condition and the Stroock-Varadhan support theorem. Subjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.21131 [math.ST] (or arXiv:2601.21131v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2601.21131 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Qiyang Han [view email] [v1] Thu, 29 Jan 2026 00:12:04 UTC (516 KB)
[LG-161] Multilevel and Sequential Monte Carlo for Training-Free Diffusion Guidance
链接: https://arxiv.org/abs/2601.21104
作者: Aidan Gleich,Scott C. Schmidler
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We address the problem of accurate, training-free guidance for conditional generation in trained diffusion models. Existing methods typically rely on point-estimates to approximate the posterior score, often resulting in biased approximations that fail to capture multimodality inherent to the reverse process of diffusion models. We propose a sequential Monte Carlo (SMC) framework that constructs an unbiased estimator of p_\theta(y|x_t) by integrating over the full denoising distribution via Monte Carlo approximation. To ensure computational tractability, we incorporate variance-reduction schemes based on Multi-Level Monte Carlo (MLMC). Our approach achieves new state-of-the-art results for training-free guidance on CIFAR-10 class-conditional generation, achieving 95.6% accuracy with 3\times lower cost-per-success than baselines. On ImageNet, our algorithm achieves 1.5\times cost-per-success advantage over existing methods.
[LG-162] High-dimensional learning dynamics of multi-pass Stochastic Gradient Descent in multi-index models
链接: https://arxiv.org/abs/2601.21093
作者: Zhou Fan,Leda Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:We study the learning dynamics of a multi-pass, mini-batch Stochastic Gradient Descent (SGD) procedure for empirical risk minimization in high-dimensional multi-index models with isotropic random data. In an asymptotic regime where the sample size n and data dimension d increase proportionally, for any sub-linear batch size \kappa \asymp n^\alpha where \alpha \in [0,1) , and for a commensurate ``critical’’ scaling of the learning rate, we provide an asymptotically exact characterization of the coordinate-wise dynamics of SGD. This characterization takes the form of a system of dynamical mean-field equations, driven by a scalar Poisson jump process that represents the asymptotic limit of SGD sampling noise. We develop an analogous characterization of the Stochastic Modified Equation (SME) which provides a Gaussian diffusion approximation to SGD. Our analyses imply that the limiting dynamics for SGD are the same for any batch size scaling \alpha \in [0,1) , and that under a commensurate scaling of the learning rate, dynamics of SGD, SME, and gradient flow are mutually distinct, with those of SGD and SME coinciding in the special case of a linear model. We recover a known dynamical mean-field characterization of gradient flow in a limit of small learning rate, and of one-pass/online SGD in a limit of increasing sample size n/d \to \infty . Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST) Cite as: arXiv:2601.21093 [stat.ML] (or arXiv:2601.21093v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.21093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-163] An efficient accurate and interpretable machine learning method for computing probability of failure
链接: https://arxiv.org/abs/2601.21089
作者: Jacob Zhu,Donald Estep
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a novel machine learning method called the Penalized Profile Support Vector Machine based on the Gabriel edited set for the computation of the probability of failure for a complex system as determined by a threshold condition on a computer model of system behavior. The method is designed to minimize the number of evaluations of the computer model while preserving the geometry of the decision boundary that determines the probability. It employs an adaptive sampling strategy designed to strategically allocate points near the boundary determining failure and builds a locally linear surrogate boundary that remains consistent with its geometry by strategic clustering of training points. We prove two convergence results and we compare the performance of the method against a number of state of the art classification methods on four test problems. We also apply the method to determine the probability of survival using the Lotka–Volterra model for competing species.
[LG-164] Better without U: Impact of Selective Hubbard U Correction on Foundational MLIPs
链接: https://arxiv.org/abs/2601.21056
作者: Thomas Warford,Fabian L. Thiemann,Gábor Csányi
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:The training of foundational machine learning interatomic potentials (fMLIPs) relies on diverse databases with energies and forces calculated using ab initio methods. We show that fMLIPs trained on large datasets such as MPtrj, Alexandria, and OMat24 encode inconsistencies from the Materials Project’s selective use of the Hubbard U correction, which is applied to certain transition metals only if O or F atoms are present in the simulation cell. This inconsistent use of +U creates two incompatible potential-energy surfaces (PES): a lower-energy GGA surface and a higher-energy GGA+U one. When trained on both, MLIPs interpolate between them, leading to systematic underbinding, or even spurious repulsion, between U-corrected metals and oxygen- or fluorine-containing species. Models such as MACE-OMAT and -MPA exhibit repulsion between U-corrected metals and their oxides, limiting their value for studying catalysis and oxidation. We link the severity of this pathology to the oxygen number density in U-corrected training configurations. This explains why OMAT-trained models are most affected and suggests the issue might worsen as expanding future datasets increasingly include configurations with low oxygen content, such as those generated through combinatorial exploration of multi-element or defect-containing systems. Our simple per-U-corrected-atom shift aligns PBE+U and PBE energies for identical structures, yielding a smoother PES compared to existing correction schemes, which target phase diagram accuracy. As a result, models trained on datasets with our shift applied exhibit smaller mean absolute errors for the adsorption energies of oxygen on U-corrected elemental slabs. Since datasets omitting +U entirely (e.g. MatPES, MP-ALOE) avoid these pathologies, we recommend excluding +U in future fMLIP datasets. For existing datasets, our post-hoc correction provides a low-cost improvement. Subjects: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG) Cite as: arXiv:2601.21056 [physics.chem-ph] (or arXiv:2601.21056v1 [physics.chem-ph] for this version) https://doi.org/10.48550/arXiv.2601.21056 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-165] Diffusion-based Annealed Boltzmann Generators : benefits pitfalls and hopes
链接: https://arxiv.org/abs/2601.21026
作者: Louis Grenioux,Maxence Noble
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Sampling configurations at thermodynamic equilibrium is a central challenge in statistical physics. Boltzmann Generators (BGs) tackle it by combining a generative model with a Monte Carlo (MC) correction step to obtain asymptotically unbiased samples from an unnormalized target. Most current BGs use classic MC mechanisms such as importance sampling, which both require tractable likelihoods from the backbone model and scale poorly in high-dimensional, multi-modal targets. We study BGs built on annealed Monte Carlo (aMC), which is designed to overcome these limitations by bridging a simple reference to the target through a sequence of intermediate densities. Diffusion models (DMs) are powerful generative models and have already been incorporated into aMC-based recalibration schemes via the diffusion-induced density path, making them appealing backbones for aMC-BGs. We provide an empirical meta-analysis of DM-based aMC-BGs on controlled multi-modal Gaussian mixtures (varying mode separation, number of modes, and dimension), explicitly disentangling inference effects from learning effects by comparing (i) a perfectly learned DM and (ii) a DM trained from data. Even with a perfect DM, standard integrations using only first-order stochastic denoising kernels fail systematically, whereas second-order denoising kernels can substantially improve performance when covariance information is available. We further propose a deterministic aMC integration based on first-order transport maps derived from DMs, which outperforms the stochastic first-order variant at higher computational cost. Finally, in the learned-DM setting, all DM-aMC variants struggle to produce accurate BGs; we trace the main bottleneck to inaccurate DM log-density estimation.
[LG-166] A Diffusive Classification Loss for Learning Energy-based Generative Models
链接: https://arxiv.org/abs/2601.21025
作者: Louis Grenioux,RuiKang OuYang,José Miguel Hernández-Lobato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Score-based generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from the negative input-gradient of the energy. Crucially, EBMs can be leveraged not only for generation, but also for tasks such as compositional sampling or building Boltzmann Generators via Monte Carlo methods. However, training EBMs remains challenging. Direct maximum likelihood is computationally prohibitive due to the need for nested sampling, while score matching, though efficient, suffers from mode blindness. To address these issues, we introduce the Diffusive Classification (DiffCLF) objective, a simple method that avoids blindness while remaining computationally efficient. DiffCLF reframes EBM learning as a supervised classification problem across noise levels, and can be seamlessly combined with standard score-based objectives. We validate the effectiveness of DiffCLF by comparing the estimated energies against ground truth in analytical Gaussian mixture cases, and by applying the trained models to tasks such as model composition and Boltzmann Generator sampling. Our results show that DiffCLF enables EBMs with higher fidelity and broader applicability than existing approaches.
[LG-167] owards regularized learning from functional data with covariate shift
链接: https://arxiv.org/abs/2601.21019
作者: Markus Holzleitner,Sergiy Pereverzyev Jr.,Sergei V. Pereverzyev,Vaibhav Silmana,S. Sivananthan
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 38 pages
Abstract:This paper investigates a general regularization framework for unsupervised domain adaptation in vector-valued regression under the covariate shift assumption, utilizing vector-valued reproducing kernel Hilbert spaces (vRKHS). Covariate shift occurs when the input distributions of the training and test data differ, introducing significant challenges for reliable learning. By restricting the hypothesis space, we develop a practical operator learning algorithm capable of handling functional outputs. We establish optimal convergence rates for the proposed framework under a general source condition, providing a theoretical foundation for regularized learning in this setting. We also propose an aggregation-based approach that forms a linear combination of estimators corresponding to different regularization parameters and different kernels. The proposed approach addresses the challenge of selecting appropriate tuning parameters, which is crucial for constructing a good estimator, and we provide a theoretical justification for its effectiveness. Furthermore, we illustrate the proposed method on a real-world face image dataset, demonstrating robustness and effectiveness in mitigating distributional discrepancies under covariate shift.
[LG-168] Efficient Causal Structure Learning via Modular Subgraph Integration
链接: https://arxiv.org/abs/2601.21014
作者: Haixiang Sun,Pengchao Tian,Zihan Zhou,Jielei Zhang,Peiyi Li,Andrew L. Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Learning causal structures from observational data remains a fundamental yet computationally intensive task, particularly in high-dimensional settings where existing methods face challenges such as the super-exponential growth of the search space and increasing computational demands. To address this, we introduce VISTA (Voting-based Integration of Subgraph Topologies for Acyclicity), a modular framework that decomposes the global causal structure learning problem into local subgraphs based on Markov Blankets. The global integration is achieved through a weighted voting mechanism that penalizes low-support edges via exponential decay, filters unreliable ones with an adaptive threshold, and ensures acyclicity using a Feedback Arc Set (FAS) algorithm. The framework is model-agnostic, imposing no assumptions on the inductive biases of base learners, is compatible with arbitrary data settings without requiring specific structural forms, and fully supports parallelization. We also theoretically establish finite-sample error bounds for VISTA, and prove its asymptotic consistency under mild conditions. Extensive experiments on both synthetic and real datasets consistently demonstrate the effectiveness of VISTA, yielding notable improvements in both accuracy and efficiency over a wide range of base learners.
[LG-169] he augmented NLP bound for maximum-entropy remote sampling
链接: https://arxiv.org/abs/2601.20970
作者: Gabriel Ponte,Marcia Fampa,Jon Lee
类目: Optimization and Control (math.OC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:The maximum-entropy remote sampling problem (MERSP) is to select a subset of s random variables from a set of n random variables, so as to maximize the information concerning a set of target random variables that are not directly observable. We assume throughout that the set of all of these random variables follows a joint Gaussian distribution, and that we have the covariance matrix available. Finally, we measure information using Shannon’s differential entropy. The main approach for exact solution of moderate-sized instances of MERSP has been branch-and-bound, and so previous work concentrated on upper bounds. Prior to our work, there were two upper-bounding methods for MERSP: the so-called NLP bound and the spectral bound, both introduced 25 years ago. We are able now to establish domination results between these two upper bounds. We propose an ``augmented NLP bound’’ based on a subtle convex relaxation. We provide theoretical guarantees, giving sufficient conditions under which the augmented NLP bound strictly dominates the ordinary NLP bound. In addition, the augmented NLP formulation allows us to derive upper bounds for rank-deficient covariance matrices when they satisfy a technical condition. This is in contrast to the earlier work on the ordinary NLP bound that worked with only positive definite covariance matrices. Finally, we introduce a novel and very effective diagonal-scaling technique for MERSP, employing a positive vector of parameters. Numerical experiments on benchmark instances demonstrate the effectiveness of our approaches in advancing the state of the art for calculating upper bounds on MERSP. Subjects: Optimization and Control (math.OC); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2601.20970 [math.OC] (or arXiv:2601.20970v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2601.20970 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-170] Parametric Quantum State Tomography with HyperRBMs
链接: https://arxiv.org/abs/2601.20950
作者: Simon Tonner,Viet T. Tran,Richard Kueng
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum state tomography (QST) is essential for validating quantum devices but suffers from exponential scaling in system size. Neural-network quantum states, such as Restricted Boltzmann Machines (RBMs), can efficiently parameterize individual many-body quantum states and have been successfully used for QST. However, existing approaches are point-wise and require retraining at every parameter value in a phase diagram. We introduce a parametric QST framework based on a hypernetwork that conditions an RBM on Hamiltonian control parameters, enabling a single model to represent an entire family of quantum ground states. Applied to the transverse-field Ising model, our HyperRBM achieves high-fidelity reconstructions from local Pauli measurements on 1D and 2D lattices across both phases and through the critical region. Crucially, the model accurately reproduces the fidelity susceptibility and identifies the quantum phase transition without prior knowledge of the critical point. These results demonstrate that hypernetwork-modulated neural quantum states provide an efficient and scalable route to tomographic reconstruction across full phase diagrams.
[LG-171] ECGFlowCMR: Pretraining with ECG-Generated Cine CMR Improves Cardiac Disease Classification and Phenotype Prediction
链接: https://arxiv.org/abs/2601.20904
作者: Xiaocheng Fang,Zhengyao Ding,Jieyi Cai,Yujie Xiao,Bo Liu,Jiarui Jin,Haoyu Wang,Guangkun Nie,Shun Huang,Ting Chen,Hongyan Li,Shenda Hong
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Cardiac Magnetic Resonance (CMR) imaging provides a comprehensive assessment of cardiac structure and function but remains constrained by high acquisition costs and reliance on expert annotations, limiting the availability of large-scale labeled datasets. In contrast, electrocardiograms (ECGs) are inexpensive, widely accessible, and offer a promising modality for conditioning the generative synthesis of cine CMR. To this end, we propose ECGFlowCMR, a novel ECG-to-CMR generative framework that integrates a Phase-Aware Masked Autoencoder (PA-MAE) and an Anatomy-Motion Disentangled Flow (AMDF) to address two fundamental challenges: (1) the cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and (2) the anatomical observability gap due to the limited structural information inherent in ECGs. Extensive experiments on the UK Biobank and a proprietary clinical dataset demonstrate that ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.
[LG-172] ATTNSOM: Learning Cross-Isoform Attention for Cytochrome P450 Site-of-Metabolism
链接: https://arxiv.org/abs/2601.20891
作者: Hajung Kim,Eunha Lee,Sohyun Chung,Jueon Park,Seungheun Baek,Jaewoo Kang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 14 pages
Abstract:Identifying metabolic sites where cytochrome P450 enzymes metabolize small-molecule drugs is essential for drug discovery. Although existing computational approaches have been proposed for site-of-metabolism prediction, they typically ignore cytochrome P450 isoform identity or model isoforms independently, thereby failing to fully capture inherent cross-isoform metabolic patterns. In addition, prior evaluations often rely on top-k metrics, where false positive atoms may be included among the top predictions, underscoring the need for complementary metrics that more directly assess binary atom-level discrimination under severe class imbalance. We propose ATTNSOM, an atom-level site-of-metabolism prediction framework that integrates intrinsic molecular reactivity with cross-isoform relationships. The model combines a shared graph encoder, molecule-conditioned atom representations, and a cross-attention mechanism to capture correlated metabolic patterns across cytochrome P450 isoforms. The model is evaluated on two benchmark datasets annotated with site-of-metabolism labels at atom resolution. Across these benchmarks, the model achieves consistently strong top-k performance across multiple cytochrome P450 isoforms. Relative to ablated variants, the model yields higher Matthews correlation coefficient, indicating improved discrimination of true metabolic sites. These results support the importance of explicitly modeling cross-isoform relationships for site-of-metabolism prediction. The code and datasets are available at this https URL.
[LG-173] Latent-IMH: Efficient Bayesian Inference for Inverse Problems with Approximate Operators
链接: https://arxiv.org/abs/2601.20888
作者: Youguang Chen,George Biros
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:
Abstract:We study sampling from posterior distributions in Bayesian linear inverse problems where A , the parameters to observables operator, is computationally expensive. In many applications, A can be factored in a manner that facilitates the construction of a cost-effective approximation \tildeA . In this framework, we introduce Latent-IMH, a sampling method based on the Metropolis-Hastings independence (IMH) sampler. Latent-IMH first generates intermediate latent variables using the approximate \tildeA , and then refines them using the exact A . Its primary benefit is that it shifts the computational cost to an offline phase. We theoretically analyze the performance of Latent-IMH using KL divergence and mixing time bounds. Using numerical experiments on several model problems, we show that, under reasonable assumptions, it outperforms state-of-the-art methods such as the No-U-Turn sampler (NUTS) in computational efficiency. In some cases, Latent-IMH can be orders of magnitude faster than existing schemes.
[LG-174] Spatial Heterogeneity in Climate Risk and Human Flourishing: An Exploration with Generative AI
链接: https://arxiv.org/abs/2601.20880
作者: Stefano Maria Iacus,Haodong Qi,Devika Jain
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), enable scalable extraction of spatial information from unstructured text and offer new methodological opportunities for studying climate geography. This study develops a spatial framework to examine how cumulative climate risk relates to multidimensional human flourishing across U.S. counties. High-resolution climate hazard indicators are integrated with a Human Flourishing Geographic Index (HFGI), an index derived from classification of 2.6 billion geotagged tweets using fine-tuned open-source Large Language Models (LLMs). These indicators are aggregated to the US county-level and mapped to a structural equation model to infer overall climate risk and human flourishing dimensions, including expressed well-being, meaning and purpose, social connectedness, psychological distress, physical condition, economic stability, religiosity, character and virtue, and institutional trust. The results reveal spatially heterogeneous associations between greater cumulative climate risk and lower levels of expressed human flourishing, with coherent spatial patterns corresponding to recurrent exposure to heat, flooding, wind, drought, and wildfire hazards. The study demonstrates how Generative AI can be combined with latent construct modeling for geographical analysis and for spatial knowledge extraction.
[LG-175] Distributed Causality in the SDG Network: Evidence from Panel VAR and Conditional Independence Analysis
链接: https://arxiv.org/abs/2601.20875
作者: Md Muhtasim Munif Fahim,Md Jahid Hasan Imran,Luknath Debnath,Tonmoy Shill,Md. Naim Molla,Ehsanul Bashar Pranto,Md Shafin Sanyan Saad,Md Rezaul Karim
类目: Applications (stat.AP); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Comprehensive Manuscript with Code Data
Abstract:The achievement of the 2030 Sustainable Development Goals (SDGs) is dependent upon strategic resource distribution. We propose a causal discovery framework using Panel Vector Autoregression, along with both country-specific fixed effects and PCMCI+ conditional independence testing on 168 countries (2000-2025) to develop the first complete causal architecture of SDG dependencies. Utilizing 8 strategically chosen SDGs, we identify a distributed causal network (i.e., no single ‘hub’ SDG), with 10 statistically significant Granger-causal relationships identified as 11 unique direct effects. Education to Inequality is identified as the most statistically significant direct relationship (r = -0.599; p 0.05), while effect magnitude significantly varies depending on income levels (e.g., high-income: r = -0.65; lower-middle-income: r = -0.06; non-significant). We also reject the idea that there exists a single ‘keystone’ SDG. Additionally, we offer a proposed tiered priority framework for the SDGs namely, identifying upstream drivers (Education, Growth), enabling goals (Institutions, Energy), and downstream outcomes (Poverty, Health). Therefore, we conclude that effective SDG acceleration can be accomplished through coordinated multi-dimensional intervention(s), and that single-goal sequential strategies are insufficient.
[LG-176] Analyzing the Temporal Factors for Anxiety and Depression Symptoms with the Rashomon Perspective
链接: https://arxiv.org/abs/2601.20874
作者: Mustafa Cavus,Przemysław Biecek,Julian Tejada,Fernando Marmolejo-Ramos,Andre Faro
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 19 pages, 2 figures
Abstract:This paper introduces a new modeling perspective in the public mental health domain to provide a robust interpretation of the relations between anxiety and depression, and the demographic and temporal factors. This perspective particularly leverages the Rashomon Effect, where multiple models exhibit similar predictive performance but rely on diverse internal structures. Instead of considering these multiple models, choosing a single best model risks masking alternative narratives embedded in the data. To address this, we employed this perspective in the interpretation of a large-scale psychological dataset, specifically focusing on the Patient Health Questionnaire-4. We use a random forest model combined with partial dependence profiles to rigorously assess the robustness and stability of predictive relationships across the resulting Rashomon set, which consists of multiple models that exhibit similar predictive performance. Our findings confirm that demographic variables \textttage, \textttsex, and \texttteducation lead to consistent structural shifts in anxiety and depression risk. Crucially, we identify significant temporal effects: risk probability demonstrates clear diurnal and circaseptan fluctuations, peaking during early morning hours. This work demonstrates the necessity of moving beyond the best model to analyze the entire Rashomon set. Our results highlight that the observed variability, particularly due to circadian and circaseptan rhythms, must be meticulously considered for robust interpretation in psychological screening. We advocate for a multiplicity-aware approach to enhance the stability and generalizability of ML-based conclusions in mental health research.
信息检索
[IR-0] LANCER: LLM Reranking for Nugget Coverag e ECIR2026
链接: https://arxiv.org/abs/2601.22008
作者: Jia-Huei Ju,François G. Landry,Eugene Yang,Suzan Verberne,Andrew Yates
类目: Information Retrieval (cs.IR)
*备注: ECIR 2026
Abstract:Unlike short-form retrieval-augmented generation (RAG), such as factoid question answering, long-form RAG requires retrieval to provide documents covering a wide range of relevant information. Automated report generation exemplifies this setting: it requires not only relevant information but also a more elaborate response with comprehensive information. Yet, existing retrieval methods are primarily optimized for relevance ranking rather than information coverage. To address this limitation, we propose LANCER, an LLM-based reranking method for nugget coverage. LANCER predicts what sub-questions should be answered to satisfy an information need, predicts which documents answer these sub-questions, and reranks documents in order to provide a ranked list covering as many information nuggets as possible. Our empirical results show that LANCER enhances the quality of retrieval as measured by nugget coverage metrics, achieving higher \alpha -nDCG and information coverage than other LLM-based reranking methods. Our oracle analysis further reveals that sub-question generation plays an essential role.
[IR-1] SpecTran: Spectral-Aware Transformer-based Adapter for LLM -Enhanced Sequential Recommendation
链接: https://arxiv.org/abs/2601.21986
作者: Yu Cui,Feng Liu,Zhaoxiang Wang,Changwang Zhang,Jun Wang,Can Wang,Jiawei Chen
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Traditional sequential recommendation (SR) models learn low-dimensional item ID embeddings from user-item interactions, often overlooking textual information such as item titles or descriptions. Recent advances in Large Language Models (LLMs) have inspired a surge of research that encodes item textual information with high-dimensional semantic embeddings, and designs transformation methods to inject such embeddings into SR models. These embedding transformation strategies can be categorized into two types, both of which exhibits notable drawbacks: 1) adapter-based methods suffer from pronounced dimension collapse, concentrating information into a few dominant dimensions; 2) SVD-based methods are rigid and manual, considering only a few principal spectral components while discarding rich information in the remaining spectrum. To address these limitations, we propose SpecTran, a spectral-aware transformer-based adapter that operates in the spectral domain, attending to the full spectrum to select and aggregates informative components. A learnable spectral-position encoding injects singular-value cues as an inductive bias, guiding attention toward salient spectral components and promoting diversity across embedding dimensions. Across four real-world datasets and three SR backbones, it consistently outperforms strong baselines, achieving an average improvement of 9.17%. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2601.21986 [cs.IR] (or arXiv:2601.21986v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.21986 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] he Big Three of Scientific Information: A comparative bibliometric review of Web of Science Scopus and OpenAlex
链接: https://arxiv.org/abs/2601.21908
作者: Daniel Torres-Salinas,Wenceslao Arroyo-Machado
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:
Abstract:The present comparative study examines the three main multidisciplinary bibliographic databases, Web of Science Core Collection, Scopus, and OpenAlex, with the aim of providing up-to-date evidence on coverage, metadata quality, and functional features to help inform strategic decisions in research assessment. The report is structured into two complementary methodological sections. First, it presents a systematic review of recent scholarly literature that investigates record volume, open-access coverage, linguistic diversity, reference coverage, and metadata quality; this is followed by an original bibliometric analysis of the 2015-2024 period that explores longitudinal distribution, document types, thematic profiles, linguistic differences, and overlap between databases. The text concludes with a ten-point executive summary and five recommendations.
[IR-3] he Double-Edged Sword of Knowledge Transfer: Diagnosing and Curing Fairness Pathologies in Cross-Domain Recommendation WWW’26
链接: https://arxiv.org/abs/2601.21805
作者: Yuhan Zhao,Weixin Chen,Li Chen,Weike Pan
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW’26
Abstract:Cross-domain recommendation (CDR) offers an effective strategy for improving recommendation quality in a target domain by leveraging auxiliary signals from source domains. Nonetheless, emerging evidence shows that CDR can inadvertently heighten group-level unfairness. In this work, we conduct a comprehensive theoretical and empirical analysis to uncover why these fairness issues arise. Specifically, we identify two key challenges: (i) Cross-Domain Disparity Transfer, wherein existing group-level disparities in the source domain are systematically propagated to the target domain; and (ii) Unfairness from Cross-Domain Information Gain, where the benefits derived from cross-domain knowledge are unevenly allocated among distinct groups. To address these two challenges, we propose a Cross-Domain Fairness Augmentation (CDFA) framework composed of two key components. Firstly, it mitigates cross-domain disparity transfer by adaptively integrating unlabeled data to equilibrate the informativeness of training signals across groups. Secondly, it redistributes cross-domain information gains via an information-theoretic approach to ensure equitable benefit allocation across groups. Extensive experiments on multiple datasets and baselines demonstrate that our framework significantly reduces unfairness in CDR without sacrificing overall recommendation performance, while even enhancing it.
[IR-4] OneMall: One Model More Scenarios – End-to-End Generative Recommender Family at Kuaishou E-Commerce
链接: https://arxiv.org/abs/2601.21770
作者: Kun Zhang,Jingming Zhang,Wei Cheng,Yansong Cheng,Jiaqi Zhang,Hao Lu,Xu Zhang,Haixiang Gan,Jiangxia Cao,Tenglong Wang,Ximing Zhang,Boyang Xia,Kuo Cai,Shiyao Wang,Hongjian Dou,Jinkai Yu,Mingxing Wen,Qiang Luo,Dongxu Liang,Chenyi Lei,Jun Wang,Runan Liu,Zhaojie Liu,Ruiming Tang,Tingting Gao,Shaoguo Liu,Yuqing Ding,Hui Kong,Han Li,Guorui Zhou,Wenwu Ou,Kun Gai
类目: Information Retrieval (cs.IR)
*备注: Work in progress
Abstract:In the wave of generative recommendation, we present OneMall, an end-to-end generative recommendation framework tailored for e-commerce services at Kuaishou. Our OneMall systematically unifies the e-commerce’s multiple item distribution scenarios, such as Product-card, short-video and live-streaming. Specifically, it comprises three key components, aligning the entire model training pipeline to the LLM’s pre-training/post-training: (1) E-commerce Semantic Tokenizer: we provide a tokenizer solution that captures both real-world semantics and business-specific item relations across different scenarios; (2) Transformer-based Architecture: we largely utilize Transformer as our model backbone, e.g., employing Query-Former for long sequence compression, Cross-Attention for multi-behavior sequence fusion, and Sparse MoE for scalable auto-regressive generation; (3) Reinforcement Learning Pipeline: we further connect retrieval and ranking models via RL, enabling the ranking model to serve as a reward signal for end-to-end policy retrieval model optimization. Extensive experiments demonstrate that OneMall achieves consistent improvements across all e-commerce scenarios: +13.01% GMV in product-card, +15.32% Orders in Short-Video, and +2.78% Orders in Live-Streaming. OneMall has been deployed, serving over 400 million daily active users at Kuaishou.

