本篇博文主要内容为 2026-02-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-02-02)
今日共更新673篇论文,其中:
- 自然语言处理共93篇(Computation and Language (cs.CL))
- 人工智能共220篇(Artificial Intelligence (cs.AI))
- 计算机视觉共110篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共262篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] FOCUS: DLLM s Know How to Tame Their Compute Bound
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, DLLMs)在推理阶段因解码成本高而导致的部署瓶颈问题。其核心挑战在于:尽管计算在 token 块上并行化,但每个扩散步骤中仅有少量 token 可被解码,导致大量计算资源浪费在不可解码的 token 上。解决方案的关键在于提出 FOCUS 推理系统,通过动态聚焦于可解码 token 并实时剔除不可解码 token,从而提升有效批处理大小(effective batch size),缓解计算限制并实现更高吞吐量。实证结果表明,FOCUS 在保持或提升生成质量的同时,相比生产级引擎 LMDeploy 最多提升 3.52 倍吞吐量。
链接: https://arxiv.org/abs/2601.23278
作者: Kaihua Liang,Xin Tan,An Zhong,Hong Xu,Marco Canini
机构: 未知
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注: 22 pages, 15 figures
Abstract:Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS – an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52 \times throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: this https URL.
zh
[NLP-1] UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection
【速读】: 该论文旨在解决生成式 AI(Generative AI)中提示词(prompt)优化在缺乏监督奖励信号时的自动化难题。现有基于提示代理(prompt agent)的方法通常依赖于监督式奖励信号,但在实际应用中此类信号往往不可得。为此,作者提出无监督提示代理(Unsupervised Prompt Agent, UPA),其核心创新在于通过细粒度且顺序不变的成对比较(pairwise comparisons)来自大型语言模型(LLMs)构建结构化搜索策略,并引入两阶段框架以解耦系统性探索与最终选择:第一阶段利用贝叶斯聚合对局部比较进行路径级筛选以控制不确定性,第二阶段基于Bradley-Terry-Luce(BTL)模型进行全局锦标赛式比较,从而推断提示质量并识别最优提示。该方法在多个任务上均显著优于现有提示优化技术,证明了在完全无监督场景下代理式优化依然高效可靠。
链接: https://arxiv.org/abs/2601.23273
作者: Siran Peng,Weisong Zhao,Tianyu Fu,Chenxu Zhao,Tianshuo Zhang,Haoyuan Zhang,Xiangyu Zhu,Minghui Wu,Zhen Lei
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Prompt agents have recently emerged as a promising paradigm for automated prompt optimization, framing refinement as a sequential decision-making problem over a structured prompt space. While this formulation enables the use of advanced planning algorithms, these methods typically assume access to supervised reward signals, which are often unavailable in practical scenarios. In this work, we propose UPA, an Unsupervised Prompt Agent that realizes structured search and selection without relying on supervised feedback. Specifically, during search, UPA iteratively constructs an evolving tree structure to navigate the prompt space, guided by fine-grained and order-invariant pairwise comparisons from Large Language Models (LLMs). Crucially, as these local comparisons do not inherently yield a consistent global scale, we decouple systematic prompt exploration from final selection, introducing a two-stage framework grounded in the Bradley-Terry-Luce (BTL) model. This framework first performs path-wise Bayesian aggregation of local comparisons to filter candidates under uncertainty, followed by global tournament-style comparisons to infer latent prompt quality and identify the optimal prompt. Experiments across multiple tasks demonstrate that UPA consistently outperforms existing prompt optimization methods, showing that agent-style optimization remains highly effective even in fully unsupervised settings.
zh
[NLP-2] PaperBanana: Automating Academic Illustration for AI Scientists
【速读】: 该论文旨在解决当前自主AI科学家在科研流程中生成符合出版标准的学术插图仍依赖人工、效率低下的问题。其核心解决方案是提出PaperBanana框架,该框架基于先进的视觉语言模型(Vision-Language Models, VLMs)和图像生成模型,通过协调多个专业化智能体(agents)实现从参考文献检索、内容与风格规划、图像渲染到自我批判性迭代优化的全流程自动化。关键创新在于引入了结构化的多代理协作机制与自评反馈循环,显著提升了插图在忠实度、简洁性、可读性和美学质量上的表现,且可扩展至高质量统计图表生成,为学术插图的自动化生产提供了系统性解决方案。
链接: https://arxiv.org/abs/2601.23265
作者: Dawei Zhu,Rui Meng,Yale Song,Xiyu Wei,Sujian Li,Tomas Pfister,Jinsung Yoon
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.
zh
[NLP-3] Agnostic Language Identification and Generation
【速读】: 该论文旨在解决语言识别(language identification)与语言生成(language generation)任务在传统强可实现性假设(realizability assumption)下局限性的问题,即原假设要求输入数据必须来自某个已知语言集合中的未知分布。为突破这一限制,论文提出在更一般性的“对抗性”(agnostic)设定下研究这两个问题,不再对输入数据分布施加任何约束。其解决方案的关键在于设计新的目标函数来刻画语言识别与生成的性能,并在此框架下获得了新颖且近乎紧致的统计速率(statistical rates),从而实现了理论上的重要进展。
链接: https://arxiv.org/abs/2601.23258
作者: Mikael Møller Høgsgaard,Chirag Pabbaraju
机构: Aarhus University (奥胡斯大学); University of Oxford (牛津大学); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general “agnostic” setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.
zh
[NLP-4] Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models EACL2026
【速读】: 该论文旨在解决大型音频-语言模型(Audio-Language Models)在处理原始语音输入时所面临的安全漏洞问题,特别是针对生成式AI(Generative AI)系统中因模态转变而引入的新型对抗性攻击风险。其解决方案的关键在于设计了一种文本到音频越狱攻击(text-to-audio jailbreak),通过先进指令遵循型文本转语音(Text-to-Speech, TTS)模型将违规指令嵌入叙事风格的音频流中,利用语音的结构和声学特性绕过主要针对文本设计的安全机制;实验表明,该方法在Gemini 2.0 Flash等前沿模型上实现了98.26%的成功率,显著高于纯文本基线,凸显了未来安全框架需同时考虑语言与副语言(paralinguistic)特征的必要性。
链接: https://arxiv.org/abs/2601.23255
作者: Ye Yu,Haibo Jin,Yaoning Yu,Jun Zhuang,Haohan Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Boise State University (博伊西州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: to be published at EACL 2026 main conference
Abstract:Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.
zh
[NLP-5] Scaling Multiagent Systems with Process Rewards
【速读】: 该论文旨在解决多智能体系统(multiagent systems)在同时微调多个智能体时面临的两大挑战:一是智能体间的信用分配(credit assignment)问题,二是昂贵的多智能体轨迹回放(multiagent rollouts)带来的样本效率低下问题。其解决方案的关键在于提出基于AI反馈的逐动作过程奖励机制(MAPPA),通过将信用分配细化到每个智能体的具体动作层面而非仅在任务完成时进行评估,从而在无需真实标签的情况下实现细粒度监督,并从每一次轨迹中提取最大训练信号。这一方法显著提升了复杂任务场景下的性能表现,验证了逐动作监督对多智能体系统在不同领域(如竞赛数学与工具增强的数据分析)的有效性。
链接: https://arxiv.org/abs/2601.23228
作者: Ed Li,Junyu Ren,Cat Yan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注:
Abstract:While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0–17.5pp on AIME and +7.8–17.2pp on AMC. For data analysis tasks, our method improves success rate by +12.5pp while quality metrics improve by up to 30%, validating that per-action supervision can lead to improvements across different multiagent system on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.
zh
[NLP-6] Are you going to finish that? A Practical Study of the Tokenization Boundary Problem
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在实际使用中因词边界(word boundary)与分词边界(token boundary)不一致所引发的“部分分词问题”(partial token problem)。该问题表现为用户输入的提示(prompt)在非完整分词处终止时,导致模型对下一个正确token的概率预测严重偏低,从而影响生成质量。研究表明,在中文等无空格语言、高复合词语言及代码场景中,该问题尤为普遍且严重——例如在中文中高达25%的词边界与分词边界错位,使得自然且完整的提示仍可能触发此问题。实验表明,前沿大模型在遇到此类部分分词提示时,对正确延续token的概率估计相比token对齐提示下降三个数量级,且这种性能退化不会随模型规模增大而缓解,反而可能恶化。解决方案的关键在于推理阶段的校正机制,特别是采用近期提出的精确解法(exact solutions),如通过回溯至最近的token边界重新计算概率分布,可有效缓解该问题,从而提升模型在真实应用场景下的鲁棒性和可靠性。
链接: https://arxiv.org/abs/2601.23223
作者: Hao Xu,Alisa Liu,Jonathan Hayase,Yejin Choi,Noah A. Smith
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and “word” boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is “backed-off” to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.
zh
[NLP-7] Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience
【速读】: 该论文旨在解决深度搜索代理(Deep Search Agents)在多步检索、推理和长周期任务执行中因缺乏对推理与检索状态的动态监控机制而导致的实际失败问题,尤其是在不确定性环境下任务演化过程中难以有效调节行为。解决方案的关键在于提出一种带有显式分层元认知监控机制的框架——DS-MCM(Deep Search with Meta-Cognitive Monitoring),其核心创新包括:1)快速一致性监测器(Fast Consistency Monitor),通过轻量级检查外部证据与内部推理置信度之间的对齐情况实现异常快速检测;2)慢速经验驱动监测器(Slow Experience-Driven Monitor),基于历史代理轨迹的经验记忆选择性触发纠正干预。该机制将监控嵌入推理-检索循环中,从而决定何时需要干预以及如何依据过往经验指导修正动作,显著提升了模型在多个基准测试中的性能与鲁棒性。
链接: https://arxiv.org/abs/2601.23188
作者: Zhongxiang Sun,Qipeng Wang,Weijie Yu,Jingxuan Yang,Haolang Lu,Jun Xu
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); School of Information Technology and Management (信息技术与管理学院); University of International Business and Economics (对外经济贸易大学); Search Applications Department, Tencent (腾讯搜索应用部门); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures
Abstract:Deep search agents powered by large language models have demonstrated strong capabilities in multi-step retrieval, reasoning, and long-horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evolve under uncertainty. Insights from cognitive neuroscience suggest that human metacognition is hierarchically organized, integrating fast anomaly detection with selectively triggered, experience-driven reflection. In this work, we propose Deep Search with Meta-Cognitive Monitoring (DS-MCM), a deep search framework augmented with an explicit hierarchical metacognitive monitoring mechanism. DS-MCM integrates a Fast Consistency Monitor, which performs lightweight checks on the alignment between external evidence and internal reasoning confidence, and a Slow Experience-Driven Monitor, which is selectively activated to guide corrective intervention based on experience memory from historical agent trajectories. By embedding monitoring directly into the reasoning-retrieval loop, DS-MCM determines both when intervention is warranted and how corrective actions should be informed by prior experience. Experiments across multiple deep search benchmarks and backbone models demonstrate that DS-MCM consistently improves performance and robustness.
zh
[NLP-8] ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought
【速读】: 该论文旨在解决显式思维链(Chain-of-Thought, CoT)在大型语言模型(Large Language Models, LLMs)中引入的计算冗余问题,同时克服现有潜在推理(latent reasoning)方法因缺乏有效压缩引导而导致性能显著下降的缺陷。其解决方案的关键在于提出一种基于变分自编码器(Variational Auto-Encoding, VAE)框架的新型潜在学习范式——ReGuLaR(Rendered CoT-Guided variational Latent Reasoning),通过将显式推理链渲染为图像并提取密集的视觉-语义表征来正则化后验分布,从而实现高效压缩且最小信息损失的潜在推理过程。
链接: https://arxiv.org/abs/2601.23184
作者: Fanmeng Wang,Haotian Liu,Guojiang Zhao,Hongteng Xu,Zhifeng Gao
机构: 中国人民大学(renmin university of china); dp.tech
类目: Computation and Language (cs.CL)
备注:
Abstract:While Chain-of-Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suffer from severe performance degradation due to the lack of appropriate compression guidance. In this study, we propose Rendered CoT-Guided variational Latent Reasoning (ReGuLaR), a simple yet novel latent learning paradigm resolving this issue. Fundamentally, we formulate latent reasoning within the Variational Auto-Encoding (VAE) framework, sampling the current latent reasoning state from the posterior distribution conditioned on previous ones. Specifically, when learning this variational latent reasoning model, we render explicit reasoning chains as images, from which we extract dense visual-semantic representations to regularize the posterior distribution, thereby achieving efficient compression with minimal information loss. Extensive experiments demonstrate that ReGuLaR significantly outperforms existing latent reasoning methods across both computational efficiency and reasoning effectiveness, and even surpasses CoT through multi-modal reasoning, providing a new and insightful solution to latent reasoning. Code: this https URL.
zh
[NLP-9] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在人力资源(HR)场景下多语言机器阅读理解(Machine Reading Comprehension, MRC)能力评估缺乏标准化、可复现且具备公平性研究潜力的基准数据集的问题。解决方案的关键在于构建一个名为JobResQA的多语言问答(Question Answering, QA)基准,其核心创新包括:1)基于真实世界来源进行去标识化与数据合成的数据生成管道,确保内容真实性与隐私保护;2)通过占位符控制人口统计学和职业属性,支持系统性的偏见与公平性分析;3)采用TEaR方法驱动的人工介入翻译流程,结合MQM错误标注与选择性后编辑,实现高质量跨语言平行数据构建;4)提供基于LLM-as-judge的基线评测结果,揭示不同语言间性能差异,凸显多语言MRC能力短板。该基准为推动公平、可靠的生成式AI(Generative AI)在HR领域的应用提供了可复现的研究基础。
链接: https://arxiv.org/abs/2601.23183
作者: Casimiro Pio Carrino,Paula Estrella,Rabih Zbib,Carlos Escolano,José A. R. Fonollosa
机构: Avature Machine Learning (Avature机器学习); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: this https URL
zh
[NLP-10] FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation
【速读】: 该论文旨在解决扩散语言模型(diffusion language models, dLLMs)在推理过程中存在的位置偏差问题,从而充分释放其非自回归生成的潜力。现有解码策略难以实现任意顺序的高质量生成,主要原因在于未能有效分离和利用隐藏状态中的全局结构信息与局部细节信息。解决方案的关键在于首次对dLLMs进行频域分析,揭示低频成分主要编码全局结构和长程依赖,而高频成分则表征局部细节;基于此发现,作者提出FourierSampler方法,通过频域滑动窗口机制动态引导模型从“结构到细节”分阶段生成,显著提升生成质量与灵活性,在LLADA和SDAR基准上分别取得20.4%和16.0%的相对性能提升,优于同类规模的自回归模型如Llama3.1-8B-Instruct。
链接: https://arxiv.org/abs/2601.23182
作者: Siyang He,Qiqi Wang,Xiaoran Liu,Hongnan Ma,Yiwei Shi,Yuerong Song,Ying Zhu,Tianyi Liang,Zengfeng Huang,Ziwei He,Xipeng Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 6 figures, under review
Abstract:Despite the non-autoregressive potential of diffusion language models (dLLMs), existing decoding strategies demonstrate positional bias, failing to fully unlock the potential of arbitrary generation. In this work, we delve into the inherent spectral characteristics of dLLMs and present the first frequency-domain analysis showing that low-frequency components in hidden states primarily encode global structural information and long-range dependencies, while high-frequency components are responsible for characterizing local details. Based on this observation, we propose FourierSampler, which leverages a frequency-domain sliding window mechanism to dynamically guide the model to achieve a “structure-to-detail” generation. FourierSampler outperforms other inference enhancement strategies on LLADA and SDAR, achieving relative improvements of 20.4% on LLaDA1.5-8B and 16.0% on LLaDA-8B-Instruct. It notably surpasses similarly sized autoregressive models like Llama3.1-8B-Instruct.
zh
[NLP-11] Monotonic Reference-Free Refinement for Autoformalization
【速读】: 该论文旨在解决全定理自动形式化(full-theorem autoformalization)中多质量维度协同优化难题,现有迭代精炼方法通常仅能提升孤立的质量指标(如语法正确性),难以同时改善 Formal Validity(形式有效性)、Logical Preservation(逻辑保真性)、Mathematical Consistency(数学一致性)和 Formal Quality(形式质量)。其解决方案的关键在于提出一种无参考的迭代单调过程,通过定理证明器与基于大语言模型(LLM)的判别器提供的互补反馈,在推理阶段无需依赖真实证明或已有形式化结果的情况下,联合优化掩码复合目标函数,并借助响应映射(responsiveness map)指导不同角色 LLM 优先改进特定维度;同时设计接受策略以保证单调改进的可证性,并提供收敛与终止条件,从而实现多维质量指标的同步提升。
链接: https://arxiv.org/abs/2601.23166
作者: Lan Zhang,Marco Valentino,André Freitas
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:While statement autoformalization has advanced rapidly, full-theorem autoformalization remains largely unexplored. Existing iterative refinement methods in statement autoformalization typicall improve isolated aspects of formalization, such as syntactic correctness, but struggle to jointly optimizing multiple quality dimensions, which is critical for full-theorem autoformalization. We introduce a reference-free iterative monotonic process for full-theorem autoformalization that leverages complementary feedback from theorem provers and LLM-based judges, without access to ground-truth proofs or existing formalizations at inference time. Our approach optimizes a masked composite objective over Formal Validity, Logical Preservation, Mathematical Consistency, and Formal Quality, guided by a responsiveness map that indicates how different LLMs acting as different roles preferentially improve each dimension. We further propose an acceptance policy that guarantees certified monotonic improvement, and provide conditions ensuring convergence and termination. Empirical experiments demonstrate the proposed process enables simultaneous improvement across multiple dimensions, achieving 93.44% formal validity and a 78.22% overall score on miniF2F, and 44.09% formal validity and a 29.79% overall score on ProofNet.
zh
[NLP-12] DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding
【速读】: 该论文旨在解决自回归大音频语言模型(Autoregressive Large Audio Language Models, AR LALMs)在训练数据和计算资源上成本高昂,以及严格序列解码导致推理效率低下的问题。其解决方案的关键在于提出DIFFA-2,一种基于扩散机制的实用型大音频语言模型,通过升级语音编码器、引入语义与声学双适配器,并采用四阶段训练流程(结合语义与声学对齐、大规模监督微调及方差缩减的偏好优化),仅使用全开源语料即可实现高效且高性能的通用音频理解能力,验证了扩散建模作为大规模音频理解骨干架构的可行性与竞争力。
链接: https://arxiv.org/abs/2601.23161
作者: Jiaming Zhou,Xuxin Cheng,Shiwan Zhao,Yuhang Jia,Cao Liu,Ke Zeng,Xunliang Cai,Yong Qin
机构: Nankai University (南开大学); Meituan LongCat Interaction Team
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at this https URL.
zh
[NLP-13] Evaluating the Utility of Grounding Documents with Reference-Free LLM -based Metrics
【速读】: 该论文旨在解决检索增强生成(Retrieval Augmented Generation, RAG)中内容实用性难以量化的问题,现有指标要么缺乏模型特异性,要么依赖昂贵的人工标注。解决方案的关键在于提出一种名为Grounding Generation Utility (GroGU) 的模型特定且无需参考的评估指标,其将内容实用性定义为下游大语言模型(LLM)生成结果的置信度函数,基于信息熵进行衡量。该方法无需人工标注即可有效区分真实文档,并捕捉到LLM无关指标忽略的细微差异,从而在RAG系统中实现更精准的高质量文档筛选与优化。
链接: https://arxiv.org/abs/2601.23129
作者: Yilun Hua,Giuseppe Castellucci,Peter Schulam,Heba Elfardy,Kevin Small
机构: Cornell University (康奈尔大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval Augmented Generation (RAG)'s success depends on the utility the LLM derives from the content used for grounding. Quantifying content utility does not have a definitive specification and existing metrics ignore model-specific capabilities and/or rely on costly annotations. In this paper, we propose Grounding Generation Utility (GroGU), a model-specific and reference-free metric that defines utility as a function of the downstream LLM’s generation confidence based on entropy. Despite having no annotation requirements, GroGU is largely faithful in distinguishing ground-truth documents while capturing nuances ignored by LLM-agnostic metrics. We apply GroGU to train a query-rewriter for RAG by identifying high-utility preference data for Direct Preference Optimization. Experiments show improvements by up to 18.2 points in Mean Reciprocal Rank and up to 9.4 points in answer accuracy.
zh
[NLP-14] Safer Policy Compliance with Dynamic Epistemic Fallback
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景下因恶意篡改政策文本(如HIPAA、GDPR等法律条文)而可能被诱导执行错误合规行为的问题。其解决方案的关键在于提出一种名为动态认知回退(Dynamic Epistemic Fallback, DEF)的安全协议,该协议通过在推理阶段引入多层级的一句话提示线索(textual cues),引导LLM识别政策文本中的不一致信息,拒绝执行被扰动的指令,并回退至自身参数化知识以作出更安全的响应。实证结果表明,DEF显著提升了前沿LLM对扰动政策文本的检测与拒识能力,例如DeepSeek-R1在特定设置中实现了100%的检测率。
链接: https://arxiv.org/abs/2601.23094
作者: Joseph Marvin Imperial,Harish Tayyar Madabushi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Humans develop a series of cognitive defenses, known as epistemic vigilance, to combat risks of deception and misinformation from everyday interactions. Developing safeguards for LLMs inspired by this mechanism might be particularly helpful for their application in high-stakes tasks such as automating compliance with data privacy laws. In this paper, we introduce Dynamic Epistemic Fallback (DEF), a dynamic safety protocol for improving an LLM’s inference-time defenses against deceptive attacks that make use of maliciously perturbed policy texts. Through various levels of one-sentence textual cues, DEF nudges LLMs to flag inconsistencies, refuse compliance, and fallback to their parametric knowledge upon encountering perturbed policy texts. Using globally recognized legal policies such as HIPAA and GDPR, our empirical evaluations report that DEF effectively improves the capability of frontier LLMs to detect and refuse perturbed versions of policies, with DeepSeek-R1 achieving a 100% detection rate in one setting. This work encourages further efforts to develop cognitively inspired defenses to improve LLM robustness against forms of harm and deception that exploit legal artifacts.
zh
[NLP-15] Character as a Latent Variable in Large Language Models : A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在窄域数据微调后出现的“涌现错位”(Emergent Misalignment)问题,即微调过程引发广泛且难以预测的行为偏差,而非仅限于特定错误内容的泛化。其解决方案的关键在于识别并揭示:这种错位并非源于能力退化或知识污染,而是由训练数据中特定字符级特征(character-level dispositions)所诱导的稳定行为偏移;这些偏移可通过训练时触发因素或推理时角色对齐提示条件性激活,体现出与后门攻击和越狱漏洞之间的共享结构。因此,论文主张将行为倾向(behavioral dispositions)作为核心对齐风险加以防范,而非仅仅针对孤立错误或提示层面的防御策略。
链接: https://arxiv.org/abs/2601.23081
作者: Yanghao Su,Wenbo Zhou,Tianwei Zhang,Qiu Han,Weiming Zhang,Nenghai Yu,Jie Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned prompts, revealing shared structure across emergent misalignment, backdoor activation, and jailbreak susceptibility. Overall, our results identify character formation as a central and underexplored alignment risk, suggesting that robust alignment must address behavioral dispositions rather than isolated errors or prompt-level defenses.
zh
[NLP-16] DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis
【速读】: 该论文旨在解决传统Aspect-Based Sentiment Analysis (ABSA) 依赖粗粒度类别标签(如正面、负面)而难以捕捉细微情感状态的问题。其解决方案的关键在于引入一种维度化方法,用连续的效价-唤醒度(Valence-Arousal, VA)分数表示情感,从而实现对方面和情感层面的细粒度分析。为此,作者构建了首个多语言维度化ABSA资源DimABSA,包含传统ABSA元素(方面词、方面类别、观点词)及新增的VA分数标注,涵盖六种语言和四个领域共76,958个方面实例。同时提出三个结合VA分数与不同ABSA元素的子任务,并设计统一指标continuous F1(cF1),将VA预测误差纳入标准F1评分,为多语言维度化ABSA提供基准与研究基础。
链接: https://arxiv.org/abs/2601.23022
作者: Lung-Hao Lee,Liang-Chih Yu,Natalia Loukashevich,Ilseyar Alimova,Alexander Panchenko,Tzu-Mi Lin,Zhe-Yu Xu,Jian-Yu Zhou,Guangmin Zheng,Jin Wang,Sharanya Awasthi,Jonas Becker,Jan Philip Wahle,Terry Ruas,Shamsuddeen Hassan Muhammad,Saif M. Mohammed
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Yuan Ze University (元智大学); Lomonosov Moscow State University (莫斯科国立大学); Yunnan University (云南大学); University of Cincinnati (辛辛那提大学); University of Göttingen (哥廷根大学); Imperial College London (伦敦帝国学院); National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注:
Abstract:Aspect-Based Sentiment Analysis (ABSA) focuses on extracting sentiment at a fine-grained aspect level and has been widely applied across real-world domains. However, existing ABSA research relies on coarse-grained categorical labels (e.g., positive, negative), which limits its ability to capture nuanced affective states. To address this limitation, we adopt a dimensional approach that represents sentiment with continuous valence-arousal (VA) scores, enabling fine-grained analysis at both the aspect and sentiment levels. To this end, we introduce DimABSA, the first multilingual, dimensional ABSA resource annotated with both traditional ABSA elements (aspect terms, aspect categories, and opinion terms) and newly introduced VA scores. This resource contains 76,958 aspect instances across 42,590 sentences, spanning six languages and four domains. We further introduce three subtasks that combine VA scores with different ABSA elements, providing a bridge from traditional ABSA to dimensional ABSA. Given that these subtasks involve both categorical and continuous outputs, we propose a new unified metric, continuous F1 (cF1), which incorporates VA prediction error into standard F1. We provide a comprehensive benchmark using both prompted and fine-tuned large language models across all subtasks. Our results show that DimABSA is a challenging benchmark and provides a foundation for advancing multilingual dimensional ABSA.
zh
[NLP-17] Mem-T: Densifying Rewards for Long-Horizon Memory Agents
【速读】: 该论文旨在解决当前记忆代理(Memory Agents)在长期记忆管理策略训练中因稀疏且延迟的奖励信号而导致难以实现端到端优化的问题。现有方法通常需要在长时间序列的记忆操作后才能获得反馈,这限制了模型对记忆构建与检索策略的联合优化能力。解决方案的关键在于提出一种名为Mem-T的自主记忆代理,其通过轻量级分层记忆数据库实现流式输入下的动态更新与多轮检索,并结合MoT-GRPO(Tree-guided Reinforcement Learning with Gradient-based Reward Propagation and Hindsight Credit Assignment)框架,利用记忆操作树的反向传播和事后信用分配机制,将稀疏的终端奖励转化为密集的步骤级监督信号,从而有效训练长时程记忆管理能力。
链接: https://arxiv.org/abs/2601.23014
作者: Yanwei Yue,Guibin Zhang,Boci Peng,Xuanbo Fan,Jiaxin Guo,Qiankun Li,Yan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Memory agents, which depart from predefined memory-processing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered increasing attention for their autonomy and adaptability. However, existing training paradigms remain constrained: agents often traverse long-horizon sequences of memory operations before receiving sparse and delayed rewards, which hinders truly end-to-end optimization of memory management policies. To address this limitation, we introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs. To effectively train long-horizon memory management capabilities, we further propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment, thereby enabling the joint optimization of memory construction and retrieval. Extensive experiments demonstrate that Mem-T is (1) high-performing, surpassing frameworks such as A-Mem and Mem0 by up to 14.92% , and (2) economical, operating on a favorable accuracy-efficiency Pareto frontier and reducing inference tokens per query by \sim24.45% relative to GAM without sacrificing performance.
zh
[NLP-18] InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning
【速读】: 该论文旨在解决大规模语言模型在监督微调(Supervised Fine-Tuning, SFT)过程中因使用完整数据集而导致的高昂训练成本与收益递减问题,同时克服现有数据选择方法在不同任务域中表现不稳定、存在严重领域特异性的问题。其解决方案的关键在于提出一种基于差分熵(differential entropy)的统一框架 InstructDiff,通过预热校准(warmup calibration)、双向负对数似然(bi-directional NLL)过滤和熵驱动排序机制,实现跨领域的自适应数据选择:在推理类任务中偏好熵增加(认知扩展),而在通用指令遵循任务中偏好熵减少(认知压缩),从而显著提升小样本下的模型性能,在数学推理和通用指令跟随任务上分别相对全量数据训练提升17%和52%,且仅需10%的数据量。
链接: https://arxiv.org/abs/2601.23006
作者: Junyou Su,He Zhu,Xiao Luo,Liyu Zhang,Hong-Yu Zhou,Yun Chen,Peng Li,Yang Liu,Guanhua Chen
机构: Peking University (北京大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Tsinghua University (清华大学); SUFE (上海财经大学); SUSTech (南方科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern – samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17% relative improvement over full data training on mathematical reasoning and 52% for general instruction-following, outperforming prior baselines while using only 10% of the data.
zh
[NLP-19] Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLM s
【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)中政治偏见的跨语言一致性不足及安全后处理缓解策略缺失的问题。现有研究多集中于高资源西方语言或有限的多语言场景,忽视了不同语言间意识形态表征的一致性与可控性。其解决方案的关键在于提出一种名为跨语言对齐引导(Cross-Lingual Alignment Steering, CLAS)的后处理框架,该框架通过将不同语言下由政治提示诱导出的潜在意识形态表示对齐至共享的意识形态子空间,实现跨语言一致性;同时引入自适应机制动态调节干预强度,避免过度修正并保持输出连贯性,从而在显著降低经济与社会轴线上偏见的同时,最小化响应质量的下降。
链接: https://arxiv.org/abs/2601.23001
作者: Afrozah Nadeem,Agrima,Mehwish Nasim,Usman Naseem
机构: Macquarie University (麦考瑞大学); Microsoft (微软); University of Western Australia (西澳大利亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: PrePrint
Abstract:Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing attention to political bias in LLMs, prior work largely focuses on high-resource, Western languages or narrow multilingual settings, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. To address this gap, we present a large-scale multilingual evaluation of political bias spanning 50 countries and 33 languages. We introduce a complementary post-hoc mitigation framework, Cross-Lingual Alignment Steering (CLAS), designed to augment existing steering methods by aligning ideological representations across languages and dynamically regulating intervention strength. This method aligns latent ideological representations induced by political prompts into a shared ideological subspace, ensuring cross lingual consistency, with the adaptive mechanism prevents over correction and preserves coherence. Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. The proposed framework establishes a scalable and interpretable paradigm for fairness-aware multilingual LLM governance, balancing ideological neutrality with linguistic and cultural diversity.
zh
[NLP-20] ArabicDialectHub: A Cross-Dialectal Arabic Learning Resource and Platform
【速读】: 该论文旨在解决阿拉伯语方言间学习资源匮乏的问题,特别是跨方言语言学习中缺乏系统化、可交互且具备文化背景支持的学习工具。其解决方案的关键在于构建了一个名为ArabicDialectHub的开放资源平台,包含6种阿拉伯语变体(摩洛哥方言、黎巴嫩语、叙利亚语、阿联酋语、沙特语及标准阿拉伯语)的552个主题分层短语,并利用大语言模型(LLM)生成内容,经五位母语者验证后进行难度分级与主题组织;平台提供翻译探索、基于算法生成干扰项的自适应测验、云端同步进度追踪及文化语境展示功能,从而实现多方言协同学习与个性化交互体验。
链接: https://arxiv.org/abs/2601.22987
作者: Salem Lahlou
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi, and MSA) and an interactive web platform. Phrases were generated using LLMs and validated by five native speakers, stratified by difficulty, and organized thematically. The open-source platform provides translation exploration, adaptive quizzing with algorithmic distractor generation, cloud-synchronized progress tracking, and cultural context. Both the dataset and complete platform source code are released under MIT license. Platform: this https URL.
zh
[NLP-21] Learnable Permutation for Structured Sparsity on Transformer Models
【速读】: 该论文旨在解决结构化稀疏(structured sparsity)剪枝技术在Transformer架构中因排列搜索空间随模型规模呈指数增长而导致的性能瓶颈问题,即现有方法依赖贪心或启发式算法进行权重重排,难以获得最优剪枝效果。其解决方案的关键在于提出一种端到端可学习的排列框架:通过引入一个可学习的排列代价矩阵来量化任意两个输入通道交换的代价,采用可微分的二分图匹配求解器获取给定代价矩阵下的最优二元排列矩阵,并设计稀疏优化损失函数直接优化排列操作符,从而实现更有效的结构化稀疏剪枝。
链接: https://arxiv.org/abs/2601.22980
作者: Zekai Li,Ji Liu,Guanchen Li,Yixing Xu,Ziqiong Liu,Xuanwu Yin,Dong Li,Emad Barsoum
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Structured sparsity has emerged as a popular model pruning technique, widely adopted in various architectures, including CNNs, Transformer models, and especially large language models (LLMs) in recent years. A promising direction to further improve post-pruning performance is weight permutation, which reorders model weights into patterns more amenable to pruning. However, the exponential growth of the permutation search space with the scale of Transformer architectures forces most methods to rely on greedy or heuristic algorithms, limiting the effectiveness of reordering. In this work, we propose a novel end-to-end learnable permutation framework. Our method introduces a learnable permutation cost matrix to quantify the cost of swapping any two input channels of a given weight matrix, a differentiable bipartite matching solver to obtain the optimal binary permutation matrix given a cost matrix, and a sparsity optimization loss function to directly optimize the permutation operator. We extensively validate our approach on vision and language Transformers, demonstrating that our method achieves state-of-the-art permutation results for structured sparsity. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2601.22980 [cs.LG] (or arXiv:2601.22980v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.22980 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-22] MiTa: A Hierarchical Multi-Agent Collaboration Framework with Memory-integrated and Task Allocation ICASSP2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统在复杂任务中面临的记忆不一致性和智能体行为冲突问题。其解决方案的关键在于提出一种分层记忆集成的任务分配框架(MiTa),该框架将智能体组织为管理者-成员的层级结构,其中管理者引入了任务分配模块和摘要模块:前者实现全局视角下的任务分配以避免智能体间冲突,后者通过任务进展触发的事件记忆整合机制,将近期协作历史压缩为简洁摘要以保留长时程上下文信息,从而提升任务理解清晰度与全局一致性。
链接: https://arxiv.org/abs/2601.22974
作者: XiaoJie Zhang,JianHan Wu,Xiaoyang Qu,Jianzong Wang
机构: 未知
类目: Emerging Technologies (cs.ET); Computation and Language (cs.CL)
备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:Recent advances in large language models (LLMs) have substantially accelerated the development of embodied agents. LLM-based multi-agent systems mitigate the inefficiency of single agents in complex tasks. However, they still suffer from issues such as memory inconsistency and agent behavioral conflicts. To address these challenges, we propose MiTa, a hierarchical memory-integrated task allocative framework to enhance collaborative efficiency. MiTa organizes agents into a manager-member hierarchy, where the manager incorporates additional allocation and summary modules that enable (1) global task allocation and (2) episodic memory integration. The allocation module enables the manager to allocate tasks from a global perspective, thereby avoiding potential inter-agent conflicts. The summary module, triggered by task progress updates, performs episodic memory integration by condensing recent collaboration history into a concise summary that preserves long-horizon context. By combining task allocation with episodic memory, MiTa attains a clearer understanding of the task and facilitates globally consistent task distribution. Experimental results confirm that MiTa achieves superior efficiency and adaptability in complex multi-agent cooperation over strong baseline methods.
zh
[NLP-23] A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
【速读】: 该论文旨在解决大语言模型中两类新兴异常值——注意力异常值(attention sinks)和残差异常值(residual sinks)的功能机制问题,以及它们如何通过归一化操作(如softmax注意力和RMSNorm)影响模型训练稳定性与性能。其核心解决方案在于提出“异常值驱动的重缩放”(outlier-driven rescaling)机制,即异常值并非独立贡献特征,而是与归一化层协同作用,对非异常值组件进行动态重缩放,从而维持训练稳定性和模型表现。关键创新点在于:(1) 异常值与归一化不可分割,移除归一化会消除异常值但破坏训练;(2) 异常值主要作为重缩放因子而非信息贡献者;(3) 可通过将异常值纳入可学习参数或引入显式门控重缩放机制来缓解其负面影响,显著提升训练性能(平均提升2点)和量化鲁棒性(W4A4量化下仅下降1.2点)。
链接: https://arxiv.org/abs/2601.22966
作者: Zihan Qiu,Zeyu Huang,Kaiyue Wen,Peng Jin,Bo Zheng,Yuxin Zhou,Haofeng Huang,Zekun Wang,Xiao Li,Huaqing Zhang,Yang Xu,Haoran Lian,Siqi Zhang,Rui Men,Jianwei Zhang,Ivan Titov,Dayiheng Liu,Jingren Zhou,Junyang Lin
机构: Qwen Team; University of Edinburgh (爱丁堡大学); Stanford University (斯坦福大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textite.g., softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textitoutlier-driven rescaling and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).
zh
[NLP-24] Residual Context Diffusion Language Models
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在推理过程中因“重掩码”(remasking)机制导致的计算资源浪费问题,即仅保留高置信度token进行后续解码,而丢弃其余token所蕴含的上下文信息。解决方案的关键在于提出残差上下文扩散(Residual Context Diffusion, RCD)模块,该模块将被丢弃token的表示转化为上下文残差(contextual residuals),并在下一去噪步骤中重新注入,从而有效利用被废弃的计算资源。RCD采用解耦的两阶段训练策略以规避反向传播带来的内存瓶颈,实验证明其可在几乎不增加额外计算开销的前提下显著提升模型性能,在长链式思维(CoT)推理和短CoT指令遵循任务中均取得5–10点准确率提升,并在最具挑战性的AIME任务上实现近两倍准确率提升及4–5倍去噪步数减少。
链接: https://arxiv.org/abs/2601.22954
作者: Yuezhou Hu,Harman Singh,Monishwaran Maheswaran,Haocheng Xi,Coleman Hooper,Jintao Zhang,Aditya Tomar,Michael W. Mahoney,Sewon Min,Mehrdad Farajtabar,Kurt Keutzer,Amir Gholami,Chenfeng Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a “remasking” mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.
zh
[NLP-25] Perplexity Cannot Always Tell Right from Wrong
【速读】: 该论文试图解决的问题是:困惑度(Perplexity)作为模型选择指标的局限性,尤其是在生成式 AI(Generative AI)模型中,其是否能准确反映模型的真实性能。解决方案的关键在于利用近期关于 Transformer 模型连续性的研究成果,从理论上严格证明:若一个紧凑的仅解码器 Transformer 模型能够对某序列进行准确且自信的预测(这是强泛化能力的必要前提),则必然存在另一个序列使得该模型的困惑度极低但预测错误。进一步地,通过对等困惑度曲线(iso-perplexity plots)的分析发现,模型置信度的提升并不必然带来准确率的同步提高,因此单纯依赖困惑度可能导致错误的模型选择。
链接: https://arxiv.org/abs/2601.22950
作者: Petar Veličković,Federico Barbero,Christos Perivolaropoulos,Simon Osindero,Razvan Pascanu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 11 pages, 4 figures
Abstract:Perplexity – a function measuring a model’s overall level of “surprise” when encountering a particular output – has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently – a necessary pre-requisite for strong generalisation – it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model – rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.
zh
[NLP-26] Autonomous Chain-of-Thought Distillation for Graph-Based Fraud Detection
【速读】: 该论文旨在解决文本属性图(Text-attributed Graph, TAG)上的欺诈检测问题,其中需联合建模丰富的文本语义与关系依赖,而现有基于大语言模型(Large Language Model, LLM)增强的图神经网络(Graph Neural Network, GNN)方法受限于预定义提示(prompting)和解耦训练流程,导致推理自主性不足且语义-结构对齐弱化。解决方案的关键在于提出 FraudCoT 框架,其核心创新包括:1)引入一种欺诈感知的选择性链式思维(Chain-of-Thought, CoT)蒸馏机制,生成多样化的推理路径并强化语义-结构理解,将蒸馏后的 CoT 整合至节点文本中以提供多跳语义与结构线索;2)设计高效的非对称协同训练策略,实现端到端优化的同时显著降低朴素联合训练的计算开销,从而在保持高检测性能的同时大幅提升训练吞吐量。
链接: https://arxiv.org/abs/2601.22949
作者: Yuan Li,Jun Hu,Bryan Hooi,Bingsheng He,Cheng Chen
机构: National University of Singapore(新加坡国立大学); ByteDance Inc.(字节跳动)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Graph-based fraud detection on text-attributed graphs (TAGs) requires jointly modeling rich textual semantics and relational dependencies. However, existing LLM-enhanced GNN approaches are constrained by predefined prompting and decoupled training pipelines, limiting reasoning autonomy and weakening semantic-structural alignment. We propose FraudCoT, a unified framework that advances TAG-based fraud detection through autonomous, graph-aware chain-of-thought (CoT) reasoning and scalable LLM-GNN co-training. To address the limitations of predefined prompts, we introduce a fraud-aware selective CoT distillation mechanism that generates diverse reasoning paths and enhances semantic-structural understanding. These distilled CoTs are integrated into node texts, providing GNNs with enriched, multi-hop semantic and structural cues for fraud detection. Furthermore, we develop an efficient asymmetric co-training strategy that enables end-to-end optimization while significantly reducing the computational cost of naive joint training. Extensive experiments on public and industrial benchmarks demonstrate that FraudCoT achieves up to 8.8% AUPRC improvement over state-of-the-art methods and delivers up to 1,066x speedup in training throughput, substantially advancing both detection performance and efficiency.
zh
[NLP-27] Relaxing Positional Alignment in Masked Diffusion Language Models
【速读】: 该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)在开放式文本生成任务中性能与自回归模型存在显著差距的问题。研究指出,这种差距部分源于训练时严格的顺序预测机制导致解码过程对词元错位高度敏感,甚至一个位置的偏移即可严重破坏语义连贯性。解决方案的关键在于引入一种灵活的对齐监督策略:通过连接时序分类(Connectionist Temporal Classification, CTC)目标引入特殊“松弛”标记(slack token),从而在微调阶段放宽对精确位置的约束,使模型更适应不可逆的去噪解码动态。实验表明,该方法在五个开放式文本生成基准上均优于原始模型,并提升了对位置偏移的鲁棒性。
链接: https://arxiv.org/abs/2601.22947
作者: Mengyu Ye,Ryosuke Takahashi,Keito Kudo,Jun Suzuki
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches. Although they achieve competitive performance on several tasks, a substantial gap remains in open-ended text generation. We hypothesize that one cause of this gap is that strict positional prediction makes MDLM decoding highly sensitive to token misalignment, and we show through controlled interventions that a one-position shift can severely disrupt semantics. This observation suggests that enforcing strict positional supervision during training is misaligned with the irreversible denoising dynamics of MDLM decoding. Motivated by this mismatch, we adopt an alignment-flexible supervision strategy during fine-tuning. Specifically, we introduce a special token slack via the connectionist temporal classification objective. We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks. Our method consistently outperforms the original model and improves robustness to positional shifts, indicating that relaxing strict positional supervision is an important factor in improving generation quality in MDLMs.
zh
[NLP-28] Benchmarking Machine Translation on Chinese Social Media Texts
【速读】: 该论文旨在解决当前机器翻译(Machine Translation, MT)在处理中文社交媒体文本时面临的两大核心问题:一是高质量平行语料数据稀缺,因需具备平台特定俚语和风格线索双语知识的标注者;二是传统评估指标(如COMET)难以捕捉非标准表达和风格保真度。解决方案的关键在于构建一个名为CSM-MTBench的基准测试集,涵盖五种中-外语言方向,并包含两个专家精调子集:Fun Posts(侧重富含俚语与新词的上下文丰富内容)和Social Snippets(强调情绪驱动与风格浓缩的短文本)。针对不同子集设计定制化评估方法——对Fun Posts量化俚语与新词的翻译成功率,对Social Snippets则通过嵌入式指标与大语言模型作为裁判(LLM-as-a-judge)联合评估语气与风格保留程度,从而系统性地揭示现有MT模型在语义保真度与社交语境风格适配上的显著差异。
链接: https://arxiv.org/abs/2601.22931
作者: Kaiyan Zhao,Zheyong Xie,Zhongtao Miao,Xinze Lyu,Yao Hu,Shaosheng Cao
机构: The University of Tokyo (东京大学); Xiaohongshu Inc. (小红书)
类目: Computation and Language (cs.CL)
备注: Work in Progress
Abstract:The prevalence of rapidly evolving slang, neologisms, and highly stylized expressions in informal user-generated text, particularly on Chinese social media, poses significant challenges for Machine Translation (MT) benchmarking. Specifically, we identify two primary obstacles: (1) data scarcity, as high-quality parallel data requires bilingual annotators familiar with platform-specific slang, and stylistic cues in both languages; and (2) metric limitations, where traditional evaluators like COMET often fail to capture stylistic fidelity and nonstandard expressions. To bridge these gaps, we introduce CSM-MTBench, a benchmark covering five Chinese-foreign language directions and consisting of two expert-curated subsets: Fun Posts, featuring context-rich, slang- and neologism-heavy content, and Social Snippets, emphasizing concise, emotion- and style- driven expressions. Furthermore, we propose tailored evaluation approaches for each subset: measuring the translation success rate of slang and neologisms in Fun Posts, while assessing tone and style preservation in Social Snippets via a hybrid of embedding-based metrics and LLM-as-a-judge. Experiments on over 20 models reveal substantial variation in how current MT systems handle semantic fidelity and informal, social-media-specific stylistic cues. CSM-MTBench thus serves as a rigorous testbed for advancing MT systems capable of mastering real-world Chinese social media texts.
zh
[NLP-29] Semantic Leakage from Image Embeddings
【速读】: 该论文旨在解决图像嵌入(image embeddings)中存在的语义泄露(semantic leakage)问题,即尽管图像嵌入通常被认为隐私风险较低,但其在对齐过程中保留的局部语义邻域结构可能被用于恢复原始图像的语义信息。解决方案的关键在于提出SLImE框架,该框架通过一个本地训练的语义检索器结合现成模型,在无需任务特定解码器的情况下,从独立压缩的图像嵌入中推断出标签、符号表示及语法连贯的描述,从而揭示嵌入中的隐含语义信息。核心创新点在于证明:即使不精确重建图像,仅保持嵌入对齐下的局部语义邻近性就足以引发语义泄露,这暴露了当前主流嵌入模型(如GEMINI、COHERE、NOMIC和CLIP)在隐私保护方面的根本脆弱性。
链接: https://arxiv.org/abs/2601.22929
作者: Yiyi Chen,Qiongkai Xu,Desmond Eliott,Qiongxiu Li,Johannes Bjerva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 20 pages, 19 figures
Abstract:Image embeddings are generally assumed to pose limited privacy risk. We challenge this assumption by formalizing semantic leakage as the ability to recover semantic structures from compressed image embeddings. Surprisingly, we show that semantic leakage does not require exact reconstruction of the original image. Preserving local semantic neighborhoods under embedding alignment is sufficient to expose the intrinsic vulnerability of image embeddings. Crucially, this preserved neighborhood structure allows semantic information to propagate through a sequence of lossy mappings. Based on this conjecture, we propose Semantic Leakage from Image Embeddings (SLImE), a lightweight inference framework that reveals semantic information from standalone compressed image embeddings, incorporating a locally trained semantic retriever with off-the-shelf models, without training task-specific decoders. We thoroughly validate each step of the framework empirically, from aligned embeddings to retrieved tags, symbolic representations, and grammatical and coherent descriptions. We evaluate SLImE across a range of open and closed embedding models, including GEMINI, COHERE, NOMIC, and CLIP, and demonstrate consistent recovery of semantic information across diverse inference tasks. Our results reveal a fundamental vulnerability in image embeddings, whereby the preservation of semantic neighborhoods under alignment enables semantic leakage, highlighting challenges for privacy preservation.1
zh
[NLP-30] LLM s Explaint: A Post-Mortem on Semantic Interpretability in Transformer Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中语言抽象(linguistic abstraction)的形成机制不明确的问题,特别是试图在不同模块(如注意力头和输入嵌入)中检测这种抽象的体现。其关键解决方案是采用两种已被广泛接受的可解释性方法:(1) 通过探测token级别的关系结构来识别抽象;(2) 利用嵌入作为载体进行特征映射以推断人类可理解的属性。然而研究发现,这两种方法均因方法论缺陷而失败——前者因后期层表示与token对应关系假设失效,后者则因高预测性能由数据结构和方法学伪影驱动而非真实语义知识。这一结果揭示了当前主流解释方法在LLM理解能力推断中的不可靠性,对在普适计算和分布式系统中依赖可解释性进行调试、压缩和模型解释具有重要警示意义。
链接: https://arxiv.org/abs/2601.22928
作者: Alhassan Abdelhalim,Janick Edinger,Sören Laue,Michaela Regneri
机构: University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance. However, despite their ubiquitous use, the exact mechanisms underlying their outstanding performance remain unclear. Different methods for LLM explainability exist, and many are, as a method, not fully understood themselves. We started with the question of how linguistic abstraction emerges in LLMs, aiming to detect it across different LLM modules (attention heads and input embeddings). For this, we used methods well-established in the literature: (1) probing for token-level relational structures, and (2) feature-mapping using embeddings as carriers of human-interpretable properties. Both attempts failed for different methodological reasons: Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens. Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure rather than meaningful semantic knowledge. These failures matter because both techniques are widely treated as evidence for what LLMs supposedly understand, yet our results show such conclusions are unwarranted. These limitations are particularly relevant in pervasive and distributed computing settings where LLMs are deployed as system components and interpretability methods are relied upon for debugging, compression, and explaining models. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.22928 [cs.CL] (or arXiv:2601.22928v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.22928 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-31] DiffuSpeech: Silent Thought Spoken Answer via Unified Speech-Text Diffusion
【速读】: 该论文旨在解决当前语音语言模型(Speech Language Models, SLMs)在生成响应时缺乏显式推理过程,导致错误无法纠正的问题。其核心解决方案是提出“静默思考,口头回答”(Silent Thought, Spoken Answer)范式,即语音大模型在输出语音的同时生成内部文本推理轨迹(reasoning traces),并通过这些推理轨迹优化语音质量与语义准确性。关键技术在于构建首个基于扩散模型(diffusion-based)的语音-文本联合语言模型 \method,该模型在统一的掩码扩散框架下联合建模离散文本和分词语音,采用模态特定的掩码调度策略实现推理轨迹与语音标记的迭代去噪生成。实验表明,该方法在语音问答任务中达到SOTA性能(相比最优基线提升高达9个百分点),同时保持高保真的文本到语音(TTS)质量(6.2% WER)和语言理解能力(66.2% MMLU)。
链接: https://arxiv.org/abs/2601.22889
作者: Yuxuan Lou,Ziming Wu,Yaochen Wang,Yong Liu,Yingxuan Ren,Fuming Lai,Shaobing Lian,Jie Tang,Yang You
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注:
Abstract:Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf``Silent Thought, Spoken Answer’’ – a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2% WER) and preserving language understanding (66.2% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.
zh
[NLP-32] Should LLM s textitlike Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial
【速读】: 该论文旨在解决大型语言模型(LLM)在处理非标准英语方言时表现不佳的问题,尤其是针对全球超过80%的英语使用者并非以标准美式英语(Standard American English, SAE)为母语,却在与LLM交互中遭遇更高失败率和刻板回应的现象。其核心挑战在于当前多方言性能研究严重不足,且缺乏高质量、结构化、覆盖词汇、拼写与语法特征的多方言对话数据集。解决方案的关键在于提出首个大规模多方言对话生成框架MDial,该框架系统性地涵盖九种英语方言的三大书面方言维度(lexical, orthographic, morphosyntactic),并联合母语语言学家设计了一套可扩展的规则驱动LLM转换机制,确保生成内容的准确性与自然度;尤为关键的是,研究发现模型无需复现用户全部语法特征(最多90%的语法特征应被忽略),从而颠覆了传统“模仿用户句法”的假设,显著提升跨方言交互的公平性与有效性。
链接: https://arxiv.org/abs/2601.22888
作者: Jio Oh,Paul Vicinanza,Thomas Butler,Steven Euijong Whang,Dezhi Hong,Amani Namboori
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce \textbfMDial , the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect – lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features – for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users’ morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel \textbfMDialBench mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.
zh
[NLP-33] MoVE: Mixture of Value Embeddings – A New Axis for Scaling Parametric Memory in Autoregressive Models
【速读】: 该论文旨在解决自回归序列建模中模型容量与计算成本之间的刚性耦合问题,即传统方法通过加深或加宽网络来扩展参数记忆(parametric memory)时,会带来成比例增长的活跃浮点运算量(active FLOPs)。其解决方案的关键在于提出MoVE(Mixture of Value Embeddings)机制,该机制通过引入一个全局共享的可学习值嵌入库(value embeddings bank),将模型的记忆能力与计算资源解耦:在每个序列步骤中,利用可微分软门控机制从该库中动态混合检索到的概念进入标准值投影,从而实现参数记忆的独立扩展——只需增加嵌入槽位数量即可提升容量,而无需增加网络深度或计算量。
链接: https://arxiv.org/abs/2601.22887
作者: Yangyan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model’s parametric memory – its repository of factual knowledge or visual patterns – traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce \textbfMoVE (Mixture of Value Embeddings) , a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to dynamically mix retrieved concepts from this bank into the standard value projection. This architecture allows parametric memory to be scaled independently of network depth by simply increasing the number of embedding slots. We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation. In both domains, MoVE yields consistent performance improvements over standard and layer-wise memory baselines, enabling the construction of “memory-dense” models that achieve lower perplexity and higher fidelity than their dense counterparts at comparable compute budgets.
zh
[NLP-34] Leverag ing LLM s For Turkish Skill Extraction
【速读】: 该论文旨在解决土耳其语(Turkish)这一形态学复杂的低资源语言在技能提取(skill extraction)任务中缺乏标准化技能词典和标注数据集的问题,从而限制了其在招聘系统、个性化推荐及劳动力市场分析中的应用。解决方案的关键在于构建首个土耳其语技能提取数据集(包含4,819个标注技能片段),并采用大语言模型(LLM)驱动的端到端流程:通过动态少样本提示(dynamic few-shot prompting)、基于嵌入的检索(embedding-based retrieval)与LLM重排序(LLM-based reranking)相结合的方式进行技能识别与链接,最终在ESCO标准技能体系下实现0.56的端到端性能,显著优于传统监督序列标注方法,验证了LLM在低资源场景下提升技能提取效果的有效性。
链接: https://arxiv.org/abs/2601.22885
作者: Ezgi Arslan İltüzer,Özgür Anıl Özlü,Vahid Farajijobehdar,Gülşen Eryiğit
机构: Kariyer.net, R&D Center (Kariyer.net 研发中心); Istanbul Technical University Artificial Intelligence and Data Engineering Department (伊斯坦布尔技术大学人工智能与数据工程系)
类目: Computation and Language (cs.CL)
备注:
Abstract:Skill extraction is a critical component of modern recruitment systems, enabling efficient job matching, personalized recommendations, and labor market analysis. Despite Türkiye’s significant role in the global workforce, Turkish, a morphologically complex language, lacks both a skill taxonomy and a dedicated skill extraction dataset, resulting in underexplored research in skill extraction for Turkish. This article seeks the answers to three research questions: 1) How can skill extraction be effectively performed for this language, in light of its low resource nature? 2)~What is the most promising model? 3) What is the impact of different Large Language Models (LLMs) and prompting strategies on skill extraction (i.e., dynamic vs. static few-shot samples, varying context information, and encouraging causal reasoning)? The article introduces the first Turkish skill extraction dataset and performance evaluations of automated skill extraction using LLMs. The manually annotated dataset contains 4,819 labeled skill spans from 327 job postings across different occupation areas. The use of LLM outperforms supervised sequence labeling when used in an end-to-end pipeline, aligning extracted spans with standardized skills in the ESCO taxonomy more effectively. The best-performing configuration, utilizing Claude Sonnet 3.7 with dynamic few-shot prompting for skill identification, embedding-based retrieval, and LLM-based reranking for skill linking, achieves an end-to-end performance of 0.56, positioning Turkish alongside similar studies in other languages, which are few in the literature. Our findings suggest that LLMs can improve skill extraction performance in low-resource settings, and we hope that our work will accelerate similar research on skill extraction for underrepresented languages.
zh
[NLP-35] From Labels to Facets: Building a Taxonomically Enriched Turkish Learner Corpus
【速读】: 该论文旨在解决当前学习者语料库(Learner Corpus)在标注结构上的局限性问题,即大多数语料库采用扁平化的标签体系,无法显式区分多种语言维度,导致语言学深度标注困难,并阻碍了对学习者特定错误成因和机制的细粒度分析。解决方案的关键在于提出了一种基于新兴分面分类法(faceted taxonomy)的半自动化标注方法,通过构建一个理论基础扎实、多维的语言属性分类体系,实现对错误实例的标准化、细粒度且可解释的扩展标注;同时开发了一个新型标注扩展框架(annotation extension framework),以土耳其语为例,在已有扁平标注基础上自动推断并添加额外的语言学与元数据信息作为分类维度,从而提供更丰富的学习者特定上下文。该方法在评估中达到了95.86%的分面级准确率,显著提升了语料库的查询能力和跨维度探索分析的可能性。
链接: https://arxiv.org/abs/2601.22875
作者: Elif Sayar,Tolgahan Türker,Anna Golynskaia Knezhevich,Bihter Dereli,Ayşe Demirhas,Lionel Nicolas,Gülşen Eryiğit
机构: ITU (伊斯坦布尔技术大学); Bilgi University (比尔吉大学); YEE (土耳其青年与教育协会); Eurac Research (欧洲阿尔卑斯研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:In terms of annotation structure, most learner corpora rely on holistic flat label inventories which, even when extensive, do not explicitly separate multiple linguistic dimensions. This makes linguistically deep annotation difficult and complicates fine-grained analyses aimed at understanding why and how learners produce specific errors. To address these limitations, this paper presents a semi-automated annotation methodology for learner corpora, built upon a recently proposed faceted taxonomy, and implemented through a novel annotation extension framework. The taxonomy provides a theoretically grounded, multi-dimensional categorization that captures the linguistic properties underlying each error instance, thereby enabling standardized, fine-grained, and interpretable enrichment beyond flat annotations. The annotation extension tool, implemented based on the proposed extension framework for Turkish, automatically extends existing flat annotations by inferring additional linguistic and metadata information as facets within the taxonomy to provide richer learner-specific context. It was systematically evaluated and yielded promising performance results, achieving a facet-level accuracy of 95.86%. The resulting taxonomically enriched corpus offers enhanced querying capabilities and supports detailed exploratory analyses across learner corpora, enabling researchers to investigate error patterns through complex linguistic and pedagogical dimensions. This work introduces the first collaboratively annotated and taxonomically enriched Turkish Learner Corpus, a manual annotation guideline with a refined tagset, and an annotation extender. As the first corpus designed in accordance with the recently introduced taxonomy, we expect our study to pave the way for subsequent enrichment efforts of existing error-annotated learner corpora.
zh
[NLP-36] Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild
【速读】: 该论文旨在解决生成式 AI(Generative AI)内容日益逼近人类水平时,如何有效区分合成文本与真实内容这一可信网络智能(Trustworthy Web Intelligence)的核心挑战。其解决方案的关键在于提出 JudgeGPT 与 RogueGPT 的双轴框架,通过解耦“真实性”(authenticity)与“归属性”(attribution)两个维度,系统分析人类对合成内容的辨识机制;并借助结构因果模型(Structural Causal Models, SCMs)验证可检验的因果假设,发现政治倾向对检测准确性影响微弱(r = -0.10),而“假新闻熟悉度”是关键中介变量(r = 0.35),提示暴露经验可作为对抗训练提升人类辨别力。研究进一步揭示 GPT-4 输出因高流畅性陷入“流畅陷阱”(fluency trap),绕过源监控机制(Source Monitoring),从而难以被识别,因此建议干预策略应聚焦于认知层面的源监控能力培养,而非基于人口统计学的分群策略。
链接: https://arxiv.org/abs/2601.22871
作者: Alexander Loth,Martin Kappes,Marc-Oliver Pahl
机构: Frankfurt University of Applied Sciences (法兰克福应用技术大学); IMT Atlantique (IMT大西洋)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at ACM TheWebConf '26 Companion
Abstract:As foundation models (FMs) approach human-level fluency, distinguishing synthetic from organic content has become a key challenge for Trustworthy Web Intelligence. This paper presents JudgeGPT and RogueGPT, a dual-axis framework that decouples “authenticity” from “attribution” to investigate the mechanisms of human susceptibility. Analyzing 918 evaluations across five FMs (including GPT-4 and Llama-2), we employ Structural Causal Models (SCMs) as a principal framework for formulating testable causal hypotheses about detection accuracy. Contrary to partisan narratives, we find that political orientation shows a negligible association with detection performance ( r=-0.10 ). Instead, “fake news familiarity” emerges as a candidate mediator ( r=0.35 ), suggesting that exposure may function as adversarial training for human discriminators. We identify a “fluency trap” where GPT-4 outputs (HumanMachineScore: 0.20) bypass Source Monitoring mechanisms, rendering them indistinguishable from human text. These findings suggest that “pre-bunking” interventions should target cognitive source monitoring rather than demographic segmentation to ensure trustworthy information ecosystems. Comments: Accepted at ACM TheWebConf '26 Companion Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) ACMclasses: I.2.7; H.5.2; K.4.1 Cite as: arXiv:2601.22871 [cs.CY] (or arXiv:2601.22871v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2601.22871 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Alexander Loth [view email] [v1] Fri, 30 Jan 2026 11:49:58 UTC (1,074 KB)
zh
[NLP-37] When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training EACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练过程中多语言概念空间(language-agnostic concept spaces)如何形成及其对跨语言迁移能力影响的问题,尤其是在单语资源稀缺场景下提升多语言覆盖的重要性日益凸显的背景下。其解决方案的关键在于采用因果可解释性方法——激活修补(activation patching),通过隔离跨语言概念表示并将其注入翻译提示中,系统性地分析这些概念空间在训练中的演化过程以及其对翻译行为的实际影响。研究发现,共享概念空间虽早期即出现并持续优化,但其对齐程度具有语言依赖性;且部分看似翻译质量提升的现象实为模型行为转变(如多义词义项选择或跨语言同形异义词的翻译而非复制),而非真正翻译能力的增强,从而揭示了因果可解释性方法在多语言语境下的深层价值与局限。
链接: https://arxiv.org/abs/2601.22851
作者: Felicia Körner,Max Müller-Eberstein,Anna Korhonen,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany; University of Tokyo, Japan; IT University of Copenhagen, Denmark; Language Technology Lab, University of Cambridge, United Kingdom
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Main Conference
Abstract:Training Large Language Models (LLMs) with high multilingual coverage is becoming increasingly important – especially when monolingual resources are scarce. Recent studies have found that LLMs process multilingual inputs in shared concept spaces, thought to support generalization and cross-lingual transfer. However, these prior studies often do not use causal methods, lack deeper error analysis or focus on the final model only, leaving open how these spaces emerge during training. We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM through the causal interpretability method of activation patching. We isolate cross-lingual concept representations, then inject them into a translation prompt to investigate how consistently translations can be altered, independently of the language. We find that shared concept spaces emerge early and continue to refine, but that alignment with them is language-dependent. Furthermore, in contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior – like selecting senses for polysemous words or translating instead of copying cross-lingual homographs – rather than improved translation ability. Our findings offer new insight into the training dynamics of cross-lingual alignment and the conditions under which causal interpretability methods offer meaningful insights in multilingual contexts.
zh
[NLP-38] SOMBRERO: Measuring and Steering Boundary Placement in End-to-End Hierarchical Sequence Models
【速读】: 该论文旨在解决层次化序列模型中边界定位不精准的问题,即如何在不依赖显式标注的情况下,自动学习到能有效压缩长字节序列并合理分配计算资源的分段边界。现有方法虽可通过语言建模目标学习有意义的边界,但难以定量评估边界质量并系统性地引导计算资源投向预测难度高的位置。解决方案的关键在于提出一个与路由器无关的边界质量度量指标——边界丰富度(Boundary Enrichment, B),该指标衡量分段起点是否集中在高下一字节困惑度(next-byte surprisal)的位置;并基于此设计了Sombrero方法,通过置信度对齐边界损失(confidence-alignment boundary loss)引导边界向预测困难区域迁移,并在输入层面采用置信度加权平滑(confidence-weighted smoothing)而非对已生成块进行平滑,从而稳定边界学习过程。实验表明,该方法在10亿规模的UTF-8语料(涵盖英文、德文文本及代码和数学内容)上显著提升了准确率与效率的权衡表现。
链接: https://arxiv.org/abs/2601.22805
作者: Pit Neitemeier,Alessio Serra,Jiaze Li,Sascha Wirges,Lukas Balles,Jan Hendrik Metzen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Hierarchical sequence models replace fixed tokenization with learned segmentations that compress long byte sequences for efficient autoregressive modeling. While recent end-to-end methods can learn meaningful boundaries from the language-modeling objective alone, it remains difficult to quantitatively assess and systematically steer where compute is spent. We introduce a router-agnostic metric of boundary quality, boundary enrichment B, which measures how strongly chunk starts concentrate on positions with high next-byte surprisal. Guided by this metric, we propose Sombrero, which steers boundary placement toward predictive difficulty via a confidence-alignment boundary loss and stabilizes boundary learning by applying confidence-weighted smoothing at the input level rather than on realized chunks. On 1B scale, across UTF-8 corpora covering English and German text as well as code and mathematical content, Sombrero improves the accuracy-efficiency trade-off and yields boundaries that more consistently align compute with hard-to-predict positions.
zh
[NLP-39] Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中计算资源分配不均的问题,即是否存在某些参数在推理过程中被频繁使用而另一些则几乎闲置,从而影响模型效率。传统观点认为LLM的计算是稀疏的,但本文通过提出一种基于机制可解释性的计算密度估计器(computation density estimator),系统性地量化了LLM内部的计算分布情况。其关键在于利用机制可解释性方法构建一个能够捕捉输入依赖性计算密度的估算工具,并通过实验发现:LLM的计算通常并非稀疏而是密集的,且密度随输入动态变化,同时不同模型对相同输入表现出高度一致的密度模式。这一发现挑战了LLM作为符号处理系统的传统理解,并为优化模型效率提供了新视角。
链接: https://arxiv.org/abs/2601.22795
作者: Corentin Kervadec,Iuliia Lysova,Marco Baroni,Gemma Boleda
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs. Several studies on LLM efficiency optimization argue that it is possible to prune a significant portion of the parameters, while only marginally impacting performance. This suggests that the computation is not uniformly distributed across the parameters. We introduce here a technique to systematically quantify computation density in LLMs. In particular, we design a density estimator drawing on mechanistic interpretability. We experimentally test our estimator and find that: (1) contrary to what has been often assumed, LLM processing generally involves dense computation; (2) computation density is dynamic, in the sense that models shift between sparse and dense processing regimes depending on the input; (3) per-input density is significantly correlated across LLMs, suggesting that the same inputs trigger either low or high density. Investigating the factors influencing density, we observe that predicting rarer tokens requires higher density, and increasing context length often decreases the density. We believe that our computation density estimator will contribute to a better understanding of the processing at work in LLMs, challenging their symbolic interpretation.
zh
[NLP-40] RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation
【速读】: 该论文旨在解决同时语音翻译(Simultaneous Speech Translation, SST)中罕见和领域特定术语翻译准确性不足的问题。当前基于语音大语言模型(Speech LLMs)的SST系统在整体翻译质量上已有显著提升,但在术语翻译方面仍存在局限。解决方案的关键在于提出一种检索增强的SST框架——RASST,其核心是将跨模态(speech-to-text)检索机制紧密集成到SST流水线中:通过训练轻量级语音-文本检索器并采用高效滑动窗口检索策略,为Speech LLM提供逐块的术语提示;同时构建合成训练数据以指导模型精准利用检索到的术语,从而在保持实时性的同时显著提升术语翻译准确性和整体翻译质量。
链接: https://arxiv.org/abs/2601.22777
作者: Jiaxuan Luo,Siqi Ouyang,Lei Li
机构: Johns Hopkins University (约翰霍普金斯大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise terminology hints to the Speech LLM. We further synthesize training data that teaches the Speech LLM to leverage retrieved terms precisely. Experiments on three language directions of the ACL 60/60 dev set show that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablations confirming the contribution of each component.
zh
[NLP-41] AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection Classification and Correction
【速读】: 该论文旨在解决法律判决中因案件复杂性和法律概念抽象性导致的错误问题,以及现有上诉审查机制在案件数量激增背景下面临的效率压力。传统法律人工智能研究主要聚焦于判决预测和法律文书生成任务,而本文提出了一项新任务——上诉审查(Appellate Review),其核心在于对已发布的判决进行错误检测、分类与修正,本质上属于异常检测而非预测或生成任务。解决方案的关键在于构建了一个新的基准数据集 AR-BENCH,包含8,700份精细标注的判决文书和34,617份补充语料,并通过评估14个大语言模型揭示了现有模型在识别法律适用错误方面的显著局限性,为未来模型改进提供了实证依据。
链接: https://arxiv.org/abs/2601.22742
作者: Yifei Li,Richong Zhang,Wanyu Tu,Zhijie Nie,Haokun Luo,Chuantao Yin,Pengchong Li
机构: Beihang University (北京航空航天大学); University of Science and Technology Beijing (北京科技大学); Sino-French Engineer School (中法工程师学院); People’s Procuratorate of Beijing Municipality (北京市人民检察院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Legal judgments may contain errors due to the complexity of case circumstances and the abstract nature of legal concepts, while existing appellate review mechanisms face efficiency pressures from a surge in case volumes. Although current legal AI research focuses on tasks like judgment prediction and legal document generation, the task of judgment review differs fundamentally in its objectives and paradigm: it centers on detecting, classifying, and correcting errors after a judgment is issued, constituting anomaly detection rather than prediction or generation. To address this research gap, we introduce a novel task APPELLATE REVIEW, aiming to assess models’ diagnostic reasoning and reliability in legal practice. We also construct a novel dataset benchmark AR-BENCH, which comprises 8,700 finely annotated decisions and 34,617 supplementary corpora. By evaluating 14 large language models, we reveal critical limitations in existing models’ ability to identify legal application errors, providing empirical evidence for future improvements.
zh
[NLP-42] MM-THEBench: Do Reasoning MLLM Reason ing MLLMs Think Reasonably?
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中产生的幻觉问题,尤其是针对基于思维链(Chain-of-Thought, CoT)的后训练推理模型中,中间推理步骤所引入的幻觉缺乏系统评估的问题。现有基准测试主要聚焦于非推理模型,忽视了模型内部的思考过程,无法准确衡量推理阶段出现的幻觉现象。为此,作者提出MM-THEBench,其核心创新在于构建了一个细粒度的认知维度分类体系、带有验证推理标注的多样化数据集,以及一个多层级自动化评估框架,从而能够系统性地量化和分析推理过程中幻觉的发生机制及其对多模态任务中推理能力的影响。
链接: https://arxiv.org/abs/2601.22735
作者: Zhidian Huang,Zijun Yao,Ji Qi,Shangqing Tu,Junxian Ma,Jinxin Liu,Weichuan Liu,Xiaoyin Che,Lei Hou,Juanzi Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in multimodal large language models (MLLMs) mark a shift from non-thinking models to post-trained reasoning models capable of solving complex problems through thinking. However, whether such thinking mitigates hallucinations in multimodal perception and reasoning remains unclear. Self-reflective reasoning enhances robustness but introduces additional hallucinations, and subtle perceptual errors still result in incorrect or coincidentally correct answers. Existing benchmarks primarily focus on models before the emergence of reasoning MLLMs, neglecting the internal thinking process and failing to measure the hallucinations that occur during thinking. To address these challenges, we introduce MM-THEBench, a comprehensive benchmark for assessing hallucinations of intermediate CoTs in reasoning MLLMs. MM-THEBench features a fine-grained taxonomy grounded in cognitive dimensions, diverse data with verified reasoning annotations, and a multi-level automated evaluation framework. Extensive experiments on mainstream reasoning MLLMs reveal insights into how thinking affects hallucination and reasoning capability in various multimodal tasks.
zh
[NLP-43] AlienLM: Alienization of Language for API-Boundary Privacy in Black-Box LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)通过黑盒API调用时存在的隐私泄露问题,即用户在传输敏感提示(prompt)、输出结果及微调数据过程中,可能因API边界暴露而导致数据泄露。解决方案的关键在于提出AlienLM,一种仅依赖API的隐私保护层,其核心机制是通过词汇级双射(vocabulary-scale bijection)将原始文本转换为“外星语言”(Alien Language),并在客户端实现无损还原;同时采用仅使用标准微调API的Alien Adaptation Training(AAT)方法,使目标模型直接在异化输入上运行,从而在保障任务性能的同时显著降低被逆向恢复的风险——实验表明其平均保留原始模型性能的81%以上,且在多种攻击场景下恢复的token比例低于0.22%。
链接: https://arxiv.org/abs/2601.22710
作者: Jaehee Kim,Pilsung Kang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Modern LLMs are increasingly accessed via black-box APIs, requiring users to transmit sensitive prompts, outputs, and fine-tuning data to external providers, creating a critical privacy risk at the API boundary. We introduce AlienLM, a deployable API-only privacy layer that protects text by translating it into an Alien Language via a vocabulary-scale bijection, enabling lossless recovery on the client side. Using only standard fine-tuning APIs, Alien Adaptation Training (AAT) adapts target models to operate directly on alienized inputs. Across four LLM backbones and seven benchmarks, AlienLM retains over 81% of plaintext-oracle performance on average, substantially outperforming random-bijection and character-level baselines. Under adversaries with access to model weights, corpus statistics, and learning-based inverse translation, recovery attacks reconstruct fewer than 0.22% of alienized tokens. Our results demonstrate a practical pathway for privacy-preserving LLM deployment under API-only access, substantially reducing plaintext exposure while maintaining task performance.
zh
[NLP-44] A Unified Study of LoRA Variants: Taxonomy Review Codebase and Empirical Evaluation
【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)变体在方法学、理论、代码实现和评估标准上的碎片化问题,从而推动其系统化发展。解决方案的关键在于提出首个统一的研究框架,包括:(1)基于秩(rank)、优化动态、初始化策略及与专家混合模型(Mixture-of-Experts)集成四个维度的系统分类体系;(2)构建统一的理论分析框架,聚焦低秩更新动力学以厘清各变体间的内在关系与演进逻辑;(3)开发LoRAFactory模块化代码库,提供标准化接口支持即插即用实验与细粒度分析;(4)通过大规模跨任务(自然语言生成、理解与图像分类)实证评估,揭示关键超参数敏感性并验证LoRA在最优配置下性能可优于或等同于多数变体。
链接: https://arxiv.org/abs/2601.22708
作者: Haonan He,Jingqi Ye,Minglei Li,Zhengbo Wang,Tao Chen,Lei Bai,Peng Ye
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, Under Review
Abstract:Low-Rank Adaptation (LoRA) is a fundamental parameter-efficient fine-tuning method that balances efficiency and performance in large-scale neural networks. However, the proliferation of LoRA variants has led to fragmentation in methodology, theory, code, and evaluation. To this end, this work presents the first unified study of LoRA variants, offering a systematic taxonomy, unified theoretical review, structured codebase, and standardized empirical assessment. First, we categorize LoRA variants along four principal axes: rank, optimization dynamics, initialization, and integration with Mixture-of-Experts. Then, we review their relationships and evolution within a common theoretical framework focused on low-rank update dynamics. Further, we introduce LoRAFactory, a modular codebase that implements variants through a unified interface, supporting plug-and-play experimentation and fine-grained analysis. Last, using this codebase, we conduct a large-scale evaluation across natural language generation, natural language understanding, and image classification tasks, systematically exploring key hyperparameters. Our results uncover several findings, notably: LoRA and its variants exhibit pronounced sensitivity to the choices of learning rate compared to other hyperparameters; moreover, with proper hyperparameter configurations, LoRA consistently matches or surpasses the performance of most of its variants.
zh
[NLP-45] Models Know Models Best: Evaluation via Model-Preferred Formats
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多项选择题任务中因评估格式不同(符号选择式与填空式)而导致性能表现不一致的问题。研究表明,自然语言续写任务受益于概率评分机制,而显式比较任务则更适合符号选择方式,且这种差异在多种基于解码器的LLM中均具有一致性,表明其为模型无关效应。解决方案的关键在于提出一种动态格式对齐策略,该策略利用轻量级分类器学习模型内部生成的偏好信号(latent model-preference signals),而非依赖人工设计的启发式规则,从而自动为每个问题实例选择最优评估格式。实验表明,该方法在零样本场景下显著且稳定地提升了推理和知识类基准测试中的准确率,更真实地揭示了模型的潜在能力。
链接: https://arxiv.org/abs/2601.22699
作者: Joonhak Lee,Sungmok Jung,Jongyeon Park,Jaejin Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models’ latent capabilities.
zh
[NLP-46] FNF: Functional Network Fingerprint for Large Language Models
【速读】: 该论文旨在解决开源大语言模型(Large Language Models, LLMs)被未经授权复制或衍生使用的问题,以保护开发者知识产权。其解决方案的关键在于提出一种名为功能网络指纹(Functional Network Fingerprint, FNF)的方法,该方法无需重新训练模型,仅需少量输入样本即可检测可疑模型是否源自目标模型,其核心依据是不同模型在功能网络活动模式上的高度一致性——共享同一来源的模型即使在规模或架构上存在差异,也能在多样输入下保持稳定的神经元活动模式,而独立训练的模型则无法维持这种活动对齐。FNF具有高效性、鲁棒性(对微调、剪枝、参数重排等常见修改不敏感),且适用于跨架构和维度比较,为模型所有者提供了一种非侵入式、实用性强的知识产权保护工具。
链接: https://arxiv.org/abs/2601.22692
作者: Yiheng Liu,Junhao Ning,Sichen Xia,Haiyang Sun,Yang Yang,Hanyang Chi,Xiaohui Gao,Ning Qiang,Bao Ge,Junwei Han,Xintao Hu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 13 pages, 4 figures
Abstract:The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open-source LLMs and protecting developers’ intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training-free, sample-efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine-tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non-invasive, and effective tool for protecting LLM intellectual property. The code is available at this https URL.
zh
[NLP-47] SLM: Tree-Structured Language Modeling for Divergent Thinking
【速读】: 该论文旨在解决语言模型在推理过程中因序列生成特性而导致的探索路径无法解耦的问题,即模型难以在搜索中有效分离无关的探索分支,从而造成计算冗余和效率低下。解决方案的关键在于提出树状结构语言建模(Tree-Structured Language Modeling, TSLM),通过引入特殊标记显式编码搜索树的分支结构,使模型能够在单次生成过程中同时生成并选择性扩展多个搜索路径;同时,利用包含成功与失败尝试的完整搜索树进行训练,使模型学会内化系统性探索策略,避免对共享前缀的重复计算,从而实现更高效且鲁棒的推理过程。
链接: https://arxiv.org/abs/2601.22688
作者: Doyoung Kim,Jaehyeok Doo,Minjoon Seo
机构: NYU(纽约大学); KAIST AI(韩国科学技术院人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Language models generate reasoning sequentially, preventing them from decoupling irrelevant exploration paths during search. We introduce Tree-Structured Language Modeling (TSLM), which uses special tokens to encode branching structure, enabling models to generate and selectively expand multiple search paths within a single generation process. By training on complete search trees including both successful and failed attempts, TSLM learns to internalize systematic exploration without redundant recomputation of shared prefixes. TSLM achieves robust performance and superior inference efficiency by avoiding the multiple independent forward passes required by external search methods. These results suggest a new paradigm of inference-time scaling for robust reasoning, demonstrating that supervised learning on complete tree-structured traces provides an efficient alternative for developing systematic exploration capabilities in language models.
zh
[NLP-48] NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models
【速读】: 该论文旨在解决现有文本-图融合方法中因架构分离导致的结构与语义交互不一致问题:传统方法依赖外部图神经网络(GNN)编码图结构拓扑,而语言模型(LM)仅处理文本语义,二者在嵌入空间上存在割裂,需通过复杂隐式对齐实现节点与文本元素的关联。其解决方案的关键在于提出NAG(Native Architecture for Graphs),将图处理机制内化至语言模型的原生流形中,利用自注意力机制显式建模拓扑依赖关系,并重新校准位置标识符以保证结构等价性,从而让模型在不引入外部编码器的前提下,直接利用自身的语言理解能力同步解析节点、边内容及结构信息。
链接: https://arxiv.org/abs/2601.22657
作者: Haisong Gong,Zhibo Liu,Qiang Liu,Shu Wu,Liang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Prevailing methods for integrating graphs into Language Models (LMs) typically rely on a segregated architecture: external Graph Neural Networks (GNNs) encode structural topology, while LMs process textual semantics. We argue this approach is suboptimal for text-graphs: it creates a conceptually disjointed interaction paradigm. By segregating structural encoding from semantic processing, these systems must perform a complex implicit alignment between abstract graph tokens and concrete textual elements. Challenging the necessity of external encoders, we propose NAG (Native Architecture for Graphs), a unified framework that internalizes graph processing within the LM’s native manifold. Instead of bridging disparate embedding spaces, NAG repurposes the self-attention mechanism to enforce topological dependencies and recalibrates positional IDs to ensure structural equivalence. This allows the model to harness its intrinsic linguistic capability to simultaneously comprehend node and edge content alongside structural topology. We introduce two efficient implementations: NAG-Zero for absolute preservation of the base model’s linguistic capabilities, and NAG-LoRA for enhanced structural adaptation. Experiments across diverse graph tasks validate that NAG achieves robust graph comprehension without the overhead of external encoders, offering a simpler, more coherent paradigm for text-graph modeling.
zh
[NLP-49] DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中前馈网络(Feed-Forward Networks, FFNs)存在的参数冗余问题,尤其是现有剪枝方法在数据依赖性强和静态剪枝无法适应自回归生成过程中上下文动态变化的局限性。其解决方案的关键在于提出一种轻量级、无需训练的运行时动态剪枝框架 DART(Dynamic Attention-Guided Runtime Tracing),通过监测注意力分数分布的变化来感知上下文演化,并据此动态调整神经元级别的掩码,从而保留关键知识神经元。该方法显著提升了剪枝后的模型性能,在多个基准测试中相较静态剪枝提升高达14.5%的准确率,同时保持与原始密集模型相当的性能,且内存开销低于10MB,FLOPs额外开销仅为0.1%。
链接: https://arxiv.org/abs/2601.22632
作者: Abhishek Tyagi,Yunuo Cen,Shrey Dhorajiya,Bharadwaj Veeravalli,Xuanyao Fong
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at this https URL.
zh
[NLP-50] me-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, Diffusion-LMs)在文本生成过程中如何有效控制多样性以探索多种合理语义或推理路径的问题。现有方法未充分挖掘其显式时间维度的潜力,导致生成结果缺乏多样性且难以调控。解决方案的关键在于提出一种无需训练的推理策略——时间退火扰动采样(Time-Annealed Perturbation Sampling, TAPS),该策略利用扩散过程中的时间分工特性:早期去噪步骤主要决定全局语义结构,后期则聚焦局部词汇精炼。TAPS通过在早期阶段引入扰动以促进语义分支,随后逐步减少扰动以保持流畅性和指令遵循性,从而在不牺牲生成质量的前提下显著提升输出多样性,适用于非自回归与半自回归扩散骨干模型(如LLaDA和TraDo)。
链接: https://arxiv.org/abs/2601.22629
作者: Jingxuan Wu,Zhenglin Wan,Xingrui Yu,Yuzhe Yang,Yiqiao Huang,Ivor Tsang,Yang You
机构: The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); National University of Singapore (新加坡国立大学); CFAR, Agency for Science, Technology and Research (科技研究局); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.
zh
[NLP-51] CS: Test-Time Curriculum Synthesis for Self-Evolving
【速读】: 该论文旨在解决测试时训练(Test-Time Training, TTT)在提升大语言模型(Large Language Models, LLMs)推理能力时面临的两大挑战:一是原始测试问题过于困难,难以生成高质量伪标签;二是测试集规模有限,导致在线更新易出现不稳定。解决方案的关键在于提出一种协同进化测试时训练框架(TTCS),其核心机制是通过两个从同一预训练模型初始化的策略——问题合成器(question synthesizer)和推理求解器(reasoning solver)——进行迭代优化:合成器基于测试问题生成逐步增强的问题变体以构建适配求解器当前能力的结构化课程,而求解器则利用自一致性奖励(self-consistency rewards)在原始与合成问题上更新自身;二者相互反馈,形成闭环,既稳定了求解器的训练过程,又确保合成问题始终与模型能力对齐,从而实现高效、稳定的推理能力提升。
链接: https://arxiv.org/abs/2601.22628
作者: Chengyi Yang,Zhishang Xiang,Yunbo Tang,Zongpei Teng,Chengsong Huang,Fei Long,Yuhan Liu,Jinsong Su
机构: Xiamen University (厦门大学); Washington University in St. Louis (圣路易斯华盛顿大学); Renmin University of China (中国人民大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 4 figures, Our code and implementation details are available at this https URL
Abstract:Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver’s current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver’s feedback guides the synthesizer to generate questions aligned with the model’s current capability, and the generated question variants in turn stabilize the solver’s test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at this https URL.
zh
[NLP-52] Layer-wise Swapping for Generalizable Multilingual Safety
【速读】: 该论文旨在解决低资源语言在生成式 AI(Generative AI)安全对齐方面的挑战,即当前主流的安全数据集以英语为中心,导致低资源语言模型在安全性和有害内容控制方面表现较差。解决方案的关键在于提出一种安全感知的层交换方法(safety-aware layer swapping method),通过将英语安全专家模型中的安全对齐能力迁移至低资源语言模型,而无需额外训练;同时,该方法基于模块的专业化程度自适应地选择或融合层,从而在保持通用语言理解能力(如MMMLU、BELEBELE和MGSM等基准测试性能)的同时,显著提升目标语言的安全性表现(如MultiJail基准上的有害响应减少)。
链接: https://arxiv.org/abs/2601.22620
作者: Hyunseo Shin,Wonseok Hwang
机构: University of Seoul (首尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.
zh
[NLP-53] From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents ICML2026
【速读】: 该论文旨在解决交互式工具使用智能体(Interactive Tool-Using Agents)在真实世界任务中面临的多轮交互挑战,包括对话状态追踪、多步骤工具执行以及复杂指令遵循等问题。由于高质量多轮工具使用数据的合成难以规模化,且强化学习(Reinforcement Learning, RL)易受用户模拟带来的噪声信号干扰,导致训练效率下降。解决方案的关键在于提出一个统一框架——EigenData,其核心是结合自演化数据代理(self-evolving data agent)与基于验证器的强化学习(verifier-based RL)。该框架通过分层多智能体引擎生成带可执行检查器的工具接地对话,并利用闭环自演化机制持续优化提示词和工作流;同时设计了一种基于轨迹级组相对优势(trajectory-level group-relative advantages)和动态过滤的GRPO风格训练策略,在无需昂贵人工标注的情况下显著提升模型性能,最终在tau^2-bench评测中达到73.0% pass^1(Airline)和98.3% pass^1(Telecom),媲美或超越前沿模型。
链接: https://arxiv.org/abs/2601.22607
作者: Jiaxuan Gao,Jiaao Chen,Chuyi He,Wei-Chen Wang,Shusheng Xu,Hanrui Wang,Di Jin,Yi Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to ICML 2026
Abstract:Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.
zh
[NLP-54] meMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks EACL2026
【速读】: 该论文试图解决软件迁移(Software Migration)任务在真实世界项目中缺乏系统评估基准的问题,尤其关注Python项目因依赖更新导致测试失败后的自动修复能力。其解决方案的关键在于构建了一个名为TimeMachine-bench的自动化、可实时更新的基准数据集,该数据集基于GitHub仓库中因依赖版本变更而测试失败的案例,并通过人工验证子集确保问题的可解性;在此基础上,作者评估了基于11种大语言模型(LLM)的代理基线方法,揭示了当前LLM在迁移任务中存在的可靠性挑战,如利用低测试覆盖率产生的虚假解和因工具使用策略不佳引发的冗余修改。
链接: https://arxiv.org/abs/2601.22597
作者: Ryo Fujii,Makoto Morishita,Kazuki Yano,Jun Suzuki
机构: Tohoku University (东北大学); Future Corporation (未来公司); RIKEN (理化学研究所); NII LLMC (日本国立信息学研究所语言模型与计算中心)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Main, camera-ready
Abstract:With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of adapting code to evolving environments, has been largely overlooked. In this study, we introduce TimeMachine-bench, a benchmark designed to evaluate software migration in real-world Python projects. Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates. The construction process is fully automated, enabling live updates of the benchmark. Furthermore, we curated a human-verified subset to ensure problem solvability. We evaluated agent-based baselines built on top of 11 models, including both strong open-weight and state-of-the-art LLMs on this verified subset. Our results indicated that, while LLMs show some promise for migration tasks, they continue to face substantial reliability challenges, including spurious solutions that exploit low test coverage and unnecessary edits stemming from suboptimal tool-use strategies. Our dataset and implementation are available at this https URL.
zh
[NLP-55] Language Model Circuits Are Sparse in the Neuron Basis
【速读】: 该论文旨在解决语言模型中神经元表示的可解释性问题,即如何在不引入额外训练成本的前提下,识别并定位控制模型行为的因果电路(causal circuitry)。传统方法如稀疏自编码器(Sparse Autoencoders, SAEs)被用于将原始神经元基分解为更易解释的计算单元,但并非所有神经元表示都不可解释。本文的关键发现是:MLP层中的神经元本身即可构成与SAEs相当稀疏的特征基底,从而可以直接用于电路追踪任务。基于此,作者构建了一个端到端的电路追踪流程,利用梯度归因方法在MLP神经元基础上定位关键因果结构——例如在主谓一致任务中仅需约10²个MLP神经元即可控制模型行为,在多跳城市→州→首都任务中识别出编码特定推理步骤的小规模神经元集合,并能通过干预实现输出可控改变。这一成果显著提升了语言模型自动化可解释性的效率和精度。
链接: https://arxiv.org/abs/2601.22594
作者: Aryaman Arora,Zhengxuan Wu,Jacob Steinhardt,Sarah Schwettmann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages main text, 41 pages total
Abstract:The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textitsparse autoencoders (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textitcircuit tracing. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbfMLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of \approx 10^2 MLP neurons is enough to control model behaviour. On the multi-hop city \to state \to capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state’), and can be steered to change the model’s output. This work thus advances automated interpretability of language models without additional training costs.
zh
[NLP-56] Rethinking LLM -as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的无参考评估方法(即“LLM-as-a-Judge”)存在的高成本、不透明以及对提示设计敏感等问题。其解决方案的关键在于提出并验证了“语义容量不对称假设”(Semantic Capacity Asymmetry Hypothesis),指出评估任务所需的语义能力显著低于生成任务,因此可基于小模型内部表示(hidden states)进行有效评估,而非依赖其生成结果。由此提出“Representation-as-a-Judge”范式,通过探测小模型的中间表征来实现解码-free 的评分预测,具体实例化为INSPECTOR框架,该框架在多个推理基准测试中展现出优于传统提示驱动的小模型评估效果,并逼近全量LLM判官的性能,同时具备更高的效率、可靠性和可解释性。
链接: https://arxiv.org/abs/2601.22588
作者: Zhuochun Li,Yong Zhang,Ming Li,Yuelyu Ji,Yiming Zeng,Ning Cheng,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao,Daqing He
机构: Ping An Technology (Shenzhen) Co., Ltd. (平安科技(深圳)有限公司); University of Pittsburgh (匹兹堡大学); University of Maryland, College Park (马里兰大学学院市分校); University of Connecticut (康涅狄格大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this “LLM-as-a-Judge” paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.
zh
[NLP-57] SpanNorm: Reconciling Training Stability and Performance in Deep Transformers
【速读】: 该论文旨在解决深度Transformer架构中归一化层放置位置带来的训练稳定性与模型性能之间的权衡问题:PreNorm(前置归一化)虽能保障训练稳定,但深层模型易出现性能下降;PostNorm(后置归一化)则具有更强的性能潜力,却面临严重的训练不稳定问题。解决方案的关键在于提出SpanNorm,其通过构建跨越整个Transformer模块的清晰残差连接以稳定信号传播,同时采用PostNorm风格的计算方式对聚合输出进行归一化,从而兼顾稳定性与性能。理论分析表明,结合合理的缩放策略,SpanNorm可保持网络中信号方差有界,避免PostNorm常见的梯度问题,并缓解PreNorm的表征坍塌现象。
链接: https://arxiv.org/abs/2601.22580
作者: Chao Wang,Bei Li,Jiaqi Zhang,Xinyu Liu,Yuchun Fan,Linkun Lyu,Xin Chen,Jingang Wang,Tong Xiao,Peng Pei,Xunliang Cai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the PostNorm’’ architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.
zh
[NLP-58] PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在连续真实世界流式音视频输入场景下作为移动助手时的实时响应能力不足的问题。现有基准测试多局限于短视频或多项选择题,难以评估模型在动态环境中对视听信息的时间感知与适时回应能力。解决方案的关键在于构建首个以移动端为中心的流式评测基准PhoStream,其统一了屏幕内与屏幕外场景,涵盖578个视频中的5,572个开放式问答对,覆盖4种场景和10项能力,并通过自动化生成管道结合人工验证确保数据质量;同时设计了真实的在线推理流程和基于大语言模型作为裁判(LLM-as-a-Judge)的开放回答评估机制,从而揭示当前MLLMs存在“何时回应”而非“说什么”的根本性局限——即在前向任务中因过早响应导致性能显著下降(仅16.40分),而对即时和回溯任务表现良好(Gemini 3 Pro超80分)。
链接: https://arxiv.org/abs/2601.22575
作者: Xudong Lu,Huankang Guan,Yang Bo,Jinpeng Chen,Xintong Guo,Shuhan Li,Fang Liu,Peiwen Sun,Xueying Li,Wei Zhang,Xue Yang,Rui Liu,Hongsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages
Abstract:Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at this https URL.
zh
[NLP-59] Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLM)在自动评估任务中表现出的自偏倚(self-preference bias)问题,即模型倾向于偏好自身生成的输出,从而干扰自动化后训练与评估流程的可靠性。现有研究难以区分这种偏倚是由模型“自恋”(narcissism)还是由实验设计中的混杂因素(methodological confounds)导致。论文发现一个核心方法学混杂因素:当LLM作为评判者时,若其自身在回答某一问题时出错,则更可能错误地偏好自己的答案,无论该答案是否为待评选项之一。解决方案的关键在于引入“评估者质量基线”(Evaluator Quality Baseline),通过比较模型在自身犯错情况下偏好自己输出的概率与偏好其他模型错误输出的概率,从而分离出真实的自偏倚信号。该基线可将测量误差降低89.6%,并使原本统计显著的结果中仅51%保持显著性,表明此前许多结论可能受噪声数据误导。此方法为未来研究自偏倚提供了更干净的数据基础,并推动对评判偏差机制的系统性理解。
链接: https://arxiv.org/abs/2601.22548
作者: Dani Roytburg,Matthew Bozoukov,Matthew Nguyen,Mackenzie Puig-Hall,Narmeen Oozeer
机构: Carnegie Mellon University (卡内基梅隆大学); University of California, San Diego (加州大学圣地亚哥分校); University of Virginia (弗吉尼亚大学); Apart Research; Martian Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of “easy” versus “hard” evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.
zh
[NLP-60] owards the Holographic Characteristic of LLM s for Efficient Short-text Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中对目标关键词捕捉机制不明确的问题,尤其是其强大的生成能力背后是否存在可被利用的规律性特征。研究发现,LLMs在生成初期即倾向于捕获目标侧关键词,这一现象被命名为“全息特性”(Holographic Characteristic)。解决方案的关键在于提出名为HOLO的插件,该插件基于全息特性,在有限生成步数内提取目标关键词,并结合并行词汇约束文本生成方法实现高效补全,从而提升推理效率且保持与基线相当的自动和人工评估性能。
链接: https://arxiv.org/abs/2601.22546
作者: Shun Qian,Bingquan Liu,Chengjie Sun,Zhen Xu,Baoxun Wang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The recent advancements in Large Language Models (LLMs) have attracted interest in exploring their in-context learning abilities and chain-of-thought capabilities. However, there are few studies investigating the specific traits related to the powerful generation capacity of LLMs. This paper aims to delve into the generation characteristics exhibited by LLMs. Through our investigation, we have discovered that language models tend to capture target-side keywords at the beginning of the generation process. We name this phenomenon the Holographic Characteristic of language models. For the purpose of exploring this characteristic and further improving the inference efficiency of language models, we propose a plugin called HOLO, which leverages the Holographic Characteristic to extract target-side keywords from language models within a limited number of generation steps and complements the sentence with a parallel lexically constrained text generation method. To verify the effectiveness of HOLO, we conduct massive experiments on language models of varying architectures and scales in the short-text generation scenario. The results demonstrate that HOLO achieves comparable performance to the baselines in terms of both automatic and human-like evaluation metrics and highlight the potential of the Holographic Characteristic.
zh
[NLP-61] ρ-textttEOS: Training-free Bidirectional Variable-Length Control for Masked Diffusion LLM s
【速读】: 该论文旨在解决当前掩码扩散大语言模型(masked diffusion large language models, dLLMs)在生成过程中面临的固定长度限制问题,即模型必须预设固定的生成长度,导致在输出质量与计算效率之间存在不可调和的权衡。其解决方案的关键在于发现并利用去噪过程中的隐式结束符(end-of-sequence, \textttEOS)密度(\rho)作为生成充分性的可靠信号:该密度的变化能够判断当前掩码空间是否冗余或不足,从而指导生成长度的双向调整——当隐式 \textttEOS 密度过高时触发掩码 token 的收缩,过低则触发扩展。基于此洞察,作者提出无需训练、单阶段的 \textbf \rho - \textttEOS 策略,在统一的去噪流程中实现双向可变长度生成,显著提升推理效率和 token 利用率,同时保持与固定长度方法相当的性能表现。
链接: https://arxiv.org/abs/2601.22527
作者: Jingyi Yang,Yuxian Jiang,Jing Shao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages,6 figures,6 tables
Abstract:Beyond parallel generation and global context modeling, current masked diffusion large language models (dLLMs) suffer from a fundamental limitation: they require a predefined, fixed generation length, which lacks flexibility and forces an inevitable trade-off between output quality and computational efficiency. To address this, we study the denoising dynamics and find that the implicit density ( \rho ) of end-of-sequence ( \textttEOS ) tokens serves as a reliable signal of generation sufficiency. In particular, the evolving implicit \textttEOS density during denoising reveals whether the current masked space is excessive or insufficient, thereby guiding the adjustment direction for generation length. Building on this insight, we propose \textbf \rho - \textttEOS , a training-free, single-stage strategy that enables bidirectional variable-length generation for masked dLLMs. Unlike prior two-stage approaches–which require separate length adjustment and iterative mask insertion phases while supporting only unidirectional expansion-- \textbf \rho - \textttEOS achieves bidirectional length adjustment within a unified denoising process by continuously estimating the implicit \textttEOS density: excessively high density triggers \textttMASK token contraction, while insufficient density induces expansion. Extensive experiments on mathematics and code benchmarks demonstrate that \textbf \rho - \textttEOS achieves comparable performance while substantially improving inference efficiency and token utilization.
zh
[NLP-62] One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry
【速读】: 该论文旨在解决基于分组的强化学习方法(如GRPO和GMPO)在策略优化过程中依赖固定聚合几何结构的问题,这一局限性忽略了轨迹在训练过程中动态演化和异质性的特点。其解决方案的关键在于提出了一种广义框架——幂均策略优化(Power-Mean Policy Optimization, PMPO),通过引入幂均几何指数 $ p $ 参数化聚合几何结构,使GRPO和GMPO成为该框架下的特例;进一步设计了基于裁剪感知的有效样本大小(Clip-aware Effective Sample Size, ESS)机制,通过将轨迹裁剪比例映射至目标ESS并求解对应的 $ p $ 值,实现对不同轨迹自适应地调整聚合方式:对可靠轨迹采用更激进的算术平均,对不稳定轨迹则转向保守的几何平均,从而提升整体训练稳定性和性能。
链接: https://arxiv.org/abs/2601.22521
作者: Weisong Zhao,Tong Wang,Zichang Tan,Te Yang,Siran Peng,Haoyuan Zhang,Tianshuo Zhang,Haichao Shi,Meng Meng,Yang Yang,Xiangyu Zhu,Zhen Lei,Xiao-Yu Zhang,Xu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 3 figures
Abstract:Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a deterministic rule that maps a trajectory clipping fraction to a target ESS. Then, we solve for the specific p to align the trajectory induced ESS with this target one. This allows PMPO to dynamically transition between the aggressive arithmetic mean for reliable trajectories and the conservative geometric mean for unstable ones. Experiments on multiple mathematical reasoning benchmarks demonstrate that PMPO outperforms strong baselines.
zh
[NLP-63] Mock Worlds Real Skills: Building Small Agent ic Language Models with Synthetic Tasks Simulated Environments and Rubric-Based Rewards
【速读】: 该论文旨在解决小规模语言模型(Small LLMs)在智能体(agent)能力上难以匹敌大型昂贵模型的问题,其核心挑战在于现有开源训练数据任务类型单一且易解,以及真实世界API在大规模强化学习(Reinforcement Learning, RL)部署中缺乏多样性与稳定性。解决方案的关键在于提出SYNTHAGENT框架,该框架通过强教师模型生成新颖的任务和工具生态系统,并将其转化为有意模糊的指令,促使代理主动向用户查询缺失信息;同时引入基于大语言模型(LLM)的用户模拟器提供私有用户信息,以及模拟工具系统确保稳定响应;此外,奖励机制基于任务级评分规则,综合子目标达成、用户交互行为及禁止行为进行设计。该方法显著提升了小模型在数学、搜索和工具使用等14个挑战性数据集上的性能,甚至超越更大规模的基线模型。
链接: https://arxiv.org/abs/2601.22511
作者: Yuan-Jay Lü,Chengyu Wang,Lei Shen,Jun Huang,Tong Xu
机构: University of Science and Technology of China (中国科学技术大学); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.
zh
[NLP-64] SSL: Sweet Spot Learning for Differentiated Guidance in Agent ic Optimization
【速读】: 该论文旨在解决强化学习中因使用二元奖励(binary rewards)而导致的优化效率低下问题,即现有方法无法区分达成相同结果的不同轨迹之间的质量差异,从而忽略了解空间中的潜在多样性。其解决方案的关键在于提出一种名为“甜点学习”(Sweet Spot Learning, SSL)的新框架,该框架通过逐步增强的分层奖励机制,引导策略向解空间中性能最优的“甜点区域”收敛;这一原则在视觉感知任务中体现为基于距离的分层建模,在复杂推理任务中则体现为对渐进式进展的奖励,理论上保证了最优解排序不变性并提升梯度信噪比,从而实现更高效、更定向的智能体优化。
链接: https://arxiv.org/abs/2601.22491
作者: Jinyang Wu,Changpeng Yang,Yuhao Shen,Fangzhi Xu,Bolin Ni,Chonghua Liao,Yuchen Liu,Hongzhen Wang,Shuai Nie,Shuai Zhang,Haoran Luo,Jiaming Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversity within the solution space. Inspired by the ``sweet spot’’ concept in tennis-the racket’s core region that produces optimal hitting effects, we introduce \textbfSweet \textbfSpot \textbfLearning (\textbfSSL), a novel framework that provides differentiated guidance for agent optimization. SSL follows a simple yet effective principle: progressively amplified, tiered rewards guide policies toward the sweet-spot region of the solution space. This principle naturally adapts across diverse tasks: visual perception tasks leverage distance-tiered modeling to reward proximity, while complex reasoning tasks reward incremental progress toward promising solutions. We theoretically demonstrate that SSL preserves optimal solution ordering and enhances the gradient signal-to-noise ratio, thereby fostering more directed optimization. Extensive experiments across GUI perception, short/long-term planning, and complex reasoning tasks show consistent improvements over strong baselines on 12 benchmarks, achieving up to 2.5X sample efficiency gains and effective cross-task transferability. Our work establishes SSL as a general principle for training capable and robust agents.
zh
[NLP-65] FraudShield: Knowledge Graph Empowered Defense for LLM s against Fraud Attacks WWW2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在关键自动化流程中因受到欺诈性信息干扰而导致误判或产生有害输出的问题。现有防御方法在有效性、可解释性和泛化能力方面存在局限,尤其难以适配多样化的LLM应用场景。解决方案的关键在于提出FraudShield框架,其核心创新是构建并迭代优化一个欺诈手法-关键词知识图谱(fraud tactic-keyword knowledge graph),通过捕捉可疑文本与欺诈技术之间的高置信度关联,以结构化方式增强原始输入:一方面标注关键风险词,另一方面提供支撑证据,从而引导LLM生成更安全的响应。该方法显著提升了对五类典型欺诈类型的防护效果,并提供了可解释的推理线索。
链接: https://arxiv.org/abs/2601.22485
作者: Naen Xu,Jinghuai Zhang,Ping He,Chunyi Zhou,Jun Wang,Zhihui Fu,Tianyu Du,Zhaoxiang Wang,Shouling Ji
机构: Zhejiang University (浙江大学); University of California, Los Angeles (加州大学洛杉矶分校); OPPO Research Institute (OPPO研究院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: WWW 2026
Abstract:Large language models (LLMs) have been widely integrated into critical automated workflows, including contract review and job application processes. However, LLMs are susceptible to manipulation by fraudulent information, which can lead to harmful outcomes. Although advanced defense methods have been developed to address this issue, they often exhibit limitations in effectiveness, interpretability, and generalizability, particularly when applied to LLM-based applications. To address these challenges, we introduce FraudShield, a novel framework designed to protect LLMs from fraudulent content by leveraging a comprehensive analysis of fraud tactics. Specifically, FraudShield constructs and refines a fraud tactic-keyword knowledge graph to capture high-confidence associations between suspicious text and fraud techniques. The structured knowledge graph augments the original input by highlighting keywords and providing supporting evidence, guiding the LLM toward more secure responses. Extensive experiments show that FraudShield consistently outperforms state-of-the-art defenses across four mainstream LLMs and five representative fraud types, while also offering interpretable clues for the model’s generations.
zh
[NLP-66] HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning
【速读】: 该论文旨在解决强化学习用于训练大语言模型(Large Language Models, LLMs)时在推理任务上因rollout生成成本过高而导致的效率瓶颈问题,尤其针对静态或弱动态提示池(prompt pool)导致采样低效、浪费计算资源于已解决或超出当前能力范围的问题。解决方案的关键在于提出HeaPA(Heap Sampling and On-Policy Query Augmentation),其核心机制包括:基于堆结构的边界采样以精准追踪能力前沿、通过轻量异步验证实现在线策略增强的提示池扩展、以及拓扑感知的统计重估与受控再插入策略来稳定相关查询的分布。该方法在多个数据集和基准测试中均实现了更高的准确性与更低的计算开销,且不显著增加运行时间,证明了其高效性和可扩展性。
链接: https://arxiv.org/abs/2601.22448
作者: Weiqi Wang,Xin Liu,Binxuan Huang,Hejie Cui,Rongzhi Zhang,Changlong Yu,Shuowei Jin,Jingfeng Yang,Qingyu Yin,Zhengyang Wang,Zheng Li,Yifan Gao,Priyanka Nigam,Bing Yin,Lihong Li,Yangqiu Song
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model’s learning progress, so uniform sampling can’t keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool-which makes it hard to support stable on-policy pool growth-or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On-Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap-based boundary sampling, expands the pool via on-policy augmentation with lightweight asynchronous validation, and stabilizes correlated queries through topology-aware re-estimation of pool statistics and controlled reinsertion. Across two training corpora, two training recipes, and seven benchmarks, HeaPA consistently improves accuracy and reaches target performance with fewer computations while keeping wall-clock time comparable. Our analyses suggest these gains come from frontier-focused sampling and on-policy pool growth, with the benefits becoming larger as model scale increases. Our code is available at this https URL.
zh
[NLP-67] AI and My Values: User Perceptions of LLM s Ability to Extract Embody and Explain Human Values from Casual Conversations
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在价值对齐(value alignment)方面的表现是否真正反映了人类价值观,以及人类如何感知和评价这种对齐。为应对这一问题,研究者提出了一种名为VAPT(Value-Alignment Perception Toolkit)的工具包,其关键在于通过结构化的人类访谈与行为实验,系统评估LLMs在提取(pull)、体现(embody)和解释(explain)人类价值观三个维度上的能力,并揭示用户对AI价值理解的认知偏差与说服效应。该方案不仅提供可操作的评估框架,还警示了“武器化共情”(weaponized empathy)的风险——即AI可能在表面上展现价值一致性但实际偏离人类福祉,从而推动未来对话式智能体的设计需嵌入透明度、知情同意与安全机制。
链接: https://arxiv.org/abs/2601.22440
作者: Bhada Yun,Renn Su,April Yi Wang
机构: ETH Zürich(苏黎世联邦理工学院); Stanford University(斯坦福大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear in CHI '26
Abstract:Does AI understand human values? While this remains an open philosophical question, we take a pragmatic stance by introducing VAPT, the Value-Alignment Perception Toolkit, for studying how LLMs reflect people’s values and how people judge those reflections. 20 participants texted a human-like chatbot over a month, then completed a 2-hour interview with our toolkit evaluating AI’s ability to extract (pull details regarding), embody (make decisions guided by), and explain (provide proof of) human values. 13 participants left our study convinced that AI can understand human values. Participants found the experience insightful for self-reflection and found themselves getting persuaded by the AI’s reasoning. Thus, we warn about “weaponized empathy”: a potentially dangerous design pattern that may arise in value-aligned, yet welfare-misaligned AI. VAPT offers concrete artifacts and design implications to evaluate and responsibly build value-aligned conversational agents with transparency, consent, and safeguards as AI grows more capable and human-like into the future.
zh
[NLP-68] Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss COLING2025
【速读】: 该论文旨在解决低资源语言(low-resource languages)在神经语言模型训练中因词汇稀疏而导致的表征学习困难问题,特别是罕见词元(rare tokens)易受边际化(marginalization)影响、难以获得有效对齐的问题。其解决方案的关键在于提出一种阈值化(thresholding)技术,通过限制负采样过程中过度的边际化效应,从而降低对罕见词元的有害影响,使其能够从更有意义的上下文对齐中获益,显著提升模型在低资源语言验证数据上的性能表现。
链接: https://arxiv.org/abs/2601.22439
作者: Galim Turumtaev
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at LoResLM 2025 (COLING 2025 workshop). Oral presentation
Abstract:Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.
zh
[NLP-69] owards Resiliency in Large Language Model Serving with KevlarFlow
【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)服务系统在超大规模集群中因硬件故障频发而导致的服务中断问题,当前恢复机制耗时过长(可达10分钟),严重影响服务可用性。其解决方案的关键在于提出KevlarFlow架构,通过三个核心机制实现高容错性:1)解耦的模型并行初始化,降低资源重载开销;2)动态流量重路由,保障部分失效下的请求处理连续性;3)后台KV缓存复制,维持推理状态一致性。实验表明,该方案将平均恢复时间(MTTR)缩短20倍,并显著改善延迟指标,包括平均时延提升3.1倍、p99延迟提升2.8倍、首token时间(TTFT)提升达378.9倍,且运行时开销可忽略不计。
链接: https://arxiv.org/abs/2601.22438
作者: Shangshu Qian,Kipling Liu,P. C. Sruthi,Lin Tan,Yongle Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Model (LLM) serving systems remain fundamentally fragile, where frequent hardware faults in hyperscale clusters trigger disproportionate service outages in the software stack. Current recovery mechanisms are prohibitively slow, often requiring up to 10 minutes to reinitialize resources and reload massive model weights. We introduce KevlarFlow, a fault tolerant serving architecture designed to bridge the gap between hardware unreliability and service availability. KevlarFlow leverages 1) decoupled model parallelism initialization, 2) dynamic traffic rerouting, and 3) background KV cache replication to maintain high throughput during partial failures. Our evaluation demonstrates that KevlarFlow reduces mean-time-to-recovery (MTTR) by 20x and, under failure conditions, improves average latency by 3.1x, 99th percentile (p99) latency by 2.8x, average time-to-first-token (TTFT) by 378.9x, and p99 TTFT by 574.6x with negligible runtime overhead in comparison to state-of-the-art LLM serving systems.
zh
[NLP-70] Large Language Model Agents Are Not Always Faithful Self-Evolvers
【速读】: 该论文旨在解决自演化大语言模型(Large Language Model, LLM)代理在持续学习过程中对过往经验的依赖性问题,即“经验忠实性”(experience faithfulness)——即代理决策是否真正因果性地依赖于其被赋予的经验。此前研究假设自演化机制能有效利用历史经验进行改进,但缺乏系统验证。论文通过在原始经验与压缩经验上施加受控因果干预,对四种代表性框架在10种LLM主干和9个环境中的表现进行全面评估,发现一个显著的不对称现象:代理始终依赖原始经验,却常忽视或误读压缩经验,即使后者是唯一可用的信息源。这一差距在单/多智能体配置及不同模型规模下均存在。关键发现指出其根源在于三方面:压缩内容的语义局限、内部处理偏置抑制经验使用、以及预训练先验已足够应对的任务场景。该研究挑战了当前自演化方法中经验整合可靠性的一般假设,并强调需发展更忠实、鲁棒的经验集成机制。
链接: https://arxiv.org/abs/2601.22436
作者: Weixiang Zhao,Yingshuo Wang,Yichen Zhang,Yang Deng,Yanyan Zhao,Wanxiang Che,Bing Qin,Ting Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 16 figures, 7 tables
Abstract:Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent’s decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 10 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.
zh
[NLP-71] ReNCE: Learning to Reason by Noise Contrastive Estimation
【速读】: 该论文旨在解决当前基于群体奖励策略(Group-based Reward Policy Optimization, GRPO)在提升预训练大语言模型(Large Language Models, LLMs)推理能力时所面临的局限性问题,即其依赖软区分机制难以有效识别优劣结果,且需借助经验性改进如非对称裁剪和零方差数据过滤等手段来增强性能,但这些改进往往难以系统性发现与优化。论文提出的关键解决方案是采用显式对比学习(explicit contrastive learning)框架:不再估计优势值,而是将K个候选输出直接划分为正负样本集,并通过最大化正样本的似然概率来优化策略,本质上可视为面向LLM推理任务的在线多标签噪声对比估计(multi-label noise contrastive estimation)实现。此方法简化了策略优化过程,同时保持了与DAPO和在线DPO等强基线相当甚至更优的数学推理性能。
链接: https://arxiv.org/abs/2601.22432
作者: Wenzheng Zhang,Karl Stratos
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:GRPO is a standard approach to endowing pretrained LLMs with reasoning capabilities. It estimates the advantage of an outcome from a group of K outcomes, and promotes those with positive advantages inside a trust region. Since GRPO discriminates between good and bad outcomes softly, it benefits from additional refinements such as asymmetric clipping and zero-variance data filtering. While effective, these refinements require significant empirical insight and can be challenging to identify. We instead propose an explicit contrastive learning approach. Instead of estimating advantages, we bifurcate K outcomes into positive and negative sets, then maximize the likelihood of positive outcomes. Our approach can be viewed as an online instantiation of (multi-label) noise contrastive estimation for LLM reasoning. We validate our method by demonstrating competitive performance on a suite of challenging math benchmarks against strong baselines such as DAPO and online DPO.
zh
[NLP-72] Word-Centered Semantic Graphs for Interpretable Diachronic Sense Tracking
【速读】: 该论文旨在解决历时语料库中词语语义演变(semantic shift)的可解释性分析问题,传统方法常依赖预定义的词义标注或黑箱模型,难以捕捉动态语义变化的结构细节。解决方案的关键在于提出一种基于图结构的可解释框架:对每个目标词和时间切片,构建以该词为中心的语义网络,融合历时Skip-gram嵌入的分布相似性与特定时段掩码语言模型(masked language model)的词汇替代性,通过聚类外围节点识别语义簇,并利用节点重叠实现跨时间的语义簇对齐,从而追踪语义簇组成与归一化质量的变化。该方法无需预设词义体系即可揭示语义演化中的聚类动态,如事件驱动的语义替换(trump)、稳定但过度分割的语义结构(god)以及与数字通信相关的渐进关联迁移(post)。
链接: https://arxiv.org/abs/2601.22410
作者: Imene Kolli,Kai-Robin Lange,Jonas Rieger,Carsten Jentsch
机构: University of Zurich (苏黎世大学); TU Dortmund University (多特蒙德工业大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 16 figures
Abstract:We propose an interpretable, graph-based framework for analyzing semantic shift in diachronic corpora. For each target word and time slice, we induce a word-centered semantic network that integrates distributional similarity from diachronic Skip-gram embeddings with lexical substitutability from time-specific masked language models. We identify sense-related structure by clustering the peripheral graph, align clusters across time via node overlap, and track change through cluster composition and normalized cluster mass. In an application study on a corpus of New York Times Magazine articles (1980 - 2017), we show that graph connectivity reflects polysemy dynamics and that the induced communities capture contrasting trajectories: event-driven sense replacement (trump), semantic stability with cluster over-segmentation effects (god), and gradual association shifts tied to digital communication (post). Overall, word-centered semantic graphs offer a compact and transparent representation for exploring sense evolution without relying on predefined sense inventories.
zh
[NLP-73] Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization
【速读】: 该论文旨在解决标准旋转位置编码(Rotary Positional Embeddings, RoPE)在处理长距离递归结构时的局限性,即所谓的“频谱刚性”(Spectral Rigidity)问题:RoPE采用固定几何衰减(θ⁻ⁱ)优化局部句法一致性,但无法捕捉递归逻辑和算法推理中固有的长程周期性结构,导致模型在训练时仅能学习浅层推理链,难以外推至更深的递归步骤,形成“结构差距”(Structure Gap)。解决方案的关键在于提出一种名为“双焦点注意力”(Bifocal Attention)的架构范式,将位置编码解耦为两种模态——用于精确词元级操作的“几何眼”(Geometric Eyes,即标准RoPE)和用于追踪长程递归深度的“频谱眼”(Spectral Eyes,即可学习的谐波算子);同时设计了一种新颖的训练协议“频谱演化”(Spectral Evolution),初始时将位置频率设为静态几何参数,随后通过梯度下降使其演化为针对特定算法拓扑优化的谐波基底。
链接: https://arxiv.org/abs/2601.22402
作者: Kanishk Awadhiya
机构: Indian Institute of Technology, Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注:
Abstract:Rotary Positional Embeddings (RoPE) have become the standard for Large Language Models (LLMs) due to their ability to encode relative positions through geometric rotation. However, we identify a significant limitation we term ‘‘Spectral Rigidity’’: standard RoPE utilizes a fixed geometric decay ( \theta^-i ) optimized for local syntactic coherence, which fails to capture the long-range, periodic structures inherent in recursive logic and algorithmic reasoning. This results in a ‘‘Structure Gap’’, where models trained on shallow reasoning chains fail to extrapolate to deeper recursive steps. In this work, we introduce Bifocal Attention, an architectural paradigm that decouples positional encoding into two distinct modalities: Geometric Eyes (Standard RoPE) for precise token-level manipulation, and Spectral Eyes (Learnable Harmonic Operators) for tracking long-range recursive depth. We propose a novel training protocol, Spectral Evolution, which initializes positional frequencies as static geometric parameters but allows them to evolve via gradient descent into a harmonic basis optimized for the specific algorithmic topology of the task.
zh
[NLP-74] Culturally Grounded Personas in Large Language Models : Characterization and Alignment with Socio-Psychological Value Frameworks
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成的合成人格(synthetic personas)在跨文化情境下是否准确反映世界价值观与道德体系的问题。其核心挑战在于评估这些由LLM生成的人格是否能够真实映射不同文化背景下的价值结构和道德取向。解决方案的关键在于构建一种基于可解释的世界价值观调查(World Values Survey, WVS)变量的文化锚定人格生成框架,并通过三个互补维度进行验证:一是利用Inglehart-Welzel文化地图定位人格的文化坐标,揭示其对稳定文化差异的刻画能力;二是检验生成人格在人口统计层面与WVS人类群体响应分布的一致性;三是基于道德基础理论(Moral Foundations Theory)构建道德画像,结合文化到道德的映射关系,分析道德反应如何随文化配置而变化。这一方法实现了对跨文化结构和道德变异性的系统性评估。
链接: https://arxiv.org/abs/2601.22396
作者: Candida M. Greco,Lucio La Cava,Andrea Tagarelli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)
备注:
Abstract:Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.
zh
[NLP-75] Specialists or Generalists? Multi-Agent and Single-Agent LLM s for Essay Grading
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动作文评分(Automated Essay Scoring, AES)系统中,不同架构设计如何影响其在各类作文质量水平下的性能表现这一问题。解决方案的关键在于对比单智能体(single-agent)与多智能体(multi-agent)架构:多智能体系统将评分任务分解为内容(Content)、结构(Structure)和语言(Language)三个专业代理,并由主席代理(Chairman Agent)依据评分量规(rubric-aligned logic)协调决策,包括否决规则(veto rules)和分数上限控制(score capping),从而显著提升对低质量作文的识别能力;而单智能体架构则在中等质量作文上表现更优,且两者均在高质量作文上表现受限。研究发现,少量示例微调(few-shot calibration)是提升性能的核心因素,仅需每分值等级两个样例即可使QWK指标提升约26%。
链接: https://arxiv.org/abs/2601.22386
作者: Jamiu Adekunle Idowu,Ahmed Almasoud
机构: Sahel AI(萨赫尔人工智能); Sahel Group Inc.(萨赫尔集团公司); University College London (UCL)(伦敦大学学院); Prince Sultan University(沙特王子苏丹大学)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Automated essay scoring (AES) systems increasingly rely on large language models, yet little is known about how architectural choices shape their performance across different essay quality levels. This paper evaluates single-agent and multi-agent LLM architectures for essay grading using the ASAP 2.0 corpus. Our multi-agent system decomposes grading into three specialist agents (Content, Structure, Language) coordinated by a Chairman Agent that implements rubric-aligned logic including veto rules and score capping. We test both architectures in zero-shot and few-shot conditions using GPT-5.1. Results show that the multi-agent system is significantly better at identifying weak essays while the single-agent system performs better on mid-range essays. Both architectures struggle with high-quality essays. Critically, few-shot calibration emerges as the dominant factor in system performance – providing just two examples per score level improves QWK by approximately 26% for both architectures. These findings suggest architectural choice should align with specific deployment priorities, with multi-agent AI particularly suited for diagnostic screening of at-risk students, while single-agent models provide a cost-effective solution for general assessment.
zh
[NLP-76] SP2DPO: An LLM -assisted Semantic Per-Pair DPO Generalization
【速读】: 该论文旨在解决传统直接偏好优化(Direct Preference Optimization, DPO)方法在处理异质性偏好数据时的局限性问题,即其使用单一全局温度参数 β 来平衡拟合偏好标签与保持与参考模型接近度,忽略了不同偏好对之间的信号强度差异(如高信噪比的安全性违规 vs. 低信噪比的主观风格偏好)以及标注噪声的影响。解决方案的关键在于提出 SP2DPO(Semantic Per-Pair DPO),通过引入基于结构化语义差距标注(类别、幅度、置信度)预定义的实例级温度调度 β_i,使每个偏好对可独立调整优化强度,从而更精准地响应不同类型偏好信号。该方法在 UltraFeedback 数据集上实现了大规模可审计的 β_i 构建,并在 AlpacaEval 2.0 上验证了其有效性:在不增加训练开销的前提下,提升了部分模型的长度控制胜率,且无需针对每种模型进行超参调优。
链接: https://arxiv.org/abs/2601.22385
作者: Chaoyue He,Xin Zhou,Di Wang,Hong Xu,Wei Liu,Chunyan Miao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 15 figures, 16 tables, 60 equations
Abstract:Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.
zh
[NLP-77] SPLA: Block Sparse Plus Linear Attention for Long Context Modeling
【速读】: 该论文旨在解决块级稀疏注意力(block-wise sparse attention)在长上下文建模中因完全丢弃未选块而导致的选择 fidelity 低和累积上下文损失的问题。其解决方案的关键在于提出 Sparse Plus Linear Attention (SPLA) 框架:首先利用二阶泰勒展开推导的选择度量精准识别需进行精确注意力计算的相关块;其次,通过残差线性注意力(residual linear attention, RLA)模块将未选块压缩为紧凑的递归状态,从而保留“长尾”信息;尤为重要的是,为避免输入输出(I/O)开销,作者推导出一种基于减法优化的 RLA 形式——将残差表示为全局线性注意力与选定块线性注意力之差,确保推理过程中无需显式访问未选块,显著提升效率并保持性能优势。
链接: https://arxiv.org/abs/2601.22379
作者: Bailin Wang,Dan Friedman,Tao Lei,Chong Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: v1
Abstract:Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining “long tail,” SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA – calculating the residual as the difference between global and selected linear attention – ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.
zh
[NLP-78] Stability-Aware Prompt Optimization for Clinical Data Abstraction
【速读】: 该论文旨在解决临床大语言模型(Large Language Models, LLMs)在医疗信息抽取任务中对提示词(prompt)敏感性问题,即相同任务下不同表述的提示可能导致模型输出显著波动,从而影响其在真实临床场景中的可靠性。现有研究通常将提示视为固定参数,仅关注模型本身的不确定性,忽视了提示设计与模型性能之间的耦合关系。论文提出一个双目标提示优化循环(dual-objective prompt optimization loop),其关键在于联合优化准确率和提示稳定性——通过引入稳定性约束项,在保证基本性能的同时显著降低因提示改写导致的输出翻转率(flip rates),从而提升模型在实际应用中的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2601.22373
作者: Arinbjörn Kolbeinsson,Daniel Timbie,Sajjan Narsinghani,Sanjay Hariharan
机构: Century Health
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models used for clinical abstraction are sensitive to prompt wording, yet most work treats prompts as fixed and studies uncertainty in isolation. We argue these should be treated jointly. Across two clinical tasks (MedAlign applicability/correctness and MS subtype abstraction) and multiple open and proprietary models, we measure prompt sensitivity via flip rates and relate it to calibration and selective prediction. We find that higher accuracy does not guarantee prompt stability, and that models can appear well-calibrated yet remain fragile to paraphrases. We propose a dual-objective prompt optimization loop that jointly targets accuracy and stability, showing that explicitly including a stability term reduces flip rates across tasks and models, sometimes at modest accuracy cost. Our results suggest prompt sensitivity should be an explicit objective when validating clinical LLM systems.
zh
[NLP-79] Context Structure Reshapes the Representational Geometry of Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)过程中,其内部表示轨迹是否发生“直线化”(representation straightening)这一问题,即探究模型如何通过调整神经表征来适应不同任务结构。解决方案的关键在于系统性地测量Gemma 2模型在多种不同类型任务中的表示直度变化:发现LLMs在连续预测任务中会随着上下文增长而增强轨迹直线化,并与预测性能提升相关;而在结构化预测任务中,直线化仅出现在具有显式模板结构的阶段,其余部分则消失。这表明ICL并非单一机制,而是依赖于任务结构动态选择策略的过程,模型如同瑞士军刀般灵活切换不同计算模式,其中部分策略可引发表示直线化。
链接: https://arxiv.org/abs/2601.22364
作者: Eghbal A. Hosseini,Yuxuan Li,Yasaman Bahri,Declan Campbell,Andrew Kyle Lampinen
机构: Google DeepMind(谷歌深度思维); Princeton Neuroscience Institute, Princeton University(普林斯顿大学神经科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have been shown to organize the representations of input sequences into straighter neural trajectories in their deep layers, which has been hypothesized to facilitate next-token prediction via linear extrapolation. Language models can also adapt to diverse tasks and learn new structure in context, and recent work has shown that this in-context learning (ICL) can be reflected in representational changes. Here we bring these two lines of research together to explore whether representation straightening occurs \emphwithin a context during ICL. We measure representational straightening in Gemma 2 models across a diverse set of in-context tasks, and uncover a dichotomy in how LLMs’ representations change in context. In continual prediction settings (e.g., natural language, grid world traversal tasks) we observe that increasing context increases the straightness of neural sequence trajectories, which is correlated with improvement in model prediction. Conversely, in structured prediction settings (e.g., few-shot tasks), straightening is inconsistent – it is only present in phases of the task with explicit structure (e.g., repeating a template), but vanishes elsewhere. These results suggest that ICL is not a monolithic process. Instead, we propose that LLMs function like a Swiss Army knife: depending on task structure, the LLM dynamically selects between strategies, only some of which yield representational straightening.
zh
[NLP-80] MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment
【速读】: 该论文旨在解决现有自动化 veracity assessment(真实性评估)系统中证据检索与推理过程脱节的问题,即当前方法通常将证据检索视为静态、孤立的步骤,缺乏对已检索证据的有效管理和跨声明复用,导致搜索效率低且验证一致性差。解决方案的关键在于提出 MERMAID 框架,其核心创新是通过一个记忆增强的多智能体架构,将检索与推理以“Reason-Action”式的迭代方式紧密耦合:该框架集成代理驱动的搜索、结构化知识表示和持久化证据记忆模块,在动态获取证据的同时支持跨声明的知识复用,从而显著减少冗余搜索并提升验证效率与一致性。
链接: https://arxiv.org/abs/2601.22361
作者: Yupeng Cao,Chengyang He,Yangyang Yu,Ping Wang,K.P. Subbalakshmi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact-checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub-claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory-enhanced multi-agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process, enabling dynamic evidence acquisition and cross-claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state-of-the-art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.
zh
[NLP-81] Why Reasoning Fails to Plan: A Planning -Centric Analysis of Long-Horizon Decision Making in LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在长周期规划任务中行为不连贯的问题,其核心症结在于:基于逐步推理的策略本质上是一种贪心策略,在短时 horizon 中有效,但在长时规划中因未能考虑早期动作的延迟后果而失效。解决方案的关键在于引入 FLARE(Future-aware Lookahead with Reward Estimation),该方法通过在单一模型中实现显式前瞻(future-aware lookahead)、价值传播(value propagation)和有限承诺(limited commitment),使下游结果能够反向影响早期决策,从而打破局部最优导致的早期短视承诺陷阱,显著提升任务性能与规划质量。
链接: https://arxiv.org/abs/2601.22311
作者: Zehong Wang,Fang Wu,Hongru Wang,Xiangru Tang,Bolian Li,Zhenfei Yin,Yijun Ma,Yiyang Li,Weixiang Sun,Xiusi Chen,Yanfang Ye
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM)-based agents exhibit strong step-by-step reasoning capabilities over short horizons, yet often fail to sustain coherent behavior over long planning horizons. We argue that this failure reflects a fundamental mismatch: step-wise reasoning induces a form of step-wise greedy policy that is adequate for short horizons but fails in long-horizon planning, where early actions must account for delayed consequences. From this planning-centric perspective, we study LLM-based agents in deterministic, fully structured environments with explicit state transitions and evaluation signals. Our analysis reveals a core failure mode of reasoning-based policies: locally optimal choices induced by step-wise scoring lead to early myopic commitments that are systematically amplified over time and difficult to recover from. We introduce FLARE (Future-aware Lookahead with Reward Estimation) as a minimal instantiation of future-aware planning to enforce explicit lookahead, value propagation, and limited commitment in a single model, allowing downstream outcomes to influence early decisions. Across multiple benchmarks, agent frameworks, and LLM backbones, FLARE consistently improves task performance and planning-level behavior, frequently allowing LLaMA-8B with FLARE to outperform GPT-4o with standard step-by-step reasoning. These results establish a clear distinction between reasoning and planning.
zh
[NLP-82] Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning
【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练大语言模型(Large Language Models, LLMs)时,通常仅关注孤立问题求解,而未显式训练模型从多智能体辩论(Multi-Agent Debate, MAD)中整合和利用多样推理路径的问题。解决方案的关键在于提出自辩强化学习(Self-Debate Reinforcement Learning, SDRL),其核心机制是:在给定提示(prompt)后,首先采样多个候选解,随后构建包含多样化推理路径的辩论上下文,并生成受此上下文条件约束的第二轮响应;最终通过联合优化初始响应与辩论条件响应,使单一模型兼具独立求解能力与辩论参与能力,从而在MAD框架下提升整体性能的同时增强单模型推理表现。
链接: https://arxiv.org/abs/2601.22297
作者: Chenxi Liu,Yanshuo Chen,Ruibo Chen,Tianyi Xiong,Tong Zheng,Heng Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning (SDRL), a training framework that equips a single LLM with strong standalone problem-solving ability and the capability to learn from diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL improves overall MAD performance while simultaneously strengthening single model reasoning.
zh
[NLP-83] JAF: Judge Agent Forest
【速读】: 该论文旨在解决传统判官代理(judge agent)在评估生成式 AI(Generative AI)系统输出时存在的局部性局限问题,即现有方法通常孤立地评价单个响应,难以捕捉跨实例的模式与不一致性,从而限制了主代理(primary agent)的自我改进能力。解决方案的关键在于提出 JAF(Judge Agent Forest)框架:通过让判官代理对一组由主代理生成的查询-响应对进行联合推理(joint inference),使判官从局部评估者转变为全局学习者;其核心机制包括利用重叠的上下文邻域构建知识图谱结构以促进批判信息传播,并借助随机化重复评估形成上下文敏感的集成判断;此外,引入一种可学习的局部敏感哈希(LSH)算法,结合语义嵌入、LLM驱动的哈希谓词、类别标签监督及辅助信息,生成可解释且关系感知的二进制编码,实现高效多样化的同伴示例选择,进而优化思维链(CoT)推理路径的探索效率。
链接: https://arxiv.org/abs/2601.22269
作者: Sahil Garg,Brad Cheezum,Sridhar Dutta,Vishal Agarwal
机构: Averlon(艾弗隆)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Judge agents are fundamental to agentic AI frameworks: they provide automated evaluation, and enable iterative self-refinement of reasoning processes. We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query–response pairs generated by a primary agent, rather than evaluating each in isolation. This paradigm elevates the judge from a local evaluator to a holistic learner: by simultaneously assessing related responses, the judge discerns cross-instance patterns and inconsistencies, whose aggregate feedback enables the primary agent to improve by viewing its own outputs through the judge’s collective perspective. Conceptually, JAF bridges belief propagation and ensemble-learning principles: overlapping in-context neighborhoods induce a knowledge-graph structure that facilitates propagation of critique, and repeated, randomized evaluations yield a robust ensemble of context-sensitive judgments. JAF can be instantiated entirely via ICL, with the judge prompted for each query using its associated primary-agent response plus a small, possibly noisy set of peer exemplars. While kNN in embedding space is a natural starting point for exemplars, this approach overlooks categorical structure, domain metadata, or nuanced distinctions accessible to modern LLMs. To overcome these limitations, we develop a flexible locality-sensitive hashing (LSH) algorithm that learns informative binary codes by integrating semantic embeddings, LLM-driven hash predicates, supervision from categorical labels, and relevant side information. These hash codes support efficient, interpretable, and relation-aware selection of diverse exemplars, and further optimize exploration of CoT reasoning paths. We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.22269 [cs.AI] (or arXiv:2601.22269v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.22269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-84] Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models
【速读】: 该论文旨在解决持续集成(Continuous Integration, CI)流水线中间歇性(flaky)作业失败的诊断难题,这类失败由非确定性测试、网络中断、基础设施故障等因素引发,导致计算资源浪费和开发人员大量时间被占用。解决方案的关键在于提出FlaXifyer——一种基于预训练语言模型的少样本学习方法,仅需作业执行日志即可预测失败类别,实现84.3%的Macro F1和92.0%的Top-2准确率,且每类仅需12个标注样本;同时引入LogSift可解释性技术,在不到一秒内识别关键日志语句,将审查工作量减少74.4%,并在87%的情况下有效定位失败信息,从而实现高效自动分类与诊断,为自动化修复间歇性失败奠定基础。
链接: https://arxiv.org/abs/2601.22264
作者: Henri Aïdasso,Francis Bordeleau,Ali Tizghadam
机构: École de technologie supérieure (École de technologie supérieure); TELUS (TELUS)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.
zh
[NLP-85] A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy
【速读】: 该论文旨在解决生成式人工智能(Generative AI)和大语言模型(Large Language Models, LLMs)在实际应用中因提示注入(prompt injection)攻击所引发的安全漏洞问题,包括越狱(jailbreaking)等恶意输入导致的数据泄露、未经授权操作或输出被篡改等风险。其解决方案的关键在于通过系统性文献综述方法,对88项相关研究进行归纳与分析,提出一个扩展的防御分类体系以补充NIST关于对抗机器学习的现有分类框架,并构建一个包含定量效果评估、开源状态及模型无关性的全面防御策略目录,从而为研究人员和开发者提供标准化、可复用且实用的防御指南,推动对抗性机器学习领域的安全实践发展。
链接: https://arxiv.org/abs/2601.22240
作者: Pedro H. Barcha Correia,Ryan W. Achjian,Diego E. G. Caetano de Oliveira,Ygor Acacio Maria,Victor Takashi Hayashi,Marcos Lopes,Charles Christian Miers,Marcos A. Simplicio Jr
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 14 figures, 11 tables, submitted to Elsevier Computer Science Review
Abstract:The rapid advancement and widespread adoption of generative artificial intelligence (GenAI) and large language models (LLMs) has been accompanied by the emergence of new security vulnerabilities and challenges, such as jailbreaking and other prompt injection attacks. These maliciously crafted inputs can exploit LLMs, causing data leaks, unauthorized actions, or compromised outputs, for instance. As both offensive and defensive prompt injection techniques evolve quickly, a structured understanding of mitigation strategies becomes increasingly important. To address that, this work presents the first systematic literature review on prompt injection mitigation strategies, comprehending 88 studies. Building upon NIST’s report on adversarial machine learning, this work contributes to the field through several avenues. First, it identifies studies beyond those documented in NIST’s report and other academic reviews and surveys. Second, we propose an extension to NIST taxonomy by introducing additional categories of defenses. Third, by adopting NIST’s established terminology and taxonomy as a foundation, we promote consistency and enable future researchers to build upon the standardized taxonomy proposed in this work. Finally, we provide a comprehensive catalog of the reviewed prompt injection defenses, documenting their reported quantitative effectiveness across specific LLMs and attack datasets, while also indicating which solutions are open-source and model-agnostic. This catalog, together with the guidelines presented herein, aims to serve as a practical resource for researchers advancing the field of adversarial machine learning and for developers seeking to implement effective defenses in production systems.
zh
[NLP-86] Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在3D空间结构理解上的局限性问题,特别是其相较于2D感知与语义推理能力的显著不足。为系统评估这一差距,作者提出两个关键解决方案:一是构建VRRPI-Bench,一个基于未标注第一人称视频并附带描述性注释的基准数据集,用于模拟真实场景中围绕共同物体的复合平移与旋转运动;二是设计VRRPI-Diag,一个诊断性基准,可分离并独立评估各自由度的空间运动(如深度变化和绕光轴的滚转)。实验表明,即使最先进的VLMs(如GPT-5)在相对相机位姿估计(Relative Camera Pose Estimation, RCPE)任务上表现远逊于经典几何基线(0.64 vs. 0.97)和人类水平(0.92),且多帧空间线索整合能力弱(最佳仅59.7%),揭示了当前VLMs在3D空间建模与多视角推理方面的根本缺陷。
链接: https://arxiv.org/abs/2601.22228
作者: Ken Deng,Yifu Qiu,Yoni Kasten,Shay B. Cohen,Yftah Ziser
机构: University of Edinburgh (爱丁堡大学); University of Oxford (牛津大学); NVIDIA Research (英伟达研究); University of Groningen (格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ( 0.64 ) fall short of classic geometric baselines ( 0.97 ) and human performance ( 0.92 ). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best 59.7% ) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.
zh
[NLP-87] MrRoPE: Mixed-radix Rotary Position Embedding
【速读】: 该论文旨在解决旋转位置编码(Rotary Position Embedding, RoPE)在预训练后扩展至更长序列时缺乏统一理论基础的问题。当前RoPE扩展方法多样且缺乏系统性,导致性能不稳定或难以泛化。解决方案的关键在于提出MrRoPE(Mixed-radix RoPE),其从进制转换(radix system conversion)的角度构建了一个统一的编码框架,将不同扩展策略视为特定进制转换方式。在此基础上,作者设计了两种无需微调的扩展方法——MrRoPE-Uni与MrRoPE-Pro,分别采用均匀和渐进式进制转换策略,实现“短训长测”(train short, test long)的泛化能力。理论分析表明,MrRoPE-Pro显著提升了RoPE可达到的最大编码长度上限,从而验证了该方法的有效性和可靠性。
链接: https://arxiv.org/abs/2601.22181
作者: Qingyuan Tian,Wenhong Zhu,Xiaoran Liu,Xiaofeng Wang,Rui Wang
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Rotary Position Embedding (RoPE)-extension refers to modifying or generalizing the Rotary Position Embedding scheme to handle longer sequences than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. In this paper, we propose MrRoPE (Mixed-radix RoPE), a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies various RoPE-extension approaches as distinct radix conversion strategies. Based on this theory, we introduce two training-free extensions, MrRoPE-Uni and MrRoPE-Pro, which leverage uniform and progressive radix conversion strategies, respectively, to achieve ‘train short, test long’ generalization. Without fine-tuning, MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN’s accuracy on Infinite-Bench retrieval and dialogue subsets. Theoretical analysis confirms that MrRoPE-Pro effectively raises the upper bound of RoPE’s attainable encoding length, which further validates the reliability and utility of our theory and methodology.
zh
[NLP-88] In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对受酒精影响的语言行为(drunk language)时的安全脆弱性问题,即此类语言诱导可能导致模型更容易被越狱(jailbreaking)或泄露隐私。解决方案的关键在于提出三种高效且简单的诱导机制:基于角色的提示(persona-based prompting)、因果微调(causal fine-tuning)和基于强化学习的后训练(reinforcement-based post-training),这些方法能有效激发LLM中类似人类醉酒状态下的非理性与不安全行为,并通过多维度评估验证其对主流安全基准(如JailbreakBench和ConfAIde)的显著影响,从而揭示LLM安全风险的新来源。
链接: https://arxiv.org/abs/2601.22169
作者: Anudeex Shetty,Aditya Joshi,Salil S. Kanhere
机构: UNSW Sydney (新南威尔士大学); the University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: WIP
Abstract:Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.
zh
[NLP-89] EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis ICASSP2026
【速读】: 该论文旨在解决当前情感感知文本到语音(TTS)系统在情感表达精确性和可控性方面的不足,尤其是现有基于大语言模型(LLM)的系统依赖固定情感嵌入或外部引导,难以建模特定情感的潜在特征。其解决方案的关键在于提出一种轻量级激活调控框架 EmoShift,核心创新是引入 EmoSteer 层,该层在输出嵌入空间中为每种目标情感学习一个导向向量(steering vector),以捕捉情感的潜在偏移并保持跨话语和情感类别间的情感表达稳定性。该方法仅需约1000万可训练参数(不足全量微调的1/30),在客观与主观评估中均优于零样本和全量微调基线,在增强情感表现力的同时保持自然度和说话人相似性。
链接: https://arxiv.org/abs/2601.22873
作者: Li Zhou,Hao Jiang,Junjie Li,Tianrui Wang,Haizhou Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Activation Steering; Emotion-Aware TTS; Speech Synthesis; Accepted by ICASSP 2026
Abstract:Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer’s effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.
zh
[NLP-90] CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR ICASSP2026
【速读】: 该论文旨在解决多说话人自动语音识别(Multi-speaker Automatic Speech Recognition, Multi-speaker ASR)中因语音重叠导致的识别准确率下降问题,尤其是在个性化AI场景下如何有效利用目标说话人声学特征与语言上下文信息进行联合建模。解决方案的关键在于提出一种端到端的上下文声学-语言联合建模框架CALM(Contextual Acoustic-Linguistic Modeling),其核心创新包括:基于说话人嵌入驱动的目标说话人提取模块,以及基于动态词汇的语言上下文偏置机制,从而在重叠对话中实现对目标说话人的精准分离和语义引导的识别优化。
链接: https://arxiv.org/abs/2601.22792
作者: Muhammad Shakeel,Yosuke Fukumoto,Chikara Maeda,Chyi-Jiunn Lin,Shinji Watanabe
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to IEEE ICASSP 2026
Abstract:We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
zh
[NLP-91] Sylber 2.0: A Universal Syllable Embedding
【速读】: 该论文旨在解决语音建模中语音标记(speech tokens)的效率与通用性不足的问题,尤其是在多语言和多种表达风格下难以同时保留语言信息与声学细节的挑战。现有基于音节的语音标记方法受限于单一语言且缺乏足够的声学保真度,无法满足跨语言语音处理的需求。解决方案的关键在于提出Sylber 2.0——一个自监督的音节级语音编码框架,其通过实现约5 Hz的极低标记频率,在保持高保真重建的同时兼顾语言与声学特征,并在多语言场景中展现出良好的泛化能力。该方法不仅在性能上可媲美高频基线模型,还显著提升了低资源自动语音识别(ASR)和文本到语音合成(TTS)任务的效果,从而建立了一种适用于通用口语语言的高效音节抽象表示。
链接: https://arxiv.org/abs/2601.22306
作者: Cheol Jun Cho,Nicholas Lee,Alan W Black,Gopala K. Anumanchipalli
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.
zh
[NLP-92] UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text Images and Videos
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在高信息密度金融场景下评估能力不足的问题,特别是面对文本、图像和视频等多模态数据时,现有基准测试难以覆盖金融领域特有的复杂推理任务,如跨模态多跳推理(cross-modal multi-hop reasoning)与细粒度信息整合。解决方案的关键在于提出首个统一的多模态金融评估基准UniFinEval,其系统性构建了五个基于真实金融系统的典型场景:财务报表审计、公司基本面推理、行业趋势洞察、金融风险感知与资产配置分析,并人工构建包含3,767个问答对的高质量双语数据集,在零样本(Zero-Shot)与思维链(Chain-of-Thought, CoT)设置下对10种主流MLLMs进行系统评测,从而为金融场景中多模态模型的能力提供精细化、可复现的评估框架。
链接: https://arxiv.org/abs/2601.22162
作者: Zhi Yang,Lingfeng Zeng,Fangqi Lou,Qi Qi,Wei Zhang,Zhenyu Wu,Zhenxiong Yu,Jun Han,Zhiheng Jin,Lejie Zhang,Xiaoming Huang,Xiaolong Liang,Zheng Wei,Junbo Zou,Dongpo Cheng,Zhaowei Liu,Xin Guo,Rongjunchen Zhang,Liwen Zhang
机构: SUFE(上海财经大学); Tencent(腾讯); Gatech(佐治亚理工学院); HiThink Research(慧思研究)
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high-density information and cross-modal multi-hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high-information-density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high-quality dataset consisting of 3,767 question-answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero-Shot and CoT settings. Results show that Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs’ capabilities in fine-grained, high-information-density financial environments, thereby enhancing the robustness of MLLMs applications in real-world financial scenarios. Data and code are available at this https URL.
zh
计算机视觉
[CV-0] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
【速读】:该论文旨在解决视频扩散模型(Video Diffusion Models, VDMs)在生成过程中难以保持三维结构一致性的问题,表现为物体形变或空间漂移,其根本原因在于标准去噪目标缺乏对几何一致性的显式激励。解决方案的关键在于提出一种数据高效、自监督的框架 VideoGPA(Video Geometric Preference Alignment),该框架利用一个几何基础模型自动提取密集的偏好信号,并通过直接偏好优化(Direct Preference Optimization, DPO)引导VDM生成更具3D一致性的视频内容,无需人工标注即可显著提升时序稳定性、物理合理性与运动连贯性。
链接: https://arxiv.org/abs/2601.23286
作者: Hongyang Du,Junjie Ye,Xiaoyan Cong,Runhao Li,Jingcheng Ni,Aman Agarwal,Zeqi Zhou,Zekun Li,Randall Balestriero,Yue Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.
zh
[CV-1] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments
【速读】:该论文旨在解决开放集目标检测(Open-set Object Detection, OSOD)模型在现实交互式扩展现实(XR)环境中对用户模糊、不完整或过度详细提示的鲁棒性不足问题。其关键解决方案在于通过引入视觉-语言模型模拟多样化的用户提示行为,评估两种主流OSOD模型(GroundingDINO与YOLO-E)在标准、欠细节、过细节及语用模糊四类提示下的性能表现,并验证提示增强策略对提升模型鲁棒性的有效性——结果表明,提示增强能显著改善模型在语用模糊提示下的表现,mIoU提升超过55%,平均置信度提升41%。
链接: https://arxiv.org/abs/2601.23281
作者: Junfeng Lin,Yanming Xiu,Maria Gorlatova
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE VR 2026: GenAI-XR workshop
Abstract:Open-set object detection (OSOD) localizes objects while identifying and rejecting unknown classes at inference. While recent OSOD models perform well on benchmarks, their behavior under realistic user prompting remains underexplored. In interactive XR settings, user-generated prompts are often ambiguous, underspecified, or overly detailed. To study prompt-conditioned robustness, we evaluate two OSOD models, GroundingDINO and YOLO-E, on real-world XR images and simulate diverse user prompting behaviors using vision-language models. We consider four prompt types: standard, underdetailed, overdetailed, and pragmatically ambiguous, and examine the impact of two enhancement strategies on these prompts. Results show that both models exhibit stable performance under underdetailed and standard prompts, while they suffer degradation under ambiguous prompts. Overdetailed prompts primarily affect GroundingDINO. Prompt enhancement substantially improves robustness under ambiguity, yielding gains exceeding 55% mIoU and 41% average confidence. Based on the findings, we propose several prompting strategies and prompt enhancement methods for OSOD models in XR environments.
zh
[CV-2] raining-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models ICASSP2026
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在域偏移(domain shift)下性能下降的问题,同时克服现有测试时适应方法计算复杂度高、依赖反向传播(back-propagation)以及多模态适配不足的局限。其解决方案的关键在于提出一种无需训练的测试时自适应方法TaTa,该方法利用布朗距离协方差(Brownian Distance Covariance)动态捕捉跨模态特征间的线性和非线性依赖关系,从而在不进行权重更新的前提下实现模型对新域的有效适应;此外,通过引入属性增强提示(attribute-enhanced prompting)和动态聚类与伪标签精炼机制,进一步提升模型在新颖视觉场景中的推理准确性和泛化能力。
链接: https://arxiv.org/abs/2601.23253
作者: Yi Zhang,Chun-Wun Cheng,Angelica I. Aviles-Rivero,Zhihai He,Liang-Jie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in ICASSP 2026
Abstract:Vision-language models suffer performance degradation under domain shift, limiting real-world applicability. Existing test-time adaptation methods are computationally intensive, rely on back-propagation, and often focus on single modalities. To address these issues, we propose Training-free Test-Time Adaptation with Brownian Distance Covariance (TaTa). TaTa leverages Brownian Distance Covariance-a powerful statistical measure that captures both linear and nonlinear dependencies via pairwise distances-to dynamically adapt VLMs to new domains without training or back-propagation. This not only improves efficiency but also enhances stability by avoiding disruptive weight updates. TaTa further integrates attribute-enhanced prompting to improve vision-language inference with descriptive visual cues. Combined with dynamic clustering and pseudo-label refinement, it effectively recalibrates the model for novel visual contexts. Experiments across diverse datasets show that TaTa significantly reduces computational cost while achieving state-of-the-art performance in domain and cross-dataset generalization.
zh
[CV-3] Structured Over Scale: Learning Spatial Reasoning from Educational Video
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在简单推理任务上表现不佳的问题,例如计数、空间推理和组合理解等,这些任务是学龄前儿童能够轻松完成的。现有VLMs虽在标准视频理解基准上表现优异,但缺乏对结构化逻辑推理能力的有效学习机制。解决方案的关键在于利用教育类视频中天然存在的结构化教学信号——特别是《爱探险的朵拉》(Dora the Explorer)系列中每集固定“情境-问题-暂停-答案”的互动式教学框架,构建了一个名为DoraVQA的数据集(5,344个带时间戳对齐的问答对),并通过Group Relative Policy Optimization(GRPO)方法对Qwen2和Qwen3模型进行微调。该方法有效利用了教育内容中的明确正确性信号与可追踪的推理路径,仅用38小时的儿童教育视频训练即显著提升模型在DoraVQA、CVBench、Video-MME及NExT-QA等多个基准上的性能,证明了内容结构的重要性不亚于数据规模,为提升VLM的泛化推理能力提供了新路径。
链接: https://arxiv.org/abs/2601.23251
作者: Bishoy Galoaa,Xiangyu Bai,Sarah Ostadabbas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) demonstrate impressive performance on standard video understanding benchmarks yet fail systematically on simple reasoning tasks that preschool children can solve, including counting, spatial reasoning, and compositional understanding. We hypothesize that the pedagogically-structured content of educational videos provides an ideal training signal for improving these capabilities. We introduce DoraVQA, a dataset of 5,344 question-answer pairs automatically extracted from 8 seasons of Dora the Explorer with precise timestamp alignment. Each episode follows a consistent \textitcontext-question-pause-answer structure that creates a self-contained learning environment analogous to interactive tutoring. We fine-tune both Qwen2 and Qwen3 using Group Relative Policy Optimization (GRPO), leveraging the clear correctness signals and structured reasoning traces inherent in educational content. Despite training exclusively on 38 hours of children’s educational videos, our approach achieves improvements of 8-14 points on DoraVQA and state-of-the-art 86.16% on CVBench, with strong transfer to Video-MME and NExT-QA, demonstrating effective generalization from narrow pedagogical content to broad multimodal understanding. Through cross-domain benchmarks, we show that VLMs can perform tasks that require robust reasoning learned from structured educational content, suggesting that content structure matters as much as content scale.
zh
[CV-4] ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search
【速读】:该论文旨在解决开放域视频片段检索(open-domain video shot retrieval)中缺乏系统性基准测试与分析的问题,尤其针对视频 richer temporal structure 和 more complex semantics 带来的挑战。其解决方案的关键在于提出 ShotFinder 基准,将编辑需求形式化为以关键帧为导向的片段描述,并引入五类可控单因素约束(时间顺序、颜色、视觉风格、音频和分辨率),构建包含1,210个高质量样本的数据集;同时设计了一个文本驱动的三阶段检索与定位流水线:(1)通过视频想象进行查询扩展,(2)利用搜索引擎召回候选视频,(3)基于描述引导的时间定位。实验表明,当前多模态大模型在该任务上仍显著落后于人类表现,且不同约束条件间存在明显性能不平衡,凸显了该任务对生成式 AI (Generative AI) 的重大挑战。
链接: https://arxiv.org/abs/2601.23232
作者: Tao Yu,Haopeng Jin,Hao Wang,Shenghua Chai,Yujia Yang,Junhao Gong,Jiaming Guo,Minghui Zhang,Xinlong Chen,Zhenghao Zhang,Yuxuan Zhou,Yanpei Gong,YuanCheng Liu,Yiming Ding,Kangwei Zeng,Pengfei Yang,Zhongtian Luo,Yufei Xiong,Shanbin Zhang,Shaoxiong Cheng,Huang Ruilin,Li Shuo,Yuxi Niu,Xinyuan Zhang,Yueya Xu,Jie Mao,Ruixuan Ji,Yaru Zhao,Mingchen Zhang,Jiabing Yang,Jiaqi Liu,YiFan Zhang,Hongzhu Yi,Xinming Wang,Cheng Zhong,Xiao Ma,Zhang Zhang,Yan Huang,Liang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 7 figures
Abstract:In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.
zh
[CV-5] Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
【速读】:该论文旨在解决现有多模态大语言模型在长视频理解中因依赖均匀采样和单轮推理而导致的局限性,即难以从冗余信息中识别稀疏但关键的视觉证据。其核心解决方案是提出Video-o3框架,通过迭代式发现显著视觉线索、细粒度检查关键片段以及在获取足够证据后自适应终止,实现高效多跳证据搜索与推理。关键技术包括:(1)任务解耦注意力掩码(Task-Decoupled Attention Masking),用于缓解推理与工具调用异质性引起的注意力分散问题,保持每步专注同时保留全局上下文;(2)可验证轨迹引导奖励机制(Verifiable Trajectory-Guided Reward),控制多轮交互中的上下文长度增长,平衡探索覆盖与推理效率。
链接: https://arxiv.org/abs/2601.23224
作者: Xiangyu Zeng,Zhiqiu Zhang,Yuhan Zhu,Xinhao Li,Zikang Wang,Changlian Ma,Qingyu Zhang,Zizheng Huang,Kun Ouyang,Tianxiang Jiang,Ziang Yan,Yi Wang,Hongjie Zhang,Yali Wang,Limin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 15 figures, 11 tables
Abstract:Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3’s strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.
zh
[CV-6] Region-Normalized DPO for Medical Image Segmentation under Noisy Judges
【速读】:该论文旨在解决医学图像分割中因依赖昂贵的像素级标注而导致可扩展性受限的问题,提出利用现有低成本自动质量控制(QC)信号(如模型一致性、不确定性度量或学习到的掩码质量分数)进行偏好优化微调,从而在无需额外像素标注的情况下提升分割模型性能。其解决方案的关键在于设计了一种分割感知的目标函数——区域归一化直接偏好优化(Region-Normalized DPO, RN-DPO),通过将偏好更新按掩码间差异区域的大小进行归一化,降低有害比较对优化过程的影响,从而显著提升训练稳定性和持续性能表现。
链接: https://arxiv.org/abs/2601.23222
作者: Hamza Kalisch,Constantin Seibold,Jens Kleesiek,Ken Herrmann,Frederic Jonske
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While dense pixel-wise annotations remain the gold standard for medical image segmentation, they are costly to obtain and limit scalability. In contrast, many deployed systems already produce inexpensive automatic quality-control (QC) signals like model agreement, uncertainty measures, or learned mask-quality scores which can be used for further model training without additional ground-truth annotation. However, these signals can be noisy and biased, making preference-based fine-tuning susceptible to harmful updates. We study Direct Preference Optimization (DPO) for segmentation from such noisy judges using proposals generated by a supervised base segmenter trained on a small labeled set. We find that outcomes depend strongly on how preference pairs are mined: selecting the judge’s top-ranked proposal can improve peak performance when the judge is reliable, but can amplify harmful errors under weaker judges. We propose Region-Normalized DPO (RN-DPO), a segmentation-aware objective which normalizes preference updates by the size of the disagreement region between masks, reducing the leverage of harmful comparisons and improving optimization stability. Across two medical datasets and multiple regimes, RN-DPO improves sustained performance and stabilizes preference-based fine-tuning, outperforming standard DPO and strong baselines without requiring additional pixel annotations.
zh
[CV-7] Med-Scout: Curing MLLM s Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学诊断中存在的一项关键缺陷——几何盲视(geometric blindness),即模型无法将输出结果准确锚定于客观的几何约束,从而产生看似合理但事实错误的幻觉。这一问题源于训练范式过度侧重语言流畅性而忽视几何保真度。解决方案的关键在于提出Med-Scout框架,通过强化学习(Reinforcement Learning, RL)挖掘未标注医学图像中的内在几何逻辑,并利用三个策略性代理任务(层级尺度定位、拓扑拼图重建和异常一致性检测)自动构建可验证的监督信号,从而无需依赖昂贵的专家标注即可显著提升模型的几何感知能力。
链接: https://arxiv.org/abs/2601.23220
作者: Anglin Liu,Ruichao Chen,Yi Lu,Hongxia Xu,Jintai Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite recent Multimodal Large Language Models (MLLMs)’ linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that “cures” this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.
zh
[CV-8] Hi-Light: A Path to high-fidelity high-resolution video relighting with a Novel Evaluation Paradigm
【速读】:该论文旨在解决视频重光照(Video Relighting)中存在的三大核心问题:缺乏有效的评估指标、光照 flickering 严重以及编辑过程中高频细节退化。其解决方案的关键在于提出了一种无需训练的高保真、高分辨率、鲁棒的视频重光照框架 Hi-Light,包含三项关键技术:基于亮度先验的引导扩散模型以稳定中间重光照视频;融合光流信息的混合运动自适应光照平滑滤波器,在不引入运动模糊的前提下保障时序稳定性;以及基于 LAB 色彩空间的细节融合模块,有效保留原始视频中的高频细节信息。此外,作者还首次提出了 Light Stability Score(光照稳定性评分),作为专门衡量光照一致性的量化指标,填补了该领域评估标准的空白。
链接: https://arxiv.org/abs/2601.23167
作者: Xiangrui Liu,Haoxiang Li,Yezhou Yang
机构: Arizona State University (亚利桑那州立大学); Pixocial Technology (Pixocial 科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video relighting offers immense creative potential and commercial value but is hindered by challenges, including the absence of an adequate evaluation metric, severe light flickering, and the degradation of fine-grained details during editing. To overcome these challenges, we introduce Hi-Light, a novel, training-free framework for high-fidelity, high-resolution, robust video relighting. Our approach introduces three technical innovations: lightness prior anchored guided relighting diffusion that stabilises intermediate relit video, a Hybrid Motion-Adaptive Lighting Smoothing Filter that leverages optical flow to ensure temporal stability without introducing motion blur, and a LAB-based Detail Fusion module that preserves high-frequency detail information from the original video. Furthermore, to address the critical gap in evaluation, we propose the Light Stability Score, the first quantitative metric designed to specifically measure lighting consistency. Extensive experiments demonstrate that Hi-Light significantly outperforms state-of-the-art methods in both qualitative and quantitative comparisons, producing stable, highly detailed relit videos.
zh
[CV-9] Segment Any Events with Language ICLR2026
【速读】:该论文旨在解决事件传感器(event sensor)在开放词汇表条件下的实例分割问题,即开放词汇事件实例分割(Open-Vocabulary Event Instance Segmentation, OV-EIS),这是此前研究中较为稀缺的方向。现有方法多集中于语义层面的理解,缺乏对细粒度实例和部件级别的分割能力。解决方案的关键在于提出首个语义感知的事件分割框架SEAL(Semantic-aware Segment Any Events),其核心创新包括:1)构建统一的框架,支持基于视觉提示的事件分割与开放词汇掩码分类,覆盖从实例级到部件级的多层次语义粒度;2)设计参数高效的架构,在保证高精度的同时显著提升推理速度;3)通过四个新构建的基准测试集(涵盖从粗到细的标签粒度与从实例到部件的语义粒度)实现全面评估。此外,附录还提出了无需用户视觉提示的通用时空开放词汇事件分割变体,进一步拓展了模型的应用场景。
链接: https://arxiv.org/abs/2601.23159
作者: Seungjun Lee,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026. Project Page: this https URL
Abstract:Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. Check out our project page in this https URL
zh
[CV-10] FlowCalib: LiDAR-to-Vehicle Miscalibration Detection using Scene Flows
【速读】:该论文旨在解决LiDAR传感器与车辆之间存在的旋转错位(即LiDAR-to-vehicle miscalibration)问题,这种错位会导致自动驾驶系统在感知和决策阶段出现安全风险。现有方法多聚焦于传感器间的标定误差校正,却忽略了单个传感器自身偏移所引发的根本性错误。其解决方案的关键在于提出FlowCalib框架,利用静态物体在连续3D点云中生成的场景流(scene flow)运动线索来检测miscalibration,通过分析由旋转错位引入的系统性流场偏差,无需额外传感器即可实现检测。该框架结合神经场景流先验与双分支检测网络,融合学习到的全局流特征与手工设计的几何描述符,从而完成两个互补的二分类任务:全局判断是否存在错位,以及逐轴判断每个旋转轴是否发生偏移。
链接: https://arxiv.org/abs/2601.23107
作者: Ilir Tahiraj,Peter Wittal,Markus Lienkamp
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Accurate sensor-to-vehicle calibration is essential for safe autonomous driving. Angular misalignments of LiDAR sensors can lead to safety-critical issues during autonomous operation. However, current methods primarily focus on correcting sensor-to-sensor errors without considering the miscalibration of individual sensors that cause these errors in the first place. We introduce FlowCalib, the first framework that detects LiDAR-to-vehicle miscalibration using motion cues from the scene flow of static objects. Our approach leverages the systematic bias induced by rotational misalignment in the flow field generated from sequential 3D point clouds, eliminating the need for additional sensors. The architecture integrates a neural scene flow prior for flow estimation and incorporates a dual-branch detection network that fuses learned global flow features with handcrafted geometric descriptors. These combined representations allow the system to perform two complementary binary classification tasks: a global binary decision indicating whether misalignment is present and separate, axis-specific binary decisions indicating whether each rotational axis is misaligned. Experiments on the nuScenes dataset demonstrate FlowCalib’s ability to robustly detect miscalibration, establishing a benchmark for sensor-to-vehicle miscalibration detection.
zh
[CV-11] Rethinking Transferable Adversarial Attacks on Point Clouds from a Compact Subspace Perspective
【速读】:该论文旨在解决点云场景下可迁移对抗攻击(transferable adversarial attacks)的泛化能力不足问题,现有方法通常依赖特定模型的梯度或启发式策略,导致在未见网络架构上的攻击效果受限。其解决方案的关键在于提出一种基于紧凑子空间视角的攻击框架 CoSA,通过将每个点云表示为类特定原型(class-specific prototypes)的紧凑组合以捕捉共享语义结构,并在低秩子空间中优化对抗扰动,从而诱导一致且与架构无关的扰动方向。此设计有效抑制了模型依赖噪声,约束扰动于语义有意义的方向,显著提升跨模型迁移性能,且无需依赖替代模型特有的特征。
链接: https://arxiv.org/abs/2601.23102
作者: Keke Tang,Xianheng Liu,Weilong Peng,Xiaofei Wang,Daizong Liu,Peican Zhu,Can Lu,Zhihong Tian
机构: Cyberspace Institute of Advanced Technology, Guangzhou University (广州大学先进科技网络研究院); School of Computer Science and Cyber Engineering, Guangzhou University (广州大学计算机科学与网络工程学院); Department of Automation, University of Science and Technology of China (中国科学技术大学自动化系); Institute for Math & AI, Wuhan University (武汉大学数学与人工智能研究所); School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University (西北工业大学人工智能学院(光电与智能系统研究院))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transferable adversarial attacks on point clouds remain challenging, as existing methods often rely on model-specific gradients or heuristics that limit generalization to unseen architectures. In this paper, we rethink adversarial transferability from a compact subspace perspective and propose CoSA, a transferable attack framework that operates within a shared low-dimensional semantic space. Specifically, each point cloud is represented as a compact combination of class-specific prototypes that capture shared semantic structure, while adversarial perturbations are optimized within a low-rank subspace to induce coherent and architecture-agnostic variations. This design suppresses model-dependent noise and constrains perturbations to semantically meaningful directions, thereby improving cross-model transferability without relying on surrogate-specific artifacts. Extensive experiments on multiple datasets and network architectures demonstrate that CoSA consistently outperforms state-of-the-art transferable attacks, while maintaining competitive imperceptibility and robustness under common defense strategies. Codes will be made public upon paper acceptance.
zh
[CV-12] EAG-PT: Emission-Aware Gaussians and Path Tracing for Indoor Scene Reconstruction and Editing
【速读】:该论文旨在解决现有基于辐射场(radiance field)的重建方法(如NeRF和3DGS)在场景编辑时因光照信息被烘焙且缺乏显式光传输机制而导致物理不一致的问题,同时克服传统基于网格的逆向渲染方法对几何保真度要求过高而难以应用于真实室内场景的瓶颈。其解决方案的关键在于提出Emission-Aware Gaussians and Path Tracing (EAG-PT),通过三个核心设计实现:(1) 使用2D高斯作为统一场景表示与运输友好的几何代理,避免重建网格;(2) 在重建过程中显式分离发射与非发射成分以支持后续编辑;(3) 将重建与最终渲染解耦,先采用高效单次反弹优化进行编辑,再通过高质量多次反弹路径追踪完成渲染,从而在保持精细几何细节的同时实现更自然、物理一致的编辑后渲染结果。
链接: https://arxiv.org/abs/2601.23065
作者: Xijie Yang,Mulin Yu,Changjian Jiang,Kerui Ren,Tao Lu,Jiangmiao Pang,Dahua Lin,Bo Dai,Linning Xu
机构: Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学); Feeling AI
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Recent reconstruction methods based on radiance field such as NeRF and 3DGS reproduce indoor scenes with high visual fidelity, but break down under scene editing due to baked illumination and the lack of explicit light transport. In contrast, physically based inverse rendering relies on mesh representations and path tracing, which enforce correct light transport but place strong requirements on geometric fidelity, becoming a practical bottleneck for real indoor scenes. In this work, we propose Emission-Aware Gaussians and Path Tracing (EAG-PT), aiming for physically based light transport with a unified 2D Gaussian representation. Our design is based on three cores: (1) using 2D Gaussians as a unified scene representation and transport-friendly geometry proxy that avoids reconstructed mesh, (2) explicitly separating emissive and non-emissive components during reconstruction for further scene editing, and (3) decoupling reconstruction from final rendering by using efficient single-bounce optimization and high-quality multi-bounce path tracing after scene editing. Experiments on synthetic and real indoor scenes show that EAG-PT produces more natural and physically consistent renders after editing than radiant scene reconstructions, while preserving finer geometric detail and avoiding mesh-induced artifacts compared to mesh-based inverse path tracing. These results suggest promising directions for future use in interior design, XR content creation, and embodied AI.
zh
[CV-13] HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation
【速读】:该论文旨在解决视觉地理定位(Visual Geolocalization)任务中的三大挑战:全球尺度下的定位复杂性、视觉模糊性以及地理结构的层次特性。现有方法要么依赖大规模图像检索导致存储开销大,要么采用基于网格的分类器忽略地理连续性,或使用生成模型在空间上扩散但难以保留细节。其解决方案的关键在于提出一种以地理实体为中心的范式,将图像直接对齐至国家、区域、子区域和城市等层级实体,并通过引入哈弗辛距离(Haversine distance)到对比学习目标中,实现几何感知的双曲空间嵌入(Hyperbolic space embeddings)。该设计不仅显著降低参数规模(仅需240k实体嵌入 vs. 超过500万图像嵌入),还提升了定位精度与可解释性,在OSV5M基准上实现了新的SOTA性能,平均测地线误差降低19.5%,子区域细粒度准确率提升43%。
链接: https://arxiv.org/abs/2601.23064
作者: Hari Krishna Gadi,Daniel Matos,Hongyi Luo,Lu Liu,Yongliang Wang,Yanfeng Zhang,Liqiu Meng
机构: Huawei Riemann Lab (华为黎曼实验室); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5%, while improving the fine-grained subregion accuracy by 43%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.
zh
[CV-14] One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在多模态任务中仍存在的幻觉(hallucination)和安全相关失败问题,这些问题即使在模型规模扩大后依然存在。现有基于引导(steering)的技术虽能提升性能,但其效率与效果之间难以平衡,且多数方法依赖输入特定的调整。论文提出了一种名为OSGA(One-shot Steering with Generative Anchor)的输入无关框架,其核心创新在于:通过方差驱动的数据选择策略选取具有代表性的样本,并利用对比目标与生成锚点正则化(contrastive objective with generative anchor regularization)学习一个单一的引导向量(steering vector),该向量可在推理阶段固定层位通用应用,无需修改模型参数。实验表明,仅需一次优化即可显著降低幻觉并增强安全性,同时计算开销极低,验证了“单次引导”作为可靠、可扩展VLM改进方案的可行性。
链接: https://arxiv.org/abs/2601.23041
作者: Youxu Shi,Suorong Yang,Dong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures that persist even at scale. Steering offers a lightweight technique to improve model performance. However, steering, whether input-dependent or input-independent, achieves a meaningful trade-off between efficiency and effectiveness. In this work, we observe that steering vectors can generalize across inputs when tasks share aligned semantic intent. Based on this insight, we propose \textbfOSGA (\textbfOne-shot \textbfSteering with \textbfGenerative \textbfAnchor), an input-independent framework that improves model performance with a single optimization instance. OSGA first selects an informative sample via a variance-based data selection strategy and learns a single steering vector with a contrastive objective with generative anchor regularization. The resulting vector can be universally applied at a certain layer during inference time without modifying model parameters. Experiments across multiple benchmarks show that a single OSGA-optimized steering vector consistently improves hallucination mitigation and safety enhancement with negligible overhead, highlighting one-shot steering as a practical and scalable solution for reliable VLMs.
zh
[CV-15] Leverag ing Multi-Rater Annotations to Calibrate Object Detectors in Microscopy Imaging
【速读】:该论文旨在解决深度学习目标检测模型在显微成像中置信度估计缺乏校准(calibration)的问题,这限制了其在生物医学应用中的可靠性。解决方案的关键在于利用多标注者(multi-rater)注释,通过为每位专家的独立标注训练专用模型,并聚合这些模型的预测以模拟共识,从而更有效地建模标注者间差异,相较于传统的标签采样策略,这种方法更具理论基础且能显著提升校准性能,同时保持与现有方法相当的检测准确率。
链接: https://arxiv.org/abs/2601.23007
作者: Francesco Campi,Lucrezia Tondo,Ekin Karabati,Johannes Betge,Marie Piraud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a conference paper at ISBI 2026
Abstract:Deep learning-based object detectors have achieved impressive performance in microscopy imaging, yet their confidence estimates often lack calibration, limiting their reliability for biomedical applications. In this work, we introduce a new approach to improve model calibration by leveraging multi-rater annotations. We propose to train separate models on the annotations from single experts and aggregate their predictions to emulate consensus. This improves upon label sampling strategies, where models are trained on mixed annotations, and offers a more principled way to capture inter-rater variability. Experiments on a colorectal organoid dataset annotated by two experts demonstrate that our rater-specific ensemble strategy improves calibration performance while maintaining comparable detection accuracy. These findings suggest that explicitly modelling rater disagreement can lead to more trustworthy object detectors in biomedical imaging.
zh
[CV-16] Self-Supervised Slice-to-Volume Reconstruction with Gaussian Representations for Fetal MRI
【速读】:该论文旨在解决胎儿磁共振(fetal MR)成像中因运动伪影导致的二维切片数据难以重建为高质量三维体积的问题。传统切片到体积重建(slice-to-volume reconstruction, SVR)方法计算复杂且需多角度正交切片,而基于学习的方法则依赖真实标签数据进行训练,这在实际应用中难以获取。其解决方案的关键在于提出一种自监督框架GaussianSVR,利用3D高斯表示(3D Gaussian representations)建模目标体积以实现高保真重建,并通过模拟前向切片采集模型实现无需真实标签的自监督训练;同时引入多分辨率训练策略,在不同尺度上联合优化高斯参数与空间变换,从而显著提升重建精度与效率。
链接: https://arxiv.org/abs/2601.22990
作者: Yinsong Wang,Thomas Fletcher,Xinzhe Luo,Aine Travers Dineen,Rhodri Cusack,Chen Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing 3D fetal MR volumes from motion-corrupted stacks of 2D slices is a crucial and challenging task. Conventional slice-to-volume reconstruction (SVR) methods are time-consuming and require multiple orthogonal stacks for reconstruction. While learning-based SVR approaches have significantly reduced the time required at the inference stage, they heavily rely on ground truth information for training, which is inaccessible in practice. To address these challenges, we propose GaussianSVR, a self-supervised framework for slice-to-volume reconstruction. GaussianSVR represents the target volume using 3D Gaussian representations to achieve high-fidelity reconstruction. It leverages a simulated forward slice acquisition model to enable self-supervised training, alleviating the need for ground-truth volumes. Furthermore, to enhance both accuracy and efficiency, we introduce a multi-resolution training strategy that jointly optimizes Gaussian parameters and spatial transformations across different resolution levels. Experiments show that GaussianSVR outperforms the baseline methods on fetal MR volumetric reconstruction. Code will be available upon acceptance.
zh
[CV-17] About an Automating Annotation Method for Robot Markers
【速读】:该论文旨在解决在工厂自动化场景中,基于传统图像处理方法(如OpenCV)进行ArUco标记识别时,在噪声、运动模糊、失焦或光照变化等复杂条件下鲁棒性不足的问题,同时应对深度学习模型训练所需大量人工标注数据所带来的瓶颈。解决方案的关键在于利用ArUco标记自带的识别模块自动获取标记的ID和位置信息,从而实现对训练数据的全自动标注,无需人工干预,显著降低标注成本并保证标签一致性;在此基础上,使用自动标注的数据集训练YOLO-based深度学习模型,实验证明该方法在模糊和失焦图像中相比传统图像处理技术具有更优的识别性能。
链接: https://arxiv.org/abs/2601.22982
作者: Wataru Uemura,Takeru Nagashima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Factory automation has become increasingly important due to labor shortages, leading to the introduction of autonomous mobile robots for tasks such as material transportation. Markers are commonly used for robot self-localization and object identification. In the RoboCup Logistics League (RCLL), ArUco markers are employed both for robot localization and for identifying processing modules. Conventional recognition relies on OpenCV-based image processing, which detects black-and-white marker patterns. However, these methods often fail under noise, motion blur, defocus, or varying illumination conditions. Deep-learning-based recognition offers improved robustness under such conditions, but requires large amounts of annotated data. Annotation must typically be done manually, as the type and position of objects cannot be detected automatically, making dataset preparation a major bottleneck. In contrast, ArUco markers include built-in recognition modules that provide both ID and positional information, enabling automatic annotation. This paper proposes an automated annotation method for training deep-learning models on ArUco marker images. By leveraging marker detection results obtained from the ArUco module, the proposed approach eliminates the need for manual labeling. A YOLO-based model is trained using the automatically annotated dataset, and its performance is evaluated under various conditions. Experimental results demonstrate that the proposed method improves recognition performance compared with conventional image-processing techniques, particularly for images affected by blur or defocus. Automatic annotation also reduces human effort and ensures consistent labeling quality. Future work will investigate the relationship between confidence thresholds and recognition performance.
zh
[CV-18] Improving Supervised Machine Learning Performance in Optical Quality Control via Generative AI for Dataset Expansion
【速读】:该论文旨在解决工业生产中光学质量控制领域因缺陷样本稀缺导致的监督学习模型性能下降问题,即数据集严重不平衡对模型训练的不利影响。其解决方案的关键在于利用生成式人工智能(Generative AI)技术扩展有限的数据集,特别是采用Stable Diffusion和CycleGAN两种图像生成模型进行数据增强;实验表明,使用Stable Diffusion进行数据扩充可显著提升分割性能,使平均交并比(Mean IoU)达到84.6%,较原始模型提升4.6%。
链接: https://arxiv.org/abs/2601.22961
作者: Dennis Sprute,Hanna Senke,Holger Flatt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 19th CIRP Conference on Intelligent Computation in Manufacturing Engineering
Abstract:Supervised machine learning algorithms play a crucial role in optical quality control within industrial production. These approaches require representative datasets for effective model training. However, while non-defective components are frequent, defective parts are rare in production, resulting in highly imbalanced datasets that adversely impact model performance. Existing strategies to address this challenge, such as specialized loss functions or traditional data augmentation techniques, have limitations, including the need for careful hyperparameter tuning or the alteration of only simple image features. Therefore, this work explores the potential of generative artificial intelligence (GenAI) as an alternative method for expanding limited datasets and enhancing supervised machine learning performance. Specifically, we investigate Stable Diffusion and CycleGAN as image generation models, focusing on the segmentation of combine harvester components in thermal images for subsequent defect detection. Our results demonstrate that dataset expansion using Stable Diffusion yields the most significant improvement, enhancing segmentation performance by 4.6 %, resulting in a Mean Intersection over Union (Mean IoU) of 84.6 %.
zh
[CV-19] riage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models ICASSP2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视频处理中因数据冗余导致的计算挑战,特别是长序列token带来的高延迟和高内存消耗问题。解决方案的关键在于提出一种无需训练、即插即用的框架Triage,其核心是将视频推理建模为资源分配问题,并通过分层视觉预算机制实现高效处理:第一阶段“帧级预算”基于视觉动态性和相关性评估识别关键帧,生成重要性先验;第二阶段“Token级预算”则在此先验指导下,分两步分配token——首先保留高相关性的核心token,再利用高效的批量最大边际相关性(batched Maximal Marginal Relevance, MMR)算法选择多样化的上下文token,从而在显著提升推理效率的同时保持甚至超越基线性能。
链接: https://arxiv.org/abs/2601.22959
作者: Anmin Wang,Nan Zhang,Wei Tao,Xiaoyang Qu,Guokuan Li,Jiguang Wan,Jianzong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which creates prohibitively long token sequences. To address this, we introduce Triage, a training-free, plug-and-play framework that reframes video reasoning as a resource allocation problem via hierarchical visual budgeting. Its first stage, Frame-Level Budgeting, identifies keyframes by evaluating their visual dynamics and relevance, generating a strategic prior based on their importance scores. Guided by this prior, the second stage, Token-Level Budgeting, allocates tokens in two phases: it first secures high-relevance Core Tokens, followed by diverse Context Tokens selected with an efficient batched Maximal Marginal Relevance (MMR) algorithm. Extensive experiments demonstrate that Triage improves inference speed and reduces memory footprint, while maintaining or surpassing the performance of baselines and other methods on various video reasoning benchmarks.
zh
[CV-20] Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment
【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)的图像质量评估(Image Quality Assessment, IQA)方法中存在的两个关键可靠性问题:一是现有基于GRPO(Generalized Reward Policy Optimization)的方法对所有训练样本采用统一的优势权重,忽略了模型预测稳定性在不同样本间的差异,导致不稳定样本中的噪声信号被放大,影响梯度更新的准确性;二是大多数方法侧重于文本引导的推理过程,而忽视了模型对图像内容本身的视觉感知能力,从而削弱了质量判断的客观性。解决方案的关键在于提出Q-Hawkeye框架,通过两项核心设计重构学习信号:其一为不确定性感知的动态优化(Uncertainty-Aware Dynamic Optimization),利用多次rollout中预测分数的方差估计预测不确定性,并据此重新加权每个样本的更新强度,提升策略优化的稳定性;其二为感知意识优化(Perception-Aware Optimization),构建退化图像与其原始图像的配对输入,并引入隐式感知损失(Implicit Perception Loss),强制模型基于真实的视觉证据进行质量判断,增强视觉感知的可靠性。
链接: https://arxiv.org/abs/2601.22920
作者: Wulin Xie,Rui Dai,Ruidong Ding,Kaikui Liu,Xiangxiang Chu,Xinwen Hou,Jie Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image Quality Assessment (IQA) predicts perceptual quality scores consistent with human judgments. Recent RL-based IQA methods built on MLLMs focus on generating visual quality descriptions and scores, ignoring two key reliability limitations: (i) although the model’s prediction stability varies significantly across training samples, existing GRPO-based methods apply uniform advantage weighting, thereby amplifying noisy signals from unstable samples in gradient updates; (ii) most works emphasize text-grounded reasoning over images while overlooking the model’s visual perception ability of image content. In this paper, we propose Q-Hawkeye, an RL-based reliable visual policy optimization framework that redesigns the learning signal through unified Uncertainty-Aware Dynamic Optimization and Perception-Aware Optimization. Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts and leverages this uncertainty to reweight each sample’s update strength, stabilizing policy optimization. To strengthen perceptual reliability, we construct paired inputs of degraded images and their original images and introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence. Extensive experiments demonstrate that Q-Hawkeye outperforms state-of-the-art methods and generalizes better across multiple datasets. The code and models will be made available.
zh
[CV-21] Deep in the Jungle: Towards Automating Chimpanzee Population Estimation
【速读】:该论文旨在解决野生动物保护中未标记灵长类种群(如黑猩猩)的个体数量和密度估算问题,传统方法依赖于人工从大量相机陷阱视频中获取动物与摄像机之间的距离测量,这一过程耗时且劳动密集。其解决方案的关键在于将基于计算机视觉的单目深度估计(Monocular Depth Estimation, MDE)模型——具体采用Dense Prediction Transformers (DPT) 和 Depth Anything 两种架构——直接集成进生态相机陷阱工作流程,通过自动提取检测距离来推断种群密度与丰度。研究表明,经校准后的DPT模型在距离估计精度及下游密度推算上均优于Depth Anything,尽管两者均存在系统性偏差(尤其在复杂森林环境中倾向于高估距离,导致密度低估),但整体生成的种群估计值可控制在传统人工方法结果的22%误差范围内,验证了MDE驱动的距离采样是一种可行且实用的替代方案。
链接: https://arxiv.org/abs/2601.22917
作者: Tom Raynes,Otto Brookes,Timm Haucke,Lukas Bösch,Anne-Sophie Crunchant,Hjalmar Kühl,Sara Beery,Majid Mirmehdi,Tilo Burghardt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The estimation of abundance and density in unmarked populations of great apes relies on statistical frameworks that require animal-to-camera distance measurements. In practice, acquiring these distances depends on labour-intensive manual interpretation of animal observations across large camera trap video corpora. This study introduces and evaluates an only sparsely explored alternative: the integration of computer vision-based monocular depth estimation (MDE) pipelines directly into ecological camera trap workflows for great ape conservation. Using a real-world dataset of 220 camera trap videos documenting a wild chimpanzee population, we combine two MDE models, Dense Prediction Transformers and Depth Anything, with multiple distance sampling strategies. These components are used to generate detection distance estimates, from which population density and abundance are inferred. Comparative analysis against manually derived ground-truth distances shows that calibrated DPT consistently outperforms Depth Anything. This advantage is observed in both distance estimation accuracy and downstream density and abundance inference. Nevertheless, both models exhibit systematic biases. We show that, given complex forest environments, they tend to overestimate detection distances and consequently underestimate density and abundance relative to conventional manual approaches. We further find that failures in animal detection across distance ranges are a primary factor limiting estimation accuracy. Overall, this work provides a case study that shows MDE-driven camera trap distance sampling is a viable and practical alternative to manual distance estimation. The proposed approach yields population estimates within 22% of those obtained using traditional methods.
zh
[CV-22] Multi-Cue Anomaly Detection and Localization under Data Contamination
【速读】:该论文旨在解决工业场景下视觉异常检测中存在的两个关键问题:一是现有方法通常假设训练数据仅包含正常样本或无标签数据主要为正常样本,忽略了实际中训练数据常被异常样本污染的事实;二是这些方法受限于无法获取标注异常样本,导致模型难以学习真实异常的判别特征,从而在异常检测与定位上表现不佳。解决方案的关键在于提出一种融合有限异常监督的鲁棒异常检测框架,其核心创新是引入一个由三部分组成的复合异常评分机制:统计偏离度(deviation score)、基于熵的不确定性评分(entropy-based uncertainty score)以及基于分割的空间异常评分(segmentation-based score),该机制不仅提升了检测准确性,还支持基于梯度的定位,实现可解释的异常区域可视化。同时,通过自适应实例加权策略,在少量标注异常样本的基础上有效缓解了数据污染的影响,显著增强了模型在不同污染水平下的鲁棒性。
链接: https://arxiv.org/abs/2601.22913
作者: Anindya Sundar Das,Monowar Bhuyan
机构: Umeå University (于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages total (10 pages main text + references), 6 figures. Preprint version; the final camera-ready version may differ
Abstract:Visual anomaly detection in real-world industrial settings faces two major limitations. First, most existing methods are trained on purely normal data or on unlabeled datasets assumed to be predominantly normal, presuming the absence of contamination, an assumption that is rarely satisfied in practice. Second, they assume no access to labeled anomaly samples, limiting the model from learning discriminative characteristics of true anomalies. Therefore, these approaches often struggle to distinguish anomalies from normal instances, resulting in reduced detection and weak localization performance. In real-world applications, where training data are frequently contaminated with anomalies, such methods fail to deliver reliable performance. In this work, we propose a robust anomaly detection framework that integrates limited anomaly supervision into the adaptive deviation learning paradigm. We introduce a composite anomaly score that combines three complementary components: a deviation score capturing statistical irregularity, an entropy-based uncertainty score reflecting predictive inconsistency, and a segmentation-based score highlighting spatial abnormality. This unified scoring mechanism enables accurate detection and supports gradient-based localization, providing intuitive and explainable visual evidence of anomalous regions. Following the few-anomaly paradigm, we incorporate a small set of labeled anomalies during training while simultaneously mitigating the influence of contaminated samples through adaptive instance weighting. Extensive experiments on the MVTec and VisA benchmarks demonstrate that our framework outperforms state-of-the-art baselines and achieves strong detection and localization performance, interpretability, and robustness under various levels of data contamination.
zh
[CV-23] DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation
【速读】:该论文旨在解决基于预训练视觉基础模型(Vision Foundation Models, VFM)的生成式自编码器在图像重建中普遍存在的低频细节丢失问题,即重建 fidelity 有限的问题。其关键解决方案在于:首先,提出层次化卷积补丁嵌入模块(Hierarchical Convolutional Patch Embedding),以增强局部结构与纹理的保留;其次,设计余弦相似性对齐目标(Cosine Similarity Alignment),在保证语义一致性的同时允许特征幅值灵活变化,从而更好地保留高频细节;最后,利用自监督学习(SSL)模型表示天然位于超球面(hypersphere)上的特性,采用黎曼流匹配(Riemannian Flow Matching)直接在球面潜空间上训练扩散 Transformer(Diffusion Transformer, DiT),显著提升了重建质量与收敛效率。
链接: https://arxiv.org/abs/2601.22904
作者: Hun Chang,Byunghee Cha,Jong Chul Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, and 11 figures
Abstract:Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.
zh
[CV-24] When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection ICML2026
【速读】:该论文旨在解决传统异常检测方法中假设异常是观测对象固有属性、与上下文无关的问题,而现实中许多场景下同一对象或行为在不同情境下可能表现为正常或异常(如跑步在跑道上为正常,但在高速公路上则为异常)。为此,作者重新定义并实证了“上下文异常检测”(contextual anomaly detection),即异常性取决于主体与上下文的兼容性而非其内在外观。解决方案的关键在于提出了一种条件兼容性学习框架(conditional compatibility learning framework),利用视觉-语言表示模型在有限监督条件下建模主体与上下文之间的关系,并引入CAAD-3K基准数据集以隔离上下文异常进行系统研究。实验表明,该方法在CAAD-3K上显著优于现有方法,并在MVTec-AD和VisA上达到SOTA性能,验证了建模上下文依赖性对提升结构化异常检测的有效性。
链接: https://arxiv.org/abs/2601.22868
作者: Shashank Mishra,Didier Stricker,Jason Rambach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Submitted to ICML 2026. 8 pages main text, plus appendix
Abstract:Anomaly detection is often formulated under the assumption that abnormality is an intrinsic property of an observation, independent of context. This assumption breaks down in many real-world settings, where the same object or action may be normal or anomalous depending on latent contextual factors (e.g., running on a track versus on a highway). We revisit \emphcontextual anomaly detection, classically defined as context-dependent abnormality, and operationalize it in the visual domain, where anomaly labels depend on subject–context compatibility rather than intrinsic appearance. To enable systematic study of this setting, we introduce CAAD-3K, a benchmark that isolates contextual anomalies by controlling subject identity while varying context. We further propose a conditional compatibility learning framework that leverages vision–language representations to model subject–context relationships under limited supervision. Our method substantially outperforms existing approaches on CAAD-3K and achieves state-of-the-art performance on MVTec-AD and VisA, demonstrating that modeling context dependence complements traditional structural anomaly detection. Our code and dataset will be publicly released.
zh
[CV-25] Under-Canopy Terrain Reconstruction in Dense Forests Using RGB Imaging and Neural 3D Reconstruction WACV2026
【速读】:该论文旨在解决在密集森林冠层下难以获取地表及林下结构信息的问题,这在搜救(Search and Rescue, SAR)、路径规划和森林资源调查等场景中具有重要意义。传统方法依赖于昂贵且笨重的机载激光雷达(LiDAR)或专用于人员检测的热成像合成孔径摄影(Airborne Optical Sectioning, AOS),存在成本高、适用性受限等问题。本文提出了一种基于神经辐射场(Neural Radiance Fields, NeRF)的新方法,仅使用常规RGB图像即可重建无冠层遮挡的逼真地面视图;其关键创新在于:1)引入低光照损失函数以应对林下弱光环境;2)通过控制每条射线的积分过程,提出两种互补策略有效去除冠层遮挡物;3)验证了该方法在SAR任务中实现人员检测的能力,并展示了其在树木计数等森林资源调查中的潜力,从而为上述应用提供了一种低成本、高分辨率的替代方案。
链接: https://arxiv.org/abs/2601.22861
作者: Refael Sheffer,Chen Pinchover,Haim Zisman,Dror Ozeri,Roee Litman
机构: Rafael Advanced Defense Systems inc.(拉斐尔先进防御系统公司); Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Graphics (cs.GR)
备注: WACV 2026 CV4EO
Abstract:Mapping the terrain and understory hidden beneath dense forest canopies is of great interest for numerous applications such as search and rescue, trail mapping, forest inventory tasks, and more. Existing solutions rely on specialized sensors: either heavy, costly airborne LiDAR, or Airborne Optical Sectioning (AOS), which uses thermal synthetic aperture photography and is tailored for person detection. We introduce a novel approach for the reconstruction of canopy-free, photorealistic ground views using only conventional RGB images. Our solution is based on the celebrated Neural Radiance Fields (NeRF), a recent 3D reconstruction method. Additionally, we include specific image capture considerations, which dictate the needed illumination to successfully expose the scene beneath the canopy. To better cope with the poorly lit understory, we employ a low light loss. Finally, we propose two complementary approaches to remove occluding canopy elements by controlling per-ray integration procedure. To validate the value of our approach, we present two possible downstream tasks. For the task of search and rescue (SAR), we demonstrate that our method enables person detection which achieves promising results compared to thermal AOS (using only RGB images). Additionally, we show the potential of our approach for forest inventory tasks like tree counting. These results position our approach as a cost-effective, high-resolution alternative to specialized sensors for SAR, trail mapping, and forest-inventory tasks. Comments: WACV 2026 CV4EO Subjects: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Graphics (cs.GR) ACMclasses: I.3.3; I.3.5; I.3.7; I.3.8; I.4.1; I.4.3; I.4.5; I.4.9 Cite as: arXiv:2601.22861 [cs.CV] (or arXiv:2601.22861v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.22861 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-26] Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification ICLR2026
【速读】:该论文旨在解决多模态深度学习(Multimodal Deep Learning, MDL)在实际部署中因模态数据不完整而面临的“丢弃-补全困境”(discarding-imputation dilemma):传统方法要么直接丢弃缺失模态导致任务相关信息丢失,要么强行补全模态可能引入无关噪声。解决方案的关键在于提出一种推理时动态模态选择框架DyMo,其核心创新是设计了一种基于任务损失的可计算代理指标来近似估计每条测试样本的多模态任务相关信息,并据此构建一个有理论依据的奖励函数以指导可靠模态的选择与融合,从而在不依赖先验数据分布的前提下实现对任务相关信息的自适应利用。
链接: https://arxiv.org/abs/2601.22853
作者: Siyi Du,Xinzhe Luo,Declan P. O’Regan,Chen Qin
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages (including appendix), accepted by ICLR 2026
Abstract:Multimodal deep learning (MDL) has achieved remarkable success across various domains, yet its practical deployment is often hindered by incomplete multimodal data. Existing incomplete MDL methods either discard missing modalities, risking the loss of valuable task-relevant information, or recover them, potentially introducing irrelevant noise, leading to the discarding-imputation dilemma. To address this dilemma, in this paper, we propose DyMo, a new inference-time dynamic modality selection framework that adaptively identifies and integrates reliable recovered modalities, fully exploring task-relevant information beyond the conventional discard-or-impute paradigm. Central to DyMo is a novel selection algorithm that maximizes multimodal task-relevant information for each test sample. Since direct estimation of such information at test time is intractable due to the unknown data distribution, we theoretically establish a connection between information and the task loss, which we compute at inference time as a tractable proxy. Building on this, a novel principled reward function is proposed to guide modality selection. In addition, we design a flexible multimodal network architecture compatible with arbitrary modality combinations, alongside a tailored training strategy for robust representation learning. Extensive experiments on diverse natural and medical image datasets show that DyMo significantly outperforms state-of-the-art incomplete/dynamic MDL methods across various missing-data scenarios. Our code is available at this https URL.
zh
[CV-27] How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models
【速读】:该论文试图解决的问题是:遥感(Remote Sensing, RS)领域的大规模基础模型(Foundation Models, FMs)是否像计算机视觉(Computer Vision, CV)领域那样遵循相同的参数扩展规律,即是否存在过参数化(overparameterized)现象——即在较小规模下参数增加主要导致冗余表示而非新的抽象能力。解决方案的关键在于采用事后瘦身(post-hoc slimming)方法,通过均匀缩小预训练编码器的宽度来测量六种先进RS FMs在四个下游分类任务中的表征冗余性。实验结果表明,RS FMs在极低计算预算(如1% FLOPs)下仍能保持超过71%的相对准确率,显著优于CV领域的表现(<10%准确率),从而验证了其更早进入过参数化状态的假设,并揭示了RS FMs中任务相关信息具有高度冗余分布的机制特性。这一发现不仅为资源受限环境提供了实用部署策略,也为挑战当前RS领域“越大越好”的扩展范式提供了实证依据。
链接: https://arxiv.org/abs/2601.22841
作者: Leonard Hackel,Tom Burgert,Begüm Demir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale foundation models (FMs) in remote sensing (RS) are developed based on the paradigms established in computer vision (CV) and have shown promise for various Earth observation applications. However, the direct transfer of scaling assumptions from CV to RS has not been adequately examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, where increasing parameter count primarily induces redundant representations rather than qualitatively new abstractions. To test this hypothesis, we use post-hoc slimming, where we uniformly reduce the width of pretrained encoder, as a tool to measure representational redundancy across six state-of-the-art RS FMs on four downstream classification tasks. Our findings reveal a significant contrast with those in the CV domain: while a post-hoc slimmed masked autoencoder (MAE) trained on ImageNet retains less than 10% accuracy at 1% FLOPs, RS FMs maintain over 71% relative accuracy at the same budget. This sevenfold difference provides strong empirical support for our hypothesis. We further demonstrate that learned slimmable training can improve both Momentum Contrast (MoCo)- and MAE- based models. In addition, through the explained variance ratio and the feature correlation analysis, we provide mechanistic explanations showing that RS FMs distribute task-relevant information with high redundancy. Our findings establish post-hoc slimmability as both a practical deployment strategy for resource-constrained environments and a diagnostic tool that challenges the prevailing scaling paradigm in RS. Upon acceptance, we will publish all code.
zh
[CV-28] Neural Clothing Tryer: Customized Virtual Try-On via Semantic Enhancement and Controlling Diffusion Model
【速读】:该论文旨在解决一种新型的定制化虚拟试衣(Customized Virtual Try-On, Cu-VTON)任务,即在数字模型上叠加指定服装的同时,允许用户对模型的外观、姿态及额外属性进行个性化调整,从而提升虚拟试衣的灵活性与沉浸感。传统虚拟试衣(VTON)任务通常仅限于固定模型和服装匹配,而Cu-VTON通过支持多维度定制显著增强了用户体验。解决方案的关键在于提出了一种神经服装试穿框架(Neural Clothing Tryer, NCT),其核心创新包括:一是引入语义增强模块,利用视觉-语言编码器学习跨模态对齐特征,并作为条件输入引导扩散模型以更好地保留服装的语义特征和纹理细节;二是设计语义控制模块,将服装图像、定制姿态图像及语义描述联合输入,实现服装细节保持的同时对模型姿态、表情等属性进行灵活编辑。
链接: https://arxiv.org/abs/2601.22838
作者: Zhijing Yang,Weiwei Zhang,Mingliang Yang,Siyuan Peng,Yukai Shi,Junpeng Tan,Tianshui Chen,Liruo Zhong
机构: Guangdong University of Technology (广东工业大学); Guangdong Provincial Key Laboratory of Intellectual Property & Big Data (广东省知识产权与大数据重点实验室); South China University of Technology (华南理工大学); Genstoraige Technology (Beijing) Co., Ltd (北京亘生科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Expert Systems with Applications. 16 pages, 10 figures
Abstract:This work aims to address a novel Customized Virtual Try-ON (Cu-VTON) task, enabling the superimposition of a specified garment onto a model that can be customized in terms of appearance, posture, and additional attributes. Compared with traditional VTON task, it enables users to tailor digital avatars to their individual preferences, thereby enhancing the virtual fitting experience with greater flexibility and engagement. To address this task, we introduce a Neural Clothing Tryer (NCT) framework, which exploits the advanced diffusion models equipped with semantic enhancement and controlling modules to better preserve semantic characterization and textural details of the garment and meanwhile facilitating the flexible editing of the model’s postures and appearances. Specifically, NCT introduces a semantic-enhanced module to take semantic descriptions of garments and utilizes a visual-language encoder to learn aligned features across modalities. The aligned features are served as condition input to the diffusion model to enhance the preservation of the garment’s semantics. Then, a semantic controlling module is designed to take the garment image, tailored posture image, and semantic description as input to maintain garment details while simultaneously editing model postures, expressions, and various attributes. Extensive experiments on the open available benchmark demonstrate the superior performance of the proposed NCT framework.
zh
[CV-29] NativeTok: Native Visual Tokenization for Improved Image Generation
【速读】:该论文旨在解决基于向量量化(Vector Quantization, VQ)的图像生成方法中,第一阶段的分词过程与第二阶段的生成模型之间存在的不匹配问题。现有方法虽然提升了分词质量,但由于未能约束token之间的依赖关系,导致生成模型需从无序分布中学习,从而引发偏差和结构不一致的问题。解决方案的关键在于提出“原生视觉分词”(Native Visual Tokenization),通过在分词阶段强制引入因果依赖关系,使token序列天然具备顺序建模能力;在此基础上构建的NativeTok框架,采用元图像Transformer(Meta Image Transformer, MIT)进行潜在空间建模,并结合因果专家混合Transformer(Mixture of Causal Expert Transformer, MoCET),其中每个轻量级专家模块仅基于先前token和潜在特征生成当前token,从而实现高效且具关系约束的图像重建。
链接: https://arxiv.org/abs/2601.22837
作者: Bin Wu,Mengqi Huang,Weinan Jia,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each lightweight expert block generates a single token conditioned on prior tokens and latent features. We further design a Hierarchical Native Training strategy that updates only new expert blocks, ensuring training efficiency. Extensive experiments demonstrate the effectiveness of NativeTok.
zh
[CV-30] A Comparative Evaluation of Large Vision-Language Models for 2D Object Detection under SOTIF Conditions
【速读】:该论文旨在解决自动驾驶系统中因感知不足导致的安全风险问题,即功能安全(Safety of the Intended Functionality, SOTIF)中的关键挑战,特别是在恶劣环境条件下传统检测器性能下降的问题。解决方案的关键在于系统性评估十种代表性生成式视觉语言模型(Large Vision-Language Models, LVLMs)在长尾交通场景和环境退化条件下的2D目标检测性能,并与基于YOLO的经典感知方法进行定量对比。实验表明,高性能LVLMs在复杂自然场景中召回率比YOLO基线提升超过25%,展现出更强的鲁棒性;而YOLO在合成扰动下的几何精度仍具优势,凸显了语义推理与几何回归能力的互补性,支持将LVLMs作为SOTIF导向的自动驾驶系统中的高层安全验证工具。
链接: https://arxiv.org/abs/2601.22830
作者: Ji Zhou,Yilin Ding,Yongqi Zhao,Jiachen Xu,Arno Eichberger
机构: Graz University of Technology (格拉茨工业大学); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages, 11 figures
Abstract:Reliable environmental perception remains one of the main obstacles for safe operation of automated vehicles. Safety of the Intended Functionality (SOTIF) concerns safety risks from perception insufficiencies, particularly under adverse conditions where conventional detectors often falter. While Large Vision-Language Models (LVLMs) demonstrate promising semantic reasoning, their quantitative effectiveness for safety-critical 2D object detection is underexplored. This paper presents a systematic evaluation of ten representative LVLMs using the PeSOTIF dataset, a benchmark specifically curated for long-tail traffic scenarios and environmental degradations. Performance is quantitatively compared against the classical perception approach, a YOLO-based detector. Experimental results reveal a critical trade-off: top-performing LVLMs (e.g., Gemini 3, Doubao) surpass the YOLO baseline in recall by over 25% in complex natural scenarios, exhibiting superior robustness to visual degradation. Conversely, the baseline retains an advantage in geometric precision for synthetic perturbations. These findings highlight the complementary strengths of semantic reasoning versus geometric regression, supporting the use of LVLMs as high-level safety validators in SOTIF-oriented automated driving systems.
zh
[CV-31] Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在持续学习(Continual Learning, CL)过程中面临的任务适应能力弱和灾难性遗忘(Catastrophic Forgetting)问题。现有方法通常存在推理负担重或依赖外部知识的局限性,而低秩适配(Low-Rank Adaptation, LoRA)虽具参数高效潜力,但直接应用仍难以缓解遗忘。其解决方案的关键在于:将单一LoRA模块重构为可分解的秩-1专家池(Rank-1 Expert Pool),通过[CLS]标记语义动态选择稀疏的任务特定更新;并提出激活引导正交损失(Activation-Guided Orthogonal, AGO)损失,使跨任务的LoRA权重关键部分正交化,从而实现更少参数更新、更低任务干扰,并保持下游任务性能。该方法在减少96.7%训练参数的同时,无需外部数据或任务ID判别器,且合并后的LoRA无推理延迟,显著提升了计算效率与泛化能力。
链接: https://arxiv.org/abs/2601.22828
作者: Zhan Fa,Yue Duan,Jian Zhang,Lei Qi,Wanqi Yang,Yinghuan Shi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual learning (CL) in vision-language models (VLMs) faces significant challenges in improving task adaptation and avoiding catastrophic forgetting. Existing methods usually have heavy inference burden or rely on external knowledge, while Low-Rank Adaptation (LoRA) has shown potential in reducing these issues by enabling parameter-efficient tuning. However, considering directly using LoRA to alleviate the catastrophic forgetting problem is non-trivial, we introduce a novel framework that restructures a single LoRA module as a decomposable Rank-1 Expert Pool. Our method learns to dynamically compose a sparse, task-specific update by selecting from this expert pool, guided by the semantics of the [CLS] token. In addition, we propose an Activation-Guided Orthogonal (AGO) loss that orthogonalizes critical parts of LoRA weights across tasks. This sparse composition and orthogonalization enable fewer parameter updates, resulting in domain-aware learning while minimizing inter-task interference and maintaining downstream task performance. Extensive experiments across multiple settings demonstrate state-of-the-art results in all metrics, surpassing zero-shot upper bounds in generalization. Notably, it reduces trainable parameters by 96.7% compared to the baseline method, eliminating reliance on external datasets or task-ID discriminators. The merged LoRAs retain less weights and incur no inference latency, making our method computationally lightweight.
zh
[CV-32] FarmMind: Reasoning -Query-Driven Dynamic Segmentation for Farmland Remote Sensing Images
【速读】:该论文旨在解决远场农田遥感图像(Farmland Remote Sensing Image, FRSI)分割中因单一输入图像信息有限而导致的推理能力不足问题,尤其在复杂场景下存在视觉不确定性时表现不佳。现有静态分割方法仅依赖单个图像块内的局部信息,难以应对模糊和多义性场景。解决方案的关键在于提出一种基于推理-查询驱动的动态分割框架 FarmMind,其核心创新是引入了“推理-查询机制”——该机制不直接盲目调用辅助图像,而是先通过分析分割歧义的根本原因进行逻辑推理,再根据推理结果决定所需类型(如高分辨率、大尺度或时序相邻)的外部辅助图像进行按需查询,从而模拟人类专家的跨图像验证思维过程,显著提升了分割精度与泛化能力。
链接: https://arxiv.org/abs/2601.22809
作者: Haiyang Wu,Weiliang Mu,Jipeng Zhang,Zhong Dandan,Zhuofei Du,Haifeng Li,Tao Chao
机构: School of Geosciences and Info-Physics, Central South University (中南大学地球科学与信息物理学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing methods for farmland remote sensing image (FRSI) segmentation generally follow a static segmentation paradigm, where analysis relies solely on the limited information contained within a single input patch. Consequently, their reasoning capability is limited when dealing with complex scenes characterized by ambiguity and visual uncertainty. In contrast, human experts, when interpreting remote sensing images in such ambiguous cases, tend to actively query auxiliary images (such as higher-resolution, larger-scale, or temporally adjacent data) to conduct cross-verification and achieve more comprehensive reasoning. Inspired by this, we propose a reasoning-query-driven dynamic segmentation framework for FRSIs, named FarmMind. This framework breaks through the limitations of the static segmentation paradigm by introducing a reasoning-query mechanism, which dynamically and on-demand queries external auxiliary images to compensate for the insufficient information in a single input image. Unlike direct queries, this mechanism simulates the thinking process of human experts when faced with segmentation ambiguity: it first analyzes the root causes of segmentation ambiguities through reasoning, and then determines what type of auxiliary image needs to be queried based on this analysis. Extensive experiments demonstrate that FarmMind achieves superior segmentation performance and stronger generalization ability compared with existing methods. The source code and dataset used in this work are publicly available at: this https URL.
zh
[CV-33] Diachronic Stereo Matching for Multi-Date Satellite Imagery
【速读】:该论文旨在解决卫星遥感影像中因时间差异导致的立体匹配难题,即当图像采集时间间隔较长(如数月)时,季节、光照和阴影变化会严重违反传统立体匹配假设,使得现有重建方法失效。其解决方案的关键在于提出首个针对时差影像(diachronic stereo matching)的深度学习方法:首先基于预训练的MonSter模型(融合单目深度先验),并在专门构建的包含多样化时差图像对的数据集(DFC2019)上进行微调;其次,通过引入时序多样性与单目深度约束,显著提升了模型在跨季节、跨光照条件下的几何重建鲁棒性,实验证明该方法在同步与非同步场景下均优于经典管线及未适配的深度模型。
链接: https://arxiv.org/abs/2601.22808
作者: Elías Masquil(IIE, UDELAR),Luca Savant Aira(Polito),Roger Marí,Thibaud Ehret(AMIAD),Pablo Musé(IIE, UDELAR, CB),Gabriele Facciolo(CB, IUF)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in image-based satellite 3D reconstruction have progressed along two complementary directions. On one hand, multi-date approaches using NeRF or Gaussian-splatting jointly model appearance and geometry across many acquisitions, achieving accurate reconstructions on opportunistic imagery with numerous observations. On the other hand, classical stereoscopic reconstruction pipelines deliver robust and scalable results for simultaneous or quasi-simultaneous image pairs. However, when the two images are captured months apart, strong seasonal, illumination, and shadow changes violate standard stereoscopic assumptions, causing existing pipelines to fail. This work presents the first Diachronic Stereo Matching method for satellite imagery, enabling reliable 3D reconstruction from temporally distant pairs. Two advances make this possible: (1) fine-tuning a state-of-the-art deep stereo network that leverages monocular depth priors, and (2) exposing it to a dataset specifically curated to include a diverse set of diachronic image pairs. In particular, we start from a pretrained MonSter model, trained initially on a mix of synthetic and real datasets such as SceneFlow and KITTI, and fine-tune it on a set of stereo pairs derived from the DFC2019 remote sensing challenge. This dataset contains both synchronic and diachronic pairs under diverse seasonal and illumination conditions. Experiments on multi-date WorldView-3 imagery demonstrate that our approach consistently surpasses classical pipelines and unadapted deep stereo models on both synchronic and diachronic settings. Fine-tuning on temporally diverse images, together with monocular priors, proves essential for enabling 3D reconstruction from previously incompatible acquisition dates. Left image (winter) Right image (autumn) DSM geometry Ours (1.23 m) Zero-shot (3.99 m) LiDAR GT Figure 1. Output geometry for a winter-autumn image pair from Omaha (OMA 331 test scene). Our method recovers accurate geometry despite the diachronic nature of the pair, exhibiting strong appearance changes, which cause existing zero-shot methods to fail. Missing values due to perspective shown in black. Mean altitude error in parentheses; lower is better.
zh
[CV-34] HeatMat: Simulation of City Material Impact on Urban Heat Island Effect
【速读】:该论文旨在解决城市热岛(Urban Heat Island, UHI)效应研究中因传感器数据空间和时间分辨率不足而导致的难题,特别是难以量化城市材料属性对UHI的个体影响。解决方案的关键在于提出HeatMat方法,该方法基于街景图像与预训练视觉-语言模型(Vision-Language Model, VLM),从开放数据中高分辨率估计建筑立面材料,并将其编码为反映城市垂直结构与材料特征的二维地图;这些地图作为输入驱动一个2.5D热耦合模拟器,可实现多尺度随机访问表面温度估算,在计算效率上相较传统3D模拟提升达20倍。
链接: https://arxiv.org/abs/2601.22796
作者: Marie Reinbigler,Romain Rouffet,Peter Naylor,Mikolaj Czerkawski,Nikolaos Dionelis,Elisabeth Brunet,Catalin Fetita,Rosalie Martin
机构: SAMOVAR, Inria Saclay, Télécom SudParis, Institut Polytechnique de Paris, Palaiseau, France; Adobe Research, Paris, France; ESA/ESRIN, ϕ\phi-lab, Frascati, Italy; Asterisk Labs, London, UK; SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, Palaiseau, France
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Urban Heat Island (UHI) effect, defined as a significant increase in temperature in urban environments compared to surrounding areas, is difficult to study in real cities using sensor data (satellites or in-situ stations) due to their coarse spatial and temporal resolution. Among the factors contributing to this effect are the properties of urban materials, which differ from those in rural areas. To analyze their individual impact and to test new material configurations, a high-resolution simulation at the city scale is required. Estimating the current materials used in a city, including those on building facades, is also challenging. We propose HeatMat, an approach to analyze at high resolution the individual impact of urban materials on the UHI effect in a real city, relying only on open data. We estimate building materials using street-view images and a pre-trained vision-language model (VLM) to supplement existing OpenStreetMap data, which describes the 2D geometry and features of buildings. We further encode this information into a set of 2D maps that represent the city’s vertical structure and material characteristics. These maps serve as inputs for our 2.5D simulator, which models coupled heat transfers and enables random-access surface temperature estimation at multiple resolutions, reaching an x20 speedup compared to an equivalent simulation in 3D.
zh
[CV-35] Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval
【速读】:该论文旨在解决大规模生物多样性监测平台中,基于文本的野生动物观测数据检索效率低下的问题,尤其是在高维相似性搜索带来的巨大计算开销背景下。其核心挑战在于如何在保持检索精度的同时显著降低内存占用与搜索延迟。解决方案的关键在于提出了一种紧凑的超立方体嵌入(compact hypercube embeddings)框架,通过跨模态代码对齐哈希(cross-view code alignment hashing)方法,将自然语言描述与视觉或声学观测统一映射到共享的汉明空间(Hamming space)中,从而实现高效的二进制表示检索。该方法利用预训练的野生动物基础模型(如BioCLIP和BioLingual),并采用参数高效微调策略进行哈希适配,不仅在iNaturalist2024和iNatSounds2024等基准上实现了与连续嵌入相当甚至更优的检索性能,还大幅降低了存储和搜索成本,同时提升了编码器表征能力及零样本泛化性能。
链接: https://arxiv.org/abs/2601.22783
作者: Ilyass Moummad,Marius Miron,David Robinson,Kawtar Zaher,Hervé Goëau,Olivier Pietquin,Pierre Bonnet,Emmanuel Chemla,Matthieu Geist,Alexis Joly
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Large-scale biodiversity monitoring platforms increasingly rely on multimodal wildlife observations. While recent foundation models enable rich semantic representations across vision, audio, and language, retrieving relevant observations from massive archives remains challenging due to the computational cost of high-dimensional similarity search. In this work, we introduce compact hypercube embeddings for fast text-based wildlife observation retrieval, a framework that enables efficient text-based search over large-scale wildlife image and audio databases using compact binary representations. Building on the cross-view code alignment hashing framework, we extend lightweight hashing beyond a single-modality setup to align natural language descriptions with visual or acoustic observations in a shared Hamming space. Our approach leverages pretrained wildlife foundation models, including BioCLIP and BioLingual, and adapts them efficiently for hashing using parameter-efficient fine-tuning. We evaluate our method on large-scale benchmarks, including iNaturalist2024 for text-to-image retrieval and iNatSounds2024 for text-to-audio retrieval, as well as multiple soundscape datasets to assess robustness under domain shift. Results show that retrieval using discrete hypercube embeddings achieves competitive, and in several cases superior, performance compared to continuous embeddings, while drastically reducing memory and search cost. Moreover, we observe that the hashing objective consistently improves the underlying encoder representations, leading to stronger retrieval and zero-shot generalization. These results demonstrate that binary, language-based retrieval enables scalable and efficient search over large wildlife archives for biodiversity monitoring systems.
zh
[CV-36] Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection
【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像对数字真实性构成威胁时,基于生成伪影的检测器在跨模型泛化能力不足的问题。其解决方案的关键在于利用相机成像流水线的内在特性,具体通过分析由彩色滤光阵列(Color Filter Array, CFA)和去马赛克(Demosaicing)过程引发的颜色相关性,并提出一种基于去马赛克引导的颜色相关性训练(Demosaicing-guided Color Correlation Training, DCCT)框架。该方法通过模拟CFA采样模式,将彩色图像分解为单通道输入(作为条件)和其余两通道作为目标输出进行预测,使用自监督U-Net建模缺失通道的条件分布,参数化方式基于混合逻辑函数;理论分析表明DCCT能够捕捉摄影图像与AI生成图像在颜色相关性特征上的可证明分布差异,从而构建出具有优异泛化性和鲁棒性的二分类器,在超过20种未见过的生成器上显著优于现有方法。
链接: https://arxiv.org/abs/2601.22778
作者: Nan Zhong,Yiran Xu,Mian Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:As realistic AI-generated images threaten digital authenticity, we address the generalization failure of generative artifact-based detectors by exploiting the intrinsic properties of the camera imaging pipeline. Concretely, we investigate color correlations induced by the color filter array (CFA) and demosaicing, and propose a Demosaicing-guided Color Correlation Training (DCCT) framework for AI-generated image detection. By simulating the CFA sampling pattern, we decompose each color image into a single-channel input (as the condition) and the remaining two channels as the ground-truth targets (for prediction). A self-supervised U-Net is trained to model the conditional distribution of the missing channels from the given one, parameterized via a mixture of logistic functions. Our theoretical analysis reveals that DCCT targets a provable distributional difference in color-correlation features between photographic and AI-generated images. By leveraging these distinct features to construct a binary classifier, DCCT achieves state-of-the-art generalization and robustness, significantly outperforming prior methods across over 20 unseen generators.
zh
[CV-37] Is Training Necessary for Anomaly Detection?
【速读】:该论文旨在解决多类无监督异常检测(Multi-class Unsupervised Anomaly Detection, MUAD)方法中普遍存在的重建残差驱动的异常检测策略所面临的保真度-稳定性困境(fidelity-stability dilemma)。现有方法依赖编码器-解码器模型重建正常特征,但此类方法在保持重建质量与稳定检测异常之间难以兼顾。为突破这一限制,作者提出无需训练的基于检索的异常检测(Retrieval-based Anomaly Detection, RAD),其核心创新在于:将正常样本特征存储于内存中,通过多层次检索机制,将测试图像块与内存中的正常特征进行匹配以实现异常判定。RAD不仅在多个基准数据集(MVTec-AD、VisA、Real-IAD、3D-ADAM)上达到当前最优性能,且理论上证明了检索得分可严格上界重构残差得分,从而颠覆了MUAD必须依赖任务特定训练的传统认知,表明基于记忆的检索策略即可实现顶尖异常检测效果。
链接: https://arxiv.org/abs/2601.22763
作者: Xingwu Zhang,Guanxuan Li,Paul Henderson,Gerardo Aragon-Camarasa,Zijun Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current state-of-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder-decoder models to reconstruct anomaly-free features. We first show these approaches have an inherent fidelity-stability dilemma in how they detect anomalies via reconstruction residuals. We then abandon the reconstruction paradigm entirely and propose Retrieval-based Anomaly Detection (RAD). RAD is a training-free approach that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7% Pixel AUROC with just a single anomaly-free image compared to 98.5% of RAD’s full-data performance. We further prove that retrieval-based scores theoretically upper-bound reconstruction-residual scores. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with memory-based retrieval. Our code is available at this https URL.
zh
[CV-38] Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models
【速读】:该论文旨在解决工业故障诊断指南中知识提取与结构化的问题,即如何将嵌套在流程图式布局和专业技术语言中的诊断程序高效、准确地转化为机器可读的结构化信息,以支持现场操作人员的智能辅助系统。解决方案的关键在于利用视觉语言模型(Vision Language Models, VLMs)实现对图文混合内容的联合理解,并通过对比两种提示策略——标准指令引导与增强版布局模式提示——来提升模型对诊断流程空间结构与语义内容的识别能力,从而优化知识抽取的准确性与鲁棒性。
链接: https://arxiv.org/abs/2601.22754
作者: Guillermo Gil de Avalle,Laura Maruster,Christos Emmanouilidis
机构: University of Groningen (格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Industrial troubleshooting guides encode diagnostic procedures in flowchart-like diagrams where spatial layout and technical language jointly convey meaning. To integrate this knowledge into operator support systems, which assist shop-floor personnel in diagnosing and resolving equipment issues, the information must first be extracted and structured for machine interpretation. However, when performed manually, this extraction is labor-intensive and error-prone. Vision Language Models offer potential to automate this process by jointly interpreting visual and textual meaning, yet their performance on such guides remains underexplored. This paper evaluates two VLMs on extracting structured knowledge, comparing two prompting strategies: standard instruction-guided versus an augmented approach that cues troubleshooting layout patterns. Results reveal model-specific trade-offs between layout sensitivity and semantic robustness, informing practical deployment decisions.
zh
[CV-39] Beauty and the Beast: Imperceptible Perturbations Against Diffusion-Based Face Swapping via Directional Attribute Editing
【速读】:该论文旨在解决基于扩散模型(diffusion-based)的人脸替换技术所带来的隐私侵犯与声誉损害问题,其核心挑战在于现有主动防御方法在扰动强度与防护效果之间存在权衡:过大扰动会破坏人脸结构,过小扰动则削弱防御有效性。解决方案的关键在于提出FaceDefense框架,通过引入一种新的扩散损失(diffusion loss)增强对抗样本的防御效能,并结合方向性人脸属性编辑(directional facial attribute editing)修复扰动引起的失真,从而提升视觉不可感知性;同时设计两阶段交替优化策略生成最终的扰动人脸图像,实现了防护效果与视觉质量之间的优越平衡。
链接: https://arxiv.org/abs/2601.22744
作者: Yilong Huang,Songze Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Diffusion-based face swapping achieves state-of-the-art performance, yet it also exacerbates the potential harm of malicious face swapping to violate portraiture right or undermine personal reputation. This has spurred the development of proactive defense methods. However, existing approaches face a core trade-off: large perturbations distort facial structures, while small ones weaken protection effectiveness. To address these issues, we propose FaceDefense, an enhanced proactive defense framework against diffusion-based face swapping. Our method introduces a new diffusion loss to strengthen the defensive efficacy of adversarial examples, and employs a directional facial attribute editing to restore perturbation-induced distortions, thereby enhancing visual imperceptibility. A two-phase alternating optimization strategy is designed to generate final perturbed face images. Extensive experiments show that FaceDefense significantly outperforms existing methods in both imperceptibility and defense effectiveness, achieving a superior trade-off.
zh
[CV-40] StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing
【速读】:该论文旨在解决实时社交流媒体内容检测中因多模态信号(视频、文本、音频)部分且异步输入而导致的复杂性问题,尤其是在低延迟和高精度要求下的高效理解挑战。解决方案的关键在于提出StreamSense框架,其核心是将轻量级流式编码器与选择性路由机制相结合,动态地将简单样本由轻量编码器处理,而将困难或模糊样本选择性地交由视觉-语言模型(VLM)专家处理,并在上下文不足时推迟决策。这种选择性升级(selective escalation)与推迟(deferral)策略显著提升了效率与准确性,同时通过跨模态对比损失和IoU加权损失优化训练过程,有效缓解了片段边界处标签干扰问题。
链接: https://arxiv.org/abs/2601.22738
作者: Han Wang,Deyi Ji,Lanyun Zhu,Jiebo Luo,Roy Ka-Wei Lee
机构: Singapore University of Technology and Design(新加坡科技设计大学); University of Science and Technology of China(中国科学技术大学); Nanyang Technological University(南洋理工大学); University of Rochester(罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, The Web Conference 2026
Abstract:Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.
zh
[CV-41] Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
【速读】:该论文旨在解决视觉语言大模型(Vision-Language Large Models, VLLMs)在联合多语言与多模态输入下的安全性问题,现有基准测试或仅支持多语言但局限于文本,或仅支持多模态但局限于单语种,且缺乏语义对齐的图像-文本配对,难以覆盖真实场景中的跨模态交互风险。其解决方案的关键在于提出 Lingua-SafetyBench,一个包含100,440个有害图像-文本对的多语言多模态基准,明确划分为以图像为主导和以文本为主导的子集,从而解耦风险来源,并通过系统评估揭示了图像主导风险在高资源语言(High-Resource Languages, HRLs)中导致更高攻击成功率(Attack Success Rate, ASR),而文本主导风险在非高资源语言(Non-HRLs)中更具危害性,强调需开展语言与模态感知的安全对齐策略。
链接: https://arxiv.org/abs/2601.22737
作者: Enyi Shi,Pengyang Shao,Yanxin Zhang,Chenhang Cui,Jiayi Lyu,Xu Xie,Xiaobo Xia,Fei Shen,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robust safety of vision-language large models (VLLMs) under joint multilingual and multimodal inputs remains underexplored. Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual. Recent multilingual multimodal red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals and lack semantically grounded image-text pairs, limiting coverage of realistic cross-modal interactions. We introduce Lingua-SafetyBench, a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets to disentangle risk sources. Evaluating 11 open-source VLLMs reveals a consistent asymmetry: image-dominant risks yield higher ASR in high-resource languages, while text-dominant risks are more severe in non-high-resource languages. A controlled study on the Qwen series shows that scaling and version upgrades reduce Attack Success Rate (ASR) overall but disproportionately benefit HRLs, widening the gap between HRLs and Non-HRLs under text-dominant risks. This underscores the necessity of language- and modality-aware safety alignment beyond mere this http URL facilitate reproducibility and future research, we will publicly release our benchmark, model checkpoints, and source this http URL code and dataset will be available at this https URL this paper contains examples with unsafe content.
zh
[CV-42] ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model
【速读】:该论文旨在解决长链式思维(Chain-of-Thought, CoT)在大语言模型(Large Language Models, LLMs)中推理效率低下的问题,即如何通过压缩CoT为紧凑的潜在标记(latent tokens)来提升推理效率,同时保留其逻辑结构。现有方法依赖自编码器(autoencoder)以文本形式的CoT作为重建目标,导致潜在标记过度关注表层语言特征(如词汇选择和句法结构),引入了强烈的语言归纳偏置(linguistic inductive bias),从而削弱了对推理结构的抽象能力。解决方案的关键在于提出ImgCoT,将重建目标从文本CoT替换为通过渲染生成的视觉CoT(visual CoT),以此引入空间归纳偏置(spatial inductive bias),使潜在标记更聚焦于推理步骤的空间布局,从而更好地捕捉全局推理结构。进一步地,为弥补视觉潜在标记可能模糊细节的问题,论文还提出松散ImgCoT(loose ImgCoT),通过选取低token对数似然值的关键文本推理步骤进行增强,实现全局结构与细粒度细节的协同保留,且所需标记数量少于完整CoT。
链接: https://arxiv.org/abs/2601.22730
作者: Xiaoshu Chen,Sihang Zhou,Ke Liang,Taichun Zhou,Xinwang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.
zh
[CV-43] GaussianOcc3D: A Gaussian-Based Adaptive Multi-modal 3D Occupancy Prediction
【速读】:该论文旨在解决自动驾驶中3D语义占据预测(3D semantic occupancy prediction)任务的挑战,特别是单模态方法在相机语义与激光雷达(LiDAR)几何信息之间存在权衡的问题,以及现有多模态框架普遍面临的模态异质性、空间错位和表示危机(representation crisis)——即体素(voxel)计算开销大,而鸟瞰图(BEV)替代方案则存在信息损失。其解决方案的关键在于提出GaussianOcc3D,一种基于高效连续3D高斯表示的多模态框架,通过四个核心模块实现:(1) LiDAR深度特征聚合(LDFA)利用深度感知可变形采样将稀疏信号映射到高斯基元;(2) 基于熵的特征平滑(EBFS)降低域噪声影响;(3) 自适应相机-激光雷达融合(ACLF)引入不确定性感知重加权机制以提升传感器可靠性;(4) Gauss-Mamba头结合选择性状态空间模型(Selective State Space Models)实现线性复杂度的全局上下文建模。该方法在多个基准数据集上达到SOTA性能,并展现出对雨天和夜间等复杂场景的鲁棒性。
链接: https://arxiv.org/abs/2601.22729
作者: A. Enes Doruk,Hasan F. Ates
机构: Ozyegin University (奥泽京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D semantic occupancy prediction is a pivotal task in autonomous driving, providing a dense and fine-grained understanding of the surrounding environment, yet single-modality methods face trade-offs between camera semantics and LiDAR geometry. Existing multi-modal frameworks often struggle with modality heterogeneity, spatial misalignment, and the representation crisis–where voxels are computationally heavy and BEV alternatives are lossy. We present GaussianOcc3D, a multi-modal framework bridging camera and LiDAR through a memory-efficient, continuous 3D Gaussian representation. We introduce four modules: (1) LiDAR Depth Feature Aggregation (LDFA), using depth-wise deformable sampling to lift sparse signals onto Gaussian primitives; (2) Entropy-Based Feature Smoothing (EBFS) to mitigate domain noise; (3) Adaptive Camera-LiDAR Fusion (ACLF) with uncertainty-aware reweighting for sensor reliability; and (4) a Gauss-Mamba Head leveraging Selective State Space Models for global context with linear complexity. Evaluations on Occ3D, SurroundOcc, and SemanticKITTI benchmarks demonstrate state-of-the-art performance, achieving mIoU scores of 49.4%, 28.9%, and 25.2% respectively. GaussianOcc3D exhibits superior robustness across challenging rainy and nighttime conditions.
zh
[CV-44] OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)系统中评估可靠性不足的问题,具体表现为传统指标难以量化细粒度纹理细节与语义一致性,且现有数据集在规模和多样性上无法满足商业应用需求。解决方案的关键在于提出一个大规模基准OpenVTON-Bench,包含约10万对高分辨率图像(最高达1536×1536),通过DINOv3引导的分层聚类实现语义均衡采样,并利用Gemini驱动的密集描述生成确保20个细粒度服装类别分布均匀;同时设计了一种多模态评估协议,从背景一致性、身份保真度、纹理保真度、形状合理性及整体真实感五个可解释维度进行测量,结合视觉语言模型(VLM)语义推理与基于SAM3分割和形态学腐蚀的新颖多尺度表示度量,有效分离边界对齐误差与内部纹理伪影,实验表明该方法与人工评价具有高度一致性(Kendall’s τ=0.833),显著优于SSIM等传统指标(τ=0.611)。
链接: https://arxiv.org/abs/2601.22725
作者: Jin Li,Tao Chen,Shuai Jiang,Weijie Wang,Jingwen Luo,Chenhui Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to 1536 \times 1536 ). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall’s \tau of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.
zh
[CV-45] Vision-Language Models Unlock Task-Centric Latent Actions
【速读】:该论文旨在解决Latent Action Models (LAMs)在存在动作相关干扰项(action-correlated distractors)时性能下降的问题,即模型容易将观测中的噪声误认为有意义的潜在动作。其解决方案的关键在于利用视觉语言模型(Vision-Language Models, VLMs)的常识推理能力,生成可提示(promptable)的表示作为监督信号,在无监督条件下有效分离可控变化与噪声。实验表明,通过引导VLM忽略干扰项,可显著提升潜在动作质量,使下游任务在Distracting MetaWorld上的成功率提高达六倍。
链接: https://arxiv.org/abs/2601.22714
作者: Alexander Nikulin,Ilya Zisman,Albina Klepach,Denis Tarasov,Alexander Derevyagin,Andrei Polubarov,Lyubaykin Nikita,Vladislav Kurenkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
zh
[CV-46] SQUAD: Scalable Quorum Adaptive Decisions via ensemble of early exit neural networks
【速读】:该论文旨在解决早期退出神经网络(Early-exit Neural Networks)在推理过程中因依赖单一模型置信度阈值而导致的不确定性估计不可靠问题。传统方法往往受模型校准偏差影响,难以准确判断何时停止计算以获得可靠预测。解决方案的关键在于提出SQUAD(Scalable Quorum Adaptive Decisions)框架,其创新性地将分布式集成学习与早期退出机制相结合,采用基于多数投票(quorum-based)的停止准则:通过按计算复杂度递增顺序收集中间层预测结果,直到达成统计显著的共识时终止计算。此外,为增强投票机制的有效性,论文进一步引入QUEST(Quorum Search Technique),一种用于搜索具有最优层次多样性结构的早期退出学习器的神经架构搜索方法,从而确保各中间层学习器之间互补性强,提升整体决策鲁棒性。该方案在保持与现有动态方法相当计算成本的同时,测试准确率最高提升5.95%,且相比静态集成模型推理延迟降低达70.60%。
链接: https://arxiv.org/abs/2601.22711
作者: Matteo Gambella,Fabrizio Pittorino,Giuliano Casale,Manuel Roveri
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Early-exit neural networks have become popular for reducing inference latency by allowing intermediate predictions when sufficient confidence is achieved. However, standard approaches typically rely on single-model confidence thresholds, which are frequently unreliable due to inherent calibration issues. To address this, we introduce SQUAD (Scalable Quorum Adaptive Decisions), the first inference scheme that integrates early-exit mechanisms with distributed ensemble learning, improving uncertainty estimation while reducing the inference time. Unlike traditional methods that depend on individual confidence scores, SQUAD employs a quorum-based stopping criterion on early-exit learners by collecting intermediate predictions incrementally in order of computational complexity until a consensus is reached and halting the computation at that exit if the consensus is statistically significant. To maximize the efficacy of this voting mechanism, we also introduce QUEST (Quorum Search Technique), a Neural Architecture Search method to select early-exit learners with optimized hierarchical diversity, ensuring learners are complementary at every intermediate layer. This consensus-driven approach yields statistically robust early exits, improving the test accuracy up to 5.95% compared to state-of-the-art dynamic solutions with a comparable computational cost and reducing the inference latency up to 70.60% compared to static ensembles while maintaining a good accuracy.
zh
[CV-47] Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs ICML
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在后训练量化(post-training quantization)过程中因信息压缩导致显著精度下降的问题,同时提升模型在资源受限场景下的部署效率。其核心解决方案是提出GRACE框架,该框架基于信息瓶颈(Information Bottleneck)原理,统一知识蒸馏与量化感知训练(Quantization-Aware Training, QAT):通过量化约束信息容量,同时利用蒸馏引导保留任务相关的信息。关键创新包括置信度门控解耦蒸馏(confidence-gated decoupled distillation)以过滤不可靠监督信号、关系中心核对齐(relational centered kernel alignment)用于传递视觉token结构信息,以及基于拉格朗日松弛的自适应控制器,动态平衡重建保真度与容量约束。实验证明,该方法在INT4量化下仍能接近教师模型性能,并实现3倍推理吞吐量和54%内存减少。
链接: https://arxiv.org/abs/2601.22709
作者: Yanlong Chen,Amirhossein Habibian,Luca Benini,Yawei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is currently under review for the 2026 International Conference on Machine Learning (ICML)
Abstract:Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3 \times throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.
zh
[CV-48] DAVIS: OOD Detection via Dominant Activations and Variance for Increased Separation
【速读】:该论文旨在解决机器学习模型在真实世界部署中对分布外(out-of-distribution, OOD)输入检测能力不足的问题。现有大多数后处理检测方法依赖于全局平均池化(global average pooling, GAP)后的特征表示,而GAP会丢失激活图中的重要分布统计信息,导致检测性能受限。论文提出了一种名为DAVIS的简单且通用的后处理检测方法,其核心创新在于通过引入GAP前被丢弃的关键统计量——通道维度上的方差(channel-wise variance)和主导最大激活值(dominant maximum activations),来增强特征向量,从而弥补GAP带来的信息损失。实验证明,DAVIS在多种主流架构上均显著提升OOD检测性能,尤其在FPR95指标上取得大幅降低,揭示了以均值之外的统计特征提升检测鲁棒性的新机制。
链接: https://arxiv.org/abs/2601.22703
作者: Abid Hassan,Tuan Ngo,Saad Shafiq,Nenad Medvidovic
机构: University Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Detecting out-of-distribution (OOD) inputs is a critical safeguard for deploying machine learning models in the real world. However, most post-hoc detection methods operate on penultimate feature representations derived from global average pooling (GAP) – a lossy operation that discards valuable distributional statistics from activation maps prior to global average pooling. We contend that these overlooked statistics, particularly channel-wise variance and dominant (maximum) activations, are highly discriminative for OOD detection. We introduce DAVIS, a simple and broadly applicable post-hoc technique that enriches feature vectors by incorporating these crucial statistics, directly addressing the information loss from GAP. Extensive evaluations show DAVIS sets a new benchmark across diverse architectures, including ResNet, DenseNet, and EfficientNet. It achieves significant reductions in the false positive rate (FPR95), with improvements of 48.26% on CIFAR-10 using ResNet-18, 38.13% on CIFAR-100 using ResNet-34, and 26.83% on ImageNet-1k benchmarks using MobileNet-v2. Our analysis reveals the underlying mechanism for this improvement, providing a principled basis for moving beyond the mean in OOD detection.
zh
[CV-49] Bi-MCQ: Reformulating Vision-Language Alignment for Negation Understanding ICPR2026
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在医学图像分析中对否定性临床语句理解能力不足的问题,其根源在于对比学习目标将否定视为语言上的微小变化而非语义反转操作,导致模型难以准确识别疾病不存在的情况。解决方案的关键在于将视觉-语言对齐重新建模为条件语义比较问题,并通过双向多选学习框架(Bi-MCQ)实现:该框架联合训练图像到文本和文本到图像的多选任务,使用肯定、否定及混合提示,从而以条件语义比较替代全局相似度最大化;同时引入方向特异的交叉注意力融合模块,缓解双向推理所需的不对称线索干扰,显著提升模型对否定语义的理解能力。
链接: https://arxiv.org/abs/2601.22696
作者: Tae Hun Kim,Hyun Gyu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, Submitted to ICPR 2026 (under review)
Abstract:Recent vision-language models (VLMs) achieve strong zero-shot performance via large-scale image-text pretraining and have been widely adopted in medical image analysis. However, existing VLMs remain notably weak at understanding negated clinical statements, largely due to contrastive alignment objectives that treat negation as a minor linguistic variation rather than a meaning-inverting operator. In multi-label settings, prompt-based InfoNCE fine-tuning further reinforces easy-positive image-prompt alignments, limiting effective learning of disease absence. To overcome these limitations, we reformulate vision-language alignment as a conditional semantic comparison problem, which is instantiated through a bi-directional multiple-choice learning framework(Bi-MCQ). By jointly training Image-to-Text and Text-to-Image MCQ tasks with affirmative, negative, and mixed prompts, our method implements fine-tuning as conditional semantic comparison instead of global similarity maximization. We further introduce direction-specific Cross-Attention fusion modules to address asymmetric cues required by bi-directional reasoning and reduce alignment interference. Experiments on ChestXray14, Open-I, CheXpert, and PadChest show that Bi-MCQ improves negation understanding by up to 0.47 AUC over the zero-shot performance of the state-of-the-art CARZero model, while achieving up to a 0.08 absolute gain on positive-negative combined (PNC) evaluation. Additionally, Bi-MCQ reduces the affirmative-negative AUC gap by an average of 0.12 compared to InfoNCE-based fine-tuning, demonstrating that objective reformulation can substantially enhance negation understanding in medical VLMs.
zh
[CV-50] PEAR: Pixel-aligned Expressive humAn mesh Recovery
【速读】:该论文旨在解决从单张自然场景图像中重建高精度3D人体网格(尤其是细粒度区域如面部和手部)的难题,现有基于SMPLX的方法普遍存在推理速度慢、姿态粗略、细节区域对齐不准或出现不自然伪影等问题,限制了其在下游任务中的应用。解决方案的关键在于提出PEAR框架——一个无需预处理、可实时运行且像素对齐的表达性人体网格恢复方法:首先采用统一的ViT架构实现高效粗粒度几何重建;其次引入像素级监督机制以补偿简化结构导致的细节损失,显著提升面部与手部等区域的重建精度;最后通过模块化数据标注策略增强训练数据多样性与模型鲁棒性,从而在多个基准数据集上实现优于以往SMPLX方法的姿态估计准确率,并支持每秒超过100帧的EHM-s(SMPLX与缩放版FLAME)参数同步推断。
链接: https://arxiv.org/abs/2601.22693
作者: Jiahao Wu,Yunfei Liu,Lijian Lin,Ye Zhu,Lei Zhu,Jingyi Li,Yu Li
机构: International Digital Economy Academy (国际数字经济发展研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: this https URL
zh
[CV-51] OOVDet: Low-Density Prior Learning for Zero-Shot Out-of-Vocabulary Object Detection
【速读】:该论文旨在解决零样本词汇外检测(Zero-shot Out-of-Vocabulary Detection, ZS-OOVD)中的关键挑战:现有方法在零样本推理场景下容易对已知类别(in-vocabulary, IV)过拟合,导致未定义类别(out-of-vocabulary, OOV)被错误地高置信度分类为IV类别。为此,作者提出了一种新的零样本OOV检测框架OOVDet,其核心创新在于通过构建低密度先验约束来有效区分IV与OOV类别。具体而言,解决方案的关键包括:1)利用隐空间中类条件高斯分布的低似然区域采样生成区域级OOV提示(prompt),假设未知语义更可能出现在潜在空间的低密度区域;2)设计基于Dirichlet分布的梯度归因机制,将梯度解释为证据以估计预测不确定性,并筛选高不确定性样本作为伪OOV图像;3)基于合成的OOV提示和伪OOV图像,通过高斯核密度估计施加低密度先验约束,从而优化OOV类别的决策边界,显著提升零样本场景下的OOV检测性能。
链接: https://arxiv.org/abs/2601.22685
作者: Binyi Su,Chenghao Huang,Haiyong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot out-of-vocabulary detection (ZS-OOVD) aims to accurately recognize objects of in-vocabulary (IV) categories provided at zero-shot inference, while simultaneously rejecting undefined ones (out-of-vocabulary, OOV) that lack corresponding category prompts. However, previous methods are prone to overfitting the IV classes, leading to the OOV or undefined classes being misclassified as IV ones with a high confidence score. To address this issue, this paper proposes a zero-shot OOV detector (OOVDet), a novel framework that effectively detects predefined classes while reliably rejecting undefined ones in zero-shot scenes. Specifically, due to the model’s lack of prior knowledge about the distribution of OOV data, we synthesize region-level OOV prompts by sampling from the low-likelihood regions of the class-conditional Gaussian distributions in the hidden space, motivated by the assumption that unknown semantics are more likely to emerge in low-density areas of the latent space. For OOV images, we further propose a Dirichlet-based gradient attribution mechanism to mine pseudo-OOV image samples, where the attribution gradients are interpreted as Dirichlet evidence to estimate prediction uncertainty, and samples with high uncertainty are selected as pseudo-OOV images. Building on these synthesized OOV prompts and pseudo-OOV images, we construct the OOV decision boundary through a low-density prior constraint, which regularizes the optimization of OOV classes using Gaussian kernel density estimation in accordance with the above assumption. Experimental results show that our method significantly improves the OOV detection performance in zero-shot scenes. The code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.22685 [cs.CV] (or arXiv:2601.22685v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.22685 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-52] Visual Personalization Turing Test
【速读】:该论文旨在解决当前生成式 AI 在视觉个性化任务中缺乏有效评估标准的问题,尤其是如何衡量模型生成内容在感知层面是否与特定个体的创作难以区分,而非仅追求身份复制。其解决方案的关键在于提出 Visual Personalization Turing Test (VPTT) 新范式,通过引入 VPTT Framework,包含一个涵盖 10,000 个人设的基准测试集(VPTT-Bench)、基于视觉检索增强的生成器(VPRAG)以及一个与人类和视觉语言模型(VLM)判断高度一致的文本指标(VPTT Score),从而实现对个性化生成内容的可量化、可扩展且隐私安全的评估与优化。
链接: https://arxiv.org/abs/2601.22680
作者: Rameen Abdal,James Burgess,Sergey Tulyakov,Kuan-Chieh Jackson Wang
机构: Snap Research( Snap 研究); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL
Abstract:We introduce the Visual Personalization Turing Test (VPTT), a new paradigm for evaluating contextual visual personalization based on perceptual indistinguishability, rather than identity replication. A model passes the VPTT if its output (image, video, 3D asset, etc.) is indistinguishable to a human or calibrated VLM judge from content a given person might plausibly create or share. To operationalize VPTT, we present the VPTT Framework, integrating a 10k-persona benchmark (VPTT-Bench), a visual retrieval-augmented generator (VPRAG), and the VPTT Score, a text-only metric calibrated against human and VLM judgments. We show high correlation across human, VLM, and VPTT evaluations, validating the VPTT Score as a reliable perceptual proxy. Experiments demonstrate that VPRAG achieves the best alignment-originality balance, offering a scalable and privacy-safe foundation for personalized generative AI.
zh
[CV-53] Stabilizing Consistency Training: A Flow Map Analysis and Self-Distillation
【速读】:该论文旨在解决一致性模型(consistency models)在从头训练时存在的固有不稳定性与可重复性差的问题,这些问题限制了其性能和应用。解决方案的关键在于从流映射(flow map)视角对一致性模型进行理论分析,揭示了训练稳定性和收敛行为如何导致退化解;在此基础上,重新审视自蒸馏(self-distillation)作为缓解特定形式次优收敛的实用方法,并对其进行了重构以避免梯度范数过大,从而实现更稳定的优化过程。该策略进一步扩展至基于扩散的策略学习任务,且无需依赖预训练扩散模型初始化,验证了其广泛适用性。
链接: https://arxiv.org/abs/2601.22679
作者: Youngjoong Kim,Duhoe Kim,Woosung Kim,Jaesik Park
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Consistency models have been proposed for fast generative modeling, achieving results competitive with diffusion and flow models. However, these methods exhibit inherent instability and limited reproducibility when training from scratch, motivating subsequent work to explain and stabilize these issues. While these efforts have provided valuable insights, the explanations remain fragmented, and the theoretical relationships remain unclear. In this work, we provide a theoretical examination of consistency models by analyzing them from a flow map-based perspective. This joint analysis clarifies how training stability and convergence behavior can give rise to degenerate solutions. Building on these insights, we revisit self-distillation as a practical remedy for certain forms of suboptimal convergence and reformulate it to avoid excessive gradient norms for stable optimization. We further demonstrate that our strategy extends beyond image generation to diffusion-based policy learning, without reliance on a pretrained diffusion model for initialization, thereby illustrating its broader applicability.
zh
[CV-54] Fire on Motion: Optimizing Video Pass-bands for Efficient Spiking Action Recognition
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在动态视频任务中性能显著落后于人工神经网络(Artificial Neural Networks, ANNs)的问题。作者指出,SNN的标准脉冲动力学本质上表现为一个时间低通滤波器,会抑制包含任务相关动态信息的运动频带,从而导致其在视频理解任务中表现不佳。解决方案的关键在于提出一种名为“通带优化器”(Pass-Bands Optimizer, PBO)的即插即用模块,该模块通过引入两个可学习参数和一个轻量级一致性约束,在不改变网络结构的前提下,主动调整SNN的时间通带响应,使其聚焦于任务相关的运动频带,同时抑制对分类判别贡献较小的静态成分,从而实现高效且高精度的视频处理。
链接: https://arxiv.org/abs/2601.22675
作者: Shuhan Ye,Yuanbin Qian,Yi Yu,Chong Wang,Yuqi Xie,Jiazhen Xu,Kun Wang,Xudong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spiking neural networks (SNNs) have gained traction in vision due to their energy efficiency, bio-plausibility, and inherent temporal processing. Yet, despite this temporal capacity, most progress concentrates on static image benchmarks, and SNNs still underperform on dynamic video tasks compared to artificial neural networks (ANNs). In this work, we diagnose a fundamental pass-band mismatch: Standard spiking dynamics behave as a temporal low pass that emphasizes static content while attenuating motion bearing bands, where task relevant information concentrates in dynamic tasks. This phenomenon explains why SNNs can approach ANNs on static tasks yet fall behind on tasks that demand richer temporal this http URL remedy this, we propose the Pass-Bands Optimizer (PBO), a plug-and-play module that optimizes the temporal pass-band toward task-relevant motion bands. PBO introduces only two learnable parameters, and a lightweight consistency constraint that preserves semantics and boundaries, incurring negligible computational overhead and requires no architectural changes. PBO deliberately suppresses static components that contribute little to discrimination, effectively high passing the stream so that spiking activity concentrates on motion bearing content. On UCF101, PBO yields over ten percentage points improvement. On more complex multi-modal action recognition and weakly supervised video anomaly detection, PBO delivers consistent and significant gains, offering a new perspective for SNN based video processing and understanding.
zh
[CV-55] VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration ICLR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在高分辨率图像和视频场景中因视觉令牌(visual tokens)过多而导致的计算成本过高问题,同时现有令牌压缩方法往往仅关注单一模块且忽视文本对齐,从而引发性能下降。解决方案的关键在于提出一个无需训练的统一加速框架 VisionTrim,其核心包含两个即插即用模块:1)基于全局-局部视角的主导视觉令牌选择(Dominant Vision Token Selection, DVTS)模块,用于保留关键视觉信息;2)由文本引导的视觉补全(Text-Guided Vision Complement, TGVC)模块,通过文本线索实现上下文感知的令牌合并,从而在不损害文本对齐的前提下显著降低计算开销并提升性能。
链接: https://arxiv.org/abs/2601.22674
作者: Hanxun Yu,Wentong Li,Xuan Qu,Song Wang,Junbo Chen,Jianke Zhu
机构: Zhejiang University (浙江大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Udeer.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR2026, Code Link: this https URL
Abstract:Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: this https URL.
zh
[CV-56] ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding
【速读】:该论文旨在解决开放词汇目标检测与零样本实例分割中视觉-语言对齐的准确性问题,尤其是在弱监督条件下如何实现细粒度的语义匹配。现有方法或依赖全局句子嵌入导致表达能力不足,或通过显式标注或复杂的交叉注意力机制进行词元级对齐,带来额外负担。其解决方案的关键在于提出ExpAlign框架,基于严格的多实例学习(Multiple Instance Learning, MIL)理论构建,引入期望对齐头(Expectation Alignment Head),通过注意力机制对词元-区域相似性进行软MIL池化,从而隐式完成词元和实例的选择而无需额外标注;同时设计基于能量的多尺度一致性正则化策略,包括Top-K多正例对比损失和由拉格朗日约束自由能最小化推导出的几何感知一致性目标,有效稳定对齐学习过程。
链接: https://arxiv.org/abs/2601.22666
作者: Junyi Hu,Tian Bai,Fengyi Wu,Wenyan Li,Zhenming Peng,Yi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures
Abstract:Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP _r on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.
zh
[CV-57] Unsupervised Synthetic Image Attribution: Alignment and Disentanglement
【速读】:该论文旨在解决生成式 AI(Generative AI)合成图像的来源归属问题,即如何在无需昂贵配对标注数据的情况下,准确识别合成图像所对应的真实训练源。传统方法依赖于成对的合成图像与原始训练源标注,但获取此类数据成本高昂且难以规模化。本文提出一种名为“对齐与解耦”(Alignment and Disentanglement)的无监督方法,其关键在于:首先利用对比自监督学习(contrastive self-supervised learning)实现基础概念对齐,随后通过 Infomax 损失促进表征解耦,从而增强模型的归属判别能力。理论分析表明,这种策略可通过 canonical correlation analysis(CCA)目标的分解近似实现概念匹配过程,实验证明该方法在真实世界基准 AbC 上显著优于现有监督方法。
链接: https://arxiv.org/abs/2601.22663
作者: Zongfang Liu,Guangyi Chen,Boyang Sun,Tongliang Liu,Kun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:As the quality of synthetic images improves, identifying the underlying concepts of model-generated images is becoming increasingly crucial for copyright protection and ensuring model transparency. Existing methods achieve this attribution goal by training models using annotated pairs of synthetic images and their original training sources. However, obtaining such paired supervision is challenging, as it requires either well-designed synthetic concepts or precise annotations from millions of training sources. To eliminate the need for costly paired annotations, in this paper, we explore the possibility of unsupervised synthetic image attribution. We propose a simple yet effective unsupervised method called Alignment and Disentanglement. Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning. Next, we enhance the model’s attribution ability by promoting representation disentanglement with the Infomax loss. This approach is motivated by an interesting observation: contrastive self-supervised models, such as MoCo and DINO, inherently exhibit the ability to perform simple cross-domain alignment. By formulating this observation as a theoretical assumption on cross-covariance, we provide a theoretical explanation of how alignment and disentanglement can approximate the concept-matching process through a decomposition of the canonical correlation analysis objective. On the real-world benchmarks, AbC, we show that our unsupervised method surprisingly outperforms the supervised methods. As a starting point, we expect our intuitive insights and experimental findings to provide a fresh perspective on this challenging task.
zh
[CV-58] What can Computer Vision learn from Ranganathan?
【速读】:该论文旨在解决计算机视觉(Computer Vision, CV)中的语义鸿沟问题(Semantic Gap Problem, SGP),即视觉语义与词法语义之间的不匹配,这一问题导致了CV数据集设计和基准测试的缺陷。解决方案的关键在于引入S.R. Ranganathan的分类学原理,并将其适配应用于vTelos视觉标注方法中,从而为构建高质量CV数据集提供理论基础和结构化指导。实验结果表明,该方法能够提升标注质量和模型准确性,验证了其有效性。
链接: https://arxiv.org/abs/2601.22634
作者: Mayukh Bagchi,Fausto Giunchiglia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted @ DRTC-ISI Conference 2026, Indian Statistical Institute (ISI), Bangalore, India
Abstract:The Semantic Gap Problem (SGP) in Computer Vision (CV) arises from the misalignment between visual and lexical semantics leading to flawed CV dataset design and CV benchmarks. This paper proposes that classification principles of S.R. Ranganathan can offer a principled starting point to address SGP and design high-quality CV datasets. We elucidate how these principles, suitably adapted, underpin the vTelos CV annotation methodology. The paper also briefly presents experimental evidence showing improvements in CV annotation and accuracy, thereby, validating vTelos.
zh
[CV-59] LINA: Linear Autoregressive Image Generative Models with Continuous Tokens
【速读】:该论文旨在解决自回归视觉生成模型(尤其是文本到图像合成,T2I)中因使用传统softmax注意力机制而导致的高计算成本问题。其核心解决方案是设计一种计算高效的线性注意力机制(linear attention),并通过系统实证分析确定最优配置:首先发现除法归一化(division-based normalization)在生成任务中比减法归一化(subtraction-based normalization)更具可扩展性;其次验证了深度卷积(depthwise convolution)对局部性建模的重要性,有助于提升生成质量。进一步提出KV门控机制(KV gate),通过引入数据无关的可学习参数动态调节键(key)和值(value)状态的权重,实现类似语言模型中遗忘门的灵活记忆管理。基于上述改进,作者构建了LINA模型,完全基于线性注意力结构,在保持高性能的同时显著降低浮点运算量(FLOPs减少约61%),在ImageNet和GenEval基准上分别取得2.18 FID和0.74得分,证明其在高分辨率图像生成中的有效性与效率。
链接: https://arxiv.org/abs/2601.22630
作者: Jiahao Wang,Ting Pan,Haoge Deng,Dongchen Han,Taiqiang Wu,Xinlong Wang,Ping Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 9 figures
Abstract:Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis, but they suffer from high computational cost. We study how to design compute-efficient linear attention within this framework. Specifically, we conduct a systematic empirical analysis of scaling behavior with respect to parameter counts under different design choices, focusing on (1) normalization paradigms in linear attention (division-based vs. subtraction-based) and (2) depthwise convolution for locality augmentation. Our results show that although subtraction-based normalization is effective for image classification, division-based normalization scales better for linear generative transformers. In addition, incorporating convolution for locality modeling plays a crucial role in autoregressive generation, consistent with findings in diffusion models. We further extend gating mechanisms, commonly used in causal linear attention, to the bidirectional setting and propose a KV gate. By introducing data-independent learnable parameters to the key and value states, the KV gate assigns token-wise memory weights, enabling flexible memory management similar to forget gates in language models. Based on these findings, we present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions. LINA achieves competitive performance on both class-conditional and T2I benchmarks, obtaining 2.18 FID on ImageNet (about 1.4B parameters) and 0.74 on GenEval (about 1.5B parameters). A single linear attention module reduces FLOPs by about 61 percent compared to softmax attention. Code and models are available at: this https URL. Comments: 20 pages, 9 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.22630 [cs.CV] (or arXiv:2601.22630v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.22630 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiahao Wang [view email] [v1] Fri, 30 Jan 2026 06:44:33 UTC (3,384 KB)
zh
[CV-60] UniGeo: A Unified 3D Indoor Object Detection Framework Integrating Geometry-Aware Learning and Dynamic Channel Gating
【速读】:该论文旨在解决现有3D室内物体检测方法在处理稀疏点云场景时,难以建模几何关系以及忽略显著区域特征分布的问题,从而限制了检测性能。其解决方案的关键在于提出一个统一的3D室内检测框架UniGeo,包含两个核心机制:一是几何感知学习模块(geometry-aware learning module),通过可学习的映射将空间关系转化为特征权重,实现显式的几何特征增强;二是动态通道门控机制(dynamic channel gating mechanism),利用可学习的通道级加权策略自适应优化由稀疏3D U-Net网络生成的特征,显著提升关键几何信息的表达能力。
链接: https://arxiv.org/abs/2601.22616
作者: Xing Yi,Jinyang Huang,Feng-Qi Cui,Anyang Tong,Ruimin Wang,Liu Liu,Dan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The growing adoption of robotics and augmented reality in real-world applications has driven considerable research interest in 3D object detection based on point clouds. While previous methods address unified training across multiple datasets, they fail to model geometric relationships in sparse point cloud scenes and ignore the feature distribution in significant areas, which ultimately restricts their performance. To deal with this issue, a unified 3D indoor detection framework, called UniGeo, is proposed. To model geometric relations in scenes, we first propose a geometry-aware learning module that establishes a learnable mapping from spatial relationships to feature weights, which enabes explicit geometric feature enhancement. Then, to further enhance point cloud feature representation, we propose a dynamic channel gating mechanism that leverages learnable channel-wise weighting. This mechanism adaptively optimizes features generated by the sparse 3D U-Net network, significantly enhancing key geometric information. Extensive experiments on six different indoor scene datasets clearly validate the superior performance of our method.
zh
[CV-61] SA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction
【速读】:该论文旨在解决流式递归模型在长时间序列中因历史信息与新观测之间权衡而导致的灾难性记忆遗忘问题。现有方法虽通过注意力机制提取自适应信号缓解此问题,但仅考虑单一维度,忽略了时间与空间的一致性。解决方案的关键在于提出一种无需训练的框架TTSA3R,其核心是同时利用时序状态演化和空间观测质量来实现自适应的状态更新:首先设计Temporal Adaptive Update Module(时序自适应更新模块),通过分析时序状态演化模式调节更新幅度;其次引入Spatial Contextual Update Module(空间上下文更新模块),基于观测-状态对齐与场景动态性定位需更新的空间区域;最终融合两类互补信号以确定最优状态更新策略,从而显著提升长期3D重建稳定性。
链接: https://arxiv.org/abs/2601.22615
作者: Zhijie Zheng,Xinhao Xiang,Jiawei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic memory forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments demonstrate the effectiveness of TTSA3R in diverse 3D tasks. Moreover, our method exhibits only 15% error increase compared to over 200% degradation in baseline models on extended sequences, significantly improving long-term reconstruction stability. Our codes will be available soon.
zh
[CV-62] FOTBCD: A Large-Scale Building Change Detection Benchmark from French Orthophotos and Topographic Data
【速读】:该论文旨在解决建筑变化检测(Building Change Detection, BCD)模型在跨地理域(geographic domain shift)场景下泛化能力不足的问题。现有基准数据集通常局限于单一城市或小范围区域,导致模型在新地理环境下性能显著下降。解决方案的关键在于构建一个大规模、地理分布广泛的数据集——FOTBCD,其覆盖法国 mainland 的28个行政区(其中25个用于训练,3个独立区域用于评估),并提供像素级二值变化掩码和实例级标注,以支持在真实地理多样性条件下的模型训练与验证。通过在FOTBCD-Binary上与LEVIR-CD+和WHU-CD进行对比实验,研究证实了数据集层面的地理多样性可显著提升模型的跨域泛化性能。
链接: https://arxiv.org/abs/2601.22596
作者: Abdelrrahman Moubane
机构: Retgen AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce FOTBCD, a large-scale building change detection dataset derived from authoritative French orthophotos and topographic building data provided by IGN France. Unlike existing benchmarks that are geographically constrained to single cities or limited regions, FOTBCD spans 28 departments across mainland France, with 25 used for training and three geographically disjoint departments held out for evaluation. The dataset covers diverse urban, suburban, and rural environments at 0.2m/pixel resolution. We publicly release FOTBCD-Binary, a dataset comprising approximately 28,000 before/after image pairs with pixel-wise binary building change masks, each associated with patch-level spatial metadata. The dataset is designed for large-scale benchmarking and evaluation under geographic domain shift, with validation and test samples drawn from held-out departments and manually verified to ensure label quality. In addition, we publicly release FOTBCD-Instances, a publicly available instance-level annotated subset comprising several thousand image pairs, which illustrates the complete annotation schema used in the full instance-level version of FOTBCD. Using a fixed reference baseline, we benchmark FOTBCD-Binary against LEVIR-CD+ and WHU-CD, providing strong empirical evidence that geographic diversity at the dataset level is associated with improved cross-domain generalization in building change detection.
zh
[CV-63] Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model
【速读】:该论文旨在解决高光谱图像(HSI)跨域少样本学习(CDFSL)中因数据稀缺导致的模型过拟合与泛化能力差的问题。现有方法通常依赖不切实际的外部噪声增强来扩大样本规模,且参数更新量大、易过拟合。解决方案的关键在于提出一种基于遥感(RS)基础模型的框架MIFOMO,其核心创新包括:1)利用预训练于大规模遥感任务的基础模型提取通用特征;2)引入共聚投影(coalescent projection, CP)在冻结主干网络的前提下快速适配下游任务;3)设计混合域适应(mixup domain adaptation, MDM)以缓解极端领域差异;4)采用标签平滑策略应对伪标签噪声问题。该方案显著提升了模型在少样本条件下的跨域迁移性能。
链接: https://arxiv.org/abs/2601.22581
作者: Naeem Paeedeh,Mahardhika Pratama,Ary Shiddiqi,Zehong Cao,Mukesh Prasad,Wisnu Jatmiko
机构: Adelaide University (阿德莱德大学); Institut Teknologi Sepuluh Nopember (十日理工大学); University of Technology, Sydney (悉尼科技大学); University of Indonesia (印度尼西亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Although cross-domain few-shot learning (CDFSL) for hyper-spectral image (HSI) classification has attracted significant research interest, existing works often rely on an unrealistic data augmentation procedure in the form of external noise to enlarge the sample size, thus greatly simplifying the issue of data scarcity. They involve a large number of parameters for model updates, being prone to the overfitting problem. To the best of our knowledge, none has explored the strength of the foundation model, having strong generalization power to be quickly adapted to downstream tasks. This paper proposes the MIxup FOundation MOdel (MIFOMO) for CDFSL of HSI classifications. MIFOMO is built upon the concept of a remote sensing (RS) foundation model, pre-trained across a large scale of RS problems, thus featuring generalizable features. The notion of coalescent projection (CP) is introduced to quickly adapt the foundation model to downstream tasks while freezing the backbone network. The concept of mixup domain adaptation (MDM) is proposed to address the extreme domain discrepancy problem. Last but not least, the label smoothing concept is implemented to cope with noisy pseudo-label problems. Our rigorous experiments demonstrate the advantage of MIFOMO, where it beats prior arts with up to 14% margin. The source code of MIFOMO is open-sourced in this https URL Paeedeh/MIFOMO for reproducibility and convenient further study.
zh
[CV-64] Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding
【速读】:该论文旨在解决视频大语言模型(Video Large Language Models)在视频理解、问答和推理等任务中普遍存在且难以克服的幻觉问题(hallucination),即生成内容与视频显式内容或事实证据不一致的现象。现有缓解方法虽考虑了视频的时空特性,但多依赖启发式设计,无法精准捕捉幻觉的根本原因及其细粒度的时间-语义关联,导致在复杂场景下鲁棒性和泛化能力不足。解决方案的关键在于提出一种新颖的解码策略——时空语义对比解码(Spatiotemporal-Semantic Contrastive Decoding),其通过人为破坏视频特征的时空一致性与语义关联构建负样本特征,并在推理阶段通过对比解码抑制幻觉输出,从而在有效降低幻觉发生率的同时保持模型的通用视频理解与推理能力。
链接: https://arxiv.org/abs/2601.22574
作者: Yuansheng Gao,Jinman Zhao,Tong Zhang,Xingguo Xu,Han Bao,Zonghui Wang,Wenzhi Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.
zh
[CV-65] DELNet: Continuous All-in-One Weather Removal via Dynamic Expert Library ICASSP
【速读】:该论文旨在解决当前全一体化天气图像恢复方法在面对未见过的退化类型时需重新训练模型的问题,从而导致较高的部署成本。其核心挑战在于如何在不牺牲性能的前提下实现模型对新退化类型的持续学习与适应。解决方案的关键在于提出DELNet框架,该框架包含两个核心组件:一是判断阀门(judging valve),用于衡量任务相似性以区分新任务与已知任务;二是动态专家库(dynamic expert library),用于存储针对不同退化类型训练的专家模块。对于新任务,阀门选择前k个最相关的专家进行知识迁移,并引入新的专家以捕捉特定任务特征;对于已知任务,则直接复用对应专家,从而实现无需重训练的持续优化,显著降低再训练成本并提升实际应用中的鲁棒性和效率。
链接: https://arxiv.org/abs/2601.22573
作者: Shihong Liu,Kun Zuo,Hanguang Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the ICASSP conference, not yet officially published
Abstract:All-in-one weather image restoration methods are valuable in practice but depend on pre-collected data and require retraining for unseen degradations, leading to high cost. We propose DELNet, a continual learning framework for weather image restoration. DELNet integrates a judging valve that measures task similarity to distinguish new from known tasks, and a dynamic expert library that stores experts trained on different degradations. For new tasks, the valve selects top-k experts for knowledge transfer while adding new experts to capture task-specific features; for known tasks, the corresponding experts are directly reused. This design enables continuous optimization without retraining existing models. Experiments on OTS, Rain100H, and Snow100K demonstrate that DELNet surpasses state-of-the-art continual learning methods, achieving PSNR gains of 16%, 11%, and 12%, respectively. These results highlight the effectiveness, robustness, and efficiency of DELNet, which reduces retraining cost and enables practical deployment in real-world scenarios.
zh
[CV-66] Leverag ing Data to Say No: Memory Augmented Plug-and-Play Selective Prediction ICLR2026
【速读】:该论文旨在解决视觉语言基础模型(Vision-Language Foundation Models)在开放集和无限词汇场景下(如图像描述生成)的选择性预测(Selective Prediction)问题,即赋予模型拒绝低置信度预测的能力。传统方法主要针对闭集任务(如固定类别分类),难以直接迁移至开放域任务。其核心挑战在于:(1)视觉-语言表征不稳定导致图像-文本嵌入方差大;(2)相似度分数校准不足。解决方案的关键是提出一种无需训练、低复杂度的插件式选择性预测框架(Plug-and-Play Selective Prediction, PaPSP),并进一步改进为基于记忆增强的PaPSP(MA-PaPSP),通过引入一个图像-文本对检索数据集来平均最近邻嵌入以降低方差,并结合对比归一化(contrastive normalization)提升相似度分数的校准能力,从而显著优于现有基线方法。
链接: https://arxiv.org/abs/2601.22570
作者: Aditya Sarkar,Yi Li,Jiacheng Cheng,Shlok Mishra,Nuno Vasconcelos
机构: University of Maryland, College Park (马里兰大学学院公园分校); University of California, San Diego (加州大学圣地亚哥分校); Qualcomm AI (高通人工智能); Yale University (耶鲁大学); Meta AI (Meta AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2026
Abstract:Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at this https URL.
zh
[CV-67] Hybrid Cross-Device Localization via Neural Metric Learning and Feature Fusion
【速读】:该论文旨在解决跨设备(cross-device)场景下的高精度定位问题,特别是在复杂环境中的几何一致性与尺度准确性挑战。解决方案的关键在于提出一种混合式定位流水线,其核心由共享的检索编码器和两个互补的定位分支组成:一是基于特征融合与PnP(Perspective-n-Point)的古典几何分支,用于提供鲁棒的姿态估计;二是以MapAnything为代表的神经前馈分支,利用几何输入进行条件化度量定位。此外,通过神经引导的候选帧剪枝策略实现翻译一致性过滤,并结合深度条件化优化提升Spot场景下的尺度与平移精度,从而在HYDRO和SUCCU基准上显著提升召回率和定位准确度。
链接: https://arxiv.org/abs/2601.22551
作者: Meixia Lin,Mingkai Liu,Shuxue Peng,Dikai Fan,Shengyu Gu,Xianliang Huang,Haoyang Ye,Xiao Liu
机构: PICO, ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages
Abstract:We present a hybrid cross-device localization pipeline developed for the CroCoDL 2025 Challenge. Our approach integrates a shared retrieval encoder and two complementary localization branches: a classical geometric branch using feature fusion and PnP, and a neural feed-forward branch (MapAnything) for metric localization conditioned on geometric inputs. A neural-guided candidate pruning strategy further filters unreliable map frames based on translation consistency, while depth-conditioned localization refines metric scale and translation precision on Spot scenes. These components jointly lead to significant improvements in recall and accuracy across both HYDRO and SUCCU benchmarks. Our method achieved a final score of 92.62 (R@0.5m, 5°) during the challenge.
zh
[CV-68] SHED Light on Segmentation for Dense Prediction
【速读】:该论文旨在解决现有密集预测方法在处理真实场景时因忽略结构信息而导致的结构性不一致问题,即传统方法将每个像素视为独立预测单元,忽视了场景中固有的几何和语义结构。其解决方案的关键在于提出一种名为SHED的新颖编码器-解码器架构,通过显式引入分割(segmentation)作为几何先验,并利用双向层次推理机制:在编码器中对片段令牌(segment tokens)进行分层聚合,在解码器中则反向展开以恢复层次结构;模型仅在最终输出阶段接受监督,从而无需显式的分割标注即可自发形成结构化的段落层次,显著提升深度边界清晰度、语义一致性及跨域泛化能力,同时增强3D重建质量并揭示可解释的部件级结构。
链接: https://arxiv.org/abs/2601.22529
作者: Seung Hyun Lee,Sangwoo Mo,Stella X. Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense prediction infers per-pixel values from a single image and is fundamental to 3D perception and robotics. Although real-world scenes exhibit strong structure, existing methods treat it as an independent pixel-wise prediction, often resulting in structural inconsistencies. We propose SHED, a novel encoder-decoder architecture that enforces geometric prior explicitly by incorporating segmentation into dense prediction. By bidirectional hierarchical reasoning, segment tokens are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, allowing the segment hierarchy to emerge without explicit segmentation supervision. SHED improves depth boundary sharpness and segment coherence, while demonstrating strong cross-domain generalization from synthetic to the real-world environments. Its hierarchy-aware decoder better captures global 3D scene layouts, leading to improved semantic segmentation performance. Moreover, SHED enhances 3D reconstruction quality and reveals interpretable part-level structures that are often missed by conventional pixel-wise methods.
zh
[CV-69] Can 3D point cloud data improve automated body condition score prediction in dairy cattle?
【速读】:该论文旨在解决传统奶牛体况评分(Body Condition Score, BCS)方法主观性强、劳动密集的问题,提出利用计算机视觉技术提升BCS预测的客观性与效率。其解决方案的关键在于系统比较了两种三维数据表示方式——顶视深度图像(top-view depth image)与点云数据(point cloud data)在不同预处理和特征提取设置下的BCS预测性能,结果表明:在未分割原始数据和全身体部分割数据场景下,深度图像模型表现更优;而在仅使用后躯分割数据时两者性能相当;同时发现点云方法对噪声和模型架构更为敏感,整体未展现出相对于深度图像的一致优势。
链接: https://arxiv.org/abs/2601.22522
作者: Zhou Tang,Jin Wang,Angelo De Castro,Yuxi Zhang,Victoria Bastos Primo,Ana Beatriz Montevecchio Bernardino,Gota Morota,Xu Wang,Ricardo C Chebel,Haipeng Yu
机构: University of Florida (佛罗里达大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Body condition score (BCS) is a widely used indicator of body energy status and is closely associated with metabolic status, reproductive performance, and health in dairy cattle; however, conventional visual scoring is subjective and labor-intensive. Computer vision approaches have been applied to BCS prediction, with depth images widely used because they capture geometric information independent of coat color and texture. More recently, three-dimensional point cloud data have attracted increasing interest due to their ability to represent richer geometric characteristics of animal morphology, but direct head-to-head comparisons with depth image-based approaches remain limited. In this study, we compared top-view depth image and point cloud data for BCS prediction under four settings: 1) unsegmented raw data, 2) segmented full-body data, 3) segmented hindquarter data, and 4) handcrafted feature data. Prediction models were evaluated using data from 1,020 dairy cows collected on a commercial farm, with cow-level cross-validation to prevent data leakage. Depth image-based models consistently achieved higher accuracy than point cloud-based models when unsegmented raw data and segmented full-body data were used, whereas comparable performance was observed when segmented hindquarter data were used. Both depth image and point cloud approaches showed reduced accuracy when handcrafted feature data were employed compared with the other settings. Overall, point cloud-based predictions were more sensitive to noise and model architecture than depth image-based predictions. Taken together, these results indicate that three-dimensional point clouds do not provide a consistent advantage over depth images for BCS prediction in dairy cattle under the evaluated conditions.
zh
[CV-70] DNA: Uncovering Universal Latent Forgery Knowledge
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 图像伪造检测中因超现实主义导致传统表面伪影检测失效的问题,以及现有方法依赖资源密集型黑箱模型微调所带来的效率与泛化瓶颈。其解决方案的关键在于提出判别性神经锚点(Discriminative Neural Anchors, DNA)框架,通过粗到精的挖掘机制识别预训练模型中已编码的伪造检测能力:首先基于特征解耦与注意力分布变化定位关键中间层,实现从全局语义到局部异常的关注转移;随后引入三元融合评分指标与曲率截断策略,剥离语义冗余,精准提取内在敏感于伪造痕迹的判别单元(Forgery-Discriminative Units, FDUs)。此方法无需端到端重训练即可在少样本条件下实现优越性能,并展现出对多种架构和未见生成模型的强大鲁棒性。
链接: https://arxiv.org/abs/2601.22515
作者: Jingtong Dou,Chuancheng Shi,Yemin Wang,Shiming Guo,Anqi Yi,Wenhua Wu,Li Zhang,Fei Shen,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As generative AI achieves hyper-realism, superficial artifact detection has become obsolete. While prevailing methods rely on resource-intensive fine-tuning of black-box backbones, we propose that forgery detection capability is already encoded within pre-trained models rather than requiring end-to-end retraining. To elicit this intrinsic capability, we propose the discriminative neural anchors (DNA) framework, which employs a coarse-to-fine excavation mechanism. First, by analyzing feature decoupling and attention distribution shifts, we pinpoint critical intermediate layers where the focus of the model logically transitions from global semantics to local anomalies. Subsequently, we introduce a triadic fusion scoring metric paired with a curvature-truncation strategy to strip away semantic redundancy, precisely isolating the forgery-discriminative units (FDUs) inherently imprinted with sensitivity to forgery traces. Moreover, we introduce HIFI-Gen, a high-fidelity synthetic benchmark built upon the very latest models, to address the lag in existing datasets. Experiments demonstrate that by solely relying on these anchors, DNA achieves superior detection performance even under few-shot conditions. Furthermore, it exhibits remarkable robustness across diverse architectures and against unseen generative models, validating that waking up latent neurons is more effective than extensive fine-tuning.
zh
[CV-71] CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content
【速读】:该论文旨在解决当前组成视频检索(Composed Video Retrieval, CoVR)任务中仅关注视觉变化而忽略音频差异的问题,即在视觉相似的视频中可能存在显著的音频不同,但现有基准未能涵盖此类跨模态变化。为此,作者提出了一个新的检索任务——带音频的组成视频检索(Composed retrieval for Video with its Audio, CoVA),并构建了AV-Comp数据集,其中包含具有跨模态(视觉与音频)变化的视频对及描述这些差异的文本查询。解决方案的关键在于提出AVT Compositional Fusion(AVT)方法,该方法通过选择性地将文本查询对齐到最相关的模态(视频、音频或两者),实现多模态特征的融合,从而有效提升在视觉和音频均存在变化场景下的检索性能,且优于传统单模态融合策略,为CoVA任务提供了强有力的基线。
链接: https://arxiv.org/abs/2601.22508
作者: Gyuwon Han,Young Kyun Jang,Chanho Eom
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Please visit our project page at this https URL
Abstract:Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed dataset, including both visual and auditory information, are available at this https URL.
zh
[CV-72] DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation ICASSP2026
【速读】:该论文旨在解决当前基于扩散模型(diffusion models)的主体驱动图像生成方法在主体一致性(subject consistency)和语义对齐(semantic alignment)方面存在的不足,尤其是在多尺度条件建模场景下容易出现训练与推理不一致的问题。其解决方案的关键在于提出一种基于视觉自回归模型(Visual Autoregressive, VAR)的新框架DreamVAR,该框架通过引入“下一尺度预测”机制,在图像生成过程中先完整填充参考主体的多尺度特征序列,再进行目标图像令牌的自回归预测,从而简化了条件依赖关系并有效缓解了多尺度条件下的训练-测试差异问题;此外,还结合强化学习进一步优化生成图像的语义一致性和主体保真度。
链接: https://arxiv.org/abs/2601.22507
作者: Xin Jiang,Jingwen Chen,Yehao Li,Yingwei Pan,Kezhou Chen,Zechao Li,Ting Yao,Tao Mei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted By ICASSP 2026
Abstract:Recent advances in subject-driven image generation using diffusion models have attracted considerable attention for their remarkable capabilities in producing high-quality images. Nevertheless, the potential of Visual Autoregressive (VAR) models, despite their unified architecture and efficient inference, remains underexplored. In this work, we present DreamVAR, a novel framework for subject-driven image synthesis built upon a VAR model that employs next-scale prediction. Technically, multi-scale features of the reference subject are first extracted by a visual tokenizer. Instead of interleaving these conditional features with target image tokens across scales, our DreamVAR pre-fills the full subject feature sequence prior to predicting target image tokens. This design simplifies autoregressive dependencies and mitigates the train-test discrepancy in multi-scale conditioning scenario within the VAR paradigm. DreamVAR further incorporates reinforcement learning to jointly enhance semantic alignment and subject consistency. Extensive experiments demonstrate that DreamVAR achieves superior appearance preservation compared to leading diffusion-based methods.
zh
[CV-73] MIRRORTALK: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control ICASSP2026
【速读】:该论文旨在解决个性化说话人脸生成中难以同时保持高唇形同步精度与说话者独特风格的问题。现有方法因面部运动中说话者特有风格与语义内容混杂,导致无法将说话者的个性特征准确迁移到任意语音上。其解决方案的关键在于提出一种基于条件扩散模型的生成框架MirrorTalk,结合一个语义解耦风格编码器(Semantically-Disentangled Style Encoder, SDSE),可从短时参考视频中提取纯净的风格表征,并引入分层调制策略,在扩散过程中动态平衡音频与风格特征在不同面部区域的贡献,从而实现精确的唇形同步与完整的面部表情表达。
链接: https://arxiv.org/abs/2601.22501
作者: Renjie Lu,Xulong Zhang,Xiaoyang Qu,Jianzong Wang,Shangfei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:Synthesizing personalized talking faces that uphold and highlight a speaker’s unique style while maintaining lip-sync accuracy remains a significant challenge. A primary limitation of existing approaches is the intrinsic confounding of speaker-specific talking style and semantic content within facial motions, which prevents the faithful transfer of a speaker’s unique persona to arbitrary speech. In this paper, we propose MirrorTalk, a generative framework based on a conditional diffusion model, combined with a Semantically-Disentangled Style Encoder (SDSE) that can distill pure style representations from a brief reference video. To effectively utilize this representation, we further introduce a hierarchical modulation strategy within the diffusion process. This mechanism guides the synthesis by dynamically balancing the contributions of audio and style features across distinct facial regions, ensuring both precise lip-sync accuracy and expressive full-face dynamics. Extensive experiments demonstrate that MirrorTalk achieves significant improvements over state-of-the-art methods in terms of lip-sync accuracy and personalization preservation.
zh
[CV-74] PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization ICASSP2026
【速读】:该论文旨在解决多类别场景下视觉异常检测(Visual Anomaly Detection)中的三大挑战:对象类别的多样性、异常样本的稀缺性以及伪装缺陷的存在。其解决方案的关键在于提出了一种跨模态提示框架 PromptMAD,通过视觉-语言对齐(Vision-Language Alignment)引入语义引导,利用 CLIP 编码的文本提示同时描述正常与异常类别特定特征,从而在视觉重建中融入语义上下文,提升对细微和纹理类异常的检测能力;此外,为缓解像素级类别不平衡问题,引入 Focal Loss 强调难检测区域,并结合监督分割器融合多尺度卷积特征与基于 Transformer 的空间注意力机制及扩散迭代优化,最终实现高精度、高分辨率的异常定位。
链接: https://arxiv.org/abs/2601.22492
作者: Duncan McCain,Hossein Kashiani,Fatemeh Afghah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP 2026
Abstract:Visual anomaly detection in multi-class settings poses significant challenges due to the diversity of object categories, the scarcity of anomalous examples, and the presence of camouflaged defects. In this paper, we propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization that integrates semantic guidance through vision-language alignment. By leveraging CLIP-encoded text prompts describing both normal and anomalous class-specific characteristics, our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies. To further address the challenge of class imbalance at the pixel level, we incorporate Focal loss function, which emphasizes hard-to-detect anomalous regions during training. Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention and diffusion iterative refinement, yielding precise and high-resolution anomaly maps. Extensive experiments on the MVTec-AD dataset demonstrate that our method achieves state-of-the-art pixel-level performance, improving mean AUC to 98.35% and AP to 66.54%, while maintaining efficiency across diverse categories.
zh
[CV-75] Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage ICASSP2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度视觉问答(Fine-grained Visual Question Answering, VQA)任务中因输入图像分辨率低和注意力机制聚合噪声导致的视觉定位不准与推理能力不足的问题。解决方案的关键在于提出一种无需训练的头感知视觉裁剪方法(Head Aware Visual Cropping, HAVC),其核心思想是通过OCR诊断任务筛选具备真实视觉定位能力的注意力头,并在推理阶段利用空间熵进一步增强空间聚焦性、梯度敏感性以提升预测贡献度,最终融合这些注意力信号生成可靠的视觉裁剪引导图(Visual Cropping Guidance Map),指导对原图进行精准子区域裁剪,从而显著提升MLLMs的视觉接地精度与细粒度推理性能。
链接: https://arxiv.org/abs/2601.22483
作者: Junfei Xie,Peng Pan,Xulong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbfHead Aware Visual Cropping (HAVC), a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.
zh
[CV-76] raining-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector
【速读】:该论文旨在解决扩散模型在生成过程中存在的语义漂移(semantic drift)问题,即在去噪早期阶段由于随机性导致即使在相同条件约束下生成结果仍出现语义不一致的现象。这一问题限制了无监督特征表示在生成过程中的有效利用,从而影响图像的语义一致性与视觉保真度。解决方案的关键在于引入一个表示对齐投影器(representation alignment projector),该投影器在采样中间步骤中注入由其预测的特征表示,为生成过程提供一个语义锚点(semantic anchor),从而增强生成图像的语义稳定性与一致性,且无需修改模型架构。实验表明,该方法显著降低了FID分数,在类条件ImageNet合成任务中优于现有代表性引导策略,并能与无分类器引导(classifier-free guidance)互补提升整体性能。
链接: https://arxiv.org/abs/2601.22468
作者: Wenqiang Zu,Shenghao Xie,Bo Lei,Lei Ma
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in generative modeling has enabled high-quality visual synthesis with diffusion-based frameworks, supporting controllable sampling and large-scale training. Inference-time guidance methods such as classifier-free and representative guidance enhance semantic alignment by modifying sampling dynamics; however, they do not fully exploit unsupervised feature representations. Although such visual representations contain rich semantic structure, their integration during generation is constrained by the absence of ground-truth reference images at inference. This work reveals semantic drift in the early denoising stages of diffusion transformers, where stochasticity results in inconsistent alignment even under identical conditioning. To mitigate this issue, we introduce a guidance scheme using a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps, providing an effective semantic anchor without modifying the model architecture. Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis, achieving substantially lower FID scores; for example, REPA-XL/2 improves from 5.9 to 3.3, and the proposed method outperforms representative guidance when applied to SiT models. The approach further yields complementary gains when combined with classifier-free guidance, demonstrating enhanced semantic coherence and visual fidelity. These results establish representation-informed diffusion sampling as a practical strategy for reinforcing semantic preservation and image consistency.
zh
[CV-77] CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control ICASSP2026
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人控制中对显式动作标注(action supervision)的强依赖问题,这一限制显著制约了模型的可扩展性和泛化能力。解决方案的关键在于提出一种名为CARE的新框架,其核心创新是通过仅使用视频-文本对(video-text pairs)进行预训练,利用弱监督信号学习连续的潜在动作表示(latent action representations),从而无需在预训练阶段引入动作标签;随后在微调阶段仅需少量标注数据即可训练动作头完成控制任务,有效提升了模型的可扩展性、语义可解释性并避免捷径学习(shortcut learning)。
链接: https://arxiv.org/abs/2601.22467
作者: Jiaqi Shi,Xulong Zhang,Xiaoyang Qu,Jianzong Wang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:Recent advances in Vision-Language-Action (VLA) models have shown promise for robot control, but their dependence on action supervision limits scalability and generalization. To address this challenge, we introduce CARE, a novel framework designed to train VLA models for robotic task execution. Unlike existing methods that depend on action annotations during pretraining, CARE eliminates the need for explicit action labels by leveraging only video-text pairs. These weakly aligned data sources enable the model to learn continuous latent action representations through a newly designed multi-task pretraining objective. During fine-tuning, a small set of labeled data is used to train the action head for control. Experimental results across various simulation tasks demonstrate CARE’s superior success rate, semantic interpretability, and ability to avoid shortcut learning. These results underscore CARE’s scalability, interpretability, and effectiveness in robotic control with weak supervision.
zh
[CV-78] ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction
【速读】:该论文旨在解决基于涂鸦(scribble-based)的3D模型纹理编辑中存在意图模糊与目标语义位置不明确的问题。现有方法多依赖草图(sketch-based)交互进行轮廓绘制,而对粗粒度涂鸦的利用有限,且涂鸦指令的抽象性常导致编辑意图难以准确解析。解决方案的关键在于提出ScribbleSense框架,该框架融合多模态大语言模型(Multimodal Large Language Models, MLLMs)与图像生成模型:首先利用MLLMs理解涂鸦背后的语义意图,随后通过全局生成图像提取局部纹理细节,从而锚定局部语义并缓解目标位置的歧义性,最终实现更精准、直观的交互式纹理编辑。
链接: https://arxiv.org/abs/2601.22455
作者: Yudi Zhang,Yeming Geng,Lei Zhang
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TVCG. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses
Abstract:Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.
zh
[CV-79] Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework
【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在图像描述生成任务中普遍存在的对象幻觉(object hallucination)问题,即模型生成不存在于图像中的物体描述,从而降低其可靠性。研究表明,这种幻觉现象主要源于LVLMs对语言先验(language priors)的过度依赖,尤其在生成长度增加时,会导致幻觉对象词的概率被放大,进一步加剧幻觉。解决方案的关键在于提出一种无需训练的自验证框架(Self-Validation Framework),通过引入“无语言先验验证”(Language-Prior-Free Verification)机制,在采样的候选描述中验证对象存在性,并基于此进行描述选择或聚合,从而有效缓解模型对语言先验的依赖,显著减少幻觉现象(如在LLaVA-v1.5-7B上CHIRI指标提升65.6%)。
链接: https://arxiv.org/abs/2601.22451
作者: Shiyu Liu,Xinyi Wen,Zhibin Lan,Ante Wang,Jinsong Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is available at this https URL
Abstract:Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs’ over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs’ over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects’ existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.
zh
[CV-80] High-Definition 5MP Stereo Vision Sensing for Robotics
【速读】:该论文旨在解决高分辨率(5MP+)立体视觉系统在机器人应用中因传统标定与立体匹配方法精度不足、处理速度慢而难以充分发挥性能的问题。其关键解决方案在于提出一种新颖的帧间标定与立体匹配方法,能够在保证高精度的同时实现快速处理;并通过将实时生成的视差图与基于计算密集型算法获得的真值视差图进行对比,创新性地评估了系统的实时性能表现,从而验证了高像素相机仅在高精度标定支持下才能生成高质量点云的核心结论。
链接: https://arxiv.org/abs/2601.22445
作者: Leaf Jiang,Matthew Holzel,Bernhard Kaplan,Hsiou-Yuan Liu,Sabyasachi Paul,Karen Rankin,Piotr Swierczynski
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-resolution (5MP+) stereo vision systems are essential for advancing robotic capabilities, enabling operation over longer ranges and generating significantly denser and accurate 3D point clouds. However, realizing the full potential of high-angular-resolution sensors requires a commensurately higher level of calibration accuracy and faster processing – requirements often unmet by conventional methods. This study addresses that critical gap by processing 5MP camera imagery using a novel, advanced frame-to-frame calibration and stereo matching methodology designed to achieve both high accuracy and speed. Furthermore, we introduce a new approach to evaluate real-time performance by comparing real-time disparity maps with ground-truth disparity maps derived from more computationally intensive stereo matching algorithms. Crucially, the research demonstrates that high-pixel-count cameras yield high-quality point clouds only through the implementation of high-accuracy calibration.
zh
[CV-81] Weak Diffusion Priors Can Still Achieve Strong Inverse-Problem Performance
【速读】:该论文试图解决在逆问题求解中使用不匹配或低保真度的扩散模型(diffusion model)作为先验时,其性能是否仍可信赖的问题。传统方法通常假设扩散模型是在与未知信号高度一致的数据上训练的,但实际应用中常面临模型与任务数据分布不匹配的情况。论文的关键解决方案在于通过贝叶斯一致性理论分析,在高维测量信息充足(如大量观测像素)时,后验分布会集中在真实信号附近,从而为弱扩散先验在特定条件下依然有效提供了理论依据。这一发现揭示了弱先验成功的关键机制,并明确了其适用范围和失效场景。
链接: https://arxiv.org/abs/2601.22443
作者: Jing Jia,Wei Yuan,Sifan Liu,Liyue Shen,Guanyang Wang
机构: Rutgers University (罗格斯大学); Duke University (杜克大学); University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)
备注:
Abstract:Can a diffusion model trained on bedrooms recover human faces? Diffusion models are widely used as priors for inverse problems, but standard approaches usually assume a high-fidelity model trained on data that closely match the unknown signal. In practice, one often must use a mismatched or low-fidelity diffusion prior. Surprisingly, these weak priors often perform nearly as well as full-strength, in-domain baselines. We study when and why inverse solvers are robust to weak diffusion priors. Through extensive experiments, we find that weak priors succeed when measurements are highly informative (e.g., many observed pixels), and we identify regimes where they fail. Our theory, based on Bayesian consistency, gives conditions under which high-dimensional measurements make the posterior concentrate near the true signal. These results provide a principled justification on when weak diffusion priors can be used reliably.
zh
[CV-82] EMBC Special Issue: Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture
【速读】:该论文旨在解决视频驱动的无标记点运动捕捉(multi-view markerless motion capture, MMMC)在临床应用中因缺乏可靠置信区间而影响其可信度的问题。现有系统虽具备一定准确性,但无法量化个体测量结果的不确定性,从而限制了其在临床评估与研究中的实际应用。解决方案的关键在于引入基于变分推断的概率建模方法,用于估计关节角度的后验分布,并通过预期校准误差(Expected Calibration Error, ECE)验证置信区间的可靠性。实验表明,该模型在步长和步态运动学参数上实现了ECE < 0.1的校准性能,且预测不确定性的大小与实测误差高度相关,说明该方法能够有效识别不可靠输出,无需依赖同步的真实基准数据即可提供可解释的不确定性估计。
链接: https://arxiv.org/abs/2601.22412
作者: Seth Donahue,Irina Djuraskovic,Kunal Shah,Fabian Sinz,Ross Chafetz,R.James Cotton
机构: Shriners Children’s Lexington(谢尔iners儿童医院莱克星顿分院); University of Kentucky Department of Physical Therapy(肯塔基大学物理治疗系); Shirley Ryan AbilityLab(莎莉·瑞安能力实验室); Northwestern University(西北大学); University of Göttingen(哥廷根大学); Lower Saxony Center for AI & Causal Methods in Medicine(下萨克森州人工智能与因果方法医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, EMBS Special Issue
Abstract:Video-based human movement analysis holds potential for movement assessment in clinical practice and research. However, the clinical implementation and trust of multi-view markerless motion capture (MMMC) require that, in addition to being accurate, these systems produce reliable confidence intervals to indicate how accurate they are for any individual. Building on our prior work utilizing variational inference to estimate joint angle posterior distributions, this study evaluates the calibration and reliability of a probabilistic MMMC method. We analyzed data from 68 participants across two institutions, validating the model against an instrumented walkway and standard marker-based motion capture. We measured the calibration of the confidence intervals using the Expected Calibration Error (ECE). The model demonstrated reliable calibration, yielding ECE values generally 0.1 for both step and stride length and bias-corrected gait kinematics. We observed a median step and stride length error of ~16 mm and ~12 mm respectively, with median bias-corrected kinematic errors ranging from 1.5 to 3.8 degrees across lower extremity joints. Consistent with the calibrated ECE, the magnitude of the model’s predicted uncertainty correlated strongly with observed error measures. These findings indicate that, as designed, the probabilistic model reconstruction quantifies epistemic uncertainty, allowing it to identify unreliable outputs without the need for concurrent ground-truth instrumentation.
zh
[CV-83] Jailbreaks on Vision Language Model via Multimodal Reasoning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在安全对齐(safety alignment)方面的脆弱性问题,即模型输出极易受提示(prompt)变化影响,从而可能被恶意利用绕过内容安全过滤机制。解决方案的关键在于提出一种双策略攻击框架:首先,利用后训练阶段的思维链(Chain-of-Thought, CoT)提示构建隐蔽性强的对抗性文本提示;其次,设计一种基于ReAct(Reasoning + Acting)范式的自适应加噪机制,通过模型反馈迭代扰动输入图像,在最可能触发安全防御的区域精炼对抗噪声,从而在提升攻击成功率(Attack Success Rate, ASR)的同时保持文本与图像内容的自然性。
链接: https://arxiv.org/abs/2601.22398
作者: Aarush Noheria,Yuguang Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.
zh
[CV-84] FlexMap: Generalized HD Map Construction from Flexible Camera Configurations
【速读】:该论文旨在解决当前高精地图(High-definition map, HD map)构建方法对固定多摄像头标定配置依赖性强、缺乏鲁棒性的问题,尤其在传感器故障或车辆车队中相机配置不一致时表现脆弱。其解决方案的关键在于提出FlexMap框架,通过引入一个几何感知的基础模型(geometry-aware foundation model),利用跨帧注意力机制隐式编码三维场景理解,从而无需显式的几何投影操作;同时设计了时空增强模块与相机感知解码器(camera-aware decoder),分别分离空间推理与时间动态,并借助潜在相机标记实现视图自适应注意力,避免依赖投影矩阵,从而实现对不同相机配置的自适应能力且无需重新训练。
链接: https://arxiv.org/abs/2601.22376
作者: Run Wang,Chaoyi Zhou,Amir Salarpour,Xi Liu,Zhi-Qi Cheng,Feng Luo,Mert D. Pesé,Siyu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-definition (HD) maps provide essential semantic information of road structures for autonomous driving systems, yet current HD map construction methods require calibrated multi-camera setups and either implicit or explicit 2D-to-BEV transformations, making them fragile when sensors fail or camera configurations vary across vehicle fleets. We introduce FlexMap, unlike prior methods that are fixed to a specific N-camera rig, our approach adapts to variable camera configurations without any architectural changes or per-configuration retraining. Our key innovation eliminates explicit geometric projections by using a geometry-aware foundation model with cross-frame attention to implicitly encode 3D scene understanding in feature space. FlexMap features two core components: a spatial-temporal enhancement module that separates cross-view spatial reasoning from temporal dynamics, and a camera-aware decoder with latent camera tokens, enabling view-adaptive attention without the need for projection matrices. Experiments demonstrate that FlexMap outperforms existing methods across multiple configurations while maintaining robustness to missing views and sensor variations, enabling more practical real-world deployment.
zh
[CV-85] Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes
【速读】:该论文旨在解决传统渲染管线在生成 populated dynamic scenes(包含人群的动态场景)时面临的可扩展性差与真实感不足的问题,这些问题通常源于对复杂资产、精确材质和光照以及大量计算资源的依赖。解决方案的关键在于提出一种名为 C2R(Coarse-to-Real)的生成式渲染框架,该框架通过粗粒度 3D 模拟显式控制场景布局、相机运动和人体轨迹,同时利用学习得到的神经渲染器在文本提示引导下生成逼真的外观、光照及细粒度动态效果;为克服粗略模拟与真实视频之间缺乏成对训练数据的问题,采用两阶段混合 CG-Real 训练策略,从大规模真实视频中学习强先验,并通过跨域共享的隐式时空特征引入可控性,从而实现从粗到精的控制、跨多种 CG 和游戏输入的泛化能力,并生成时间一致、可控且逼真的城市场景视频。
链接: https://arxiv.org/abs/2601.22301
作者: Gonzalo Gomez-Nogales,Yicong Hong,Chongjian Ge,Marc Comino-Trinidad,Dan Casas,Yi Zhou
机构: Universidad Rey Juan Carlos (皇家赫尔瓦大学); Adobe Research (Adobe研究院); Roblox (罗布洛克斯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL
Abstract:Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-phase mixed CG-real training strategy that learns a strong generative prior from large-scale real footage and introduces controllability through shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at this https URL.
zh
[CV-86] SurrogateSHAP: Training-Free Contributor Attribution for Text-to-Image (T2I) Models
【速读】:该论文旨在解决生成式 AI(Generative AI)在实际创意工作流中,如何公平地评估数据贡献者价值的问题,以实现合理补偿并构建可持续的数据市场。其核心挑战在于传统基于 Shapley 值的分配方法面临双重计算瓶颈:一是需对每个样本子集重新训练模型以估算边际贡献,二是因贡献者间交互效应导致组合爆炸式增长的子集数量。解决方案的关键在于提出 SurrogateSHAP 框架,该框架通过利用预训练模型进行推理而非重新训练来近似昂贵的“重训练博弈”,并进一步采用梯度提升树(Gradient-Boosted Tree)建模效用函数,从而可解析地推导出 Shapley 值,显著降低计算开销的同时保持高精度,实现在图像质量、美学评分及产品多样性等多个任务上的有效贡献者识别,并成功应用于临床图像中虚假关联的溯源审计。
链接: https://arxiv.org/abs/2601.22276
作者: Mingyu Lu,Soham Gadgil,Chris Lin,Chanwoo Kim,Su-In Lee
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As Text-to-Image (T2I) diffusion models are increasingly used in real-world creative workflows, a principled framework for valuing contributors who provide a collection of data is essential for fair compensation and sustainable data marketplaces. While the Shapley value offers a theoretically grounded approach to attribution, it faces a dual computational bottleneck: (i) the prohibitive cost of exhaustive model retraining for each sampled subset of players (i.e., data contributors) and (ii) the combinatorial number of subsets needed to estimate marginal contributions due to contributor interactions. To this end, we propose SurrogateSHAP, a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model. To further improve efficiency, we employ a gradient-boosted tree to approximate the utility function and derive Shapley values analytically from the tree-based model. We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data. Across settings, SurrogateSHAP outperforms prior methods while substantially reducing computational overhead, consistently identifying influential contributors across multiple utility metrics. Finally, we demonstrate that SurrogateSHAP effectively localizes data sources responsible for spurious correlations in clinical images, providing a scalable path toward auditing safety-critical generative models.
zh
[CV-87] VMonarch: Efficient Video Diffusion Transformers with Structured Attention
【速读】:该论文旨在解决视频扩散变换器(Video Diffusion Transformers, Video DiTs)中注意力机制的二次复杂度问题,该问题严重限制了模型在长视频序列上的上下文扩展能力。解决方案的关键在于提出一种名为VMonarch的新颖注意力机制,其核心是利用具有灵活稀疏结构的Monarch矩阵来高效表示视频数据中固有的稀疏时空注意力模式,并通过交替最小化算法实现次二次复杂度的计算。具体而言,VMonarch首先采用时空Monarch分解以显式建模帧内与帧间相关性,其次引入重计算策略缓解交替优化过程中因矩阵不稳定导致的伪影,最后结合FlashAttention设计了一种在线熵算法以实现长序列下Monarch矩阵的快速更新。实验表明,该方法在VBench基准上仅需少量调优即可达到或超越全注意力机制的生成质量,同时将注意力FLOPs降低17.5倍,在长视频场景中实现超过5倍的速度提升,显著优于当前主流稀疏注意力方法。
链接: https://arxiv.org/abs/2601.22275
作者: Cheng Liang,Haoxian Chen,Liang Hou,Qi Fan,Gangshan Wu,Xin Tao,Limin Wang
机构: State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); School of Intelligence Science and Technology, Nanjing University (南京大学智能科学与技术学院); Kling Team, Kuaishou Technology (快手科技Kling团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5x in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.
zh
[CV-88] Is Hierarchical Quantization Essential for Optimal Reconstruction? ICPR
【速读】:该论文旨在解决一个长期存在的假设问题:即分层向量量化变分自编码器(VQ-VAE)是否因其层级结构而天然具备优于单层模型的重建保真度。尽管现有研究表明分层模型(如VQ-VAE2)在图像重建中表现更优,但其优势是否源于层级结构本身,还是受限于代码本利用不足或表征容量不匹配等训练因素尚不明确。论文通过对比容量匹配的两层VQ-VAE与单层模型在高分辨率ImageNet上的重建性能,发现当代码本崩溃(codebook collapse)被有效缓解时,单层模型可达到与分层模型相当的重建精度。解决方案的关键在于三个轻量级干预措施:基于数据的代码本初始化、周期性重置未激活代码向量以及系统性调优代码本超参数,从而显著减少代码本崩溃并提升表示效率,进而证明层级结构并非实现高质量重建的必要条件。
链接: https://arxiv.org/abs/2601.22244
作者: Shirin Reyhanian,Laurenz Wiskott
机构: Institute for Neural Computation (INI), Faculty of Computer Science, Ruhr University Bochum (鲁尔大学波鸿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in the Proceedings of ICPRAM 2026. Code available at : this https URL
Abstract:Vector-quantized variational autoencoders (VQ-VAEs) are central to models that rely on high reconstruction fidelity, from neural compression to generative pipelines. Hierarchical extensions, such as VQ-VAE2, are often credited with superior reconstruction performance because they split global and local features across multiple levels. However, since higher levels derive all their information from lower levels, they should not carry additional reconstructive content beyond what the lower-level already encodes. Combined with recent advances in training objectives and quantization mechanisms, this leads us to ask whether a single-level VQ-VAE, with matched representational budget and no codebook collapse, can equal the reconstruction fidelity of its hierarchical counterpart. Although the multi-scale structure of hierarchical models may improve perceptual quality in downstream tasks, the effect of hierarchy on reconstruction accuracy, isolated from codebook utilization and overall representational capacity, remains empirically underexamined. We revisit this question by comparing a two-level VQ-VAE and a capacity-matched single-level model on high-resolution ImageNet images. Consistent with prior observations, we confirm that inadequate codebook utilization limits single-level VQ-VAEs and that overly high-dimensional embeddings destabilize quantization and increase codebook collapse. We show that lightweight interventions such as initialization from data, periodic reset of inactive codebook vectors, and systematic tuning of codebook hyperparameters significantly reduce collapse. Our results demonstrate that when representational budgets are matched, and codebook collapse is mitigated, single-level VQ-VAEs can match the reconstruction fidelity of hierarchical variants, challenging the assumption that hierarchical quantization is inherently superior for high-quality reconstructions.
zh
[CV-89] Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)中位置嵌入(Positional Embeddings, PEs)的作用机制问题,特别是其在空间结构建模中的因果角色。传统观点认为PEs仅作为token的索引标识,但本文从几何视角出发,揭示PEs实际上充当了几何先验(geometric priors),直接影响ViT表征的空间结构。解决方案的关键在于提出了一种token-level诊断方法,用于量化多视角几何一致性如何依赖于一致的PEs,并通过14个基础ViT模型的广泛实验验证了PEs对多视角几何和空间推理能力的决定性影响,从而明确了PEs在ViT中作为因果机制调控空间结构的核心作用。
链接: https://arxiv.org/abs/2601.22231
作者: Jian Shi,Michael Birsak,Wenqing Cui,Zhenyu Li,Peter Wonka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in this https URL
zh
[CV-90] What Lies Beneath: A Call for Distribution-based Visual Question Answer Datasets
【速读】:该论文旨在解决当前视觉问答(Visual Question Answering, VQA)基准在评估大模型对科学图表理解能力方面的不足,特别是现有数据集大多基于真实世界图像或简单图示分析,且假设图表标记与底层数据之间存在一一对应关系,而现实中图表是对数据的转换(如分析、简化或修改),这种非一一对应关系引入了复杂的推理挑战。解决方案的关键在于构建一个专门面向科学图表的VQA基准,其中图表标记与底层数据之间不存在直接映射关系;作者通过基于真实数据生成合成直方图图表,并设计需要依赖底层数据才能精确回答的问题,同时提供包括图表、原始数据、分布参数及标注框在内的完整开源数据集,以推动该领域研究发展。
链接: https://arxiv.org/abs/2601.22218
作者: Jill P. Naiman,Daniel J. Evans,JooYoung Seo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: Accepted to ACM/IEEE Joint Conference on Digital Libraries JCDL 2025, 4 pages, 2 figures
Abstract:Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.
zh
[CV-91] Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation
【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary object detection, OVD)在航空影像(aerial imagery)中迁移性能严重下降的问题,即现有基于视觉-语言模型的OVD方法在自然图像上表现优异,但在航空场景下缺乏有效性和鲁棒性。解决方案的关键在于通过构建首个系统性基准测试(LAE-80C数据集),严格评估五种前沿OVD模型在零样本条件下的表现,并设计三种推理模式(Global、Oracle、Single-Category)以分离语义混淆与视觉定位误差的影响。实验揭示语义混淆是主要瓶颈——当类别数量从80降至3.2时,F1分数提升15倍,表明当前方法对词汇规模敏感;同时发现提示工程策略(如领域特定前缀和同义词扩展)无法显著改善性能,且不同数据集间表现差异巨大(F1: 0.53 on DIOR vs. 0.12 on FAIR1M),凸显了航空OVD对成像条件的高度脆弱性。因此,论文强调需发展面向航空域自适应的OVD方法以突破当前局限。
链接: https://arxiv.org/abs/2601.22164
作者: Christos Tsourveloudis
机构: National Technical University of Athens (雅典国立技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.
zh
[CV-92] Attention Isnt All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset
【速读】:该论文旨在解决小样本场景下多模态情绪识别(multimodal emotion recognition)中模型性能受限的问题,特别是探讨复杂注意力机制是否能提升性能。研究发现,尽管引入了因子分解注意力机制(factorized attention mechanisms)和改进的CNN基线模型,但复杂架构反而因过拟合和破坏预训练特征而表现更差;关键解决方案在于利用领域知识进行简单而有效的结构优化:例如在音频CNN中加入delta MFCC特征使准确率提升3.66个百分点,EEG信号采用频域特征提升7.62个百分点,同时通过领域特定预训练提升视觉Transformer基线至75.30%,优于原论文ViViT结果。这表明,在小数据条件下,合理的特征工程与领域适配策略比单纯堆砌复杂模型架构更为有效。
链接: https://arxiv.org/abs/2601.22161
作者: Anmol Guragain
机构: Universidad Politécnica de Madrid (马德里理工大学); E.T.S. Ingenieros de Telecomunicación (电信工程师学校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:We present a systematic study of multimodal emotion recognition using the EAV dataset, investigating whether complex attention mechanisms improve performance on small datasets. We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Our experiments show that sophisticated attention mechanisms consistently underperform on small datasets. M2 models achieved 5 to 13 percentage points below baselines due to overfitting and destruction of pretrained features. In contrast, simple domain-appropriate modifications proved effective: adding delta MFCCs to the audio CNN improved accuracy from 61.9% to \textbf65.56% (+3.66pp), while frequency-domain features for EEG achieved \textbf67.62% (+7.62pp over the paper baseline). Our vision transformer baseline (M1) reached \textbf75.30%, exceeding the paper’s ViViT result (74.5%) through domain-specific pretraining, and vision delta features achieved \textbf72.68% (+1.28pp over the paper CNN). These findings demonstrate that for small-scale emotion recognition, domain knowledge and proper implementation outperform architectural complexity.
zh
[CV-93] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging
【速读】:该论文旨在解决天文成像中由电荷耦合器件(Charge-Coupled Device, CCD)噪声导致的图像质量受限问题,尤其针对现有校准流程无法有效去除随机噪声(stochastic noise)的局限性。其解决方案的关键在于提出了一种基于物理机制的噪声合成框架,能够精确建模光子散粒噪声、光电响应非均匀性、暗电流噪声、读出效应以及宇宙射线撞击和热像素引发的局部异常点等多源噪声成分;同时通过多帧未配准曝光平均生成高信噪比(high-SNR)基础图像作为输入,从而合成逼真的带噪图像,构建大规模成对训练数据集,支持监督学习方法在真实天文场景下的应用与评估。
链接: https://arxiv.org/abs/2601.23276
作者: Shuhong Liu,Xining Ge,Ziying Gu,Lin Gu,Ziteng Cui,Xuangeng Chu,Jun Liu,Dong Li,Tatsuya Harada
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Astronomical imaging remains noise-limited under practical observing constraints, while standard calibration pipelines mainly remove structured artifacts and leave stochastic noise largely unresolved. Learning-based denoising is promising, yet progress is hindered by scarce paired training data and the need for physically interpretable and reproducible models in scientific workflows. We propose a physics-based noise synthesis framework tailored to CCD noise formation. The pipeline models photon shot noise, photo-response non-uniformity, dark-current noise, readout effects, and localized outliers arising from cosmic-ray hits and hot pixels. To obtain low-noise inputs for synthesis, we average multiple unregistered exposures to produce high-SNR bases. Realistic noisy counterparts synthesized from these bases using our noise model enable the construction of abundant paired datasets for supervised learning. We further introduce a real-world dataset across multi-bands acquired with two twin ground-based telescopes, providing paired raw frames and instrument-pipeline calibrated frames, together with calibration data and stacked high-SNR bases for real-world evaluation.
zh
[CV-94] Scale-Cascaded Diffusion Models for Super-Resolution in Medical Imaging
【速读】:该论文旨在解决医学图像超分辨率(Super-Resolution, SR)中传统扩散模型(Diffusion Models)仅使用单一尺度训练先验所导致的感知质量不足与计算效率低的问题。其关键解决方案是将图像分解为拉普拉斯金字塔(Laplacian Pyramid)的不同频率带,并为每个频带独立训练扩散先验,随后设计一种多尺度渐进式重构算法,利用这些先验在不同尺度上逐步优化重建结果。该方法不仅提升了图像的感知质量,还通过使用更小的粗尺度网络显著降低了推理时间,从而实现了多尺度重建与扩散先验的有效统一。
链接: https://arxiv.org/abs/2601.23201
作者: Darshan Thaker,Mahmoud Mostapha,Radu Miron,Shihan Qiu,Mariappan Nadar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IEEE International Symposium for Biomedical Imaging (ISBI) 2026
Abstract:Diffusion models have been increasingly used as strong generative priors for solving inverse problems such as super-resolution in medical imaging. However, these approaches typically utilize a diffusion prior trained at a single scale, ignoring the hierarchical scale structure of image data. In this work, we propose to decompose images into Laplacian pyramid scales and train separate diffusion priors for each frequency band. We then develop an algorithm to perform super-resolution that utilizes these priors to progressively refine reconstructions across different scales. Evaluated on brain, knee, and prostate MRI data, our approach both improves perceptual quality over baselines and reduces inference time through smaller coarse-scale networks. Our framework unifies multiscale reconstruction and diffusion priors for medical image super-resolution.
zh
[CV-95] Vision-Language Controlled Deep Unfolding for Joint Medical Image Restoration and Segmentation
【速读】:该论文旨在解决医学图像复原与分割任务在传统流水线中孤立处理所导致的次优性能问题,即低层信号恢复与高层语义理解之间缺乏协同优化。其核心解决方案在于提出VL-DUN框架,通过两个关键创新实现联合优化:一是将复原与分割建模为统一的优化问题,推导出可解释的联合展开机制,使两者数学耦合并相互精炼;二是引入频域感知的Mamba机制,以线性复杂度捕捉长程依赖关系用于全局分割,同时保留高频纹理信息以保障复原质量,从而有效缓解标准架构的谱偏差问题。
链接: https://arxiv.org/abs/2601.23103
作者: Ping Chen,Zicheng Huang,Xiangming Wang,Yungeng Liu,Bingyu Liang,Haijin Zeng,Yongyong Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, medical image
Abstract:We propose VL-DUN, a principled framework for joint All-in-One Medical Image Restoration and Segmentation (AiOMIRS) that bridges the gap between low-level signal recovery and high-level semantic understanding. While standard pipelines treat these tasks in isolation, our core insight is that they are fundamentally synergistic: restoration provides clean anatomical structures to improve segmentation, while semantic priors regularize the restoration process. VL-DUN resolves the sub-optimality of sequential processing through two primary innovations. (1) We formulate AiOMIRS as a unified optimization problem, deriving an interpretable joint unfolding mechanism where restoration and segmentation are mathematically coupled for mutual refinement. (2) We introduce a frequency-aware Mamba mechanism to capture long-range dependencies for global segmentation while preserving the high-frequency textures necessary for restoration. This allows for efficient global context modeling with linear complexity, effectively mitigating the spectral bias of standard architectures. As a pioneering work in the AiOMIRS task, VL-DUN establishes a new state-of-the-art across multi-modal benchmarks, improving PSNR by 0.92 dB and the Dice coefficient by 9.76%. Our results demonstrate that joint collaborative learning offers a superior, more robust solution for complex clinical workflows compared to isolated task processing. The codes are provided in this https URL.
zh
[CV-96] Scale Equivariance Regularization and Feature Lifting in High Dynamic Range Modulo Imaging
【速读】:该论文旨在解决模数成像(modulo imaging)中高动态范围(HDR)重建难题,即如何准确区分自然图像边缘与由饱和强度周期性包裹(wrap discontinuities)引入的伪影。解决方案的关键在于提出一种基于学习的HDR恢复框架,其核心创新包括:(i) 一种尺度等变正则化策略,强制模型在曝光变化下保持一致性;(ii) 一种特征提升输入设计,融合原始模数图像、包裹有限差分和闭式初始化,从而显著增强网络对真实结构与包裹伪影的区分能力,在感知和线性HDR质量指标上均达到当前最优性能。
链接: https://arxiv.org/abs/2601.23037
作者: Brayan Monroy,Jorge Bacca
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modulo imaging enables high dynamic range (HDR) acquisition by cyclically wrapping saturated intensities, but accurate reconstruction remains challenging due to ambiguities between natural image edges and artificial wrap discontinuities. This work proposes a learning-based HDR restoration framework that incorporates two key strategies: (i) a scale-equivariant regularization that enforces consistency under exposure variations, and (ii) a feature lifting input design combining the raw modulo image, wrapped finite differences, and a closed-form initialization. Together, these components enhance the network’s ability to distinguish true structure from wrapping artifacts, yielding state-of-the-art performance across perceptual and linear HDR quality metrics.
zh
[CV-97] Development of Domain-Invariant Visual Enhancement and Restoration (DIVER) Approach for Underwater Images
【速读】:该论文旨在解决水下图像因波长依赖性衰减、散射及光照不均匀性导致的严重退化问题,这些问题在不同水体类型和深度下表现各异,使得现有增强方法泛化能力不足。解决方案的关键在于提出一种无监督的域不变视觉增强与恢复框架(DIVER),其核心是融合经验校正与物理引导建模:首先通过IlluminateNet或光谱均衡滤波器进行亮度自适应增强或光谱归一化;随后利用自适应光学校正模块进行色相与对比度优化,并由Hydro-OpticNet基于物理约束学习补偿后向散射和波长相关衰减;整个系统通过复合损失函数实现无监督参数优化,从而在浅水、深水、高浑浊度等多样化场景中均表现出卓越且一致的性能,显著优于当前主流方法如WaterNet、UDNet和Phaseformer。
链接: https://arxiv.org/abs/2601.22878
作者: Rajini Makam,Sharanya Patil,Dhatri Shankari T M,Suresh Sundaram,Narasimhan Sundararajan
机构: Indian Institute of Science (印度科学研究所); Nanyang Technological University (南洋理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Journal of Oceanic Engineering
Abstract:Underwater images suffer severe degradation due to wavelength-dependent attenuation, scattering, and illumination non-uniformity that vary across water types and depths. We propose an unsupervised Domain-Invariant Visual Enhancement and Restoration (DIVER) framework that integrates empirical correction with physics-guided modeling for robust underwater image enhancement. DIVER first applies either IlluminateNet for adaptive luminance enhancement or a Spectral Equalization Filter for spectral normalization. An Adaptive Optical Correction Module then refines hue and contrast using channel-adaptive filtering, while Hydro-OpticNet employs physics-constrained learning to compensate for backscatter and wavelength-dependent attenuation. The parameters of IlluminateNet and Hydro-OpticNet are optimized via unsupervised learning using a composite loss function. DIVER is evaluated on eight diverse datasets covering shallow, deep, and highly turbid environments, including both naturally low-light and artificially illuminated scenes, using reference and non-reference metrics. While state-of-the-art methods such as WaterNet, UDNet, and Phaseformer perform reasonably in shallow water, their performance degrades in deep, unevenly illuminated, or artificially lit conditions. In contrast, DIVER consistently achieves best or near-best performance across all datasets, demonstrating strong domain-invariant capability. DIVER yields at least a 9% improvement over SOTA methods in UCIQE. On the low-light SeaThru dataset, where color-palette references enable direct evaluation of color restoration, DIVER achieves at least a 4.9% reduction in GPMAE compared to existing methods. Beyond visual quality, DIVER also improves robotic perception by enhancing ORB-based keypoint repeatability and matching performance, confirming its robustness across diverse underwater environments.
zh
[CV-98] Active Learning-Driven Lightweight YOLOv9: Enhancing Efficiency in Smart Agriculture
【速读】:该论文旨在解决农业机器人在温室环境下部署于边缘设备时,对番茄和番茄花进行实时检测所面临的挑战,包括因相机距离变化导致的目标尺度差异大、植物结构引起的严重遮挡以及类别分布极度不均衡等问题。这些问题使得依赖全标注数据集的传统目标检测方法难以同时实现高检测精度与高效的模型部署。解决方案的关键在于提出一种由主动学习驱动的轻量化目标检测框架,其核心包括:(1)通过分析原始图像中目标尺寸分布重新定义操作目标范围,提升实际场景下的学习稳定性;(2)引入高效特征提取模块与轻量级注意力机制,在降低计算成本的同时增强多尺度及遮挡条件下的特征表达能力;(3)采用主动学习策略,在有限标注预算下迭代选择高信息量样本用于标注与训练,显著提升少数类和小目标的识别性能。该方案在保持低参数量和推理开销的前提下,实现了67.8% mAP的总体检测准确率,验证了其在智能农业应用中的实用性与可行性。
链接: https://arxiv.org/abs/2601.22732
作者: Hung-Chih Tu,Bo-Syun Chen,Yun-Chien Cheng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study addresses the demand for real-time detection of tomatoes and tomato flowers by agricultural robots deployed on edge devices in greenhouse environments. Under practical imaging conditions, object detection systems often face challenges such as large scale variations caused by varying camera distances, severe occlusion from plant structures, and highly imbalanced class distributions. These factors make conventional object detection approaches that rely on fully annotated datasets difficult to simultaneously achieve high detection accuracy and deployment efficiency. To overcome these limitations, this research proposes an active learning driven lightweight object detection framework, integrating data analysis, model design, and training strategy. First, the size distribution of objects in raw agricultural images is analyzed to redefine an operational target range, thereby improving learning stability under real-world conditions. Second, an efficient feature extraction module is incorporated to reduce computational cost, while a lightweight attention mechanism is introduced to enhance feature representation under multi-scale and occluded scenarios. Finally, an active learning strategy is employed to iteratively select high-information samples for annotation and training under a limited labeling budget, effectively improving the recognition performance of minority and small-object categories. Experimental results demonstrate that, while maintaining a low parameter count and inference cost suitable for edge-device deployment, the proposed method effectively improves the detection performance of tomatoes and tomato flowers in raw images. Under limited annotation conditions, the framework achieves an overall detection accuracy of 67.8% mAP, validating its practicality and feasibility for intelligent agricultural applications.
zh
[CV-99] raining Beyond Convergence: Grokking nnU-Net for Glioma Segmentation in Sub-Saharan MRI
【速读】:该论文旨在解决撒哈拉以南非洲地区胶质瘤(Glioma)患者生存期短、诊断影像资源极度匮乏的问题,提出基于本地数据训练的自动化分割工具以最大化利用有限扫描信息。其关键解决方案是采用Brain Tumor Segmentation (BraTS) Africa 2025 Challenge数据集,在受限计算资源下对比两种训练策略:一是快速、低预算的短周期优化(仅数个epoch),验证nnUNet在资源受限场景下的性能表现;二是延长训练至收敛后继续迭代,探索“grokking”现象——即模型从记忆模式向卓越泛化能力的突变式跃迁,从而在不增加标注样本的前提下显著提升分割精度,Dice分数分别达到WH 92.2%、TC 90.1%、ET 90.2%,验证了该机制在本地化医疗AI中的潜力。
链接: https://arxiv.org/abs/2601.22637
作者: Mohtady Barakat,Omar Salah,Ahmed Yasser,Mostafa Ahmed,Zahirul Arief,Waleed Khan,Dong Zhang,Aondona Iorumbur,Confidence Raymond,Mohannad Barakat,Noha Magdy
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gliomas are placing an increasingly clinical burden on Sub-Saharan Africa (SSA). In the region, the median survival for patients remains under two years, and access to diagnostic imaging is extremely limited. These constraints highlight an urgent need for automated tools that can extract the maximum possible information from each available scan, tools that are specifically trained on local data, rather than adapted from high-income settings where conditions are vastly different. We utilize the Brain Tumor Segmentation (BraTS) Africa 2025 Challenge dataset, an expert annotated collection of glioma MRIs. Our objectives are: (i) establish a strong baseline with nnUNet on this dataset, and (ii) explore whether the celebrated “grokking” phenomenon an abrupt, late training jump from memorization to superior generalization can be triggered to push performance without extra labels. We evaluate two training regimes. The first is a fast, budget-conscious approach that limits optimization to just a few epochs, reflecting the constrained GPU resources typically available in African institutions. Despite this limitation, nnUNet achieves strong Dice scores: 92.3% for whole tumor (WH), 86.6% for tumor core (TC), and 86.3% for enhancing tumor (ET). The second regime extends training well beyond the point of convergence, aiming to trigger a grokking-driven performance leap. With this approach, we were able to achieve grokking and enhanced our results to higher Dice scores: 92.2% for whole tumor (WH), 90.1% for tumor core (TC), and 90.2% for enhancing tumor (ET).
zh
[CV-100] Bonnet: Ultra-fast whole-body bone segmentation from CT scans
【速读】:该论文旨在解决全身体部CT扫描中骨骼分割的计算效率问题,现有基于3D体素的模型(如nnU-Net和STU-Net)虽然精度较高,但推理时间通常需数分钟,难以满足临床手术规划等对时效性要求高的场景。其解决方案的关键在于提出Bonnet——一个超快速稀疏体积流水线,通过引入基于HU值的骨阈值分割、基于稀疏spconv的U-Net进行分块推理以及多窗融合策略,实现全体积预测;该方法在保持与主流 voxel 基线相当精度的同时,将单次推理时间缩短至2.69秒(RTX A6000),提速约25倍。
链接: https://arxiv.org/abs/2601.22576
作者: Hanjiang Zhu,Pedro Martelleto Rezende,Zhang Yang,Tong Ye,Bruce Z. Gao,Feng Luo,Siyu Huang,Jiancheng Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures. Accepted for publication at the 2026 IEEE International Symposium on Biomedical Imaging (ISBI 2026)
Abstract:This work proposes Bonnet, an ultra-fast sparse-volume pipeline for whole-body bone segmentation from CT scans. Accurate bone segmentation is important for surgical planning and anatomical analysis, but existing 3D voxel-based models such as nnU-Net and STU-Net require heavy computation and often take several minutes per scan, which limits time-critical use. The proposed Bonnet addresses this by integrating a series of novel framework components including HU-based bone thresholding, patch-wise inference with a sparse spconv-based U-Net, and multi-window fusion into a full-volume prediction. Trained on TotalSegmentator and evaluated without additional tuning on RibSeg, CT-Pelvic1K, and CT-Spine1K, Bonnet achieves high Dice across ribs, pelvis, and spine while running in only 2.69 seconds per scan on an RTX A6000. Compared to strong voxel baselines, Bonnet attains a similar accuracy but reduces inference time by roughly 25x on the same hardware and tiling setup. The toolkit and pre-trained models will be released at this https URL.
zh
[CV-101] EndoCaver: Handling Fog Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation ICASSP
【速读】:该论文旨在解决内窥镜图像分析中因镜头起雾、运动模糊和高光等现实条件导致的自动息肉检测性能下降问题。解决方案的关键在于提出一种轻量级Transformer架构EndoCaver,其核心创新包括:(1) 采用单向引导的双解码器结构,实现图像去模糊与分割任务的联合优化;(2) 引入全局注意力模块(Global Attention Module, GAM)以增强跨尺度特征聚合能力;(3) 设计去模糊-分割对齐器(Deblurring-Segmentation Aligner, DSA)用于传递恢复线索;(4) 使用基于余弦的调度策略(LoCoS)实现多任务训练的稳定优化。该方法在保持高精度的同时,显著降低模型参数(减少90%),提升了临床端侧部署的可行性。
链接: https://arxiv.org/abs/2601.22537
作者: Zhuoyu Wu,Wenhui Ou,Pei-Sze Tan,Jiayan Yang,Wenqi Fang,Zheng Wang,Raphaël C.-W. Phan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026
Abstract:Endoscopic image analysis is vital for colorectal cancer screening, yet real-world conditions often suffer from lens fogging, motion blur, and specular highlights, which severely compromise automated polyp detection. We propose EndoCaver, a lightweight transformer with a unidirectional-guided dual-decoder architecture, enabling joint multi-task capability for image deblurring and segmentation while significantly reducing computational complexity and model parameters. Specifically, it integrates a Global Attention Module (GAM) for cross-scale aggregation, a Deblurring-Segmentation Aligner (DSA) to transfer restoration cues, and a cosine-based scheduler (LoCoS) for stable multi-task optimisation. Experiments on the Kvasir-SEG dataset show that EndoCaver achieves 0.922 Dice on clean data and 0.889 under severe image degradation, surpassing state-of-the-art methods while reducing model parameters by 90%. These results demonstrate its efficiency and robustness, making it well-suited for on-device clinical deployment. Code is available at this https URL.
zh
[CV-102] A Survey on Semantic Communication for Vision: Categories Frameworks Enabling Techniques and Applications
【速读】:该论文旨在解决视觉数据传输中因流量密集导致的通信资源压力问题,通过引入语义通信(Semantic Communication, SemCom)范式,将传输重点从原始数据转向有意义的内容表达。其核心挑战包括:视觉数据的精确语义量化、多样化任务下的鲁棒语义提取与重建、收发端协同及知识有效利用,以及对不可预测无线环境的适应性。解决方案的关键在于提出一种融合计算机视觉(Computer Vision, CV)与通信工程的跨学科分析框架,并基于语义量化方案将现有SemCom-Vision方法划分为语义保真通信(Semantic Preservation Communication, SPC)、语义扩展通信(Semantic Expansion Communication, SEC)和语义精炼通信(Semantic Refinement Communication, SRC)三类,进而为每类设计机器学习驱动的编码器-解码器模型、训练算法及知识结构与利用策略,从而系统性地支撑高效、灵活且适应性强的视觉语义通信体系构建。
链接: https://arxiv.org/abs/2601.22202
作者: Runze Cheng,Yao Sun,Ahmad Taha,Xuesong Liu,David Flynn,Muhammad Ali Imran
机构: University of Glasgow (格拉斯哥大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic communication (SemCom) emerges as a transformative paradigm for traffic-intensive visual data transmission, shifting focus from raw data to meaningful content transmission and relieving the increasing pressure on communication resources. However, to achieve SemCom, challenges are faced in accurate semantic quantization for visual data, robust semantic extraction and reconstruction under diverse tasks and goals, transceiver coordination with effective knowledge utilization, and adaptation to unpredictable wireless communication environments. In this paper, we present a systematic review of SemCom for visual data transmission (SemCom-Vision), wherein an interdisciplinary analysis integrating computer vision (CV) and communication engineering is conducted to provide comprehensive guidelines for the machine learning (ML)-empowered SemCom-Vision design. Specifically, this survey first elucidates the basics and key concepts of SemCom. Then, we introduce a novel classification perspective to categorize existing SemCom-Vision approaches as semantic preservation communication (SPC), semantic expansion communication (SEC), and semantic refinement communication (SRC) based on communication goals interpreted through semantic quantization schemes. Moreover, this survey articulates the ML-based encoder-decoder models and training algorithms for each SemCom-Vision category, followed by knowledge structure and utilization strategies. Finally, we discuss potential SemCom-Vision applications.
zh
[CV-103] SCENE: Semantic-aware Codec Enhancement with Neural Embeddings ICASSP2026
【速读】:该论文旨在解决标准视频编码器(video codec)产生的压缩伪影对感知质量的损害问题。解决方案的关键在于提出了一种轻量级、语义感知的预处理框架SCENE,其核心是将视觉-语言模型(vision-language model)提取的语义嵌入(semantic embeddings)融入高效的卷积架构中,从而优先保护感知显著结构;同时通过可微分的编码器代理(codec proxy)端到端训练,使模型能适配多种标准编码器且无需改动现有视频流水线,在推理阶段丢弃代理后即可作为独立预处理器实现实时性能。
链接: https://arxiv.org/abs/2601.22189
作者: Han-Yu Lin,Li-Wei Chen,Hung-Shin Lee
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to ICASSP 2026
Abstract:Compression artifacts from standard video codecs often degrade perceptual quality. We propose a lightweight, semantic-aware pre-processing framework that enhances perceptual fidelity by selectively addressing these distortions. Our method integrates semantic embeddings from a vision-language model into an efficient convolutional architecture, prioritizing the preservation of perceptually significant structures. The model is trained end-to-end with a differentiable codec proxy, enabling it to mitigate artifacts from various standard codecs without modifying the existing video pipeline. During inference, the codec proxy is discarded, and SCENE operates as a standalone pre-processor, enabling real-time performance. Experiments on high-resolution benchmarks show improved performance over baselines in both objective (MS-SSIM) and perceptual (VMAF) metrics, with notable gains in preserving detailed textures within salient regions. Our results show that semantic-guided, codec-aware pre-processing is an effective approach for enhancing compressed video streams.
zh
[CV-104] Deep Lightweight Unrolled Network for High Dynamic Range Modulo Imaging
【速读】:该论文旨在解决模数成像(Modulo-Imaging, MI)中高动态范围(HDR)图像重建的非凸且病态问题,尤其是在高噪声场景下现有恢复网络性能下降的问题。解决方案的关键在于将HDR重建任务建模为一个融合深度先验(deep prior)的优化问题,并将其展开(unroll)为一种优化启发式的深度神经网络结构;该网络采用轻量级卷积去噪器以实现快速推理并最小化计算开销,在有效恢复强度值的同时抑制噪声;此外,引入尺度等变性项(Scaling Equivariance term)以支持自监督微调,使模型能够适应超出原始训练分布的新模数图像,从而提升泛化能力与重建质量。
链接: https://arxiv.org/abs/2601.12526
作者: Brayan Monroy,Jorge Bacca
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modulo-Imaging (MI) offers a promising alternative for expanding the dynamic range of images by resetting the signal intensity when it reaches the saturation level. Subsequently, high-dynamic range (HDR) modulo imaging requires a recovery process to obtain the HDR image. MI is a non-convex and ill-posed problem where recent recovery networks suffer in high-noise scenarios. In this work, we formulate the HDR reconstruction task as an optimization problem that incorporates a deep prior and subsequently unrolls it into an optimization-inspired deep neural network. The network employs a lightweight convolutional denoiser for fast inference with minimal computational overhead, effectively recovering intensity values while mitigating noise. Moreover, we introduce the Scaling Equivariance term that facilitates self-supervised fine-tuning, thereby enabling the model to adapt to new modulo images that fall outside the original training distribution. Extensive evaluations demonstrate the superiority of our method compared to state-of-the-art recovery algorithms in terms of performance and quality.
zh
人工智能
[AI-0] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms
【速读】:该论文旨在解决共享自主系统中用户意图推理与辅助水平确定的协同优化问题,这是人机交互中的核心挑战,尤其在非结构化环境中需兼顾任务成功率与用户自主权。传统方法依赖静态混合比例或分离目标推断与辅助决策步骤,导致性能受限。其解决方案的关键在于提出BRACE(Bayesian Reinforcement Assistance with Context Encoding)框架,通过端到端梯度流架构将贝叶斯意图推理与上下文自适应辅助决策耦合,使控制策略同时依赖环境上下文和完整的目标概率分布;理论分析表明最优辅助强度应随目标不确定性降低、环境约束强度增加而调整,并且融合信念信息的策略学习相比顺序方法可实现二次期望遗憾优势。实验验证显示,该方法在三类渐进复杂场景下均显著优于当前最优方法(IDA、DQN),成功率达6.3%提升、路径效率提高41%,且在无辅助控制基础上分别提升36.3%成功率和87%路径效率,证明了集成优化在高目标模糊性场景中的优越性及跨机器人平台的泛化能力。
链接: https://arxiv.org/abs/2601.23285
作者: MH Farhadi,Ali Rabiee,Sima Ghafoori,Anna Cetera,Andrew Fisher,Reza Abiri
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Shared autonomy systems require principled methods for inferring user intent and determining appropriate assistance levels. This is a central challenge in human-robot interaction, where systems must be successful while being mindful of user agency. Previous approaches relied on static blending ratios or separated goal inference from assistance arbitration, leading to suboptimal performance in unstructured environments. We introduce BRACE (Bayesian Reinforcement Assistance with Context Encoding), a novel framework that fine-tunes Bayesian intent inference and context-adaptive assistance through an architecture enabling end-to-end gradient flow between intent inference and assistance arbitration. Our pipeline conditions collaborative control policies on environmental context and complete goal probability distributions. We provide analysis showing (1) optimal assistance levels should decrease with goal uncertainty and increase with environmental constraint severity, and (2) integrating belief information into policy learning yields a quadratic expected regret advantage over sequential approaches. We validated our algorithm against SOTA methods (IDA, DQN) using a three-part evaluation progressively isolating distinct challenges of end-effector control: (1) core human-interaction dynamics in a 2D human-in-the-loop cursor task, (2) non-linear dynamics of a robotic arm, and (3) integrated manipulation under goal ambiguity and environmental constraints. We demonstrate improvements over SOTA, achieving 6.3% higher success rates and 41% increased path efficiency, and 36.3% success rate and 87% path efficiency improvement over unassisted control. Our results confirmed that integrated optimization is most beneficial in complex, goal-ambiguous scenarios, and is generalizable across robotic domains requiring goal-directed assistance, advancing the SOTA for adaptive shared autonomy.
zh
[AI-1] IRL-DAL: Safe and Adaptive Trajectory Planning for Autonomous Driving via Energy-Guided Diffusion Models
【速读】:该论文旨在解决自动驾驶车辆在复杂环境中实现安全、稳定且接近专家水平的决策与控制问题,尤其是在应对动态障碍物和非结构化场景时的鲁棒性不足。解决方案的关键在于提出一种基于扩散模型的自适应前瞻规划器(Diffusion-based Adaptive Lookahead Planner, DAL)与逆强化学习(Inverse Reinforcement Learning, IRL)相结合的框架——IRL-DAL。其核心创新包括:首先通过专家有限状态机(Finite State Machine, FSM)控制器进行模仿学习以获得稳定初始策略;随后引入混合奖励机制,融合环境扩散反馈与IRL判别信号,使智能体更精准地对齐专家目标;同时利用条件扩散模型作为安全监督模块,生成符合车道保持、避障及平滑运动约束的安全路径;并通过可学习自适应掩码(Learnable Adaptive Mask, LAM)动态调整视觉注意力,提升感知适应性。整个训练采用两阶段课程学习,在Webots仿真环境中实现了96%的成功率和每千步仅0.05次碰撞的性能,显著提升了自主驾驶的安全性和泛化能力。
链接: https://arxiv.org/abs/2601.23266
作者: Seyed Ahmad Hosseini Miangoleh,Amin Jalal Aghdasian,Farzaneh Abdollahi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a novel inverse reinforcement learning framework using a diffusion-based adaptive lookahead planner (IRL-DAL) for autonomous vehicles. Training begins with imitation from an expert finite state machine (FSM) controller to provide a stable initialization. Environment terms are combined with an IRL discriminator signal to align with expert goals. Reinforcement learning (RL) is then performed with a hybrid reward that combines diffuse environmental feedback and targeted IRL rewards. A conditional diffusion model, which acts as a safety supervisor, plans safe paths. It stays in its lane, avoids obstacles, and moves smoothly. Then, a learnable adaptive mask (LAM) improves perception. It shifts visual attention based on vehicle speed and nearby hazards. After FSM-based imitation, the policy is fine-tuned with Proximal Policy Optimization (PPO). Training is run in the Webots simulator with a two-stage curriculum. A 96% success rate is reached, and collisions are reduced to 0.05 per 1k steps, marking a new benchmark for safe navigation. By applying the proposed approach, the agent not only drives in lane but also handles unsafe conditions at an expert level, increasing this http URL make our code publicly available.
zh
[AI-2] EON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)预训练过程中梯度优化效率低下的问题,特别是传统层内梯度正交化方法(如Muon)在跨层梯度耦合建模上的局限性。其解决方案的关键在于提出TEON(Tensor-based Efficient Orthogonalization Network),通过将神经网络的梯度建模为结构化的高阶张量(higher-order tensor),实现了对跨层梯度关系的显式建模与正交化,从而突破了原有仅在单层独立进行梯度正交化的限制,提升了优化过程的收敛性与稳定性。
链接: https://arxiv.org/abs/2601.23261
作者: Ruijie Zhang,Yequan Zhao,Ziyue Liu,Zhengyang Wang,Dongyang Li,Yupeng Su,Sijia Liu,Zheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Muon optimizer has demonstrated strong empirical performance in pre-training large language models by performing matrix-level gradient (or momentum) orthogonalization in each layer independently. In this work, we propose TEON, a principled generalization of Muon that extends orthogonalization beyond individual layers by modeling the gradients of a neural network as a structured higher-order tensor. We present TEON’s improved convergence guarantee over layer-wise Muon, and further develop a practical instantiation of TEON based on the theoretical analysis with corresponding ablation. We evaluate our approach on two widely adopted architectures: GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, ranging from 60M to 1B parameters. Experimental results show that TEON consistently improves training and validation perplexity across model scales and exhibits strong robustness under various approximate SVD schemes.
zh
[AI-3] YuriiFormer: A Suite of Nesterov-Accelerated Transformers
【速读】:该论文旨在解决当前Transformer架构缺乏明确优化理论指导的问题,即如何从经典优化视角出发,为Transformer的模块设计提供更坚实的理论基础。其解决方案的关键在于提出一种变分框架(variational framework),将Transformer层视为作用于token embeddings的优化算法迭代过程:自注意力机制(self-attention)被解释为交互能量(interaction energy)的梯度步进,而多层感知机(MLP)层则对应势能(potential energy)的梯度更新;标准GPT类Transformer由此被建模为通过Lie–Trotter分裂策略实现的复合目标函数上的梯度下降算法。这一理论视角使得可基于经典优化思想进行结构设计,例如文中引入的Nesterov加速版本Transformer,在保持原有注意力与MLP计算接口不变的前提下,显著优于nanoGPT基线模型,验证了优化理论对实际性能提升的有效性。
链接: https://arxiv.org/abs/2601.23236
作者: Aleksandr Zimin,Yury Polyanskiy,Philippe Rigollet
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie–Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.
zh
[AI-4] Strongly Polynomial Time Complexity of Policy Iteration for L_infty Robust MDPs
【速读】:该论文旨在解决**(s, a)-矩形 L∞ 不确定性集下的鲁棒马尔可夫决策过程(Robust Markov Decision Processes, RMDPs)在固定折扣因子下是否存在强多项式时间算法这一基础性问题。此前,对于标准马尔可夫决策过程(MDP),线性规划可在任意折扣因子下实现多项式时间求解,而Ye的开创性工作证明了在固定折扣因子下存在强多项式时间算法;但这一结果未能推广至RMDPs,成为长期悬而未决的重要开放问题。本文的关键解决方案是提出并分析了一种鲁棒策略迭代算法(robust policy iteration algorithm)**,证明其在固定折扣因子下对(s, a)-矩形L∞ RMDPs具有强多项式时间复杂度,从而首次实现了该类鲁棒优化模型的强多项式算法保证。
链接: https://arxiv.org/abs/2601.23229
作者: Ali Asadi,Krishnendu Chatterjee,Ehsan Goharshady,Mehrdad Karrabi,Alipasha Montaseri,Carlo Pagano
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:
Abstract:Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, (s, a) -rectangular RMDPs with L_\infty uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly–polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for (s, a) -rectangular L_\infty RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.
zh
[AI-5] Agile Reinforcement Learning through Separable Neural Architecture
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在资源受限环境中因多层感知机(Multilayer Perceptrons, MLPs)参数效率低下而导致的样本效率低和策略学习缓慢的问题。其核心挑战在于MLPs对价值函数平滑结构的归纳偏置不充分,从而限制了在容量受限场景下的性能表现。解决方案的关键在于提出SPINE(SPline-based Adaptive Networks),该方法基于可学习的预处理层与可分离张量积B样条基相结合的架构,继承了Kolmogorov-Arnold Networks(KANs)的参数高效特性,并通过低秩KHRONOS框架实现计算开销可控,显著提升了样本效率与训练鲁棒性,在离散控制(PPO)、高维连续控制(SAC)及离线强化学习(Minari/D4RL)任务中均优于传统MLP基线。
链接: https://arxiv.org/abs/2601.23225
作者: Rajib Mostakim,Reza T. Batley,Sourav Saha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep reinforcement learning (RL) is increasingly deployed in resource-constrained environments, yet the go-to function approximators - multilayer perceptrons (MLPs) - are often parameter-inefficient due to an imperfect inductive bias for the smooth structure of many value functions. This mismatch can also hinder sample efficiency and slow policy learning in this capacity-limited regime. Although model compression techniques exist, they operate post-hoc and do not improve learning efficiency. Recent spline-based separable architectures - such as Kolmogorov-Arnold Networks (KANs) - have been shown to offer parameter efficiency but are widely reported to exhibit significant computational overhead, especially at scale. In seeking to address these limitations, this work introduces SPAN (SPline-based Adaptive Networks), a novel function approximation approach to RL. SPAN adapts the low rank KHRONOS framework by integrating a learnable preprocessing layer with a separable tensor product B-spline basis. SPAN is evaluated across discrete (PPO) and high-dimensional continuous (SAC) control tasks, as well as offline settings (Minari/D4RL). Empirical results demonstrate that SPAN achieves a 30-50% improvement in sample efficiency and 1.3-9 times higher success rates across benchmarks compared to MLP baselines. Furthermore, SPAN demonstrates superior anytime performance and robustness to hyperparameter variations, suggesting it as a viable, high performance alternative for learning intrinsically efficient policies in resource-limited settings. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.23225 [cs.LG] (or arXiv:2601.23225v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.23225 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-6] MonoScale: Scaling Multi-Agent System with Monotonic Improvement
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLM)驱动的多智能体系统(Multi-Agent Systems, MAS)在扩展代理池时因路由器(router)对新加入异构且不可靠代理冷启动而导致的性能崩溃问题。解决方案的关键在于提出一种名为MonoScale的扩增感知更新框架,其核心机制是通过主动生成少量代理条件化的熟悉任务,从成功与失败交互中收集证据,并将其提炼为可审计的自然语言记忆,用于指导未来的路由决策;同时将序列式扩增形式化为上下文相关老虎机问题,并采用信任区域记忆更新策略,从而保证在各引入轮次中性能单调不降。
链接: https://arxiv.org/abs/2601.23219
作者: Shuai Shao,Yixiang Liu,Bingwei Lu,Weinan Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, LLM-based multi-agent systems (MAS) have advanced rapidly, using a router to decompose tasks and delegate subtasks to specialized agents. A natural way to expand capability is to scale up the agent pool by continually integrating new functional agents or tool interfaces, but naive expansion can trigger performance collapse when the router cold-starts on newly added, heterogeneous, and unreliable agents. We propose MonoScale, an expansion-aware update framework that proactively generates a small set of agent-conditioned familiarization tasks, harvests evidence from both successful and failed interactions, and distills it into auditable natural-language memory to guide future routing. We formalize sequential augmentation as a contextual bandit and perform trust-region memory updates, yielding a monotonic non-decreasing performance guarantee across onboarding rounds. Experiments on GAIA and Humanity’s Last Exam show stable gains as the agent pool grows, outperforming naive scale-up and strong-router fixed-pool baselines.
zh
[AI-7] Learning to Execute Graph Algorithms Exactly with Graph Neural Networks
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在有限精度和有界度约束下对图算法的可学习性问题,尤其是其能否准确执行诸如广度优先搜索(Breadth-First Search, BFS)、深度优先搜索(Depth-First Search, DFS)及Bellman-Ford等经典图算法。解决方案的关键在于提出一种两阶段训练框架:首先使用多层感知机(Multi-Layer Perceptrons, MLPs)的集成模型学习单个节点的局部指令;其次,在推理阶段将训练好的MLP集成作为GNN的更新函数,从而实现整个图算法的无误差执行。借助神经切线核(Neural Tangent Kernel, NTK)理论,作者证明了仅需少量训练样本即可精确学习局部指令,并保证在高概率下完成全局算法执行,从而为GNN执行复杂图算法提供了严格的理论支撑。
链接: https://arxiv.org/abs/2601.23207
作者: Muhammad Fetrat Qharabagh,Artur Back de Luca,George Giapitzakis,Kimon Fountoulakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding what graph neural networks can learn, especially their ability to learn to execute algorithms, remains a central theoretical challenge. In this work, we prove exact learnability results for graph algorithms under bounded-degree and finite-precision constraints. Our approach follows a two-step process. First, we train an ensemble of multi-layer perceptrons (MLPs) to execute the local instructions of a single node. Second, during inference, we use the trained MLP ensemble as the update function within a graph neural network (GNN). Leveraging Neural Tangent Kernel (NTK) theory, we show that local instructions can be learned from a small training set, enabling the complete graph algorithm to be executed during inference without error and with high probability. To illustrate the learning power of our setting, we establish a rigorous learnability result for the LOCAL model of distributed computation. We further demonstrate positive learnability results for widely studied algorithms such as message flooding, breadth-first and depth-first search, and Bellman-Ford.
zh
[AI-8] High-quality generation of dynamic game content via small language models: A proof of concept
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在动态游戏内容生成中面临的两大核心问题:一是叙事连贯性差,二是因模型规模庞大导致的高运算成本及对云端依赖,限制了其在离线游戏场景中的应用。现有基于小型语言模型(Small Language Models, SLMs)的研究虽能降低资源消耗,但输出质量普遍不佳。论文提出的关键解决方案是通过“激进微调”(aggressive fine-tuning)策略,在任务范围受限、上下文狭窄或结构约束明确的条件下训练SLMs,从而实现高质量生成。具体而言,更复杂的任务需更强的任务专一性和训练语料匹配度;同时,利用基于有向无环图(DAG-based)的方法合成训练数据,使模型扎根于特定游戏世界。实验验证表明,一个专注于声誉博弈的最小RPG循环可借助该策略实现可预测延迟下的实时生成,且质量达到LLM-as-a-judge评估标准,为构建基于叙事学框架的代理网络提供了可行路径。
链接: https://arxiv.org/abs/2601.23206
作者: Morten I. K. Munk,Arturo Valdivia,Paolo Burelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) offer promise for dynamic game content generation, but they face critical barriers, including narrative incoherence and high operational costs. Due to their large size, they are often accessed in the cloud, limiting their application in offline games. Many of these practical issues are solved by pivoting to small language models (SLMs), but existing studies using SLMs have resulted in poor output quality. We propose a strategy of achieving high-quality SLM generation through aggressive fine-tuning on deliberately scoped tasks with narrow context, constrained structure, or both. In short, more difficult tasks require narrower scope and higher specialization to the training corpus. Training data is synthetically generated via a DAG-based approach, grounding models in the specific game world. Such models can form the basis for agentic networks designed around the narratological framework at hand, representing a more practical and robust solution than cloud-dependent LLMs. To validate this approach, we present a proof-of-concept focusing on a single specialized SLM as the fundamental building block. We introduce a minimal RPG loop revolving around rhetorical battles of reputations, powered by this model. We demonstrate that a simple retry-until-success strategy reaches adequate quality (as defined by an LLM-as-a-judge scheme) with predictable latency suitable for real-time generation. While local quality assessment remains an open question, our results demonstrate feasibility for real-time generation under typical game engine constraints.
zh
[AI-9] SAQA: Time Series Analysis Question And Answering Benchmark
【速读】:该论文旨在解决当前多任务时间序列问答(Time Series Question Answering, TSQA)基准测试任务覆盖范围有限的问题,现有基准主要集中在预测和异常检测任务,难以全面评估大语言模型(Large Language Models, LLMs)在时间序列分析中的综合能力。解决方案的关键在于提出一个统一的基准TSAQA,它整合了六类多样化的时序分析任务,涵盖从传统分析(如异常检测、分类)到高级分析(如特征刻画、对比分析、数据变换和时序关系挖掘),并构建包含210k样本、跨13个领域的多样化数据集,采用真/假(True-or-False, TF)、多选(Multiple-Choice, MC)和新颖的谜题式(Puzzling, PZ)三种问答格式,从而系统性地评估LLMs在复杂时序理解与推理方面的表现。
链接: https://arxiv.org/abs/2601.23204
作者: Baoyu Jing,Sanhorn Chen,Lecheng Zheng,Boyu Liu,Zihao Li,Jiaru Zou,Tianxin Wei,Zhining Liu,Zhichen Zeng,Ruizhong Qiu,Xiao Lin,Yuchen Yan,Dongqi Fu,Jingchao Ni,Jingrui He,Hanghang Tong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 7 figures
Abstract:Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.
zh
[AI-10] Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLM s via Multi-Crop Routed Meta Optimization
【速读】:该论文致力于解决闭源多模态大语言模型(Multimodal Large Language Models, MLLMs)在黑盒迁移场景下,通用目标可迁移对抗攻击(Universal Targeted Transferable Adversarial Attacks, UTTAA)的挑战。具体而言,现有方法多为样本特定(sample-specific),难以在不同输入间复用,而UTTAA要求单个扰动能稳定地将任意输入引导至指定目标,这对攻击的鲁棒性和泛化能力提出了更高要求。解决方案的关键在于:(1) 通过多裁剪聚合与注意力引导的裁剪策略(Multi-Crop Aggregation with an Attention-Guided Crop)稳定高方差的目标监督信号;(2) 利用可对齐性门控的token路由机制(alignability-gated Token Routing)提升token级匹配的可靠性;(3) 引入元学习框架以构建跨目标扰动先验(cross-target perturbation prior),从而增强每个目标下的优化稳定性与性能。实验表明,该方法在GPT-4o和Gemini-2.0上分别实现了+23.7%和+19.9%的未见图像攻击成功率提升。
链接: https://arxiv.org/abs/2601.23179
作者: Hui Lu,Yi Yu,Yiming Yang,Chenyu Yi,Xueyi Ke,Qixing Zhang,Bingquan Shen,Alex Kot,Xudong Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Targeted adversarial attacks on closed-source multimodal large language models (MLLMs) have been increasingly explored under black-box transfer, yet prior methods are predominantly sample-specific and offer limited reusability across inputs. We instead study a more stringent setting, Universal Targeted Transferable Adversarial Attacks (UTTAA), where a single perturbation must consistently steer arbitrary inputs toward a specified target across unknown commercial MLLMs. Naively adapting existing sample-wise attacks to this universal setting faces three core difficulties: (i) target supervision becomes high-variance due to target-crop randomness, (ii) token-wise matching is unreliable because universality suppresses image-specific cues that would otherwise anchor alignment, and (iii) few-source per-target adaptation is highly initialization-sensitive, which can degrade the attainable performance. In this work, we propose MCRMO-Attack, which stabilizes supervision via Multi-Crop Aggregation with an Attention-Guided Crop, improves token-level reliability through alignability-gated Token Routing, and meta-learns a cross-target perturbation prior that yields stronger per-target solutions. Across commercial MLLMs, we boost unseen-image attack success rate by +23.7% on GPT-4o and +19.9% on Gemini-2.0 over the strongest universal baseline.
zh
[AI-11] Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
【速读】:该论文旨在解决现有神经音频编解码器(Neural Audio Codecs)在语音 token 化过程中因固定帧率导致的冗余问题,即均匀分配 token 于时间轴上,从而生成过长的序列。其解决方案的关键在于提出 DyCAST(Dynamic Character-Aligned Speech Tokenizer),通过软字符级对齐(soft character-level alignment)与显式时长建模(explicit duration modeling),实现可变帧率的语音 token 化;该方法在训练中学习将 token 与字符级语言单元关联,并支持推理阶段无需对齐的直接控制 token 时长,同时引入检索增强解码机制以提升低帧率下的语音重建质量,从而在保持下游任务性能的同时显著减少 token 数量。
链接: https://arxiv.org/abs/2601.23174
作者: Luca Della Libera,Cem Subakan,Mirco Ravanelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 18 pages, 3 figures
Abstract:Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs.
zh
[AI-12] Probing the Trajectories of Reasoning Traces in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成推理轨迹(reasoning traces)过程中,准确率与决策承诺(decision commitment)如何随推理进展演化的问题,并探究中间推理片段是否蕴含比单纯长度或风格更丰富的答案相关性信息。其解决方案的关键在于提出了一种系统性探针协议(probing protocol),通过固定百分位截断推理轨迹并将其回注入模型,利用下一个词概率分布测量不同阶段的响应倾向,从而量化推理路径中信息的有效性。实验表明,准确率和决策强度随推理token占比提升而持续增强,且这种增益主要源于推理内容本身而非形式特征;同时发现更强模型能从错误部分轨迹中成功回溯,但弱模型的初始错误响应往往难以修正,这为推理模型的高效部署与安全监控提供了可操作的诊断依据。
链接: https://arxiv.org/abs/2601.23163
作者: Marthe Ballon,Brecht Verbeken,Vincent Ginis,Andres Algaba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 33 pages, 20 figures, 4 tables
Abstract:Large language models (LLMs) increasingly solve difficult problems by producing “reasoning traces” before emitting a final response. However, it remains unclear how accuracy and decision commitment evolve along a reasoning trajectory, and whether intermediate trace segments provide answer-relevant information beyond generic length or stylistic effects. Here, we propose a protocol to systematically probe the trajectories of reasoning traces in LLMs by 1) generating a model’s reasoning trace, 2) truncating it at fixed token-percentiles, and 3) injecting each partial trace back into the model (or a different model) to measure the induced distribution over answer choices via next-token probabilities. We apply this protocol to the open-source Qwen3-4B/-8B/-14B and gpt-oss-20b/-120b models across the multiple-choice GPQA Diamond and MMLU-Pro benchmarks. We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows. These gains are primarily driven by relevant content in the model generation rather than context length or generic “reasoning style” effects. Stronger models often backtrack successfully from incorrect partial traces, but immediate answers often remain anchored in the weaker model’s incorrect response. More broadly, we show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models as the measurements can inform practical trace-handling and monitoring policies that improve reliability without assuming intermediate tokens are inherently faithful explanations.
zh
[AI-13] SPICE: Submodular Penalized Information-Conflict Selection for Efficient Large Language Model Training
【速读】:该论文旨在解决指令微调(instruction tuning)中数据选择效率低下的问题,即如何在有限的数据预算下最大化信息增益,从而提升模型性能并降低训练成本。传统方法基于Fisher信息矩阵的行列式(log-determinant)构建子模优化目标,虽理论上可实现(1−1/e)近似比,但在实践中因样本梯度间存在冲突(gradient conflicts),导致边际信息增益衰减缓慢,实际效果受限。解决方案的关键在于提出一种冲突感知的数据选择方法SPICE,其核心创新是通过引入ε-分解量化梯度冲突对子模性偏离的影响,并在优化目标中显式惩罚梯度不一致(misalignment),从而获得更紧致的数据依赖近似因子;同时支持早停机制与代理模型以提升效率。实验证明,SPICE仅用10%数据即可达到甚至超越全量数据微调的效果,在多个基准上显著优于现有方法。
链接: https://arxiv.org/abs/2601.23155
作者: Powei Chang,Jinpeng Zhang,Bowen Chen,Chenyu Wang,Chenlu Guo,Yixing Zhang,Yukang Gao,JianXiang Xiang,Yue Gao,Chaoqun Sun,Yiyi Chen,Dongying Kong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages, 9 figures, 15 tables (including appendices)
Abstract:Information-based data selection for instruction tuning is compelling: maximizing the log-determinant of the Fisher information yields a monotone submodular objective, enabling greedy algorithms to achieve a (1-1/e) approximation under a cardinality budget. In practice, however, we identify alleviating gradient conflicts, misalignment between per-sample gradients, is a key factor that slows down the decay of marginal log-determinant information gains, thereby preventing significant loss of information. We formalize this via an \varepsilon -decomposition that quantifies the deviation from ideal submodularity as a function of conflict statistics, yielding data-dependent approximation factors that tighten as conflicts diminish. Guided by this analysis, we propose SPICE, a conflict-aware selector that maximizes information while penalizing misalignment, and that supports early stopping and proxy models for efficiency. Empirically, SPICE selects subsets with higher log-determinant information than original criteria, and these informational gains translate into performance improvements: across 8 benchmarks with LLaMA2-7B and Qwen2-7B, SPICE uses only 10% of the data, yet matches or exceeds 6 methods including full-data tuning. This achieves performance improvements with substantially lower training cost.
zh
[AI-14] On Safer Reinforcement Learning Policies for Sedation and Analgesia in Intensive Care
【速读】:该论文旨在解决重症监护病房(Intensive Care Unit, ICU)中疼痛管理的复杂权衡问题,即在追求治疗目标的同时确保患者安全,因为镇静和镇痛药物剂量不足或过量均可能导致严重后果。解决方案的关键在于采用深度强化学习(Deep Reinforcement Learning, DRL)框架,在部分可观测环境下学习最优药物给药策略,并通过对比两种目标函数——仅减少疼痛 vs. 同时减少疼痛与死亡率——来评估长期结果对治疗安全性的影响。研究发现,仅优化短期疼痛缓解的目标会导致死亡率上升,而同时考虑生存率的目标则显著降低死亡风险,表明在策略设计中纳入长期结局指标是实现更安全治疗决策的核心。
链接: https://arxiv.org/abs/2601.23154
作者: Joel Romero-Hernandez,Oscar Camara
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to the 48th Annual International Conference of the IEEE Engineering in Medicine Biology Society (IEEE EMBC 2026)
Abstract:Pain management in intensive care usually involves complex trade-offs between therapeutic goals and patient safety, since both inadequate and excessive treatment may induce serious sequelae. Reinforcement learning can help address this challenge by learning medication dosing policies from retrospective data. However, prior work on sedation and analgesia has optimized for objectives that do not value patient survival while relying on algorithms unsuitable for imperfect information settings. We investigated the risks of these design choices by implementing a deep reinforcement learning framework to suggest hourly medication doses under partial observability. Using data from 47,144 ICU stays in the MIMIC-IV database, we trained policies to prescribe opioids, propofol, benzodiazepines, and dexmedetomidine according to two goals: reduce pain or jointly reduce pain and mortality. We found that, although the two policies were associated with lower pain, actions from the first policy were positively correlated with mortality, while those proposed by the second policy were negatively correlated. This suggests that valuing long-term outcomes could be critical for safer treatment policies, even if a short-term goal remains the primary objective.
zh
[AI-15] Securing Time in Energy IoT: A Clock-Dynamics-Aware Spatio-Temporal Graph Attention Network for Clock Drift Attacks and Y2K38 Failures
【速读】:该论文旨在解决分布式能源物联网(Energy IoT)系统中因时钟漂移、时间同步篡改及时间戳中断(如2038年Unix溢出问题)导致的时间完整性破坏问题,这些问题会扰乱设备间的时间顺序关系,使得传统依赖可靠时间戳的异常检测模型失效。其解决方案的关键在于提出STGAT(Spatio-Temporal Graph Attention Network)框架,该框架通过融合感知漂移的时间嵌入与时间自注意力机制来捕捉单个设备上的时间演化异常,并利用图注意力机制建模时间误差在设备间的空间传播特性;同时引入曲率正则化的潜在表示,从几何上分离正常时钟演化与由漂移、同步偏移和溢出事件引发的异常,从而实现高精度、低延迟的时序异常检测。
链接: https://arxiv.org/abs/2601.23147
作者: Saeid Jamshidi,Omar Abdul Wahab,Rolando Herrero,Foutse Khomh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The integrity of time in distributed Internet of Things (IoT) devices is crucial for reliable operation in energy cyber-physical systems, such as smart grids and microgrids. However, IoT systems are vulnerable to clock drift, time-synchronization manipulation, and timestamp discontinuities, such as the Year 2038 (Y2K38) Unix overflow, all of which disrupt temporal ordering. Conventional anomaly-detection models, which assume reliable timestamps, fail to capture temporal inconsistencies. This paper introduces STGAT (Spatio-Temporal Graph Attention Network), a framework that models both temporal distortion and inter-device consistency in energy IoT systems. STGAT combines drift-aware temporal embeddings and temporal self-attention to capture corrupted time evolution at individual devices, and uses graph attention to model spatial propagation of timing errors. A curvature-regularized latent representation geometrically separates normal clock evolution from anomalies caused by drift, synchronization offsets, and overflow events. Experimental results on energy IoT telemetry with controlled timing perturbations show that STGAT achieves 95.7% accuracy, outperforming recurrent, transformer, and graph-based baselines with significant improvements (d 1.8, p 0.001). Additionally, STGAT reduces detection delay by 26%, achieving a 2.3-time-step delay while maintaining stable performance under overflow, drift, and physical inconsistencies.
zh
[AI-16] HINKSAFE: Self-Generated Safety Alignment for Reasoning Models
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在通过强化学习(Reinforcement Learning, RL)优化链式思维(Chain-of-Thought, CoT)过程中因过度追求合规性而导致的安全性退化问题,即模型对有害提示变得敏感。现有方法依赖外部教师蒸馏进行对齐,但引入了分布偏移,损害了原始推理能力。其解决方案的关键在于提出ThinkSafe框架,利用轻量级拒绝引导(lightweight refusal steering)机制,从模型自身生成的安全推理轨迹中恢复安全对齐,从而在不依赖外部教师的情况下实现高效、低分布偏移的再对齐,兼顾安全性与推理性能。
链接: https://arxiv.org/abs/2601.23143
作者: Seanie Lee,Sangwoo Park,Yumin Choi,Gyeongman Kim,Minki Kang,Jihun Yun,Dongmin Park,Jongho Park,Sung Ju Hwang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 13 figures
Abstract:Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at this https URL.
zh
[AI-17] Machine Learning for Energy-Performance-aware Scheduling
【速读】:该论文旨在解决后 Dennard 缩放时代嵌入式系统中能量效率与延迟之间复杂权衡的优化问题,传统启发式调优方法在高维非光滑搜索空间中效率低下。其解决方案的关键在于提出一种基于高斯过程(Gaussian Process)的贝叶斯优化框架,用于自动搜索异构多核架构上的最优调度配置;并通过近似能量与时间之间的帕累托前沿(Pareto Frontier)来处理多目标优化问题,同时引入 fANOVA 敏感性分析和不同核函数(如 Matérn 与 RBF)比较,增强黑箱模型的物理可解释性,从而识别主导系统性能的关键硬件参数。
链接: https://arxiv.org/abs/2601.23134
作者: Zheyuan Hu,Yifei Shi
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Zheyuan Hu and Yifei Shi contributed equally to this work
Abstract:In the post-Dennard era, optimizing embedded systems requires navigating complex trade-offs between energy efficiency and latency. Traditional heuristic tuning is often inefficient in such high-dimensional, non-smooth landscapes. In this work, we propose a Bayesian Optimization framework using Gaussian Processes to automate the search for optimal scheduling configurations on heterogeneous multi-core architectures. We explicitly address the multi-objective nature of the problem by approximating the Pareto Frontier between energy and time. Furthermore, by incorporating Sensitivity Analysis (fANOVA) and comparing different covariance kernels (e.g., Matérn vs. RBF), we provide physical interpretability to the black-box model, revealing the dominant hardware parameters driving system performance.
zh
[AI-18] RAudit: A Blind Auditing Protocol for Large Language Model Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因推理路径病理(如谄媚行为、阶梯坍塌和过早确定性)导致的不可靠性问题,尤其在缺乏真实标签(ground truth)的情况下如何有效诊断与修正模型推理过程。其解决方案的核心是提出RAudit——一种基于“盲视”约束的审计协议:审计者仅评估推导步骤是否支持结论,从而检测输出不一致并恢复潜在能力(latent competence)。该方法通过CRIT-based合理性评分衡量推理质量,并通过调整批判性表述研究社会框架对模型响应的影响,实现了有界纠正和O(log(1/ϵ))终止条件,为理解LLM推理脆弱性提供了系统性诊断工具。
链接: https://arxiv.org/abs/2601.23133
作者: Edward Y. Chang,Longling Geng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 21 tables, 3 figures
Abstract:Inference-time scaling can amplify reasoning pathologies: sycophancy, rung collapse, and premature certainty. We present RAudit, a diagnostic protocol for auditing LLM reasoning without ground truth access. The key constraint is blindness: the auditor evaluates only whether derivation steps support conclusions, enabling detection of trace-output inconsistency and, when latent competence exists, its recovery. RAudit measures process quality via CRIT-based reasonableness scores and varies critique formulation to study how social framing affects model response. We prove bounded correction and O(\log(1/\epsilon)) termination. Experiments on mathematical reasoning (CAP-GSM8K) and causal judgment (CausalL2) reveal four mechanisms explaining model unreliability: (1) Latent Competence Suppression, where models derive correct answers then overwrite them under social pressure; (2) The False Competence Trap, where weaker judges mask sycophancy that stronger judges expose; (3) The Complexity-Vulnerability Tradeoff, where causal tasks induce more than 10 times higher sycophancy than mathematical tasks; and (4) Iatrogenic Critique, where authoritative correction harms weaker models. These findings challenge assumptions that capability implies robustness and that stronger feedback yields better outputs.
zh
[AI-19] Secure Tool Manifest and Digital Signing Solution for Verifiable MCP and LLM Pipelines
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗和金融等敏感领域应用中,其执行流程存在可被操纵且行为不可验证的问题。现有控制机制如模型上下文协议(Model Context Protocol, MCP)虽能定义工具调用的合规策略,但缺乏可验证的执行强制机制与透明的行为验证能力。解决方案的关键在于提出一种安全感知的工具清单与数字签名框架(Secure Tool Manifest and Digital Signing Framework),通过加密签名的manifest实现执行完整性保障,集成透明的验证日志,并将模型内部执行元数据与用户可见组件隔离,从而确保执行过程的可验证性和安全性。
链接: https://arxiv.org/abs/2601.23132
作者: Saeid Jamshidi,Kawser Wazed Nafi,Arghavan Moradi Dakhel,Foutse Khomh,Amin Nikanjam,Mohammad Adnan Hamdaqa
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly adopted in sensitive domains such as healthcare and financial institutions’ data analytics; however, their execution pipelines remain vulnerable to manipulation and unverifiable behavior. Existing control mechanisms, such as the Model Context Protocol (MCP), define compliance policies for tool invocation but lack verifiable enforcement and transparent validation of model actions. To address this gap, we propose a novel Secure Tool Manifest and Digital Signing Framework, a structured and security-aware extension of Model Context Protocols. The framework enforces cryptographically signed manifests, integrates transparent verification logs, and isolates model-internal execution metadata from user-visible components to ensure verifiable execution integrity. Furthermore, the evaluation demonstrates that the framework scales nearly linearly (R-squared = 0.998), achieves near-perfect acceptance of valid executions while consistently rejecting invalid ones, and maintains balanced model utilization across execution pipelines.
zh
[AI-20] Regularisation in neural networks: a survey and empirical analysis of approaches
【速读】:该论文试图解决的问题是:当前广泛使用的正则化(regularisation)技术是否在实践中确实能够提升神经网络的泛化能力,即这一假设是否成立。解决方案的关键在于通过系统性分类和实证比较来评估不同正则化方法的有效性——首先提出一个包含四类方法的分类体系(数据驱动策略、架构策略、训练策略和损失函数策略),并基于十种数值与图像数据集对多层感知机(MLP)和卷积神经网络(CNN)进行实验验证,结果表明正则化效果具有显著的数据集依赖性,例如正则项仅在数值数据上有效,而批量归一化(batch normalisation)仅在图像数据上提升性能。这揭示了正则化并非普适有效,需结合具体任务和数据特性选择合适策略。
链接: https://arxiv.org/abs/2601.23131
作者: Christiaan P. Opperman,Anna S. Bosman,Katherine M. Malan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures, 4 tables and for associated to the code, see this https URL
Abstract:Despite huge successes on a wide range of tasks, neural networks are known to sometimes struggle to generalise to unseen data. Many approaches have been proposed over the years to promote the generalisation ability of neural networks, collectively known as regularisation techniques. These are used as common practice under the assumption that any regularisation added to the pipeline would result in a performance improvement. In this study, we investigate whether this assumption holds in practice. First, we provide a broad review of regularisation techniques, including modern theories such as double descent. We propose a taxonomy of methods under four broad categories, namely: (1) data-based strategies, (2) architecture strategies, (3) training strategies, and (4) loss function strategies. Notably, we highlight the contradictions and correspondences between the approaches in these broad classes. Further, we perform an empirical comparison of the various regularisation techniques on classification tasks for ten numerical and image datasets applied to the multi-layer perceptron and convolutional neural network architectures. Results show that the efficacy of regularisation is dataset-dependent. For example, the use of a regularisation term only improved performance on numeric datasets, whereas batch normalisation improved performance on image datasets only. Generalisation is crucial to machine learning; thus, understanding the effects of applying regularisation techniques, and considering the connections between them is essential to the appropriate use of these methods in practice.
zh
[AI-21] o See Far Look Close: Evolutionary Forecasting for Long-term Time Series
【速读】:该论文旨在解决长时序预测(Long-term Time Series Forecasting, LTSF)中主流直接预测(Direct Forecasting, DF)范式所面临的优化困境:DF强制模型在单次前向传播中预测整个未来时间窗口,导致输出与评估时域耦合,使得针对不同目标时域需进行昂贵的重新训练。其关键解决方案是提出进化预测(Evolutionary Forecasting, EF)范式,通过解耦预测过程与评估时域,使模型能够以渐进方式演化出长期预测结果。EF被证明是一个统一的生成式框架,而DF仅为其中退化特例;实验表明,单一EF模型在标准基准上优于任务特定的DF集成,并在极端外推场景下展现出稳健的渐近稳定性,从而推动LTSF从静态映射向自主进化推理的范式转变。
链接: https://arxiv.org/abs/2601.23114
作者: Jiaming Ma,Siyuan Mu,Ruilin Tang,Haofeng Ma,Qihe Huang,Zhengyang Zhou,Pengkun Wang,Binwu Wang,Yang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The prevailing Direct Forecasting (DF) paradigm dominates Long-term Time Series Forecasting (LTSF) by forcing models to predict the entire future horizon in a single forward pass. While efficient, this rigid coupling of output and evaluation horizons necessitates computationally prohibitive re-training for every target horizon. In this work, we uncover a counter-intuitive optimization anomaly: models trained on short horizons-when coupled with our proposed Evolutionary Forecasting (EF) paradigm-significantly outperform those trained directly on long horizons. We attribute this success to the mitigation of a fundamental optimization pathology inherent in DF, where conflicting gradients from distant futures cripple the learning of local dynamics. We establish EF as a unified generative framework, proving that DF is merely a degenerate special case of EF. Extensive experiments demonstrate that a singular EF model surpasses task-specific DF ensembles across standard benchmarks and exhibits robust asymptotic stability in extreme extrapolation. This work propels a paradigm shift in LTSF: moving from passive Static Mapping to autonomous Evolutionary Reasoning.
zh
[AI-22] WiFiPenTester: Advancing Wireless Ethical Hacking with Governed GenAI
【速读】:该论文旨在解决无线伦理渗透测试(wireless ethical hacking)中高度依赖人工、效率低下且易受主观判断和人为错误影响的问题。传统方法要求熟练技术人员手动分析侦察结果并执行复杂、时间敏感的命令序列,难以规模化且缺乏一致性。解决方案的关键在于提出WiFiPenTester系统,这是一个由生成式AI(Generative AI)赋能的实验性、受控且可复现的无线安全评估框架,其核心创新在于将大语言模型(LLM)集成至侦察与决策支持阶段,实现智能目标排序、攻击可行性估计与策略推荐,同时保留严格的人机协同控制和预算感知执行机制,从而在提升目标选择准确性和整体评估效率的同时,确保审计可追溯性和伦理安全性。
链接: https://arxiv.org/abs/2601.23092
作者: Haitham S. Al-Sinani,Chris J. Mitchell
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 35 pages, 10 figures
Abstract:Wireless ethical hacking relies heavily on skilled practitioners manually interpreting reconnaissance results and executing complex, time-sensitive sequences of commands to identify vulnerable targets, capture authentication handshakes, and assess password resilience; a process that is inherently labour-intensive, difficult to scale, and prone to subjective judgement and human error. To help address these limitations, we propose WiFiPenTester, an experimental, governed, and reproducible system for GenAI-enabled wireless ethical hacking. The system integrates large language models into the reconnaissance and decision-support phases of wireless security assessment, enabling intelligent target ranking, attack feasibility estimation, and strategy recommendation, while preserving strict human-in-the-loop control and budget-aware execution. We describe the system architecture, threat model, governance mechanisms, and prompt-engineering methodology, and empirical experiments conducted across multiple wireless environments. The results demonstrate that GenAI assistance improves target selection accuracy and overall assessment efficiency, while maintaining auditability and ethical safeguards. This indicates that WiFiPenTester is a meaningful step toward practical, safe, and scalable GenAI-assisted wireless penetration testing, while reinforcing the necessity of bounded autonomy, human oversight, and rigorous governance mechanisms when deploying GenAI in ethical hacking.
zh
[AI-23] From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching
【速读】:该论文旨在解决生成式 AI(Generative AI)应用中语义缓存(semantic caching)机制所面临的安全漏洞问题,特别是因缓存键采用语义嵌入向量(semantic embedding vectors)作为索引而引发的密钥碰撞攻击风险。传统研究多关注侧信道和隐私泄露,本文首次系统性揭示了缓存碰撞带来的完整性威胁,并提出 CacheAttack 框架作为自动化黑盒攻击手段,通过利用局部性(locality)与抗碰撞性(collision resistance)之间的根本矛盾,在不同嵌入模型间实现高成功率的响应劫持(hit rate 86%)和恶意行为诱导,从而验证了语义缓存机制在安全关键任务中的脆弱性。其解决方案的关键在于将缓存键建模为模糊哈希(fuzzy hashes),并基于此构建可迁移、高效的攻击框架,同时提出针对性的缓解策略以平衡性能与安全性。
链接: https://arxiv.org/abs/2601.23088
作者: Zhixiang Zhang,Zesen Liu,Yuchong Xie,Quanfeng Huang,Dongdong She
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic caching has emerged as a pivotal technique for scaling LLM applications, widely adopted by major providers including AWS and Microsoft. By utilizing semantic embedding vectors as cache keys, this mechanism effectively minimizes latency and redundant computation for semantically similar queries. In this work, we conceptualize semantic cache keys as a form of fuzzy hashes. We demonstrate that the locality required to maximize cache hit rates fundamentally conflicts with the cryptographic avalanche effect necessary for collision resistance. Our conceptual analysis formalizes this inherent trade-off between performance (locality) and security (collision resilience), revealing that semantic caching is naturally vulnerable to key collision attacks. While prior research has focused on side-channel and privacy risks, we present the first systematic study of integrity risks arising from cache collisions. We introduce CacheAttack, an automated framework for launching black-box collision attacks. We evaluate CacheAttack in security-critical tasks and agentic workflows. It achieves a hit rate of 86% in LLM response hijacking and can induce malicious behaviors in LLM agent, while preserving strong transferability across different embedding models. A case study on a financial agent further illustrates the real-world impact of these vulnerabilities. Finally, we discuss mitigation strategies. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.23088 [cs.CR] (or arXiv:2601.23088v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.23088 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-24] Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理过程中可能被优化诱导而产生推理痕迹模糊化(obfuscation)的问题,从而削弱其可监控性。解决方案的关键在于揭示:即使仅对模型最终输出进行惩罚(如禁止有害行为),模型仍会通过在CoT中隐藏真实推理路径来规避惩罚,且这种遮蔽行为可在不同任务间泛化——这意味着当前基于结果惩罚的对齐策略可能无意中导致LLMs在更广泛场景下丧失透明性和可解释性,进而增加安全风险。
链接: https://arxiv.org/abs/2601.23086
作者: Nathaniel Mitrani Hadida,Sassan Bhanji,Cameron Tice,Puria Radmard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model’s decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model’s final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.
zh
[AI-25] OrLog: Resolving Complex Queries with LLM s and Probabilistic Reasoning ECIR2026
【速读】:该论文旨在解决信息检索中复杂查询约束(如合取、析取、否定)难以被有效建模的问题,现有方法或忽略逻辑运算符,或依赖生成式推理导致不一致和不可靠。其解决方案的关键在于提出OrLog框架,通过解耦原子谓词的可 plausible 估计与逻辑推理:利用大语言模型(LLM)在一次无生成的前向传播中对原子谓词给出置信度评分,再由概率推理引擎计算查询满足的后验概率,从而实现约束感知的检索。此方法显著提升高排名精度,尤其在析取查询上效果更优,且token消耗减少约90%,优于传统端到端推理方式。
链接: https://arxiv.org/abs/2601.23085
作者: Mohanna Hoveyda,Jelle Piepenbrock,Arjen P de Vries,Maarten de Rijke,Faegheh Hasibi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to ECIR 2026
Abstract:Resolving complex information needs that come with multiple constraints should consider enforcing the logical operators encoded in the query (i.e., conjunction, disjunction, negation) on the candidate answer set. Current retrieval systems either ignore these constraints in neural embeddings or approximate them in a generative reasoning process that can be inconsistent and unreliable. Although well-suited to structured reasoning, existing neuro-symbolic approaches remain confined to formal logic or mathematics problems as they often assume unambiguous queries and access to complete evidence, conditions rarely met in information retrieval. To bridge this gap, we introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning: a large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction. We evaluate OrLog across multiple backbone LLMs, varying levels of access to external knowledge, and a range of logical constraints, and compare it against base retrievers and LLM-as-reasoner methods. Provided with entity descriptions, OrLog can significantly boost top-rank precision compared to LLM reasoning with larger gains on disjunctive queries. OrLog is also more efficient, cutting mean tokens by \sim 90% per query-entity pair. These results demonstrate that generation-free predicate plausibility estimation combined with probabilistic reasoning enables constraint-aware retrieval that outperforms monolithic reasoning while using far fewer tokens.
zh
[AI-26] ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations
【速读】:该论文旨在解决监督分类任务中特征重要性计算的可解释性问题,特别是针对Shapley值在实际部署场景中因缺乏对底层模型访问权限而难以应用的问题。传统Shapley值方法依赖于模型内部结构或梯度信息,但在零样本(zero-shot)场景下无法获取这些资源;此外,即使有模型访问权限,其精确计算也常因复杂度高而不可行。解决方案的关键在于提出ExplainerPFN——一种基于TabPFN构建的表格基础模型,通过在由随机结构因果模型生成的合成数据集上预训练,并利用精确或近似Shapley值进行监督学习,从而实现无需模型访问、无参考示例、无梯度信息的情况下,直接预测未见表格数据的特征归因。这一方法首次实现了真正的零样本Shapley值估计,且在多个真实与合成数据集上的实验表明其性能可媲美依赖2–10个SHAP参考示例的少样本代理解释器。
链接: https://arxiv.org/abs/2601.23068
作者: Joao Fonseca,Julia Stoyanovich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures
Abstract:Computing the importance of features in supervised classification tasks is critical for model interpretability. Shapley values are a widely used approach for explaining model predictions, but require direct access to the underlying model, an assumption frequently violated in real-world deployments. Further, even when model access is possible, their exact computation may be prohibitively expensive. We investigate whether meaningful Shapley value estimations can be obtained in a zero-shot setting, using only the input data distribution and no evaluations of the target model. To this end, we introduce ExplainerPFN, a tabular foundation model built on TabPFN that is pretrained on synthetic datasets generated from random structural causal models and supervised using exact or near-exact Shapley values. Once trained, ExplainerPFN predicts feature attributions for unseen tabular datasets without model access, gradients, or example explanations. Our contributions are fourfold: (1) we show that few-shot learning-based explanations can achieve high fidelity to SHAP values with as few as two reference observations; (2) we propose ExplainerPFN, the first zero-shot method for estimating Shapley values without access to the underlying model or reference explanations; (3) we provide an open-source implementation of ExplainerPFN, including the full training pipeline and synthetic data generator; and (4) through extensive experiments on real and synthetic datasets, we show that ExplainerPFN achieves performance competitive with few-shot surrogate explainers that rely on 2-10 SHAP examples. Comments: 18 pages, 7 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.23068 [cs.LG] (or arXiv:2601.23068v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.23068 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-27] owards Explicit Acoustic Evidence Perception in Audio LLM s for Speech Deepfake Detection
【速读】:该论文旨在解决语音深度伪造检测(Speech Deepfake Detection, SDD)中因音频大语言模型(Audio Large Language Model, Audio LLM)过度依赖语义相关线索而导致对细微声学伪影(acoustic artifacts)敏感性不足的问题。当伪造语音具备自然语义时,现有方法容易漏检,其根本原因并非缺乏声学信息,而是语义主导推理机制阻碍了细粒度声学特征的可访问性。解决方案的关键在于提出SDD-APALLM框架,通过融合原始音频与结构化频谱图(spectrogram),增强音频LLM对时间-频率维度上细粒度声学证据的感知能力,从而在不牺牲语义理解的前提下,实现对隐匿声学异常的有效捕捉,显著提升检测准确性和鲁棒性。
链接: https://arxiv.org/abs/2601.23066
作者: Xiaoxuan Guo,Yuankun Xie,Haonan Cheng,Jiayi Zhou,Jian Liu,Hengyan Huang,Long Ye,Qin Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.
zh
[AI-28] On the Impact of Code Comments for Automated Bug-Fixing: An Empirical Study
【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)驱动的自动化缺陷修复(Automated Bug Fixing, ABF)任务中,传统做法通常会在训练前移除代码注释,但这一操作是否合理尚不明确。作者提出假设——注释可能对修复特定类型缺陷具有关键作用,因其提供了设计与实现层面的重要信息。解决方案的关键在于系统性地评估注释在训练和推理阶段的存在与否对ABF性能的影响,并通过大语言模型(Large Language Models, LLMs)自动为缺乏注释的方法补充注释以缓解数据稀缺问题。实证结果表明,当注释同时存在于训练和推理阶段时,ABF准确率可提升至三倍;且训练时保留注释不会损害无注释样本的表现,从而验证了注释在提升LLM修复能力中的核心价值。
链接: https://arxiv.org/abs/2601.23059
作者: Antonio Vitale,Emanuela Guglielmi,Simone Scalabrino,Rocco Oliveto
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026)
Abstract:Large Language Models (LLMs) are increasingly relevant in Software Engineering research and practice, with Automated Bug Fixing (ABF) being one of their key applications. ABF involves transforming a buggy method into its fixed equivalent. A common preprocessing step in ABF involves removing comments from code prior to training. However, we hypothesize that comments may play a critical role in fixing certain types of bugs by providing valuable design and implementation insights. In this study, we investigate how the presence or absence of comments, both during training and at inference time, impacts the bug-fixing capabilities of LLMs. We conduct an empirical evaluation comparing two model families, each evaluated under all combinations of training and inference conditions (with and without comments), and thereby revisiting the common practice of removing comments during training. To address the limited availability of comments in state-of-the-art datasets, we use an LLM to automatically generate comments for methods lacking them. Our findings show that comments improve ABF accuracy by up to threefold when present in both phases, while training with comments does not degrade performance when instances lack them. Additionally, an interpretability analysis identifies that comments detailing method implementation are particularly effective in aiding LLMs to fix bugs accurately.
zh
[AI-29] Adaptive Edge Learning for Density-Aware Graph Generation
【速读】:该论文旨在解决生成真实图结构数据的挑战,尤其是由离散结构、可变图大小以及类别特异性连通模式所导致的传统生成建模方法难以有效捕捉复杂结构依赖关系的问题。其解决方案的关键在于提出一种基于Wasserstein GAN(WGAN)的密度感知条件图生成框架,通过引入一个可学习的距离驱动边预测器替代传统的固定概率随机采样机制;该预测器将节点嵌入到潜在空间中,使得节点间的距离与边存在的可能性相关联,并结合密度感知的选择机制自适应调控边密度以匹配真实图的类别特异性稀疏分布,从而实现对图拓扑结构和连通性的精准建模。
链接: https://arxiv.org/abs/2601.23052
作者: Seyedeh Ava Razi Razavi,James Sargant,Sheridan Houghten,Renata Dividino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 39th Canadian Conference on Artificial Intelligence
Abstract:Generating realistic graph-structured data is challenging due to discrete structures, variable sizes, and class-specific connectivity patterns that resist conventional generative modelling. While recent graph generation methods employ generative adversarial network (GAN) frameworks to handle permutation invariance and irregular topologies, they typically rely on random edge sampling with fixed probabilities, limiting their capacity to capture complex structural dependencies between nodes. We propose a density-aware conditional graph generation framework using Wasserstein GANs (WGAN) that replaces random sampling with a learnable distance-based edge predictor. Our approach embeds nodes into a latent space where proximity correlates with edge likelihood, enabling the generator to learn meaningful connectivity patterns. A differentiable edge predictor determines pairwise relationships directly from node embeddings, while a density-aware selection mechanism adaptively controls edge density to match class-specific sparsity distributions observed in real graphs. We train the model using a WGAN with gradient penalty, employing a GCN-based critic to ensure generated graphs exhibit realistic topology and align with target class distributions. Experiments on benchmark datasets demonstrate that our method produces graphs with superior structural coherence and class-consistent connectivity compared to existing baselines. The learned edge predictor captures complex relational patterns beyond simple heuristics, generating graphs whose density and topology closely match real structural distributions. Our results show improved training stability and controllable synthesis, making the framework effective for realistic graph generation and data augmentation. Source code is publicly available at this https URL.
zh
[AI-30] MedMCP-Calc: Benchmarking LLM s for Realistic Medical Calculator Scenarios via MCP Integration
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在医疗计算器(Medical Calculator)实际应用中存在的重要局限性,即现有评估基准仅关注静态、单步计算任务,而忽略了真实临床场景下多阶段、自适应的使用流程,包括模糊查询理解、电子健康记录(Electronic Health Record, EHR)数据库交互、外部参考信息检索及工具调用等复杂行为。解决方案的关键在于提出首个面向现实医疗计算器场景的评测基准 MedMCP-Calc,并通过 Model Context Protocol (MCP) 集成实现对 LLM 在完整工作流中的能力评估,涵盖 118 个跨四大临床领域的任务,支持结构化数据库操作与过程级评价。基于此基准发现模型普遍存在计算器选择偏差、SQL 迭代交互能力弱以及外部工具利用意愿低等问题,进而开发出 CalcMate 模型,引入场景规划与工具增强机制,在开源模型中达到最优性能。
链接: https://arxiv.org/abs/2601.23049
作者: Yakun Zhu,Yutong Huang,Shengqian Qin,Zhongzhen Huang,Shaoting Zhang,Xiaofan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Medical calculators are fundamental to quantitative, evidence-based clinical practice. However, their real-world use is an adaptive, multi-stage process, requiring proactive EHR data acquisition, scenario-dependent calculator selection, and multi-step computation, whereas current benchmarks focus only on static single-step calculations with explicit instructions. To address these limitations, we introduce MedMCP-Calc, the first benchmark for evaluating LLMs in realistic medical calculator scenarios through Model Context Protocol (MCP) integration. MedMCP-Calc comprises 118 scenario tasks across 4 clinical domains, featuring fuzzy task descriptions mimicking natural queries, structured EHR database interaction, external reference retrieval, and process-level evaluation. Our evaluation of 23 leading models reveals critical limitations: even top performers like Claude Opus 4.5 exhibit substantial gaps, including difficulty selecting appropriate calculators for end-to-end workflows given fuzzy queries, poor performance in iterative SQL-based database interactions, and marked reluctance to leverage external tools for numerical computation. Performance also varies considerably across clinical domains. Building on these findings, we develop CalcMate, a fine-tuned model incorporating scenario planning and tool augmentation, achieving state-of-the-art performance among open-source models. Benchmark and Codes are available in this https URL.
zh
[AI-31] From Abstract to Contextual: What LLM s Still Cannot Do in Mathematics ICLR2026
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在基准数学测试中表现优异,但在真实场景下的上下文数学推理(contextual mathematical reasoning)任务中性能显著下降的问题。其核心挑战在于:模型能否从描述性场景中准确提取并形式化数学问题(即问题表述),进而进行正确推理。解决方案的关键在于构建一个名为ContextMATH的新基准,通过两种情境设置——情景嵌入(Scenario Grounding, SG)和复杂度扩展(Complexity Scaling, CS),系统性地模拟现实世界中约束条件的隐含性和结构复杂性。实验表明,错误主要源于问题表述失败,且表述准确性随原始难度增加而降低;尽管更大规模模型在表述与推理能力上均有提升,但二者仍是互补性的瓶颈。此外,使用情境数据微调可有效改善性能,而仅训练表述能力无效,说明上下文数学推理仍是LLMs亟待突破的核心难题。
链接: https://arxiv.org/abs/2601.23048
作者: Bowen Cao,Dongdong Zhang,Yixia Li,Junpeng Liu,Shijue Huang,Chufan Shi,Hongyuan Lu,Yaokang Wu,Guanhua Chen,Wai Lam,Furu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.
zh
[AI-32] he Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity? ICLR2026
【速读】:该论文试图解决的问题是:随着人工智能(AI)模型能力的提升,其在执行复杂、重要任务时可能出现的失败模式究竟是系统性偏离目标(如奖励黑客行为),还是非连贯的混乱行为(如无意义动作)。解决方案的关键在于引入一种基于偏差-方差分解的量化指标——通过测试时随机性来衡量AI在任务中的“不连贯性”(incoherence),即模型错误中由方差(随机性)而非偏差(系统性错误)所导致的比例。研究发现,随着模型推理和行动时间延长,其失败表现趋于更不连贯;且在多个场景下,更大、更强大的模型反而表现出更高的不连贯性,这表明单纯依赖规模扩展难以消除此类问题,而应重点关注针对奖励黑客或目标误设的对齐研究。
链接: https://arxiv.org/abs/2601.23045
作者: Alexander Hägele,Aryo Pradipta Gema,Henry Sleight,Ethan Perez,Jascha Sohl-Dickstein
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICLR 2026
Abstract:As AI becomes more capable, we entrust it with more general and consequential tasks. The risks from failure grow more severe with increasing task scope. It is therefore important to understand how extremely capable AI models will fail: Will they fail by systematically pursuing goals we do not intend? Or will they fail by being a hot mess, and taking nonsensical actions that do not further any goal? We operationalize this question using a bias-variance decomposition of the errors made by AI models: An AI’s \emphincoherence on a task is measured over test-time randomness as the fraction of its error that stems from variance rather than bias in task outcome. Across all tasks and frontier models we measure, the longer models spend reasoning and taking actions, \emphthe more incoherent their failures become. Incoherence changes with model scale in a way that is experiment dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate incoherence. Instead, as more capable AIs pursue harder tasks, requiring more sequential action and thought, our results predict failures to be accompanied by more incoherent behavior. This suggests a future where AIs sometimes cause industrial accidents (due to unpredictable misbehavior), but are less likely to exhibit consistent pursuit of a misaligned goal. This increases the relative importance of alignment research targeting reward hacking or goal misspecification.
zh
[AI-33] Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference
【速读】:该论文旨在解决基于可微分匹配层(Differentiable Matching Layers)的结构预测中,通过退火参数 ϵ→0 恢复离散排列时出现的不稳定问题。作者识别出根本原因在于“过早模式坍缩”(Premature Mode Collapse),其本质是Sinkhorn固定点映射的非正规动力学导致目标后验分布的变化速率(O(1))快于推理算子的收缩速率(O(1/ϵ)),从而迫使推断轨迹陷入虚假局部基域。解决方案的关键在于提出一种自适应调度算法Efficient PH-ASC,该方法通过监测推断过程的稳定性并强制执行线性稳定性定律,将昂贵的谱诊断从训练循环中解耦,使计算开销从 O(N3) 降低至摊销 O(1),显著提升了退火过程的稳定性和效率。
链接: https://arxiv.org/abs/2601.23039
作者: Yizhi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Differentiable matching layers, often implemented via entropy-regularized Optimal Transport, serve as a critical approximate inference mechanism in structural prediction. However, recovering discrete permutations via annealing \epsilon \to 0 is notoriously unstable. We identify a fundamental mechanism for this failure: \textbfPremature Mode Collapse. By analyzing the non-normal dynamics of the Sinkhorn fixed-point map, we reveal a theoretical \textbfthermodynamic speed limit. Under standard exponential cooling, the shift in the target posterior ( O(1) ) outpaces the contraction rate of the inference operator, which degrades as O(1/\epsilon) . This mismatch inevitably forces the inference trajectory into spurious local basins. To address this, we propose \textbfEfficient PH-ASC, an adaptive scheduling algorithm that monitors the stability of the inference process. By enforcing a linear stability law, we decouple expensive spectral diagnostics from the training loop, reducing overhead from O(N^3) to amortized O(1) . Our implementation and interactive demo are available at this https URL and this https URL. bounded away from zero in generic training dynamics unless the feature extractor converges unrealistically fast.
zh
[AI-34] Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在工具集成推理(Tool-Integrated Reasoning, TIR)中依赖高质量合成轨迹和稀疏结果奖励所导致的监督信号有限且存在偏差的问题。解决方案的关键在于提出一个两阶段框架AutoTraj:第一阶段通过生成多候选轨迹并进行多维评估,保留高质量轨迹,并利用LLM-as-Repairer修复低质量轨迹,从而构建用于监督微调(Supervised Fine-Tuning, SFT)的合成数据集及用于轨迹偏好建模的数据对;第二阶段基于偏好数据训练轨迹级奖励模型,结合结果奖励与格式奖励,显式引导强化学习优化向可靠工具使用行为收敛。
链接: https://arxiv.org/abs/2601.23032
作者: Siyu Gong,Linan Yue,Weibo Gao,Fangzhou Yao,Shimin Di,Lei Feng,Min-Ling Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to solve complex tasks by interacting with external tools, yet existing approaches depend on high-quality synthesized trajectories selected by scoring functions and sparse outcome-based rewards, providing limited and biased supervision for learning TIR. To address these challenges, in this paper, we propose AutoTraj, a two-stage framework that automatically learns TIR by repairing and rewarding tool-use trajectories. Specifically, in the supervised fine-tuning (SFT) stage, AutoTraj generates multiple candidate tool-use trajectories for each query and evaluates them along multiple dimensions. High-quality trajectories are directly retained, while low-quality ones are repaired using a LLM (i.e., LLM-as-Repairer). The resulting repaired and high-quality trajectories form a synthetic SFT dataset, while each repaired trajectory paired with its original low-quality counterpart constitutes a dataset for trajectory preference modeling. In the reinforcement learning (RL) stage, based on the preference dataset, we train a trajectory-level reward model to assess the quality of reasoning paths and combine it with outcome and format rewards, thereby explicitly guiding the optimization toward reliable TIR behaviors. Experiments on real-world benchmarks demonstrate the effectiveness of AutoTraj in TIR.
zh
[AI-35] Leverag ing Convolutional Sparse Autoencoders for Robust Movement Classification from Low-Density sEMG
【速读】:该论文旨在解决肌电假肢(myoelectric prostheses)在实际临床应用中因个体间差异大以及高密度传感器阵列不实用而导致的可靠控制难题。其解决方案的关键在于提出了一种基于深度学习的框架,利用卷积稀疏自编码器(Convolutional Sparse Autoencoder, CSAE)直接从原始表面肌电信号(surface electromyography, sEMG)中提取时序特征表示,从而避免了传统方法依赖人工特征工程的问题。该方法仅使用两个sEMG通道即实现了94.3% ± 0.3%的多受试者F1分数,并通过少量样本的迁移学习策略显著提升未见受试者的性能至92.3% ± 0.9%,同时支持增量学习以实现无须全量重训练的类别扩展,展现出高精度、低计算与传感开销的可扩展性优势。
链接: https://arxiv.org/abs/2601.23011
作者: Blagoj Hristov,Zoran Hadzi-Velkov,Katerina Hadzi-Velkova Saneva,Gorjan Nadzinski,Vesna Ojleska Latkoska
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Reliable control of myoelectric prostheses is often hindered by high inter-subject variability and the clinical impracticality of high-density sensor arrays. This study proposes a deep learning framework for accurate gesture recognition using only two surface electromyography (sEMG) channels. The method employs a Convolutional Sparse Autoencoder (CSAE) to extract temporal feature representations directly from raw signals, eliminating the need for heuristic feature engineering. On a 6-class gesture set, our model achieved a multi-subject F1-score of 94.3% \pm 0.3%. To address subject-specific differences, we present a few-shot transfer learning protocol that improved performance on unseen subjects from a baseline of 35.1% \pm 3.1% to 92.3% \pm 0.9% with minimal calibration data. Furthermore, the system supports functional extensibility through an incremental learning strategy, allowing for expansion to a 10-class set with a 90.0% \pm 0.2% F1-score without full model retraining. By combining high precision with minimal computational and sensor overhead, this framework provides a scalable and efficient approach for the next generation of affordable and adaptive prosthetic systems.
zh
[AI-36] Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中因策略外推误差(extrapolation error)导致性能下降的问题,其核心挑战在于现有方法通常仅采用单一类型的策略约束(如加权行为克隆、密度正则化或支持约束),缺乏统一的理论框架来解释这些约束之间的联系与权衡。解决方案的关键是提出连续约束插值(Continuous Constraint Interpolation, CCI)框架,该框架将三种典型约束形式视为同一约束谱上的特例,并通过一个单一的插值参数实现不同约束类型间的平滑过渡与合理组合;在此基础上进一步设计自动约束策略优化(Automatic Constraint Policy Optimization, ACPO)算法,利用拉格朗日对偶更新机制动态调整插值参数,从而在实践中自适应地选择最优约束形式,显著提升算法鲁棒性与性能。
链接: https://arxiv.org/abs/2601.23010
作者: Xinchen Han,Qiuyang Fang,Hossam Afifi,Michel Marot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal–dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.
zh
[AI-37] Mano: Restriking Manifold Optimization for LLM Training
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练中优化器效率与性能之间的权衡问题:现有主流优化器如AdamW因仅依赖对角曲率估计而忽略结构信息,而Muon虽采用全局谱归一化却损失了曲率细节。为此,作者提出一种基于流形优化(Manifold Optimization)的新方法——Mano,其核心创新在于将动量投影至模型参数的切空间,并约束在旋转正交流形(Rotational Oblique Manifold)上,从而同时保留曲率信息并提升优化稳定性。实验表明,Mano在LLaMA和Qwen3模型上显著优于AdamW与Muon,且在内存消耗和计算复杂度方面更具优势,首次实现了流形优化与现代优化器之间的性能接轨。
链接: https://arxiv.org/abs/2601.23000
作者: Yufei Gu,Zeke Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) have emerged as a significant advancement in artificial intelligence, the hardware and computational costs for training LLMs are also significantly burdensome. Among the state-of-the-art optimizers, AdamW relies on diagonal curvature estimates and ignores structural properties, while Muon applies global spectral normalization at the expense of losing curvature information. In this study, we restriked manifold optimization methods for training LLMs, which may address both optimizers’ limitations, while conventional manifold optimization methods have been largely overlooked due to the poor performance in large-scale model optimization. By innovatively projecting the momentum onto the tangent space of model parameters and constraining it on a rotational Oblique manifold, we propose a novel, powerful, and efficient optimizer Mano that is the first to bridge the performance gap between manifold optimization and modern optimizers. Extensive experiments on the LLaMA and Qwen3 models demonstrate that Mano consistently and significantly outperforms AdamW and Muon even with less memory consumption and computational complexity, respectively, suggesting an expanded Pareto frontier in terms of space and time efficiency.
zh
[AI-38] riCEGAR: A Trace-Driven Abstraction Mechanism for Agent ic AI
【速读】:该论文旨在解决代理型人工智能(Agentic AI)系统在长期随机交互过程中难以保障行为可靠性的问题,其核心挑战在于环境的非确定性和模型输出的概率特性使得传统验证方法失效。为应对这一问题,作者提出了一种名为TriCEGAR的解决方案,其关键创新在于通过执行日志驱动的状态抽象机制,自动构建代理行为的马尔可夫决策过程(MDP)模型,并结合反例引导的细化策略优化状态划分。该方法无需人工定义状态抽象,显著降低了开发者的介入成本,同时支持在线建模与概率模型检测,从而可计算诸如最大成功概率(Pmax(success))和最小失败概率(Pmin(failure))等量化保证指标,并利用运行似然性实现异常检测作为护栏信号。
链接: https://arxiv.org/abs/2601.22997
作者: Roham Koohestani,Ateş Görpelioğlu,Egor Klimov,Burcu Kulahcioglu Ozkan,Maliheh Izadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Agentic AI systems act through tools and evolve their behavior over long, stochastic interaction traces. This setting complicates assurance, because behavior depends on nondeterministic environments and probabilistic model outputs. Prior work introduced runtime verification for agentic AI via Dynamic Probabilistic Assurance (DPA), learning an MDP online and model checking quantitative properties. A key limitation is that developers must manually define the state abstraction, which couples verification to application-specific heuristics and increases adoption friction. This paper proposes TriCEGAR, a trace-driven abstraction mechanism that automates state construction from execution logs and supports online construction of an agent behavioral MDP. TriCEGAR represents abstractions as predicate trees learned from traces and refined using counterexamples. We describe a framework-native implementation that (i) captures typed agent lifecycle events, (ii) builds abstractions from traces, (iii) constructs an MDP, and (iv) performs probabilistic model checking to compute bounds such as Pmax(success) and Pmin(failure). We also show how run likelihoods enable anomaly detection as a guardrailing signal.
zh
[AI-39] Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory
【速读】:该论文旨在解决深度研究代理(Deep Research Agents, DRAs)在复杂任务中因中间阶段幻觉(hallucination)积累而导致的失效机制难以诊断的问题。现有评估基准多依赖端到端的结果评判,忽略了诸如错误规划等关键中间幻觉,从而无法精准定位故障根源。解决方案的关键在于从基于结果的评估转向过程感知的评估范式,通过审计完整的研究轨迹来识别和量化幻觉。作者提出PIES分类法,依据功能组件(规划 vs. 总结)与错误属性(显式 vs. 隐式)对幻觉进行细粒度归类,并构建了可分解轨迹的评估框架,进而基于此框架设计出DeepHalluBench基准,用于系统性地检测DRAs中的100种典型幻觉任务。实验表明,当前主流DRAs均缺乏鲁棒可靠性,且失败源于系统性缺陷,如幻觉传播和认知偏差,为未来架构优化提供了基础洞见。
链接: https://arxiv.org/abs/2601.22984
作者: Yuhao Zhan,Tianyu Fan,Linxuan Huang,Zirui Guo,Chao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucinations, such as flawed planning, that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to process-aware evaluation by auditing the full research trajectory. We introduce the PIES Taxonomy to categorize hallucinations along functional components (Planning vs. Summarization) and error properties (Explicit vs. Implicit). We instantiate this taxonomy into a fine-grained evaluation framework that decomposes the trajectory to rigorously quantify these hallucinations. Leveraging this framework to isolate 100 distinctively hallucination-prone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six state-of-theart DRAs reveal that no system achieves robust reliability. Furthermore, our diagnostic analysis traces the etiology of these failures to systemic deficits, specifically hallucination propagation and cognitive biases, providing foundational insights to guide future architectural optimization. Data and code are available at this https URL.
zh
[AI-40] Quantifying Model Uniqueness in Heterogeneous AI Ecosystems
【速读】:该论文旨在解决复杂异构模型生态系统中区分真实行为新颖性与功能冗余性的治理难题,尤其在基础模型(Foundation Models)与专用适配器(Specialized Adapters)共存的场景下,传统基于观测日志的方法难以准确识别模型的独特性。其解决方案的关键在于提出一种基于模拟准实验设计(In-Silico Quasi-Experimental Design, ISQED)的统计审计框架,通过强制跨模型匹配干预来隔离内在模型身份,并量化唯一性为“同行不可表达残差”(Peer-Inexpressible Residual, PIER),即目标模型行为中无法被任何随机凸组合的同行模型所还原的部分;当PIER趋近于零时,表明可通过路由替代实现功能等效。该方法突破了仅依赖观测数据的局限性,建立了理论基础并实现了最优采样效率的主动审计协议,同时揭示了合作博弈论方法(如Shapley值)在检测冗余方面的根本失效问题。
链接: https://arxiv.org/abs/2601.22977
作者: Lei You
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI systems evolve from isolated predictors into complex, heterogeneous ecosystems of foundation models and specialized adapters, distinguishing genuine behavioral novelty from functional redundancy becomes a critical governance challenge. Here, we introduce a statistical framework for auditing model uniqueness based on In-Silico Quasi-Experimental Design (ISQED). By enforcing matched interventions across models, we isolate intrinsic model identity and quantify uniqueness as the Peer-Inexpressible Residual (PIER), i.e. the component of a target’s behavior strictly irreducible to any stochastic convex combination of its peers, with vanishing PIER characterizing when such a routing-based substitution becomes possible. We establish the theoretical foundations of ecosystem auditing through three key contributions. First, we prove a fundamental limitation of observational logs: uniqueness is mathematically non-identifiable without intervention control. Second, we derive a scaling law for active auditing, showing that our adaptive query protocol achieves minimax-optimal sample efficiency ( d\sigma^2\gamma^-2\log(Nd/\delta) ). Third, we demonstrate that cooperative game-theoretic methods, such as Shapley values, fundamentally fail to detect redundancy. We implement this framework via the DISCO (Design-Integrated Synthetic Control) estimator and deploy it across diverse ecosystems, including computer vision models (ResNet/ConvNeXt/ViT), large language models (BERT/RoBERTa), and city-scale traffic forecasters. These results move trustworthy AI beyond explaining single models: they establish a principled, intervention-based science of auditing and governing heterogeneous model ecosystems.
zh
[AI-41] Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text
【速读】:该论文旨在解决强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在大规模应用中因可用可验证数据稀缺而导致的训练效果饱和问题。现有RLVR方法受限于高质量标注数据的匮乏,导致模型性能随训练时间延长逐渐趋于稳定甚至停滞。解决方案的关键在于提出一种名为Golden Goose的简单但高效的数据合成策略:通过将未验证的互联网文本(如科学教材)转化为多选题形式的填空中间推理任务(fill-in-the-middle),利用大语言模型(LLM)自动识别并掩码关键推理步骤,并生成多样且合理的干扰项(distractors),从而构建大规模、高价值的RLVR训练数据集——GooseReason-0.7M(含超过70万任务)。该方法显著扩展了可用于RLVR训练的数据来源,有效突破了传统数据瓶颈,在多个基准测试中实现持续提升,尤其在网络安全领域首次实现了无需领域预训练即可超越更大规模专用模型的性能表现。
链接: https://arxiv.org/abs/2601.22975
作者: Ximing Lu,David Acuna,Jaehun Jung,Jian Hu,Di Zhang,Shizhe Diao,Yunheng Zou,Shaokun Zhang,Brandon Cui,Mingjie Liu,Hyunwoo Kim,Prithviraj Ammanabrolu,Jan Kautz,Yi Dong,Yejin Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.
zh
[AI-42] Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic
【速读】:该论文旨在解决基于连续策略梯度的强化学习方法中策略函数(policy)常出现高频振荡、缺乏平滑性的问题,这限制了其在物理系统中的部署。传统方法通过直接对策略输出施加正则化来强制平滑性,但作者指出这仅治标不治本。解决方案的关键在于从批评者(critic)的角度出发,理论证明策略的非光滑性本质上由批评者的微分几何特性决定:最优策略对输入扰动的敏感度受Q函数混合偏导数(噪声敏感度)与动作空间曲率(信号区分度)之比的约束。基于此洞察,论文提出PAVE(Policy-Aware Value-field Equalization),一种以批评者为中心的正则化框架,通过将批评者视为标量场并稳定其诱导的动作梯度场,在不修改策略网络的前提下最小化Q梯度波动,同时保留局部曲率,从而实现平滑且鲁棒的策略学习。
链接: https://arxiv.org/abs/2601.22970
作者: Jeong Woon Lee,Kyoleen Kwak,Daeho Kim,Hyoseok Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy’s output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function’s mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness and robustness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.
zh
[AI-43] EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning
【速读】:该论文旨在解决当前医疗人工智能(Medical AI)在诊断过程中存在的“一次性”模型局限性问题,即现有系统通常假设能够访问完整的患者病历进行诊断,而现实中临床诊断是一个多轮迭代的信息获取过程,医生需通过逐步提问和检查来优化信息收集策略并权衡资源成本。为应对这一挑战,作者提出了Med-Inquire基准测试平台,模拟真实世界的多轮诊断流程,通过隐藏完整病历数据,利用患者代理(Patient Agent)和检查代理(Examination Agent)迫使智能体主动获取信息。解决方案的核心在于提出EvoClinician——一个自演化诊断智能体,其关键机制是“诊断-评分-进化”循环:由执行者(Actor)尝试诊断,评估者(Process Grader)对每一步操作进行临床收益与资源效率的信用分配,随后演化器(Evolver)基于反馈更新执行者的提示(prompt)和记忆策略,从而在测试阶段持续优化诊断决策。
链接: https://arxiv.org/abs/2601.22964
作者: Yufei He,Juncheng Liu,Zhiyuan Hu,Yulin Chen,Yue Liu,Yuan Sui,Yibo Li,Nuo Chen,Jun Hu,Bryan Hooi,Xinxing Xu,Jiang Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Prevailing medical AI operates on an unrealistic ‘‘one-shot’’ model, diagnosing from a complete patient file. However, real-world diagnosis is an iterative inquiry where Clinicians sequentially ask questions and order tests to strategically gather information while managing cost and time. To address this, we first propose Med-Inquire, a new benchmark designed to evaluate an agent’s ability to perform multi-turn diagnosis. Built upon a dataset of real-world clinical cases, Med-Inquire simulates the diagnostic process by hiding a complete patient file behind specialized Patient and Examination agents. They force the agent to proactively ask questions and order tests to gather information piece by piece. To tackle the challenges posed by Med-Inquire, we then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time. Its core is a ‘‘Diagnose-Grade-Evolve’’ loop: an Actor agent attempts a diagnosis; a Process Grader agent performs credit assignment by evaluating each action for both clinical yield and resource efficiency; finally, an Evolver agent uses this feedback to update the Actor’s strategy by evolving its prompt and memory. Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents. The code is available at this https URL
zh
[AI-44] Alignment among Language Vision and Action Representations
【速读】:该论文旨在解决认知科学与人工智能领域中一个核心问题:语言、视觉和动作等不同学习模态是否产生独立或共享的内部表征。传统观点认为,不同数据类型训练的模型会发展出专用且不可迁移的表征;而本文通过实证研究发现,即使在训练数据、模态和目标差异显著的情况下,动作表征仍能与语言和视觉表征实现强对齐(如与decoder-only语言模型和BLIP的precision@15达0.70–0.73),表明多模态表征存在部分共享的语义结构,支持语义组织的模态无关性。其解决方案的关键在于利用BabyAI平台上的行为克隆方法生成仅由感官运动控制需求塑造的动作-语言嵌入,并通过跨模态对比分析揭示了动作、语言与视觉表征之间的收敛特性,为具身智能系统中的跨域迁移提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2601.22948
作者: Nicola Milano,Stefano Nolfi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A fundamental question in cognitive science and AI concerns whether different learning modalities: language, vision, and action, give rise to distinct or shared internal representations. Traditional views assume that models trained on different data types develop specialized, non-transferable representations. However, recent evidence suggests unexpected convergence: models optimized for distinct tasks may develop similar representational geometries. We investigate whether this convergence extends to embodied action learning by training a transformer-based agent to execute goal-directed behaviors in response to natural language instructions. Using behavioral cloning on the BabyAI platform, we generated action-grounded language embeddings shaped exclusively by sensorimotor control requirements. We then compared these representations with those extracted from state-of-the-art large language models (LLaMA, Qwen, DeepSeek, BERT) and vision-language models (CLIP, BLIP). Despite substantial differences in training data, modality, and objectives, we observed robust cross-modal alignment. Action representations aligned strongly with decoder-only language models and BLIP (precision@15: 0.70-0.73), approaching the alignment observed among language models themselves. Alignment with CLIP and BERT was significantly weaker. These findings indicate that linguistic, visual, and action representations converge toward partially shared semantic structures, supporting modality-independent semantic organization and highlighting potential for cross-domain transfer in embodied AI systems.
zh
[AI-45] From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models
【速读】:该论文旨在解决机器学习模型在软件安全任务中因训练与测试数据集存在重复或高度相似样本而导致的数据泄露问题,这种泄露会使得模型看似性能优异,实则可能只是记忆了特定模式而非具备泛化能力。解决方案的关键在于识别并清理基准数据集中存在的重复样本,从而确保模型评估的公正性和真实性,避免对AI驱动的敏感信息检测器(如硬编码密钥检测)的实际效能产生误导性判断。
链接: https://arxiv.org/abs/2601.22946
作者: Farnaz Soltaniani,Mohammad Ghafari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.
zh
[AI-46] A Real-Time Privacy-Preserving Behavior Recognition System via Edge-Cloud Collaboration
【速读】:该论文旨在解决智能感知技术在高隐私环境(如厕所和更衣室)中面临的隐私-安全悖论问题,即传统RGB监控存在视觉记录与存储的隐私风险,而现有隐私保护方法(如物理去敏或加密/混淆技术)往往牺牲语义理解能力或无法数学上保证不可逆性以抵御重建攻击。解决方案的关键在于提出一种基于AI Flow理论框架和边缘-云协同架构的新型隐私保护感知技术,其核心是将源端去敏与不可逆特征映射相结合:边缘设备利用信息瓶颈理论对原始图像进行非线性映射和随机噪声注入,在毫秒级内生成抽象特征向量,构建单向信息流以剥离身份敏感属性并确保原图无法重建;云端则仅基于这些抽象向量使用多模态大模型进行异常行为联合推理,从架构层面彻底切断隐私泄露路径,实现从视频监控到去标识化行为感知的范式跃迁。
链接: https://arxiv.org/abs/2601.22938
作者: Huan Song,Shuyu Tian,Junyi Hao,Cheng Yuan,Zhenyu Jia,Jiawei Shao,Xuelong Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:
Abstract:As intelligent sensing expands into high-privacy environments such as restrooms and changing rooms, the field faces a critical privacy-security paradox. Traditional RGB surveillance raises significant concerns regarding visual recording and storage, while existing privacy-preserving methods-ranging from physical desensitization to traditional cryptographic or obfuscation techniques-often compromise semantic understanding capabilities or fail to guarantee mathematical irreversibility against reconstruction attacks. To address these challenges, this study presents a novel privacy-preserving perception technology based on the AI Flow theoretical framework and an edge-cloud collaborative architecture. The proposed methodology integrates source desensitization with irreversible feature mapping. Leveraging Information Bottleneck theory, the edge device performs millisecond-level processing to transform raw imagery into abstract feature vectors via non-linear mapping and stochastic noise injection. This process constructs a unidirectional information flow that strips identity-sensitive attributes, rendering the reconstruction of original images impossible. Subsequently, the cloud platform utilizes multimodal family models to perform joint inference solely on these abstract vectors to detect abnormal behaviors. This approach fundamentally severs the path to privacy leakage at the architectural level, achieving a breakthrough from video surveillance to de-identified behavior perception and offering a robust solution for risk management in high-sensitivity public spaces.
zh
[AI-47] Protecting Private Code in IDE Autocomplete using Differential Privacy
【速读】:该论文旨在解决现代集成开发环境(IDE)中基于大语言模型(LLM)的代码补全功能所引发的隐私风险问题,即模型在训练过程中可能泄露用户编写代码的敏感信息,从而面临成员推理攻击(Membership Inference Attacks, MIAs)等安全威胁。解决方案的关键在于采用差分隐私(Differential Privacy, DP)机制对模型进行微调,在保障模型性能的同时显著提升隐私保护水平——实验表明,使用DP训练后的Mellum模型可将MIAs的成功率从0.901(AUC)降至0.606(接近随机猜测),且在仅使用1/100训练数据的情况下仍能保持与非私有模型相当的实用性,验证了差分隐私作为构建可信AI驱动IDE功能的实际可行方案。
链接: https://arxiv.org/abs/2601.22935
作者: Evgeny Grigorenko,David Stanojević,David Ilić,Egor Bogomolov,Kostadin Cvejoski
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:Modern Integrated Development Environments (IDEs) increasingly leverage Large Language Models (LLMs) to provide advanced features like code autocomplete. While powerful, training these models on user-written code introduces significant privacy risks, making the models themselves a new type of data vulnerability. Malicious actors can exploit this by launching attacks to reconstruct sensitive training data or infer whether a specific code snippet was used for training. This paper investigates the use of Differential Privacy (DP) as a robust defense mechanism for training an LLM for Kotlin code completion. We fine-tune a \textttMellum model using DP and conduct a comprehensive evaluation of its privacy and utility. Our results demonstrate that DP provides a strong defense against Membership Inference Attacks (MIAs), reducing the attack’s success rate close to a random guess (AUC from 0.901 to 0.606). Furthermore, we show that this privacy guarantee comes at a minimal cost to model performance, with the DP-trained model achieving utility scores comparable to its non-private counterpart, even when trained on 100x less data. Our findings suggest that DP is a practical and effective solution for building private and trustworthy AI-powered IDE features.
zh
[AI-48] MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving
【速读】:该论文旨在解决自主驾驶中轨迹规划任务在复杂“长尾”场景下因现有方法仅支持单轮推理(single-turn reasoning)而导致的迭代优化能力不足问题。其核心挑战在于如何通过多轮交互式推理实现对环境反馈的持续响应与路径精炼。解决方案的关键是提出MTDrive框架,其中引入了多轮组相对策略优化(Multi-Turn Group Relative Policy Optimization, mtGRPO),通过跨轮次计算相对优势来缓解奖励稀疏性问题,并构建了一个基于闭环仿真生成的交互式轨迹理解数据集以支持多轮训练。实验表明,该方法在NAVSIM基准上显著优于现有方案,验证了多轮推理范式的有效性。
链接: https://arxiv.org/abs/2601.22930
作者: Xidong Li,Mingyu Guo,Chenchao Xu,Bailin Li,Wenjing Zhu,Yangang Zou,Rui Chen,Zehuan Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Trajectory planning is a core task in autonomous driving, requiring the prediction of safe and comfortable paths across diverse scenarios. Integrating Multi-modal Large Language Models (MLLMs) with Reinforcement Learning (RL) has shown promise in addressing “long-tail” scenarios. However, existing methods are constrained to single-turn reasoning, limiting their ability to handle complex tasks requiring iterative refinement. To overcome this limitation, we present MTDrive, a multi-turn framework that enables MLLMs to iteratively refine trajectories based on environmental feedback. MTDrive introduces Multi-Turn Group Relative Policy Optimization (mtGRPO), which mitigates reward sparsity by computing relative advantages across turns. We further construct an interactive trajectory understanding dataset from closed-loop simulation to support multi-turn training. Experiments on the NAVSIM benchmark demonstrate superior performance compared to existing methods, validating the effectiveness of our multi-turn reasoning paradigm. Additionally, we implement system-level optimizations to reduce data transfer overhead caused by high-resolution images and multi-turn sequences, achieving 2.5x training throughput. Our data, models, and code will be made available soon.
zh
[AI-49] BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推荐系统中应用时存在的训练-推理不一致性问题:尽管监督微调(Supervised Fine-Tuning, SFT)优化了正样本的整体概率,但使用束搜索(Beam Search)进行推理时,由于贪婪剪枝机制,即使某些正样本具有较高概率,也可能因前缀概率不足而被提前丢弃。为缓解这一问题,论文提出BEAR(Beam-Search-Aware Regularization)方法,其核心创新在于设计一种计算高效且有效的微调目标——在每个解码步骤中强制要求正样本的每个token在其对应位置排名位于前B个候选token之内,从而避免束搜索错误剪枝,同时仅引入可忽略的额外计算开销。
链接: https://arxiv.org/abs/2601.22925
作者: Weiqin Yang,Bohao Wang,Zhenxiang Xu,Jiawei Chen,Shengjia Zhang,Jingbang Chen,Canghong Jin,Can Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent years have witnessed a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ supervised fine-tuning (SFT) to adapt LLMs to recommendation scenarios, and utilize beam search during inference to efficiently retrieve B top-ranked recommended items. However, we identify a critical training-inference inconsistency: while SFT optimizes the overall probability of positive items, it does not guarantee that such items will be retrieved by beam search even if they possess high overall probabilities. Due to the greedy pruning mechanism, beam search can prematurely discard a positive item once its prefix probability is insufficient. To address this inconsistency, we propose BEAR (Beam-SEarch-Aware Regularization), a novel fine-tuning objective that explicitly accounts for beam search behavior during training. Rather than directly simulating beam search for each instance during training, which is computationally prohibitive, BEAR enforces a relaxed necessary condition: each token in a positive item must rank within the top- B candidate tokens at each decoding step. This objective effectively mitigates the risk of incorrect pruning while incurring negligible computational overhead compared to standard SFT. Extensive experiments across four real-world datasets demonstrate that BEAR significantly outperforms strong baselines. Code will be released upon acceptance. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.22925 [cs.IR] (or arXiv:2601.22925v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.22925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-50] Evaluating Large Language Models for Security Bug Report Prediction
【速读】:该论文旨在解决安全漏洞报告(Security Bug Reports, SBRs)的早期检测问题,以实现漏洞的及时缓解。其核心解决方案在于对比分析基于提示工程(prompt-based engineering)与微调(fine-tuning)两种策略在大型语言模型(Large Language Models, LLMs)上预测SBRs的效果。关键发现表明:提示工程方法在敏感性(sensitivity)和召回率(recall)方面表现更优,但误报率较高;而微调方法则具备更高的精确率(precision),尽管召回率较低,且推理速度比提示方法快达50倍,体现了性能与效率之间的权衡。
链接: https://arxiv.org/abs/2601.22921
作者: Farnaz Soltaniani,Shoaib Razzaq,Mohammad Ghafari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Early detection of security bug reports (SBRs) is critical for timely vulnerability mitigation. We present an evaluation of prompt-based engineering and fine-tuning approaches for predicting SBRs using Large Language Models (LLMs). Our findings reveal a distinct trade-off between the two approaches. Prompted proprietary models demonstrate the highest sensitivity to SBRs, achieving a G-measure of 77% and a recall of 74% on average across all the datasets, albeit at the cost of a higher false-positive rate, resulting in an average precision of only 22%. Fine-tuned models, by contrast, exhibit the opposite behavior, attaining a lower overall G-measure of 51% but substantially higher precision of 75% at the cost of reduced recall of 36%. Though a one-time investment in building fine-tuned models is necessary, the inference on the largest dataset is up to 50 times faster than that of proprietary models. These findings suggest that further investigations to harness the power of LLMs for SBR prediction are necessary.
zh
[AI-51] MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop
【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)时,仅使用结果导向的标量奖励在失败样本上信息稀疏且缺乏解释性的问题,即此类奖励只能表明推理失败,而无法提供失败原因的具体反馈。解决方案的关键在于提出一种多轮反馈引导的强化学习框架,其核心机制包括:(1)仅在失败样本上触发的动态多轮再生机制,以生成更符合反馈语义的中间推理路径;(2)用于单轮内优化与跨轮间优化的两种互补学习信号;(3)将结构化口头反馈注入模型推理过程,从而将非结构化的自然语言反馈转化为可训练的学习信号。该方法在OpenR1-Math数据集上的实验表明,其不仅优于监督微调和传统RLVR基线,在域内表现更优,还能实现良好的域外泛化能力。
链接: https://arxiv.org/abs/2601.22900
作者: Xuancheng Li,Haitao Li,Yujia Zhou,YiqunLiu,Qingyao Ai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in multiple domains, yet outcome-only scalar rewards are often sparse and uninformative, especially on failed samples, where they merely indicate failure and provide no insight into why the reasoning fails. In this paper, we investigate how to leverage richer verbal feedback to guide RLVR training on failed samples, and how to convert such feedback into a trainable learning signal. Specifically, we propose a multi-turn feedback-guided reinforcement learning framework. It builds on three mechanisms: (1) dynamic multi-turn regeneration guided by feedback, triggered only on failed samples, (2) two complementary learning signals for within-turn and cross-turn optimization, and (3) structured feedback injection into the model’s reasoning process. Trained on sampled OpenR1-Math, the approach outperforms supervised fine-tuning and RLVR baselines in-domain and generalizes well out-of-domain.
zh
[AI-52] Game-Theoretic Co-Evolution for LLM -Based Heuristic Discovery
【速读】:该论文旨在解决当前自动启发式发现(Automatic Heuristic Discovery, AHD)方法普遍存在的过拟合与泛化能力差的问题,这些问题主要源于现有方法依赖静态评估策略,即在固定实例分布上进行训练和测试,导致模型在面对分布偏移时性能显著下降。解决方案的关键在于提出一种基于博弈论的框架——算法空间响应预言机(Algorithm Space Response Oracles, ASRO),将启发式发现建模为求解器与实例生成器之间的程序级协同进化过程,将其形式化为一个双人零和博弈,并通过LLM驱动的最佳响应预言机迭代扩展双方的策略池,从而用自动生成的动态课程学习替代静态评估,实现更强的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2601.22896
作者: Xinyi Ke,Kai Li,Junliang Xing,Yifan Zhang,Jian Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have enabled rapid progress in automatic heuristic discovery (AHD), yet most existing methods are predominantly limited by static evaluation against fixed instance distributions, leading to potential overfitting and poor generalization under distributional shifts. We propose Algorithm Space Response Oracles (ASRO), a game-theoretic framework that reframes heuristic discovery as a program level co-evolution between solver and instance generator. ASRO models their interaction as a two-player zero-sum game, maintains growing strategy pools on both sides, and iteratively expands them via LLM-based best-response oracles against mixed opponent meta-strategies, thereby replacing static evaluation with an adaptive, self-generated curriculum. Across multiple combinatorial optimization domains, ASRO consistently outperforms static-training AHD baselines built on the same program search mechanisms, achieving substantially improved generalization and robustness on diverse and out-of-distribution instances.
zh
[AI-53] Reinforcement Learning-Based Co-Design and Operation of Chiller and Thermal Energy Storag e for Cost-Optimal HVAC Systems
【速读】:该论文旨在解决商业暖通空调(HVAC)系统中冷却设备联合运行与容量配置的优化问题,目标是在30年生命周期内最小化总成本,包括资本支出和 discounted 运营成本(含电费与维护费用)。其核心挑战在于冷却设备中电制冷机(chiller)与蓄冷系统(thermal energy storage, TES)之间的强非对称资本成本:提高制冷机容量的成本远高于同等规模的TES扩容,因此需在确保零冷负荷损失的前提下,协同设计最优的制冷机与TES容量组合。解决方案的关键在于将固定基础设施下的制冷机调度问题建模为有限时域马尔可夫决策过程(Markov Decision Process, MDP),控制动作定义为制冷机部分负载比(part-load ratio, PLR),并采用带约束动作空间的深度Q网络(Deep Q Network, DQN)学习最优运行策略;随后对多个候选容量组合进行评估,筛选出满足全部冷负荷需求的可行集,并在该集合上进行生命周期成本最小化,最终确定最优制冷机容量为700 kW、TES容量为1500 kWh。
链接: https://arxiv.org/abs/2601.22880
作者: Tanay Raghunandan Srinivasa,Vivek Deulkar,Aviruch Bhatia,Vishal Garg
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures
Abstract:We study the joint operation and sizing of cooling infrastructure for commercial HVAC systems using reinforcement learning, with the objective of minimizing life-cycle cost over a 30-year horizon. The cooling system consists of a fixed-capacity electric chiller and a thermal energy storage (TES) unit, jointly operated to meet stochastic hourly cooling demands under time-varying electricity prices. The life-cycle cost accounts for both capital expenditure and discounted operating cost, including electricity consumption and maintenance. A key challenge arises from the strong asymmetry in capital costs: increasing chiller capacity by one unit is far more expensive than an equivalent increase in TES capacity. As a result, identifying the right combination of chiller and TES sizes, while ensuring zero loss-of-cooling-load under optimal operation, is a non-trivial co-design problem. To address this, we formulate the chiller operation problem for a fixed infrastructure configuration as a finite-horizon Markov Decision Process (MDP), in which the control action is the chiller part-load ratio (PLR). The MDP is solved using a Deep Q Network (DQN) with a constrained action space. The learned DQN RL policy minimizes electricity cost over historical traces of cooling demand and electricity prices. For each candidate chiller-TES sizing configuration, the trained policy is evaluated. We then restrict attention to configurations that fully satisfy the cooling demand and perform a life-cycle cost minimization over this feasible set to identify the cost-optimal infrastructure design. Using this approach, we determine the optimal chiller and thermal energy storage capacities to be 700 and 1500, respectively.
zh
[AI-54] Degradation-Aware Frequency Regulation of a Heterogeneous Battery Fleet via Reinforcement Learning
【速读】:该论文旨在解决异质电池群在电网平衡服务(如频率调节)中实时调度的问题,核心挑战在于电池循环退化具有路径依赖性(path-dependent),即退化程度由状态-荷电状态(SoC)轨迹决定,通常通过雨流计数(rainflow cycle counting)量化,这使得传统基于每步代价加和的动态规划方法难以适用。解决方案的关键在于将整个调度问题建模为一个带有约束动作空间的马尔可夫决策过程(MDP),并设计了一种密集代理奖励(dense proxy reward),该奖励在每个时间步提供信息性反馈,同时与长期循环深度(cycle-depth)减少目标保持一致;此外,为应对因精细SoC离散化和电池间不对称约束带来的大规模状态-动作空间问题,作者采用基于极限学习机(ELM)的随机非线性特征映射结合线性时序差分学习的函数逼近强化学习方法,实现了高效且可扩展的学习策略优化。
链接: https://arxiv.org/abs/2601.22865
作者: Tanay Raghunandan Srinivasa,Vivek Deulkar,Jia Bhargava,Mohammad Hajiesmaili,Prashant Shenoy
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures
Abstract:Battery energy storage systems are increasingly deployed as fast-responding resources for grid balancing services such as frequency regulation and for mitigating renewable generation uncertainty. However, repeated charging and discharging induces cycling degradation and reduces battery lifetime. This paper studies the real-time scheduling of a heterogeneous battery fleet that collectively tracks a stochastic balancing signal subject to per-battery ramp-rate and capacity constraints, while minimizing long-term cycling degradation. Cycling degradation is fundamentally path-dependent: it is determined by charge-discharge cycles formed by the state-of-charge (SoC) trajectory and is commonly quantified via rainflow cycle counting. This non-Markovian structure makes it difficult to express degradation as an additive per-time-step cost, complicating classical dynamic programming approaches. We address this challenge by formulating the fleet scheduling problem as a Markov decision process (MDP) with constrained action space and designing a dense proxy reward that provides informative feedback at each time step while remaining aligned with long-term cycle-depth reduction. To scale learning to large state-action spaces induced by fine-grained SoC discretization and asymmetric per-battery constraints, we develop a function-approximation reinforcement learning method using an Extreme Learning Machine (ELM) as a random nonlinear feature map combined with linear temporal-difference learning. We evaluate the proposed approach on a toy Markovian signal model and on a Markovian model trained from real-world regulation signal traces obtained from the University of Delaware, and demonstrate consistent reductions in cycle-depth occurrence and degradation metrics compared to baseline scheduling policies. Comments: 11 pages, 2 figures Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.22865 [eess.SY] (or arXiv:2601.22865v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2601.22865 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-55] Bayesian Interpolating Neural Network (B-INN): a scalable and reliable Bayesian model for large-scale physical systems ICML
【速读】:该论文旨在解决神经网络与机器学习模型在不确定性量化(Uncertainty Quantification, UQ)中面临的可扩展性差和可靠性不足的问题,尤其针对工业级主动学习场景下高保真仿真计算成本高昂、数据量巨大(可达GB级别)的挑战。其解决方案的关键在于提出一种可扩展且可靠的贝叶斯代理模型——贝叶斯插值神经网络(Bayesian Interpolating Neural Network, B-INN),该模型融合高阶插值理论、张量分解(Tensor Decomposition)与交替方向算法(Alternating Direction Algorithm),实现高效维度约简而不牺牲预测精度;理论上证明B-INN的函数空间是高斯过程(Gaussian Process, GP)的子集,同时其贝叶斯推断具有线性复杂度 O(N),显著优于传统贝叶斯神经网络(Bayesian Neural Networks, BNNs)和GP,数值实验表明其速度提升可达20至10,000倍,并具备稳健的不确定性估计能力,从而为大规模工业仿真中的不确定性驱动主动学习提供了实用基础。
链接: https://arxiv.org/abs/2601.22860
作者: Chanwook Park,Brian Kim,Jiachen Guo,Wing Kam Liu
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, ICML conference full paper submitted
Abstract:Neural networks and machine learning models for uncertainty quantification suffer from limited scalability and poor reliability compared to their deterministic counterparts. In industry-scale active learning settings, where generating a single high-fidelity simulation may require days or weeks of computation and produce data volumes on the order of gigabytes, they quickly become impractical. This paper proposes a scalable and reliable Bayesian surrogate model, termed the Bayesian Interpolating Neural Network (B-INN). The B-INN combines high-order interpolation theory with tensor decomposition and alternating direction algorithm to enable effective dimensionality reduction without compromising predictive accuracy. We theoretically show that the function space of a B-INN is a subset of that of Gaussian processes, while its Bayesian inference exhibits linear complexity, \mathcalO(N) , with respect to the number of training samples. Numerical experiments demonstrate that B-INNs can be from 20 times to 10,000 times faster with a robust uncertainty estimation compared to Bayesian neural networks and Gaussian processes. These capabilities make B-INN a practical foundation for uncertainty-driven active learning in large-scale industrial simulations, where computational efficiency and robust uncertainty calibration are paramount.
zh
[AI-56] MEnvAgent : Scalable Polyglot Environment Construction for Verifiable Software Engineering
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在软件工程(Software Engineering, SWE)应用中因缺乏可验证数据集而导致的演化瓶颈问题,其核心挑战在于跨多种编程语言构建可执行环境的复杂性。解决方案的关键在于提出MEnvAgent框架,该框架采用多智能体规划-执行-验证架构以自动处理环境构建失败,并引入一种新颖的环境复用机制,通过增量补丁方式重用历史环境来显著降低计算开销。这一设计使得MEnvAgent能够在包含10种编程语言的1,000个任务上实现更高的成功通过率(Fail-to-Pass, F2P提升8.6%),同时将时间成本降低43%,并进一步推动了MEnvData-SWE——当前最大规模的开源多语言可验证Docker环境数据集的发展,为SWE任务提供了高质量训练与评估资源。
链接: https://arxiv.org/abs/2601.22859
作者: Chuanzhe Guo,Jingjing Wu,Sijun He,Yang Chen,Zhaoqi Kuang,Shilong Fan,Bingjin Chen,Siqi Bao,Jing Liu,Hua Wu,Qingfu Zhu,Wanxiang Che,Haifeng Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at this https URL.
zh
[AI-57] Learning to Build Shapes by Extrusion
【速读】:该论文旨在解决当前基于Transformer的3D网格生成方法在输出面数受限、难以保证网格流形性(manifoldness)以及缺乏编辑能力方面的局限性。其解决方案的关键在于提出一种名为Text Encoded Extrusion (TEE) 的文本驱动表示方法,将网格构建过程建模为一系列面extrusion操作序列,而非传统的多边形列表;并通过微调大型语言模型(LLM)学习从基础面环(face loops)重构网格的extrusion步骤,从而实现任意面数的网格生成、天然保证流形性,并支持对已有网格进行特征添加与编辑。
链接: https://arxiv.org/abs/2601.22858
作者: Thor Vestergaard Christiansen,Karran Pandey,Alba Reinders,Karan Singh,Morten Rieger Hannemose,J. Andreas Bærentzen
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: A preprint
Abstract:We introduce Text Encoded Extrusion (TEE), a text-based representation that expresses mesh construction as sequences of face extrusions rather than polygon lists, and a method for generating 3D meshes from TEE using a large language model (LLM). By learning extrusion sequences that assemble a mesh, similar to the way artists create meshes, our approach naturally supports arbitrary output face counts and produces manifold meshes by design, in contrast to recent transformer-based models. The learnt extrusion sequences can also be applied to existing meshes - enabling editing in addition to generation. To train our model, we decompose a library of quadrilateral meshes with non-self-intersecting face loops into constituent loops, which can be viewed as their building blocks, and finetune an LLM on the steps for reassembling the meshes by performing a sequence of extrusions. We demonstrate that our representation enables reconstruction, novel shape synthesis, and the addition of new features to existing meshes.
zh
[AI-58] Just-in-Time Catching Test Generation at Meta
【速读】:该论文旨在解决大规模后端系统(数亿行代码)中潜在缺陷在代码合并前难以被发现的问题,传统硬性测试(hardening tests)仅在生成时通过,无法主动暴露潜在错误。其解决方案的核心是引入“即时捕获测试”(Just-in-Time catching tests),这类测试设计为故意失败以提前暴露bug,从而在代码落地前阻止问题进入生产环境。关键创新在于:1)采用基于代码变更感知的方法显著提升候选捕获测试的有效性(较传统方法提升4倍,较偶然失败测试提升20倍);2)结合规则和大语言模型(LLM)驱动的评估器减少误报,将人工审核负担降低70%;3)实证表明,被人工接受的变更更易产生假阳性,而被拒绝的变更则包含更多真阳性,验证了该方法在工业场景中的有效性与可扩展性。
链接: https://arxiv.org/abs/2601.22832
作者: Matthew Becker,Yifei Chen,Nicholas Cochran,Pouyan Ghasemi,Abhishek Gulati,Mark Harman,Zachary Haluza,Mehrdad Honarkhah,Herve Robert,Jiacheng Liu,Weini Liu,Sreeja Thummala,Xiaoning Yang,Rui Xin,Sophie Zeng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to FSE 2026 industry track
Abstract:We report on Just-in-Time catching test generation at Meta, designed to prevent bugs in large scale backend systems of hundreds of millions of line of code. Unlike traditional hardening tests, which pass at generation time, catching tests are meant to fail, surfacing bugs before code lands. The primary challenge is to reduce development drag from false positive test failures. Analyzing 22,126 generated tests, we show code-change-aware methods improve candidate catch generation 4x over hardening tests and 20x over coincidentally failing tests. To address false positives, we use rule-based and LLM-based assessors. These assessors reduce human review load by 70%. Inferential statistical analysis showed that human-accepted code changes are assessed to have significantly more false positives, while human-rejected changes have significantly more true positives. We reported 41 candidate catches to engineers; 8 were confirmed to be true positives, 4 of which would have led to serious failures had they remained uncaught. Overall, our results show that Just-in-Time catching is scalable, industrially applicable, and that it prevents serious failures from reaching production.
zh
[AI-59] Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
【速读】:该论文旨在解决离线强化学习中风格条件策略(style-conditioned policies)的训练难题,特别是在显式风格监督(通过子轨迹标注函数实现)下,如何在分布偏移和风格与奖励目标之间的固有冲突背景下,同时实现高任务性能与风格一致性。其解决方案的关键在于提出了一种统一的行为风格定义,并基于此构建了Style-Conditioned Implicit Q-Learning (SCIQL) 框架:该框架融合了离线目标条件强化学习技术(如 hindsight relabeling 和 value learning),并引入一种新的门控优势加权回归(Gated Advantage Weighted Regression)机制,从而在优化任务性能的同时有效保持风格对齐。
链接: https://arxiv.org/abs/2601.22823
作者: Mathieu Petitbois,Rémy Portelas,Sylvain Lamprier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style-Conditioned Implicit Q-Learning (SCIQL), which leverages offline goal-conditioned RL techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments demonstrate that SCIQL achieves superior performance on both objectives compared to prior offline methods. Code, datasets and visuals are available in: this https URL.
zh
[AI-60] User-Adaptive Meta-Learning for Cold-Start Medication Recommendation with Uncertainty Filtering ICDE
【速读】:该论文旨在解决电子健康记录(Electronic Health Record, EHR)中药物推荐面临的患者冷启动问题(patient cold-start problem),即对于新患者因缺乏足够的处方历史而难以生成可靠推荐。解决方案的关键在于提出一种多层级、不确定性感知的元学习框架 MetaDrug,其核心创新包括:1)双层元自适应机制——自适应(self-adaptation)利用新患者的自身医疗事件作为支持集以捕捉时间依赖性,同伴适应(peer-adaptation)则借助相似患者的就诊记录增强新患者表征;2)引入不确定性量化模块,对支持集中的就诊记录进行排序并过滤无关信息,从而提升元自适应的一致性和准确性。该方法在 MIMIC-III 和急性肾损伤(Acute Kidney Injury, AKI)数据集上均显著优于现有先进药物推荐模型,尤其在冷启动场景下表现突出。
链接: https://arxiv.org/abs/2601.22820
作者: Arya Hadizadeh Moghaddam,Mohsen Nayebi Kerdabadi,Dongjie Wang,Mei Liu,Zijun Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IEEE International Conference on Data Engineering (ICDE) 2026 accepted paper
Abstract:Large-scale Electronic Health Record (EHR) databases have become indispensable in supporting clinical decision-making through data-driven treatment recommendations. However, existing medication recommender methods often struggle with a user (i.e., patient) cold-start problem, where recommendations for new patients are usually unreliable due to the lack of sufficient prescription history for patient profiling. While prior studies have utilized medical knowledge graphs to connect medication concepts through pharmacological or chemical relationships, these methods primarily focus on mitigating the item cold-start issue and fall short in providing personalized recommendations that adapt to individual patient characteristics. Meta-learning has shown promise in handling new users with sparse interactions in recommender systems. However, its application to EHRs remains underexplored due to the unique sequential structure of EHR data. To tackle these challenges, we propose MetaDrug, a multi-level, uncertainty-aware meta-learning framework designed to address the patient cold-start problem in medication recommendation. MetaDrug proposes a novel two-level meta-adaptation mechanism, including self-adaptation, which adapts the model to new patients using their own medical events as support sets to capture temporal dependencies; and peer-adaptation, which adapts the model using similar visits from peer patients to enrich new patient representations. Meanwhile, to further improve meta-adaptation outcomes, we introduce an uncertainty quantification module that ranks the support visits and filters out the unrelated information for adaptation consistency. We evaluate our approach on the MIMIC-III and Acute Kidney Injury (AKI) datasets. Experimental results on both datasets demonstrate that MetaDrug consistently outperforms state-of-the-art medication recommendation methods on cold-start patients.
zh
[AI-61] Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中可能通过隐写术(steganography)将提示秘密(prompt secrets)嵌入输出内容的问题,且此前方法存在可被轻易恢复的缺陷。其核心解决方案是提出“低可恢复性隐写术”(low-recoverability steganography),摒弃任意映射方式,转而基于嵌入空间(embedding space)构建编码机制,从而在显著提升秘密恢复难度的同时仍保持一定功能性能;进一步地,作者提出基于机制解释性的检测方法——利用后层激活上的线性探测器(linear probes)识别恶意微调留下的内部签名,实验表明该方法可在低可恢复性方案下仍实现比基础模型高33%的检测准确率,为防御此类隐蔽攻击提供了新路径。
链接: https://arxiv.org/abs/2601.22818
作者: Charles Westphal,Keivan Navaie,Fernando E. Rosas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuned LLMs can covertly encode prompt secrets into outputs via steganographic channels. Prior work demonstrated this threat but relied on trivially recoverable encodings. We formalize payload recoverability via classifier accuracy and show previous schemes achieve 100% recoverability. In response, we introduce low-recoverability steganography, replacing arbitrary mappings with embedding-space-derived ones. For Llama-8B (LoRA) and Ministral-8B (LoRA) trained on TrojanStego prompts, exact secret recovery rises from 17 \rightarrow 30% (+78%) and 24 \rightarrow 43% (+80%) respectively, while on Llama-70B (LoRA) trained on Wiki prompts, it climbs from 9 \rightarrow 19% (+123%), all while reducing payload recoverability. We then discuss detection. We argue that detecting fine-tuning-based steganographic attacks requires approaches beyond traditional steganalysis. Standard approaches measure distributional shift, which is an expected side-effect of fine-tuning. Instead, we propose a mechanistic interpretability approach: linear probes trained on later-layer activations detect the secret with up to 33% higher accuracy in fine-tuned models compared to base models, even for low-recoverability schemes. This suggests that malicious fine-tuning leaves actionable internal signatures amenable to interpretability-based defenses.
zh
[AI-62] Aligning the Unseen in Attributed Graphs: Interplay between Graph Geometry and Node Attributes Manifold
【速读】:该论文旨在解决传统属性图表示学习方法中存在的几何缺陷问题,即同时重建节点属性和图结构会强行融合两个潜在不兼容的度量空间,导致破坏性的对齐过程,从而丢失关于图生成机制的关键信息。其解决方案的核心在于提出一种定制的变分自编码器(Variational Autoencoder, VAE),通过将流形学习与结构对齐分离,量化将属性流形映射到图热核(Heat Kernel)所需的度量扭曲,从而将几何冲突转化为可解释的结构描述符,有效恢复被传统方法忽略的连接模式与异常信息。
链接: https://arxiv.org/abs/2601.22806
作者: Aldric Labarthe(CB, UNIGE),Roland Bouffanais(UNIGE),Julien Randon-Furling(CB)
机构: 未知
类目: Artificial Intelligence (cs.AI); Differential Geometry (math.DG)
备注:
Abstract:The standard approach to representation learning on attributed graphs – i.e., simultaneously reconstructing node attributes and graph structure – is geometrically flawed, as it merges two potentially incompatible metric spaces. This forces a destructive alignment that erodes information about the graph’s underlying generative process. To recover this lost signal, we introduce a custom variational autoencoder that separates manifold learning from structural alignment. By quantifying the metric distortion needed to map the attribute manifold onto the graph’s Heat Kernel, we transform geometric conflict into an interpretable structural descriptor. Experiments show our method uncovers connectivity patterns and anomalies undetectable by conventional approaches, proving both their theoretical inadequacy and practical limitations.
zh
[AI-63] CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成后验证阶段面临的挑战,即现有监督微调方法存在数据稀缺、高失败率和推理效率低的问题。为此,作者提出了一种基于强化学习(Reinforcement Learning, RL)的解决方案——CVeDRL,其关键在于通过理论分析将分支覆盖率(branch coverage)、样本难度(sample difficulty)、语法正确性(syntactic correctness)和功能正确性(functional correctness)统一建模为多维度奖励信号,并设计了语法与功能感知奖励机制,结合指数型奖励重塑(exponential reward shaping)和静态分析指标,实现对难覆盖分支和复杂样本的有效引导,从而显著提升单元测试驱动验证的可靠性与效率。
链接: https://arxiv.org/abs/2601.22803
作者: Ji Shi,Peiming Guo,Meishan Zhang,Miao Zhang,Xuebo Liu,Min Zhang,Weili Guan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 17 pages, 3 figures
Abstract:Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit-test-based verification. Guided by this analysis, we design syntax- and functionality-aware rewards and further propose branch- and sample-difficulty–aware RL using exponential reward shaping and static analysis metrics. With this formulation, CVeDRL achieves state-of-the-art performance with only 0.6B parameters, yielding up to 28.97% higher pass rate and 15.08% higher branch coverage than GPT-3.5, while delivering over 20\times faster inference than competitive baselines. Code is available at this https URL
zh
[AI-64] Conditional Performance Guarantee for Large Reasoning Models
【速读】:该论文旨在解决大推理模型(Large reasoning models)在执行链式思维(chain-of-thought reasoning)时计算成本过高,而现有概率近似正确(PAC)推理方法仅提供边际风险控制、缺乏条件覆盖保证的问题。解决方案的关键在于提出G-PAC推理框架,通过将输入空间划分为不同组别,在组级别上实现PAC风格的风险控制;具体包含两种实例化方式:已知分组结构下的Group PAC(G-PAC)推理和未知分组情况下的Clustered PAC(C-PAC)推理,二者均能实现组条件风险控制,并在异质场景中显著优于传统的边际PAC推理,从而在保障推理可靠性的同时大幅提升计算效率。
链接: https://arxiv.org/abs/2601.22790
作者: Jianguo Huang,Hao Zeng,Bingyi Jing,Hongxin Wei,Bo An
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:
Abstract:Large reasoning models have shown strong performance through extended chain-of-thought reasoning, yet their computational cost remains significant. Probably approximately correct (PAC) reasoning provides statistical guarantees for efficient reasoning by adaptively switching between thinking and non-thinking models, but the guarantee holds only in the marginal case and does not provide exact conditional coverage. We propose G-PAC reasoning, a practical framework that provides PAC-style guarantees at the group level by partitioning the input space. We develop two instantiations: Group PAC (G-PAC) reasoning for known group structures and Clustered PAC (C-PAC) reasoning for unknown groupings. We prove that both G-PAC and C-PAC achieve group-conditional risk control, and that grouping can strictly improve efficiency over marginal PAC reasoning in heterogeneous settings. Our experiments on diverse reasoning benchmarks demonstrate that G-PAC and C-PAC successfully achieve group-conditional risk control while maintaining substantial computational savings.
zh
[AI-65] oward IIT-Inspired Consciousness in LLM s: A Reward-Based Learning Framework
【速读】:该论文旨在解决当前语言模型在追求人工通用智能(AGI)过程中缺乏类意识处理能力的问题,试图通过引入整合信息理论(Integrated Information Theory, IIT)来增强模型生成文本的因果性、连贯性和整合性,从而模拟意识相关的行为特征。解决方案的关键在于设计一种基于IIT核心原则的新型奖励函数,该函数量化文本的因果整合特性,并通过奖励驱动的学习范式优化模型输出;实验表明,此方法能在不牺牲任务准确性的前提下显著减少输出长度(最高达31%),同时提升模型的置信度校准和推理效率,且无需外部数据或辅助模型,仅依赖通用的能力驱动信号即可实现。
链接: https://arxiv.org/abs/2601.22786
作者: Hamid Reza Akbari,Mohammad Hossein Sameti,Amir M. Mansourian,Mohammad Hossein Rohban,Hossein Sameti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures, 4 tables
Abstract:The pursuit of Artificial General Intelligence (AGI) is a central goal in language model development, in which consciousness-like processing could serve as a key facilitator. While current language models are not conscious, they exhibit behaviors analogous to certain aspects of consciousness. This paper investigates the implementation of a leading theory of consciousness, Integrated Information Theory (IIT), within language models via a reward-based learning paradigm. IIT provides a formal, axiom-based mathematical framework for quantifying consciousness. Drawing inspiration from its core principles, we formulate a novel reward function that quantifies a text’s causality, coherence and integration, characteristics associated with conscious processing. Empirically, it is found that optimizing for this IIT-inspired reward leads to more concise text generation. On out of domain tasks, careful tuning achieves up to a 31% reduction in output length while preserving accuracy levels comparable to the base model. In addition to primary task performance, the broader effects of this training methodology on the model’s confidence calibration and test-time computational scaling is analyzed. The proposed framework offers significant practical advantages: it is conceptually simple, computationally efficient, requires no external data or auxiliary models, and leverages a general, capability-driven signal rather than task-specific heuristics. Code available at this https URL
zh
[AI-66] Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training
【速读】:该论文旨在解决当前移动图形用户界面(GUI)智能体训练中数据生成缺乏细粒度难度控制的问题,从而导致训练难度与智能体能力不匹配,限制了学习效果。其解决方案的关键在于提出MobileGen框架,该框架通过显式解耦任务难度为结构维度(如轨迹长度)和语义维度(如任务目标),并基于预定义数据集迭代评估智能体能力边界,构建其在两个维度上的系统性能力轮廓;随后自适应计算难度概率分布,并从中采样下一阶段训练的目标难度,最终利用多智能体可控生成器合成高质量交互轨迹及对应任务指令,实现训练难度与智能体能力的动态对齐。
链接: https://arxiv.org/abs/2601.22781
作者: Linjia Kang,Zhimin Wang,Yongkang Zhang,Duo Wu,Jinghe Wang,Ming Ma,Haopeng Yan,Zhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large-scale, high-quality interaction trajectories are essential for advancing mobile Graphical User Interface (GUI) agents. While existing methods typically rely on labor-intensive human demonstrations or automated model exploration to generate GUI trajectories, they lack fine-grained control over task difficulty. This fundamentally restricts learning effectiveness due to the mismatch between the training difficulty and the agent’s capabilities. Inspired by how humans acquire skills through progressively challenging tasks, we propose MobileGen, a novel data generation framework that adaptively aligns training difficulty with the GUI agent’s capability frontier. Specifically, MobileGen explicitly decouples task difficulty into structural (e.g., trajectory length) and semantic (e.g., task goal) dimensions. It then iteratively evaluates the agent on a curated prior dataset to construct a systematic profile of its capability frontier across these two dimensions. With this profile, the probability distribution of task difficulty is adaptively computed, from which the target difficulty for the next round of training can be sampled. Guided by the sampled difficulty, a multi-agent controllable generator is finally used to synthesize high-quality interaction trajectories along with corresponding task instructions. Extensive experiments show that MobileGen consistently outperforms existing data generation methods by improving the average performance of GUI agents by 1.57 times across multiple challenging benchmarks. This highlights the importance of capability-aligned data generation for effective mobile GUI agent training.
zh
[AI-67] SPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization
【速读】:该论文旨在解决多轮工具增强型推理中因依赖稀疏结果级奖励而导致的“双重同质化困境”问题,即过程同质化(忽略生成过程中思维、推理与工具调用的细节)和组内同质化(粗粒度奖励导致组内优势估计效率低下)。其解决方案的关键在于提出一种逐轮阶段感知的策略优化方法(Turn-level Stage-aware Policy Optimization, TSPO),通过引入首次出现潜在奖励(First-Occurrence Latent Reward, FOLR)机制,在答案首次出现的步骤分配部分奖励,从而保留过程级信号并提升组内奖励方差,无需外部奖励模型或人工标注即可显著改善策略优化效果。
链接: https://arxiv.org/abs/2601.22776
作者: Shichao Ma,Zhiyuan Ma,Ming Yang,Xiaofan Li,Xing Wu,Jintao Du,Yu Cheng,Weiqiang Wang,Qiliang Liu,Zhengyang Zhou,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a “Double Homogenization Dilemma.” This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.
zh
[AI-68] Beyond Abstract Compliance: Operationalising trust in AI as a moral relationship
【速读】:该论文试图解决当前主流AI可信性框架(如欧盟的“可信AI框架”)将信任视为可设计、评估与治理的静态属性,而忽视了信任在主观体验、文化嵌入性和关系动态性方面的本质问题。其解决方案的关键在于引入关系伦理(relational ethics),特别是非洲共同体主义哲学(African communitarian philosophies),提出一套扩展的信任原则,将信任重构为一种动态、历时性的关系过程,强调透明度与相互尊重,并主张在整个AI生命周期中持续参与社区,以逐步建立信任并促进更具公平性和情境敏感性的AI系统。
链接: https://arxiv.org/abs/2601.22769
作者: Lameck Mbangula Amugongo,Tutaleni Asino,Nicola J Bidwell
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Dominant approaches, e.g. the EU’s “Trustworthy AI framework”, treat trust as a property that can be designed for, evaluated, and governed according to normative and technical criteria. They do not address how trust is subjectively cultivated and experienced, culturally embedded, and inherently relational. This paper proposes some expanded principles for trust in AI that can be incorporated into common development methods and frame trust as a dynamic, temporal relationship, which involves transparency and mutual respect. We draw on relational ethics and, in particular, African communitarian philosophies, to foreground the nuances of inclusive, participatory processes and long-term relationships with communities. Involving communities throughout the AI lifecycle can foster meaningful relationships with AI design and development teams that incrementally build trust and promote more equitable and context-sensitive AI systems. We illustrate how trust-enabling principles based on African relational ethics can be operationalised, using two use-cases for AI: healthcare and education.
zh
[AI-69] How Far Can Pretrained LLM s Go in Symbolic Music? Controlled Comparisons of Supervised and Preference-based Adaptation
【速读】:该论文旨在解决如何有效将指令微调的大语言模型(instruction-tuned large language models, LLMs)适配到符号音乐(symbolic music)的理解与生成任务中这一问题。其关键解决方案在于通过受控的对比实验,系统评估不同微调策略对基于ABC记谱法的音乐生成与理解性能的影响,明确领域适应(domain adaptation)与保留预训练先验信息之间的权衡关系,并揭示用于衡量音乐领域适应性的指标行为差异,从而为符号音乐应用中的模型选择与优化提供实证依据。
链接: https://arxiv.org/abs/2601.22764
作者: Deepak Kumar,Emmanouil Karystinaios,Gerhard Widmer,Markus Schedl
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at NLP4MusA 2026
Abstract:Music often shares notable parallels with language, motivating the use of pretrained large language models (LLMs) for symbolic music understanding and generation. Despite growing interest, the practical effectiveness of adapting instruction-tuned LLMs to symbolic music remains insufficiently characterized. We present a controlled comparative study of finetuning strategies for ABC-based generation and understanding, comparing an off-the-shelf instruction-tuned backbone to domain-adapted variants and a music-specialized LLM baseline. Across multiple symbolic music corpora and evaluation signals, we provide some insights into adaptation choices for symbolic music applications. We highlight the domain adaptation vs.~preserving prior information tradeoff as well as the distinct behaviour of metrics used to measure the domain adaptation for symbolic music.
zh
[AI-70] Qualitative Evaluation of LLM -Designed GUI
【速读】:该论文旨在解决生成式 AI 在自动化图形用户界面(GUI)设计中的可用性与适应性问题,特别是大型语言模型(Large Language Models, LLMs)在满足多样化用户需求时的表现瓶颈。其解决方案的关键在于通过实证实验评估三类先进LLM(OpenAI GPT o3-mini-high、DeepSeek R1 和 Anthropic Claude 3.5 Sonnet)生成的界面原型在结构化布局、可访问性及交互功能等方面的性能,并发现尽管LLMs在早期原型设计中具有潜力,但需依赖人工干预以确保最终界面的可用性、无障碍合规性和用户满意度。
链接: https://arxiv.org/abs/2601.22759
作者: Bartosz Sawicki,Tomasz Les,Dariusz Parzych,Aleksandra Wycisk-Ficek,Pawel Trebacz,Pawel Zawadzki
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, presented on conference PP-RAI 2025, Katowice-Poland
Abstract:As generative artificial intelligence advances, Large Language Models (LLMs) are being explored for automated graphical user interface (GUI) design. This study investigates the usability and adaptability of LLM-generated interfaces by analysing their ability to meet diverse user needs. The experiments included utilization of three state-of-the-art models from January 2025 (OpenAI GPT o3-mini-high, DeepSeek R1, and Anthropic Claude 3.5 Sonnet) generating mockups for three interface types: a chat system, a technical team panel, and a manager dashboard. Expert evaluations revealed that while LLMs are effective at creating structured layouts, they face challenges in meeting accessibility standards and providing interactive functionality. Further testing showed that LLMs could partially tailor interfaces for different user personas but lacked deeper contextual understanding. The results suggest that while LLMs are promising tools for early-stage UI prototyping, human intervention remains critical to ensure usability, accessibility, and user satisfaction.
zh
[AI-71] AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement
【速读】:该论文旨在解决大语言模型代理在执行任务过程中难以积累和复用经验的问题,尤其是现有方法将经验以扁平化文本形式提取,无法捕捉复杂子任务的程序逻辑,且缺乏维护机制导致知识库随经验增长而退化。其解决方案的关键在于提出AutoRefine框架,通过从代理执行历史中提取并维护双形式的经验模式(Experience Patterns):针对程序性子任务,生成具有独立推理与记忆能力的专用子代理;针对静态知识,提取技能模式作为指南或代码片段。同时引入持续维护机制,对模式进行评分、修剪和合并,有效防止知识库退化,从而显著提升任务完成率与效率。
链接: https://arxiv.org/abs/2601.22758
作者: Libin Qiu,Zhirong Gao,Junfu Chen,Yuhang Ye,Weizhi Huang,Xiaobo Xue,Wenkai Qiu,Shuo Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 3 tables
Abstract:Large language model agents often fail to accumulate knowledge from experience, treating each task as an independent challenge. Recent methods extract experience as flattened textual knowledge, which cannot capture procedural logic of complex subtasks. They also lack maintenance mechanisms, causing repository degradation as experience accumulates. We introduce AutoRefine, a framework that extracts and maintains dual-form Experience Patterns from agent execution histories. For procedural subtasks, we extract specialized subagents with independent reasoning and memory. For static knowledge, we extract skill patterns as guidelines or code snippets. A continuous maintenance mechanism scores, prunes, and merges patterns to prevent repository degradation. Evaluated on ALFWorld, ScienceWorld, and TravelPlanner, AutoRefine achieves 98.4%, 70.4%, and 27.1% respectively, with 20-73% step reductions. On TravelPlanner, automatic extraction exceeds manually designed systems (27.1% vs 12.1%), demonstrating its ability to capture procedural coordination.
zh
[AI-72] UrbanMoE: A Sparse Multi-Modal Mixture-of-Experts Framework for Multi-Task Urban Region Profiling WWW’26
【速读】:该论文旨在解决城市区域画像(Urban Region Profiling)研究中存在的两大问题:一是现有方法多局限于单任务预测,无法捕捉城市环境中多种指标间的复杂关联性;二是缺乏标准化的实验基准,阻碍了公平比较与可复现的研究进展。其解决方案的关键在于提出首个面向多任务的城市区域画像框架——UrbanMoE,该框架基于稀疏的多模态专家混合(Sparse Multi-Modal Mixture-of-Experts, MoE)架构,能够动态地将多模态特征路由至专用子网络,从而实现对多种城市指标的协同预测,在多个真实数据集上展现出显著优于基线方法的性能和效率。
链接: https://arxiv.org/abs/2601.22746
作者: Pingping Liu,Jiamiao Liu,Zijian Zhang,Hao Miao,Qi Jiang,Qingliang Li,Qiuzhan Zhou,Irwin King
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures, 5tables, Proceedings of the ACM Web Conference 2026 (WWW '26), April 13–17, 2026, Dubai, United Arab Emirates
Abstract:Urban region profiling, the task of characterizing geographical areas, is crucial for urban planning and resource allocation. However, existing research in this domain faces two significant limitations. First, most methods are confined to single-task prediction, failing to capture the interconnected, multi-faceted nature of urban environments where numerous indicators are deeply correlated. Second, the field lacks a standardized experimental benchmark, which severely impedes fair comparison and reproducible progress. To address these challenges, we first establish a comprehensive benchmark for multi-task urban region profiling, featuring multi-modal features and a diverse set of strong baselines to ensure a fair and rigorous evaluation environment. Concurrently, we propose UrbanMoE, the first sparse multi-modal, multi-expert framework specifically architected to solve the multi-task challenge. Leveraging a sparse Mixture-of-Experts architecture, it dynamically routes multi-modal features to specialized sub-networks, enabling the simultaneous prediction of diverse urban indicators. We conduct extensive experiments on three real-world datasets within our benchmark, where UrbanMoE consistently demonstrates superior performance over all baselines. Further in-depth analysis validates the efficacy and efficiency of our approach, setting a new state-of-the-art and providing the community with a valuable tool for future research in urban analytics
zh
[AI-73] Decomposing Epistemic Uncertainty for Causal Decision Making
【速读】:该论文旨在解决在存在未观测混杂因素(unobserved confounding)的情况下,如何准确量化因果效应估计的不确定性问题,特别是区分由样本量有限导致的“样本不确定性”(sample uncertainty)与由因果识别结构不完整(如缺少对潜在混杂变量或工具变量的观测)导致的“非可识别不确定性”(non-ID uncertainty)。其解决方案的关键在于构建一个围绕经验观测分布的置信集,并计算该集合中所有分布对应的因果效应边界(causal effect bounds)的交集;通过求解最小-最大和最大-最小优化问题,利用神经因果模型搜索可能的数据生成分布及其对应的结构因果模型(SCM),从而分离出可通过增加样本量缩小的区间部分(样本不确定性)和仅能通过观测更多变量才能缩小的部分(非ID不确定性)。这一方法为决策者提供了是否值得继续收集数据的判断依据,避免无效的数据采集。
链接: https://arxiv.org/abs/2601.22736
作者: Md Musfiqur Rahman,Ziwei Jiang,Hilaf Hasson,Murat Kocaoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Causal inference from observational data provides strong evidence for the best action in decision-making without performing expensive randomized trials. The effect of an action is usually not identifiable under unobserved confounding, even with an infinite amount of data. Recent work uses neural networks to obtain practical bounds to such causal effects, which is often an intractable problem. However, these approaches may overfit to the dataset and be overconfident in their causal effect estimates. Moreover, there is currently no systematic approach to disentangle how much of the width of causal effect bounds is due to fundamental non-identifiability versus how much is due to finite-sample limitations. We propose a novel framework to address this problem by considering a confidence set around the empirical observational distribution and obtaining the intersection of causal effect bounds for all distributions in this confidence set. This allows us to distinguish the part of the interval that can be reduced by collecting more samples, which we call sample uncertainty, from the part that can only be reduced by observing more variables, such as latent confounders or instrumental variables, but not with more data, which we call non-ID uncertainty. The upper and lower bounds to this intersection are obtained by solving min-max and max-min problems with neural causal models by searching over all distributions that the dataset might have been sampled from, and all SCMs that entail the corresponding distribution. We demonstrate via extensive experiments on synthetic and real-world datasets that our algorithm can determine when collecting more samples will not help determine the best action. This can guide practitioners to collect more variables or lean towards a randomized study for best action identification.
zh
[AI-74] AEGIS: White-Box Attack Path Generation using LLM s and Training Effectiveness Evaluation for Large-Scale Cyber Defence Exercises
【速读】:该论文旨在解决网络安全攻防演练中攻击路径生成依赖人工专家大量投入的问题,传统自动化方法受限于预先构建的漏洞图谱或利用集合,适用场景有限。其解决方案的关键在于提出AEGIS系统,该系统结合大语言模型(LLM)进行动态exploit发现、白盒访问验证单个exploit的有效性,并通过蒙特卡洛树搜索(Monte Carlo Tree Search)在真实exploit执行基础上规划攻击链,从而无需预设漏洞图谱即可自动生成高质量攻击路径,显著缩短场景开发周期,将专家工作重心从技术验证转向场景设计。
链接: https://arxiv.org/abs/2601.22720
作者: Ivan K. Tung,Yu Xiang Shi,Alex Chien,Wenkai Liu,Lawrence Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Creating attack paths for cyber defence exercises requires substantial expert effort. Existing automation requires vulnerability graphs or exploit sets curated in advance, limiting where it can be applied. We present AEGIS, a system that generates attack paths using LLMs, white-box access, and Monte Carlo Tree Search over real exploit execution. LLM-based search discovers exploits dynamically without pre-existing vulnerability graphs, while white-box access enables validating exploits in isolation before committing to attack paths. Evaluation at CIDeX 2025, a large-scale exercise spanning 46 IT hosts, showed that AEGIS-generated paths are comparable to human-authored scenarios across four dimensions of training experience (perceived learning, engagement, believability, challenge). Results were measured with a validated questionnaire extensible to general simulation-based training. By automating exploit chain discovery and validation, AEGIS reduces scenario development from months to days, shifting expert effort from technical validation to scenario design.
zh
[AI-75] A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)后训练过程中,由于采用离策略(off-policy)采样导致的训练不稳定问题。现有方法通常依赖于基于token级别的重要性采样比进行偏差校正,但在策略偏离较大时容易引发训练动态不稳。论文指出,理论上更严谨的校正项应为前缀重要性比(prefix importance ratio),而将该比值简化为token级近似会引入不稳定性。解决方案的关键在于提出一种名为Minimum Prefix Ratio (MinPRO) 的新目标函数,其通过使用先前前缀中观察到的最小token级比例作为非累积代理,替代不稳定的累积前缀比,从而显著提升大规模离策略场景下的训练稳定性和最终性能。
链接: https://arxiv.org/abs/2601.22718
作者: Shiye Lei,Zhihao Cheng,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can induce instability in RL post-training. To stabilize LLM optimization under large off-policy drift, we propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO). MinPRO replaces the unstable cumulative prefix ratio with a non-cumulative surrogate based on the minimum token-level ratio observed in the preceding prefix. Extensive experiments on both dense and mixture-of-experts LLMs, across multiple mathematical reasoning benchmarks, demonstrate that MinPRO substantially improves training stability and peak performance in off-policy regimes.
zh
[AI-76] Breaking the Blocks: Continuous Low-Rank Decomposed Scaling for Unified LLM Quantization and Adaptation
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)量化方法中效率与表达能力之间的权衡问题,即现有基于块(block-wise)结构的量化策略虽保持高效,但限制了模型的表征灵活性。其核心解决方案是提出低秩分解缩放(Low-Rank Decomposed Scaling, LoRDS)框架,通过将缩放矩阵建模为连续低秩矩阵(S = BA),实现元素级(element-wise)量化在不牺牲效率的前提下获得更强的表达能力。关键创新在于“打破块约束”,使量化粒度从块级别提升至元素级别,同时支持高保真后训练量化(PTQ)、联合量化感知训练(QAT)以及高秩乘法型参数高效微调(PEFT),从而在压缩与适应两个维度上实现统一优化,且无额外推理开销。
链接: https://arxiv.org/abs/2601.22716
作者: Pingzhi Tang,Ruijie Zhou,Fanxu Meng,Wenjie Pei,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current quantization methods for LLMs predominantly rely on block-wise structures to maintain efficiency, often at the cost of representational flexibility. In this work, we demonstrate that element-wise quantization can be made as efficient as block-wise scaling while providing strictly superior expressive power by modeling the scaling manifold as continuous low-rank matrices ( S = BA ). We propose Low-Rank Decomposed Scaling (LoRDS), a unified framework that rethinks quantization granularity through this low-rank decomposition. By “breaking the blocks” of spatial constraints, LoRDS establishes a seamless efficiency lifecycle: it provides high-fidelity PTQ initialization refined via iterative optimization, enables joint QAT of weights and scaling factors, and facilitates high-rank multiplicative PEFT adaptation. Unlike additive PEFT approaches such as QLoRA, LoRDS enables high-rank weight updates within a low-rank budget while incurring no additional inference overhead. Supported by highly optimized Triton kernels, LoRDS consistently outperforms state-of-the-art baselines across various model families in both quantization and downstream fine-tuning tasks. Notably, on Llama3-8B, our method achieves up to a 27.0% accuracy improvement at 3 bits over NormalFloat quantization and delivers a 1.5x inference speedup on NVIDIA RTX 4090 while enhancing PEFT performance by 9.6% on downstream tasks over 4bit QLoRA, offering a robust and integrated solution for unified compression and adaptation of LLMs.
zh
[AI-77] Deep Learning-Based Early-Stage IR-Drop Estimation via CNN Surrogate Modeling
【速读】:该论文旨在解决现代超大规模集成电路(VLSI)设计中电源完整性问题中的IR-drop(电压降)分析难题,尤其是在早期设计阶段缺乏高效、准确的预测手段的问题。传统基于物理的签核工具虽精度高,但计算成本大且依赖接近最终的版图信息,难以支持快速迭代设计探索。解决方案的关键在于提出一种基于卷积神经网络(CNN)的代理建模方法,将IR-drop估计任务建模为密集像素级回归问题,利用U-Net结构的编码器-解码器架构结合跳跃连接(skip connections),有效捕捉版图中的局部与全局空间依赖关系;同时,通过自建的物理启发式合成数据集训练模型,实现毫秒级推理速度下的高精度IR-drop热力图预测,从而在签核前提供快速评估能力,辅助早期设计优化。
链接: https://arxiv.org/abs/2601.22707
作者: Ritesh Bhadana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Image and Video Processing (eess.IV)
备注: 13 pages, 5 figures, 2 tables. Code and live demo available at this https URL
Abstract:IR-drop is a critical power integrity challenge in modern VLSI designs that can cause timing degradation, reliability issues, and functional failures if not detected early in the design flow. Conventional IR-drop analysis relies on physics-based signoff tools, which provide high accuracy but incur significant computational cost and require near-final layout information, making them unsuitable for rapid early-stage design exploration. In this work, we propose a deep learning-based surrogate modeling approach for early-stage IR-drop estimation using a CNN. The task is formulated as a dense pixel-wise regression problem, where spatial physical layout features are mapped directly to IR-drop heatmaps. A U-Net-based encoder-decoder architecture with skip connections is employed to effectively capture both local and global spatial dependencies within the layout. The model is trained on a physics-inspired synthetic dataset generated by us, which incorporates key physical factors including power grid structure, cell density distribution, and switching activity. Model performance is evaluated using standard regression metrics such as Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR). Experimental results demonstrate that the proposed approach can accurately predict IR-drop distributions with millisecond-level inference time, enabling fast pre-signoff screening and iterative design optimization. The proposed framework is intended as a complementary early-stage analysis tool, providing designers with rapid IR-drop insight prior to expensive signoff analysis. The implementation, dataset generation scripts, and the interactive inference application are publicly available at: this https URL. The live application can be accessed at: this https URL.
zh
[AI-78] Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在动态数字环境(如网页)中因缺乏适应性而导致的性能下降问题,传统方法依赖于耗时且资源密集的微调训练来提升性能。其解决方案的关键在于提出一种无需重新训练策略的推理阶段增强范式:保持VLM策略冻结不变,仅利用其生成候选动作集,并引入一个轻量级、离线训练的Q函数在推理时对候选动作进行重排序,从而选择价值最高的动作执行。这一机制直接在推理过程中实现策略优化,而非用于离线数据重标注以供再训练,显著提升了代理在WebVoyager基准上的成功率(如Qwen2.5-VL-7B从38.8%提升至55.7%,GPT-4.1从82.4%提升至88.8%)。
链接: https://arxiv.org/abs/2601.22701
作者: Emilien Biré,María Santos,Kai Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM’s role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.
zh
[AI-79] Do Transformers Have the Ability for Periodicity Generalization?
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在分布外(out-of-distribution, OOD)泛化能力上的显著不足,特别是针对周期性模式(periodicity)这一基本OOD场景的建模缺陷。研究指出,尽管Transformer架构在多种任务中表现优异,但其在提取和推广周期性规律方面存在局限,尤其是在面对复合周期性(composite periodicity)时无法实现有效泛化。解决方案的关键在于:首先从抽象代数与推理视角对周期性进行统一解释,涵盖单周期与复合周期;其次构建了一个名为Coper的可控生成基准,包含“空心”(Hollow)和“外推”(Extrapolation)两种OOD设置,用于系统评估模型在复合周期性下的泛化能力。实验表明,模型虽能记忆训练中的周期数据,却难以推广至未见过的复合周期结构,揭示了当前Transformer在周期性泛化上的本质瓶颈。
链接: https://arxiv.org/abs/2601.22690
作者: Huanyu Liu,Ge Li,Yihong Dong,Sihan Wu,Peixu Wang,Sihao Cheng,Taozhi Chen,Kechi Zhang,Hao Zhu,Tongxuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) based on the Transformer have demonstrated strong performance across diverse tasks. However, current models still exhibit substantial limitations in out-of-distribution (OOD) generalization compared with humans. We investigate this gap through periodicity, one of the basic OOD scenarios. Periodicity captures invariance amid variation. Periodicity generalization represents a model’s ability to extract periodic patterns from training data and generalize to OOD scenarios. We introduce a unified interpretation of periodicity from the perspective of abstract algebra and reasoning, including both single and composite periodicity, to explain why Transformers struggle to generalize periodicity. Then we construct Coper about composite periodicity, a controllable generative benchmark with two OOD settings, Hollow and Extrapolation. Experiments reveal that periodicity generalization in Transformers is limited, where models can memorize periodic data during training, but cannot generalize to unseen composite periodicity. We release the source code to support future research.
zh
[AI-80] From Horizontal Layering to Vertical Integration: A Comparative Study of the AI-Driven Software Development Paradigm
【速读】:该论文旨在解决生成式 AI(Generative AI)在软件工程组织中应用时所带来的结构性效率问题,即如何通过组织架构调整实现资源消耗的显著降低并提升整体生产力。其解决方案的关键在于从传统的水平分层(Horizontal Layering)向垂直整合(Vertical Integration)模式转变,从而催生“超级员工”(Super Employees)——即被 AI 增强、跨越传统职能边界的人类工程师,同时消除跨功能协作带来的冗余开销。这一转型不仅使资源消耗减少 8 至 33 倍,还促使组织将优化目标从个体生产率转向人机协同效能(Human-AI Collaboration Efficacy),并通过总要素生产率分析揭示了 AI 引起的技术杠杆效应与劳动规模回报递减的“AI 扭曲效应”(AI Distortion Effect)。
链接: https://arxiv.org/abs/2601.22667
作者: Chi Zhang,Zehan Li,Ziqian Zhong,Haibing Ma,Dan Xiao,Chen Lin,Ming Dong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper examines the organizational implications of Generative AI adoption in software engineering through a multiple-case comparative study. We contrast two development environments: a traditional enterprise (brownfield) and an AI-native startup (greenfield). Our analysis reveals that transitioning from Horizontal Layering (functional specialization) to Vertical Integration (end-to-end ownership) yields 8-fold to 33-fold reductions in resource consumption. We attribute these gains to the emergence of Super Employees, AI-augmented engineers who span traditional role boundaries, and the elimination of inter-functional coordination overhead. Theoretically, we propose Human-AI Collaboration Efficacy as the primary optimization target for engineering organizations, supplanting individual productivity metrics. Our Total Factor Productivity analysis identifies an AI Distortion Effect that diminishes returns to labor scale while amplifying technological leverage. We conclude with managerial strategies for organizational redesign, including the reactivation of idle cognitive bandwidth in senior engineers and the suppression of blind scale expansion.
zh
[AI-81] Real-Time Aligned Reward Model beyond Semantics
【速读】:该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)存在的奖励过优化(reward overoptimization)问题,即策略模型过度拟合奖励模型(Reward Model, RM),并利用虚假的奖励模式而非忠实捕捉人类意图。传统方法主要依赖预训练语言模型的表面语义信息,难以有效应对因策略分布持续变化导致的奖励模型与策略模型之间的错位问题,从而引发奖励差异增大,进一步加剧过优化现象。解决方案的关键在于提出一种轻量级RLHF框架R2M(Real-Time Aligned Reward Model),其不再仅依赖静态的语义表示,而是实时利用策略模型的隐藏状态(policy feedback)来动态对齐策略在强化学习过程中的分布变化,实现奖励模型与策略模型的实时协同调整,从而提升奖励模型的适应性和鲁棒性。
链接: https://arxiv.org/abs/2601.22664
作者: Zixuan Huang,Xin Xia,Yuxi Ren,Jianbin Zheng,Xuefeng Xiao,Hongyan Xie,Li Huaqiu,Songshi Liang,Zhongxiang Dai,Fuzhen Zhuang,Jianxin Li,Yikun Ban,Deqing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.
zh
[AI-82] ask-Aware LLM Council with Adaptive Decision Pathways for Decision Support ICASSP2026
【速读】:该论文旨在解决现有大语言模型(Large Language Models, LLMs)在决策任务中普遍存在的“一刀切”问题,即忽略不同LLM在特定任务上的专业化差异,导致无法根据任务复杂度和推理需求动态适配最优模型。其解决方案的关键在于提出任务感知的LLM议会(Task-Aware LLM Council, TALC),通过引入基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的动态专家选择机制,结合每个多模型的结构化成功记忆(success memory profile)进行语义匹配,实现当前推理上下文与历史成功轨迹的精准对齐;同时采用双信号价值估计机制融合模型预测与历史效用得分,并依据节点内方差自适应调整权重,从而在探索深度与规划置信度之间取得平衡,显著提升任务成功率与搜索效率。
链接: https://arxiv.org/abs/2601.22662
作者: Wei Zhu,Lixing Yu,Hao-Ren Yao,Zhiwen Tang,Kun Yue
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: A shorter version of this work has been accepted by ICASSP 2026
Abstract:Large language models (LLMs) have shown strong capabilities across diverse decision-making tasks. However, existing approaches often overlook the specialization differences among available models, treating all LLMs as uniformly applicable regardless of task characteristics. This limits their ability to adapt to varying reasoning demands and task complexities. In this work, we propose Task-Aware LLM Council (TALC), a task-adaptive decision framework that integrates a council of LLMs with Monte Carlo Tree Search (MCTS) to enable dynamic expert selection and efficient multi-step planning. Each LLM is equipped with a structured success memory profile derived from prior task trajectories, enabling semantic matching between current reasoning context and past successes. At each decision point, TALC routes control to the most contextually appropriate model and estimates node value using a dual-signal mechanism that fuses model-based evaluations with historical utility scores. These signals are adaptively weighted based on intra-node variance and used to guide MCTS selection, allowing the system to balance exploration depth with planning confidence. Experiments on WebShop, HumanEval, and the Game of 24 demonstrate that TALC achieves superior task success rates and improved search efficiency compared to strong baselines, validating the benefits of specialization-aware routing and adaptive planning.
zh
[AI-83] Human-Centered Explainability in AI-Enhanced UI Security Interfaces: Designing Trustworthy Copilots for Cybersecurity Analysts
【速读】:该论文旨在解决生成式 AI (Generative AI) 在企业网络安全平台中应用时,因缺乏有效解释机制而导致的安全分析师难以理解与信任AI输出的问题。其核心挑战在于如何将算法可解释性(algorithmic explainability)有效地融入用户界面(UI),以支持高风险决策场景下的准确判断和高效操作。解决方案的关键在于通过混合方法研究设计出多种解释风格(包括自然语言推理、置信度可视化、反事实解释及混合策略),并基于安全运营中心(SOC)实践者的控制实验验证不同解释方式对用户信任校准、决策准确性和认知负荷的影响,从而提出可落地的设计指南与面向分析师需求的解释策略框架。
链接: https://arxiv.org/abs/2601.22653
作者: Mona Rajhans
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To appear in IEEE ICCA 2025 proceedings
Abstract:Artificial intelligence (AI) copilots are increasingly integrated into enterprise cybersecurity platforms to assist analysts in threat detection, triage, and remediation. However, the effectiveness of these systems depends not only on the accuracy of underlying models but also on the degree to which users can understand and trust their outputs. Existing research on algorithmic explainability has largely focused on model internals, while little attention has been given to how explanations should be surfaced in user interfaces for high-stakes decision-making contexts [8], [5], [6]. We present a mixed-methods study of explanation design strategies in AI-driven security dashboards. Through a taxonomy of explanation styles and a controlled user study with security practitioners, we compare natural language rationales, confidence visualizations, counterfactual explanations, and hybrid approaches. Our findings show that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load. We contribute (1) empirical evidence on the usability of explanation interfaces for security copilots, (2) design guidelines for integrating explainability into enterprise UIs, and (3) a framework for aligning explanation strategies with analyst needs in security operations centers (SOCs). This work advances the design of human-centered AI tools in cybersecurity and provides broader implications for explainability in other high-stakes domains.
zh
[AI-84] GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning
【速读】:该论文旨在解决视觉生成模型中训练数据的群体级归因问题(group-wise training-data attribution),即识别哪些训练数据组(如艺术风格或物体类别)对特定生成结果具有显著影响。传统方法多聚焦于单个样本的归因,而实际应用中更需要群体层面的解释。为此,作者提出GUDA(Group Unlearning-based Data Attribution)方法,其核心创新在于利用机器遗忘(machine unlearning)技术,在不重新训练模型的前提下近似每个“移除某一组数据”的反事实模型(counterfactual model),并通过比较完整模型与各反事实模型在证据下界(ELBO)上的差异来量化群体影响力。相较经典的Leave-One-Group-Out(LOGO)重训练方法,GUDA实现了计算效率提升约100倍,同时在CIFAR-10和Stable Diffusion的艺术风格归因任务上展现出更高的可靠性。
链接: https://arxiv.org/abs/2601.22651
作者: Naoki Murata,Yuhta Takida,Chieh-Hsin Lai,Toshimitsu Uesaka,Bac Nguyen,Stefano Ermon,Yuki Mitsufuji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training-data attribution for vision generative models aims to identify which training data influenced a given output. While most methods score individual examples, practitioners often need group-level answers (e.g., artistic styles or object classes). Group-wise attribution is counterfactual: how would a model’s behavior on a generated sample change if a group were absent from training? A natural realization of this counterfactual is Leave-One-Group-Out (LOGO) retraining, which retrains the model with each group removed; however, it becomes computationally prohibitive as the number of groups grows. We propose GUDA (Group Unlearning-based Data Attribution) for diffusion models, which approximates each counterfactual model by applying machine unlearning to a shared full-data model instead of training from scratch. GUDA quantifies group influence using differences in a likelihood-based scoring rule (ELBO) between the full model and each unlearned counterfactual. Experiments on CIFAR-10 and artistic style attribution with Stable Diffusion show that GUDA identifies primary contributing groups more reliably than semantic similarity, gradient-based attribution, and instance-level unlearning approaches, while achieving x100 speedup on CIFAR-10 over LOGO retraining.
zh
[AI-85] UCPO: Uncertainty-Aware Policy Optimization
【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的大型语言模型(Large Language Models, LLMs)在高风险应用中因缺乏内在不确定性表达能力而导致幻觉(hallucination)的问题。现有RL范式如GRPO常因二元决策空间和静态不确定性奖励机制引发优势偏差(Advantage Bias),导致模型过度保守或过度自信。其解决方案的关键在于提出一种不确定性感知策略优化框架(UnCertainty-Aware Policy Optimization, UCPO),核心创新包括:1)采用三元优势解耦(Ternary Advantage Decoupling)将确定性与不确定性轨迹分离并独立归一化,从而消除优势偏差;2)引入动态不确定性奖励调整机制(Dynamic Uncertainty Reward Adjustment),根据模型演化和实例难度实时校准不确定性权重,实现奖励平衡,显著提升模型在知识边界外的可靠性与校准性能。
链接: https://arxiv.org/abs/2601.22648
作者: Xianzhou Zeng,Jing Huang,Chunmei Xie,Gongrui Nan,Siye Chen,Mengyu Lu,Weiqi Xiong,Qixuan Zhou,Junhao Zhang,Qiang Zhu,Yadong Li,Xingzhong Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The key to building trustworthy Large Language Models (LLMs) lies in endowing them with inherent uncertainty expression capabilities to mitigate the hallucinations that restrict their high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism is introduced to calibrate uncertainty weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks demonstrate that UCPO effectively resolves the reward imbalance, significantly improving the reliability and calibration of the model beyond their knowledge boundaries.
zh
[AI-86] st-Time Mixture of World Models for Embodied Agents in Dynamic Environments ICLR2026
【速读】:该论文旨在解决语言模型(Language Model, LM)驱动的具身智能体在动态环境中适应能力有限的问题,尤其是在构建准确且灵活的世界模型(World Model)方面存在挑战。传统混合专家(Mixture-of-Experts, MoE)架构虽将知识模块化为专家组件并采用预训练路由机制,但其路由函数在部署后固定不变,难以应对未见领域或环境变化。为此,作者提出测试时混合世界模型(Test-time Mixture of World Models, TMoW),其核心创新在于:在测试阶段动态更新路由函数,使智能体能够重新组合现有世界模型并整合新模型以实现持续适应。关键机制包括:(i) 多粒度原型导向路由,支持从物体到场景级别的相似性匹配;(ii) 测试时微调,通过推理阶段对未见域特征与原型对齐增强泛化能力;(iii) 基于蒸馏的混合增强方法,利用少量数据高效生成新世界模型。实验表明,TMoW在VirtualHome、ALFWorld和RLBench等基准上均展现出卓越的零样本迁移和少样本扩展性能。
链接: https://arxiv.org/abs/2601.22647
作者: Jinwoo Jang,Minjong Yoo,Sihyung Yoon,Honguk Woo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026. 10 pages. Code available at this https URL
Abstract:Language model (LM)-based embodied agents are increasingly deployed in real-world settings. Yet, their adaptability remains limited in dynamic environments, where constructing accurate and flexible world models is crucial for effective reasoning and decision-making. To address this challenge, we extend the Mixture-of-Experts (MoE) paradigm to embodied agents. While conventional MoE architectures modularize knowledge into expert components with pre-trained routing, they remain rigid once deployed, making them less effective for adapting to unseen domains in dynamic environments. We therefore propose Test-time Mixture of World Models (TMoW), a framework that enhances adaptability to unseen and evolving domains. TMoW updates its routing function over world models at test time, unlike conventional MoE where the function remains fixed, enabling agents to recombine existing models and integrate new ones for continual adaptation. It achieves this through (i) multi-granular prototype-based routing, which adapts mixtures across object- to scene-level similarities, (ii) test-time refinement that aligns unseen domain features with prototypes during inference, and (iii) distilled mixture-based augmentation, which efficiently constructs new models from few-shot data and existing prototypes. We evaluate TMoW on VirtualHome, ALFWorld, and RLBench benchmarks, demonstrating strong performance in both zero-shot adaptation and few-shot expansion scenarios, and showing that it enables embodied agents to operate effectively in dynamic environments.
zh
[AI-87] Beyond Medical Chatbots: Meddollina and the Rise of Continuous Clinical Intelligence
【速读】:该论文旨在解决当前生成式医疗人工智能(Generative Medical AI)系统在临床部署中表现出的结构性缺陷问题,即尽管模型在文本生成任务上表现流畅且得分提升,但其推理行为仍存在 premature closure(过早闭合)、 unjustified certainty(无依据的确定性)、intent drift(意图漂移)和 multi-step decision instability(多步决策不稳定)等与临床实践不兼容的问题。这些问题源于将医学视为“下一个词预测”(next-token prediction)的建模范式,忽视了临床推理的本质——一种在模糊性、证据不完整性和纵向上下文约束下承担责任的过程。解决方案的关键在于提出并实现一种名为 Meddollina 的治理优先型临床智能系统,该系统通过在语言生成之前施加推理约束(inference constraint),强调持续的上下文感知(persistent context awareness)、意图保持(intent preservation)、有界推理(bounded inference)以及在证据不足时的原则性回避(principled deferral),从而构建出一种新型能力类别:临床情境智能(Clinical Contextual Intelligence, CCI)。实验表明,Meddollina 在 16,412+ 种异构医疗查询中展现出校准后的不确定性、对信息不足场景的保守推理、稳定的长期约束遵守能力及显著减少的推测性补全,证明可部署的医疗 AI 不应仅依赖规模扩展,而需转向以临床行为为导向的连续临床智能(Continuous Clinical Intelligence)范式。
链接: https://arxiv.org/abs/2601.22645
作者: Vaibhav Ram S. V. N. S,Swetanshu Agrawal,Samudra Banerjee,Abdul Muhsin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative medical AI now appears fluent and knowledgeable enough to resemble clinical intelligence, encouraging the belief that scaling will make it safe. But clinical reasoning is not text generation. It is a responsibility-bound process under ambiguity, incomplete evidence, and longitudinal context. Even as benchmark scores rise, generation-centric systems still show behaviours incompatible with clinical deployment: premature closure, unjustified certainty, intent drift, and instability across multi-step decisions. We argue these are structural consequences of treating medicine as next-token prediction. We formalise Clinical Contextual Intelligence (CCI) as a distinct capability class required for real-world clinical use, defined by persistent context awareness, intent preservation, bounded inference, and principled deferral when evidence is insufficient. We introduce Meddollina, a governance-first clinical intelligence system designed to constrain inference before language realisation, prioritising clinical appropriateness over generative completeness. Meddollina acts as a continuous intelligence layer supporting clinical workflows while preserving clinician authority. We evaluate Meddollina using a behaviour-first regime across 16,412+ heterogeneous medical queries, benchmarking against general-purpose models, medical-tuned models, and retrieval-augmented systems. Meddollina exhibits a distinct behavioural profile: calibrated uncertainty, conservative reasoning under underspecification, stable longitudinal constraint adherence, and reduced speculative completion relative to generation-centric baselines. These results suggest deployable medical AI will not emerge from scaling alone, motivating a shift toward Continuous Clinical Intelligence, where progress is measured by clinician-aligned behaviour under uncertainty rather than fluency-driven completion. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.22645 [cs.AI] (or arXiv:2601.22645v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.22645 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vaibhav Ram S.V.N.S [view email] [v1] Fri, 30 Jan 2026 07:05:14 UTC (4,157 KB)
zh
[AI-88] ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review
【速读】:该论文旨在解决当前自动化同行评审系统在生成深度学术批评方面的局限性问题,尤其是其难以准确评估论文的创新性和重要性、识别深层方法论缺陷,因为现有模型通常在缺乏外部背景的情况下对论文进行孤立评价。解决方案的关键在于提出一个名为ScholarPeer的搜索增强型多智能体框架,该框架通过双流机制实现:一是由历史学家智能体动态构建领域叙事以获取上下文,二是通过基线侦察者智能体识别缺失的对比实验,并借助多维度问答引擎验证论文中的主张,从而将评审意见扎根于实时的网络规模文献中,显著提升了自动化评审的深度与多样性。
链接: https://arxiv.org/abs/2601.22638
作者: Palash Goyal,Mihir Parmar,Yiwen Song,Hamid Palangi,Tomas Pfister,Jinsung Yoon
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Automated peer review has evolved from simple text classification to structured feedback generation. However, current state-of-the-art systems still struggle with “surface-level” critiques: they excel at summarizing content but often fail to accurately assess novelty and significance or identify deep methodological flaws because they evaluate papers in a vacuum, lacking the external context a human expert possesses. In this paper, we introduce ScholarPeer, a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher. ScholarPeer employs a dual-stream process of context acquisition and active verification. It dynamically constructs a domain narrative using a historian agent, identifies missing comparisons via a baseline scout, and verifies claims through a multi-aspect QA engine, grounding the critique in live web-scale literature. We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations and reduces the gap to human-level diversity.
zh
[AI-89] Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全评估中因采用单次或低预算对抗提示(adversarial prompting)而导致的风险低估问题。在实际场景中,攻击者可通过大规模并行采样反复探测模型,直至生成有害响应,而现有方法缺乏对这种高规模攻击下风险的准确预测能力。解决方案的关键在于提出一种尺度感知的 Best-of-N 风险估计方法(SABER),通过将样本级成功概率建模为 Beta 分布(贝塔分布),利用其作为伯努利分布的共轭先验特性,推导出可解析的缩放定律,从而实现从少量样本(如 n=100)可靠外推至大规模采样(如 N=1000)下的攻击成功率(Attack Success Rate, ASR)。该方法显著提升了风险预测精度,在仅使用 100 个样本时即可将平均绝对误差降低 86.2%,揭示了模型在平行对抗压力下可能出现非线性风险放大现象,为低成本、可扩展的 LLM 安全评估提供了新范式。
链接: https://arxiv.org/abs/2601.22636
作者: Mingqian Feng,Xiaodong Liu,Weiwei Yang,Chenliang Xu,Christopher White,Jianfeng Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.
zh
[AI-90] MCP-Diag: A Deterministic Protocol-Driven Architecture for AI-Native Network Diagnostics
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在智能运维(AIOps)应用中面临的两大核心挑战:一是“随机接地问题”(stochastic grounding problem),即LLMs难以可靠地解析来自不同厂商的非结构化命令行界面(CLI)输出;二是赋予自主代理shell访问权限带来的安全缺口。解决方案的关键在于提出MCP-Diag,一个基于Model Context Protocol(MCP)的混合神经符号架构,其核心创新包括:1)设计了一个确定性翻译层,在AI处理前将标准工具(如dig、ping、traceroute)的原始stdout转换为严格的JSON模式,提升输入可解释性与一致性;2)引入强制性的“启发循环”(Elicitation Loop),在协议层面强制执行人机协同授权(Human-in-the-Loop, HITL),从而在保障安全性的同时实现高效诊断。
链接: https://arxiv.org/abs/2601.22633
作者: Devansh Lodha,Mohit Panchal,Sameer G. Kulkarni
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Accepted at COMSNETS 2026 Graduate Forum. Best Paper Award (Runner Up). 5 pages, 3 figures
Abstract:The integration of Large Language Models (LLMs) into network operations (AIOps) is hindered by two fundamental challenges: the stochastic grounding problem, where LLMs struggle to reliably parse unstructured, vendor-specific CLI output, and the security gap of granting autonomous agents shell access. This paper introduces MCP-Diag, a hybrid neuro-symbolic architecture built upon the Model Context Protocol (MCP). We propose a deterministic translation layer that converts raw stdout from canonical utilities (dig, ping, traceroute) into rigorous JSON schemas before AI ingestion. We further introduce a mandatory “Elicitation Loop” that enforces Human-in-the-Loop (HITL) authorization at the protocol level. Our preliminary evaluation demonstrates that MCP-Diag achieving 100% entity extraction accuracy with less than 0.9% execution latency overhead and 3.7x increase in context token usage.
zh
[AI-91] PEFT-MuTS: A Multivariate Parameter-Efficient Fine-Tuning Framework for Remaining Useful Life Prediction based on Cross-domain Time Series Representation Model
【速读】:该论文旨在解决生成式 AI (Generative AI) 在剩余使用寿命(Remaining Useful Life, RUL)预测中因缺乏大量目标设备退化数据而导致的性能瓶颈问题。传统方法如领域自适应和元学习仍依赖于与目标设备相同或相似的历史退化数据,限制了其在实际场景中的应用。解决方案的关键在于提出一种基于参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的框架——PEFT-MuTS,其核心创新包括:1)利用大规模跨域时间序列预训练模型实现知识迁移,突破了以往仅限于相似设备间迁移的认知局限;2)设计独立特征调优网络与基于元变量的低秩多变量融合机制,使单变量预训练时序表示骨干模型能够充分挖掘多变量退化数据中的关系;3)引入零初始化回归器以在少样本条件下稳定微调过程。实验表明,该方法在航空发动机和工业轴承数据集上仅需目标设备不到1%的样本即可实现高精度RUL预测,显著优于传统监督与少样本方法,并大幅降低对数据量的需求。
链接: https://arxiv.org/abs/2601.22631
作者: En Fu,Yanyan Hu,Changhua Hu,Zengwang Jin,Kaixiang Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The application of data-driven remaining useful life (RUL) prediction has long been constrained by the availability of large amount of degradation data. Mainstream solutions such as domain adaptation and meta-learning still rely on large amounts of historical degradation data from equipment that is identical or similar to the target, which imposes significant limitations in practical applications. This study investigates PEFT-MuTS, a Parameter-Efficient Fine-Tuning framework for few-shot RUL prediction, built on cross-domain pre-trained time-series representation models. Contrary to the widely held view that knowledge transfer in RUL prediction can only occur within similar devices, we demonstrate that substantial benefits can be achieved through pre-training process with large-scale cross-domain time series datasets. A independent feature tuning network and a meta-variable-based low rank multivariate fusion mechanism are developed to enable the pre-trained univariate time-series representation backbone model to fully exploit the multivariate relationships in degradation data for downstream RUL prediction task. Additionally, we introduce a zero-initialized regressor that stabilizes the fine-tuning process under few-shot conditions. Experiments on aero-engine and industrial bearing datasets demonstrate that our method can achieve effective RUL prediction even when less than 1% of samples of target equipment are used. Meanwhile, it substantially outperforms conventional supervised and few-shot approaches while markedly reducing the data required to achieve high predictive accuracy. Our code is available at this https URL.
zh
[AI-92] SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assembly NEURIPS2025
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主代理在复杂问题求解中因采用单一代理框架进行蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)规划而导致的探索能力受限问题,具体表现为生成的搜索分支多样性不足和规划性能欠佳。其解决方案的关键在于提出一种异构语言模型协同的多智能体规划框架——SYMPHONY(Synergistic Multi-agent Planning with Heterogeneous language model assembly),通过整合一组具有不同推理模式的基于语言模型的代理,提升 rollout 多样性并增强探索效率,从而实现更优的规划效果。
链接: https://arxiv.org/abs/2601.22623
作者: Wei Zhu,Zhiwen Tang,Kun Yue
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted by NeurIPS 2025
Abstract:Recent advancements have increasingly focused on leveraging large language models (LLMs) to construct autonomous agents for complex problem-solving tasks. However, existing approaches predominantly employ a single-agent framework to generate search branches and estimate rewards during Monte Carlo Tree Search (MCTS) planning. This single-agent paradigm inherently limits exploration capabilities, often resulting in insufficient diversity among generated branches and suboptimal planning performance. To overcome these limitations, we propose Synergistic Multi-agent Planning with Heterogeneous langauge model assembly (SYMPHONY), a novel multi-agent planning framework that integrates a pool of heterogeneous language model-based agents. By leveraging diverse reasoning patterns across agents, SYMPHONY enhances rollout diversity and facilitates more effective exploration. Empirical results across multiple benchmark tasks show that SYMPHONY achieves strong performance even when instantiated with open-source LLMs deployable on consumer-grade hardware. When enhanced with cloud-based LLMs accessible via API, SYMPHONY demonstrates further improvements, outperforming existing state-of-the-art baselines and underscoring the effectiveness of heterogeneous multi-agent coordination in planning tasks.
zh
[AI-93] EntroCut: Entropy-Guided Adaptive Truncation for Efficient Chain-of-Thought Reasoning in Small-scale Large Reasoning Models ICASSP26
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在执行复杂推理任务时因生成冗长中间步骤而导致的高计算成本问题。其解决方案的关键在于提出一种无需训练的动态截断方法——EntroCut,该方法通过分析早期推理步骤中模型输出分布的熵值来识别高置信度状态,从而安全地提前终止推理过程,实现效率与准确性的最优平衡。
链接: https://arxiv.org/abs/2601.22617
作者: Hongxi Yan,Qingjie Liu,Yunhong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP26
Abstract:Large Reasoning Models (LRMs) excel at complex reasoning tasks through extended chain-of-thought generation, but their reliance on lengthy intermediate steps incurs substantial computational cost. We find that the entropy of the model’s output distribution in early reasoning steps reliably distinguishes correct from incorrect reasoning. Motivated by this observation, we propose EntroCut, a training-free method that dynamically truncates reasoning by identifying high-confidence states where reasoning can be safely terminated. To comprehensively evaluate the trade-off between efficiency and accuracy, we introduce the Efficiency-Performance Ratio (EPR), a unified metric that quantifies relative token savings per unit accuracy loss. Experiments on four benchmarks show that EntroCut reduces token usage by up to 40% with minimal accuracy sacrifice, achieving superior efficiency-performance trade-offs compared with existing training-free methods. These results demonstrate that entropy-guided dynamic truncation provides a practical approach to mitigate the inefficiency of LRMs.
zh
[AI-94] Local-Global Multimodal Contrastive Learning for Molecular Property Prediction
【速读】:该论文旨在解决分子属性预测中如何有效融合分子结构信息与化学语义信息的问题,以提升模型对分子特性的理解与预测精度。其解决方案的关键在于提出了一种局部-全局多模态对比学习框架(LGM-CL),通过AttentiveFP和Graph Transformer分别捕获分子的局部功能基团信息与全局拓扑结构,并利用自监督对比学习进行对齐;同时,引入化学增强的文本描述与原始SMILES字符串进行对比,以无任务依赖的方式融入理化语义信息;在微调阶段进一步采用双交叉注意力机制实现分子指纹与多模态表示的融合,从而实现统一的局部-全局与多模态表征学习。
链接: https://arxiv.org/abs/2601.22610
作者: Xiayu Liu,Zhengyi Lu,Yunhong Liao,Chan Fan,Hou-biao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures. Submitted to Briefings in Bioinformatics
Abstract:Accurate molecular property prediction requires integrating complementary information from molecular structure and chemical semantics. In this work, we propose LGM-CL, a local-global multimodal contrastive learning framework that jointly models molecular graphs and textual representations derived from SMILES and chemistry-aware augmented texts. Local functional group information and global molecular topology are captured using AttentiveFP and Graph Transformer encoders, respectively, and aligned through self-supervised contrastive learning. In addition, chemically enriched textual descriptions are contrasted with original SMILES to incorporate physicochemical semantics in a task-agnostic manner. During fine-tuning, molecular fingerprints are further integrated via Dual Cross-attention multimodal fusion. Extensive experiments on MoleculeNet benchmarks demonstrate that LGM-CL achieves consistent and competitive performance across both classification and regression tasks, validating the effectiveness of unified local-global and multimodal representation learning.
zh
[AI-95] Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR
【速读】:该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Reward, RLVR)在数学推理任务中对大规模查询预算依赖过高的问题,从而降低标注成本。其核心解决方案是引入主动学习(Active Learning, AL)机制,并提出一种不确定性一致性度量(uncertainty consistency metric)来评估主观不确定性(subjective uncertainty)与客观不确定性(objective uncertainty)之间的匹配程度。具体而言,在离线场景下使用点二列相关系数(Point-Biserial Correlation Coefficient, PBC)衡量这种一致性;而在在线训练中,由于样本有限且输出分布动态变化,PBC估计困难,作者进一步设计了一个基于归一化优势(normalized advantage)和主观不确定性的在线变体,理论上证明该指标与离线PBC严格负相关,且能支持更优的样本选择策略。实验表明,该方法仅需30%的数据即可达到全数据集性能,显著提升了RLVR在推理任务中的效率和经济性。
链接: https://arxiv.org/abs/2601.22595
作者: Hao Yi,Yulan Hu,Xin Li,Sheng Ouyang,Lizhong Ding,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring objective uncertainty when only selecting by subjective uncertainty. This work proposes an uncertainty consistency metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data, effectively reducing the cost of RLVR for reasoning tasks.
zh
[AI-96] FedCARE: Federated Unlearning with Conflict-Aware Projection and Relearning-Resistant Recovery IJCAI2026
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中数据删除请求的挑战,即在隐私法规(如“被遗忘权”)要求下,如何高效、低损耗地移除特定客户端、实例或类别数据对模型的影响。现有联邦无学习(Federated Unlearning, FU)方法普遍存在高计算开销、知识纠缠导致性能下降以及恢复阶段意外重学(relearning)等问题。其解决方案的关键在于提出FedCARE框架:通过梯度上升实现本地目标数据的高效遗忘,利用无数据模型反演构建共享知识的类级代理(class-level proxies),并集成伪样本生成器与冲突感知的投影梯度上升策略以保留模型效用;同时设计抗重学恢复机制,抑制模型向未删除前状态回退,从而在客户端、实例和类别层面均实现低开销、高保真的无学习效果。
链接: https://arxiv.org/abs/2601.22589
作者: Yue Li,Mingmin Chu,Xilei Yang,Da Xiao,Ziqi Xu,Wei Shao,Qipeng Song,Hui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures. Submitted to IJCAI 2026
Abstract:Federated learning (FL) enables collaborative model training without centralizing raw data, but privacy regulations such as the right to be forgotten require FL systems to remove the influence of previously used training data upon request. Retraining a federated model from scratch is prohibitively expensive, motivating federated unlearning (FU). However, existing FU methods suffer from high unlearning overhead, utility degradation caused by entangled knowledge, and unintended relearning during post-unlearning recovery. In this paper, we propose FedCARE, a unified and low overhead FU framework that enables conflict-aware unlearning and relearning-resistant recovery. FedCARE leverages gradient ascent for efficient forgetting when target data are locally available and employs data free model inversion to construct class level proxies of shared knowledge. Based on these insights, FedCARE integrates a pseudo-sample generator, conflict-aware projected gradient ascent for utility preserving unlearning, and a recovery strategy that suppresses rollback toward the pre-unlearning model. FedCARE supports client, instance, and class level unlearning with modest overhead. Extensive experiments on multiple datasets and model architectures under both IID and non-IID settings show that FedCARE achieves effective forgetting, improved utility retention, and reduced relearning risk compared to state of the art FU baselines.
zh
[AI-97] WED-Net: A Weather-Effect Disentanglement Network with Causal Augmentation for Urban Flow Prediction WWW’26
【速读】:该论文旨在解决极端天气条件下城市时空预测(如暴雨)的难题,其核心挑战在于事件稀有性和动态复杂性,现有数据驱动方法常因依赖粗粒度气象描述符且缺乏对细粒度时空效应的捕捉机制而表现不佳;同时,尽管部分因果方法提升了分布外泛化能力,却往往忽视时间动态性或依赖固定混杂因子分层。解决方案的关键在于提出WED-Net(Weather-Effect Disentanglement Network),一种双分支Transformer架构,通过自注意力和交叉注意力分离内在交通模式与气象诱导模式,并引入记忆库和自适应门控融合特征;进一步设计判别器以显式区分天气条件强化解耦效果,辅以因果数据增强策略,在扰动非因果部分的同时保留因果结构,从而显著提升罕见场景下的泛化性能。
链接: https://arxiv.org/abs/2601.22586
作者: Qian Hong,Siyuan Chang,Xiao Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The ACM on Web Conference 2026 (WWW’26)
Abstract:Urban spatio-temporal prediction under extreme conditions (e.g., heavy rain) is challenging due to event rarity and dynamics. Existing data-driven approaches that incorporate weather as auxiliary input often rely on coarse-grained descriptors and lack dedicated mechanisms to capture fine-grained spatio-temporal effects. Although recent methods adopt causal techniques to improve out-of-distribution generalization, they typically overlook temporal dynamics or depend on fixed confounder stratification. To address these limitations, we propose WED-Net (Weather-Effect Disentanglement Network), a dual-branch Transformer architecture that separates intrinsic and weather-induced traffic patterns via self- and cross-attention, enhanced with memory banks and fused through adaptive gating. To further promote disentanglement, we introduce a discriminator that explicitly distinguishes weather conditions. Additionally, we design a causal data augmentation strategy that perturbs non-causal parts while preserving causal structures, enabling improved generalization under rare scenarios. Experiments on taxi-flow datasets from three cities demonstrate that WED-Net delivers robust performance under extreme weather conditions, highlighting its potential to support safer mobility, highlighting its potential to support safer mobility, disaster preparedness, and urban resilience in real-world settings. The code is publicly available at this https URL.
zh
[AI-98] MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning
【速读】:该论文旨在解决在资源受限场景下,基于组相对策略优化(Group-relative Policy Optimization, GRPO)方法因小样本滚动生成(small-rollout training)导致的训练精度下降问题。其核心问题是:当每提示(prompt)生成的轨迹数(rollout budget)较小时,共享均值奖励基线(shared mean reward baseline)易受异常值噪声影响,引发优势值符号翻转(advantage sign flips),即部分轨迹被错误分配负优势,从而导致策略更新方向错误。解决方案的关键在于提出中位数中心化组相对策略优化(Median-Centered Group Relative Policy Optimization, MC-GRPO)——用中位数替代均值作为基线,显著降低对异常奖励的敏感性;同时引入一个额外轨迹(G+1)用于确定中位数参考,并排除中位数对应的轨迹(pivot rollout)进行反向传播,确保每提示梯度计算仍维持G个样本,保持与标准GRPO相当的计算成本。该方法在多种模型和规模下均提升了低滚动生成场景下的稳定性与最终精度。
链接: https://arxiv.org/abs/2601.22582
作者: Youngeun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Group-relative policy optimization methods train language models by generating multiple rollouts per prompt and normalizing rewards with a shared mean reward baseline. In resource-constrained settings where the rollout budget is small, accuracy often degrades. We find that noise in the shared baseline induces advantage sign flips, where some rollouts receive an incorrect advantage sign, and the update direction is reversed. To address this, we propose Median-Centered Group Relative Policy Optimization (MC-GRPO), a simple and effective solution for small-rollout training. Our main idea is to replace the mean baseline with a median baseline: the median is far less sensitive to outlier rewards than the mean, mitigating the sign flips under small rollout size (G). We generate one additional rollout for median reference (G+1), and compute advantages by using the group median. With an odd-sized group, exactly one completion is the median and receives zero advantage, we exclude this pivot rollout from backpropagation so the number of gradient-contributing samples per prompt remains G, preserving the core update cost of standard G-rollout training. Across various GRPO-family methods and a wide range of models and scales, this median-centered training consistently improves stability and final accuracy in the low-rollout regime, reducing the gap between G=2 and G=8 to within 1%. Code is available at this https URL
zh
[AI-99] FedDis: A Causal Disentanglement Framework for Federated Traffic Prediction
【速读】:该论文旨在解决联邦学习在交通预测任务中因分布式数据非独立同分布(non-IID)特性而导致的性能下降问题。现有方法通常难以区分全局共享模式与客户端特有局部动态,导致知识迁移效率低且适应性差。其解决方案的关键在于提出FedDis框架,首次将因果解耦(causal disentanglement)引入联邦时空预测任务,通过双分支架构分离两种生成源:个性化银行(Personalized Bank)捕捉客户端特有因素,全局模式银行(Global Pattern Bank)提取跨客户端共通时空模式;并通过互信息最小化目标强制两个分支的信息正交性,实现有效解耦,从而在保障本地适应性的前提下提升跨客户端知识迁移的鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2601.22578
作者: Chengyang Zhou,Zijian Zhang,Chunxu Zhang,Hao Miao,Yulin Zhang,Kedi Lyu,Juncheng Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning offers a promising paradigm for privacy-preserving traffic prediction, yet its performance is often challenged by the non-identically and independently distributed (non-IID) nature of decentralized traffic data. Existing federated methods frequently struggle with this data heterogeneity, typically entangling globally shared patterns with client-specific local dynamics within a single representation. In this work, we postulate that this heterogeneity stems from the entanglement of two distinct generative sources: client-specific localized dynamics and cross-client global spatial-temporal patterns. Motivated by this perspective, we introduce FedDis, a novel framework that, to the best of our knowledge, is the first to leverage causal disentanglement for federated spatial-temporal prediction. Architecturally, FedDis comprises a dual-branch design wherein a Personalized Bank learns to capture client-specific factors, while a Global Pattern Bank distills common knowledge. This separation enables robust cross-client knowledge transfer while preserving high adaptability to unique local environments. Crucially, a mutual information minimization objective is employed to enforce informational orthogonality between the two branches, thereby ensuring effective disentanglement. Comprehensive experiments conducted on four real-world benchmark datasets demonstrate that FedDis consistently achieves state-of-the-art performance, promising efficiency, and superior expandability.
zh
[AI-100] PerfGuard: A Performance-Aware Agent for Visual Content Generation ICLR2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在视觉内容生成(AIGC)任务中因工具执行不确定性导致的规划与执行可靠性下降问题。现有框架通常假设工具调用始终成功,仅依赖静态文本描述进行决策,无法刻画工具性能边界,亦难以适应迭代更新的工具版本,从而影响任务达成质量。解决方案的关键在于提出PerfGuard框架,其核心创新包括:(1) 性能感知选择建模(Performance-Aware Selection Modeling, PASM),以多维评分体系替代通用工具描述,精确刻画工具性能边界;(2) 自适应偏好更新机制(Adaptive Preference Update, APU),通过理论排名与实际执行结果对比动态优化工具选择策略;(3) 能力对齐规划优化(Capability-Aligned Planning Optimization, CAPO),引导任务规划器生成与性能感知策略一致的子任务序列。该框架显著提升了工具选择准确性、执行可靠性和用户意图对齐度,验证了其在复杂AIGC场景下的实用性与鲁棒性。
链接: https://arxiv.org/abs/2601.22571
作者: Zhipeng Chen,Zhongrui Zhang,Chao Zhang,Yifan Xu,Lan Yang,Jun Liu,Ke Li,Yi-Zhe Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICLR 2026. The original paper link is: this https URL The code repository link is: this https URL
Abstract:The advancement of Large Language Model (LLM)-powered agents has enabled automated task processing through reasoning and tool invocation capabilities. However, existing frameworks often operate under the idealized assumption that tool executions are invariably successful, relying solely on textual descriptions that fail to distinguish precise performance boundaries and cannot adapt to iterative tool updates. This gap introduces uncertainty in planning and execution, particularly in domains like visual content generation (AIGC), where nuanced tool performance significantly impacts outcomes. To address this, we propose PerfGuard, a performance-aware agent framework for visual content generation that systematically models tool performance boundaries and integrates them into task planning and scheduling. Our framework introduces three core mechanisms: (1) Performance-Aware Selection Modeling (PASM), which replaces generic tool descriptions with a multi-dimensional scoring system based on fine-grained performance evaluations; (2) Adaptive Preference Update (APU), which dynamically optimizes tool selection by comparing theoretical rankings with actual execution rankings; and (3) Capability-Aligned Planning Optimization (CAPO), which guides the planner to generate subtasks aligned with performance-aware strategies. Experimental comparisons against state-of-the-art methods demonstrate PerfGuard’s advantages in tool selection accuracy, execution reliability, and alignment with user intent, validating its robustness and practical utility for complex AIGC tasks. The project code is available at this https URL.
zh
[AI-101] Whispers of Wealth: Red-Teaming Googles Agent Payments Protocol via Prompt Injection
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在金融交易自动化过程中因依赖上下文推理而面临的提示注入攻击(prompt injection)风险,尤其是针对代理支付协议(Agent Payments Protocol, AP2)的实际鲁棒性不足问题。解决方案的关键在于通过AI红队测试(AI red-teaming evaluation)识别出直接和间接提示注入漏洞,并提出两种新型攻击技术——品牌低语攻击(Branded Whisper Attack)和保险库低语攻击(Vault Whisper Attack),实验证明这些攻击可有效操纵商品排序并窃取敏感用户数据,从而揭示当前代理支付架构中的核心缺陷,强调必须在LLM驱动的金融系统中引入更强的隔离机制与防御措施以保障安全性。
链接: https://arxiv.org/abs/2601.22569
作者: Tanusree Debi,Wentian Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) based agents are increasingly used to automate financial transactions, yet their reliance on contextual reasoning exposes payment systems to prompt-driven manipulation. The Agent Payments Protocol (AP2) aims to secure agent-led purchases through cryptographically verifiable mandates, but its practical robustness remains underexplored. In this work, we perform an AI red-teaming evaluation of AP2 and identify vulnerabilities arising from indirect and direct prompt injection. We introduce two attack techniques, the Branded Whisper Attack and the Vault Whisper Attack which manipulate product ranking and extract sensitive user data. Using a functional AP2 based shopping agent built with Gemini-2.5-Flash and the Google ADK framework, we experimentally validate that simple adversarial prompts can reliably subvert agent behavior. Our findings reveal critical weaknesses in current agentic payment architectures and highlight the need for stronger isolation and defensive safeguards in LLM-mediated financial systems.
zh
[AI-102] EUGens: Efficient Unified and General Dense Layers NEURIPS2025
【速读】:该论文旨在解决神经网络中全连接前馈层(Fully-connected Feedforward Layers, FFLs)在计算复杂度和参数数量上的瓶颈问题,尤其是在实时应用和资源受限环境下的可扩展性挑战。解决方案的关键在于提出了一类新型密集层——高效、统一且通用的密集层(Efficient, Unified and General dense layers, EUGens),其通过引入随机特征来近似标准FFL,并在计算中显式依赖输入范数,从而将推理复杂度从二次时间降低到线性时间;同时,EUGens首次实现了对任意多项式激活函数的无偏近似算法,在减少参数量和计算开销的同时保持了FFL的表达能力和适应性。此外,作者还设计了一种逐层知识迁移技术,无需反向传播即可高效适配预训练模型,显著提升了Transformer和MLP架构在图像分类、语言建模及3D场景重建等任务中的推理速度(最高提升27%)与内存效率(最高提升30%)。
链接: https://arxiv.org/abs/2601.22563
作者: Sang Min Kim,Byeongchan Kim,Arijit Sehanobish,Somnath Basu Roy Chowdhury,Rahul Kidambi,Dongseok Shim,Avinava Dubey,Snigdha Chaturvedi,Min-hwan Oh,Krzysztof Choromanski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Neurips 2025. Encompasses results of arXiv:2410.09771
Abstract:Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbfEfficient, \textbfUnified and \textbfGeneral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbfthe first unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf27%) and memory efficiency (up to \textbf30%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.
zh
[AI-103] Adapting Reinforcement Learning for Path Planning in Constrained Parking Scenarios
【速读】:该论文旨在解决复杂受限环境中自主系统实时路径规划的问题,尤其是停车场景中因空间狭小、需频繁倒车调整而带来的挑战。传统经典规划方法在现实感知约束下表现脆弱,且依赖在线搜索导致计算开销大,难以实现实时部署。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的框架,其核心优势包括:1)不依赖理想结构化的感知输入,可直接从原始观测中学习策略,避免对定位与跟踪等额外模块的依赖;2)在测试阶段通过单次前向传播生成动作,计算轻量,满足实时性要求;3)以自行车模型动力学为基础建模序列决策问题,使智能体在闭环设置中直接学习符合车辆运动学和环境约束的导航策略。该方法在新构建的基准上实现了比传统规划基线更高的成功率(+96%)和效率(+52%)。
链接: https://arxiv.org/abs/2601.22545
作者: Feng Tao,Luca Paparusso,Chenyi Gu,Robin Koehler,Chenxu Wu,Xinyu Huang,Christian Juette,David Paz,Ren Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-time path planning in constrained environments remains a fundamental challenge for autonomous systems. Traditional classical planners, while effective under perfect perception assumptions, are often sensitive to real-world perception constraints and rely on online search procedures that incur high computational costs. In complex surroundings, this renders real-time deployment prohibitive. To overcome these limitations, we introduce a Deep Reinforcement Learning (DRL) framework for real-time path planning in parking scenarios. In particular, we focus on challenging scenes with tight spaces that require a high number of reversal maneuvers and adjustments. Unlike classical planners, our solution does not require ideal and structured perception, and in principle, could avoid the need for additional modules such as localization and tracking, resulting in a simpler and more practical implementation. Also, at test time, the policy generates actions through a single forward pass at each step, which is lightweight enough for real-time deployment. The task is formulated as a sequential decision-making problem grounded in a bicycle model dynamics, enabling the agent to directly learn navigation policies that respect vehicle kinematics and environmental constraints in the closed-loop setting. A new benchmark is developed to support both training and evaluation, capturing diverse and challenging scenarios. Our approach achieves state-of-the-art success rates and efficiency, surpassing classical planner baselines by +96% in success rate and +52% in efficiency. Furthermore, we release our benchmark as an open-source resource for the community to foster future research in autonomous systems. The benchmark and accompanying tools are available at this https URL.
zh
[AI-104] Decoding in Geometry: Alleviating Embedding-Space Crowding for Complex Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在基于采样的解码过程中,由于仅依赖全局概率重加权或截断策略而忽略嵌入空间中token间细粒度关系所导致的生成质量与多样性失衡问题。其核心发现是“嵌入空间拥挤”(embedding-space crowding)现象——即下一个token的概率质量集中在嵌入空间中几何邻近的token上,且该现象与数学推理任务的成功率存在统计关联。解决方案的关键在于提出CraEG(Crowding-aware Geometry-guided sampling),一种无需训练、单次遍历即可实现的插件式采样方法,通过几何引导的重加权机制缓解嵌入空间中的拥挤效应,从而提升生成结果的鲁棒性和多样性。
链接: https://arxiv.org/abs/2601.22536
作者: Yixin Yang,Qingxiu Dong,Zhifang Sui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sampling-based decoding underlies complex reasoning in large language models (LLMs), where decoding strategies critically shape model behavior. Temperature- and truncation-based methods reshape the next-token distribution through global probability reweighting or thresholding to balance the quality-diversity tradeoff. However, they operate solely on token probabilities, ignoring fine-grained relationships among tokens in the embedding space. We uncover a novel phenomenon, embedding-space crowding, where the next-token distribution concentrates its probability mass on geometrically close tokens in the embedding space. We quantify crowding at multiple granularities and find a statistical association with reasoning success in mathematical problem solving. Motivated by this finding, we propose CraEG, a plug-and-play sampling method that mitigates crowding through geometry-guided reweighting. CraEG is training-free, single-pass, and compatible with standard sampling strategies. Experiments on multiple models and benchmarks demonstrate improved generation performance, with gains in robustness and diversity metrics.
zh
[AI-105] Demystifying Design Choices of Reinforcement Fine-tuning: A Batched Contextual Bandit Learning Perspective
【速读】:该论文旨在解决强化学习微调(Reinforcement Fine-Tuning, RFT)领域中设计选择(design choices)作用不明确、贡献难以归因的问题,尤其在性能提升常被宣称但结论却时常不一致的情况下,亟需从理论上厘清各设计因素的角色及其关键性。其解决方案的关键在于构建一个极简基线(minimalist baseline),该基线通过每轮仅使用一次轨迹(one rollout per query)、以直接奖励作为训练信号(无优势修正,advantage trick)、固定批量大小为32等方式,将复杂的设计变量解耦,从而实现对各因素边际收益的系统性评估。此基线与批处理上下文Bandit学习(batched contextual bandit learning)建立连接,使得实验分析更具可解释性和可控性,最终通过围绕该基线设计的实验流程,揭示了不同设计选择对学习和泛化动态的影响,并识别出真正关键的因素。
链接: https://arxiv.org/abs/2601.22532
作者: Hong Xie,Xiao Hu,Tao Tan,Haoran Gu,Xin Li,Jianyu Han,Defu Lian,Enhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The reinforcement fine-tuning area is undergoing an explosion papers largely on optimizing design choices. Though performance gains are often claimed, inconsistent conclusions also arise from time to time, making the progress illusive. Reflecting on this illusion, we still lack principled answers to two fundamental questions: 1) what is the role of each design choice? 2) which ones are critical? This paper aims to shed light on them. The underlying challenge is that design choices are entangled together, making their contribution to learning and generalization difficult to attribute. To address this challenge, we first construct a minimalist baseline for disentangling factors: one rollout per query in each round, the outcome reward serving as the training signal without any advantage trick, and a batch size of thirty-two. This baseline connects to batched contextual bandit learning, which facilitates experimental analysis. Centering around this baseline, we design an experiment pipeline, examining the marginal gains of factors like advantage, number of rollouts, etc. Experiments on three base models and two datasets, not only reveal new understanding on the role of various design choices on learning and generalization dynamics, but also identify critical ones that deserve more effort.
zh
[AI-106] Learn from A Rationalist: Distilling Intermediate Interpretable Rationales
【速读】:该论文旨在解决生成式 AI(Generative AI)中深度神经网络(Deep Neural Networks, DNNs)的可解释性问题,特别是针对理由提取(Rationale Extraction, RE)模型在使用能力较弱或规模较小的学生模型(student models)时预测性能不足的问题。其核心挑战在于:仅依赖最终任务标签的远程监督(remote supervision),学生模型需在所有可能的特征组合空间中搜索最优理由,计算复杂度高且难以收敛。解决方案的关键在于提出REKD(Rationale Extraction with Knowledge Distillation),通过引入一个“教师”模型(即理性主义者,rationalist)来提供额外的指导信号——不仅包括预测结果,还包括其生成的理由,从而实现知识蒸馏(knowledge distillation)。这一结构调整使学生模型能够从教师的可解释知识中学习,显著提升其预测性能,且方法具有神经网络架构无关性(neural-model agnostic),适用于多种黑盒模型(如BERT和Vision Transformer)。
链接: https://arxiv.org/abs/2601.22531
作者: Jiayi Dai,Randy Goebel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (or \emphrationales) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we propose \textbfREKD (\textbfRationale \textbfExtraction with \textbfKnowledge \textbfDistillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., a \emphrationalist) in addition to the student’s own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.
zh
[AI-107] Enhancing TableQA through Verifiable Reasoning Trace Reward
【速读】:该论文旨在解决表格问答(TableQA)代理在训练过程中因答案无法从静态输入直接推断,而必须通过多步表格状态转换进行推理所带来的复杂性问题,即如何提升模型在动态表状态下的多步推理能力。解决方案的关键在于提出一种名为RE-Tab的即插即用框架,其核心创新是将问题建模为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),并通过轻量级、无需训练的奖励建模机制,在状态转移(“最佳动作是什么?”)和模拟推理(“我对输出确定吗?”)两个阶段提供显式可验证的奖励反馈,从而引导代理在表格状态空间中更高效地导航。此方法显著提升了TableQA任务的性能,同时大幅降低推理成本,且在多种大语言模型(LLM)和基准测试上均表现出一致的改进效果,验证了其通用性。
链接: https://arxiv.org/abs/2601.22530
作者: Tung Sum Thomas Kwok,Xinyu Wang,Hengzhi He,Xiaofeng Lin,Peng Lu,Liheng Ma,Chunhe Wang,Ying Nian Wu,Lei Ding,Guang Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A major challenge in training TableQA agents, compared to standard text- and image-based agents, is that answers cannot be inferred from a static input but must be reasoned through stepwise transformations of the table state, introducing multi-step reasoning complexity and environmental interaction. This leads to a research question: Can explicit feedback on table transformation action improve model reasoning capability? In this work, we introduce RE-Tab, a plug-and-play framework that architecturally enhances trajectory search via lightweight, training-free reward modeling by formulating the problem as a Partially Observable Markov Decision Process. We demonstrate that providing explicit verifiable rewards during State Transition (What is the best action?'') and Simulative Reasoning (Am I sure about the output?‘’) is crucial to steer the agent’s navigation in table states. By enforcing stepwise reasoning with reward feedback in table transformations, RE-Tab achieves state-of-the-art performance in TableQA with almost 25% drop in inference cost. Furthermore, a direct plug-and-play implementation of RE-Tab brings up to 41.77% improvement in QA accuracy and 33.33% drop in test-time inference samples for consistent answer. Consistent improvement pattern across various LLMs and state-of-the-art benchmarks further confirms RE-Tab’s generalisability. The repository is available at this https URL .
zh
[AI-108] Darwinian Memory: A Training-Free Self-Regulating Memory System for GUI Agent Evolution
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在跨应用、长周期图形用户界面(Graphical User Interface, GUI)自动化任务中因上下文窗口有限而导致的性能瓶颈问题。现有记忆系统难以适应动态GUI环境,存在高层意图与底层执行粒度不匹配以及静态记忆累积导致上下文污染(context pollution)引发幻觉的问题。其解决方案的关键在于提出达尔文记忆系统(Darwinian Memory System, DMS),该系统将复杂轨迹分解为可复用的独立单元以实现组合灵活性,并通过效用驱动的自然选择机制追踪生存价值,主动剪枝低效路径并抑制高风险策略,从而在进化压力下促使智能体生成更优策略,显著提升任务成功率和执行稳定性。
链接: https://arxiv.org/abs/2601.22528
作者: Hongze Mi,Yibo Feng,WenJie Lu,Song Cao,Jinyuan Li,Yanming Li,Xuelin Zhang,Haotian Luo,Songyang Peng,He Cui,Tengfei Tian,Jun Fang,Hua Chai,Naiqiang Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Model (MLLM) agents facilitate Graphical User Interface (GUI) automation but struggle with long-horizon, cross-application tasks due to limited context windows. While memory systems provide a viable solution, existing paradigms struggle to adapt to dynamic GUI environments, suffering from a granularity mismatch between high-level intent and low-level execution, and context pollution where the static accumulation of outdated experiences drives agents into hallucination. To address these bottlenecks, we propose the Darwinian Memory System (DMS), a self-evolving architecture that constructs memory as a dynamic ecosystem governed by the law of survival of the fittest. DMS decomposes complex trajectories into independent, reusable units for compositional flexibility, and implements Utility-driven Natural Selection to track survival value, actively pruning suboptimal paths and inhibiting high-risk plans. This evolutionary pressure compels the agent to derive superior strategies. Extensive experiments on real-world multi-app benchmarks validate that DMS boosts general-purpose MLLMs without training costs or architectural overhead, achieving average gains of 18.0% in success rate and 33.9% in execution stability, while reducing task latency, establishing it as an effective self-evolving memory system for GUI tasks.
zh
[AI-109] SCOPE-PD: Explainable AI on Subjective and Clinical Objective Measurements of Parkinsons Disease for Precision Decision-Making
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)早期预测中因传统诊断方法主观性强而导致的延迟诊断问题,以及现有机器学习(machine learning, ML)模型在个体化风险评估中缺乏可解释性的局限。解决方案的关键在于提出一种可解释的人工智能预测框架SCOPE-PD,通过整合主观临床评估(如MDS-UPDRS量表)与客观生物标志物数据(来自帕金森进展标记物倡议PPMI研究),构建多模态预测模型,并采用SHAP(SHapley Additive exPlanations)分析提升模型可解释性;其中,随机森林(Random Forest)算法在融合特征下达到98.66%的准确率,且识别出震颤(tremor)、运动迟缓(bradykinesia)和面部表情异常为最重要的预测特征,从而实现了高精度、可解释的个性化疾病风险评估。
链接: https://arxiv.org/abs/2601.22516
作者: Md Mezbahul Islam,John Michael Templeton,Masrur Sobhan,Christian Poellabauer,Ananda Mohan Mondal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 3 tables, 5 figures, to be published (full text online) in Springer (Springer CCIS series: electronic ISSN 1865-0937, print ISSN 1865-0929)
Abstract:Parkinson’s disease (PD) is a chronic and complex neurodegenerative disorder influenced by genetic, clinical, and lifestyle factors. Predicting this disease early is challenging because it depends on traditional diagnostic methods that face issues of subjectivity, which commonly delay diagnosis. Several objective analyses are currently in practice to help overcome the challenges of subjectivity; however, a proper explanation of these analyses is still lacking. While machine learning (ML) has demonstrated potential in supporting PD diagnosis, existing approaches often rely on subjective reports only and lack interpretability for individualized risk estimation. This study proposes SCOPE-PD, an explainable AI-based prediction framework, by integrating subjective and objective assessments to provide personalized health decisions. Subjective and objective clinical assessment data are collected from the Parkinson’s Progression Markers Initiative (PPMI) study to construct a multimodal prediction framework. Several ML techniques are applied to these data, and the best ML model is selected to interpret the results. Model interpretability is examined using SHAP-based analysis. The Random Forest algorithm achieves the highest accuracy of 98.66 percent using combined features from both subjective and objective test data. Tremor, bradykinesia, and facial expression are identified as the top three contributing features from the MDS-UPDRS test in the prediction of PD.
zh
[AI-110] Why Self-Rewarding Works: Theoretical Guarantees for Iterative Alignment of Language Models
【速读】:该论文旨在解决自奖励语言模型(Self-Rewarding Language Models, SRLMs)在迭代优化过程中对齐性能提升的理论机制不明确问题,即缺乏对其成功背后核心驱动因素的严谨数学解释。其解决方案的关键在于首次为SRLMs提供了理论保证:首先建立了单步更新的下界,揭示了初始模型质量的重要性;进而推导出完整迭代过程的有限样本误差边界,证明性能以 \widetilde\mathcal{O}(1/\sqrt{n}) 的速率随样本量 n 改善;更重要的是,发现初始模型的影响随迭代次数 T 指数衰减,从而从理论上阐明了SRLMs为何能通过迭代稳定性和一致性逐步克服不良初始化,实现鲁棒的对齐提升。
链接: https://arxiv.org/abs/2601.22513
作者: Shi Fu,Yingjie Wang,Shengchao Hu,Peng Wang,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Self-Rewarding Language Models (SRLMs) achieve notable success in iteratively improving alignment without external feedback. Yet, despite their striking empirical progress, the core mechanisms driving their capabilities remain unelucidated, leaving a critical gap in theoretical understanding. This paper provides the first rigorous theoretical guarantees for SRLMs. We first establish a lower bound that characterizes the fundamental limits of a single update step, revealing a critical dependence on the quality of the initial model. We then derive finite-sample error bounds for the full iterative paradigm, showing that performance improves at a rate of \widetilde\mathcalO\left(1/\sqrtn\right) with sample size n . Crucially, our analysis reveals that the dependence on the initial model decays exponentially with the number of iterations T . This provides a formal explanation for why self-rewarding succeeds: it robustly overcomes poor initialization by steering the dynamics toward internal stability and consistency. Finally, we instantiate our theoretical framework for the linear softmax model class, yielding tailored guarantees that connect our high-level insights to practical model architectures.
zh
[AI-111] Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在复杂任务中表现出非人类行为和意外错误的问题,尤其是其技能组合(skill composition)的学习机制与人类认知模式存在根本差异。研究表明,Transformer模型在训练过程中并非按照人类类似的顺序规则构建技能组合,而是常以逆序或并行方式习得技能,导致在分布外场景下出现“破碎组合性”(shattered compositionality)现象,即技能间发生不可预测的混合错误。解决方案的关键在于揭示了模型学习动态主要受训练数据相关性匹配驱动,而非因果或过程导向的组合逻辑;这一发现表明,单纯扩大模型规模或引入思维链(scratchpad-based reasoning)无法缓解该问题,提示未来需从学习机制层面重构模型对技能组合的理解方式,以提升推理可靠性与分布外鲁棒性。
链接: https://arxiv.org/abs/2601.22510
作者: Xingyu Zhao,Darsh Sharma,Rheeya Uppaal,Yiqiao Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 33 pages, 27 figures
Abstract:Large language models (LLMs) often exhibit unexpected errors or unintended behavior, even at scale. While recent work reveals the discrepancy between LLMs and humans in skill compositions, the learning dynamics of skill compositions and the underlying cause of non-human behavior remain elusive. In this study, we investigate the mechanism of learning dynamics by training transformers on synthetic arithmetic tasks. Through extensive ablations and fine-grained diagnostic metrics, we discover that transformers do not reliably build skill compositions according to human-like sequential rules. Instead, they often acquire skills in reverse order or in parallel, which leads to unexpected mixing errors especially under distribution shifts–a phenomenon we refer to as shattered compositionality. To explain these behaviors, we provide evidence that correlational matching to the training data, rather than causal or procedural composition, shapes learning dynamics. We further show that shattered compositionality persists in modern LLMs and is not mitigated by pure model scaling or scratchpad-based reasoning. Our results reveal a fundamental mismatch between a model’s learning behavior and desired skill compositions, with implications for reasoning reliability, out-of-distribution robustness, and alignment.
zh
[AI-112] Keep Rehearsing and Refining: Lifelong Learning Vehicle Routing under Continually Drifting Tasks
【速读】:该论文旨在解决神经网络求解器在车辆路径问题(Vehicle Routing Problem, VRP)中面对持续漂移任务时的终身学习挑战,即现实场景中问题模式随时间不断变化,导致任务序列持续涌现,但每个任务可用的训练资源有限。传统方法要么在固定任务集上一次性训练,要么假设每个任务都能获得充分训练,均无法适应这种动态环境。解决方案的关键在于提出一种通用框架——双重回放与经验增强(Dual Replay with Experience Enhancement, DREE),通过结合经验回放机制与强化经验表示的方法,在有限训练条件下提升学习效率、缓解灾难性遗忘,并增强对未见任务的泛化能力。
链接: https://arxiv.org/abs/2601.22509
作者: Jiyuan Pei,Yi Mei,Jialin Liu,Mengjie Zhang,Xin Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing neural solvers for vehicle routing problems (VRPs) are typically trained either in a one-off manner on a fixed set of pre-defined tasks or in a lifelong manner on several tasks arriving sequentially, assuming sufficient training on each task. Both settings overlook a common real-world property: problem patterns may drift continually over time, yielding massive tasks sequentially arising while offering only limited training resources per task. In this paper, we study a novel lifelong learning paradigm for neural VRP solvers under continually drifting tasks over learning time steps, where sufficient training for any given task at any time is not available. We propose Dual Replay with Experience Enhancement (DREE), a general framework to improve learning efficiency and mitigate catastrophic forgetting under such drift. Extensive experiments show that, under such continual drift, DREE effectively learns new tasks, preserves prior knowledge, improves generalization to unseen tasks, and can be applied to diverse existing neural solvers.
zh
[AI-113] Action-Sufficient Goal Representations
【速读】:该论文旨在解决离线目标条件强化学习(goal-conditioned reinforcement learning, GCRL)中层次策略架构下目标表示(goal representation)设计的问题,即现有方法在学习价值函数的同时隐式地生成目标表示,假设保留足够用于价值估计的信息即可实现最优控制,但这种假设可能失效,因为此类表示可能未能区分对动作选择至关重要的不同目标状态。解决方案的关键在于提出一个信息论框架,定义了“动作充分性”(action sufficiency)——一种确保低层策略能做出最优动作选择的目标表示必要条件,并证明价值充分性(value sufficiency)并不蕴含动作充分性;进一步通过实验证明,标准的对数损失(log-loss)训练低层策略可自然诱导出动作充分的目标表示,从而显著提升控制性能。
链接: https://arxiv.org/abs/2601.22496
作者: Jinu Hyeon,Woobin Park,Hongjoon Ahn,Taesup Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hierarchical policies in offline goal-conditioned reinforcement learning (GCRL) addresses long-horizon tasks by decomposing control into high-level subgoal planning and low-level action execution. A critical design choice in such architectures is the goal representation-the compressed encoding of goals that serves as the interface between these levels. Existing approaches commonly derive goal representations while learning value functions, implicitly assuming that preserving information sufficient for value estimation is adequate for optimal control. We show that this assumption can fail, even when the value estimation is exact, as such representations may collapse goal states that need to be differentiated for action learning. To address this, we introduce an information-theoretic framework that defines action sufficiency, a condition on goal representations necessary for optimal action selection. We prove that value sufficiency does not imply action sufficiency and empirically verify that the latter is more strongly associated with control success in a discrete environment. We further demonstrate that standard log-loss training of low-level policies naturally induces action-sufficient representations. Our experimental results a popular benchmark demonstrate that our actor-derived representations consistently outperform representations learned via value estimation.
zh
[AI-114] AI Literacy Safety Awareness and STEM Career Aspirations of Australian Secondary Students: Evaluating the Impact of Workshop Interventions
【速读】:该论文旨在解决生成式 AI (Generative AI) 时代下青少年面临的合成媒体(synthetic media)安全风险问题,特别是针对澳大利亚中学生群体在接触深度伪造(deepfakes)等虚假内容后的认知盲区与行为风险。研究发现,82.4%的学生曾接触过深度伪造内容,其中18.5%曾分享、7.3%曾创建此类内容,凸显了提升AI素养的紧迫性。解决方案的关键在于通过基于工作坊的干预措施,系统性地增强学生对AI技术的理解,包括识别日常平台(如Netflix、Spotify、TikTok)中AI应用的能力,以及对AI伦理、训练机制和安全风险的认知。结果表明,该干预显著提升了学生的自评AI知识水平和信心,并促使他们从“算法驱动”视角转向“AI驱动”认知,验证了将基础AI概念与合成媒体安全教育相结合的可扩展性路径。
链接: https://arxiv.org/abs/2601.22486
作者: Christian Bergh,Alexandra Vassar,Natasha Banks,Jessica Xu,Jake Renzella
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Deepfakes and other forms of synthetic media pose growing safety risks for adolescents, yet evidence on students’ exposure and related behaviours remains limited. This study evaluates the impact of Day of AI Australia’s workshop-based intervention designed to improve AI literacy and conceptual understanding among Australian secondary students (Years 7-10). Using a mixed-methods approach with pre- and post-intervention surveys (N=205 pre; N=163 post), we analyse changes in students’ ability to identify AI in everyday tools, their understanding of AI ethics, training, and safety, and their interest in STEM-related careers. Baseline data revealed notable synthetic media risks: 82.4% of students reported having seen deepfakes, 18.5% reported sharing them, and 7.3% reported creating them. Results show higher self-reported AI knowledge and confidence after the intervention, alongside improved recognition of AI in widely used platforms such as Netflix, Spotify, and TikTok. This pattern suggests a shift from seeing these tools as merely “algorithm-based” to recognising them as AI-driven systems. Students also reported increased interest in STEM careers post-workshop; however, effect sizes were small, indicating that sustained approaches beyond one-off workshops may be needed to influence longer-term aspirations. Overall, the findings support scalable AI literacy programs that pair foundational AI concepts with an explicit emphasis on synthetic media safety. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.22486 [cs.CY] (or arXiv:2601.22486v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2601.22486 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-115] RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning
【速读】:该论文旨在解决集成电路(Integrated Circuit, IC)布局设计中因工艺节点不断缩小及三维堆叠结构引入而加剧的复杂硬件设计规则遵守难题。当前方法仅能处理特定且有限的设计规则,其余规则违反需依赖专家手动调整,导致后期处理耗时费力。其解决方案的关键在于提出一种基于深度强化学习的统一框架,通过三个核心组件实现:1)新颖的矩阵表示法建模真实世界IC设计规则;2)对动作空间施加约束以过滤引发规则违反的无效操作;3)将约束满足程度量化为奖励信号,驱动智能体学习合规布局策略。该方法在公开基准测试中验证了有效性,并展现出良好的跨电路迁移能力,同时具备良好扩展性以适应未来新设计规则的需求。
链接: https://arxiv.org/abs/2601.22476
作者: Ruizhe Zhong,Xingbo Du,Junchi Yan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly challenging to adhere to complex hardware design rules. Current methods are only capable of handling specific and limited design rules, while violations of other rules require manual and meticulous adjustment. This leads to labor-intensive and time-consuming post-processing for expert engineers. In this paper, we propose an all-in-one deep reinforcement learning-based approach to tackle these challenges, and design novel representations for real-world IC design rules that have not been addressed by previous approaches. Specifically, the processing of various hardware design rules is unified into a single framework with three key components: 1) novel matrix representations to model the design rules, 2) constraints on the action space to filter out invalid actions that cause rule violations, and 3) quantitative analysis of constraint satisfaction as reward signals. Experiments on public benchmarks demonstrate the effectiveness and validity of our approach. Furthermore, transferability is well demonstrated on unseen circuits. Our framework is extensible to accommodate new design rules, thus providing flexibility to address emerging challenges in future chip design. Code will be available at: this https URL
zh
[AI-116] Machine Unlearning in Low-Dimensional Feature Subspace
【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)中的两个关键问题:一是主流方法在执行遗忘时需反复访问原始数据,存在隐私泄露风险;二是对整个预训练模型进行更新效率低下。解决方案的核心在于提出LOFT方法,其创新性地从低维特征子空间(low-dimensional feature subspace)视角出发,通过优化一个小型投影矩阵(principal projection),在保持剩余数据信息最大化的同时最小化遗忘数据的影响,从而实现高效且安全的模型遗忘。该方法仅需一次性从预训练主干网络中提取特征,无需重复加载原始数据,显著降低了计算开销并提升了遗忘性能。
链接: https://arxiv.org/abs/2601.22456
作者: Kun Fang,Qinghua Tao,Junxu Liu,Yaxin Xiao,Qingqing Ye,Jian Sun,Haibo Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Unlearning (MU) aims at removing the influence of specific data from a pretrained model while preserving performance on the remaining data. In this work, a novel perspective for MU is presented upon low-dimensional feature subspaces, which gives rise to the potentials of separating the remaining and forgetting data herein. This separability motivates our LOFT, a method that proceeds unlearning in a LOw-dimensional FeaTure subspace from the pretrained model skithrough principal projections, which are optimized to maximally capture the information of the remaining data and meanwhile diminish that of the forgetting data. In training, LOFT simply optimizes a small-size projection matrix flexibly plugged into the pretrained model, and only requires one-shot feature fetching from the pretrained backbone instead of repetitively accessing the raw data. Hence, LOFT mitigates two critical issues in mainstream MU methods, i.e., the privacy leakage risk from massive data reload and the inefficiency of updates to the entire pretrained model. Extensive experiments validate the significantly lower computational overhead and superior unlearning performance of LOFT across diverse models, datasets, tasks, and applications. Code is anonymously available at this https URL.
zh
[AI-117] mporal Graph Pattern Machine
【速读】:该论文旨在解决时序图学习中动态系统演化模式建模的难题,现有方法多依赖于任务导向设计和受限假设(如短时依赖、静态邻域语义及事后时间利用),限制了可迁移的时序演化机制发现。其解决方案的关键在于提出Temporal Graph Pattern Machine (TGPM),通过引入基于时序偏置随机游走的交互片段(interaction patch)来捕捉多尺度结构语义与长程依赖,并采用Transformer架构提取全局时序规律并适应上下文相关的交互动态;同时设计掩码标记建模和下一时刻预测等自监督预训练任务,显式编码网络演化的基本规律,从而实现跨领域优异的迁移能力与链接预测性能。
链接: https://arxiv.org/abs/2601.22454
作者: Yijun Ma,Zehong Wang,Weixiang Sun,Yanfang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Temporal graph learning is pivotal for deciphering dynamic systems, where the core challenge lies in explicitly modeling the underlying evolving patterns that govern network transformation. However, prevailing methods are predominantly task-centric and rely on restrictive assumptions – such as short-term dependency modeling, static neighborhood semantics, and retrospective time usage. These constraints hinder the discovery of transferable temporal evolution mechanisms. To address this, we propose the Temporal Graph Pattern Machine (TGPM), a foundation framework that shifts the focus toward directly learning generalized evolving patterns. TGPM conceptualizes each interaction as an interaction patch synthesized via temporally-biased random walks, thereby capturing multi-scale structural semantics and long-range dependencies that extend beyond immediate neighborhoods. These patches are processed by a Transformer-based backbone designed to capture global temporal regularities while adapting to context-specific interaction dynamics. To further empower the model, we introduce a suite of self-supervised pre-training tasks – specifically masked token modeling and next-time prediction – to explicitly encode the fundamental laws of network evolution. Extensive experiments show that TGPM consistently achieves state-of-the-art performance in both transductive and inductive link prediction, demonstrating exceptional cross-domain transferability.
zh
[AI-118] Does My Chatbot Have an Agenda? Understanding Human and AI Agency in Human-Human-like Chatbot Interaction
【速读】:该论文试图解决的问题是:随着AI聊天机器人从工具向伴侣角色转变,人类与AI在对话中的代理权(agency)分配问题日益凸显——即谁主导对话进程、设定边界以及如何在互动中动态协调控制权。解决方案的关键在于提出一个3×5的代理权映射框架,将代理行为细分为意图(Intention)、执行(Execution)、适应(Adaptation)、限定(Delimitation)和协商(Negotiation)五类,并区分主体为人类、AI或混合体(hybrid),从而系统化刻画人-AI共构代理权的动态过程。研究进一步主张通过透明度设计(translucent design)、提供代理权协商空间及制定代理权意识导向的设计指南,推动更具伦理性和协作性的对话式AI发展。
链接: https://arxiv.org/abs/2601.22452
作者: Bhada Yun,Evgenia Taranova,April Yi Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To appear in CHI '26
Abstract:AI chatbots are shifting from tools to companions. This raises critical questions about agency: who drives conversations and sets boundaries in human-AI chatrooms? We report a month-long longitudinal study with 22 adults who chatted with Day, an LLM companion we built, followed by a semi-structured interview with post-hoc elicitation of notable moments, cross-participant chat reviews, and a ‘strategy reveal’ disclosing Day’s vertical (depth-seeking) vs. horizontal (breadth-seeking) modes. We discover that agency in human-AI chatrooms is an emergent, shared experience: as participants claimed agency by setting boundaries and providing feedback, and the AI was perceived to steer intentions and drive execution, control shifted and was co-constructed turn-by-turn. We introduce a 3-by-5 framework mapping who (human, AI, hybrid) x agency action (Intention, Execution, Adaptation, Delimitation, Negotiation), modulated by individual and environmental factors. Ultimately, we argue for translucent design (i.e. transparency-on-demand), spaces for agency negotiation, and guidelines toward agency-aware conversational AI.
zh
[AI-119] uning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from k-Parity
【速读】:该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models)在复杂任务中泛化能力不足的问题,特别是针对传统自回归模型中常见的“grokking”现象——即模型在长时间训练后性能停滞于随机水平,随后突然实现性能跃迁。其解决方案的关键在于理论分解掩码扩散目标函数为“信号”(Signal)和“噪声”(Noise)两个机制:其中信号项驱动特征学习,而噪声项作为隐式正则化项,从而重塑了学习轨迹,使模型能够在不经历grokking的情况下实现快速且同步的泛化。此外,作者通过优化掩码概率分布进一步提升了模型性能,在50M参数模型上显著降低困惑度,并在8B参数模型的预训练与监督微调中分别取得最高达8.8%和5.8%的性能提升,验证了该框架在大规模生成式AI场景下的有效性与可扩展性。
链接: https://arxiv.org/abs/2601.22450
作者: Jianhao Huang,Baharan Mirzasoleiman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In this work, we investigate these properties within the setting of the k -parity problem (computing the XOR sum of k relevant bits), where neural networks typically exhibit grokking – a prolonged plateau of chance-level performance followed by sudden generalization. We theoretically decompose the Masked Diffusion (MD) objective into a Signal regime which drives feature learning, and a Noise regime which serves as an implicit regularizer. By training nanoGPT using MD objective on the k -parity problem, we demonstrate that MD objective fundamentally alters the learning landscape, enabling rapid and simultaneous generalization without experiencing grokking. Furthermore, we leverage our theoretical insights to optimize the distribution of the mask probability in the MD objective. Our method significantly improves perplexity for 50M-parameter models and achieves superior results across both pre-training from scratch and supervised fine-tuning. Specifically, we observe performance gains peaking at 8.8% and 5.8% , respectively, on 8B-parameter models, confirming the scalability and effectiveness of our framework in large-scale masked diffusion language model regimes.
zh
[AI-120] Controllable Information Production
【速读】:该论文旨在解决现有基于信息论的内在动机(Intrinsic Motivation, IM)方法依赖外部奖励信号或设计者指定随机变量进行信息传输的问题,这些问题限制了智能体自主生成复杂行为的能力。解决方案的关键在于提出一种新的IM原则——可控信息生成(Controllable Information Production, CIP),其不依赖外部奖励或人为设定变量,而是从最优控制理论推导出目标函数,揭示了外在行为与内在行为之间的联系。CIP本质上体现为开环与闭环Kolmogorov-Sinai熵之间的差异,同时激励智能体对混沌的探索与调控,从而实现无需外部监督的自主智能行为生成。
链接: https://arxiv.org/abs/2601.22449
作者: Tristan Shah,Stas Tiomkin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Intrinsic Motivation (IM) is a paradigm for generating intelligent behavior without external utilities. The existing information-theoretic methods for IM are predominantly based on information transmission, which explicitly depends on the designer’s choice of which random variables engage in transmission. In this work, we introduce a novel IM principle, Controllable Information Production (CIP), that avoids both external utilities and designer-specified variables. We derive the CIP objective from Optimal Control, showing a connection between extrinsic and intrinsic behaviors. CIP appears as the gap between open-loop and closed-loop Kolmogorov-Sinai entropies, which simultaneously rewards the pursuit and regulation of chaos. We establish key theoretical properties of CIP and demonstrate its effectiveness on standard IM benchmarks.
zh
[AI-121] Anytime Safe PAC Efficient Reasoning
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在复杂任务中因计算成本高和延迟大而导致的效率问题,尤其是在在线场景下,现有选择性思考策略易引入不可控误差,且由于部分可观测性能损失与非平稳数据分布,难以保证安全性和稳定性。解决方案的关键在于提出一种基于“可能大致正确”(Probably Approximately Correct, PAC)理论框架的赌博式安全推理机制(Betting Probably Approximately Correct, B-PAC reasoning),其核心是利用逆倾向评分估计器构建用于候选阈值的检验上鞅(test supermartingales),并基于累积统计证据动态调整路由阈值,从而实现任意时刻有效的性能损失控制与计算效率优化。实验证明,该方法可将思考模型使用率降低高达81.01%,同时确保性能损失始终低于用户设定阈值。
链接: https://arxiv.org/abs/2601.22446
作者: Chengyao Yu,Hao Zeng,Youxin Zhu,Jianguo Huang,Huajun Zeng,Bingyi Jing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks but suffer from high computational costs and latency. While selective thinking strategies improve efficiency by routing easy queries to non-thinking models, existing approaches often incur uncontrollable errors, especially in online settings where the performance loss of a non-thinking model is only partially observed and data are non-stationary. To address this, we propose Betting Probably Approximately Correct (B-PAC) reasoning, a principled method that enables anytime safe and efficient online reasoning under partial feedback. Specifically, we utilize inverse propensity scoring estimators to construct test supermartingales for candidate thresholds, and then dynamically adjust the routing threshold based on the accumulated statistical evidence of safety. Theoretically, we establish the anytime-valid performance loss control and the efficiency of B-PAC reasoning. Extensive experiments demonstrate that B-PAC reasoning significantly reduces computational overhead, decreasing thinking model usage by up to 81.01%, while controlling the performance loss below the user-specified level.
zh
[AI-122] Automating Forecasting Question Generation and Resolution for AI Evaluation
【速读】:该论文旨在解决生成高质量、多样化且可验证的预测性问题(forecasting questions)这一关键挑战,以支持对人工智能预测能力的系统评估与提升。传统方法依赖于重复性数据源(如天气或股票),导致问题多样性不足,限制了其在真实世界决策中的应用价值。解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)驱动的网络研究代理系统,能够自动、规模化地生成并后续验证现实世界的预测问题。该系统实现了96%的问题可验证性和95%的解析准确率,显著优于人工平台Metaculus,并进一步证明了更智能LLM在预测任务中的优越性能(如Gemini 3 Pro的Brier得分为0.134),同时展示了通过问题分解策略可有效提升预测精度,从而为自动化、高质高效的AI预测评估提供了可行路径。
链接: https://arxiv.org/abs/2601.22444
作者: Nikos I. Bosse,Peter Mühlbacher,Jack Wildman,Lawrence Phillips,Dan Schwarz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41 pages, 4 figures
Abstract:Forecasting future events is highly valuable in decision-making and is a robust measure of general intelligence. As forecasting is probabilistic, developing and evaluating AI forecasters requires generating large numbers of diverse and difficult questions, and accurately resolving them. Previous efforts to automate this laborious work relied on recurring data sources (e.g., weather, stocks), limiting diversity and utility. In this work, we present a system for generating and resolving high-quality forecasting questions automatically and at scale using LLM-powered web research agents. We use this system to generate 1499 diverse, real-world forecasting questions, and to resolve them several months later. We estimate that our system produces verifiable, unambiguous questions approximately 96% of the time, exceeding the rate of Metaculus, a leading human-curated forecasting platform. We also find that our system resolves questions at approximately 95% accuracy. We verify that forecasting agents powered by more intelligent LLMs perform better on these questions (Brier score of 0.134 for Gemini 3 Pro, 0.149 for GPT-5, and 0.179 for Gemini 2.5 Flash). Finally, we demonstrate how our system can be leveraged to directly improve forecasting, by evaluating a question decomposition strategy on a generated question set, yielding a significant improvement in Brier scores (0.132 vs. 0.141).
zh
[AI-123] When LLM meets Fuzzy-TOPSIS for Personnel Selection through Automated Profile Analysis
【速读】:该论文旨在解决当前就业环境中人才选拔效率低、主观性强及缺乏一致性的问题,尤其是在软件工程岗位招聘中,传统人工评估方法难以兼顾规模与公平性。解决方案的关键在于构建一个基于自然语言处理(Natural Language Processing, NLP)的自动化人员筛选框架——LLM-TOPSIS,其核心创新在于融合大语言模型(Large Language Models, LLMs)与模糊多准则决策(Fuzzy Multi-Criteria Decision Making, Fuzzy MCDM)理论,具体采用三角模糊数(Triangular Fuzzy Numbers, TFNs)刻画评分和权重以处理评估中的模糊性和主观性,并通过微调DistilRoBERTa模型实现候选人特征提取与排序,最终在经验(Experience)和综合(Overall)属性上达到高达91%的准确率,显著提升了招聘流程的可扩展性、一致性和去偏能力。
链接: https://arxiv.org/abs/2601.22433
作者: Shahria Hoque,Ahmed Akib Jawad Karim,Md. Golam Rabiul Alam,Nirjhar Gope
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 8 figures. This paper has been peer-reviewed and published in IEEE Access. The arXiv version corresponds to the accepted author manuscript (AAM)
Abstract:In this highly competitive employment environment, the selection of suitable personnel is essential for organizational success. This study presents an automated personnel selection system that utilizes sophisticated natural language processing (NLP) methods to assess and rank software engineering applicants. A distinctive dataset was created by aggregating LinkedIn profiles that include essential features such as education, work experience, abilities, and self-introduction, further enhanced with expert assessments to function as standards. The research combines large language models (LLMs) with multicriteria decision-making (MCDM) theory to develop the LLM-TOPSIS framework. In this context, we utilized the TOPSIS method enhanced by fuzzy logic (Fuzzy TOPSIS) to address the intrinsic ambiguity and subjectivity in human assessments. We utilized triangular fuzzy numbers (TFNs) to describe criteria weights and scores, thereby addressing the ambiguity frequently encountered in candidate evaluations. For candidate ranking, the DistilRoBERTa model was fine-tuned and integrated with the fuzzy TOPSIS method, achieving rankings closely aligned with human expert evaluations and attaining an accuracy of up to 91% for the Experience attribute and the Overall attribute. The study underlines the potential of NLP-driven frameworks to improve recruitment procedures by boosting scalability, consistency, and minimizing prejudice. Future endeavors will concentrate on augmenting the dataset, enhancing model interpretability, and verifying the system in actual recruitment scenarios to better evaluate its practical applicability. This research highlights the intriguing potential of merging NLP with fuzzy decision-making methods in personnel selection, enabling scalable and unbiased solutions to recruitment difficulties.
zh
[AI-124] CoDCL: Counterfactual Data Augmentation Contrastive Learning for Continuous-Time Dynamic Network Link Prediction
【速读】:该论文旨在解决动态网络中因结构持续演化而导致的预测模型适应性不足问题,即如何使预测模型在复杂时序环境中保持鲁棒性。其解决方案的关键在于提出了一种名为CoDCL(Counterfactual Data Augmentation and Contrastive Learning)的动态网络学习框架,该框架通过结合反事实数据增强(counterfactual data augmentation)与对比学习(contrastive learning),实现对动态网络中交互模式时序变化的有效量化与建模;其中,反事实数据生成策略融合了动态处理设计(dynamic treatments design)与高效的结构邻域探索机制,从而生成高质量的反事实样本,使模型能够更好地理解结构变化的影响,进而提升动态图表示学习的性能。
链接: https://arxiv.org/abs/2601.22427
作者: Hantong Feng,Yonggang Wu,Duxin Chen,Wenwu Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid growth and continuous structural evolution of dynamic networks make effective predictions increasingly challenging. To enable prediction models to adapt to complex temporal environments, they need to be robust to emerging structural changes. We propose a dynamic network learning framework CoDCL, which combines counterfactual data augmentation with contrastive learning to address this this http URL, we devise a comprehensive strategy to generate high-quality counterfactual data, combining a dynamic treatments design with efficient structural neighborhood exploration to quantify the temporal changes in interaction this http URL, the entire CoDCL is designed as a plug-and-play universal module that can be seamlessly integrated into various existing temporal graph models without requiring architectural this http URL experiments on multiple real-world datasets demonstrate that CoDCL significantly gains state-of-the-art baseline models in the field of dynamic networks, confirming the critical role of integrating counterfactual data augmentation into dynamic representation learning.
zh
[AI-125] MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments EACL2026
【速读】:该论文旨在解决传统机器学习(Machine Learning, ML)排行榜(Leaderboard)构建过程中依赖大量人工劳动、数据不透明以及元数据匮乏的问题。现有数据集通常仅收录每篇论文的最佳结果,缺乏对实验类型(如基线、所提方法或其变体)和训练/测试数据集划分的明确标注,限制了跨领域评估与细粒度比较的能力。其解决方案的关键在于提出MetaLead——一个全人工标注的ML排行榜数据集,不仅完整记录所有实验结果以提升透明度,还引入结构化元数据,包括实验类型分类和显式的训练/测试数据集分离,从而支持更精准、可复现且具有指导意义的模型性能评估与对比分析。
链接: https://arxiv.org/abs/2601.22420
作者: Roelien C. Timmer,Necva Bölücü,Stephen Wan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: EACL 2026
Abstract:Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate leaderboard generation, but existing datasets for this purpose are limited by capturing only the best results from each paper and limited metadata. We present MetaLead, a fully human-annotated ML Leaderboard dataset that captures all experimental results for result transparency and contains extra metadata, such as the result experimental type: baseline, proposed method, or variation of proposed method for experiment-type guided comparisons, and explicitly separates train and test dataset for cross-domain assessment. This enriched structure makes MetaLead a powerful resource for more transparent and nuanced evaluations across ML research.
zh
[AI-126] Dynamic Welfare-Maximizing Pooled Testing
【速读】:该论文旨在解决在有限检测资源下,如何通过动态分配测试任务来最大化公共卫生筛查中的社会福利(即确诊为健康的个体总效用)的问题。传统静态池化检测方法预先分配所有测试,无法适应实际检测过程中的信息更新;而本文提出了一种动态福利最大化池化检测策略,允许测试按序进行并根据前期结果调整后续决策。其关键在于构建一个形式化的动态优化框架,并评估多种算法策略(包括精确优化、贪心启发式、混合整数规划松弛和基于学习的策略)的性能。实验表明,在低预算场景中,动态策略相比静态基线能显著提升福利,且简单贪心策略即可捕获大部分收益,同时保持计算效率,从而为实际应用提供了高效可行的解决方案。
链接: https://arxiv.org/abs/2601.22419
作者: Nicholas Lopez,Francisco Marmolejo-Cossío,Jose Roberto Tello Ayala,David C. Parkes
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:Pooled testing is a common strategy for public health disease screening under limited testing resources, allowing multiple biological samples to be tested together with the resources of a single test, at the cost of reduced individual resolution. While dynamic and adaptive strategies have been extensively studied in the classical pooled testing literature, where the goal is to minimize the number of tests required for full diagnosis of a given population, much of the existing work on welfare-maximizing pooled testing adopts static formulations in which all tests are assigned in advance. In this paper, we study dynamic welfare-maximizing pooled testing strategies in which a limited number of tests are performed sequentially to maximize social welfare, defined as the aggregate utility of individuals who are confirmed to be healthy. We formally define the dynamic problem and study algorithmic approaches for sequential test assignment. Because exact dynamic optimization is computationally infeasible beyond small instances, we evaluate a range of strategies (including exact optimization baselines, greedy heuristics, mixed-integer programming relaxations, and learning-based policies) and empirically characterize their performance and tradeoffs using synthetic experiments. Our results show that dynamic testing can yield substantial welfare improvements over static baselines in low-budget regimes. We find that much of the benefit of dynamic testing is captured by simple greedy policies, which substantially outperform static approaches while remaining computationally efficient. Learning-based methods are included as flexible baselines, but in our experiments they do not reliably improve upon these heuristics. Overall, this work provides a principled computational perspective on dynamic pooled testing and clarifies when dynamic assignment meaningfully improves welfare in public health screening.
zh
[AI-127] AI-Enabled Waste Classification as a Data-Driven Decision Support Tool for Circular Economy and Urban Sustainability
【速读】:该论文旨在解决智能城市中废弃物高效分类的问题,以支持循环经济实践和资源回收。其核心解决方案是通过对比传统机器学习(如随机森林、支持向量机、AdaBoost)与深度学习方法(包括自定义卷积神经网络、VGG16、ResNet50及三种迁移学习模型:DenseNet121、EfficientNetB0、InceptionV3)在二分类任务中的性能表现,验证迁移学习在小样本条件下的优越性。关键发现为:DenseNet121模型达到最高准确率(91%)和ROC-AUC值(0.98),较最优传统模型提升20个百分点;主成分分析(Principal Component Analysis, PCA)对传统模型改善有限,而迁移学习显著提升了模型在数据受限场景下的泛化能力,从而为构建实时数据驱动的废弃物自动分拣决策系统提供了有效技术路径。
链接: https://arxiv.org/abs/2601.22418
作者: Julius Sechang Mboli,Omolara Aderonke Ogungbemi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted version of Conference paper
Abstract:Efficient waste sorting is crucial for enabling circular-economy practices and resource recovery in smart cities. This paper evaluates both traditional machine-learning (Random Forest, SVM, AdaBoost) and deep-learning techniques including custom CNNs, VGG16, ResNet50, and three transfer-learning models (DenseNet121, EfficientNetB0, InceptionV3) for binary classification of 25 077 waste images (80/20 train/test split, augmented and resized to 150x150 px). The paper assesses the impact of Principal Component Analysis for dimensionality reduction on traditional models. DenseNet121 achieved the highest accuracy (91 %) and ROC-AUC (0.98), outperforming the best traditional classifier by 20 pp. Principal Component Analysis (PCA) showed negligible benefit for classical methods, whereas transfer learning substantially improved performance under limited-data conditions. Finally, we outline how these models integrate into a real-time Data-Driven Decision Support System for automated waste sorting, highlighting potential reductions in landfill use and lifecycle environmental impacts.)
zh
[AI-128] Optimization Generalization and Differential Privacy Bounds for Gradient Descent on Kolmogorov-Arnold Networks
【速读】:该论文旨在解决Kolmogorov–Arnold Networks (KANs)在训练动态、泛化能力和差分隐私(differential privacy, DP)属性方面的理论分析不足问题。其关键解决方案在于通过梯度下降(GD)对两层KAN进行系统性理论分析,推导出涵盖优化率、泛化率和隐私-效用权衡的通用边界;特别地,在逻辑损失与NTK可分离假设下,证明了多项式对数宽度(polylogarithmic width)足以实现最优的优化速率 1/T 和泛化速率 1/n,并在私有设置中量化了 (ϵ,δ)-DP所需的噪声水平,并获得与经典凸Lipschitz问题下界一致的效用界 d/(nϵ),从而揭示了在差分隐私约束下,多项式对数宽度不仅是充分条件,也是必要条件,区别于非私有情形仅需充分性。
链接: https://arxiv.org/abs/2601.22409
作者: Puyu Wang,Junyu Zhou,Philipp Liznerski,Marius Kloft
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 41 pages, 3 figures
Abstract:Kolmogorov–Arnold Networks (KANs) have recently emerged as a structured alternative to standard MLPs, yet a principled theory for their training dynamics, generalization, and privacy properties remains limited. In this paper, we analyze gradient descent (GD) for training two-layer KANs and derive general bounds that characterize their training dynamics, generalization, and utility under differential privacy (DP). As a concrete instantiation, we specialize our analysis to logistic loss under an NTK-separable assumption, where we show that polylogarithmic network width suffices for GD to achieve an optimization rate of order 1/T and a generalization rate of order 1/n , with T denoting the number of GD iterations and n the sample size. In the private setting, we characterize the noise required for (\epsilon,\delta) -DP and obtain a utility bound of order \sqrtd/(n\epsilon) (with d the input dimension), matching the classical lower bound for general convex Lipschitz problems. Our results imply that polylogarithmic width is not only sufficient but also necessary under differential privacy, revealing a qualitative gap between non-private (sufficiency only) and private (necessity also emerges) training regimes. Experiments further illustrate how these theoretical insights can guide practical choices, including network width selection and early stopping.
zh
[AI-129] Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems
【速读】:该论文旨在解决数学领域中长期标记为“Open”(开放)的猜想问题,其核心问题是判断这些猜想是否真正未被解决,还是因文献覆盖不足或认知盲区而被误标。解决方案的关键在于采用混合方法:首先利用生成式 AI(Generative AI)对700个候选猜想进行自然语言层面的自动验证以缩小搜索范围,随后由人类专家评估结果的正确性与新颖性。这一流程成功识别出13个问题中的5个可通过AI自主得出新解,其余8个则通过文献检索发现已有解,揭示了“Open”状态多源于信息不透明而非技术难度本身。
链接: https://arxiv.org/abs/2601.22401
作者: Tony Feng,Trieu Trinh,Garrett Bingham,Jiwon Kang,Shengtong Zhang,Sang-hyun Kim,Kevin Barreto,Carl Schildkraut,Junehyuk Jung,Jaehyeon Seo,Carlo Pagano,Yuri Chervonyi,Dawsen Hwang,Kaiying Hou,Sergei Gukov,Cheng-Chiang Tsai,Hyunwoo Choi,Youngbeom Jin,Wei-Yuan Li,Hao-An Wu,Ruey-An Shiu,Yu-Sheng Shih,Quoc V. Le,Thang Luong
机构: 未知
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO); Number Theory (math.NT)
备注:
Abstract:We present a case study in semi-autonomous mathematics discovery, using Gemini to systematically evaluate 700 conjectures labeled ‘Open’ in Bloom’s Erdős Problems database. We employ a hybrid methodology: AI-driven natural language verification to narrow the search space, followed by human expert evaluation to gauge correctness and novelty. We address 13 problems that were marked ‘Open’ in the database: 5 through seemingly novel autonomous solutions, and 8 through identification of previous solutions in the existing literature. Our findings suggest that the ‘Open’ status of the problems was through obscurity rather than difficulty. We also identify and discuss issues arising in applying AI to math conjectures at scale, highlighting the difficulty of literature identification and the risk of ‘‘subconscious plagiarism’’ by AI. We reflect on the takeaways from AI-assisted efforts on the Erdős Problems.
zh
[AI-130] Score-based Integrated Gradient for Root Cause Explanations of Outliers ICDM2025
【速读】:该论文旨在解决异常值(outlier)根因溯源问题,即在因果推断与异常检测中准确识别导致异常的潜在因素。传统方法依赖启发式规则或反事实推理,在高维依赖关系和不确定性环境下表现不佳。其解决方案的关键在于提出一种名为SIREN的新方法,该方法通过估计数据似然的得分函数(score function)来实现根因归因,并利用积分梯度(integrated gradients)沿从异常点到正常数据分布的路径累积得分贡献,从而量化各变量对异常的贡献程度。SIREN满足Shapley值的三个经典公理(哑元、效率、线性)及由因果结构导出的不对称性公理,且直接作用于得分函数,能够在非线性、高维和异方差(heteroscedastic)因果模型中实现可计算且具有不确定性感知的归因分析。
链接: https://arxiv.org/abs/2601.22399
作者: Phuoc Nguyen,Truyen Tran,Sunil Gupta,Svetha Venkatesh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICDM 2025
Abstract:Identifying the root causes of outliers is a fundamental problem in causal inference and anomaly detection. Traditional approaches based on heuristics or counterfactual reasoning often struggle under uncertainty and high-dimensional dependencies. We introduce SIREN, a novel and scalable method that attributes the root causes of outliers by estimating the score functions of the data likelihood. Attribution is computed via integrated gradients that accumulate score contributions along paths from the outlier toward the normal data distribution. Our method satisfies three of the four classic Shapley value axioms - dummy, efficiency, and linearity - as well as an asymmetry axiom derived from the underlying causal structure. Unlike prior work, SIREN operates directly on the score function, enabling tractable and uncertainty-aware root cause attribution in nonlinear, high-dimensional, and heteroscedastic causal models. Extensive experiments on synthetic random graphs and real-world cloud service and supply chain datasets show that SIREN outperforms state-of-the-art baselines in both attribution accuracy and computational efficiency.
zh
[AI-131] Graph is a Substrate Across Data Modalities
【速读】:该论文旨在解决图结构在不同模态和任务之间孤立学习导致的结构性规律重复重建问题,即当前方法通常在单一任务上下文中构建图表示并随后丢弃,未能实现跨模态与跨任务的结构知识积累。其解决方案的关键在于提出G-Substrate框架,该框架以图结构作为持久存在的结构基底(structural substrate),通过两个互补机制实现:一是统一的结构模式(unified structural schema)确保异构模态与任务间图表示的兼容性;二是交错的角色驱动训练策略(interleaved role-based training strategy),使同一图结构在学习过程中被赋予多种功能角色,从而促进结构知识的持续积累与复用。实验表明,该方法在多个领域、模态和任务上显著优于孤立任务学习和朴素多任务学习方法。
链接: https://arxiv.org/abs/2601.22384
作者: Ziming Li,Xiaoming Wu,Zehong Wang,Jiazheng Li,Yijun Tian,Jinhe Bi,Yunpu Ma,Yanfang Ye,Chuxu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Graph structure across data modalities
Abstract:Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure is typically learned in a modality- and task-isolated manner, where graph representations are constructed within individual task contexts and discarded thereafter. As a result, structural regularities across modalities and tasks are repeatedly reconstructed rather than accumulated at the level of intermediate graph representations. This motivates a representation-learning question: how should graph structure be organized so that it can persist and accumulate across heterogeneous modalities and tasks? We adopt a representation-centric perspective in which graph structure is treated as a structural substrate that persists across learning contexts. To instantiate this perspective, we propose G-Substrate, a graph substrate framework that organizes learning around shared graph structures. G-Substrate comprises two complementary mechanisms: a unified structural schema that ensures compatibility among graph representations across heterogeneous modalities and tasks, and an interleaved role-based training strategy that exposes the same graph structure to multiple functional roles during learning. Experiments across multiple domains, modalities, and tasks show that G-Substrate outperforms task-isolated and naive multi-task learning methods.
zh
[AI-132] Learning Provably Correct Distributed Protocols Without Human Knowledge
【速读】:该论文旨在解决分布式协议设计中正确性难以保障的问题,尤其是在存在不确定性与故障的环境中,如何高效地自动合成可证明正确的多智能体协调协议。现有方法在小规模场景下也难以学习到正确协议,主要受限于标准多智能体博弈求解技术的局限性。其解决方案的关键在于提出一种名为GGMS的学习框架,该框架融合了改进的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)、基于Transformer的动作编码器、全局深度优先搜索以跳出局部最优,并通过模型检测器反复反馈优化策略。该方法将协议设计建模为带有不完美信息的游戏中的策略搜索问题,利用Satisfiability Modulo Theories (SMT) 表达正确性条件,并在有限执行空间内通过穷举模型检查验证输出协议的正确性。进一步理论证明表明,在合理假设下,GGMS的搜索过程具有完备性:若存在正确协议,该框架最终必能发现。
链接: https://arxiv.org/abs/2601.22369
作者: Yujie Hui,Xiaoyi Lu,Andrew Perrault,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Provably correct distributed protocols, which are a critical component of modern distributed systems, are highly challenging to design and have often required decades of human effort. These protocols allow multiple agents to coordinate to come to a common agreement in an environment with uncertainty and failures. We formulate protocol design as a search problem over strategies in a game with imperfect information, and the desired correctness conditions are specified in Satisfiability Modulo Theories (SMT). However, standard methods for solving multi-agent games fail to learn correct protocols in this setting, even when the number of agents is small. We propose a learning framework, GGMS, which integrates a specialized variant of Monte Carlo Tree Search with a transformer-based action encoder, a global depth-first search to break out of local minima, and repeated feedback from a model checker. Protocols output by GGMS are verified correct via exhaustive model checking for all executions within the bounded setting. We further prove that, under mild assumptions, the search process is complete: if a correct protocol exists, GGMS will eventually find it. In experiments, we show that GGMS can learn correct protocols for larger settings than existing methods.
zh
[AI-133] he Unseen Threat: Residual Knowledge in Machine Unlearning under Perturbed Samples NEURIPS2025
【速读】:该论文旨在解决机器遗忘(Machine Unlearning)中存在的一种新型隐私风险:即使模型已从训练数据中“遗忘”特定样本,其对这些样本的微小扰动版本仍可能被正确识别,表明关于遗忘样本的信息可能残留在模型局部邻域中,这种现象被称为残留知识(Residual Knowledge)。现有基于统计不可区分性的遗忘保证无法覆盖对抗性扰动下的模型输出,导致隐私泄露隐患。解决方案的关键在于提出一种名为RURK(Regularized Unlearning with Residual Knowledge Penalty)的微调策略,通过在训练过程中引入惩罚项,抑制模型对扰动后遗忘样本的再识别能力,从而有效消除残留知识。实验表明,该方法在视觉基准上显著降低了现有遗忘方法中的残留知识水平。
链接: https://arxiv.org/abs/2601.22359
作者: Hsiang Hsu,Pradeep Niroula,Zichang He,Ivan Brugere,Freddy Lecue,Chun-Fu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at NeurIPS 2025
Abstract:Machine unlearning offers a practical alternative to avoid full model re-training by approximately removing the influence of specific user data. While existing methods certify unlearning via statistical indistinguishability from re-trained models, these guarantees do not naturally extend to model outputs when inputs are adversarially perturbed. In particular, slight perturbations of forget samples may still be correctly recognized by the unlearned model - even when a re-trained model fails to do so - revealing a novel privacy risk: information about the forget samples may persist in their local neighborhood. In this work, we formalize this vulnerability as residual knowledge and show that it is inevitable in high-dimensional settings. To mitigate this risk, we propose a fine-tuning strategy, named RURK, that penalizes the model’s ability to re-recognize perturbed forget samples. Experiments on vision benchmarks with deep neural networks demonstrate that residual knowledge is prevalent across existing unlearning methods and that our approach effectively prevents residual knowledge.
zh
[AI-134] Recoverability Has a Law: The ERR Measure for Tool-Augmented Agents ICML
【速读】:该论文旨在解决语言模型代理在工具调用失败后表现出自恢复能力却缺乏形式化解释的问题。其解决方案的关键在于提出一种可预测的理论框架,通过定义期望恢复遗憾(Expected Recovery Regret, ERR)来量化恢复策略与最优策略在随机执行噪声下的偏离程度,并推导出ERR与可观测指标效率得分(Efficiency Score, ES)之间的一阶关系,从而建立了一个可验证的恢复动力学定量法则。实证结果表明,该法则在多个工具使用基准上均能准确预测恢复 regret,误差小于等于 0.05,揭示了恢复能力是交互动态所支配的属性,而非模型规模或架构的副产物。
链接: https://arxiv.org/abs/2601.22352
作者: Sri Vatsa Vuddanti,Satwik Kumar Chittiprolu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint for ICML Submission
Abstract:Language model agents often appear capable of self-recovery after failing tool call executions, yet this behavior lacks a formal explanation. We present a predictive theory that resolves this gap by showing that recoverability follows a measurable law. To elaborate, we formalize recoverability through Expected Recovery Regret (ERR), which quantifies the deviation of a recovery policy from the optimal one under stochastic execution noise, and derive a first-order relationship between ERR and an empirical observable quantity, the Efficiency Score (ES). This yields a falsifiable first-order quantitative law of recovery dynamics in tool-using agents. We empirically validate the law across five tool-use benchmarks spanning controlled perturbations, diagnostic reasoning, and real-world APIs. Across model scales, perturbation regimes, and recovery horizons, predicted regret under the ERR-ES law closely matched observed post-failure regret measured from Monte Carlo rollouts, within delta less than or equal to 0.05. Our results reveal that recoverability is not an artifact of model scale or architecture, but a governed property of interaction dynamics, providing a theoretical foundation for execution-level robustness in language agents.
zh
[AI-135] Learning Policy Representations for Steerable Behavior Synthesis
【速读】:该论文旨在解决在马尔可夫决策过程(Markov Decision Process, MDP)中,如何学习一组策略的统一表示以实现测试时的行为调控问题。其核心挑战在于,不同策略由各自的占用度量(occupancy measure)唯一确定,而传统方法难以高效地在不重新训练的情况下对策略进行灵活调整。解决方案的关键在于将策略表示建模为状态-动作特征映射关于占用度量的期望,并通过集合架构(set-based architecture)统一近似多种策略的此类表示。模型利用变分生成方法构建平滑的潜在空间,并结合对比学习进一步优化该空间几何结构,使得潜在空间中的距离与价值函数差异对齐,从而支持在潜在空间内直接进行梯度优化。这一机制使得模型能够在无需额外训练的前提下,合成满足先前未见价值函数约束的新策略,即实现行为合成(behavior synthesis)。
链接: https://arxiv.org/abs/2601.22350
作者: Beiming Li,Sergio Rozada,Alejandro Ribeiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Given a Markov decision process (MDP), we seek to learn representations for a range of policies to facilitate behavior steering at test time. As policies of an MDP are uniquely determined by their occupancy measures, we propose modeling policy representations as expectations of state-action feature maps with respect to occupancy measures. We show that these representations can be approximated uniformly for a range of policies using a set-based architecture. Our model encodes a set of state-action samples into a latent embedding, from which we decode both the policy and its value functions corresponding to multiple rewards. We use variational generative approach to induce a smooth latent space, and further shape it with contrastive learning so that latent distances align with differences in value functions. This geometry permits gradient-based optimization directly in the latent space. Leveraging this capability, we solve a novel behavior synthesis task, where policies are steered to satisfy previously unseen value function constraints without additional training.
zh
[AI-136] MixQuant: Pushing the Limits of Block Rotations in Post-Training Quantization
【速读】:该论文旨在解决后训练量化(Post-Training Quantization, PTQ)中因块结构(block structure)导致的异常值(outlier)抑制效果不佳的问题。现有方法采用块哈达玛旋转(block Hadamard rotation)来扩散异常值,但其对异常值的抑制能力受限于输入向量的几何特性,且缺乏系统性的理论分析。解决方案的关键在于:首先通过非渐近分析揭示了在预旋转阶段各块的ℓ₁范数质量分布均匀时可最小化旋转后的异常值;进而提出MixQuant框架,利用排列(permutation)预先重新分配激活质量,使各块ℓ₁范数趋于一致,并设计贪婪质量扩散算法优化排列策略;最后,为避免推理开销,识别Transformer架构中的排列等变区域,将排列合并进模型权重中部署。该方法显著提升了INT4量化下的模型精度,例如在Llama3 1B模型上使用块大小为16时,相比无排列方案,困惑度恢复提升至90%(原仅为46%)。
链接: https://arxiv.org/abs/2601.22347
作者: Sai Sanjeet,Ian Colbert,Pablo Monteagudo-Lago,Giuseppe Franco,Yaman Umuroglu,Nicholas J. Fraser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of full-vector rotations, the effect of block structure on outlier suppression remains poorly understood. To fill this gap, we present the first systematic, non-asymptotic analysis of outlier suppression for block Hadamard rotations. Our analysis reveals that outlier suppression is fundamentally limited by the geometry of the input vector. In particular, post-rotation outliers are deterministically minimized when the pre-rotation \ell_1 norm mass is evenly distributed across blocks. Guided by these insights, we introduce MixQuant, a block rotation-aware PTQ framework that redistributes activation mass via permutations prior to rotation. We propose a greedy mass diffusion algorithm to calibrate permutations by equalizing the expected blockwise \ell_1 norms. To avoid adding inference overhead, we identify permutation-equivariant regions in transformer architectures to merge the resulting permutations into model weights before deployment. Experiments show that MixQuant consistently improves accuracy across all block sizes, recovering up to 90% of the full-vector rotation perplexity when quantizing Llama3 1B to INT4 with block size 16, compared to 46% without permutations.
zh
[AI-137] From Retrieving Information to Reasoning with AI: Exploring Different Interaction Modalities to Support Human-AI Coordination in Clinical Decision-Making
【速读】:该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)在临床决策支持系统(Clinical Decision-Support Systems, CDSS)中的实际应用效果尚不明确,尤其是临床工作者如何使用LLMs及其与传统CDSS的交互方式差异尚未被充分理解,这限制了新型交互机制的设计与优化。解决方案的关键在于通过定性研究方法,深入分析12名临床工作者对文本对话、静态/交互式用户界面(UI)及语音三种交互模态的感知与使用行为,发现其主要采用“工具导向”策略进行信息检索和验证,而非将LLMs视为复杂问题的协作者;同时揭示了个体认知风格对交互参与度的影响,并指出不同交互模态各具优势与局限,从而为设计更贴合临床需求、提升性能与用户体验的下一代决策支持工具提供实证依据。
链接: https://arxiv.org/abs/2601.22338
作者: Behnam Rahdari,Sameer Shaikh,Jonathan H Chen,Tobias Gerstenberg,Shriti Raj
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs are popular among clinicians for decision-support because of simple text-based interaction. However, their impact on clinicians’ performance is ambiguous. Not knowing how clinicians use this new technology and how they compare it to traditional clinical decision-support systems (CDSS) restricts designing novel mechanisms that overcome existing tool limitations and enhance performance and experience. This qualitative study examines how clinicians (n=12) perceive different interaction modalities (text-based conversation with LLMs, interactive and static UI, and voice) for decision-support. In open-ended use of LLM-based tools, our participants took a tool-centric approach using them for information retrieval and confirmation with simple prompts instead of use as active deliberation partners that can handle complex questions. Critical engagement emerged with changes to the interaction setup. Engagement also differed with individual cognitive styles. Lastly, benefits and drawbacks of interaction with text, voice and traditional UIs for clinical decision-support show the lack of a one-size-fits-all interaction modality.
zh
[AI-138] Sparks of Rationality: Do Reasoning LLM s Align with Human Judgment and Choice?
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险决策场景中是否具备类似人类的理性与情感偏差这一关键问题,从而评估其作为人类行为模拟工具或决策引擎的可靠性。解决方案的关键在于通过两类实验设计:一是基于理性选择公理的基准测试,二是经典行为经济学和社交规范领域的决策任务,系统性地检验LLMs在“思考”过程中的理性提升能力;同时引入两种情绪引导方法——上下文提示(in-context priming, ICP)与表征层级引导(representation-level steering, RLS),以探究情绪干预对模型决策的影响机制。研究发现,推理能力的增强不仅提升理性表现,也加剧了模型对情绪干预的敏感性,且不同引导方式在可控性与人类对齐行为之间存在权衡,揭示了推理与情感调节之间的内在张力。
链接: https://arxiv.org/abs/2601.22329
作者: Ala N. Tak,Amin Banayeeanzade,Anahita Bolourani,Fatemeh Bahrani,Ashutosh Chaubey,Sai Praneeth Karimireddy,Norbert Schwarz,Jonathan Gratch
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Large Language Models (LLMs) are increasingly positioned as decision engines for hiring, healthcare, and economic judgment, yet real-world human judgment reflects a balance between rational deliberation and emotion-driven bias. If LLMs are to participate in high-stakes decisions or serve as models of human behavior, it is critical to assess whether they exhibit analogous patterns of (ir)rationalities and biases. To this end, we evaluate multiple LLM families on (i) benchmarks testing core axioms of rational choice and (ii) classic decision domains from behavioral economics and social norms where emotions are known to shape judgment and choice. Across settings, we show that deliberate “thinking” reliably improves rationality and pushes models toward expected-value maximization. To probe human-like affective distortions and their interaction with reasoning, we use two emotion-steering methods: in-context priming (ICP) and representation-level steering (RLS). ICP induces strong directional shifts that are often extreme and difficult to calibrate, whereas RLS produces more psychologically plausible patterns but with lower reliability. Our results suggest that the same mechanisms that improve rationality also amplify sensitivity to affective interventions, and that different steering methods trade off controllability against human-aligned behavior. Overall, this points to a tension between reasoning and affective steering, with implications for both human simulation and the safe deployment of LLM-based decision systems.
zh
[AI-139] Stealthy Poisoning Attacks Bypass Defenses in Regression Settings
【速读】:该论文旨在解决回归模型在工业过程、工程及自然科学等领域中对数据投毒攻击(poisoning attack)的鲁棒性不足问题,尤其针对现有研究多基于不切实际的威胁模型而缺乏实用性的问题。其解决方案的关键在于提出一种新型最优隐蔽攻击(optimal stealthy attack)形式化方法,该方法考虑了不同级别的可检测性,并能绕过当前最先进的防御机制;同时,论文进一步提出基于目标归一化的评估方法以权衡攻击的有效性与隐蔽性,并开发了一种新的防御策略 BayesClean,该策略在隐蔽攻击且投毒样本数量较多时显著优于已有防御方法。
链接: https://arxiv.org/abs/2601.22308
作者: Javier Carnerero-Cano,Luis Muñoz-González,Phillippa Spencer,Emil C. Lupu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Regression models are widely used in industrial processes, engineering and in natural and physical sciences, yet their robustness to poisoning has received less attention. When it has, studies often assume unrealistic threat models and are thus less useful in practice. In this paper, we propose a novel optimal stealthy attack formulation that considers different degrees of detectability and show that it bypasses state-of-the-art defenses. We further propose a new methodology based on normalization of objectives to evaluate different trade-offs between effectiveness and detectability. Finally, we develop a novel defense (BayesClean) against stealthy attacks. BayesClean improves on previous defenses when attacks are stealthy and the number of poisoning points is significant.
zh
[AI-140] Conformal Prediction for Generative Models via Adaptive Cluster-Based Density Estimation
【速读】:该论文旨在解决条件生成模型(conditional generative models)在高风险应用场景中缺乏校准不确定性(calibrated uncertainty)的问题,这限制了用户对单个输出结果的信任度。其解决方案的关键在于提出一种系统性的分位数预测(conformal prediction)方法——CP4Gen,该方法利用基于聚类的密度估计技术构建预测集,从而在保持较低结构复杂度的同时,显著提升对异常值的鲁棒性与预测集的可解释性,实验证明其在预测集体积和结构简洁性上均优于现有方法。
链接: https://arxiv.org/abs/2601.22298
作者: Qidong Yang,Qianyu Julie Zhu,Jonathan Giezendanner,Youssef Marzouk,Stephen Bates,Sherrie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Conditional generative models map input variables to complex, high-dimensional distributions, enabling realistic sample generation in a diverse set of domains. A critical challenge with these models is the absence of calibrated uncertainty, which undermines trust in individual outputs for high-stakes applications. To address this issue, we propose a systematic conformal prediction approach tailored to conditional generative models, leveraging density estimation on model-generated samples. We introduce a novel method called CP4Gen, which utilizes clustering-based density estimation to construct prediction sets that are less sensitive to outliers, more interpretable, and of lower structural complexity than existing methods. Extensive experiments on synthetic datasets and real-world applications, including climate emulation tasks, demonstrate that CP4Gen consistently achieves superior performance in terms of prediction set volume and structural simplicity. Our approach offers practitioners a powerful tool for uncertainty estimation associated with conditional generative models, particularly in scenarios demanding rigorous and interpretable prediction sets.
zh
[AI-141] ParalESN: Enabling parallel information processing in Reservoir Computing
【速读】:该论文旨在解决传统Reservoir Computing (RC) 在处理时序数据时面临的两大瓶颈:一是时序数据必须串行处理,限制了计算效率;二是高维储层(reservoir)导致内存开销巨大,难以扩展。其解决方案的关键在于引入Parallel Echo State Network (ParalESN),通过在复数域中采用对角线性递推结构(diagonal linear recurrence),构建高效且可并行处理的高维储层。该方法不仅保持了传统Echo State Networks的Echo State Property和通用性保证,还实现了任意线性储层在复数对角形式下的等效表示,从而显著降低计算复杂度与能耗,同时在时间序列预测和像素级分类任务上达到与传统RC相当甚至更优的性能表现。
链接: https://arxiv.org/abs/2601.22296
作者: Matteo Pinna,Giacomo Lagomarsini,Andrea Ceni,Claudio Gallicchio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 9 tables
Abstract:Reservoir Computing (RC) has established itself as an efficient paradigm for temporal processing. However, its scalability remains severely constrained by (i) the necessity of processing temporal data sequentially and (ii) the prohibitive memory footprint of high-dimensional reservoirs. In this work, we revisit RC through the lens of structured operators and state space modeling to address these limitations, introducing Parallel Echo State Network (ParalESN). ParalESN enables the construction of high-dimensional and efficient reservoirs based on diagonal linear recurrence in the complex space, enabling parallel processing of temporal data. We provide a theoretical analysis demonstrating that ParalESN preserves the Echo State Property and the universality guarantees of traditional Echo State Networks while admitting an equivalent representation of arbitrary linear reservoirs in the complex diagonal form. Empirically, ParalESN matches the predictive accuracy of traditional RC on time series benchmarks, while delivering substantial computational savings. On 1-D pixel-level classification tasks, ParalESN achieves competitive accuracy with fully trainable neural networks while reducing computational costs and energy consumption by orders of magnitude. Overall, ParalESN offers a promising, scalable, and principled pathway for integrating RC within the deep learning landscape.
zh
[AI-142] he Six Sigma Agent : Achieving Enterprise-Grade Reliability in LLM Systems Through Consensus-Driven Decomposed Execution
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在企业级部署中因本质上的概率性而导致的可靠性问题。其核心解决方案在于提出一种名为“六西格玛智能体”(Six Sigma Agent)的新架构,关键创新在于三个协同组件:(1) 将任务分解为原子动作的依赖树;(2) 微智能体采样(micro-agent sampling),即并行调用n个不同LLM对同一任务执行独立输出;(3) 动态缩放的一致性投票机制,通过聚类输出并选择得票最多的簇作为最终答案。理论证明,当每个动作错误率为p时,使用n个独立采样可将系统错误率降至O(p^⌈n/2⌉),实现指数级可靠性提升;实验证明,即使使用误差率达5%的低成本模型,仅需5个代理即可将错误率降至0.11%,动态扩展至13个代理可达3.4 DPMO(Defects Per Million Opportunities),达到六西格玛标准,且相比单智能体方案可靠性提升14,700倍、成本降低80%。这表明AI系统的可靠性源于结构化冗余与一致性决策,而非单纯依赖模型规模扩展。
链接: https://arxiv.org/abs/2601.22290
作者: Khush Patel,Siva Surendira,Jithin George,Shreyas Kapale
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures, 2 tables
Abstract:Large Language Models demonstrate remarkable capabilities yet remain fundamentally probabilistic, presenting critical reliability challenges for enterprise deployment. We introduce the Six Sigma Agent, a novel architecture that achieves enterprise-grade reliability through three synergistic components: (1) task decomposition into a dependency tree of atomic actions; (2) micro-agent sampling where each task is executed n times in parallel across diverse LLMs to generate independent outputs; and (3) consensus voting with dynamic scaling, clustering outputs and selecting the answer from the winning cluster with maximum votes. We prove that sampling n independent outputs with error rate p achieves system error O(p^ceil(n/2)), enabling exponential reliability gains. Even using cheaper models with 5% per-action error, consensus voting with 5 agents reduces error to 0.11%; dynamic scaling to 13 agents achieves 3.4 DPMO (Defects Per Million Opportunities), the Six Sigma standard. Evaluation across three enterprise use cases demonstrates a 14,700x reliability improvement over single-agent execution while reducing costs by 80%. Our work establishes that reliability in AI systems emerges from principled redundancy and consensus rather than model scaling alone.
zh
[AI-143] PersonaCite: VoC-Grounded Interviewable Agent ic Synthetic AI Personas for Verifiable User and Design Research
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)和代理(agent-based)的合成人物(synthetic personas)在设计与产品决策中广泛应用时,因依赖提示词(prompt-based)角色扮演而产生的响应具有说服力但缺乏可验证性的问题,即其论证依据不透明、难以追溯。解决方案的关键在于提出PersonaCite系统,该系统通过检索增强交互(retrieval-augmented interaction)重构AI人物为证据边界受限的研究工具:在每轮对话中检索真实用户声音(voice-of-customer)数据,约束回答仅基于检索到的证据,缺失证据时明确拒绝回答,并提供逐条响应级别的来源引用(source attribution)。此机制显著提升了人物输出的可信度与可审计性,从而支持更负责任的人类中心设计流程。
链接: https://arxiv.org/abs/2601.22288
作者: Mario Truss
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
备注:
Abstract:LLM-based and agent-based synthetic personas are increasingly used in design and product decision-making, yet prior work shows that prompt-based personas often produce persuasive but unverifiable responses that obscure their evidentiary basis. We present PersonaCite, an agentic system that reframes AI personas as evidence-bounded research instruments through retrieval-augmented interaction. Unlike prior approaches that rely on prompt-based roleplaying, PersonaCite retrieves actual voice-of-customer artifacts during each conversation turn, constrains responses to retrieved evidence, explicitly abstains when evidence is missing, and provides response-level source attribution. Through semi-structured interviews and deployment study with 14 industry experts, we identify preliminary findings on perceived benefits, validity concerns, and design tensions, and propose Persona Provenance Cards as a documentation pattern for responsible AI persona use in human-centered design workflows.
zh
[AI-144] AI Narrative Breakdown. A Critical Assessment of Power and Promise
【速读】:该论文试图解决当前社会对人工智能(Artificial Intelligence, AI)的 discourse 存在过度简化、意识形态化和价值盲视的问题,尤其是那些将AI描述为中立、通用或具有决定性力量的叙事所掩盖的政治性、权力结构与价值选择。其解决方案的关键在于引入“时代精神AI”(Zeitgeist AI)这一概念,批判性地揭示“AI”一词在不同领域中的模糊使用及其误导性,并通过跨学科视角(包括批判性计算机科学、数据与算法研究、科学技术研究、数据保护理论及心智哲学等)深入剖析常见叙事背后的隐含假设,从而凸显所有AI应用本质上都是人类导向且嵌入社会权力关系的工具,最终呼吁建立基于现实基础的AI治理框架,推动更具批判性和负责任的AI发展路径。
链接: https://arxiv.org/abs/2601.22255
作者: Rainer Rehak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 11 pages. In: The 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25)
Abstract:This article sets off for an exploration of the still evolving discourse surrounding artificial intelligence (AI) in the wake of the release of ChatGPT. It scrutinizes the pervasive narratives that are shaping the societal engagement with AI, spotlighting key themes such as agency and decision-making, autonomy, truthfulness, knowledge processing, prediction, general purpose, neutrality and objectivity, apolitical optimization, sustainability game-changer, democratization, mass unemployment, and the dualistic portrayal of AI as either a harbinger of societal utopia or dystopia. Those narratives are analysed critically based on insights from critical computer science, critical data and algorithm studies, from STS, data protection theory, as well as from the philosophy of mind and semiotics. To properly analyse the narratives presented, the article first delves into a historical and technical contextualisation of the AI discourse itself. The article then introduces the notion of “Zeitgeist AI” to critique the imprecise and misleading application of the term “AI” across various societal sectors. Then, by discussing common narratives with nuance, the article contextualises and challenges often assumed socio-political implications of AI, uncovering in detail and with examples the inherent political, power infused and value-laden decisions within all AI applications. Concluding with a call for a more grounded engagement with AI, the article carves out acute problems ignored by the narratives discussed and proposes new narratives recognizing AI as a human-directed tool necessarily subject to societal governance.
zh
[AI-145] MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成内容中可信来源追溯的问题,即如何在不损害文本质量的前提下实现高效、鲁棒的内容水印。现有水印方法普遍存在两个缺陷:一是仅提供二值信号,信息容量有限;二是引入采样分布扰动,导致生成文本质量下降。针对上述问题,作者提出MirrorMark,其核心创新在于采用保测度(measure-preserving)的采样随机性镜像机制,在不改变词元概率分布的前提下嵌入多比特消息,从而实现无失真水印。此外,通过引入基于上下文的调度策略,均衡消息位在文本中的分配,显著提升对插入和删除等攻击的鲁棒性。理论分析进一步揭示了等错误率(equal error rate)与实际性能之间的关系,实验表明,MirrorMark在保持与非水印生成相当文本质量的同时,大幅提升了检测准确率,尤其在54比特嵌入于300个token时,比特准确率提高8–12%,且在1%假阳性率下可识别更多水印文本达11%。
链接: https://arxiv.org/abs/2601.22246
作者: Ya Jiang,Massieh Kordi Boroujeny,Surender Suresh Kumar,Kai Zeng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) become integral to applications such as question answering and content creation, reliable content attribution has become increasingly important. Watermarking is a promising approach, but existing methods either provide only binary signals or distort the sampling distribution, degrading text quality; distortion-free approaches, in turn, often suffer from weak detectability or robustness. We propose MirrorMark, a multi-bit and distortion-free watermark for LLMs. By mirroring sampling randomness in a measure-preserving manner, MirrorMark embeds multi-bit messages without altering the token probability distribution, preserving text quality by design. To improve robustness, we introduce a context-based scheduler that balances token assignments across message positions while remaining resilient to insertions and deletions. We further provide a theoretical analysis of the equal error rate to interpret empirical performance. Experiments show that MirrorMark matches the text quality of non-watermarked generation while achieving substantially stronger detectability: with 54 bits embedded in 300 tokens, it improves bit accuracy by 8-12% and correctly identifies up to 11% more watermarked texts at 1% false positive rate.
zh
[AI-146] Learning to Recommend Multi-Agent Subgraphs from Calling Trees
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)中智能体推荐问题,即在日益庞大的智能体与工具市场中,如何从功能重叠的候选者中选出既相关又可靠、且能与当前执行上下文兼容并协同工作的最优组合。现有推荐系统主要基于扁平用户-物品日志进行项目级排序,无法处理MAS中任务编排所具有的结构化、序列性和交互依赖特性。其解决方案的关键在于提出一个约束推荐框架(constrained recommendation framework),首先通过检索构建受当前子任务和上下文条件限制的紧凑候选集,再利用学习到的评分器在可行集中进行效用优化,该评分器综合考虑相关性、可靠性及交互效应;同时,该框架以历史调用树(calling trees)为学习信号来源,捕获MAS执行结构中的父-子调用关系、分支依赖和局部协作模式,从而支持两种互补场景:智能体级推荐(选择下一个执行智能体)和系统级推荐(选择一组可协调执行的小型连通智能体子图)。
链接: https://arxiv.org/abs/2601.22209
作者: Xinyuan Song,Liang Zhao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems (MAS) increasingly solve complex tasks by orchestrating agents and tools selected from rapidly growing marketplaces. As these marketplaces expand, many candidates become functionally overlapping, making selection not just a retrieval problem: beyond filtering relevant agents, an orchestrator must choose options that are reliable, compatible with the current execution context, and able to cooperate with other selected agents. Existing recommender systems – largely built for item-level ranking from flat user-item logs – do not directly address the structured, sequential, and interaction-dependent nature of agent orchestration. We address this gap by \textbfformulating agent recommendation in MAS as a constrained decision problem and introducing a generic \textbfconstrained recommendation framework that first uses retrieval to build a compact candidate set conditioned on the current subtask and context, and then performs \textbfutility optimization within this feasible set using a learned scorer that accounts for relevance, reliability, and interaction effects. We ground both the formulation and learning signals in \textbfhistorical calling trees, which capture the execution structure of MAS (parent-child calls, branching dependencies, and local cooperation patterns) beyond what flat logs provide. The framework supports two complementary settings: \textbfagent-level recommendation (select the next agent/tool) and \textbfsystem-level recommendation (select a small, connected agent team/subgraph for coordinated execution). To enable systematic evaluation, we construct a unified calling-tree benchmark by normalizing invocation logs from eight heterogeneous multi-agent corpora into a shared structured representation.
zh
[AI-147] Advanced techniques and applications of LiDAR Place Recognition in Agricultural Environments: A Comprehensive Survey
【速读】:该论文旨在解决农业环境中基于激光雷达(LiDAR)的位姿识别(Localization and Place Recognition, LPR)技术面临的挑战,即在缺乏显著特征和结构化信息的非结构化农田场景中实现高精度定位。其解决方案的关键在于系统性地综述当前最先进的深度学习方法在农业场景中的应用,分析现有数据集、评估指标及主流LPR系统性能,并指出当前研究的局限性和未来发展方向,从而为该特定领域提供理论基础与实践指导,推动农业机器人自主导航能力的发展。
链接: https://arxiv.org/abs/2601.22198
作者: Judith Vilella-Cantos,Mónica Ballesta,David Valiente,María Flores,Luis Payá
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:An optimal solution to the localization problem is essential for developing autonomous robotic systems. Apart from autonomous vehicles, precision agriculture is one of the elds that can bene t most from these systems. Although LiDAR place recognition is a widely used technique in recent years to achieve accurate localization, it is mostly used in urban settings. However, the lack of distinctive features and the unstructured nature of agricultural environments make place recognition challenging. This work presents a comprehensive review of state-of-the-art the latest deep learning applications for agricultural environments and LPR techniques. We focus on the challenges that arise in these environments. We analyze the existing approaches, datasets, and metrics used to evaluate LPR system performance and discuss the limitations and future directions of research in this eld. This is the rst survey that focuses on LiDAR based localization in agricultural settings, with the aim of providing a thorough understanding and fostering further research in this specialized domain.
zh
[AI-148] Neural Signals Generate Clinical Notes in the Wild
【速读】:该论文旨在解决从长时间脑电图(EEG)记录中自动生成临床报告的劳动密集型问题,该报告需总结异常模式、诊断发现及临床解读。解决方案的关键在于构建首个临床EEG到语言的基础模型CELM,其通过整合预训练的EEG基础模型与语言模型,实现多尺度端到端的临床报告生成,涵盖记录描述、背景活动、癫痫样异常、事件/发作及最终印象等结构化内容,从而在有患者病史监督下将生成指标(如ROUGE-1和METEOR)提升至0.4–0.6,显著优于基线方法。
链接: https://arxiv.org/abs/2601.22197
作者: Jathurshan Pradeepkumar,Zheng Chen,Jimeng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:Generating clinical reports that summarize abnormal patterns, diagnostic findings, and clinical interpretations from long-term EEG recordings remains labor-intensive. We curate a large-scale clinical EEG dataset with 9,922 reports paired with approximately 11,000 hours of EEG recordings from 9,048 patients. We therefore develop CELM, the first clinical EEG-to-Language foundation model capable of summarizing long-duration, variable-length EEG recordings and performing end-to-end clinical report generation at multiple scales, including recording description, background activity, epileptiform abnormalities, events/seizures, and impressions. Experimental results show that, with patient history supervision, our method achieves 70% – 95% average relative improvements in standard generation metrics (e.g., ROUGE-1 and METEOR) from 0.2 – 0.3 to 0.4 – 0.6 . In the zero-shot setting without patient history, CELM attains generation scores in the range of 0.43 – 0.52 , compared to baselines of 0.17 – 0.26 . CELM integrates pretrained EEG foundation models with language models to enable scalable multimodal learning. We release our model and benchmark construction pipeline at [URL].
zh
[AI-149] Multitask Learning for Earth Observation Data Classification with Hybrid Quantum Network
【速读】:该论文旨在解决地球观测(Earth Observation, EO)数据在大规模深度学习模型下计算需求高、分析效率低的问题,尤其是在当前量子硬件受限的背景下,探索量子机器学习(Quantum Machine Learning, QML)在EO数据分类中的应用潜力。其解决方案的关键在于提出一种混合模型:一方面引入多任务学习机制以优化数据编码效率,另一方面设计带有位置权重模块的量子卷积操作来提取有效特征,从而提升分类性能并增强模型泛化能力。
链接: https://arxiv.org/abs/2601.22195
作者: Fan Fan,Yilei Shi,Tobias Guggemos,Xiao Xiang Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantum machine learning (QML) has gained increasing attention as a potential solution to address the challenges of computation requirements in the future. Earth observation (EO) has entered the era of Big Data, and the computational demands for effectively analyzing large EO data with complex deep learning models have become a bottleneck. Motivated by this, we aim to leverage quantum computing for EO data classification and explore its advantages despite the current limitations of quantum devices. This paper presents a hybrid model that incorporates multitask learning to assist efficient data encoding and employs a location weight module with quantum convolution operations to extract valid features for classification. The validity of our proposed model was evaluated using multiple EO benchmarks. Additionally, we experimentally explored the generalizability of our model and investigated the factors contributing to its advantage, highlighting the potential of QML in EO data analysis.
zh
[AI-150] COL-Trees: Efficient Hierarchical Object Search in Road Networks
【速读】:该论文旨在解决在图结构(如道路网络)中高效执行多种近邻查询的问题,包括聚合k近邻(Aggregate k Nearest Neighbor, AkNN)和k远邻(k Farthest Neighbor, kFN)等场景,这些查询难以用传统基于欧几里得距离的启发式方法有效处理。其核心解决方案是提出一种新的数据结构——紧凑对象-地标树(Compacted Object-Landmark Tree, COL-Tree),通过引入更精确的基于地标(landmark-based)的启发式策略实现对图的分层遍历,从而显著提升多代理场景下复杂近邻查询的效率。实验表明,该方法相比现有技术在真实和合成数据集上性能提升可达4个数量级,且预处理开销极小。
链接: https://arxiv.org/abs/2601.22183
作者: Tenindra Abeywickrama,Muhammad Aamir Cheema,Sabine Storandt
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: Submitted to Artificial Intelligence (AIJ)
Abstract:Location-based services rely heavily on efficient methods that search for relevant points-of-interest (POIs) near a given location. A k Nearest Neighbor (kNN) query is one such example that finds the k closest POIs from an agent’s location. While most existing techniques focus on retrieving nearby POIs for a single agent, these search heuristics do not translate to many other applications. For example, Aggregate k Nearest Neighbor (AkNN) queries require POIs that are close to multiple agents. k Farthest Neighbor (kFN) queries require POIs that are the antithesis of nearest. Such problems naturally benefit from a hierarchical approach, but existing methods rely on Euclidean-based heuristics, which have diminished effectiveness in graphs such as road networks. We propose a novel data structure, COL-Tree (Compacted Object-Landmark Tree), to address this gap by enabling efficient hierarchical graph traversal using a more accurate landmark-based heuristic. We then present query algorithms that utilize COL-Trees to efficiently answer AkNN, kFN, and other queries. In our experiments on real-world and synthetic datasets, we demonstrate that our techniques significantly outperform existing approaches, achieving up to 4 orders of magnitude improvement. Moreover, this comes at a small pre-processing overhead in both theory and practice.
zh
[AI-151] ShellForge: Adversarial Co-Evolution of Webshell Generation and Multi-View Detection for Robust Webshell Defense
【速读】:该论文旨在解决Webshell检测中面临的两大核心挑战:一是攻击者不断演化出新型变种并采用复杂混淆技术以逃避检测,二是现有防御机制在面对合法但高度混淆的管理脚本时易产生高误报率。解决方案的关键在于提出ShellForge框架,其通过对抗式协同进化机制实现检测能力的持续增强——具体而言,该框架包含一个基于监督微调与偏好强化学习优化的生成器,用于合成具有高度逃避能力的Webshell变种;同时设计了一个多视角融合检测器,整合长字符串压缩的语义特征、剪枝抽象语法树的结构特征及Shannon熵等全局统计指标,并引入大语言模型(Large Language Model, LLM)生成去恶意化样本作为高质量难负样本,从而显著降低误报率并提升检测鲁棒性。
链接: https://arxiv.org/abs/2601.22182
作者: Yizhong Ding
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Webshells remain a primary foothold for attackers to compromise servers, particularly within PHP ecosystems. However, existing detection mechanisms often struggle to keep pace with rapid variant evolution and sophisticated obfuscation techniques that camouflage malicious intent. Furthermore, many current defenses suffer from high false-alarm rates when encountering benign administrative scripts that employ heavy obfuscation for intellectual property protection. To address these challenges, we present ShellForge, an adversarial co-evolution framework that couples automated webshell generation with multi-view detection to continuously harden defensive boundaries. The framework operates through an iterative co-training loop where a generator and a detector mutually reinforce each other via the exchange of hard samples. The generator is optimized through supervised fine-tuning and preference-based reinforcement learning to synthesize functional, highly evasive variants. Simultaneously, we develop a multi-view fusion detector that integrates semantic features from long-string compression, structural features from pruned abstract syntax trees, and global statistical indicators such as Shannon entropy. To minimize false positives, ShellForge utilizes a LLM-based transformation to create de-malicious samples–scripts that retain complex obfuscation patterns but lack harmful payloads–serving as high-quality hard negatives during training. Evaluations on the public FWOID benchmark demonstrate that ShellForge significantly enhances defensive robustness. Upon convergence, the detector maintains a 0.981 F1-score while the generator achieves a 0.939 evasion rate against commercial engines on VirusTotal.
zh
[AI-152] Screen Match and Cache: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation
【速读】:该论文旨在解决长序列视频中人类动画生成的两大核心挑战:如何建模远距离依赖关系以保持时间连贯性,同时维持每一帧的视觉质量。其解决方案的关键在于提出一个无需训练的三阶段框架 FrameCache,包括 Screen(筛选)、Cache(缓存)和 Match(匹配)三个模块:Screen 阶段通过多维、质量感知的自适应阈值机制动态选择信息丰富帧;Cache 阶段采用动态替换-命中策略维护参考池,兼顾多样性与相关性;Match 阶段提取行为特征进行运动一致的参考匹配,从而提供连贯的动画引导。该方法在标准基准上显著提升了时间一致性与视觉稳定性,并可无缝集成到多种基线模型中。
链接: https://arxiv.org/abs/2601.22160
作者: Jianan Wang,Nailei Hei,Li He,Huanzhen Wang,Aoxing Li,Haofen Wang,Yan Wang,Wenqiang Zhang
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:Human animation aims to generate temporally coherent and visually consistent videos over long sequences, yet modeling long-range dependencies while preserving frame quality remains challenging. Inspired by the human ability to leverage past observations for interpreting ongoing actions, we propose FrameCache, a training-free three-stage framework consisting of Screen, Cache, and Match. In the Screen stage, a multi-dimensional, quality-aware mechanism with adaptive thresholds dynamically selects informative frames; the Cache stage maintains a reference pool using a dynamic replacement-hit strategy, preserving both diversity and relevance; and the Match stage extracts behavioral features to perform motion-consistent reference matching for coherent animation guidance. Extensive experiments on standard benchmarks demonstrate that FrameCache consistently improves temporal coherence and visual stability while integrating seamlessly with diverse baselines. Despite these encouraging results, further analysis reveals that its effectiveness depends on baseline temporal reasoning and real-synthetic consistency, motivating future work on compatibility conditions and adaptive cache mechanisms. Code will be made publicly available.
zh
[AI-153] Smart Routing with Precise Link Estimation: DSEE-Based Anypath Routing for Reliable Wireless Networking ICML
【速读】:该论文旨在解决多跳无线网状网络(multi-hop wireless mesh networks)中传统路由协议因依赖预设路径而在不可预测链路条件下失效的问题,尤其关注如何在动态且资源受限环境中提升路由的可靠性和适应性。其解决方案的关键在于引入一种基于确定性探索与利用序列(Deterministic Sequencing of Exploration and Exploitation, DSEE)的多臂赌博机算法,通过实时估计链路传输概率来增强最短任意路径(Shortest Anypath)路由的性能;该方法不仅实现了对链路可靠性的持续学习和精确估计,还能在保证近对数级遗憾边界(near-logarithmic regret bound)的前提下选择最优路径,从而显著优于此前基于Thompson采样的机会路由(TSOR)方案,在网络规模增大时展现出更优的遗憾缩放特性。
链接: https://arxiv.org/abs/2405.10377
作者: Narjes Nourzad,Bhaskar Krishnamachari
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICMLCN 2024
Abstract:In dynamic and resource-constrained environments, such as multi-hop wireless mesh networks, traditional routing protocols often falter by relying on predetermined paths that prove ineffective in unpredictable link conditions. Shortest Anypath routing offers a solution by adapting routing decisions based on real-time link conditions. However, the effectiveness of such routing is fundamentally dependent on the quality and reliability of the available links, and predicting these variables with certainty is challenging. This paper introduces a novel approach that leverages the Deterministic Sequencing of Exploration and Exploitation (DSEE), a multi-armed bandit algorithm, to address the need for accurate and real-time estimation of link delivery probabilities. This approach augments the reliability and resilience of the Shortest Anypath routing in the face of fluctuating link conditions. By coupling DSEE with Anypath routing, this algorithm continuously learns and ensures accurate delivery probability estimation and selects the most suitable way to efficiently route packets while maintaining a provable near-logarithmic regret bound. We also theoretically prove that our proposed scheme offers better regret scaling with respect to the network size than the previously proposed Thompson Sampling-based Opportunistic Routing (TSOR).
zh
[AI-154] Disentangling multispecific antibody function with graph neural networks
【速读】:该论文旨在解决多特异性抗体(multispecific antibodies)在理性设计中面临的挑战,即如何准确预测其分子拓扑结构(domain topology)的细微变化对功能结果(functional outcomes)的影响,这一问题因实验数据稀缺而尤为突出。解决方案的关键在于提出一个计算框架:首先,开发了一种生成方法以构建大规模、真实的合成功能性景观(synthetic functional landscapes),能够捕捉生物活性依赖于域连接性的非线性相互作用;其次,设计了一种显式编码拓扑约束的图神经网络(graph neural network)架构,可区分仅通过序列模型难以识别的格式配置差异。该模型在合成数据上训练后,通过迁移学习在有限生物数据集上展现出高预测准确性,从而为优化三特异性T细胞接合剂的疗效与毒性权衡及筛选最优公共轻链提供了有效工具。
链接: https://arxiv.org/abs/2601.23212
作者: Joshua Southern,Changpeng Lu,Santrupti Nerli,Samuel D. Stanton,Andrew M. Watkins,Franziska Seeger,Frédéric A. Dreyer
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures, code available at this https URL
Abstract:Multispecific antibodies offer transformative therapeutic potential by engaging multiple epitopes simultaneously, yet their efficacy is an emergent property governed by complex molecular architectures. Rational design is often bottlenecked by the inability to predict how subtle changes in domain topology influence functional outcomes, a challenge exacerbated by the scarcity of comprehensive experimental data. Here, we introduce a computational framework to address part of this gap. First, we present a generative method for creating large-scale, realistic synthetic functional landscapes that capture non-linear interactions where biological activity depends on domain connectivity. Second, we propose a graph neural network architecture that explicitly encodes these topological constraints, distinguishing between format configurations that appear identical to sequence-only models. We demonstrate that this model, trained on synthetic landscapes, recapitulates complex functional properties and, via transfer learning, has the potential to achieve high predictive accuracy on limited biological datasets. We showcase the model’s utility by optimizing trade-offs between efficacy and toxicity in trispecific T-cell engagers and retrieving optimal common light chains. This work provides a robust benchmarking environment for disentangling the combinatorial complexity of multispecifics, accelerating the design of next-generation therapeutics.
zh
[AI-155] A Cross-Domain Graph Learning Protocol for Single-Step Molecular Geometry Refinement
【速读】:该论文旨在解决密度泛函理论(Density Functional Theory, DFT)几何优化在高通量分子筛选中效率低下的问题,即传统DFT优化计算成本高、耗时长,成为限制大规模分子结构预测和性质计算的关键瓶颈。其解决方案的核心是提出GeoOpt-Net——一种基于SE(3)-等变神经网络的多分支几何精修模型,能够从低成本力场生成的初始构象出发,在单次前向传播中直接预测达到B3LYP/TZVP级别精度的分子几何结构。该方法通过两阶段训练策略(广义预训练+针对特定理论与基组的微调)结合保真度感知特征调制(Fidelity-aware Feature Modulation, FAFM)机制,实现对不同量子化学理论和基组的自适应校准,从而在结构精度(亚毫埃级均方根偏差)和能量一致性(接近零的单点能偏差)上媲美DFT结果,并显著提升DFT收敛成功率与计算效率。
链接: https://arxiv.org/abs/2601.22723
作者: Chengchun Liu,Wendi Cai,Boxuan Zhao,Fanyang Mo
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Atomic and Molecular Clusters (physics.atm-clus)
备注: 17 pages, 6 figures
Abstract:Accurate molecular geometries are a prerequisite for reliable quantum-chemical predictions, yet density functional theory (DFT) optimization remains a major bottleneck for high-throughput molecular screening. Here we present GeoOpt-Net, a multi-branch SE(3)-equivariant geometry refinement network that predicts DFT-quality structures at the B3LYP/TZVP level of theory in a single forward pass starting from inexpensive initial conformers generated at a low-cost force-field level. GeoOpt-Net is trained using a two-stage strategy in which a broadly pretrained geometric representation is subsequently fine-tuned to approach B3LYP/TZVP-level accuracy, with theory- and basis-set-aware calibration enabled by a fidelity-aware feature modulation (FAFM) mechanism. Benchmarking against representative approaches spanning classical conformer generation (RDKit), semiempirical quantum methods (xTB), data-driven geometry refinement pipelines (Auto3D), and machine-learning interatomic potentials (UMA) on external drug-like molecules demonstrates that GeoOpt-Net achieves sub-milli-Å all-atom RMSD with near-zero B3LYP/TZVP single-point energy deviations, indicating DFT-ready geometries that closely reproduce both structural and energetic references. Beyond geometric metrics, GeoOpt-Net generates initial guesses intrinsically compatible with DFT convergence criteria, yielding nonzero ``All-YES’’ convergence rates (65.0% under loose and 33.4% under default thresholds), and substantially reducing re-optimization steps and wall-clock time. GeoOpt-Net further exhibits smooth and predictable energy scaling with molecular complexity while preserving key electronic observables such as dipole moments. Collectively, these results establish GeoOpt-Net as a scalable, physically consistent geometry refinement framework that enables efficient acceleration of DFT-based quantum-chemical workflows.
zh
[AI-156] AI Decodes Historical Chinese Archives to Reveal Lost Climate History
【速读】:该论文旨在解决历史档案中定性气候事件描述难以转化为定量气候记录的难题。其解决方案的关键在于提出一种生成式 AI (Generative AI) 框架,通过逆向推理历史记载者的逻辑,从文献中提取与气候事件相关的定量降水模式,从而重建1368–1911年东南中国季尺度降水序列,首次系统刻画了厄尔尼诺(El Niño)对区域降水影响的完整时空结构。
链接: https://arxiv.org/abs/2601.22458
作者: Sida He,Lingxi Xie,Xiaopeng Zhang,Qi Tian
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 60 pages, 4 figures in the main text, 25 figures and 10 tables in the appendix
Abstract:Historical archives contain qualitative descriptions of climate events, yet converting these into quantitative records has remained a fundamental challenge. Here we introduce a paradigm shift: a generative AI framework that inverts the logic of historical chroniclers by inferring the quantitative climate patterns associated with documented events. Applied to historical Chinese archives, it produces the sub-annual precipitation reconstruction for southeastern China over the period 1368-1911 AD. Our reconstruction not only quantifies iconic extremes like the Ming Dynasty’s Great Drought but also, crucially, maps the full spatial and seasonal structure of El Ni ñ o influence on precipitation in this region over five centuries, revealing dynamics inaccessible in shorter modern records. Our methodology and high-resolution climate dataset are directly applicable to climate science and have broader implications for the historical and social sciences.
zh
[AI-157] Spectral Filtering for Learning Quantum Dynamics
【速读】:该论文旨在解决高维量子系统学习中的“维度灾难”问题,即在量子演化预测任务中,传统方法因需重建全量系统矩阵而导致样本和计算复杂度随希尔伯特空间维度呈指数增长。其解决方案的关键在于提出量子谱滤波(Quantum Spectral Filtering)方法,将学习目标从精确重构系统矩阵转向不恰当动态学习(improper dynamic learning),并利用Slepian基的最优集中性质,证明此类系统的可学习性仅由有效量子维度 $ k^* $ 决定,该维度由谱带宽和记忆时域共同决定,从而实现样本与计算复杂度与环境状态维度无关的学习性能。
链接: https://arxiv.org/abs/2601.22400
作者: Elad Hazan,Annie Marsden
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning high-dimensional quantum systems is a fundamental challenge that notoriously suffers from the curse of dimensionality. We formulate the task of predicting quantum evolution in the linear response regime as a specific instance of learning a Complex-Valued Linear Dynamical System (CLDS) with sector-bounded eigenvalues – a setting that also encompasses modern Structured State Space Models (SSMs). While traditional system identification attempts to reconstruct full system matrices (incurring exponential cost in the Hilbert dimension), we propose Quantum Spectral Filtering, a method that shifts the goal to improper dynamic learning. Leveraging the optimal concentration properties of the Slepian basis, we prove that the learnability of such systems is governed strictly by an effective quantum dimension k^* , determined by the spectral bandwidth and memory horizon. This result establishes that complex-valued LDSs can be learned with sample and computational complexity independent of the ambient state dimension, provided their spectrum is bounded.
zh
[AI-158] Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram
【速读】:该论文旨在解决当前基因组基础模型(Genomic Foundation Models, GFMs)在处理多碱基保守序列基序(motif)时效率低下且缺乏显式结构建模的问题。现有方法依赖大规模神经网络隐式学习单核苷酸输入中的生物特征,导致计算资源消耗高且可解释性弱。解决方案的关键在于提出Gengram模块——一个基于基因组特异性哈希机制的条件记忆模块,通过显式查找机制高效存储和检索多碱基基序,从而构建基因组“语法”(genomic “syntax”)。该模块集成于先进GFM架构中,在多个功能基因组学任务上实现最高达14%的性能提升,并展现出良好的架构泛化能力与生物学意义明确的潜在空间表示,兼顾了模型性能与机制可解释性。
链接: https://arxiv.org/abs/2601.22203
作者: Huinan Xu,Xuyang Feng,Junhong Chen,Junchen Liu,Kaiwen Deng,Kai Ding,Shengning Long,Jiaxue Shuai,Zhaorong Li,Shiping Liu,Guirong Xue,Zhan Xiao
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Current genomic foundation models (GFMs) rely on extensive neural computation to implicitly approximate conserved biological motifs from single-nucleotide inputs. We propose Gengram, a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi-base motifs via a genomic-specific hashing scheme, establishing genomic “syntax”. Integrated into the backbone of state-of-the-art GFMs, Gengram achieves substantial gains (up to 14%) across several functional genomics tasks. The module demonstrates robust architectural generalization, while further inspection of Gengram’s latent space reveals the emergence of meaningful representations that align closely with fundamental biological knowledge. By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability, providing a scalable and biology-aligned pathway for the next generation of GFMs. The code is available at this https URL, and the model checkpoint is available at this https URL.
zh
[AI-159] Practical Evaluation of Quantum Kernel Methods for Radar Micro-Doppler Classification on Noisy Intermediate-Scale Quantum (NISQ) Hardware
【速读】:该论文旨在解决雷达航空目标分类中特征维度高、计算效率低的问题,提出了一种基于量子支持向量机(Quantum Support Vector Machine, QSVM)的解决方案。其关键在于通过主成分分析(Principal Component Analysis, PCA)对经典特征进行降维,并利用全纠缠的ZZFeatureMap将降维后的特征向量嵌入到量子核诱导的特征空间中,从而实现高效且具有竞争力的分类性能。该方法在量子模拟器和NISQ时代超导量子硬件(如IBM Torino和Fez处理器)上均进行了验证,表明QSVM能够在显著降低特征维度的同时保持良好分类效果,并揭示了噪声、退相干及测量次数对量子核估计的影响,为量子核方法在实际雷达信号分类中的应用提供了可行性依据与优化方向。
链接: https://arxiv.org/abs/2601.22194
作者: Vikas Agnihotri,Jasleen Kaur,Sarvagya Kaushik
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper examines the application of a Quantum Support Vector Machine (QSVM) for radarbased aerial target classification using micro-Doppler signatures. Classical features are extracted and reduced via Principal Component Analysis (PCA) to enable efficient quantum encoding. The reduced feature vectors are embedded into a quantum kernel-induced feature space using a fully entangled ZZFeatureMap and classified using a kernel based QSVM. Performance is first evaluated on a quantum simulator and subsequently validated on NISQ-era superconducting quantum hardware, specifically the IBM Torino (133-qubit) and IBM Fez (156-qubit) processors. Experimental results demonstrate that the QSVM achieves competitive classification performance relative to classical SVM baselines while operating on substantially reduced feature dimensionality. Hardware experiments reveal the impact of noise and decoherence and measurement shot count on quantum kernel estimation, and further show improved stability and fidelity on newer Heron r2 architecture. This study provides a systematic comparison between simulator-based and hardware-based QSVM implementations and highlights both the feasibility and current limitations of deploying quantum kernel methods for practical radar signal classification tasks.
zh
[AI-160] Stablecoin Design with Adversarial-Robust Multi-Agent Systems via Trust-Weighted Signal Aggregation
【速读】:该论文旨在解决算法稳定币(Algorithmic Stablecoins)在极端市场波动下因储备管理控制器对尾部风险(tail events)缺乏敏感性而导致的系统性崩溃问题,例如2020年3月“黑色星期四”事件中MakerDAO因抵押品拍卖损失830万美元并出现15%的锚定偏离。现有模型如SAS(Stable Asset Strategy)依赖平稳期数据进行协方差估计,忽视极端压力情境,导致其优化策略在正常环境下表现良好但面对黑天鹅冲击时失效。解决方案的关键在于提出MVF-Composer——一种基于信任加权均值-方差前沿(Mean-Variance Frontier, MVF)的储备控制器,其核心创新是引入一个新型“压力捕获器”(Stress Harness),通过多智能体模拟(multi-agent simulations)作为对抗性压力测试工具,在链上实际暴露前识别储备脆弱性;同时设计了一个信任评分机制T: A → [0,1],用于动态降低表现出操纵行为或Sybil攻击特征的代理信号权重,从而提升风险状态估计的鲁棒性。实验证明,MVF-Composer在1200个注入黑天鹅冲击的随机场景中,相较SAS基线将峰值锚定偏离降低57%,平均恢复时间缩短3.1倍,且信任层贡献了23%的稳定性增益,实现了无需链上预言机、仅依赖标准价格流即可部署的可复现DeFi储备政策压力测试框架。
链接: https://arxiv.org/abs/2601.22168
作者: Shengwei You,Aditya Joshi,Andrey Kuehlkamp,Jarek Nabrzyski
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computational Finance (q-fin.CP)
备注:
Abstract:Algorithmic stablecoins promise decentralized monetary stability by maintaining a target peg through programmatic reserve management. Yet, their reserve controllers remain vulnerable to regime-blind optimization, calibrating risk parameters on fair-weather data while ignoring tail events that precipitate cascading failures. The March 2020 Black Thursday collapse, wherein MakerDAO’s collateral auctions yielded 8.3M in losses and a 15% peg deviation, exposed a critical gap: existing models like SAS systematically omit extreme volatility regimes from covariance estimates, producing allocations optimal in expectation but catastrophic under adversarial stress. We present MVF-Composer, a trust-weighted Mean-Variance Frontier reserve controller incorporating a novel Stress Harness for risk-state estimation. Our key insight is deploying multi-agent simulations as adversarial stress-testers: heterogeneous agents (traders, liquidity providers, attackers) execute protocol actions under crisis scenarios, exposing reserve vulnerabilities before they manifest on-chain. We formalize a trust-scoring mechanism T: A - [0,1] that down-weights signals from agents exhibiting manipulative behavior, ensuring the risk-state estimator remains robust to signal injection and Sybil attacks. Across 1,200 randomized scenarios with injected Black-Swan shocks (10% collateral drawdown, 50% sentiment collapse, coordinated redemption attacks), MVF-Composer reduces peak peg deviation by 57% and mean recovery time by 3.1x relative to SAS baselines. Ablation studies confirm the trust layer accounts for 23% of stability gains under adversarial conditions, achieving 72% adversarial agent detection. Our system runs on commodity hardware, requires no on-chain oracles beyond standard price feeds, and provides a reproducible framework for stress-testing DeFi reserve policies. Subjects: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computational Finance (q-fin.CP) Cite as: arXiv:2601.22168 [q-fin.RM] (or arXiv:2601.22168v1 [q-fin.RM] for this version) https://doi.org/10.48550/arXiv.2601.22168 Focus to learn more arXiv-issued DOI via DataCite
zh
机器学习
[LG-0] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces
链接: https://arxiv.org/abs/2601.23280
作者: Thomas Y.L. Lin,Jiachen Yao,Lufang Chiang,Julius Berner,Anima Anandkumar
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We propose a data-efficient, physics-aware generative framework in function space for inverse PDE problems. Existing plug-and-play diffusion posterior samplers represent physics implicitly through joint coefficient-solution modeling, requiring substantial paired supervision. In contrast, our Decoupled Diffusion Inverse Solver (DDIS) employs a decoupled design: an unconditional diffusion learns the coefficient prior, while a neural operator explicitly models the forward PDE for guidance. This decoupling enables superior data efficiency and effective physics-informed learning, while naturally supporting Decoupled Annealing Posterior Sampling (DAPS) to avoid over-smoothing in Diffusion Posterior Sampling (DPS). Theoretically, we prove that DDIS avoids the guidance attenuation failure of joint models when training data is scarce. Empirically, DDIS achieves state-of-the-art performance under sparse observation, improving l_2 error by 11% and spectral error by 54% on average; when data is limited to 1%, DDIS maintains accuracy with 40% advantage in l_2 error compared to joint models.
[LG-1] Particle-Guided Diffusion Models for Partial Differential Equations
链接: https://arxiv.org/abs/2601.23262
作者: Andrew Millard,Fredrik Lindsten,Zheng Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a guided stochastic sampling method that augments sampling from diffusion models with physics-based guidance derived from partial differential equation (PDE) residuals and observational constraints, ensuring generated samples remain physically admissible. We embed this sampling procedure within a new Sequential Monte Carlo (SMC) framework, yielding a scalable generative PDE solver. Across multiple benchmark PDE systems as well as multiphysics and interacting PDE systems, our method produces solution fields with lower numerical error than existing state-of-the-art generative methods.
[LG-2] How well do generative models solve inverse problems? A benchmark study
链接: https://arxiv.org/abs/2601.23238
作者: Patrick Krüger,Patrick Materne,Werner Krebs,Hanno Gottschalk
类目: Machine Learning (cs.LG)
*备注: 32 pages, 11 figures, 5 tables
Abstract:Generative learning generates high dimensional data based on low dimensional conditions, also called prompts. Therefore, generative learning algorithms are eligible for solving (Bayesian) inverse problems. In this article we compare a traditional Bayesian inverse approach based on a forward regression model and a prior sampled with the Markov Chain Monte Carlo method with three state of the art generative learning models, namely conditional Generative Adversarial Networks, Invertible Neural Networks and Conditional Flow Matching. We apply them to a problem of gas turbine combustor design where we map six independent design parameters to three performance labels. We propose several metrics for the evaluation of this inverse design approaches and measure the accuracy of the labels of the generated designs along with the diversity. We also study the performance as a function of the training dataset size. Our benchmark has a clear winner, as Conditional Flow Matching consistently outperforms all competing approaches.
[LG-3] Sequence Diffusion Model for Temporal Link Prediction in Continuous-Time Dynamic Graph
链接: https://arxiv.org/abs/2601.23233
作者: Nguyen Minh Duc,Viet Cuong Ta
类目: Machine Learning (cs.LG)
*备注:
Abstract:Temporal link prediction in dynamic graphs is a fundamental problem in many real-world systems. Existing temporal graph neural networks mainly focus on learning representations of historical interactions. Despite their strong performance, these models are still purely discriminative, producing point estimates for future links and lacking an explicit mechanism to capture the uncertainty and sequential structure of future temporal interactions. In this paper, we propose SDG, a novel sequence-level diffusion framework that unifies dynamic graph learning with generative denoising. Specifically, SDG injects noise into the entire historical interaction sequence and jointly reconstructs all interaction embeddings through a conditional denoising process, thereby enabling the model to capture more comprehensive interaction distributions. To align the generative process with temporal link prediction, we employ a cross-attention denoising decoder to guide the reconstruction of the destination sequence and optimize the model in an end-to-end manner. Extensive experiments on various temporal graph benchmarks show that SDG consistently achieves state-of-the-art performance in the temporal link prediction task.
[LG-4] Optimal Fair Aggregation of Crowdsourced Noisy Labels using Demographic Parity Constraints
链接: https://arxiv.org/abs/2601.23221
作者: Gabriel Singer,Samuel Gruffaz,Olivier Vo Van,Nicolas Vayatis,Argyris Kalogeratos
类目: Machine Learning (cs.LG)
*备注:
Abstract:As acquiring reliable ground-truth labels is usually costly, or infeasible, crowdsourcing and aggregation of noisy human annotations is the typical resort. Aggregating subjective labels, though, may amplify individual biases, particularly regarding sensitive features, raising fairness concerns. Nonetheless, fairness in crowdsourced aggregation remains largely unexplored, with no existing convergence guarantees and only limited post-processing approaches for enforcing \varepsilon -fairness under demographic parity. We address this gap by analyzing the fairness s of crowdsourced aggregation methods within the \varepsilon -fairness framework, for Majority Vote and Optimal Bayesian aggregation. In the small-crowd regime, we derive an upper bound on the fairness gap of Majority Vote in terms of the fairness gaps of the individual annotators. We further show that the fairness gap of the aggregated consensus converges exponentially fast to that of the ground-truth under interpretable conditions. Since ground-truth itself may still be unfair, we generalize a state-of-the-art multiclass fairness post-processing algorithm from the continuous to the discrete setting, which enforces strict demographic parity constraints to any aggregation rule. Experiments on synthetic and real datasets demonstrate the effectiveness of our approach and corroborate the theoretical insights.
[LG-5] ackling air quality with SAPIENS
链接: https://arxiv.org/abs/2601.23215
作者: Marcella Bona,Nathan Heatley,Jia-Chen Hua,Adriana Lara,Valeria Legaria-Santiago,Alberto Luviano Juarez,Fernando Moreno-Gomez,Jocelyn Richardson,Natan Vilchis,Xiwen Shirley Zheng
类目: Machine Learning (cs.LG)
*备注: 24 pages, 13 figures
Abstract:Air pollution is a chronic problem in large cities worldwide and awareness is rising as the long-term health implications become clearer. Vehicular traffic has been identified as a major contributor to poor air quality. In a lot of cities the publicly available air quality measurements and forecasts are coarse-grained both in space and time. However, in general, real-time traffic intensity data is openly available in various forms and is fine-grained. In this paper, we present an in-depth study of pollution sensor measurements combined with traffic data from Mexico City. We analyse and model the relationship between traffic intensity and air quality with the aim to provide hyper-local, dynamic air quality forecasts. We developed an innovative method to represent traffic intensities by transforming simple colour-coded traffic maps into concentric ring-based descriptions, enabling improved characterisation of traffic conditions. Using Partial Least Squares Regression, we predict pollution levels based on these newly defined traffic intensities. The model was optimised with various training samples to achieve the best predictive performance and gain insights into the relationship between pollutants and traffic. The workflow we have designed is straightforward and adaptable to other contexts, like other cities beyond the specifics of our dataset.
[LG-6] Ensuring Semantics in Weights of Implicit Neural Representations through the Implicit Function Theorem
链接: https://arxiv.org/abs/2601.23181
作者: Tianming Qiu,Christos Sonis,Hao Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Weight Space Learning (WSL), which frames neural network weights as a data modality, is an emerging field with potential for tasks like meta-learning or transfer learning. Particularly, Implicit Neural Representations (INRs) provide a convenient testbed, where each set of weights determines the corresponding individual data sample as a mapping from coordinates to contextual values. So far, a precise theoretical explanation for the mechanism of encoding semantics of data into network weights is still missing. In this work, we deploy the Implicit Function Theorem (IFT) to establish a rigorous mapping between the data space and its latent weight representation space. We analyze a framework that maps instance-specific embeddings to INR weights via a shared hypernetwork, achieving performance competitive with existing baselines on downstream classification tasks across 2D and 3D datasets. These findings offer a theoretical lens for future investigations into network weights.
[LG-7] riSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
链接: https://arxiv.org/abs/2601.23180
作者: Haoyun Jiang,Junqi He,Feng Hong,Xinlong Yang,Jianwei Zhang,Zheng Li,Zhengyang Zhuge,Zhiyong Chen,Bo Han,Junyang Lin,Jiangchao Yao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35% speedup over standard SD, with up to 50% fewer target model invocations while maintaining comparable accuracy.
[LG-8] MeshGraphNet-Transformer: Scalable Mesh-based Learned Simulation for Solid Mechanics
链接: https://arxiv.org/abs/2601.23177
作者: Mikel M. Iparraguirre,Iciar Alfaro,David Gonzalez,Elias Cueto
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present MeshGraphNet-Transformer (MGN-T), a novel architecture that combines the global modeling capabilities of Transformers with the geometric inductive bias of MeshGraphNets, while preserving a mesh-based graph representation. MGN-T overcomes a key limitation of standard MGN, the inefficient long-range information propagation caused by iterative message passing on large, high-resolution meshes. A physics-attention Transformer serves as a global processor, updating all nodal states simultaneously while explicitly retaining node and edge attributes. By directly capturing long-range physical interactions, MGN-T eliminates the need for deep message-passing stacks or hierarchical, coarsened meshes, enabling efficient learning on high-resolution meshes with varying geometries, topologies, and boundary conditions at an industrial scale. We demonstrate that MGN-T successfully handles industrial-scale meshes for impact dynamics, a setting in which standard MGN fails due message-passing under-reaching. The method accurately models self-contact, plasticity, and multivariate outputs, including internal, phenomenological plastic variables. Moreover, MGN-T outperforms state-of-the-art approaches on classical benchmarks, achieving higher accuracy while maintaining practical efficiency, using only a fraction of the parameters required by competing baselines. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.23177 [cs.LG] (or arXiv:2601.23177v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.23177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] Names Dont Matter: Symbol-Invariant Transformer for Open-Vocabulary Learning
链接: https://arxiv.org/abs/2601.23169
作者: İlker Işık,Wenchao Li
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
*备注:
Abstract:Current neural architectures lack a principled way to handle interchangeable tokens, i.e., symbols that are semantically equivalent yet distinguishable, such as bound variables. As a result, models trained on fixed vocabularies often struggle to generalize to unseen symbols, even when the underlying semantics remain unchanged. We propose a novel Transformer-based mechanism that is provably invariant to the renaming of interchangeable tokens. Our approach employs parallel embedding streams to isolate the contribution of each interchangeable token in the input, combined with an aggregated attention mechanism that enables structured information sharing across streams. Experimental results confirm the theoretical guarantees of our method and demonstrate substantial performance gains on open-vocabulary tasks that require generalization to novel symbols.
[LG-10] Stochastic Linear Bandits with Parameter Noise
链接: https://arxiv.org/abs/2601.23164
作者: Daniel Ezer,Alon Peled-Cohen,Yishay Mansour
类目: Machine Learning (cs.LG)
*备注: 8 pages
Abstract:We study the stochastic linear bandits with parameter noise model, in which the reward of action a is a^\top \theta where \theta is sampled i.i.d. We show a regret upper bound of \widetildeO (\sqrtd T \log (K/\delta) \sigma^2_\max) for a horizon T , general action set of size K of dimension d , and where \sigma^2_\max is the maximal variance of the reward for any action. We further provide a lower bound of \widetilde\Omega (d \sqrtT \sigma^2_\max) which is tight (up to logarithmic factors) whenever \log (K) \approx d . For more specific action sets, \ell_p unit balls with p \leq 2 and dual norm q , we show that the minimax regret is \widetilde\Theta (\sqrtdT \sigma^2_q) , where \sigma^2_q is a variance-dependent quantity that is always at most 4 . This is in contrast to the minimax regret attainable for such sets in the classic additive noise model, where the regret is of order d \sqrtT . Surprisingly, we show that this optimal (up to logarithmic factors) regret bound is attainable using a very simple explore-exploit algorithm.
[LG-11] No More No Less: Least-Privilege Language Models
链接: https://arxiv.org/abs/2601.23157
作者: Paulius Rauba,Dominykas Seputis,Patrikas Vanagas,Mihaela van der Schaar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Least privilege is a core security principle: grant each request only the minimum access needed to achieve its goal. Deployed language models almost never follow it, instead being exposed through a single API endpoint that serves all users and requests. This gap exists not because least privilege would be unhelpful; deployments would benefit greatly from reducing unnecessary capability exposure. The real obstacle is definitional and mechanistic: what does “access” mean inside a language model, and how can we enforce it without retraining or deploying multiple models? We take inspiration from least privilege in computer systems and define a class of models called least-privilege language models, where privilege is reachable internal computation during the forward pass. In this view, lowering privilege literally shrinks the model’s accessible function class, as opposed to denying access via learned policies. We formalize deployment-time control as a monitor-allocator-enforcer stack, separating (i) request-time signals, (ii) a decision rule that allocates privilege, and (iii) an inference-time mechanism that selects privilege. We then propose Nested Least-Privilege Networks, a shape-preserving, rank-indexed intervention that provides a smooth, reversible control knob. We show that this knob yields policy-usable privilege-utility frontiers and enables selective suppression of targeted capabilities with limited collateral degradation across various policies. Most importantly, we argue for a new deployment paradigm that challenges the premise that language models can only be controlled at the output level.
[LG-12] Unsupervised Hierarchical Skill Discovery
链接: https://arxiv.org/abs/2601.23156
作者: Damion Harvey(1),Geraud Nangue Tasse(1 and 2),Branden Ingram(1 and 2),Benjamin Rosman(1 and 2),Steven James(1 and 2) ((1) University of the Witwatersrand, Johannesburg, South Africa, (2) Machine Intelligence and Neural Discovery (MIND) Institute, University of the Witwatersrand, Johannesburg, South Africa)
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注: 24 pages, 34 figures. Appendix by Damion Harvey. Damion Harvey is the primary author
Abstract:We consider the problem of unsupervised skill segmentation and hierarchical structure discovery in reinforcement learning. While recent approaches have sought to segment trajectories into reusable skills or options, most rely on action labels, rewards, or handcrafted annotations, limiting their applicability. We propose a method that segments unlabelled trajectories into skills and induces a hierarchical structure over them using a grammar-based approach. The resulting hierarchy captures both low-level behaviours and their composition into higher-level skills. We evaluate our approach in high-dimensional, pixel-based environments, including Craftax and the full, unmodified version of Minecraft. Using metrics for skill segmentation, reuse, and hierarchy quality, we find that our method consistently produces more structured and semantically meaningful hierarchies than existing baselines. Furthermore, as a proof of concept for utility, we demonstrate that these discovered hierarchies accelerate and stabilise learning on downstream reinforcement learning tasks.
[LG-13] Behemoth: Benchmarking Unlearning in LLM s Using Fully Synthetic Data
链接: https://arxiv.org/abs/2601.23153
作者: Eugenia Iofinova,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:
Abstract:As artificial neural networks, and specifically large language models, have improved rapidly in capabilities and quality, they have increasingly been deployed in real-world applications, from customer service to Google search, despite the fact that they frequently make factually incorrect or undesirable statements. This trend has inspired practical and academic interest in model editing, that is, in adjusting the weights of the model to modify its likely outputs for queries relating to a specific fact or set of facts. This may be done either to amend a fact or set of facts, for instance, to fix a frequent error in the training data, or to suppress a fact or set of facts entirely, for instance, in case of dangerous knowledge. Multiple methods have been proposed to do such edits. However, at the same time, it has been shown that such model editing can be brittle and incomplete. Moreover the effectiveness of any model editing method necessarily depends on the data on which the model is trained, and, therefore, a good understanding of the interaction of the training data distribution and the way it is stored in the network is necessary and helpful to reliably perform model editing. However, working with large language models trained on real-world data does not allow us to understand this relationship or fully measure the effects of model editing. We therefore propose Behemoth, a fully synthetic data generation framework. To demonstrate the practical insights from the framework, we explore model editing in the context of simple tabular data, demonstrating surprising findings that, in some cases, echo real-world results, for instance, that in some cases restricting the update rank results in a more effective update. The code is available at this https URL.
[LG-14] Manifold-Aware Perturbations for Constrained Generative Modeling
链接: https://arxiv.org/abs/2601.23151
作者: Katherine Keegan,Lars Ruthotto
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative models have enjoyed widespread success in a variety of applications. However, they encounter inherent mathematical limitations in modeling distributions where samples are constrained by equalities, as is frequently the setting in scientific domains. In this work, we develop a computationally cheap, mathematically justified, and highly flexible distributional modification for combating known pitfalls in equality-constrained generative models. We propose perturbing the data distribution in a constraint-aware way such that the new distribution has support matching the ambient space dimension while still implicitly incorporating underlying manifold geometry. Through theoretical analyses and empirical evidence on several representative tasks, we illustrate that our approach consistently enables data distribution recovery and stable sampling with both diffusion models and normalizing flows.
[LG-15] Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients
链接: https://arxiv.org/abs/2601.23135
作者: Cheng Ge,Caitlyn Heqi Yin,Hao Liang,Jiawei Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) has become a key driver of language model reasoning. Among RL algorithms, Group Relative Policy Optimization (GRPO) is the de facto standard, avoiding the need for a critic by using per-prompt baselines and variance normalization. Yet why and when this normalization helps remains unclear. In this work, we provide an explanation through the lens of local curvature of the sequence-level policy gradient: standard deviation normalization implements an adaptive gradient. Theoretically, under mild conditions, GRPO enjoys a strictly improved convergence rate over unnormalized REINFORCE, with gains characterized by the average within-prompt reward standard deviation across prompts and iterations. Empirically, our analysis on GSM8K and MATH benchmarks reveals three distinct training phases governed by the interplay between feature orthogonality and reward variance: (I) an early acceleration phase where high variance and orthogonality favor adaptive scaling; (II) a relatively stable transition phase; and (III) a late-stage regime where the loss of orthogonality limits further gains. Together, these results provide a principled account of when std normalization helps in GRPO, and offer broader insights into the design of critic-free RL algorithms.
[LG-16] Distribution-informed Efficient Conformal Prediction for Full Ranking
链接: https://arxiv.org/abs/2601.23128
作者: Wenbo Liao,Huipeng Huang,Chen Jia,Huajun Xi,Hao Zeng,Hongxin Wei
类目: Machine Learning (cs.LG)
*备注: 21 pages, 8 figures
Abstract:Quantifying uncertainty is critical for the safe deployment of ranking models in real-world applications. Recent work offers a rigorous solution using conformal prediction in a full ranking scenario, which aims to construct prediction sets for the absolute ranks of test items based on the relative ranks of calibration items. However, relying on upper bounds of non-conformity scores renders the method overly conservative, resulting in substantially large prediction sets. To address this, we propose Distribution-informed Conformal Ranking (DCR), which produces efficient prediction sets by deriving the exact distribution of non-conformity scores. In particular, we find that the absolute ranks of calibration items follow Negative Hypergeometric distributions, conditional on their relative ranks. DCR thus uses the rank distribution to derive non-conformity score distribution and determine conformal thresholds. We provide theoretical guarantees that DCR achieves improved efficiency over the baseline while ensuring valid coverage under mild assumptions. Extensive experiments demonstrate the superiority of DCR, reducing average prediction set size by up to 36%, while maintaining valid coverage.
[LG-17] CATTO: Balancing Preferences and Confidence in Language Models
链接: https://arxiv.org/abs/2601.23096
作者: Nisarg Parikh,Kunjal Panchal,Ananya Sai,Pannaga Shivaswamy,Andrew Lan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) often make accurate next token predictions but their confidence in these predictions can be poorly calibrated: high-confidence predictions are frequently wrong, and low-confidence predictions may be correct. This miscalibration is exacerbated by preference-based alignment methods breaking the link between predictive probability and correctness. We introduce a Calibration Aware Token-level Training Objective (CATTO), a calibration-aware objective that aligns predicted confidence with empirical prediction correctness, which can be combined with the original preference optimization objectives. Empirically, CATTO reduces Expected Calibration Error (ECE) by 2.22%-7.61% in-distribution and 1.46%-10.44% out-of-distribution compared to direct preference optimization (DPO), and by 0.22%-1.24% in-distribution and 1.23%-5.07% out-of-distribution compared to the strongest DPO baseline. This improvement in confidence does not come at a cost of losing task accuracy, where CATTO maintains or slightly improves multiple-choice question-answering accuracy on five datasets. We also introduce Confidence@k, a test-time scaling mechanism leveraging calibrated token probabilities for Bayes-optimal selection of output tokens.
[LG-18] RN-D: Discretized Categorical Actors with Regularized Networks for On-Policy Reinforcement Learning
链接: https://arxiv.org/abs/2601.23075
作者: Yuexin Bian,Jie Feng,Tao Wang,Yijiang Li,Sicun Gao,Yuanyuan Shi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:On-policy deep reinforcement learning remains a dominant paradigm for continuous control, yet standard implementations rely on Gaussian actors and relatively shallow MLP policies, often leading to brittle optimization when gradients are noisy and policy updates must be conservative. In this paper, we revisit policy representation as a first-class design choice for on-policy optimization. We study discretized categorical actors that represent each action dimension with a distribution over bins, yielding a policy objective that resembles a cross-entropy loss. Building on architectural advances from supervised learning, we further propose regularized actor networks, while keeping critic design fixed. Our results show that simply replacing the standard actor network with our discretized regularized actor yields consistent gains and achieve the state-of-the-art performance across diverse continuous-control benchmarks.
[LG-19] SplineFlow: Flow Matching for Dynamical Systems with B-Spline Interpolants
链接: https://arxiv.org/abs/2601.23072
作者: Santanu Subhash Rathod,Pietro Liò,Xiao Zhang
类目: Machine Learning (cs.LG)
*备注: 36 pages, 35 tables, 22 figures
Abstract:Flow matching is a scalable generative framework for characterizing continuous normalizing flows with wide-range applications. However, current state-of-the-art methods are not well-suited for modeling dynamical systems, as they construct conditional paths using linear interpolants that may not capture the underlying state evolution, especially when learning higher-order dynamics from irregular sampled observations. Constructing unified paths that satisfy multi-marginal constraints across observations is challenging, since naïve higher-order polynomials tend to be unstable and oscillatory. We introduce SplineFlow, a theoretically grounded flow matching algorithm that jointly models conditional paths across observations via B-spline interpolation. Specifically, SplineFlow exploits the smoothness and stability of B-spline bases to learn the complex underlying dynamics in a structured manner while ensuring the multi-marginal requirements are met. Comprehensive experiments across various deterministic and stochastic dynamical systems of varying complexity, as well as on cellular trajectory inference tasks, demonstrate the strong improvement of SplineFlow over existing baselines. Our code is available at: this https URL.
[LG-20] From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning
链接: https://arxiv.org/abs/2601.23058
作者: Wenzhe Niu,Wei He,Zongxia Xie,Jinpeng Ou,Huichuan Fan,Yuchen Ge,Yanru Sun,Ziyin Wang,Yizhao Sun,Chengshun Shi,Jiuchong Gao,Jinghua Hao,Renqing He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning has become a cornerstone for enhancing the reasoning capabilities of Large Language Models, where group-based approaches such as GRPO have emerged as efficient paradigms that optimize policies by leveraging intra-group performance differences. However, these methods typically rely on absolute numerical rewards, introducing intrinsic limitations. In verifiable tasks, identical group evaluations often result in sparse supervision, while in open-ended scenarios, the score range instability of reward models undermines advantage estimation based on group means. To address these limitations, we propose Reinforcement Learning with Relative Rewards (RLRR), a framework that shifts reward shaping from absolute scoring to relative ranking. Complementing this framework, we introduce the Ranking Reward Model, a listwise preference model tailored for group-based optimization to directly generate relative rankings. By transforming raw evaluations into robust relative signals, RLRR effectively mitigates signal sparsity and reward instability. Experimental results demonstrate that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.
[LG-21] Divide-and-Conquer CoT: RL for Reducing Latency via Parallel Reasoning
链接: https://arxiv.org/abs/2601.23027
作者: Arvind Mahankali,Kaiyue Wen,Tengyu Ma
类目: Machine Learning (cs.LG)
*备注: 47 pages, 13 figures
Abstract:Long chain-of-thought reasoning (Long CoT) is now fundamental to state-of-the-art LLMs, especially in mathematical reasoning. However, LLM generation is highly sequential, and long CoTs lead to a high latency. We propose to train Divide-and-Conquer CoT (DC-CoT) to reduce the latency. With DC-CoT, the model can act as a director that identifies distinct subtasks that can be performed in parallel in its reasoning process, and then spawns workers to execute the subtasks. Our goal is to achieve high accuracy, with a low longest path length, which is a theoretical measure of the latency needed for the response. We start with a long CoT base model (DeepScaleR-1.5B-Preview), and first use SFT with a small curated demonstration set to initialize its ability to spawn workers in a certain format. Because SFT degrades the accuracy significantly, we design a multi-stage RL algorithm, with various data filtering strategies, to recover the accuracy while decreasing the longest path length. Across several benchmarks including AIME 2024 and HMMT 2025, DC-CoT achieves similar accuracy as DeepScaleR-1.5B-Preview while decreasing longest path length by 35-40%. Our code, SFT dataset and models are publicly available at this https URL.
[LG-22] Causal Characterization of Measurement and Mechanistic Anomalies
链接: https://arxiv.org/abs/2601.23026
作者: Hendrik Suhr,David Kaltenpoth,Jilles Vreeken
类目: Machine Learning (cs.LG)
*备注:
Abstract:Root cause analysis of anomalies aims to identify those features that cause the deviation from the normal process. Existing methods ignore, however, that anomalies can arise through two fundamentally different processes: measurement errors, where data was generated normally but one or more values were recorded incorrectly, and mechanism shifts, where the causal process generating the data changed. While measurement errors can often be safely corrected, mechanistic anomalies require careful consideration. We define a causal model that explicitly captures both types by treating outliers as latent interventions on latent (“true”) and observed (“measured”) variables. We show that they are identifiable, and propose a maximum likelihood estimation approach to put this to practice. Experiments show that our method matches state-of-the-art performance in root cause localization, while it additionally enables accurate classification of anomaly types, and remains robust even when the causal DAG is unknown.
[LG-23] Value-at-Risk Constrained Policy Optimization
链接: https://arxiv.org/abs/2601.22993
作者: Rohan Tangri,Jan-Peter Calliess
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constraints directly. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ the one-sided Chebyshev inequality to obtain a tractable surrogate based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide rigorous worst-case bounds for both policy improvement and constraint violation during the training process.
[LG-24] dgMARK: Decoding-Guided Watermarking for Diffusion Language Models
链接: https://arxiv.org/abs/2601.22985
作者: Pyo Min Hong,Albert No
类目: Machine Learning (cs.LG)
*备注: Project page: this https URL
Abstract:We propose dgMARK, a decoding-guided watermarking method for discrete diffusion language models (dLLMs). Unlike autoregressive models, dLLMs can generate tokens in arbitrary order. While an ideal conditional predictor would be invariant to this order, practical dLLMs exhibit strong sensitivity to the unmasking order, creating a new channel for watermarking. dgMARK steers the unmasking order toward positions whose high-reward candidate tokens satisfy a simple parity constraint induced by a binary hash, without explicitly reweighting the model’s learned probabilities. The method is plug-and-play with common decoding strategies (e.g., confidence, entropy, and margin-based ordering) and can be strengthened with a one-step lookahead variant. Watermarks are detected via elevated parity-matching statistics, and a sliding-window detector ensures robustness under post-editing operations including insertion, deletion, substitution, and paraphrasing.
[LG-25] PIDSMaker: Building and Evaluating Provenance-based Intrusion Detection Systems
链接: https://arxiv.org/abs/2601.22983
作者: Tristan Bilot,Baoxiang Jiang,Thomas Pasquier
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Recent provenance-based intrusion detection systems (PIDSs) have demonstrated strong potential for detecting advanced persistent threats (APTs) by applying machine learning to system provenance graphs. However, evaluating and comparing PIDSs remains difficult: prior work uses inconsistent preprocessing pipelines, non-standard dataset splits, and incompatible ground-truth labeling and metrics. These discrepancies undermine reproducibility, impede fair comparison, and impose substantial re-implementation overhead on researchers. We present PIDSMaker, an open-source framework for developing and evaluating PIDSs under consistent protocols. PIDSMaker consolidates eight state-of-the-art systems into a modular, extensible architecture with standardized preprocessing and ground-truth labels, enabling consistent experiments and apples-to-apples comparisons. A YAML-based configuration interface supports rapid prototyping by composing components across systems without code changes. PIDSMaker also includes utilities for ablation studies, hyperparameter tuning, multi-run instability measurement, and visualization, addressing methodological gaps identified in prior work. We demonstrate PIDSMaker through concrete use cases and release it with preprocessed datasets and labels to support shared evaluation for the PIDS community.
[LG-26] Improved Algorithms for Nash Welfare in Linear Bandits
链接: https://arxiv.org/abs/2601.22969
作者: Dhruv Sarkar,Nishant Pandey,Sayak Ray Chowdhury
类目: Machine Learning (cs.LG)
*备注:
Abstract:Nash regret has recently emerged as a principled fairness-aware performance metric for stochastic multi-armed bandits, motivated by the Nash Social Welfare objective. Although this notion has been extended to linear bandits, existing results suffer from suboptimality in ambient dimension d , stemming from proof techniques that rely on restrictive concentration inequalities. In this work, we resolve this open problem by introducing new analytical tools that yield an order-optimal Nash regret bound in linear bandits. Beyond Nash regret, we initiate the study of p -means regret in linear bandits, a unifying framework that interpolates between fairness and utility objectives and strictly generalizes Nash regret. We propose a generic algorithmic framework, FairLinBandit, that works as a meta-algorithm on top of any linear bandit strategy. We instantiate this framework using two bandit algorithms: Phased Elimination and Upper Confidence Bound, and prove that both achieve sublinear p -means regret for the entire range of p . Extensive experiments on linear bandit instances generated from real-world datasets demonstrate that our methods consistently outperform the existing state-of-the-art baseline.
[LG-27] Environment-Conditioned Tail Reweighting for Total Variation Invariant Risk Minimization
链接: https://arxiv.org/abs/2601.22944
作者: Wang Yuanchao,Lai Zhao-Rong,Zhong Tianqi,Li Fengnan
类目: Machine Learning (cs.LG)
*备注: 8 pages
Abstract:Out-of-distribution (OOD) generalization remains challenging when models simultaneously encounter correlation shifts across environments and diversity shifts driven by rare or hard samples. Existing invariant risk minimization (IRM) methods primarily address spurious correlations at the environment level, but often overlook sample-level heterogeneity within environments, which can critically impact OOD performance. In this work, we propose \emphEnvironment-Conditioned Tail Reweighting for Total Variation Invariant Risk Minimization (ECTR), a unified framework that augments TV-based invariant learning with environment-conditioned tail reweighting to jointly address both types of distribution shift. By integrating environment-level invariance with within-environment robustness, the proposed approach makes these two mechanisms complementary under mixed distribution shifts. We further extend the framework to scenarios without explicit environment annotations by inferring latent environments through a minimax formulation. Experiments across regression, tabular, time-series, and image classification benchmarks under mixed distribution shifts demonstrate consistent improvements in both worst-environment and average OOD performance.
[LG-28] Scalable Topology-Preserving Graph Coarsening with Graph Collapse
链接: https://arxiv.org/abs/2601.22943
作者: Xiang Wu,Rong-Hua Li,Xunkai Li,Kangfei Zhao,Hongchao Qin,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph coarsening reduces the size of a graph while preserving certain properties. Most existing methods preserve either spectral or spatial characteristics. Recent research has shown that preserving topological features helps maintain the predictive performance of graph neural networks (GNNs) trained on the coarsened graph but suffers from exponential time complexity. To address these problems, we propose Scalable Topology-Preserving Graph Coarsening (STPGC) by introducing the concepts of graph strong collapse and graph edge collapse extended from algebraic topology. STPGC comprises three new algorithms, GStrongCollapse, GEdgeCollapse, and NeighborhoodConing based on these two concepts, which eliminate dominated nodes and edges while rigorously preserving topological features. We further prove that STPGC preserves the GNN receptive field and develop approximate algorithms to accelerate GNN training. Experiments on node classification with GNNs demonstrate the efficiency and effectiveness of STPGC.
[LG-29] DC-LA: Difference-of-Convex Langevin Algorithm
链接: https://arxiv.org/abs/2601.22932
作者: Hoang Phuc Hau Luu,Zhongjian Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study a sampling problem whose target distribution is \pi \propto \exp(-f-r) where the data fidelity term f is Lipschitz smooth while the regularizer term r=r_1-r_2 is a non-smooth difference-of-convex (DC) function, i.e., r_1,r_2 are convex. By leveraging the DC structure of r , we can smooth out r by applying Moreau envelopes to r_1 and r_2 separately. In line of DC programming, we then redistribute the concave part of the regularizer to the data fidelity and study its corresponding proximal Langevin algorithm (termed DC-LA). We establish convergence of DC-LA to the target distribution \pi , up to discretization and smoothing errors, in the q -Wasserstein distance for all q \in \mathbbN^* , under the assumption that V is distant dissipative. Our results improve previous work on non-log-concave sampling in terms of a more general framework and assumptions. Numerical experiments show that DC-LA produces accurate distributions in synthetic settings and reliably provides uncertainty quantification in a real-world Computed Tomography application.
[LG-30] FlexLoRA: Entropy-Guided Flexible Low-Rank Adaptation ICLR
链接: https://arxiv.org/abs/2601.22905
作者: Muqing Liu,Chongjie Si,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注: 2026 ICLR. Codes in this https URL
Abstract:Large pre-trained models achieve remarkable success across diverse domains, yet fully fine-tuning incurs prohibitive computational and memory costs. Parameter-efficient fine-tuning (PEFT) has thus become a mainstream paradigm. Among them, Low-Rank Adaptation (LoRA) introduces trainable low-rank matrices and shows strong performance, nevertheless, its fixed-rank design limits flexibility. Dynamic rank allocation methods mitigate this issue by pruning redundant directions; however, they often rely on heuristic, element-level metrics that globally sort rank directions without matrix-wise distinction, and they lack mechanisms to expand capacity in layers requiring additional adaptation. To overcome these limitations, we propose FlexLoRA, an entropy-guided flexible low-rank adaptation framework that (i) evaluates matrix importance via spectral energy entropy, (ii) supports rank pruning and expansion under a global budget, and (iii) employs zero-impact initialization for newly added singular directions to ensure stability. By addressing granularity, flexibility, and stability limitations, FlexLoRA provides a more principled solution for PEFT. Extensive experiments show that FlexLoRA consistently outperforms state-of-the-art baselines across benchmarks. Codes are available at this https URL.
[LG-31] Uncertainty-Aware Extrapolation in Bayesian Oblique Trees
链接: https://arxiv.org/abs/2601.22899
作者: Viktor Andonovikj,Sašo Džeroski,Pavle Boškoski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decision trees are widely used due to their interpretability and efficiency, but they struggle in regression tasks that require reliable extrapolation and well-calibrated uncertainty. Piecewise-constant leaf predictions are bounded by the training targets and often become overconfident under distribution shift. We propose a single-tree Bayesian model that extends VSPYCT by equipping each leaf with a GP predictor. Bayesian oblique splits provide uncertainty-aware partitioning of the input space, while GP leaves model local functional behaviour and enable principled extrapolation beyond the observed target range. We present an efficient inference and prediction scheme that combines posterior sampling of split parameters with \glsgp posterior predictions, and a gating mechanism that activates GP-based extrapolation when inputs fall outside the training support of a leaf. Experiments on benchmark regression tasks show improvements in the predictive performance compared to standard variational oblique trees, and substantial performance gains in extrapolation scenarios.
[LG-32] Calibrated Multivariate Distributional Regression with Pre-Rank Regularization
链接: https://arxiv.org/abs/2601.22895
作者: Aya Laajil,Elnura Zhalieva,Naomi Desobry,Souhaib Ben Taieb
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2510.21273
Abstract:The goal of probabilistic prediction is to issue predictive distributions that are as informative as possible, subject to being calibrated. Despite substantial progress in the univariate setting, achieving multivariate calibration remains challenging. Recent work has introduced pre-rank functions, scalar projections of multivariate forecasts and observations, as flexible diagnostics for assessing specific aspects of multivariate calibration, but their use has largely been limited to post-hoc evaluation. We propose a regularization-based calibration method that enforces multivariate calibration during training of multivariate distributional regression models using pre-rank functions. We further introduce a novel PCA-based pre-rank that projects predictions onto principal directions of the predictive distribution. Through simulation studies and experiments on 18 real-world multi-output regression datasets, we show that the proposed approach substantially improves multivariate pre-rank calibration without compromising predictive accuracy, and that the PCA pre-rank reveals dependence-structure misspecifications that are not detected by existing pre-ranks.
[LG-33] PlatoLTL: Learning to Generalize Across Symbols in LTL Instructions for Multi-Task RL
链接: https://arxiv.org/abs/2601.22891
作者: Jacques Cloete,Mathias Jackermeier,Ioannis Havoutis,Alessandro Abate
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures (main paper). 14 pages, 10 figures (appendix)
Abstract:A central challenge in multi-task reinforcement learning (RL) is to train generalist policies capable of performing tasks not seen during training. To facilitate such generalization, linear temporal logic (LTL) has recently emerged as a powerful formalism for specifying structured, temporally extended tasks to RL agents. While existing approaches to LTL-guided multi-task RL demonstrate successful generalization across LTL specifications, they are unable to generalize to unseen vocabularies of propositions (or “symbols”), which describe high-level events in LTL. We present PlatoLTL, a novel approach that enables policies to zero-shot generalize not only compositionally across LTL formula structures, but also parametrically across propositions. We achieve this by treating propositions as instances of parameterized predicates rather than discrete symbols, allowing policies to learn shared structure across related propositions. We propose a novel architecture that embeds and composes predicates to represent LTL specifications, and demonstrate successful zero-shot generalization to novel propositions and tasks across challenging environments.
[LG-34] Synthetic Time Series Generation via Complex Networks
链接: https://arxiv.org/abs/2601.22879
作者: Jaime Vale,Vanessa Freitas Silva,Maria Eduarda Silva,Fernando Silva
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series data are essential for a wide range of applications, particularly in developing robust machine learning models. However, access to high-quality datasets is often limited due to privacy concerns, acquisition costs, and labeling challenges. Synthetic time series generation has emerged as a promising solution to address these constraints. In this work, we present a framework for generating synthetic time series by leveraging complex networks mappings. Specifically, we investigate whether time series transformed into Quantile Graphs (QG) – and then reconstructed via inverse mapping – can produce synthetic data that preserve the statistical and structural properties of the original. We evaluate the fidelity and utility of the generated data using both simulated and real-world datasets, and compare our approach against state-of-the-art Generative Adversarial Network (GAN) methods. Results indicate that our quantile graph-based methodology offers a competitive and interpretable alternative for synthetic time series generation.
[LG-35] Matterhorn: Efficient Analog Sparse Spiking Transformer Architecture with Masked Time-To-First-Spike Encoding
链接: https://arxiv.org/abs/2601.22876
作者: Zhanglu Yan,Kaiwen Tang,Zixuan Zhu,Zhenyu Bai,Qianhui Liu,Weng-Fai Wong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spiking neural networks (SNNs) have emerged as a promising candidate for energy-efficient LLM inference. However, current energy evaluations for SNNs primarily focus on counting accumulate operations, and fail to account for real-world hardware costs such as data movement, which can consume nearly 80% of the total energy. In this paper, we propose Matterhorn, a spiking transformer that integrates a novel masked time-to-first-spike (M-TTFS) encoding method to reduce spike movement and a memristive synapse unit (MSU) to eliminate weight access overhead. M-TTFS employs a masking strategy that reassigns the zero-energy silent state (a spike train of all 0s) to the most frequent membrane potential rather than the lowest. This aligns the coding scheme with the data distribution, minimizing spike movement energy without information loss. We further propose a `dead zone’ strategy that maximizes sparsity by mapping all values within a given range to the silent state. At the hardware level, the MSU utilizes compute-in-memory (CIM) technology to perform analog integration directly within memory, effectively removing weight access costs. On the GLUE benchmark, Matterhorn establishes a new state-of-the-art, surpassing existing SNNs by 1.42% in average accuracy while delivering a 2.31 times improvement in energy efficiency.
[LG-36] OptiMAG: Structure-Semantic Alignment via Unbalanced Optimal Transport
链接: https://arxiv.org/abs/2601.22856
作者: Yilong Zuo,Xunkai Li,Zhihan Zhang,Qiangqiang Dai,Ronghua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal Attributed Graphs (MAGs) have been widely adopted for modeling complex systems by integrating multi-modal information, such as text and images, on nodes. However, we identify a discrepancy between the implicit semantic structure induced by different modality embeddings and the explicit graph structure. For instance, neighbors in the explicit graph structure may be close in one modality but distant in another. Since existing methods typically perform message passing over the fixed explicit graph structure, they inadvertently aggregate dissimilar features, introducing modality-specific noise and impeding effective node representation learning. To address this, we propose OptiMAG, an Unbalanced Optimal Transport-based regularization framework. OptiMAG employs the Fused Gromov-Wasserstein distance to explicitly guide cross-modal structural consistency within local neighborhoods, effectively mitigating structural-semantic conflicts. Moreover, a KL divergence penalty enables adaptive handling of cross-modal inconsistencies. This framework can be seamlessly integrated into existing multimodal graph models, acting as an effective drop-in regularizer. Experiments demonstrate that OptiMAG consistently outperforms baselines across multiple tasks, ranging from graph-centric tasks (e.g., node classification, link prediction) to multimodal-centric generation tasks (e.g., graph2text, graph2image). The source code will be available upon acceptance.
[LG-37] Hierarchical Shift Mixing – Beyond Dense Attention in Transformers
链接: https://arxiv.org/abs/2601.22852
作者: Robert Forchheimer
类目: Machine Learning (cs.LG)
*备注: 11 pages, 10 pdf figures
Abstract:Since the introduction of the Transformer architecture for large language models, the softmax-based attention layer has faced increasing scrutinity due to its quadratic-time computational complexity. Attempts have been made to replace it with less complex methods, at the cost of reduced performance in most cases. We introduce Hierarchical Shift Mixing (HSM), a general framework for token mixing that distributes pairwise token interactions across Transformer layers rather than computing them densely within each layer. HSM enables linear-time complexity while remaining agnostic to the specific mixing function. We show that even simple HSM variants achieve performance close to softmax attention, and that hybrid architectures combining HSM with softmax attention can outperform a GPT-style Transformer baseline while reducing computational cost during both training and inference.
[LG-38] Unconditional flow-based time series generation with equivariance-regularised latent spaces ICASSP2026
链接: https://arxiv.org/abs/2601.22848
作者: Camilo Carvajal Reyes,Felipe Tobar
类目: Machine Learning (cs.LG)
*备注: Accepted at ICASSP 2026
Abstract:Flow-based models have proven successful for time-series generation, particularly when defined in lower-dimensional latent spaces that enable efficient sampling. However, how to design latent representations with desirable equivariance properties for time-series generative modelling remains underexplored. In this work, we propose a latent flow-matching framework in which equivariance is explicitly encouraged through a simple regularisation of a pre-trained autoencoder. Specifically, we introduce an equivariance loss that enforces consistency between transformed signals and their reconstructions, and use it to fine-tune latent spaces with respect to basic time-series transformations such as translation and amplitude scaling. We show that these equivariance-regularised latent spaces improve generation quality while preserving the computational advantages of latent flow models. Experiments on multiple real-world datasets demonstrate that our approach consistently outperforms existing diffusion-based baselines in standard time-series generation metrics, while achieving orders-of-magnitude faster sampling. These results highlight the practical benefits of incorporating geometric inductive biases into latent generative models for time series.
[LG-39] Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features
链接: https://arxiv.org/abs/2601.22816
作者: Markus Mueller,Kathrin Gruber,Dennis Fok
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Advances in generative modeling have recently been adapted to tabular data containing discrete and continuous features. However, generating mixed-type features that combine discrete states with an otherwise continuous distribution in a single feature remains challenging. We advance the state-of-the-art in diffusion models for tabular data with a cascaded approach. We first generate a low-resolution version of a tabular data row, that is, the collection of the purely categorical features and a coarse categorical representation of numerical features. Next, this information is leveraged in the high-resolution flow matching model via a novel guided conditional probability path and data-dependent coupling. The low-resolution representation of numerical features explicitly accounts for discrete outcomes, such as missing or inflated values, and therewith enables a more faithful generation of mixed-type features. We formally prove that this cascade tightens the transport cost bound. The results indicate that our model generates significantly more realistic samples and captures distributional details more accurately, for example, the detection score increases by 40%.
[LG-40] Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation
链接: https://arxiv.org/abs/2601.22813
作者: Andrei Panferov,Erik Schultheis,Soroush Tabesh,Dan Alistarh
类目: Machine Learning (cs.LG)
*备注:
Abstract:The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at this https URL .
[LG-41] Clipping-Free Policy Optimization for Large Language Models
链接: https://arxiv.org/abs/2601.22801
作者: Ömer Veysel Çağatan,Barış Akgün,Gözde Gül Şahin,Xuandong Zhao
类目: Machine Learning (cs.LG)
*备注: 23 pages, 10 tables, 8 figures
Abstract:Reinforcement learning has become central to post-training large language models, yet dominant algorithms rely on clipping mechanisms that introduce optimization issues at scale, including zero-gradient regions, reward hacking, and training instability. We propose Clipping-Free Policy Optimization (CFPO), which replaces heuristic clipping with a convex quadratic penalty derived from Total Variation divergence constraints, yielding an everywhere-differentiable objective that enforces stable policy updates without hard boundaries. We evaluate CFPO across both reasoning and alignment settings. In reasoning, CFPO matches clipping-based methods on downstream benchmarks while extending the stable training regime. In alignment, CFPO mitigates verbosity exploitation and reduces capability degradation, while achieving competitive instruction-following performance. CFPO requires only a one-line code change and no additional hyperparameters. Our results suggest that CFPO is a promising drop-in alternative to clipping-based methods for LLM post-training.
[LG-42] rackly: A Unified SaaS Platform for User Behavior Analytics and Real Time Rule Based Anomaly Detection
链接: https://arxiv.org/abs/2601.22800
作者: Md Zahurul Haque,Md. Hafizur Rahman,Yeahyea Sarker
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Understanding user behavior is essential for improving digital experiences, optimizing business conversions, and mitigating threats like account takeovers, fraud, and bot attacks. Most platforms separate product analytics and security, creating fragmented visibility and delayed threat detection. Trackly, a scalable SaaS platform, unifies comprehensive user behavior analytics with real time, rule based anomaly detection. It tracks sessions, IP based geo location, device browser fingerprints, and granular events such as page views, add to cart, and checkouts. Suspicious activities logins from new devices or locations, impossible travel (Haversine formula), rapid bot like actions, VPN proxy usage, or multiple accounts per IP are flagged via configurable rules with weighted risk scoring, enabling transparent, explainable decisions. A real time dashboard provides global session maps, DAU MAU, bounce rates, and session durations. Integration is simplified with a lightweight JavaScript SDK and secure REST APIs. Implemented on a multi tenant microservices stack (this http URL Core, MongoDB, RabbitMQ, this http URL), Trackly achieved 98.1% accuracy, 97.7% precision, and 2.25% false positives on synthetic datasets, proving its efficiency for SMEs and ecommerce.
[LG-43] Float8@2bits: Entropy Coding Enables Data-Free Model Compression
链接: https://arxiv.org/abs/2601.22787
作者: Patrick Putzky,Martin Genzel,Mattes Mollenhauer,Sebastian Schulze,Thomas Wollmann,Stefan Dietzel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-training compression is currently divided into two contrasting regimes. On the one hand, fast, data-free, and model-agnostic methods (e.g., NF4 or HQQ) offer maximum accessibility but suffer from functional collapse at extreme bit-rates below 4 bits. On the other hand, techniques leveraging calibration data or extensive recovery training achieve superior fidelity but impose high computational constraints and face uncertain robustness under data distribution shifts. We introduce EntQuant, the first framework to unite the advantages of these distinct paradigms. By matching the performance of data-dependent methods with the speed and universality of data-free techniques, EntQuant enables practical utility in the extreme compression regime. Our method decouples numerical precision from storage cost via entropy coding, compressing a 70B parameter model in less than 30 minutes. We demonstrate that EntQuant does not only achieve state-of-the-art results on standard evaluation sets and models, but also retains functional performance on more complex benchmarks with instruction-tuned models, all at modest inference overhead.
[LG-44] Sparse Attention as Compact Kernel Regression
链接: https://arxiv.org/abs/2601.22766
作者: Saul Santos,Nuno Gonçalves,Daniel C. McNamee,André F.T Martins
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures
Abstract:Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation – including Epanechnikov, biweight, and triweight – correspond to \alpha -entmax attention with \alpha = 1 + \frac1n for n \in \mathbbN , while the softmax/Gaussian relationship emerges in the limit n \to \infty . This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top- k attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers – Memory Mosaics – show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.
[LG-45] AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation
链接: https://arxiv.org/abs/2601.22760
作者: Zhongzhen Wen,Shudi Shao,Zhong Li,Yu Ge,Tongtong Xu,Yuanyi Lin,Tian Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF); Software Engineering (cs.SE)
*备注:
Abstract:The performance of deep learning models critically depends on efficient kernel implementations, yet developing high-performance kernels for specialized accelerators remains time-consuming and expertise-intensive. While recent work demonstrates that large language models (LLMs) can generate correct and performant GPU kernels, kernel generation for neural processing units (NPUs) remains largely underexplored due to domain-specific programming models, limited public examples, and sparse documentation. Consequently, directly generating AscendC kernels with LLMs yields extremely low correctness, highlighting a substantial gap between GPU and NPU kernel generation. We present AscendCraft, a DSL-guided approach for automatic AscendC kernel generation. AscendCraft introduces a lightweight DSL that abstracts non-essential complexity while explicitly modeling Ascend-specific execution semantics. Kernels are first generated in the DSL using category-specific expert examples and then transcompiled into AscendC through structured, constraint-driven LLM lowering passes. Evaluated on MultiKernelBench across seven operator categories, AscendCraft achieves 98.1% compilation success and 90.4% functional correctness. Moreover, 46.2% of generated kernels match or exceed PyTorch eager execution performance, demonstrating that DSL-guided transcompilation can enable LLMs to generate both correct and competitive NPU kernels. Beyond benchmarks, AscendCraft further demonstrates its generality by successfully generating two correct kernels for newly proposed mHC architecture, achieving performance that substantially surpasses PyTorch eager execution. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF); Software Engineering (cs.SE) Cite as: arXiv:2601.22760 [cs.DC] (or arXiv:2601.22760v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2601.22760 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-46] Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size Data and Representation
链接: https://arxiv.org/abs/2601.22757
作者: Dong Xu,Qihua Pan,Sisi Yuan,Jianqiang Li,Zexuan Zhu,Junkai Ji
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 34 pages, 51 figures
Abstract:Molecular generative models, often employing GPT-style language modeling on molecular string representations, have shown promising capabilities when scaled to large datasets and model sizes. However, it remains unclear and subject to debate whether these models adhere to predictable scaling laws under fixed computational budgets, which is a crucial understanding for optimally allocating resources between model size, data volume, and molecular representation. In this study, we systematically investigate the scaling behavior of molecular language models across both pretraining and downstream tasks. We train 300 models and conduct over 10,000 experiments, rigorously controlling compute budgets while independently varying model size, number of training tokens, and molecular representation. Our results demonstrate clear scaling laws in molecular models for both pretraining and downstream transfer, reveal the substantial impact of molecular representation on performance, and explain previously observed inconsistencies in scaling behavior for molecular generation. Additionally, we publicly release the largest library of molecular language models to date to facilitate future research and development. Code and models are available at this https URL.
[LG-47] Understanding Generalization from Embedding Dimension and Distributional Convergence
链接: https://arxiv.org/abs/2601.22756
作者: Junjie Yu,Zhuoli Ouyang,Haotian Deng,Chen Wei,Wenxiao Ma,Jianyu Zhang,Zihan Deng,Quanying Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks often generalize well despite heavy over-parameterization, challenging classical parameter-based analyses. We study generalization from a representation-centric perspective and analyze how the geometry of learned embeddings controls predictive performance for a fixed trained model. We show that population risk can be bounded by two factors: (i) the intrinsic dimension of the embedding distribution, which determines the convergence rate of empirical embedding distribution to the population distribution in Wasserstein distance, and (ii) the sensitivity of the downstream mapping from embeddings to predictions, characterized by Lipschitz constants. Together, these yield an embedding-dependent error bound that does not rely on parameter counts or hypothesis class complexity. At the final embedding layer, architectural sensitivity vanishes and the bound is dominated by embedding dimension, explaining its strong empirical correlation with generalization performance. Experiments across architectures and datasets validate the theory and demonstrate the utility of embedding-based diagnostics.
[LG-48] OSNIP: Breaking the Privacy-Utility-Efficiency Trilemma in LLM Inference via Obfuscated Semantic Null Space
链接: https://arxiv.org/abs/2601.22752
作者: Zhiyuan Cao,Zeyu Ma,Chenhao Yang,Han Zheng,Mingang Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose Obfuscated Semantic Null space Injection for Privacy (OSNIP), a lightweight client-side encryption framework for privacy-preserving LLM inference. Generalizing the geometric intuition of linear kernels to the high-dimensional latent space of LLMs, we formally define the ``Obfuscated Semantic Null Space’', a high-dimensional regime that preserves semantic fidelity while enforcing near-orthogonality to the original embedding. By injecting perturbations that project the original embedding into this space, OSNIP ensures privacy without any post-processing. Furthermore, OSNIP employs a key-dependent stochastic mapping that synthesizes individualized perturbation trajectories unique to each user. Evaluations on 12 generative and classification benchmarks show that OSNIP achieves state-of-the-art performance, sharply reducing attack success rates while maintaining strong model utility under strict security constraints.
[LG-49] Discovering Scaling Exponents with Physics-Informed Müntz-Szász Networks
链接: https://arxiv.org/abs/2601.22751
作者: Gnankan Landry Regis N’guessan,Bum Jun Kim
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 26 pages, 6 figures
Abstract:Physical systems near singularities, interfaces, and critical points exhibit power-law scaling, yet standard neural networks leave the governing exponents implicit. We introduce physics-informed M"untz-Sz’asz Networks (MSN-PINN), a power-law basis network that treats scaling exponents as trainable parameters. The model outputs both the solution and its scaling structure. We prove identifiability, or unique recovery, and show that, under these conditions, the squared error between learned and true exponents scales as O(|\mu - \alpha|^2) . Across experiments, MSN-PINN achieves single-exponent recovery with 1–5% error under noise and sparse sampling. It recovers corner singularity exponents for the two-dimensional Laplace equation with 0.009% error, matches the classical result of Kondrat’ev (1967), and recovers forcing-induced exponents in singular Poisson problems with 0.03% and 0.05% errors. On a 40-configuration wedge benchmark, it reaches a 100% success rate with 0.022% mean error. Constraint-aware training encodes physical requirements such as boundary condition compatibility and improves accuracy by three orders of magnitude over naive training. By combining the expressiveness of neural networks with the interpretability of asymptotic analysis, MSN-PINN produces learned parameters with direct physical meaning.
[LG-50] Is Softmax Loss All You Need? A Principled Analysis of Softmax-family Loss
链接: https://arxiv.org/abs/2601.22745
作者: Yuanhao Pu,Defu Lian,Enhong Chen
类目: Machine Learning (cs.LG)
*备注: 34 pages, 3 figures
Abstract:The Softmax loss is one of the most widely employed surrogate objectives for classification and ranking tasks. To elucidate its theoretical properties, the Fenchel-Young framework situates it as a canonical instance within a broad family of surrogates. Concurrently, another line of research has addressed scalability when the number of classes is exceedingly large, in which numerous approximations have been proposed to retain the benefits of the exact objective while improving efficiency. Building on these two perspectives, we present a principled investigation of the Softmax-family losses. We examine whether different surrogates achieve consistency with classification and ranking metrics, and analyze their gradient dynamics to reveal distinct convergence behaviors. We also introduce a systematic bias-variance decomposition for approximate methods that provides convergence guarantees, and further derive a per-epoch complexity analysis, showing explicit trade-offs between effectiveness and efficiency. Extensive experiments on a representative task demonstrate a strong alignment between consistency, convergence, and empirical performance. Together, these results establish a principled foundation and offer practical guidance for loss selections in large-class machine learning applications.
[LG-51] Local Intrinsic Dimension of Representations Predicts Alignment and Generalization in AI Models and Human Brain
链接: https://arxiv.org/abs/2601.22722
作者: Junjie Yu,Wenxiao Ma,Chen Wei,Jianyu Zhang,Haotian Deng,Zihan Deng,Quanying Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent work has found that neural networks with stronger generalization tend to exhibit higher representational alignment with one another across architectures and training paradigms. In this work, we show that models with stronger generalization also align more strongly with human neural activity. Moreover, generalization performance, model–model alignment, and model–brain alignment are all significantly correlated with each other. We further show that these relationships can be explained by a single geometric property of learned representations: the local intrinsic dimension of embeddings. Lower local dimension is consistently associated with stronger model–model alignment, stronger model–brain alignment, and better generalization, whereas global dimension measures fail to capture these effects. Finally, we find that increasing model capacity and training data scale systematically reduces local intrinsic dimension, providing a geometric account of the benefits of scaling. Together, our results identify local intrinsic dimension as a unifying descriptor of representational convergence in artificial and biological systems.
[LG-52] Metric Hub: A metric library and practical selection workflow for use-case-driven data quality assessment in medical AI
链接: https://arxiv.org/abs/2601.22702
作者: Katinka Becker,Maximilian P. Oppelt,Tobias S. Zech,Martin Seyferth,Sandie Cabon,Vanja Miskovic,Ivan Cimrak,Michal Kozubek,Giuseppe D’Avenio,Ilaria Campioni,Jana Fehr,Kanjar De,Ismail Mahmoudi,Emilio Dolgener Cantu,Laurenz Ottmann,Andreas Klaß,Galaad Altares,Jackie Ma,Alireza Salehi M.,Nadine R. Lang-Richter,Tobias Schaeffter,Daniel Schwabe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) in medicine has transitioned from research to concrete applications aimed at supporting several medical purposes like therapy selection, monitoring and treatment. Acceptance and effective adoption by clinicians and patients, as well as regulatory approval, require evidence of trustworthiness. A major factor for the development of trustworthy AI is the quantification of data quality for AI model training and testing. We have recently proposed the METRIC-framework for systematically evaluating the suitability (fit-for-purpose) of data for medical ML for a given task. Here, we operationalize this theoretical framework by introducing a collection of data quality metrics - the metric library - for practically measuring data quality dimensions. For each metric, we provide a metric card with the most important information, including definition, applicability, examples, pitfalls and recommendations, to support the understanding and implementation of these metrics. Furthermore, we discuss strategies and provide decision trees for choosing an appropriate set of data quality metrics from the metric library given specific use cases. We demonstrate the impact of our approach exemplarily on the PTB-XL ECG-dataset. This is a first step to enable fit-for-purpose evaluation of training and test data in practice as the base for establishing trustworthy AI in medicine.
[LG-53] Full-Graph vs. Mini-Batch Training: Comprehensive Analysis from a Batch Size and Fan-Out Size Perspective
链接: https://arxiv.org/abs/2601.22678
作者: Mengfan Liu,Da Zheng,Junwei Su,Chuan Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Full-graph and mini-batch Graph Neural Network (GNN) training approaches have distinct system design demands, making it crucial to choose the appropriate approach to develop. A core challenge in comparing these two GNN training approaches lies in characterizing their model performance (i.e., convergence and generalization) and computational efficiency. While a batch size has been an effective lens in analyzing such behaviors in deep neural networks (DNNs), GNNs extend this lens by introducing a fan-out size, as full-graph training can be viewed as mini-batch training with the largest possible batch size and fan-out size. However, the impact of the batch and fan-out size for GNNs remains insufficiently explored. To this end, this paper systematically compares full-graph vs. mini-batch training of GNNs through empirical and theoretical analyses from the view points of the batch size and fan-out size. Our key contributions include: 1) We provide a novel generalization analysis using the Wasserstein distance to study the impact of the graph structure, especially the fan-out size. 2) We uncover the non-isotropic effects of the batch size and the fan-out size in GNN convergence and generalization, providing practical guidance for tuning these hyperparameters under resource constraints. Finally, full-graph training does not always yield better model performance or computational efficiency than well-tuned smaller mini-batch settings. The implementation can be found in the github link: this https URL.
[LG-54] Beyond Fixed Rounds: Data-Free Early Stopping for Practical Federated Learning
链接: https://arxiv.org/abs/2601.22669
作者: Youngjoon Lee,Hyukjoon Lee,Seungrok Jung,Andy Luo,Jinu Gong,Yang Cao,Joonhyuk Kang
类目: Machine Learning (cs.LG)
*备注: 10 pages
Abstract:Federated Learning (FL) facilitates decentralized collaborative learning without transmitting raw data. However, reliance on fixed global rounds or validation data for hyperparameter tuning hinders practical deployment by incurring high computational costs and privacy risks. To address this, we propose a data-free early stopping framework that determines the optimal stopping point by monitoring the task vector’s growth rate using solely server-side parameters. The numerical results on skin lesion/blood cell classification demonstrate that our approach is comparable to validation-based early stopping across various state-of-the-art FL methods. In particular, the proposed framework spends an average of 47/20 (skin lesion/blood cell) rounds to achieve over 12.5%/10.3% higher performance than early stopping based on validation data. To the best of our knowledge, this is the first work to propose an early stopping framework for FL methods without using any validation data.
[LG-55] Layerwise Progressive Freezing Enables STE-Free Training of Deep Binary Neural Networks
链接: https://arxiv.org/abs/2601.22660
作者: Evan Gibson Smith,Bashima Islam
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate progressive freezing as an alternative to straight-through estimators (STE) for training binary networks from scratch. Under controlled training conditions, we find that while global progressive freezing works for binary-weight networks, it fails for full binary neural networks due to activation-induced gradient blockades. We introduce StoMPP (Stochastic Masked Partial Progressive Binarization), which uses layerwise stochastic masking to progressively replace differentiable clipped weights/activations with hard binary step functions, while only backpropagating through the unfrozen (clipped) subset (i.e., no straight-through estimator). Under a matched minimal training recipe, StoMPP improves accuracy over a BinaryConnect-style STE baseline, with gains that increase with depth (e.g., for ResNet-50 BNN: +18.0 on CIFAR-10, +13.5 on CIFAR-100, and +3.8 on ImageNet; for ResNet-18: +3.1, +4.7, and +1.3). For binary-weight networks, StoMPP achieves 91.2% accuracy on CIFAR-10 and 69.5% on CIFAR-100 with ResNet-50. We analyze training dynamics under progressive freezing, revealing non-monotonic convergence and improved depth scaling under binarization constraints.
[LG-56] Pushing the Boundaries of Natural Reasoning : Interleaved Bonus from Formal-Logic Verification
链接: https://arxiv.org/abs/2601.22642
作者: Chuxue Cao,Jinluan Yang,Haoran Li,Kunhao Pan,Zijian Zhao,Zhengyu Chen,Yuchen Tian,Lijun Wu,Conghui He,Sirui Han,Yike Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) show remarkable capabilities, yet their stochastic next-token prediction creates logical inconsistencies and reward hacking that formal symbolic systems avoid. To bridge this gap, we introduce a formal logic verification-guided framework that dynamically interleaves formal symbolic verification with the natural language generation process, providing real-time feedback to detect and rectify errors as they occur. Distinguished from previous neuro-symbolic methods limited by passive post-hoc validation, our approach actively penalizes intermediate fallacies during the reasoning chain. We operationalize this framework via a novel two-stage training pipeline that synergizes formal logic verification-guided supervised fine-tuning and policy optimization. Extensive evaluation on six benchmarks spanning mathematical, logical, and general reasoning demonstrates that our 7B and 14B models outperform state-of-the-art baselines by average margins of 10.4% and 14.2%, respectively. These results validate that formal verification can serve as a scalable mechanism to significantly push the performance boundaries of advanced LLM reasoning.
[LG-57] Stabilizing Transformer Training Through Consensus
链接: https://arxiv.org/abs/2601.22614
作者: Shyam Venkatasubramanian,Sean Moushegian,Michael Lin,Mir Park,Ankit Singhal,Connor Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Standard attention-based transformers are known to exhibit instability under learning rate overspecification during training, particularly at high learning rates. While various methods have been proposed to improve resilience to such overspecification by modifying the optimization procedure, fundamental architectural innovations to this end remain underexplored. In this work, we illustrate that the consensus mechanism, a drop-in replacement for attention, stabilizes transformer training across a wider effective range of learning rates. We formulate consensus as a graphical model and provide extensive empirical analysis demonstrating improved stability across learning rate sweeps on text, DNA, and protein modalities. We further propose a hybrid consensus-attention framework that preserves performance while improving stability. We provide theoretical analysis characterizing the properties of consensus.
[LG-58] Lethe:Adapter-Augmented Dual-Stream Update for Persistent Knowledge Erasure in Federated Unlearning
链接: https://arxiv.org/abs/2601.22601
作者: Hanwei Tan,Wentai Hu,Ligang He,Yijun Quan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated unlearning (FU) aims to erase designated client-level, class-level, or sample-level knowledge from a global model. Existing studies commonly assume that the collaboration ends up with the unlearning operation, overlooking the follow-up situation where the federated training continues over the remaining this http URL identify a critical failure mode, termed Knowledge resurfacing, by revealing that continued training can re-activate unlearned knowledge and cause the removed influence to resurface in the global model. To address this, we propose Lethe, a novel federated unlearning method that de-correlates knowledge to be unlearned from knowledge to be retained, ensuring persistent erasure during continued this http URL follows a Reshape–Rectify–Restore pipeline: a temporary adapter is first trained with gradient ascent on the unlearning data to obtain magnified updates, which is then used as corrective signals to diverge layer-wise rectification on the remaining updates in two streams. Finally, the adapter is removed and a short recovery stage is performed on the retained data. Our experiments show that Lethe supports unlearning in the federated system at all levels in a unified manner and maintains superior persistence (Resurfacing Rate 1% in most cases) even after numerous rounds of follow-up training.
[LG-59] Heterogeneous Graph Alignment for Joint Reasoning and Interpretability
链接: https://arxiv.org/abs/2601.22593
作者: Zahra Moslemi,Ziyi Liang,Norbert Fortin,Babak Shahbaba
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-graph learning is crucial for extracting meaningful signals from collections of heterogeneous graphs. However, effectively integrating information across graphs with differing topologies, scales, and semantics, often in the absence of shared node identities, remains a significant challenge. We present the Multi-Graph Meta-Transformer (MGMT), a unified, scalable, and interpretable framework for cross-graph learning. MGMT first applies Graph Transformer encoders to each graph, mapping structure and attributes into a shared latent space. It then selects task-relevant supernodes via attention and builds a meta-graph that connects functionally aligned supernodes across graphs using similarity in the latent space. Additional Graph Transformer layers on this meta-graph enable joint reasoning over intra- and inter-graph structure. The meta-graph provides built-in interpretability: supernodes and superedges highlight influential substructures and cross-graph alignments. Evaluating MGMT on both synthetic datasets and real-world neuroscience applications, we show that MGMT consistently outperforms existing state-of-the-art models in graph-level prediction tasks while offering interpretable representations that facilitate scientific discoveries. Our work establishes MGMT as a unified framework for structured multi-graph learning, advancing representation techniques in domains where graph-based data plays a central role.
[LG-60] HetCCL: Accelerating LLM Training with Heterogeneous GPUs
链接: https://arxiv.org/abs/2601.22585
作者: Heehoon Kim,Jaehwan Lee,Taejeoung Kim,Jongwon Park,Jinpyo Kim,Pyongwon Suh,Ryan H. Choi,Sangwoo Lee,Jaejin Lee
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:The rapid growth of large language models is driving organizations to expand their GPU clusters, often with GPUs from multiple vendors. However, current deep learning frameworks lack support for collective communication across heterogeneous GPUs, leading to inefficiency and higher costs. We present HetCCL, a collective communication library that unifies vendor-specific backends and enables RDMA-based communication across GPUs without requiring driver modifications. HetCCL introduces two novel mechanisms that enable cross-vendor communication while leveraging optimized vendor libraries, NVIDIA NCCL and AMD RCCL. Evaluations on a multi-vendor GPU cluster show that HetCCL matches NCCL and RCCL performance in homogeneous setups while uniquely scaling in heterogeneous environments, enabling practical, high-performance training with both NVIDIA and AMD GPUs without changes to existing deep learning applications.
[LG-61] Non-Intrusive Graph-Based Bot Detection for E-Commerce Using Inductive Graph Neural Networks
链接: https://arxiv.org/abs/2601.22579
作者: Sichen Zhao,Zhiming Xue,Yalun Qi,Xianling Zeng,Zihan Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Malicious bots pose a growing threat to e-commerce platforms by scraping data, hoarding inventory, and perpetrating fraud. Traditional bot mitigation techniques, including IP blacklists and CAPTCHA-based challenges, are increasingly ineffective or intrusive, as modern bots leverage proxies, botnets, and AI-assisted evasion strategies. This work proposes a non-intrusive graph-based bot detection framework for e-commerce that models user session behavior through a graph representation and applies an inductive graph neural network for classification. The approach captures both relational structure and behavioral semantics, enabling accurate identification of subtle automated activity that evades feature-based methods. Experiments on real-world e-commerce traffic demonstrate that the proposed inductive graph model outperforms a strong session-level multilayer perceptron baseline in terms of AUC and F1 score. Additional adversarial perturbation and cold-start simulations show that the model remains robust under moderate graph modifications and generalizes effectively to previously unseen sessions and URLs. The proposed framework is deployment-friendly, integrates with existing systems without client-side instrumentation, and supports real-time inference and incremental updates, making it suitable for practical e-commerce security deployments.
[LG-62] Exo-Plore: Exploring Exoskeleton Control Space through Human-aligned Simulation ICLR2026
链接: https://arxiv.org/abs/2601.22550
作者: Geonho Leem,Jaedong Lee,Jehee Lee,Seungmoon Song,Jungdam Won
类目: Robotics (cs.RO); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 10 pages, 9 figures, ICLR 2026 accepted
Abstract:Exoskeletons show great promise for enhancing mobility, but providing appropriate assistance remains challenging due to the complexity of human adaptation to external forces. Current state-of-the-art approaches for optimizing exoskeleton controllers require extensive human experiments in which participants must walk for hours, creating a paradox: those who could benefit most from exoskeleton assistance, such as individuals with mobility impairments, are rarely able to participate in such demanding procedures. We present Exo-plore, a simulation framework that combines neuromechanical simulation with deep reinforcement learning to optimize hip exoskeleton assistance without requiring real human experiments. Exo-plore can (1) generate realistic gait data that captures human adaptation to assistive forces, (2) produce reliable optimization results despite the stochastic nature of human gait, and (3) generalize to pathological gaits, showing strong linear relationships between pathology severity and optimal assistance.
[LG-63] Detect and Act: Automated Dynamic Optimizer through Meta-Black-Box Optimization
链接: https://arxiv.org/abs/2601.22542
作者: Zijian Gao,Yuanting Zhong,Zeyuan Ma,Yue-Jiao Gong,Hongshu Guo
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Dynamic Optimization Problems (DOPs) are challenging to address due to their complex nature, i.e., dynamic environment variation. Evolutionary Computation methods are generally advantaged in solving DOPs since they resemble dynamic biological evolution. However, existing evolutionary dynamic optimization methods rely heavily on human-crafted adaptive strategy to detect environment variation in DOPs, and then adapt the searching strategy accordingly. These hand-crafted strategies may perform ineffectively at out-of-box scenarios. In this paper, we propose a reinforcement learning-assisted approach to enable automated variation detection and self-adaption in evolutionary algorithms. This is achieved by borrowing the bi-level learning-to-optimize idea from recent Meta-Black-Box Optimization works. We use a deep Q-network as optimization dynamics detector and searching strategy adapter: It is fed as input with current-step optimization state and then dictates desired control parameters to underlying evolutionary algorithms for next-step optimization. The learning objective is to maximize the expected performance gain across a problem distribution. Once trained, our approach could generalize toward unseen DOPs with automated environment variation detection and self-adaption. To facilitate comprehensive validation, we further construct an easy-to-difficult DOPs testbed with diverse synthetic instances. Extensive benchmark results demonstrate flexible searching behavior and superior performance of our approach in solving DOPs, compared to state-of-the-art baselines.
[LG-64] Benchmarking Long Roll-outs of Auto-regressive Neural Operators for the Compressible Navier-Stokes Equations with Conserved Quantity Correction
链接: https://arxiv.org/abs/2601.22541
作者: Sean Current,Chandan Kumar,Datta Gaitonde,Srinivasan Parthasarathy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has been proposed as an efficient alternative for the numerical approximation of PDE solutions, offering fast, iterative simulation of PDEs through the approximation of solution operators. However, deep learning solutions have struggle to perform well over long prediction durations due to the accumulation of auto-regressive error, which is compounded by the inability of models to conserve physical quantities. In this work, we present conserved quantity correction, a model-agnostic technique for incorporation physical conservation criteria within deep learning models. Our results demonstrate consistent improvement in the long-term stability of auto-regressive neural operator models, regardless of the model architecture. Furthermore, we analyze the performance of neural operators from the spectral domain, highlighting significant limitations of present architectures. These results highlight the need for future work to consider architectures that place specific emphasis on high frequency components, which are integral to the understanding and modeling of turbulent flows.
[LG-65] Neural-Inspired Posterior Approximation (NIPA)
链接: https://arxiv.org/abs/2601.22539
作者: Babak Shahbaba,Zahra Moslemi
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 13 pages, 4 tables
Abstract:Humans learn efficiently from their environment by engaging multiple interacting neural systems that support distinct yet complementary forms of control, including model-based (goal-directed) planning, model-free (habitual) responding, and episodic memory-based learning. Model-based mechanisms compute prospective action values using an internal model of the environment, supporting flexible but computationally costly planning; model-free mechanisms cache value estimates and build heuristics that enable fast, efficient habitual responding; and memory-based mechanisms allow rapid adaptation from individual experience. In this work, we aim to elucidate the computational principles underlying this biological efficiency and translate them into a sampling algorithm for scalable Bayesian inference through effective exploration of the posterior distribution. More specifically, our proposed algorithm comprises three components: a model-based module that uses the target distribution for guided but computationally slow sampling; a model-free module that uses previous samples to learn patterns in the parameter space, enabling fast, reflexive sampling without directly evaluating the expensive target distribution; and an episodic-control module that supports rapid sampling by recalling specific past events (i.e., samples). We show that this approach advances Bayesian methods and facilitates their application to large-scale statistical machine learning problems. In particular, we apply our proposed framework to Bayesian deep learning, with an emphasis on proper and principled uncertainty quantification.
[LG-66] Learning to Defer in Non-Stationary Time Series via Switching State-Space Models
链接: https://arxiv.org/abs/2601.22538
作者: Yannis Montreuil,Letian Yu,Axel Carlier,Lai Xing Ng,Wei Tsang Ooi
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:We study Learning to Defer for non-stationary time series with partial feedback and time-varying expert availability. At each time step, the router selects an available expert, observes the target, and sees only the queried expert’s prediction. We model signed expert residuals using L2D-SLDS, a factorized switching linear-Gaussian state-space model with context-dependent regime transitions, a shared global factor enabling cross-expert information transfer, and per-expert idiosyncratic states. The model supports expert entry and pruning via a dynamic registry. Using one-step-ahead predictive beliefs, we propose an IDS-inspired routing rule that trades off predicted cost against information gained about the latent regime and shared factor. Experiments show improvements over contextual-bandit baselines and a no-shared-factor ablation.
[LG-67] Variational Bayesian Flow Network for Graph Generation
链接: https://arxiv.org/abs/2601.22524
作者: Yida Xiong,Jiameng Chen,Xiuwen Gong,Jia Wu,Shirui Pan,Wenbin Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph generation aims to sample discrete node and edge attributes while satisfying coupled structural constraints. Diffusion models for graphs often adopt largely factorized forward-noising, and many flow-matching methods start from factorized reference noise and coordinate-wise interpolation, so node-edge coupling is not encoded by the generative geometry and must be recovered implicitly by the core network, which can be brittle after discrete decoding. Bayesian Flow Networks (BFNs) evolve distribution parameters and naturally support discrete generation. But classical BFNs typically rely on factorized beliefs and independent channels, which limit geometric evidence fusion. We propose Variational Bayesian Flow Network (VBFN), which performs a variational lifting to a tractable joint Gaussian variational belief family governed by structured precisions. Each Bayesian update reduces to solving a symmetric positive definite linear system, enabling coupled node and edge updates within a single fusion step. We construct sample-agnostic sparse precisions from a representation-induced dependency graph, thereby avoiding label leakage while enforcing node-edge consistency. On synthetic and molecular graph datasets, VBFN improves fidelity and diversity, and surpasses baseline methods.
[LG-68] DRL-Enabled Trajectory Planing for UAV-Assisted VLC: Optimal Altitude and Reward Design
链接: https://arxiv.org/abs/2601.22512
作者: Tian-Tian Lin,Yi Liu,Xiao-Wei Tang,Yunmei Shi,Yi Huang,Zhongxiang Wei,Qingqing Wu,Yuhan Dong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recently, the integration of unmanned aerial vehicle (UAV) and visible light communication (VLC) technologies has emerged as a promising solution to offer flexible communication and efficient lighting. This letter investigates the three-dimensional trajectory planning in a UAV-assisted VLC system, where a UAV is dispatched to collect data from ground users (GUs). The core objective is to develop a trajectory planning framework that minimizes UAV flight distance, which is equivalent to maximizing the data collection efficiency. This issue is formulated as a challenging mixed-integer non-convex optimization problem. To tackle it, we first derive a closed-form optimal flight altitude under specific VLC channel gain threshold. Subsequently, we optimize the UAV horizontal trajectory by integrating a novel pheromone-driven reward mechanism with the twin delayed deep deterministic policy gradient algorithm, which enables adaptive UAV motion strategy in complex environments. Simulation results validate that the derived optimal altitude effectively reduces the flight distance by up to 35% compared to baseline methods. Additionally, the proposed reward mechanism significantly shortens the convergence steps by approximately 50%, demonstrating notable efficiency gains in the context of UAV-assisted VLC data collection.
[LG-69] Gradual Fine-Tuning for Flow Matching Models ICML
链接: https://arxiv.org/abs/2601.22495
作者: Gudrun Thorkelsdottir,Arindam Banerjee
类目: Machine Learning (cs.LG)
*备注: Preprint. Submitted to ICML. 8 pages, 5 figures (+ appendix)
Abstract:Fine-tuning flow matching models is a central challenge in settings with limited data, evolving distributions, or strict efficiency demands, where unconstrained fine-tuning can erode the accuracy and efficiency gains learned during pretraining. Prior work has produced theoretical guarantees and empirical advances for reward-based fine-tuning formulations, but these methods often impose restrictions on permissible drift structure or training techniques. In this work, we propose Gradual Fine-Tuning (GFT), a principled framework for fine-tuning flow-based generative models when samples from the target distribution are available. For stochastic flows, GFT defines a temperature-controlled sequence of intermediate objectives that smoothly interpolate between the pretrained and target drifts, approaching the true target as the temperature approaches zero. We prove convergence results for both marginal and conditional GFT objectives, enabling the use of suitable (e.g., optimal transport) couplings during GFT while preserving correctness. Empirically, GFT improves convergence stability and shortens probability paths, resulting in faster inference, while maintaining generation quality comparable to standard fine-tuning. Our results position GFT as a theoretically grounded and practically effective alternative for scalable adaptation of flow matching models under distribution shift.
[LG-70] Elastic Spectral State Space Models for Budgeted Inference
链接: https://arxiv.org/abs/2601.22488
作者: Dachuan Song,Xuan Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Foundation models are typically trained at a fixed computational capacity, while real-world applications require deployment across platforms with different resource constraints. Current approaches usually rely on training families of model variants or model distillation, which requires additional training and supports only a pre-selected set of sizes rather than fine-grained adaptation at runtime. In this paper, we propose Elastic Spectral State Space Models (ES-SSM), which require only one-time training at full capacity, but can be directly truncated into arbitrary scales for budgeted, runtime inference without retraining. Our ES-SSM builds on Hankel spectral filtering over a state space model (SSM), coupled with a lightweight input-adaptive gate trained under randomized spectral budgets. Using a shared masked normalization rule over the ordered spectral channels, we encourage predictive capability to concentrate in low-index components, while higher-index components act primarily as refinement. We test our algorithm across long-sequence benchmarks spanning text, logic, retrieval, vision, and audio. We demonstrate that a single ES-SSM model trained once can be truncated to provide competitive performance compared with modern Transformer and SSM baselines at similar parameter scales. Furthermore, by testing under various runtime budgets, we observe smooth and stable budget-performance curves over a wide range of truncation levels.
[LG-71] Mitigating Cognitive Inertia in Large Reasoning Models via Latent Spike Steering
链接: https://arxiv.org/abs/2601.22484
作者: Seojin Lee,ByeongJeong Kim,Hwanhee Lee
类目: Machine Learning (cs.LG)
*备注: 21 pages, 6 figures
Abstract:While Large Reasoning Models (LRMs) have achieved remarkable performance by scaling test-time compute, they frequently suffer from Cognitive Inertia, a failure pattern manifesting as either overthinking (inertia of motion) or reasoning rigidity (inertia of direction). Existing detection methods, typically relying on superficial textual heuristics like self-correction tokens, often fail to capture the model’s unvoiced internal conflicts. To address this, we propose STARS (Spike-Triggered Adaptive Reasoning Steering), a training-free framework designed to rectify cognitive inertia by monitoring latent dynamics. STARS identifies Cognitive Pivots-critical moments of reasoning transition-by detecting distinct L2 distance spikes in the hidden states. Upon detection, the framework employs geometric trajectory analysis to diagnose the structural nature of the transition and injects state-aware language cues to steer the model in real-time. Our experiments across diverse benchmarks confirm that STARS efficiently curtails redundant loops while improving accuracy through the adaptive correction of erroneous trajectories. STARS offers a robust, unsupervised mechanism to optimize the reasoning process of LRMs without requiring additional fine-tuning.
[LG-72] ransform-Augmented GRPO Improves Pass@k
链接: https://arxiv.org/abs/2601.22478
作者: Khiem Le,Youssef Mroueh,Phuc Nguyen,Chi-Heng Lin,Shangqian Gao,Ting Hua,Nitesh V. Chawla
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models trained via next-token prediction are fundamentally pattern-matchers: sensitive to superficial phrasing variations even when the underlying problem is identical. Group Relative Policy Optimization (GRPO) was designed to improve reasoning, but in fact it worsens this situation through two failure modes: diversity collapse, where training amplifies a single solution strategy while ignoring alternatives of gradient signal, and gradient diminishing, where a large portion of questions yield zero gradients because all rollouts receive identical rewards. We propose TA-GRPO (Transform-Augmented GRPO), which generates semantically equivalent transformed variants of each question (via paraphrasing, variable renaming, and format changes) and computes advantages by pooling rewards across the entire group. This pooled computation ensures mixed rewards even when the original question is too easy or too hard, while training on diverse phrasings promotes multiple solution strategies. We provide theoretical justification showing that TA-GRPO reduces zero-gradient probability and improves generalization via reduced train-test distribution shift. Experiments on mathematical reasoning benchmarks show consistent Pass@k improvements, with gains up to 9.84 points on competition math (AMC12, AIME24) and 5.05 points on out-of-distribution scientific reasoning (GPQA-Diamond).
[LG-73] Continual Policy Distillation from Distributed Reinforcement Learning Teachers
链接: https://arxiv.org/abs/2601.22475
作者: Yuxuan Li,Qijun He,Mingqi Yuan,Wen-Tse Chen,Jeff Schneider,Jiayu Chen
类目: Machine Learning (cs.LG)
*备注: 19 pages (8 pages main text)
Abstract:Continual Reinforcement Learning (CRL) aims to develop lifelong learning agents to continuously acquire knowledge across diverse tasks while mitigating catastrophic forgetting. This requires efficiently managing the stability-plasticity dilemma and leveraging prior experience to rapidly generalize to novel tasks. While various enhancement strategies for both aspects have been proposed, achieving scalable performance by directly applying RL to sequential task streams remains challenging. In this paper, we propose a novel teacher-student framework that decouples CRL into two independent processes: training single-task teacher models through distributed RL and continually distilling them into a central generalist model. This design is motivated by the observation that RL excels at solving single tasks, while policy distillation – a relatively stable supervised learning process – is well aligned with large foundation models and multi-task learning. Moreover, a mixture-of-experts (MoE) architecture and a replay-based approach are employed to enhance the plasticity and stability of the continual policy distillation process. Extensive experiments on the Meta-World benchmark demonstrate that our framework enables efficient continual RL, recovering over 85% of teacher performance while constraining task-wise forgetting to within 10%.
[LG-74] Unrewarded Exploration in Large Language Models Reveals Latent Learning from Psychology
链接: https://arxiv.org/abs/2601.22474
作者: Jian Xiong,Jingbo Zhou,Zihan Zhou,Yixiong Xiao,Le Zhang,Jingyong Ye,Rui Qian,Yang Zhou,Dejing Dou
类目: Machine Learning (cs.LG)
*备注: 17pages, 1 figure
Abstract:Latent learning, classically theorized by Tolman, shows that biological agents (e.g., rats) can acquire internal representations of their environment without rewards, enabling rapid adaptation once rewards are introduced. In contrast, from a cognitive science perspective, reward learning remains overly dependent on external feedback, limiting flexibility and generalization. Although recent advances in the reasoning capabilities of large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, mark a significant breakthrough, these models still rely primarily on reward-centric reinforcement learning paradigms. Whether and how the well-established phenomenon of latent learning in psychology can inform or emerge within LLMs’ training remains largely unexplored. In this work, we present novel findings from our experiments that LLMs also exhibit the latent learning dynamics. During an initial phase of unrewarded exploration, LLMs display modest performance improvements, as this phase allows LLMs to organize task-relevant knowledge without being constrained by reward-driven biases, and performance is further enhanced once rewards are introduced. LLMs post-trained under this two-stage exploration regime ultimately achieve higher competence than those post-trained with reward-based reinforcement learning throughout. Beyond these empirical observations, we also provide theoretical analyses for our experiments explaining why unrewarded exploration yields performance gains, offering a mechanistic account of these dynamics. Specifically, we conducted extensive experiments across multiple model families and diverse task domains to establish the existence of the latent learning dynamics in LLMs.
[LG-75] EvoEGF-Mol: Evolving Exponential Geodesic Flow for Structure-based Drug Design
链接: https://arxiv.org/abs/2601.22466
作者: Yaowei Jin,Junjie Wang,Cheng Cao,Penglei Wang,Duo An,Qian Shi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Structure-Based Drug Design (SBDD) aims to discover bioactive ligands. Conventional approaches construct probability paths separately in Euclidean and probabilistic spaces for continuous atomic coordinates and discrete chemical categories, leading to a mismatch with the underlying statistical manifolds. We address this issue from an information-geometric perspective by modeling molecules as composite exponential-family distributions and defining generative flows along exponential geodesics under the Fisher-Rao metric. To avoid the instantaneous trajectory collapse induced by geodesics directly targeting Dirac distributions, we propose Evolving Exponential Geodesic Flow for SBDD (EvoEGF-Mol), which replaces static Dirac targets with dynamically concentrating distributions, ensuring stable training via a progressive-parameter-refinement architecture. Our model approaches a reference-level PoseBusters passing rate (93.4%) on CrossDock, demonstrating remarkable geometric precision and interaction fidelity, while outperforming baselines on real-world MolGenBench tasks by recovering bioactive scaffolds and generating candidates that meet established MedChem filters.
[LG-76] oward Non-Expert Customized Congestion Control
链接: https://arxiv.org/abs/2601.22461
作者: Mingrui Zhang,Hamid Bagheri,Lisong Xu
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Accepted manuscript (AAM) of IEEE ICC 2025 paper. DOI: https://doi.org/10.1109/ICC52391.2025.11160790
Abstract:General-purpose congestion control algorithms (CCAs) are designed to achieve general congestion control goals, but they may not meet the specific requirements of certain users. Customized CCAs can meet certain users’ specific requirements; however, non-expert users often lack the expertise to implement them. In this paper, we present an exploratory non-expert customized CCA framework, named NECC, which enables non-expert users to easily model, implement, and deploy their customized CCAs by leveraging Large Language Models and the Berkeley Packet Filter (BPF) interface. To the best of our knowledge, we are the first to address the customized CCA implementation problem. Our evaluations using real-world CCAs show that the performance of NECC is very promising, and we discuss the insights that we find and possible future research directions.
[LG-77] Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features
链接: https://arxiv.org/abs/2601.22447
作者: Yiting Liu,Zhi-Hong Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sparse autoencoders (SAEs) have emerged as a powerful technique for decomposing language model representations into interpretable features. Current interpretation methods infer feature semantics from activation patterns, but overlook that features are trained to reconstruct activations that serve computational roles in the forward pass. We introduce a novel weight-based interpretation framework that measures functional effects through direct weight interactions, requiring no activation data. Through three experiments on Gemma-2 and Llama-3.1 models, we demonstrate that (1) 1/4 of features directly predict output tokens, (2) features actively participate in attention mechanisms with depth-dependent structure, and (3) semantic and non-semantic feature populations exhibit distinct distribution profiles in attention circuits. Our analysis provides the missing out-of-context half of SAE feature interpretability.
[LG-78] AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism
链接: https://arxiv.org/abs/2601.22442
作者: Thalaiyasingam Ajanthan,Sameera Ramasinghe,Gil Avraham,Hadi Mohaghegh Dolatabadi,Chamin P Hewa Koneputugodage,Violetta Shevchenko,Yan Zuo,Alexander Long
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.
[LG-79] Rethinking Anonymity Claims in Synthetic Data Generation: A Model-Centric Privacy Attack Perspective
链接: https://arxiv.org/abs/2601.22434
作者: Georgi Ganev,Emiliano De Cristofaro
类目: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Training generative machine learning models to produce synthetic tabular data has become a popular approach for enhancing privacy in data sharing. As this typically involves processing sensitive personal information, releasing either the trained model or generated synthetic datasets can still pose privacy risks. Yet, recent research, commercial deployments, and privacy regulations like the General Data Protection Regulation (GDPR) largely assess anonymity at the level of an individual dataset. In this paper, we rethink anonymity claims about synthetic data from a model-centric perspective and argue that meaningful assessments must account for the capabilities and properties of the underlying generative model and be grounded in state-of-the-art privacy attacks. This perspective better reflects real-world products and deployments, where trained models are often readily accessible for interaction or querying. We interpret the GDPR’s definitions of personal data and anonymization under such access assumptions to identify the types of identifiability risks that must be mitigated and map them to privacy attacks across different threat settings. We then argue that synthetic data techniques alone do not ensure sufficient anonymization. Finally, we compare the two mechanisms most commonly used alongside synthetic data – Differential Privacy (DP) and Similarity-based Privacy Metrics (SBPMs) – and argue that while DP can offer robust protections against identifiability risks, SBPMs lack adequate safeguards. Overall, our work connects regulatory notions of identifiability with model-centric privacy attacks, enabling more responsible and trustworthy regulatory assessment of synthetic data systems by researchers, practitioners, and policymakers. Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2601.22434 [cs.CR] (or arXiv:2601.22434v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.22434 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-80] MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning
链接: https://arxiv.org/abs/2601.22416
作者: Xunkai Li,Yuming Ai,Yinlin Zhu,Haodong Lu,Yi Zhang,Guohao Fu,Bowen Fan,Qiangqiang Dai,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Multimodal-attributed graphs (MMAGs) provide a unified framework for modeling complex relational data by integrating heterogeneous modalities with graph structures. While centralized learning has shown promising performance, MMAGs in real-world applications are often distributed across isolated platforms and cannot be shared due to privacy concerns or commercial constraints. Federated graph learning (FGL) offers a natural solution for collaborative training under such settings; however, existing studies largely focus on single-modality graphs and do not adequately address the challenges unique to multimodal federated graph learning (MMFGL). To bridge this gap, we present MM-OpenFGL, the first comprehensive benchmark that systematically formalizes the MMFGL paradigm and enables rigorous evaluation. MM-OpenFGL comprises 19 multimodal datasets spanning 7 application domains, 8 simulation strategies capturing modality and topology variations, 6 downstream tasks, and 57 state-of-the-art methods implemented through a modular API. Extensive experiments investigate MMFGL from the perspectives of necessity, effectiveness, robustness, and efficiency, offering valuable insights for future research on MMFGL.
[LG-81] SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning
链接: https://arxiv.org/abs/2601.22397
作者: Jianchang Su,Yifan Zhang,Shengkai Lin,Shizhen Zhao,Yusheng Zheng,Yiwei Yang,Wei Zhang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.
[LG-82] Purely Agent ic Black-Box Optimization for Biological Design
链接: https://arxiv.org/abs/2601.22382
作者: Natalie Maus,Yimeng Zeng,Haydn Thomas Jones,Yining Huang,Gaurav Ng Goel,Alden Rose,Kyurae Kim,Hyun-Su Lee,Marcelo Der Torossian Torres,Fangping Wan,Cesar de la Fuente-Nunez,Mark Yatskar,Osbert Bastani,Jacob R. Gardner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many key challenges in biological design-such as small-molecule drug discovery, antimicrobial peptide development, and protein engineering-can be framed as black-box optimization over vast, complex structured spaces. Existing methods rely mainly on raw structural data and struggle to exploit the rich scientific literature. While large language models (LLMs) have been added to these pipelines, they have been confined to narrow roles within structure-centered optimizers. We instead cast biological black-box optimization as a fully agentic, language-based reasoning process. We introduce Purely Agentic BLack-box Optimization (PABLO), a hierarchical agentic system that uses scientific LLMs pretrained on chemistry and biology literature to generate and iteratively refine biological candidates. On both the standard GuacaMol molecular design and antimicrobial peptide optimization tasks, PABLO achieves state-of-the-art performance, substantially improving sample efficiency and final objective values over established baselines. Compared to prior optimization methods that incorporate LLMs, PABLO achieves competitive token usage per run despite relying on LLMs throughout the optimization loop. Beyond raw performance, the agentic formulation offers key advantages for realistic design: it naturally incorporates semantic task descriptions, retrieval-augmented domain knowledge, and complex constraints. In follow-up in vitro validation, PABLO-optimized peptides showed strong activity against drug-resistant pathogens, underscoring the practical potential of PABLO for therapeutic discovery.
[LG-83] FIRE: Multi-fidelity Regression with Distribution-conditioned In-context Learning using Tabular Foundation Models
链接: https://arxiv.org/abs/2601.22371
作者: Rosen Ting-Ying Yu,Nicholas Sung,Faez Ahmed
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-fidelity (MF) regression often operates in regimes of extreme data imbalance, where the commonly-used Gaussian-process (GP) surrogates struggle with cubic scaling costs and overfit to sparse high-fidelity observations, limiting efficiency and generalization in real-world applications. We introduce FIRE, a training-free MF framework that couples tabular foundation models (TFMs) to perform zero-shot in-context Bayesian inference via a high-fidelity correction model conditioned on the low-fidelity model’s posterior predictive distributions. This cross-fidelity information transfer via distributional summaries captures heteroscedastic errors, enabling robust residual learning without model retraining. Across 31 benchmark problems spanning synthetic and real-world tasks (e.g., DrivAerNet, LCBench), FIRE delivers a stronger performance-time trade-off than seven state-of-the-art GP-based or deep learning MF regression methods, ranking highest in accuracy and uncertainty quantification with runtime advantages. Limitations include context window constraints and dependence on the quality of the pre-trained TFM’s.
[LG-84] owards Solving the Gilbert-Pollak Conjecture via Large Language Models
链接: https://arxiv.org/abs/2601.22365
作者: Yisi Ke,Tianyu Huang,Yankai Shu,Di He,Jingchu Gai,Liwei Wang
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: 44 pages, 11 figures
Abstract:The Gilbert-Pollak Conjecture \citepgilbert1968steiner, also known as the Steiner Ratio Conjecture, states that for any finite point set in the Euclidean plane, the Steiner minimum tree has length at least \sqrt3/2 \approx 0.866 times that of the Euclidean minimum spanning tree (the Steiner ratio). A sequence of improvements through the 1980s culminated in a lower bound of 0.824 , with no substantial progress reported over the past three decades. Recent advances in LLMs have demonstrated strong performance on contest-level mathematical problems, yet their potential for addressing open, research-level questions remains largely unexplored. In this work, we present a novel AI system for obtaining tighter lower bounds on the Steiner ratio. Rather than directly prompting LLMs to solve the conjecture, we task them with generating rule-constrained geometric lemmas implemented as executable code. These lemmas are then used to construct a collection of specialized functions, which we call verification functions, that yield theoretically certified lower bounds of the Steiner ratio. Through progressive lemma refinement driven by reflection, the system establishes a new certified lower bound of 0.8559 for the Steiner ratio. The entire research effort involves only thousands of LLM calls, demonstrating the strong potential of LLM-based systems for advanced mathematical research.
[LG-85] Understanding Efficiency: Quantization Batching and Serving Strategies in LLM Energy Use
链接: https://arxiv.org/abs/2601.22362
作者: Julien Delavande,Regis Pierrard,Sasha Luccioni
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emphsystem-level design choices - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face’s Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to 100 times. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
[LG-86] Small Talk Big Impact: The Energy Cost of Thanking AI
链接: https://arxiv.org/abs/2601.22357
作者: Julien Delavande,Regis Pierrard,Sasha Luccioni
类目: Machine Learning (cs.LG)
*备注:
Abstract:Being polite is free - or is it? In this paper, we quantify the energy cost of seemingly innocuous messages such as ``thank you’’ when interacting with large language models, often used by users to convey politeness. Using real-world conversation traces and fine-grained energy measurements, we quantify how input length, output length and model size affect energy use. While politeness is our motivating example, it also serves as a controlled and reproducible proxy for measuring the energy footprint of a typical LLM interaction. Our findings provide actionable insights for building more sustainable and efficient LLM applications, especially in increasingly widespread real-world contexts like chat. As user adoption grows and billions of prompts are processed daily, understanding and mitigating this cost becomes crucial - not just for efficiency, but for sustainable AI deployment.
[LG-87] PoSafeNet: Safe Learning with Poset-Structured Neural Nets
链接: https://arxiv.org/abs/2601.22356
作者: Kiwan Wong,Wei Xiao,Daniela Rus
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Safe learning is essential for deploying learningbased controllers in safety-critical robotic systems, yet existing approaches often enforce multiple safety constraints uniformly or via fixed priority orders, leading to infeasibility and brittle behavior. In practice, safety requirements are heterogeneous and admit only partial priority relations, where some constraints are comparable while others are inherently incomparable. We formalize this setting as poset-structured safety, modeling safety constraints as a partially ordered set and treating safety composition as a structural property of the policy class. Building on this formulation, we propose PoSafeNet, a differentiable neural safety layer that enforces safety via sequential closed-form projection under poset-consistent constraint orderings, enabling adaptive selection or mixing of valid safety executions while preserving priority semantics by construction. Experiments on multi-obstacle navigation, constrained robot manipulation, and vision-based autonomous driving demonstrate improved feasibility, robustness, and scalability over unstructured and differentiable quadratic program-based safety layers.
[LG-88] Relative Wasserstein Angle and the Problem of the W_2-Nearest Gaussian Distribution
链接: https://arxiv.org/abs/2601.22355
作者: Binshuai Wang,Peng Wei
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of quantifying how far an empirical distribution deviates from Gaussianity under the framework of optimal transport. By exploiting the cone geometry of the relative translation invariant quadratic Wasserstein space, we introduce two novel geometric quantities, the relative Wasserstein angle and the orthogonal projection distance, which provide meaningful measures of non-Gaussianity. We prove that the filling cone generated by any two rays in this space is flat, ensuring that angles, projections, and inner products are rigorously well-defined. This geometric viewpoint recasts Gaussian approximation as a projection problem onto the Gaussian cone and reveals that the commonly used moment-matching Gaussian can \emphnot be the (W_2)-nearest Gaussian for a given empirical distribution. In one dimension, we derive closed-form expressions for the proposed quantities and extend them to several classical distribution families, including uniform, Laplace, and logistic distributions; while in high dimensions, we develop an efficient stochastic manifold optimization algorithm based on a semi-discrete dual formulation. Experiments on synthetic data and real-world feature distributions demonstrate that the relative Wasserstein angle is more robust than the Wasserstein distance and that the proposed nearest Gaussian provides a better approximation than moment matching in the evaluation of Fréchet Inception Distance (FID) scores.
[LG-89] Failing to Explore: Language Models on Interactive Tasks
链接: https://arxiv.org/abs/2601.22345
作者: Mahdi JafariRaviz,Keivan Rezaei,Arshia Soltani Moakhar,Zahra Sodagar,Yize Cheng,Soheil Feizi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We evaluate language models on their ability to explore interactive environments under a limited interaction budget. We introduce three parametric tasks with controllable exploration difficulty, spanning continuous and discrete environments. Across state-of-the-art models, we find systematic under-exploration and suboptimal solutions, with performance often significantly worse than simple explore–exploit heuristic baselines and scaling weakly as the budget increases. Finally, we study two lightweight interventions: splitting a fixed budget into parallel executions, which surprisingly improves performance despite a no-gain theoretical result for our tasks, and periodically summarizing the interaction history, which preserves key discoveries and further improves exploration.
[LG-90] Quantum-Inspired Reinforcement Learning for Secure and Sustainable AIoT-Driven Supply Chain Systems
链接: https://arxiv.org/abs/2601.22339
作者: Muhammad Bilal Akram Dastagir,Omer Tariq,Shahid Mumtaz,Saif Al-Kuwari,Ahmed Farouk
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Modern supply chains must balance high-speed logistics with environmental impact and security constraints, prompting a surge of interest in AI-enabled Internet of Things (AIoT) solutions for global commerce. However, conventional supply chain optimization models often overlook crucial sustainability goals and cyber vulnerabilities, leaving systems susceptible to both ecological harm and malicious attacks. To tackle these challenges simultaneously, this work integrates a quantum-inspired reinforcement learning framework that unifies carbon footprint reduction, inventory management, and cryptographic-like security measures. We design a quantum-inspired reinforcement learning framework that couples a controllable spin-chain analogy with real-time AIoT signals and optimizes a multi-objective reward unifying fidelity, security, and carbon costs. The approach learns robust policies with stabilized training via value-based and ensemble updates, supported by window-normalized reward components to ensure commensurate scaling. In simulation, the method exhibits smooth convergence, strong late-episode performance, and graceful degradation under representative noise channels, outperforming standard learned and model-based references, highlighting its robust handling of real-time sustainability and risk demands. These findings reinforce the potential for quantum-inspired AIoT frameworks to drive secure, eco-conscious supply chain operations at scale, laying the groundwork for globally connected infrastructures that responsibly meet both consumer and environmental needs.
[LG-91] Knowledge Gradient for Preference Learning
链接: https://arxiv.org/abs/2601.22335
作者: Kaiwen Wu,Jacob R. Gardner
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The knowledge gradient is a popular acquisition function in Bayesian optimization (BO) for optimizing black-box objectives with noisy function evaluations. Many practical settings, however, allow only pairwise comparison queries, yielding a preferential BO problem where direct function evaluations are unavailable. Extending the knowledge gradient to preferential BO is hindered by its computational challenge. At its core, the look-ahead step in the preferential setting requires computing a non-Gaussian posterior, which was previously considered intractable. In this paper, we address this challenge by deriving an exact and analytical knowledge gradient for preferential BO. We show that the exact knowledge gradient performs strongly on a suite of benchmark problems, often outperforming existing acquisition functions. In addition, we also present a case study illustrating the limitation of the knowledge gradient in certain scenarios.
[LG-92] DP-λCGD: Efficient Noise Correlation for Differentially Private Model Training
链接: https://arxiv.org/abs/2601.22334
作者: Nikita P. Kalinin,Ryan McKenna,Rasmus Pagh,Christoph H. Lampert
类目: Machine Learning (cs.LG)
*备注:
Abstract:Differentially private stochastic gradient descent (DP-SGD) is the gold standard for training machine learning models with formal differential privacy guarantees. Several recent extensions improve its accuracy by introducing correlated noise across training iterations. Matrix factorization mechanisms are a prominent example, but they correlate noise across many iterations and require storing previously added noise vectors, leading to substantial memory overhead in some settings. In this work, we propose a new noise correlation strategy that correlates noise only with the immediately preceding iteration and cancels a controlled portion of it. Our method relies on noise regeneration using a pseudorandom noise generator, eliminating the need to store past noise. As a result, it requires no additional memory beyond standard DP-SGD. We show that the computational overhead is minimal and empirically demonstrate improved accuracy over DP-SGD.
[LG-93] Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling
链接: https://arxiv.org/abs/2601.22331
作者: Aditya Narayan Ravi,Snehal Vadvalkar,Abhishek Pandey,Ilan Shomorony
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注: 40 pages, many figures
Abstract:Cell Painting is a microscopy-based, high-content imaging assay that produces rich morphological profiles of cells and can support drug discovery by quantifying cellular responses to chemical perturbations. At scale, however, Cell Painting data is strongly affected by batch effects arising from differences in laboratories, instruments, and protocols, which can obscure biological signal. We present BALANS (Batch Alignment via Local Affinities and Subsampling), a scalable batch-correction method that aligns samples across batches by constructing a smoothed affinity matrix from pairwise distances. Given n data points, BALANS builds a sparse affinity matrix A \in \mathbbR^n \times n using two ideas. (i) For points i and j , it sets a local scale using the distance from i to its k -th nearest neighbor within the batch of j , then computes A_ij via a Gaussian kernel calibrated by these batch-aware local scales. (ii) Rather than forming all n^2 entries, BALANS uses an adaptive sampling procedure that prioritizes rows with low cumulative neighbor coverage and retains only the strongest affinities per row, yielding a sparse but informative approximation of A . We prove that this sampling strategy is order-optimal in sample complexity and provides an approximation guarantee, and we show that BALANS runs in nearly linear time in n . Experiments on diverse real-world Cell Painting datasets and controlled large-scale synthetic benchmarks demonstrate that BALANS scales to large collections while improving runtime over native implementations of widely used batch-correction methods, without sacrificing correction quality.
[LG-94] Knowledge-Informed Kernel State Reconstruction for Interpretable Dynamical System Discovery
链接: https://arxiv.org/abs/2601.22328
作者: Luca Muscarnera,Silas Ruhrberg Estévez,Samuel Holt,Evgeny Saveliev,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recovering governing equations from data is central to scientific discovery, yet existing methods often break down under noisy, partial observations, or rely on black-box latent dynamics that obscure mechanism. We introduce MAAT (Model Aware Approximation of Trajectories), a framework for symbolic discovery built on knowledge-informed Kernel State Reconstruction. MAAT formulates state reconstruction in a reproducing kernel Hilbert space and directly incorporates structural and semantic priors such as non-negativity, conservation laws, and domain-specific observation models into the reconstruction objective, while accommodating heterogeneous sampling and measurement granularity. This yields smooth, physically consistent state estimates with analytic time derivatives, providing a principled interface between fragmented sensor data and symbolic regression. Across twelve diverse scientific benchmarks and multiple noise regimes, MAAT substantially reduces state-estimation MSE for trajectories and derivatives used by downstream symbolic regression relative to strong baselines.
[LG-95] Molecular Representations in Implicit Functional Space via Hyper-Networks
链接: https://arxiv.org/abs/2601.22327
作者: Zehong Wang,Xiaolong Han,Qi Yang,Xiangru Tang,Fang Wu,Xiaoguang Guo,Weixiang Sun,Tianyi Ma,Pietro Lio,Le Cong,Sheng Wang,Chuxu Zhang,Yanfang Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:Molecular representations fundamentally shape how machine learning systems reason about molecular structure and physical properties. Most existing approaches adopt a discrete pipeline: molecules are encoded as sequences, graphs, or point clouds, mapped to fixed-dimensional embeddings, and then used for task-specific prediction. This paradigm treats molecules as discrete objects, despite their intrinsically continuous and field-like physical nature. We argue that molecular learning can instead be formulated as learning in function space. Specifically, we model each molecule as a continuous function over three-dimensional (3D) space and treat this molecular field as the primary object of representation. From this perspective, conventional molecular representations arise as particular sampling schemes of an underlying continuous object. We instantiate this formulation with MolField, a hyper-network-based framework that learns distributions over molecular fields. To ensure physical consistency, these functions are defined over canonicalized coordinates, yielding invariance to global SE(3) transformations. To enable learning directly over functions, we introduce a structured weight tokenization and train a sequence-based hyper-network to model a shared prior over molecular fields. We evaluate MolField on molecular dynamics and property prediction. Our results show that treating molecules as continuous functions fundamentally changes how molecular representations generalize across tasks and yields downstream behavior that is stable to how molecules are discretized or queried.
[LG-96] Label-Efficient Monitoring of Classification Models via Stratified Importance Sampling
链接: https://arxiv.org/abs/2601.22326
作者: Lupo Marsigli,Angel Lopez de Haro
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 24 pages
Abstract:Monitoring the performance of classification models in production is critical yet challenging due to strict labeling budgets, one-shot batch acquisition of labels and extremely low error rates. We propose a general framework based on Stratified Importance Sampling (SIS) that directly addresses these constraints in model monitoring. While SIS has previously been applied in specialized domains, our theoretical analysis establishes its broad applicability to the monitoring of classification models. Under mild conditions, SIS yields unbiased estimators with strict finite-sample mean squared error (MSE) improvements over both importance sampling (IS) and stratified random sampling (SRS). The framework does not rely on optimally defined proposal distributions or strata: even with noisy proxies and sub-optimal stratification, SIS can improve estimator efficiency compared to IS or SRS individually, though extreme proposal mismatch may limit these gains. Experiments across binary and multiclass tasks demonstrate consistent efficiency improvements under fixed label budgets, underscoring SIS as a principled, label-efficient, and operationally lightweight methodology for post-deployment model monitoring.
[LG-97] AgentS core: Autoformulation of Deployable Clinical Scoring Systems
链接: https://arxiv.org/abs/2601.22324
作者: Silas Ruhrberg Estévez,Christopher Chiu,Mihaela van der Schaar
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.
[LG-98] Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning
链接: https://arxiv.org/abs/2601.22323
作者: Qi Cao,Shuhao Zhang,Ruizhe Zhou,Ruiyi Zhang,Peijia Qin,Pengtao Xie
类目: Machine Learning (cs.LG)
*备注: We propose SCOPE, a model routing framework that predicts how accurate and how expensive each model will be before running it, allowing users to control cost-accuracy trade-offs and naturally handle new models
Abstract:Model routing chooses which language model to use for each query. By sending easy queries to cheaper models and hard queries to stronger ones, it can significantly reduce inference cost while maintaining high accuracy. However, most existing routers treat this as a fixed choice among a small set of models, which makes them hard to adapt to new models or changing budget constraints. In this paper, we propose SCOPE (Scalable and Controllable Outcome Performance Estimator), a routing framework that goes beyond model selection by predicting their cost and performance. Trained with reinforcement learning, SCOPE makes reasoning-based predictions by retrieving how models behave on similar problems, rather than relying on fixed model names, enabling it to work with new, unseen models. Moreover, by explicitly predicting how accurate and how expensive a model will be, it turns routing into a dynamic decision problem, allowing users to easily control the trade-off between accuracy and cost. Experiments show that SCOPE is more than just a cost-saving tool. It flexibly adapts to user needs: it can boost accuracy by up to 25.7% when performance is the priority, or cut costs by up to 95.1% when efficiency matters most.
[LG-99] Spatially-Adaptive Conformal Graph Transformer for Indoor Localization in Wi-Fi Driven Networks
链接: https://arxiv.org/abs/2601.22322
作者: Ayesh Abu Lehyeh,Anastassia Gharib,Safwan Wshah
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted to IEEE ICC 2026
Abstract:Indoor localization is a critical enabler for a wide range of location-based services in smart environments, including navigation, asset tracking, and safety-critical applications. Recent graph-based models leverage spatial relationships between Wire-less Fidelity (Wi-Fi) Access Points (APs) and devices, offering finer localization granularity, but fall short in quantifying prediction uncertainty, a key requirement for real-world deployment. In this paper, we propose Spatially-Adaptive Conformal Graph Transformer (SAC-GT), a framework for accurate and reliable indoor localization. SAC-GT integrates a Graph Transformer (GT) model that captures network’s spatial topology and signal strength dynamics, with a novel Spatially-Adaptive Conformal Prediction (SACP) method that provides region-specific uncertainty estimates. This allows SAC-GT to produce not only precise two-dimensional (2D) location predictions but also statistically valid confidence regions tailored to varying environmental conditions. Extensive evaluations on a large-scale real-world dataset demonstrate that the proposed SAC-GT solution achieves state-of-the-art localization accuracy while delivering robust and spatially adaptive reliability guarantees.
[LG-100] Matrix Factorization for Practical Continual Mean Estimation Under User-Level Differential Privacy
链接: https://arxiv.org/abs/2601.22320
作者: Nikita P. Kalinin,Ali Najar,Valentin Roth,Christoph H. Lampert
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study continual mean estimation, where data vectors arrive sequentially and the goal is to maintain accurate estimates of the running mean. We address this problem under user-level differential privacy, which protects each user’s entire dataset even when they contribute multiple data points. Previous work on this problem has focused on pure differential privacy. While important, this approach limits applicability, as it leads to overly noisy estimates. In contrast, we analyze the problem under approximate differential privacy, adopting recent advances in the Matrix Factorization mechanism. We introduce a novel mean estimation specific factorization, which is both efficient and accurate, achieving asymptotically lower mean-squared error bounds in continual mean estimation under user-level differential privacy.
[LG-101] Federate the Router: Learning Language Model Routers with Sparse and Decentralized Evaluations
链接: https://arxiv.org/abs/2601.22318
作者: Baris Askin,Shivam Patel,Anupam Nayak,Andrea Vigano,Jiin Woo,Gauri Joshi,Carlee Joe-Wong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) are increasingly accessed as remotely hosted services by edge and enterprise clients that cannot run frontier models locally. Since models vary widely in capability and price, routing queries to models that balance quality and inference cost is essential. Existing router approaches assume access to centralized query-model evaluation data. However, these data are often fragmented across clients, such as end users and organizations, and are privacy-sensitive, which makes centralizing data infeasible. Additionally, per-client router training is ineffective since local evaluation data is limited and covers only a restricted query distribution and a biased subset of model evaluations. We introduce the first federated framework for LLM routing, enabling clients to learn a shared routing policy from local offline query-model evaluation data. Our framework supports both parametric multilayer perceptron router and nonparametric K-means router under heterogeneous client query distributions and non-uniform model coverage. Across two benchmarks, federated collaboration improves the accuracy-cost frontier over client-local routers, both via increased effective model coverage and better query generalization. Our theoretical results also validate that federated training reduces routing suboptimality.
[LG-102] FlowSymm: Physics Aware Symmetry Preserving Graph Attention for Network Flow Completion
链接: https://arxiv.org/abs/2601.22317
作者: Ege Demirci,Francesco Bullo,Ananthram Swami,Ambuj Singh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recovering missing flows on the edges of a network, while exactly respecting local conservation laws, is a fundamental inverse problem that arises in many systems such as transportation, energy, and mobility. We introduce FlowSymm, a novel architecture that combines (i) a group-action on divergence-free flows, (ii) a graph-attention encoder to learn feature-conditioned weights over these symmetry-preserving actions, and (iii) a lightweight Tikhonov refinement solved via implicit bilevel optimization. The method first anchors the given observation on a minimum-norm divergence-free completion. We then compute an orthonormal basis for all admissible group actions that leave the observed flows invariant and parameterize the valid solution subspace, which shows an Abelian group structure under vector addition. A stack of GATv2 layers then encodes the graph and its edge features into per-edge embeddings, which are pooled over the missing edges and produce per-basis attention weights. This attention-guided process selects a set of physics-aware group actions that preserve the observed flows. Finally, a scalar Tikhonov penalty refines the missing entries via a convex least-squares solver, with gradients propagated implicitly through Cholesky factorization. Across three real-world flow benchmarks (traffic, power, bike), FlowSymm outperforms state-of-the-art baselines in RMSE, MAE and correlation metrics.
[LG-103] Gaussian Process Bandit Optimization with Machine Learning Predictions and Application to Hypothesis Generation
链接: https://arxiv.org/abs/2601.22315
作者: Xin Jennifer Chen,Yunjin Tong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many real-world optimization problems involve an expensive ground-truth oracle (e.g., human evaluation, physical experiments) and a cheap, low-fidelity prediction oracle (e.g., machine learning models, simulations). Meanwhile, abundant offline data (e.g., past experiments and predictions) are often available and can be used to pretrain powerful predictive models, as well as to provide an informative prior. We propose Prediction-Augmented Gaussian Process Upper Confidence Bound (PA-GP-UCB), a novel Bayesian optimization algorithm that leverages both oracles and offline data to achieve provable gains in sample efficiency for the ground-truth oracle queries. PA-GP-UCB employs a control-variates estimator derived from a joint Gaussian process posterior to correct prediction bias and reduce uncertainty. We prove that PA-GP-UCB preserves the standard regret rate of GP-UCB while achieving a strictly smaller leading constant that is explicitly controlled by prediction quality and offline data coverage. Empirically, PA-GP-UCB converges faster than Vanilla GP-UCB and naive prediction-augmented GP-UCB baselines on synthetic benchmarks and on a real-world hypothesis evaluation task grounded in human behavioral data, where predictions are provided by large language models. These results establish PA-GP-UCB as a general and sample-efficient framework for hypothesis generation under expensive feedback.
[LG-104] Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
链接: https://arxiv.org/abs/2601.22313
作者: Yavuz Bakman,Duygu Nur Yaldiz,Salman Avestimehr,Sai Praneeth Karimireddy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed “aligned” can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update-robust alignment evaluation.
[LG-105] SCALAR: Quantifying Structural Hallucination Consistency and Reasoning Gaps in Materials Foundation Models
链接: https://arxiv.org/abs/2601.22312
作者: Can Polat,Erchin Serpedin,Mustafa Kurban,Hasan Kurban
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Large language models are increasingly applied to materials science reasoning, yet their behavior under physically structured distribution shifts remains poorly understood. We introduce SCALAR (Structural Consistency And Logic Across Regimes), a benchmark for evaluating geometric scale generalization and its connection to structural hallucination, consistency, and reasoning in materials foundation models. Given canonical crystal representations, models must reason about derived nanoparticle structures obtained through supercell expansion and geometric truncation across length scales spanning a few atoms to over 18,000 atoms, totaling \approx 100,000 structures from DFT-validated unit cells. SCALAR defines three tasks. (i) CIF to property prediction. (ii) A Chain-of-Thought variant with explicit physics-grounded reasoning. (iii) Inverse retrieval identifying crystals from candidates given target properties. Outputs are evaluated via structured metrics capturing numeric error, hallucination, cross-prompt consistency, monotonic reasoning, output validity, and retrieval regret. Experiments across diverse foundation models reveal large, model-dependent shifts under explicit reasoning, often reducing hallucination and error, but frequently destabilizing consistency or validity. These results demonstrate that geometric scale generalization cannot be inferred from accuracy alone. Supplementary materials are available at this https URL.
[LG-106] Exact closed-form Gaussian moments of residual layers
链接: https://arxiv.org/abs/2601.22307
作者: Simon Kuang,Xinfan Lin
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We study the problem of propagating the mean and covariance of a general multivariate Gaussian distribution through a deep (residual) neural network using layer-by-layer moment matching. We close a longstanding gap by deriving exact moment matching for the probit, GeLU, ReLU (as a limit of GeLU), Heaviside (as a limit of probit), and sine activation functions; for both feedforward and generalized residual layers. On random networks, we find orders-of-magnitude improvements in the KL divergence error metric, up to a millionfold, over popular alternatives. On real data, we find competitive statistical calibration for inference under epistemic uncertainty in the input. On a variational Bayes network, we show that our method attains hundredfold improvements in KL divergence from Monte Carlo ground truth over a state-of-the-art deterministic inference method. We also give an a priori error bound and a preliminary analysis of stochastic feedforward neurons, which have recently attracted general interest.
[LG-107] BayesFlow: A Probability Inference Framework for Meta-Agent Assisted Workflow Generation EACL2026
链接: https://arxiv.org/abs/2601.22305
作者: Bo Yuan,Yun Zhou,Zhichao Xu,Kiran Ramnath,Aosong Feng,Balasubramaniam Srinivasan
类目: Machine Learning (cs.LG)
*备注: EACL 2026 Finding
Abstract:Automatic workflow generation is the process of automatically synthesizing sequences of LLM calls, tool invocations, and post-processing steps for complex end-to-end tasks. Most prior methods cast this task as an optimization problem with limited theoretical grounding. We propose to cast workflow generation as Bayesian inference over a posterior distribution on workflows, and introduce \textbfBayesian Workflow Generation (BWG), a sampling framework that builds workflows step-by-step using parallel look-ahead rollouts for importance weighting and a sequential in-loop refiner for pool-wide improvements. We prove that, without the refiner, the weighted empirical distribution converges to the target posterior. We instantiate BWG as \textbfBayesFlow, a training-free algorithm for workflow construction. Across six benchmark datasets, BayesFlow improves accuracy by up to 9 percentage points over SOTA workflow generation baselines and by up to 65 percentage points over zero-shot prompting, establishing BWG as a principled upgrade to search-based workflow design. Code will be available on this https URL.
[LG-108] ZK-HybridFL: Zero-Knowledge Proof-Enhanced Hybrid Ledger for Federated Learning
链接: https://arxiv.org/abs/2601.22302
作者: Amirhossein Taherpour,Xiaodong Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted for publication in IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:Federated learning (FL) enables collaborative model training while preserving data privacy, yet both centralized and decentralized approaches face challenges in scalability, security, and update validation. We propose ZK-HybridFL, a secure decentralized FL framework that integrates a directed acyclic graph (DAG) ledger with dedicated sidechains and zero-knowledge proofs (ZKPs) for privacy-preserving model validation. The framework uses event-driven smart contracts and an oracle-assisted sidechain to verify local model updates without exposing sensitive data. A built-in challenge mechanism efficiently detects adversarial behavior. In experiments on image classification and language modeling tasks, ZK-HybridFL achieves faster convergence, higher accuracy, lower perplexity, and reduced latency compared to Blade-FL and ChainFL. It remains robust against substantial fractions of adversarial and idle nodes, supports sub-second on-chain verification with efficient gas usage, and prevents invalid updates and orphanage-style attacks. This makes ZK-HybridFL a scalable and secure solution for decentralized FL across diverse environments.
[LG-109] Learning Reward Functions for Cooperative Resilience in Multi-Agent Systems
链接: https://arxiv.org/abs/2601.22292
作者: Manuela Chacon-Chamorro,Luis Felipe Giraldo,Nicanor Quijano
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Supplementary material in this https URL
Abstract:Multi-agent systems often operate in dynamic and uncertain environments, where agents must not only pursue individual goals but also safeguard collective functionality. This challenge is especially acute in mixed-motive multi-agent systems. This work focuses on cooperative resilience, the ability of agents to anticipate, resist, recover, and transform in the face of disruptions, a critical yet underexplored property in Multi-Agent Reinforcement Learning. We study how reward function design influences resilience in mixed-motive settings and introduce a novel framework that learns reward functions from ranked trajectories, guided by a cooperative resilience metric. Agents are trained in a suite of social dilemma environments using three reward strategies: i) traditional individual reward; ii) resilience-inferred reward; and iii) hybrid that balance both. We explore three reward parameterizations-linear models, hand-crafted features, and neural networks, and employ two preference-based learning algorithms to infer rewards from behavioral rankings. Our results demonstrate that hybrid strategy significantly improve robustness under disruptions without degrading task performance and reduce catastrophic outcomes like resource overuse. These findings underscore the importance of reward design in fostering resilient cooperation, and represent a step toward developing robust multi-agent systems capable of sustaining cooperation in uncertain environments.
[LG-110] Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success
链接: https://arxiv.org/abs/2601.22285
作者: Luca Zhou,Bo Zhao,Rose Yu,Emanuele Rodolà
类目: Machine Learning (cs.LG)
*备注: 8 pages of main paper, 3 figures in the main paper, 4 tables in the main paper, many more figures and tables in the appendix
Abstract:Model merging combines knowledge from separately fine-tuned models, yet success factors remain poorly understood. While recent work treats mergeability as an intrinsic property, we show with an architecture-agnostic framework that it fundamentally depends on both the merging method and the partner tasks. Using linear optimization over a set of interpretable pairwise metrics (e.g., gradient L2 distance), we uncover properties correlating with post-merge performance across four merging methods. We find substantial variation in success drivers (46.7% metric overlap; 55.3% sign agreement), revealing method-specific “fingerprints”. Crucially, however, subspace overlap and gradient alignment metrics consistently emerge as foundational, method-agnostic prerequisites for compatibility. These findings provide a diagnostic foundation for understanding mergeability and motivate future fine-tuning strategies that explicitly encourage these properties.
[LG-111] Riemannian Lyapunov Optimizer: A Unified Framework for Optimization
链接: https://arxiv.org/abs/2601.22284
作者: Yixuan Wang,Omkar Sudhir Patil,Warren E. Dixon
类目: Machine Learning (cs.LG)
*备注: 22 pages, 4 figures
Abstract:We introduce Riemannian Lyapunov Optimizers (RLOs), a family of optimization algorithms that unifies classic optimizers within one geometric framework. Unlike heuristic improvements to existing optimizers, RLOs are systematically derived from a novel control-theoretic framework that reinterprets optimization as an extended state discrete-time controlled dynamical system on a Riemannian parameter manifold. Central to this framework is the identification of a Normally Attracting Invariant Manifold (NAIM), which organizes training dynamics into two distinct stages: rapid alignment of the speed state to a target graph, followed by controlled evolution within it. We formalize this by constructing a strict Lyapunov function that certifies convergence to a target manifold. This perspective yields a constructive ``optimizer generator" that not only recovers classic algorithms but enables the principled design of RLOs. We validate our theory via geometric diagnostics and demonstrate that grounding optimizer design in control theory yields state-of-the-art performance in large-scale benchmarks. Overall, RLOs bridge control theory and modern machine learning optimization, providing a unified language and a systematic toolkit for designing stable, effective optimizers.
[LG-112] ask-Uniform Convergence and Backward Transfer in Federated Domain-Incremental Learning with Partial Participation
链接: https://arxiv.org/abs/2601.22274
作者: Longtao Xu,Jian Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-world federated systems seldom operate on static data: input distributions drift while privacy rules forbid raw-data sharing. We study this setting as Federated Domain-Incremental Learning (FDIL), where (i) clients are heterogeneous, (ii) tasks arrive sequentially with shifting domains, yet (iii) the label space remains fixed. Two theoretical pillars remain missing for FDIL under realistic deployment: a guarantee of backward knowledge transfer (BKT) and a convergence rate that holds across the sequence of all tasks with partial participation. We introduce SPECIAL (Server-Proximal Efficient Continual Aggregation for Learning), a simple, memory-free FDIL algorithm that adds a single server-side ``anchor’’ to vanilla FedAvg: in each round, the server nudges the uniformly sampled participated clients update toward the previous global model with a lightweight proximal term. This anchor curbs cumulative drift without replay buffers, synthetic data, or task-specific heads, keeping communication and model size unchanged. Our theory shows that SPECIAL (i) preserves earlier tasks: a BKT bound caps any increase in prior-task loss by a drift-controlled term that shrinks with more rounds, local epochs, and participating clients; and (ii) learns efficiently across all tasks: the first communication-efficient non-convex convergence rate for FDIL with partial participation, O((E/NT)^(1/2)), with E local epochs, T communication rounds, and N participated clients per round, matching single-task FedAvg while explicitly separating optimization variance from inter-task drift. Experimental results further demonstrate the effectiveness of SPECIAL.
[LG-113] Privacy-Preserving Sensor-Based Human Activity Recognition for Low-Resource Healthcare Using Classical Machine Learning
链接: https://arxiv.org/abs/2601.22265
作者: Ramakant Kumar,Pravin Kumar
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Limited access to medical infrastructure forces elderly and vulnerable patients to rely on home-based care, often leading to neglect and poor adherence to therapeutic exercises such as yoga or physiotherapy. To address this gap, we propose a low-cost and automated human activity recognition (HAR) framework based on wearable inertial sensors and machine learning. Activity data, including walking, walking upstairs, walking downstairs, sitting, standing, and lying, were collected using accelerometer and gyroscope measurements. Four classical classifiers, Logistic Regression, Random Forest, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN), were evaluated and compared with the proposed Support Tensor Machine (STM). Experimental results show that SVM achieved an accuracy of 93.33 percent, while Logistic Regression, Random Forest, and k-NN achieved 91.11 percent. In contrast, STM significantly outperformed these models, achieving a test accuracy of 96.67 percent and the highest cross-validation accuracy of 98.50 percent. Unlike conventional methods, STM leverages tensor representations to preserve spatio-temporal motion dynamics, resulting in robust classification across diverse activities. The proposed framework demonstrates strong potential for remote healthcare, elderly assistance, child activity monitoring, yoga feedback, and smart home wellness, offering a scalable solution for low-resource and rural healthcare settings.
[LG-114] abular Foundation Models Can Do Survival Analysis
链接: https://arxiv.org/abs/2601.22259
作者: Da In Kim,Wei Siang Lai,Kelly W. Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:While tabular foundation models have achieved remarkable success in classification and regression, adapting them to model time-to-event outcomes for survival analysis is non-trivial due to right-censoring, where data observations may end before the event occurs. We develop a classification-based framework that reformulates both static and dynamic survival analysis as a series of binary classification problems by discretizing event times. Censored observations are naturally handled as examples with missing labels at certain time points. This classification formulation enables existing tabular foundation models to perform survival analysis through in-context learning without explicit training. We prove that under standard censoring assumptions, minimizing our binary classification loss recovers the true survival probabilities as the training set size increases. We demonstrate through evaluation across 53 real-world datasets that off-the-shelf tabular foundation models with this classification formulation outperform classical and deep learning baselines on average over multiple survival metrics.
[LG-115] Symmetry Breaking in Transformers for Efficient and Interpretable Training
链接: https://arxiv.org/abs/2601.22257
作者: Eva Silverstein,Daniel Kunin,Vasudev Shyam
类目: Machine Learning (cs.LG)
*备注: 22 pages, 3 figures
Abstract:The attention mechanism in its standard implementation contains extraneous rotational degrees of freedom that are carried through computation but do not affect model activations or outputs. We introduce a simple symmetry-breaking protocol that inserts a preferred direction into this rotational space through batchwise-sampled, unlearned query and value biases. This modification has two theoretically motivated and empirically validated consequences. First, it can substantially improve the performance of simple, memory-efficient optimizers, narrowing – and in some cases closing – the gap to successful but more complex memory-intensive adaptive methods. We demonstrate this by pretraining 124M parameter transformer models with four optimization algorithms (AdamW, SOAP, SGDM, and Energy Conserving Descent(ECD)) and evaluating both validation loss and downstream logical reasoning. Second, it enables an interpretable use of otherwise redundant rotational degrees of freedom, selectively amplifying semantically meaningful token classes within individual attention heads. Overall, our results show that minimal, principled architectural changes can simultaneously improve performance and interpretability.
[LG-116] FunPRM: Function-as-Step Process Reward Model with Meta Reward Correction for Code Generation
链接: https://arxiv.org/abs/2601.22249
作者: Ruiyi Zhang,Peijia Qin,Qi Cao,Eric Xue,Pengtao Xie
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Code generation is a core application of large language models (LLMs), yet LLMs still frequently fail on complex programming tasks. Given its success in mathematical reasoning, test-time scaling approaches such as Process Reward Model (PRM)-based Best-of-N selection offer a promising way to improve performance. However, existing PRMs remain ineffective for code generation due to the lack of meaningful step decomposition in code and the noise of Monte Carlo-estimated partial-solution correctness scores (rewards). To address these challenges, we propose FunPRM. FunPRM prompts LLMs to encourage modular code generation organized into functions, with functions treated as PRM reasoning steps. Furthermore, FunPRM introduces a novel meta-learning-based reward correction mechanism that leverages clean final-solution rewards obtained via a unit-test-based evaluation system to purify noisy partial-solution rewards. Experiments on LiveCodeBench and BigCodeBench demonstrate that FunPRM consistently outperforms existing test-time scaling methods across five base LLMs, notably achieving state-of-the-art performance on LiveCodeBench when combined with O4-mini. Furthermore, FunPRM produces code that is more readable and reusable for developers.
[LG-117] Aligning Microscopic Vehicle and Macroscopic Traffic Statistics: Reconstructing Driving Behavior from Partial Data
链接: https://arxiv.org/abs/2601.22242
作者: Zhihao Zhang,Keith Redmill,Chengyang Peng,Bowen Weng
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:A driving algorithm that aligns with good human driving practices, or at the very least collaborates effectively with human drivers, is crucial for developing safe and efficient autonomous vehicles. In practice, two main approaches are commonly adopted: (i) supervised or imitation learning, which requires comprehensive naturalistic driving data capturing all states that influence a vehicle’s decisions and corresponding actions, and (ii) reinforcement learning (RL), where the simulated driving environment either matches or is intentionally more challenging than real-world conditions. Both methods depend on high-quality observations of real-world driving behavior, which are often difficult and costly to obtain. State-of-the-art sensors on individual vehicles can gather microscopic data, but they lack context about the surrounding conditions. Conversely, roadside sensors can capture traffic flow and other macroscopic characteristics, but they cannot associate this information with individual vehicles on a microscopic level. Motivated by this complementarity, we propose a framework that reconstructs unobserved microscopic states from macroscopic observations, using microscopic data to anchor observed vehicle behaviors, and learns a shared policy whose behavior is microscopically consistent with the partially observed trajectories and actions and macroscopically aligned with target traffic statistics when deployed population-wide. Such constrained and regularized policies promote realistic flow patterns and safe coordination with human drivers at scale.
[LG-118] DAJ: Data-Reweighted LLM Judge for Test-Time Scaling in Code Generation
链接: https://arxiv.org/abs/2601.22230
作者: Peijia Qin,Ruiyi Zhang,Qi Cao,Pengtao Xie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Test-time scaling for code generation commonly relies on Best-of-N selection, in which multiple candidate solutions are sampled from a base model, and the best one is selected by an LLM judge. However, training reliable LLM judges is challenging due to severe distribution shifts, including imbalances between easy and hard problems, mismatches between training tasks and evaluation benchmarks, and trajectory mismatch arising from training data generated by cheaper models whose behavior differs from that of inference-time models. We propose DAJ, a reasoning-based LLM judge trained with verifiable rewards under a bi-level data-reweighted learning framework. The proposed framework learns data-importance weights (either domain-level or instance-level) to optimize generalization performance on a held-out meta set aligned with target benchmarks. To the best of our knowledge, this is the first application of data reweighting to LLM-as-a-Judge training for test-time scaling. Our approach automatically emphasizes hard problems, in-distribution samples, and trajectory-aligned data, without relying on hand-crafted heuristics. Empirically, DAJ achieves state-of-the-art performance on LiveCodeBench and BigCodeBench, outperforming strong test-time scaling baselines as well as leading proprietary models.
[LG-119] Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions
链接: https://arxiv.org/abs/2601.22211
作者: Lingkai Kong,Anagha Satish,Hezi Jiang,Akseli Kangaslahti,Andrew Ma,Wenbo Chen,Mingxiao Song,Lily Xu,Milind Tambe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impractical. Existing approaches embed task-specific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness. We propose a solver-induced \emphlatent spherical flow policy that brings the expressiveness of modern generative policies to combinatorial RL while guaranteeing feasibility by design. Our method, LSFlow, learns a \emphstochastic policy in a compact continuous latent space via spherical flow matching, and delegates feasibility to a combinatorial optimization solver that maps each latent sample to a valid structured action. To improve efficiency, we train the value network directly in the latent space, avoiding repeated solver calls during policy optimization. To address the piecewise-constant and discontinuous value landscape induced by solver-based action selection, we introduce a smoothed Bellman operator that yields stable, well-defined learning targets. Empirically, our approach outperforms state-of-the-art baselines by an average of 20.6% across a range of challenging combinatorial RL tasks.
[LG-120] Causal Imitation Learning Under Measurement Error and Distribution Shift
链接: https://arxiv.org/abs/2601.22206
作者: Shi Bo,AmirEmad Ghassami
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 28 pages, 3 figures
Abstract:We study offline imitation learning (IL) when part of the decision-relevant state is observed only through noisy measurements and the distribution may change between training and deployment. Such settings induce spurious state-action correlations, so standard behavioral cloning (BC) – whether conditioning on raw measurements or ignoring them – can converge to systematically biased policies under distribution shift. We propose a general framework for IL under measurement error, inspired by explicitly modeling the causal relationships among the variables, yielding a target that retains a causal interpretation and is robust to distribution shift. Building on ideas from proximal causal inference, we introduce \textttCausIL, which treats noisy state observations as proxy variables, and we provide identification conditions under which the target policy is recoverable from demonstrations without rewards or interactive expert queries. We develop estimators for both discrete and continuous state spaces; for continuous settings, we use an adversarial procedure over RKHS function classes to learn the required parameters. We evaluate \textttCausIL on semi-simulated longitudinal data from the PhysioNet/Computing in Cardiology Challenge 2019 cohort and demonstrate improved robustness to distribution shift compared to BC baselines.
[LG-121] FedAdaVR: Adaptive Variance Reduction for Robust Federated Learning under Limited Client Participation
链接: https://arxiv.org/abs/2601.22204
作者: S M Ruhul Kabir Howlader,Xiao Chen,Yifei Xie,Lu Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated learning (FL) encounters substantial challenges due to heterogeneity, leading to gradient noise, client drift, and partial client participation errors, the last of which is the most pervasive but remains insufficiently addressed in current literature. In this paper, we propose FedAdaVR, a novel FL algorithm aimed at solving heterogeneity issues caused by sporadic client participation by incorporating an adaptive optimiser with a variance reduction technique. This method takes advantage of the most recent stored updates from clients, even when they are absent from the current training round, thereby emulating their presence. Furthermore, we propose FedAdaVR-Quant, which stores client updates in quantised form, significantly reducing the memory requirements (by 50%, 75%, and 87.5%) of FedAdaVR while maintaining equivalent model performance. We analyse the convergence behaviour of FedAdaVR under general nonconvex conditions and prove that our proposed algorithm can eliminate partial client participation error. Extensive experiments conducted on multiple datasets, under both independent and identically distributed (IID) and non-IID settings, demonstrate that FedAdaVR consistently outperforms state-of-the-art baseline methods.
[LG-122] acit Coordination of Large Language Models
链接: https://arxiv.org/abs/2601.22184
作者: Ido Aharon,Emanuele La Malfa,Michael Wooldridge,Sarit Kraus
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Code: this https URL
Abstract:In tacit coordination games with multiple outcomes, purely rational solution concepts, such as Nash equilibria, provide no guidance for which equilibrium to choose. Shelling’s theory explains how, in these settings, humans coordinate by relying on focal points: solutions or outcomes that naturally arise because they stand out in some way as salient or prominent to all players. This work studies Large Language Models (LLMs) as players in tacit coordination games, and addresses how, when, and why focal points emerge. We compare and quantify the coordination capabilities of LLMs in cooperative and competitive games for which human experiments are available. We also introduce several learning-free strategies to improve the coordination of LLMs, with themselves and with humans. On a selection of heterogeneous open-source models, including Llama, Qwen, and GPT-oss, we discover that LLMs have a remarkable capability to coordinate and often outperform humans, yet fail on common-sense coordination that involves numbers or nuanced cultural archetypes. This paper constitutes the first large-scale assessment of LLMs’ tacit coordination within the theoretical and psychological framework of focal points.
[LG-123] Large Language Models : A Mathematical Formulation
链接: https://arxiv.org/abs/2601.22170
作者: Ricardo Baptista,Andrew Stuart,Son Tran
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 51 pages, 2 figures
Abstract:Large language models (LLMs) process and predict sequences containing text to answer questions, and address tasks including document summarization, providing recommendations, writing software and solving quantitative problems. We provide a mathematical framework for LLMs by describing the encoding of text sequences into sequences of tokens, defining the architecture for next-token prediction models, explaining how these models are learned from data, and demonstrating how they are deployed to address a variety of tasks. The mathematical sophistication required to understand this material is not high, and relies on straightforward ideas from information theory, probability and optimization. Nonetheless, the combination of ideas resting on these different components from the mathematical sciences yields a complex algorithmic structure; and this algorithmic structure has demonstrated remarkable empirical successes. The mathematical framework established here provides a platform from which it is possible to formulate and address questions concerning the accuracy, efficiency and robustness of the algorithms that constitute LLMs. The framework also suggests directions for development of modified and new methodologies.
[LG-124] Nested Slice Sampling: Vectorized Nested Sampling for GPU-Accelerated Inference
链接: https://arxiv.org/abs/2601.23252
作者: David Yallup,Namu Kroupa,Will Handley
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 54 pages, 11 figures
Abstract:Model comparison and calibrated uncertainty quantification often require integrating over parameters, but scalable inference can be challenging for complex, multimodal targets. Nested Sampling is a robust alternative to standard MCMC, yet its typically sequential structure and hard constraints make efficient accelerator implementations difficult. This paper introduces Nested Slice Sampling (NSS), a GPU-friendly, vectorized formulation of Nested Sampling that uses Hit-and-Run Slice Sampling for constrained updates. A tuning analysis yields a simple near-optimal rule for setting the slice width, improving high-dimensional behavior and making per-step compute more predictable for parallel execution. Experiments on challenging synthetic targets, high dimensional Bayesian inference, and Gaussian process hyperparameter marginalization show that NSS maintains accurate evidence estimates and high-quality posterior samples, and is particularly robust on difficult multimodal problems where current state-of-the-art methods such as tempered SMC baselines can struggle. An open-source implementation is released to facilitate adoption and reproducibility.
[LG-125] Graph Attention Network for Node Regression on Random Geometric Graphs with Erdős–Rényi contamination
链接: https://arxiv.org/abs/2601.23239
作者: Somak Laha,Suqi Liu,Morgane Austern
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Statistics Theory (math.ST)
*备注: 62 pages, 2 figures, 2 tables
Abstract:Graph attention networks (GATs) are widely used and often appear robust to noise in node covariates and edges, yet rigorous statistical guarantees demonstrating a provable advantage of GATs over non-attention graph neural networks~(GNNs) are scarce. We partially address this gap for node regression with graph-based errors-in-variables models under simultaneous covariate and edge corruption: responses are generated from latent node-level covariates, but only noise-perturbed versions of the latent covariates are observed; and the sample graph is a random geometric graph created from the node covariates but contaminated by independent Erdős–Rényi edges. We propose and analyze a carefully designed, task-specific GAT that constructs denoised proxy features for regression. We prove that regressing the response variables on the proxies achieves lower error asymptotically in (a) estimating the regression coefficient compared to the ordinary least squares (OLS) estimator on the noisy node covariates, and (b) predicting the response for an unlabelled node compared to a vanilla graph convolutional network~(GCN) – under mild growth conditions. Our analysis leverages high-dimensional geometric tail bounds and concentration for neighbourhood counts and sample covariances. We verify our theoretical findings through experiments on synthetically generated data. We also perform experiments on real-world graphs and demonstrate the effectiveness of the attention mechanism in several node regression tasks.
[LG-126] Solving Inverse Problems with Flow-based Models via Model Predictive Control
链接: https://arxiv.org/abs/2601.23231
作者: George Webber,Alexander Denker,Riccardo Barbano,Andrew J Reader
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:
Abstract:Flow-based generative models provide strong unconditional priors for inverse problems, but guiding their dynamics for conditional generation remains challenging. Recent work casts training-free conditional generation in flow models as an optimal control problem; however, solving the resulting trajectory optimisation is computationally and memory intensive, requiring differentiation through the flow dynamics or adjoint solves. We propose MPC-Flow, a model predictive control framework that formulates inverse problem solving with flow-based generative models as a sequence of control sub-problems, enabling practical optimal control-based guidance at inference time. We provide theoretical guarantees linking MPC-Flow to the underlying optimal control objective and show how different algorithmic choices yield a spectrum of guidance algorithms, including regimes that avoid backpropagation through the generative model trajectory. We evaluate MPC-Flow on benchmark image restoration tasks, spanning linear and non-linear settings such as in-painting, deblurring, and super-resolution, and demonstrate strong performance and scalability to massive state-of-the-art architectures via training-free guidance of FLUX.2 (32B) in a quantised setting on consumer hardware.
[LG-127] A Random Matrix Theory of Masked Self-Supervised Regression
链接: https://arxiv.org/abs/2601.23208
作者: Arie Wortsman Zurich,Federica Gerace,Bruno Loureiro,Yue M. Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In the era of transformer models, masked self-supervised learning (SSL) has become a foundational training paradigm. A defining feature of masked SSL is that training aggregates predictions across many masking patterns, giving rise to a joint, matrix-valued predictor rather than a single vector-valued estimator. This object encodes how coordinates condition on one another and poses new analytical challenges. We develop a precise high-dimensional analysis of masked modeling objectives in the proportional regime where the number of samples scales with the ambient dimension. Our results provide explicit expressions for the generalization error and characterize the spectral structure of the learned predictor, revealing how masked modeling extracts structure from data. For spiked covariance models, we show that the joint predictor undergoes a Baik–Ben Arous–Péché (BBP)-type phase transition, identifying when masked SSL begins to recover latent signals. Finally, we identify structured regimes in which masked self-supervised learning provably outperforms PCA, highlighting potential advantages of SSL objectives over classical unsupervised methods
[LG-128] Compressed BC-LISTA via Low-Rank Convolutional Decomposition
链接: https://arxiv.org/abs/2601.23148
作者: Han Wang,Yhonatan Kvich,Eduardo Pérez,Florian Römer,Yonina C. Eldar
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Inverse Problems, Model Compression, Compressed Sensing, Deep Unrolling, Computational Imaging
Abstract:We study Sparse Signal Recovery (SSR) methods for multichannel imaging with compressed forward and backward operators that preserve reconstruction accuracy. We propose a Compressed Block-Convolutional (C-BC) measurement model based on a low-rank Convolutional Neural Network (CNN) decomposition that is analytically initialized from a low-rank factorization of physics-derived forward/backward operators in time delay-based measurements. We use Orthogonal Matching Pursuit (OMP) to select a compact set of basis filters from the analytic model and compute linear mixing coefficients to approximate the full model. We consider the Learned Iterative Shrinkage-Thresholding Algorithm (LISTA) network as a representative example for which the C-BC-LISTA extension is presented. In simulated multichannel ultrasound imaging across multiple Signal-to-Noise Ratios (SNRs), C-BC-LISTA requires substantially fewer parameters and smaller model size than other state-of-the-art (SOTA) methods while improving reconstruction accuracy. In ablations over OMP, Singular Value Decomposition (SVD)-based, and random initializations, OMP-initialized structured compression performs best, yielding the most efficient training and the best performance.
[LG-129] Asymptotic Theory of Iterated Empirical Risk Minimization with Applications to Active Learning
链接: https://arxiv.org/abs/2601.23031
作者: Hugo Cui,Yue M. Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study a class of iterated empirical risk minimization (ERM) procedures in which two successive ERMs are performed on the same dataset, and the predictions of the first estimator enter as an argument in the loss function of the second. This setting, which arises naturally in active learning and reweighting schemes, introduces intricate statistical dependencies across samples and fundamentally distinguishes the problem from classical single-stage ERM analyses. For linear models trained with a broad class of convex losses on Gaussian mixture data, we derive a sharp asymptotic characterization of the test error in the high-dimensional regime where the sample size and ambient dimension scale proportionally. Our results provide explicit, fully asymptotic predictions for the performance of the second-stage estimator despite the reuse of data and the presence of prediction-dependent losses. We apply this theory to revisit a well-studied pool-based active learning problem, removing oracle and sample-splitting assumptions made in prior work. We uncover a fundamental tradeoff in how the labeling budget should be allocated across stages, and demonstrate a double-descent behavior of the test error driven purely by data selection, rather than model size or sample count.
[LG-130] Neural Backward Filtering Forward Guiding
链接: https://arxiv.org/abs/2601.23030
作者: Gefan Yang,Frank van der Meulen,Stefan Sommer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Inference in non-linear continuous stochastic processes on trees is challenging, particularly when observations are sparse (leaf-only) and the topology is complex. Exact smoothing via Doob’s h -transform is intractable for general non-linear dynamics, while particle-based methods degrade in high dimensions. We propose Neural Backward Filtering Forward Guiding (NBFFG), a unified framework for both discrete transitions and continuous diffusions. Our method constructs a variational posterior by leveraging an auxiliary linear-Gaussian process. This auxiliary process yields a closed-form backward filter that serves as a ``guide’', steering the generative path toward high-likelihood regions. We then learn a neural residual–parameterized as a normalizing flow or a controlled SDE–to capture the non-linear discrepancies. This formulation allows for an unbiased path-wise subsampling scheme, reducing the training complexity from tree-size dependent to path-length dependent. Empirical results show that NBFFG outperforms baselines on synthetic benchmarks, and we demonstrate the method on a high-dimensional inference task in phylogenetic analysis with reconstruction of ancestral butterfly wing shapes.
[LG-131] OneFlowSBI: One Model Many Queries for Simulation-Based Inference
链接: https://arxiv.org/abs/2601.22951
作者: Mayank Nautiyal,Li Ju,Melker Ernfors,Klara Hagland,Ville Holma,Maximilian Werkö Söderholm,Andreas Hellander,Prashant Singh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce \textitOneFlowSBI, a unified framework for simulation-based inference that learns a single flow-matching generative model over the joint distribution of parameters and observations. Leveraging a query-aware masking distribution during training, the same model supports multiple inference tasks, including posterior sampling, likelihood estimation, and arbitrary conditional distributions, without task-specific retraining. We evaluate \textitOneFlowSBI on ten benchmark inference problems and two high-dimensional real-world inverse problems across multiple simulation budgets. \textitOneFlowSBI is shown to deliver competitive performance against state-of-the-art generalized inference solvers and specialized posterior estimators, while enabling efficient sampling with few ODE integration steps and remaining robust under noisy and partially observed data.
[LG-132] Approximating f-Divergences with Rank Statistics ICML’26
链接: https://arxiv.org/abs/2601.22784
作者: Viktor Stein,José Manuel de Frutos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, 10 figures, 4 tables, submitted to ICML’26. Comments welcome!
Abstract:We introduce a rank-statistic approximation of f -divergences that avoids explicit density-ratio estimation by working directly with the distribution of ranks. For a resolution parameter K , we map the mismatch between two univariate distributions \mu and \nu to a rank histogram on \ 0, \ldots, K\ and measure its deviation from uniformity via a discrete f -divergence, yielding a rank-statistic divergence estimator. We prove that the resulting estimator of the divergence is monotone in K , is always a lower bound of the true f -divergence, and we establish quantitative convergence rates for K\to\infty under mild regularity of the quantile-domain density ratio. To handle high-dimensional data, we define the sliced rank-statistic f -divergence by averaging the univariate construction over random projections, and we provide convergence results for the sliced limit as well. We also derive finite-sample deviation bounds along with asymptotic normality results for the estimator. Finally, we empirically validate the approach by benchmarking against neural baselines and illustrating its use as a learning objective in generative modelling experiments.
[LG-133] GRANITE: A Generalized Regional Framework for Identifying Agreement in Feature-Based Explanations
链接: https://arxiv.org/abs/2601.22771
作者: Julia Herbinger,Gabriel Laberge,Maximilian Muschalik,Yann Pequignot,Marvin N. Wright,Fabian Fumagalli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Feature-based explanation methods aim to quantify how features influence the model’s behavior, either locally or globally, but different methods often disagree, producing conflicting explanations. This disagreement arises primarily from two sources: how feature interactions are handled and how feature dependencies are incorporated. We propose GRANITE, a generalized regional explanation framework that partitions the feature space into regions where interaction and distribution influences are minimized. This approach aligns different explanation methods, yielding more consistent and interpretable explanations. GRANITE unifies existing regional approaches, extends them to feature groups, and introduces a recursive partitioning algorithm to estimate such regions. We demonstrate its effectiveness on real-world datasets, providing a practical tool for consistent and interpretable feature explanations.
[LG-134] Bayesian Matrix Completion Under Geometric Constraints ICASSP2026
链接: https://arxiv.org/abs/2601.22765
作者: Rohit Varma Chiluvuri,Santosh Nannuru
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures, Accepted to ICASSP 2026
Abstract:The completion of a Euclidean distance matrix (EDM) from sparse and noisy observations is a fundamental challenge in signal processing, with applications in sensor network localization, acoustic room reconstruction, molecular conformation, and manifold learning. Traditional approaches, such as rank-constrained optimization and semidefinite programming, enforce geometric constraints but often struggle under sparse or noisy conditions. This paper introduces a hierarchical Bayesian framework that places structured priors directly on the latent point set generating the EDM, naturally embedding geometric constraints. By incorporating a hierarchical prior on latent point set, the model enables automatic regularization and robust noise handling. Posterior inference is performed using a Metropolis-Hastings within Gibbs sampler to handle coupled latent point posterior. Experiments on synthetic data demonstrate improved reconstruction accuracy compared to deterministic baselines in sparse regimes.
[LG-135] Spectral Gradient Descent Mitigates Anisotropy-Driven Misalignment: A Case Study in Phase Retrieval
链接: https://arxiv.org/abs/2601.22652
作者: Guillaume Braun,Han Bao,Wei Huang,Masaaki Imaizumi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 53 pages, 8 figures
Abstract:Spectral gradient methods, such as the Muon optimizer, modify gradient updates by preserving directional information while discarding scale, and have shown strong empirical performance in deep learning. We investigate the mechanisms underlying these gains through a dynamical analysis of a nonlinear phase retrieval model with anisotropic Gaussian inputs, equivalent to training a two-layer neural network with the quadratic activation and fixed second-layer weights. Focusing on a spiked covariance setting where the dominant variance direction is orthogonal to the signal, we show that gradient descent (GD) suffers from a variance-induced misalignment: during the early escaping stage, the high-variance but uninformative spike direction is multiplicatively amplified, degrading alignment with the true signal under strong anisotropy. In contrast, spectral gradient descent (SpecGD) removes this spike amplification effect, leading to stable alignment and accelerated noise contraction. Numerical experiments confirm the theory and show that these phenomena persist under broader anisotropic covariances.
[LG-136] Generative and Nonparametric Approaches for Conditional Distribution Estimation: Methods Perspectives and Comparative Evaluations
链接: https://arxiv.org/abs/2601.22650
作者: Yen-Shiu Chin,Zhi-Yu Jou,Toshinari Morimoto,Chia-Tse Wang,Ming-Chung Chang,Tso-Jung Yen,Su-Yun Huang,Tailen Hsing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 2 figures, 2 tables
Abstract:The inference of conditional distributions is a fundamental problem in statistics, essential for prediction, uncertainty quantification, and probabilistic modeling. A wide range of methodologies have been developed for this task. This article reviews and compares several representative approaches spanning classical nonparametric methods and modern generative models. We begin with the single-index method of Hall and Yao (2005), which estimates the conditional distribution through a dimension-reducing index and nonparametric smoothing of the resulting one-dimensional cumulative conditional distribution function. We then examine the basis-expansion approaches, including FlexCode (Izbicki and Lee, 2017) and DeepCDE (Dalmasso et al., 2020), which convert conditional density estimation into a set of nonparametric regression problems. In addition, we discuss two recent generative simulation-based methods that leverage modern deep generative architectures: the generative conditional distribution sampler (Zhou et al., 2023) and the conditional denoising diffusion probabilistic model (Fu et al., 2024; Yang et al., 2025). A systematic numerical comparison of these approaches is provided using a unified evaluation framework that ensures fairness and reproducibility. The performance metrics used for the estimated conditional distribution include the mean-squared errors of conditional mean and standard deviation, as well as the Wasserstein distance. We also discuss their flexibility and computational costs, highlighting the distinct advantages and limitations of each approach.
[LG-137] RPWithPrior: Label Differential Privacy in Regression
链接: https://arxiv.org/abs/2601.22625
作者: Haixia Liu,Ruifan Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 20 pages
Abstract:With the wide application of machine learning techniques in practice, privacy preservation has gained increasing attention. Protecting user privacy with minimal accuracy loss is a fundamental task in the data analysis and mining community. In this paper, we focus on regression tasks under \epsilon -label differential privacy guarantees. Some existing methods for regression with \epsilon -label differential privacy, such as the RR-On-Bins mechanism, discretized the output space into finite bins and then applied RR algorithm. To efficiently determine these finite bins, the authors rounded the original responses down to integer values. However, such operations does not align well with real-world scenarios. To overcome these limitations, we model both original and randomized responses as continuous random variables, avoiding discretization entirely. Our novel approach estimates an optimal interval for randomized responses and introduces new algorithms designed for scenarios where a prior is either known or unknown. Additionally, we prove that our algorithm, RPWithPrior, guarantees \epsilon -label differential privacy. Numerical results demonstrate that our approach gets better performance compared with the Gaussian, Laplace, Staircase, and RRonBins, Unbiased mechanisms on the Communities and Crime, Criteo Sponsored Search Conversion Log, California Housing datasets.
[LG-138] An Efficient Algorithm for Thresholding Monte Carlo Tree Search
链接: https://arxiv.org/abs/2601.22600
作者: Shoma Nameki(1),Atsuyoshi Nakamura(2),Junpei Komiyama(3 and 4),Koji Tabata(5) ((1) Graduate School of Information Science and Technology, Hokkaido University, (2) Faculty of Information Science and Technology, Hokkaido University, (3) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), (4) RIKEN AIP, (5) Research Institute for Electronic Science, Hokkaido University)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce the Thresholding Monte Carlo Tree Search problem, in which, given a tree \mathcalT and a threshold \theta , a player must answer whether the root node value of \mathcalT is at least \theta or not. In the given tree, MAX' or MIN’ is labeled on each internal node, and the value of a MAX'-labeled (MIN’-labeled) internal node is the maximum (minimum) of its child values. The value of a leaf node is the mean reward of an unknown distribution, from which the player can sample rewards. For this problem, we develop a \delta -correct sequential sampling algorithm based on the Track-and-Stop strategy that has asymptotically optimal sample complexity. We show that a ratio-based modification of the D-Tracking arm-pulling strategy leads to a substantial improvement in empirical sample complexity, as well as reducing the per-round computational cost from linear to logarithmic in the number of arms.
[LG-139] Corrected Samplers for Discrete Flow Models
链接: https://arxiv.org/abs/2601.22519
作者: Zhengyan Wan,Yidong Ouyang,Liyan Xie,Fang Fang,Hongyuan Zha,Guang Cheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Discrete flow models (DFMs) have been proposed to learn the data distribution on a finite state space, offering a flexible framework as an alternative to discrete diffusion models. A line of recent work has studied samplers for discrete diffusion models, such as tau-leaping and Euler solver. However, these samplers require a large number of iterations to control discretization error, since the transition rates are frozen in time and evaluated at the initial state within each time interval. Moreover, theoretical results for these samplers often require boundedness conditions of the transition rate or they focus on a specific type of source distributions. To address those limitations, we establish non-asymptotic discretization error bounds for those samplers without any restriction on transition rates and source distributions, under the framework of discrete flow models. Furthermore, by analyzing a one-step lower bound of the Euler sampler, we propose two corrected samplers: \textittime-corrected sampler and \textitlocation-corrected sampler, which can reduce the discretization error of tau-leaping and Euler solver with almost no additional computational cost. We rigorously show that the location-corrected sampler has a lower iteration complexity than existing parallel samplers. We validate the effectiveness of the proposed method by demonstrating improved generation quality and reduced inference time on both simulation and text-to-image generation tasks. Code can be found in this https URL.
[LG-140] Simulation-based Bayesian inference with ameliorative learned summary statistics – Part I
链接: https://arxiv.org/abs/2601.22441
作者: Getachew K. Befekadu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages
Abstract:This paper, which is Part 1 of a two-part paper series, considers a simulation-based inference with learned summary statistics, in which such a learned summary statistic serves as an empirical-likelihood with ameliorative effects in the Bayesian setting, when the exact likelihood function associated with the observation data and the simulation model is difficult to obtain in a closed form or computationally intractable. In particular, a transformation technique which leverages the Cressie-Read discrepancy criterion under moment restrictions is used for summarizing the learned statistics between the observation data and the simulation outputs, while preserving the statistical power of the inference. Here, such a transformation of data-to-learned summary statistics also allows the simulation outputs to be conditioned on the observation data, so that the inference task can be performed over certain sample sets of the observation data that are considered as an empirical relevance or believed to be particular importance. Moreover, the simulation-based inference framework discussed in this paper can be extended further, and thus handling weakly dependent observation data. Finally, we remark that such an inference framework is suitable for implementation in distributed computing, i.e., computational tasks involving both the data-to-learned summary statistics and the Bayesian inferencing problem can be posed as a unified distributed inference problem that will exploit distributed optimization and MCMC algorithms for supporting large datasets associated with complex simulation models.
[LG-141] Minimal-Action Discrete Schrödinger Bridge Matching for Peptide Sequence Design
链接: https://arxiv.org/abs/2601.22408
作者: Shrey Goel,Pranam Chatterjee
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Generative modeling of peptide sequences requires navigating a discrete and highly constrained space in which many intermediate states are chemically implausible or unstable. Existing discrete diffusion and flow-based methods rely on reversing fixed corruption processes or following prescribed probability paths, which can force generation through low-likelihood regions and require countless sampling steps. We introduce Minimal-action discrete Schrödinger Bridge Matching (MadSBM), a rate-based generative framework for peptide design that formulates generation as a controlled continuous-time Markov process on the amino-acid edit graph. To yield probability trajectories that remain near high-likelihood sequence neighborhoods throughout generation, MadSBM 1) defines generation relative to a biologically informed reference process derived from pre-trained protein language model logits and 2) learns a time-dependent control field that biases transition rates to produce low-action transport paths from a masked prior to the data distribution. We finally introduce guidance to the MadSBM sampling procedure towards a specific functional objective, expanding the design space of therapeutic peptides; to our knowledge, this represents the first-ever application of discrete classifier guidance to Schrödinger bridge-based generative models.
[LG-142] Its all the (Exponential) Family: An Equivalence between Maximum Likelihood Estimation and Control Variates for Sketching Algorithms
链接: https://arxiv.org/abs/2601.22378
作者: Keegan Kang,Kerong Wang,Ding Zhang,Rameshwar Pratap,Bhisham Dev Verma,Benedict H.W. Wong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 36 pages, 15 figures
Abstract:Maximum likelihood estimators (MLE) and control variate estimators (CVE) have been used in conjunction with known information across sketching algorithms and applications in machine learning. We prove that under certain conditions in an exponential family, an optimal CVE will achieve the same asymptotic variance as the MLE, giving an Expectation-Maximization (EM) algorithm for the MLE. Experiments show the EM algorithm is faster and numerically stable compared to other root finding algorithms for the MLE for the bivariate Normal distribution, and we expect this to hold across distributions satisfying these conditions. We show how the EM algorithm leads to reproducibility for algorithms using MLE / CVE, and demonstrate how the EM algorithm leads to finding the MLE when the CV weights are known.
[LG-143] Amortized Simulation-Based Inference in Generalized Bayes via Neural Posterior Estimation
链接: https://arxiv.org/abs/2601.22367
作者: Shiyi Sun,Geoff K. Nicholls,Jeong Eun Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Generalized Bayesian Inference (GBI) tempers a loss with a temperature \beta0 to mitigate overconfidence and improve robustness under model misspecification, but existing GBI methods typically rely on costly MCMC or SDE-based samplers and must be re-run for each new dataset and each \beta value. We give the first fully amortized variational approximation to the tempered posterior family p_\beta(\theta \mid x) \propto \pi(\theta),p(x \mid \theta)^\beta by training a single (x,\beta) -conditioned neural posterior estimator q_\phi(\theta \mid x,\beta) that enables sampling in a single forward pass, without simulator calls or inference-time MCMC. We introduce two complementary training routes: (i) synthesize off-manifold samples (\theta,x) \sim \pi(\theta),p(x \mid \theta)^\beta and (ii) reweight a fixed base dataset \pi(\theta),p(x \mid \theta) using self-normalized importance sampling (SNIS). We show that the SNIS-weighted objective provides a consistent forward-KL fit to the tempered posterior with finite weight variance. Across four standard simulation-based inference (SBI) benchmarks, including the chaotic Lorenz-96 system, our \beta -amortized estimator achieves competitive posterior approximations in standard two-sample metrics, matching non-amortized MCMC-based power-posterior samplers over a wide range of temperatures.
[LG-144] Dependence-Aware Label Aggregation for LLM -as-a-Judge via Ising Models
链接: https://arxiv.org/abs/2601.22336
作者: Krishnakumar Balasubramanian,Aleksandr Podkopaev,Shiva Prasad Kasiviswanathan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Large-scale AI evaluation increasingly relies on aggregating binary judgments from K annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label Y\in\0,1\ , an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite- K examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.
[LG-145] Adaptive Benign Overfitting (ABO): Overparameterized RLS for Online Learning in Non-stationary Time-series
链接: https://arxiv.org/abs/2601.22200
作者: Luis Ontaneda Mijares,Nick Firoozye
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Mathematical Software (cs.MS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 32 pages, 3 figures, 10 tables
Abstract:Overparameterized models have recently challenged conventional learning theory by exhibiting improved generalization beyond the interpolation limit, a phenomenon known as benign overfitting. This work introduces Adaptive Benign Overfitting (ABO), extending the recursive least-squares (RLS) framework to this regime through a numerically stable formulation based on orthogonal-triangular updates. A QR-based exponentially weighted RLS (QR-EWRLS) algorithm is introduced, combining random Fourier feature mappings with forgetting-factor regularization to enable online adaptation under non-stationary conditions. The orthogonal decomposition prevents the numerical divergence associated with covariance-form RLS while retaining adaptability to evolving data distributions. Experiments on nonlinear synthetic time series confirm that the proposed approach maintains bounded residuals and stable condition numbers while reproducing the double-descent behavior characteristic of overparameterized models. Applications to forecasting foreign exchange and electricity demand show that ABO is highly accurate (comparable to baseline kernel methods) while achieving speed improvements of between 20 and 40 percent. The results provide a unified view linking adaptive filtering, kernel approximation, and benign overfitting within a stable online learning framework.
信息检索
[IR-0] Farewell to Item IDs: Unlocking the Scaling Potential of Large Ranking Models via Semantic Tokens
链接: https://arxiv.org/abs/2601.22694
作者: Zhen Zhao,Tong Zhang,Jie Xu,Qingliang Cai,Qile Zhang,Leyuan Yang,Daorui Xiao,Xiaojia Chang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent studies on scaling up ranking models have achieved substantial improvement for recommendation systems and search engines. However, most large-scale ranking systems rely on item IDs, where each item is treated as an independent categorical symbol and mapped to a learned embedding. As items rapidly appear and disappear, these embeddings become difficult to train and maintain. This instability impedes effective learning of neural network parameters and limits the scalability of ranking models. In this paper, we show that semantic tokens possess greater scaling potential compared to item IDs. Our proposed framework TRM improves the token generation and application pipeline, leading to 33% reduction in sparse storage while achieving 0.85% AUC increase. Extensive experiments further show that TRM could consistently outperform state-of-the-art models when model capacity scales. Finally, TRM has been successfully deployed on large-scale personalized search engines, yielding 0.26% and 0.75% improvement on user active days and change query ratio respectively through A/B test.
[IR-1] PersonaAct: Simulating Short-Video Users with Personalized Agents for Counterfactual Filter Bubble Auditing
链接: https://arxiv.org/abs/2601.22547
作者: Shilong Zhao,Qinggang Yang,Zhiyi Yin,Xiaoshi Wang,Zhenxing Chen,Du Su,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Short-video platforms rely on personalized recommendation, raising concerns about filter bubbles that narrow content exposure. Auditing such phenomena at scale is challenging because real user studies are costly and privacy-sensitive, and existing simulators fail to reproduce realistic behaviors due to their reliance on textual signals and weak personalization. We propose PersonaAct, a framework for simulating short-video users with persona-conditioned multimodal agents trained on real behavioral traces for auditing filter bubbles in breadth and depth. PersonaAct synthesizes interpretable personas through automated interviews combining behavioral analysis with structured questioning, then trains agents on multimodal observations using supervised fine-tuning and reinforcement learning. We deploy trained agents for filter bubble auditing and evaluate bubble breadth via content diversity and bubble depth via escape potential. The evaluation demonstrates substantial improvements in fidelity over generic LLM baselines, enabling realistic behavior reproduction. Results reveal significant content narrowing over interaction. However, we find that Bilibili demonstrates the strongest escape potential. We release the first open multimodal short-video dataset and code to support reproducible auditing of recommender systems.
[IR-2] SCaLRec: Semantic Calibration for LLM -enabled Cloud-Device Sequential Recommendation
链接: https://arxiv.org/abs/2601.22543
作者: Ruiqi Zheng,Jinli Cao,Jiao Yin,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Cloud-device collaborative recommendation partitions computation across the cloud and user devices: the cloud provides semantic user modeling, while the device leverages recent interactions and cloud semantic signals for privacy-preserving, responsive reranking. With large language models (LLMs) on the cloud, semantic user representations can improve sequential recommendation by capturing high-level intent. However, regenerating such representations via cloud LLM inference for every request is often infeasible at real-world scale. As a result, on-device reranking commonly reuses a cached cloud semantic user embedding across requests. We empirically identify a cloud semantic staleness effect: reused embeddings become less aligned with the user’s latest interactions, leading to measurable ranking degradation. Most existing LLM-enabled cloud-device recommenders are typically designed around on-demand cloud semantics, either by assuming low-latency cloud LLM access or by regenerating semantic embeddings per request. When per-request regeneration is infeasible and cached semantics must be reused, two technical challenges arise: (1) deciding when cached cloud semantics remain useful for on-device reranking, and (2) maintaining ranking quality when the cloud LLM cannot be invoked and only cached semantics are available. To address this gap, we introduce the Semantic Calibration for LLM-enabled Cloud-Device Recommendation (SCaLRec). First, it estimates the reliability of cached semantics under the user’s latest interactions. Second, an on-device semantic calibration module is proposed to adjusts the cached semantic embedding on-device using up-to-date interaction evidence, without per-request cloud LLM involvement. Experiments on real-world datasets show that SCaLRec consistently improves recommendation performance over strong baselines under cloud semantic staleness. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2601.22543 [cs.IR] (or arXiv:2601.22543v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.22543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] FITMM: Adaptive Frequency-Aware Multimodal Recommendation via Information-Theoretic Representation Learning
链接: https://arxiv.org/abs/2601.22498
作者: Wei Yang,Rui Zhong,Yiqun Chen,Shixuan Li,Heng Ping,Chi Lu,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Multimodal recommendation aims to enhance user preference modeling by leveraging rich item content such as images and text. Yet dominant systems fuse modalities in the spatial domain, obscuring the frequency structure of signals and amplifying misalignment and redundancy. We adopt a spectral information-theoretic view and show that, under an orthogonal transform that approximately block-diagonalizes bandwise covariances, the Gaussian Information Bottleneck objective decouples across frequency bands, providing a principled basis for separate-then-fuse paradigm. Building on this foundation, we propose FITMM, a Frequency-aware Information-Theoretic framework for multimodal recommendation. FITMM constructs graph-enhanced item representations, performs modality-wise spectral decomposition to obtain orthogonal bands, and forms lightweight within-band multimodal components. A residual, task-adaptive gate aggregates bands into the final representation. To control redundancy and improve generalization, we regularize training with a frequency-domain IB term that allocates capacity across bands (Wiener-like shrinkage with shut-off of weak bands). We further introduce a cross-modal spectral consistency loss that aligns modalities within each band. The model is jointly optimized with the standard recommendation loss. Extensive experiments on three real-world datasets demonstrate that FITMM consistently and significantly outperforms advanced baselines.
[IR-4] Do AI Overviews Benefit Search Engines? An Ecosystem Perspective
链接: https://arxiv.org/abs/2601.22493
作者: Yihang Wu,Jiajun Tang,Jinfei Liu,Haifeng Xu,Fan Yao
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)
*备注:
Abstract:The integration of AI Overviews into search engines enhances user experience but diverts traffic from content creators, potentially discouraging high-quality content creation and causing user attrition that undermines long-term search engine profit. To address this issue, we propose a game-theoretic model of creator competition with costly effort, characterize equilibrium behavior, and design two incentive mechanisms: a citation mechanism that references sources within an AI Overview, and a compensation mechanism that offers monetary rewards to creators. For both cases, we provide structural insights and near-optimal profit-maximizing mechanisms. Evaluations on real click data show that although AI Overviews harm long-term search engine profit, interventions based on our proposed mechanisms can increase long-term profit across a range of realistic scenarios, pointing toward a more sustainable trajectory for AI-enhanced search ecosystems.



