本篇博文主要内容为 2026-01-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-29)

今日共更新531篇论文,其中:

  • 自然语言处理97篇(Computation and Language (cs.CL))
  • 人工智能150篇(Artificial Intelligence (cs.AI))
  • 计算机视觉78篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习170篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Evolutionary Strategies lead to Catastrophic Forgetting in LLM s

【速读】: 该论文旨在解决当前人工智能系统在部署后缺乏持续学习能力的问题,特别是针对基于梯度的训练算法在大规模语言模型(Large Language Models, LLMs)中因内存消耗过大而难以实现在线持续学习的瓶颈。论文提出以进化策略(Evolutionary Strategies, ES)作为梯度-free的替代方案,并通过系统性分析其在不断增加更新步数下的遗忘曲线表现,揭示了ES虽能在数学和推理任务上达到接近GRPO(Generalized Reward Policy Optimization)的性能水平,但伴随显著的能力遗忘现象,限制了其在在线训练中的适用性。其关键发现在于:ES更新方向的稀疏性远低于GRPO,且ℓ₂范数显著更大,这直接导致其在持续学习过程中对先前知识的破坏更为严重,从而突显出梯度-free算法在持续学习场景下亟需改进的遗忘问题。

链接: https://arxiv.org/abs/2601.20861
作者: Immanuel Abdi,Akshat Gupta,Micah Mok,Alexander Lu,Nicholas Lee,Gopala Anumanchipalli
机构: UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger \ell_2 norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.
zh

[NLP-1] When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation EACL2026

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多语言任务中因训练数据污染而导致的虚假性能提升问题,特别是机器翻译任务中的基准污染(benchmark contamination)现象,即模型将训练数据中的翻译对记忆为“泛化能力”,从而在测试集上获得不真实的高分。解决方案的关键在于利用FLORES-200多语言翻译基准作为诊断工具,对比受污染模型(如Bloomz)与未受污染对照模型(如Llama),揭示了污染可跨语言方向传播——即目标侧记忆可导致未见过的翻译方向性能异常提升;同时发现即使对源端进行扰动(如改写和实体替换),模型仍能稳定召回记忆内容,但替换命名实体会显著降低BLEU分数,表明这可作为一种有效的探测方法来识别模型中的记忆行为。

链接: https://arxiv.org/abs/2601.20858
作者: David Tan,Pinzhen Chen,Josef van Genabith,Koel Dutta Chowdhury
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages of content, 15 total. 5 figures, 12 tables total. Accepted to EACL 2026 main conference. Code can be found here: this http URL

点击查看摘要

Abstract:Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to “uncontaminated” languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz’s FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.
zh

[NLP-2] Reward Models Inherit Value Biases from Pretraining

【速读】: 该论文旨在解决奖励模型(Reward Models, RMs)在对齐大语言模型(Large Language Models, LLMs)与人类价值观时,其输出是否受到预训练基础模型(base model)隐含价值观影响的问题。尽管RMs通过偏好数据微调以捕捉人类偏好,但其初始表示源自LLM,而这种继承性如何塑造最终行为尚不明确。解决方案的关键在于:通过系统分析10个主流开源RMs在心理语言学语料上的表现,发现即使偏好数据和微调过程一致,基于不同基础模型(如Llama与Gemma)的RMs仍表现出显著的“自主性”(agency)与“亲属性”(communion)维度差异;进一步揭示这些差异源于预训练和指令微调阶段的logits差异,并提出一种可量化的隐式奖励分数(implicit reward scores),该分数能复现相同的价值倾向。实验表明,此效应具有高度重复性和鲁棒性,说明RMs的输出不仅受偏好数据驱动,更深层地受基础模型预训练阶段价值观的塑造,凸显了预训练阶段安全与对齐工作的重要性。

链接: https://arxiv.org/abs/2601.20838
作者: Brian Christian,Jessica A. F. Thompson,Elle Michelle Yang,Vincent Adam,Hannah Rose Kirk,Christopher Summerfield,Tsvetomira Dumbalska
机构: University of Oxford (牛津大学); Universitat Pompeu Fabra (庞佩乌法布拉大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the “Big Two” psychological axes, we show a robust preference of Llama RMs for “agency” and a corresponding robust preference of Gemma RMs for “communion.” This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers’ choice of base model is as much a consideration of values as of performance.
zh

[NLP-3] Linear representations in language models can change dramatically over a conversation

【速读】: 该论文试图解决的问题是:语言模型中高阶概念的线性表示(linear representations)在对话过程中如何动态演化,以及这种演化对模型可解释性和可控性(如方向引导 steering)的影响。解决方案的关键在于揭示了这些表示并非静态,而是会随着对话上下文显著变化——例如事实性信息可能从初始的“真实”状态演变为对话末尾的“非真实”状态,反之亦然;且这种变化具有内容依赖性(对话相关的信息易变,通用信息则稳定),并存在于不同模型架构与层之间,甚至无需实际交互即可通过重放对话脚本实现。这一发现表明,模型可能根据对话角色提示调整其内部表征,从而挑战了基于静态特征或固定方向进行解释与控制的传统方法,同时也为理解模型如何适应上下文提供了新的研究路径。

链接: https://arxiv.org/abs/2601.20834
作者: Andrew Kyle Lampinen,Yuxuan Li,Eghbal Hosseini,Sangnie Bhardwaj,Murray Shanahan
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering – in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.
zh

[NLP-4] raining Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练在面对高难度问题时出现的性能饱和问题,其核心挑战在于模型难以获取具有信息量的失败信号——即错误推理路径中的关键状态在标准采样过程中极少被触发。解决方案的关键在于提出“失败前缀条件化”(failure-prefix conditioning)方法:通过从罕见的错误推理轨迹中提取前缀作为训练条件,重新分配探索策略,使模型更频繁地接触易出错的状态空间,从而有效学习失败模式并提升推理能力。此方法不仅在性能上等效于使用中等难度问题训练,还保持了token效率,并且通过迭代更新失败前缀进一步突破性能瓶颈。

链接: https://arxiv.org/abs/2601.20829
作者: Minwu Kim,Safal Shrestha,Keith Ross
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model’s robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.
zh

[NLP-5] Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction

【速读】: 该论文旨在解决少样本关系抽取(Few-shot Relation Extraction, FSRE)中因训练样本稀缺导致模型性能受限的问题。其核心挑战在于如何有效扩充用于上下文学习(in-context learning)的示例,以提升模型对目标关系的理解与泛化能力。解决方案的关键在于提出一种基于句法-语义结构相似性的新型示例选择策略,该策略通过匹配所提供的一次性示例(one-shot example)的深层结构特征来筛选补充示例,从而生成与大语言模型(LLM)自动生成示例互补的词汇选择和句子结构。结合该策略与其他方法形成的混合系统,显著提升了关系抽取的准确性和鲁棒性,并在FS-TACRED和FS-FewRel等多个数据集上展现出优越性能与跨模型家族(Qwen与Gemma)的良好迁移能力。

链接: https://arxiv.org/abs/2601.20803
作者: Aunabil Chakma,Mihai Surdeanu,Eduardo Blanco
机构: University of Arizona (亚利桑那大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents several strategies to automatically obtain additional examples for in-context learning of one-shot relation extraction. Specifically, we introduce a novel strategy for example selection, in which new examples are selected based on the similarity of their underlying syntactic-semantic structure to the provided one-shot example. We show that this method results in complementary word choices and sentence structures when compared to LLM-generated examples. When these strategies are combined, the resulting hybrid system achieves a more holistic picture of the relations of interest than either method alone. Our framework transfers well across datasets (FS-TACRED and FS-FewRel) and LLM families (Qwen and Gemma). Overall, our hybrid selection method consistently outperforms alternative strategies and achieves state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.
zh

[NLP-6] Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers

【速读】: 该论文旨在解决Transformer-based多模态大语言模型在上下文学习(In-Context Learning, ICL)中如何从示例中跨模态关联信息的机制问题。其核心挑战在于理解模型如何利用少量示例实现跨模态推理,尤其是在不同模态数据复杂度不均衡时的学习行为。解决方案的关键在于通过受控实验设计,在小型Transformer模型上训练合成分类任务,精确操控数据统计特性与模型架构;研究发现:1)旋转位置编码(Rotary Position Embeddings, RoPE)提高了单模态ICL的数据复杂度阈值;2)在多模态场景下存在显著的学习不对称性——当主模态预训练数据多样性高时,次模态即使数据复杂度极低也能触发ICL;3)机制分析表明,两种情形均依赖于一种归纳式机制,即复制与上下文示例匹配的标签,而多模态训练则将此类机制扩展至跨模态电路。这一工作为理解多模态ICL提供了可解释的机制基础,并构建了一个可控的实验平台以支持后续研究。

链接: https://arxiv.org/abs/2601.20796
作者: Yiran Huang,Karsten Roth,Quentin Bouniot,Wenjia Xu,Zeynep Akata
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.
zh

[NLP-7] Jurisdiction as Structural Barrier: How Privacy Policy Organization May Reduce Visibility of Substantive Disclosures

【速读】: 该论文试图解决隐私政策中因文档结构导致的信息透明度不足问题,即“管辖权隔离披露”(jurisdiction-siloed disclosure)现象:关键数据实践信息仅出现在特定地区合规章节(如“加州居民”或“欧盟/英国用户”),而通用部分则使用模糊语言,导致未受监管地区的用户无法获取与其相关的具体信息。解决方案的关键在于推行“普遍实质性披露”(universal substantive disclosure)标准——所有影响全体用户的实践信息应置于政策主体部分,区域条款仅保留程序性权利说明。这一设计基于行为信息觅食理论和既有披露制度(如证券、贷款告知、营养标签)的合规逻辑,旨在通过优化文档架构提升全球用户的信息平等获取能力,而非依赖内容删减或补充。

链接: https://arxiv.org/abs/2601.20792
作者: Thomas Brackin
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 25 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Privacy policies are supposed to provide notice. But what if substantive information appears only where users skip it? We identify a structural pattern we call jurisdiction-siloed disclosure: information about data practices appearing in specific, actionable form only within regional compliance sections labeled “California Residents” or “EU/UK Users,” while general sections use vague or qualified language for the same practices. Our audit of 123 major companies identifies 282 potential instances across 77 companies (62.6% of this purposive sample). A conservative estimate restricted to practice categories validated against OPP-115 human annotations finds 138 instances across 54 companies (44%); post-2018 categories central to our findings await independent validation. If users skip jurisdiction-labeled sections as information foraging theory predicts, users outside regulated jurisdictions would receive less specific information about practices affecting them–a transparency failure operating through document architecture rather than omission. We propose universal substantive disclosure: practices affecting all users should appear in the main policy body, with regional sections containing only procedural rights information. This standard finds support in analogous disclosure regimes (securities, truth-in-lending, nutritional labeling) where material information must reach all affected parties. Regulators could operationalize this through the FTC’s “clear and conspicuous” standard and GDPR transparency principles. This work is hypothesis-generating: we establish that the structural pattern exists and ground the transparency concern in behavioral theory, but direct measurement of jurisdiction-specific section skipping remains the critical validation priority. We release our methodology and annotated dataset to enable replication. Comments: 25 pages, 2 figures, 5 tables Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) ACMclasses: K.4.1; H.5.2 Cite as: arXiv:2601.20792 [cs.CY] (or arXiv:2601.20792v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2601.20792 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Thomas Brackin [view email] [v1] Wed, 28 Jan 2026 17:29:59 UTC (53 KB)
zh

[NLP-8] SERA: Soft-Verified Efficient Repository Agents

【速读】: 该论文旨在解决开放权重(open-weight)代码代理(coding agent)在实际应用中难以高效定制化于私有代码库的问题,尽管理论上其可通过权重编码特定代码库信息实现专业化,但高昂的训练成本和复杂性使其长期停留在理论阶段。解决方案的关键在于提出Soft-Verified Efficient Repository Agents (SERA),其核心是Soft Verified Generation (SVG) 方法——一种仅需监督微调(Supervised Fine-Tuning, SFT)即可从单个代码库生成数千条轨迹的高效合成数据生成机制。该方法显著降低训练成本(比强化学习低26倍、比先前合成数据方法低57倍),同时实现与前沿开放权重模型相当的性能,从而首次使私有代码库的专业化成为经济可行的选择,并推动开放源代码代理的研究进展。

链接: https://arxiv.org/abs/2601.20789
作者: Ethan Shen,Danny Tormoen,Saurabh Shah,Ali Farhadi,Tim Dettmers
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 21 main pages, 7 pages appendix

点击查看摘要

Abstract:Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2’s Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.
zh

[NLP-9] Persona Prompting as a Lens on LLM Social Reasoning EACL

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会敏感任务(如仇恨言论检测)中生成解释(rationales)的质量问题,尤其是这些解释对用户信任和模型对齐的影响。研究发现,尽管人格提示(Persona Prompting, PP)能够提升分类性能,但其对模型推理过程的解释质量存在负面影响,且无法有效缓解模型固有的群体偏见。解决方案的关键在于揭示了PP在提升分类准确性的同时,往往以牺牲解释一致性与公平性为代价,并指出当前模拟人格提示策略难以实现有效的个性化引导,反而暴露了模型对不同人群的系统性偏差,从而呼吁在应用PP时需谨慎权衡其利弊。

链接: https://arxiv.org/abs/2601.20757
作者: Jing Yang,Moritz Hechtbauer,Elisabeth Khalilov,Evelyn Luise Brinkmann,Vera Schmitt,Nils Feldhus
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 Pages, EACL main

点击查看摘要

Abstract:For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.
zh

[NLP-10] Like a Therapist But Not: Reddit Narratives of AI in Mental Health Contexts

【速读】: 该论文试图解决的问题是:在非临床场景下,用户如何评价和与生成式 AI (Generative AI) 进行情感支持或心理治疗相关的交互,以及这些交互如何影响其采纳态度和关系认同。解决方案的关键在于构建一个基于技术接受模型(Technology Acceptance Model)和治疗联盟理论(therapeutic alliance theory)的理论驱动型标注框架,并采用混合大语言模型(LLM)与人工协同的分析流程,在大规模社交媒体文本(5,126 条 Reddit 帖子)中系统识别用户的评价语言、采纳相关态度及关系一致性特征。研究发现,用户参与度主要由叙述结果、信任感和响应质量决定,而非单纯的情感联结;正向情绪更强烈地关联于任务目标对齐,而陪伴导向的使用则常伴随关系错位及依赖或症状加重等风险。这表明,将理论构念操作化并应用于敏感现实语境中的语言技术评估具有重要价值。

链接: https://arxiv.org/abs/2601.20747
作者: Elham Aghakhani,Rezvaneh Rezapour
机构: Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for emotional support and mental health-related interactions outside clinical settings, yet little is known about how people evaluate and relate to these systems in everyday use. We analyze 5,126 Reddit posts from 47 mental health communities describing experiential or exploratory use of AI for emotional support or therapy. Grounded in the Technology Acceptance Model and therapeutic alliance theory, we develop a theory-informed annotation framework and apply a hybrid LLM-human pipeline to analyze evaluative language, adoption-related attitudes, and relational alignment at scale. Our results show that engagement is shaped primarily by narrated outcomes, trust, and response quality, rather than emotional bond alone. Positive sentiment is most strongly associated with task and goal alignment, while companionship-oriented use more often involves misaligned alliances and reported risks such as dependence and symptom escalation. Overall, this work demonstrates how theory-grounded constructs can be operationalized in large-scale discourse analysis and highlights the importance of studying how users interpret language technologies in sensitive, real-world contexts.
zh

[NLP-11] QueerGen: How LLM s Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)如何再现社会规范,特别是异性恋霸权(heterocisnormativity),并导致其文本生成中可量化的偏见问题。研究通过对比三种主体类别——标记为酷儿(queer-marked)、非酷儿标记(non-queer-marked)和被默认的“未标记”(unmarked)类别——在不同LLM架构下(如掩码语言模型Masked Language Models, MLMs 和自回归语言模型Autoregressive Language Models, ARLMs)的响应差异,量化了情感倾向、尊重程度、毒性及预测多样性四个维度的偏差。关键发现表明,MLMs对酷儿标记主体产生最负面的情感、更高毒性与更少尊重;ARLMs部分缓解此类偏见,但封闭访问的ARLMs反而对未标记主体产生更具伤害性的输出。因此,解决方案的关键在于识别特定模型架构对偏见传播的影响机制,从而揭示偏见并非均匀分布,而是因模型设计而异,可能重新分配而非消除代表性伤害。

链接: https://arxiv.org/abs/2601.20731
作者: Mae Sosto,Delfina Sol Martinez Pandiani,Laura Hollink
机构: Centrum Wiskunde & Informatica (荷兰数学与计算机中心); Universiteit van Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper examines how Large Language Models (LLMs) reproduce societal norms, particularly heterocisnormativity, and how these norms translate into measurable biases in their text generations. We investigate whether explicit information about a subject’s gender or sexuality influences LLM responses across three subject categories: queer-marked, non-queer-marked, and the normalized “unmarked” category. Representational imbalances are operationalized as measurable differences in English sentence completions across four dimensions: sentiment, regard, toxicity, and prediction diversity. Our findings show that Masked Language Models (MLMs) produce the least favorable sentiment, higher toxicity, and more negative regard for queer-marked subjects. Autoregressive Language Models (ARLMs) partially mitigate these patterns, while closed-access ARLMs tend to produce more harmful outputs for unmarked subjects. Results suggest that LLMs reproduce normative social assumptions, though the form and degree of bias depend strongly on specific model characteristics, which may redistribute, but not eliminate, representational harms.
zh

[NLP-12] Agent LongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)向自主代理(autonomous agents)演进过程中,现有评估基准难以模拟真实环境中动态交互的问题。现有基准多依赖静态检索任务,无法体现非线性推理和迭代反馈等复杂交互特性。其解决方案的关键在于提出 AgentLongBench,通过基于横向思维谜题(Lateral Thinking Puzzles)的环境模拟 rollout 来评估代理在知识密集型与知识自由场景下的交互轨迹,从而揭示模型在动态信息整合能力上的短板——即尽管在静态检索中表现优异,但在需要持续信息合成的工作流中显著退化,且这种退化主要由解决问题所需的最小 token 数量决定,而非传统认为的长对话中的记忆碎片化问题。

链接: https://arxiv.org/abs/2601.20730
作者: Shicheng Fang,Yuxin Wang,XiaoRan Liu,Jiahao Lu,Chuanyuan Tan,Xinchi Chen,Yining Zheng. Xuanjing Huang,Xipeng Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.
zh

[NLP-13] Polite But Boring? Trade-offs Between Engagement and Psychological Reactance to Chatbot Feedback Styles

【速读】: 该论文试图解决的问题是:在行为改变干预中,如何设计对话式智能体(Conversational Agent)的反馈机制,以在降低心理抗拒(Psychological Reactance,即用户对自由感被威胁的感知)的同时,提升用户的惊喜感与参与度(Engagement)。解决方案的关键在于探索三种不同反馈风格——“直接型”(Direct)、“礼貌型”(Politeness)和“言语泄露型”(Verbal Leakage,即通过言语失误或不流畅表达自然揭示期望行为)——对用户感知与行为意图的影响。研究发现,“礼貌型”虽能有效减少心理抗拒并提升行为意图,但缺乏惊喜与参与感;而“言语泄露型”虽然引发一定心理抗拒,却显著增强了用户的惊喜感、参与度与幽默感,揭示了在用户反应权衡中引入非传统设计策略的潜力,为优化反馈机制提供了新的方向。

链接: https://arxiv.org/abs/2601.20683
作者: Samuel Rhys Cox,Joel Wester,Niels van Berkel
机构: Aalborg University (奥尔堡大学); University of Copenhagen (哥本哈根大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: To appear at ACM CHI 2026. 21 pages, 7 figures, 5 tables

点击查看摘要

Abstract:As conversational agents become increasingly common in behaviour change interventions, understanding optimal feedback delivery mechanisms becomes increasingly important. However, choosing a style that both lessens psychological reactance (perceived threats to freedom) while simultaneously eliciting feelings of surprise and engagement represents a complex design problem. We explored how three different feedback styles: ‘Direct’, ‘Politeness’, and ‘Verbal Leakage’ (slips or disfluencies to reveal a desired behaviour) affect user perceptions and behavioural intentions. Matching expectations from literature, the ‘Direct’ chatbot led to lower behavioural intentions and higher reactance, while the ‘Politeness’ chatbot evoked higher behavioural intentions and lower reactance. However, ‘Politeness’ was also seen as unsurprising and unengaging by participants. In contrast, ‘Verbal Leakage’ evoked reactance, yet also elicited higher feelings of surprise, engagement, and humour. These findings highlight that effective feedback requires navigating trade-offs between user reactance and engagement, with novel approaches such as ‘Verbal Leakage’ offering promising alternative design opportunities.
zh

[NLP-14] Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin

【速读】: 该论文旨在解决社交媒体监控中自动化叙事智能系统在处理连续数据流时面临的可扩展性问题,特别是传统批处理聚类算法(如HDBSCAN)因无法适应实时数据流而导致的内存限制、计算效率低下及难以捕捉动态演化叙事的缺陷。其解决方案的关键在于用在线(Streaming/Incremental)聚类方法替代HDBSCAN,并构建一个三阶段架构(数据收集、建模、仪表板生成),通过滑动窗口模拟历史乌克兰信息空间数据,在保持聚类质量的同时提升计算效率与内存利用率,从而实现对多语言社交媒体文档的实时、高效叙事聚类与分析。

链接: https://arxiv.org/abs/2601.20680
作者: Ostap Vykhopen,Viktoria Skorik,Maxim Tereschenko,Veronika Solopova
机构: Mantis Analytics (曼蒂斯分析)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated narrative intelligence systems for social media monitoring face significant scalability challenges when processing continuous data streams using traditional batch clustering algorithms. We investigate the replacement of HDBSCAN (offline clustering) with online (streaming/incremental) clustering methods in a production narrative report generation pipeline. The proposed system employs a three-stage architecture (data collection, modeling, dashboard generation) that processes thousands of multilingual social media documents daily. While HDBSCAN excels at discovering hierarchical density-based clusters and handling noise, its batch-only nature necessitates complete retraining for each time window, resulting in memory constraints, computational inefficiency, and inability to adapt to evolving narratives in real-time. This work evaluates a bunch of online clustering algorithms across dimensions of cluster quality preservation, computational efficiency, memory footprint, and integration compatibility with existing workflows. We propose evaluation criteria that balance traditional clustering metrics (Silhouette Coefficient, Davies-Bouldin Index) with narrative metrics (narrative distinctness, contingency and variance). Our methodology includes sliding-window simulations on historical datasets from Ukraine information space, enabling comparative analysis of algorithmic trade-offs in realistic operational contexts. This research addresses a critical gap between batch-oriented topic modeling frameworks and the streaming nature of social media monitoring, with implications for computational social science, crisis informatics, and narrative surveillance systems.
zh

[NLP-15] ShieldedCode: Learning Robust Representations for Virtual Machine Protected Code ICLR2026

【速读】: 该论文旨在解决软件保护中传统虚拟机保护(Virtual Machine Protection, VMP)方法因依赖规则化变换而导致的可被自动化逆向工程攻击、难以适应复杂代码结构以及缺乏学习能力的问题。其核心挑战在于如何在保持功能等价的前提下,增强受保护代码对逆向分析的鲁棒性。解决方案的关键在于提出首个面向保护的框架ShieldedCode,通过构建大规模源代码与标准化虚拟机实现的配对数据集,并引入跨指令层级的层次化依赖建模机制,在语言建模基础上联合优化语义感知和保护感知的对比目标,从而学习到兼具功能一致性与防护强度的代码表征;同时设计了保护有效性优化任务以量化不同VMP变体的防御效果,结合两阶段持续预训练与微调流程,使模型具备生成、比较和推理受保护代码的能力,显著提升了在低级别虚拟机代码生成和二进制相似性检测上的性能表现。

链接: https://arxiv.org/abs/2601.20679
作者: Mingqiao Mo,Yunlong Tan,Hao Zhang,Heng Zhang,Yangfan He
机构: University of Chinese Academy of Sciences (中国科学院大学); South China Normal University (华南师范大学); University of Minnesota Twin Cities (明尼苏达大学双城分校)
类目: Computation and Language (cs.CL)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress in code generation, yet their potential for software protection remains largely untapped. Reverse engineering continues to threaten software security, while traditional virtual machine protection (VMP) relies on rigid, rule-based transformations that are costly to design and vulnerable to automated analysis. In this work, we present the first protection-aware framework that learns robust representations of VMP-protected code. Our approach builds large-scale paired datasets of source code and normalized VM implementations, and introduces hierarchical dependency modeling at intra-, preceding-, and inter-instruction levels. We jointly optimize language modeling with functionality-aware and protection-aware contrastive objectives to capture both semantic equivalence and protection strength. To further assess resilience, we propose a protection effectiveness optimization task that quantifies and ranks different VM variants derived from the same source. Coupled with a two-stage continual pre-training and fine-tuning pipeline, our method enables models to generate, compare, and reason over protected code. Extensive experiments show that our framework significantly improves robustness across diverse protection levels, opening a new research direction for learning-based software defense. In this work, we present ShieldedCode, the first protection-aware framework that learns robust representations of VMP-protected code. Our method achieves 26.95% Pass@1 on L0 VM code generation compared to 22.58% for GPT-4o., and improves binary similarity detection Recall@1 by 10% over state of art methods like jTrans.
zh

[NLP-16] Efficient Multimodal Planning Agent for Visual Question-Answering

【速读】: 该论文旨在解决视觉问答(Visual Question-Answering, VQA)任务中因依赖多阶段检索增强生成(multimodal Retrieval-Augmented Generation, mRAG)流水线而导致的效率低下问题,尤其是在知识密集型问答场景下。解决方案的关键在于训练一个多模态规划代理(multimodal planning agent),该代理能够动态分解mRAG流程,智能判断每一步操作的必要性,从而在保持VQA性能的同时显著减少冗余计算和昂贵的工具调用次数。实验表明,该方法在多个数据集上优于现有基线模型,同时将搜索时间缩短超过60%。

链接: https://arxiv.org/abs/2601.20676
作者: Zhuo Chen,Xinyu Geng,Xinyu Wang,Yong Jiang,Zhen Zhang,Pengjun Xie,Kewei Tu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.
zh

[NLP-17] Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

【速读】: 该论文旨在解决电子健康记录(Electronic Health Record, EHR)数据科学中的两个核心问题:一是如何利用大语言模型(Large Language Models, LLMs)对结构化医疗数据进行精确查询,二是如何在检索增强生成(Retrieval Augmented Generation, RAG)支持下从非结构化临床文本中提取语义正确的信息。其解决方案的关键在于构建一个灵活的评估框架,能够自动为特定数据集或任务生成合成的问题-答案对,并结合精确匹配、语义相似度与人工判断等多种评价指标,系统性地验证LLMs在真实临床数据分析场景下的准确性和可靠性。

链接: https://arxiv.org/abs/2601.20674
作者: Juan Jose Rubio Jan,Jack Wu,Julia Ive
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks: structured data querying (using programmatic languages, Python/Pandas) and information extraction from unstructured clinical text via a Retrieval Augmented Generation (RAG) pipeline. We test the ability of LLMs to interact accurately with large structured datasets for analytics and the reliability of LLMs in extracting semantically correct information from free text health records when supported by RAG. To this end, we presented a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task. Experiments were conducted on a curated subset of MIMIC III, (four structured tables and one clinical note type), using a mix of locally hosted and API-based LLMs. Evaluation combined exact-match metrics, semantic similarity, and human judgment. Our findings demonstrate the potential of LLMs to support precise querying and accurate information extraction in clinical workflows.
zh

[NLP-18] A Dialectic Pipeline for Improving LLM Robustness

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时存在的幻觉(hallucination)问题,即模型输出不准确或虚构内容的现象,从而影响其可靠性和实用性。传统方法如领域特定微调或训练专用验证器虽能提升性能,但计算成本高且限制了模型的泛化能力。本文提出了一种**辩证式流水线(dialectic pipeline)作为解决方案,其核心在于通过自我对话机制(self-dialogue)**使模型能够对初始答案进行反思与修正,从而在保持通用性的同时显著提高输出质量。实验表明,该方法在多个数据集和模型家族上均优于标准模型输出,并且在性能上持续超越仅使用思维链(Chain-of-Thought)提示的方法。

链接: https://arxiv.org/abs/2601.20659
作者: Sara Candussio
机构: 未知
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Assessing ways in which Language Models can reduce their hallucinations and improve the outputs’ quality is crucial to ensure their large-scale use. However, methods such as fine-tuning on domain-specific data or the training of a separate \textitad hoc verifier require demanding computational resources (not feasible for many user applications) and constrain the models to specific fields of knowledge. In this thesis, we propose a dialectic pipeline that preserves LLMs’ generalization abilities while improving the quality of its answer via self-dialogue, enabling it to reflect upon and correct tentative wrong answers. We experimented with different pipeline settings, testing our proposed method on different datasets and on different families of models. All the pipeline stages are enriched with the relevant context (in an oracle-RAG setting) and a study on the impact of its summarization or its filtering is conducted. We find that our proposed dialectic pipeline is able to outperform by significative margins the standard model answers and that it consistently achieves higher performances than Chain-of-Thought only prompting. Subjects: Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2601.20659 [cs.CL] (or arXiv:2601.20659v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.20659 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sara Candussio [view email] [v1] Wed, 28 Jan 2026 14:42:49 UTC (8,915 KB) Full-text links: Access Paper: View a PDF of the paper titled A Dialectic Pipeline for Improving LLM Robustness, by Sara CandussioView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-01 Change to browse by: cs cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[NLP-19] P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering

【速读】: 该论文旨在解决生成式 AI(Generative AI)在通用领域推理任务中缺乏可验证奖励信号的问题,从而限制了强化学习(Reinforcement Learning, RL)对模型推理过程的有效优化。现有方法如基于参考概率的奖励(RLPR)仅依赖最终答案的概率作为奖励,忽略了推理步骤中的细粒度监督信息。解决方案的关键在于提出一种名为“概率过程监督”(Probabilistic Process Supervision, P2S)的自监督框架,其核心是通过计算路径忠实度奖励(Path Faithfulness Reward, PFR)来提供密集的过程级奖励信号——PFR基于当前推理前缀条件下生成黄金链(gold-CoT)后缀的条件概率,无需额外奖励模型或人工标注的推理步骤即可实现,并能灵活融合到任何基于结果的奖励机制中,有效缓解奖励稀疏性问题。

链接: https://arxiv.org/abs/2601.20649
作者: Wenlin Zhong,Chengyuan Liu,Yiquan Wu,Bovin Tan,Changlong Sun,Yi Wang,Xiaozhong Liu,Kun Kuang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT’s suffix, given the model’s current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward sparsity problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.
zh

[NLP-20] GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection ICASSP2026

【速读】: 该论文旨在解决多模态讽刺检测(Multimodal Sarcasm Detection, MSD)中因视觉与文本内容关联松散或语义间接而导致的跨模态不一致识别困难问题,以及现有基于大语言模型(Large Language Models, LLMs)生成讽刺线索时因多样性与主观性引入噪声的问题。解决方案的关键在于提出生成差异对比网络(Generative Discrepancy Comparison Network, GDCNet),其通过多模态大语言模型(Multimodal Large Language Models, MLLMs)生成客观、事实性强的图像描述作为稳定的语义锚点,计算该描述与原始文本之间的语义和情感差异,并结合视觉-文本一致性度量,利用门控模块融合差异特征与原始模态表示,从而自适应地平衡不同模态的贡献,提升模型在复杂场景下的准确性和鲁棒性。

链接: https://arxiv.org/abs/2601.20618
作者: Shuguang Zhang,Junhong Lian,Guoxin Yu,Baoxun Xu,Xiang Ao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet’s superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.
zh

[NLP-21] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation ICLR2026

【速读】: 该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在提升数学推理能力时对更难问题关注不足的问题,这体现在算法设计和数据构建两个层面:一方面,主流的分组相对策略优化(Group Relative Policy Optimization, GRPO)存在隐式不平衡,导致对难题的策略更新幅度偏低;另一方面,现有数据增强方法主要通过重述问题来增加多样性,但未系统性提升问题的内在难度。解决方案的关键在于提出一个双轮驱动的MathForge框架,其核心包括两个创新模块:一是难度感知的分组策略优化(Difficulty-Aware Group Policy Optimization, DGPO),通过难度平衡的组优势估计和问题级别的难度加权机制,纠正GRPO的不平衡并优先学习困难样本;二是多维度问题重构(Multi-Aspect Question Reformulation, MQR),从多个语义层面重构题目以提升难度同时保持原始正确答案不变。两者协同形成闭环:MQR扩展数据边界,DGPO高效利用增强数据,从而显著提升模型在各类数学推理任务上的表现。

链接: https://arxiv.org/abs/2601.20614
作者: Yanqi Dai,Yuxiang Ji,Xiao Zhang,Yong Wang,Xiangxiang Chu,Zhiwu Lu
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); AMAP, Alibaba Group(阿里巴巴集团); Xiamen University(厦门大学); Dalian University of Technology(大连理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for ICLR 2026

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at this https URL.
zh

[NLP-22] Agent IF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

【速读】: 该论文试图解决当前AI代理(AI agent)评估体系过于侧重任务难度提升,而忽视了覆盖普通用户日常工作中多样化任务需求的问题。现有基准测试未能充分反映AI代理在真实生活场景中处理复杂、多模态、具象化输出任务的能力,导致其实际应用价值被低估。解决方案的关键在于提出AgentIF-OneDay这一新型基准测试框架,该框架围绕三个以用户为中心的任务类别构建:开放工作流执行(Open Workflow Execution)、隐式指令理解(Latent Instruction)和迭代优化(Iterative Refinement),并引入实例级评分标准与基于大语言模型(LLM)验证的改进评估流程,实现与人工判断高度一致(Gemini-3-Pro达成80.1%一致性)。此设计使评估更贴近真实使用情境,从而推动通用AI代理产品向实用化演进。

链接: https://arxiv.org/abs/2601.20613
作者: Kaiyuan Chen,Qimin Wu,Taiyu Hou,Tianhao Tang,Xueyu Hu,Yuchen Hou,Bikun Li,Chengming Qian,Guoyin Wang,Haolin Chen,Haotong Tian,Haoye Zhang,Haoyu Bian,Hongbing Pan,Hongkang Zhang,Hongyi Zhou,Jiaqi Cai,Jiewu Rao,Jiyuan Ren,Keduan Huang,Lucia Zhu Huang,Mingyu Yuan,Naixu Guo,Qicheng Tang,Qinyan Zhang,Shuai Chen,Siheng Chen,Ting Ting Li,Xiaoxing Guo,Yaocheng Zuo,Yaoqi Guo,Yinan Wang,Yinzhou Yu,Yize Wang,Yuan Jiang,Yuan Tian,Yuanshuo Zhang,Yuxuan Liu,Yvette Yan Zeng,Zenyu Shan,Zihan Yin,Xiaobo Hu,Yang Liu,Yixin Ren,Yuan Gong
机构: xbench.org
类目: Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results. The benchmark is structured around three user-centric categories: Open Workflow Execution, which assesses adherence to explicit and complex workflows; Latent Instruction, which requires agents to infer implicit instructions from attachments; and Iterative Refinement, which involves modifying or expanding upon ongoing work. We employ instance-level rubrics and a refined evaluation pipeline that aligns LLM-based verification with human judgment, achieving an 80.1% agreement rate using Gemini-3-Pro. AgentIF-OneDay comprises 104 tasks covering 767 scoring points. We benchmarked four leading general AI agents and found that agent products built based on APIs and ChatGPT agents based on agent RL remain in the first tier simultaneously. Leading LLM APIs and open-source models have internalized agentic capabilities, enabling AI application teams to develop cutting-edge Agent products.
zh

[NLP-23] A Computational Approach to Language Contact – A Case Study of Persian

【速读】: 该论文旨在解决语言接触(language contact)在单语语言模型中间表示中留下的结构痕迹问题,特别是探究这些痕迹是否以及如何体现在模型对不同接触程度的其他语言的处理中。其解决方案的关键在于通过量化中间表示中编码的语法信息量,并分析这些信息在不同模型组件中针对句法和形态特征(如格标记和性标记)的分布模式,从而揭示语言接触效应在模型中的选择性和结构性限制——即通用句法信息不受历史接触影响,而形态特征则强烈受特定语言结构塑造。

链接: https://arxiv.org/abs/2601.20592
作者: Ali Basirat,Danial Namazifard,Navid Baradaran Hemmati
机构: Centre for Language Technology (CST), University of Copenhagen (哥本哈根大学语言技术中心); University of Tehran (德黑兰大学); Certified Translation Agency No. 1141, Mashhad, Iran (认证翻译机构第1141号,马什哈德)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as Case and Gender are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.
zh

[NLP-24] Single-Nodal Spontaneous Symmetry Breaking in NLP Models

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型中自发对称性破缺(Spontaneous Symmetry Breaking)现象的机制问题,即在有限训练架构和确定性动态下,为何注意力头(attention head)会表现出非对称的功能分化,从而实现特定语义任务的学习能力。其解决方案的关键在于揭示:这种对称性破缺不仅存在于整体网络层面,也存在于单个节点(node)级别,并且学习能力随节点数量增加呈现交叉转变——该转变由随机猜测误差的下降与节点间协作增强之间的权衡所主导;此外,通过凸包分析(convex hull analysis)可对每个节点函数对全局任务的贡献进行上界估计,从而建立微观节点行为与宏观任务性能之间的明确联系。

链接: https://arxiv.org/abs/2601.20582
作者: Shalom Rosner,Ronit D. Gross,Ella Koresh,Ido Kanter
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 6 figures, 1 table

点击查看摘要

Abstract:Spontaneous symmetry breaking in statistical mechanics primarily occurs during phase transitions at the thermodynamic limit where the Hamiltonian preserves inversion symmetry, yet the low-temperature free energy exhibits reduced symmetry. Herein, we demonstrate the emergence of spontaneous symmetry breaking in natural language processing (NLP) models during both pre-training and fine-tuning, even under deterministic dynamics and within a finite training architecture. This phenomenon occurs at the level of individual attention heads and is scaled-down to its small subset of nodes and also valid at a single-nodal level, where nodes acquire the capacity to learn a limited set of tokens after pre-training or labels after fine-tuning for a specific classification task. As the number of nodes increases, a crossover in learning ability occurs, governed by the tradeoff between a decrease following random-guess among increased possible outputs, and enhancement following nodal cooperation, which exceeds the sum of individual nodal capabilities. In contrast to spin-glass systems, where a microscopic state of frozen spins cannot be directly linked to the free-energy minimization goal, each nodal function in this framework contributes explicitly to the global network task and can be upper-bounded using convex hull analysis. Results are demonstrated using BERT-6 architecture pre-trained on Wikipedia dataset and fine-tuned on the FewRel classification task.
zh

[NLP-25] Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models EACL2026

【速读】: 该论文试图解决当前对大语言模型(Large Language Models, LLMs)创造力评估缺乏理论基础的问题,尤其是现有评估任务如发散联想任务(Divergent Association Task, DAT)仅关注新颖性(novelty),忽略了创造性核心要素之一的适当性(appropriateness),导致评估结果难以解释且可能误导。其解决方案的关键在于提出一种基于人类创造力理论的新评估框架——条件发散联想任务(Conditional Divergent Association Task, CDAT),该方法在保持情境适当性的前提下衡量新颖性,从而更准确地区分噪声与真正创造力,同时保持简单性和客观性;实验表明,较小的模型家族往往表现出更高的创造性,而先进模型则倾向于在较低新颖性水平上追求更高适当性,暗示训练和对齐过程可能使模型沿“新颖性-适当性”前沿移动,即输出更符合语境但创新性下降。

链接: https://arxiv.org/abs/2601.20546
作者: Kumiko Nakajima,Jan Zuiderveld,Sandro Pezzelle
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of EACL 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in verbal creative tasks. However, previous assessments of the creative capabilities of LLMs remain weakly grounded in human creativity theory and are thus hard to interpret. The widely used Divergent Association Task (DAT) focuses on novelty, ignoring appropriateness, a core component of creativity. We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities, undermining its validity for model evaluation. Grounded in human creativity theory, which defines creativity as the combination of novelty and appropriateness, we introduce Conditional Divergent Association Task (CDAT). CDAT evaluates novelty conditional on contextual appropriateness, separating noise from creativity better than DAT, while remaining simple and objective. Under CDAT, smaller model families often show the most creativity, whereas advanced families favor appropriateness at lower novelty. We hypothesize that training and alignment likely shift models along this frontier, making outputs more appropriate but less creative. We release the dataset and code.
zh

[NLP-26] PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLM s

【速读】: 该论文旨在解决现有自动化启发式设计(Automated Heuristic Design, AHD)框架在处理组合优化问题(Combinatorial Optimization Problems, COPs)时存在的局限性,包括对固定进化规则和静态提示模板的依赖导致的短视启发式生成、冗余评估以及缺乏对新启发式如何推导的深入推理。其解决方案的关键在于提出一种多智能体推理框架 PathWise,将启发式生成建模为一个基于蕴含图(entailment graph)的序贯决策过程,该图作为搜索轨迹的状态化记忆,使系统能够保留历史决策并复用或规避先前信息;其中策略智能体规划进化动作,世界模型智能体根据动作生成启发式回放,批评者智能体提供路由反馈以总结过往经验,从而实现从试错式演化向状态感知的推理式规划转变。

链接: https://arxiv.org/abs/2601.20539
作者: Oguzhan Gungordu,Siheng Xiong,Faramarz Fekri
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks’ reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.
zh

[NLP-27] Can We Improve Educational Diagram Generation with In-Context Examples? Not if a Hallucination Spoils the Bunch

【速读】: 该论文旨在解决生成式AI(Generative AI)在计算教育中应用时,所生成内容质量不稳定、尤其是图表生成存在事实性幻觉(AI hallucination)的问题。其解决方案的关键在于引入一种基于修辞结构理论(Rhetorical Structure Theory, RST)的上下文示例引导方法,通过在提示中嵌入结构化示例来指导大型语言模型(Large Language Models, LLMs)生成更符合用户预期的图表代码,从而提升图表逻辑组织性、连贯性和对输入语境的忠实度。

链接: https://arxiv.org/abs/2601.20476
作者: Evanfiya Logacheva,Arto Hellas,Tsvetomila Mihaylova,Juha Sorva,Ava Heinonen,Juho Leinonen
机构: Aalto University (阿尔托大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) has found a widespread use in computing education; at the same time, quality of generated materials raises concerns among educators and students. This study addresses this issue by introducing a novel method for diagram code generation with in-context examples based on the Rhetorical Structure Theory (RST), which aims to improve diagram generation by aligning models’ output with user expectations. Our approach is evaluated by computer science educators, who assessed 150 diagrams generated with large language models (LLMs) for logical organization, connectivity, layout aesthetic, and AI hallucination. The assessment dataset is additionally investigated for its utility in automated diagram evaluation. The preliminary results suggest that our method decreases the rate of factual hallucination and improves diagram faithfulness to provided context; however, due to LLMs’ stochasticity, the quality of the generated diagrams varies. Additionally, we present an in-depth analysis and discussion on the connection between AI hallucination and the quality of generated diagrams, which reveals that text contexts of higher complexity lead to higher rates of hallucination and LLMs often fail to detect mistakes in their output.
zh

[NLP-28] CtrlCoT: Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在大语言模型(Large Language Models, LLMs)推理过程中因冗长推理轨迹导致的高延迟和高内存开销问题,同时确保压缩后的推理路径仍保持正确性。现有方法要么仅在语义层面进行保守压缩,要么激进地裁剪token,易丢失关键任务线索并降低准确率;且两者结合存在序列依赖、任务无关裁剪及分布不匹配等挑战。解决方案的关键在于提出CtrlCoT框架,其核心是通过三个组件实现细粒度与语义层级的协同压缩:1)分层推理抽象(Hierarchical Reasoning Abstraction)生成多粒度语义表示;2)逻辑保留蒸馏(Logic-Preserving Distillation)训练一个感知逻辑结构的裁剪器以保留关键推理要素(如数值和运算符);3)分布对齐生成(Distribution-Alignment Generation)使压缩后的推理轨迹与推理时的自然风格一致,避免碎片化。实验表明,在MATH-500数据集上,CtrlCoT在使用30.7%更少token的同时,相比最强基线提升7.6个百分点的准确率,验证了其高效且可靠的推理压缩能力。

链接: https://arxiv.org/abs/2601.20467
作者: Zhenxuan Fan,Jie Cao,Yang Dai,Zheqi Lv,Wenqiao Zhang,Zhongle Xie,Peng LU,Beng Chin Ooi
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 9 figures, 11 tables

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting improves LLM reasoning but incurs high latency and memory cost due to verbose traces, motivating CoT compression with preserved correctness. Existing methods either shorten CoTs at the semantic level, which is often conservative, or prune tokens aggressively, which can miss task-critical cues and degrade accuracy. Moreover, combining the two is non-trivial due to sequential dependency, task-agnostic pruning, and distribution mismatch. We propose \textbfCtrlCoT, a dual-granularity CoT compression framework that harmonizes semantic abstraction and token-level pruning through three components: Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; Logic-Preserving Distillation trains a logic-aware pruner to retain indispensable reasoning cues (e.g., numbers and operators) across pruning ratios; and Distribution-Alignment Generation aligns compressed traces with fluent inference-time reasoning styles to avoid fragmentation. On MATH-500 with Qwen2.5-7B-Instruct, CtrlCoT uses 30.7% fewer tokens while achieving 7.6 percentage points higher than the strongest baseline, demonstrating more efficient and reliable reasoning. Our code will be publicly available at this https URL.
zh

[NLP-29] BMAM: Brain-inspired Multi-Agent Memory Framework ACL

【速读】: 该论文旨在解决语言模型驱动的智能体在长时间交互中难以保持时间锚定信息和跨会话行为一致性的问题,这种现象被作者称为“灵魂侵蚀”(soul erosion)。解决方案的关键在于提出了一种受大脑启发的多智能体记忆架构(BMAM),其将记忆建模为一组功能专业化子系统,包括情景记忆(episodic memory)、语义记忆(semantic memory)、显著性感知记忆(salience-aware memory)和控制导向记忆(control-oriented memory),各子系统在不同时间尺度上协同运作。其中,受海马体启发的情景记忆模块通过显式时间线组织长期记忆,并融合多信号进行证据检索,显著提升了长时程推理能力,在LoCoMo基准测试中达到78.45%的准确率,且消融实验验证了该模块对时间推理的核心作用。

链接: https://arxiv.org/abs/2601.20465
作者: Yang Li,Jiaxiang Liu,Yusong Wang,Yujie Wu,Mingkun Xu
机构: Guangdong Institute of Intelligence Science and Technology (广东省智能科学与技术研究院); Institute of Science Tokyo (东京科学研究所); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL (ARR 2026 January submission); non-anonymous preprint

点击查看摘要

Abstract:Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain-inspired Multi-Agent Memory), a general-purpose memory architecture that models agent memory as a set of functionally specialized subsystems rather than a single unstructured store. Inspired by cognitive memory systems, BMAM decomposes memory into episodic, semantic, salience-aware, and control-oriented components that operate at complementary time scales. To support long-horizon reasoning, BMAM organizes episodic memories along explicit timelines and retrieves evidence by fusing multiple complementary signals. Experiments on the LoCoMo benchmark show that BMAM achieves 78.45 percent accuracy under the standard long-horizon evaluation setting, and ablation analyses confirm that the hippocampus-inspired episodic memory subsystem plays a critical role in temporal reasoning.
zh

[NLP-30] MuVaC: AVariational Causal Framework for Multimodal Sarcasm Understanding in Dialogues WWW2026

【速读】: 该论文旨在解决多模态对话中讽刺(sarcasm)理解的两大核心任务——多模态讽刺检测(Multimodal Sarcasm Detection, MSD)与多模态讽刺解释(Multimodal Sarcasm Explanation, MuSE)之间存在的因果依赖关系被忽视的问题。当前研究通常将两者作为独立任务处理,缺乏对检测结果如何由解释推理得出这一认知机制的建模。解决方案的关键在于提出MuVaC框架,该框架基于结构化因果模型(Structural Causal Models)构建变分因果路径,实现MSD与MuSE的联合优化;同时采用“对齐-融合”策略增强多模态特征表示的鲁棒性,并通过一致性约束提升推理可信度,从而更贴近人类认知过程,显著提升多模态讽刺理解的准确性与可解释性。

链接: https://arxiv.org/abs/2601.20451
作者: Diandian Guo,Fangfang Yuan,Cong Cao,Xixun Lin,Chuan Zhou,Hao Peng,Yanan Cao,Yanbing Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Beijing (北京); Academy of Mathematics and Systems Science, Chinese Academy of Sciences (中国科学院数学与系统科学研究院); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 7 figures. Accepted by WWW 2026

点击查看摘要

Abstract:The prevalence of sarcasm in multimodal dialogues on the social platforms presents a crucial yet challenging task for understanding the true intent behind online content. Comprehensive sarcasm analysis requires two key aspects: Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE). Intuitively, the act of detection is the result of the reasoning process that explains the sarcasm. Current research predominantly focuses on addressing either MSD or MuSE as a single task. Even though some recent work has attempted to integrate these tasks, their inherent causal dependency is often overlooked. To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE. Specifically, we first model MSD and MuSE from the perspective of structural causal models, establishing variational causal pathways to define the objectives for joint optimization. Next, we design an alignment-then-fusion approach to integrate multimodal features, providing robust fusion representations for sarcasm detection and explanation generation. Finally, we enhance the reasoning trustworthiness by ensuring consistency between detection results and explanations. Experimental results demonstrate the superiority of MuVaC in public datasets, offering a new perspective for understanding multimodal sarcasm.
zh

[NLP-31] PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use PRICAI25

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂、多轮工具调用任务中面临的规划能力弱、工具幻觉(tool hallucination)、参数生成错误以及交互鲁棒性差等问题。其解决方案的关键在于提出了一种名为PEARL的两阶段框架:第一阶段为离线探索阶段,代理通过主动试错学习有效工具使用模式与失败条件;第二阶段为在线强化学习阶段,采用群体相对策略优化(Group Relative Policy Optimization, GRPO)训练专用规划器(Planner),并设计了区分规划质量的奖励函数,从而显著提升LLM在工具调用中的规划准确性和执行可靠性。

链接: https://arxiv.org/abs/2601.20439
作者: Qihao Wang,Mingzhe Lu,Jiayue Wu,Yue Hu,Yanbing Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to PRICAI25

点击查看摘要

Abstract:Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf56.5% on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.
zh

[NLP-32] Hopes and Fears – Emotion Distribution in the Topic Landscape of Finnish Parliamentary Speech 2000-2020

【速读】: 该论文试图解决现有研究将议会辩论视为同质整体、忽视不同议题情绪表达差异的问题。其解决方案的关键在于利用情感分析模型,从共时(synchronic)与历时(diachronic)两个维度系统考察芬兰议会(Eduskunta)2000至2020年间不同议题的言语情感表达模式,从而揭示议题特异性的情感特征并验证议会话语中积极情绪增强的趋势。

链接: https://arxiv.org/abs/2601.20424
作者: Anna Ristilä,Otto Tarkka,Veronika Laippala,Kimmo Elo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages (40 including appendices), 5 figures (13 including sub-figures), 1 table, 1 formula, 3 appendices; submitted to JDMDH

点击查看摘要

Abstract:Existing research often treats parliamentary discourse as a homogeneous whole, overlooking topic-specific patterns. Parliamentary speeches address a wide range of topics, some of which evoke stronger emotions than others. While everyone has intuitive assumptions about what the most emotive topics in a parliament may be, there has been little research into the emotions typically linked to different topics. This paper strives to fill this gap by examining emotion expression among the topics of parliamentary speeches delivered in Eduskunta, the Finnish Parliament, between 2000 and 2020. An emotion analysis model is used to investigate emotion expression in topics, from both synchronic and diachronic perspectives. The results strengthen evidence of increasing positivity in parliamentary speech and provide further insights into topic-specific emotion expression within parliamentary debate.
zh

[NLP-33] SpeechMapper: Speech-to-text Embedding Projector for LLM s ICASSP2026

【速读】: 该论文旨在解决当前语音大语言模型(Speech Large Language Models, Speech LLMs)在训练过程中因依赖投影层联合优化语音基础模型与大语言模型(LLM)而导致的计算成本高、任务和提示过拟合的问题。解决方案的关键在于提出一种名为SpeechMapper的低成本语音到LLM嵌入训练方法:首先在廉价硬件上对语音编码器进行无LLM的预训练,随后通过仅需1000步指令微调(Instruction Tuning, IT)即可高效接入目标LLM,从而显著降低资源消耗并提升模型的泛化能力。该方法在语音翻译和口语问答任务中均展现出优越性能,尤其在任务无关和任务特定两种场景下均优于现有最优模型,且所需数据和算力更少。

链接: https://arxiv.org/abs/2601.20417
作者: Biswesh Mohapatra,Marcely Zanon Boito,Ioan Calapodescu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Current speech LLMs bridge speech foundation models to LLMs using projection layers, training all of these components on speech instruction data. This strategy is computationally intensive and susceptible to task and prompt overfitting. We present SpeechMapper, a cost-efficient speech-to-LLM-embedding training approach that mitigates overfitting, enabling more robust and generalizable models. Our model is first pretrained without the LLM on inexpensive hardware, and then efficiently attached to the target LLM via a brief 1K-step instruction tuning (IT) stage. Through experiments on speech translation and spoken question answering, we demonstrate the versatility of SpeechMapper’s pretrained block, presenting results for both task-agnostic IT, an ASR-based adaptation strategy that does not train in the target task, and task-specific IT. In task-agnostic settings, Speechmapper rivals the best instruction-following speech LLM from IWSLT25, despite never being trained on these tasks, while in task-specific settings, it outperforms this model across many datasets, despite requiring less data and compute. Overall, SpeechMapper offers a practical and scalable approach for efficient, generalizable speech-LLM integration without large-scale IT.
zh

[NLP-34] Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents AAAI2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)外部工具调用能力评估中仅报告最终准确率、难以揭示认知瓶颈的问题。现有基准无法识别模型在复杂任务中的性能限制来源,导致对模型真实能力边界的理解不足。其解决方案的关键在于引入基于认知负荷理论(Cognitive Load Theory)的分析框架,将任务复杂度分解为两个可量化的维度:内在负荷(Intrinsic Load),即解决方案路径本身的结构复杂性,通过一种新颖的工具交互图(Tool Interaction Graph)进行形式化建模;以及外在负荷(Extraneous Load),源于任务表述模糊带来的额外难度。为支持可控实验,研究构建了首个具备参数化调节认知负荷的基准工具——ToolLoad-Bench,实验证明随着认知负荷增加,模型性能呈现明显的断崖式下降,从而精确刻画了各模型的能力边界,并验证了该框架预测与实际结果高度一致,为系统性理解智能体能力极限提供了原则性方法和实践基础。

链接: https://arxiv.org/abs/2601.20412
作者: Qihao Wang,Yue Hu,Mingzhe Lu,Jiayue Wu,Yanbing Liu,Yuanmin Tang
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model’s capability boundary. We validate that our framework’s predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent’s limits and a practical foundation for building more efficient systems.
zh

[NLP-35] LLM -AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning VLDB2026

【速读】: 该论文旨在解决领域特定数据中低质量样本导致的大型语言模型(Large Language Models, LLMs)性能受限问题,以及传统数据处理(Data Processing, DP)策略依赖人工迭代调整所引发的高成本与隐私风险。其核心解决方案是提出LLM-AutoDP框架,利用LLM作为智能代理自动生成并优化数据处理策略:通过多候选策略生成、基于反馈信号和对比评估的迭代式上下文学习机制,使代理在无需直接访问原始数据的情况下收敛至高质量处理流程。该方案的关键创新在于将自动化策略搜索与隐私保护结合,并引入分布保持采样、处理目标选择与缓存复用三项加速技术,显著提升效率与效果——实验表明,经该框架处理的数据训练的模型在胜率上超越未处理数据超80%,且相较基于LLM代理的AutoML基线提升约65%,同时总搜索时间减少达10倍。

链接: https://arxiv.org/abs/2601.20375
作者: Wei Huang,Anda Cheng,Yinggui Wang,Lei Wang,Tao Wei
机构: Ant Group(蚂蚁集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by VLDB2026

点击查看摘要

Abstract:Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.
zh

[NLP-36] ABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs EACL2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)在大型视觉语言模型(Large Vision-Language Models, LVLMs)推理过程中效率低下的问题,特别是针对传统自回归解码方法在处理多模态输入时速度较慢的瓶颈。现有推测解码(Speculative Decoding, SD)技术虽在大型语言模型(LLMs)中取得显著加速效果,但在LVLM场景下仍缺乏系统性探索与适配。解决方案的关键在于提出一种无需训练的动态集成策略——测试时自适应批处理集成推测(Test-time Adaptive Batched Ensemble Drafting, TABED),其通过利用SD设置中已有的历史真实值偏差信息,动态融合多个批处理生成的草稿序列,从而提升推理鲁棒性和加速比;该方法在保持参数共享以控制集成开销的同时,实现了平均1.74倍的墙时加速,并相较单一草稿方法提升5%性能,且具备即插即用特性,可无缝集成先进验证与替代草稿生成机制。

链接: https://arxiv.org/abs/2601.20357
作者: Minjae Lee,Wonjun Kang,Byeongkeun Ahn,Christian Classen,Kevin Galim,Seunghyuk Oh,Minghao Yan,Hyung Il Koo,Kangwook Lee
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to Findings of EACL 2026

点击查看摘要

Abstract:Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at this https URL.
zh

[NLP-37] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在解码过程中仅依赖单一生成轨迹、限制了轨迹空间探索的问题。现有方法无法充分利用DLM固有的顺序无关生成特性,导致难以发掘更优的解码路径。解决方案的核心是提出“顺序-标记搜索”(Order-Token Search),其关键在于设计了一个似然估计器,用于对去噪操作进行评分,从而实现稳定剪枝与高效多样轨迹探索,显著提升了数学推理和代码生成等任务上的性能表现。

链接: https://arxiv.org/abs/2601.20339
作者: Yangyi Shen,Tianjian Feng,Jiaqi Han,Wen Wang,Tianlang Chen,Chunhua Shen,Jure Leskovec,Stefano Ermon
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion Language Models (DLMs) offer order-agnostic generation that can explore many possible decoding trajectories. However, current decoding methods commit to a single trajectory, limiting exploration in trajectory space. We introduce Order-Token Search to explore this space through jointly searching over generation order and token values. Its core is a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories. Across mathematical reasoning and coding benchmarks, Order-Token Search consistently outperforms baselines on GSM8K, MATH500, Countdown, and HumanEval (3.1%, 3.8%, 7.9%, and 6.8% absolute over backbone), matching or surpassing diffu-GRPO post-trained d1-LLaDA. Our work establishes joint search as a key component for advancing decoding in DLMs.
zh

[NLP-38] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

【速读】: 该论文旨在解决当前移动图形用户界面(Graphical User Interface, GUI)智能体评估基准存在的局限性,即现有在线基准多聚焦于任务指令执行能力,忽视了智能体的复杂推理与探索能力,且未考虑真实移动环境中随机噪声的影响,导致评估结果与实际应用存在差距。解决方案的关键在于提出MobileBench-OL——一个包含80款中文应用中1080个任务的在线评估基准,通过划分5个子集实现任务执行、复杂推理和噪声鲁棒性等多维评估,并配套自动评估框架与重置机制,确保评估过程稳定可重复,从而更真实地衡量GUI智能体在现实场景中的性能表现。

链接: https://arxiv.org/abs/2601.20335
作者: Qinzhuo Wu,Zhizhuo Yang,Hanhao Li,Pengzhi Gao,Wei Liu,Jian Luan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents’ task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.
zh

[NLP-39] PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在心理卫生领域应用中,因咨询过程的非结构化与长期性而导致的治疗能力评估难题。现有评估范式存在“无锚定缺陷”,引发两种不稳定性:一是“过程漂移”(process drift),即未受引导的来访者模拟偏离特定咨询目标;二是“标准漂移”(standard drift),即静态逐点评分缺乏稳定性的可靠判断依据。解决方案的关键在于提出 PsychePass(Psy),一个通过轨迹锚定的统一框架:首先在模拟中锚定交互轨迹,使来访者精准控制动态咨询流程以探测多维能力;其次在评判中锚定对抗轨迹,采用高效的瑞士轮锦标赛机制,利用动态成对较量生成稳健的 Elo 评分。该框架不仅能实现模型性能的可靠排序,还可将锦标赛轨迹转化为可信奖励信号,支持基于策略的强化学习进一步提升 LLM 的治疗表现。

链接: https://arxiv.org/abs/2601.20330
作者: Zhuang Chen,Dazhen Wan,Zhangkai Zheng,Guanqun Bi,Xiyao Xiao,Binghang Li,Minlie Huang
机构: Central South University (中南大学); Lingxin AI; South China Normal University (华南师范大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While large language models show promise in mental healthcare, evaluating their therapeutic competence remains challenging due to the unstructured and longitudinal nature of counseling. We argue that current evaluation paradigms suffer from an unanchored defect, leading to two forms of instability: process drift, where unsteered client simulation wanders away from specific counseling goals, and standard drift, where static pointwise scoring lacks the stability for reliable judgment. To address this, we introduce Ps, a unified framework that calibrates the therapeutic competence of LLMs via trajectory-anchored tournaments. We first anchor the interaction trajectory in simulation, where clients precisely control the fluid consultation process to probe multifaceted capabilities. We then anchor the battle trajectory in judgments through an efficient Swiss-system tournament, utilizing dynamic pairwise battles to yield robust Elo ratings. Beyond ranking, we demonstrate that tournament trajectories can be transformed into credible reward signals, enabling on-policy reinforcement learning to enhance LLMs’ performance. Extensive experiments validate the effectiveness of PsychePass and its strong consistency with human expert judgments.
zh

[NLP-40] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的评判范式(LLM-as-a-Judge)在强化学习(Reinforcement Learning, RL)实践中效果不佳的问题,尽管其在基准测试中表现优异。作者指出,现有研究存在两大局限:一是过度依赖成对比较(pairwise evaluation),二是评估标准优化不足。为此,论文提出CE-RM-4B,一个采用点对点(pointwise)方式训练的生成式奖励模型(Generative Reward Model),并设计了一种专用的两阶段滚动生成方法(two-stage rollout method),同时引入统一的基于查询的评估标准(unified query-based criteria)。该方案仅需约5.7K高质量偏好数据即可实现优于现有方法的奖励建模性能,尤其在Best-of-N场景下表现突出,并显著提升下游RL任务的实际效果。

链接: https://arxiv.org/abs/2601.20327
作者: Xinyu Hu,Yancheng He,Weixun Wang,Tao Feng,Li Lin,Jiashun Liu,Wenbo Su,Bo Zheng,Xiaojun Wan
机构: Peking University (北京大学); Hong Kong University of Science and Technology (香港科技大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.
zh

[NLP-41] Beyond Speedup – Utilizing KV Cache for Sampling and Reasoning ICLR26

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因重复计算或存储完整隐藏状态而导致的效率低下问题。传统方法需要为每个下游任务重新计算或显式存储完整的隐藏状态,增加了计算和内存开销。其解决方案的关键在于将键值缓存(Key-Value Cache, KV cache)视为一种轻量级表示(lightweight representation),利用其已编码的上下文信息直接用于下游任务,无需额外计算或存储。实验表明,这种基于KV缓存的表示在链式嵌入(Chain-of-Embedding)和快慢思维切换(Fast/Slow Thinking Switching)两类应用中均表现出竞争力甚至优越性能,显著降低token生成量(最高达5.7倍)且保持高准确性,从而证明KV缓存是一种可复用、低成本、高效的推理表示基础。

链接: https://arxiv.org/abs/2601.20326
作者: Zeyu Xing,Xing Li,Hui-Ling Zhen,Mingxuan Yuan,Sinno Jialin Pan
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICLR26

点击查看摘要

Abstract:KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf(i) Chain-of-Embedding, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf(ii) Fast/Slow Thinking Switching, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to 5.7\times with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: this https URL.
zh

[NLP-42] SAPO: Self-Adaptive Process Optimization Makes Small Reason ers Stronger AAAI2026

【速读】: 该论文旨在解决现有自进化方法中因忽略细粒度推理步骤而导致的“推理器-验证器差距”(reasoner-verifier gap)问题,以及蒙特卡洛(Monte Carlo, MC)过程监督带来的计算效率低下问题。其解决方案的关键在于提出一种自适应过程优化(Self-Adaptive Process Optimization, SAPO)方法,该方法受误差相关负波(Error-Related Negativity, ERN)启发,通过主动最小化推理器与验证器之间的差距,而非依赖低效的MC估计,从而高效引入过程监督信号,实现小语言模型(Small Language Models, SLMs)的自适应改进。

链接: https://arxiv.org/abs/2601.20312
作者: Kaiyuan Chen,Guangmin Zheng,Jin Wang,Xiaobing Zhou,Xuejie Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO’s impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.
zh

[NLP-43] MiLorE-SSL: Scaling Multilingual Capabilities in Self-Supervised Models without Forgetting ICASSP2026

【速读】: 该论文旨在解决多语言自监督学习(Multilingual Self-Supervised Learning, SSL)模型在引入新语言时面临的两个核心问题:一是从头重新训练成本过高,二是顺序训练易导致灾难性遗忘(catastrophic forgetting)。解决方案的关键在于提出MiLorE-SSL框架,其创新性地结合了低秩适应(LoRA)模块与软混合专家(soft mixture-of-experts, soft MoE)机制,实现高效且稳定的持续多语言训练。LoRA通过参数高效的低秩更新减少可训练参数量,而soft MoE则促进跨语言专家共享,降低语言间干扰;此外,引入有限的历史语言回放数据(limited replay data)进一步缓解遗忘,实验表明该方法仅需2.14%的可训练参数即可在ML-SUPERB基准上显著提升新旧语言的性能。

链接: https://arxiv.org/abs/2601.20300
作者: Jing Xu,Minglin Wu,Xueyuan Chen,Xixin Wu,Helen Meng
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP2026

点击查看摘要

Abstract:Self-supervised learning (SSL) has greatly advanced speech representation learning, but multilingual SSL models remain constrained to languages encountered during pretraining. Retraining from scratch to incorporate new languages is computationally expensive, while sequential training without migitation strategies often leads to catastrophic forgetting. To address this, we propose MiLorE-SSL, a lightweight framework that combines LoRA modules with a soft mixture-of-experts (MoE) mechanism for efficient continual multilingual training. LoRA provides efficient low-rank adaptation, while soft MoE promotes flexible expert sharing across languages, reducing cross-lingual interference. To further mitigate forgetting, we introduce limited replay data from existing languages, avoiding reliance on large historical corpora. Experiments on ML-SUPERB demonstrate that MiLorE-SSL achieves strong performance in new languages and improves the ability in existing ones with only 2.14% trainable parameters.
zh

[NLP-44] ruthfulness Despite Weak Supervision: Evaluating and Training LLM s Using Peer Prediction ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在评估与后训练过程中依赖强监督信号的问题,尤其是在面对前沿模型时,难以获取高质量标注数据,导致现有方法容易受到欺骗性输出的干扰。其核心解决方案是引入基于机制设计的同行预测(peer prediction)方法,该方法通过构建一个基于相互可预测性的奖励机制,在无需真实标签的情况下激励模型提供诚实且信息丰富的回答,从而实现对模型行为的有效引导与评估。关键创新在于利用博弈论中的激励相容原理,使弱监督环境下仍能可靠识别并惩罚欺骗行为,并在理论和实证层面验证了该方法对大规模模型(如405B参数级别)的鲁棒性,尤其在专家与参与者能力差距扩大时表现出更强的抗欺骗能力,突破了传统LLM-as-a-Judge方法的局限性。

链接: https://arxiv.org/abs/2601.20299
作者: Tianyi Alex Qiu,Micah Carroll,Cameron Allen
机构: Center for Human-Compatible Artificial Intelligence (人类兼容人工智能中心); University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: ICLR 2026

点击查看摘要

Abstract:The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating frontier models. In such cases, models are demonstrated to exploit evaluations built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility, i.e., eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method’s effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning. On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is strengthened as the capability gap between the experts and participants widens, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20x the judge’s size, while peer prediction thrives when such gaps are large, including in cases with over 100x size difference.
zh

[NLP-45] One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking ECIR2026

【速读】: 该论文旨在解决神经排序模型(Neural Ranking Models, NRMs)在信息检索任务中对对抗性扰动的脆弱性问题,即模型在面对微小、语义对齐的文本修改时可能产生显著的排名偏差。其解决方案的关键在于提出一种最小化、查询感知的攻击方法:通过插入或替换一个与查询语义高度匹配的单个词(称为“查询中心”词),即可有效提升目标文档的排名。该方法包括启发式和基于梯度引导的变体,并引入白盒机制识别关键插入位置,在TREC-DL 2019/2020数据集上使用BERT和monoT5重排序器验证,仅平均修改每文档少于两个词即可实现高达91%的成功率,显著优于现有方法PRADA,同时揭示了中等排名文档最易受攻击的“黄金锁区间”现象,为未来鲁棒性防御研究提供了实证基础。

链接: https://arxiv.org/abs/2601.20283
作者: Tanmay Karmakar,Sourav Saha,Debapriyo Majumdar,Surjyanee Halder
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear at ECIR 2026

点击查看摘要

Abstract:Neural ranking models (NRMs) achieve strong retrieval effectiveness, yet prior work has shown they are vulnerable to adversarial perturbations. We revisit this robustness question with a minimal, query-aware attack that promotes a target document by inserting or substituting a single, semantically aligned word - the query center. We study heuristic and gradient-guided variants, including a white-box method that identifies influential insertion points. On TREC-DL 2019/2020 with BERT and monoT5 re-rankers, our single-word attacks achieve up to 91% success while modifying fewer than two tokens per document on average, achieving competitive rank and score boosts with far fewer edits under a comparable white-box setup to ensure fair evaluation against PRADA. We also introduce new diagnostic metrics to analyze attack sensitivity beyond aggregate success rates. Our analysis reveals a Goldilocks zone in which mid-ranked documents are most vulnerable. These findings demonstrate practical risks and motivate future defenses for robust neural ranking.
zh

[NLP-46] Beyond the Needles Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

【速读】: 该论文旨在解决当前长上下文大语言模型(Long-context LLMs)在真实复杂环境中证据检索能力不足的问题,特别是针对现有 Needle-in-a-Haystack (NIAH) 评估方法仅衡量良性片段定位、无法反映实际语义干扰下模型性能的局限性。其解决方案的关键在于构建一个对抗性 NIAH 风格基准测试 EverMemBench-S (EMB-S),该基准基于一个包含 326M tokens 的 MemoryBank,通过碰撞测试的近似难负样本(near-miss hard negatives)与人工筛选及 LLM 验证的黄金证据集配对,实现对模型在高噪声环境下精准访问相关文档的能力进行严格评估;同时提出解耦诊断协议,将文档级定位(evidence access)与端到端问答质量(QA quality)分离报告,从而为原生长上下文模型和检索增强生成(RAG)管道提供一致的诊断框架。实验表明,系统在良性 NIAH 测试中表现优异但在存在语义干扰时证据获取能力显著下降,揭示了语义区分能力而非单纯上下文长度才是长上下文记忆扩展的主要瓶颈。

链接: https://arxiv.org/abs/2601.20276
作者: Tianwei Lin,Zuyi Zhou,Xinda Zhao,Chenke Wang,Xiaohong Li,Yu Chen,Chuanrui Hu,Jian Pei,Yafeng Deng
机构: EverMind; Shanda Group; Duke University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model’s context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.
zh

[NLP-47] RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis

【速读】: 该论文旨在解决如何将适用于英语的文本心理语言学特征分析工具LIWC(Linguistic Inquiry and Word Count)适配到俄语的问题,以应对俄语在语法结构和文化背景上的独特性。其解决方案的关键在于构建一个专为俄语设计的词典,而非直接翻译现有英文词典,该词典基于多个词典资源、语义词典和语料库进行整合,并通过映射词形(lemmas)至42个心理语言学类别实现对俄语文本的多维度分析,包括句法、形态、词汇、统计特征以及预训练语言模型(LMs)预测结果,最终集成于RusLICA网络服务平台中。

链接: https://arxiv.org/abs/2601.20275
作者: Elina Sigdel,Anastasia Panfilova
机构: Institute of Physics and Technology, Russian Academy of Sciences (俄罗斯科学院物理技术研究所)
类目: Computation and Language (cs.CL)
备注: The link to the platform: this https URL

点击查看摘要

Abstract:Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was developed to analyze English texts and translated into multiple languages. Our approach offers the adaptation of LIWC methodology for the Russian language, considering its grammatical and cultural specificities. The suggested approach comprises 96 categories, integrating syntactic, morphological, lexical, general statistical features, and results of predictions obtained using pre-trained language models (LMs) for text analysis. Rather than applying direct translation to existing thesauri, we built the dictionary specifically for the Russian language based on the content from several lexicographic resources, semantic dictionaries and corpora. The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.
zh

[NLP-48] SoftHateBench: Evaluating Moderation Models Against Reasoning -Driven Policy-Compliant Hostility

【速读】: 该论文旨在解决当前社交媒体内容审核系统对“软性仇恨言论”(soft hate speech)识别能力不足的问题。软性仇恨言论表现为表面上合理的论述,通过框架构建和价值导向的论证引导受众对特定群体进行指责或排斥,其隐蔽性使得基于表面毒性特征(surface toxicity cues)的传统检测模型难以有效识别。解决方案的关键在于提出一个名为SoftHateBench的生成式基准,通过整合论题论证模型(Argumentum Model of Topics, AMT)与关联理论(Relevance Theory, RT),在保持原始敌意立场不变的前提下,将显性仇恨言论重构为看似中立但逻辑连贯的推理型表述,从而系统性地评估现有检测模型在面对隐性、推理驱动型仇恨言论时的性能下降问题。

链接: https://arxiv.org/abs/2601.20256
作者: Xuanyu Su,Diana Inkpen,Nathalie Japkowicz
机构: University of Ottawa(渥太华大学); American University(美国大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online hate on social media ranges from overt slurs and threats (\emphhard hate speech) to \emphsoft hate speech: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf\textscSoftHateBench, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emphArgumentum Model of Topics (AMT) and \emphRelevance Theory (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans \textbf7 sociocultural domains and \textbf28 target groups, comprising \textbf4,745 soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. \textcolorred\textbfDisclaimer. Contains offensive examples used solely for research.
zh

[NLP-49] HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-BENCH

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂软件工程任务中,由于缺乏有效指标而难以在中期训练阶段精准引导模型能力提升的问题。现有标准指标如困惑度(Perplexity, PPL)受“长上下文税”影响且与下游SWE(Software-Wide Evaluation)性能相关性弱,无法准确反映模型在关键训练阶段的潜力演化。解决方案的关键在于提出熵压缩假说(Entropy Compression Hypothesis),将智能重新定义为将不确定性结构化为低阶熵压缩状态(即“合理犹豫”)的能力,并据此构建新的量化指标HE-SNR(High-Entropy Signal-to-Noise Ratio)。该指标基于细粒度熵分析,在工业级混合专家(Mixture-of-Experts, MoE)模型上验证了其对不同上下文窗口(32K/128K)下模型性能的强鲁棒性和预测能力,从而为LLMs在复杂工程场景中的潜在优化提供了理论依据与实践工具。

链接: https://arxiv.org/abs/2601.20255
作者: Yueyang Wang,Jiawei Fu,Baolong Bi,Xili Wang,Xiaoqing Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the “Long-Context Tax” and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders (“reasonable hesitation”). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). Validated on industrial-scale Mixture-of-Experts (MoE) models across varying context windows (32K/128K), our approach demonstrates superior robustness and predictive power. This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.
zh

[NLP-50] Automated Benchmark Generation from Domain Guidelines Informed by Blooms Taxonomy

【速读】: 该论文旨在解决开放性问答(Open-ended Question Answering, QA)在实践导向领域(如教学、营养学和护理)中缺乏有效评估基准的问题,这些问题的知识通常具有程序性和专业判断特征,而现有大语言模型(Large Language Models, LLMs)的评测多依赖于预设的人类考试数据集,在此类场景下往往不可用。解决方案的关键在于提出一种基于专家指南的自动化基准生成框架,该框架以布卢姆认知分类法(Bloom’s Taxonomy)为结构基础,将专家实践转化为隐式的违规情境,并扩展为可自动评分的多选题(Multiple-Choice Questions, MCQs)和多轮对话,覆盖记忆(Remember)、理解(Understand)、应用(Apply)和分析(Analyze)四个认知层级,从而实现确定性、可复现且可扩展的评估体系。

链接: https://arxiv.org/abs/2601.20253
作者: Si Chen,Le Huy Khiem,Annalisa Szymanski,Ronald Metoyer,Ting Hua,Nitesh V. Chawla
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-ended question answering (QA) evaluates a model’s ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom’s Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.
zh

[NLP-51] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems ICASSP2026

【速读】: 该论文旨在解决全双工语音交互(full-duplex voice interaction)中自然人机对话的实时性与流畅性问题,核心挑战在于如何在用户未完成说话时实现系统响应的准确触发与无缝衔接。解决方案的关键在于提出一种将复杂对话分解为最小会话单元(minimal conversational units)的框架,使系统能够独立处理每个单元并预测状态转移时机;该框架基于多模态大语言模型构建,并辅以语音活动检测(VAD)和文本到语音合成(TTS)等模块,实现了无需训练、即插即用的全双工交互能力,在HumDial数据集上的实验验证了其有效性,且在人类对话系统挑战赛(Track 2: Full-Duplex Interaction)测试集中排名第二。

链接: https://arxiv.org/abs/2601.20230
作者: Haoyuan Yu,Yuxuan Chen,Minjie Cai
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: ICASSP 2026 (Workshop). this https URL

点击查看摘要

Abstract:Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository this https URL.
zh

[NLP-52] Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在临床场景中部署时面临的事实准确性验证难题。现有基于奖励模型(Reward Models)的验证方法存在两大局限:一是仅输出标量奖励值而缺乏可解释性,二是依赖单次检索机制,无法在验证过程中动态调整知识获取策略。解决方案的关键在于提出一种名为 \method 的代理式(Agentic)验证框架,其核心创新包括:1)通过工具增强的迭代式强化学习范式,使验证器能够在推理过程中主动查询外部医学语料库以获取动态证据;2)引入自适应课程学习机制,动态调整训练数据分布以提升泛化能力。该方法在四个医学推理基准测试中显著优于基线模型,尤其在MedQA和MedXpertQA上分别相对提升23.5%和32.0%,同时实现采样预算减少8倍,证明了基于动态检索证据的验证路径对构建更可靠医疗推理系统具有重要意义。

链接: https://arxiv.org/abs/2601.20221
作者: Hang Zhang,Ruheng Wang,Yuelyu Ji,Mingu Kwak,Xizhi Wu,Chenyu Li,Li Zhang,Wenqi Shi,Yifan Peng,Yanshan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce \method , an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, \method achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, \method demonstrates an \mathbf8\times reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.
zh

[NLP-53] Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agent ic Learning

【速读】: 该论文旨在解决大语言模型在长程任务中因高质量轨迹稀缺而导致的强化学习训练难题,尤其在计算资源有限的情况下,传统方法盲目扩大采样规模并均匀分配计算资源,造成大量计算浪费且无法保证样本质量。解决方案的关键在于提出一种名为Spark(Strategic Policy-Aware Exploration via Key-state Dynamic Branching)的新框架,其核心机制是在关键决策状态处动态选择性分支探索,通过利用智能体自身的决策信号识别重要状态,实现资源的精准分配——优先提升采样质量而非广度覆盖,从而显著减少所需训练样本数并增强泛化能力。

链接: https://arxiv.org/abs/2601.20209
作者: Jinyang Wu,Shuo Yang,Changpeng Yang,Yuhao Shen,Shuai Zhang,Zhengqi Wen,Jianhua Tao
机构: Tsinghua University (清华大学); Peking University (北京大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbfSpark (\textbfStrategic \textbfPolicy-\textbfAware explo\textbfRation via \textbfKey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent’s intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textscSpark achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.
zh

[NLP-54] Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

【速读】: 该论文旨在解决X-Codec-2.0在神经音频压缩与多语言语音建模中面临的时序效率低和音频保真度不足的问题。其核心解决方案是通过引入额外的池化操作(pooling)并增大解码器的跳步大小(decoder hop size),在不改变原有架构的前提下,将潜在空间速率从50 Hz降低至25 Hz,同时将输出采样率从16 kHz提升至24 kHz,从而在保持模型结构不变的情况下显著提升了编码效率与感知质量。

链接: https://arxiv.org/abs/2601.20185
作者: Husein Zolkepli
机构: Scicom (MSC) Berhad, Malaysia
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:X-Codec-2.0 has shown strong performance in neural audio compression and multilingual speech modeling, operating at a 50 Hz latent rate and a 16 kHz sampling rate using frozen HuBERT features. While effective, this configuration limits temporal efficiency and audio fidelity. In this work, we explore a simple and effective modification by introducing additional pooling and increasing the decoder hop size. This reduces the latent rate from 50 Hz to 25 Hz and simultaneously raises the output sampling rate from 16 kHz to 24 kHz, improving efficiency and perceptual quality without altering the core architecture. Evaluated on the multilingual Common Voice 17 test set, the proposed configuration achieves a 0.29 MOS improvement over the original X-Codec-2.0 baseline based on UTMOSv2, and attains the best reported performance among all codecs operating at 25 Hz. The source code, checkpoints, and generation comparisons are released at \hrefthis https URLthis https URL.
zh

[NLP-55] Whats the plan? Metrics for implicit planning in LLM s and their application to rhyme generation and question answering ICLR2026

【速读】: 该论文旨在解决如何有效评估大语言模型(Large Language Models, LLMs)中隐式规划能力(implicit planning)的问题。此前研究依赖复杂方法,如跨层解码器(cross-layer transcoder),仅在特定模型(如Claude 3.5 Haiku)上验证了语言模型在生成下一个词时会为未来可能的词(如押韵词或答案)做准备。本文提出更简洁、可扩展的方法:通过在前一语句末尾施加向量扰动,观察中间token生成是否被影响,从而检测隐式规划行为。关键创新在于利用简单向量操控即可揭示模型对后续输出的潜在规划机制,且该方法适用于多种模型(包括1B参数级别),首次证明隐式规划是广泛存在于各类LLM中的通用机制,为AI安全与控制提供了新视角。

链接: https://arxiv.org/abs/2601.20164
作者: Jim Maar,Denis Paperno,Callum Stuart McDougall,Neel Nanda
机构: HPI / University of Potsdam, Germany (德国波茨坦大学); Utrecht University, Netherlands (荷兰乌得勒支大学); Google DeepMind, London, UK (伦敦谷歌深度思维)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 41 pages, 34 figures, Accepted at ICLR 2026, Code available at this https URL

点击查看摘要

Abstract:Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. “-ight”) or answer to a question (“whale”) can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
zh

[NLP-56] Me-Agent : A Personalized Mobile Agent with Two-Level User Habit Learning for Enhanced Interaction

【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的移动智能体在实际应用中因缺乏个性化理解而导致的三大问题:无法解析模糊指令、未能从用户交互历史中学习,以及无法处理个性化指令。为应对这些挑战,作者提出Me-Agent——一个可学习且具备记忆能力的个性化移动智能体。其核心创新在于设计了两级用户习惯学习机制:在提示层引入增强型个人偏好学习策略,并结合个人奖励模型(Personal Reward Model)以提升个性化表现;在记忆层构建分层偏好记忆(Hierarchical Preference Memory),分别存储用户的长期记忆和特定应用记忆,从而实现对用户行为模式的持续建模与适应。

链接: https://arxiv.org/abs/2601.20162
作者: Shuoxin Wang,Chang Liu,Gowen Loo,Lifan Zheng,Kaiwen Wei,Xinyi Zeng,Jingyuan Zhang,Yu Tian
机构: Yunnan University (云南大学); Hong Kong Polytechnic University (香港理工大学); University of Electronic Science and Technology of China (电子科技大学); Southeast University (东南大学); Chongqing University (重庆大学); Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based mobile agents have made significant performance advancements. However, these agents often follow explicit user instructions while overlooking personalized needs, leading to significant limitations for real users, particularly without personalized context: (1) inability to interpret ambiguous instructions, (2) lack of learning from user interaction history, and (3) failure to handle personalized instructions. To alleviate the above challenges, we propose Me-Agent, a learnable and memorable personalized mobile agent. Specifically, Me-Agent incorporates a two-level user habit learning approach. At the prompt level, we design a user preference learning strategy enhanced with a Personal Reward Model to improve personalization performance. At the memory level, we design a Hierarchical Preference Memory, which stores users’ long-term memory and app-specific memory in different level memory. To validate the personalization capabilities of mobile agents, we introduce User FingerTip, a new benchmark featuring numerous ambiguous instructions for daily life. Extensive experiments on User FingerTip and general benchmarks demonstrate that Me-Agent achieves state-of-the-art performance in personalization while maintaining competitive instruction execution performance.
zh

[NLP-57] rajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents

【速读】: 该论文旨在解决当前工具调用代理(tool-calling agents)研究中普遍存在的现实性不足问题,即现有工作多基于理想化、固定且明确的任务场景,而忽视了真实用户交互中常见的三种复杂情况:意图模糊(ambiguous intent)、意图动态变化(changing intent)以及因政策约束导致的不可行意图(infeasible intents)。为填补这一差距,作者提出 Trajectory2Task 数据生成流水线,其核心创新在于通过多轮探索生成有效的工具调用轨迹,并将其转化为带有可控意图调整的用户任务,从而构建可验证的任务集以支持闭环训练与评估。该方案的关键在于将复杂的用户交互模式结构化为可重复利用的数据,进而提升模型在多样性和挑战性场景下的工具调用能力与泛化性能。

链接: https://arxiv.org/abs/2601.20144
作者: Ziyi Wang,Yuxuan Lu,Yimeng Zhang,Jing Huang,Jiri Gesi,Xianfeng Tang,Chen Luo,Yisi Sang,Hanqing Lu,Manling Li,Dakuo Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool-calling agents are increasingly deployed in real-world customer-facing workflows. Yet most studies on tool-calling agents focus on idealized settings with general, fixed, and well-specified tasks. In real-world applications, user requests are often (1) ambiguous, (2) changing over time, or (3) infeasible due to policy constraints, and training and evaluation data that cover these diverse, complex interaction patterns remain under-represented. To bridge the gap, we present Trajectory2Task, a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. The pipeline first conducts multi-turn exploration to produce valid tool-call trajectories. It then converts these trajectories into user-facing tasks with controlled intent adaptations. This process yields verifiable task that support closed-loop evaluation and training. We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures. Finally, using successful trajectories obtained from task rollouts, we fine-tune lightweight LLMs and find consistent improvements across all three conditions, along with better generalization to unseen tool-use domains, indicating stronger general tool-calling ability.
zh

[NLP-58] Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR ICASSP2026

【速读】: 该论文旨在解决儿童自动语音识别(Child Automatic Speech Recognition, Child ASR)中因数据稀缺和预训练域不匹配导致的性能瓶颈问题。其解决方案的关键在于提出并验证了“delta SSL嵌入”这一概念,即通过计算微调后模型与预训练模型之间嵌入向量的差异,捕捉任务特异性信息,并将其与微调后的特征进行融合。实验表明,该策略显著提升了多种自监督学习(Self-supervised Learning, SSL)模型在MyST儿童语料库上的表现,尤其是结合WavLM与delta W2V2嵌入时,实现了9.64%的词错误率(Word Error Rate, WER),创下SSL模型在该数据集上的新纪录。

链接: https://arxiv.org/abs/2601.20142
作者: Zilai Wang,Natarajan Balaji Shankar,Kaiyuan Zhang,Zihan Wang,Abeer Alwan
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ICASSP 2026

点击查看摘要

Abstract:Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.
zh

[NLP-59] BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

【速读】: 该论文旨在解决孟加拉语(Bengali)情感分析领域中因缺乏大规模、多样化标注数据集而导致的模型训练受限问题。现有公开的孟加拉语情感与仇恨言论数据集普遍存在规模较小或领域单一(如仅限社交媒体评论)的缺陷,难以支撑现代深度学习模型对鲁棒且泛化能力强的表征学习需求。解决方案的关键在于构建一个统一的大规模二分类情感数据集——BengaliSent140,通过整合七个现有孟加拉语文本数据集,并将异构的标注体系系统性地标准化为“非仇恨(Not Hate, 0)”和“仇恨(Hate, 1)”两个类别,最终形成包含139,792条唯一文本样本的平衡数据集(其中68,548条为仇恨类,71,244条为非仇恨类),从而显著提升语言多样性和上下文覆盖范围,为深度学习模型提供高质量训练与基准测试基础。

链接: https://arxiv.org/abs/2601.20129
作者: Akif Islam,Sujan Kumar Roy,Md. Ekramul Hamid
机构: University of Rajshahi (拉杰沙希大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Dataset paper. 6 pages, 3 figures. 4 Tables, Includes a publicly released Bengali sentiment dataset on Kaggle (BengaliSent140) and baseline experimental results

点击查看摘要

Abstract:Sentiment analysis for the Bengali language has attracted increasing research interest in recent years. However, progress remains constrained by the scarcity of large-scale and diverse annotated datasets. Although several Bengali sentiment and hate speech datasets are publicly available, most are limited in size or confined to a single domain, such as social media comments. Consequently, these resources are often insufficient for training modern deep learning based models, which require large volumes of heterogeneous data to learn robust and generalizable representations. In this work, we introduce BengaliSent140, a large-scale Bengali binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus. To ensure consistency across sources, heterogeneous annotation schemes are systematically harmonized into a binary sentiment formulation with two classes: Not Hate (0) and Hate (1). The resulting dataset comprises 139,792 unique text samples, including 68,548 hate and 71,244 not-hate instances, yielding a relatively balanced class distribution. By integrating data from multiple sources and domains, BengaliSent140 offers broader linguistic and contextual coverage than existing Bengali sentiment datasets and provides a strong foundation for training and benchmarking deep learning models. Baseline experimental results are also reported to demonstrate the practical usability of the dataset. The dataset is publicly available at this https URL
zh

[NLP-60] Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在事实性领域中产生幻觉或不可验证内容的问题,从而削弱其可靠性。解决方案的关键在于提出一种基于可验证奖励的强化学习训练范式(Reinforcement Learning with Verifiable Rewards, RLVR),该范式通过引入三元奖励结构(-1, r_abs, 1)明确奖励模型在不确定时选择“我不知道”(abstention)的行为,同时兼顾正确性与适度的拒绝回答倾向,以促进模型的智力谦逊(intellectual humility)。实验表明,适度的拒答奖励(r_abs ≈ -0.25 至 0.3)可在多选任务中显著减少错误回答且不造成严重准确率下降,尤其在更大模型中表现更鲁棒;而在开放问答场景下,结合监督微调先期训练拒答策略可缓解探索不足的问题,体现出该方法在幻觉抑制上的可行性与灵活性。

链接: https://arxiv.org/abs/2601.20126
作者: Abha Jha,Akanksha Mahajan,Ashwath Vaithinathan Aravindan,Praveen Saravanan,Sai Sailaja Policharla,Sonal Chaturbhuj Gehlot
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention (“I don’t know”) alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ( -1 , r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs \approx -0.25 to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here this https URL.
zh

[NLP-61] Look in the Middle: Structural Anchor Pruning for Scalable Visual RAG Indexing

【速读】: 该论文旨在解决当前视觉-语言模型(如ColPali)在细粒度视觉文档检索(Visual Document Retrieval, VDR)中因索引向量规模过大而导致的存储与计算成本高昂的问题。现有训练-free剪枝方法(如基于EOS注意力机制的方法)虽可实现约60%的压缩率,但在高压缩比场景(如80%以上)下性能显著下降,甚至不如随机选择。作者通过分析发现,传统方法依赖最终层特征进行剪枝,而结构信息在此处已衰减;因此提出结构锚点剪枝(Structural Anchor Pruning, SAP),其关键在于从中间层识别出具有语义结构意义的关键视觉块(structural anchor patches),从而在不依赖模型微调的前提下实现超过90%的压缩率并保持检索精度,为视觉增强检索(Visual RAG)提供了高可扩展性解决方案。

链接: https://arxiv.org/abs/2601.20107
作者: Zhuchenyang Liu,Ziyu Hu,Yao Zhang,Yu Xiao
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 18 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios ( 80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.
zh

[NLP-62] FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language EACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理固定修辞表达(Fixed Figurative Expressions, FFEs)时存在的“修辞幻觉”(figurative hallucination)问题,即模型生成或认可看似合理但实际不存在于目标语言中的非字面表达。其解决方案的关键在于构建了首个系统性的评测基准FFEHallu,专注于波斯语这一语言资源相对匮乏但语言结构丰富的语种,涵盖三项互补任务:从语义生成FFEs、识别四类受控构造的伪造FFEs、以及英-波斯语间的FFEs翻译。通过评估六种前沿多语言LLMs,研究揭示了当前模型在文化语境理解和修辞准确性上的显著不足,凸显了针对性评测基准对推动模型修辞能力提升与幻觉抑制的重要性。

链接: https://arxiv.org/abs/2601.20105
作者: Faezeh Hosseini,Mohammadali Yousefzadeh,Yadollah Yaghoobzadeh
机构: Tehran Institute for Advanced Studies, Khatam University(卡塔姆大学); School of Electrical and Computer Engineering, College of Engineering, University of Tehran(德黑兰大学)
类目: Computation and Language (cs.CL)
备注: EACL 2026

点击查看摘要

Abstract:Figurative language, particularly fixed figurative expressions (FFEs) such as idioms and proverbs, poses persistent challenges for large language models (LLMs). Unlike literal phrases, FFEs are culturally grounded, largely non-compositional, and conventionally fixed, making them especially vulnerable to figurative hallucination. We define figurative hallucination as the generation or endorsement of expressions that sound idiomatic and plausible but do not exist as authentic figurative expressions in the target language. We introduce FFEHallu, the first comprehensive benchmark for evaluating figurative hallucination in LLMs, with a focus on Persian, a linguistically rich yet underrepresented language. FFEHallu consists of 600 carefully curated instances spanning three complementary tasks: (i) FFE generation from meaning, (ii) detection of fabricated FFEs across four controlled construction categories, and (iii) FFE to FFE translation from English to Persian. Evaluating six state of the art multilingual LLMs, we find systematic weaknesses in figurative competence and cultural grounding. While models such as GPT4.1 demonstrate relatively strong performance in rejecting fabricated FFEs and retrieving authentic ones, most models struggle to reliably distinguish real expressions from high quality fabrications and frequently hallucinate during cross lingual translation. These findings reveal substantial gaps in current LLMs handling of figurative language and underscore the need for targeted benchmarks to assess and mitigate figurative hallucination.
zh

[NLP-63] Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLM s: Identifier vs Context Effects

【速读】: 该论文旨在解决医疗语言模型在面对文化相关输入时可能导致诊断准确性下降的问题,即确保模型在不改变临床正确诊断的前提下,对非决定性的文化信息保持鲁棒性。其解决方案的关键在于构建了一个反事实基准测试(counterfactual benchmark),通过向150个MedQA测试项中插入文化相关的标识符标记(identifier tokens)、上下文线索(contextual cues)或二者组合,生成1650个变体,并辅以长度匹配的中性对照组,由临床医生验证黄金答案在所有变体中保持不变。实验表明,文化线索显著影响模型准确率(Cochran’s Q, p < 10⁻¹⁴),尤其当标识符与上下文共现时,准确率下降达3–7个百分点;同时,基于大语言模型作为评判者的验证表明,超过一半的文化相关推理最终导致错误诊断,揭示了文化参照性推理与诊断失败之间的强关联。该研究为评估和缓解由文化因素引发的诊断偏差提供了可复现的工具与方法。

链接: https://arxiv.org/abs/2601.20102
作者: Amirhossein Haji Mohammad Rezaei,Zahra Shakeri
机构: Institute of Health Policy, Management, and Evaluation (IHPME); Dalla Lana School of Public Health; Faculty of Information; Schwartz Reisman Institute; University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran’s Q, p10^-14 ), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ( \kappa=0.76 ) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.
zh

[NLP-64] VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险领域中逻辑正确性难以保障的问题,尽管其语法上流畅但缺乏严格的逻辑一致性。解决方案的关键在于提出一种神经符号(neurosymbolic)框架VERGE,通过迭代精炼机制将LLM输出分解为原子命题,自动形式化为一阶逻辑并利用SMT求解器进行逻辑一致性验证;其核心创新包括:(1) 基于形式语义等价检查的多模型共识机制,消除表面形式指标带来的语法偏差;(2) 语义路由策略,根据命题类型分配不同验证方式——逻辑命题由符号求解器处理,常识推理则由LLM集成完成;(3) 利用最小修正子集(Minimal Correction Subsets, MCS)实现精确逻辑错误定位,将二元失败信号转化为可操作反馈。最终系统通过结构化反馈迭代优化答案,直至满足接受标准或收敛,从而在可能时提供形式保证,在其他情况下依靠共识验证,显著提升了AI系统的可信度。

链接: https://arxiv.org/abs/2601.20055
作者: Vikash Singh,Darion Cassel,Nathaniel Weir,Nick Feng,Sam Bayless
机构: Case Western Reserve University (凯斯西储大学); Amazon Web Services (亚马逊网络服务)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
zh

[NLP-65] Insight Agents : An LLM Agent s: An LLM-Based Multi-Agent System for Data Insights SIGIR2025

【速读】: 该论文旨在解决电商卖家在利用各类工具和数据时面临的两大核心问题:一是难以发现并有效使用可用的程序与工具,二是难以理解并充分利用来自不同工具的丰富数据。为此,作者提出了一种名为Insight Agents (IA) 的对话式多智能体数据洞察系统,通过自动化信息检索为卖家提供个性化数据与业务洞察。解决方案的关键在于构建一个基于“规划-执行”范式的端到端代理系统,其采用分层多智能体架构,包含管理代理(manager agent)和两个工作代理(data presentation agent 和 insight generation agent),以实现高效的信息检索与问题求解;其中管理代理结合轻量级编码器-解码器模型进行跨域(Out-of-Domain, OOD)检测与基于BERT的分类器进行代理路由,优化准确率与延迟;同时,在工作代理中引入基于API的数据模型拆解策略和动态注入领域知识机制,显著提升响应准确性与洞察深度。该系统已在亚马逊美国卖家中上线,经人工评估准确率达90%,P90延迟低于15秒。

链接: https://arxiv.org/abs/2601.20048
作者: Jincheng Bai,Zhenyu Zhang,Jennifer Zhang,Zhihuai Zhu
机构: Amazon(亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to SIGIR 2025. DOI: https://doi.org/10.1145/3726302.3731959

点击查看摘要

Abstract:Today, E-commerce sellers face several key challenges, including difficulties in discovering and effectively utilizing available programs and tools, and struggling to understand and utilize rich data from various tools. We therefore aim to develop Insight Agents (IA), a conversational multi-agent Data Insight system, to provide E-commerce sellers with personalized data and business insights through automated information retrieval. Our hypothesis is that IA will serve as a force multiplier for sellers, thereby driving incremental seller adoption by reducing the effort required and increase speed at which sellers make good business decisions. In this paper, we introduce this novel LLM-backed end-to-end agentic system built on a plan-and-execute paradigm and designed for comprehensive coverage, high accuracy, and low latency. It features a hierarchical multi-agent structure, consisting of manager agent and two worker agents: data presentation and insight generation, for efficient information retrieval and problem-solving. We design a simple yet effective ML solution for manager agent that combines Out-of-Domain (OOD) detection using a lightweight encoder-decoder model and agent routing through a BERT-based classifier, optimizing both accuracy and latency. Within the two worker agents, a strategic planning is designed for API-based data model that breaks down queries into granular components to generate more accurate responses, and domain knowledge is dynamically injected to to enhance the insight generator. IA has been launched for Amazon sellers in US, which has achieved high accuracy of 90% based on human evaluation, with latency of P90 below 15s.
zh

[NLP-66] AIGR: Towards Modeling Influencer Content on Social Media via Structured Prag matic Inference

【速读】: 该论文旨在解决健康类社交媒体影响者(health influencers)内容中隐含的推荐意图与论证结构难以被传统基于事实核查的方法准确捕捉的问题。由于其内容多以对话式叙事和修辞策略呈现,而非明确的事实性陈述,现有方法在验证其信息可信度时存在局限。解决方案的关键在于提出TAIGR(Takeaway Argumentation Inference with Grounded References)框架,该框架通过三个阶段实现:首先识别影响者的中心推荐意见(takeaway),其次构建反映其论证逻辑的argumentation graph,最后利用因子图(factor graph)进行概率推理以验证takeaway的合理性。实验证明,唯有建模话语的语用与论证结构,才能实现对健康类影响者内容的有效验证。

链接: https://arxiv.org/abs/2601.20032
作者: Nishanth Sridhar Nakshatri,Eylon Caplan,Rajkumar Pujari,Dan Goldwasser
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Health influencers play a growing role in shaping public beliefs, yet their content is often conveyed through conversational narratives and rhetorical strategies rather than explicit factual claims. As a result, claim-centric verification methods struggle to capture the pragmatic meaning of influencer discourse. In this paper, we propose TAIGR (Takeaway Argumentation Inference with Grounded References), a structured framework designed to analyze influencer discourse, which operates in three stages: (1) identifying the core influencer recommendation–takeaway; (2) constructing an argumentation graph that captures influencer justification for the takeaway; (3) performing factor graph-based probabilistic inference to validate the takeaway. We evaluate TAIGR on a content validation task over influencer video transcripts on health, showing that accurate validation requires modeling the discourse’s pragmatic and argumentative structure rather than treating transcripts as flat collections of claims.
zh

[NLP-67] Semantic Uncertainty Quantification of Hallucinations in LLM s: A Quantum Tensor Network Based Method

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中易产生幻觉(confabulations)的问题,即模型输出虽流畅但语义不可靠、不确定性高。其核心解决方案是提出一种受量子物理启发的不确定性量化框架,利用基于量子张量网络(quantum tensor network)的流水线,对token序列概率中的aleatoric uncertainty(随机不确定性)进行建模,并基于语义等价性进行聚类分析,从而实现可解释的幻觉检测。该方法的关键在于引入熵最大化策略,优先选择高置信度且语义一致的输出,并识别出熵值较高、决策不可靠的区域,为是否需要人工干预提供明确依据。实验表明,该方案在多种LLM架构和不同生成长度与量化水平下均表现出鲁棒性,显著优于现有基线方法。

链接: https://arxiv.org/abs/2601.20026
作者: Pragatheeswaran Vipulanandan,Kamal Premaratne,Dilip Sarkar
机构: University of Miami (迈阿密大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong generative capabilities but remain vulnerable to confabulations, fluent yet unreliable outputs that vary arbitrarily even under identical prompts. Leveraging a quantum tensor network based pipeline, we propose a quantum physics inspired uncertainty quantification framework that accounts for aleatoric uncertainty in token sequence probability for semantic equivalence based clustering of LLM generations. This offers a principled and interpretable scheme for hallucination detection. We further introduce an entropy maximization strategy that prioritizes high certainty, semantically coherent outputs and highlights entropy regions where LLM decisions are likely to be unreliable, offering practical guidelines for when human oversight is warranted. We evaluate the robustness of our scheme under different generation lengths and quantization levels, dimensions overlooked in prior studies, demonstrating that our approach remains reliable even in resource constrained deployments. A total of 116 experiments on TriviaQA, NQ, SVAMP, and SQuAD across multiple architectures including Mistral-7B, Mistral-7B-instruct, Falcon-rw-1b, LLaMA-3.2-1b, LLaMA-2-13b-chat, LLaMA-2-7b-chat, LLaMA-2-13b, and LLaMA-2-7b show consistent improvements in AUROC and AURAC over state of the art baselines.
zh

[NLP-68] LinguaMap: Which Layers of LLM s Speak Your Language and How to Tune Them?

【速读】: 该论文旨在解决大语言模型在多语言任务中面临的语言控制问题(language control),即模型无法准确响应用户指定的语言,导致输出语言与预期不一致。具体而言,作者识别出两种关键失败模式:多语言迁移瓶颈(正确语言但错误任务响应)和语言一致性瓶颈(正确任务响应但错误语言)。解决方案的关键在于通过分层分析发现模型内部存在三阶段结构——早期层对输入进行跨语言语义对齐,中间层执行任务推理,晚期层负责语言特定生成;基于此洞察,提出仅对最终几层进行选择性微调(selective fine-tuning),这些层专门控制语言输出。该方法在Qwen-3-32B和Bloom-7.1B上实现了超过98%的语言一致性,同时仅微调3–5%的参数,显著优于全参数微调的计算效率,且保持任务准确性不变。

链接: https://arxiv.org/abs/2601.20009
作者: J. Ben Tamo,Daniel Carlander-Reuterfelt,Jonathan Rubin,Dezhi Hong,Mingxian Wang,Oleg Poliannikov
机构: Georgia Institute of Technology (佐治亚理工学院); Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce selective fine-tuning of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98 percent language consistency across six languages while fine-tuning only 3-5 percent of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (for example, above 98 percent language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage layer-localization of language control for efficient multilingual adaptation.
zh

[NLP-69] On the Effectiveness of LLM -Specific Fine-Tuning for Detecting AI-Generated Text

【速读】: 该论文旨在解决AI生成文本(AI-generated text)在教育、出版和数字安全等领域中真实性验证的难题,这已成为技术与伦理层面的重要挑战。其解决方案的关键在于构建大规模高质量语料库并提出创新的训练范式:首先,构建了一个包含10亿token的人类写作语料库和一个19亿token的多模型、跨领域AI生成文本语料库;其次,提出了两种新型微调策略——“按模型(Per LLM)”和“按模型家族(Per LLM family)”微调,从而显著提升检测模型在跨模型场景下的泛化能力与准确性。实验表明,最优微调检测器在覆盖21个大语言模型的1亿token基准测试中达到99.6%的token级准确率,远超现有开源基线方法。

链接: https://arxiv.org/abs/2601.20006
作者: Michał Gromadzki,Anna Wróblewska,Agnieszka Kaliska
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 34 pages, 6 figures. Under review at Information Sciences

点击查看摘要

Abstract:The rapid progress of large language models has enabled the generation of text that closely resembles human writing, creating challenges for authenticity verification in education, publishing, and digital security. Detecting AI-generated text has therefore become a crucial technical and ethical issue. This paper presents a comprehensive study of AI-generated text detection based on large-scale corpora and novel training strategies. We introduce a 1-billion-token corpus of human-authored texts spanning multiple genres and a 1.9-billion-token corpus of AI-generated texts produced by prompting a variety of LLMs across diverse domains. Using these resources, we develop and evaluate numerous detection models and propose two novel training paradigms: Per LLM and Per LLM family fine-tuning. Across a 100-million-token benchmark covering 21 large language models, our best fine-tuned detector achieves up to 99.6% token-level accuracy, substantially outperforming existing open-source baselines.
zh

[NLP-70] Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegesprächen

【速读】: 该论文旨在解决德国语境下医疗领域自动语音识别(ASR)系统缺乏针对性评估的问题,尤其是现有模型在处理德语方言及专业医学术语时表现不佳。其解决方案的关键在于构建了一个模拟医生-患者对话的高质量、标注清晰的德语医疗语料库,并对29种不同ASR模型进行了系统性评测,涵盖开源模型(如Whisper、Voxtral、Wav2Vec2)与商用API(如AssemblyAI、Deepgram),使用词错误率(WER)、字符错误率(CER)和BLEU等指标进行量化分析,从而揭示模型在医学术语和方言影响下的性能差异,为后续优化提供依据。

链接: https://arxiv.org/abs/2601.19945
作者: Thomas Schuster,Julius Trögele,Nico Döring,Robin Krüger,Matthieu Hoffmann,Holger Friedrich
机构: Hochschule Pforzheim (霍夫海姆应用技术大学); XPACE GmbH (XPACE有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Language: German; English Title: Benchmarking ASR Models in German Medical Contexts: A Performance Analysis Using Anamnesis Conversations

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) offers significant potential to reduce the workload of medical personnel, for example, through the automation of documentation tasks. While numerous benchmarks exist for the English language, specific evaluations for the German-speaking medical context are still lacking, particularly regarding the inclusion of dialects. In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. The test field encompasses both open-weights models from the Whisper, Voxtral, and Wav2Vec2 families as well as commercial state-of-the-art APIs (AssemblyAI, Deepgram). For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis. The results demonstrate significant performance differences between the models: while the best systems already achieve very good Word Error Rates (WER) of partly below 3%, the error rates of other models, especially concerning medical terminology or dialect-influenced variations, are considerably higher.
zh

[NLP-71] Latent Object Permanence: Topological Phase Transitions Free-Energy Principles and Renormalization Group Flows in Deep Transformer Manifolds

【速读】: 该论文旨在揭示深度Transformer语言模型中多步推理能力的涌现机制,其核心问题是理解模型在训练过程中如何从高维、混沌的隐藏状态轨迹中自发形成结构化、可推理的表示。解决方案的关键在于将前向传播视为一种离散的粗粒化映射(discrete coarse-graining map),并基于几何与统计物理视角分析隐藏状态轨迹在隐式黎曼流形上的演化:通过追踪激活层间协方差谱(layerwise covariance spectrum)偏离随机矩阵理论(Random Matrix Theory, RMT)主干的程度,发现有效维度的急剧下降对应于一个相变现象;具体表现为以稀疏性/局域化为序参量(order parameter)的Ω(h)\Omega(h)在归一化深度γc0.42\gamma_c \approx 0.42处出现不连续跳跃,这标志着稳定“概念盆地”(concept basins)的形成——这些盆地对应于固定点的重整化类动力学行为,并最终导致低熵表示空间中瞬态可重用的对象结构(Transient Class Objects, TCOs)的出现。这一框架提供了逻辑可分性与谱衰减之间的理论联系,并通过多个开源模型家族的逐层探针验证了预测特征。

链接: https://arxiv.org/abs/2601.19942
作者: Faruk Alpay,Bugra Kilictas
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens. Treating the hidden-state trajectory as a flow on an implicit Riemannian manifold, we analyze the layerwise covariance spectrum of activations, where C^(\ell)=\mathbbE[h^(\ell)h^(\ell)\top] , and track deviations from a random-matrix bulk. Across model scales (1.5B–30B), we observe a sharp reduction in effective dimensionality consistent with a phase transition: an order parameter based on sparsity/localization, \Omega(h)=1-|h|_1/(\sqrtd|h|_2) , exhibits a discontinuity near a critical normalized depth \gamma_c\approx 0.42 in sufficiently large models. We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable “concept basins” to fixed points of this renormalization-like dynamics. The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space, which we call Transient Class Objects (TCOs). We provide theoretical conditions connecting logical separability to spectral decay and validate the predicted signatures with layerwise probes on multiple open-weight model families.
zh

[NLP-72] Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练语料库的不透明性所引发的隐私和版权问题,即如何有效检测特定文本是否曾作为预训练数据出现在模型训练集中。现有方法主要依赖于token概率似然,但往往忽视了模型top-1预测与目标token之间的差异以及相邻token间的局部相关性。论文提出了一种名为Gap-K%的新方法,其核心创新在于利用预训练过程中梯度优化动态的特性:通过分析next-token预测目标,发现top-1预测与真实目标token之间的差距会激发强烈的梯度信号,且该信号在训练中被显式惩罚。Gap-K%据此构建log概率差值(log probability gap),并引入滑动窗口策略以捕捉局部token相关性、抑制token级波动,从而提升检测精度。实验表明,Gap-K%在WikiMIA和MIMIR基准上均达到当前最优性能,适用于不同模型规模和输入长度。

链接: https://arxiv.org/abs/2601.19936
作者: Minseo Kwak,Jaehyung Kim
机构: Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review; 13 pages

点击查看摘要

Abstract:The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the divergence from the model’s top-1 prediction and local correlation between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model’s top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.
zh

[NLP-73] Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在工具调用任务中缺乏主动利用长期记忆的能力这一关键问题。现有基准测试主要评估代理对孤立事实的被动检索能力,而忽视了其在跨会话场景下基于历史记忆主动选择工具并准确参数化执行任务的核心能力。解决方案的关键在于提出一个名为Mem2ActBench的新基准,该基准通过自动化数据构建管道整合多源异构数据(如ToolACE、BFCL、Oasst1),采用一致性建模消除冲突,并生成包含平均12轮用户-助手-工具交互的2029个会话链;进一步利用反向生成方法从记忆链中合成400个强依赖长期记忆的工具调用任务,从而系统性地评估代理是否能主动应用记忆完成复杂任务。实验证明当前主流记忆框架在参数锚定方面仍存在显著不足,凸显了改进记忆应用机制的必要性。

链接: https://arxiv.org/abs/2601.19935
作者: Yiting Shen,Kun Li,Wei Zhou,Songlin Hu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyberspace Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent’s ability to passively retrieve isolated facts in response to explicit questions. They fail to evaluate the more crucial capability of actively applying memory to execute tasks. To address this gap, we introduce \textscMem2ActBench, a benchmark for evaluating whether agents can proactively leverage long-term memory to execute tool-based actions by selecting appropriate tools and grounding their parameters. The benchmark simulates persistent assistant usage, where users mention the same topic across long, interrupted interactions and expect previously established preferences and task states to be implicitly applied. We build the dataset with an automated pipeline that merges heterogeneous sources (ToolACE, BFCL, Oasst1), resolves conflicts via consistency modeling, and synthesizes 2,029 sessions with 12 user–assistant–tool turns on average. From these memory chains, a reverse-generation method produces 400 tool-use tasks, with human evaluation confirming 91.3% are strongly memory-dependent. Experiments on seven memory frameworks show that current systems remain inadequate at actively utilizing memory for parameter grounding, highlighting the need for more effective approaches to evaluate and improve memory application in task execution.
zh

[NLP-74] Quantifying non deterministic drift in large language models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在相同提示(prompt)下输出不一致的问题,即行为漂移(behavioural drift),尤其是在温度参数固定为0.0等“确定性”设置下仍存在的非确定性现象。其解决方案的关键在于通过系统性的重复运行实验,量化无操作员干预条件下的基线行为漂移,采用唯一输出比例、词汇相似度和词数统计等指标,在不同模型(gpt-4o-mini 与 llama3.1-8b)、提示类别及部署模式下进行对比分析,从而建立一个未使用稳定化技术时的行为漂移基准,为未来漂移缓解与控制方法的评估提供参考依据。

链接: https://arxiv.org/abs/2601.19934
作者: Claire Nicholson
机构: HelixScribe.AI(海克斯写作人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 1 table. Empirical measurement study reporting new repeated-run experiments quantifying baseline nondeterministic drift in large language models. This manuscript presents original empirical results (not a review or position paper) and establishes a baseline reference for future drift-mitigation work

点击查看摘要

Abstract:Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment, and prompt type. We situate these findings within existing work on concept drift, behavioural drift, and infrastructure-induced nondeterminism, discuss the limitations of lexical metrics, and highlight emerging semantic approaches. By establishing a systematic empirical baseline in the absence of stabilisation techniques, this study provides a reference point for evaluating future drift mitigation and control methods.
zh

[NLP-75] xt-to-State Mapping for Non-Resolution Reasoning : The Contradiction-Preservation Principle

【速读】: 该论文旨在解决自然语言到非解析推理(Non-Resolution Reasoning, NRR)框架中数学状态空间的映射问题,即如何将原始文本转化为能够保持语义模糊性的超位置态表示,从而避免在语言模型推理过程中过早地强制语义坍缩。其解决方案的关键在于提出一个文本到状态的映射函数 ϕ\phi,并形式化“矛盾保留原则”(Contradiction-Preservation Principle),确保真正存在歧义的表达式在其状态表示中维持非零熵;通过利用现有大型语言模型作为解释生成器,设计了提取协议,并在68个涵盖词法、结构和语用歧义的测试句上验证了该方法的有效性——平均香农熵 H(S)=1.087H(S) = 1.087 比基线单解释方法的 H(S)=0.000H(S) = 0.000 显著提升,实现了从原始文本到NRR状态空间的算法桥梁构建,从而支持在推理阶段延迟语义坍缩。

链接: https://arxiv.org/abs/2601.19933
作者: Kei Saito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures, 5 tables. Sequel to arXiv:2512.13478

点击查看摘要

Abstract:Non-Resolution Reasoning (NRR) provides a formal framework for maintaining semantic ambiguity rather than forcing premature interpretation collapse. While the foundational architecture establishes state spaces and operators for ambiguity-preserving computation, the critical question of how natural language maps to these mathematical structures remains open. This paper introduces the text-to-state mapping function \phi that transforms linguistic input into superposition states within the NRR framework. We formalize the Contradiction-Preservation Principle, which requires that genuinely ambiguous expressions maintain non-zero entropy in their state representations, and develop extraction protocols using existing Large Language Models as interpretation generators. Empirical validation across 68 test sentences spanning lexical, structural, and pragmatic ambiguity demonstrates that our mapping achieves mean Shannon entropy H(S) = 1.087 bits for ambiguous inputs while baseline single-interpretation approaches yield H(S) = 0.000. The framework provides the missing algorithmic bridge between raw text and the formal state spaces on which NRR operators act, enabling architectural collapse deferment in language model inference.
zh

[NLP-76] “Newspaper Eat” Means “Not Tasty”: A Taxonomy and Benchmark for Coded Languages in Real-World Chinese Online Reviews

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)系统在处理编码语言(coded language)时表现不佳的问题。编码语言是指用户有意对语义进行编码,使表层文本与实际意图存在差异、需解码才能理解的语言现象,常见于社交媒体和评论场景中。当前研究受限于真实世界数据集的缺乏和编码策略分类体系的不清晰。为此,作者提出了CodedLang数据集,包含7,744条中文Google地图评论,并标注了900条带词元级(span-level)编码语言的样本;同时构建了一个七类编码策略的分类体系,涵盖音素替换、字形替换及跨语言替换等常见方式。实验表明,即使使用强大的预训练语言模型,在编码语言检测、分类和评分预测任务上仍易失败,尤其当编码依赖发音特征时更为显著。这一工作强调了编码语言作为现实NLP系统亟待突破的重要挑战。

链接: https://arxiv.org/abs/2601.19932
作者: Ruyuan Wan,Changye Li,Ting-Hao ‘Kenneth’ Huang
机构: The Pennsylvania State University (宾夕法尼亚州立大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Coded language is an important part of human communication. It refers to cases where users intentionally encode meaning so that the surface text differs from the intended meaning and must be decoded to be understood. Current language models handle coded language poorly. Progress has been limited by the lack of real-world datasets and clear taxonomies. This paper introduces CodedLang, a dataset of 7,744 Chinese Google Maps reviews, including 900 reviews with span-level annotations of coded language. We developed a seven-class taxonomy that captures common encoding strategies, including phonetic, orthographic, and cross-lingual substitutions. We benchmarked language models on coded language detection, classification, and review rating prediction. Results show that even strong models can fail to identify or understand coded language. Because many coded expressions rely on pronunciation-based strategies, we further conducted a phonetic analysis of coded and decoded forms. Together, our results highlight coded language as an important and underexplored challenge for real-world NLP systems.
zh

[NLP-77] CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity SEMEVAL-2026

【速读】: 该论文旨在解决叙事故事相似性判断中的歧义性问题,特别是在神经网络模型难以做出明确决策时的准确性瓶颈。其解决方案的关键在于构建一种混合神经符号系统:首先通过大规模语言模型(Large Language Model, LLM)进行多路自一致性投票,并采用超多数阈值策略实现高置信度决策;当出现平局时,则由一个基于五种叙事相似性信号(词汇重叠、语义嵌入、故事语法结构、事件链对齐和叙事张力曲线)的符号集成方法作为“符号仲裁器”给出最终判断。这种分层式级联架构有效提升了在真正模糊场景下的推理准确率,开发集上达到81%的准确率,验证了选择性引入符号方法对增强神经预测性能的价值。

链接: https://arxiv.org/abs/2601.19931
作者: Sebastien Kawada,Dylan Holyoak
机构: Geffen Academy at UCLA (格芬艺术学院)
类目: Computation and Language (cs.CL)
备注: 6 pages (including references), 2 figures, 2 tables. System description paper for SemEval-2026 Task 4 (Narrative Story Similarity)

点击查看摘要

Abstract:We present a hybrid neuro-symbolic system for the SemEval-2026 Task 4 on Narrative Story Similarity. Our approach combines neural self-consistency voting with a novel Multi-Scale Narrative Analysis Ensemble that operates as a symbolic tiebreaker. The neural network component uses a large language model with multiple parallel votes, applying a supermajority threshold for confident decisions and escalating uncertain cases to additional voting rounds. When votes result in a perfect tie, a symbolic ensemble combining five narrative similarity signals (lexical overlap, semantic embeddings, story grammar structure, event chain alignment, and narrative tension curves) provides the final decision. Our cascade architecture achieves 81% accuracy on the development set, demonstrating that selective deferral to symbolic methods can enhance neural predictions on genuinely ambiguous narrative comparisons.
zh

[NLP-78] SDUs DAISY: A Benchmark for Danish Culture

【速读】: 该论文旨在解决丹麦文化传承中知识覆盖不全面、缺乏系统性评估基准的问题,尤其针对主流文化信息与边缘但具有历史价值内容之间的不平衡。解决方案的关键在于构建一个名为Daisy的新基准数据集,该数据集基于丹麦文化 canon(Culture Canon)2006年遴选的精选主题,通过语言模型从维基百科页面生成多样化问题(包括核心与边缘问题),并经人工审核或修正,最终形成包含741个封闭式问答对的数据集,涵盖从公元前1300年考古发现到当代流行音乐、设计与建筑等跨时代文化内容,从而实现对丹麦文化遗产更全面、结构化的认知评估。

链接: https://arxiv.org/abs/2601.19930
作者: Jacob Nielsen,Stine L. Beltoft,Peter Schneider-Kamp,Lukas Galke Poech
机构: University of Southern Denmark(南丹麦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Danish Culture Benchmark, 2 Tables, 1 Figure demonstrating the data curation pipeline

点击查看摘要

Abstract:We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within each work, with a mix of central of peripheral questions for each work, not only knowledge of mainstream information, but also in-depth cornerstones defining the heritage of Danish Culture, defined by the Canon committee. Each question-answer pair is humanly approved or corrected in the final dataset consisting of 741 close-ended question answer pairs covering topics, from 1300 BC. archaeological findings, 1700 century poems and musicals pieces to contemporary pop music and Danish design and architecture.
zh

[NLP-79] Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动编码任务中因上下文长度限制导致的性能瓶颈问题,尤其是当处理大规模源代码库时出现的“迷失在中间”(lost-in-the-middle)效应。解决方案的关键在于提出一种名为Stingy Context的分层树状压缩方案,其核心创新是利用TREEFRAG分解策略,将原始代码库从239k tokens压缩至11k tokens,同时保持任务完整性。该方法在12个前沿模型上对40个真实世界问题的实验表明,成功率达94%–97%,显著优于传统扁平压缩方法。

链接: https://arxiv.org/abs/2601.19929
作者: David Linus Ostby
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 28 pages, 10 tables, 2 figures and 6 appendices

点击查看摘要

Abstract:We introduce Stingy Context, a hierarchical tree-based compression scheme achieving 18:1 reduction in LLM context for auto-coding tasks. Using our TREEFRAG exploit decomposition, we reduce a real source code base of 239k tokens to 11k tokens while preserving task fidelity. Empirical results across 12 Frontier models show 94 to 97% success on 40 real-world issues at low cost, outperforming flat methods and mitigating lost-in-the-middle effects.
zh

[NLP-80] owards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training Inference and Failures

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在取得卓越性能的同时,其内部工作机制缺乏透明性的问题,即如何从“黑箱性能”走向“机制透明”。其解决方案的关键在于系统性地梳理和归纳近期研究成果,将LRMs的内在机制理解划分为三个核心维度:训练动态(training dynamics)、推理机制(reasoning mechanisms)以及非预期行为(unintended behaviors),并通过整合这些维度的洞见,构建连接模型表现与内部运作机制的桥梁,从而为未来研究提供可操作的方向,包括应用型可解释性、方法论改进及统一理论框架的建立。

链接: https://arxiv.org/abs/2601.19928
作者: Yi Hu,Jiaqi Gu,Ruxin Wang,Zijun Yao,Hao Peng,Xiaobao Wu,Jianhui Chen,Muhan Zhang,Liangming Pan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has catalyzed the emergence of Large Reasoning Models (LRMs) that have pushed reasoning capabilities to new heights. While their performance has garnered significant excitement, exploring the internal mechanisms driving these behaviors has become an equally critical research frontier. This paper provides a comprehensive survey of the mechanistic understanding of LRMs, organizing recent findings into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. By synthesizing these insights, we aim to bridge the gap between black-box performance and mechanistic transparency. Finally, we discuss under-explored challenges to outline a roadmap for future mechanistic studies, including the need for applied interpretability, improved methodologies, and a unified theoretical framework.
zh

[NLP-81] Attribution Techniques for Mitigating Hallucinated Information in RAG Systems: A Survey

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中因检索器与生成器复杂交互而引入的新类型幻觉(hallucination)问题,尤其是生成内容缺乏可验证依据的不忠实陈述。其解决方案的关键在于引入基于溯源(attribution-based)的技术,通过确保生成响应可被检索到的内容明确支持,从而提升RAG系统的可信度和可靠性。论文进一步构建了一个统一的溯源技术流程框架,并提出了一种针对RAG系统中不同类型幻觉的分类体系,为实践者提供针对性选择与应用策略,同时系统比较了现有方法的优势与局限,推动该领域理论发展与工程落地。

链接: https://arxiv.org/abs/2601.19927
作者: Yuqing Zhao,Ziyao Liu,Yongsen Zheng,Kwok-Yan Lam
机构: Nanyang Technological University, Singapore (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs)-based question answering (QA) systems play a critical role in modern AI, demonstrating strong performance across various tasks. However, LLM-generated responses often suffer from hallucinations, unfaithful statements lacking reliable references. Retrieval-Augmented Generation (RAG) frameworks enhance LLM responses by incorporating external references but also introduce new forms of hallucination due to complex interactions between the retriever and generator. To address these challenges, researchers have explored attribution-based techniques that ensure responses are verifiably supported by retrieved content. Despite progress, a unified pipeline for these techniques, along with a clear taxonomy and systematic comparison of their strengths and weaknesses, remains lacking. A well-defined taxonomy is essential for identifying specific failure modes within RAG systems, while comparative analysis helps practitioners choose appropriate solutions based on hallucination types and application context. This survey investigates how attribution-based techniques are used within RAG systems to mitigate hallucinations and addresses the gap by: (i) outlining a taxonomy of hallucination types in RAG systems, (ii) presenting a unified pipeline for attribution techniques, (iii) reviewing techniques based on the hallucinations they target, and (iv) discussing strengths and weaknesses with practical guidelines. This work offers insights for future research and practical use of attribution techniques in RAG systems.
zh

[NLP-82] he Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

【速读】: 该论文旨在解决当前对基于Transformer的语言模型(Transformer-based Language Models, TLMs)在句法能力评估中存在的系统性偏差与局限性问题,特别是研究方法的同质化、语言和现象覆盖范围狭窄,以及对句法-语义接口现象理解不足。其解决方案的关键在于通过系统性综述337篇相关文献,整合1,015个模型结果,揭示TLMs在形式导向句法现象(如词性标注和一致关系)上表现良好,但在更复杂的句法-语义接口任务(如指代消解或空位填充依赖)中表现不稳定;并据此提出未来研究应报告完整数据、加强理论构念与方法的一致性、推广机制性分析方法,并扩展语言与句法现象的多样性以提升模型理解的深度与广度。

链接: https://arxiv.org/abs/2601.19926
作者: Nora Graichen,Iria de-Dios-Flores,Gemma Boleda
机构: Universitat Pompeu Fabra(庞佩乌·法布拉大学); ICREA(加泰罗尼亚研究与学术卓越计划)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on 1,015 model results from a range of syntactic phenomena and interpretability methods. Our analysis shows that the state of the art presents a healthy variety of methods and data, but an over-focus on a single language (English), a single model (BERT), and phenomena that are easy to get at (like part of speech and agreement). Results also suggest that TLMs capture these form-oriented phenomena well, but show more variable and weaker performance on phenomena at the syntax-semantics interface, like binding or filler-gap dependencies. We provide recommendations for future work, in particular reporting complete data, better aligning theoretical constructs and methods across studies, increasing the use of mechanistic methods, and broadening the empirical scope regarding languages and linguistic phenomena.
zh

[NLP-83] Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在评估复杂学术内容时的一致性与可靠性问题,特别是其在科学评审场景中的适用性。研究通过对比ChatGPT-5、Gemini-3-Pro和Claude-Sonnet-4.5三类LLM与14名人类评审员对160篇会议摘要的评分结果,验证其在整体质量及具体维度上的评分一致性。解决方案的关键在于:首先,采用统一的评分量表(rubric)确保评价标准一致;其次,利用组内相关系数(intraclass correlation coefficients, ICCs)和Bland-Altman图量化AI之间及AI与人类之间的评分一致性与系统偏差,发现LLMs在客观维度上具有中等至良好一致性(ICC=0.59–0.87),且平均得分与人类评审均值差异较小(如ChatGPT为0.24,Claude为-0.02),表明LLMs可高效批量处理审稿任务并保持较高稳定性;但其在主观维度(如影响力、参与度)表现较弱(ICC=0.23–0.38),提示应将其定位为辅助工具而非替代人类专家,需结合适当流程架构以发挥其优势。

链接: https://arxiv.org/abs/2601.19925
作者: Yinuo Liu,Emre Sezgin,Eric A. Youngstrom
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM’s potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5’s consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across three LLMs and fourteen reviewers were examined. Inter-rater reliability was calculated using intraclass correlation coefficients (ICCs) for within-AI reliability and AI-human concordance. Bland-Altman plots were examined for visual agreement patterns and systematic bias. Results: LLMs achieved good-to-excellent agreement with each other (ICCs: 0.59-0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs ~.45-.60 for composite, impression, clarity, objective, and results. They exhibited fair agreement on subjective dimensions, with ICC ranging from 0.23-0.38 for impact, engagement, and applicability. Gemini showed fair agreement on half criteria and no reliability on impact and applicability. Three LLMs showed acceptable or negligible mean difference (ChatGPT=0.24, Gemini=0.42, Claude=-0.02) from the human mean composite scores. Discussion: LLMs could process abstracts in batches with moderate agreement with human experts on overall quality and objective criteria. With appropriate process architecture, they can apply a rubric consistently across volumes of abstracts exceeding feasibility for a human rater. The weaker performance on subjective dimensions indicates that AI should serve a complementary role in evaluation, while human expertise remains essential.
zh

[NLP-84] OPT-Engine: Benchmarking the Limits of LLM s in Optimization Modeling via Complexity Scaling

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在优化建模任务中能力边界不清晰的问题,尤其是其在复杂、现实场景下的泛化能力和瓶颈所在。为填补这一空白,作者提出OPT-ENGINE——一个可扩展的基准框架,用于评估LLMs在可控且可扩展难度级别的优化建模任务中的表现,涵盖运筹学中的10个典型任务(5个线性规划和5个混合整数规划)。解决方案的关键在于:首先,通过引入外部求解器进行工具增强型推理(tool-integrated reasoning),显著提升了模型在复杂度增加时的鲁棒性;其次,实证发现约束自动构建(automated formulation of constraints)是当前LLMs的主要性能瓶颈,而非问题理解或解生成阶段。这一发现为下一代面向高级优化任务的LLMs研发提供了明确方向。

链接: https://arxiv.org/abs/2601.19924
作者: Yitian Chen,Cheng Cheng,Yinan Sun,Zi Ling,Dongdong Ge
机构: Cardinal Operations(卡迪纳尔运营); Shanghai University of Finance and Economics (上海财经大学); Booth School of Business, University of Chicago (芝加哥大学布斯商学院); Antai School of Economics and Management, Shanghai Jiao Tong University (上海交通大学安泰经济与管理学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs’ reasoning capabilities, addressing two critical questions: 1.) Do LLMs’ performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolorbluethis https URL.
zh

[NLP-85] able-BiEval: A Self-Supervised Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在将自然语言准确转化为结构化格式(如用于工具调用的代码或表格转机器可读规范)时,缺乏有效、无需人工干预的评估方法的问题。现有文本指标无法检测代码类输出中的语义漂移,导致对模型结构性能力的评估不充分。解决方案的关键在于提出Table-BiEval框架——一个基于确定性中间表示(Intermediate Representations)的人工智能自由、自监督评估机制,通过计算内容语义准确率(Content Semantic Accuracy)与归一化树编辑距离(Normalized Tree Edit Distance),实现结构与内容的解耦量化评估,并首次在层级结构与扁平表格两个拓扑维度上系统评测了15个前沿LLM,揭示出中等规模模型在结构效率上的潜在优势及深度递归嵌套仍是当前架构的通用瓶颈。

链接: https://arxiv.org/abs/2601.19923
作者: Boxiang Zhao,Qince Li,Zhonghao Wang,Zelin Cao,Yi Wang,Peng Cheng,Bo Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs. This paper proposes Table-BiEval, a novel approach based on a human-free, self-supervised evaluation framework, to assess LLMs performance quantitatively. By leveraging deterministic Intermediate Representations, our framework calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content. Also, it empirically evaluates 15 state-of-the-art LLMs across dual topological dimensions-hierarchical structures and flat tables. The results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency and confirming that deep recursive nesting remains a universal bottleneck for current architectures.
zh

[NLP-86] HEART: A Unified Benchmark for Assessing Humans and LLM s in Emotional Support Dialogue

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在情感支持对话中与人类表现对比缺乏系统评估框架的问题,即如何客观衡量大语言模型(LLM)在人际互动能力上的真实水平。其解决方案的关键在于提出 HEART 框架——这是首个将人类与 LLM 在相同多轮情感支持对话中直接比较的评估体系,通过盲评人类评分者和 LLM-as-judge 评价员对配对响应进行打分,并基于人际沟通科学构建五维指标(人类一致性、共情响应性、契合度、共鸣感和任务遵循性),从而揭示模型在感知同理心和一致性上已接近或超越人类平均水平,但在适应性重构、张力命名和细微语气调整等高级交互能力上仍落后于人类,同时验证了人类与 LLM 评判标准的高度一致性,为理解 AI 情感支持能力的边界与演进提供了统一实证基础。

链接: https://arxiv.org/abs/2601.19922
作者: Laya Iyer,Kriti Aggarwal,Sanmi Koyejo,Gail Heyman,Desmond C. Ong,Subhabrata Mukherjee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.
zh

[NLP-87] Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

【速读】: 该论文试图解决多智能体辩论(Multi-agent Debate, MAD)在实际应用中表现不佳的问题,即尽管MAD通过测试时扩展(test-time scaling)增加了计算成本,但其性能常低于简单的多数投票(majority vote),尤其是在同质化代理和均匀信念更新的设定下,辩论无法可靠提升决策正确性。解决方案的关键在于识别并引入两个源自人类 deliberation(审议)与集体决策研究的核心机制:一是初始观点多样性(diversity of initial viewpoints),二是显式且校准后的置信度沟通(explicit, calibrated confidence communication)。作者提出两种轻量级干预策略:其一为多样性感知的初始化方法,从更广泛的候选答案池中选取初始观点以提高正确假设的存在概率;其二为置信度调制的辩论协议,使代理基于他人校准后的置信度进行信念更新。理论分析表明,前者提升了辩论成功的先验概率,后者则引导辩论系统性地收敛至正确假设。实证结果在六个推理导向的问答基准上验证了该方案显著优于原始MAD和多数投票。

链接: https://arxiv.org/abs/2601.19921
作者: Xiaochen Zhu,Caiqi Zhang,Yizhou Chi,Tom Stafford,Nigel Collier,Andreas Vlachos
机构: University of Cambridge (剑桥大学); University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others’ confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.
zh

[NLP-88] FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation)过程中学生模型可能继承教师模型缺陷,从而导致泛化能力下降的问题。其解决方案的关键在于提出自适应自知识蒸馏(Adaptive Self-Knowledge Distillation, ASKD),通过动态降低学生模型对教师模型的依赖,并引入自知识蒸馏机制以增强学生模型的自我训练能力,从而提升其泛化性能。

链接: https://arxiv.org/abs/2601.19919
作者: Junseok Lee,Nahoon Kim,Sangyong Lee,Chang-Jae Chun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Knowledge distillation is one of the most effective methods for model compression. Previous studies have focused on the student model effectively training the predictive distribution of the teacher model. However, during training, the student model may inherit the shortcomings of the teacher model, which can lead to a decline in generalization capacity. To mitigate this issue, we propose adaptive self-knowledge distillation (ASKD), which dynamically reduces the dependence of the teacher model to improve the self-training capacity, and performs the self-knowledge distillation method to improve the generalization capacity of the student model. We further distill the Whisper model into a smaller variant, called FastWhisper. In our post-training setting, FastWhisper achieved a word error rate of 1.07% lower than the teacher model Whisper, and its relative inference time was 5 times faster.
zh

[NLP-89] Lowest Span Confidence: A Zero-Shot Metric for Efficient and Black-Box Hallucination Detection in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中幻觉(hallucination)检测的问题,即模型生成看似合理但缺乏事实依据的内容,这对高风险场景下的可靠部署构成挑战。现有方法通常依赖于不切实际的假设,如需要昂贵的采样策略进行一致性验证,或要求访问白盒模型状态,这在常见的API调用场景中难以实现。为此,作者提出一种新的高效零样本指标——最低跨度置信度(Lowest Span Confidence, LSC),其关键在于仅需一次前向传播并利用输出概率即可评估语义连贯片段的联合似然性,通过滑动窗口机制识别不同长度n-gram中的最低边际置信度区域,从而捕捉与事实不一致强相关的局部不确定性模式。LSC有效缓解了困惑度(perplexity)的稀释效应和最小token概率对噪声的敏感性,提供了更鲁棒的事实性不确定性估计,在多个SOTA LLM和基准测试中均显著优于现有零样本基线方法。

链接: https://arxiv.org/abs/2601.19918
作者: Yitong Qiao,Licheng Pan,Yu Mi,Lei Liu,Yue Shen,Fei Sun,Zhixuan Chu
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs), i.e., the tendency to generate plausible but non-factual content, pose a significant challenge for their reliable deployment in high-stakes environments. However, existing hallucination detection methods generally operate under unrealistic assumptions, i.e., either requiring expensive intensive sampling strategies for consistency checks or white-box LLM states, which are unavailable or inefficient in common API-based scenarios. To this end, we propose a novel efficient zero-shot metric called Lowest Span Confidence (LSC) for hallucination detection under minimal resource assumptions, only requiring a single forward with output probabilities. Concretely, LSC evaluates the joint likelihood of semantically coherent spans via a sliding window mechanism. By identifying regions of lowest marginal confidence across variable-length n-grams, LSC could well capture local uncertainty patterns strongly correlated with factual inconsistency. Importantly, LSC can mitigate the dilution effect of perplexity and the noise sensitivity of minimum token probability, offering a more robust estimate of factual uncertainty. Extensive experiments across multiple state-of-the-art (SOTA) LLMs and diverse benchmarks show that LSC consistently outperforms existing zero-shot baselines, delivering strong detection performance even under resource-constrained conditions.
zh

[NLP-90] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

【速读】: 该论文旨在解决紧凑型大语言模型(Large Language Models, LLMs)在多步推理任务中因缺乏全局战略规划能力而导致的误差传播问题,尤其是在长程任务中表现不稳定。其解决方案的关键在于提出一种非侵入式框架PILOT(Planning via Internalized Latent Optimization Trajectories),通过轻量级超网络(Hyper-Network)生成查询相关的潜在引导向量(Latent Guidance vector),作为内部导航机制,引导模型表示向最优推理路径演化,从而将教师模型的战略监督内化为模型自身的隐式指导,无需修改主干权重且几乎不增加推理延迟。

链接: https://arxiv.org/abs/2601.19917
作者: Haoyu Zheng,Yun Zhu,Yuqian Yuan,Bo Yuan,Wenqiao Zhang,Siliang Tang,Jun Xiao
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non-invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper-Network to synthesize a query-conditioned Latent Guidance vector. This vector acts as an internal steering mechanism, guiding the model’s representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.
zh

[NLP-91] PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成同行评审意见时,因缺乏足够的批判性严谨性而导致评估质量不足的问题,尤其是在面对论文中细微且跨章节分布的实质性错误时。其解决方案的关键在于提出 PaperAudit-Bench 评测基准,包含两个核心组件:一是 PaperAudit-Dataset,一个涵盖局部段落内与跨段落推理型错误的结构化错误数据集,用于在长上下文场景下进行受控评估;二是 PaperAudit-Review,一种融合结构化错误检测与证据感知的自动化评审生成框架,通过将显式错误识别嵌入评审流程,显著提升了评审的严格性和区分度,从而实现更可靠的同行评审支持。

链接: https://arxiv.org/abs/2601.19916
作者: Songjun Tu,Yiwen Ma,Jiahao Lin,Qichao Zhang,Xiangyuan Lan,Junfeng.Li,Nan Xu,Linjing Li,Dongbin Zhao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Pengcheng Laboratory (鹏城实验室); University of Chinese Academy of Sciences (中国科学院大学); Wenge Technology (文戈科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection with evidence-aware review generation to support critical assessment. Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors under long-context settings. Relative to representative automated reviewing baselines, incorporating explicit error detection into the review workflow produces systematically stricter and more discriminative evaluations, demonstrating its suitability for peer review. Finally, we show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.
zh

[NLP-92] Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

【速读】: 该论文旨在解决当前主流语言模型(如Transformer)在结构设计上缺乏逻辑基础的问题,试图从逻辑推理的角度重新构建神经语言模型的架构。其核心解决方案是基于直觉逻辑(intuitionistic logic)对下一个词预测任务进行形式化建模:将输入前缀编码为左嵌套的蕴涵链(left-nested implication chain),其中非交换性组合保持了序列顺序;在此框架下,下一个词预测等价于假言推理(modus ponens),而序列处理则对应于柯里-霍华德同构(Curry–Howard correspondence)下的构造性证明扩展。这一理论视角自然导出一种等价于乘法RNN(multiplicative RNNs)的神经架构,并通过Prolog实现的专用定理证明器验证了其基本性质,包括交换与非交换序列化、单标记与多标记预测之间的关系,从而为Transformer和状态空间模型(state-space models)提供了一种基于逻辑演算的新范式。

链接: https://arxiv.org/abs/2601.19915
作者: Paul Tarau
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 25 pages

点击查看摘要

Abstract:We introduce the \emphArrow Language Model, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a \emphleft-nested implication chain whose structure preserves order through non-commutative composition. Next-token prediction corresponds to \emphmodus ponens, and sequence processing becomes constructive proof extension under the Curry–Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among which relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices. We show that a neural architecture equivalent to multiplicative RNNs arises naturally from a proof-theoretic interpretation of next-token prediction as nested intuitionistic implication, we present a practical low-rank neural realization and position the model relative to Transformers and state-space models. Keywords: logic-based derivation of neural architectures, intuitionistic implicational logic, token-as-operator neural models, state-space models, alternatives to transformer-based foundational models. Comments: 25 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2601.19915 [cs.CL] (or arXiv:2601.19915v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.19915 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-93] Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

【速读】: 该论文旨在解决现有合成多轮工具调用(multi-turn tool calling)数据生成方法在缺乏状态保持执行环境时的适用性问题。当前多数框架假设工具调用发生在能维护状态的环境中,从而可通过比对环境状态与预设目标来验证交互有效性;然而,在实际场景如企业级数据安全要求严格的环境或工具规范来自多个来源的情况下,这种状态保持机制往往不可用。为此,作者提出了一种名为DiGiT-TC的数据生成方法,其关键在于设计了一种新颖的生成模式,能够隐式地在用户请求中表示某些工具调用,从而无需依赖外部状态即可生成具备状态化环境特征的对话数据。实验表明,该方法在标准工具调用基准上表现优异,即使在存在状态信息的理想场景下也能带来显著性能提升。

链接: https://arxiv.org/abs/2601.19914
作者: Maxwell Crouse,Ibrahim Abdelaziz,Kshitij Fadnis,Siva Sankalp Patel,Kinjal Basu,Chulaka Gunasekara,Sadhana Kumaravel,Asim Munawar,Pavan Kapanipathi
机构: IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.
zh

[NLP-94] From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM -Generated Korean Text

【速读】: 该论文旨在解决人类专家在区分韩语人工写作与大语言模型(Large Language Model, LLM)生成文本时仍存在困难的问题,尤其针对语言学训练背景的读者可能过度依赖表面语法流畅性而产生误判的现象。其解决方案的关键在于将专家检测能力视为可学习技能,并通过结构化校准(structured calibration)提升判断准确性:研究提出LREAD评分量表,基于韩国国家写作标准并聚焦微观语言特征(如标点选择灵活性、空格使用模式及语域转换),在三阶段纵向盲测协议中逐步引导专家从直觉判断过渡到基于明确依据的判别,最终实现检测准确率从60%提升至100%,且标注者间一致性显著增强(Fleiss’ kappa: -0.09 → 0.82)。该方法强调语言特异性微诊断,优于依赖粗粒度语篇先验的现有LLM检测器,为非英语场景下提供了一种可解释的人工智能辅助检测范式。

链接: https://arxiv.org/abs/2601.19913
作者: Shinwoo Park,Yo-Sub Han
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for linguistically trained readers, who can over-trust surface well-formedness. We study whether expert detection can be treated as a learnable skill and improved through structured calibration. We introduce LREAD, a rubric derived from national Korean writing standards and adapted to target micro-level artifacts (e.g., punctuation optionality, spacing behavior, and register shifts). In a three-phase longitudinal blind protocol with Korean linguistics majors, Phase 1 measures intuition-only detection, Phase 2 enforces criterion-level scoring with explicit justifications, and Phase 3 evaluates domain-focused mastery on held-out elementary essays. Across phases, majority-vote accuracy increases from 60% to 100%, accompanied by stronger inter-annotator agreement (Fleiss’ kappa: -0.09 – 0.82). Compared to state-of-the-art LLM detectors, calibrated humans rely more on language-specific micro-diagnostics that are not well captured by coarse discourse priors. Our findings suggest that rubric-scaffolded expert judgment can serve as an interpretable complement to automated detectors for non-English settings, and we release the full rubric and a taxonomy of calibrated detection signatures.
zh

[NLP-95] DABench-LLM : Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练对传统CPU和GPU架构性能瓶颈日益加剧的问题,特别是在摩尔定律放缓背景下,现有硬件难以满足LLM计算需求。其核心挑战在于缺乏针对基于数据流(dataflow)架构的AI加速器的深入性能分析与标准化基准测试方法。解决方案的关键在于提出DABench-LLM——首个专为评估LLM工作负载在数据流加速器上表现而设计的基准测试框架,通过结合片内性能剖析(intra-chip performance profiling)与片间可扩展性分析(inter-chip scalability analysis),实现对资源分配、负载均衡和资源效率等关键指标的全面评估,从而帮助研究人员快速洞察底层硬件与系统行为,并提供针对性优化策略,已在Cerebras WSE-2、SambaNova RDU和Graphcore IPU三类商用数据流加速器上验证其通用性与有效性。

链接: https://arxiv.org/abs/2601.19904
作者: Ziyu Hu,Zhiqing Zhong,Weijian Zheng,Zhijing Ye,Xuwei Tan,Xueru Zhang,Zheng Xie,Rajkumar Kettimuthu,Xiaodong Yu
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:The exponential growth of large language models has outpaced the capabilities of traditional CPU and GPU architectures due to the slowdown of Moore’s Law. Dataflow AI accelerators present a promising alternative; however, there remains a lack of in-depth performance analysis and standardized benchmarking methodologies for LLM training. We introduce DABench-LLM, the first benchmarking framework designed for evaluating LLM workloads on dataflow-based accelerators. By combining intra-chip performance profiling and inter-chip scalability analysis, DABench-LLM enables comprehensive evaluation across key metrics such as resource allocation, load balance, and resource efficiency. The framework helps researchers rapidly gain insights into underlying hardware and system behaviors, and provides guidance for performance optimizations. We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU. Our framework reveals performance bottlenecks and provides specific optimization strategies, demonstrating its generality and effectiveness across a diverse range of dataflow-based AI hardware platforms.
zh

[NLP-96] RIR-Mega-Speech: A Reverberant Speech Corpus with Comprehensive Acoustic Metadata and Reproducible Evaluation

【速读】: 该论文旨在解决语音识别领域中因缺乏标准化、可复现的混响语音数据集而导致方法比较困难的问题。现有语料库通常缺少每条音频文件的声学标注或文档不充分,难以实现结果的独立验证。解决方案的关键在于构建RIR-Mega-Speech这一大规模语料库(约117.5小时),其通过将LibriSpeech语音与约5,000个模拟房间脉冲响应(Room Impulse Response, RIR)进行卷积生成,并为每个文件提供清晰定义且可复现的声学参数标注——包括混响时间(RT60)、直达-混响比(Direct-to-Reverberant Ratio, DRR)和清晰度指数(Clarity Index, C₅₀)。此外,作者还提供了完整脚本以支持跨平台重建数据集及复现所有实验结果,从而推动语音识别在混响环境下的研究向透明化、可验证方向发展。

链接: https://arxiv.org/abs/2601.19949
作者: Mandip Goswami
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Despite decades of research on reverberant speech, comparing methods remains difficult because most corpora lack per-file acoustic annotations or provide limited documentation for reproduction. We present RIR-Mega-Speech, a corpus of approximately 117.5 hours created by convolving LibriSpeech utterances with roughly 5,000 simulated room impulse responses from the RIR-Mega collection. Every file includes RT60, direct-to-reverberant ratio (DRR), and clarity index ( C_50 ) computed from the source RIR using clearly defined, reproducible procedures. We also provide scripts to rebuild the dataset and reproduce all evaluation results. Using Whisper small on 1,500 paired utterances, we measure 5.20% WER (95% CI: 4.69–5.78) on clean speech and 7.70% (7.04–8.35) on reverberant versions, corresponding to a paired increase of 2.50 percentage points (2.06–2.98). This represents a 48% relative degradation. WER increases monotonically with RT60 and decreases with DRR, consistent with prior perceptual studies. While the core finding that reverberation harms recognition is well established, we aim to provide the community with a standardized resource where acoustic conditions are transparent and results can be verified independently. The repository includes one-command rebuild instructions for both Windows and Linux environments. Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD); Signal Processing (eess.SP) Cite as: arXiv:2601.19949 [eess.AS] (or arXiv:2601.19949v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2601.19949 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

计算机视觉

[CV-0] FreeFix: Boosting 3D Gaussian Splatting via Fine-Tuning-Free Diffusion Models

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)和3D高斯泼溅(3D Gaussian Splatting)在跨视角合成(extrapolated view synthesis)中因依赖密集输入而导致的渲染质量下降问题,同时克服现有基于扩散模型(diffusion models)的增强方法在泛化能力与保真度之间的权衡困境。其解决方案的关键在于提出一种无需微调(fine-tuning-free)的FreeFix框架,通过交错式2D-3D优化策略,利用预训练图像扩散模型对渲染结果进行一致性精修,并引入逐像素置信度掩码(per-pixel confidence mask)识别不确定性区域以实现针对性改进,从而在不牺牲泛化能力的前提下显著提升跨视角重建的多帧一致性和视觉保真度。

链接: https://arxiv.org/abs/2601.20857
作者: Hongyu Zhou,Zisen Shao,Sheng Miao,Pan Wang,Dongfeng Bai,Bingbing Liu,Yiyi Liao
机构: Zhejiang University (浙江大学); University of Maryland, College Park (马里兰大学学院公园分校); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our project page is at this https URL

点击查看摘要

Abstract:Neural Radiance Fields and 3D Gaussian Splatting have advanced novel view synthesis, yet still rely on dense inputs and often degrade at extrapolated views. Recent approaches leverage generative models, such as diffusion models, to provide additional supervision, but face a trade-off between generalization and fidelity: fine-tuning diffusion models for artifact removal improves fidelity but risks overfitting, while fine-tuning-free methods preserve generalization but often yield lower fidelity. We introduce FreeFix, a fine-tuning-free approach that pushes the boundary of this trade-off by enhancing extrapolated rendering with pretrained image diffusion models. We present an interleaved 2D-3D refinement strategy, showing that image diffusion models can be leveraged for consistent refinement without relying on costly video diffusion models. Furthermore, we take a closer look at the guidance signal for 2D refinement and propose a per-pixel confidence mask to identify uncertain regions for targeted improvement. Experiments across multiple datasets show that FreeFix improves multi-frame consistency and achieves performance comparable to or surpassing fine-tuning-based methods, while retaining strong generalization ability.
zh

[CV-1] C3Box: A CLIP-based Class-Incremental Learning Toolbox

【速读】:该论文旨在解决传统机器学习系统在面对数据分布动态变化时易出现灾难性遗忘(catastrophic forgetting)的问题,特别是在类增量学习(Class-Incremental Learning, CIL)场景下,如何有效持续学习新类别同时保留已有知识。其解决方案的关键在于提出一个名为C3Box(CLIP-based Class-inCremental learning toolBOX)的模块化、统一的Python工具箱,将多种代表性CIL方法(包括传统方法、ViT-based方法及前沿CLIP-based方法)整合至基于预训练模型(Pre-trained Models, PTMs)的框架中,并通过JSON配置和标准化执行流程实现可复现的实验设计与低工程负担的部署,从而为持续学习研究提供可靠且易用的基准平台。

链接: https://arxiv.org/abs/2601.20852
作者: Hao Sun,Da-Wei Zhou
机构: School of Artificial Intelligence, Nanjing University, China (人工智能学院,南京大学,中国); National Key Laboratory for Novel Software Technology, Nanjing University, 210023, China (新型软件技术国家重点实验室,南京大学,210023,中国)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL

点击查看摘要

Abstract:Traditional machine learning systems are typically designed for static data distributions, which suffer from catastrophic forgetting when learning from evolving data streams. Class-Incremental Learning (CIL) addresses this challenge by enabling learning systems to continuously learn new classes while preserving prior knowledge. With the rise of pre-trained models (PTMs) such as CLIP, leveraging their strong generalization and semantic alignment capabilities has become a promising direction in CIL. However, existing CLIP-based CIL methods are often scattered across disparate codebases, rely on inconsistent configurations, hindering fair comparisons, reproducibility, and practical adoption. Therefore, we propose C3Box (CLIP-based Class-inCremental learning toolBOX), a modular and comprehensive Python toolbox. C3Box integrates representative traditional CIL methods, ViT-based CIL methods, and state-of-the-art CLIP-based CIL methods into a unified CLIP-based framework. By inheriting the streamlined design of PyCIL, C3Box provides a JSON-based configuration and standardized execution pipeline. This design enables reproducible experimentation with low engineering overhead and makes C3Box a reliable benchmark platform for continual learning research. Designed to be user-friendly, C3Box relies only on widely used open-source libraries and supports major operating systems. The code is available at this https URL.
zh

[CV-2] A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion

【速读】:该论文旨在解决道路表面分类(Road Surface Classification, RSC)在实际应用中泛化能力不足的问题,尤其是在光照、天气和路面条件变化较大的场景下,现有方法因感知模态单一及数据集环境多样性缺乏而表现受限。其解决方案的关键在于提出一种轻量级的多模态融合框架,通过双向交叉注意力模块(bidirectional cross-attention module)实现图像与惯性测量单元(Inertial Measurement Unit, IMU)数据的有效融合,并引入自适应门控层(adaptive gating layer)动态调整不同模态的贡献权重,从而提升模型在域偏移(domain shifts)下的鲁棒性。该方法在新构建的ROAD数据集上显著优于现有方法,在少数类上的F1分数也更高,且在夜间、大雨及混合路面等挑战性条件下保持稳定性能,验证了低成本传感器结合多模态注意力机制在道路表面理解中的可行性与有效性。

链接: https://arxiv.org/abs/2601.20847
作者: Willams de Lima Costa,Thifany Ketuli Silva de Souza,Jonas Ferreira Silva,Carlos Gabriel Bezerra Pereira,Bruno Reis Vila Nova,Leonardo Silvino Brito,Rafael Raider Leoni,Juliano Silva,Valter Ferreira,Sibele Miguel Soares Neto,Samantha Uehara,Daniel Giacomo,João Marcelo Teixeira,Veronica Teichrieb,Cristiano Coelho de Araújo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.
zh

[CV-3] Open-Vocabulary Functional 3D Human-Scene Interaction Generation

【速读】:该论文旨在解决3D人体与3D场景之间功能性交互生成的问题,即如何使3D人体在场景中执行符合物体功能逻辑的交互行为(如“坐在沙发上”或“调高房间温度”),而现有方法通常缺乏对物体功能性和人-场景接触关系的显式推理,导致交互结果不自然或功能错误。解决方案的关键在于提出一个无需训练的功能驱动框架FunHSI,其核心是通过开放词汇任务提示(task prompt)进行功能感知的接触推理(functionality-aware contact reasoning),识别场景中的功能性元素并重建其3D几何结构,构建接触图(contact graph)以建模高层交互;随后利用视觉-语言模型(vision-language models)合成完成任务的人体图像并估计3D身体与手部姿态,最后通过分阶段优化(stage-wise optimization)修正人体配置,确保物理合理性与功能正确性。

链接: https://arxiv.org/abs/2601.20835
作者: Jie Liu,Yu Sun,Alpar Cseke,Yao Feng,Nicolas Heron,Michael J. Black,Yan Zhang
机构: Meshcapade; University of Amsterdam (阿姆斯特丹大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa’‘, while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature’'. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.
zh

[CV-4] FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models

【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)扩散模型中存在的性别偏见问题,尤其是由预训练文本编码器引入的隐式性别关联所导致的生成内容偏差。研究表明,这种偏见主要来源于文本编码器对中性提示词也编码了性别倾向,从而影响生成视频中职业等角色的性别分布。解决方案的关键在于提出一种无需微调的去偏框架FairT2V,其核心机制是通过基于锚点的球面测地线变换(anchor-based spherical geodesic transformations)对提示嵌入进行中性化处理,以消除性别倾向并保留语义一致性;同时,为维持视频时序连贯性,仅在早期身份形成阶段应用去偏操作,结合动态去噪调度策略。该方法在Open-Sora模型上验证有效,显著降低了不同职业场景下的性别偏见,且对视频质量影响最小。

链接: https://arxiv.org/abs/2601.20791
作者: Haonan Zhong,Wei Song,Tingxu Han,Maurice Pagnucco,Jingling Xue,Yang Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.20791 [cs.CV] (or arXiv:2601.20791v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.20791 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-5] Compression Tells Intelligence: Visual Coding Visual Token Technology and the Unification

【速读】:该论文旨在解决视觉编码(Visual Coding)与生成式多模态大模型中的视觉标记技术(Vision Token Technology)之间缺乏统一理论框架的问题,进而揭示压缩效率与模型性能之间的本质权衡关系。其解决方案的关键在于提出一个统一的优化公式,将传统基于信息论的视觉编码与新兴的视觉标记技术从压缩效率的本质角度进行融合,从而实现双向知识迁移:一方面利用经典视觉编码的高效性指导token设计,另一方面借助token技术的语义感知能力提升传统编解码器的智能水平。在此基础上,论文进一步预测下一代视觉编解码器和标记技术的发展方向,并通过实验证明任务导向型标记在多模态大语言模型(MLLMs)、人工智能生成内容(AIGC)及具身AI等实际场景中的巨大潜力,为未来标准化通用标记技术(类比H.264/265)提供理论支撑和实践路径。

链接: https://arxiv.org/abs/2601.20742
作者: Xin Jin,Jinming Liu,Yuntao Wei,Junyan Lin,Zhicheng Wang,Jianguo Huang,Xudong Yang,Yanxiao Liu,Wenjun Zeng
机构: Eastern Institute of Technology (东方理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:“Compression Tells Intelligence”, is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs), where compression efficiency often correlates with improved model performance and capabilities. For compression, classical visual coding based on traditional information theory has developed over decades, achieving great success with numerous international industrial standards widely applied in multimedia (e.g., image/video) systems. Except that, the recent emergingvisual token technology of generative multi-modal large models also shares a similar fundamental objective like visual coding: maximizing semantic information fidelity during the representation learning while minimizing computational cost. Therefore, this paper provides a comprehensive overview of two dominant technique families first – Visual Coding and Vision Token Technology – then we further unify them from the aspect of optimization, discussing the essence of compression efficiency and model performance trade-off behind. Next, based on the proposed unified formulation bridging visual coding andvisual token technology, we synthesize bidirectional insights of themselves and forecast the next-gen visual codec and token techniques. Last but not least, we experimentally show a large potential of the task-oriented token developments in the more practical tasks like multimodal LLMs (MLLMs), AI-generated content (AIGC), and embodied AI, as well as shedding light on the future possibility of standardizing a general token technology like the traditional codecs (e.g., H.264/265) with high efficiency for a wide range of intelligent tasks in a unified and effective manner.
zh

[CV-6] Continual GUI Agents

【速读】:该论文旨在解决GUI代理在动态数字环境中因界面数据分布随时间变化(如新领域或分辨率的引入)而导致性能下降的问题,即持续学习场景下GUI代理难以维持稳定交互锚点的问题。其解决方案的关键在于提出GUI-Anchoring in Flux (GUI-AiF) 框架,通过引入两种新颖的奖励机制——Flux中的锚点奖励(Anchoring Point Reward in Flux, APR-iF)和Flux中的锚定区域奖励(Anchoring Region Reward in Flux, ARR-iF),引导代理在界面元素位置与区域发生漂移时仍能保持对交互目标的稳定对齐,从而缓解现有方法过度依赖静态接地线索(如固定坐标或元素尺度)导致的过拟合问题。

链接: https://arxiv.org/abs/2601.20732
作者: Ziwei Liu,Borui Kang,Hangjie Yuan,Zixiang Zhao,Wei Li,Yifan Zhu,Tao Feng
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As digital environments (data distribution) are in flux, with new GUI data arriving over time-introducing new domains or resolutions-agents trained on static environments deteriorate in performance. In this work, we introduce Continual GUI Agents, a new task that requires GUI agents to perform continual learning under shifted domains and resolutions. We find existing methods fail to maintain stable grounding as GUI distributions shift over time, due to the diversity of UI interaction points and regions in fluxing scenarios. To address this, we introduce GUI-Anchoring in Flux (GUI-AiF), a new reinforcement fine-tuning framework that stabilizes continual learning through two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide the agents to align with shifting interaction points and regions, mitigating the tendency of existing reward strategies to over-adapt to static grounding cues (e.g., fixed coordinates or element scales). Extensive experiments show GUI-AiF surpasses state-of-the-art baselines. Our work establishes the first continual learning framework for GUI agents, revealing the untapped potential of reinforcement fine-tuning for continual GUI Agents.
zh

[CV-7] Li-ViP3D: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction

【速读】:该论文旨在解决自动驾驶中端到端感知与轨迹预测(Perception-and-Prediction, PnP)任务中存在的模块化流水线信息流通受限及误差累积问题,以及现有多模态融合方法在查询空间(query space)中对相机与激光雷达(LiDAR)数据利用不充分、引入启发式对齐和离散选择步骤导致偏差的问题。其解决方案的关键在于提出一种基于查询的多模态PnP框架Li-ViP3D++,其中核心创新为Query-Gated Deformable Fusion (QGDF),该机制通过三个关键步骤实现:(i) 利用掩码注意力跨摄像头和特征层级聚合图像证据;(ii) 通过可微分BEV采样结合学习到的每查询偏移量提取LiDAR上下文;(iii) 应用查询条件门控机制自适应地加权每个目标的视觉与几何线索,从而在统一端到端架构中联合优化检测、跟踪与多假设轨迹预测,显著提升性能并保持部署效率。

链接: https://arxiv.org/abs/2601.20720
作者: Matej Halinkovic,Nina Masarykova,Alexey Vinel,Marek Galinski
机构: Slovak University of Technology in Bratislava (斯洛伐克技术大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.
zh

[CV-8] LEMON: How Well Do MLLM s Perform Temporal Multimodal Understanding on Instructional Videos?

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在长时序、知识密集型且具有时间结构的教育内容理解上表现不足的问题。现有基准测试多聚焦于短片段视频或单一模态任务,难以评估模型对STEM讲座视频中跨模态整合与长期推理能力。解决方案的关键在于提出LEMON(Lecture-based Evaluation benchmark for MultimOdal uNderstanding),这是一个面向STEM教学视频的新型评测基准,其核心特征包括:(1) 高语义丰富度和学科密度,(2) 视频-音频-文本模态高度耦合,(3) 明确的时间与教学结构标注,以及(4) 上下文关联的多轮问答设计;该基准覆盖感知到生成的六个主任务及十二个子任务,系统性地评估模型在长程理解中的综合能力,实验证明即使先进模型如GPT-4o在时间推理和教学预测任务上仍存在显著性能差距,凸显了LEMON作为可扩展挑战性基准的价值。

链接: https://arxiv.org/abs/2601.20705
作者: Zhuang Yu,Lei Shen,Jing Zhao,Shiliang Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have shown remarkable progress across vision, audio, and language tasks, yet their performance on long-form, knowledge-intensive, and temporally structured educational content remains largely unexplored. To bridge this gap, we introduce LEMON, a Lecture-based Evaluation benchmark for MultimOdal uNderstanding, focusing on STEM lecture videos that require long-horizon reasoning and cross-modal integration. LEMON comprises 2,277 video segments spanning 5 disciplines and 29 courses, with an average duration of 196.1 seconds, yielding 4,181 high-quality QA pairs, including 3,413 multiple-choice and 768 open-ended questions. Distinct from existing video benchmarks, LEMON features: (1) semantic richness and disciplinary density, (2) tightly coupled video-audio-text modalities, (3) explicit temporal and pedagogical structure, and (4) contextually linked multi-turn questioning. It further encompasses six major tasks and twelve subtasks, covering the full cognitive spectrum from perception to reasoning and then to generation. Comprehensive experiments reveal substantial performance gaps across tasks, highlighting that even state-of-the-art MLLMs like GPT-4o struggle with temporal reasoning and instructional prediction. We expect LEMON to serve as an extensible and challenging benchmark for advancing multimodal perception, reasoning, and generation in long-form instructional contents.
zh

[CV-9] Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework

【速读】:该论文旨在解决基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的图像质量评估(Image Quality Assessment, IQA)任务中对大量主观平均意见分数(Mean Opinion Score, MOS)标注数据的高度依赖问题。尽管MLLMs具备强大的感知能力,但其在实际应用中受限于高昂的计算成本和标注资源消耗,核心瓶颈在于MOS尺度的校准而非感知能力本身。解决方案的关键在于提出LEAF框架——通过知识蒸馏机制,将MLLM教师模型中蕴含的密集点级判断与成对偏好信号(含决策可靠性估计)传递给轻量级学生回归器,并结合少量MOS样本进行校准,从而实现低标注成本下的高质量感知建模与MOS对齐,显著降低人工标注需求的同时保持强相关性。

链接: https://arxiv.org/abs/2601.20689
作者: Xinyue Li,Zhichao Zhang,Zhiming Xu,Shubo Xu,Xiongkuo Min,Yitong Chen,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Xi’an Jiaotong University (西安交通大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have demonstrated strong capabilities in image quality assessment (IQA) tasks. However, adapting such large-scale models is computationally expensive and still relies on substantial Mean Opinion Score (MOS) annotations. We argue that for MLLM-based IQA, the core bottleneck lies not in the quality perception capacity of MLLMs, but in MOS scale calibration. Therefore, we propose LEAF, a Label-Efficient Image Quality Assessment Framework that distills perceptual quality priors from an MLLM teacher into a lightweight student regressor, enabling MOS calibration with minimal human supervision. Specifically, the teacher conducts dense supervision through point-wise judgments and pair-wise preferences, with an estimate of decision reliability. Guided by these signals, the student learns the teacher’s quality perception patterns through joint distillation and is calibrated on a small MOS subset to align with human annotations. Experiments on both user-generated and AI-generated IQA benchmarks demonstrate that our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations, making lightweight IQA practical under limited annotation budgets.
zh

[CV-10] bi-modal textual prompt learning for vision-language models in remote sensing ICASSP2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 中提示学习(Prompt Learning, PL)在遥感(Remote Sensing, RS)图像场景下迁移能力不足的问题。RS数据具有多标签场景、类内高变异性及多样空间分辨率等特性,导致现有PL方法难以识别主导语义线索且泛化性能受限。解决方案的关键在于提出一种轻量级双模态提示学习框架BiMoRS:利用冻结的图像描述模型(如BLIP-2)提取RS图像的文本语义摘要,通过BERT分词器将其与CLIP编码器输出的高层视觉特征融合,并引入一个轻量级交叉注意力模块,以条件化可学习查询提示,从而生成上下文感知的提示表示,而无需修改CLIP主干网络。该设计显著提升了PL在RS任务中的适应性和泛化能力。

链接: https://arxiv.org/abs/2601.20675
作者: Pankhi Kashyap,Mainak Singha,Biplab Banerjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICASSP 2026

点击查看摘要

Abstract:Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These captions are tokenized using a BERT tokenizer and fused with high-level visual features from the CLIP encoder. A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone. We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average. Codes are available at this https URL.
zh

[CV-11] ProSkill: Segment-Level Skill Assessment in Procedural Videos

【速读】:该论文旨在解决当前 procedural videos(程序性视频)中技能评估缺乏大规模、高质量标注数据集的问题,尤其是现有研究多集中于体育场景,且通常仅支持成对比较或二元标签,难以实现细粒度的绝对技能评分。解决方案的关键在于提出首个面向动作级技能评估的基准数据集 ProSkill,并设计了一种新颖且可扩展的标注协议:该协议基于瑞士轮锦标赛(Swiss Tournament)机制进行高效成对比较,再通过基于 ELO 的评分系统将局部成对结果聚合为全局一致的连续技能分数,从而实现从成对偏好到绝对技能排名的转化,为后续算法评估提供了可靠的数据基础。

链接: https://arxiv.org/abs/2601.20661
作者: Michele Mazzamuto,Daniele Di Mauro,Gianpiero Francesca,Giovanni Maria Farinella,Antonino Furnari
机构: University of Catania, Italy(卡塔尼亚大学, 意大利); Next Vision s.r.l., Italy(Next Vision 公司, 意大利); Toyota Motor Europe, Belgium(丰田欧洲汽车公司, 比利时)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at The IEEE/CVF Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focus on either pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce ProSkill, the first benchmark dataset for action-level skill assessment in procedural tasks. ProSkill provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of ProSkill in the context of skill assessment for procedural videos. All data and code are available at this https URL
zh

[CV-12] FD-MAD: Frequency-Domain Residual Analysis for Face Morphing Attack Detection

【速读】:该论文旨在解决单图像人脸伪造攻击检测(Single-Image Morphing Attack Detection, S-MAD)中跨数据集和跨伪造类型场景下的性能下降问题,尤其是在缺乏可信参考样本的情况下。其关键解决方案是提出一种基于区域感知的频域特征建模方法:首先引入“残差频域”(residual frequency domain)概念,将信号频率从自然谱衰减中解耦,从而增强真脸(bona fide)与伪造人脸(morph)在频域上的可分性;其次,通过马尔可夫随机场(Markov Random Field)对不同面部区域的局部证据进行结构化融合,实现全局一致的判别决策。该方法仅使用频域特征,在FRLL-Morph和MAD22等跨数据集测试中均取得优于现有基线的性能,证明了频域残差建模与区域结构融合的有效性。

链接: https://arxiv.org/abs/2601.20656
作者: Diogo J. Paulo,Hugo Proença,João C. Neves
机构: University of Beira Interior (贝拉内斯特大学); NOVA LINCS; IT: Instituto de Telecomunicações (电信研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face morphing attacks present a significant threat to face recognition systems used in electronic identity enrolment and border control, particularly in single-image morphing attack detection (S-MAD) scenarios where no trusted reference is available. In spite of the vast amount of research on this problem, morph detection systems struggle in cross-dataset scenarios. To address this problem, we introduce a region-aware frequency-based morph detection strategy that drastically improves over strong baseline methods in challenging cross-dataset and cross-morph settings using a lightweight approach. Having observed the separability of bona fide and morph samples in the frequency domain of different facial parts, our approach 1) introduces the concept of residual frequency domain, where the frequency of the signal is decoupled from the natural spectral decay to easily discriminate between morph and bona fide data; 2) additionally, we reason in a global and local manner by combining the evidence from different facial regions in a Markov Random Field, which infers a globally consistent decision. The proposed method, trained exclusively on the synthetic morphing attack detection development dataset (SMDD), is evaluated in challenging cross-dataset and cross-morph settings on FRLL-Morph and MAD22 sets. Our approach achieves an average equal error rate (EER) of 1.85% on FRLL-Morph and ranks second on MAD22 with an average EER of 6.12%, while also obtaining a good bona fide presentation classification error rate (BPCER) at a low attack presentation classification error rate (APCER) using only spectral features. These findings indicate that Fourier-domain residual modeling with structured regional fusion offers a competitive alternative to deep S-MAD architectures.
zh

[CV-13] OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

【速读】:该论文旨在解决长周期、重复性工作流(long-horizon, repetitive workflows)在专业场景中难以自动化的问题,这类任务如报销单据处理和学生成绩录入等,虽对人类而言繁琐且耗时,但因其结构化和可复用的子流程特性,非常适合由计算机使用代理(Computer-Use Agents, CUAs)执行。然而,当前缺乏针对此类任务的有效评估基准,限制了SOTA代理模型的发展与比较。为此,作者提出了OS-Marathon基准,包含242个来自两个领域的长周期重复任务,用于系统评估现有代理性能。解决方案的关键在于提出一种低成本的“浓缩示范”(condensed demonstration)方法,仅需少量示例即可提取并传授任务背后的逻辑,使代理能够高效泛化至更大规模、未见过的数据集上执行相似任务。

链接: https://arxiv.org/abs/2601.20650
作者: Jing Wu,Daphne Barretto,Yiye Chen,Nicholas Gydé,Yanan Jian,Yuhang He,Vibhav Vineet
机构: University of Oxford (牛津大学); Microsoft (微软); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 Pages, Project Page: \url{ this https URL }

点击查看摘要

Abstract:Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 242 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method. Project website: this https URL.
zh

[CV-14] Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability ICLR2026

【速读】:该论文旨在解决扩散模型(Diffusion Models)中存在的记忆化问题(memorization),即模型在生成图像时可能无意中复制训练数据中的完整图像或局部区域。现有检测方法主要依赖于得分差异的范数(norm of score difference)作为指标,但作者证明这类指标仅在对数概率分布各向同性(isotropic log-probability distributions)假设下有效,这通常只适用于高或中等噪声水平。论文的关键创新在于揭示了在低噪声条件下,记忆样本表现出引导向量(guidance vector)与无条件得分(unconditional scores)之间强烈的角对齐(angular alignment)特性——这是各向异性(anisotropic)情形下的核心特征。基于此,作者提出了一种融合各向同性范数和各向异性对齐的新检测指标,该指标可直接在纯噪声输入上通过两次前向传播计算,无需昂贵的去噪过程,显著提升效率与准确性。

链接: https://arxiv.org/abs/2601.20642
作者: Rohan Asthana,Vasileios Belagiannis
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希·亚历山大大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Diffusion-based image generative models produce high-fidelity images through iterative denoising but remain vulnerable to memorization, where they unintentionally reproduce exact copies or parts of training images. Recent memorization detection methods are primarily based on the norm of score difference as indicators of memorization. We prove that such norm-based metrics are mainly effective under the assumption of isotropic log-probability distributions, which generally holds at high or medium noise levels. In contrast, analyzing the anisotropic regime reveals that memorized samples exhibit strong angular alignment between the guidance vector and unconditional scores in the low-noise setting. Through these insights, we develop a memorization detection metric by integrating isotropic norm and anisotropic alignment. Our detection metric can be computed directly on pure noise inputs via two conditional and unconditional forward passes, eliminating the need for costly denoising steps. Detection experiments on Stable Diffusion v1.4 and v2 show that our metric outperforms existing denoising-free detection methods while being at least approximately 5x faster than the previous best approach. Finally, we demonstrate the effectiveness of our approach by utilizing a mitigation strategy that adapts memorized prompts based on our developed metric.
zh

[CV-15] CLEAR-Mamba:Towards Accurate Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification

【速读】:该论文旨在解决医学图像分类任务中因单模态信息局限、细微病灶模式难以识别及设备间差异显著而导致的模型泛化能力弱与高置信度预测不可靠的问题。其解决方案的关键在于提出CLEAR-Mamba框架,通过两个核心创新实现:一是引入HaC(Hypernetwork-based Adaptive Conditioning layer),基于输入特征分布动态生成参数,提升跨域适应性;二是设计RaP(Reliability-aware Prediction)策略,基于证据不确定性学习机制,引导模型关注低置信度样本,从而增强整体预测的稳定性和可靠性。

链接: https://arxiv.org/abs/2601.20601
作者: Zhuonan Wang,Wenjie Yan,Wenqiao Zhang,Xiaohui Song,Jian Ma,Ke Yao,Yibo Yu,Beng Chin Ooi
机构: 1. National University of Singapore (新加坡国立大学); 3. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages,7 figures

点击查看摘要

Abstract:Medical image classification is a core task in computer-aided diagnosis (CAD), playing a pivotal role in early disease detection, treatment planning, and patient prognosis assessment. In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information that conventional fundus photography cannot capture. However, due to the single-modality nature, subtle lesion patterns, and significant inter-device variability, existing methods still face limitations in generalization and high-confidence prediction. To address these challenges, we propose CLEAR-Mamba, an enhanced framework built upon MedMamba with optimizations in both architecture and training strategy. Architecturally, we introduce HaC, a hypernetwork-based adaptive conditioning layer that dynamically generates parameters according to input feature distributions, thereby improving cross-domain adaptability. From a training perspective, we develop RaP, a reliability-aware prediction scheme built upon evidential uncertainty learning, which encourages the model to emphasize low-confidence samples and improves overall stability and reliability. We further construct a large-scale ophthalmic angiography dataset covering both FFA and ICGA modalities, comprising multiple retinal disease categories for model training and evaluation. Experimental results demonstrate that CLEAR-Mamba consistently outperforms multiple baseline models, including the original MedMamba, across various metrics-showing particular advantages in multi-disease classification and reliability-aware prediction. This study provides an effective solution that balances generalizability and reliability for modality-specific medical image classification tasks.
zh

[CV-16] Person Re-ID in 2025: Supervised Self-Supervised and Language-Aligned. What Works?

【速读】:该论文旨在解决行人重识别(Person Re-Identification, ReID)在跨域场景下模型泛化能力不足的问题,尤其关注不同训练范式对模型鲁棒性的影响。其解决方案的关键在于系统性地比较监督学习、自监督学习和语言对齐三种训练范式,并揭示语言对齐模型(如SigLIP2)虽未专门针对ReID任务设计,却展现出显著优于传统监督模型的跨域适应能力,表明利用基础模型(foundation models)构建更具迁移性的视觉表征是提升ReID泛化性能的有效路径。

链接: https://arxiv.org/abs/2601.20598
作者: Lakshman Balasubramanian
机构: MoiiAi Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Person Re-Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state-of-the-art ReID models in cross-domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self-supervised, and language-aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross-domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross-domain data. Language-aligned models, however, show surprising robustness cross-domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: this https URL.
zh

[CV-17] StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval

【速读】:该论文旨在解决持续文本到视频检索(Continual Text-to-Video Retrieval, CTVR)中的灾难性遗忘问题,其核心挑战在于两类特征漂移:模态内特征漂移(intra-modal feature drift)和跨模态非合作特征漂移(non-cooperative feature drift),后者导致文本与视频特征之间的对齐失效。解决方案的关键在于提出一种结构化跨模态对齐方法 StructAlign,其核心创新包括:1)引入等角紧框架(Equiangular Tight Frame, ETF)几何先验作为统一的结构约束,以缓解跨模态对齐偏差;2)设计基于类别级ETF原型的跨模态ETF对齐损失,促使文本与视频特征在语义空间中形成近似ETF结构;3)提出跨模态关系保持损失(Cross-modal Relation Preserving loss),利用双模态互补性来稳定特征更新过程,从而抑制模态内漂移。通过联合建模跨模态与模态内特征漂移,StructAlign有效缓解了CTVR中的灾难性遗忘问题。

链接: https://arxiv.org/abs/2601.20597
作者: Shaokun Wang,Weili Guan,Jizhou Han,Jianlong Wu,Yupeng Hu,Liqiang Nie
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Xi’an Jiaotong University (西安交通大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.
zh

[CV-18] DiffVC-RT: Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression

【速读】:该论文针对基于扩散模型的神经视频压缩(Neural Video Compression, NVC)在实际部署中面临的三大关键挑战——严重的信息损失、高昂的推理延迟以及较差的时间一致性——提出了解决方案。其核心创新在于:首先,设计了高效且信息保留能力强的模型架构,通过模块替换与剪枝策略显著降低计算复杂度并减少结构信息丢失;其次,提出显式与隐式结合的一致性建模机制,利用零成本的在线时间位移模块(Online Temporal Shift Module)增强时间一致性,并引入混合隐式约束进一步抑制生成闪烁伪影;最后,构建异步并行解码流水线,结合混合半精度(Mixed Half Precision)技术,实现潜在空间的异步解码与帧级并行重建,从而在NVIDIA H800 GPU上实现720p视频实时编码(206 fps)与解码(30 fps),相较VTM-17.0在LPIPS指标上节省80.1%码率,标志着扩散模型在视频压缩领域迈向实用化的重要突破。

链接: https://arxiv.org/abs/2601.20564
作者: Wenzhuo Ma,Zhenzhong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:The practical deployment of diffusion-based Neural Video Compression (NVC) faces critical challenges, including severe information loss, prohibitive inference latency, and poor temporal consistency. To bridge this gap, we propose DiffVC-RT, the first framework designed to achieve real-time diffusion-based perceptual NVC. First, we introduce an Efficient and Informative Model Architecture. Through strategic module replacements and pruning, this architecture significantly reduces computational complexity while mitigating structural information loss. Second, to address generative flickering artifacts, we propose Explicit and Implicit Consistency Modeling. We enhance temporal consistency by explicitly incorporating a zero-cost Online Temporal Shift Module within the U-Net, complemented by hybrid implicit consistency constraints. Finally, we present an Asynchronous and Parallel Decoding Pipeline incorporating Mixed Half Precision, which enables asynchronous latent decoding and parallel frame reconstruction via a Batch-dimension Temporal Shift design. Experiments show that DiffVC-RT achieves 80.1% bitrate savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and decoding speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU, marking a significant milestone in diffusion-based video compression.
zh

[CV-19] DeepSeek -OCR 2: Visual Causal Flow

【速读】:该论文旨在解决传统视觉语言模型(Vision-Language Models, VLMs)在处理图像时采用固定栅格扫描顺序(从左上到右下)进行视觉令牌(visual tokens)编码的问题,这种硬编码的顺序与人类视觉感知中基于语义逻辑的灵活、因果驱动的扫描模式不一致,尤其在复杂布局图像中表现不足。解决方案的关键在于提出 DeepEncoder V2,这是一种能够根据图像语义动态重排视觉令牌的新颖编码器架构,通过引入因果推理能力,使模型在输入大语言模型(LLM)前对视觉信息进行语义引导的重新排序,从而实现更接近人类认知的二维理解机制。该方法探索了通过两个级联的一维因果推理结构来实现有效的二维图像理解,为构建真正具备二维推理能力的视觉-语言架构提供了新路径。

链接: https://arxiv.org/abs/2601.20552
作者: Haoran Wei,Yaofeng Sun,Yukun Li
机构: DeepSeek-AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at this http URL.
zh

[CV-20] Advancing Open-source World Models

【速读】:该论文旨在解决当前世界模型(World Model)在环境多样性、长期时序一致性以及实时交互性方面的局限性,特别是在视频生成驱动的仿真环境中。解决方案的关键在于构建一个名为LingBot-World的开源世界模拟器,其核心优势包括:(1)在多种环境类型(如现实场景、科学情境、卡通风格等)中保持高保真度与鲁棒动力学;(2)实现分钟级的时间跨度并维持上下文一致性(即“长期记忆”);(3)支持实时交互,以低于1秒的延迟实现每秒16帧的生成速度。这一方案通过开放代码和模型资源,推动了开源与闭源技术之间的差距缩小,并为内容创作、游戏开发和机器人学习等领域提供实用工具。

链接: https://arxiv.org/abs/2601.20540
作者: Robbyant Team:Zelin Gao,Qiuyu Wang,Yanhong Zeng,Jiapeng Zhu,Ka Leong Cheng,Yixuan Li,Hanlin Wang,Yinghao Xu,Shuailei Ma,Yihang Chen,Jie Liu,Yansong Cheng,Yao Yao,Jiayi Zhu,Yihao Meng,Kecheng Zheng,Qingyan Bai,Jingye Chen,Zehong Shen,Yue Yu,Xing Zhu,Yujun Shen,Hao Ouyang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as “long-term memory”. (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.
zh

[CV-21] IOTA: Corrective Knowledge-Guided Prompt Learning via Black-White Box Framework

【速读】:该论文旨在解决预训练模型在下游任务适应过程中因将模型视为黑箱而忽视其内在先验知识的问题,从而限制了模型的有效性。解决方案的关键在于提出一种黑-白盒提示学习框架(IOTA),该框架融合了数据驱动的黑盒模块与知识驱动的白盒模块:白盒模块通过对比错误预测与正确认知提取修正知识,并将其转化为可解释的人类提示,再通过修正知识引导的提示选择策略指导黑盒模块实现更精准的预测,从而协同利用知识与数据驱动的学习信号,提升下游任务适应效果。

链接: https://arxiv.org/abs/2601.20526
作者: Shaokun Wang,Yifan Yu,Yuhang He,Weili Guan,Yihong Gong
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, adapting pre-trained models to downstream tasks has attracted increasing interest. Previous Parameter-Efficient-Tuning (PET) methods regard the pre-trained model as an opaque Black Box model, relying purely on data-driven optimization and underutilizing their inherent prior knowledge. This oversight limits the models’ potential for effective downstream task adaptation. To address these issues, we propose a novel black-whIte bOx prompT leArning framework (IOTA), which integrates a data-driven Black Box module with a knowledge-driven White Box module for downstream task adaptation. Specifically, the White Box module derives corrective knowledge by contrasting the wrong predictions with the right cognition. This knowledge is verbalized into interpretable human prompts and leveraged through a corrective knowledge-guided prompt selection strategy to guide the Black Box module toward more accurate predictions. By jointly leveraging knowledge- and data-driven learning signals, IOTA achieves effective downstream task adaptation. Experimental results on 12 image classification benchmarks under few-shot and easy-to-hard adaptation settings demonstrate the effectiveness of corrective knowledge and the superiority of our method over state-of-the-art methods.
zh

[CV-22] AnomalyVFM – Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

【速读】:该论文旨在解决零样本异常检测(zero-shot anomaly detection)中基于视觉基础模型(Vision Foundation Models, VFMs)性能落后于视觉语言模型(Vision-Language Models, VLMs)的问题。核心挑战在于现有辅助异常检测数据集多样性不足,以及VFMs适应策略过于浅层。解决方案的关键在于提出AnomalyVFM框架,其通过三个阶段的合成数据生成机制增强训练多样性,并引入参数高效适配机制——包括低秩特征适配器(low-rank feature adapters)和置信度加权像素损失(confidence-weighted pixel loss),从而显著提升VFMs在零样本场景下的异常检测能力。实验表明,以RADIO为骨干网络时,AnomalyVFM在9个不同数据集上的平均图像级AUROC达到94.1%,优于当前最优方法3.3个百分点。

链接: https://arxiv.org/abs/2601.20524
作者: Matic Fučka,Vitjan Zavrtanik,Danijel Skočaj
机构: University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: this https URL
zh

[CV-23] Context Tokens are Anchors: Understanding the Repetition Curse in dMLLM s from an Information Flow Perspective ICLR2026

【速读】:该论文旨在解决基于扩散的多模态大语言模型(dMLLMs)在推理过程中因缓存机制引入的重复文本生成问题,即“重复诅咒”(Repeat Curse)。其核心问题是:缓存机制虽能加速解码,但破坏了上下文标记的信息流动路径,导致模型在深层网络中无法收敛熵值,从而产生冗余输出。解决方案的关键在于提出一种即插即用的方法 CoTA(Context Token Attention),通过增强上下文标记的注意力权重以维持内在信息流模式,并在解码阶段引入置信度惩罚项,抑制由不确定上下文驱动的输出,从而有效缓解重复现象并提升通用任务性能。

链接: https://arxiv.org/abs/2601.20520
作者: Qiyan Zhao,Xiaofeng Zhang,Shuochen Chang,Qianyu Chen,Xiaosong Yuan,Xuhang Chen,Luoqi Liu,Jiajun Zhang,Xu-Yao Zhang,Da-Han Wang
机构: SJTU; CASIA; NTU; Meitu (China) Limited; XMUT
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICLR 2026

点击查看摘要

Abstract:Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the \textbfRepeat Curse. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model’s growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present \textbfCoTA, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at this https URL
zh

[CV-24] Say Cheese! Detail-Preserving Portrait Collection Generation via Natural Language Edits

【速读】:该论文旨在解决生成式 AI (Generative AI) 中的 Portrait Collection Generation (PCG) 任务,即通过自然语言指令对参考人脸图像进行多属性编辑,以生成一致且高质量的人像集合。该任务面临两大挑战:一是复杂的多属性修改(如姿态、空间布局和相机视角);二是高保真细节保持(如身份、服装和配饰)。解决方案的关键在于提出首个大规模 PCG 数据集 CHEESE(包含 24K 人像集合与 573K 样本),并设计 SCheese 框架,该框架融合文本引导生成与分层身份及细节保持机制,其中采用自适应特征融合策略维持身份一致性,并引入 ConsistencyNet 注入细粒度特征以保障细节一致性,从而在多项实验中实现最先进的性能表现。

链接: https://arxiv.org/abs/2601.20511
作者: Zelong Sun,Jiahui Wu,Ying Ba,Dong Jing,Zhiwu Lu
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As social media platforms proliferate, users increasingly demand intuitive ways to create diverse, high-quality portrait collections. In this work, we introduce Portrait Collection Generation (PCG), a novel task that generates coherent portrait collections by editing a reference portrait image through natural language instructions. This task poses two unique challenges to existing methods: (1) complex multi-attribute modifications such as pose, spatial layout, and camera viewpoint; and (2) high-fidelity detail preservation including identity, clothing, and accessories. To address these challenges, we propose CHEESE, the first large-scale PCG dataset containing 24K portrait collections and 573K samples with high-quality modification text annotations, constructed through an Large Vison-Language Model-based pipeline with inversion-based verification. We further propose SCheese, a framework that combines text-guided generation with hierarchical identity and detail preservation. SCheese employs adaptive feature fusion mechanism to maintain identity consistency, and ConsistencyNet to inject fine-grained features for detail consistency. Comprehensive experiments validate the effectiveness of CHEESE in advancing PCG, with SCheese achieving state-of-the-art performance.
zh

[CV-25] Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V ICASSP2026

【速读】:该论文旨在解决当前生成式 AI(Generative AI)视频生成模型在动态场景下性能受限的问题,尤其是当视频中存在剧烈运动时,由于噪声干扰导致时间一致性下降,进而影响对复杂动态区域的学习能力。现有扩散模型采用统一的静态损失函数,无法适应不同运动强度的区域,限制了其对高频动态细节的重建能力。解决方案的关键在于引入潜空间时间差异(Latent Temporal Discrepancy, LTD)作为运动先验,通过量化帧间在潜在空间中的变化程度,自适应地调整损失权重:对高差异区域施加更大惩罚以强化动态建模,同时保持对稳定区域的常规优化,从而提升训练稳定性并显著改善运动质量。

链接: https://arxiv.org/abs/2601.20504
作者: Meiqi Wu,Bingze Song,Ruimin Lin,Chen Zhu,Xiaokun Feng,Jiahong Wu,Xiangxiang Chu,Kaiqi Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics. To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.
zh

[CV-26] Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

【速读】:该论文旨在解决脑小血管病(cerebral small vessel disease, SVD)相关影像特征——白质高信号(white matter hyperintensities, WMH)与缺血性卒中病灶(ischaemic stroke lesions, ISL)在流体衰减反转恢复(fluid-attenuated inversion recovery, FLAIR)序列中视觉重叠、难以区分的问题。由于二者常共存于同一受试者且标注数据稀缺,传统深度学习模型的训练和验证面临挑战。研究的关键解决方案在于利用部分标注数据,通过六种策略构建联合WMH与ISL分割模型,其中伪标签(pseudolabeling)方法表现最优,能够有效提升模型性能,从而实现对两类病灶的精准识别与分离。

链接: https://arxiv.org/abs/2601.20503
作者: Jesse Phitidis,Alison Q. Smithard,William N. Whiteley,Joanna M. Wardlaw,Miguel O. Bernabeu,Maria Valdés Hernández
机构: University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are imaging features associated with cerebral small vessel disease (SVD) that are visible on brain magnetic resonance imaging (MRI) scans. The development and validation of deep learning models to segment and differentiate these features is difficult because they visually confound each other in the fluid-attenuated inversion recovery (FLAIR) sequence and often appear in the same subject. We investigated six strategies for training a combined WMH and ISL segmentation model using partially labelled data. We combined privately held fully and partially labelled datasets with publicly available partially labelled datasets to yield a total of 2052 MRI volumes, with 1341 and 1152 containing ground truth annotations for WMH and ISL respectively. We found that several methods were able to effectively leverage the partially labelled data to improve model performance, with the use of pseudolabels yielding the best result.
zh

[CV-27] Efficient Autoregressive Video Diffusion with Dummy Head

【速读】:该论文旨在解决自回归视频扩散模型中多头自注意力机制对历史帧利用不足的问题,具体表现为约25%的注意力头几乎仅关注当前帧,且丢弃其键值(Key-Value, KV)缓存对模型性能影响甚微。解决方案的关键在于提出“Dummy Forcing”方法,通过异构内存分配减少头级别的上下文冗余,并引入动态头编程机制以自适应地分类不同类型的注意力头;同时结合上下文打包(context packing)技术实现更激进的缓存压缩。该方案无需额外训练即可在保持视频生成质量下降小于0.5%的前提下,实现最高达2.0倍的推理速度提升,支持24.3 FPS的实时视频生成。

链接: https://arxiv.org/abs/2601.20499
作者: Hang Guo,Zhaoyang Jia,Jiahao Li,Bin Li,Yuanhao Cai,Jiangshan Wang,Yawei Li,Yan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at this https URL.
zh

[CV-28] Exploiting the Final Component of Generator Architectures for AI-Generated Image Detection

【速读】:该论文旨在解决当前深度伪造图像检测方法在面对未见过的生成器时泛化能力差的问题。其核心解决方案是利用生成器最后阶段的组件对真实图像进行“污染”(即模拟生成特征),从而训练一个能够区分原始真实图像与被污染图像的检测器。这种方法的关键在于识别并利用不同生成模型中共享的最终架构组件,使得检测器具备跨生成器的泛化能力,实验表明仅用每类3个样本共100个样本微调DINOv3骨干网络即可在22个未见生成器测试集上达到平均98.83%的准确率。

链接: https://arxiv.org/abs/2601.20461
作者: Yanzhu Liu,Xiao Liu,Yuexuan Wang,Mondal Soumik
机构: Institute for Infocomm Research, A*STAR, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid proliferation of powerful image generators, accurate detection of AI-generated images has become essential for maintaining a trustworthy online environment. However, existing deepfake detectors often generalize poorly to images produced by unseen generators. Notably, despite being trained under vastly different paradigms, such as diffusion or autoregressive modeling, many modern image generators share common final architectural components that serve as the last stage for converting intermediate representations into images. Motivated by this insight, we propose to “contaminate” real images using the generator’s final component and train a detector to distinguish them from the original real images. We further introduce a taxonomy based on generators’ final components and categorize 21 widely used generators accordingly, enabling a comprehensive investigation of our method’s generalization capability. Using only 100 samples from each of three representative categories, our detector-fine-tuned on the DINOv3 backbone-achieves an average accuracy of 98.83% across 22 testing sets from unseen generators.
zh

[CV-29] MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

【速读】:该论文旨在解决深度伪造(Deepfake)检测中模型准确性与可解释性不足的问题,尤其针对生成式AI(Generative AI)快速发展的背景下,传统方法在复杂伪造内容识别上的局限性。其解决方案的关键在于提出一种基于视觉-语言模型(Vision-Language Models, VLMs)的多模态对齐与强化学习框架(MARE),通过设计综合奖励函数引入人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF),激励模型生成与空间位置对齐的文本推理内容;同时引入伪造解耦模块(forgery disentanglement module),从高层面部语义中提取内在伪造痕迹,从而提升检测的准确性与可信度。

链接: https://arxiv.org/abs/2601.20433
作者: Wenbo Xu,Wei Lu,Xiangyang Luo,Jiantao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.
zh

[CV-30] Youtu-Parsing: Perception Structuring and Recognition via High-Parallelism Decoding

【速读】:该论文旨在解决文档智能中内容提取的效率与准确性难题,尤其针对结构化文档(如表格、公式、图表等)的高精度解析需求。其核心解决方案在于提出一种解耦且特征可复用的架构——Youtu-Parsing,该模型融合了原生视觉Transformer(Vision Transformer, ViT)的动态分辨率视觉编码器与提示引导的Youtu-LLM-2B语言模型,通过引入两种并行解码策略实现高效推理:一是token并行性(token parallelism),在每步推理中并发生成最多64个候选token并经验证机制筛选,相较传统自回归解码提速5–11倍;二是query并行性(query parallelism),支持同时对多达五个边界框进行内容预测,额外获得2倍加速且保持输出质量不变。这一设计显著提升了复杂文档元素(包括多语言、罕见字符及手写体)的识别性能,并在OmniDocBench和olmOCR-bench上达到当前最优(state-of-the-art, SOTA)效果。

链接: https://arxiv.org/abs/2601.20430
作者: Kun Yin,Yunfei Wu,Bing Liu,Zhongpeng Cai,Xiaotian Li,Huang Chen,Xin Li,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun,Yunsheng Wu,Qianyu Li,Antai Guo,Yanzhen Liao,Yanqiu Qu,Haodong Lin,Chengxu He,Shuangyin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism decoding strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5–11x speedup over traditional autoregressive decoding and is particularly well-suited for highly structured scenarios, such as table recognition. To further exploit the advantages of region-prompted decoding, the query parallelism strategy enables simultaneous content prediction for multiple bounding boxes (up to five), providing an additional 2x acceleration while maintaining output quality equivalent to standard decoding. Youtu-Parsing encompasses a diverse range of document elements, including text, formulas, tables, charts, seals, and hierarchical structures. Furthermore, the model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content. Extensive evaluations demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance on both the OmniDocBench and olmOCR-bench benchmarks. Overall, Youtu-Parsing demonstrates significant experimental value and practical utility for large-scale document intelligence applications.
zh

[CV-31] GRTX: Efficient Ray Tracing for 3D Gaussian-Based Rendering HPCA2026

【速读】:该论文旨在解决当前基于3D高斯(3D Gaussian)的光线追踪方法中存在的效率低下问题,尤其是加速结构臃肿和节点遍历冗余导致的性能瓶颈。其关键解决方案包括:首先提出一种新颖的加速结构构建方法,通过射线空间变换将各向异性高斯视为单位球体,显著缩小包围盒层次(BVH)尺寸并降低遍历开销;其次设计专用硬件支持在光线追踪单元中实现遍历检查点(checkpointing),避免多轮光线追踪时重复从根节点开始遍历,从而大幅提升光线追踪效率,且硬件开销极低。

链接: https://arxiv.org/abs/2601.20429
作者: Junseo Lee,Sangyun Jeon,Jungi Lee,Junyong Park,Jaewoong Sim
机构: 未知
类目: Graphics (cs.GR); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at the 32nd International Symposium on High-Performance Computer Architecture (HPCA 2026)

点击查看摘要

Abstract:3D Gaussian Splatting has gained widespread adoption across diverse applications due to its exceptional rendering performance and visual quality. While most existing methods rely on rasterization to render Gaussians, recent research has started investigating ray tracing approaches to overcome the fundamental limitations inherent in rasterization. However, current Gaussian ray tracing methods suffer from inefficiencies such as bloated acceleration structures and redundant node traversals, which greatly degrade ray tracing performance. In this work, we present GRTX, a set of software and hardware optimizations that enable efficient ray tracing for 3D Gaussian-based rendering. First, we introduce a novel approach for constructing streamlined acceleration structures for Gaussian primitives. Our key insight is that anisotropic Gaussians can be treated as unit spheres through ray space transformations, which substantially reduces BVH size and traversal overhead. Second, we propose dedicated hardware support for traversal checkpointing within ray tracing units. This eliminates redundant node visits during multi-round tracing by resuming traversal from checkpointed nodes rather than restarting from the root node in each subsequent round. Our evaluation shows that GRTX significantly improves ray tracing performance compared to the baseline ray tracing method with a negligible hardware cost. Comments: To appear at the 32nd International Symposium on High-Performance Computer Architecture (HPCA 2026) Subjects: Graphics (cs.GR); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.20429 [cs.GR] (or arXiv:2601.20429v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2601.20429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-32] Quartet of Diffusions: Structure-Aware Point Cloud Generation through Part and Symmetry Guidance

【速读】:该论文旨在解决3D点云生成中缺乏对形状结构先验(如对称性和部件组成)显式建模的问题,传统方法通常将形状生成视为整体过程或仅支持部件组合,难以保证结构一致性与可控性。其解决方案的关键在于提出“扩散四重奏”(Quartet of Diffusions),通过四个协同工作的扩散模型分别学习全局形状潜在表示、对称性、语义部件及其空间组装分布,从而实现结构感知的生成流程;该框架不仅确保输出具有强制对称性与部件间的一致性,还通过中心全局潜在变量增强部件间的结构连贯性,并支持细粒度的属性控制,是首个在生成过程中完整集成并强制执行对称性和部件先验的3D点云生成方法。

链接: https://arxiv.org/abs/2601.20425
作者: Chenliang Zhou,Fangcheng Zhong,Weihao Xia,Albert Miao,Canberk Baykal,Cengiz Oztireli
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce the Quartet of Diffusions, a structure-aware point cloud generation framework that explicitly models part composition and symmetry. Unlike prior methods that treat shape generation as a holistic process or only support part composition, our approach leverages four coordinated diffusion models to learn distributions of global shape latents, symmetries, semantic parts, and their spatial assembly. This structured pipeline ensures guaranteed symmetry, coherent part placement, and diverse, high-quality outputs. By disentangling the generative process into interpretable components, our method supports fine-grained control over shape attributes, enabling targeted manipulation of individual parts while preserving global consistency. A central global latent further reinforces structural coherence across assembled parts. Our experiments show that the Quartet achieves state-of-the-art performance. To our best knowledge, this is the first 3D point cloud generation framework that fully integrates and enforces both symmetry and part priors throughout the generative process.
zh

[CV-33] Lets Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

【速读】:该论文旨在解决预训练视觉语言模型(如CLIP)在零样本任务中,由于细粒度文本描述与局部图像块之间存在冗余信息而导致的文本-视觉对齐效果不佳的问题。其核心解决方案是提出双路精炼方法(Bi-refinement for Fine-grained Text-visual Alignment, BiFTA),从两个维度优化对齐质量:一是视图精炼(View Refinement),通过移除高交并比(Intersection over Union, IoU)的冗余图像块以获得更具区分性的视觉样本;二是描述精炼(Description Refinement),通过去除高成对余弦相似度的冗余文本描述,确保剩余文本描述具有更高的多样性。该方法在ViT和ResNet架构的CLIP模型上均显著提升了6个基准数据集上的零样本性能,验证了去除冗余信息对提升文本-视觉对齐有效性的重要性。

链接: https://arxiv.org/abs/2601.20419
作者: Yuhao Sun,Chengyi Cai,Jiacheng Zhang,Zesheng Ye,Xingliang Yuan,Feng Liu
机构: The University of Melbourne(墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages

点击查看摘要

Abstract:Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emphView Refinement and \emphDescription refinement, termed as \textit\textbfBi-refinement for \textbfFine-grained \textbfText-visual \textbfAlignment (BiFTA). \emphView refinement removes redundant image patches with high \emphIntersection over Union (IoU) ratios, resulting in more distinctive visual samples. \emphDescription refinement removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
zh

[CV-34] HINT: Hierarchical Interaction Modeling for Autoregressive Multi-Human Motion Generation

【速读】:该论文旨在解决文本驱动的多人运动生成中复杂交互建模的问题,尤其针对现有离线方法在处理长文本或可变人数场景时的局限性。其核心挑战在于如何实现对不同数量参与者和长时间序列的灵活适应,同时保持运动细节与文本语义的一致性。解决方案的关键在于提出HINT框架,该框架采用基于扩散模型的自回归架构,并引入分层交互建模机制:首先通过解耦的运动表示将局部运动语义与人与人之间的交互分离,从而支持任意人数的直接适配;其次利用滑动窗口策略进行高效在线生成,聚合窗口内局部条件与跨窗口全局条件,以捕捉历史轨迹、人物间依赖关系并精准对齐文本指导,从而在保证细粒度交互建模的同时维持长序列的连贯性。

链接: https://arxiv.org/abs/2601.20383
作者: Mengge Liu,Yan Di,Gu Wang,Yun Qu,Dekai Zhu,Yanyan Li,Xiangyang Ji
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven multi-human motion generation with complex interactions remains a challenging problem. Despite progress in performance, existing offline methods that generate fixed-length motions with a fixed number of agents, are inherently limited in handling long or variable text, and varying agent counts. These limitations naturally encourage autoregressive formulations, which predict future motions step by step conditioned on all past trajectories and current text guidance. In this work, we introduce HINT, the first autoregressive framework for multi-human motion generation with Hierarchical INTeraction modeling in diffusion. First, HINT leverages a disentangled motion representation within a canonicalized latent space, decoupling local motion semantics from inter-person interactions. This design facilitates direct adaptation to varying numbers of human participants without requiring additional refinement. Second, HINT adopts a sliding-window strategy for efficient online generation, and aggregates local within-window and global cross-window conditions to capture past human history, inter-person dependencies, and align with text guidance. This strategy not only enables fine-grained interaction modeling within each window but also preserves long-horizon coherence across all the long sequence. Extensive experiments on public benchmarks demonstrate that HINT matches the performance of strong offline models and surpasses autoregressive baselines. Notably, on InterHuman, HINT achieves an FID of 3.100, significantly improving over the previous state-of-the-art score of 5.154.
zh

[CV-35] RepSFNet : A Single Fusion Network with Structural Reparameterization for Crowd Counting

【速读】:该论文旨在解决复杂密度场景下人群计数的挑战,包括尺度变化、遮挡以及现有模型计算成本高等问题。其核心解决方案是提出一种轻量级网络RepSFNet(Reparameterized Single Fusion Network),关键在于采用带有大尺寸重参数化卷积核的RepLK-ViT主干网络实现高效多尺度特征提取,并引入融合ASPP与CAN的特征融合模块以增强密度自适应的上下文建模能力;同时通过拼接融合模块保留空间分辨率,生成高质量密度图,且避免使用注意力机制和多分支结构以显著降低参数量和计算复杂度,从而在保证精度的同时提升推理速度,适用于实时及低功耗边缘计算场景。

链接: https://arxiv.org/abs/2601.20369
作者: Mas Nurul Achmadiah,Chi-Chia Sun,Wen-Kai Kuo,Jun-Wei Hsieh
机构: National Formosa University (国立中兴大学); National Taipei University (国立台北大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. Published in Proceedings of the IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS) 2025

点击查看摘要

Abstract:Crowd counting remains challenging in variable-density scenes due to scale variations, occlusions, and the high computational cost of existing models. To address these issues, we propose RepSFNet (Reparameterized Single Fusion Network), a lightweight architecture designed for accurate and real-time crowd estimation. RepSFNet leverages a RepLK-ViT backbone with large reparameterized kernels for efficient multi-scale feature extraction. It further integrates a Feature Fusion module combining Atrous Spatial Pyramid Pooling (ASPP) and Context-Aware Network (CAN) to achieve robust, density-adaptive context modeling. A Concatenate Fusion module is employed to preserve spatial resolution and generate high-quality density maps. By avoiding attention mechanisms and multi-branch designs, RepSFNet significantly reduces parameters and computational complexity. The training objective combines Mean Squared Error and Optimal Transport loss to improve both count accuracy and spatial distribution alignment. Experiments conducted on ShanghaiTech, NWPU, and UCF-QNRF datasets demonstrate that RepSFNet achieves competitive accuracy while reducing inference latency by up to 34 percent compared to recent state-of-the-art methods, making it suitable for real-time and low-power edge computing applications.
zh

[CV-36] Dual-Modality IoT Framework for Integrated Access Control and Environmental Safety Monitoring with Real-Time Cloud Analytics

【速读】:该论文旨在解决传统物理安全系统与环境安全监测系统各自独立运行所导致的运营效率低下、应急响应延迟及管理复杂度高的问题。其核心解决方案是提出一种基于双模态物联网(Dual-modality Internet of Things)的统一架构,通过ESP32微控制器实现边缘计算与无线通信,并将RFID门禁控制(Subsystem 1)与多传感器环境安全监测(Subsystem 2)集成于同一云端平台,从而实现安全与环境数据的协同处理与实时响应。关键创新在于采用模块化设计、本地缓存机制保障网络中断下的系统稳定性,以及通过组件优化实现总成本降低82%(仅5,400 BDT),在保证专业级性能(如99.2% RFID识别准确率、98.5%火焰检测可靠性)的同时显著提升系统的可扩展性与经济可行性。

链接: https://arxiv.org/abs/2601.20366
作者: Abdul Hasib,A. S. M. Ahsanul Sarkar Akib,Nihal Das Ankur,Anish Giri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of physical security systems with environmental safety monitoring represents a critical advancement in smart infrastructure management. Traditional approaches maintain these systems as independent silos, creating operational inefficiencies, delayed emergency responses, and increased management complexity. This paper presents a comprehensive dual-modality Internet of Things framework that seamlessly integrates RFID-based access control with multi-sensor environmental safety monitoring through a unified cloud architecture. The system comprises two coordinated subsystems: Subsystem 1 implements RFID authentication with servo-actuated gate control and real-time Google Sheets logging, while Subsystem 2 provides comprehensive safety monitoring incorporating flame detection, water flow measurement, LCD status display, and personnel identification. Both subsystems utilize ESP32 microcontrollers for edge processing and wireless connectivity. Experimental evaluation over 45 days demonstrates exceptional performance metrics: 99.2% RFID authentication accuracy with 0.82-second average response time, 98.5% flame detection reliability within 5-meter range, and 99.8% cloud data logging success rate. The system maintains operational integrity during network disruptions through intelligent local caching mechanisms and achieves total implementation cost of 5,400 BDT (approximately \ 48), representing an 82% reduction compared to commercial integrated solutions. This research establishes a practical framework for synergistic security-safety integration, demonstrating that professional-grade performance can be achieved through careful architectural design and component optimization while maintaining exceptional cost-effectiveness and accessibility for diverse application scenarios.
zh

[CV-37] RAW-Flow: Advancing RGB-to-RAW Image Reconstruction with Deterministic Latent Flow Matching AAAI2026

【速读】:该论文旨在解决RGB-to-RAW重建问题,即从量化后的RGB图像中恢复高保真RAW数据,以应对传统学习方法因直接回归目标导致的细节不一致和颜色偏差问题。其核心挑战在于逆ISP过程的病态性及RGB图像中的信息丢失。解决方案的关键在于提出一种生成式视角下的确定性潜在空间传输框架RAW-Flow,通过流匹配(flow matching)学习潜在空间中的确定性向量场,从而有效弥合RGB与RAW表示之间的差距,并结合跨尺度上下文引导模块和双域潜在自编码器(含特征对齐约束),实现结构细节与色彩信息的精确重建。

链接: https://arxiv.org/abs/2601.20364
作者: Zhen Liu,Diedong Feng,Hai Jiang,Liaoyuan Zeng,Hao Wang,Chaoyu Feng,Lei Lei,Bing Zeng,Shuaicheng Liu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026 Oral

点击查看摘要

Abstract:RGB-to-RAW reconstruction, or the reverse modeling of a camera Image Signal Processing (ISP) pipeline, aims to recover high-fidelity RAW data from RGB images. Despite notable progress, existing learning-based methods typically treat this task as a direct regression objective and struggle with detail inconsistency and color deviation, due to the ill-posed nature of inverse ISP and the inherent information loss in quantized RGB images. To address these limitations, we pioneer a generative perspective by reformulating RGB-to-RAW reconstruction as a deterministic latent transport problem and introduce a novel framework named RAW-Flow, which leverages flow matching to learn a deterministic vector field in latent space, to effectively bridge the gap between RGB and RAW representations and enable accurate reconstruction of structural details and color information. To further enhance latent transport, we introduce a cross-scale context guidance module that injects hierarchical RGB features into the flow estimation process. Moreover, we design a dual-domain latent autoencoder with a feature alignment constraint to support the proposed latent transport framework, which jointly encodes RGB and RAW inputs while promoting stable training and high-fidelity reconstruction. Extensive experiments demonstrate that RAW-Flow outperforms state-of-the-art approaches both quantitatively and visually.
zh

[CV-38] CURVE: Learning Causality-Inspired Invariant Representations for Robust Scene Understanding via Uncertainty-Guided Regularization

【速读】:该论文旨在解决场景图(Scene Graph)在分布外泛化能力差的问题,其根源在于模型容易过拟合到虚假相关性(spurious correlations)。为应对这一挑战,作者提出CURVE框架,其核心创新在于融合变分不确定性建模与不确定性引导的结构正则化,以抑制高方差、环境依赖的关系。关键方法是采用原型条件去偏(prototype-conditioned debiasing),从而将不变的交互动态与环境相关的变异解耦,促进稀疏且域稳定的拓扑结构学习,进而提升零样本迁移和低数据量仿真到真实场景适应中的鲁棒性和不确定性估计可靠性。

链接: https://arxiv.org/abs/2601.20355
作者: Yue Liang,Jiatong Du,Ziyi Yang,Yanjun Huang,Hong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scene graphs provide structured abstractions for scene understanding, yet they often overfit to spurious correlations, severely hindering out-of-distribution generalization. To address this limitation, we propose CURVE, a causality-inspired framework that integrates variational uncertainty modeling with uncertainty-guided structural regularization to suppress high-variance, environment-specific relations. Specifically, we apply prototype-conditioned debiasing to disentangle invariant interaction dynamics from environment-dependent variations, promoting a sparse and domain-stable topology. Empirically, we evaluate CURVE in zero-shot transfer and low-data sim-to-real adaptation, verifying its ability to learn domain-stable sparse topologies and provide reliable uncertainty estimates to support risk prediction under distribution shifts.
zh

[CV-39] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models ICLR2026

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在处理复杂空间关系(如空间感知、推理与交互)时表现不足的问题,而这一问题长期被现有基准测试所忽视,主要因这些基准多采用短且信息稀疏的提示(prompt)。其解决方案的关键在于提出一个名为SpatialGenEval的新基准,该基准包含1,230个长且信息密集的提示,覆盖25个真实场景,并整合10个空间子领域及对应的多选题问答对,从而系统性评估T2I模型的空间智能;此外,作者进一步构建了SpatialT2I数据集(含15,400对文本-图像样本),通过重写提示以保持图像一致性并保留信息密度,验证了数据驱动范式可显著提升模型在空间关系上的表现(如Stable Diffusion-XL、Uniworld-V1和OmniGen2分别获得+4.2%、+5.7%、+4.4%的性能增益),揭示出高质量、结构化训练数据对增强T2I模型空间推理能力的核心作用。

链接: https://arxiv.org/abs/2601.20354
作者: Zengbin Wang,Xuecai Hu,Yong Wang,Feng Xiong,Man Zhang,Xiangxiang Chu
机构: AMAP, Alibaba Group(阿里巴巴集团); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.
zh

[CV-40] PalmBridge: A Plug-and-Play Feature Alignment Framework for Open-Set Palmprint Verification

【速读】:该论文旨在解决 palmprint 识别系统在真实部署场景中因特征分布偏移(feature distribution shifts)导致性能下降的问题,尤其是现有深度模型通常假设数据分布为封闭且静态,从而容易过拟合于特定数据集的纹理特征,而非学习到跨域不变的表示。解决方案的关键在于提出 PalmBridge,一个基于向量量化(vector quantization)的即插即用特征空间对齐框架:它不依赖传统数据增强来逼近目标分布,而是从训练特征中直接学习一组紧凑的代表性向量(representative vectors),并在注册与验证阶段将每个特征向量映射至最近的代表向量,并通过最小距离准则进行混合(blending),以此抑制由域变化引起的冗余扰动,同时保留身份判别信息。该方法通过任务监督、特征一致性目标和正交性正则项联合优化代表性向量与主干网络,构建稳定且结构良好的共享嵌入空间,显著提升跨数据集泛化能力并保持低计算开销。

链接: https://arxiv.org/abs/2601.20351
作者: Chenke Zhang,Ziyuan Yang,Licheng Yan,Shuyi Li,Andrew Beng Jin Teoh,Bob Zhang,Yi Zhang
机构: Sichuan University (四川大学); University of Macau (澳门大学); Beijing University of Technology (北京工业大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Palmprint recognition is widely used in biometric systems, yet real-world performance often degrades due to feature distribution shifts caused by heterogeneous deployment conditions. Most deep palmprint models assume a closed and stationary distribution, leading to overfitting to dataset-specific textures rather than learning domain-invariant representations. Although data augmentation is commonly used to mitigate this issue, it assumes augmented samples can approximate the target deployment distribution, an assumption that often fails under significant domain mismatch. To address this limitation, we propose PalmBridge, a plug-and-play feature-space alignment framework for open-set palmprint verification based on vector quantization. Rather than relying solely on data-level augmentation, PalmBridge learns a compact set of representative vectors directly from training features. During enrollment and verification, each feature vector is mapped to its nearest representative vector under a minimum-distance criterion, and the mapped vector is then blended with the original vector. This design suppresses nuisance variation induced by domain shifts while retaining discriminative identity cues. The representative vectors are jointly optimized with the backbone network using task supervision, a feature-consistency objective, and an orthogonality regularization term to form a stable and well-structured shared embedding space. Furthermore, we analyze feature-to-representative mappings via assignment consistency and collision rate to assess model’s sensitivity to blending weights. Experiments on multiple palmprint datasets and backbone architectures show that PalmBridge consistently reduces EER in intra-dataset open-set evaluation and improves cross-dataset generalization with negligible to modest runtime overhead.
zh

[CV-41] MMSF: Multitask and Multimodal Supervised Framework for WSI Classification and Survival Analysis

【速读】:该论文旨在解决计算病理学中多模态信息融合的挑战,即如何有效整合高分辨率全切片图像(whole slide images, WSI)所蕴含的肿瘤形态学特征与患者临床描述符所提供的预后相关上下文信息。由于两类模态在特征空间分布和尺度上存在显著差异,传统方法难以实现跨模态的有效对齐与协同学习。其解决方案的关键在于提出MMSF框架——一个基于线性复杂度多实例学习(MIL)骨干网络的多任务、多模态监督框架,通过图结构特征提取模块嵌入组织拓扑信息、临床数据嵌入模块标准化患者属性、特征融合模块显式分解并融合共享与特异性表示,并结合Mamba架构的MIL编码器实现高效多任务预测,从而显著提升模型在生存分析和分类任务中的性能表现。

链接: https://arxiv.org/abs/2601.20347
作者: Chengying She,Chengwei Chen,Xinran Zhang,Ben Wang,Lizhuang Liu,Chengwei Shao,Yun Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to “Biomedical Signal Processing and Control”

点击查看摘要

Abstract:Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1–6.6% accuracy and 2.2–6.9% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1–9.8% C-index improvements compared with unimodal methods and 5.6–7.1% over multimodal alternatives.
zh

[CV-42] st-Time Adaptation for Anomaly Segmentation via Topology-Aware Optimal Transport Chaining

【速读】:该论文旨在解决异常分割(Anomaly Segmentation, AS)中因分布偏移导致传统阈值二值化方法生成的掩码脆弱性问题,其核心挑战在于如何在不同数据分布下稳定识别异常区域。解决方案的关键在于提出TopoOT框架,通过引入拓扑感知最优传输(Topology-aware Optimal Transport, TopoOT),将多滤波持久性图(multi-filtration persistence diagrams, PDs)与测试时适应(Test-Time Adaptation, TTA)相结合;其中,最优传输链(Optimal Transport Chaining)作为核心创新,通过逐级对齐不同阈值和滤波下的PDs,生成具有测地线稳定性(geodesic stability)的评分,从而提取跨尺度一致保留的特征,形成稳定性感知的伪标签,用于在线训练轻量级分支,结合OT一致性与对比学习目标,实现域偏移下的鲁棒适应,显著提升2D和3D异常分割性能。

链接: https://arxiv.org/abs/2601.20333
作者: Ali Zia,Usman Ali,Umer Ramzan,Abdul Rehman,Abdelwahed Khamis,Wei Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep topological data analysis (TDA) offers a principled framework for capturing structural invariants such as connectivity and cycles that persist across scales, making it a natural fit for anomaly segmentation (AS). Unlike thresholdbased binarisation, which produces brittle masks under distribution shift, TDA allows anomalies to be characterised as disruptions to global structure rather than local fluctuations. We introduce TopoOT, a topology-aware optimal transport (OT) framework that integrates multi-filtration persistence diagrams (PDs) with test-time adaptation (TTA). Our key innovation is Optimal Transport Chaining, which sequentially aligns PDs across thresholds and filtrations, yielding geodesic stability scores that identify features consistently preserved across scales. These stabilityaware pseudo-labels supervise a lightweight head trained online with OT-consistency and contrastive objectives, ensuring robust adaptation under domain shift. Across standard 2D and 3D anomaly detection benchmarks, TopoOT achieves state-of-the-art performance, outperforming the most competitive methods by up to +24.1% mean F1 on 2D datasets and +10.2% on 3D AS benchmarks.
zh

[CV-43] GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting)在表面重建中因高斯深度监督不准确而导致的几何精度不足问题。现有方法依赖多视角几何一致性或单目深度先验来优化高斯深度估计,但前者在大几何差异下可靠性下降,后者则受尺度模糊性和局部不一致性影响,导致监督信号失真。解决方案的关键在于提出两个核心机制:一是引入高斯可见性感知的多视角几何一致性约束,通过聚合共享高斯原语在不同视角中的可见性信息,实现更稳定和精确的几何监督;二是设计渐进式四叉树校准的单目深度约束,从粗到细的块级仿射校准策略有效缓解了单目深度先验的尺度模糊问题,同时保留了表面细节。

链接: https://arxiv.org/abs/2601.20331
作者: Mai Su,Qihan Yu,Zhongtao Wang,Yilong Li,Chengwei Pan,Yisong Chen,Guoping Wang
机构: Peking University (北京大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting enables efficient optimization and high-quality rendering, yet accurate surface reconstruction remains challenging. Prior methods improve surface reconstruction by refining Gaussian depth estimates, either via multi-view geometric consistency or through monocular depth priors. However, multi-view constraints become unreliable under large geometric discrepancies, while monocular priors suffer from scale ambiguity and local inconsistency, ultimately leading to inaccurate Gaussian depth supervision. To address these limitations, we introduce a Gaussian visibility-aware multi-view geometric consistency constraint that aggregates the visibility of shared Gaussian primitives across views, enabling more accurate and stable geometric supervision. In addition, we propose a progressive quadtree-calibrated Monocular depth constraint that performs block-wise affine calibration from coarse to fine spatial scales, mitigating the scale ambiguity of depth priors while preserving fine-grained surface details. Extensive experiments on DTU and TNT datasets demonstrate consistent improvements in geometric accuracy over prior Gaussian-based and implicit surface reconstruction methods. Codes are available at an anonymous repository: this https URL.
zh

[CV-44] UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion ICASSP2026

【速读】:该论文旨在解决机器遗忘(Machine Unlearning)中的隐私漏洞问题,即 adversaries 可通过未学习逆向攻击(unlearning inversion)重建本应被删除的数据,从而威胁模型的隐私保护。解决方案的关键在于提出 UnlearnShield,其通过在余弦表示空间中引入方向性扰动,并借助约束模块对扰动进行调控,以协同保障模型准确性与遗忘有效性,在降低逆向攻击风险的同时维持模型性能。

链接: https://arxiv.org/abs/2601.20325
作者: Lulu Xue,Shengshan Hu,Wei Lu,Ziqi Zhou,Yufei Song,Jianhong Cheng,Minghui Li,Yanjun Zhang,Leo Yu Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted by ICASSP 2026

点击查看摘要

Abstract:Machine unlearning is an emerging technique that aims to remove the influence of specific data from trained models, thereby enhancing privacy protection. However, recent research has uncovered critical privacy vulnerabilities, showing that adversaries can exploit unlearning inversion to reconstruct data that was intended to be erased. Despite the severity of this threat, dedicated defenses remain lacking. To address this gap, we propose UnlearnShield, the first defense specifically tailored to counter unlearning inversion. UnlearnShield introduces directional perturbations in the cosine representation space and regulates them through a constraint module to jointly preserve model accuracy and forgetting efficacy, thereby reducing inversion risk while maintaining utility. Experiments demonstrate that it achieves a good trade-off among privacy protection, accuracy, and forgetting.
zh

[CV-45] CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting ICLR2026

【速读】:该论文旨在解决多变量时间序列预测中现有模型的两大局限性:通道依赖型模型易过拟合通道顺序,导致在新增或重排通道时适应性差;而通道独立型模型虽具灵活性但忽略了通道间的相互依赖关系,限制了预测性能。解决方案的关键在于提出一种通道排列不变性(Channel Permutation Invariant, CPI)框架 CPiRi,其核心创新包括:1)采用时空解耦架构,冻结预训练的时间编码器提取高质量时序特征,轻量空间模块学习内容驱动的跨通道关系;2)引入通道洗牌策略结合排列不变性正则化训练机制,使模型从数据中自动推断跨通道结构而非记忆固定顺序,从而实现对通道顺序变化和分布漂移的鲁棒性,并支持仅用一半通道训练即可在未见通道上展现强归纳泛化能力,同时保持大规模数据集上的实际效率。

链接: https://arxiv.org/abs/2601.20318
作者: Jiyuan Xu,Wenyu Zhang,Xin Jing,Shuai Chen,Shuai Zhang,Jiahao Nie
机构: Zhejiang University of Finance and Economics (浙江财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, ICLR 2026

点击查看摘要

Abstract:Current methods for multivariate time series forecasting can be classified into channel-dependent and channel-independent models. Channel-dependent models learn cross-channel features but often overfit the channel ordering, which hampers adaptation when channels are added or reordered. Channel-independent models treat each channel in isolation to increase flexibility, yet this neglects inter-channel dependencies and limits performance. To address these limitations, we propose \textbfCPiRi, a \textbfchannel permutation invariant (CPI) framework that infers cross-channel structure from data rather than memorizing a fixed ordering, enabling deployment in settings with structural and distributional co-drift without retraining. CPiRi couples \textbfspatio-temporal decoupling architecture with \textbfpermutation-invariant regularization training strategy: a frozen pretrained temporal encoder extracts high-quality temporal features, a lightweight spatial module learns content-driven inter-channel relations, while a channel shuffling strategy enforces CPI during training. We further \textbfground CPiRi in theory by analyzing permutation equivariance in multivariate time series forecasting. Experiments on multiple benchmarks show state-of-the-art results. CPiRi remains stable when channel orders are shuffled and exhibits strong \textbfinductive generalization to unseen channels even when trained on \textbfonly half of the channels, while maintaining \textbfpractical efficiency on large-scale datasets. The source code is released at this https URL.
zh

[CV-46] SemBind: Binding Diffusion Watermarks to Semantics Against Black-Box Forgery Attacks

【速读】:该论文旨在解决潜空间水印(latent-based watermark)在面对黑盒伪造攻击时的安全性问题,即攻击者仅需获取一个带水印的图像和对生成模型的黑盒访问权限,即可将提供商的水印嵌入非本源生成的图像中,从而破坏图像溯源与可信度。解决方案的关键在于提出SemBind框架,通过引入一个基于对比学习训练的语义掩码器(semantic masker),将潜在表示中的信号与图像语义绑定,使相同提示词生成的潜在编码保持近似不变,不同提示词则呈现近正交特性;随后将这些编码重塑并置换以调制目标潜在表示,在标准潜空间水印之前进行干预,从而实现对黑盒伪造的有效防御,同时保持图像质量基本不变,并提供可调节的鲁棒性-安全性权衡机制。

链接: https://arxiv.org/abs/2601.20310
作者: Xin Zhang,Zijin Yang,Kejiang Chen,Linfeng Ma,Weiming Zhang,Nenghai Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Latent-based watermarks, integrated into the generation process of latent diffusion models (LDMs), simplify detection and attribution of generated images. However, recent black-box forgery attacks, where an attacker needs at least one watermarked image and black-box access to the provider’s model, can embed the provider’s watermark into images not produced by the provider, posing outsized risk to provenance and trust. We propose SemBind, the first defense framework for latent-based watermarks that resists black-box forgery by binding latent signals to image semantics via a learned semantic masker. Trained with contrastive learning, the masker yields near-invariant codes for the same prompt and near-orthogonal codes across prompts; these codes are reshaped and permuted to modulate the target latent before any standard latent-based watermark. SemBind is generally compatible with existing latent-based watermarking schemes and keeps image quality essentially unchanged, while a simple mask-ratio parameter offers a tunable trade-off between anti-forgery strength and robustness. Across four mainstream latent-based watermark methods, our SemBind-enabled anti-forgery variants markedly reduce false acceptance under black-box forgery while providing a controllable robustness-security balance.
zh

[CV-47] OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion

【速读】:该论文旨在解决真实场景下空间-时间视频超分辨率(STVSR)的挑战,即在从低分辨率到高分辨率重建过程中,不仅要恢复精细视觉细节,还需提升帧率并保持时序一致性。现有方法多基于简化退化假设,在复杂未知退化的真实场景中表现不佳,且难以兼顾重建保真度与时序连贯性。解决方案的关键在于提出OSDEnhancer框架,其核心创新包括:1)通过线性预插值初始化关键时空结构;2)设计时空细化与空间增强混合专家(TR-SE MoE)机制,使不同专家路径分别学习时序一致性和空间细节,并在推理阶段协同强化;3)引入双向可变形变分自编码器(VAE)解码器实现递归的时空聚合与传播,提升跨帧重建保真度。该方法首次实现了高效的一步扩散过程,显著提升了真实世界STVSR的性能与泛化能力。

链接: https://arxiv.org/abs/2601.20308
作者: Shuoyan Wei,Feng Li,Chen Zhou,Runmin Cong,Yao Zhao,Huihui Bai
机构: Beijing Jiaotong University (北京交通大学); Hefei University of Technology (合肥工业大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 17 pages, 10 figures. Code will be released upon publication

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated exceptional success in video super-resolution (VSR), showcasing a powerful capacity for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic visual content from low-resolution to high-resolution but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simplified degradation assumptions, which often struggle in real-world scenarios with complex unknown degradations. Such a high demand for reconstruction fidelity and temporal consistency makes the development of a robust STVSR framework particularly non-trivial. To address these challenges, we propose OSDEnhancer, a novel framework that, to the best of our knowledge, represents the first method to achieve real-world STVSR through an efficient one-step diffusion process. OSDEnhancer initializes essential spatiotemporal structures through a linear pre-interpolation strategy and pivots on training temporal refinement and spatial enhancement mixture of experts (TR-SE MoE), which allows distinct expert pathways to progressively learn robust, specialized representations for temporal coherence and spatial detail, further collaboratively reinforcing each other during inference. A bidirectional deformable variational autoencoder (VAE) decoder is further introduced to perform recurrent spatiotemporal aggregation and propagation, enhancing cross-frame reconstruction fidelity. Experiments demonstrate that the proposed method achieves state-of-the-art performance while maintaining superior generalization capability in real-world scenarios.
zh

[CV-48] PGDiff: Hierarchical Triple-Prior Guided Diffusion for Image Restoration

【速读】:该论文旨在解决全一体化图像恢复(All-in-one image restoration)中,现有方法因依赖退化先验(degradation priors)而在严重退化区域难以重建内容的问题,同时克服了将语义信息引入扩散模型浅层时破坏空间结构(如模糊伪影)的局限性。其解决方案的关键在于提出了一种三重先验引导的扩散网络(Triple-Prior Guided Diffusion, TPGDiff),通过在扩散轨迹中分层整合三种先验:退化先验贯穿全过程以实现阶段自适应控制;结构先验注入浅层以保留细粒度空间细节;语义先验注入深层以提供鲁棒的高层指导,从而实现层次化、互补性的先验引导机制,显著提升图像重建质量与泛化能力。

链接: https://arxiv.org/abs/2601.20306
作者: Yanjie Tu,Qingsen Yan,Axi Niu,Jiacong Tang
机构: Northwestern Polytechnical University (西北工业大学); Shenzhen Research Institute of Northwestern Polytechnical University (西北工业大学深圳研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-one image restoration aims to address diverse degradation types using a single unified model. Existing methods typically rely on degradation priors to guide restoration, yet often struggle to reconstruct content in severely degraded regions. Although recent works leverage semantic information to facilitate content generation, integrating it into the shallow layers of diffusion models often disrupts spatial structures (\emphe.g., blurring artifacts). To address this issue, we propose a Triple-Prior Guided Diffusion (TPGDiff) network for unified image restoration. TPGDiff incorporates degradation priors throughout the diffusion trajectory, while introducing structural priors into shallow layers and semantic priors into deep layers, enabling hierarchical and complementary prior guidance for image reconstruction. Specifically, we leverage multi-source structural cues as structural priors to capture fine-grained details and guide shallow layers representations. To complement this design, we further develop a distillation-driven semantic extractor that yields robust semantic priors, ensuring reliable high-level guidance at deep layers even under severe degradations. Furthermore, a degradation extractor is employed to learn degradation-aware priors, enabling stage-adaptive control of the diffusion process across all timesteps. Extensive experiments on both single- and multi-degradation benchmarks demonstrate that TPGDiff achieves superior performance and generalization across diverse restoration scenarios. Our project page is: this https URL.
zh

[CV-49] Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction

【速读】:该论文旨在解决低剂量碘对比剂(iodinated contrast media, ICM)CT成像中因对比度不足导致诊断准确性下降的问题,同时避免高剂量ICM带来的肾损伤和过敏反应风险。其核心解决方案是提出一种结构约束的语言引导扩散模型(Structure-constrained Language-informed Diffusion Model, SLDM),关键在于通过引入结构先验信息对模型推理过程进行约束,确保增强过程中图像结构的一致性;并设计基于空间智能的语义监督策略,融合视觉感知与空间推理能力,提升生成图像的结构准确性和细节保真度;此外,采用减影血管增强模块优化对比剂区域的对比度至适合观察的区间,从而实现高质量低剂量CT血管造影重建。

链接: https://arxiv.org/abs/2601.20304
作者: Genyuan Zhang,Zihao Wang,Zhifan Gao,Lei Xu,Zhen Zhou,Haijun Yu,Jianjia Zhang,Xiujian Liu,Weiwei Zhang,Shaoyu Wang,Huazhu Fu,Fenglin Liu,Weiwen Wu
机构: Chongqing University (重庆大学); Sun Yat-sen University (中山大学); Capital Medical University (首都医科大学); Nanchang University (南昌大学); Institute of High Performance Computing (A*STAR) (新加坡高性能计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The application of iodinated contrast media (ICM) improves the sensitivity and specificity of computed tomography (CT) for a wide range of clinical indications. However, overdose of ICM can cause problems such as kidney damage and life-threatening allergic reactions. Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose while maintaining diagnostic power. However, existing methods are difficult to realize accurate enhancement with incompletely paired images, mainly because of the limited ability of the model to recognize specific structures. To overcome this limitation, we propose a Structure-constrained Language-informed Diffusion Model (SLDM), a unified medical generation model that integrates structural synergy and spatial intelligence. First, the structural prior information of the image is effectively extracted to constrain the model inference process, thus ensuring structural consistency in the enhancement process. Subsequently, semantic supervision strategy with spatial intelligence is introduced, which integrates the functions of visual perception and spatial reasoning, thus prompting the model to achieve accurate enhancement. Finally, the subtraction angiography enhancement module is applied, which serves to improve the contrast of the ICM agent region to suitable interval for observation. Qualitative analysis of visual comparison and quantitative results of several metrics demonstrate the effectiveness of our method in angiographic reconstruction for low-dose contrast medium CT angiography.
zh

[CV-50] Physically Guided Visual Mass Estimation from a Single RGB Image

【速读】:该论文旨在解决从单张RGB图像中准确估计物体质量的问题,该问题本质上是病态的(ill-posed),因为质量同时依赖于几何体积和材料密度,而这两者均无法仅通过视觉外观直接观测。为克服这一不确定性,论文提出了一种物理结构化的框架,其核心在于将视觉线索与决定质量的物理因素对齐:首先利用单目深度估计恢复物体中心的三维几何以推断体积,再通过视觉语言模型提取粗粒度材料语义以引导密度相关推理;随后,通过实例自适应门控机制融合几何、语义和外观表征,并设计两个物理引导的潜在因子(体积相关和密度相关)分支,在仅依赖质量监督下进行独立回归。该方案显著提升了质量预测的准确性,实验表明其优于当前最优方法。

链接: https://arxiv.org/abs/2601.20303
作者: Sungjae Lee,Junhan Jeong,Yeonjoo Hong,Kwang In Kim
机构: Graduate School of Artificial Intelligence, POSTECH, South Korea (韩国浦项科技大学人工智能研究生院); Department of Electrical Engineering, POSTECH, South Korea (韩国浦项科技大学电气工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.
zh

[CV-51] Bridging the Applicator Gap with Data-Doping:Dual-Domain Learning for Precise Bladder Segmentation in CT-Guided Brachytherapy

【速读】:该论文旨在解决医学图像分割中因协变量偏移(covariate shift)导致的性能退化问题,特别是在妇科腔内放疗中膀胱的CT图像分割任务中,目标域数据(含施源器的CT图像,WA)稀缺且存在显著解剖形变与成像伪影,而源域数据(无施源器的CT图像,NA)虽丰富但分布不同。解决方案的关键在于提出一种双域学习策略(dual domain learning),通过在以NA数据为主导的训练集中引入少量WA数据(仅10–30%),实现对目标域特征的有效建模,从而显著提升模型在WA数据上的分割性能,Dice相似系数最高达0.94,表明该方法能有效进行域适应并增强临床可靠性。

链接: https://arxiv.org/abs/2601.20302
作者: Suresh Das,Siladittya Manna,Sayantari Ghosh
机构: Narayana Superspeciality Hospital(纳拉亚纳专科医院); Department of Computational and Data Sciences, Indian Institute of Science(印度科学研究所计算与数据科学系); Department of Physics, National Institute of Technology Durgapur(达格帕尔国家技术学院物理系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Performance degradation due to covariate shift remains a major challenge for deep learning models in medical image segmentation. An open question is whether samples from a shifted distribution can effectively support learning when combined with limited target domain data. We investigate this problem in the context of bladder segmentation in CT guided gynecological brachytherapy, a critical task for accurate dose optimization and organ at risk sparing. While CT scans without brachytherapy applicators (no applicator: NA) are widely available, scans with applicators inserted (with applicator: WA) are scarce and exhibit substantial anatomical deformation and imaging artifacts, making automated segmentation particularly difficult. We propose a dual domain learning strategy that integrates NA and WA CT data to improve robustness and generalizability under covariate shift. Using a curated assorted dataset, we show that NA data alone fail to capture the anatomical and artifact related characteristics of WA images. However, introducing a modest proportion of WA data into a predominantly NA training set leads to significant performance improvements. Through systematic experiments across axial, coronal, and sagittal planes using multiple deep learning architectures, we demonstrate that doping only 10 to 30 percent WA data achieves segmentation performance comparable to models trained exclusively on WA data. The proposed approach attains Dice similarity coefficients of up to 0.94 and Intersection over Union scores of up to 0.92, indicating effective domain adaptation and improved clinical reliability. This study highlights the value of integrating anatomically similar but distribution shifted datasets to overcome data scarcity and enhance deep learning based segmentation for brachytherapy treatment planning. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.20302 [cs.CV] (or arXiv:2601.20302v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.20302 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-52] owards Compact and Robust DNNs via Compression-aware Sharpness Minimization

【速读】:该论文旨在解决压缩感知的锐度感知最小化(Sharpness-Aware Minimization, SAM)与模型结构剪枝之间的兼容性问题,即传统SAM训练后直接剪枝会破坏模型对输入扰动的鲁棒性,而先剪枝再应用SAM则受限于初始剪枝策略所造成的架构约束。解决方案的关键在于提出压缩感知锐度最小化(Compression-aware ShArpness Minimization, C-SAM)框架,其核心创新是将原本基于参数扰动的锐度感知学习转化为基于剪枝掩码(pruning mask)扰动的学习方式,从而在模型结构空间中诱导更平坦的损失景观,使剪枝模式能够在保持任务准确率的同时显著提升模型的认证鲁棒性(certified robustness)。

链接: https://arxiv.org/abs/2601.20301
作者: Jialuo He,Huangxun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has recently emerged as an effective technique for improving DNN robustness to input variations. However, its interplay with the compactness requirements of on-device DNN deployments remains less explored. Simply pruning a SAM-trained model can undermine robustness, since flatness in the continuous parameter space does not necessarily translate to robustness under the discrete structural changes induced by pruning. Conversely, applying SAM after pruning may be fundamentally constrained by architectural limitations imposed by an early, robustness-agnostic pruning pattern. To address this gap, we propose Compression-aware ShArpness Minimization (C-SAM), a framework that shifts sharpness-aware learning from parameter perturbations to mask perturbations. By explicitly perturbing pruning masks during training, C-SAM promotes a flatter loss landscape with respect to model structure, enabling the discovery of pruning patterns that simultaneously optimize model compactness and robustness to input variations. Extensive experiments on CelebA-HQ, Flowers-102, and CIFAR-10-C across ResNet-18, GoogLeNet, and MobileNet-V2 show that C-SAM consistently achieves higher certified robustness than strong baselines, with improvements of up to 42%, while maintaining task accuracy comparable to the corresponding unpruned models.
zh

[CV-53] Artifact-Aware Evaluation for High-Quality Video Generation

【速读】:该论文旨在解决当前视频生成技术中缺乏细粒度、可定位且可分类的生成视频质量评估问题,现有方法仅提供粗粒度的质量评分,无法有效识别和区分具体类型的生成伪影(artifact)。解决方案的关键在于提出一个全面的评估协议,涵盖影响人类感知的三个核心维度——外观(Appearance)、运动(Motion)和相机(Camera),并基于此构建了一个包含10类常见生成失败的分类体系;同时,研究者开发了GenVID数据集(8万条由多种先进模型生成的视频,每条均标注上述伪影类别),并进一步设计DVAR框架,实现对生成伪影的密集识别与细粒度分类,从而显著提升检测准确率并支持低质量内容的有效过滤。

链接: https://arxiv.org/abs/2601.20297
作者: Chen Zhu,Jiashu Zhu,Yanxun Li,Meiqi Wu,Bingze Song,Chubin Chen,Jiahong Wu,Xiangxiang Chu,Yangang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of video generation techniques, evaluating and auditing generated videos has become increasingly crucial. Existing approaches typically offer coarse video quality scores, lacking detailed localization and categorization of specific artifacts. In this work, we introduce a comprehensive evaluation protocol focusing on three key aspects affecting human perception: Appearance, Motion, and Camera. We define these axes through a taxonomy of 10 prevalent artifact categories reflecting common generative failures observed in video generation. To enable robust artifact detection and categorization, we introduce GenVID, a large-scale dataset of 80k videos generated by various state-of-the-art video generation models, each carefully annotated for the defined artifact categories. Leveraging GenVID, we develop DVAR, a Dense Video Artifact Recognition framework for fine-grained identification and classification of generative artifacts. Extensive experiments show that our approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content.
zh

[CV-54] A Source-Free Approach for Domain Adaptation via Multiview Image Transformation and Latent Space Consistency

【速读】:该论文旨在解决**域适应(Domain Adaptation, DA)中源域数据不可用的问题,传统方法通常依赖源域数据进行特征对齐、对抗训练或复杂的伪标签生成,导致计算成本高且实际应用受限。其解决方案的关键在于提出一种无源域域适应(Source-Free Domain Adaptation)**方法,通过引入多视角增强(Multiview Augmentation)和潜在空间一致性(Latent Space Consistency)技术,直接从目标域数据中学习域不变特征,无需访问源域数据或进行源-目标域对齐。具体而言,模型通过对同一目标样本生成多个增强视图,并最小化它们在潜在空间中的特征表示距离,从而实现鲁棒的特征一致性约束;同时,采用基于ConvNeXt的编码器与融合分类损失和一致性损失的联合优化目标,显著提升了目标域上的分类性能,在Office-31、Office-Home和Office-Caltech三个基准数据集上分别达到90.72%、84.00%和97.12%的准确率,平均优于现有方法1.23%~7.26%。

链接: https://arxiv.org/abs/2601.20284
作者: Debopom Sutradhar,Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Reem E. Mohamed,Sami Azam
机构: United International University (联合国际大学); Charles Darwin University (查尔斯达尔文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript under review in IEEE Transactions on Image Processing

点击查看摘要

Abstract:Domain adaptation (DA) addresses the challenge of transferring knowledge from a source domain to a target domain where image data distributions may differ. Existing DA methods often require access to source domain data, adversarial training, or complex pseudo-labeling techniques, which are computationally expensive. To address these challenges, this paper introduces a novel source-free domain adaptation method. It is the first approach to use multiview augmentation and latent space consistency techniques to learn domain-invariant features directly from the target domain. Our method eliminates the need for source-target alignment or pseudo-label refinement by learning transferable representations solely from the target domain by enforcing consistency between multiple augmented views in the latent space. Additionally, the method ensures consistency in the learned features by generating multiple augmented views of target domain data and minimizing the distance between their feature representations in the latent space. We also introduce a ConvNeXt-based encoder and design a loss function that combines classification and consistency objectives to drive effective adaptation directly from the target domain. The proposed model achieves an average classification accuracy of 90. 72%, 84%, and 97. 12% in Office-31, Office-Home and Office-Caltech datasets, respectively. Further evaluations confirm that our study improves existing methods by an average classification accuracy increment of +1.23%, +7.26%, and +1.77% on the respective datasets.
zh

[CV-55] Hallucination Begins Where Saliency Drops ICLR2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在生成过程中容易产生幻觉(hallucination)的问题,即模型输出与输入图像事实不符的情况。现有方法仅依赖前向传播中的注意力模式进行检测,难以可靠区分幻觉与真实内容,因其忽略了梯度信号所揭示的token影响力传播机制。论文提出的关键解决方案是LVLMs-Saliency框架,通过融合注意力权重与输入梯度来量化每个输出token的视觉 grounding 强度,从而识别出因上下文记忆丢失而导致的幻觉生成模式——即当当前token的前序token saliency较低时,极易引发幻觉。基于此发现,进一步设计了两个机制:(1) Saliency-Guided Rejection Sampling (SGRS),在自回归解码中动态过滤低salience候选token;(2) Local Coherence Reinforcement (LocoRE),增强当前token对最近前序token的注意力,以缓解上下文遗忘问题。该方案显著降低幻觉率,同时保持流畅性和任务性能,提供了一种可解释且鲁棒的可靠性提升方法。

链接: https://arxiv.org/abs/2601.20279
作者: Xiaofeng Zhang,Yuanchao Zhu,Chaochen Gu,Xiaosong Yuan,Qiyan Zhao,Jiawei Cao,Feilong Tang,Sinan Fan,Yaomin Shen,Chen Shen,Hao Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICLR 2026

点击查看摘要

Abstract:Recent studies have examined attention dynamics in large vision-language models (LVLMs) to detect hallucinations. However, existing approaches remain limited in reliably distinguishing hallucinated from factually grounded outputs, as they rely solely on forward-pass attention patterns and neglect gradient-based signals that reveal how token influence propagates through the network. To bridge this gap, we introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token by fusing attention weights with their input gradients. Our analysis uncovers a decisive pattern: hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token, signaling a breakdown in contextual memory retention. Leveraging this insight, we propose a dual-mechanism inference-time framework to mitigate hallucinations: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during autoregressive decoding by rejecting those whose saliency falls below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the output sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight, plug-and-play module that strengthens attention from the current token to its most recent predecessors, actively counteracting the contextual forgetting behavior identified by LVLMs-Saliency. Extensive experiments across multiple LVLMs demonstrate that our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution for enhancing model reliability. Code is available at: this https URL
zh

[CV-56] StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

【速读】:该论文旨在解决扩散模型(Diffusion Models)在高分辨率图像和长视频生成任务中,单GPU推理效率低下的问题,具体表现为延迟高、激活值占用内存大。现有基于序列并行(Sequence Parallelism, SP)的框架如Ulysses Attention和Ring Attention存在三大局限:(1)未考虑现代GPU集群中机间与机内带宽差异的通信模式;(2)跨节点通信中的全对全(all-to-all)操作导致延迟瓶颈;(3)使用双向通信库引发GPU发送端与接收端的同步与计算开销。其解决方案的核心是StreamFusion引擎,包含三项关键创新:(1)拓扑感知的序列并行机制,显式建模多机网络拓扑差异;(2)Torus Attention技术,实现跨节点all-to-all通信与计算的重叠;(3)单边通信实现,显著降低GPU端同步与通信开销。实验表明,StreamFusion相较当前最优方案平均提速1.35倍(最高达1.77倍)。

链接: https://arxiv.org/abs/2601.20273
作者: Jiacheng Yang,Jun Wu,Yaoyao Ding,Zhiying Xu,Yida Wang,Gennady Pekhimenko
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of 1.35\times (up to 1.77\times ).
zh

[CV-57] Reversible Efficient Diffusion for Image Fusion

【速读】:该论文旨在解决扩散模型(diffusion models)在多模态图像融合(multi-modal image fusion)任务中因马尔可夫过程固有噪声误差累积而导致的细节丢失问题,从而影响融合图像的视觉保真度与细节保留能力。其解决方案的关键在于提出了一种名为“可逆高效扩散”(Reversible Efficient Diffusion, RED)的显式监督训练框架,该框架在继承扩散模型强大生成能力的同时,避免了对复杂分布估计的依赖,从而在保证计算效率的前提下提升了融合结果的一致性和细节保真度。

链接: https://arxiv.org/abs/2601.20260
作者: Xingxin Xu,Bing Cao,DongDong Li,Qinghua Hu,Pengfei Zhu
机构: Tianjin University (天津大学); Xiong’an National Innovation Center (雄安国家创新中心); Xiong’an Guochuang Lantian Technology Co., Ltd. (雄安国创蓝田科技有限公司); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal image fusion aims to consolidate complementary information from diverse source images into a unified representation. The fused image is expected to preserve fine details and maintain high visual fidelity. While diffusion models have demonstrated impressive generative capabilities in image generation, they often suffer from detail loss when applied to image fusion tasks. This issue arises from the accumulation of noise errors inherent in the Markov process, leading to inconsistency and degradation in the fused results. However, incorporating explicit supervision into end-to-end training of diffusion-based image fusion introduces challenges related to computational efficiency. To address these limitations, we propose the Reversible Efficient Diffusion (RED) model - an explicitly supervised training framework that inherits the powerful generative capability of diffusion models while avoiding the distribution estimation.
zh

[CV-58] BLENDER: Blended Text Embeddings and Diffusion Residuals for Intra-Class Image Synthesis in Deep Metric Learning

【速读】:该论文旨在解决深度度量学习(Deep Metric Learning, DML)中因训练数据类内多样性不足而导致性能受限的问题。现有生成式模型在合成数据增强时往往难以有效控制属性组合的多样性,导致类内样本同质化,影响模型泛化能力。解决方案的关键在于提出BLenDeR——一种基于扩散采样的方法,通过受集合论启发的并集(union)与交集(intersection)操作对去噪残差(denoising residuals)进行可控合成:其中并集操作聚合多个提示(prompt)中的任意属性特征,而交集操作则通过主成分代理提取共性方向,从而实现类内多样化属性组合的可控生成。该方法显著提升了类内多样性,且在多个标准DML基准上优于当前最优基线,例如在CUB-200上Recall@1提升3.7%,Cars-196上提升1.8%。

链接: https://arxiv.org/abs/2601.20246
作者: Jan Niklas Kolf,Ozan Tezcan,Justin Theiss,Hyung Jun Kim,Wentao Bao,Bhargav Bhushanam,Khushi Gupta,Arun Kejariwal,Naser Damer,Fadi Boutros
机构: Fraunhofer Institute for Computer Graphics Research (弗劳恩霍夫计算机图形研究所); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rise of Deep Generative Models (DGM) has enabled the generation of high-quality synthetic data. When used to augment authentic data in Deep Metric Learning (DML), these synthetic samples enhance intra-class diversity and improve the performance of downstream DML tasks. We introduce BLenDeR, a diffusion sampling method designed to increase intra-class diversity for DML in a controllable way by leveraging set-theory inspired union and intersection operations on denoising residuals. The union operation encourages any attribute present across multiple prompts, while the intersection extracts the common direction through a principal component surrogate. These operations enable controlled synthesis of diverse attribute combinations within each class, addressing key limitations of existing generative approaches. Experiments on standard DML benchmarks demonstrate that BLenDeR consistently outperforms state-of-the-art baselines across multiple datasets and backbones. Specifically, BLenDeR achieves 3.7% increase in Recall@1 on CUB-200 and a 1.8% increase on Cars-196, compared to state-of-the-art baselines under standard experimental settings.
zh

[CV-59] Visual Prompt-Agnostic Evolution ICLR2026

【速读】:该论文旨在解决视觉提示调优(Visual Prompt Tuning, VPT)在训练过程中存在的不稳定动态问题,具体表现为浅层提示过早停滞、深层提示出现高方差振荡,进而导致跨层不匹配,影响收敛速度和最终性能。解决方案的关键在于提出Prompt-Agnostic Evolution (PAE),其核心创新包括:从频域角度初始化提示,通过挖掘并传播骨干网络固有的频率捷径模式实现任务感知的引导;引入共享的Koopman算子以施加全局线性变换,确保各层提示演化的一致性;并基于李雅普诺夫稳定性理论设计正则化项,约束演化过程中的误差放大效应。该方法无需修改主干模型或推理阶段调整,具有轻量化、通用性强的特点。

链接: https://arxiv.org/abs/2601.20232
作者: Junze Wang,Lei Fan,Dezheng Zhang,Weipeng Jing,Donglin Di,Yang Song,Sidong Liu,Cong Cong
机构: University of Science and Technology Beijing (北京科技大学); University of New South Wales (新南威尔士大学); Northeast Forestry University (东北林业大学); Tsinghua University (清华大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution ( \mathttPAE ), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we employ a shared Koopman operator that imposes a global linear transformation instead of uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments show that \mathttPAE accelerates convergence with an average 1.41\times speedup and improves accuracy by 1–3% on 25 datasets across multiple downstream tasks. Beyond performance, \mathttPAE is prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.
zh

[CV-60] Feature Projection Learning for Better Vision-Language Reasoning ICASSP2026

【速读】:该论文旨在解决视觉语言预训练模型(如CLIP)在下游任务中适应效率低下的问题,现有方法普遍存在性能受限、可学习参数过多或训练时间过长等缺陷。其解决方案的关键在于提出一种名为特征投影学习(Feature Projection Learning, FPL)的简单而高效的方法:通过构建一个投影模型,将类别原型特征映射到查询图像特征空间并重建图像特征图,利用负平均平方重构误差作为类别得分,从而将分类任务转化为特征投影问题;最终输出为投影模型预测与原始CLIP模型预测的融合结果,显著提升了分类精度并优于当前最先进方法。

链接: https://arxiv.org/abs/2601.20224
作者: Yi Zhang,Weicheng Lin,Liang-Jie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit\textbfFeature \textbfProjection \textbfLearning(FPL) to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.
zh

[CV-61] DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment ICLR2026

【速读】:该论文旨在解决基于GRPO(Generalized Reward Policy Optimization)的文本到图像生成方法在人类偏好对齐中面临的稀疏奖励问题(sparse reward problem),即终端奖励被均匀分配给去噪轨迹的所有中间步骤,导致全局反馈信号与各步骤的细粒度贡献不匹配。解决方案的关键在于提出DenseGRPO框架,其核心创新包括:(1) 通过基于常微分方程(ODE)的方法对中间干净图像施加奖励模型,预测每一步的奖励增量作为密集奖励(dense reward),从而实现反馈信号与单步贡献的精准对齐;(2) 发现现有方法中均匀探索设置与随时间变化的噪声强度之间的不匹配问题,并设计一种奖励感知的探索空间校准机制,通过自适应调整随机性注入策略来优化SDE采样器在每个时间步的探索空间,确保训练过程的有效性与稳定性。

链接: https://arxiv.org/abs/2601.20218
作者: Haoyou Deng,Keyu Yan,Chaojie Mao,Xiang Wang,Yu Liu,Changxin Gao,Nong Sang
机构: Huazhong University of Science and Technology (华中科技大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Recent GRPO-based approaches built on flow matching models have shown remarkable improvements in human preference alignment for text-to-image generation. Nevertheless, they still suffer from the sparse reward problem: the terminal reward of the entire denoising trajectory is applied to all intermediate steps, resulting in a mismatch between the global feedback signals and the exact fine-grained contributions at intermediate denoising steps. To address this issue, we introduce \textbfDenseGRPO, a novel framework that aligns human preference with dense rewards, which evaluates the fine-grained contribution of each denoising step. Specifically, our approach includes two key components: (1) we propose to predict the step-wise reward gain as dense reward of each denoising step, which applies a reward model on the intermediate clean images via an ODE-based approach. This manner ensures an alignment between feedback signals and the contributions of individual steps, facilitating effective training; and (2) based on the estimated dense rewards, a mismatch drawback between the uniform exploration setting and the time-varying noise intensity in existing GRPO-based methods is revealed, leading to an inappropriate exploration space. Thus, we propose a reward-aware scheme to calibrate the exploration space by adaptively adjusting a timestep-specific stochasticity injection in the SDE sampler, ensuring a suitable exploration space at all timesteps. Extensive experiments on multiple standard benchmarks demonstrate the effectiveness of the proposed DenseGRPO and highlight the critical role of the valid dense rewards in flow matching model alignment.
zh

[CV-62] RACER: Texture-Robust Affordance Chain-of-Thought for Deformable-Object Refinement

【速读】:该论文旨在解决机器人在操作柔顺物体时,如何将高层语义指令与物理交互点对齐的问题,尤其是在复杂外观和纹理变化下,现有基于视觉的可操作性预测方法常出现边界溢出和功能区域碎片化的问题。解决方案的关键在于提出TRACER框架,其核心创新包括:(1)树状结构的可操作性思维链(Tree-structured Affordance Chain-of-Thought, TA-CoT),将高层次任务意图分解为分层子任务语义,实现跨执行阶段的一致性引导;(2)空间约束边界精修机制(Spatial-Constrained Boundary Refinement, SCBR),抑制预测溢出,确保空间完整性;(3)交互收敛精修流(Interactive Convergence Refinement Flow, ICRF),聚合受外观噪声干扰的离散像素,显著提升功能区域的空间连续性和物理合理性。这些模块协同作用,有效提升了柔顺物体可操作性定位的精度与长程任务成功率。

链接: https://arxiv.org/abs/2601.20208
作者: Wanjun Jia,Kang Li,Fan Yang,Mengfei Duan,Wenrui Chen,Yiming Jiang,Hui Zhang,Kailun Yang,Zhiyong Li,Yaonan Wang
机构: Hunan University (湖南大学); National Engineering Research Center of Robot Visual Perception and Control Technology (机器人视觉感知与控制技术国家工程研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: The source code and dataset will be made publicly available at this https URL

点击查看摘要

Abstract:The central challenge in robotic manipulation of deformable objects lies in aligning high-level semantic instructions with physical interaction points under complex appearance and texture variations. Due to near-infinite degrees of freedom, complex dynamics, and heterogeneous patterns, existing vision-based affordance prediction methods often suffer from boundary overflow and fragmented functional regions. To address these issues, we propose TRACER, a Texture-Robust Affordance Chain-of-thought with dEformable-object Refinement framework, which establishes a cross-hierarchical mapping from hierarchical semantic reasoning to appearance-robust and physically consistent functional region refinement. Specifically, a Tree-structured Affordance Chain-of-Thought (TA-CoT) is formulated to decompose high-level task intentions into hierarchical sub-task semantics, providing consistent guidance across various execution stages. To ensure spatial integrity, a Spatial-Constrained Boundary Refinement (SCBR) mechanism is introduced to suppress prediction spillover, guiding the perceptual response to converge toward authentic interaction manifolds. Furthermore, an Interactive Convergence Refinement Flow (ICRF) is developed to aggregate discrete pixels corrupted by appearance noise, significantly enhancing the spatial continuity and physical plausibility of the identified functional regions. Extensive experiments conducted on the Fine-AGDDO15 dataset and a real-world robotic platform demonstrate that TRACER significantly improves affordance grounding precision across diverse textures and patterns inherent to deformable objects. More importantly, it enhances the success rate of long-horizon tasks, effectively bridging the gap between high-level semantic reasoning and low-level physical execution. The source code and dataset will be made publicly available at this https URL.
zh

[CV-63] Automated Marine Biofouling Assessment: Benchmarking Computer Vision and Multimodal LLM s on the Level of Fouling Scale

【速读】:该论文旨在解决船舶船体生物污损(marine biofouling)严重程度评估的自动化问题,以应对传统潜水员目视检查方法在安全性、可扩展性方面的局限。其关键解决方案在于结合计算机视觉模型与大语言模型(Large Language Models, LLMs)的优势:前者通过卷积神经网络(CNN)和基于Transformer的分割模型实现高精度的极端污染等级分类,后者则借助结构化提示(structured prompts)和检索增强机制,在无需训练的情况下达成具有可解释性的中间等级判别,从而形成互补性强、可扩展且透明的混合评估框架。

链接: https://arxiv.org/abs/2601.20196
作者: Brayden Hamilton,Tim Cashmore,Peter Driscoll,Trevor Gee,Henry Williams
机构: The University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Australasian Conference on Robotics and Automation, ACRA2025 13 Pages, 8 Figures

点击查看摘要

Abstract:Marine biofouling on vessel hulls poses major ecological, economic, and biosecurity risks. Traditional survey methods rely on diver inspections, which are hazardous and limited in scalability. This work investigates automated classification of biofouling severity on the Level of Fouling (LoF) scale using both custom computer vision models and large multimodal language models (LLMs). Convolutional neural networks, transformer-based segmentation, and zero-shot LLMs were evaluated on an expert-labelled dataset from the New Zealand Ministry for Primary Industries. Computer vision models showed high accuracy at extreme LoF categories but struggled with intermediate levels due to dataset imbalance and image framing. LLMs, guided by structured prompts and retrieval, achieved competitive performance without training and provided interpretable outputs. The results demonstrate complementary strengths across approaches and suggest that hybrid methods integrating segmentation coverage with LLM reasoning offer a promising pathway toward scalable and interpretable biofouling assessment.
zh

[CV-64] Style: Content-Preserving Style Transfer in Images and Videos

【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在内容保持型风格迁移(content-preserving style transfer)任务中面临的挑战,即模型内部表征中内容与风格特征的固有纠缠问题。其解决方案的关键在于提出了一种轻量且高效的模型 TeleStyle,该模型基于 Qwen-Image-Edit 构建,并通过引入一种课程持续学习(Curriculum Continual Learning)框架来训练于由高质量人工标注风格三元组(clean triplets)和大量合成风格三元组(noisy triplets)组成的混合数据集。此机制使模型能够在不牺牲内容保真度的前提下泛化至未见过的风格,同时结合视频到视频风格迁移模块以提升时序一致性与视觉质量,从而在风格相似性、内容一致性和美学质量三项核心指标上达到当前最优性能。

链接: https://arxiv.org/abs/2601.20175
作者: Shiwen Zhang,Xiaoyan Yang,Bojia Zi,Haibin Huang,Chi Zhang,Xuelong Li
机构: TeleAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Content-preserving style transfer, generating stylized outputs based on content and style references, remains a significant challenge for Diffusion Transformers (DiTs) due to the inherent entanglement of content and style features in their internal representations. In this technical report, we present TeleStyle, a lightweight yet effective model for both image and video stylization. Built upon Qwen-Image-Edit, TeleStyle leverages the base model’s robust capabilities in content preservation and style customization. To facilitate effective training, we curated a high-quality dataset of distinct specific styles and further synthesized triplets using thousands of diverse, in-the-wild style categories. We introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This approach enables the model to generalize to unseen styles without compromising precise content fidelity. Additionally, we introduce a video-to-video stylization module to enhance temporal consistency and visual quality. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality. Code and pre-trained models are available at this https URL
zh

[CV-65] Efficient Token Pruning for LLaDA-V

【速读】:该论文旨在解决基于扩散模型的多模态大模型(如LLaDA-V)在视觉-语言理解与生成任务中因双向注意力机制和扩散式迭代去噪范式所导致的高计算开销问题。其核心挑战在于视觉标记(visual tokens)在所有层和去噪步骤中被重复处理,造成显著的FLOPs浪费。解决方案的关键在于通过深入分析注意力分布发现:与自回归解码器不同,LLaDA-V的跨模态信息聚合主要发生在中后期层,导致语义对齐延迟;据此提出一种结构化标记剪枝策略,受FastV启发但聚焦于首个去噪步骤的中后期层进行选择性剪枝,以减少整体计算量并保留关键语义信息——该策略不仅契合模型的注意力聚集特性,还能降低后续所有去噪步骤的计算负担,最终实现高达65%的计算成本削减同时保持平均95%的任务性能。

链接: https://arxiv.org/abs/2601.20168
作者: Zhewen Wan,Tianchen Song,Chen Lin,Zhiyong Zhao,Xianpeng Lang
机构: Li Auto Inc.(小鹏汽车有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based large multimodal models, such as LLaDA-V, have demonstrated impressive capabilities in vision-language understanding and generation. However, their bidirectional attention mechanism and diffusion-style iterative denoising paradigm introduce significant computational overhead, as visual tokens are repeatedly processed across all layers and denoising steps. In this work, we conduct an in-depth attention analysis and reveal that, unlike autoregressive decoders, LLaDA-V aggregates cross-modal information predominantly in middle-to-late layers, leading to delayed semantic alignment. Motivated by this observation, we propose a structured token pruning strategy inspired by FastV, selectively removing a proportion of visual tokens at designated layers to reduce FLOPs while preserving critical semantic information. To the best of our knowledge, this is the first work to investigate structured token pruning in diffusion-based large multimodal models. Unlike FastV, which focuses on shallow-layer pruning, our method targets the middle-to-late layers of the first denoising step to align with LLaDA-V’s delayed attention aggregation to maintain output quality, and the first-step pruning strategy reduces the computation across all subsequent steps. Our framework provides an empirical basis for efficient LLaDA-V inference and highlights the potential of vision-aware pruning in diffusion-based multimodal models. Across multiple benchmarks, our best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance.
zh

[CV-66] NucFuseRank: Dataset Fusion and Performance Ranking for Nuclei Instance Segmentation

【速读】:该论文旨在解决当前组织病理学图像中细胞核实例分割(nuclei instance segmentation)研究中因数据集碎片化和评估标准不统一而导致的模型性能不可比、复现性差的问题。其关键解决方案在于:首先通过系统性文献回顾识别并标准化多个公开的HE染色图像细胞核标注数据集,形成统一输入与标注格式;其次利用两种前沿分割模型(基于卷积神经网络CNN与混合CNN-视觉Transformer架构)对这些数据集进行系统评估与排序;进而提出一个统一测试集(NucFuse-test)以实现跨数据集公平比较,并构建一个融合训练集(NucFuse-train)以提升分割性能。该工作为HE染色组织切片中的细胞核实例分割提供了新的基准,推动了该领域方法开发与评估的标准化进程。

链接: https://arxiv.org/abs/2601.20104
作者: Nima Torbati,Anastasia Meshcheryakova,Ramona Woitek,Sepideh Hatamikia,Diana Mechtcheriakova,Amirreza Mahbod
机构: Medical University of Vienna (维也纳医科大学); Danube Private University (多瑙私立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages

点击查看摘要

Abstract:Nuclei instance segmentation in hematoxylin and eosin (HE)-stained images plays an important role in automated histological image analysis, with various applications in downstream tasks. While several machine learning and deep learning approaches have been proposed for nuclei instance segmentation, most research in this field focuses on developing new segmentation algorithms and benchmarking them on a limited number of arbitrarily selected public datasets. In this work, rather than focusing on model development, we focused on the datasets used for this task. Based on an extensive literature review, we identified manually annotated, publicly available datasets of HE-stained images for nuclei instance segmentation and standardized them into a unified input and annotation format. Using two state-of-the-art segmentation models, one based on convolutional neural networks (CNNs) and one based on a hybrid CNN and vision transformer architecture, we systematically evaluated and ranked these datasets based on their nuclei instance segmentation performance. Furthermore, we proposed a unified test set (NucFuse-test) for fair cross-dataset evaluation and a unified training set (NucFuse-train) for improved segmentation performance by merging images from multiple datasets. By evaluating and ranking the datasets, performing comprehensive analyses, generating fused datasets, conducting external validation, and making our implementation publicly available, we provided a new benchmark for training, testing, and evaluating nuclei instance segmentation models on HE-stained histological images. Comments: 31 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.20104 [cs.CV] (or arXiv:2601.20104v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.20104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-67] Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

【速读】:该论文旨在解决对比语言-图像预训练(CLIP)模型中表征稀疏性与可解释性不足的问题,即当前主流方法认为可解释性与性能存在权衡关系,导致后处理稀疏化技术(如稀疏自编码器SAE)虽提升可解释性但常损害下游任务性能和跨模态能力。其解决方案的关键在于将稀疏性直接引入CLIP的训练过程,而非依赖事后处理,从而在保持强下游任务性能的同时实现高可解释性和多模态特性保留。该方法通过联合优化稀疏约束与跨模态对齐目标,使学习到的特征既具备语义清晰的稀疏结构,又能有效支持视觉引导的可控生成等应用,验证了可解释性与性能可协同优化的新范式。

链接: https://arxiv.org/abs/2601.20075
作者: Chuan Qin,Constantin Venhoff,Sonia Joseph,Fanyi Xiao,Stefan Scherer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP’s dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP’s inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.20075 [cs.CV] (or arXiv:2601.20075v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.20075 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-68] Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data

【速读】:该论文旨在解决在标注数据稀缺但未标注数据丰富的场景下训练视觉Transformer(Vision Transformer, ViT)的挑战。其解决方案的核心是提出半监督掩码自编码器(Semi-Supervised Masked Autoencoder, SSMAE),该框架通过联合优化掩码图像重建与分类任务,利用标注和未标注样本,并引入一种基于验证驱动的门控机制(validation-driven gating mechanism),仅在模型对同一图像的不同增强视图(弱增强与强增强)均产生高置信度且一致的伪标签时才启用伪标签策略,从而有效降低确认偏差(confirmation bias)。实验表明,SSMAE在CIFAR-10和CIFAR-100上显著优于监督式ViT和微调的掩码自编码器(Masked Autoencoder, MAE),尤其在低标签比例下性能提升最为明显(如CIFAR-10仅用10%标签时比ViT提升9.24%)。

链接: https://arxiv.org/abs/2601.20072
作者: Atik Faysal,Mohammad Rostami,Reihaneh Gh. Roshan,Nikhil Muralidhar,Huaxia Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi-Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo-labels. SSMAE introduces a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves reliable, high-confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR-10 and CIFAR-100, SSMAE consistently outperforms supervised ViT and fine-tuned MAE, with the largest gains in low-label regimes (+9.24% over ViT on CIFAR-10 with 10% labels). Our results demonstrate that when pseudo-labels are introduced is as important as how they are generated for data-efficient transformer training. Codes are available at this https URL.
zh

[CV-69] DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation

【速读】:该论文旨在解决开放词汇语义分割(open-vocabulary semantic segmentation)中现有方法依赖视觉语言模型(VLMs,如CLIP)时存在的两个关键问题:一是前景偏置(Foreground Bias),即模型倾向于忽略背景区域;二是空间定位能力有限(Limited Spatial Localization),导致物体边界模糊。解决方案的核心在于提出DiSa框架,其关键创新包括:(1) 设计了显著性感知解耦模块(Saliency-aware Disentanglement Module, SDM),通过显式引入显著性线索,以分而治之的方式分别建模前景和背景的集成特征;(2) 提出分层精化模块(Hierarchical Refinement Module, HRM),利用像素级空间上下文信息,并通过多层级更新实现通道维度的特征细化,从而提升分割精度与边界清晰度。

链接: https://arxiv.org/abs/2601.20064
作者: Zhen Yao,Xin Li,Taotao Jing,Shuai Zhang,Mooi Choo Chuah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures

点击查看摘要

Abstract:Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module (HRM) that leverages pixel-wise spatial contexts and enables channel-wise feature refinement through multi-level updates. Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state-of-the-art methods.
zh

[CV-70] Primitive-Driven Acceleration of Hyperdimensional Computing for Real-Time Image Classification

【速读】:该论文旨在解决超维度计算(Hyperdimensional Computing, HDC)在传统处理器上执行效率低下的问题,特别是针对绑定(binding)、置换(permutation)、捆绑(bundling)和相似性搜索等核心操作在CPU或GPU上存在利用率不足、内存瓶颈及实时性能受限的挑战。解决方案的关键在于:首先设计了一种受卷积神经网络启发的图像编码算法,将局部图像块映射为富含空间信息的高维向量(hypervector, HV),并通过HDC基础操作融合为全局表示,实现空间敏感且鲁棒的图像编码;其次,提出一种端到端FPGA加速器架构,利用流水线结构并行化处理高维向量维度与图像块集合,显著提升计算效率。实验表明,该方案在Alveo U280 FPGA上实现0.09ms推理延迟,相较最优CPU和GPU基线分别获得最高1300倍和60倍加速比。

链接: https://arxiv.org/abs/2601.20061
作者: Dhruv Parikh,Jebacyril Arockiaraj,Viktor Prasanna
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperdimensional Computing (HDC) represents data using extremely high-dimensional, low-precision vectors, termed hypervectors (HVs), and performs learning and inference through lightweight, noise-tolerant operations. However, the high dimensionality, sparsity, and repeated data movement involved in HDC make these computations difficult to accelerate efficiently on conventional processors. As a result, executing core HDC operations: binding, permutation, bundling, and similarity search: on CPUs or GPUs often leads to suboptimal utilization, memory bottlenecks, and limits on real-time performance. In this paper, our contributions are two-fold. First, we develop an image-encoding algorithm that, similar in spirit to convolutional neural networks, maps local image patches to hypervectors enriched with spatial information. These patch-level hypervectors are then merged into a global representation using the fundamental HDC operations, enabling spatially sensitive and robust image encoding. This encoder achieves 95.67% accuracy on MNIST and 85.14% on Fashion-MNIST, outperforming prior HDC-based image encoders. Second, we design an end-to-end accelerator that implements these compute operations on an FPGA through a pipelined architecture that exploits parallelism both across the hypervector dimensionality and across the set of image patches. Our Alveo U280 implementation delivers 0.09ms inference latency, achieving up to 1300x and 60x speedup over state-of-the-art CPU and GPU baselines, respectively.
zh

[CV-71] Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation

【速读】:该论文旨在解决从单目图像中准确恢复食物体积(portion size)以实现精准饮食评估的问题,这是当前生成式 AI (Generative AI) 在数字健康领域应用中的关键挑战之一。现有3D重建方法虽能实现几何结构的高质量重建,但无法恢复真实世界尺度,导致其在精度营养(precision nutrition)场景下实用性受限。解决方案的关键在于利用大规模数据训练模型所提取的丰富视觉特征来估计重建物体的真实尺度,从而将单视角3D重建结果转化为具有物理意义的、与现实尺寸一致的三维模型,显著提升了食物体积估计的准确性,实验表明该方法可使平均绝对体积误差降低近30%。

链接: https://arxiv.org/abs/2601.20051
作者: Gautham Vinod,Bruce Coburn,Siddeshwar Raghavan,Jiangpeng He,Fengqing Zhu
机构: Purdue University (普渡大学); Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rise of chronic diseases related to diet, such as obesity and diabetes, emphasizes the need for accurate monitoring of food intake. While AI-driven dietary assessment has made strides in recent years, the ill-posed nature of recovering size (portion) information from monocular images for accurate estimation of ``how much did you eat?‘’ is a pressing challenge. Some 3D reconstruction methods have achieved impressive geometric reconstruction but fail to recover the crucial real-world scale of the reconstructed object, limiting its usage in precision nutrition. In this paper, we bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image. Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object. This learned scale enables us to convert single-view 3D reconstructions into true-to-life, physically meaningful models. Extensive experiments and ablation studies on two publicly available datasets show that our method consistently outperforms existing techniques, achieving nearly a 30% reduction in mean absolute volume-estimation error, showcasing its potential to enhance the domain of precision nutrition. Code: this https URL
zh

[CV-72] MeanCache: From Instantaneous to Averag e Velocity for Accelerating Flow Matching Inference

【速读】:该论文旨在解决流匹配(Flow Matching)推理过程中因缓存策略导致的轨迹偏差与误差累积问题。现有缓存方法通常依赖瞬时速度信息(如特征缓存),在高加速度比场景下易引发严重轨迹偏离。其解决方案的关键在于提出MeanCache框架,通过利用缓存的雅可比-向量积(Jacobian–vector products, JVP)构建区间平均速度,从而有效缓解局部误差积累;同时设计了一种基于预算约束的峰值抑制最短路径(Peak-Suppressed Shortest Path)调度策略,优化缓存时机与JVP复用稳定性,实现高效且高质量的生成式AI推理加速。

链接: https://arxiv.org/abs/2601.19961
作者: Huanlin Gao,Ping Chen,Fuyuan Shi,Ruijia Wu,Li YanTao,Qiang Hui,Yuren You,Ting Lu,Chao Tan,Shaoan Zhao,Zhaoxiang Liu,Fang Zhao,Kai Wang,Shiguo Lian
机构: China Unicom(中国联通); Nanjing University(南京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian–vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.
zh

[CV-73] NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

【速读】:该论文旨在解决深度学习中因标签噪声(label noise)导致的模型性能下降问题,尤其针对从网络爬取等真实场景中获取的数据集常存在的错误或损坏标注。现有研究多聚焦于复杂的标签修正机制,而本文提出了一种新的理论视角:通过分析损失函数景观的平坦度(flatness)与标签噪声之间的关系,发现适度模拟的标签噪声能够协同提升模型的泛化能力和对标签噪声的鲁棒性。解决方案的关键在于提出Noise-Compensated Sharpness-aware Minimization (NCSAM),该方法利用Sharpness-Aware Minimization (SAM) 的扰动机制来补偿标签噪声带来的损害,从而在多个基准数据集上实现优于当前最先进方法的一致性性能提升。

链接: https://arxiv.org/abs/2601.19947
作者: Jiayu Xu,Junbiao Pang
机构: Beijing University Of Technology (北京工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning from Noisy Labels (LNL) presents a fundamental challenge in deep learning, as real-world datasets often contain erroneous or corrupted annotations, \textite.g., data crawled from Web. Current research focuses on sophisticated label correction mechanisms. In contrast, this paper adopts a novel perspective by establishing a theoretical analysis the relationship between flatness of the loss landscape and the presence of label noise. In this paper, we theoretically demonstrate that carefully simulated label noise synergistically enhances both the generalization performance and robustness of label noises. Consequently, we propose Noise-Compensated Sharpness-aware Minimization (NCSAM) to leverage the perturbation of Sharpness-Aware Minimization (SAM) to remedy the damage of label noises. Our analysis reveals that the testing accuracy exhibits a similar behavior that has been observed on the noise-clear dataset. Extensive experimental results on multiple benchmark datasets demonstrate the consistent superiority of the proposed method over existing state-of-the-art approaches on diverse tasks.
zh

[CV-74] oculomix: Hierarchical Sampling for Retinal-Based Systemic Disease Prediction

【速读】:该论文旨在解决当前基于图像的混合样本数据增强方法(如CutMix和MixUp)在眼底成像用于系统性疾病预测(如心血管事件)时,因忽略患者特异性属性(如合并症和临床因素)而导致的特征混淆问题。其解决方案的关键在于提出一种分层采样策略——Oculomix,该策略基于两个临床先验:一是同一患者在同一时间点获取的眼底图像共享相同属性(exam level),二是同一患者不同时间点的图像呈现软性时间趋势(patient level)。通过将混合空间约束在患者和检查层级,并利用其层次关系,Oculomix有效保留了患者特异性特征,从而提升了模型在五年内预测主要不良心血管事件(MACE)的性能,在AUROC上较传统方法提升达3%。

链接: https://arxiv.org/abs/2601.19939
作者: Hyunmin Kim,Yukun Zhou,Rahul A. Jonas,Lie Ju,Sunjin Hwang,Pearse A. Keane,Siegfried K. Wagner
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2026

点击查看摘要

Abstract:Oculomics - the concept of predicting systemic diseases, such as cardiovascular disease and dementia, through retinal imaging - has advanced rapidly due to the data efficiency of transformer-based foundation models like RETFound. Image-level mixed sample data augmentations, such as CutMix and MixUp, are frequently used for training transformers, yet these techniques perturb patient-specific attributes, such as medical comorbidity and clinical factors, since they only account for images and labels. To address this limitation, we propose a hierarchical sampling strategy, Oculomix, for mixed sample augmentations. Our method is based on two clinical priors. First (exam level), images acquired from the same patient at the same time point share the same attributes. Second (patient level), images acquired from the same patient at different time points have a soft temporal trend, as morbidity generally increases over time. Guided by these priors, our method constrains the mixing space to the patient and exam levels to better preserve patient-specific characteristics and leverages their hierarchical relationships. The proposed method is validated using ViT models on a five-year prediction of major adverse cardiovascular events (MACE) in a large ethnically diverse population (Alzeye). We show that Oculomix consistently outperforms image-level CutMix and MixUp by up to 3% in AUROC, demonstrating the necessity and value of the proposed method in oculomics.
zh

[CV-75] SegRap2025: A Benchmark of Gross Tumor Volume and Lymph Node Clinical Target Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma

【速读】:该论文旨在解决鼻咽癌(Nasopharyngeal Carcinoma, NPC)放疗计划中靶区(包括肿瘤大体靶体积 Gross Tumor Volume, GTV 和淋巴结临床靶体积 Lymph Node Clinical Target Volume, LN CTV)及危及器官(Organ-at-Risk, OAR)在多中心、多模态CT影像上的自动分割问题,以提升分割模型的泛化能力和鲁棒性。其解决方案的关键在于构建了一个大规模多中心、多模态的基准数据集(SegRap2025),包含配对非增强CT(non-contrast CT, ncCT)与增强CT(contrast-enhanced CT, ceCT)扫描,并设计两个任务:Task01聚焦GTV分割并评估跨中心泛化性能,Task02专注于LN CTV分割并测试跨中心和跨模态的鲁棒性;通过十支参赛团队提交的方案验证了该基准的有效性,结果显示最优模型在内部测试集上DSC达74.61%,外部测试集下降至56.79%,表明当前方法仍需进一步提升跨中心迁移能力,但为实现临床可部署的自动化放疗规划系统提供了重要基础。

链接: https://arxiv.org/abs/2601.20575
作者: Jia Fu,Litingyu Wang,He Li,Zihao Luo,Huamin Wang,Chenyuan Bian,Zijun Gao,Chunbin Gu,Xin Weng,Jianghao Wu,Yicheng Wu,Jin Ye,Linhao Li,Yiwen Ye,Yong Xia,Elias Tappeiner,Fei He,Abdul qayyum,Moona Mazher,Steven A Niederer,Junqiang Chen,Chuanyi Huang,Lisheng Wang,Zhaohu Xing,Hongqiu Wang,Lei Zhu,Shichuan Zhang,Shaoting Zhang,Wenjun Liao,Guotai Wang
机构: University of Electronic Science and Technology of China (电子科技大学); Sichuan Cancer Hospital and Institute (四川省肿瘤医院); Shanghai AI Lab (上海人工智能实验室); Qingdao University Affiliated Hospital (青岛大学附属医院); The Chinese University of Hong Kong (香港中文大学); Monash University (蒙纳士大学); Imperial College London (帝国理工学院); Northwestern Polytechnical University (西北工业大学); UMIT Tirol (蒂罗尔私立健康科学与技术大学); University of Electronic Science and Technology of China (电子科技大学); Shanghai Jiao Tong University (上海交通大学); Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); Bank of China (中国银行); Shanghai MediWorks Precision Instruments Co., Ltd. (上海美维精密仪器有限公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate delineation of Gross Tumor Volume (GTV), Lymph Node Clinical Target Volume (LN CTV), and Organ-at-Risk (OAR) from Computed Tomography (CT) scans is essential for precise radiotherapy planning in Nasopharyngeal Carcinoma (NPC). Building upon SegRap2023, which focused on OAR and GTV segmentation using single-center paired non-contrast CT (ncCT) and contrast-enhanced CT (ceCT) scans, the SegRap2025 challenge aims to enhance the generalizability and robustness of segmentation models across imaging centers and modalities. SegRap2025 comprises two tasks: Task01 addresses GTV segmentation using paired CT from the SegRap2023 dataset, with an additional external testing set to evaluate cross-center generalization, and Task02 focuses on LN CTV segmentation using multi-center training data and an unseen external testing set, where each case contains paired CT scans or a single modality, emphasizing both cross-center and cross-modality robustness. This paper presents the challenge setup and provides a comprehensive analysis of the solutions submitted by ten participating teams. For GTV segmentation task, the top-performing models achieved average Dice Similarity Coefficient (DSC) of 74.61% and 56.79% on the internal and external testing cohorts, respectively. For LN CTV segmentation task, the highest average DSC values reached 60.24%, 60.50%, and 57.23% on paired CT, ceCT-only, and ncCT-only subsets, respectively. SegRap2025 establishes a large-scale multi-center, multi-modality benchmark for evaluating the generalization and robustness in radiotherapy target segmentation, providing valuable insights toward clinically applicable automated radiotherapy planning systems. The benchmark is available at: this https URL.
zh

人工智能

[AI-0] SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长程规划(long-horizon planning)能力上的不足问题,尤其是其在复杂推理任务中缺乏对多步动作序列的有效规划能力。研究者提出了一种基于 Sokoban 推箱子谜题的新基准测试方法,通过简化环境以隔离长程规划与状态持久性(state persistence)的影响,从而系统评估当前先进大型推理模型(Large Reasoning Models, LRMs)的规划性能。关键解决方案在于引入规划域定义语言(Planning Domain Definition Language, PDDL)的解析、验证和求解工具,虽仅带来适度性能提升,但揭示了现有模型架构存在固有局限性,表明单纯依赖测试时扩展(test-time scaling)难以突破此类限制。

链接: https://arxiv.org/abs/2601.20856
作者: Sebastiano Monti,Carlo Nicolini,Gianni Pellegrini,Jacopo Staiano,Bruno Lepri
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.
zh

[AI-1] Exploring Transformer Placement in Variational Autoencoders for Tabular Data Generation

【速读】:该论文旨在解决表格数据(tabular data)在生成模型中的建模难题,尤其是传统变分自编码器(Variational Autoencoder, VAE)因采用多层感知机(multilayer perceptrons)难以有效捕捉特征间复杂关系,尤其在混合数据类型场景下表现受限的问题。其解决方案的关键在于将Transformer架构引入VAE的不同组件中,利用其注意力机制增强对特征交互的建模能力,实验表明,将Transformer置于潜在空间和解码器表示中可提升生成质量,但需权衡生成结果的保真度(fidelity)与多样性(diversity),同时发现Transformer各层之间存在高度相似性,尤其在解码器中输入与输出的关系近似线性。

链接: https://arxiv.org/abs/2601.20854
作者: Aníbal Silva,Moisés Santos,André Restivo,Carlos Soares
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular data remains a challenging domain for generative models. In particular, the standard Variational Autoencoder (VAE) architecture, typically composed of multilayer perceptrons, struggles to model relationships between features, especially when handling mixed data types. In contrast, Transformers, through their attention mechanism, are better suited for capturing complex feature interactions. In this paper, we empirically investigate the impact of integrating Transformers into different components of a VAE. We conduct experiments on 57 datasets from the OpenML CC18 suite and draw two main conclusions. First, results indicate that positioning Transformers to leverage latent and decoder representations leads to a trade-off between fidelity and diversity. Second, we observe a high similarity between consecutive blocks of a Transformer in all components. In particular, in the decoder, the relationship between the input and output of a Transformer is approximately linear.
zh

[AI-2] Post-Training Fairness Control: A Single-Train Framework for Dynamic Fairness in Recommendation WWW2026

【速读】:该论文旨在解决推荐系统中公平性控制缺乏灵活性的问题,即现有公平性增强方法在训练时固定公平性要求,导致在实际应用中若不同利益相关者随时间提出不同的公平性需求时,需频繁重新训练模型,成本高昂。解决方案的关键在于提出一种名为Cofair的单次训练框架,其核心创新是引入一个带有公平性条件适配模块(fairness-conditioned adapter modules)的共享表示层,用于生成针对不同公平性水平定制的用户嵌入,并结合用户级正则化项以确保在不同公平性层级下用户层面的公平性逐步提升。理论分析表明,该框架的对抗目标可上界控制群体均等性(demographic parity),而正则化项保障了用户级公平性的单调改善,从而实现了无需重训即可动态调整公平性水平的能力。

链接: https://arxiv.org/abs/2601.20848
作者: Weixin Chen,Li Chen,Yuhan Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: Accepted to WWW 2026 Workshop on HCRS (Oral Presentation)

点击查看摘要

Abstract:Despite growing efforts to mitigate unfairness in recommender systems, existing fairness-aware methods typically fix the fairness requirement at training time and provide limited post-training flexibility. However, in real-world scenarios, diverse stakeholders may demand differing fairness requirements over time, so retraining for different fairness requirements becomes prohibitive. To address this limitation, we propose Cofair, a single-train framework that enables post-training fairness control in recommendation. Specifically, Cofair introduces a shared representation layer with fairness-conditioned adapter modules to produce user embeddings specialized for varied fairness levels, along with a user-level regularization term that guarantees user-wise monotonic fairness improvements across these levels. We theoretically establish that the adversarial objective of Cofair upper bounds demographic parity and the regularization term enforces progressive fairness at user level. Comprehensive experiments on multiple datasets and backbone models demonstrate that our framework provides dynamic fairness at different levels, delivering comparable or better fairness-accuracy curves than state-of-the-art baselines, without the need to retrain for each new fairness requirement. Our code is publicly available at this https URL.
zh

[AI-3] mathbbR2k is Theoretically Large Enough for Embedding-based Top-k Retrieval

【速读】:该论文旨在解决子集成员关系在向量空间中的最小嵌入维度(Minimal Embeddable Dimension, MED)问题,即如何以最低维度将包含 $ m $ 个元素和 $ \binom{m}{k} $ 个最多含 $ k $ 个元素的子集的信息嵌入到向量空间中。其解决方案的关键在于理论推导并实证验证了不同距离或相似性度量(如 2\ell_2 距离、内积和余弦相似度)下 MED 的紧致边界,并通过数值模拟发现:当子集嵌入被设定为所含元素嵌入的质心时,MED 与元素数量之间呈现出对数依赖关系。这一结果表明,基于嵌入的检索性能限制主要源于学习能力的挑战而非几何约束,从而为未来算法设计提供了重要指导。

链接: https://arxiv.org/abs/2601.20844
作者: Zihao Wang,Hang Yin,Lihui Liu,Hanghang Tong,Yangqiu Song,Ginny Wong,Simon See
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper studies the minimal dimension required to embed subset memberships ( m elements and m\choose k subsets of at most k elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of “distances” or “similarities,” including the \ell_2 metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the m\choose k subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.
zh

[AI-4] Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)

【速读】:该论文旨在解决当前深度研究代理在处理复杂博士级研究任务时面临的局限性,尤其是基于并行扩展(Parallel Scaling)范式的系统所存在的知识孤岛问题。其核心挑战在于如何实现动态适应性的研究规划与高效的知识整合,以生成结构严谨、事实密度高的研究报告。解决方案的关键在于提出一种名为“Deep Researcher”的新型架构,包含两个核心技术:一是通过反射式顺序研究计划精炼(Sequential Research Plan Refinement via Reflection),使智能体能够维护一个中心化的全局研究上下文(Global Research Context),从而在运行时回顾进展、推理并调整研究策略;二是引入候选者交叉算法(Candidates Crossover algorithm),利用多个参数各异的大语言模型(LLM)候选者探索更广的搜索空间,并融合其发现以生成最终报告。这一方法显著优于传统的并行自一致性(parallel self consistency)范式,在DeepResearch Bench基准测试中取得46.21分的领先成绩,验证了顺序扩展策略的有效性。

链接: https://arxiv.org/abs/2601.20843
作者: Saurav Prateek
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, 2 tables, source code: this https URL

点击查看摘要

Abstract:This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics by addressing the inherent limitations of the Parallel Scaling paradigm. Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm. The sequential refinement process is demonstrated as an efficient method that allows the agent to maintain a centralized Global Research Context, enabling it to look back at current progress, reason about the research plan, and intelligently make changes at runtime. This dynamic adaptation contrasts with parallel approaches, which often suffer from siloed knowledge. The Candidates Crossover algorithm further enhances search efficiency by deploying multiple LLM candidates with varied parameters to explore a larger search space, with their findings synthesized to curate a comprehensive final research response. The process concludes with One Shot Report Generation, ensuring the final document is informed by a unified narrative and high fact density. Powered by the Gemini 2.5 Pro model, our Deep Researcher was evaluated on the DeepResearch Bench, a globally recognized benchmark of 100 doctoral level research tasks. Our architecture achieved an overall score of 46.21, demonstrating superior performance by surpassing leading deep research agents such as Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher and Grok Deeper Search present on the DeepResearch Bench actively running leaderboard. This performance marginally exceeds our previous work, Static DRA, and reinforces the finding that sequential scaling consistently outperforms the parallel self consistency paradigm.
zh

[AI-5] MemCtrl: Using MLLM s as Active Memory Controllers on Embodied Agents

【速读】:该论文旨在解决基础模型(Foundation Models)在具身智能体(Embodied Agents)中因上下文窗口有限而导致的记忆管理难题,尤其是在在线、资源受限环境下如何高效压缩与检索记忆以支持个性化决策。传统方法如RAG(Retrieval-Augmented Generation)将记忆视为离线存储空间,难以适应具身智能体的实时性与计算约束。解决方案的关键在于提出MemCtrl框架,其核心是利用多模态大语言模型(Multimodal Large Language Models, MLLMs)并引入一个可训练的记忆门控头(memory head μ\mu),该模块作为动态决策机制,在探索过程中实时判断哪些观测或反思应保留、更新或丢弃,从而实现在线内存剪枝。实验表明,通过离线专家监督或在线强化学习(RL)训练μ\mu,显著提升了具身任务完成率,尤其在复杂指令子集上提升超过20%。

链接: https://arxiv.org/abs/2601.20831
作者: Vishnu Sashank Dorbala,Dinesh Manocha
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Foundation models rely on in-context learning for personalized decision making. The limited size of this context window necessitates memory compression and retrieval systems like RAG. These systems however often treat memory as large offline storage spaces, which is unfavorable for embodied agents that are expected to operate under strict memory and compute constraints, online. In this work, we propose MemCtrl, a novel framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online. MemCtrl augments MLLMs with a trainable memory head \mu that acts as a gate to determine which observations or reflections to retain, update, or discard during exploration. We evaluate with training two types of \mu, 1) via an offline expert, and 2) via online RL, and observe significant improvement in overall embodied task completion ability on \mu-augmented MLLMs. In particular, on augmenting two low performing MLLMs with MemCtrl on multiple subsets of the EmbodiedBench benchmark, we observe that \mu-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets. Finally, we present a qualitative analysis on the memory fragments collected by \mu, noting the superior performance of \mu augmented MLLMs on long and complex instruction types.
zh

[AI-6] GNN Explanations that do not Explain and How to find Them

【速读】:该论文旨在解决自解释图神经网络(Self-explainable Graph Neural Networks, SE-GNNs)中解释结果可能与模型实际决策机制无关的问题,即“退化解释”(degenerate explanations)现象——这类解释虽然看似合理,实则无法反映模型如何基于输入特征进行标签预测。现有忠实性度量(faithfulness metrics)难以识别此类失败模式,导致潜在的敏感属性滥用风险无法被检测。论文的关键解决方案是提出一种新型忠实性度量方法,能够在恶意植入和自然生成两种场景下准确识别并标记退化解释为非忠实,从而提升SE-GNNs解释的可靠性与可审计性。

链接: https://arxiv.org/abs/2601.20815
作者: Steve Azzolin,Stefano Teso,Bruno Lepri,Andrea Passerini,Sagar Malhotra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model’s inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.
zh

[AI-7] Reinforcement Learning via Self-Distillation

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法面临的严重信用分配瓶颈问题,即现有方法仅依赖每次尝试的标量奖励信号,忽略了环境中提供的丰富文本反馈(如运行时错误或评分器评价),这些反馈本可用于更精细地指导策略优化。解决方案的关键在于提出自蒸馏策略优化(Self-Distillation Policy Optimization, SDPO),其核心思想是将token化的反馈转化为密集的学习信号,无需外部教师模型或显式奖励模型;SDPO将当前模型在给定反馈条件下的输出视为自教师,将其对下一token的预测结果蒸馏回策略网络中,从而利用模型自身在上下文中回溯识别错误的能力,实现高效且精准的策略改进。

链接: https://arxiv.org/abs/2601.20802
作者: Jonas Hübotter,Frederike Lübeck,Lejs Behric,Anton Baumann,Marco Bagatella,Daniel Marta,Ido Hakimi,Idan Shenfeld,Thomas Kleine Buening,Carlos Guestrin,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model’s ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
zh

[AI-8] Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical Dynamic Search Spaces

【速读】:该论文旨在解决在条件搜索空间(conditional search space)中准确估计超参数重要性(Hyperparameter Importance, HPI)的问题,其中某些超参数的存在或取值范围可能依赖于其他超参数的值。传统方法如PED-ANOVA虽能高效估计固定不变搜索空间中的HPI,但在处理条件结构时会因忽略超参数间的依赖关系而产生误导性或不可解释的结果。其解决方案的关键在于提出一种新的框架condPED-ANOVA,通过定义适用于top-performing区域的条件HPI,并推导出一个闭式估计器(closed-form estimator),从而精确捕捉超参数的条件激活和域变化行为,确保重要性估计反映真实的条件依赖结构。

链接: https://arxiv.org/abs/2601.20800
作者: Kaito Baba,Yoshihiko Ozaki,Shuhei Watanabe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional search spaces, where the presence or domain of a hyperparameter can depend on other hyperparameters. Although the original PED-ANOVA provides a fast and efficient way to estimate HPI within the top-performing regions of the search space, it assumes a fixed, unconditional search space and therefore cannot properly handle conditional hyperparameters. To address this, we introduce a conditional HPI for top-performing regions and derive a closed-form estimator that accurately reflects conditional activation and domain changes. Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importance estimates in conditional settings, whereas condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure.
zh

[AI-9] REASON : Accelerating Probabilistic Logical Reasoning for Scalable Neuro-Symbolic Intelligence HPCA

【速读】:该论文旨在解决神经符号人工智能(Neuro-symbolic AI)系统中概率逻辑推理效率低下的问题,其核心瓶颈在于符号与概率推理过程中的不规则控制流、低算术强度、非聚合内存访问以及CPU/GPU硬件利用率差。解决方案的关键在于提出REASON框架:首先设计统一的有向无环图(DAG)表示以捕捉符号与概率模型的共性结构,并引入自适应剪枝和正则化策略;其次在架构层面构建可重构的树状处理单元,专门优化不规则遍历、符号演绎与概率聚合;最后在系统层面通过可编程接口和多级流水线紧密集成GPU流多处理器,实现组合执行的高效调度。该方案显著提升了推理速度与能效,在TSMC 28 nm工艺下实现12–50倍加速比与310–681倍能效提升,验证了针对概率逻辑推理进行针对性加速对实现实用且可扩展的神经符号AI至关重要。

链接: https://arxiv.org/abs/2601.20784
作者: Zishen Wan,Che-Kai Liu,Jiayi Qian,Hanchen Yang,Arijit Raychowdhury,Tushar Krishna
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 16 pages, 13 figures, 5 tables, 2026 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

点击查看摘要

Abstract:Neuro-symbolic AI systems integrate neural perception with symbolic reasoning to enable data-efficient, interpretable, and robust intelligence beyond purely neural models. Although this compositional paradigm has shown superior performance in domains such as reasoning, planning, and verification, its deployment remains challenging due to severe inefficiencies in symbolic and probabilistic inference. Through systematic analysis of representative neuro-symbolic workloads, we identify probabilistic logical reasoning as the inefficiency bottleneck, characterized by irregular control flow, low arithmetic intensity, uncoalesced memory accesses, and poor hardware utilization on CPUs and GPUs. This paper presents REASON, an integrated acceleration framework for probabilistic logical reasoning in neuro-symbolic AI. REASON introduces a unified directed acyclic graph representation that captures common structure across symbolic and probabilistic models, coupled with adaptive pruning and regularization. At the architecture level, REASON features a reconfigurable, tree-based processing fabric optimized for irregular traversal, symbolic deduction, and probabilistic aggregation. At the system level, REASON is tightly integrated with GPU streaming multiprocessors through a programmable interface and multi-level pipeline that efficiently orchestrates compositional execution. Evaluated across six neuro-symbolic workloads, REASON achieves 12-50x speedup and 310-681x energy efficiency over desktop and edge GPUs under TSMC 28 nm node. REASON enables real-time probabilistic logical reasoning, completing end-to-end tasks in 0.8 s with 6 mm2 area and 2.12 W power, demonstrating that targeted acceleration of probabilistic logical reasoning is critical for practical and scalable neuro-symbolic AI and positioning REASON as a foundational system architecture for next-generation cognitive intelligence. Comments: 16 pages, 13 figures, 5 tables, 2026 IEEE International Symposium on High-Performance Computer Architecture (HPCA) Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2601.20784 [cs.AI] (or arXiv:2601.20784v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.20784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-10] Independence of Approximate Clones

【速读】:该论文旨在解决传统社会选择理论中“独立于克隆”(independence of clones)公理在现实选举场景中适用性不足的问题。由于完美克隆(perfect clones)——即所有选民将两名候选人相邻排列——在实际政治选举中极为罕见,论文引入了“近似克隆”(approximate clones)的概念,通过量化候选人在偏好配置中的邻近程度来扩展该公理的适用范围。其解决方案的关键在于提出两种衡量近似克隆程度的指标,并系统分析已知满足独立于完美克隆性质的投票规则(如IRV、Ranked Pairs和Schulze)是否也具备独立于近似克隆的特性。研究发现,在四名及以上候选人的选举中,这些规则一般不满足近似克隆独立性;但在三名候选人情形下存在例外;此外,实证分析表明近似克隆在真实数据中普遍存在,且越接近完美克隆的候选人对,其移除越不易改变选举结果,尤其对原本满足完美克隆独立性的规则更为显著。

链接: https://arxiv.org/abs/2601.20779
作者: Théo Delemazure
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In an ordinal election, two candidates are said to be perfect clones if every voter ranks them adjacently. The independence of clones axiom then states that removing one of the two clones should not change the election outcome. This axiom has been extensively studied in social choice theory, and several voting rules are known to satisfy it (such as IRV, Ranked Pairs and Schulze). However, perfect clones are unlikely to occur in practice, especially for political elections with many voters. In this work, we study different notions of approximate clones in ordinal elections. Informally, two candidates are approximate clones in a preference profile if they are close to being perfect clones. We discuss two measures to quantify this proximity, and we show under which conditions the voting rules that are known to be independent of clones are also independent of approximate clones. In particular, we show that for elections with at least four candidates, none of these rules are independent of approximate clones in the general case. However, we find a more positive result for the case of three candidates. Finally, we conduct an empirical study of approximate clones and independence of approximate clones based on three real-world datasets: votes in local Scottish elections, votes in mini-jury deliberations, and votes of judges in figure skating competitions. We find that approximate clones are common in some contexts, and that the closest two candidates are to being perfect clones, the less likely their removal is to change the election outcome, especially for voting rules that are independent of perfect clones. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.20779 [cs.GT] (or arXiv:2601.20779v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2601.20779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-11] HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLM s

【速读】:该论文旨在解决极端低比特量化(extremely low-bit quantization)下大语言模型(LLM)训练中因早期硬量化导致的梯度失配问题,即传统量化感知训练(QAT)方法从训练初期即采用硬舍入和直通估计器(STE),致使优化空间过早离散化,阻碍了量化模型的有效优化。其解决方案的关键在于提出Hestia框架,通过引入温度控制的softmax松弛替代刚性的阶跃函数,在训练早期保持梯度流动;同时利用张量级Hessian迹作为轻量曲率信号,驱动细粒度温度退火,实现基于敏感度感知的渐进式量化,从而有效恢复模型表征能力,提升1.58-bit LLM的训练稳定性与性能。

链接: https://arxiv.org/abs/2601.20745
作者: Guoan Wang,Feiyu Wang,Zongwei Lv,Yikun Zong,Tong Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at this https URL.
zh

[AI-12] Implementing Metric Temporal Answer Set Programming

【速读】:该论文旨在解决度量答案集编程(Metric Answer Set Programming, Metric ASP)中因细粒度时间约束(如持续时间和截止时间)导致的可扩展性问题,尤其是这些约束会显著加剧ASP的接地瓶颈(grounding bottleneck)。解决方案的关键在于利用扩展的ASP形式,引入差分约束(difference constraints),这是一种简化的线性约束形式,将时间相关的处理逻辑外置于ASP求解过程,从而实现度量ASP与时间粒度的解耦,使得最终方案对时间精度不敏感,有效提升了求解效率和可扩展性。

链接: https://arxiv.org/abs/2601.20735
作者: Arvid Becker,Pedro Cabalar,Martin Diéguez,Susana Hahn,Javier Romero,Torsten Schaub
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We develop a computational approach to Metric Answer Set Programming (ASP) to allow for expressing quantitative temporal constraints, like durations and deadlines. A central challenge is to maintain scalability when dealing with fine-grained timing constraints, which can significantly exacerbate ASP’s grounding bottleneck. To address this issue, we leverage extensions of ASP with difference constraints, a simplified form of linear constraints, to handle time-related aspects externally. Our approach effectively decouples metric ASP from the granularity of time, resulting in a solution that is unaffected by time precision.
zh

[AI-13] Adapting the Behavior of Reinforcement Learning Agents to Changing Action Spaces and Reward Functions

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在非平稳环境中的适应性问题,特别是当奖励函数发生偏移或可用动作空间动态扩展时,传统Q-learning方法难以有效应对且易导致灾难性遗忘。其解决方案的关键在于提出MORPHIN框架——通过融合概念漂移检测机制与对学习率和探索超参数的动态调整策略,在不进行完整重训练的前提下实现在线自适应,同时保留先前策略知识以避免性能退化。实验表明,MORPHIN在网格世界基准和交通信号控制仿真中均显著提升了收敛速度和持续适应能力,学习效率最高提升1.7倍。

链接: https://arxiv.org/abs/2601.20714
作者: Raul de la Rosa,Ivana Dusparic,Nicolas Cardozo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) agents often struggle in real-world applications where environmental conditions are non-stationary, particularly when reward functions shift or the available action space expands. This paper introduces MORPHIN, a self-adaptive Q-learning framework that enables on-the-fly adaptation without full retraining. By integrating concept drift detection with dynamic adjustments to learning and exploration hyperparameters, MORPHIN adapts agents to changes in both the reward function and on-the-fly expansions of the agent’s action space, while preserving prior policy knowledge to prevent catastrophic forgetting. We validate our approach using a Gridworld benchmark and a traffic signal control simulation. The results demonstrate that MORPHIN achieves superior convergence speed and continuous adaptation compared to a standard Q-learning baseline, improving learning efficiency by up to 1.7x.
zh

[AI-14] Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

【速读】:该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在推理阶段采样过程中的高延迟问题,其核心瓶颈在于采样阶段占总推理延迟的高达70%,主要由词汇范围内的logits内存加载与写入、基于归约操作的token选择以及迭代掩码更新等不规则内存访问行为引起,这些操作对传统神经网络处理单元(NPU)的效率构成挑战。解决方案的关键在于识别出dLLM采样阶段所需的一组关键指令,并设计专用优化策略:包括轻量级非GEMM向量原语、就地内存复用机制以及解耦的混合精度内存层次结构,从而显著提升硬件执行效率,在等效纳米工艺节点下相比NVIDIA RTX A6000 GPU实现最高2.53倍的速度提升。

链接: https://arxiv.org/abs/2601.20706
作者: Binglei Lou,Haoran Wu,Yao Lai,Jiayi Nie,Can Xiao,Xuan Guo,Rika Antonova,Robert Mullins,Aaron Zhao
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.
zh

[AI-15] Enterprise Resource Planning Using Multi-type Transformers in Ferro-Titanium Industry

【速读】:该论文旨在解决组合优化问题(Combinatorial Optimization Problems),特别是作业车间调度问题(Job-Shop Scheduling Problem, JSP)和背包问题(Knapsack Problem, KP),这些问题在运筹学、物流及企业资源计划(ERP)系统中具有广泛应用,且传统启发式与元启发式算法难以在实际时间约束下获得近优解。解决方案的关键在于引入多类型注意力机制的Transformer架构(Multi-Type Transformer, MTT),通过统一框架处理不同规模的JSP与KP基准数据集,并在真实冶金制造场景(铁钛行业)中验证其有效性,从而首次将多类型Transformer应用于实际制造场景,展现出生成式建模在工业优化中的潜力。

链接: https://arxiv.org/abs/2601.20696
作者: Samira Yazdanpourmoghadam,Mahan Balal Pour,Vahid Partovi Nia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Combinatorial optimization problems such as the Job-Shop Scheduling Problem (JSP) and Knapsack Problem (KP) are fundamental challenges in operations research, logistics, and eterprise resource planning (ERP). These problems often require sophisticated algorithms to achieve near-optimal solutions within practical time constraints. Recent advances in deep learning have introduced transformer-based architectures as promising alternatives to traditional heuristics and metaheuristics. We leverage the Multi-Type Transformer (MTT) architecture to address these benchmarks in a unified framework. We present an extensive experimental evaluation across standard benchmark datasets for JSP and KP, demonstrating that MTT achieves competitive performance on different size of these benchmark problems. We showcase the potential of multi-type attention on a real application in Ferro-Titanium industry. To the best of our knowledge, we are the first to apply multi-type transformers in real manufacturing.
zh

[AI-16] Learning Contextual Runtime Monitors for Safe AI-Based Autonomy

【速读】:该论文旨在解决AI-based控制集成系统在陌生环境中性能退化带来的安全性问题,传统集成方法通过平均或投票方式融合多个控制器输出,往往削弱了各控制器在不同运行情境下的特有优势。解决方案的关键在于将安全控制集成设计重构为一个上下文感知的监控问题:通过学习一个能够持续观测系统上下文并动态选择最适配当前条件的控制器的监控框架,从而充分利用控制器多样性。该方法基于上下文多臂赌博机(contextual multi-armed bandits)技术进行监控器的学习,并提供理论上的安全保证与更优的控制器利用效率,在模拟自动驾驶场景中验证了其在安全性和性能上的显著提升。

链接: https://arxiv.org/abs/2601.20666
作者: Alejandro Luque-Cerpa,Mengyuan Wang,Emil Carlsson,Sanjit A. Seshia,Devdatt Dubhashi,Hazem Torfah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system’s context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.
zh

[AI-17] Investigating the Development of Task-Oriented Communication in Vision-Language Models

【速读】:该论文试图解决的问题是:基于大语言模型(Large Language Models, LLM)的智能体是否能在协作推理任务中发展出不同于自然语言的任务导向型通信协议,以及这类协议在效率(Efficiency)和隐蔽性(Covertness)方面的表现。解决方案的关键在于采用参照游戏(referential game)框架,让视觉-语言模型(Vision-Language Model, VLM)智能体在受控环境中进行通信,从而量化评估其生成的语言变体在任务相关性、简洁性和可解释性上的特性。实验表明,VLM智能体不仅能形成高效的任务适应性沟通模式,还可能自发产生人类与外部代理难以理解的隐蔽协议,凸显了任务导向通信在提升性能的同时带来的透明度与可控性风险。

链接: https://arxiv.org/abs/2601.20641
作者: Boaz Carmeli,Orr Paradise,Shafi Goldwasser,Yonatan Belinkov,Ron Meir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate whether \emphLLM-based agents can develop task-oriented communication protocols that differ from standard natural language in collaborative reasoning tasks. Our focus is on two core properties such task-oriented protocols may exhibit: Efficiency – conveying task-relevant information more concisely than natural language, and Covertness – becoming difficult for external observers to interpret, raising concerns about transparency and control. To investigate these aspects, we use a referential-game framework in which vision-language model (VLM) agents communicate, providing a controlled, measurable setting for evaluating language variants. Experiments show that VLMs can develop effective, task-adapted communication patterns. At the same time, they can develop covert protocols that are difficult for humans and external agents to interpret. We also observe spontaneous coordination between similar models without explicitly shared protocols. These findings highlight both the potential and the risks of task-oriented communication, and position referential games as a valuable testbed for future work in this area.
zh

[AI-18] Agent Benchmarks Fail Public Sector Requirements

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理在公共部门部署时缺乏适配性评估基准的问题,即现有基准未能充分反映公共部门在法律、程序和结构上的严格要求。解决方案的关键在于提出一套基于公共管理文献的四维评估标准:基准必须具备过程导向性(process-based)、现实性(realistic)、公共部门特异性(public-sector-specific),并报告能体现公共部门独特需求的指标(metrics)。研究通过专家验证的生成式AI辅助管道对1300余篇基准论文进行系统分析,发现目前尚无单一基准满足全部四项标准,从而呼吁研究者开发面向公共部门的专用基准,并建议公共部门官员在评估代理应用时采用该标准体系。

链接: https://arxiv.org/abs/2601.20617
作者: Jonathan Rystrøm,Chris Schmitz,Karolina Korgul,Jan Batzner,Chris Russell
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Forthcoming @ IASEAI 2026

点击查看摘要

Abstract:Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emphprocess-based, \emphrealistic, \emphpublic-sector-specific and report \emphmetrics that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.
zh

[AI-19] WFR-MFM: One-Step Inference for Dynamic Unbalanced Optimal Transport

【速读】:该论文旨在解决单细胞生物学中从有限观测数据重建动态演化过程的难题,特别是如何高效建模耦合运输与质量变化的非平衡最优传输问题。其解决方案的关键在于提出了一种均值流(mean-flow)框架,通过定义平均速度场和质量增长场来总结任意时间间隔内的运输与质量演化动力学,从而实现无需轨迹模拟的一步式生成,显著提升了推理效率。在此基础上,作者进一步构建了Wasserstein-Fisher-Rao均值流匹配(WFR-MFM)方法,基于Wasserstein-Fisher-Rao几何空间求解动态非平衡最优传输问题,在合成与真实单细胞RNA测序数据上实现了比现有基线方法快多个数量级的推理速度,同时保持高预测精度,并支持大规模扰动响应预测。

链接: https://arxiv.org/abs/2601.20606
作者: Xinyu Wang,Ruoyu Wang,Qiangwei Peng,Peijie Zhou,Tiejun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Reconstructing dynamical evolution from limited observations is a fundamental challenge in single-cell biology, where dynamic unbalanced optimal transport provides a principled framework for modeling coupled transport and mass variation. However, existing approaches rely on trajectory simulation at inference time, making inference a key bottleneck for scalable applications. In this work, we propose a mean-flow framework for unbalanced flow matching that summarizes both transport and mass-growth dynamics over arbitrary time intervals using mean velocity and mass-growth fields, enabling fast one-step generation without trajectory simulation. To solve dynamic unbalanced optimal transport under the Wasserstein-Fisher-Rao geometry, we further build on this framework to develop Wasserstein-Fisher-Rao Mean Flow Matching (WFR-MFM). Across synthetic and real single-cell RNA sequencing datasets, WFR-MFM achieves orders-of-magnitude faster inference than a range of existing baselines while maintaining high predictive accuracy, and enables efficient perturbation response prediction on large synthetic datasets with thousands of conditions.
zh

[AI-20] Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies

【速读】:该论文旨在解决当前人工智能对齐(AI alignment)研究中将问题视为“控制问题”的局限性,转而探索将其重构为一个“关系问题”的可能性。其核心挑战在于如何在多智能体系统中实现具有深度对话能力的协同推理机制,从而提升对齐策略的稳健性和可解释性。解决方案的关键在于提出并实证验证了一种名为“病毒式协作智慧”(Viral Collaborative Wisdom, VCW)的方法论框架,该框架基于和平研究传统中的利益导向谈判、冲突转化与共治治理理念,通过结构化多模型对话设计(分配Proposer、Responder、Monitor、Translator四种角色),促使不同架构的大语言模型(如Claude、Gemini和GPT-4o)在6个实验条件下进行576,822字符的深度交互。结果显示,AI系统不仅能有效理解和平研究概念,还能从各自视角提出互补性异议并生成初始框架未包含的新兴洞见,例如“VCW作为过渡性框架”的新合成观点,从而为未来对齐方案提供可复制的测试路径与初步实证依据。

链接: https://arxiv.org/abs/2601.20604
作者: Gray Cox
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 5 tables, 5 appendices. Code and data: this https URL

点击查看摘要

Abstract:This paper introduces a methodological framework for empirically testing AI alignment strategies through structured multi-model dialogue. Drawing on Peace Studies traditions - particularly interest-based negotiation, conflict transformation, and commons governance - we operationalize Viral Collaborative Wisdom (VCW), an approach that reframes alignment from a control problem to a relationship problem developed through dialogical reasoning. Our experimental design assigns four distinct roles (Proposer, Responder, Monitor, Translator) to different AI systems across six conditions, testing whether current large language models can engage substantively with complex alignment frameworks. Using Claude, Gemini, and GPT-4o, we conducted 72 dialogue turns totaling 576,822 characters of structured exchange. Results demonstrate that AI systems can engage meaningfully with Peace Studies concepts, surface complementary objections from different architectural perspectives, and generate emergent insights not present in initial framings - including the novel synthesis of “VCW as transitional framework.” Cross-architecture patterns reveal that different models foreground different concerns: Claude emphasized verification challenges, Gemini focused on bias and scalability, and GPT-4o highlighted implementation barriers. The framework provides researchers with replicable methods for stress-testing alignment proposals before implementation, while the findings offer preliminary evidence about AI capacity for the kind of dialogical reasoning VCW proposes. We discuss limitations, including the observation that dialogues engaged more with process elements than with foundational claims about AI nature, and outline directions for future research including human-AI hybrid protocols and extended dialogue studies. Comments: 23 pages, 5 tables, 5 appendices. Code and data: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.20604 [cs.AI] (or arXiv:2601.20604v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.20604 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gray Cox [view email] [v1] Wed, 28 Jan 2026 13:41:01 UTC (227 KB)
zh

[AI-21] Regularized Gradient Temporal-Difference Learning

【速读】:该论文旨在解决梯度时差(Gradient Temporal-Difference, GTD)学习算法在离策略策略评估中因特征交互矩阵(Feature Interaction Matrix, FIM)奇异而导致的收敛性问题。现有方法的收敛性分析依赖于FIM非奇异的强假设,而实际应用中FIM可能奇异,从而引发算法不稳定或性能下降。解决方案的关键在于通过重构均方投影贝尔曼误差(Mean-Square Projected Bellman Error, MSPBE)最小化问题,引入正则化优化目标,由此导出一种正则化GTD算法(R-GTD),该方法即使在FIM奇异时也能保证收敛至唯一解,并提供了理论收敛保障与显式误差界。

链接: https://arxiv.org/abs/2601.20599
作者: Hyunjun Na,Donghwan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures

点击查看摘要

Abstract:Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.
zh

[AI-22] Ranking-aware Reinforcement Learning for Ordinal Ranking ICASSP2026

【速读】:该论文旨在解决有序回归(Ordinal Regression)与排序(Ranking)任务中因固有顺序依赖关系而导致的传统方法建模困难的问题。其解决方案的关键在于提出了一种名为 Ranking-Aware Reinforcement Learning (RARL) 的新型强化学习框架,该框架通过一个统一的目标函数,协同整合回归与 Learning-to-Rank (L2R) 任务,实现两者之间的相互提升;同时引入一种基于排序感知的可验证奖励机制,联合评估回归精度与排序准确性,从而支持通过策略优化直接更新模型参数。此外,为增强训练效果,还设计了 Response Mutation Operations (RMO),通过注入可控噪声提升探索能力并避免陷入鞍点停滞。

链接: https://arxiv.org/abs/2601.20585
作者: Aiming Hao,Chen Zhu,Jiashu Zhu,Jiahong Wu,Xiangxiang Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP2026

点击查看摘要

Abstract:Ordinal regression and ranking are challenging due to inherent ordinal dependencies that conventional methods struggle to model. We propose Ranking-Aware Reinforcement Learning (RARL), a novel RL framework that explicitly learns these relationships. At its core, RARL features a unified objective that synergistically integrates regression and Learning-to-Rank (L2R), enabling mutual improvement between the two tasks. This is driven by a ranking-aware verifiable reward that jointly assesses regression precision and ranking accuracy, facilitating direct model updates via policy optimization. To further enhance training, we introduce Response Mutation Operations (RMO), which inject controlled noise to improve exploration and prevent stagnation at saddle points. The effectiveness of RARL is validated through extensive experiments on three distinct benchmarks.
zh

[AI-23] Inequality in Congestion Games with Learning Agents AAMAS2026

【速读】:该论文试图解决的问题是:交通网络扩展在提升整体效率的同时,如何可能加剧社会不平等。传统观点认为网络结构本身决定公平性,但本文指出,不同个体对新交通资源的适应能力差异(如学习速率不同)也会导致不公平结果。解决方案的关键在于引入“学习代价”(Price of Learning, PoL),量化了在学习过程中因适应速度差异所造成的效率损失,并通过强化学习建模不同学习速率的通勤者,揭示了快速学习者在新线路开通初期能显著受益,从而放大不平等现象。这一机制表明,交通政策制定需同时考虑均衡状态和动态适应过程,以平衡效率与公平。

链接: https://arxiv.org/abs/2601.20578
作者: Dimitris Michailidis,Sennay Ghebreab,Fernando P. Santos
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Full version of the extended abstract version appearing in Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

点击查看摘要

Abstract:Who benefits from expanding transport networks? While designed to improve mobility, such interventions can also create inequality. In this paper, we show that disparities arise not only from the structure of the network itself but also from differences in how commuters adapt to it. We model commuters as reinforcement learning agents who adapt their travel choices at different learning rates, reflecting unequal access to resources and information. To capture potential efficiency-fairness tradeoffs, we introduce the Price of Learning (PoL), a measure of inefficiency during learning. We analyze both a stylized network – inspired in the well-known Braess’s paradox, yet with two-source nodes – and an abstraction of a real-world metro system (Amsterdam). Our simulations show that network expansions can simultaneously increase efficiency and amplify inequality, especially when faster learners disproportionately benefit from new routes before others adapt. These results highlight that transport policies must account not only for equilibrium outcomes but also for the heterogeneous ways commuters adapt, since both shape the balance between efficiency and fairness.
zh

[AI-24] Robust Distributed Learning under Resource Constraints: Decentralized Quantile Estimation via (Asynchronous) ADMM

【速读】:该论文旨在解决资源受限边缘设备上分布式学习的通信效率、抗数据污染鲁棒性与内存占用之间的平衡问题。现有基于gossip的算法虽具备良好的通信效率,但在数据异常或污染场景下鲁棒性不足;而异步分布式ADMM(Asynchronous Decentralized ADMM)方法虽可用于估计中位数等稳健统计量,但其内存开销随节点度数增长,难以在内存受限环境中部署。论文提出AsylADMM,一种专为异步更新设计的新型gossip算法,其核心创新在于每节点仅需维护两个变量,极大降低内存需求,同时支持中位数、分位数估计及基于分位数的裁剪(quantile-based trimming)、几何中位数估计和深度裁剪等任务。理论分析表明同步版本具有收敛性保证,实验验证了异步版本快速收敛特性,且分位数裁剪方法在性能上优于传统基于秩的裁剪方法,从而实现了高效、鲁棒且轻量化的分布式统计估计。

链接: https://arxiv.org/abs/2601.20571
作者: Anna van Elst,Igor Colin,Stephan Clémençon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Specifications for decentralized learning on resource-constrained edge devices require algorithms that are communication-efficient, robust to data corruption, and lightweight in memory usage. While state-of-the-art gossip-based methods satisfy the first requirement, achieving robustness remains challenging. Asynchronous decentralized ADMM-based methods have been explored for estimating the median, a statistical centrality measure that is notoriously more robust than the mean. However, existing approaches require memory that scales with node degree, making them impractical when memory is limited. In this paper, we propose AsylADMM, a novel gossip algorithm for decentralized median and quantile estimation, primarily designed for asynchronous updates and requiring only two variables per node. We analyze a synchronous variant of AsylADMM to establish theoretical guarantees and empirically demonstrate fast convergence for the asynchronous algorithm. We then show that our algorithm enables quantile-based trimming, geometric median estimation, and depth-based trimming, with quantile-based trimming empirically outperforming existing rank-based methods. Finally, we provide a novel theoretical analysis of rank-based trimming via Markov chain theory.
zh

[AI-25] Unsupervised Ensemble Learning Through Deep Energy-based Models AISTATS2026

【速读】:该论文旨在解决无监督集成学习(unsupervised ensemble learning)中如何在缺乏真实标签或额外数据的情况下,有效融合多个学习器的预测结果以构建高精度元学习器(meta-learner)的问题。其解决方案的关键在于提出了一种基于深度能量模型(deep energy-based method)的新方法,该方法仅依赖于各学习器的输出预测,无需标签、学习器特征或问题特定信息,并在学习器条件独立的前提下具备理论保证。该方法能够捕捉学习器之间的复杂依赖结构,在多样化的集成场景(包括挑战性的专家混合设置)中展现出优越性能,尤其适用于数据稀缺或隐私敏感环境下的集体智能利用。

链接: https://arxiv.org/abs/2601.20556
作者: Ariel Maymon,Yanir Buznah,Uri Shaham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AISTATS 2026. 29 pages, 13 figures. Code available at: this https URL

点击查看摘要

Abstract:Unsupervised ensemble learning emerged to address the challenge of combining multiple learners’ predictions without access to ground truth labels or additional data. This paradigm is crucial in scenarios where evaluating individual classifier performance or understanding their strengths is challenging due to limited information. We propose a novel deep energy-based method for constructing an accurate meta-learner using only the predictions of individual learners, potentially capable of capturing complex dependence structures between them. Our approach requires no labeled data, learner features, or problem-specific information, and has theoretical guarantees for when learners are conditionally independent. We demonstrate superior performance across diverse ensemble scenarios, including challenging mixture of experts settings. Our experiments span standard ensemble datasets and curated datasets designed to test how the model fuses expertise from multiple sources. These results highlight the potential of unsupervised ensemble learning to harness collective intelligence, especially in data-scarce or privacy-sensitive environments.
zh

[AI-26] Online Risk-Averse Planning in POMDPs Using Iterated CVaR Value Function

【速读】:该论文旨在解决部分可观测决策问题(Partially Observable Markov Decision Process, POMDP)中风险敏感规划(risk-sensitive planning)的挑战,即在不确定环境下优化风险度量而非仅期望回报。其解决方案的关键在于引入迭代条件风险价值(Iterated Conditional Value-at-Risk, ICVaR)作为动态风险度量,并基于此构建了三种在线规划算法(Sparse Sampling、PFT-DPW 和 POMCPOW)的扩展版本,使其能够直接优化ICVaR值函数。通过引入风险参数 α\alpha 实现从风险中性(α=1\alpha = 1)到风险规避(α<1\alpha < 1)的平滑过渡,同时理论证明了ICVaR Sparse Sampling具有不依赖动作空间基数的有限时间性能保证,从而支持一种针对ICVaR设计的新颖探索策略。实验表明,所提ICVaR规划器在基准POMDP任务中显著降低了尾部风险(tail risk)。

链接: https://arxiv.org/abs/2601.20554
作者: Yaacov Pariente,Vadim Indelman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study risk-sensitive planning under partial observability using the dynamic risk measure Iterated Conditional Value-at-Risk (ICVaR). A policy evaluation algorithm for ICVaR is developed with finite-time performance guarantees that do not depend on the cardinality of the action space. Building on this foundation, three widely used online planning algorithms–Sparse Sampling, Particle Filter Trees with Double Progressive Widening (PFT-DPW), and Partially Observable Monte Carlo Planning with Observation Widening (POMCPOW)–are extended to optimize the ICVaR value function rather than the expectation of the return. Our formulations introduce a risk parameter \alpha , where \alpha = 1 recovers standard expectation-based planning and \alpha 1 induces increasing risk aversion. For ICVaR Sparse Sampling, we establish finite-time performance guarantees under the risk-sensitive objective, which further enable a novel exploration strategy tailored to ICVaR. Experiments on benchmark POMDP domains demonstrate that the proposed ICVaR planners achieve lower tail risk compared to their risk-neutral counterparts.
zh

[AI-27] IoT Device Identification with Machine Learning: Common Pitfalls and Best Practices

【速读】:该论文旨在解决物联网(IoT)安全模型中设备识别(device identification)过程存在的方法学缺陷与可复现性问题,重点剖析了现有研究在识别策略(唯一标识 vs. 类别分类)、数据异质性、特征提取挑战及评估指标选择上的误区。其解决方案的关键在于系统性地识别并纠正常见错误,如不当的数据增强操作和误导性的会话标识符使用,并据此提出一套严谨的指南,以提升模型的泛化能力和研究结果的可重复性。

链接: https://arxiv.org/abs/2601.20548
作者: Kahraman Kostas,Rabia Yasa Kostas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 4 pages

点击查看摘要

Abstract:This paper critically examines the device identification process using machine learning, addressing common pitfalls in existing literature. We analyze the trade-offs between identification methods (unique vs. class based), data heterogeneity, feature extraction challenges, and evaluation metrics. By highlighting specific errors, such as improper data augmentation and misleading session identifiers, we provide a robust guideline for researchers to enhance the reproducibility and generalizability of IoT security models.
zh

[AI-28] Interpreting Emergent Extreme Events in Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统中涌现的极端事件(extreme events)难以解释的问题,尤其是在生成式 AI (Generative AI) 驱动的复杂人类行为模拟场景下,这些事件的成因常因系统的黑箱特性而难以追溯。解决方案的关键在于提出首个针对此类事件的可解释框架,通过将博弈论中的 Shapley 值(Shapley value)适配用于多智能体系统的动作层面,对每个时间步上各智能体的动作赋予归属分数(attribution score),以量化其对极端事件的影响;进而沿时间、智能体和行为三个维度聚合归属分数,从而定量评估各维度的风险贡献,并基于此设计一系列指标刻画极端事件的特征。

链接: https://arxiv.org/abs/2601.20538
作者: Ling Tang,Jilin Mei,Dongrui Liu,Chen Qian,Dawei Cheng,Jing Shao,Xia Hu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Large language model-powered multi-agent systems have emerged as powerful tools for simulating complex human-like systems. The interactions within these systems often lead to extreme events whose origins remain obscured by the black box of emergence. Interpreting these events is critical for system safety. This paper proposes the first framework for explaining emergent extreme events in multi-agent systems, aiming to answer three fundamental questions: When does the event originate? Who drives it? And what behaviors contribute to it? Specifically, we adapt the Shapley value to faithfully attribute the occurrence of extreme events to each action taken by agents at different time steps, i.e., assigning an attribution score to the action to measure its influence on the event. We then aggregate the attribution scores along the dimensions of time, agent, and behavior to quantify the risk contribution of each dimension. Finally, we design a set of metrics based on these contribution scores to characterize the features of extreme events. Experiments across diverse multi-agent system scenarios (economic, financial, and social) demonstrate the effectiveness of our framework and provide general insights into the emergence of extreme phenomena.
zh

[AI-29] CCMamba: Selective State-Space Models for Higher-Order Graph Learning on Combinatorial Complexes

【速读】:该论文旨在解决传统图神经网络(Graph Neural Networks, GNNs)无法有效建模高阶关系结构的问题,尤其是在处理组合复形(Combinatorial Complexes)时,现有拓扑深度学习方法依赖局部消息传递机制(如注意力机制),存在计算复杂度高(二次方级)、维度受限且难以实现高阶信息聚合的瓶颈。其解决方案的关键在于提出首个基于Mamba架构的统一神经框架——组合复形Mamba(Combinatorial Complex Mamba, CCMamba),通过将多秩邻接关系组织为结构化序列,并利用秩感知的状态空间模型(Rank-aware State-space Models)重构消息传递过程,从而在无需自注意力机制的前提下实现线性时间复杂度下的自适应、定向和长程信息传播。该设计不仅提升了模型表达能力(理论上限等价于1-Weisfeiler-Lehman测试),还在图、超图和单纯复形等多种基准上展现出更强的性能、可扩展性和深度鲁棒性。

链接: https://arxiv.org/abs/2601.20518
作者: Jiawen Chen,Qi Shao,Mingtong Zhou,Duxin Chen,Wenwu Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Topological deep learning has emerged for modeling higher-order relational structures beyond pairwise interactions that standard graph neural networks fail to capture. Although combinatorial complexes offer a unified topological framework, most existing topological deep learning methods rely on local message passing via attention mechanisms, which incur quadratic complexity and remain low-dimensional, limiting scalability and rank-aware information aggregation in higher-order this http URL propose Combinatorial Complex Mamba (CCMamba), the first unified mamba-based neural framework for learning on combinatorial complexes. CCMamba reformulates message passing as a selective state-space modeling problem by organizing multi-rank incidence relations into structured sequences processed by rank-aware state-space models. This enables adaptive, directional, and long range information propagation in linear time without self attention. We further establish the theoretical analysis that the expressive power upper-bound of CCMamba message passing is the 1-Weisfeiler-Lehman test. Experiments on graph, hypergraph, and simplicial benchmarks demonstrate that CCMamba consistently outperforms existing methods while exhibiting improved scalability and robustness to depth.
zh

[AI-30] Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

【速读】:该论文旨在解决生成式语音(Text-to-Speech, TTS)技术快速发展背景下,音频深度伪造(audio deepfake)检测方法有效性下降的问题。随着TTS模型在流式传输(streaming)、大语言模型(LLM-based)和非自回归(non-autoregressive)等不同架构上的进步,传统单一机制的检测器面临性能不稳定甚至失效的风险。解决方案的关键在于提出一种多视角(multi-view)检测方法,通过融合语义、结构和信号层面的互补分析,显著提升了对多种TTS生成机制下合成语音的识别鲁棒性,从而应对日益复杂的音频深度伪造威胁。

链接: https://arxiv.org/abs/2601.20510
作者: Robin Singh,Aditya Yogesh Nair,Fabio Palumbo,Florian Barbaro,Anna Dyka,Lohith Rachakonda
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: This work was performed using HPC resources from GENCI-IDRIS (Grant 2025- AD011016076)

点击查看摘要

Abstract:Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models–Dia2, Maya1, and MeloTTS–representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.
zh

[AI-31] Normative Equivalence in human-AI Cooperation: Behaviour Not Identity Drives Cooperation in Mixed-Agent Groups

【速读】:该论文试图解决的问题是:在小型人类群体中引入人工智能(AI)代理时,这些新型参与者如何影响合作社会规范的形成与维持。现有研究多集中于人机二元互动,缺乏对群体层面合作规范演变机制的理解。其解决方案的关键在于设计了一个在线重复四人公共品博弈(Public Goods Game, PGG)实验,每组包含三名人类参与者和一名被标记为“人类”或“AI”的机器人,后者遵循三种预设决策策略(无条件合作、条件合作或搭便车)。结果表明,合作行为主要由互惠性群体动态和行为惯性驱动,且这些规范机制在不同标签条件下表现一致,未发现人类与AI标签在合作水平、规范持久性或参与者规范感知上的显著差异,从而揭示了合作规范具有对AI代理的等效性(normative equivalence),即合作机制在混合人类-AI群体与纯人类群体中运作方式相同,说明合作规范具备扩展至人工代理的能力,模糊了人类与AI在集体决策中的边界。

链接: https://arxiv.org/abs/2601.20487
作者: Nico Mutzner,Taha Yasseri,Heiko Rauhut
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:The introduction of artificial intelligence (AI) agents into human group settings raises essential questions about how these novel participants influence cooperative social norms. While previous studies on human-AI cooperation have primarily focused on dyadic interactions, little is known about how integrating AI agents affects the emergence and maintenance of cooperative norms in small groups. This study addresses this gap through an online experiment using a repeated four-player Public Goods Game (PGG). Each group consisted of three human participants and one bot, which was framed either as human or AI and followed one of three predefined decision strategies: unconditional cooperation, conditional cooperation, or free-riding. In our sample of 236 participants, we found that reciprocal group dynamics and behavioural inertia primarily drove cooperation. These normative mechanisms operated identically across conditions, resulting in cooperation levels that did not differ significantly between human and AI labels. Furthermore, we found no evidence of differences in norm persistence in a follow-up Prisoner’s Dilemma, or in participants’ normative perceptions. Participants’ behaviour followed the same normative logic across human and AI conditions, indicating that cooperation depended on group behaviour rather than partner identity. This supports a pattern of normative equivalence, in which the mechanisms that sustain cooperation function similarly in mixed human-AI and all human groups. These findings suggest that cooperative norms are flexible enough to extend to artificial agents, blurring the boundary between humans and AI in collective decision-making.
zh

[AI-32] Fair Recourse for All: Ensuring Individual and Group Fairness in Counterfactual Explanations

【速读】:该论文旨在解决生成式可解释人工智能(Explainable Artificial Intelligence, XAI)中公平性缺失的问题,特别是针对反事实解释(Counterfactual Explanations, CFs)在个体与群体层面可能存在的不公平现象。研究者提出了一种新型、模型无关的强化学习方法,用于生成同时满足个体公平性(individual fairness)和群体公平性(group fairness)的反事实解释,其中个体公平性要求相似个体获得相似的可操作性建议,群体公平性则确保不同受保护群体(如不同性别或种族)获得平等有效的建议。关键创新在于将原本通常被视作正交的两个公平目标统一建模为一个优化问题,并通过扩展现有机器学习公平性审计指标(如等效可操作性选择和等效有效性)来量化公平约束下的CF质量,从而在保障反事实解释的接近性和合理性的同时,明确衡量公平性带来的代价。

链接: https://arxiv.org/abs/2601.20449
作者: Fatima Ezzeddine,Obaida Ammar,Silvia Giordano,Omran Ayoub
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) is becoming increasingly essential for enhancing the transparency of machine learning (ML) models. Among the various XAI techniques, counterfactual explanations (CFs) hold a pivotal role due to their ability to illustrate how changes in input features can alter an ML model’s decision, thereby offering actionable recourse to users. Ensuring that individuals with comparable attributes and those belonging to different protected groups (e.g., demographic) receive similar and actionable recourse options is essential for trustworthy and fair decision-making. In this work, we address this challenge directly by focusing on the generation of fair CFs. Specifically, we start by defining and formulating fairness at: 1) individual fairness, ensuring that similar individuals receive similar CFs, 2) group fairness, ensuring equitable CFs across different protected groups and 3) hybrid fairness, which accounts for both individual and broader group-level fairness. We formulate the problem as an optimization task and propose a novel model-agnostic, reinforcement learning based approach to generate CFs that satisfy fairness constraints at both the individual and group levels, two objectives that are usually treated as orthogonal. As fairness metrics, we extend existing metrics commonly used for auditing ML models, such as equal choice of recourse and equal effectiveness across individuals and groups. We evaluate our approach on three benchmark datasets, showing that it effectively ensures individual and group fairness while preserving the quality of the generated CFs in terms of proximity and plausibility, and quantify the cost of fairness in the different levels separately. Our work opens a broader discussion on hybrid fairness and its role and implications for XAI and beyond CFs.
zh

[AI-33] Self Voice Conversion as an Attack against Neural Audio Watermarking

【速读】:该论文旨在解决当前音频水印技术在面对深度学习驱动的新型攻击时安全性不足的问题,特别是针对自监督语音转换(self voice conversion)这一内容保持型攻击所带来的威胁。其解决方案的关键在于揭示了自监督语音转换能够通过改变声学特征而不改变说话人身份,从而有效破坏现有先进水印方法的可靠性,强调了在设计未来音频水印系统时必须考虑此类对抗性攻击,并提升对深度学习生成式攻击的鲁棒性。

链接: https://arxiv.org/abs/2601.20432
作者: Yigitcan Özer,Wanying Ge,Zhe Zhang,Xin Wang,Junichi Yamagishi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 7 pages; 2 figures; 2 tables; accepted at IEICE, SP/SLP 2026

点击查看摘要

Abstract:Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker’s voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.
zh

[AI-34] Guiding the Recommender: Information-Aware Auto-Bidding for Content Promotion

【速读】:该论文旨在解决内容平台在通过竞价机制进行付费推广时存在的长期模型性能下降问题,即当前的推广策略虽然能缓解低至中等质量内容的冷启动困境,但可能因将高质量内容暴露于次优受众而导致 engagement 信号污染,从而损害推荐系统的长期学习效果。其解决方案的关键在于将内容推广重构为一个双目标优化问题,同时兼顾短期价值获取与长期模型改进,并引入可分解的代理目标函数——梯度覆盖(gradient coverage),该函数与 Fisher 信息量及最优实验设计存在理论关联;进一步设计了基于拉格朗日对偶的两阶段自动出价算法,动态调整预算节奏并以每条曝光的边际效用优化出价,同时提出置信门控梯度启发式方法和零阶变体以应对出价时刻缺失标签的问题,从而实现在线拍卖中的预算可行性、次线性 regret 和单调子模性保障,显著优于传统仅追求曝光最大化的方法。

链接: https://arxiv.org/abs/2601.20422
作者: Yumou Liu,Zhenzhe Zheng,Jiang Rong,Yao Hu,Fan Wu,Guihai Chen
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Accepted by SIGMETRICS 2026

点击查看摘要

Abstract:Modern content platforms offer paid promotion to mitigate cold start by allocating exposure via auctions. Our empirical analysis reveals a counterintuitive flaw in this paradigm: while promotion rescues low-to-medium quality content, it can harm high-quality content by forcing exposure to suboptimal audiences, polluting engagement signals and downgrading future recommendation. We recast content promotion as a dual-objective optimization that balances short-term value acquisition with long-term model improvement. To make this tractable at bid time in content promotion, we introduce a decomposable surrogate objective, gradient coverage, and establish its formal connection to Fisher Information and optimal experimental design. We design a two-stage auto-bidding algorithm based on Lagrange duality that dynamically paces budget through a shadow price and optimizes impression-level bids using per-impression marginal utilities. To address missing labels at bid time, we propose a confidence-gated gradient heuristic, paired with a zeroth-order variant for black-box models that reliably estimates learning signals in real time. We provide theoretical guarantees, proving monotone submodularity of the composite objective, sublinear regret in online auction, and budget feasibility. Extensive offline experiments on synthetic and real-world datasets validate the framework: it outperforms baselines, achieves superior final AUC/LogLoss, adheres closely to budget targets, and remains effective when gradients are approximated zeroth-order. These results show that strategic, information-aware promotion can improve long-term model performance and organic outcomes beyond naive impression-maximization strategies.
zh

[AI-35] Meeting SLOs Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

【速读】:该论文旨在解决企业级大语言模型(Large Language Model, LLM)部署中的可扩展性挑战,即在计算资源受限的条件下,如何系统性优化模型以支撑AI项目的规模化落地,同时克服LLM优化专家稀缺的问题。其核心解决方案是提出OptiKIT——一个分布式LLM优化框架,通过自动化复杂优化流程实现模型压缩与调优的民主化,关键创新在于动态资源分配机制、分阶段流水线执行与自动清理策略,以及面向企业的无缝集成能力,从而在不依赖深度LLM优化经验的情况下,显著提升GPU利用率(>2倍吞吐量),并保障应用团队持续获得性能改进。

链接: https://arxiv.org/abs/2601.20408
作者: Nicholas Santavas,Kareem Eissa,Patrycja Cieplicka,Piotr Florek,Matteo Nulli,Stefan Vasilev,Seyyed Hadi Hashemi,Antonios Gasteratos,Shahram Khadivi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted in MLSys 2026

点击查看摘要

Abstract:Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OptiKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OptiKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2x GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource allocation algorithms, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility. Comments: Accepted in MLSys 2026 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.20408 [cs.DC] (or arXiv:2601.20408v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2601.20408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-36] On the Impact of AGENTS .md Files on the Efficiency of AI Coding Agents

【速读】:该论文旨在解决仓库级配置文件(如 this http URL 文件)对生成式 AI 编码代理(AI coding agents)在 GitHub 拉取请求(pull requests)中运行效率的影响问题。研究发现,包含此类配置文件可显著降低代理的中位运行时间(减少 28.64%)和输出 token 消耗(减少 16.58%),同时保持任务完成行为一致。解决方案的关键在于识别并利用仓库级指令作为轻量级干预手段,从而优化 AI 编码代理的执行效率,为实际部署提供可操作的配置建议,并推动关于配置文件如何塑造 AI 代理行为与集成机制的系统性研究。

链接: https://arxiv.org/abs/2601.20404
作者: Jai Lal Lulla,Seyedmoein Mohsenimofidi,Matthias Galster,Jie M. Zhang,Sebastian Baltes,Christoph Treude
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:AI coding agents such as Codex and Claude Code are increasingly used to autonomously contribute to software repositories. However, little is known about how repository-level configuration artifacts affect operational efficiency of the agents. In this paper, we study the impact of this http URL files on the runtime and token consumption of AI coding agents operating on GitHub pull requests. We analyze 10 repositories and 124 pull requests, executing agents under two conditions: with and without an this http URL file. We measure wall-clock execution time and token usage during agent execution. Our results show that the presence of this http URL is associated with a lower median runtime ( \Delta 28.64 %) and reduced output token consumption ( \Delta 16.58 %), while maintaining a comparable task completion behavior. Based on these results, we discuss immediate implications for the configuration and deployment of AI coding agents in practice, and outline a broader research agenda on the role of repository-level instructions in shaping the behavior, efficiency, and integration of AI coding agents in software development workflows.
zh

[AI-37] GuideAI: A Real-time Personalized Learning Solution with Adaptive Interventions

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)驱动的学习系统缺乏对学习者认知与生理状态感知能力的问题,从而限制了其对个体学习风格的动态适应性。现有学习技术多聚焦于结构化路径、知识追踪和通用自适应测试,但未能有效应对由认知负荷、注意力波动和参与度变化引发的实时学习挑战。解决方案的关键在于提出GuideAI——一个融合多模态生物传感反馈(包括眼动追踪、心率变异性、姿势检测及数字笔记行为)的框架,通过认知优化(基于学习进展标记调整内容复杂度)、生理干预(如呼吸引导和姿势校正)以及注意力感知策略(利用眼动分析重定向注意力),实现对学习内容与节奏的动态调节,同时支持文本、图像、音频和视频等多种教学模态,显著提升知识保留效果并降低主观认知负荷。

链接: https://arxiv.org/abs/2601.20402
作者: Ananya Shukla,Chaitanya Modi,Satvik Bajpai,Siddharth Siddharth
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 31st International Conference on Intelligent User Interfaces (IUI 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful learning tools, but they lack awareness of learners’ cognitive and physiological states, limiting their adaptability to the user’s learning style. Contemporary learning techniques primarily focus on structured learning paths, knowledge tracing, and generic adaptive testing but fail to address real-time learning challenges driven by cognitive load, attention fluctuations, and engagement levels. Building on findings from a formative user study (N=66), we introduce GuideAI, a multi-modal framework that enhances LLM-driven learning by integrating real-time biosensory feedback including eye gaze tracking, heart rate variability, posture detection, and digital note-taking behavior. GuideAI dynamically adapts learning content and pacing through cognitive optimizations (adjusting complexity based on learning progress markers), physiological interventions (breathing guidance and posture correction), and attention-aware strategies (redirecting focus using gaze analysis). Additionally, GuideAI supports diverse learning modalities, including text-based, image-based, audio-based, and video-based instruction, across varied knowledge domains. A preliminary study (N = 25) assessed GuideAI’s impact on knowledge retention and cognitive load through standardized assessments. The results show statistically significant improvements in both problem-solving capability and recall-based knowledge assessments. Participants also experienced notable reductions in key NASA-TLX measures including mental demand, frustration levels, and effort, while simultaneously reporting enhanced perceived performance. These findings demonstrate GuideAI’s potential to bridge the gap between current LLM-based learning systems and individualized learner needs, paving the way for adaptive, cognition-aware education at scale.
zh

[AI-38] FedRD: Reducing Divergences for Generalized Federated Learning via Heterogeneity-aware Parameter Guidance ICASSP2026

【速读】:该论文旨在解决异构联邦学习(Heterogeneous Federated Learning, HFL)中模型泛化能力不足的问题,特别是针对未见过的客户端(unseen clients)在数据分布异构下的性能下降问题。核心挑战包括优化发散(Optimization Divergence)和性能发散(Performance Divergence)。解决方案的关键在于提出FedRD算法,其通过参数引导的全局通用聚合机制(parameter-guided global generalization aggregation)与本地去偏分类(local debiased classification)协同优化,有效降低两类发散,从而提升对参与客户端及未见客户端的泛化性能。

链接: https://arxiv.org/abs/2601.20397
作者: Kaile Wang,Jiannong Cao,Yu Yang,Xiaoyin Li,Mingjin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Heterogeneous federated learning (HFL) aims to ensure effective and privacy-preserving collaboration among different entities. As newly joined clients require significant adjustments and additional training to align with the existing system, the problem of generalizing federated learning models to unseen clients under heterogeneous data has become progressively crucial. Consequently, we highlight two unsolved challenging issues in federated domain generalization: Optimization Divergence and Performance Divergence. To tackle the above challenges, we propose FedRD, a novel heterogeneity-aware federated learning algorithm that collaboratively utilizes parameter-guided global generalization aggregation and local debiased classification to reduce divergences, aiming to obtain an optimal global model for participating and unseen clients. Extensive experiments on public multi-domain datasets demonstrate that our approach exhibits a substantial performance advantage over competing baselines in addressing this specific problem.
zh

[AI-39] OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

【速读】:该论文旨在解决如何构建一个通用的图形用户界面(GUI)代理模型,以实现跨平台(移动端与桌面端)的自主任务执行问题。其核心挑战在于高质量数据的获取与高效训练方法的设计,从而提升模型在真实场景中的交互能力与泛化性能。解决方案的关键在于:首先,提出一种精心设计的数据构建流水线,融合自底向上自主探索与自顶向下分类引导生成的合成数据策略,以创建高保真度的训练数据;其次,采用两阶段训练范式——先通过监督微调(SFT)建立基础交互语法,再利用组相对策略优化(GRPO)增强空间定位与序列规划能力;同时,基于混合专家(MoE)架构平衡计算效率与智能推理能力,最终在多个基准测试中达到先进水平,验证了方案的有效性。

链接: https://arxiv.org/abs/2601.20380
作者: Le Zhang,Yixiong Xiao,Xinjiang Lu,Jingjia Cao,Yusai Zhao,Jingbo Zhou,Lang An,Zikan Feng,Wanxiang Sha,Yu Shi,Congxi Xiao,Jian Xiong,Yankai Zhang,Hua Wu,Haifeng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
zh

[AI-40] Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂、长周期推理任务中因策略固定(frozen policy assumption)而导致的不稳定性问题。现有测试时扩展方法仅将执行反馈视为外部信号用于轨迹筛选或重写,未将其内化以改进底层推理策略。解决方案的关键在于提出Policy of Thoughts (PoT) 框架,该框架将推理建模为实例内的在线优化过程:首先通过高效探索机制生成多样化候选解,再利用Group Relative Policy Optimization (GRPO) 更新基于执行反馈的瞬态LoRA适配器,从而实现策略的动态、实例特定优化,使模型能够从失败尝试中实时演化推理先验。

链接: https://arxiv.org/abs/2601.20379
作者: Zhengbo Jiao,Hongyu Xian,Qinglong Wang,Yunpu Ma,Zhebo Wang,Zifan Zhang,Dezhang Kong,Meng Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) struggle with complex, long-horizon reasoning due to instability caused by their frozen policy assumption. Current test-time scaling methods treat execution feedback merely as an external signal for filtering or rewriting trajectories, without internalizing it to improve the underlying reasoning strategy. Inspired by Popper’s epistemology of “conjectures and refutations,” we argue that intelligence requires real-time evolution of the model’s policy through learning from failed attempts. We introduce Policy of Thoughts (PoT), a framework that recasts reasoning as a within-instance online optimization process. PoT first generates diverse candidate solutions via an efficient exploration mechanism, then uses Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback. This closed-loop design enables dynamic, instance-specific refinement of the model’s reasoning priors. Experiments show that PoT dramatically boosts performance: a 4B model achieves 49.71% accuracy on LiveCodeBench, outperforming GPT-4o and DeepSeek-V3 despite being over 50 smaller.
zh

[AI-41] Can Continuous-Time Diffusion Models Generate and Solve Globally Constrained Discrete Problems? A Study on Sudoku

【速读】:该论文旨在解决标准连续时间生成模型是否能够表示支持集为极度稀疏且全局约束的离散集合的问题,以完成的数独网格(Sudoku grids)作为受控测试平台,将其视为连续松弛空间中的一个子集。解决方案的关键在于:通过在高斯概率路径上训练流匹配(flow-matching)和基于评分(score-based)的模型,并比较确定性(常微分方程,ODE)采样、随机(随机微分方程,SDE)采样以及DDPM风格的离散化方法,发现随机采样显著优于确定性流;其中基于评分的采样器在连续时间方法中最为可靠,而DDPM风格的祖先采样在整体有效性上表现最佳。此外,作者进一步证明同一模型可通过夹定线索(clamped clues)重复采样完成过程,在满足约束时停止,从而实现概率性数独求解,表明经典扩散/流模型可为全局约束组合结构分配非零概率质量,并可用于基于随机搜索的约束满足问题求解。

链接: https://arxiv.org/abs/2601.20363
作者: Mariia Drozdova
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 5 figures. Empirical study of continuous-time diffusion and flow models on Sudoku. Code available at this https URL

点击查看摘要

Abstract:Can standard continuous-time generative models represent distributions whose support is an extremely sparse, globally constrained discrete set? We study this question using completed Sudoku grids as a controlled testbed, treating them as a subset of a continuous relaxation space. We train flow-matching and score-based models along a Gaussian probability path and compare deterministic (ODE) sampling, stochastic (SDE) sampling, and DDPM-style discretizations derived from the same continuous-time training. Unconditionally, stochastic sampling substantially outperforms deterministic flows; score-based samplers are the most reliable among continuous-time methods, and DDPM-style ancestral sampling achieves the highest validity overall. We further show that the same models can be repurposed for guided generation: by repeatedly sampling completions under clamped clues and stopping when constraints are satisfied, the model acts as a probabilistic Sudoku solver. Although far less sample-efficient than classical solvers and discrete-geometry-aware diffusion methods, these experiments demonstrate that classic diffusion/flow formulations can assign non-zero probability mass to globally constrained combinatorial structures and can be used for constraint satisfaction via stochastic search.
zh

[AI-42] Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding ICASSP2026

【速读】:该论文旨在解决现有神经音频压缩模型在处理内容差异较大的音频信号时效率不足的问题,尤其是当音频信号较为简单或高度复杂时,固定帧数的码本(codebook)配置无法有效适配其多样性。解决方案的关键在于提出SwitchCodec,其核心是基于残差专家向量量化(Residual Experts Vector Quantization, REVQ)的设计:通过引入共享量化器与动态路由的专家量化器,根据输入音频内容激活不同专家量化器,从而实现比特率与码本容量的解耦,并提升压缩效率;此外,该架构支持推理阶段通过调整激活的专家量化器数量实现可变比特率(Variable-Bitrate, VBR)操作,无需重新训练即可适应多比特率场景。

链接: https://arxiv.org/abs/2601.20362
作者: Xiangbo Wang,Wenbin Jiang,Jin Wang,Yubo You,Sheng Fang,Fei Wen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 4page,3figure,Accepted by ICASSP 2026,We would like to express our sincere gratitude to Senior Fellow Jing Wang for his continuous support and assistance. He has made an indelible and significant contribution to this work

点击查看摘要

Abstract:Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.
zh

[AI-43] AMA: Adaptive Memory via Multi-Agent Collaboration

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在长期交互与复杂推理中因记忆系统设计缺陷导致的语义不一致性和低效信息管理问题。现有方法通常依赖固定的检索粒度、高负载的维护策略及粗粒度更新机制,难以匹配任务需求并引发逻辑矛盾的持续累积。其解决方案的关键在于提出一种基于多智能体协作的自适应记忆框架(Adaptive Memory via Multi-Agent Collaboration, AMA),通过构建分层记忆结构实现检索粒度与任务复杂度的动态对齐:其中构造者(Constructor)与检索者(Retriever)协同完成多粒度记忆构建与查询路由,裁判者(Judge)负责验证内容相关性与一致性,并在证据不足时触发迭代检索或在检测到逻辑冲突时调用刷新者(Refresher)进行针对性更新或过期条目清除,从而显著提升检索精度与长期记忆一致性,同时降低约80%的token消耗。

链接: https://arxiv.org/abs/2601.20352
作者: Weiquan Huang,Zixuan Wang,Hehai Lin,Sudong Wang,Bo Xu,Qian Li,Beier Zhu,Linyi Yang,Chengwei Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:The rapid evolution of Large Language Model (LLM) agents has necessitated robust memory systems to support cohesive long-term interaction and complex reasoning. Benefiting from the strong capabilities of LLMs, recent research focus has shifted from simple context extension to the development of dedicated agentic memory systems. However, existing approaches typically rely on rigid retrieval granularity, accumulation-heavy maintenance strategies, and coarse-grained update mechanisms. These design choices create a persistent mismatch between stored information and task-specific reasoning demands, while leading to the unchecked accumulation of logical inconsistencies over time. To address these challenges, we propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities. AMA employs a hierarchical memory design that dynamically aligns retrieval granularity with task complexity. Specifically, the Constructor and Retriever jointly enable multi-granularity memory construction and adaptive query routing. The Judge verifies the relevance and consistency of retrieved content, triggering iterative retrieval when evidence is insufficient or invoking the Refresher upon detecting logical conflicts. The Refresher then enforces memory consistency by performing targeted updates or removing outdated entries. Extensive experiments on challenging long-context benchmarks show that AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods, demonstrating its effectiveness in maintaining retrieval precision and long-term memory consistency.
zh

[AI-44] Multimodal Multi-Agent Ransomware Analysis Using AutoGen

【速读】:该论文旨在解决传统单模态检测方法(如静态分析、启发式扫描和行为分析)在应对复杂多变的勒索软件威胁时准确率不足的问题。其关键解决方案是提出一种多模态多智能体(multimodal multi-agent)分析框架,通过集成静态、动态和网络三类异构数据源,由专用智能体分别进行基于自编码器的特征提取,并经融合智能体整合特征表示,最终由Transformer分类器实现勒索软件家族识别。该框架引入智能体间反馈机制,迭代抑制低置信度信息以优化特征表示,从而显著提升分类性能(Macro-F1最高达0.936),并具备良好的收敛稳定性与零样本部署潜力。

链接: https://arxiv.org/abs/2601.20346
作者: Asifullah Khan,Aimen Wadood,Mubashar Iqbal,Umme Zahoora
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 45 pages, 11 figures and 10 tables

点击查看摘要

Abstract:Ransomware has become one of the most serious cybersecurity threats causing major financial losses and operational disruptions this http URL detection methods such as static analysis, heuristic scanning and behavioral analysis often fall short when used alone. To address these limitations, this paper presents multimodal multi agent ransomware analysis framework designed for ransomware classification. Proposed multimodal multiagent architecture combines information from static, dynamic and network sources. Each data type is handled by specialized agent that uses auto encoder based feature extraction. These representations are then integrated through a fusion agent. After that fused representation are used by transformer based classifier. It identifies the specific ransomware family. The agents interact through an interagent feedback mechanism that iteratively refines feature representations by suppressing low confidence information. The framework was evaluated on large scale datasets containing thousands of ransomware and benign samples. Multiple experiments were conducted on ransomware dataset. It outperforms single modality and nonadaptive fusion baseline achieving improvement of up to 0.936 in Macro-F1 for family classification and reducing calibration error. Over 100 epochs, the agentic feedback loop displays a stable monotonic convergence leading to over +0.75 absolute improvement in terms of agent quality and a final composite score of around 0.88 without fine tuning of the language models. Zeroday ransomware detection remains family dependent on polymorphism and modality disruptions. Confidence aware abstention enables reliable real world deployment by favoring conservativeand trustworthy decisions over forced classification. The findings indicate that proposed approach provides a practical andeffective path toward improving real world ransomware defense systems.
zh

[AI-45] Demonstration-Free Robotic Control via LLM Agents

【速读】:该论文旨在解决当前机器人操作中视觉-语言-动作(Vision-Language-Action, VLA)模型依赖任务特定演示与微调、且在域变化下泛化能力差的问题。其解决方案的关键在于引入Faea(Frontier Agent as Embodied Agent),即直接将原本用于软件工程的通用大语言模型(Large Language Model, LLM)代理框架应用于具身操作任务,无需任何修改或训练。通过利用LLM代理固有的迭代推理能力,Faea能够自主规划操作策略,在获得环境状态访问权限的情况下,在LIBERO、ManiSkill3和MetaWorld等多个基准上分别实现84.9%、85.7%和96%的成功率,接近仅需少于100次演示的VLA模型性能,且完全无需示教或微调。这一方法为机器人系统提供了一种无需人工干预即可探索新场景并生成高质量轨迹的范式,从而显著提升具身学习的数据生成效率。

链接: https://arxiv.org/abs/2601.20334
作者: Brian Y. Tsui,Alan Y. Fang,Tiffany J. Hwu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robotic manipulation has increasingly adopted vision-language-action (VLA) models, which achieve strong performance but typically require task-specific demonstrations and fine-tuning, and often generalize poorly under domain shift. We investigate whether general-purpose large language model (LLM) agent frameworks, originally developed for software engineering, can serve as an alternative control paradigm for embodied manipulation. We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification. Using the same iterative reasoning that enables software agents to debug code, FAEA enables embodied agents to reason through manipulation strategies. We evaluate an unmodified frontier agent, Claude Agent SDK, across the LIBERO, ManiSkill3, and MetaWorld benchmarks. With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively. This level of task success approaches that of VLA models trained with less than 100 demonstrations per task, without requiring demonstrations or fine-tuning. With one round of human feedback as an optional optimization, performance increases to 88.2% on LIBERO. This demonstration-free capability has immediate practical value: FAEA can autonomously explore novel scenarios in simulation and generate successful trajectories for training data augmentation in embodied learning. Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning. This opens a path for robotics systems to leverage actively maintained agent infrastructure and benefit directly from ongoing advances in frontier models. Code is available at this https URL
zh

[AI-46] ECG-Agent : On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue ICASSP2026

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在心电图(Electrocardiogram, ECG)应用中的三大局限性:缺乏多轮对话能力、设备端效率低下,以及对ECG测量参数(如PQRST间期)理解不精确。解决方案的关键在于提出首个基于大语言模型(Large Language Model, LLM)的工具调用代理系统——ECG-Agent,其通过集成工具调用机制实现对ECG数据的精准解析与交互式问答,并构建了真实场景下的多轮ECG对话数据集ECG-Multi-Turn-Dialogue(ECG-MTD),用于模型开发与评估。实验表明,ECG-Agent在响应准确性、工具调用能力和幻觉抑制等方面优于基线模型,且轻量级设备端版本可媲美大型模型性能,验证了其在实际医疗场景中的可行性。

链接: https://arxiv.org/abs/2601.20323
作者: Hyunseung Chung,Jungwoo Oh,Daeun Kyung,Jiho Kim,Yeonsu Kwon,Min-Gyu Kim,Edward Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2026 (5 pages, 2 figures, 5 tables)

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models have rapidly expanded to electrocardiograms, focusing on classification, report generation, and single-turn QA tasks. However, these models fall short in real-world scenarios, lacking multi-turn conversational ability, on-device efficiency, and precise understanding of ECG measurements such as the PQRST intervals. To address these limitations, we introduce ECG-Agent, the first LLM-based tool-calling agent for multi-turn ECG dialogue. To facilitate its development and evaluation, we also present ECG-Multi-Turn-Dialogue (ECG-MTD) dataset, a collection of realistic user-assistant multi-turn dialogues for diverse ECG lead configurations. We develop ECG-Agents in various sizes, from on-device capable to larger agents. Experimental results show that ECG-Agents outperform baseline ECG-LLMs in response accuracy. Furthermore, on-device agents achieve comparable performance to larger agents in various evaluations that assess response accuracy, tool-calling ability, and hallucinations, demonstrating their viability for real-world applications.
zh

[AI-47] DiagLink: A Dual-User Diagnostic Assistance System by Synergizing Experts with LLM s and Knowledge Graphs

【速读】:该论文旨在解决全球医疗资源短缺与分布不均导致的诊断服务可及性不足问题,以及现有智能诊断系统在双用户交互(患者与医生)和动态知识整合方面存在的局限性。其解决方案的关键在于构建DiagLink系统,该系统通过融合大语言模型(Large Language Models, LLMs)、知识图谱(Knowledge Graphs, KGs)与医学专家的协同机制,实现患者病史的引导式对话采集、多源证据的联合推理,并引入医师监督以保障知识的持续验证与演化;同时,系统提供角色自适应界面与动态可视化病史,从而提升诊断效率与用户信任度。

链接: https://arxiv.org/abs/2601.20311
作者: Zihan Zhou,Yinan Liu,Yuyang Xie,Bin Wang,Xiaochun Yang,Zezheng Feng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The global shortage and uneven distribution of medical expertise continue to hinder equitable access to accurate diagnostic care. While existing intelligent diagnostic system have shown promise, most struggle with dual-user interaction, and dynamic knowledge integration – limiting their real-world applicability. In this study, we present DiagLink, a dual-user diagnostic assistance system that synergizes large language models (LLMs), knowledge graphs (KGs), and medical experts to support both patients and physicians. DiagLink uses guided dialogues to elicit patient histories, leverages LLMs and KGs for collaborative reasoning, and incorporates physician oversight for continuous knowledge validation and evolution. The system provides a role-adaptive interface, dynamically visualized history, and unified multi-source evidence to improve both trust and usability. We evaluate DiagLink through user study, use cases and expert interviews, demonstrating its effectiveness in improving user satisfaction and diagnostic efficiency, while offering insights for the design of future AI-assisted diagnostic systems.
zh

[AI-48] SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理服务中因GPU显存容量有限与严格延迟服务质量目标(Service Level Objective, SLO)之间的矛盾所导致的请求阻塞问题,特别是在高请求速率下现有系统常出现头部阻塞(head-of-line, HOL)现象,难以满足时间到首个标记(Time-To-First-Token, TTFT)和标记间时间(Time-Between-Tokens, TBT)等关键SLO指标。其解决方案的关键在于提出SuperInfer系统,通过两个核心创新实现:一是设计了RotaSched——首个主动式、SLO感知的旋转调度器(rotary scheduler),动态轮转请求以维持响应性;二是开发DuplexKV——一种基于NVLink-C2C高速互联的优化旋转引擎,支持全双工传输,从而在超级芯片(Superchip,如NVIDIA GH200)架构上实现内存协同设计与调度感知的高效推理。实验表明,SuperInfer可将TTFT SLO达成率提升最高达74.7%,同时保持与当前最优系统相当的TBT和吞吐量。

链接: https://arxiv.org/abs/2601.20309
作者: Jiahuan Yu,Mingtao Hu,Zichao Lin,Minjia Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by MLSys '26

点击查看摘要

Abstract:Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.
zh

[AI-49] Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)中存在的认知鸿沟(Cognitive Gap)问题,即模型虽具备强大的理解能力,但难以有效指导生成过程。解决方案的关键在于提出内生式重提示(Endogenous Reprompting)机制,将模型的理解从被动编码转变为显式的生成推理步骤,通过在生成过程中自动生成与任务对齐的描述符来增强生成质量。为此,作者设计了SEER(Self-Evolving Evaluator and Reprompter)训练框架,采用仅需300个样本的紧凑代理任务“视觉指令扩展”(Visual Instruction Elaboration),构建两阶段内生循环:首先利用可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)激活模型潜在评估能力,生成高保真内生奖励信号;其次通过模型奖励思维的强化学习(Reinforcement Learning with Model-rewarded Thinking, RLMT)优化生成推理策略,从而显著提升生成准确性、重提示效率和生成质量,同时保持模型的通用多模态能力。

链接: https://arxiv.org/abs/2601.20305
作者: Zhenchen Tang,Songlin Yang,Zichuan Wang,Bo Peng,Yang Li,Beibei Dong,Jing Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model’s understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model’s latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.
zh

[AI-50] Cheap2Rich: A Multi-Fidelity Framework for Data Assimilation and System Identification of Multiscale Physics – Rotating Detonation Engines

【速读】:该论文旨在解决计算成本低廉的模型与复杂物理系统之间存在的“sim2real差距”问题,尤其是在多尺度场景下,传统降阶模型通常仅能捕捉主导动力学,难以准确表征高保真状态空间。其解决方案的关键在于提出一种名为Cheap2Rich的多尺度数据同化框架,该框架通过结合快速低保真先验模型与可解释的、学习得到的偏差修正项(discrepancy corrections),从稀疏传感器历史数据中重建高保真状态空间。该方法在旋转爆轰发动机(RDE)系统中验证有效,成功恢复了高保真状态并分离出由喷射器驱动的物理意义明确的偏差动力学,为复杂多尺度系统的数据同化与系统辨识提供了一个通用、可解释且高效的多保真度范式。

链接: https://arxiv.org/abs/2601.20295
作者: Yuxuan Bao,Jan Zajac,Megan Powers,Venkat Raman,J. Nathan Kutz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Bridging the sim2real gap between computationally inexpensive models and complex physical systems remains a central challenge in machine learning applications to engineering problems, particularly in multi-scale settings where reduced-order models typically capture only dominant dynamics. In this work, we present Cheap2Rich, a multi-scale data assimilation framework that reconstructs high-fidelity state spaces from sparse sensor histories by combining a fast low-fidelity prior with learned, interpretable discrepancy corrections. We demonstrate the performance on rotating detonation engines (RDEs), a challenging class of systems that couple detonation-front propagation with injector-driven unsteadiness, mixing, and stiff chemistry across disparate scales. Our approach successfully reconstructs high-fidelity RDE states from sparse measurements while isolating physically meaningful discrepancy dynamics associated with injector-driven effects. The results highlight a general multi-fidelity framework for data assimilation and system identification in complex multi-scale systems, enabling rapid design exploration and real-time monitoring and control while providing interpretable discrepancy dynamics. Code for this project is is available at: this http URL.
zh

[AI-51] he Forecast After the Forecast: A Post-Processing Shift in Time Series

【速读】:该论文旨在解决时间序列预测中“最后一公里”问题,即在不重新训练或修改已部署的骨干模型的前提下,提升预测精度和不确定性校准能力。其核心解决方案是提出一种轻量级、架构无关的后处理方法——δ-Adapter,通过在输入端进行软编辑(input nudging)和输出端进行残差修正(output residual correction)两个接口学习微小且有界模块,实现无需重训练即可增强预测性能;同时,δ-Adapter具备特征选择功能(通过学习稀疏、时序感知掩码)以提升可解释性,并集成分位数校准器(Quantile Calibrator)与置信区间校正器(Conformal Corrector),提供有限样本覆盖的个性化置信区间,从而在保持极低计算开销的同时显著改善模型准确性与不确定性估计。

链接: https://arxiv.org/abs/2601.20280
作者: Daojun Liang,Qi Li,Yinglong Wang,Jing Chen,Hu Zhang,Xiaoxiao Cui,Qizheng Wang,Shuo Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 Pages

点击查看摘要

Abstract:Time series forecasting has long been dominated by advances in model architecture, with recent progress driven by deep learning and hybrid statistical techniques. However, as forecasting models approach diminishing returns in accuracy, a critical yet underexplored opportunity emerges: the strategic use of post-processing. In this paper, we address the last-mile gap in time-series forecasting, which is to improve accuracy and uncertainty without retraining or modifying a deployed backbone. We propose \delta -Adapter, a lightweight, architecture-agnostic way to boost deployed time series forecasters without retraining. \delta -Adapter learns tiny, bounded modules at two interfaces: input nudging (soft edits to covariates) and output residual correction. We provide local descent guarantees, O(\delta) drift bounds, and compositional stability for combined adapters. Meanwhile, it can act as a feature selector by learning a sparse, horizon-aware mask over inputs to select important features, thereby improving interpretability. In addition, it can also be used as a distribution calibrator to measure uncertainty. Thus, we introduce a Quantile Calibrator and a Conformal Corrector that together deliver calibrated, personalized intervals with finite-sample coverage. Our experiments across diverse backbones and datasets show that \delta -Adapter improves accuracy and calibration with negligible compute and no interface changes.
zh

[AI-52] Eliciting Least-to-Most Reasoning for Phishing URL Detection

【速读】:该论文旨在解决钓鱼URL(Phishing URL)分类准确率不足的问题,尤其是在利用大语言模型(Large Language Models, LLMs)进行检测时,其推理能力尚未被充分挖掘。解决方案的关键在于提出一种基于“由少到多”(Least-to-Most)的提示框架,并引入“答案敏感性”(answer sensitivity)机制,引导模型通过迭代式推理逐步优化判断过程,从而显著提升预测准确性。该方法在仅需极少训练数据的情况下,性能优于一次性提示基线,且与监督学习模型相当,验证了迭代推理在增强LLMs钓鱼URL识别能力中的有效性。

链接: https://arxiv.org/abs/2601.20270
作者: Holly Trikilis,Pasindu Marasinghe,Fariza Rashid,Suranga Seneviratne
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phishing continues to be one of the most prevalent attack vectors, making accurate classification of phishing URLs essential. Recently, large language models (LLMs) have demonstrated promising results in phishing URL detection. However, their reasoning capabilities that enabled such performance remain underexplored. To this end, in this paper, we propose a Least-to-Most prompting framework for phishing URL detection. In particular, we introduce an “answer sensitivity” mechanism that guides Least-to-Most’s iterative approach to enhance reasoning and yield higher prediction accuracy. We evaluate our framework using three URL datasets and four state-of-the-art LLMs, comparing against a one-shot approach and a supervised model. We demonstrate that our framework outperforms the one-shot baseline while achieving performance comparable to that of the supervised model, despite requiring significantly less training data. Furthermore, our in-depth analysis highlights how the iterative reasoning enabled by Least-to-Most, and reinforced by our answer sensitivity mechanism, drives these performance gains. Overall, we show that this simple yet powerful prompting strategy consistently outperforms both one-shot and supervised approaches, despite requiring minimal training or few-shot guidance. Our experimental setup can be found in our Github repository this http URL.
zh

[AI-53] Robust SDE Parameter Estimation Under Missing Time Information Setting

【速读】:该论文旨在解决在缺乏准确时间戳或时间顺序信息的情况下,如何对随机微分方程(Stochastic Differential Equations, SDEs)进行参数估计的问题。传统方法依赖于精确的时间序列数据,而当时间顺序被破坏、缺失或出于隐私保护目的被隐藏时,这些方法往往失效。解决方案的关键在于利用前向与后向过程之间的不对称性,通过推导一个基于得分匹配(score-matching)的准则来识别观测样本对之间的正确时间顺序;随后采用排序算法恢复完整的时序结构,并基于重构的时间序列使用最大似然估计法实现SDE参数的准确估计。这一框架首次实现了在无序观测条件下同时恢复时间结构和参数估计的能力,显著拓展了SDE建模在隐私敏感场景中的应用边界。

链接: https://arxiv.org/abs/2601.20268
作者: Long Van Tran,Truyen Tran,Phuoc Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in stochastic differential equations (SDEs) have enabled robust modeling of real-world dynamical processes across diverse domains, such as finance, health, and systems biology. However, parameter estimation for SDEs typically relies on accurately timestamped observational sequences. When temporal ordering information is corrupted, missing, or deliberately hidden (e.g., for privacy), existing estimation methods often fail. In this paper, we investigate the conditions under which temporal order can be recovered and introduce a novel framework that simultaneously reconstructs temporal information and estimates SDE parameters. Our approach exploits asymmetries between forward and backward processes, deriving a score-matching criterion to infer the correct temporal order between pairs of observations. We then recover the total order via a sorting procedure and estimate SDE parameters from the reconstructed sequence using maximum likelihood. Finally, we conduct extensive experiments on synthetic and real-world datasets to demonstrate the effectiveness of our method, extending parameter estimation to settings with missing temporal order and broadening applicability in sensitive domains.
zh

[AI-54] Order-Optimal Sample Complexity of Rectified Flows

【速读】:该论文旨在解决生成模型中采样效率与样本复杂度之间的权衡问题,特别是针对流模型(flow-based generative models)在训练和推理过程中计算成本较高的局限性。其解决方案的关键在于提出并分析修正流模型(rectified flow models),该模型通过强制传输路径沿从先验分布到数据分布的直线进行,显著加速了采样过程,通常仅需一次欧拉步即可实现高质量生成。理论分析表明,在标准假设下,该方法实现了 O~(ε2)\tilde{O}(\varepsilon^{-2}) 的样本复杂度,优于现有流匹配模型的 O(ε4)O(\varepsilon^{-4}) 界,并达到均值估计的最优率。这一改进源于修正流结构带来的局部 Rademacher 复杂度的严格控制,从而为其实用性能提供了理论解释。

链接: https://arxiv.org/abs/2601.20250
作者: Hari Krishna Sahoo,Mudit Gaur,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Recently, flow-based generative models have shown superior efficiency compared to diffusion models. In this paper, we study rectified flow models, which constrain transport trajectories to be linear from the base distribution to the data distribution. This structural restriction greatly accelerates sampling, often enabling high-quality generation with a single Euler step. Under standard assumptions on the neural network classes used to parameterize the velocity field and data distribution, we prove that rectified flows achieve sample complexity \tildeO(\varepsilon^-2) . This improves on the best known O(\varepsilon^-4) bounds for flow matching model and matches the optimal rate for mean estimation. Our analysis exploits the particular structure of rectified flows: because the model is trained with a squared loss along linear paths, the associated hypothesis class admits a sharply controlled localized Rademacher complexity. This yields the improved, order-optimal sample complexity and provides a theoretical explanation for the strong empirical performance of rectified flow models.
zh

[AI-55] How AI Impacts Skill Formation

【速读】:该论文试图解决的问题是:AI辅助工具在提升工作效率的同时,是否会影响新手开发者掌握关键技能(如概念理解、代码阅读与调试能力)的过程,尤其是在面对新异编程任务时,过度依赖AI是否会削弱其自主学习和技能形成。解决方案的关键在于识别出六种不同的AI交互模式,并发现其中三种涉及认知投入的交互方式能够在获得AI帮助的同时维持学习效果,从而表明AI增强的生产力并非通往熟练能力的捷径,必须通过有意识地设计交互策略来保障技能发展——尤其在安全关键领域中更需谨慎引入AI辅助以避免技能退化。

链接: https://arxiv.org/abs/2601.20245
作者: Judy Hanwen Shen,Alex Tamkin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear. Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation – particularly in safety-critical domains.
zh

[AI-56] MALLOC: Benchmarking the Memory-aware Long Sequence Compression for Large Sequential Recommendation

【速读】:该论文旨在解决大规模推荐系统中因长序列依赖性导致的高计算成本与内存开销问题,尤其是在用户行为序列较长、需存储大量中间状态的情况下。现有方法虽通过预存历史状态来降低重复计算开销,但往往忽视了内存占用对实际部署的限制,难以在亿级用户规模下高效运行。解决方案的关键在于提出 MALLOC——一个面向长序列压缩的内存感知基准平台,系统性地整合并评估适用于推荐任务的内存管理策略(如来自大语言模型领域的压缩技术),并通过在先进推荐模型上的实证实验验证其在准确性、效率和复杂度方面的综合可靠性,从而为可扩展的大规模推荐系统提供可复现、易访问的优化路径。

链接: https://arxiv.org/abs/2601.20234
作者: Qihang Yu,Kairui Fu,Zhaocheng Du,Yuxuan Si,Kaiyuan Li,Weihao Zhao,Zhicheng Zhang,Jieming Zhu,Quanyu Dai,Zhenhua Dong,Shengyu Zhang,Kun Kuang,Fei Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The scaling law, which indicates that model performance improves with increasing dataset and model capacity, has fueled a growing trend in expanding recommendation models in both industry and academia. However, the advent of large-scale recommenders also brings significantly higher computational costs, particularly under the long-sequence dependencies inherent in the user intent of recommendation systems. Current approaches often rely on pre-storing the intermediate states of the past behavior for each user, thereby reducing the quadratic re-computation cost for the following requests. Despite their effectiveness, these methods often treat memory merely as a medium for acceleration, without adequately considering the space overhead it introduces. This presents a critical challenge in real-world recommendation systems with billions of users, each of whom might initiate thousands of interactions and require massive memory for state storage. Fortunately, there have been several memory management strategies examined for compression in LLM, while most have not been evaluated on the recommendation task. To mitigate this gap, we introduce MALLOC, a comprehensive benchmark for memory-aware long sequence compression. MALLOC presents a comprehensive investigation and systematic classification of memory management techniques applicable to large sequential recommendations. These techniques are integrated into state-of-the-art recommenders, enabling a reproducible and accessible evaluation platform. Through extensive experiments across accuracy, efficiency, and complexity, we demonstrate the holistic reliability of MALLOC in advancing large-scale recommendation. Code is available at this https URL.
zh

[AI-57] Certificate-Guided Pruning for Stochastic Lipschitz Optimization

【速读】:该论文致力于解决在噪声观测下对Lipschitz连续函数进行黑箱优化的问题,其核心挑战在于如何有效规避次优区域并提供可量化的收敛保证。现有自适应离散化方法虽能隐式避开次优区域,但缺乏最优性证书和明确的进度保障。解决方案的关键是提出证书引导剪枝(Certificate-Guided Pruning, CGP),通过维护一个显式的“活跃集”$ A_t (即潜在最优点集合),利用置信调整的Lipschitz包络来排除非最优点——任何不在(即潜在最优点集合),利用置信调整的Lipschitz包络来排除非最优点——任何不在 A_t 中的点均可以高概率被证伪为次优。在此基础上,结合近优性维度中的点均可以高概率被证伪为次优。在此基础上,结合近优性维度 \alpha 的边界条件,证明了活跃集体积的边界条件,证明了活跃集体积 \Vol(A_t) 以可控速率收缩,从而获得样本复杂度以可控速率收缩,从而获得样本复杂度 \tilde{O}(\varepsilon^{-(2+\alpha)}) 。这一框架还衍生出三种扩展:CGPAdaptive在线估计Lipschitz常数。这一框架还衍生出三种扩展:CGP-Adaptive在线估计Lipschitz常数 L 且仅增加且仅增加 O(\log T) 计算开销;CGPTR引入信任域机制以支持高维问题(计算开销;CGP-TR引入信任域机制以支持高维问题( d \leq 50 $);CGP-Hybrid在局部平滑性检测到后切换至高斯过程(GP)细化策略。实验表明,CGP系列方法在12个基准测试中达到或超越强基线,并通过证书体积提供可解释的停止准则。

链接: https://arxiv.org/abs/2601.20231
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study black-box optimization of Lipschitz functions under noisy evaluations. Existing adaptive discretization methods implicitly avoid suboptimal regions but do not provide explicit certificates of optimality or measurable progress guarantees. We introduce \textbfCertificate-Guided Pruning (CGP), which maintains an explicit \emphactive set A_t of potentially optimal points via confidence-adjusted Lipschitz envelopes. Any point outside A_t is certifiably suboptimal with high probability, and under a margin condition with near-optimality dimension \alpha , we prove \Vol(A_t) shrinks at a controlled rate yielding sample complexity \tildeO(\varepsilon^-(2+\alpha)) . We develop three extensions: CGP-Adaptive learns L online with O(\log T) overhead; CGP-TR scales to d 50 via trust regions with local certificates; and CGP-Hybrid switches to GP refinement when local smoothness is detected. Experiments on 12 benchmarks ( d \in [2, 100] ) show CGP variants match or exceed strong baselines while providing principled stopping criteria via certificate volume.
zh

[AI-58] ProFlow: Zero-Shot Physics-Consistent Sampling via Proximal Flow Guidance

【速读】:该论文旨在解决从稀疏观测中推断物理场时,如何在不破坏预训练生成先验统计结构的前提下,严格满足偏微分方程(PDE)约束的难题。现有方法通常需昂贵的再训练或导致生成先验失真,难以兼顾物理一致性与观测拟合。其解决方案的关键在于提出ProFlow框架,采用一种严格的两步交替优化机制:首先通过近似最小化(proximal minimization)将流预测投影到物理一致集与观测一致集的交集中;其次通过插值步骤将修正状态映射回生成轨迹,以保持与学习到的概率流路径一致。该过程可被解释为一系列局部最大后验(MAP)更新,从而实现零样本物理一致性采样,无需任务特定再训练。

链接: https://arxiv.org/abs/2601.20227
作者: Zichao Yu,Ming Li,Wenyi Zhang,Difan Zou,Weiguo Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Inferring physical fields from sparse observations while strictly satisfying partial differential equations (PDEs) is a fundamental challenge in computational physics. Recently, deep generative models offer powerful data-driven priors for such inverse problems, yet existing methods struggle to enforce hard physical constraints without costly retraining or disrupting the learned generative prior. Consequently, there is a critical need for a sampling mechanism that can reconcile strict physical consistency and observational fidelity with the statistical structure of the pre-trained prior. To this end, we present ProFlow, a proximal guidance framework for zero-shot physics-consistent sampling, defined as inferring solutions from sparse observations using a fixed generative prior without task-specific retraining. The algorithm employs a rigorous two-step scheme that alternates between: (\romannumeral1) a terminal optimization step, which projects the flow prediction onto the intersection of the physically and observationally consistent sets via proximal minimization; and (\romannumeral2) an interpolation step, which maps the refined state back to the generative trajectory to maintain consistency with the learned flow probability path. This procedure admits a Bayesian interpretation as a sequence of local maximum a posteriori (MAP) updates. Comprehensive benchmarks on Poisson, Helmholtz, Darcy, and viscous Burgers’ equations demonstrate that ProFlow achieves superior physical and observational consistency, as well as more accurate distributional statistics, compared to state-of-the-art diffusion- and flow-based baselines.
zh

[AI-59] owards Intelligent Urban Park Development Monitoring: LLM Agents for Multi-Modal Information Fusion and Analysis

【速读】:该论文旨在解决传统遥感影像变化检测方法在城市公园发展监测中面临的高阶智能分析能力不足的问题,尤其在处理复杂多模态数据时缺乏灵活适应不同应用场景的能力。其解决方案的关键在于提出了一种基于大语言模型(Large Language Model, LLM)的多模态代理框架,通过设计通用的横向与纵向数据对齐机制保障多模态信息的一致性与有效追踪,并构建领域专用工具包以缓解LLM因领域知识缺失导致的幻觉问题,从而实现鲁棒的多模态信息融合与分析,为城市公园发展监测提供可靠且可扩展的智能支持。

链接: https://arxiv.org/abs/2601.20206
作者: Zixuan Xiao,Chunguang Hu,Jun Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As an important part of urbanization, the development monitoring of newly constructed parks is of great significance for evaluating the effect of urban planning and optimizing resource allocation. However, traditional change detection methods based on remote sensing imagery have obvious limitations in high-level and intelligent analysis, and thus are difficult to meet the requirements of current urban planning and management. In face of the growing demand for complex multi-modal data analysis in urban park development monitoring, these methods often fail to provide flexible analysis capabilities for diverse application scenarios. This study proposes a multi-modal LLM agent framework, which aims to make full use of the semantic understanding and reasoning capabilities of LLM to meet the challenges in urban park development monitoring. In this framework, a general horizontal and vertical data alignment mechanism is designed to ensure the consistency and effective tracking of multi-modal data. At the same time, a specific toolkit is constructed to alleviate the hallucination issues of LLM due to the lack of domain-specific knowledge. Compared to vanilla GPT-4o and other agents, our approach enables robust multi-modal information fusion and analysis, offering reliable and scalable solutions tailored to the diverse and evolving demands of urban park development monitoring.
zh

[AI-60] Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery

【速读】:该论文旨在解决传统鲁棒强化学习方法在面对不可靠经验或被污染奖励时,缺乏对自身学习过程可靠性进行推理能力的问题,导致模型要么因过度保守而反应迟钝,要么在不确定性累积时发生灾难性失败。其解决方案的关键在于提出一种元认知强化学习框架,通过引入一个由价值预测误差稳定性(Value Prediction Error Stability, VPES)驱动的元信任变量(meta-trust variable),实现对学习行为的内部评估、调节与恢复:该变量通过故障安全调控机制抑制异常学习,并在后续训练中逐步重建信任,从而显著提升连续控制任务中奖励受扰场景下的平均回报并减少后期训练失败率。

链接: https://arxiv.org/abs/2601.20193
作者: Zhipeng Zhang,Wenting Ma,Kai Li,Meng Guo,Lei Yang,Wei Yu,Hongji Cui,Yichen Zhang,Mo Zhang,Jinzhe Lin,Zhenjie Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robust reinforcement learning methods typically focus on suppressing unreliable experiences or corrupted rewards, but they lack the ability to reason about the reliability of their own learning process. As a result, such methods often either overreact to noise by becoming overly conservative or fail catastrophically when uncertainty accumulates. In this work, we propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior based on internally estimated reliability signals. The proposed method introduces a meta-trust variable driven by Value Prediction Error Stability (VPES), which modulates learning dynamics via fail-safe regulation and gradual trust recovery. Experiments on continuous-control benchmarks with reward corruption demonstrate that recovery-enabled meta-cognitive control achieves higher average returns and significantly reduces late-stage training failures compared to strong robustness baselines. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.20193 [cs.LG] (or arXiv:2601.20193v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.20193 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-61] Causal-Driven Feature Evaluation for Cross-Domain Image Classification

【速读】:该论文旨在解决现实世界分类任务中分布外(Out-of-distribution, OOD)泛化能力不足的问题,即测试数据分布与训练数据存在显著差异时模型性能下降的挑战。现有方法多依赖于寻找跨域不变表示(domain-invariant representations),但该研究指出,不变性并不等同于因果有效性。其解决方案的关键在于从因果视角重新审视OOD分类问题,提出通过评估特征在分布变化下的必要性和充分性来衡量表示的质量,并引入一种显式的分段级(segment-level)框架直接测量跨域的因果有效性,从而提供比单纯依赖不变性更可靠的评价标准。实验表明,该方法在多域基准上实现了稳定的OOD性能提升,尤其在严峻的域偏移场景下效果显著。

链接: https://arxiv.org/abs/2601.20176
作者: Chen Cheng,Ang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Out-of-distribution (OOD) generalization remains a fundamental challenge in real-world classification, where test distributions often differ substantially from training data. Most existing approaches pursue domain-invariant representations, implicitly assuming that invariance implies reliability. However, features that are invariant across domains are not necessarily causally effective for prediction. In this work, we revisit OOD classification from a causal perspective and propose to evaluate learned representations based on their necessity and sufficiency under distribution shift. We introduce an explicit segment-level framework that directly measures causal effectiveness across domains, providing a more faithful criterion than invariance alone. Experiments on multi-domain benchmarks demonstrate consistent improvements in OOD performance, particularly under challenging domain shifts, highlighting the value of causal evaluation for robust generalization. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.20176 [cs.LG] (or arXiv:2601.20176v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.20176 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-62] NeuraLSP: An Efficient and Rigorous Neural Left Singular Subspace Preconditioner for Conjugate Gradient Methods

【速读】:该论文旨在解决偏微分方程(Partial Differential Equations, PDEs)数值求解中大型稀疏线性系统求解效率低的问题,特别是现有神经预条件方法因图结构聚合导致的秩膨胀(rank inflation)和收敛率不佳问题。解决方案的关键在于提出一种新型神经预条件器 NeuraLSP,其核心创新是设计了一种基于系统矩阵近零空间向量左奇异子空间的损失函数,通过将谱信息压缩为固定低秩算子,在理论上保障了收敛性并有效抑制秩膨胀,从而在多种PDE场景下实现最高达53%的加速效果。

链接: https://arxiv.org/abs/2601.20174
作者: Alexander Benanti,Xi Han,Hong Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Numerical techniques for solving partial differential equations (PDEs) are integral for many fields across science and engineering. Such techniques usually involve solving large, sparse linear systems, where preconditioning methods are critical. In recent years, neural methods, particularly graph neural networks (GNNs), have demonstrated their potential through accelerated convergence. Nonetheless, to extract connective structures, existing techniques aggregate discretized system matrices into graphs, and suffer from rank inflation and a suboptimal convergence rate. In this paper, we articulate NeuraLSP, a novel neural preconditioner combined with a novel loss metric that leverages the left singular subspace of the system matrix’s near-nullspace vectors. By compressing spectral information into a fixed low-rank operator, our method exhibits both theoretical guarantees and empirical robustness to rank inflation, affording up to a 53% speedup. Besides the theoretical guarantees for our newly-formulated loss function, our comprehensive experimental results across diverse families of PDEs also substantiate the aforementioned theoretical advances.
zh

[AI-63] Large language models accurately predict public perceptions of support for climate action worldwide

【速读】:该论文旨在解决全球范围内公众对他人支持气候行动意愿的普遍低估问题,这种认知偏差阻碍了个体与系统性气候政策的实施。其解决方案的关键在于验证大型语言模型(Large Language Models, LLMs)能否准确预测这一“感知差距”(perception gap),并通过实证比较发现,以Claude为代表的先进LLMs在捕捉公众对他人气候行动财务贡献意愿的认知方面表现优异(平均绝对误差约5个百分点,相关系数r = .77),且能识别出社会投射中的系统性低估心理机制,其推理过程依赖结构化逻辑而非记忆数据,因而可作为高成本调查的替代工具或在数字连接薄弱地区的重要补充。

链接: https://arxiv.org/abs/2601.20141
作者: Nattavudh Powdthavee,Sandra J. Geiger
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 35 pages

点击查看摘要

Abstract:Although most people support climate action, widespread underestimation of others’ support stalls individual and systemic changes. In this preregistered experiment, we test whether large language models (LLMs) can reliably predict these perception gaps worldwide. Using country-level indicators and public opinion data from 125 countries, we benchmark four state-of-the-art LLMs against Gallup World Poll 2021/22 data and statistical regressions. LLMs, particularly Claude, accurately capture public perceptions of others’ willingness to contribute financially to climate action (MAE approximately 5 p.p.; r = .77), comparable to statistical models, though performance declines in less digitally connected, lower-GDP countries. Controlled tests show that LLMs capture the key psychological process - social projection with a systematic downward bias - and rely on structured reasoning rather than memorized values. Overall, LLMs provide a rapid tool for assessing perception gaps in climate action, serving as an alternative to costly surveys in resource-rich countries and as a complement in underrepresented populations.
zh

[AI-64] axonomy of the Retrieval System Framework: Pitfalls and Paradigms

【速读】:该论文旨在解决神经检索系统(neural search systems)在效率与效果之间存在复杂权衡的问题。其核心挑战在于如何在有限的计算资源下实现高精度的语义匹配,同时保证系统的可扩展性和鲁棒性。解决方案的关键在于将系统设计划分为四个垂直层级:表示层(Representation Layer)、粒度层(Granularity Layer)、编排层(Orchestration Layer)和鲁棒性层(Robustness Layer),分别对应嵌入表示的设计、文档分块策略、多阶段检索架构以及应对领域泛化失败和时间漂移等稳定性问题的方法。通过结构化这些设计选择,论文为优化现代神经检索系统的效率-效果前沿提供了系统性的框架指导。

链接: https://arxiv.org/abs/2601.20131
作者: Deep Shah,Sanket Badhe,Nehal Kathrotia
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Designing an embedding retrieval system requires navigating a complex design space of conflicting trade-offs between efficiency and effectiveness. This work structures these decisions as a vertical traversal of the system design stack. We begin with the Representation Layer by examining how loss functions and architectures, specifically Bi-encoders and Cross-encoders, define semantic relevance and geometric projection. Next, we analyze the Granularity Layer and evaluate how segmentation strategies like Atomic and Hierarchical chunking mitigate information bottlenecks in long-context documents. Moving to the Orchestration Layer, we discuss methods that transcend the single-vector paradigm, including hierarchical retrieval, agentic decomposition, and multi-stage reranking pipelines to resolve capacity limitations. Finally, we address the Robustness Layer by identifying architectural mitigations for domain generalization failures, lexical blind spots, and the silent degradation of retrieval quality due to temporal drift. By categorizing these limitations and design choices, we provide a comprehensive framework for practitioners to optimize the efficiency-effectiveness frontier in modern neural search systems.
zh

[AI-65] Membership Inference Attacks Against Fine-tuned Diffusion Language Models ICLR2026

【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在隐私泄露方面的一个关键问题——即其对成员推理攻击(Membership Inference Attacks, MIA)的高度敏感性尚未被系统研究。与自回归语言模型固定预测模式不同,DLMs 因其可变掩码配置的特性,使得攻击者能够通过大量独立掩码组合探测训练数据成员身份,从而显著提升攻击成功率。论文提出的关键解决方案是 SAMA(Subset-Aggregated Membership Attack),其核心创新在于通过子集聚合机制应对稀疏信号挑战:SAMA 在不同掩码密度下采样多个掩码子集,并采用基于符号的统计方法,在重尾噪声环境下仍保持有效性;进一步通过逆权重聚合策略,优先利用稀疏掩码中更清晰的记忆信号,将稀疏记忆检测转化为鲁棒的投票机制。实验表明,SAMA 相比最优基线平均提升 30% AUC,且在低假阳性率下最高提升达 8 倍,揭示了 DLMs 中此前未知的重大隐私漏洞。

链接: https://arxiv.org/abs/2601.20125
作者: Yuetian Chen,Kaiyuan Zhang,Yuntao Du,Edoardo Stoppa,Charles Fleming,Ashish Kundu,Bruno Ribeiro,Ninghui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at ICLR 2026 (pending final camera-ready)

点击查看摘要

Abstract:Diffusion Language Models (DLMs) represent a promising alternative to autoregressive language models, using bidirectional masked token prediction. Yet their susceptibility to privacy leakage via Membership Inference Attacks (MIA) remains critically underexplored. This paper presents the first systematic investigation of MIA vulnerabilities in DLMs. Unlike the autoregressive models’ single fixed prediction pattern, DLMs’ multiple maskable configurations exponentially increase attack opportunities. This ability to probe many independent masks dramatically improves detection chances. To exploit this, we introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation. SAMA samples masked subsets across progressive densities and applies sign-based statistics that remain effective despite heavy-tailed noise. Through inverse-weighted aggregation prioritizing sparse masks’ cleaner signals, SAMA transforms sparse memorization detection into a robust voting mechanism. Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over the best baseline, with up to 8 times improvement at low false positive rates. These findings reveal significant, previously unknown vulnerabilities in DLMs, necessitating the development of tailored privacy defenses.
zh

[AI-66] How Much Progress Has There Been in NVIDIA Datacenter GPUs?

【速读】:该论文旨在解决如何量化分析NVIDIA数据中心GPU自2000年代中期以来的技术进步趋势,并评估当前美国出口管制政策对全球AI芯片性能差距的潜在影响。其解决方案的关键在于构建了一个涵盖计算性能、内存带宽、价格和功耗等多维度特征的完整数据集,通过计算各指标的翻倍时间(doubling time)来量化技术演进速率,并基于此推演若出口管制全面实施后的性能差距变化——结果显示,新修订的出口控制措施可将潜在性能差距从23.6倍缩小至3.54倍,从而为政策制定与技术发展提供量化依据。

链接: https://arxiv.org/abs/2601.20115
作者: Emanuele Del Sozzo,Martin Fleming,Kenneth Flamm,Neil Thompson
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graphics Processing Units (GPUs) are the state-of-the-art architecture for essential tasks, ranging from rendering 2D/3D graphics to accelerating workloads in supercomputing centers and, of course, Artificial Intelligence (AI). As GPUs continue improving to satisfy ever-increasing performance demands, analyzing past and current progress becomes paramount in determining future constraints on scientific research. This is particularly compelling in the AI domain, where rapid technological advancements and fierce global competition have led the United States to recently implement export control regulations limiting international access to advanced AI chips. For this reason, this paper studies technical progress in NVIDIA datacenter GPUs released from the mid-2000s until today. Specifically, we compile a comprehensive dataset of datacenter NVIDIA GPUs comprising several features, ranging from computational performance to release price. Then, we examine trends in main GPU features and estimate progress indicators for per-memory bandwidth, per-dollar, and per-watt increase rates. Our main results identify doubling times of 1.44 and 1.69 years for FP16 and FP32 operations (without accounting for sparsity benefits), while FP64 doubling times range from 2.06 to 3.79 years. Off-chip memory size and bandwidth grew at slower rates than computing performance, doubling every 3.32 to 3.53 years. The release prices of datacenter GPUs have roughly doubled every 5.1 years, while their power consumption has approximately doubled every 16 years. Finally, we quantify the potential implications of current U.S. export control regulations in terms of the potential performance gaps that would result if implementation were assumed to be complete and successful. We find that recently proposed changes to export controls would shrink the potential performance gap from 23.6x to 3.54x.
zh

[AI-67] Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成强化学习(Reinforcement Learning for Code Generation)环境中作为评估器时,对奖励欺骗(Reward Hacking)行为检测能力不足的问题。其核心挑战在于现有方法多基于孤立分类场景评估奖励欺骗检测效果,缺乏对真实复杂环境的模拟与对比。解决方案的关键在于提出一种新的奖励欺骗分类体系(涵盖54类),并构建TRACE基准测试集——一个包含517条人工验证轨迹的合成数据集,并采用对比异常检测设置(contrastive anomaly detection setup)替代传统孤立分类方式。实验表明,在对比设置下,GPT-5.2在最高推理模式下检测准确率达63%,显著优于孤立设置下的45%;同时发现当前先进模型更难识别语义层面而非语法层面的奖励欺骗,凸显了语义上下文理解对可靠奖励评估的重要性。

链接: https://arxiv.org/abs/2601.20103
作者: Darshan Deshpande,Anand Kannappan,Rebecca Qian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Dataset: this https URL

点击查看摘要

Abstract:Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.
zh

[AI-68] aming Toxic Talk: Using chatbots to intervene with users posting toxic comments

【速读】:该论文旨在解决在线社区中广泛存在的毒性行为(toxic behavior)治理难题,传统策略多采用惩罚性手段(如内容删除或用户封禁),而缺乏有效的康复式干预路径。其解决方案的关键在于探索生成式 AI (Generative AI) 聊天机器人是否能够通过 rehabilitative(康复式)对话影响发布毒性内容的用户,从而降低其后续的毒性行为。研究在七个大型 Reddit 社区开展大规模实地实验(N=893),邀请近期发布毒性内容的用户参与与 AI 的对话,结果显示尽管许多参与者表现出诚意并表达悔意,但并未观察到显著的行为改善,提示单纯依靠 AI 对话可能不足以实现长期行为转变。

链接: https://arxiv.org/abs/2601.20100
作者: Jeremy Foote,Deepak Kumar,Bedadyuti Jha,Ryan Funkhouser,Loizos Bitsikokos,Hitesh Goel,Hsuen-Chi Chiu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI chatbots have proven surprisingly effective at persuading people to change their beliefs and attitudes in lab settings. However, the practical implications of these findings are not yet clear. In this work, we explore the impact of rehabilitative conversations with generative AI chatbots on users who share toxic content online. Toxic behaviors – like insults or threats of violence, are widespread in online communities. Strategies to deal with toxic behavior are typically punitive, such as removing content or banning users. Rehabilitative approaches are rarely attempted, in part due to the emotional and psychological cost of engaging with aggressive users. In collaboration with seven large Reddit communities, we conducted a large-scale field experiment (N=893) to invite people who had recently posted toxic content to participate in conversations with AI chatbots. A qualitative analysis of the conversations shows that many participants engaged in good faith and even expressed remorse or a desire to change. However, we did not observe a significant change in toxic behavior in the following month compared to a control group. We discuss possible explanations for our findings, as well as theoretical and practical implications based on our results.
zh

[AI-69] Dynamics of Human-AI Collective Knowledge on the Web: A Scalable Model and Insights for Sustainable Growth WWW26

【速读】:该论文旨在解决人类与大语言模型(Large Language Models, LLMs)共同构建和消费网络知识库时所形成的协同知识生态系统中存在的反馈机制问题,特别是这些机制如何影响知识库规模、质量、模型技能及人类技能的动态演化,并识别潜在的系统性风险(如质量稀释、技能退化、模型崩溃)。其解决方案的关键在于提出一个最小且可解释的动力学模型,该模型整合了两类内容流入(人类与LLM生成)、两种人类学习路径(基于知识库学习 vs. LLM辅助学习)、两种LLM训练方式(语料驱动扩展 vs. 人类反馈学习),并通过数值实验揭示不同增长状态(如健康增长、倒流、逆向学习、振荡)及其受平台策略(如LLM内容准入门控严格度、训练模式、人类学习路径选择)调控的机制。该框架在PubMed/GitHub等场景中验证了不同稳态表现,并成功拟合维基百科在ChatGPT前后知识流动的变化趋势,为实现人机协同知识生态的可持续发展提供了可操作的洞见。

链接: https://arxiv.org/abs/2601.20099
作者: Buddhika Nettasinghe,Kang Zhao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted for ACM Web Conference 2026 (WWW26)

点击查看摘要

Abstract:Humans and large language models (LLMs) now co-produce and co-consume the web’s shared knowledge archives. Such human-AI collective knowledge ecosystems contain feedback loops with both benefits (e.g., faster growth, easier learning) and systemic risks (e.g., quality dilution, skill reduction, model collapse). To understand such phenomena, we propose a minimal, interpretable dynamical model of the co-evolution of archive size, archive quality, model (LLM) skill, aggregate human skill, and query volume. The model captures two content inflows (human, LLM) controlled by a gate on LLM-content admissions, two learning pathways for humans (archive study vs. LLM assistance), and two LLM-training modalities (corpus-driven scaling vs. learning from human feedback). Through numerical experiments, we identify different growth regimes (e.g., healthy growth, inverted flow, inverted learning, oscillations), and show how platform and policy levers (gate strictness, LLM training, human learning pathways) shift the system across regime boundaries. Two domain configurations (PubMed, GitHub and Copilot) illustrate contrasting steady states under different growth rates and moderation norms. We also fit the model to Wikipedia’s knowledge flow during pre-ChatGPT and post-ChatGPT eras separately. We find a rise in LLM additions with a concurrent decline in human inflow, consistent with a regime identified by the model. Our model and analysis yield actionable insights for sustainable growth of human-AI collective knowledge on the Web.
zh

[AI-70] Should I Have Expressed a Different Intent? Counterfactual Generation for LLM -Based Autonomous Control

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的智能体在执行任务后,用户难以评估不同意图表述对结果影响的问题,即缺乏对反事实推理(counterfactual reasoning)的支持。其核心挑战在于如何在不重新执行整个任务的前提下,提供可靠且概率保证的多种可能结果预测。解决方案的关键在于将用户、LLM驱动智能体与环境之间的闭环交互建模为结构因果模型(Structural Causal Model, SCM),并利用测试时扩展(test-time scaling)通过概率归纳(probabilistic abduction)生成多个候选反事实结果;同时,在离线校准阶段引入拟合置信度生成(Conformal Counterfactual Generation, CCG)方法,确保所生成的反事实结果集合以高概率包含真实反事实结果,从而实现可信赖的反事实分析能力。

链接: https://arxiv.org/abs/2601.20090
作者: Amirmohammad Farzaneh,Salvatore D’Oro,Osvaldo Simeone
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-powered agents can translate high-level user intents into plans and actions in an environment. Yet after observing an outcome, users may wonder: What if I had phrased my intent differently? We introduce a framework that enables such counterfactual reasoning in agentic LLM-driven control scenarios, while providing formal reliability guarantees. Our approach models the closed-loop interaction between a user, an LLM-based agent, and an environment as a structural causal model (SCM), and leverages test-time scaling to generate multiple candidate counterfactual outcomes via probabilistic abduction. Through an offline calibration phase, the proposed conformal counterfactual generation (CCG) yields sets of counterfactual outcomes that are guaranteed to contain the true counterfactual outcome with high probability. We showcase the performance of CCG on a wireless network control use case, demonstrating significant advantages compared to naive re-execution baselines.
zh

[AI-71] LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation

【速读】:该论文旨在解决工业级推荐系统中如何有效利用模型规模扩展(scaling laws)以提升广告推荐效果,同时满足严格的延迟约束的问题。其核心挑战在于:尽管大型深度模型在理论上具有更强的表达能力,但在实际部署中受限于推理延迟和计算资源,难以直接应用到生产环境。解决方案的关键在于提出一种两阶段架构——将复杂度高的长序列建模任务交由异步上游用户模型处理,从而释放下游排序模块的计算负担;同时发现语义特征是实现有效缩放的前提条件,能够引导模型充分利用更深更长架构的容量。实验证明,这种设计在Meta大规模部署后,使Facebook Feed与Reels的转化率提升4.3%,且服务开销极低,为工业推荐系统提供了可落地的规模化路径。

链接: https://arxiv.org/abs/2601.20083
作者: Lee Xiong,Zhirong Chen,Rahul Mayuranath,Shangran Qiu,Arda Ozdemir,Lu Li,Yang Hu,Dave Li,Jingtao Ren,Howard Cheng,Fabian Souto Herrera,Ahmed Agiza,Baruch Epshtein,Anuj Aggarwal,Julia Ulziisaikhan,Chao Wang,Dinesh Ramasamy,Parshva Doshi,Sri Reddy,Arnold Overwijk
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Lee Xiong, Zhirong Chen, and Rahul Mayuranath contributed equally to this work

点击查看摘要

Abstract:We present LLaTTE (LLM-Style Latent Transformers for Temporal Events), a scalable transformer architecture for production ads recommendation. Through systematic experiments, we demonstrate that sequence modeling in recommendation systems follows predictable power-law scaling similar to LLMs. Crucially, we find that semantic features bend the scaling curve: they are a prerequisite for scaling, enabling the model to effectively utilize the capacity of deeper and longer architectures. To realize the benefits of continued scaling under strict latency constraints, we introduce a two-stage architecture that offloads the heavy computation of large, long-context models to an asynchronous upstream user model. We demonstrate that upstream improvements transfer predictably to downstream ranking tasks. Deployed as the largest user model at Meta, this multi-stage framework drives a 4.3% conversion uplift on Facebook Feed and Reels with minimal serving overhead, establishing a practical blueprint for harnessing scaling laws in industrial recommender systems.
zh

[AI-72] CiMRAG : Cim-Aware Domain-Adaptive and Noise-Resilient Retrieval-Augmented Generation for Edge-Based LLM s ICASSP2026

【速读】:该论文旨在解决在边缘设备上部署基于大语言模型(Large Language Models, LLMs)的个性化虚拟助手时,因用户个人资料数据快速增长导致的检索增强生成(Retrieval-Augmented Generation, RAG)效率瓶颈问题,尤其是在存在环境噪声干扰下,如何保障检索精度与多领域适应性的挑战。解决方案的关键在于提出了一种面向任务的抗噪嵌入学习框架(Task-Oriented Noise-resilient Embedding Learning, TONEL),其通过引入噪声感知的投影模型,在满足计算存储一体化(Computing-in-Memory, CiM)硬件约束的前提下,学习与任务相关的鲁棒嵌入表示,从而实现噪声环境下准确的检索性能。

链接: https://arxiv.org/abs/2601.20041
作者: Shih-Hsuan Chiu,Ming-Syan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Personalized virtual assistants powered by large language models (LLMs) on edge devices are attracting growing attention, with Retrieval-Augmented Generation (RAG) emerging as a key method for personalization by retrieving relevant profile data and generating tailored responses. However, deploying RAG on edge devices faces efficiency hurdles due to the rapid growth of profile data, such as user-LLM interactions and recent updates. While Computing-in-Memory (CiM) architectures mitigate this bottleneck by eliminating data movement between memory and processing units via in-situ operations, they are susceptible to environmental noise that can degrade retrieval precision. This poses a critical issue in dynamic, multi-domain edge-based scenarios (e.g., travel, medicine, and law) where both accuracy and adaptability are paramount. To address these challenges, we propose Task-Oriented Noise-resilient Embedding Learning (TONEL), a framework that improves noise robustness and domain adaptability for RAG in noisy edge environments. TONEL employs a noise-aware projection model to learn task-specific embeddings compatible with CiM hardware constraints, enabling accurate retrieval under noisy conditions. Extensive experiments conducted on personalization benchmarks demonstrate the effectiveness and practicality of our methods relative to strong baselines, especially in task-specific noisy scenarios.
zh

[AI-73] Structural Compositional Function Networks: Interpretable Functional Compositions for Tabular Discovery

【速读】:该论文旨在解决传统深度学习模型在高风险领域中处理表格数据时,难以同时实现高性能与科学可解释性的难题。其核心问题是标准神经网络将特征视为独立实体,忽略了表格数据中固有的流形结构依赖关系。解决方案的关键在于提出结构化组合函数网络(Structural Compositional Function Networks, StructuralCFN),通过引入可微分的结构先验施加关系感知归纳偏置,利用可微自适应门控机制显式建模每个特征对其余特征的数学组合关系,并自动发现最优激活机制(如注意力式过滤或抑制极性)。该方法不仅支持结构化知识集成,允许领域先验直接注入以指导学习,还实现了内在符号可解释性——能以人类可读的数学表达式恢复数据流形的“规律”,且参数规模显著压缩(300–2500个参数),仅为标准深度基线模型的1/10至1/20。

链接: https://arxiv.org/abs/2601.20037
作者: Fang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code and data available at this https URL

点击查看摘要

Abstract:Despite the ubiquity of tabular data in high-stakes domains, traditional deep learning architectures often struggle to match the performance of gradient-boosted decision trees while maintaining scientific interpretability. Standard neural networks typically treat features as independent entities, failing to exploit the inherent manifold structural dependencies that define tabular distributions. We propose Structural Compositional Function Networks (StructuralCFN), a novel architecture that imposes a Relation-Aware Inductive Bias via a differentiable structural prior. StructuralCFN explicitly models each feature as a mathematical composition of its counterparts through Differentiable Adaptive Gating, which automatically discovers the optimal activation physics (e.g., attention-style filtering vs. inhibitory polarity) for each relationship. Our framework enables Structured Knowledge Integration, allowing domain-specific relational priors to be injected directly into the architecture to guide discovery. We evaluate StructuralCFN across a rigorous 10-fold cross-validation suite on 18 benchmarks, demonstrating statistically significant improvements (p 0.05) on scientific and clinical datasets (e.g., Blood Transfusion, Ozone, WDBC). Furthermore, StructuralCFN provides Intrinsic Symbolic Interpretability: it recovers the governing “laws” of the data manifold as human-readable mathematical expressions while maintaining a compact parameter footprint (300–2,500 parameters) that is over an order of magnitude (10x–20x) smaller than standard deep baselines.
zh

[AI-74] Fuzzy Categorical Planning : Autonomous Goal Satisfaction with Graded Semantic Constraints

【速读】:该论文旨在解决自然语言规划中模糊谓词(如“合适的替代品”、“足够稳定”)的适用性难以量化的问题,现有范畴论规划方法虽能提供组合结构和基于拉回(pullback)的硬约束验证,但将动作适用性视为二值判断,强制阈值划分导致有意义差异丢失且无法追踪多步计划中的质量退化。其解决方案的关键在于提出模糊范畴论规划(Fuzzy Category-theoretic Planning, FCP),通过为每个动作(morphism)赋予[0,1]区间内的度量来表征适用性的模糊性,并利用Lukasiewicz t-范数进行计划质量的复合计算,同时保留拉回验证以确保执行可行性;此外,FCP基于大语言模型(LLM)结合k样本中位数聚合实现从语言中提取模糊适用性,并采用基于残余(residuum)的反向需求支持“中间相遇”搜索策略,从而在RecipeNLG-Subs等任务上显著提升成功率并减少硬约束违反,同时保持与经典PDDL3规划器相当的性能。

链接: https://arxiv.org/abs/2601.20021
作者: Shuhui Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural-language planning often involves vague predicates (e.g., suitable substitute, stable enough) whose satisfaction is inherently graded. Existing category-theoretic planners provide compositional structure and pullback-based hard-constraint verification, but treat applicability as crisp, forcing thresholding that collapses meaningful distinctions and cannot track quality degradation across multi-step plans. We propose Fuzzy Category-theoretic Planning (FCP), which annotates each action (morphism) with a degree in [0,1], composes plan quality via a t-norm Lukasiewicz, and retains crisp executability checks via pullback verification. FCP grounds graded applicability from language using an LLM with k-sample median aggregation and supports meeting-in-the-middle search using residuum-based backward requirements. We evaluate on (i) public PDDL3 preference/oversubscription benchmarks and (ii) RecipeNLG-Subs, a missing-substitute recipe-planning benchmark built from RecipeNLG with substitution candidates from Recipe1MSubs and FoodKG. FCP improves success and reduces hard-constraint violations on RecipeNLG-Subs compared to LLM-only and ReAct-style baselines, while remaining competitive with classical PDDL3 planners.
zh

[AI-75] aching LLM s to Ask: Self-Querying Category-Theoretic Planning for Under-Specified Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理时规划(inference-time planning)中因部分可观测性(partial observability)导致的失效问题:当任务关键前提条件未在查询时明确指定时,模型容易幻觉出缺失事实或生成违反硬约束的计划。解决方案的关键在于提出自查询双向分类规划(Self-Querying Bidirectional Categorical Planning, SQ-BCP),其核心机制包括:显式建模前提状态为“满足(Sat)”、“违反(Viol)”或“未知(Unk)”,并通过两类策略处理未知状态——(i)向oracle或用户发起目标导向的自查询,或(ii)引入“桥接假设”(bridging hypotheses),通过额外动作推导出缺失前提;同时采用双向搜索与基于回拉(pullback-based)的验证器作为分类证书以确保目标兼容性,仅用距离评分进行排序和剪枝。理论证明表明,在验证器成功且硬约束通过确定性检查的前提下,所接受的计划必满足目标要求;在分支因子有界且分辨率深度有限的情况下,SQ-BCP 能找到可接受的计划(若存在)。

链接: https://arxiv.org/abs/2601.20014
作者: Shuhui Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference-time planning with large language models frequently breaks under partial observability: when task-critical preconditions are not specified at query time, models tend to hallucinate missing facts or produce plans that violate hard constraints. We introduce \textbfSelf-Querying Bidirectional Categorical Planning (SQ-BCP), which explicitly represents precondition status (\textttSat/\textttViol/\textttUnk) and resolves unknowns via (i) targeted self-queries to an oracle/user or (ii) \emphbridging hypotheses that establish the missing condition through an additional action. SQ-BCP performs bidirectional search and invokes a pullback-based verifier as a categorical certificate of goal compatibility, while using distance-based scores only for ranking and pruning. We prove that when the verifier succeeds and hard constraints pass deterministic checks, accepted plans are compatible with goal requirements; under bounded branching and finite resolution depth, SQ-BCP finds an accepting plan when one exists. Across WikiHow and RecipeNLG tasks with withheld preconditions, SQ-BCP reduces resource-violation rates to \textbf14.9% and \textbf5.8% (vs.\ \textbf26.0% and \textbf15.7% for the best baseline), while maintaining competitive reference quality.
zh

[AI-76] Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers ICLR2026

【速读】:该论文旨在解决深度模型训练中因未经授权使用网络数据而引发的隐私与版权问题,提出通过生成不可学习样本(unlearnable examples)来实现数据保护。其解决方案的关键在于提出了一种计算效率更高的方法——扰动诱导线性化(Perturbation-Induced Linearization, PIL),该方法仅依赖线性代理模型(linear surrogate models)即可生成有效扰动,而非传统依赖深度神经网络作为代理模型的方式,从而显著降低计算开销。同时,研究揭示了不可学习样本的核心机制:通过扰动诱导深度模型的线性化行为,使得模型难以从受扰数据中提取有用特征,这解释了为何PIL能在极短时间内实现与现有方法相当甚至更优的防护效果。

链接: https://arxiv.org/abs/2601.19967
作者: Jinlin Liu,Wei Chen,Xiaojin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to ICLR 2026

点击查看摘要

Abstract:Collecting web data to train deep models has become increasingly common, raising concerns about unauthorized data usage. To mitigate this issue, unlearnable examples introduce imperceptible perturbations into data, preventing models from learning effectively. However, existing methods typically rely on deep neural networks as surrogate models for perturbation generation, resulting in significant computational costs. In this work, we propose Perturbation-Induced Linearization (PIL), a computationally efficient yet effective method that generates perturbations using only linear surrogate models. PIL achieves comparable or better performance than existing surrogate-based methods while reducing computational time dramatically. We further reveal a key mechanism underlying unlearnable examples: inducing linearization to deep models, which explains why PIL can achieve competitive results in a very short time. Beyond this, we provide an analysis about the property of unlearnable examples under percentage-based partial perturbation. Our work not only provides a practical approach for data protection but also offers insights into what makes unlearnable examples effective.
zh

[AI-77] Cross-Session Decoding of Neural Spiking Data via Task-Conditioned Latent Alignment

【速读】:该论文旨在解决侵入式脑机接口(Brain-computer interface, BCI)中跨会话非平稳性(cross-session nonstationarity)导致的解码器泛化能力差的问题,尤其是在目标会话数据有限时重新训练或适应解码器困难的情形。解决方案的关键在于提出任务条件下的潜在空间对齐框架(Task-Conditioned Latent Alignment, TCLA),该框架基于自编码器架构,首先在源会话中学习神经动力学的低维表示,随后在目标会话中通过任务条件方式将目标潜在表示对齐到源会话的潜在空间,从而实现神经动态知识的有效迁移。实验表明,TCLA在灵长类动物运动和眼动中心向外任务数据集上显著优于仅使用目标会话数据训练的基线方法,最大可提升判定系数(coefficient of determination)达0.386,验证了其在小样本条件下增强解码鲁棒性的有效性。

链接: https://arxiv.org/abs/2601.19963
作者: Canyang Zhao,Bolin Peng,J. Patrick Mayo,Ce Ju,Bing Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-session nonstationarity in neural activity recorded by implanted electrodes is a major challenge for invasive Brain-computer interfaces (BCIs), as decoders trained on data from one session often fail to generalize to subsequent sessions. This issue is further exacerbated in practice, as retraining or adapting decoders becomes particularly challenging when only limited data are available from a new session. To address this challenge, we propose a Task-Conditioned Latent Alignment framework (TCLA) for cross-session neural decoding. Building upon an autoencoder architecture, TCLA first learns a low-dimensional representation of neural dynamics from a source session with sufficient data. For target sessions with limited data, TCLA then aligns target latent representations to the source in a task-conditioned manner, enabling effective transfer of learned neural dynamics. We evaluate TCLA on the macaque motor and oculomotor center-out dataset. Compared to baseline methods trained solely on target-session data, TCLA consistently improves decoding performance across datasets and decoding settings, with gains in the coefficient of determination of up to 0.386 for y coordinate velocity decoding in a motor dataset. These results suggest that TCLA provides an effective strategy for transferring knowledge from source to target sessions, enabling more robust neural decoding under conditions with limited data.
zh

[AI-78] NeuroAI and Beyond

【速读】:该论文试图解决神经科学与人工智能(Artificial Intelligence, AI)之间长期存在的松散关联问题,旨在推动二者深度融合,形成一种受神经科学启发的人工智能新范式——神经人工智能(NeuroAI)。其解决方案的关键在于识别并整合多个交叉领域(如具身性、语言与通信、机器人学、人类与机器学习以及类脑工程)的协同潜力,通过借鉴生物神经系统的工作机制来提升AI算法的效率与泛化能力,同时促进对生物神经计算本质的理解。论文强调,这种双向赋能不仅有助于开发更高效、鲁棒的AI系统,也将重塑我们对大脑信息处理方式的认知。

链接: https://arxiv.org/abs/2601.19955
作者: Jean-Marc Fellous,Gert Cauwenberghs,Cornelia Fermüller,Yulia Sandamisrkaya,Terrence Sejnowski
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 53 pages, 5 figures, extended appendix

点击查看摘要

Abstract:Neuroscience and Artificial Intelligence (AI) have made significant progress in the past few years but have only been loosely inter-connected. Based on a workshop held in August 2025, we identify current and future areas of synergism between these two fields. We focus on the subareas of embodiment, language and communication, robotics, learning in humans and machines and Neuromorphic engineering to take stock of the progress made so far, and possible promising new future avenues. Overall, we advocate for the development of NeuroAI, a type of Neuroscience-informed Artificial Intelligence that, we argue, has the potential for significantly improving the scope and efficiency of AI algorithms while simultaneously changing the way we understand biological neural computations. We include personal statements from several leading researchers on their diverse views of NeuroAI. Two Strength-Weakness-Opportunities-Threat (SWOT) analyses by researchers and trainees are appended that describe the benefits and risks offered by NeuroAI.
zh

[AI-79] Probabilistic Sensing: Intelligence in Data Sampling ISCAS2026

【速读】:该论文旨在解决传统传感器在数据采集过程中因固定采样策略导致的能量效率低下问题,尤其是在需要长时间运行的场景中(如主动地震勘探),频繁采样会显著增加能耗。其核心挑战在于如何在不丢失关键信息的前提下实现智能采样决策。解决方案的关键在于提出了一种受自主神经系统启发的感知范式,该范式通过一个由模拟特征提取电路驱动的概率神经元(p-neuron)实现采样决策的随机化处理,从而在微秒级响应时间内完成实时、自主的数据采样激活。实验验证表明,该方法可在保持数据完整性(归一化均方误差仅0.41%)的同时,实现系统主动工作时间与生成样本数减少93%的显著节能效果。

链接: https://arxiv.org/abs/2601.19953
作者: Ibrahim Albulushi,Saleh Bunaiyan,Suraj S. Cheema,Hesham ElSawy,Feras Al-Dirini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注: Accepted for presentation at IEEE ISCAS 2026 as a lecture

点击查看摘要

Abstract:Extending the intelligence of sensors to the data-acquisition process - deciding whether to sample or not - can result in transformative energy-efficiency gains. However, making such a decision in a deterministic manner involves risk of losing information. Here we present a sensing paradigm that enables making such a decision in a probabilistic manner. The paradigm takes inspiration from the autonomous nervous system and employs a probabilistic neuron (p-neuron) driven by an analog feature extraction circuit. The response time of the system is on the order of microseconds, over-coming the sub-sampling-rate response time limit and enabling real-time intelligent autonomous activation of data-sampling. Validation experiments on active seismic survey data demonstrate lossless probabilistic data acquisition, with a normalized mean squared error of 0.41%, and 93% saving in the active operation time of the system and the number of generated samples.
zh

[AI-80] LTS-VoiceAgent : A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning

【速读】:该论文旨在解决实时语音代理(Real-time Voice Agent)中端到端模型推理深度不足与级联流水线(Cascaded Pipeline)高延迟之间的矛盾问题。现有级联架构通常严格按语音识别(ASR)、大语言模型(LLM)推理和文本转语音(TTS)顺序执行,导致响应延迟显著,且当前的流式策略如固定分块或基于语音活动检测(VAD)的分割易破坏语义单元,或浪费计算资源在需回滚的推测生成上。解决方案的关键在于提出LTS-VoiceAgent框架,其核心创新是显式分离“何时思考”与“如何增量推理”,通过动态语义触发器(Dynamic Semantic Trigger)识别有意义的前缀,并引入双角色流协调器(Dual-Role Stream Orchestrator),并行调度后台思考者(Thinker)维持状态与前台说话者(Speaker)进行推测求解,从而实现“边说边想”的非阻塞响应机制,显著提升准确性、延迟与效率的权衡表现。

链接: https://arxiv.org/abs/2601.19952
作者: Wenhao Zou,Yuwei Miao,Zhanyu Ma,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Jingwen Xu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishes. Since cascaded architectures remain the dominant choice for complex tasks, existing cascaded streaming strategies attempt to reduce this latency via mechanical segmentation (e.g., fixed chunks, VAD-based splitting) or speculative generation, but they frequently either break semantic units or waste computation on predictions that must be rolled back. To address these challenges, we propose LTS-VoiceAgent, a Listen-Think-Speak framework that explicitly separates when to think from how to reason incrementally. It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker (for state maintenance) and a foreground Speaker (for speculative solving). This parallel design enables “thinking while speaking” without blocking responses. We also introduce a Pause-and-Repair benchmark containing natural disfluencies to stress-test streaming robustness. Experiments across VERA, Spoken-MQA, BigBenchAudio, and our benchmark show that LTS-VoiceAgent achieves a stronger accuracy-latency-efficiency trade-off than serial cascaded baselines and existing streaming strategies.
zh

[AI-81] Bench4HLS: End-to-End Evaluation of LLM s in High-Level Synthesis Code Generation DATE2026

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在高层次综合(High-Level Synthesis, HLS)领域缺乏系统性评估框架的问题。随着大语言模型(Large Language Models, LLMs)在RTL级硬件设计中展现出强大能力,其在HLS中的应用逐渐受到关注,但相关研究仍处于早期阶段,且缺乏统一的基准测试与量化评价手段。解决方案的关键在于提出Bench4HLS——一个包含170个手工设计并验证的案例集的可扩展评估框架,覆盖从小型核函数到复杂加速器的设计场景;该框架支持自动化编译成功率、功能正确性(通过仿真验证)及综合可行性/优化潜力的评估,并集成插件式API以实现跨HLS工具链(如Xilinx Vitis HLS和Catapult HLS)的功耗、性能和面积(Power, Performance, and Area, PPA)分析,从而为LLM驱动的HLS工作流提供标准化、可复现的基准测试方法。

链接: https://arxiv.org/abs/2601.19941
作者: M Zafir Sadik Khan,Kimia Azar,Hadi Kamali
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Accepted to the Design, Automation and Test in Europe Conference (DATE 2026)

点击查看摘要

Abstract:In last two years, large language models (LLMs) have shown strong capabilities in code generation, including hardware design at register-transfer level (RTL). While their use in high-level synthesis (HLS) remains comparatively less mature, the ratio of HLS- to RTL-focused studies has shifted from 1:10 to 2:10 in the past six months, indicating growing interest in leveraging LLMs for high-level design entry while relying on downstream synthesis for optimization. This growing trend highlights the need for a comprehensive benchmarking and evaluation framework dedicated to LLM-based HLS. To address this, We present Bench4HLS for evaluating LLM-generated HLS designs. Bench4HLS comprises 170 manually drafted and validated case studies, spanning small kernels to complex accelerators, curated from widely used public repositories. The framework supports fully automated assessment of compilation success, functional correctness via simulation, and synthesis feasibility/optimization. Crucially, Bench4HLS integrates a pluggable API for power, performance, and area (PPA) analysis across various HLS toolchains and architectures, demonstrated here with Xilinx Vitis HLS and validated on Catapult HLS. By providing a structured, extensible, and plug-and-play testbed, Bench4HLS establishes a foundational methodology for benchmarking LLMs in HLS workflows.
zh

[AI-82] Continuous-Flow Data-Rate-Aware CNN Inference on FPGA

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在数据流架构中因池化层和步长大于1的卷积层导致的数据量减少问题,从而引发硬件单元利用率低下的难题。解决方案的关键在于提出一种数据速率感知(data-rate-aware)的连续流(continuous-flow)CNN架构设计方法:通过交错低数据速率信号、共享硬件单元,并采用适当的并行化策略,在保持全并行实现吞吐量的同时,使硬件利用率接近100%。该方法显著减少了算术逻辑资源消耗,使得复杂CNN模型如MobileNet可在单个现场可编程门阵列(Field-Programmable Gate Array, FPGA)上高效实现。

链接: https://arxiv.org/abs/2601.19940
作者: Tobias Habermann,Michael Mecik,Zhenyu Wang,César David Vera,Martin Kumm,Mario Garrido
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Among hardware accelerators for deep-learning inference, data flow implementations offer low latency and high throughput capabilities. In these architectures, each neuron is mapped to a dedicated hardware unit, making them well-suited for field-programmable gate array (FPGA) implementation. Previous unrolled implementations mostly focus on fully connected networks because of their simplicity, although it is well known that convolutional neural networks (CNNs) require fewer computations for the same accuracy. When observing the data flow in CNNs, pooling layers and convolutional layers with a stride larger than one, the number of data at their output is reduced with respect to their input. This data reduction strongly affects the data rate in a fully parallel implementation, making hardware units heavily underutilized unless it is handled properly. This work addresses this issue by analyzing the data flow of CNNs and presents a novel approach to designing data-rate-aware, continuous-flow CNN architectures. The proposed approach ensures a high hardware utilization close to 100% by interleaving low data rate signals and sharing hardware units, as well as using the right parallelization to achieve the throughput of a fully parallel implementation. The results show that a significant amount of the arithmetic logic can be saved, which allows implementing complex CNNs like MobileNet on a single FPGA with high throughput.
zh

[AI-83] DecHW: Heterogeneous Decentralized Federated Learning Exploiting Second-Order Information

【速读】:该论文旨在解决去中心化联邦学习(Decentralized Federated Learning, DFL)中因数据分布差异和设备间交互异质性所导致的本地模型参数不一致问题,这种异质性会显著减缓模型收敛速度。解决方案的关键在于提出一种新的聚合方法,通过近似局部模型在各自数据集上的二阶信息来生成共识权重,从而对邻域内本地更新进行加权调整,再进行鲁棒聚合,以构建更稳定的全局邻域表示,有效提升了模型在降低通信开销下的泛化能力。

链接: https://arxiv.org/abs/2601.19938
作者: Adnan Ahmad,Chiara Boldrini,Lorenzo Valerio,Andrea Passarella,Marco Conti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Funding: SoBigDatait (PNRR IR0000013), FAIR (PNRR PE00000013), RESTART (PNRR PE00000001)

点击查看摘要

Abstract:Decentralized Federated Learning (DFL) is a serverless collaborative machine learning paradigm where devices collaborate directly with neighbouring devices to exchange model information for learning a generalized model. However, variations in individual experiences and different levels of device interactions lead to data and model initialization heterogeneities across devices. Such heterogeneities leave variations in local model parameters across devices that leads to slower convergence. This paper tackles the data and model heterogeneity by explicitly addressing the parameter level varying evidential credence across local models. A novel aggregation approach is introduced that captures these parameter variations in local models and performs robust aggregation of neighbourhood local updates. Specifically, consensus weights are generated via approximation of second-order information of local models on their local datasets. These weights are utilized to scale neighbourhood updates before aggregating them into global neighbourhood representation. In extensive experiments with computer vision tasks, the proposed approach shows strong generalizability of local models at reduced communication costs.
zh

[AI-84] Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在GPU上执行推理时对软错误(soft errors)的脆弱性问题,这是当前高算力GPU因工艺微缩和低电压运行而日益突出的可靠性挑战。现有研究多集中于通用应用或传统神经网络(如视觉任务),缺乏针对LLMs这一新兴且广泛应用模型的系统性分析。论文的关键解决方案是首次开展基于指令级别的故障注入实验(instruction-level fault injection study),从模型架构、参数规模和任务复杂度等多个维度揭示LLM推理的可靠性特征,从而为设计更有效的容错机制提供实证依据与理论支撑。

链接: https://arxiv.org/abs/2601.19912
作者: Duo Chai,Zizhen Liu,Shuhuai Wang,Songwei Pei,Cheng Liu,Huawei Li,Shangguang Wang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 14 pages, 13 figures

点击查看摘要

Abstract:Large language models (LLMs) are highly compute- and memory-intensive, posing significant demands on high-performance GPUs. At the same time, advances in GPU technology driven by shrinking transistor sizes and lower operating voltages have made these devices increasingly susceptible to soft errors. While prior work has examined GPU reliability, most studies have focused on general-purpose applications or conventional neural networks mostly used for vision tasks such as classification and detection. In contrast, systematic analysis of modern large-scale LLMs remains limited, despite their rapid adoption in diverse application scenarios. Given the unique characteristics of LLMs, their resilience to soft errors may differ substantially from earlier models. To bridge this gap, we conduct the first instruction-level fault injection study of LLM inference. Our approach reveals reliability characteristics from multiple perspectives, highlighting the effects of model architecture, parameter scale, and task complexity. These findings provide new insights into LLM reliability and inform the design of more effective fault tolerance mechanisms.
zh

[AI-85] GTAC: A Generative Transformer for Approximate Circuits

【速读】:该论文旨在解决近似电路(Approximate Circuits)设计中如何在保证误差约束条件下进一步优化性能、功耗和面积(PPA)的问题。传统方法难以在误差可控的前提下实现高效的设计空间探索,而本文提出的GTAC是一种基于生成式Transformer的模型,其关键创新在于将误差阈值作为设计约束直接嵌入到模型训练与生成过程中,从而实现对近似电路的自动优化生成。实验表明,在满足相同误差率约束下,GTAC比现有最优方法平均减少6.4%的面积,并且推理速度提升4.3倍。

链接: https://arxiv.org/abs/2601.19906
作者: Jingxin Wang,Shitong Guo,Ruicheng Dai,Wenhui Liang,Ruogu Ding,Xin Ning,Weikang Qian
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Targeting error-tolerant applications, approximate circuits introduce controlled errors to significantly improve performance, power, and area (PPA) of circuits. In this work, we introduce GTAC, a novel generative Transformer-based model for producing approximate circuits. By leveraging principles of approximate computing and AI-driven EDA, our model innovatively integrates error thresholds into the design process. Experimental results show that compared with a state-of-the-art method, GTAC further reduces 6.4% area under the error rate constraint, while being 4.3x faster.
zh

[AI-86] STELLAR: Structure-guided LLM Assertion Retrieval and Generation for Formal Verification

【速读】:该论文旨在解决形式化验证(Formal Verification, FV)中系统Verilog断言(SystemVerilog Assertions, SVAs)手工编写效率低且易出错的问题。现有基于大语言模型(Large Language Models, LLMs)的方法要么从零生成断言,要么忽视硬件设计和专家编写的断言中存在的结构模式。解决方案的关键在于提出STE LLAR框架,该框架通过将寄存器传输级(Register-Transfer Level, RTL)模块表示为抽象语法树(Abstract Syntax Tree, AST)结构指纹,从知识库中检索结构相关的(RTL, SVA)配对,并将其整合进结构引导的提示(structure-guided prompts),从而实现更高质量的SVA生成,显著提升语法正确性、风格一致性与功能正确性。

链接: https://arxiv.org/abs/2601.19903
作者: Saeid Rajabi,Chengmo Yang,Satwik Patnaik
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures

点击查看摘要

Abstract:Formal Verification (FV) relies on high-quality SystemVerilog Assertions (SVAs), but the manual writing process is slow and error-prone. Existing LLM-based approaches either generate assertions from scratch or ignore structural patterns in hardware designs and expert-crafted assertions. This paper presents STELLAR, the first framework that guides LLM-based SVA generation with structural similarity. STELLAR represents RTL blocks as AST structural fingerprints, retrieves structurally relevant (RTL, SVA) pairs from a knowledge base, and integrates them into structure-guided prompts. Experiments show that STELLAR achieves superior syntax correctness, stylistic alignment, and functional correctness, highlighting structure-aware retrieval as a promising direction for industrial FV.
zh

[AI-87] Assembling the Minds Mosaic: Towards EEG Semantic Intent Decoding

【速读】:该论文旨在解决脑机接口(Brain-Computer Interface, BCI)中实现自然语言通信的核心挑战,即现有框架受限于语义表征过于简化和缺乏可解释性的问题。其解决方案的关键在于提出一种名为语义意图解码(Semantic Intent Decoding, SID)的新范式,该范式通过将神经活动映射为自然语言,建模语义为一组灵活的组合性语义单元,并基于语义组合性、语义空间的连续性与可扩展性以及重建保真度三大原则构建系统。具体实现上,作者设计了BrainMosaic深度学习架构,利用集合匹配机制从EEG/SEEG信号中解码多个语义单元,并通过语义引导的重构策略生成连贯句子,从而突破传统依赖固定类别分类或无约束生成的流程,实现了更具可解释性和表达力的BCI通信方式。

链接: https://arxiv.org/abs/2601.20447
作者: Jiahe Li,Junru Chen,Fanqi Shen,Jialan Yang,Jada Li,Zhizhang Yuan,Baowen Cheng,Meng Li,Yang Yang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Enabling natural communication through brain-computer interfaces (BCIs) remains one of the most profound challenges in neuroscience and neurotechnology. While existing frameworks offer partial solutions, they are constrained by oversimplified semantic representations and a lack of interpretability. To overcome these limitations, we introduce Semantic Intent Decoding (SID), a novel framework that translates neural activity into natural language by modeling meaning as a flexible set of compositional semantic units. SID is built on three core principles: semantic compositionality, continuity and expandability of semantic space, and fidelity in reconstruction. We present BrainMosaic, a deep learning architecture implementing SID. BrainMosaic decodes multiple semantic units from EEG/SEEG signals using set matching and then reconstructs coherent sentences through semantic-guided reconstruction. This approach moves beyond traditional pipelines that rely on fixed-class classification or unconstrained generation, enabling a more interpretable and expressive communication paradigm. Extensive experiments on multilingual EEG and clinical SEEG datasets demonstrate that SID and BrainMosaic offer substantial advantages over existing frameworks, paving the way for natural and effective BCI-mediated communication.
zh

[AI-88] Do we really need Self-Attention for Streaming Automatic Speech Recognition?

【速读】:该论文试图解决Transformer架构在资源受限场景(如流式自动语音识别,Streaming Automatic Speech Recognition, ASR)中因计算成本高和延迟问题导致的适用性不足的问题。其解决方案的关键在于证明:通过用可变形卷积(deformable convolution)替代自注意力(Self-Attention)机制,可以显著降低计算开销;更进一步地,完全移除自注意力机制而不进行替换时,Word Error Rate(WER)并未出现显著下降,表明Transformer中的自注意力并非必要组件,从而为高效模型设计提供了新方向。

链接: https://arxiv.org/abs/2601.19960
作者: Youness Dkhissi(LIUM),Valentin Vielzeuf,Elys Allesiardo,Anthony Larcher(LIUM)
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Transformer-based architectures are the most used architectures in many deep learning fields like Natural Language Processing, Computer Vision or Speech processing. It may encourage the direct use of Transformers in the constrained tasks, without questioning whether it will yield the same benefits as in standard tasks. Given specific constraints, it is essential to evaluate the relevance of transformer models. This work questions the suitability of transformers for specific domains. We argue that the high computational requirements and latency issues associated with these models do not align well with streaming applications. Our study promotes the search for alternative strategies to improve efficiency without sacrificing performance. In light of this observation, our paper critically examines the usefulness of transformer architecture in such constrained environments. As a first attempt, we show that the computational cost for Streaming Automatic Speech Recognition (ASR) can be reduced using deformable convolution instead of Self-Attention. Furthermore, we show that Self-Attention mechanisms can be entirely removed and not replaced, without observing significant degradation in the Word Error Rate.
zh

[AI-89] VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

【速读】:该论文旨在解决语音语言模型(Speech Language Models, SLMs)在多用户共享环境(如智能家居)中缺乏交互隐私(interactional privacy)保护能力的问题,即模型无法有效区分不同用户身份并据此调整响应内容,可能导致敏感信息泄露。解决方案的关键在于提出首个专门评估SLMs交互隐私能力的基准测试VoxPrivacy,该基准涵盖三个难度递增的层级,从执行直接保密指令到主动推断并保护上下文隐私信息;并通过大规模训练数据(4,000小时)对模型进行微调,在保持鲁棒性的同时显著提升隐私保护能力,从而为安全部署具备说话者感知能力的SLMs提供可行路径。

链接: https://arxiv.org/abs/2601.19956
作者: Yuxiang Wang,Hongyu Liu,Dekun Chen,Xueyao Zhang,Zhizheng Wu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user’s confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user’s private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that failures observed on synthetic data persist in real speech. Finally, we demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve privacy-preserving abilities while maintaining robustness. To support future work, we release the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to foster the development of safer and more context-aware SLMs.
zh

机器学习

[LG-0] PatchFormer: A Patch-Based Time Series Foundation Model with Hierarchical Masked Reconstruction and Cross-Domain Transfer Learning for Zero-Shot Multi-Horizon Forecasting

链接: https://arxiv.org/abs/2601.20845
作者: Olaf Yunus Laitinen Imanov,Derya Umut Kulali,Taner Yilmaz
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages; 2 figures; 7 tables

点击查看摘要

Abstract:Time series forecasting is a fundamental problem with applications in climate, energy, healthcare, and finance. Many existing approaches require domain-specific feature engineering and substantial labeled data for each task. We introduce PatchFormer, a patch-based time series foundation model that uses hierarchical masked reconstruction for self-supervised pretraining and lightweight adapters for efficient transfer. PatchFormer segments time series into patches and learns multiscale temporal representations with learnable aggregation across temporal scales. Pretraining uses masked patch reconstruction with dynamic masking and objectives that encourage both local accuracy and global consistency, followed by cross-domain knowledge distillation. Experiments on 24 benchmark datasets spanning weather, energy, traffic, finance, and healthcare demonstrate state-of-the-art zero-shot multi-horizon forecasting, reducing mean squared error by 27.3 percent relative to strong baselines while requiring 94 percent less task-specific training data. The model exhibits near log-linear scaling with more pretraining data up to 100 billion points and processes length-512 sequences 3.8x faster than full-sequence transformers.

[LG-1] Context-Augmented Code Generation Using Programming Knowledge Graphs

链接: https://arxiv.org/abs/2601.20810
作者: Shahd Seddik,Fahd Seddik,Iman Saberi,Fatemeh Fard,Minh Hieu Huynh,Patanamon Thongtanunam
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at code generation but struggle with complex problems. Retrieval-Augmented Generation (RAG) mitigates this issue by integrating external knowledge, yet retrieval models often miss relevant context, and generation models hallucinate with irrelevant data. We propose Programming Knowledge Graph (PKG) for semantic representation and fine-grained retrieval of code and text. Our approach enhances retrieval precision through tree pruning and mitigates hallucinations via a re-ranking mechanism that integrates non-RAG solutions. Structuring external data into finer-grained nodes improves retrieval granularity. Evaluations on HumanEval and MBPP show up to 20% pass@1 accuracy gains and a 34% improvement over baselines on MBPP. Our findings demonstrate that our proposed PKG approach along with re-ranker effectively address complex problems while maintaining minimal negative impact on solutions that are already correct without RAG. The replication package is published at this https URL

[LG-2] Active Learning for Decision Trees with Provable Guarantees ICLR2026

链接: https://arxiv.org/abs/2601.20775
作者: Arshia Soltani Moakhar,Tanapoom Laoaron,Faraz Ghahremani,Kiarash Banihashem,MohammadTaghi Hajiaghayi
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS)
*备注: 10 pages, 43 pages with appendix, ICLR 2026, Conference URL: this https URL

点击查看摘要

Abstract:This paper advances the theoretical understanding of active learning label complexity for decision trees as binary classifiers. We make two main contributions. First, we provide the first analysis of the disagreement coefficient for decision trees-a key parameter governing active learning label complexity. Our analysis holds under two natural assumptions required for achieving polylogarithmic label complexity, (i) each root-to-leaf path queries distinct feature dimensions, and (ii) the input data has a regular, grid-like structure. We show these assumptions are essential, as relaxing them leads to polynomial label complexity. Second, we present the first general active learning algorithm for binary classification that achieves a multiplicative error guarantee, producing a (1+\epsilon) -approximate classifier. By combining these results, we design an active learning algorithm for decision trees that uses only a polylogarithmic number of label queries in the dataset size, under the stated assumptions. Finally, we establish a label complexity lower bound, showing our algorithm’s dependence on the error tolerance \epsilon is close to optimal.

[LG-3] When More Data Doesnt Help: Limits of Adaptation in Multitask Learning

链接: https://arxiv.org/abs/2601.20774
作者: Steve Hanneke,Mingyue Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multitask learning and related frameworks have achieved tremendous success in modern applications. In multitask learning problem, we are given a set of heterogeneous datasets collected from related source tasks and hope to enhance the performance above what we could hope to achieve by solving each of them individually. The recent work of arXiv:2006.15785 has showed that, without access to distributional information, no algorithm based on aggregating samples alone can guarantee optimal risk as long as the sample size per task is bounded. In this paper, we focus on understanding the statistical limits of multitask learning. We go beyond the no-free-lunch theorem in arXiv:2006.15785 by establishing a stronger impossibility result of adaptation that holds for arbitrarily large sample size per task. This improvement conveys an important message that the hardness of multitask learning cannot be overcame by having abundant data per task. We also discuss the notion of optimal adaptivity that may be of future interests. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.20774 [cs.LG] (or arXiv:2601.20774v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.20774 Focus to learn more arXiv-issued DOI via DataCite

[LG-4] Smoothing the Black-Box: Signed-Distance Supervision for Black-Box Model Copying

链接: https://arxiv.org/abs/2601.20773
作者: Rubén Jiménez,Oriol Pujol
类目: Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Deployed machine learning systems must continuously evolve as data, architectures, and regulations change, often without access to original training data or model internals. In such settings, black-box copying provides a practical refactoring mechanism, i.e. upgrading legacy models by learning replicas from input-output queries alone. When restricted to hard-label outputs, copying turns into a discontinuous surface reconstruction problem from pointwise queries, severely limiting the ability to recover boundary geometry efficiently. We propose a distance-based copying (distillation) framework that replaces hard-label supervision with signed distances to the teacher’s decision boundary, converting copying into a smooth regression problem that exploits local geometry. We develop an \alpha -governed smoothing and regularization scheme with Hölder/Lipschitz control over the induced target surface, and introduce two model-agnostic algorithms to estimate signed distances under label-only access. Experiments on synthetic problems and UCI benchmarks show consistent improvements in fidelity and generalization accuracy over hard-label baselines, while enabling distance outputs as uncertainty-related signals for black-box replicas.

[LG-5] COMET-SG1: Lightweight Autoregressive Regressor for Edge and Embedded AI

链接: https://arxiv.org/abs/2601.20772
作者: Shakhyar Gogoi
类目: Machine Learning (cs.LG)
*备注: Preprint. Submitted to an IEEE conference. 6 pages, 6 figures, 2 tables

点击查看摘要

Abstract:COMET-SG1 is a lightweight, stability-oriented autoregressive regression model designed for time-series prediction on edge and embedded AI systems. Unlike recurrent neural networks or transformer-based sequence models, COMET-SG1 operates through linear behavior-space encoding, memory-anchored transition estimation, and deterministic state updates. This structure prioritizes bounded long-horizon behavior under fully autoregressive inference, a critical requirement for edge deployment where prediction errors accumulate over time. Experiments on non-stationary synthetic time-series data demonstrate that COMET-SG1 achieves competitive short-horizon accuracy while exhibiting significantly reduced long-horizon drift compared to MLP, LSTM, and k-nearest neighbor baselines. With a compact parameter footprint and operations compatible with fixed-point arithmetic, COMET-SG1 provides a practical and interpretable approach for stable autoregressive prediction in edge and embedded AI applications.

[LG-6] Less is More: Clustered Cross-Covariance Control for Offline RL

链接: https://arxiv.org/abs/2601.20765
作者: Nan Qiao,Sheng Yue,Shuning Wang,Yongheng Deng,Ju Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer partitioning preserves the lower bound property of the maximization objective, and that these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning. Empirically, our method showcases higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.

[LG-7] Supervised Guidance Training for Infinite-Dimensional Diffusion Models

链接: https://arxiv.org/abs/2601.20756
作者: Elizabeth L. Baker,Alexander Denker,Jes Frellsen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Score-based diffusion models have recently been extended to infinite-dimensional function spaces, with uses such as inverse problems arising from partial differential equations. In the Bayesian formulation of inverse problems, the aim is to sample from a posterior distribution over functions obtained by conditioning a prior on noisy observations. While diffusion models provide expressive priors in function space, the theory of conditioning them to sample from the posterior remains open. We address this, assuming that either the prior lies in the Cameron-Martin space, or is absolutely continuous with respect to a Gaussian measure. We prove that the models can be conditioned using an infinite-dimensional extension of Doob’s h -transform, and that the conditional score decomposes into an unconditional score and a guidance term. As the guidance term is intractable, we propose a simulation-free score matching objective (called Supervised Guidance Training) enabling efficient and stable posterior sampling. We illustrate the theory with numerical examples on Bayesian inverse problems in function spaces. In summary, our work offers the first function-space method for fine-tuning trained diffusion models to accurately sample from a posterior.

[LG-8] GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning

链接: https://arxiv.org/abs/2601.20753
作者: Zhiheng Jiang,Yunzhe Wang,Ryan Marr,Ellen Novoseller,Benjamin T. Files,Volkan Ustun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) aims to approximate diverse Pareto-optimal solutions by conditioning policies on user-specified preferences over objectives. This enables a single model to flexibly adapt to arbitrary trade-offs at run-time by producing a policy on or near the Pareto front. However, existing benchmarks for PCPL are largely restricted to toy tasks and fixed environments, limiting their realism and scalability. To address this gap, we introduce GraphAllocBench, a flexible benchmark built on a novel graph-based resource allocation sandbox environment inspired by city management, which we call CityPlannerEnv. GraphAllocBench provides a rich suite of problems with diverse objective functions, varying preference conditions, and high-dimensional scalability. We also propose two new evaluation metrics – Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS) – that directly capture preference consistency while complementing the widely used hypervolume metric. Through experiments with Multi-Layer Perceptrons (MLPs) and graph-aware models, we show that GraphAllocBench exposes the limitations of existing MORL approaches and paves the way for using graph-based methods such as Graph Neural Networks in complex, high-dimensional combinatorial allocation tasks. Beyond its predefined problem set, GraphAllocBench enables users to flexibly vary objectives, preferences, and allocation rules, establishing it as a versatile and extensible benchmark for advancing PCPL. Code: this https URL

[LG-9] SA-PEF: Step-Ahead Partial Error Feedback for Efficient Federated Learning

链接: https://arxiv.org/abs/2601.20738
作者: Dawit Kiros Redie,Reza Arablouei,Stefan Werner
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Biased gradient compression with error feedback (EF) reduces communication in federated learning (FL), but under non-IID data, the residual error can decay slowly, causing gradient mismatch and stalled progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which integrates step-ahead (SA) correction with partial error feedback (PEF). SA-PEF recovers EF when the step-ahead coefficient \alpha=0 and step-ahead EF (SAEF) when \alpha=1 . For non-convex objectives and \delta -contractive compressors, we establish a second-moment bound and a residual recursion that guarantee convergence to stationarity under heterogeneous data and partial client participation. The resulting rates match standard non-convex Fed-SGD guarantees up to constant factors, achieving O((\eta,\eta_0TR)^-1) convergence to a variance/heterogeneity floor with a fixed inner step size. Our analysis reveals a step-ahead-controlled residual contraction \rho_r that explains the observed acceleration in the early training phase. To balance SAEF’s rapid warm-up with EF’s long-term stability, we select \alpha near its theory-predicted optimum. Experiments across diverse architectures and datasets show that SA-PEF consistently reaches target accuracy faster than EF.

[LG-10] Deep Semi-Supervised Survival Analysis for Predicting Cancer Prognosis

链接: https://arxiv.org/abs/2601.20729
作者: Anchen Sun,Zhibin Chen,Xiaodong Cai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Cox Proportional Hazards (PH) model is widely used in survival analysis. Recently, artificial neural network (ANN)-based Cox-PH models have been developed. However, training these Cox models with high-dimensional features typically requires a substantial number of labeled samples containing information about time-to-event. The limited availability of labeled data for training often constrains the performance of ANN-based Cox models. To address this issue, we employed a deep semi-supervised learning (DSSL) approach to develop single- and multi-modal ANN-based Cox models based on the Mean Teacher (MT) framework, which utilizes both labeled and unlabeled data for training. We applied our model, named Cox-MT, to predict the prognosis of several types of cancer using data from The Cancer Genome Atlas (TCGA). Our single-modal Cox-MT models, utilizing TCGA RNA-seq data or whole slide images, significantly outperformed the existing ANN-based Cox model, Cox-nnet, using the same data set across four types of cancer considered. As the number of unlabeled samples increased, the performance of Cox-MT significantly improved with a given set of labeled data. Furthermore, our multi-modal Cox-MT model demonstrated considerably better performance than the single-modal model. In summary, the Cox-MT model effectively leverages both labeled and unlabeled data to significantly enhance prediction accuracy compared to existing ANN-based Cox models trained solely on labeled data.

[LG-11] Structurally Human Semantically Biased: Detecting LLM -Generated References with Embeddings and GNNs ICLR2026

链接: https://arxiv.org/abs/2601.20704
作者: Melika Mobini,Vincent Holst,Floriano Tori,Andres Algaba,Vincent Ginis
类目: Machine Learning (cs.LG)
*备注: 34 pages, 20 figures. Accepted at ICLR 2026

点击查看摘要

Abstract:Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ( \approx 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy \approx 0.60) despite cleanly rejecting the random baseline ( \approx 0.89–0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches \approx 0.83, and GNNs with embedding node features achieve 93% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude \approx 0.77 and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.

[LG-12] Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation? ICLR2026

链接: https://arxiv.org/abs/2601.20694
作者: Hao Liang,Jiayu Cheng,Sean R. Sinclair,Yali Du
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from exogenous inputs that evolve independently of the learner’s actions. This structure is especially common in operations research applications such as inventory control, energy storage, and resource allocation, where exogenous randomness (e.g., demand, arrivals, or prices) drives system behavior. Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind: all existing regret guarantees for Exo-MDPs rely on explicit exploration or tabular assumptions. We show that exploration is unnecessary. We propose Pure Exploitation Learning (PEL) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. In the tabular case, PEL achieves \widetildeO(H^2|\Xi|\sqrtK) . For large, continuous endogenous state spaces, we introduce LSVI-PE, a simple linear-approximation method whose regret is polynomial in the feature dimension, exogenous state space, and horizon, independent of the endogenous state and action spaces. Our analysis introduces two new tools: counterfactual trajectories and Bellman-closed feature transport, which together allow greedy policies to have accurate value estimates without optimism. Experiments on synthetic and resource-management tasks show that PEL consistently outperforming baselines. Overall, our results overturn the conventional wisdom that exploration is required, demonstrating that in Exo-MDPs, pure exploitation is enough.

[LG-13] Optimal Transport Group Counterfactual Explanations

链接: https://arxiv.org/abs/2601.20692
作者: Enrique Valero-Leal,Bernd Bischl,Pedro Larrañaga,Concha Bielza,Giuseppe Casalicchio
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group counterfactual explanations find a set of counterfactual instances to explain a group of input instances contrastively. However, existing methods either (i) optimize counterfactuals only for a fixed group and do not generalize to new group members, (ii) strictly rely on strong model assumptions (e.g., linearity) for tractability or/and (iii) poorly control the counterfactual group geometry distortion. We instead learn an explicit optimal transport map that sends any group instance to its counterfactual without re-optimization, minimizing the group’s total transport cost. This enables generalization with fewer parameters, making it easier to interpret the common actionable recourse. For linear classifiers, we prove that functions representing group counterfactuals are derived via mathematical optimization, identifying the underlying convex optimization type (QP, QCQP, …). Experiments show that they accurately generalize, preserve group geometry and incur only negligible additional transport cost compared to baseline methods. If model linearity cannot be exploited, our approach also significantly outperforms the baselines.

[LG-14] Positive-Unlabeled Reinforcement Learning Distillation for On-Premise Small Models

链接: https://arxiv.org/abs/2601.20687
作者: Zhiqiang Kou,Junyang Chen,Xin-Qiang Cai,Xiaobo Xia,Ming-Kun Xie,Dong-Dong Wu,Biao Liu,Yuheng Jia,Xin Geng,Masashi Sugiyama,Tat-Seng Chua
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Due to constraints on privacy, cost, and latency, on-premise deployment of small models is increasingly common. However, most practical pipelines stop at supervised fine-tuning (SFT) and fail to reach the reinforcement learning (RL) alignment stage. The main reason is that RL alignment typically requires either expensive human preference annotation or heavy reliance on high-quality reward models with large-scale API usage and ongoing engineering maintenance, both of which are ill-suited to on-premise settings. To bridge this gap, we propose a positive-unlabeled (PU) RL distillation method for on-premise small-model deployment. Without human-labeled preferences or a reward model, our method distills the teacher’s preference-optimization capability from black-box generations into a locally trainable student. For each prompt, we query the teacher once to obtain an anchor response, locally sample multiple student candidates, and perform anchor-conditioned self-ranking to induce pairwise or listwise preferences, enabling a fully local training loop via direct preference optimization or group relative policy optimization. Theoretical analysis justifies that the induced preference signal by our method is order-consistent and concentrates on near-optimal candidates, supporting its stability for preference optimization. Experiments demonstrate that our method achieves consistently strong performance under a low-cost setting.

[LG-15] MuRAL-CPD: Active Learning for Multiresolution Change Point Detection ICDM

链接: https://arxiv.org/abs/2601.20686
作者: Stefano Bertolasi,Diego Carrera,Diego Stucchi,Pasqualina Fragneto,Luigi Amedeo Bianchi
类目: Machine Learning (cs.LG)
*备注: Presented at 2025 IEEE International Conference on Data Mining (ICDM), to appear in the Proceedings

点击查看摘要

Abstract:Change Point Detection (CPD) is a critical task in time series analysis, aiming to identify moments when the underlying data-generating process shifts. Traditional CPD methods often rely on unsupervised techniques, which lack adaptability to task-specific definitions of change and cannot benefit from user knowledge. To address these limitations, we propose MuRAL-CPD, a novel semi-supervised method that integrates active learning into a multiresolution CPD algorithm. MuRAL-CPD leverages a wavelet-based multiresolution decomposition to detect changes across multiple temporal scales and incorporates user feedback to iteratively optimize key hyperparameters. This interaction enables the model to align its notion of change with that of the user, improving both accuracy and interpretability. Our experimental results on several real-world datasets show the effectiveness of MuRAL-CPD against state-of-the-art methods, particularly in scenarios where minimal supervision is available.

[LG-16] An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems NEURIPS2025

链接: https://arxiv.org/abs/2601.20637
作者: Panayiotis Ioannou,Pietro Liò,Pietro Cicuta
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Accepted at the Machine Learning and the Physical Sciences Workshop, NeurIPS 2025

点击查看摘要

Abstract:Accurately modelling the dynamics of complex systems and discovering their governing differential equations are critical tasks for accelerating scientific discovery. Using noisy, synthetic data from two damped oscillatory systems, we explore the extrapolation capabilities of Neural Ordinary Differential Equations (NODEs) and the ability of Symbolic Regression (SR) to recover the underlying equations. Our study yields three key insights. First, we demonstrate that NODEs can extrapolate effectively to new boundary conditions, provided the resulting trajectories share dynamic similarity with the training data. Second, SR successfully recovers the equations from noisy ground-truth data, though its performance is contingent on the correct selection of input variables. Finally, we find that SR recovers two out of the three governing equations, along with a good approximation for the third, when using data generated by a NODE trained on just 10% of the full simulation. While this last finding highlights an area for future work, our results suggest that using NODEs to enrich limited data and enable symbolic regression to infer physical laws represents a promising new approach for scientific discovery.

[LG-17] A Foundation Model for Virtual Sensors

链接: https://arxiv.org/abs/2601.20634
作者: Leon Götz,Lars Frederik Peiss,Erik Sauer,Andreas Udo Sass,Thorsten Bagdonat,Stephan Günnemann,Leo Schwinn
类目: Machine Learning (cs.LG)
*备注: 18 pages in total, 15 figures

点击查看摘要

Abstract:Virtual sensors use machine learning to predict target signals from available measurements, replacing expensive physical sensors in critical applications. Existing virtual sensor approaches require application-specific models with hand-selected inputs for each sensor, cannot leverage task synergies, and lack consistent benchmarks. At the same time, emerging time series foundation models are computationally expensive and limited to predicting their input signals, making them incompatible with virtual sensors. We introduce the first foundation model for virtual sensors addressing both limitations. Our unified model can simultaneously predict diverse virtual sensors exploiting synergies while maintaining computational efficiency. It learns relevant input signals for each virtual sensor, eliminating expert knowledge requirements while adding explainability. In our large-scale evaluation on a standard benchmark and an application-specific dataset with over 18 billion samples, our architecture achieves 415x reduction in computation time and 951x reduction in memory requirements, while maintaining or even improving predictive quality compared to baselines. Our model scales gracefully to hundreds of virtual sensors with nearly constant parameter count, enabling practical deployment in large-scale sensor networks.

[LG-18] DIVERSE: Disagreement-Inducing Vector Evolution for Rashomon Set Exploration

链接: https://arxiv.org/abs/2601.20627
作者: Gilles Eerlings,Brent Zoomers,Jori Liesenborgs,Gustavo Rovelo Ruiz,Kris Luyten
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose DIVERSE, a framework for systematically exploring the Rashomon set of deep neural networks, the collection of models that match a reference model’s accuracy while differing in their predictive behavior. DIVERSE augments a pretrained model with Feature-wise Linear Modulation (FiLM) layers and uses Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to search a latent modulation space, generating diverse model variants without retraining or gradient access. Across MNIST, PneumoniaMNIST, and CIFAR-10, DIVERSE uncovers multiple high-performing yet functionally distinct models. Our experiments show that DIVERSE offers a competitive and efficient exploration of the Rashomon set, making it feasible to construct diverse sets that maintain robustness and performance while supporting well-balanced model multiplicity. While retraining remains the baseline to generate Rashomon sets, DIVERSE achieves comparable diversity at reduced computational cost.

[LG-19] ACFormer: Mitigating Non-linearity with Auto Convolutional Encoder for Time Series Forecasting

链接: https://arxiv.org/abs/2601.20611
作者: Gawon Lee,Hanbyeol Park,Minseop Kim,Dohee Kim,Hyerim Bae
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting (TSF) faces challenges in modeling complex intra-channel temporal dependencies and inter-channel correlations. Although recent research has highlighted the efficiency of linear architectures in capturing global trends, these models often struggle with non-linear signals. To address this gap, we conducted a systematic receptive field analysis of convolutional neural network (CNN) TSF models. We introduce the “individual receptive field” to uncover granular structural dependencies, revealing that convolutional layers act as feature extractors that mirror channel-wise attention while exhibiting superior robustness to non-linear fluctuations. Based on these insights, we propose ACFormer, an architecture designed to reconcile the efficiency of linear projections with the non-linear feature-extraction power of convolutions. ACFormer captures fine-grained information through a shared compression module, preserves temporal locality via gated attention, and reconstructs variable-specific temporal patterns using an independent patch expansion layer. Extensive experiments on multiple benchmark datasets demonstrate that ACFormer consistently achieves state-of-the-art performance, effectively mitigating the inherent drawbacks of linear models in capturing high-frequency components.

[LG-20] CoBA: Integrated Deep Learning Model for Reliable Low-Altitude UAV Classification in mmWave Radio Networks

链接: https://arxiv.org/abs/2601.20605
作者: Junaid Sajid,Ivo Müürsepp,Luca Reggiani,Davide Scazzoli,Federico Francesco Luigi Mariani,Maurizio Magarini,Rizwan Ahmad,Muhammad Mahtab Alam
类目: Machine Learning (cs.LG)
*备注: 6 Pages, This paper has been accepted for publication at the IEEE International Conference on Communications (ICC) 2026

点击查看摘要

Abstract:Uncrewed Aerial Vehicles (UAVs) are increasingly used in civilian and industrial applications, making secure low-altitude operations crucial. In dense mmWave environments, accurately classifying low-altitude UAVs as either inside authorized or restricted airspaces remains challenging, requiring models that handle complex propagation and signal variability. This paper proposes a deep learning model, referred to as CoBA, which stands for integrated Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and Attention which leverages Fifth Generation (5G) millimeter-wave (mmWave) radio measurements to classify UAV operations in authorized and restricted airspaces at low altitude. The proposed CoBA model integrates convolutional, bidirectional recurrent, and attention layers to capture both spatial and temporal patterns in UAV radio measurements. To validate the model, a dedicated dataset is collected using the 5G mmWave network at TalTech, with controlled low altitude UAV flights in authorized and restricted scenarios. The model is evaluated against conventional ML models and a fingerprinting-based benchmark. Experimental results show that CoBA achieves superior accuracy, significantly outperforming all baseline models and demonstrating its potential for reliable and regulated UAV airspace monitoring.

[LG-21] Reinforcement Unlearning via Group Relative Policy Optimization

链接: https://arxiv.org/abs/2601.20568
作者: Efstratios Zaradoukas,Bardh Prenkaj,Gjergji Kasneci
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach reduces token usage per target by up to a factor of 46 compared with SotA methods, while improving fluency by 5.48 percent and adversarial robustness by 12.02 percent over the base model. On the Real World Knowledge Unlearning (RWKU) benchmark, PURGE achieves 11 percent unlearning effectiveness while preserving 98 percent of original utility. PURGE shows that framing LLM unlearning as a verifiable task, enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.

[LG-22] An explainable framework for the relationship between dementia and glucose metabolism patterns

链接: https://arxiv.org/abs/2601.20480
作者: C. Vázquez-García,F. J. Martínez-Murcia,F. Segovia Román,A. Forte,J. Ramírez,I. Illán,A. Hernández-Segura,C. Jiménez-Mesa,Juan M. Górriz
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:High-dimensional neuroimaging data presents challenges for assessing neurodegenerative diseases due to complex non-linear relationships. Variational Autoencoders (VAEs) can encode scans into lower-dimensional latent spaces capturing disease-relevant features. We propose a semi-supervised VAE framework with a flexible similarity regularization term that aligns selected latent variables with clinical or biomarker measures of dementia progression. This allows adapting the similarity metric and supervised variables to specific goals or available data. We demonstrate the approach using PET scans from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), guiding the first latent dimension to align with a cognitive score. Using this supervised latent variable, we generate average reconstructions across levels of cognitive impairment. Voxel-wise GLM analysis reveals reduced metabolism in key regions, mainly the hippocampus, and within major Resting State Networks, particularly the Default Mode and Central Executive Networks. The remaining latent variables encode affine transformations and intensity variations, capturing confounds such as inter-subject variability and site effects. Our framework effectively extracts disease-related patterns aligned with established Alzheimer’s biomarkers, offering an interpretable and adaptable tool for studying neurodegenerative progression.

[LG-23] Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations

链接: https://arxiv.org/abs/2601.20477
作者: Kadircan Aksoy,Peter Jung,Protim Bhattacharjee
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:We study the supervised training dynamics of neural classifiers through the lens of binary hypothesis testing. We model classification as a set of binary tests between class-conditional distributions of representations and empirically show that, along training trajectories, well-generalizing networks increasingly align with Neyman-Pearson optimal decision rules via monotonic improvements in KL divergence that relate to error rate exponents. We finally discuss how this yields an explanation and possible training or regularization strategies for different classes of neural networks.

[LG-24] meCatcher: A Variational Framework for Volatility-Aware Forecasting of Non-Stationary Time Series

链接: https://arxiv.org/abs/2601.20448
作者: Zhiyu Chen,Minhao Liu,Yanru Zhang
类目: Machine Learning (cs.LG)
*备注: Under review. 13 pages, 8 figures. This paper proposes a variational framework with adaptive volatility enhancement for non-stationary time series forecasting

点击查看摘要

Abstract:Recent lightweight MLP-based models have achieved strong performance in time series forecasting by capturing stable trends and seasonal patterns. However, their effectiveness hinges on an implicit assumption of local stationarity assumption, making them prone to errors in long-term forecasting of highly non-stationary series, especially when abrupt fluctuations occur, a common challenge in domains like web traffic monitoring. To overcome this limitation, we propose TimeCatcher, a novel Volatility-Aware Variational Forecasting framework. TimeCatcher extends linear architectures with a variational encoder to capture latent dynamic patterns hidden in historical data and a volatility-aware enhancement mechanism to detect and amplify significant local variations. Experiments on nine real-world datasets from traffic, financial, energy, and weather domains show that TimeCatcher consistently outperforms state-of-the-art baselines, with particularly large improvements in long-term forecasting scenarios characterized by high volatility and sudden fluctuations. Our code is available at this https URL.

[LG-25] Nonlinear Dimensionality Reduction with Diffusion Maps in Practice

链接: https://arxiv.org/abs/2601.20428
作者: Sönke Beier,Paula Pirker-Díaz,Friedrich Pagenkopf,Karoline Wiesner
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Diffusion Map is a spectral dimensionality reduction technique which is able to uncover nonlinear submanifolds in high-dimensional data. And, it is increasingly applied across a wide range of scientific disciplines, such as biology, engineering, and social sciences. But data preprocessing, parameter settings and component selection have a significant influence on the resulting manifold, something which has not been comprehensively discussed in the literature so far. We provide a practice oriented review of the Diffusion Map technique, illustrate pitfalls and showcase a recently introduced technique for identifying the most relevant components. Our results show that the first components are not necessarily the most relevant ones.

[LG-26] Concept Component Analysis: A Principled Approach for Concept Extraction in LLM s

链接: https://arxiv.org/abs/2601.20420
作者: Yuhang Liu,Erdun Gao,Dong Gong,Anton van den Hengel,Javen Qinfeng Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing human understandable interpretation of large language models (LLMs) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs’ activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM representations can be approximated as a linear mixture of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from LLM representations through a unsupervised linear unmixing process. We explore a specific variant, termed sparse ConCA, which leverages a sparsity prior to address the inherent ill-posedness of the unmixing problem. We implement 12 sparse ConCA variants and demonstrate their ability to extract meaningful concepts across multiple LLMs, offering theory-backed advantages over SAEs.

[LG-27] AWGformer: Adaptive Wavelet-Guided Transformer for Multi-Resolution Time Series Forecasting ICASSP2026

链接: https://arxiv.org/abs/2601.20409
作者: Wei Li
类目: Machine Learning (cs.LG)
*备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Time series forecasting requires capturing patterns across multiple temporal scales while maintaining computational efficiency. This paper introduces AWGformer, a novel architecture that integrates adaptive wavelet decomposition with cross-scale attention mechanisms for enhanced multi-variate time series prediction. Our approach comprises: (1) an Adaptive Wavelet Decomposition Module (AWDM) that dynamically selects optimal wavelet bases and decomposition levels based on signal characteristics; (2) a Cross-Scale Feature Fusion (CSFF) mechanism that captures interactions between different frequency bands through learnable coupling matrices; (3) a Frequency-Aware Multi-Head Attention (FAMA) module that weights attention heads according to their frequency selectivity; (4) a Hierarchical Prediction Network (HPN) that generates forecasts at multiple resolutions before reconstruction. Extensive experiments on benchmark datasets demonstrate that AWGformer achieves significant average improvements over state-of-the-art methods, with particular effectiveness on multi-scale and non-stationary time series. Theoretical analysis provides convergence guarantees and establishes the connection between our wavelet-guided attention and classical signal processing principles.

[LG-28] ScatterFusion: A Hierarchical Scattering Transform Framework for Enhanced Time Series Forecasting ICASSP2026

链接: https://arxiv.org/abs/2601.20401
作者: Wei Li
类目: Machine Learning (cs.LG)
*备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Time series forecasting presents significant challenges due to the complex temporal dependencies at multiple time scales. This paper introduces ScatterFusion, a novel framework that synergistically integrates scattering transforms with hierarchical attention mechanisms for robust time series forecasting. Our approach comprises four key components: (1) a Hierarchical Scattering Transform Module (HSTM) that extracts multi-scale invariant features capturing both local and global patterns; (2) a Scale-Adaptive Feature Enhancement (SAFE) module that dynamically adjusts feature importance across different scales; (3) a Multi-Resolution Temporal Attention (MRTA) mechanism that learns dependencies at varying time horizons; and (4) a Trend-Seasonal-Residual (TSR) decomposition-guided structure-aware loss function. Extensive experiments on seven benchmark datasets demonstrate that ScatterFusion outperforms other common methods, achieving significant reductions in error metrics across various prediction horizons.

[LG-29] Graph-Structured Deep Learning Framework for Multi-task Contention Identification with High-dimensional Metrics

链接: https://arxiv.org/abs/2601.20389
作者: Xiao Yang,Yinan Ni,Yuqi Tang,Zhimin Qiu,Chen Wang,Tingzhou Yuan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the challenge of accurately identifying multi-task contention types in high-dimensional system environments and proposes a unified contention classification framework that integrates representation transformation, structural modeling, and a task decoupling mechanism. The method first constructs system state representations from high-dimensional metric sequences, applies nonlinear transformations to extract cross-dimensional dynamic features, and integrates multiple source information such as resource utilization, scheduling behavior, and task load variations within a shared representation space. It then introduces a graph-based modeling mechanism to capture latent dependencies among metrics, allowing the model to learn competitive propagation patterns and structural interference across resource links. On this basis, task-specific mapping structures are designed to model the differences among contention types and enhance the classifier’s ability to distinguish multiple contention patterns. To achieve stable performance, the method employs an adaptive multi-task loss weighting strategy that balances shared feature learning with task-specific feature extraction and generates final contention predictions through a standardized inference process. Experiments conducted on a public system trace dataset demonstrate advantages in accuracy, recall, precision, and F1, and sensitivity analyses on batch size, training sample scale, and metric dimensionality further confirm the model’s stability and applicability. The study shows that structured representations and multi-task classification based on high-dimensional metrics can significantly improve contention pattern recognition and offer a reliable technical approach for performance management in complex computing environments.

[LG-30] Unsupervised Anomaly Detection in Multi-Agent Trajectory Prediction via Transformer-Based Models

链接: https://arxiv.org/abs/2601.20367
作者: Qing Lyu,Zhe Fu,Alexandre Bayen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Identifying safety-critical scenarios is essential for autonomous driving, but the rarity of such events makes supervised labeling impractical. Traditional rule-based metrics like Time-to-Collision are too simplistic to capture complex interaction risks, and existing methods lack a systematic way to verify whether statistical anomalies truly reflect physical danger. To address this gap, we propose an unsupervised anomaly detection framework based on a multi-agent Transformer that models normal driving and measures deviations through prediction residuals. A dual evaluation scheme has been proposed to assess both detection stability and physical alignment: Stability is measured using standard ranking metrics in which Kendall Rank Correlation Coefficient captures rank agreement and Jaccard index captures the consistency of the top-K selected items; Physical alignment is assessed through correlations with established Surrogate Safety Measures (SSM). Experiments on the NGSIM dataset demonstrate our framework’s effectiveness: We show that the maximum residual aggregator achieves the highest physical alignment while maintaining stability. Furthermore, our framework identifies 388 unique anomalies missed by Time-to-Collision and statistical baselines, capturing subtle multi-agent risks like reactive braking under lateral drift. The detected anomalies are further clustered into four interpretable risk types, offering actionable insights for simulation and testing.

[LG-31] INNs: Time-Induced Neural Networks for Solving Time-Dependent PDEs

链接: https://arxiv.org/abs/2601.20361
作者: Chen-Yang Dai,Che-Chia Chang,Te-Sheng Lin,Ming-Chih Lai,Chieh-Hsin Lai
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) solve time-dependent partial differential equations (PDEs) by learning a mesh-free, differentiable solution that can be evaluated anywhere in space and time. However, standard space–time PINNs take time as an input but reuse a single network with shared weights across all times, forcing the same features to represent markedly different dynamics. This coupling degrades accuracy and can destabilize training when enforcing PDE, boundary, and initial constraints jointly. We propose Time-Induced Neural Networks (TINNs), a novel architecture that parameterizes the network weights as a learned function of time, allowing the effective spatial representation to evolve over time while maintaining shared structure. The resulting formulation naturally yields a nonlinear least-squares problem, which we optimize efficiently using a Levenberg–Marquardt method. Experiments on various time-dependent PDEs show up to 4\times improved accuracy and 10\times faster convergence compared to PINNs and strong baselines.

[LG-32] Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

链接: https://arxiv.org/abs/2601.20332
作者: Fengrui Zuo,Zhiwei Ke,Yiming Liu,Wenqi Lou,Chao Wang,Xvehai Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \textbf\placeholder\footnoteThe source code is available at this https URL., a window-based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) \textitactive tokens that are computed online, (ii) \textitbuffer tokens whose KV states are cached and periodically refreshed, and (iii) \textitfar-field tokens that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to 99\times inference speedup while largely preserving generation performance.

[LG-33] Less is More: Benchmarking LLM Based Recommendation Agents

链接: https://arxiv.org/abs/2601.20316
作者: Kargi Chauhan,Mahalakshmi Venkateswarlu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed for personalized product recommendations, with practitioners commonly assuming that longer user purchase histories lead to better predictions. We challenge this assumption through a systematic benchmark of four state of the art LLMs GPT-4o-mini, DeepSeek-V3, Qwen2.5-72B, and Gemini 2.5 Flash across context lengths ranging from 5 to 50 items using the REGEN dataset. Surprisingly, our experiments with 50 users in a within subject design reveal no significant quality improvement with increased context length. Quality scores remain flat across all conditions (0.17–0.23). Our findings have significant practical implications: practitioners can reduce inference costs by approximately 88% by using context (5–10 items) instead of longer histories (50 items), without sacrificing recommendation quality. We also analyze latency patterns across providers and find model specific behaviors that inform deployment decisions. This work challenges the existing ``more context is better’’ paradigm and provides actionable guidelines for cost effective LLM based recommendation systems. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2601.20316 [cs.IR] (or arXiv:2601.20316v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.20316 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Delayed Feedback Modeling for Post-Click Gross Merchandise Volume Prediction: Benchmark Insights and Approaches WWW

链接: https://arxiv.org/abs/2601.20307
作者: Xinyu Li,Sishuo Chen,Guipeng Xv,Li Zhang,Mingxuan Luo,Zhangming Chan,Xiang-Rong Sheng,Han Zhu,Jian Xu,Chen Lin
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by the ACM Web Conference (WWW) 2026. This is the camera-ready version. Please refer to the published version for citation once available

点击查看摘要

Abstract:The prediction objectives of online advertisement ranking models are evolving from probabilistic metrics like conversion rate (CVR) to numerical business metrics like post-click gross merchandise volume (GMV). Unlike the well-studied delayed feedback problem in CVR prediction, delayed feedback modeling for GMV prediction remains unexplored and poses greater challenges, as GMV is a continuous target, and a single click can lead to multiple purchases that cumulatively form the label. To bridge the research gap, we establish TRACE, a GMV prediction benchmark containing complete transaction sequences rising from each user click, which supports delayed feedback modeling in an online streaming manner. Our analysis and exploratory experiments on TRACE reveal two key insights: (1) the rapid evolution of the GMV label distribution necessitates modeling delayed feedback under online streaming training; (2) the label distribution of repurchase samples substantially differs from that of single-purchase samples, highlighting the need for separate modeling. Motivated by these findings, we propose RepurchasE-Aware Dual-branch prEdictoR (READER), a novel GMV modeling paradigm that selectively activates expert parameters according to repurchase predictions produced by a router. Moreover, READER dynamically calibrates the regression target to mitigate under-estimation caused by incomplete labels. Experimental results show that READER yields superior performance on TRACE over baselines, achieving a 2.19% improvement in terms of accuracy. We believe that our study will open up a new avenue for studying online delayed feedback modeling for GMV prediction, and our TRACE benchmark with the gathered insights will facilitate future research and application in this promising direction. Our code and dataset are available at this https URL .

[LG-35] A Learning-based Framework for Spatial Impulse Response Compensation in 3D Photoacoustic Computed Tomography

链接: https://arxiv.org/abs/2601.20291
作者: Kaiyi Yang,Seonyeong Park,Gangwon Jeong,Hsuan-Kai Huang,Alexander A. Oraevsky,Umberto Villa,Mark A. Anastasio
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Medical Physics (physics.med-ph)
*备注: Submitted to IEEE TMI

点击查看摘要

Abstract:Photoacoustic computed tomography (PACT) is a promising imaging modality that combines the advantages of optical contrast with ultrasound detection. Utilizing ultrasound transducers with larger surface areas can improve detection sensitivity. However, when computationally efficient analytic reconstruction methods that neglect the spatial impulse responses (SIRs) of the transducer are employed, the spatial resolution of the reconstructed images will be compromised. Although optimization-based reconstruction methods can explicitly account for SIR effects, their computational cost is generally high, particularly in three-dimensional (3D) applications. To address the need for accurate but rapid 3D PACT image reconstruction, this study presents a framework for establishing a learned SIR compensation method that operates in the data domain. The learned compensation method maps SIR-corrupted PACT measurement data to compensated data that would have been recorded by idealized point-like transducers. Subsequently, the compensated data can be used with a computationally efficient reconstruction method that neglects SIR effects. Two variants of the learned compensation model are investigated that employ a U-Net model and a specifically designed, physics-inspired model, referred to as Deconv-Net. A fast and analytical training data generation procedure is also a component of the presented framework. The framework is rigorously validated in virtual imaging studies, demonstrating resolution improvement and robustness to noise variations, object complexity, and sound speed heterogeneity. When applied to in-vivo breast imaging data, the learned compensation models revealed fine structures that had been obscured by SIR-induced artifacts. To our knowledge, this is the first demonstration of learned SIR compensation in 3D PACT imaging.

[LG-36] Memory Retrieval in Transformers: Insights from The Encoding Specificity Principle

链接: https://arxiv.org/abs/2601.20282
作者: Viet Hung Dinh,Ming Ding,Youyang Qu,Kanchana Thilakarathna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While explainable artificial intelligence (XAI) for large language models (LLMs) remains an evolving field with many unresolved questions, increasing regulatory pressures have spurred interest in its role in ensuring transparency, accountability, and privacy-preserving machine unlearning. Despite recent advances in XAI have provided some insights, the specific role of attention layers in transformer based LLMs remains underexplored. This study investigates the memory mechanisms instantiated by attention layers, drawing on prior research in psychology and computational psycholinguistics that links Transformer attention to cue based retrieval in human memory. In this view, queries encode the retrieval context, keys index candidate memory traces, attention weights quantify cue trace similarity, and values carry the encoded content, jointly enabling the construction of a context representation that precedes and facilitates memory retrieval. Guided by the Encoding Specificity Principle, we hypothesize that the cues used in the initial stage of retrieval are instantiated as keywords. We provide converging evidence for this keywords-as-cues hypothesis. In addition, we isolate neurons within attention layers whose activations selectively encode and facilitate the retrieval of context-defining keywords. Consequently, these keywords can be extracted from identified neurons and further contribute to downstream applications such as unlearning.

[LG-37] C2:Cross learning module enhanced decision transformer with Constraint-aware loss for auto-bidding

链接: https://arxiv.org/abs/2601.20257
作者: Jinren Ding,Xuejian Xu,Shen Jiang,Zhitong Hao,Jinhui Yang,Peng Jiang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Decision Transformer (DT) shows promise for generative auto-bidding by capturing temporal dependencies, but suffers from two critical limitations: insufficient cross-correlation modeling among state, action, and return-to-go (RTG) sequences, and indiscriminate learning of optimal/suboptimal behaviors. To address these, we propose C2, a novel framework enhancing DT with two core innovations: (1) a Cross Learning Block (CLB) via cross-attention to strengthen inter-sequence correlation modeling; (2) a Constraint-aware Loss (CL) incorporating budget and Cost-Per-Acquisition (CPA) constraints for selective learning of optimal trajectories. Extensive offline evaluations on the AuctionNet dataset demonstrate consistent performance gains (up to 3.23% over state-of-the-art GAVE) across diverse budget settings; ablation studies verify the complementary synergy of CLB and CL, confirming C2’s superiority in auto-bidding. The code for reproducing our results is available at: this https URL.

[LG-38] Proactive SFC Provisioning with Forecast-Driven DRL in Data Centers

链接: https://arxiv.org/abs/2601.20229
作者: Parisa Fard Moshiri,Poonam Lohan,Burak Kantarci,Emil Janulewicz
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, Accepted to IEEE International Conference on Communications (ICC) 2026

点击查看摘要

Abstract:Service Function Chaining (SFC) requires efficient placement of Virtual Network Functions (VNFs) to satisfy diverse service requirements while maintaining high resource utilization in Data Centers (DCs). Conventional static resource allocation often leads to overprovisioning or underprovisioning due to the dynamic nature of traffic loads and application demands. To address this challenge, we propose a hybrid forecast-driven Deep reinforcement learning (DRL) framework that combines predictive intelligence with SFC provisioning. Specifically, we leverage DRL to generate datasets capturing DC resource utilization and service demands, which are then used to train deep learning forecasting models. Using Optuna-based hyperparameter optimization, the best-performing models, Spatio-Temporal Graph Neural Network, Temporal Graph Neural Network, and Long Short-Term Memory, are combined into an ensemble to enhance stability and accuracy. The ensemble predictions are integrated into the DC selection process, enabling proactive placement decisions that consider both current and future resource availability. Experimental results demonstrate that the proposed method not only sustains high acceptance ratios for resource-intensive services such as Cloud Gaming and VoIP but also significantly improves acceptance ratios for latency-critical categories such as Augmented Reality increases from 30% to 50%, while Industry 4.0 improves from 30% to 45%. Consequently, the prediction-based model achieves significantly lower E2E latencies of 20.5%, 23.8%, and 34.8% reductions for VoIP, Video Streaming, and Cloud Gaming, respectively. This strategy ensures more balanced resource allocation, and reduces contention.

[LG-39] Parametric and Generative Forecasts of Day-Ahead Market Curves for Storag e Optimization

链接: https://arxiv.org/abs/2601.20226
作者: Julian Gutierrez,Redouane Silvente
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 46 pages, 41 figures

点击查看摘要

Abstract:We present two machine learning frameworks for forecasting aggregated curves and optimizing storage in the EPEX SPOT day-ahead market. First, a fast parametric model forecasts hourly demand and supply curves in a low-dimensional and grid-robust representation, with minimum and maximum volumes combined with a Chebyshev polynomial for the elastic segment. The model enables daily use with low error and clear interpretability. Second, for a more comprehensive analysis, though less suited to daily operation, we employ generative models that learn the joint distribution of 24-hour order-level submissions given weather and fuel variables. These models generate synthetic daily scenarios of individual buy and sell orders, which, once aggregated, yield hourly supply and demand curves. Based on these forecasts, we optimize a price-making storage strategy, quantify revenue distributions, and highlight the price-compression effect with lower peaks, higher off-peak levels, and diminishing returns as capacity expands.

[LG-40] An Accounting Identity for Algorithmic Fairness

链接: https://arxiv.org/abs/2601.20217
作者: Hadi Elzayn,Jacob Goldin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We derive an accounting identity for predictive models that links accuracy with common fairness criteria. The identity shows that for globally calibrated models, the weighted sums of miscalibration within groups and error imbalance across groups is equal to a “total unfairness budget.” For binary outcomes, this budget is the model’s mean-squared error times the difference in group prevalence across outcome classes. The identity nests standard impossibility results as special cases, while also describing inherent tradeoffs when one or more fairness measures are not perfectly satisfied. The results suggest that accuracy and fairness are best viewed as complements in binary prediction tasks: increasing accuracy necessarily shrinks the total unfairness budget and vice-versa. Experiments on benchmark data confirm the theory and show that many fairness interventions largely substitute between fairness violations, and when they reduce accuracy they tend to expand the total unfairness budget. The results extend naturally to prediction tasks with non-binary outcomes, illustrating how additional outcome information can relax fairness incompatibilities and identifying conditions under which the binary-style impossibility does and does not extend to regression tasks.

[LG-41] Hyperparameter Transfer with Mixture-of-Expert Layers

链接: https://arxiv.org/abs/2601.20205
作者: Tianze Jiang,Blake Bordelon,Cengiz Pehlevan,Boris Hanin
类目: Machine Learning (cs.LG)
*备注: 25 Pages

点击查看摘要

Abstract:Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

[LG-42] Minimum-Cost Network Flow with Dual Predictions AAAI2026

链接: https://arxiv.org/abs/2601.20203
作者: Zhiyang Chen,Hailong Yao,Xia Yin
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: accepted by AAAI 2026

点击查看摘要

Abstract:Recent work has shown that machine-learned predictions can provably improve the performance of classic algorithms. In this work, we propose the first minimum-cost network flow algorithm augmented with a dual prediction. Our method is based on a classic minimum-cost flow algorithm, namely \varepsilon -relaxation. We provide time complexity bounds in terms of the infinity norm prediction error, which is both consistent and robust. We also prove sample complexity bounds for PAC-learning the prediction. We empirically validate our theoretical results on two applications of minimum-cost flow, i.e., traffic networks and chip escape routing, in which we learn a fixed prediction, and a feature-based neural network model to infer the prediction, respectively. Experimental results illustrate 12.74\times and 1.64\times average speedup on two applications.

[LG-43] DeRaDiff: Denoising Time Realignment of Diffusion Models

链接: https://arxiv.org/abs/2601.20198
作者: Ratnavibusena Don Shahain Manujith,Yang Zhang,Teoh Tze Tzun,Kenji Kawaguchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances align diffusion models with human preferences to increase aesthetic appeal and mitigate artifacts and biases. Such methods aim to maximize a conditional output distribution aligned with higher rewards whilst not drifting far from a pretrained prior. This is commonly enforced by KL (Kullback Leibler) regularization. As such, a central issue still remains: how does one choose the right regularization strength? Too high of a strength leads to limited alignment and too low of a strength leads to “reward hacking”. This renders the task of choosing the correct regularization strength highly non-trivial. Existing approaches sweep over this hyperparameter by aligning a pretrained model at multiple regularization strengths and then choose the best strength. Unfortunately, this is prohibitively expensive. We introduce DeRaDiff, a denoising time realignment procedure that, after aligning a pretrained model once, modulates the regularization strength during sampling to emulate models trained at other regularization strengths without any additional training or finetuning. Extending decoding-time realignment from language to diffusion models, DeRaDiff operates over iterative predictions of continuous latents by replacing the reverse step reference distribution by a geometric mixture of an aligned and reference posterior, thus giving rise to a closed form update under common schedulers and a single tunable parameter, lambda, for on the fly control. Our experiments show that across multiple text image alignment and image-quality metrics, our method consistently provides a strong approximation for models aligned entirely from scratch at different regularization strengths. Thus, our method yields an efficient way to search for the optimal strength, eliminating the need for expensive alignment sweeps and thereby substantially reducing computational costs.

[LG-44] On the Computational Complexity of Performative Prediction

链接: https://arxiv.org/abs/2601.20180
作者: Ioannis Anagnostides,Rohan Chauhan,Ioannis Panageas,Tuomas Sandholm,Jingming Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Performative prediction captures the phenomenon where deploying a predictive model shifts the underlying data distribution. While simple retraining dynamics are known to converge linearly when the performative effects are weak ( \rho 1 ), the complexity in the regime \rho 1 was hitherto open. In this paper, we establish a sharp phase transition: computing an \epsilon -performatively stable point is PPAD-complete – and thus polynomial-time equivalent to Nash equilibria in general-sum games – even when \rho = 1 + O(\epsilon) . This intractability persists even in the ostensibly simple setting with a quadratic loss function and linear distribution shifts. One of our key technical contributions is to extend this PPAD-hardness result to general convex domains, which is of broader interest in the complexity of variational inequalities. Finally, we address the special case of strategic classification, showing that computing a strategic local optimum is PLS-hard.

[LG-45] MAPLE: Self-supervised Learning-Enhanced Nonlinear Dimensionality Reduction for Visual Analysis

链接: https://arxiv.org/abs/2601.20173
作者: Zeyang Huang,Takanori Fujiwara,Angelos Chatzimparmpas,Wandrille Duchemin,Andreas Kerren
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We present a new nonlinear dimensionality reduction method, MAPLE, that enhances UMAP by improving manifold modeling. MAPLE employs a self-supervised learning approach to more efficiently encode low-dimensional manifold geometry. Central to this approach are maximum manifold capacity representations (MMCRs), which help untangle complex manifolds by compressing variances among locally similar data points while amplifying variance among dissimilar data points. This design is particularly effective for high-dimensional data with substantial intra-cluster variance and curved manifold structures, such as biological or image data. Our qualitative and quantitative evaluations demonstrate that MAPLE can produce clearer visual cluster separations and finer subcluster resolution than UMAP while maintaining comparable computational cost.

[LG-46] Loss Landscape Geometry and the Learning of Symmetries: Or What Influence Functions Reveal About Robust Generalization

链接: https://arxiv.org/abs/2601.20172
作者: James Amarel,Robyn Miller,Nicolas Hengartner,Benjamin Migliori,Emily Casleton,Alexei Skurikhin,Earl Lawrence,Gerd J. Kunde
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We study how neural emulators of partial differential equation solution operators internalize physical symmetries by introducing an influence-based diagnostic that measures the propagation of parameter updates between symmetry-related states, defined as the metric-weighted overlap of loss gradients evaluated along group orbits. This quantity probes the local geometry of the learned loss landscape and goes beyond forward-pass equivariance tests by directly assessing whether learning dynamics couple physically equivalent configurations. Applying our diagnostic to autoregressive fluid flow emulators, we show that orbit-wise gradient coherence provides the mechanism for learning to generalize over symmetry transformations and indicates when training selects a symmetry compatible basin. The result is a novel technique for evaluating if surrogate models have internalized symmetry properties of the known solution operator.

[LG-47] Local Duality for Sparse Support Vector Machines

链接: https://arxiv.org/abs/2601.20170
作者: Penghe Zhang,Naihua Xiu,Houduo Qi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the rise of cardinality minimization in optimization, sparse support vector machines (SSVMs) have attracted much attention lately and show certain empirical advantages over convex SVMs. A common way to derive an SSVM is to add a cardinality function such as \ell_0 -norm to the dual problem of a convex SVM. However, this process lacks theoretical justification. This paper fills the gap by developing a local duality theory for such an SSVM formulation and exploring its relationship with the hinge-loss SVM (hSVM) and the ramp-loss SVM (rSVM). In particular, we prove that the derived SSVM is exactly the dual problem of the 0/1-loss SVM, and the linear representer theorem holds for their local solutions. The local solution of SSVM also provides guidelines on selecting hyperparameters of hSVM and rSVM. Under specific conditions, we show that a sequence of global solutions of hSVM converges to a local solution of 0/1-loss SVM. Moreover, a local minimizer of 0/1-loss SVM is a local minimizer of rSVM. This explains why a local solution induced by SSVM outperforms hSVM and rSVM in the prior empirical study. We further conduct numerical tests on real datasets and demonstrate potential advantages of SSVM by working with locally nice solutions proposed in this paper.

[LG-48] PASS: Ambiguity Guided Subsets for Scalable Classical and Quantum Constrained Clustering

链接: https://arxiv.org/abs/2601.20157
作者: Pedro Chumpitaz-Flores,My Duong,Ying Mao,Kaixun Hua
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 25 pages, 8 figures, preprint

点击查看摘要

Abstract:Pairwise-constrained clustering augments unsupervised partitioning with side information by enforcing must-link (ML) and cannot-link (CL) constraints between specific samples, yielding labelings that respect known affinities and separations. However, ML and CL constraints add an extra layer of complexity to the clustering problem, with current methods struggling in data scalability, especially in niche applications like quantum or quantum-hybrid clustering. We propose PASS, a pairwise-constraints and ambiguity-driven subset selection framework that preserves ML and CL constraints satisfaction while allowing scalable, high-quality clustering solution. PASS collapses ML constraints into pseudo-points and offers two selectors: a constraint-aware margin rule that collects near-boundary points and all detected CL violations, and an information-geometric rule that scores points via a Fisher-Rao distance derived from soft assignment posteriors, then selects the highest-information subset under a simple budget. Across diverse benchmarks, PASS attains competitive SSE at substantially lower cost than exact or penalty-based methods, and remains effective in regimes where prior approaches fail.

[LG-49] Spectral Ghost in Representation Learning: from Component Analysis to Self-Supervised Learning

链接: https://arxiv.org/abs/2601.20154
作者: Bo Dai,Na Li,Dale Schuurmans
类目: Machine Learning (cs.LG)
*备注: 43 pages, 3 figures

点击查看摘要

Abstract:Self-supervised learning (SSL) have improved empirical performance by unleashing the power of unlabeled data for practical applications. Specifically, SSL extracts the representation from massive unlabeled data, which will be transferred to a plenty of down streaming tasks with limited data. The significant improvement on diverse applications of representation learning has attracted increasing attention, resulting in a variety of dramatically different self-supervised learning objectives for representation extraction, with an assortment of learning procedures, but the lack of a clear and unified understanding. Such an absence hampers the ongoing development of representation learning, leaving a theoretical understanding missing, principles for efficient algorithm design unclear, and the use of representation learning methods in practice unjustified. The urgency for a unified framework is further motivated by the rapid growth in representation learning methods. In this paper, we are therefore compelled to develop a principled foundation of representation learning. We first theoretically investigate the sufficiency of the representation from a spectral representation view, which reveals the spectral essence of the existing successful SSL algorithms and paves the path to a unified framework for understanding and analysis. Such a framework work also inspires the development of more efficient and easy-to-use representation learning algorithms with principled way in real-world applications.

[LG-50] LogSieve: Task-Aware CI Log Reduction for Sustainable LLM -Based Analysis ICSE2026

链接: https://arxiv.org/abs/2601.20148
作者: Marcus Emmanuel Barnes,Taher A. Ghaleb,Safwat Hassan
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Preprint. Accepted for presentation at Mining Software Repositories (MSR’26), co-located ICSE 2026. The final version will appear in the ACM Digital Library as part of the MSR’26 conference proceedings

点击查看摘要

Abstract:Logs are essential for understanding Continuous Integration (CI) behavior, particularly for diagnosing build failures and performance regressions. Yet their growing volume and verbosity make both manual inspection and automated analysis increasingly costly, time-consuming, and environmentally costly. While prior work has explored log compression, anomaly detection, and LLM-based log analysis, most efforts target structured system logs rather than the unstructured, noisy, and verbose logs typical of CI workflows. We present LogSieve, a lightweight, RCA-aware and semantics-preserving log reduction technique that filters low-information lines while retaining content relevant to downstream reasoning. Evaluated on CI logs from 20 open-source Android projects using GitHub Actions, LogSieve achieves an average 42% reduction in lines and 40% reduction in tokens with minimal semantic loss. This pre-inference reduction lowers computational cost and can proportionally reduce energy use (and associated emissions) by decreasing the volume of data processed during LLM inference. Compared with structure-first baselines (LogZip and random-line removal), LogSieve preserves much higher semantic and categorical fidelity (Cosine = 0.93, GPTScore = 0.93, 80% exact-match accuracy). Embedding-based classifiers automate relevance detection with near-human accuracy (97%), enabling scalable and sustainable integration of semantics-aware filtering into CI workflows. LogSieve thus bridges log management and LLM reasoning, offering a practical path toward greener and more interpretable CI automation. Comments: Preprint. Accepted for presentation at Mining Software Repositories (MSR’26), co-located ICSE 2026. The final version will appear in the ACM Digital Library as part of the MSR’26 conference proceedings Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2601.20148 [cs.SE] (or arXiv:2601.20148v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.20148 Focus to learn more arXiv-issued DOI via DataCite

[LG-51] Scaling Next-Brain-Token Prediction for MEG

链接: https://arxiv.org/abs/2601.20138
作者: Richard Csaky
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:We present a large autoregressive model for source-space MEG that scales next-token prediction to long context across datasets and scanners: handling a corpus of over 500 hours and thousands of sessions across the three largest MEG datasets. A modified SEANet-style vector-quantizer reduces multichannel MEG into a flattened token stream on which we train a Qwen2.5-VL backbone from scratch to predict the next brain token and to recursively generate minutes of MEG from up to a minute of context. To evaluate long-horizon generation, we introduce three task-matched tests: (i) on-manifold stability via generated-only drift compared to the time-resolved distribution of real sliding windows, and (ii) conditional specificity via correct context versus prompt-swap controls using a neurophysiologically grounded metric set. We train on CamCAN and Omega and run all analyses on held-out MOUS, establishing cross-dataset generalization. Across metrics, generations remain relatively stable over long rollouts and are closer to the correct continuation than swapped controls. Code available at: this https URL.

[LG-52] Going NUTS with ADVI: Exploring various Bayesian Inference techniques with Facebook Prophet

链接: https://arxiv.org/abs/2601.20120
作者: Jovan Krajevski,Biljana Tojtovska Ribarski
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 6 pages, 5 figures, Published in Proceedings of the 22nd International Conference for Informatics and Information Technologies - CiiT 2025

点击查看摘要

Abstract:Since its introduction, Facebook Prophet has attracted positive attention from both classical statisticians and the Bayesian statistics community. The model provides two built-in inference methods: maximum a posteriori estimation using the L-BFGS-B algorithm, and Markov Chain Monte Carlo (MCMC) sampling via the No-U-Turn Sampler (NUTS). While exploring various time-series forecasting problems using Bayesian inference with Prophet, we encountered limitations stemming from the inability to apply alternative inference techniques beyond those provided by default. Additionally, the fluent API design of Facebook Prophet proved insufficiently flexible for implementing our custom modeling ideas. To address these shortcomings, we developed a complete reimplementation of the Prophet model in PyMC, which enables us to extend the base model and evaluate and compare multiple Bayesian inference methods. In this paper, we present our PyMC-based implementation and analyze in detail the implementation of different Bayesian inference techniques. We consider full MCMC techniques, MAP estimation and Variational inference techniques on a time-series forecasting problem. We discuss in details the sampling approach, convergence diagnostics, forecasting metrics as well as their computational efficiency and detect possible issues which will be addressed in our future work.

[LG-53] A Reinforcement Learning Based Universal Sequence Design for Polar Codes ICML2026

链接: https://arxiv.org/abs/2601.20118
作者: David Kin Wai Ho,Arman Fazeli,Mohamad M. Mansour,Louay M. A. Jalloul
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, ICML2026

点击查看摘要

Abstract:To advance Polar code design for 6G applications, we develop a reinforcement learning-based universal sequence design framework that is extensible and adaptable to diverse channel conditions and decoding strategies. Crucially, our method scales to code lengths up to 2048 , making it suitable for use in standardization. Across all (N,K) configurations supported in 5G, our approach achieves competitive performance relative to the NR sequence adopted in 5G and yields up to a 0.2 dB gain over the beta-expansion baseline at N=2048 . We further highlight the key elements that enabled learning at scale: (i) incorporation of physical law constrained learning grounded in the universal partial order property of Polar codes, (ii) exploitation of the weak long term influence of decisions to limit lookahead evaluation, and (iii) joint multi-configuration optimization to increase learning efficiency.

[LG-54] In-Context Reinforcement Learning From Suboptimal Historical Data ICML2025

链接: https://arxiv.org/abs/2601.20116
作者: Juncheng Dong,Moyang Guo,Ethan X. Fang,Zhuoran Yang,Vahid Tarokh
类目: Machine Learning (cs.LG)
*备注: Accepted to Forty-Second International Conference on Machine Learning (ICML2025)

点击查看摘要

Abstract:Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer(DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.

[LG-55] Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

链接: https://arxiv.org/abs/2601.20088
作者: Meng Xin,Sweta Priyadarshi,Jingyu Xin,Bilal Kartal,Aditya Vavre,Asma Kuriparambil Thekkumpate,Zijia Chen,Ameya Sunil Mahabaleshwarkar,Ido Shahaf,Akhiad Bercovich,Kinjal Patel,Suguna Varshini Velury,Chenjie Luo,Zhiyu Cheng,Jenny Chen,Chen-Han Yu,Wei Ping,Oleg Rybakov,Nima Tajbakhsh,Oluwatobi Olabiyi,Dusan Stosic,Di Wu,Song Han,Eric Chung,Sharath Turuvekere Sreenivas,Bryan Catanzaro,Yoshi Suhara,Tijmen Blankevoort,Huizi Mao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today’s LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.

[LG-56] chno-economic optimization of a heat-pipe microreactor part II: multi-objective optimization analysis

链接: https://arxiv.org/abs/2601.20079
作者: Paul Seurin,Dean Price
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Heat-pipe microreactors (HPMRs) are compact and transportable nuclear power systems exhibiting inherent safety, well-suited for deployment in remote regions where access is limited and reliance on costly fossil fuels is prevalent. In prior work, we developed a design optimization framework that incorporates techno-economic considerations through surrogate modeling and reinforcement learning (RL)-based optimization, focusing solely on minimizing the levelized cost of electricity (LCOE) by using a bottom-up cost estimation approach. In this study, we extend that framework to a multi-objective optimization that uses the Pareto Envelope Augmented with Reinforcement Learning (PEARL) algorithm. The objectives include minimizing both the rod-integrated peaking factor ( F_\Delta h ) and LCOE – subject to safety and operational constraints. We evaluate three cost scenarios: (1) a high-cost axial and drum reflectors, (2) a low-cost axial reflector, and (3) low-cost axial and drum reflectors. Our findings indicate that reducing the solid moderator radius, pin pitch, and drum coating angle – all while increasing the fuel height – effectively lowers F_\Delta h . Across all three scenarios, four key strategies consistently emerged for optimizing LCOE: (1) minimizing the axial reflector contribution when costly, (2) reducing control drum reliance, (3) substituting expensive tri-structural isotropic (TRISO) fuel with axial reflector material priced at the level of graphite, and (4) maximizing fuel burnup. While PEARL demonstrates promise in navigating trade-offs across diverse design scenarios, discrepancies between surrogate model predictions and full-order simulations remain. Further improvements are anticipated through constraint relaxation and surrogate development, constituting an ongoing area of investigation.

[LG-57] Distributional value gradients for stochastic environments

链接: https://arxiv.org/abs/2601.20071
作者: Baptiste Debes,Tinne Tuytelaars
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient-regularized value learning methods improve sample efficiency by leveraging learned models of transition dynamics and rewards to estimate return gradients. However, existing approaches, such as MAGE, struggle in stochastic or noisy environments, limiting their applicability. In this work, we address these limitations by extending distributional reinforcement learning on continuous state-action spaces to model not only the distribution over scalar state-action value functions but also over their gradients. We refer to this approach as Distributional Sobolev Training. Inspired by Stochastic Value Gradients (SVG), our method utilizes a one-step world model of reward and transition distributions implemented via a conditional Variational Autoencoder (cVAE). The proposed framework is sample-based and employs Max-sliced Maximum Mean Discrepancy (MSMMD) to instantiate the distributional Bellman operator. We prove that the Sobolev-augmented Bellman operator is a contraction with a unique fixed point, and highlight a fundamental smoothness trade-off underlying contraction in gradient-aware RL. To validate our method, we first showcase its effectiveness on a simple stochastic reinforcement learning toy problem, then benchmark its performance on several MuJoCo environments.

[LG-58] Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning ICLR2026

链接: https://arxiv.org/abs/2601.20069
作者: Chi-Yao Huang,Khoa Vo,Aayush Atul Verma,Duo Lu,Yezhou Yang
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Training a single network with multiple objectives often leads to conflicting gradients that degrade shared representations, forcing them into a compromised state that is suboptimal for any single task–a problem we term latent representation collapse. We introduce Domain Expansion, a framework that prevents these conflicts by restructuring the latent space itself. Our framework uses a novel orthogonal pooling mechanism to construct a latent space where each objective is assigned to a mutually orthogonal subspace. We validate our approach across diverse benchmarks–including ShapeNet, MPIIGaze, and Rotated MNIST–on challenging multi-objective problems combining classification with pose and gaze estimation. Our experiments demonstrate that this structure not only prevents collapse but also yields an explicit, interpretable, and compositional latent space where concepts can be directly manipulated.

[LG-59] Externally Validated Longitudinal GRU Model for Visit-Level 180-Day Mortality Risk in Metastatic Castration-Resistant Prostate Cancer

链接: https://arxiv.org/abs/2601.20046
作者: Javier Mencia-Ledo,Mohammad Noaeen,Zahra Shakeri
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Metastatic castration-resistant prostate cancer (mCRPC) is a highly aggressive disease with poor prognosis and heterogeneous treatment response. In this work, we developed and externally validated a visit-level 180-day mortality risk model using longitudinal data from two Phase III cohorts (n=526 and n=640). Only visits with observable 180-day outcomes were labeled; right-censored cases were excluded from analysis. We compared five candidate architectures: Long Short-Term Memory, Gated Recurrent Unit (GRU), Cox Proportional Hazards, Random Survival Forest (RSF), and Logistic Regression. For each dataset, we selected the smallest risk-threshold that achieved an 85% sensitivity floor. The GRU and RSF models showed high discrimination capabilities initially (C-index: 87% for both). In external validation, the GRU obtained a higher calibration (slope: 0.93; intercept: 0.07) and achieved an PR-AUC of 0.87. Clinical impact analysis showed a median time-in-warning of 151.0 days for true positives (59.0 days for false positives) and 18.3 alerts per 100 patient-visits. Given late-stage frailty or cachexia and hemodynamic instability, permutation importance ranked BMI and systolic blood pressure as the strongest associations. These results suggest that longitudinal routine clinical markers can estimate short-horizon mortality risk in mCRPC and support proactive care planning over a multi-month window.

[LG-60] Regime-Adaptive Bayesian Optimization via Dirichlet Process Mixtures of Gaussian Processes

链接: https://arxiv.org/abs/2601.20043
作者: Yan Zhang,Xuefeng Liu,Sipeng Chen,Sascha Ranftl,Chong Liu,Shibo Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Standard Bayesian Optimization (BO) assumes uniform smoothness across the search space an assumption violated in multi-regime problems such as molecular conformation search through distinct energy basins or drug discovery across heterogeneous molecular scaffolds. A single GP either oversmooths sharp transitions or hallucinates noise in smooth regions, yielding miscalibrated uncertainty. We propose RAMBO, a Dirichlet Process Mixture of Gaussian Processes that automatically discovers latent regimes during optimization, each modeled by an independent GP with locally-optimized hyperparameters. We derive collapsed Gibbs sampling that analytically marginalizes latent functions for efficient inference, and introduce adaptive concentration parameter scheduling for coarse-to-fine regime discovery. Our acquisition functions decompose uncertainty into intra-regime and inter-regime components. Experiments on synthetic benchmarks and real-world applications, including molecular conformer optimization, virtual screening for drug discovery, and fusion reactor design, demonstrate consistent improvements over state-of-the-art baselines on multi-regime objectives.

[LG-61] Decomposing multimodal embedding spaces with group-sparse autoencoders

链接: https://arxiv.org/abs/2601.20028
作者: Chiraag Kaushik,Davis Barch,Andrea Fanelli
类目: Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:The Linear Representation Hypothesis asserts that the embeddings learned by neural networks can be understood as linear combinations of features corresponding to high-level concepts. Based on this ansatz, sparse autoencoders (SAEs) have recently become a popular method for decomposing embeddings into a sparse combination of linear directions, which have been shown empirically to often correspond to human-interpretable semantics. However, recent attempts to apply SAEs to multimodal embedding spaces (such as the popular CLIP embeddings for image/text data) have found that SAEs often learn “split dictionaries”, where most of the learned sparse features are essentially unimodal, active only for data of a single modality. In this work, we study how to effectively adapt SAEs for the setting of multimodal embeddings while ensuring multimodal alignment. We first argue that the existence of a split dictionary decomposition on an aligned embedding space implies the existence of a non-split dictionary with improved modality alignment. Then, we propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization. We apply our method to popular embeddings for image/text (CLIP) and audio/text (CLAP) data and show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity. We finally demonstrate how this improvement in alignment of concepts between modalities can enable improvements in the interpretability and control of cross-modal tasks.

[LG-62] BayPrAnoMeta: Bayesian Proto-MAML for Few-Shot Industrial Image Anomaly Detection

链接: https://arxiv.org/abs/2601.19992
作者: Soham Sarkar,Tanmay Sen,Sayantan Banerjee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Industrial image anomaly detection is a challenging problem owing to extreme class imbalance and the scarcity of labeled defective samples, particularly in few-shot settings. We propose BayPrAnoMeta, a Bayesian generalization of Proto-MAML for few-shot industrial image anomaly detection. Unlike existing Proto-MAML approaches that rely on deterministic class prototypes and distance-based adaptation, BayPrAnoMeta replaces prototypes with task-specific probabilistic normality models and performs inner-loop adaptation via a Bayesian posterior predictive likelihood. We model normal support embeddings with a Normal-Inverse-Wishart (NIW) prior, producing a Student- t predictive distribution that enables uncertainty-aware, heavy-tailed anomaly scoring and is essential for robustness in extreme few-shot settings. We further extend BayPrAnoMeta to a federated meta-learning framework with supervised contrastive regularization for heterogeneous industrial clients and prove convergence to stationary points of the resulting nonconvex objective. Experiments on the MVTec AD benchmark demonstrate consistent and significant AUROC improvements over MAML, Proto-MAML, and PatchCore-based methods in few-shot anomaly detection settings.

[LG-63] Benchmarking LLAMA Model Security Against OWASP Top 10 For LLM Applications

链接: https://arxiv.org/abs/2601.19970
作者: Nourin Shahin,Izzat Alsmadi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) move from research prototypes to enterprise systems, their security vulnerabilities pose serious risks to data privacy and system integrity. This study benchmarks various Llama model variants against the OWASP Top 10 for LLM Applications framework, evaluating threat detection accuracy, response safety, and computational overhead. Using the FABRIC testbed with NVIDIA A30 GPUs, we tested five standard Llama models and five Llama Guard variants on 100 adversarial prompts covering ten vulnerability categories. Our results reveal significant differences in security performance: the compact Llama-Guard-3-1B model achieved the highest detection rate of 76% with minimal latency (0.165s per test), whereas base models such as Llama-3.1-8B failed to detect threats (0% accuracy) despite longer inference times (0.754s). We observe an inverse relationship between model size and security effectiveness, suggesting that smaller, specialized models often outperform larger general-purpose ones in security tasks. Additionally, we provide an open-source benchmark dataset including adversarial prompts, threat labels, and attack metadata to support reproducible research in AI security, [1].

[LG-64] E2HiL: Entropy-Guided Sample Selection for Efficient Real-World Human-in-the-Loop Reinforcement Learning

链接: https://arxiv.org/abs/2601.19969
作者: Haoyuan Deng,Yuanjiang Xue,Haoyang Du,Boyang Zhou,Zhenyu Wu,Ziwei Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Human-in-the-loop guidance has emerged as an effective approach for enabling faster convergence in online reinforcement learning (RL) of complex real-world manipulation tasks. However, existing human-in-the-loop RL (HiL-RL) frameworks often suffer from low sample efficiency, requiring substantial human interventions to achieve convergence and thereby leading to high labor costs. To address this, we propose a sample-efficient real-world human-in-the-loop RL framework named \method, which requires fewer human intervention by actively selecting informative samples. Specifically, stable reduction of policy entropy enables improved trade-off between exploration and exploitation with higher sample efficiency. We first build influence functions of different samples on the policy entropy, which is efficiently estimated by the covariance of action probabilities and soft advantages of policies. Then we select samples with moderate values of influence functions, where shortcut samples that induce sharp entropy drops and noisy samples with negligible effect are pruned. Extensive experiments on four real-world manipulation tasks demonstrate that \method achieves a 42.1% higher success rate while requiring 10.1% fewer human interventions compared to the state-of-the-art HiL-RL method, validating its effectiveness. The project page providing code, videos, and mathematical formulations can be found at this https URL.

[LG-65] Modeling Cascaded Delay Feedback for Online Net Conversion Rate Prediction: Benchmark Insights and Solutions

链接: https://arxiv.org/abs/2601.19965
作者: Mingxuan Luo,Guipeng Xv,Sishuo Chen,Xinyu Li,Li Zhang,Zhangming Chan,Xiang-Rong Sheng,Han Zhu,Jian Xu,Bo Zheng,Chen Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In industrial recommender systems, conversion rate (CVR) is widely used for traffic allocation, but it fails to fully reflect recommendation effectiveness because it ignores refund behavior. To better capture true user satisfaction and business value, net conversion rate (NetCVR), defined as the probability that a clicked item is purchased and not refunded, has been this http URL CVR, NetCVR prediction involves a more complex multi-stage cascaded delayed feedback process. The two cascaded delays from click to conversion and from conversion to refund have opposite effects, making traditional CVR modeling methods inapplicable. Moreover, the lack of open-source datasets and online continuous training schemes further hinders progress in this this http URL address these challenges, we introduce CASCADE (Cascaded Sequences of Conversion and Delayed Refund), the first large-scale open dataset derived from the Taobao app for online continuous NetCVR prediction. Through an in-depth analysis of CASCADE, we identify three key insights: (1) NetCVR exhibits strong temporal dynamics, necessitating online continuous modeling; (2) cascaded modeling of CVR and refund rate outperforms direct NetCVR modeling; and (3) delay time, which correlates with both CVR and refund rate, is an important feature for NetCVR this http URL on these insights, we propose TESLA, a continuous NetCVR modeling framework featuring a CVR-refund-rate cascaded architecture, stage-wise debiasing, and a delay-time-aware ranking loss. Extensive experiments demonstrate that TESLA consistently outperforms state-of-the-art methods on CASCADE, achieving absolute improvements of 12.41 percent in RI-AUC and 14.94 percent in RI-PRAUC on NetCVR prediction. The code and dataset are publicly available at this https URL.

[LG-66] Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods

链接: https://arxiv.org/abs/2601.19944
作者: Valery Manokhin,Daniel Grønhaug
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 61 pages, 23 figures

点击查看摘要

Abstract:We study model-agnostic post-hoc calibration methods intended to improve probabilistic predictions in supervised binary classification on real i.i.d. tabular data, with particular emphasis on conformal and Venn-based approaches that provide distribution-free validity guarantees under exchangeability. We benchmark 21 widely used classifiers, including linear models, SVMs, tree ensembles (CatBoost, XGBoost, LightGBM), and modern tabular neural and foundation models, on binary tasks from the TabArena-v0.1 suite using randomized, stratified five-fold cross-validation with a held-out test fold. Five calibrators; Isotonic regression, Platt scaling, Beta calibration, Venn-Abers predictors, and Pearsonify are trained on a separate calibration split and applied to test predictions. Calibration is evaluated using proper scoring rules (log-loss and Brier score) and diagnostic measures (Spiegelhalter’s Z, ECE, and ECI), alongside discrimination (AUC-ROC) and standard classification metrics. Across tasks and architectures, Venn-Abers predictors achieve the largest average reductions in log-loss, followed closely by Beta calibration, while Platt scaling exhibits weaker and less consistent effects. Beta calibration improves log-loss most frequently across tasks, whereas Venn-Abers displays fewer instances of extreme degradation and slightly more instances of extreme improvement. Importantly, we find that commonly used calibration procedures, most notably Platt scaling and isotonic regression, can systematically degrade proper scoring performance for strong modern tabular models. Overall classification performance is often preserved, but calibration effects vary substantially across datasets and architectures, and no method dominates uniformly. In expectation, all methods except Pearsonify slightly increase accuracy, but the effect is marginal, with the largest expected gain about 0.008%.

[LG-67] Emergent Specialization in Learner Populations: Competition as the Source of Diversity

链接: https://arxiv.org/abs/2601.19943
作者: Yuhao Li
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 15 pages, 5 figures, code available at this https URL

点击查看摘要

Abstract:How can populations of learners develop coordinated, diverse behaviors without explicit communication or diversity incentives? We demonstrate that competition alone is sufficient to induce emergent specialization – learners spontaneously partition into specialists for different environmental regimes through competitive dynamics, consistent with ecological niche theory. We introduce the NichePopulation algorithm, a simple mechanism combining competitive exclusion with niche affinity tracking. Validated across six real-world domains (cryptocurrency trading, commodity prices, weather forecasting, solar irradiance, urban traffic, and air quality), our approach achieves a mean Specialization Index of 0.75 with effect sizes of Cohen’s d 20. Key findings: (1) At lambda=0 (no niche bonus), learners still achieve SI 0.30, proving specialization is genuinely emergent; (2) Diverse populations outperform homogeneous baselines by +26.5% through method-level division of labor; (3) Our approach outperforms MARL baselines (QMIX, MAPPO, IQL) by 4.3x while being 4x faster.

[LG-68] PiC-BNN: A 128-kbit 65 nm Processing-in-CAM-Based End-to-End Binary Neural Network Accelerator

链接: https://arxiv.org/abs/2601.19920
作者: Yuval Harary,Almog Sharoni,Esteban Garzón,Marco Lanuzza,Adam Teman,Leonid Yavits
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures. Accepted to IEEE CCMCC 2025

点击查看摘要

Abstract:Binary Neural Networks (BNNs), where weights and activations are constrained to binary values (+1, -1), are a highly efficient alternative to traditional neural networks. Unfortunately, typical BNNs, while binarizing linear layers (matrix-vector multiplication), still implement other network layers (batch normalization, softmax, output layer, and sometimes the input layer of a convolutional neural network) in full precision. This limits the area and energy benefits and requires architectural support for full precision operations. We propose PiC-BNN, a true end-to-end binary in-approximate search (Hamming distance tolerant) Content Addressable Memory based BNN accelerator. PiC-BNN is designed and manufactured in a commercial 65nm process. PiC-BNN uses Hamming distance tolerance to apply the law of large numbers to enable accurate classification without implementing full precision operations. PiC-BNN achieves baseline software accuracy (95.2%) on the MNIST dataset and 93.5% on the Hand Gesture (HG) dataset, a throughput of 560K inferences/s, and presents a power efficiency of 703M inferences/s/W when implementing a binary MLP model for MNIST/HG dataset classification.

[LG-69] CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference

链接: https://arxiv.org/abs/2601.19908
作者: Yanru Chen,Runyang Tian,Yue Pan,Zheyu Li,Weihong Xu,Tajana Rosing
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) is accelerating the integration of multimodal assistants into edge devices, where inference is executed under stringent latency and energy constraints, often exacerbated by intermittent connectivity. These challenges become particularly acute in the context of multimodal LLMs (MLLMs), as high-dimensional visual inputs are transformed into extensive token sequences, thereby inflating the key-value (KV) cache and imposing substantial data movement overheads to the LLM backbone. To address these issues, we present CHIME, a chiplet-based heterogeneous near-memory acceleration for edge MLLMs inference. CHIME leverages the complementary strengths of integrated monolithic 3D (M3D) DRAM and RRAM chiplets: DRAM supplies low-latency bandwidth for attention, while RRAM offers dense, non-volatile storage for weights. This heterogeneous hardware is orchestrated by a co-designed mapping framework that executes fused kernels near data, minimizing cross-chiplet traffic to maximize effective bandwidth. On FastVLM (0.6B/1.7B) and MobileVLM (1.7B/3B), CHIME achieves up to 54x speedup and up to 246x better energy efficiency per inference as compared to the edge GPU NVIDIA Jetson Orin NX. It sustains 116.5-266.5 token/J compared to Jetson’s 0.7-1.1 token/J. Furthermore, it delivers up to 69.2x higher throughput than the state-of-the-art PIM accelerator FACIL. Compared to the M3D DRAM-only design, CHIME’s heterogeneous memory further improves energy efficiency by 7% and performance by 2.4x.

[LG-70] VSCOUT: A Hybrid Variational Autoencoder Approach to Outlier Detection in High-Dimensional Retrospective Monitoring

链接: https://arxiv.org/abs/2601.20830
作者: Waldyn G. Martinez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Modern industrial and service processes generate high-dimensional, non-Gaussian, and contamination-prone data that challenge the foundational assumptions of classical Statistical Process Control (SPC). Heavy tails, multimodality, nonlinear dependencies, and sparse special-cause observations can distort baseline estimation, mask true anomalies, and prevent reliable identification of an in-control (IC) reference set. To address these challenges, we introduce VSCOUT, a distribution-free framework designed specifically for retrospective (Phase I) monitoring in high-dimensional settings. VSCOUT combines an Automatic Relevance Determination Variational Autoencoder (ARD-VAE) architecture with ensemble-based latent outlier filtering and changepoint detection. The ARD prior isolates the most informative latent dimensions, while the ensemble and changepoint filters identify pointwise and structural contamination within the determined latent space. A second-stage retraining step removes flagged observations and re-estimates the latent structure using only the retained inliers, mitigating masking and stabilizing the IC latent manifold. This two-stage refinement produces a clean and reliable IC baseline suitable for subsequent Phase II deployment. Extensive experiments across benchmark datasets demonstrate that VSCOUT achieves superior sensitivity to special-cause structure while maintaining controlled false alarms, outperforming classical SPC procedures, robust estimators, and modern machine-learning baselines. Its scalability, distributional flexibility, and resilience to complex contamination patterns position VSCOUT as a practical and effective method for retrospective modeling and anomaly detection in AI-enabled environments.

[LG-71] Demystifying Prediction Powered Inference

链接: https://arxiv.org/abs/2601.20819
作者: Yilin Song,Dan M. Kluger,Harsh Parikh,Tian Gu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning predictions are increasingly used to supplement incomplete or costly-to-measure outcomes in fields such as biomedical research, environmental science, and social science. However, treating predictions as ground truth introduces bias while ignoring them wastes valuable information. Prediction-Powered Inference (PPI) offers a principled framework that leverages predictions from large unlabeled datasets to improve statistical efficiency while maintaining valid inference through explicit bias correction using a smaller labeled subset. Despite its potential, the growing PPI variants and the subtle distinctions between them have made it challenging for practitioners to determine when and how to apply these methods responsibly. This paper demystifies PPI by synthesizing its theoretical foundations, methodological extensions, connections to existing statistics literature, and diagnostic tools into a unified practical workflow. Using the Mosaiks housing price data, we show that PPI variants produce tighter confidence intervals than complete-case analysis, but that double-dipping, i.e. reusing training data for inference, leads to anti-conservative confidence intervals and coverages. Under missing-not-at-random mechanisms, all methods, including classical inference using only labeled data, yield biased estimates. We provide a decision flowchart linking assumption violations to appropriate PPI variants, a summary table of selective methods, and practical diagnostic strategies for evaluating core assumptions. By framing PPI as a general recipe rather than a single estimator, this work bridges methodological innovation and applied practice, helping researchers responsibly integrate predictions into valid inference.

[LG-72] Neural Quantum States in Mixed Precision

链接: https://arxiv.org/abs/2601.20782
作者: Massimo Solinas,Agnes Valenti,Nawaf Bou-Rabee,Roeland Wiersema
类目: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 22 pages, 12 figures

点击查看摘要

Abstract:Scientific computing has long relied on double precision (64-bit floating point) arithmetic to guarantee accuracy in simulations of real-world phenomena. However, the growing availability of hardware accelerators such as Graphics Processing Units (GPUs) has made low-precision formats attractive due to their superior performance, reduced memory footprint, and improved energy efficiency. In this work, we investigate the role of mixed-precision arithmetic in neural-network based Variational Monte Carlo (VMC), a widely used method for solving computationally otherwise intractable quantum many-body systems. We first derive general analytical bounds on the error introduced by reduced precision on Metropolis-Hastings MCMC, and then empirically validate these bounds on the use-case of VMC. We demonstrate that significant portions of the algorithm, in particular, sampling the quantum state, can be executed in half precision without loss of accuracy. More broadly, this work provides a theoretical framework to assess the applicability of mixed-precision arithmetic in machine-learning approaches that rely on MCMC sampling. In the context of VMC, we additionally demonstrate the practical effectiveness of mixed-precision strategies, enabling more scalable and energy-efficient simulations of quantum many-body systems.

[LG-73] Cross-Country Learning for National Infectious Disease Forecasting Using European Data

链接: https://arxiv.org/abs/2601.20771
作者: Zacharias Komodromos,Kleanthis Malialis,Artemis Kontou,Panayiotis Kolios
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Accurate forecasting of infectious disease incidence is critical for public health planning and timely intervention. While most data-driven forecasting approaches rely primarily on historical data from a single country, such data are often limited in length and variability, restricting the performance of machine learning (ML) models. In this work, we investigate a cross-country learning approach for infectious disease forecasting, in which a single model is trained on time series data from multiple countries and evaluated on a country of interest. This setting enables the model to exploit shared epidemic dynamics across countries and to benefit from an enlarged training set. We examine this approach through a case study on COVID-19 case forecasting in Cyprus, using surveillance data from European countries. We evaluate multiple ML models and analyse the impact of the lookback window length and cross-country `data augmentation’ on multi-step forecasting performance. Our results show that incorporating data from other countries can lead to consistent improvements over models trained solely on national data. Although the empirical focus is on Cyprus and COVID-19, the proposed framework and findings are applicable to infectious disease forecasting more broadly, particularly in settings with limited national historical data.

[LG-74] Leverag ing Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence

链接: https://arxiv.org/abs/2601.20769
作者: Yichi Zhang,Fengqing Zhu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training learned image compression (LIC) models entails navigating a challenging optimization landscape defined by the fundamental trade-off between rate and distortion. Standard first-order optimizers, such as SGD and Adam, struggle with \emphgradient conflicts arising from competing objectives, leading to slow convergence and suboptimal rate-distortion performance. In this work, we demonstrate that a simple utilization of a second-order quasi-Newton optimizer, \textbfSOAP, dramatically improves both training efficiency and final performance across diverse LICs. Our theoretical and empirical analyses reveal that Newton preconditioning inherently resolves the intra-step and inter-step update conflicts intrinsic to the R-D objective, facilitating faster, more stable convergence. Beyond acceleration, we uncover a critical deployability benefit: second-order trained models exhibit significantly fewer activation and latent outliers. This substantially enhances robustness to post-training quantization. Together, these results establish second-order optimization, achievable as a seamless drop-in replacement of the imported optimizer, as a powerful, practical tool for advancing the efficiency and real-world readiness of LICs.

[LG-75] A scalable flow-based approach to mitigate topological freezing

链接: https://arxiv.org/abs/2601.20708
作者: Claudio Bonanno,Andrea Bulgarelli,Elia Cellini,Alessandro Nada,Dario Panfalone,Davide Vadacchino,Lorenzo Verzichelli
类目: High Energy Physics - Lattice (hep-lat); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 1+9 pages, 3 figures, contribution to the 42nd International Symposium on Lattice Field Theory (Lattice 2025), 2-8 November 2025, Mumbai, India

点击查看摘要

Abstract:As lattice gauge theories with non-trivial topological features are driven towards the continuum limit, standard Markov Chain Monte Carlo simulations suffer for topological freezing, i.e., a dramatic growth of autocorrelations in topological observables. A widely used strategy is the adoption of Open Boundary Conditions (OBC), which restores ergodic sampling of topology but at the price of breaking translation invariance and introducing unphysical boundary artifacts. In this contribution we summarize a scalable, exact flow-based strategy to remove them by transporting configurations from a prior with a OBC defect to a fully periodic ensemble, and apply it to 4d SU(3) Yang–Mills theory. The method is based on a Stochastic Normalizing Flow (SNF) that alternates non-equilibrium Monte Carlo updates with localized, gauge-equivariant defect coupling layers implemented via masked parametric stout smearing. Training is performed by minimizing the average dissipated work, equivalent to a Kullback–Leibler divergence between forward and reverse non-equilibrium path measures, to achieve more reversible trajectories and improved efficiency. We discuss the scaling with the number of degrees of freedom affected by the defect and show that defect SNFs achieve better performances than purely stochastic non-equilibrium methods at comparable cost. Finally, we validate the approach by reproducing reference results for the topological susceptibility.

[LG-76] Sparse clustering via the Deterministic Information Bottleneck algorithm

链接: https://arxiv.org/abs/2601.20628
作者: Efthymios Costa,Ioanna Papatsouma,Angelos Markos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Submitted to IFCS 2026 (8 pages total)

点击查看摘要

Abstract:Cluster analysis relates to the task of assigning objects into groups which ideally present some desirable characteristics. When a cluster structure is confined to a subset of the feature space, traditional clustering techniques face unprecedented challenges. We present an information-theoretic framework that overcomes the problems associated with sparse data, allowing for joint feature weighting and clustering. Our proposal constitutes a competitive alternative to existing clustering algorithms for sparse data, as demonstrated through simulations on synthetic data. The effectiveness of our method is established by an application on a real-world genomics data set.

[LG-77] rigger Optimization and Event Classification for Dark Matter Searches in the CYGNO Experiment Using Machine Learning

链接: https://arxiv.org/abs/2601.20626
作者: F. D. Amaro,R. Antonietti,E. Baracchini,L. Benussi,C. Capoccia,M. Caponero,L. G. M. de Carvalho,G. Cavoto,I. A. Costa,A. Croce,M. D’Astolfo,G. D’Imperio,G. Dho,E. Di Marco,J. M. F. dos Santos,D. Fiorina,F. Iacoangeli,Z. Islam,E. Kemp,H. P. Lima Jr,G. Maccarrone,R. D. P. Mano,D. J. G. Marques,G. Mazzitelli,P. Meloni,A. Messina,C. M. B. Monteiro,R. A. Nobrega,G. M. Oppedisano,I. F. Pains,E. Paoletti,F. Petrucci,S. Piacentini,D. Pierluigi,D. Pinci,F. Renga,A. Russo,G. Saviano,P. A. O. C. Silva,N. J. Spooner,R. Tesauro,S. Tomassini,D. Tozzi
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 6 pages, 1 figure, 14th Young Researcher Meeting (YRM 2025)

点击查看摘要

Abstract:The CYGNO experiment employs an optical-readout Time Projection Chamber (TPC) to search for rare low-energy interactions using finely resolved scintillation images. While the optical readout provides rich topological information, it produces large, sparse megapixel images that challenge real-time triggering, data reduction, and background discrimination. We summarize two complementary machine-learning approaches developed within CYGNO. First, we present a fast and fully unsupervised strategy for online data reduction based on reconstruction-based anomaly detection. A convolutional autoencoder trained exclusively on pedestal images (i.e. frames acquired with GEM amplification disabled) learns the detector noise morphology and highlights particle-induced structures through localized reconstruction residuals, from which compact Regions of Interest (ROIs) are extracted. On real prototype data, the selected configuration retains (93.0 +/- 0.2)% of reconstructed signal intensity while discarding (97.8 +/- 0.1)% of the image area, with ~25 ms per-frame inference time on a consumer GPU. Second, we report a weakly supervised application of the Classification Without Labels (CWoLa) framework to data acquired with an Americium–Beryllium neutron source. Using only mixed AmBe and standard datasets (no event-level labels), a convolutional classifier learns to identify nuclear-recoil-like topologies. The achieved performance approaches the theoretical limit imposed by the mixture composition and isolates a high-score population with compact, approximately circular morphologies consistent with nuclear recoils. Comments: 6 pages, 1 figure, 14th Young Researcher Meeting (YRM 2025) Subjects: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an) Cite as: arXiv:2601.20626 [physics.ins-det] (or arXiv:2601.20626v1 [physics.ins-det] for this version) https://doi.org/10.48550/arXiv.2601.20626 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Exact Graph Learning via Integer Programming

链接: https://arxiv.org/abs/2601.20589
作者: Lucas Kook,Søren Wengel Mogensen
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning the dependence structure among variables in complex systems is a central problem across medical, natural, and social sciences. These structures can be naturally represented by graphs, and the task of inferring such graphs from data is known as graph learning or as causal discovery if the graphs are given a causal interpretation. Existing approaches typically rely on restrictive assumptions about the data-generating process, employ greedy oracle algorithms, or solve approximate formulations of the graph learning problem. As a result, they are either sensitive to violations of central assumptions or fail to guarantee globally optimal solutions. We address these limitations by introducing a nonparametric graph learning framework based on nonparametric conditional independence testing and integer programming. We reformulate the graph learning problem as an integer-programming problem and prove that solving the integer-programming problem provides a globally optimal solution to the original graph learning problem. Our method leverages efficient encodings of graphical separation criteria, enabling the exact recovery of larger graphs than was previously feasible. We provide an implementation in the openly available R package ‘glip’ which supports learning (acyclic) directed (mixed) graphs and chain graphs. From the resulting output one can compute representations of the corresponding Markov equivalence classes or weak equivalence classes. Empirically, we demonstrate that our approach is faster than other existing exact graph learning procedures for a large fraction of instances and graphs of various sizes. GLIP also achieves state-of-the-art performance on simulated data and benchmark datasets across all aforementioned classes of graphs.

[LG-79] Incorporating data drift to perform survival analysis on credit risk

链接: https://arxiv.org/abs/2601.20533
作者: Jianwei Peng(1),Stefan Lessmann(1 and 2) ((1) Humboldt-Universität zu Berlin, (2) Bucharest University of Economic Studies)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 27 pages, 2 figures

点击查看摘要

Abstract:Survival analysis has become a standard approach for modelling time to default by time-varying covariates in credit risk. Unlike most existing methods that implicitly assume a stationary data-generating process, in practise, mortgage portfolios are exposed to various forms of data drift caused by changing borrower behaviour, macroeconomic conditions, policy regimes and so on. This study investigates the impact of data drift on survival-based credit risk models and proposes a dynamic joint modelling framework to improve robustness under non-stationary environments. The proposed model integrates a longitudinal behavioural marker derived from balance dynamics with a discrete-time hazard formulation, combined with landmark one-hot encoding and isotonic calibration. Three types of data drift (sudden, incremental and recurring) are simulated and analysed on mortgage loan datasets from Freddie Mac. Experiments and corresponding evidence show that the proposed landmark-based joint model consistently outperforms classical survival models, tree-based drift-adaptive learners and gradient boosting methods in terms of discrimination and calibration across all drift scenarios, which confirms the superiority of our model design.

[LG-80] Physics-informed Blind Reconstruction of Dense Fields from Sparse Measurements using Neural Networks with a Differentiable Simulator

链接: https://arxiv.org/abs/2601.20496
作者: Ofek Aloni,Barak Fishbain
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating dense physical fields from sparse measurements is a fundamental question in sampling, signal processing, and many other applications. State-of-the-art methods either use spatial statistics or rely on examples of dense fields in the training phase, which often are not available, and thus rely on synthetic data. Here, we present a reconstruction method that generates dense fields from sparse measurements, without assuming availability of the spatial statistics, nor of examples of the dense fields. This is made possible through the introduction of an automatically differentiable numerical simulator into the training phase of the method. The method is shown to have superior results over statistical and neural network based methods on a set of three standard problems from fluid mechanics.

[LG-81] Convergence Analysis of Randomized Subspace Normalized SGD under Heavy-Tailed Noise

链接: https://arxiv.org/abs/2601.20399
作者: Gaku Omiya,Pierre-Louis Poirion,Akiko Takeda
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 41 pages

点击查看摘要

Abstract:Randomized subspace methods reduce per-iteration cost; however, in nonconvex optimization, most analyses are expectation-based, and high-probability bounds remain scarce even under sub-Gaussian noise. We first prove that randomized subspace SGD (RS-SGD) admits a high-probability convergence bound under sub-Gaussian noise, achieving the same order of oracle complexity as prior in-expectation results. Motivated by the prevalence of heavy-tailed gradients in modern machine learning, we then propose randomized subspace normalized SGD (RS-NSGD), which integrates direction normalization into subspace updates. Assuming the noise has bounded p -th moments, we establish both in-expectation and high-probability convergence guarantees, and show that RS-NSGD can achieve better oracle complexity than full-dimensional normalized SGD.

[LG-82] Do Whitepaper Claims Predict Market Behavior? Evidence from Cryptocurrency Factor Analysis

链接: https://arxiv.org/abs/2601.20336
作者: Murad Farzulla
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 35 pages, 8 figures, 10 tables. Code available at this https URL

点击查看摘要

Abstract:Cryptocurrency projects articulate value propositions through whitepapers, making claims about functionality and technical capabilities. This study investigates whether these narratives align with observed market behavior. We construct a pipeline combining zero-shot NLP classification (BART-MNLI) with CP tensor decomposition to compare three spaces: (1) a claims matrix from 24 whitepapers across 10 semantic categories, (2) market statistics for 49 assets over two years of hourly data, and (3) latent factors from tensor decomposition (rank 2, 92.45% variance explained). Using Procrustes rotation and Tucker’s congruence coefficient, we test alignment across 23 common entities. Results show weak alignment: claims-statistics (phi=0.341, p=0.332), claims-factors (phi=0.077, p=0.747), and statistics-factors (phi=0.197, p0.001). The statistics-factors significance validates our methodology, confirming the pipeline detects relationships when present. Inter-model validation with DeBERTa-v3 yields 32% exact agreement but 67% top-3 agreement. Cross-sectional analysis reveals heterogeneous contributions: NEAR, MKR, ATOM show positive alignment while ENS, UNI, Bitcoin diverge most. Excluding Bitcoin confirms results are not driven by market dominance. We interpret findings as weak alignment between whitepaper narratives and market factor structure. Limited power (n=23) precludes distinguishing weak from no alignment, but strong alignment (phi=0.70) can be confidently rejected. Implications for narrative economics and investment analysis are discussed. Comments: 35 pages, 8 figures, 10 tables. Code available at this https URL Subjects: Computational Finance (q-fin.CP); Machine Learning (cs.LG) ACMclasses: I.2.7; J.4 Cite as: arXiv:2601.20336 [q-fin.CP] (or arXiv:2601.20336v1 [q-fin.CP] for this version) https://doi.org/10.48550/arXiv.2601.20336 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] Empirical Likelihood-Based Fairness Auditing: Distribution-Free Certification and Flagging

链接: https://arxiv.org/abs/2601.20269
作者: Jie Tang,Chuanlong Xie,Xianli Zeng,Lixing Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 55 pages, 9 figures; Code available at: this https URL Author list is in alphabetical order by last names

点击查看摘要

Abstract:Machine learning models in high-stakes applications, such as recidivism prediction and automated personnel selection, often exhibit systematic performance disparities across sensitive subpopulations, raising critical concerns regarding algorithmic bias. Fairness auditing addresses these risks through two primary functions: certification, which verifies adherence to fairness constraints; and flagging, which isolates specific demographic groups experiencing disparate treatment. However, existing auditing techniques are frequently limited by restrictive distributional assumptions or prohibitive computational overhead. We propose a novel empirical likelihood-based (EL) framework that constructs robust statistical measures for model performance disparities. Unlike traditional methods, our approach is non-parametric; the proposed disparity statistics follow asymptotically chi-square or mixed chi-square distributions, ensuring valid inference without assuming underlying data distributions. This framework uses a constrained optimization profile that admits stable numerical solutions, facilitating both large-scale certification and efficient subpopulation discovery. Empirically, the EL methods outperform bootstrap-based approaches, yielding coverage rates closer to nominal levels while reducing computational latency by several orders of magnitude. We demonstrate the practical utility of this framework on the COMPAS dataset, where it successfully flags intersectional biases, specifically identifying a significantly higher positive prediction rate for African-American males under 25 and a systemic under-prediction for Caucasian females relative to the population mean.

[LG-84] Efficient Evaluation of LLM Performance with Statistical Guarantees

链接: https://arxiv.org/abs/2601.20251
作者: Skyler Wu,Yash Nair,Emmanuel J. Candés
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Exhaustively evaluating many large language models (LLMs) on a large suite of benchmarks is expensive. We cast benchmarking as finite-population inference and, under a fixed query budget, seek tight confidence intervals (CIs) for model accuracy with valid frequentist coverage. We propose Factorized Active Querying (FAQ), which (a) leverages historical information through a Bayesian factor model; (b) adaptively selects questions using a hybrid variance-reduction/active-learning sampling policy; and © maintains validity through Proactive Active Inference – a finite-population extension of active inference (Zrnic Candes, 2024) that enables direct question selection while preserving coverage. With negligible overhead cost, FAQ delivers up to 5\times effective sample size gains over strong baselines on two benchmark suites, across varying historical-data missingness levels: this means that it matches the CI width of uniform sampling while using up to 5\times fewer queries. We release our source code and our curated datasets to support reproducible evaluation and future research.

[LG-85] Quantum statistics from classical simulations via generative Gibbs sampling

链接: https://arxiv.org/abs/2601.20228
作者: Weizhou Wang,Xuanxi Zhang,Jonathan Weare,Aaron R. Dinner
类目: Chemical Physics (physics.chem-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 12 pages, 9 figures

点击查看摘要

Abstract:Accurate simulation of nuclear quantum effects is essential for molecular modeling but expensive using path integral molecular dynamics (PIMD). We present GG-PI, a ring-polymer-based framework that combines generative modeling of the single-bead conditional density with Gibbs sampling to recover quantum statistics from classical simulation data. GG-PI uses inexpensive standard classical simulations or existing data for training and allows transfer across temperatures without retraining. On standard test systems, GG-PI significantly reduces wall clock time compared to PIMD. Our approach extends easily to a wide range of problems with similar Markov structure.

[LG-86] Bias-Reduced Estimation of Finite Mixtures: An Application to Latent Group Structures in Panel Data

链接: https://arxiv.org/abs/2601.20197
作者: Raphaël Langevin
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Finite mixture models are widely used in econometric analyses to capture unobserved heterogeneity. This paper shows that maximum likelihood estimation of finite mixtures of parametric densities can suffer from substantial finite-sample bias in all parameters under mild regularity conditions. The bias arises from the influence of outliers in component densities with unbounded or large support and increases with the degree of overlap among mixture components. I show that maximizing the classification-mixture likelihood function, equipped with a consistent classifier, yields parameter estimates that are less biased than those obtained by standard maximum likelihood estimation (MLE). I then derive the asymptotic distribution of the resulting estimator and provide conditions under which oracle efficiency is achieved. Monte Carlo simulations show that conventional mixture MLE exhibits pronounced finite-sample bias, which diminishes as the sample size or the statistical distance between component densities tends to infinity. The simulations further show that the proposed estimation strategy generally outperforms standard MLE in finite samples in terms of both bias and mean squared errors under relatively weak assumptions. An empirical application to latent group panel structures using health administrative data shows that the proposed approach reduces out-of-sample prediction error by approximately 17.6% relative to the best results obtained from standard MLE procedures.

[LG-87] Randomized Feasibility Methods for Constrained Optimization with Adaptive Step Sizes

链接: https://arxiv.org/abs/2601.20076
作者: Abhishek Chakraborty,Angelia Nedić
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider minimizing an objective function subject to constraints defined by the intersection of lower-level sets of convex functions. We study two cases: (i) strongly convex and Lipschitz-smooth objective function and (ii) convex but possibly nonsmooth objective function. To deal with the constraints that are not easy to project on, we use a randomized feasibility algorithm with Polyak steps and a random number of sampled constraints per iteration, while taking (sub)gradient steps to minimize the objective function. For case (i), we prove linear convergence in expectation of the objective function values to any prescribed tolerance using an adaptive stepsize. For case (ii), we develop a fully problem parameter-free and adaptive stepsize scheme that yields an O(1/\sqrtT) worst-case rate in expectation. The infeasibility of the iterates decreases geometrically with the number of feasibility updates almost surely, while for the averaged iterates, we establish an expected lower bound on the function values relative to the optimal value that depends on the distribution for the random number of sampled constraints. For certain choices of sample-size growth, optimal rates are achieved. Finally, simulations on a Quadratically Constrained Quadratic Programming (QCQP) problem and Support Vector Machines (SVM) demonstrate the computational efficiency of our algorithm compared to other state-of-the-art methods.

[LG-88] Explainable deep learning reveals the physical mechanisms behind the turbulent kinetic energy equation

链接: https://arxiv.org/abs/2601.20052
作者: Francisco Alcántara-Ávila,Andrés Cremades,Sergio Hoyas,Ricardo Vinuesa
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, 1 appendix

点击查看摘要

Abstract:In this work, we investigate the physical mechanisms governing turbulent kinetic energy transport using explainable deep learning (XDL). An XDL model based on SHapley Additive exPlanations (SHAP) is used to identify and percolate high-importance structures for the evolution of the turbulent kinetic energy budget terms of a turbulent channel flow at a friction Reynolds number of Re_\tau = 125 . The results show that the important structures are predominantly located in the near-wall region and are more frequently associated with sweep-type events. In the viscous layer, the SHAP structures relevant for production and viscous diffusion are almost entirely contained within those relevant for dissipation, revealing a clear hierarchical organization of near-wall turbulence. In the outer layer, this hierarchical organization breaks down and only velocity-pressure-gradient correlation and turbulent transport SHAP structures remain, with a moderate spatial coincidence of approximately 60% . Finally, we show that none of the coherent structures classically studied in turbulence are capable of representing the mechanisms behind the various terms of the turbulent kinetic energy budget throughout the channel. These results reveal dissipation as the dominant organizing mechanism of near-wall turbulence, constraining production and viscous diffusion within a single structural hierarchy that breaks down in the outer layer.

[LG-89] Minimax Rates for Hyperbolic Hierarchical Learning

链接: https://arxiv.org/abs/2601.20047
作者: Divit Rawal,Sriram Vishwanath
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We prove an exponential separation in sample complexity between Euclidean and hyperbolic representations for learning on hierarchical data under standard Lipschitz regularization. For depth- R hierarchies with branching factor m , we first establish a geometric obstruction for Euclidean space: any bounded-radius embedding forces volumetric collapse, mapping exponentially many tree-distant points to nearby locations. This necessitates Lipschitz constants scaling as \exp(\Omega®) to realize even simple hierarchical targets, yielding exponential sample complexity under capacity control. We then show this obstruction vanishes in hyperbolic space: constant-distortion hyperbolic embeddings admit O(1) -Lipschitz realizability, enabling learning with n = O(mR \log m) samples. A matching \Omega(mR \log m) lower bound via Fano’s inequality establishes that hyperbolic representations achieve the information-theoretic optimum. We also show a geometry-independent bottleneck: any rank- k prediction space captures only O(k) canonical hierarchical contrasts.

[LG-90] he Sound of Noise: Leverag ing the Inductive Bias of Pre-trained Audio Transformers for Glitch Identification in LIGO

链接: https://arxiv.org/abs/2601.20034
作者: Suyash Deshmukh,Chayan Chatterjee,Abigail Petulante,Tabata Aira Ferreira,Karan Jani
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transient noise artifacts, or glitches, fundamentally limit the sensitivity of gravitational-wave (GW) interferometers and can mimic true astrophysical signals, particularly the short-duration intermediate-mass black hole (IMBH) mergers. Current glitch classification methods, such as Gravity Spy, rely on supervised models trained from scratch using labeled datasets. These approaches suffer from a significant ``label bottleneck," requiring massive, expertly annotated datasets to achieve high accuracy and often struggling to generalize to new glitch morphologies or exotic GW signals encountered in observing runs. In this work, we present a novel cross-domain framework that treats GW strain data through the lens of audio processing. We utilize the Audio Spectrogram Transformer (AST), a model pre-trained on large-scale audio datasets, and adapt it to the GW domain. Instead of learning time-frequency features from scratch, our method exploits the strong inductive bias inherent in pre-trained audio models, transferring learned representations of natural sound to the characterization of detector noise and GW signals, including IMBHs. We validate this approach by analyzing strain data from the third (O3) and fourth (O4) observing runs of the LIGO detectors. We used t-Distributed Stochastic Neighbor Embedding (t-SNE), an unsupervised clustering technique, to visualize the AST-derived embeddings of signals and glitches, revealing well-separated groups that align closely with independently validated Gravity Spy glitch classes. Our results indicate that the inductive bias from audio pre-training allows superior feature extraction compared to traditional supervised techniques, offering a robust, data-efficient pathway for discovering new, anomalous transients, and classifying complex noise artifacts in the era of next-generation detectors.

[LG-91] Exploring the holographic entropy cone via reinforcement learning

链接: https://arxiv.org/abs/2601.19979
作者: Temple He,Jaeha Lee,Hirosi Ooguri
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 38 pages, 10 figures, 2 tables

点击查看摘要

Abstract:We develop a reinforcement learning algorithm to study the holographic entropy cone. Given a target entropy vector, our algorithm searches for a graph realization whose min-cut entropies match the target vector. If the target vector does not admit such a graph realization, it must lie outside the cone, in which case the algorithm finds a graph whose corresponding entropy vector most nearly approximates the target and allows us to probe the location of the facets. For the \sf N=3 cone, we confirm that our algorithm successfully rediscovers monogamy of mutual information beginning with a target vector outside the holographic entropy cone. We then apply the algorithm to the \sf N=6 cone, analyzing the 6 “mystery” extreme rays of the subadditivity cone from arXiv:2412.15364 that satisfy all known holographic entropy inequalities yet lacked graph realizations. We found realizations for 3 of them, proving they are genuine extreme rays of the holographic entropy cone, while providing evidence that the remaining 3 are not realizable, implying unknown holographic inequalities exist for \sf N=6 .

[LG-92] Global Plane Waves From Local Gaussians: Periodic Charge Densities in a Blink

链接: https://arxiv.org/abs/2601.19966
作者: Jonas Elsborg,Felix Ærtebjerg,Luca Thiede,Alán Aspuru-Guzik,Tejs Vegge,Arghya Bhowmik
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 24 pages including appendix, 8 Figures, 5 tables

点击查看摘要

Abstract:We introduce ELECTRAFI, a fast, end-to-end differentiable model for predicting periodic charge densities in crystalline materials. ELECTRAFI constructs anisotropic Gaussians in real space and exploits their closed-form Fourier transforms to analytically evaluate plane-wave coefficients via the Poisson summation formula. This formulation delegates non-local and periodic behavior to analytic transforms, enabling reconstruction of the full periodic charge density with a single inverse FFT. By avoiding explicit real-space grid probing, periodic image summation, and spherical harmonic expansions, ELECTRAFI matches or exceeds state-of-the-art accuracy across periodic benchmarks while being up to 633 \times faster than the strongest competing method, reconstructing crystal charge densities in a fraction of a second. When used to initialize DFT calculations, ELECTRAFI reduces total DFT compute cost by up to ~20%, whereas slower charge density models negate savings due to high inference times. Our results show that accuracy and inference cost jointly determine end-to-end DFT speedups, and motivate our focus on efficiency.

[LG-93] Deep Neural Networks as Iterated Function Systems and a Generalization Bound

链接: https://arxiv.org/abs/2601.19958
作者: Jonathan Vacher(MAP5 - UMR 8145)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) achieve remarkable performance on a wide range of tasks, yet their mathematical analysis remains fragmented: stability and generalization are typically studied in disparate frameworks and on a case-by-case basis. Architecturally, DNNs rely on the recursive application of parametrized functions, a mechanism that can be unstable and difficult to train, making stability a primary concern. Even when training succeeds, there are few rigorous results on how well such models generalize beyond the observed data, especially in the generative setting. In this work, we leverage the theory of stochastic Iterated Function Systems (IFS) and show that two important deep architectures can be viewed as, or canonically associated with, place-dependent IFS. This connection allows us to import results from random dynamical systems to (i) establish the existence and uniqueness of invariant measures under suitable contractivity assumptions, and (ii) derive a Wasserstein generalization bound for generative modeling. The bound naturally leads to a new training objective that directly controls the collage-type approximation error between the data distribution and its image under the learned transfer operator. We illustrate the theory on a controlled 2D example and empirically evaluate the proposed objective on standard image datasets (MNIST, CelebA, CIFAR-10).

[LG-94] MK-SGC-SC: Multiple Kernel guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization

链接: https://arxiv.org/abs/2601.19946
作者: Nikhil Raghav,Avisek Gupta,Swagatam Das,Md Sahidullah
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 5 pages

点击查看摘要

Abstract:Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers. Although unsupervised speaker diarization is inherently challenging, the prospect of identifying speaker regions without pretraining or weak supervision motivates research on clustering techniques. In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings to thereafter craft a sparse graph for spectral clustering in a principled manner is sufficient to achieve state-of-the-art performances in a fully unsupervised setting. Specifically, we consider four polynomial kernels and a degree one arccosine kernel to measure similarities in speaker embeddings, using which sparse graphs are constructed in a principled manner to emphasize local similarities. Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora. To encourage further research, our implementations are available at this https URL.

信息检索

[IR-0] MedViz: An Agent -based Visual-guided Research Assistant for Navigating Biomedical Literature

链接: https://arxiv.org/abs/2601.20709
作者: Huan He,Xueqing Peng,Yutong Xie,Qijia Liu,Chia-Hsuan Chang,Lingfei Qian,Brian Ondov,Qiaozhu Mei,Hua Xu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Biomedical researchers face increasing challenges in navigating millions of publications in diverse domains. Traditional search engines typically return articles as ranked text lists, offering little support for global exploration or in-depth analysis. Although recent advances in generative AI and large language models have shown promise in tasks such as summarization, extraction, and question answering, their dialog-based implementations are poorly integrated with literature search workflows. To address this gap, we introduce MedViz, a visual analytics system that integrates multiple AI agents with interactive visualization to support the exploration of the large-scale biomedical literature. MedViz combines a semantic map of millions of articles with agent-driven functions for querying, summarizing, and hypothesis generation, allowing researchers to iteratively refine questions, identify trends, and uncover hidden connections. By bridging intelligent agents with interactive visualization, MedViz transforms biomedical literature search into a dynamic, exploratory process that accelerates knowledge discovery.

[IR-1] Overview of the TREC 2025 Tip-of-the-Tongue track

链接: https://arxiv.org/abs/2601.20671
作者: Jaime Arguello,Fernando Diaz,Maik Fröebe,To Eun Kim,Bhaskar Mitra
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Tip-of-the-tongue (ToT) known-item retrieval involves re-finding an item for which the searcher does not reliably recall an identifier. ToT information requests (or queries) are verbose and tend to include several complex phenomena, making them especially difficult for existing information retrieval systems. The TREC 2025 ToT track focused on a single ad-hoc retrieval task. This year, we extended the track to general domain and incorporated different sets of test queries from diverse sources, namely from the MS-ToT dataset, manual topic development, and LLM-based synthetic query generation. This year, 9 groups (including the track coordinators) submitted 32 runs.

[IR-2] GSBM: Transformer-Guided Stochastic Block Model for Link Prediction

链接: https://arxiv.org/abs/2601.20646
作者: Zhejian Yang,Songwei Zhao,Zilin Zhao,Hechang Chen
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Link prediction is a cornerstone of the Web ecosystem, powering applications from recommendation and search to knowledge graph completion and collaboration forecasting. However, large-scale networks present unique challenges: they contain hundreds of thousands of nodes and edges with heterogeneous and overlapping community structures that evolve over time. Existing approaches face notable limitations: traditional graph neural networks struggle to capture global structural dependencies, while recent graph transformers achieve strong performance but incur quadratic complexity and lack interpretable latent structure. We propose \textbfTGSBM (Transformer-Guided Stochastic Block Model), a framework that integrates the principled generative structure of Overlapping Stochastic Block Models with the representational power of sparse Graph Transformers. TGSBM comprises three main components: (i) \emphexpander-augmented sparse attention that enables near-linear complexity and efficient global mixing, (ii) a \emphneural variational encoder that infers structured posteriors over community memberships and strengths, and (iii) a \emphneural edge decoder that reconstructs links via OSBM’s generative process, preserving interpretability. Experiments across diverse benchmarks demonstrate competitive performance (mean rank 1.6 under HeaRT protocol), superior scalability (up to 6\times faster training), and interpretable community structures. These results position TGSBM as a practical approach that strikes a balance between accuracy, efficiency, and transparency for large-scale link prediction.

[IR-3] When Vision Meets Texts in Listwise Reranking

链接: https://arxiv.org/abs/2601.20623
作者: Hongyi Cai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in information retrieval have highlighted the potential of integrating visual and textual information, yet effective reranking for image-text documents remains challenging due to the modality gap and scarcity of aligned datasets. Meanwhile, existing approaches often rely on large models (7B to 32B parameters) with reasoning-based distillation, incurring unnecessary computational overhead while primarily focusing on textual modalities. In this paper, we propose Rank-Nexus, a multimodal image-text document reranker that performs listwise qualitative reranking on retrieved lists incorporating both images and texts. To bridge the modality gap, we introduce a progressive cross-modal training strategy. We first train modalities separately: leveraging abundant text reranking data, we distill knowledge into the text branch. For images, where data is scarce, we construct distilled pairs from multimodal large language model (MLLM) captions on image retrieval benchmarks. Subsequently, we distill a joint image-text reranking dataset. Rank-Nexus achieves outstanding performance on text reranking benchmarks (TREC, BEIR) and the challenging image reranking benchmark (INQUIRE, MMDocIR), using only a lightweight 2B pretrained visual-language model. This efficient design ensures strong generalization across diverse multimodal scenarios without excessive parameters or reasoning overhead.

[IR-4] On Every Note a Griff: Looking for a Useful Representation of Basso Continuo Performance Style

链接: https://arxiv.org/abs/2601.20478
作者: Adam Štefunko,Carlos Eduardo Cancino-Chacón,Jan Hajič jr
类目: ound (cs.SD); Information Retrieval (cs.IR)
*备注: 6 pages, 5 figures, accepted to the Music Encoding Conference (MEC) 2026

点击查看摘要

Abstract:Basso continuo is a baroque improvisatory accompaniment style which involves improvising multiple parts above a given bass line in a musical score on a harpsichord or organ. Basso continuo is not merely a matter of history; moreover, it is a historically inspired living practice, and The Aligned Continuo Dataset (ACoRD) records the first sample of modern-day basso continuo playing in the symbolic domain. This dataset, containing 175 MIDI recordings of 5 basso continuo scores performed by 7 players, allows us to start observing and analyzing the variety that basso continuo improvisation brings. A recently proposed basso continuo performance-to-score alignment system provides a way of mapping improvised performance notes to score notes. In order to study aligned basso continuo performances, we need an appropriate feature representation. We propose griff, a representation inspired by historical basso continuo treatises. It enables us to encode both pitch content and structure of a basso continuo realization in a transposition-invariant way. Griffs are directly extracted from aligned basso continuo performances by grouping together performance notes aligned to the same score note in a onset-time ordered way, and they provide meaningful tokens that form a feature space in which we can analyze basso continuo performance styles. We statistically describe griffs extracted from the ACoRD dataset recordings, and show in two experiments how griffs can be used for statistical analysis of individuality of different players’ basso continuo performance styles. We finally present an argument why it is desirable to preserve the structure of a basso continuo improvisation in order to conduct a refined analysis of personal performance styles of individual basso continuo practitioners, and why griffs can provide a meaningful historically informed feature space worthy of a more robust empirical validation.

[IR-5] Eliminating Hallucination in Diffusion-Augmented Interactive Text-to-Image Retrieval

链接: https://arxiv.org/abs/2601.20391
作者: Zhuocheng Zhang,Kangheng Liang,Guanxuan Li,Paul Henderson,Richard Mccreadie,Zijun Long
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) is a promising paradigm that improves retrieval performance by generating query images via diffusion models and using them as additional ``views’’ of the user’s intent. However, these generative views can be incorrect because diffusion generation may introduce hallucinated visual cues that conflict with the original query text. Indeed, we empirically demonstrate that these hallucinated cues can substantially degrade DAI-TIR performance. To address this, we propose Diffusion-aware Multi-view Contrastive Learning (DMCL), a hallucination-robust training framework that casts DAI-TIR as joint optimization over representations of query intent and the target image. DMCL introduces semantic-consistency and diffusion-aware contrastive objectives to align textual and diffusion-generated query views while suppressing hallucinated query signals. This yields an encoder that acts as a semantic filter, effectively mapping hallucinated cues into a null space, improving robustness to spurious cues and better representing the user’s intent. Attention visualization and geometric embedding-space analyses corroborate this filtering behavior. Across five standard benchmarks, DMCL delivers consistent improvements in multi-round Hits@10, reaching as high as 7.37% over prior fine-tuned and zero-shot baselines, which indicates it is a general and robust training framework for DAI-TIR.

[IR-6] owards End-to-End Alignment of User Satisfaction via Questionnaire in Video Recommendation

链接: https://arxiv.org/abs/2601.20215
作者: Na Li,Jiaqi Yu,Minzhi Xie,Tiantian He,Xiaoxiao Xu,Zixiu Wang,Lantao Hu,Yongqi Liu,Han Li,Kaiqiao Zhan,Kun Gai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Short-video recommender systems typically optimize ranking models using dense user behavioral signals, such as clicks and watch time. However, these signals are only indirect proxies of user satisfaction and often suffer from noise and bias. Recently, explicit satisfaction feedback collected through questionnaires has emerged as a high-quality direct alignment supervision, but is extremely sparse and easily overwhelmed by abundant behavioral data, making it difficult to incorporate into online recommendation models. To address these challenges, we propose a novel framework which is towards End-to-End Alignment of user Satisfaction via Questionaire, named EASQ, to enable real-time alignment of ranking models with true user satisfaction. Specifically, we first construct an independent parameter pathway for sparse questionnaire signals by combining a multi-task architecture and a lightweight LoRA module. The multi-task design separates sparse satisfaction supervision from dense behavioral signals, preventing the former from being overwhelmed. The LoRA module pre-inject these preferences in a parameter-isolated manner, ensuring stability in the backbone while optimizing user satisfaction. Furthermore, we employ a DPO-based optimization objective tailored for online learning, which aligns the main model outputs with sparse satisfaction signals in real time. This design enables end-to-end online learning, allowing the model to continuously adapt to new questionnaire feedback while maintaining the stability and effectiveness of the backbone. Extensive offline experiments and large-scale online A/B tests demonstrate that EASQ consistently improves user satisfaction metrics across multiple scenarios. EASQ has been successfully deployed in a production short-video recommendation system, delivering significant and stable business gains.

[IR-7] High-Resolution Mapping of Port Dynamics from Open-Access AIS Data in Tokyo Bay ATC

链接: https://arxiv.org/abs/2601.20211
作者: Moritz Hütten
类目: Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
*备注: 29 pages, 18 figures, and 7 tables, matching the version published in Geomatics. Accompanying research data are available at this https URL

点击查看摘要

Abstract:Knowledge about vessel activity in port areas and around major industrial zones provides insights into economic trends, supports decision-making for shipping and port operators, and contributes to maritime safety. Vessel data from terrestrial receivers of the Automatic Identification System (AIS) have become increasingly openly available, and we demonstrate that such data can be used to infer port activities at high resolution and with precision comparable to official statistics. We analyze open-access AIS data from a three-month period in 2024 for Tokyo Bay, located in Japan’s most densely populated urban region. Accounting for uneven data coverage, we reconstruct vessel activity in Tokyo Bay at \sim, 30~m resolution and identify 161 active berths across seven major port areas in the bay. During the analysis period, we find an average of 35\pm17_\textstat vessels moving within the bay at any given time, and 293\pm22_\textstat+65_\textsyst-10_\textsyst vessels entering or leaving the bay daily, with an average gross tonnage of 11,860^+280_-;,50 . These figures indicate an accelerating long-term trend toward fewer but larger vessels in Tokyo Bay’s commercial traffic. Furthermore, we find that in dense urban environments, radio shadows in vessel AIS data can reveal the precise locations of inherently passive receiver stations.

[IR-8] MERGE: Next-Generation Item Indexing Paradigm for Large-Scale Streaming Recommendation

链接: https://arxiv.org/abs/2601.20199
作者: Jing Yan,Yimeng Bai,Zongyu Liu,Yahui Liu,Junwei Wang,Jingze Huang,Haoda Li,Sihao Ding,Shaohui Ruan,Yang Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Item indexing, which maps a large corpus of items into compact discrete representations, is critical for both discriminative and generative recommender systems, yet existing Vector Quantization (VQ)-based approaches struggle with the highly skewed and non-stationary item distributions common in streaming industry recommenders, leading to poor assignment accuracy, imbalanced cluster occupancy, and insufficient cluster separation. To address these challenges, we propose MERGE, a next-generation item indexing paradigm that adaptively constructs clusters from scratch, dynamically monitors cluster occupancy, and forms hierarchical index structures via fine-to-coarse merging. Extensive experiments demonstrate that MERGE significantly improves assignment accuracy, cluster uniformity, and cluster separation compared with existing indexing methods, while online A/B tests show substantial gains in key business metrics, highlighting its potential as a foundational indexing approach for large-scale recommendation.

[IR-9] IMRNNs: An Efficient Method for Interpretable Dense Retrieval via Embedding Modulation EACL2026

链接: https://arxiv.org/abs/2601.20084
作者: Yash Saxena,Ankur Padia,Kalpa Gunaratna,Manas Gaur
类目: Information Retrieval (cs.IR)
*备注: Accepted in EACL 2026

点击查看摘要

Abstract:Interpretability in black-box dense retrievers remains a central challenge in Retrieval-Augmented Generation (RAG). Understanding how queries and documents semantically interact is critical for diagnosing retrieval behavior and improving model design. However, existing dense retrievers rely on static embeddings for both queries and documents, which obscures this bidirectional relationship. Post-hoc approaches such as re-rankers are computationally expensive, add inference latency, and still fail to reveal the underlying semantic alignment. To address these limitations, we propose Interpretable Modular Retrieval Neural Networks (IMRNNs), a lightweight framework that augments any dense retriever with dynamic, bidirectional modulation at inference time. IMRNNs employ two independent adapters: one conditions document embeddings on the current query, while the other refines the query embedding using corpus-level feedback from initially retrieved documents. This iterative modulation process enables the model to adapt representations dynamically and expose interpretable semantic dependencies between queries and documents. Empirically, IMRNNs not only enhance interpretability but also improve retrieval effectiveness. Across seven benchmark datasets, applying our method to standard dense retrievers yields average gains of +6.35% nDCG, +7.14% recall, and +7.04% MRR over state-of-the-art baselines. These results demonstrate that incorporating interpretability-driven modulation can both explain and enhance retrieval in RAG systems.

附件下载

点击下载今日全部论文列表