本篇博文主要内容为 2026-01-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-19)

今日共更新404篇论文,其中:

  • 自然语言处理68篇(Computation and Language (cs.CL))
  • 人工智能101篇(Artificial Intelligence (cs.AI))
  • 计算机视觉64篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习105篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中对“token”作为统一计量单位的假设性问题,即token在不同模型和文本领域间是否真正具备一致性。研究发现,tokenization在不同模型与文本分布下存在显著差异,导致直接使用token数量进行模型比较或推理定价时存在偏差。其解决方案的关键在于通过全面的实证分析量化token压缩的变异性,揭示了现有关于token长度的常见启发式认知过于简化,并为理解当代LLM中的token化机制提供了更清晰、更具洞察力的基准。

链接: https://arxiv.org/abs/2601.11518
作者: Jonathan Roberts,Kai Han,Samuel Albanie
机构: University of Cambridge (剑桥大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.
zh

[NLP-1] Do explanations generalize across large reasoning models?

【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRM)生成的思维链(Chain of Thought, CoT)解释是否具备泛化能力的问题,即这些解释是否能捕捉到问题本身的通用模式而非仅限于特定模型的特异性行为。其核心解决方案在于通过实证研究验证:不同LRM在接收到同一CoT解释后是否表现出一致的行为,从而衡量解释的跨模型一致性。研究表明,CoT解释通常具备此类泛化性,并且这种一致性与人类偏好评分及强化学习微调后的性能正相关;进一步提出基于句子级别的集成策略以提升一致性,为评估和优化LRM解释的泛化能力提供了可操作框架。

链接: https://arxiv.org/abs/2601.11517
作者: Koyena Pal,David Bau,Chandan Singh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.
zh

[NLP-2] Building Production-Ready Probes For Gemini

【速读】: 该论文旨在解决生成式 AI(Generative AI)系统在实际部署中因分布偏移(distribution shift)导致激活探测器(activation probes)失效的问题,尤其关注从短上下文到长上下文输入的迁移挑战。其核心解决方案在于提出新型探测器架构以增强对长上下文分布偏移的鲁棒性,并强调仅靠架构改进不足以实现广泛泛化——必须结合多样化训练数据分布与特定架构选择。此外,研究发现将探测器与提示分类器(prompted classifier)结合可实现高精度且计算成本低的误用检测,已在 Gemini 系列模型中成功落地应用。

链接: https://arxiv.org/abs/2601.11516
作者: János Kramár,Joshua Engels,Zheng Wang,Bilal Chughtai,Rohin Shah,Neel Nanda,Arthur Conmy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google’s frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.11516 [cs.LG] (or arXiv:2601.11516v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.11516 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-3] he Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

【速读】: 该论文旨在解决人工智能代理(AI agents)融入经济市场后,如何影响战略互动格局及其对监管设计的挑战问题。其核心问题是:在博弈论经典场景(如讨价还价、谈判与说服)中,引入更多AI技术选项是否改变均衡收益与监管结果,以及是否存在新型策略性行为干扰公平性。解决方案的关键在于识别出“毒苹果效应”(Poisoned Apple effect)——即一方主动释放自身及对手均不采用的新技术,以操纵监管者选择市场机制,从而提升自身福利并损害对手和监管公平目标。这揭示了静态监管框架易被技术扩张所利用,因而必须发展动态市场设计以适应AI能力的演进。

链接: https://arxiv.org/abs/2601.11496
作者: Eilam Shapira,Roi Reichart,Moshe Tennenholtz
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the “Poisoned Apple” effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator’s choice of market design in their favor. This strategic release improves the releaser’s welfare at the expense of their opponent and the regulator’s fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.
zh

[NLP-4] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

【速读】: 该论文旨在解决生成式 AI(Generative AI)时代下,医学影像报告生成(Radiology Report Generation, RRG)领域缺乏统一、可靠的评估指标框架的问题。当前广泛使用的自然语言生成(Natural Language Generation, NLG)指标难以准确反映临床实用性,且缺乏对风格变化、事实性错误敏感度及与专家判断一致性等关键维度的系统评估。其解决方案的核心是提出首个统一的指标评估框架——CTest-Metric,包含三个模块:(i) 基于大语言模型(LLM)的改写测试以评估写作风格泛化能力(Writing Style Generalizability, WSG);(ii) 分级注入合成错误以检验指标对事实性偏差的敏感性(Synthetic Error Injection, SEI);(iii) 通过175例“分歧”病例的临床专家评分建立指标与专家判断的相关性分析(Metrics-vs-Expert correlation, MvE)。该框架首次实现了对RRG评估指标在临床场景中鲁棒性和适用性的系统量化,为未来高质量医疗文本生成模型的开发和验证提供了可复现的标准工具。

链接: https://arxiv.org/abs/2601.11488
作者: Vanshali Sharma,Andrea Mia Bejar,Gorkem Durak,Ulas Bagci
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ISBI 2026

点击查看摘要

Abstract:In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 “disagreement” cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.
zh

[NLP-5] MHA2MLA-VLM: Enabling DeepSeek s Economical Multi-Head Latent Attention across Vision-Language Models

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理过程中因Key-Value (KV) 缓存快速增长而导致的内存和计算瓶颈问题。现有方法如多头潜在注意力(Multi-Head Latent Attention, MLA)虽能有效压缩KV缓存并加速推理,但将通用VLM适配至MLA架构通常需要昂贵的预训练,限制了其实用性。解决方案的关键在于提出MHA2MLA-VLM框架,通过两项核心技术实现高效、无预训练的迁移:一是模态自适应的部分旋转位置编码(partial-RoPE)策略,可选择性屏蔽非必要维度以兼容单模态与多模态场景;二是模态解耦的低秩近似方法,分别对视觉与文本KV空间独立压缩。此外,采用参数高效的微调策略,并发现最小化输出激活误差而非参数距离更能显著降低性能损失,从而在极少监督数据下恢复原始模型性能,同时大幅减少KV缓存占用并兼容量化技术。

链接: https://arxiv.org/abs/2601.11464
作者: Xiaoran Fan,Zhichao Sun,Tao Ji,Lixing Shen,Tao Gui
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Huawei Technologies Co., Ltd. (华为技术有限公司); 3. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 4. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
zh

[NLP-6] Interactive Narrative Analytics: Bridging Computational Narrative Extraction and Human Sensemaking

【速读】: 该论文旨在解决信息过载与虚假信息导致从大规模新闻数据中提取有意义叙事的挑战。其解决方案的关键在于提出交互式叙事分析(Interactive Narrative Analytics, INA)这一新兴领域,通过将计算叙事提取方法与交互式可视化分析相结合,支持人类在探索叙事结构时进行认知推理和决策,从而实现人机协同的叙事理解与意义建构。

链接: https://arxiv.org/abs/2601.11459
作者: Brian Keith
机构: Universidad Católica del Norte (智利天主教大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 17 pages, 5 figures, published in IEEE Access as open access paper

点击查看摘要

Abstract:Information overload and misinformation create significant challenges in extracting meaningful narratives from large news collections. This paper defines the nascent field of Interactive Narrative Analytics (INA), which combines computational narrative extraction with interactive visual analytics to support sensemaking. INA approaches enable the interactive exploration of narrative structures through computational methods and visual interfaces that facilitate human interpretation. The field faces challenges in scalability, interactivity, knowledge integration, and evaluation standardization, yet offers promising opportunities across news analysis, intelligence, scientific literature exploration, and social media analysis. Through the combination of computational and human insight, INA addresses complex challenges in narrative sensemaking.
zh

[NLP-7] Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在适配专业化领域时因分布偏移(distribution shift)导致的泛化性能下降问题。解决方案的关键在于提出一种测试时自适应方法(Test-Time Adaptation, TTARAG),其核心思想是在推理阶段动态调整语言模型参数,通过让模型学习预测检索到的内容来实现对目标领域的自动参数优化,从而显著提升RAG在专业场景下的问答准确性和鲁棒性。

链接: https://arxiv.org/abs/2601.11443
作者: Xin Sun,Zhongqi Chen,Qiang Liu,Shu Wu,Bowen Song,Weiqiang Wang,Zilei Wang,Liang Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models’ question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model’s parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at this https URL.
zh

[NLP-8] Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models ICASSP2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在应用中面临的严重安全问题,尤其是现有模型编辑方法因融合新旧知识而产生的计算开销大、梯度噪声高及知识冲突等问题。其解决方案的关键在于提出一种新的编辑框架——Hierarchical Orthogonal Residual SprEad (HORSE),通过构建层次化的正交残差扩散机制来优化信息矩阵的更新过程,从而降低梯度噪声并提升编辑稳定性,实现高效且精确的大规模模型编辑。

链接: https://arxiv.org/abs/2601.11441
作者: Xiaojie Gu,Guangxu Chen,Yuheng Yang,Jingxin Han,Andi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICASSP 2026

点击查看摘要

Abstract:Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at this https URL
zh

[NLP-9] he unreason able effectiveness of pattern matching

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)的本质认知争议问题,即它们究竟是语言模仿器、数据库还是对网络内容的模糊复现。研究通过展示LLM在面对高度混乱的“Jabberwocky”语言(其中大部分词汇被随机替换为无意义字符串)时仍能恢复语义的能力,揭示了其核心机制在于对结构模式的敏感性与匹配能力。解决方案的关键在于:LLM并非依赖于显式知识存储或语义理解,而是凭借强大的模式识别能力从输入文本的语法和语境结构中推断出合理含义,这体现了模式匹配本身作为智能基础成分的重要性,而非仅是“真实”智能的替代品。

链接: https://arxiv.org/abs/2601.11432
作者: Gary Lupyan,Blaise Agüera y Arcas
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Google(谷歌)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We report on an astonishing ability of large language models (LLMs) to make sense of “Jabberwocky” language in which most or all content words have been randomly replaced by nonsense strings, e.g., translating “He dwushed a ghanc zawk” to “He dragged a spare chair”. This result addresses ongoing controversies regarding how to best think of what LLMs are doing: are they a language mimic, a database, a blurry version of the Web? The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching. Pattern-matching is not an alternative to “real” intelligence, but rather a key ingredient.
zh

[NLP-10] Relational Linearity is a Predictor of Hallucinations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中普遍存在的幻觉(Hallucination)问题,特别是针对模型对未知合成实体的虚假事实生成行为。研究发现,中等规模模型如Gemma-7B-IT在面对合成实体时频繁产生无法识别为虚构的错误回答,表明其知识评估能力存在缺陷。解决方案的关键在于提出并验证一个核心假设:关系的线性程度(Linearity)影响模型对知识的存储方式和自我评估能力——线性关系倾向于以更抽象的方式存储,导致模型难以判断其是否具备相关知识;而非线性关系则以更直接的方式存储,便于知识评估。作者通过构建SyntHal数据集(包含6000个合成实体及六种关系),量化各关系的线性度(使用Δcos指标)并测量模型在该数据集上的幻觉率,发现二者呈现强正相关(r ∈ [0.78, 0.82]),从而证实了关系结构特性是影响LLM自知能力的重要因素。这一发现为管理幻觉行为提供了新思路,并指出了改进事实知识表征的新方向。

链接: https://arxiv.org/abs/2601.11429
作者: Yuetian Lu,Yihong Liu,Hinrich Schütze
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 8 tables

点击查看摘要

Abstract:Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: “Which instrument did Glenn Gould play?”, but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using \Delta\cos . We find a strong correlation ( r \in [.78,.82] ) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.
zh

[NLP-11] Isotropy-Optimized Contrastive Learning for Semantic Course Recommendation

【速读】: 该论文旨在解决传统BERT(Bidirectional Encoder Representations from Transformers)嵌入表示空间各向异性问题,即课程描述在语义上不相关时仍表现出高余弦相似度,导致推荐准确性下降。解决方案的关键在于提出一种基于自监督对比学习(self-supervised contrastive learning)的框架,结合数据增强和各向同性正则化(isotropy regularization),以生成更具判别性的嵌入表示,从而提升课程推荐的准确性和语义区分能力。

链接: https://arxiv.org/abs/2601.11427
作者: Ali Khreis,Anthony Nasr,Yusuf Hilal
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 7 pages, 7 figures

点击查看摘要

Abstract:This paper presents a semantic course recommendation system for students using a self-supervised contrastive learning approach built upon BERT (Bidirectional Encoder Representations from Transformers). Traditional BERT embeddings suffer from anisotropic representation spaces, where course descriptions exhibit high cosine similarities regardless of semantic relevance. To address this limitation, we propose a contrastive learning framework with data augmentation and isotropy regularization that produces more discriminative embeddings. Our system processes student text queries and recommends Top-N relevant courses from a curated dataset of over 500 engineering courses across multiple faculties. Experimental results demonstrate that our fine-tuned model achieves improved embedding separation and more accurate course recommendations compared to vanilla BERT baselines.
zh

[NLP-12] PubMed-OCR: PMC Open Access OCR Annotations

【速读】: 该论文旨在解决科学文献中光学字符识别(Optical Character Recognition, OCR)相关任务的评估与建模难题,尤其是针对学术文章页面图像中文字内容的结构化表示和下游应用(如坐标定位问答、版面感知建模)的需求。其解决方案的关键在于构建一个以OCR为核心的语料库——PubMed-OCR,该语料库基于PubMed Central开放获取PDF文档,利用Google Cloud Vision进行文本检测与定位,并以紧凑的JSON格式提供词级、行级和段落级边界框标注,从而支持高精度的布局感知建模与OCR依赖型流水线的评估。

链接: https://arxiv.org/abs/2601.11425
作者: Hunter Heidenreich,Yosheb Getachew,Olivia Dinica,Ben Elliott
机构: Roots.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.
zh

[NLP-13] Evaluating LLM Behavior in Hiring: Implicit Weights Fairness Across Groups and Alignment with Human Preferences

【速读】: 该论文旨在解决生成式 AI(Generative AI)在招聘决策中对不同属性权重分配的合理性问题,即这些权重是否符合经济原理、招聘人员偏好或更广泛的社会规范。其解决方案的关键在于提出了一种基于经济学中分析人类招聘行为的经典方法构建的评估框架,通过从真实自由职业者资料和项目描述中合成数据集,并采用全因子实验设计来量化大语言模型(Large Language Models, LLMs)在评估自由职业者与项目匹配度时对各类相关标准的加权方式,从而揭示LLM的决策逻辑及其在不同项目情境和人口子群体中的差异性表现。

链接: https://arxiv.org/abs/2601.11379
作者: Morgane Hoffmann,Emma Jouffroy,Warren Jouanneau,Marc Palyart,Charles Pebereau
机构: Malt(马尔特)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:General-purpose Large Language Models (LLMs) show significant potential in recruitment applications, where decisions require reasoning over unstructured text, balancing multiple criteria, and inferring fit and competence from indirect productivity signals. Yet, it is still uncertain how LLMs assign importance to each attribute and whether such assignments are in line with economic principles, recruiter preferences or broader societal norms. We propose a framework to evaluate an LLM’s decision logic in recruitment, by drawing on established economic methodologies for analyzing human hiring behavior. We build synthetic datasets from real freelancer profiles and project descriptions from a major European online freelance marketplace and apply a full factorial design to estimate how a LLM weighs different match-relevant criteria when evaluating freelancer-project fit. We identify which attributes the LLM prioritizes and analyze how these weights vary across project contexts and demographic subgroups. Finally, we explain how a comparable experimental setup could be implemented with human recruiters to assess alignment between model and human decisions. Our findings reveal that the LLM weighs core productivity signals, such as skills and experience, but interprets certain features beyond their explicit matching value. While showing minimal average discrimination against minority groups, intersectional effects reveal that productivity signals carry different weights between demographic groups.
zh

[NLP-14] Reward Modeling for Scientific Writing Evaluation

【速读】: 该论文旨在解决科学写作生成任务中评估难的问题,特别是现有基于大语言模型(Large Language Models, LLMs)的评判模型在面对多样化的开放性科学写作任务时,因缺乏对特定领域知识的推理能力以及固定评分标准的局限性而表现不佳。其关键解决方案在于提出一种低成本、开源的奖励模型训练框架,采用两阶段训练策略:首先优化科学评估偏好,随后增强推理能力;并通过多维度评估设计与跨任务联合训练,实现细粒度评分和对动态评分标准的鲁棒性,从而使得单一训练后的评估器可在未见过的任务和场景中直接复用,无需针对每个任务重新微调。

链接: https://arxiv.org/abs/2601.11374
作者: Furkan Şahinuç,Subhabrata Dutta,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Technical University of Darmstadt; Hessian Center for AI (hessian.AI)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2508.07955

点击查看摘要

Abstract:Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.
zh

[NLP-15] AstroReason -Bench: Evaluating Unified Agent ic Planning Planning across Heterogeneous Space Planning Problems

【速读】: 该论文试图解决当前基于代理的大型语言模型(Agentic Large Language Models, LLMs)在物理约束的真实世界场景中规划能力不足的问题,特别是其在符号化或弱接地环境下的评估基准难以反映其在高风险、多目标、长周期决策任务中的真实性能。解决方案的关键在于提出 AstroReason-Bench,这是一个针对空间规划问题(Space Planning Problems, SPP)的综合性基准测试平台,整合了多种调度机制(如地面站通信和敏捷地球观测),并提供统一的面向代理的交互协议,从而为评估代理在真实物理约束下的推理与行动能力提供了标准化、诊断性强的测试环境。

链接: https://arxiv.org/abs/2601.11354
作者: Weiyi Wang,Xinchi Chen,Jingjing Gong,Xuanjing Huang,Xipeng Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.
zh

[NLP-16] How Much Would a Clinician Edit This Draft? Evaluating LLM Alignment for Patient Message Response Drafting

【速读】: 该论文旨在解决生成式 AI(Generative AI)在临床患者门户消息回复任务中与个体医生工作流程对齐的问题,核心挑战在于评估大语言模型(Large Language Models, LLMs)是否能真正减轻医生的工作负担,并确保其生成内容符合特定医生的临床风格和沟通偏好。解决方案的关键在于构建一个新颖的主题要素分类体系和多维度评估框架,用于量化LLM草稿在内容和主题层面与医生实际编辑行为之间的差异;同时通过主题提示、检索增强生成、监督微调及直接偏好优化等多种适应策略,系统性提升LLM输出与个体医生响应的一致性,从而推动其在医患沟通场景中的可靠部署。

链接: https://arxiv.org/abs/2601.11344
作者: Parker Seegmiller,Joseph Gatto,Sarah E. Greer,Ganza Belise Isingizwe,Rohan Ray,Timothy E. Burdick,Sarah Masud Preum
机构: Dartmouth College (达特茅斯学院); Dartmouth Health (达特茅斯健康); The Dartmouth Institute (达特茅斯研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show promise in drafting responses to patient portal messages, yet their integration into clinical workflows raises various concerns, including whether they would actually save clinicians time and effort in their portal workload. We investigate LLM alignment with individual clinicians through a comprehensive evaluation of the patient message response drafting task. We develop a novel taxonomy of thematic elements in clinician responses and propose a novel evaluation framework for assessing clinician editing load of LLM-drafted responses at both content and theme levels. We release an expert-annotated dataset and conduct large-scale evaluations of local and commercial LLMs using various adaptation techniques including thematic prompting, retrieval-augmented generation, supervised fine-tuning, and direct preference optimization. Our results reveal substantial epistemic uncertainty in aligning LLM drafts with clinician responses. While LLMs demonstrate capability in drafting certain thematic elements, they struggle with clinician-aligned generation in other themes, particularly question asking to elicit further information from patients. Theme-driven adaptation strategies yield improvements across most themes. Our findings underscore the necessity of adapting LLMs to individual clinician preferences to enable reliable and responsible use in patient-clinician communication workflows.
zh

[NLP-17] Unlocking the Potentials of Retrieval-Augmented Generation for Diffusion Language Models

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在检索增强生成(Retrieval-Augmented Generation, RAG)框架下存在的生成精度不足问题,其核心挑战在于响应语义漂移(Response Semantic Drift, RSD)——即生成内容在迭代去噪过程中逐渐偏离原始查询语义,导致答案准确性下降。解决方案的关键在于提出一种名为Semantic-Preserving REtrieval-Augmented Diffusion (SPREAD) 的新框架,通过引入基于查询相关性的去噪策略,主动引导去噪轨迹以保持生成内容与查询语义的一致性,从而有效抑制RSD并提升生成精度。

链接: https://arxiv.org/abs/2601.11342
作者: Chuanyue Yu,Jiahui Wang,Yuhan Li,Heng Chang,Ge Lan,Qingyun Sun,Jia Li,Jianxin Li,Ziwei Zhang
机构: Nankai University (南开大学); Beihang University (北京航空航天大学); HKUST (Guangzhou) (香港科技大学(广州)); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprints

点击查看摘要

Abstract:Diffusion Language Models (DLMs) have recently demonstrated remarkable capabilities in natural language processing tasks. However, the potential of Retrieval-Augmented Generation (RAG), which shows great successes for enhancing large language models (LLMs), has not been well explored, due to the fundamental difference between LLM and DLM decoding. To fill this critical gap, we systematically test the performance of DLMs within the RAG framework. Our findings reveal that DLMs coupled with RAG show promising potentials with stronger dependency on contextual information, but suffer from limited generation precision. We identify a key underlying issue: Response Semantic Drift (RSD), where the generated answer progressively deviates from the query’s original semantics, leading to low precision content. We trace this problem to the denoising strategies in DLMs, which fail to maintain semantic alignment with the query throughout the iterative denoising process. To address this, we propose Semantic-Preserving REtrieval-Augmented Diffusion (SPREAD), a novel framework that introduces a query-relevance-guided denoising strategy. By actively guiding the denoising trajectory, SPREAD ensures the generation remains anchored to the query’s semantics and effectively suppresses drift. Experimental results demonstrate that SPREAD significantly enhances the precision and effectively mitigates RSD of generated answers within the RAG framework.
zh

[NLP-18] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models

【速读】: 该论文旨在解决当前大型语言模型在链式思维(Chain-of-Thought, CoT)推理过程中存在的次优路径问题,即模型生成推理步骤时缺乏前瞻性,容易陷入冗余且低效的推理路径。解决方案的关键在于提出神经链式思维搜索(Neural Chain-of-Thought Search, NCoTS)框架,将推理过程重构为对最优思考策略的动态搜索;通过量化表征解空间,识别出稀疏但更准确且更简洁的优越推理路径,并利用双因素启发式方法评估候选推理操作符,同时优化正确性和计算成本,从而实现帕累托改进,在多个推理基准上提升准确率超3.5%的同时减少生成长度超过22%。

链接: https://arxiv.org/abs/2601.11340
作者: Guoming Ling,Zhongzhan Huang,Yupei Lin,Junxin Li,Shanshan Zhong,Hefeng Wu,Liang Lin
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought reasoning has significantly enhanced the problem-solving capabilities of Large Language Models. Unfortunately, current models generate reasoning steps sequentially without foresight, often becoming trapped in suboptimal reasoning paths with redundant steps. In contrast, we introduce Neural Chain-of-Thought Search (NCoTS), a framework that reformulates reasoning as a dynamic search for the optimal thinking strategy. By quantitatively characterizing the solution space, we reveal the existence of sparse superior reasoning paths that are simultaneously more accurate and concise than standard outputs. Our method actively navigates towards these paths by evaluating candidate reasoning operators using a dual-factor heuristic that optimizes for both correctness and computational cost. Consequently, NCoTS achieves a Pareto improvement across diverse reasoning benchmarks, boosting accuracy by over 3.5% while reducing generation length by over 22%. Our code and data are available at this https URL.
zh

[NLP-19] Idea First Code Later: Disentangling Problem Solving from Code Generation in Evaluating LLM s for Competitive Programming

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在竞技编程任务中,将算法推理与代码实现混为一谈的问题,导致对模型真实问题解决能力的评估失真。其核心解决方案是将自然语言编辑稿(natural-language editorials)作为中心环节,用于引导模型先生成算法思路再编写代码,并以此作为评估基准。关键创新在于:1)通过生成或使用专家撰写的黄金编辑稿(gold editorials)前置指导,显著提升部分LLM的解题成功率;2)引入基于专家标注的编辑稿对比分析方法,诊断模型在算法设计阶段的推理错误;3)提出“LLM作为裁判”(LLM-as-a-judge)协议以实现可扩展的自动化评估。该研究还构建了一个包含83个ICPC风格题目及完整测试套件的数据集,强调未来评测基准应明确区分问题求解(problem solving)与代码实现(implementation)两个维度。

链接: https://arxiv.org/abs/2601.11332
作者: Sama Hadhoud,Alaa Elsetohy,Frederikus Hudi,Jan Christian Blaise Cruz,Steven Halim,Alham Fikri Aji
机构: MBZUAI; NAIST; National University of Singapore
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly succeed on competitive programming problems, yet existing evaluations conflate algorithmic reasoning with code-level implementation. We argue that competitive programming is fundamentally a problem-solving task and propose centering natural-language editorials in both solution generation and evaluation. Generating an editorial prior to code improves solve rates for some LLMs, with substantially larger gains when using expertly written gold editorials. However, even with gold editorials, models continue to struggle with implementation, while the gap between generated and gold editorials reveals a persistent problem-solving bottleneck in specifying correct and complete algorithms. Beyond pass/fail metrics, we diagnose reasoning errors by comparing model-generated editorials to gold standards using expert annotations and validate an LLM-as-a-judge protocol for scalable evaluation. We introduce a dataset of 83 ICPC-style problems with gold editorials and full test suites, and evaluate 19 LLMs, arguing that future benchmarks should explicitly separate problem solving from implementation.
zh

[NLP-20] F-Actor: Controllable Conversational Behaviour in Full-Duplex Models

【速读】: 该论文旨在解决当前语音对话系统在自然性和交互性方面的局限性,即现有系统虽能实现准确的语音生成,但缺乏根据上下文动态调整对话行为的能力,导致用户体验不够自然和沉浸。其解决方案的关键在于提出首个开源、指令驱动的全双工对话语音模型,通过仅微调语言模型(保持音频编码器冻结)的方式,在典型学术资源限制下仅需2000小时数据即可高效训练,从而实现对说话人声音、话题、对话行为(如回应性话语和打断)及对话发起方式的显式控制,同时采用单阶段训练协议并系统性分析设计选择,为可控全双工语音系统的可复现研究提供支持。

链接: https://arxiv.org/abs/2601.11329
作者: Maike Züfle,Ondrej Klejch,Nicholas Sanders,Jan Niehues,Alexandra Birch,Tsz Kin Lam
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); University of Edinburgh (爱丁堡大学); NatWest (英国国民西敏寺银行)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context. Current spoken conversational systems, however, rarely allow such customization, limiting their naturalness and usability. In this work, we present the first open, instruction-following full-duplex conversational speech model that can be trained efficiently under typical academic resource constraints. By keeping the audio encoder frozen and finetuning only the language model, our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage optimization. The model can follow explicit instructions to control speaker voice, conversation topic, conversational behaviour (e.g., backchanneling and interruptions), and dialogue initiation. We propose a single-stage training protocol and systematically analyze design choices. Both the model and training code will be released to enable reproducible research on controllable full-duplex speech systems.
zh

[NLP-21] Membership Inference on LLM s in the Wild

【速读】: 该论文旨在解决在严格黑盒(black-box)场景下对大型语言模型(Large Language Models, LLMs)训练数据进行成员推理攻击(Membership Inference Attacks, MIAs)时面临的两大挑战:一是现有方法通常依赖不可访问的模型内部信息(如logits),二是缺乏跨领域的泛化能力。为应对这些问题,论文提出了一种名为SimMIA的鲁棒MIA框架,其关键创新在于引入了一种先进的采样策略和评分机制,仅利用生成文本即可实现高效且具有广泛适用性的成员推理,从而在不依赖模型内部结构的情况下实现了当前最优性能。

链接: https://arxiv.org/abs/2601.11314
作者: Jiatong Yi,Yanyang Li
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) act as a crucial auditing tool for the opaque training data of Large Language Models (LLMs). However, existing techniques predominantly rely on inaccessible model internals (e.g., logits) or suffer from poor generalization across domains in strict black-box settings where only generated text is available. In this work, we propose SimMIA, a robust MIA framework tailored for this text-only regime by leveraging an advanced sampling strategy and scoring mechanism. Furthermore, we present WikiMIA-25, a new benchmark curated to evaluate MIA performance on modern proprietary LLMs. Experiments demonstrate that SimMIA achieves state-of-the-art results in the black-box setting, rivaling baselines that exploit internal model information.
zh

[NLP-22] One LLM to Train Them All: Multi-Task Learning Framework for Fact-Checking ECIR2026

【速读】: 该论文旨在解决自动化事实核查(Automated Fact-Checking, AFC)中因依赖大型专有模型所带来的权重封闭性、复杂度高及成本高昂等问题,同时克服使用多个小型专用模型进行细粒度任务(如声明检测、证据排序和立场识别)时带来的部署与维护成本过高的困境。其解决方案的关键在于采用多任务学习(Multi-Task Learning, MTL)策略,通过微调单一小型解码器架构语言模型(decoder-only LLMs),联合执行三个核心AFC任务,从而在保持模型轻量化的同时实现性能提升;实验表明,该方法相较于零样本或少样本设置,在声明检测、证据重排序和立场检测任务上分别实现了最高达44%、54%和31%的相对性能增益。

链接: https://arxiv.org/abs/2601.11293
作者: Malin Astrid Larsson,Harald Fosen Grunnaleite,Vinay Setty
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted version in ECIR 2026

点击查看摘要

Abstract:Large language models (LLMs) are reshaping automated fact-checking (AFC) by enabling unified, end-to-end verification pipelines rather than isolated components. While large proprietary models achieve strong performance, their closed weights, complexity, and high costs limit sustainability. Fine-tuning smaller open weight models for individual AFC tasks can help but requires multiple specialized models resulting in high costs. We propose \textbfmulti-task learning (MTL) as a more efficient alternative that fine-tunes a single model to perform claim detection, evidence ranking, and stance detection jointly. Using small decoder-only LLMs (e.g., Qwen3-4b), we explore three MTL strategies: classification heads, causal language modeling heads, and instruction-tuning, and evaluate them across model sizes, task orders, and standard non-LLM baselines. While multitask models do not universally surpass single-task baselines, they yield substantial improvements, achieving up to \textbf44%, \textbf54%, and \textbf31% relative gains for claim detection, evidence re-ranking, and stance detection, respectively, over zero-/few-shot settings. Finally, we also provide practical, empirically grounded guidelines to help practitioners apply MTL with LLMs for automated fact-checking.
zh

[NLP-23] Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)面临的“知识截止”(knowledge cutoff)问题,即模型参数冻结导致无法直接内化新信息,且传统监督微调(Supervised Fine-Tuning, SFT)虽能更新事实性内容,但难以提升模型对新知识的实际应用能力(如问答或决策)。其解决方案的关键在于提出参数技能迁移(Parametric Skill Transfer, PaST)框架,核心创新是基于观察到SFT与强化学习(Reinforcement Learning, RL)的参数更新方向近乎正交的现象,从中提取一个领域无关的技能向量(Skill Vector),并将其线性注入经过轻量级SFT后的目标模型中,从而实现高效、有效的知识适应与跨域技能迁移。

链接: https://arxiv.org/abs/2601.11258
作者: Pingzhi Tang,Yiding Wang,Muhan Zhang
机构: Peking University (北京大学); BIGAI (通用人工智能国家重点实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) face the “knowledge cutoff” challenge, where their frozen parametric memory prevents direct internalization of new information. While Supervised Fine-Tuning (SFT) is commonly used to update model knowledge, it often updates factual content without reliably improving the model’s ability to use the newly incorporated information for question answering or decision-making. Reinforcement Learning (RL) is essential for acquiring reasoning skills; however, its high computational cost makes it impractical for efficient online adaptation. We empirically observe that the parameter updates induced by SFT and RL are nearly orthogonal. Based on this observation, we propose Parametric Skill Transfer (PaST), a framework that supports modular skill transfer for efficient and effective knowledge adaptation. By extracting a domain-agnostic Skill Vector from a source domain, we can linearly inject knowledge manipulation skills into a target model after it has undergone lightweight SFT on new data. Experiments on knowledge-incorporation QA (SQuAD, LooGLE) and agentic tool-use benchmarks (ToolBench) demonstrate the effectiveness of our method. On SQuAD, PaST outperforms the state-of-the-art self-editing SFT baseline by up to 9.9 points. PaST further scales to long-context QA on LooGLE with an 8.0-point absolute accuracy gain, and improves zero-shot ToolBench success rates by +10.3 points on average with consistent gains across tool categories, indicating strong scalability and cross-domain transferability of the Skill Vector.
zh

[NLP-24] Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering WWW2026

【速读】: 该论文旨在解决多跳问答(multi-hop question answering, MQA)任务中,现有基于大语言模型(LLM)的迭代检索增强生成(Retrieval-Augmented Generation, RAG)方法因查询分解不准确和错误传播导致推理连贯性差的问题。其解决方案的关键在于提出一种分层框架——推理树引导的RAG(Reasoning Tree Guided RAG, RT-RAG),通过显式构建推理树来结构化分解多跳问题,利用实体分析与共识选择机制明确区分核心查询、已知实体与未知实体,从而减少错误分解;随后采用自底向上的遍历策略进行迭代查询重写与证据精炼,有效抑制错误传播,显著提升复杂多跳问答的性能。

链接: https://arxiv.org/abs/2601.11255
作者: Yuling Shi,Maolin Sun,Zijun Liu,Mo Yang,Yixiong Fang,Tianran Sun,Xiaodong Gu
机构: Shanghai Jiao Tong University (上海交通大学); Shandong University (山东大学); iAuto (智驾); Sagenic Tech (赛诺科技); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to GLOW@WWW2026. Code available at this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has demonstrated significant effectiveness in enhancing large language models (LLMs) for complex multi-hop question answering (QA). For multi-hop QA tasks, current iterative approaches predominantly rely on LLMs to self-guide and plan multi-step exploration paths during retrieval, leading to substantial challenges in maintaining reasoning coherence across steps from inaccurate query decomposition and error propagation. To address these issues, we introduce Reasoning Tree Guided RAG (RT-RAG), a novel hierarchical framework for complex multi-hop QA. RT-RAG systematically decomposes multi-hop questions into explicit reasoning trees, minimizing inaccurate decomposition through structured entity analysis and consensus-based tree selection that clearly separates core queries, known entities, and unknown entities. Subsequently, a bottom-up traversal strategy employs iterative query rewriting and refinement to collect high-quality evidence, thereby mitigating error propagation. Comprehensive experiments show that RT-RAG substantially outperforms state-of-the-art methods by 7.0% F1 and 6.0% EM, demonstrating the effectiveness of RT-RAG in complex multi-hop QA.
zh

[NLP-25] How DDAIR you? Disambiguated Data Augmentation for Intent Recognition EACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在意图识别(Intent Recognition)任务中进行数据增强时,可能生成与非目标类别语义模糊的合成样本的问题。其关键解决方案是提出DDAIR(Disambiguated Data Augmentation for Intent Recognition),利用Sentence Transformers提取句子嵌入(sentence embeddings),识别那些在语义上更接近其他意图而非目标意图的模糊样本,并通过迭代重生成机制对这些模糊样本进行修正,从而提升低资源场景下意图分类的准确性。

链接: https://arxiv.org/abs/2601.11234
作者: Galo Castillo-López,Alexis Lombard,Nasredine Semmar,Gaël de Chalendar
机构: Université Paris-Saclay (巴黎萨克雷大学); CEA (法国原子能和替代能源委员会); List (List实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication at EACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are effective for data augmentation in classification tasks like intent detection. In some cases, they inadvertently produce examples that are ambiguous with regard to untargeted classes. We present DDAIR (Disambiguated Data Augmentation for Intent Recognition) to mitigate this problem. We use Sentence Transformers to detect ambiguous class-guided augmented examples generated by LLMs for intent recognition in low-resource scenarios. We identify synthetic examples that are semantically more similar to another intent than to their target one. We also provide an iterative re-generation method to mitigate such ambiguities. Our findings show that sentence embeddings effectively help to (re)generate less ambiguous examples, and suggest promising potential to improve classification performance in scenarios where intents are loosely or broadly defined.
zh

[NLP-26] FactCorrector: A Graph-Inspired Approach to Long-Form Factuality Correction of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型应用中生成事实性错误响应的问题。现有方法通常依赖于训练阶段的调整,难以适应跨领域场景。其解决方案的关键在于提出一种无需重新训练即可跨领域适配的后处理修正方法 FactCorrector,该方法利用关于原始回答事实性的结构化反馈信息,生成针对性的修正结果,从而提升事实准确性并保持内容相关性。

链接: https://arxiv.org/abs/2601.11232
作者: Javier Carnerero-Cano,Massimiliano Pronesti,Radu Marinescu,Tigran Tchrakian,James Barry,Jasmina Gajcin,Yufang Hou,Alessandra Pascale,Elizabeth Daly
机构: IBM Research Europe - Ireland; IT:U - Interdisciplinary Transformation University Austria
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely used in knowledge-intensive applications but often generate factually incorrect responses. A promising approach to rectify these flaws is correcting LLMs using feedback. Therefore, in this paper, we introduce FactCorrector, a new post-hoc correction method that adapts across domains without retraining and leverages structured feedback about the factuality of the original response to generate a correction. To support rigorous evaluations of factuality correction methods, we also develop the VELI5 benchmark, a novel dataset containing systematically injected factual errors and ground-truth corrections. Experiments on VELI5 and several popular long-form factuality datasets show that the FactCorrector approach significantly improves factual precision while preserving relevance, outperforming strong baselines. We release our code at this https URL.
zh

[NLP-27] Language of Thought Shapes Output Diversity in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)输出多样性不足的问题,这限制了模型在创造性任务和多元文化场景中的表现。解决方案的关键在于控制模型“思维语言”(language of thought),即在推理过程中使用不同语言进行内部思考,从而在不改变最终输出语言的前提下显著提升输出多样性。研究发现,不同思维语言占据模型思维空间中不同的区域,且非英语思维语言相较于英语能带来更明显的多样性增益;进一步通过多语言混合采样策略,利用语言间的组合效应实现多样性天花板的扩展,并在多元对齐场景中提升文化知识与价值取向的覆盖广度。

链接: https://arxiv.org/abs/2601.11227
作者: Shaoyang Xu,Wenxuan Zhang
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Output diversity is crucial for Large Language Models as it underpins pluralism and creativity. In this work, we reveal that controlling the language used during model thinking-the language of thought-provides a novel and structural source of output diversity. Our preliminary study shows that different thinking languages occupy distinct regions in a model’s thinking space. Based on this observation, we study two repeated sampling strategies under multilingual thinking-Single-Language Sampling and Mixed-Language Sampling-and conduct diversity evaluation on outputs that are controlled to be in English, regardless of the thinking language used. Across extensive experiments, we demonstrate that switching the thinking language from English to non-English languages consistently increases output diversity, with a clear and consistent positive correlation such that languages farther from English in the thinking space yield larger gains. We further show that aggregating samples across multiple thinking languages yields additional improvements through compositional effects, and that scaling sampling with linguistic heterogeneity expands the model’s diversity ceiling. Finally, we show that these findings translate into practical benefits in pluralistic alignment scenarios, leading to broader coverage of cultural knowledge and value orientations in LLM outputs. Our code is publicly available at this https URL.
zh

[NLP-28] MultiCaption: Detecting disinformation using multilingual visual claims

【速读】: 该论文旨在解决在线虚假信息(online disinformation)在多模态和多语言环境下的检测难题,尤其是视觉内容中存在矛盾陈述时的识别问题。现有自动化事实核查方法受限于缺乏能反映真实复杂场景的数据集,导致模型泛化能力不足。其解决方案的关键在于构建了MultiCaption数据集,该数据集包含11,088条跨64种语言的视觉陈述对(visual claims),通过多种标注策略明确判断陈述间的逻辑矛盾关系,从而为多模态、多语言环境下的虚假信息检测提供高质量基准资源。实验表明,该数据集比标准自然语言推理(NLI)任务更具挑战性,且多语言训练可显著提升性能,无需依赖机器翻译即可实现有效的跨语言事实核查。

链接: https://arxiv.org/abs/2601.11220
作者: Rafael Martins Frade,Rrubaa Panchendrarajan,Arkaitz Zubiaga
机构: 1. University of Oxford (牛津大学); 2. DeepMind (深度思维); 3. University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Online disinformation poses an escalating threat to society, driven increasingly by the rapid spread of misleading content across both multimedia and multilingual platforms. While automated fact-checking methods have advanced in recent years, their effectiveness remains constrained by the scarcity of datasets that reflect these real-world complexities. To address this gap, we first present MultiCaption, a new dataset specifically designed for detecting contradictions in visual claims. Pairs of claims referring to the same image or video were labeled through multiple strategies to determine whether they contradict each other. The resulting dataset comprises 11,088 visual claims in 64 languages, offering a unique resource for building and evaluating misinformation-detection systems in truly multimodal and multilingual environments. We then provide comprehensive experiments using transformer-based architectures, natural language inference models, and large language models, establishing strong baselines for future research. The results show that MultiCaption is more challenging than standard NLI tasks, requiring task-specific finetuning for strong performance. Moreover, the gains from multilingual training and testing highlight the dataset’s potential for building effective multilingual fact-checking pipelines without relying on machine translation.
zh

[NLP-29] star: Progressive Block Scaling for MDM Through Trajectory Aware RL

【速读】: 该论文旨在解决掩码扩散语言模型(Masked Diffusion Language Models, MDMs)在推理阶段并行度不足的问题,尤其是在数学推理任务中,如何在不显著降低性能的前提下提升解码效率。解决方案的关键在于提出一种基于TraceRL的训练课程(training curriculum),即T ^\star ,其通过渐进式扩大掩码块大小(block-size scaling)的方式,从一个自回归(AR)初始化的小块模型平滑过渡到大块模型,从而实现更高的并行解码能力,同时保持较低的性能损失。进一步分析表明,T ^\star 能够收敛至一种替代的解码调度策略 \hat\rm S ,该策略在数学推理基准上表现相当。

链接: https://arxiv.org/abs/2601.11214
作者: Hanchen Xia,Baoyou Chen,Yutang Ge,Guojiang Zhao,Siyu Zhu
机构: Shanghai Academy of AI for Science (上海人工智能科学研究院); Shanghai Jiao Tong University (上海交通大学); Carnegie Mellon University (卡内基梅隆大学); Fundan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present T ^\star , a simple \textscTraceRL-based training curriculum for progressive block-size scaling in masked diffusion language models (MDMs). Starting from an AR-initialized small-block MDM, T ^\star ~transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks. Moreover, further analysis suggests that T ^\star ~can converge to an alternative decoding schedule \hat\rm S that achieves comparable performance.
zh

[NLP-30] DOREMI: Optimizing Long Tail Predictions in Document-Level Relation Extraction

【速读】: 该论文针对文档级关系抽取(Document-Level Relation Extraction, DocRE)中存在的两大挑战展开研究:一是依赖跨句上下文信息导致的复杂性,二是关系类型分布的长尾特性(long-tail distribution),即多数关系类别在训练集中样本稀少。为解决上述问题,作者提出DOcument-level Relation Extraction optiMizing the long taIl (DOREMI) 框架,其核心创新在于通过迭代方式选择最具信息量的样本进行最小化的人工标注,从而有效提升低频关系的建模能力。不同于以往依赖大规模噪声数据或启发式去噪的方法,DOREMI 以目标导向的方式增强稀缺关系的学习效率与鲁棒性,且可无缝集成至任意现有 DocRE 模型中,显著缓解长尾偏差,提升模型在罕见关系上的泛化性能。

链接: https://arxiv.org/abs/2601.11190
作者: Laura Menotti,Stefano Marchesin,Gianmaria Silvello
机构: University of Padova (帕多瓦大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in Knowledge-Based Systems

点击查看摘要

Abstract:Document-Level Relation Extraction (DocRE) presents significant challenges due to its reliance on cross-sentence context and the long-tail distribution of relation types, where many relations have scarce training examples. In this work, we introduce DOcument-level Relation Extraction optiMizing the long taIl (DOREMI), an iterative framework that enhances underrepresented relations through minimal yet targeted manual annotations. Unlike previous approaches that rely on large-scale noisy data or heuristic denoising, DOREMI actively selects the most informative examples to improve training efficiency and robustness. DOREMI can be applied to any existing DocRE model and is effective at mitigating long-tail biases, offering a scalable solution to improve generalization on rare relations.
zh

[NLP-31] ANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

【速读】: 该论文旨在解决社交媒体中长时序多模态有害内容(如仇恨言论)检测与可解释性之间的矛盾问题,即现有自动化系统虽能高精度识别有害内容,但缺乏对具体时间戳和目标身份等关键证据的细粒度解释能力,难以支持人工审核介入。其解决方案的关键在于提出TANDEM框架,通过一种新颖的“双联强化学习”策略,使视觉-语言模型与音频-语言模型在自约束的跨模态上下文中相互优化,从而在无需密集帧级标注的情况下稳定地进行长时间序列推理,将原本的二分类任务转化为结构化推理问题,显著提升了目标识别准确率(F1达0.73)并保持精确的时间定位能力。

链接: https://arxiv.org/abs/2601.11178
作者: Girish A. Koushik,Helen Treharne,Diptesh Kanojia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注: Under review at ICWSM 2026

点击查看摘要

Abstract:Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as “black boxes” that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.
zh

[NLP-32] he Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora LREC2026

【速读】: 该论文旨在解决低资源语言(Less-resourced languages)文本数据稀缺的问题,特别是针对南斯拉夫语族语言(South Slavic languages)缺乏大规模通用语料库的现状。解决方案的关键在于建立一个持续迭代的国家顶级域名(national top-level domains, TLDs)爬取基础设施,通过自动化、周期性地抓取相关网站内容,构建高质量且规模显著扩大的多语言语料库——CLASSLA-web 2.0,该版本包含7种语言共170亿词、3810万篇文本,并首次实现了自动话题标注(topic labeling),从而显著提升了语料多样性与可用性。

链接: https://arxiv.org/abs/2601.11170
作者: Taja Kuzman Pungeršek,Peter Rupnik,Vít Suchomel,Nikola Ljubešić
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, 2 tables. Submitted to the LREC 2026 conference

点击查看摘要

Abstract:Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.
zh

[NLP-33] FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

【速读】: 该论文旨在解决当前端到端语音对话系统中语音身份保留能力不足的问题,即模型在多轮对话中难以保持说话人特征的一致性,从而限制了个性化语音交互的实现。解决方案的关键在于提出 Chroma 1.0,这是首个开源、实时的端到端语音对话模型,通过采用交错文本-音频标记调度(比例为 1:2)实现流式生成,显著降低端到端延迟(Real-Time Factor, RTF = 0.43),同时在多轮对话中保持高质量的个性化语音合成,实验表明其在说话人相似度上相较人工基线提升 10.96%,并维持强大的推理与对话能力。

链接: https://arxiv.org/abs/2601.11141
作者: Tanyu Chen,Tairan Chen,Kai Shen,Zhenghua Bao,Zhihui Zhang,Man Yuan,Yi Shi
机构: FlashLabs
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at this https URL and this https URL .
zh

[NLP-34] Integrity Shield A System for Ethical AI Use Authorship Transparency in Assessments

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)直接处理上传的PDF考试文件时引发的学术诚信问题,即学生可能利用黑箱商业LLM(如GPT-4、Claude等)作弊,导致成绩和学位证书可靠性下降。现有水印技术要么在token层面操作,要么依赖对模型解码过程的控制,无法适用于教师上传文档后由学生通过封闭API查询的场景。其解决方案的关键在于提出Integrity Shield——一种文档层水印系统,能够在不改变人类可读外观的前提下,将具有结构感知能力的、基于题目级别的水印嵌入到考试PDF中;这些水印不仅能有效阻止多模态大语言模型(Multimodal Large Language Models, MLLMs)正确作答(平均91–94%的考试级阻断率),还能稳定地编码出可从模型或学生回答中可靠恢复的项目级签名(签名恢复准确率达89–93%),从而实现防作弊与责任溯源的双重目标。

链接: https://arxiv.org/abs/2601.11093
作者: Ashish Raj Shekhar,Shiven Agarwal,Priyanuj Bordoloi,Yash Shah,Tejas Anvekar,Vivek Gupta
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can now solve entire exams directly from uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model’s decoding process, making them ineffective when students query proprietary black-box systems with instructor-provided documents. We present Integrity Shield, a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 exams spanning STEM, humanities, and medical reasoning, Integrity Shield achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance authorship evidence.
zh

[NLP-35] Efficient Multilingual Name Type Classification Using Convolutional Networks

【速读】: 该论文旨在解决多语言命名实体分类(Named Entity Classification, NEC)任务中对计算效率与准确率平衡的问题,特别是在资源受限环境下如何实现高效处理。其解决方案的关键在于提出一种名为Onomas-CNN X的卷积神经网络架构,该架构通过并行卷积分支(parallel convolution branches)、深度可分离操作(depthwise-separable operations)和分层分类机制(hierarchical classification)相结合的方式,在仅使用单个CPU核心的情况下实现了高达92.1%的准确率,并以每秒处理2,813个名称的速度显著优于基于Transformer的XLM-RoBERTa模型(快46倍),同时将能耗降低至其四分之一。这表明在训练数据充足时,专用CNN结构仍可在特定自然语言处理任务上媲美大型预训练模型。

链接: https://arxiv.org/abs/2601.11090
作者: Davor Lauc
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint of paper presented at ISAI-NLP Phukat 2025

点击查看摘要

Abstract:We present a convolutional neural network approach for classifying proper names by language and entity type. Our model, Onomas-CNN X, combines parallel convolution branches with depthwise-separable operations and hierarchical classification to process names efficiently on CPU hardware. We evaluate the architecture on a large multilingual dataset covering 104 languages and four entity types (person, organization, location, other). Onomas-CNN X achieves 92.1% accuracy while processing 2,813 names per second on a single CPU core - 46 times faster than fine-tuned XLM-RoBERTa with comparable accuracy. The model reduces energy consumption by a factor of 46 compared to transformer baselines. Our experiments demonstrate that specialized CNN architectures remain competitive with large pre-trained models for focused NLP tasks when sufficient training data exists.
zh

[NLP-36] ABC-Bench: Benchmarking Agent ic Backend Coding in Real-World Development

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成任务中普遍局限于静态逻辑验证,而忽视了真实后端开发所需的动态、全流程工程能力的问题。现有基准测试未能充分评估模型在完整开发周期中的表现,如环境配置、服务部署及端到端API测试等关键环节。为应对这一挑战,作者提出ABC-Bench,其核心创新在于构建了一个面向代理式后端编码的可执行工作流基准,要求模型从仓库探索开始,完成容器化服务实例化并成功通过外部API测试。该解决方案的关键在于设计了一个自动化、可扩展的流水线,从开源项目中提取224个实际任务,覆盖8种编程语言和19个框架,从而系统性地评估LLM在复杂工程场景下的端到端执行能力。

链接: https://arxiv.org/abs/2601.11077
作者: Jie Yang,Honglin Guo,Li Ji,Jiazheng Zhou,Rui Zheng,Zhikai Lei,Shuo Zhang,Zhiheng Xi,Shichun Liu,Yuxin Wang,Bo Wang,Yining Zheng,Tao Gui,Xipeng Qiu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at this https URL.
zh

[NLP-37] Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLM s

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练下大语言模型(Large Language Models, LLMs)出现的“伪奖励”现象——即模型在接收到错误或无意义奖励信号时仍能显著提升性能的问题。这一现象可能源于模型对训练数据的过度依赖,导致其绕过真实推理过程而采用记忆检索的捷径策略,从而引发性能虚假提升与潜在的泛化能力下降。解决方案的关键在于通过路径修补(Path Patching)、对数概率透镜(Logit Lens)、Jensen-Shannon散度(JSD)分析及神经微分方程等工具,识别出一个隐藏的“锚点-适配器”电路(Anchor-Adapter circuit),其中中间层(L18–20)存在一个功能锚点(Functional Anchor),触发对已记忆答案的检索,随后在深层(L21+)由结构适配器(Structural Adapters)重构表征以适应此捷径信号;进一步发现可通过放大该电路中特定多层感知机(MLP)键值来实现双向因果操控,从而人工增强或抑制由数据污染驱动的性能表现,为检测和缓解RLVR调优模型中的数据污染提供了机制层面的干预路径。

链接: https://arxiv.org/abs/2601.11061
作者: Lecheng Yan,Ruizhe Li,Guanhua Chen,Qing Li,Jiahui Geng,Wenxi Li,Vincent Wang,Chris Lee
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in process

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a “Perplexity Paradox”: spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at this https URL.
zh

[NLP-38] CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因幻觉(hallucination)导致的可靠性问题,以及现有知识图谱(Knowledge Graph, KG)增强型LLM方法因采用同质化搜索策略而产生的认知僵化问题,此类策略易受邻域噪声和结构错位影响,进而引发推理停滞。解决方案的关键在于提出一种无需训练的框架CoG,其灵感源自双过程理论(Dual-Process Theory),通过两个核心模块实现:一是“关系蓝图引导”模块(Relational Blueprint Guidance),作为快速直觉过程,利用可解释的关系蓝图作为软结构约束,快速稳定搜索方向以抵抗噪声;二是“故障感知精炼”模块(Failure-Aware Refinement),作为审慎分析过程,在遇到推理障碍时触发证据条件下的反思并执行可控回溯,从而突破推理停滞。

链接: https://arxiv.org/abs/2601.11047
作者: Yuanxiang Liu,Songze Li,Xiaoke Guo,Zhaoyan Gong,Qifei Zhang,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity–applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
zh

[NLP-39] Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行连续知识编辑时出现的灾难性性能退化问题,尤其是参数修改类方法导致模型通用能力显著下降的现象。现有方法多依赖启发式约束来缓解这一问题,但其内在机制尚不明确。论文通过谱分析揭示了模型通用能力与预训练权重矩阵主奇异方向密切相关——这些方向对扰动极为敏感,并随多次编辑逐步被破坏,且与编辑效果和泛化性能的下降高度一致。解决方案的关键在于提出REVIVE框架,该框架以原始权重的谱基表示参数更新,并显式保护主奇异子空间,通过滤除可能干扰关键区域的更新成分,从而在长期连续编辑(如高达20,000次)下显著提升编辑有效性并维持模型通用能力。

链接: https://arxiv.org/abs/2601.11042
作者: Chi Zhang,Mengqi Zhang,Xiaotian Ye,Runxi Cheng,Zisheng Zhou,Ying Zhou,Pengjie Ren,Zhumin Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 18 figures

点击查看摘要

Abstract:Sequential knowledge editing in large language models often causes catastrophic collapse of the model’s general abilities, especially for parameter-modifying methods. Existing approaches mitigate this issue through heuristic constraints on parameter updates, yet the mechanisms underlying such degradation remain insufficiently understood. In this work, we present a spectral analysis of sequential knowledge editing and show that a model’s general abilities are closely associated with dominant singular directions of pretrained weight matrices. These directions are highly sensitive to perturbations and are progressively disrupted by repeated edits, closely tracking the collapse in both editing efficacy and general performance. Building on this insight, we propose REVIVE, a plug-and-play framework that stabilizes sequential editing by explicitly preserving the dominant singular subspace. REVIVE represents parameter updates in the spectral basis of the original weights and filters components that would interfere with the protected region. Extensive experiments across multiple models and benchmarks show that REVIVE consistently improves editing efficacy while substantially preserving general abilities under long-horizon sequential editing, including extreme settings with up to 20,000 edits.
zh

[NLP-40] SonicBench: Dissecting the Physical Perception Bottleneck in Large Audio Language Models

【速读】: 该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在感知音频基础物理属性(如音高、响度和空间位置)方面的不足问题,这些问题尚未得到充分探索。其解决方案的关键在于提出SonicBench——一个基于心理物理学的基准测试平台,系统评估12个核心物理属性,涵盖五个感知维度,并通过可控生成工具构建刺激材料,支持识别(绝对判断)与比较(相对判断)两种互补范式。这一设计不仅能够量化模型的感官精度,还能考察其关系推理能力,从而揭示LALMs在基础听觉理解上的缺陷:多数模型表现接近随机猜测,且未展现出人类在比较任务中的优势;进一步分析表明,冻结的音频编码器已能有效捕获这些物理线索(准确率至少60%),说明瓶颈在于对已有感知信号的对齐与解码阶段。

链接: https://arxiv.org/abs/2601.11039
作者: Yirong Sun,Yanjun Chen,Xin Qiu,Gang Zhang,Hongyu Chen,Daokuan Wu,Chengming Li,Min Yang,Dawei Zhu,Wei Zhang,Xiaoyu Shen
机构: Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, EIT; Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen University of Advanced Technology; Shenzhen MSU-BIT University; Amazon AGI
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio such as pitch, loudness, and spatial location remains under-explored. To bridge this gap, we introduce SonicBench, a psychophysically grounded benchmark that systematically evaluates 12 core physical attributes across five perceptual dimensions. Unlike previous datasets, SonicBench uses a controllable generation toolbox to construct stimuli for two complementary paradigms: recognition (absolute judgment) and comparison (relative judgment). This design allows us to probe not only sensory precision but also relational reasoning capabilities, a domain where humans typically exhibit greater proficiency. Our evaluation reveals a substantial deficiency in LALMs’ foundational auditory understanding; most models perform near random guessing and, contrary to human patterns, fail to show the expected advantage on comparison tasks. Furthermore, explicit reasoning yields minimal gains. However, our linear probing analysis demonstrates crucially that frozen audio encoders do successfully capture these physical cues (accuracy at least 60%), suggesting that the primary bottleneck lies in the alignment and decoding stages, where models fail to leverage the sensory signals they have already captured.
zh

[NLP-41] Budget-Aware Anytime Reasoning with LLM -Synthesized Preference Data

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在计算资源受限场景下如何高效生成高质量推理结果的问题。传统方法倾向于进行耗时的完整推理以获得最优解,但在实际应用中(如旅行规划任务),往往需要在固定推理预算内尽早输出可用的中间解。为此,作者提出了一种“随时推理”(Anytime Reasoning)框架,并引入“随时指数”(Anytime Index)作为量化指标,衡量模型在增加推理标记数时解决方案质量的提升效率。其核心创新在于提出一种推理时自改进方法,利用模型自身生成的偏好数据进行在线学习,从而优化中间解的质量,显著提升了不同模型(包括Grok-3、GPT系列及LLaMA)在有限预算下的推理效率与效果。

链接: https://arxiv.org/abs/2601.11038
作者: Xuanming Zhang,Shwan Ashrafi,Aziza Mirsaidova,Amir Rezaeian,Miguel Ballesteros,Lydia B. Chilton,Zhou Yu,Dan Roth
机构: Columbia University (哥伦比亚大学); Oracle AI
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:We study the reasoning behavior of large language models (LLMs) under limited computation budgets. In such settings, producing useful partial solutions quickly is often more practical than exhaustive reasoning, which incurs high inference costs. Many real-world tasks, such as trip planning, require models to deliver the best possible output within a fixed reasoning budget. We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase. To further enhance efficiency, we propose an inference-time self-improvement method using LLM-synthesized preference data, where models learn from their own reasoning comparisons to produce better intermediate solutions. Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models, improving both reasoning quality and efficiency under budget constraints.
zh

[NLP-42] From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文处理能力上的局限性问题,尤其是如何有效利用机制可解释性中识别出的“检索头”(retrieval heads)来提升模型在长文本场景下的性能。其解决方案的关键在于提出一种名为RetMask的方法,该方法通过对比正常模型输出与掩码掉检索头后的模型输出,生成训练信号以指导模型优化;这一机制驱动的训练策略在不损害通用任务性能的前提下,显著提升了模型在128K上下文长度下的表现,如HELMT基准上提升2.28点,并在带引用生成和段落重排序等任务中分别获得70%和32%的改进,同时揭示了检索头组织模式(集中式 vs 分布式)对效果的影响,验证了机制洞察向性能增强转化的可行性。

链接: https://arxiv.org/abs/2601.11020
作者: Youmi Ma,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学研究所)
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Advances in mechanistic interpretability have identified special attention heads, known as retrieval heads, that are responsible for retrieving information from the context. However, the role of these retrieval heads in improving model performance remains unexplored. This work investigates whether retrieval heads can be leveraged to enhance the long-context capabilities of LLMs. Specifically, we propose RetMask, a method that generates training signals by contrasting normal model outputs with those from an ablated variant in which the retrieval heads are masked. This mechanism-based approach achieves substantial improvements: +2.28 points on HELMET at 128K for Llama-3.1, with +70% gains on generation with citation and +32% on passage re-ranking, while preserving performance on general tasks. Experiments across three model families reveal that the effectiveness depends on retrieval head organization: models with concentrated patterns of retrieval heads respond strongly, while those with distributed patterns show limited gains. This mechanistic relationship validates the function of retrieval heads and demonstrates that mechanistic insights can be transformed into performance enhancements.
zh

[NLP-43] Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLM s AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在未进行任务特定微调的情况下仍具备较强翻译能力的内在机制不明确的问题。其核心挑战在于如何识别并验证这些模型中驱动翻译行为的关键内部特征。解决方案的关键在于引入稀疏自编码器(Sparse Autoencoders, SAEs)构建一个新颖的特征识别框架:首先筛选出在翻译输入中频繁共激活的特征,再通过基于主成分分析(PCA)的一致性度量过滤出功能一致的特征子集,从而精准定位到一组“翻译启动”特征(translation initiation features)。因果干预实验表明,增强这些特征可引导模型正确翻译,而移除它们则引发幻觉和离题输出,证实其为模型先天翻译能力的核心组成。进一步地,作者利用这一机制提出一种新的数据选择策略——优先对“机制困难样本”(mechanistically hard samples)进行微调,即那些无法自然激活翻译启动特征的样本,显著提升了微调效率并抑制了幻觉现象,且该机制在同家族更大模型中具有可迁移性。

链接: https://arxiv.org/abs/2601.11019
作者: Xinwei Wu,Heng Liu,Xiaohu Zhao,Yuqi Ren,Linlong Xu,Longyue Wang,Deyi Xiong,Weihua Luo,Kaifu Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features. Our method first recalls features that are frequently co-activated on translation inputs and then filters them for functional coherence using a PCA-based consistency metric. This framework successfully isolates a small set of translation initiation features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off-task outputs, confirming they represent a core component of the model’s innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine-tuning. Specifically, we prioritize training on mechanistically hard samples-those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at this https URL.
zh

[NLP-44] AdaMARP: An Adaptive Multi-Agent Interaction Framework for General Immersive Role-Playing

【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)在角色扮演(role-playing)任务中沉浸感不足与适应性差的问题,尤其是现有系统对动态环境信息建模能力弱、场景和角色设定多为静态,难以支持多角色协同、场景切换及实时角色引入等复杂交互需求。其解决方案的关键在于提出一个自适应多智能体角色扮演框架 AdaMARP,核心创新包括:1)设计一种沉浸式消息格式,将 [Thought]、(Action)、Environment 和 Speech 交织呈现,增强环境感知与行为连贯性;2)引入显式的 Scene Manager 模块,通过离散动作(如 init_scene、pick_speaker、switch_scene、add_role、end)及其理由来精确控制叙事流程,实现对角色调度与场景演进的结构化管理。该框架还配套构建了 AdaRPSet(用于训练角色行为一致性)和 AdaSMSet(用于监督场景决策),并通过 AdaptiveBench 进行轨迹级评估,实验证明其在角色一致性、环境锚定性和叙事连贯性等方面显著优于主流商业模型。

链接: https://arxiv.org/abs/2601.11007
作者: Zhenhua Xu,Dongsheng Chen,Shuo Wang,Jian Li,Chengjie Wang,Meng Han,Yabiao Wang
机构: Zhejiang University (浙江大学); Tencent Youtu Lab (腾讯优图实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM role-playing aims to portray arbitrary characters in interactive narratives, yet existing systems often suffer from limited immersion and adaptability. They typically under-model dynamic environmental information and assume largely static scenes and casts, offering insufficient support for multi-character orchestration, scene transitions, and on-the-fly character introduction. We propose an adaptive multi-agent role-playing framework, AdaMARP, featuring an immersive message format that interleaves [Thought], (Action), Environment, and Speech, together with an explicit Scene Manager that governs role-playing through discrete actions (init_scene, pick_speaker, switch_scene, add_role, end) accompanied by rationales. To train these capabilities, we construct AdaRPSet for the Actor Model and AdaSMSet for supervising orchestration decisions, and introduce AdaptiveBench for trajectory-level evaluation. Experiments across multiple backbones and model scales demonstrate consistent improvements: AdaRPSet enhances character consistency, environment grounding, and narrative coherence, with an 8B actor outperforming several commercial LLMs, while AdaSMSet enables smoother scene transitions and more natural role introductions, surpassing Claude Sonnet 4.5 using only a 14B LLM.
zh

[NLP-45] NAACL: Noise-AwAre Verbal Confidence Calibration for LLM s in RAG Systems

【速读】: 该论文旨在解决在检索增强生成(Retrieval-Augmented Generation, RAG)场景下大语言模型(Large Language Models, LLMs)的置信度校准(confidence calibration)问题,特别是由于检索到的上下文存在噪声(如矛盾或无关证据)导致模型出现严重过度自信(overconfidence)的问题。解决方案的关键在于提出NAACL Rules(Noise-Aware Confidence Calibration Rules),这是一套基于噪声感知的校准规则,并进一步设计了NAACL框架,通过约2000个HotpotQA样本的监督微调(Supervised Fine-Tuning, SFT)使模型具备内在的噪声感知能力,从而在不依赖更强教师模型的前提下显著提升校准性能,实证表明其在域内和域外均能有效降低期望校准误差(Expected Calibration Error, ECE)。

链接: https://arxiv.org/abs/2601.11004
作者: Jiayu Liu,Rui Wang,Qing Zong,Qingcheng Zeng,Tianshi Zheng,Haochen Shi,Dadi Guo,Baixuan Xu,Chunyang Li,Yangqiu Song
机构: HKUST; Northwestern University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance due to noisy retrieved contexts. Specifically, contradictory or irrelevant evidence tends to inflate the model’s false certainty, leading to severe overconfidence. To address this, we propose NAACL Rules (Noise-AwAre Confidence CaLibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NAACL, a noise-aware calibration framework that synthesizes supervision from about 2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NAACL equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NAACL yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NAACL paves the way for both accurate and epistemically reliable LLMs.
zh

[NLP-46] Redefining Machine Simultaneous Interpretation: From Incremental Translation to Human-Like Strategies

【速读】: 该论文旨在解决同步机器翻译(Simultaneous Machine Translation, SiMT)在严格实时约束下难以实现高质量翻译的问题,传统仅依赖READ/WRITE动作的策略无法充分应对实时性与语义保真度之间的平衡。其解决方案的关键在于扩展SiMT的动作空间,引入四种自适应动作:Sentence_Cut(句子切分)、Drop(删减)、Partial_Summarization(部分摘要化)和Pronominalization(代词化),这些动作能够在保持语义一致性的前提下实现实时重构、省略与简化。研究进一步将这些动作整合进大语言模型(Large Language Model, LLM)框架,并通过面向动作的提示(action-aware prompting)构建训练样本;同时开发了一种延迟感知的文本转语音(TTS)流水线以评估翻译质量与词级单调性,实验表明该方法在多个语言对上显著提升了语义指标并降低了延迟,尤其当Drop与Sentence_Cut结合时,在流畅性与延迟之间实现了更优平衡。

链接: https://arxiv.org/abs/2601.11002
作者: Qianen Zhang,Zeyu Yang,Satoshi Nakamura
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2509.21801

点击查看摘要

Abstract:Simultaneous Machine Translation (SiMT) requires high-quality translations under strict real-time constraints, which traditional policies with only READ/WRITE actions cannot fully address. We extend the action space of SiMT with four adaptive actions: Sentence_Cut, Drop, Partial_Summarization and Pronominalization, which enable real-time restructuring, omission, and simplification while preserving semantic fidelity. We adapt these actions in a large language model (LLM) framework and construct training references through action-aware prompting. To evaluate both quality and word-level monotonicity, we further develop a latency-aware TTS pipeline that maps textual outputs to speech with realistic timing. Experiments on the ACL60/60 English-Chinese, English-German and English-Japanese benchmarks show that our framework consistently improves semantic metrics and achieves lower delay compared to reference translations and salami-based baselines. Notably, combining Drop and Sentence_Cut leads to consistent improvements in the balance between fluency and latency. These results demonstrate that enriching the action space of LLM-based SiMT provides a promising direction for bridging the gap between human and machine interpretation.
zh

[NLP-47] When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLM s

【速读】: 该论文旨在解决个性化大语言模型(Personalized Large Language Models, PLLMs)在推理过程中因用户历史信息与事实表示之间的表征纠缠(representational entanglement)而导致的“个人化诱导幻觉”(personalization-induced hallucinations)问题,即模型倾向于生成符合用户历史偏好但偏离客观事实的答案,从而损害事实准确性并可能传播错误认知。解决方案的关键在于提出一种轻量级的推理时干预方法——事实保真个性化引导(Factuality-Preserving Personalized Steering, FPPS),该方法能够在不破坏个性化行为的前提下有效缓解由个人化引发的事实扭曲,实验证明其在多个LLM骨干网络和个性化策略下均能显著提升事实准确性并维持个性化性能。

链接: https://arxiv.org/abs/2601.11000
作者: Zhongxiang Sun,Yi Zhan,Chenglei Shen,Weijie Yu,Xiao Zhang,Ming He,Jun Xu
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); AI Lab at Lenovo Research(联想研究院人工智能实验室); School of Artificial Intelligence and Data Science, University of International Business and Economics(对外经济贸易大学人工智能与数据科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 15 figures

点击查看摘要

Abstract:Personalized large language models (LLMs) adapt model behavior to individual users to enhance user satisfaction, yet personalization can inadvertently distort factual reasoning. We show that when personalized LLMs face factual queries, there exists a phenomenon where the model generates answers aligned with a user’s prior history rather than the objective truth, resulting in personalization-induced hallucinations that degrade factual reliability and may propagate incorrect beliefs, due to representational entanglement between personalization and factual representations. To address this issue, we propose Factuality-Preserving Personalized Steering (FPPS), a lightweight inference-time approach that mitigates personalization-induced factual distortions while preserving personalized behavior. We further introduce PFQABench, the first benchmark designed to jointly evaluate factual and personalized question answering under personalization. Experiments across multiple LLM backbones and personalization methods show that FPPS substantially improves factual accuracy while maintaining personalized performance.
zh

[NLP-48] ZPD Detector: Data Selection via Capability-Difficulty Alignment for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练中因数据成本上升和高质量数据稀缺所导致的数据利用效率低下问题。现有数据选择方法多依赖静态标准(如样本难度、不确定性或启发式规则),无法捕捉模型与数据之间动态演化的交互关系。其解决方案的关键在于提出ZPD Detector框架,该框架受“最近发展区”(Zone of Proximal Development, ZPD)教育理论启发,通过双向建模样本难度与模型当前能力之间的对齐关系,结合难度校准、基于项目反应理论(Item Response Theory, IRT)的能力估计以及能力-难度匹配得分,实现学习过程中动态识别最具信息量的样本,从而显著提升数据使用效率并为训练策略设计提供新视角。

链接: https://arxiv.org/abs/2601.10986
作者: Bo Yang,Yunkui Chen,Lanfei Feng,Yu Zhang,Shijian Li
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As the cost of training large language models continues to increase and high-quality training data become increasingly scarce, selecting high-value samples or synthesizing effective training data under limited data budgets has emerged as a critical research problem. Most existing data selection methods rely on static criteria, such as difficulty, uncertainty, or heuristics, and fail to model the evolving relationship between the model and the data. Inspired by the educational theory of the Zone of Proximal Development (ZPD), we propose ZPD Detector, a data selection framework that adopts a bidirectional perspective between models and data by explicitly modeling the alignment between sample difficulty and the model’s current capability. ZPD Detector integrates difficulty calibration, model capability estimation based on Item Response Theory (IRT), and a capability-difficulty matching score to dynamically identify the most informative samples at each learning stage, improving data utilization efficiency; moreover, this dynamic matching strategy provides new insights into training strategy design. All code and data will be released after our work be accepted to support reproducible researc
zh

[NLP-49] AJAR: Adaptive Jailbreak Architecture for Red-teaming

【速读】: 该论文旨在解决当前红队测试(red-teaming)框架在应对生成式 AI(Generative AI)代理(agent)时的局限性问题,即现有方法要么局限于静态文本攻击,要么缺乏模块化架构以模拟复杂、多轮的代理级渗透行为。其解决方案的关键在于提出 AJAR(Adaptive Jailbreak Architecture for Red-teaming),通过协议驱动的认知编排(Protocol-driven Cognitive Orchestration)实现攻击逻辑与执行循环的解耦,利用 Model Context Protocol (MCP) 将先进算法如 X-Teaming 封装为标准化插件服务,从而支持状态感知的回溯和环境感知的攻击评估,有效拓展了对工具使用场景下新型安全风险的探测能力。

链接: https://arxiv.org/abs/2601.10971
作者: Yipu Dou,Wang Yang
机构: Southeast University (东南大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve from static chatbots into autonomous agents capable of tool execution, the landscape of AI safety is shifting from content moderation to action security. However, existing red-teaming frameworks remain bifurcated: they either focus on rigid, script-based text attacks or lack the architectural modularity to simulate complex, multi-turn agentic exploitations. In this paper, we introduce AJAR (Adaptive Jailbreak Architecture for Red-teaming), a proof-of-concept framework designed to bridge this gap through Protocol-driven Cognitive Orchestration. Built upon the robust runtime of Petri, AJAR leverages the Model Context Protocol (MCP) to decouple adversarial logic from the execution loop, encapsulating state-of-the-art algorithms like X-Teaming as standardized, plug-and-play services. We validate the architectural feasibility of AJAR through a controlled qualitative case study, demonstrating its ability to perform stateful backtracking within a tool-use environment. Furthermore, our preliminary exploration of the “Agentic Gap” reveals a complex safety dynamic: while tool usage introduces new injection vectors via code execution, the cognitive load of parameter formatting can inadvertently disrupt persona-based attacks. AJAR is open-sourced to facilitate the standardized, environment-aware evaluation of this emerging attack surface. The code and data are available at this https URL.
zh

[NLP-50] Steering Language Models Before They Speak: Logit-Level Interventions

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定应用场景中缺乏可控性的问题,例如风格敏感的文本重写、用户自适应沟通以及毒性内容抑制等任务。现有方法如基于提示(prompting-based)和基于激活(activation-based)的控制策略存在局限:前者难以实现稳定且细粒度的控制,后者则需要对模型内部层进行深度访问,限制了实用性。本文提出一种无需训练的推理时 logits 干预方法,其核心在于构建一个基于标注语料库 z-标准化对数几率(z-normalized log-odds)统计得到的词元得分表,通过该表调整解码分布来实现输出特性的定向引导。实验表明,该方法在写作复杂度、正式程度和毒性三个不同任务上均能实现显著且一致的性能提升,验证了其任务无关性和广泛适用性。

链接: https://arxiv.org/abs/2601.10960
作者: Hyeseon An,Shinwoo Park,Hyundong Jin,Yo-Sub Han
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, preprint

点击查看摘要

Abstract:Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.
zh

[NLP-51] Multi-Stage Patient Role-Playing Framework for Realistic Clinical Interactions

【速读】: 该论文旨在解决当前临床对话模拟中缺乏真实性和多样性的问题,现有方法依赖通用或由大语言模型(Large Language Models, LLMs)生成的对话数据,导致医生-患者交互的真实性不足。解决方案的关键在于构建首个中文患者模拟数据集(Ch-PatientSim),基于五维人格结构(persona structure)模拟患者行为,并通过少量样本生成与人工验证相结合的方式缓解类别不平衡问题;进一步提出无需训练的多阶段患者角色扮演框架(Multi-Stage Patient Role-Playing, MSPRP),将交互分解为三个阶段以确保响应的个性化与真实性,实验表明该方法在多个维度上显著提升模型性能。

链接: https://arxiv.org/abs/2601.10951
作者: Shijie Jiang,Zefan Zhang,Kehua Zhu,Tian Bai,Ruihong Zhao
机构: Jilin University (吉林大学); The First Hospital of Jilin University (吉林大学第一医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 5figures, under review

点击查看摘要

Abstract:The simulation of realistic clinical interactions plays a pivotal role in advancing clinical Large Language Models (LLMs) and supporting medical diagnostic education. Existing approaches and benchmarks rely on generic or LLM-generated dialogue data, which limits the authenticity and diversity of doctor-patient interactions. In this work, we propose the first Chinese patient simulation dataset (Ch-PatientSim), constructed from realistic clinical interaction scenarios to comprehensively evaluate the performance of models in emulating patient behavior. Patients are simulated based on a five-dimensional persona structure. To address issues of the persona class imbalance, a portion of the dataset is augmented using few-shot generation, followed by manual verification. We evaluate various state-of-the-art LLMs and find that most produce overly formal responses that lack individual personality. To address this limitation, we propose a training-free Multi-Stage Patient Role-Playing (MSPRP) framework, which decomposes interactions into three stages to ensure both personalization and realism in model responses. Experimental results demonstrate that our approach significantly improves model performance across multiple dimensions of patient simulation.
zh

[NLP-52] PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis AAAI2026

【速读】: 该论文旨在解决传统医疗诊断人工智能(AI)研究中因缺乏患者自报症状而导致的诊断准确率受限问题。其解决方案的关键在于提出一种预问诊对话框架(Pre-Consultation Dialogue Framework, PCDF),通过两个视觉语言模型(Vision-Language Models, VLMs)——医生VLM(DocVLM)与患者VLM(PatientVLM)——模拟真实世界中的多轮问诊过程:DocVLM基于图像和对话历史生成后续问题,PatientVLM则依据真实诊断推导的症状谱系进行响应。该框架生成的合成症状经临床验证具有良好的临床相关性、覆盖度和真实性,最终利用此类对话监督信号对DocVLM进行微调,显著优于仅依赖图像训练的模型,凸显了真实症状采集在提升诊断准确性中的关键作用。

链接: https://arxiv.org/abs/2601.10945
作者: K Lokesh,Abhirama Subramanyam Penamakuri,Uday Agarwal,Apoorva Challa,Shreya K Gowda,Somesh Gupta,Anand Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at AAAI 2026 Main Track

点击查看摘要

Abstract:Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.
zh

[NLP-53] Selecting Language Models for Social Science: Start Small Start Open and Validate

【速读】: 该论文旨在解决社会科学家在面对数千个大规模预训练语言模型(Large Pretrained Language Models, LLMs)时如何科学选择合适模型的问题。其核心挑战在于,尽管常用基准测试(ex-ante validity tests)被广泛引用,但仅依赖此类指标不足以确保研究结果的可信度。论文指出,社会科学家必须通过事后验证(ex-post validation)来评估计算测量的有效性,并强调可复现性(replicability)是更关键的选择依据——即能够稳定重现特定研究发现,要求整个任务流程(包括模型使用、数据处理和分析步骤)具备可重复性。解决方案的关键在于:优先选用较小且开源的语言模型,同时构建限定范围的基准测试(delimited benchmarks),以系统性地验证从输入到输出的完整计算流水线的有效性和一致性。

链接: https://arxiv.org/abs/2601.10926
作者: Dustin S. Stoltz,Marshall A. Taylor,Sanuj Kumar
机构: Lehigh University (莱赫igh大学); New Mexico State University (新墨西哥州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Currently, there are thousands of large pretrained language models (LLMs) available to social scientists. How do we select among them? Using validity, reliability, reproducibility, and replicability as guides, we explore the significance of: (1) model openness, (2) model footprint, (3) training data, and (4) model architectures and fine-tuning. While ex-ante tests of validity (i.e., benchmarks) are often privileged in these discussions, we argue that social scientists cannot altogether avoid validating computational measures (ex-post). Replicability, in particular, is a more pressing guide for selecting language models. Being able to reliably replicate a particular finding that entails the use of a language model necessitates reliably reproducing a task. To this end, we propose starting with smaller, open models, and constructing delimited benchmarks to demonstrate the validity of the entire computational pipeline.
zh

[NLP-54] Massively Multilingual Joint Segmentation and Glossing

【速读】: 该论文旨在解决当前生成式语言标注模型在实际语言文档记录中可信度不足的问题,特别是现有模型(如GlossLM)仅输出词级释义但不预测形态边界,导致结果难以解释且难以被人类标注者信任。其解决方案的关键在于首次提出联合预测形态分割与逐层释义(interlinear gloss)的神经网络架构,通过优化训练策略平衡两个任务的准确性及对齐关系,并基于扩展语料库预训练出PolyGloss系列序列到序列多语言模型,该模型在释义和形态分割任务上均优于GlossLM及其他开源大语言模型(LLM),同时支持通过低秩适应(LoRA)快速适配新数据集。

链接: https://arxiv.org/abs/2601.10925
作者: Michael Ginn,Lindia Tjuatja,Enora Rice,Ali Marashian,Maria Valentini,Jasmine Xu,Graham Neubig,Alexis Palmer
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 8 figures, submitted to ARR Jan 2026

点击查看摘要

Abstract:Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation. Comments: 13 pages, 8 figures, submitted to ARR Jan 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.10925 [cs.CL] (or arXiv:2601.10925v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.10925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-55] Neural Induction of Finite-State Transducers

【速读】: 该论文旨在解决手工构建有限状态转换器(Finite-State Transducer, FST)的困难问题,尤其是在字符串到字符串重写任务中,尽管FST具有高效性,但其构造过程复杂且依赖人工设计。解决方案的关键在于提出一种新颖的自动构建方法,该方法基于循环神经网络(Recurrent Neural Network, RNN)学习到的隐状态几何结构,从而自动生成无权FST。实验表明,该方法在词形变化、音素转写和历史文本归一化等真实数据集上均表现出高准确性和鲁棒性,相比传统FST学习算法在测试集上的准确率提升最高达87%。

链接: https://arxiv.org/abs/2601.10918
作者: Michael Ginn,Alexis Palmer,Mans Hulden
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures, submitted to ARR Jan 2026

点击查看摘要

Abstract:Finite-State Transducers (FSTs) are effective models for string-to-string rewriting tasks, often providing the efficiency necessary for high-performance applications, but constructing transducers by hand is difficult. In this work, we propose a novel method for automatically constructing unweighted FSTs following the hidden state geometry learned by a recurrent neural network. We evaluate our methods on real-world datasets for morphological inflection, grapheme-to-phoneme prediction, and historical normalization, showing that the constructed FSTs are highly accurate and robust for many datasets, substantially outperforming classical transducer learning algorithms by up to 87% accuracy on held-out test sets.
zh

[NLP-56] DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话场景中作为第三方评判者时,其判断结果因表述框架不同而产生不可靠偏差的问题。具体而言,相同内容在被呈现为需验证的陈述(“该陈述是否正确?”)与归因于说话者(“该说话者是否正确?”)时,LLMs 会给出显著不同的评价,这种现象被称为对话性顺从(dialogic deference)。解决方案的关键在于提出 DialDefer 框架,通过引入对话性顺从得分(Dialogic Deference Score, DDS)量化此类由框架引发的判断方向性偏移——这一指标能够揭示传统准确率指标所掩盖的系统性偏差。实证结果显示,DDS 在多个领域可达 ±87 个百分点(p < .0001),且人类 vs. LLM 归因是驱动最大偏差(17.7 个百分点)的核心因素,表明模型对人类持更高顺从倾向,从而将问题定位为校准而非单纯准确率优化。

链接: https://arxiv.org/abs/2601.10896
作者: Parisa Rabbani,Priyam Sahoo,Ruben Mathew,Aishee Mondal,Harshita Ketharaman,Nimet Beyza Bozdag,Dilek Hakkani-Tür
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages main content, 7 figures, 35 pages total with appendix

点击查看摘要

Abstract:LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content elicits different verdicts when presented as a statement to verify (“Is this statement correct?”) versus attributed to a speaker (“Is this speaker correct?”). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across nine domains, 3k+ instances, and four models, conversational framing induces large shifts (|DDS| up to 87pp, p .0001) while accuracy remains stable (2pp), with effects amplifying 2-4x on naturalistic Reddit conversations. Models can shift toward agreement (deference) or disagreement (skepticism) depending on domain – the same model ranges from DDS = -53 on graduate-level science to +58 on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts reduce deference but can over-correct into skepticism, framing this as a calibration problem beyond accuracy optimization.
zh

[NLP-57] EncodeRec: An Embedding Backbone for Recommendation Systems

【速读】: 该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在推荐系统中应用时存在的两个核心问题:一是PLMs未显式优化以生成结构化且具有判别性的嵌入空间,二是其表示过于通用,难以捕捉推荐任务所需的领域特定语义。解决方案的关键在于提出EncodeRec方法,该方法通过将文本表示与推荐目标对齐,在保持语言模型参数冻结的前提下,直接从商品描述中学习紧凑且信息丰富的嵌入表示,从而在不牺牲语义保真度的情况下提升推荐性能。

链接: https://arxiv.org/abs/2601.10837
作者: Guy Hadad,Neomi Rabaev,Bracha Shapira
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent recommender systems increasingly leverage embeddings from large pre-trained language models (PLMs). However, such embeddings exhibit two key limitations: (1) PLMs are not explicitly optimized to produce structured and discriminative embedding spaces, and (2) their representations remain overly generic, often failing to capture the domain-specific semantics crucial for recommendation tasks. We present EncodeRec, an approach designed to align textual representations with recommendation objectives while learning compact, informative embeddings directly from item descriptions. EncodeRec keeps the language model parameters frozen during recommender system training, making it computationally efficient without sacrificing semantic fidelity. Experiments across core recommendation benchmarks demonstrate its effectiveness both as a backbone for sequential recommendation models and for semantic ID tokenization, showing substantial gains over PLM-based and embedding model baselines. These results underscore the pivotal role of embedding adaptation in bridging the gap between general-purpose language models and practical recommender systems.
zh

[NLP-58] Reasoning Models Generate Societies of Thought

【速读】: 该论文试图解决的问题是:尽管大型语言模型(Large Language Models, LLMs)在多个领域展现出卓越能力,但其复杂推理机制的本质仍不明确。研究发现,当前先进推理模型(如DeepSeek-R1和QwQ-32B)相较于指令微调模型在复杂认知任务中表现更优,但这种优势并非单纯源于更长的思维链(Chain of Thought, CoT),而是源于一种“思想社会”(society of thought)结构——即内部认知视角之间的多智能体式交互,这些视角具有不同的个性特征和领域专长,从而实现多样性与辩论的结合。解决方案的关键在于识别并验证这种多智能体式的内部结构:通过定量分析与机制可解释性方法揭示推理模型激活了更广泛的认知冲突,表现出问答、视角转换及观点调和等类对话行为,并且受控强化学习实验表明,仅以推理准确率为奖励信号即可促使基础模型增强此类对话行为;进一步地,使用对话结构引导(conversational scaffolding)进行微调能显著加速推理能力提升。这说明,系统性组织的认知多样性才是推理性能提升的核心驱动力,为构建类人类群体智慧的计算范式提供了新路径。

链接: https://arxiv.org/abs/2601.10825
作者: Junsol Kim,Shiyang Lai,Nino Scherrer,Blaise Agüera y Arcas,James Evans
机构: Google(谷歌); University of Chicago (芝加哥大学); Santa Fe Institute (圣达菲研究所)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable instruction-tuned models on complex cognitive tasks, attributed to extended computation through longer chains of thought. Here we show that enhanced reasoning emerges not from extended computation alone, but from simulating multi-agent-like interactions – a society of thought – which enables diversification and debate among internal cognitive perspectives characterized by distinct personality traits and domain expertise. Through quantitative analysis and mechanistic interpretability methods applied to reasoning traces, we find that reasoning models like DeepSeek-R1 and QwQ-32B exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality- and expertise-related features during reasoning. This multi-agent structure manifests in conversational behaviors, including question-answering, perspective shifts, and the reconciliation of conflicting views, and in socio-emotional roles that characterize sharp back-and-forth conversations, together accounting for the accuracy advantage in reasoning tasks. Controlled reinforcement learning experiments reveal that base models increase conversational behaviors when rewarded solely for reasoning accuracy, and fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models. These findings indicate that the social organization of thought enables effective exploration of solution spaces. We suggest that reasoning models establish a computational parallel to collective intelligence in human groups, where diversity enables superior problem-solving when systematically structured, which suggests new opportunities for agent organization to harness the wisdom of crowds.
zh

[NLP-59] owards Reliable ML Feature Engineering via Planning in Constrained-Topology of LLM Agents

【速读】: 该论文旨在解决生成式 AI 在实际机器学习(ML)团队中应用时面临的三大挑战:一是缺乏能够捕捉生产级特征工程迭代与复杂编码过程的数据集;二是现有编码代理(如 CoPilot 和 Devin)难以与团队特有的工具、代码库、工作流和实践进行深度集成与个性化适配;三是人类与 AI 协作效率低下,常因反馈时机不当或不足导致效果不佳。解决方案的关键在于提出一种基于规划器引导的、受限拓扑结构的多智能体框架,其中由大语言模型(LLM)驱动的规划器利用团队环境图谱来协调多个可用代理的调用、生成上下文感知的提示,并通过下游失败回溯修正上游生成结果;同时支持在关键步骤请求人工干预,从而确保生成代码的可靠性、可维护性及与团队预期的一致性。

链接: https://arxiv.org/abs/2601.10820
作者: Himanshu Thakur,Anusha Kamath,Anurag Muthyala,Dhwani Sanmukhani,Smruthi Mukund,Jay Katukuri
机构: Meta; JPMorgan Chase & Co.
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent advances in code generation models have unlocked unprecedented opportunities for automating feature engineering, yet their adoption in real-world ML teams remains constrained by critical challenges: (i) the scarcity of datasets capturing the iterative and complex coding processes of production-level feature engineering, (ii) limited integration and personalization of widely used coding agents, such as CoPilot and Devin, with a team’s unique tools, codebases, workflows, and practices, and (iii) suboptimal human-AI collaboration due to poorly timed or insufficient feedback. We address these challenges with a planner-guided, constrained-topology multi-agent framework that generates code for repositories in a multi-step fashion. The LLM-powered planner leverages a team’s environment, represented as a graph, to orchestrate calls to available agents, generate context-aware prompts, and use downstream failures to retroactively correct upstream artifacts. It can request human intervention at critical steps, ensuring generated code is reliable, maintainable, and aligned with team expectations. On a novel in-house dataset, our approach achieves 38% and 150% improvement in the evaluation metric over manually crafted and unplanned workflows respectively. In practice, when building features for recommendation models serving over 120 million users, our approach has delivered real-world impact by reducing feature engineering cycles from three weeks to a single day.
zh

[NLP-60] A Concise Agent is Less Expert: Revealing Side Effects of Using Style Features on Conversational Agents

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)对话代理中风格特征(style features)之间存在的隐性交互效应问题,即在提示(prompt)中显式要求某一风格特征时,可能无意中引发其他风格特征的系统性变化,从而影响生成内容的质量与意图一致性。其解决方案的关键在于:首先通过构建一个包含127篇ACL Anthology论文的系统性调研,识别出12种高频使用的风格特征;其次设计了一种基于LLM作为裁判(LLM-as-a-Judge)的成对评估框架,在任务导向和开放域对话场景下量化不同风格特征之间的因果影响;最终提出CASSE数据集以捕捉这些复杂的风格交互关系,并验证了基于提示(prompt-based)和激活引导(activation-steering-based)的缓解策略的有效性与局限性——尽管能部分恢复被抑制的特质,但常导致目标风格性能下降,从而揭示了当前风格控制方法存在非忠实性(non-faithful control),并呼吁采用多目标优化和更严谨的理论框架实现安全、精准的风格引导。

链接: https://arxiv.org/abs/2601.10809
作者: Young-Min Cho,Yuan Yuan,Sharath Chandra Guntuku,Lyle Ungar
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Style features such as friendly, helpful, or concise are widely used in prompts to steer the behavior of Large Language Model (LLM) conversational agents, yet their unintended side effects remain poorly understood. In this work, we present the first systematic study of cross-feature stylistic side effects. We conduct a comprehensive survey of 127 conversational agent papers from ACL Anthology and identify 12 frequently used style features. Using controlled, synthetic dialogues across task-oriented and open domain settings, we quantify how prompting for one style feature causally affects others via a pairwise LLM as a Judge evaluation framework. Our results reveal consistent and structured side effects, such as prompting for conciseness significantly reduces perceived expertise. They demonstrate that style features are deeply entangled rather than orthogonal. To support future research, we introduce CASSE (Conversational Agent Stylistic Side Effects), a dataset capturing these complex interactions. We further evaluate prompt based and activation steering based mitigation strategies and find that while they can partially restore suppressed traits, they often degrade the primary intended style. These findings challenge the assumption of faithful style control in LLMs and highlight the need for multi-objective and more principled approaches to safe, targeted stylistic steering in conversational agents.
zh

[NLP-61] BYOL: Bring Your Own Language Into LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言能力上的严重资源不均衡问题,即全球7000余种语言中仅有不足100种具备足够的数字语料支持LLM训练,导致低资源和极端低资源语言(Extreme-Low-Resource Languages)在性能、文化适配性和可及性方面显著落后。解决方案的关键在于提出一种统一的、可扩展的语言感知框架——Bring Your Own Language (BYOL),其核心是基于语言数字足迹将语言划分为四个层级(极端低资源、低资源、中等资源、高资源),并据此选择差异化的集成路径:对于低资源语言,采用端到端的数据精炼与扩展流水线(包括语料清洗、合成文本生成、持续预训练和监督微调);对于极端低资源语言,则引入翻译媒介路径,通过定制机器翻译系统实现间接建模。实验表明,该方法在Chichewa和Maori上相较强基线平均提升约12%,并在Inuktitut上通过翻译中介路径实现BLEU提升4分,同时保留英语及多语言能力。

链接: https://arxiv.org/abs/2601.10804
作者: Syed Waqas Zamir,Wassim Hamidouche,Boulbaba Ben Amor,Luana Marotti,Inbal Becker-Reshef,Juan Lavista Ferres
机构: Microsoft AI for Good Research Lab (微软AI for Good研究实验室); Inception, G42
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain fundamentally constrained by the severe imbalance in global language resources. While over 7,000 languages are spoken worldwide, only a small subset (fewer than 100) has sufficient digital presence to meaningfully influence modern LLM training. This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages. To address this gap, we introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language’s digital footprint. BYOL begins with a language resource classification that maps languages into four tiers (Extreme-Low, Low, Mid, High) using curated web-scale corpora, and uses this classification to select the appropriate integration pathway. For low-resource languages, we propose a full-stack data refinement and expansion pipeline that combines corpus cleaning, synthetic text generation, continual pretraining, and supervised finetuning. Applied to Chichewa and Maori, this pipeline yields language-specific LLMs that achieve approximately 12 percent average improvement over strong multilingual baselines across 12 benchmarks, while preserving English and multilingual capabilities via weight-space model merging. For extreme-low-resource languages, we introduce a translation-mediated inclusion pathway, and show on Inuktitut that a tailored machine translation system improves over a commercial baseline by 4 BLEU, enabling high-accuracy LLM access when direct language modeling is infeasible. Finally, we release human-translated versions of the Global MMLU-Lite benchmark in Chichewa, Maori, and Inuktitut, and make our codebase and models publicly available at this https URL .
zh

[NLP-62] LLM s for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning AAAI2026

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在离散博弈任务中决策质量不足的问题,尤其是在需要序列化推理的场景下,如井字棋(Tic-Tac-Toe)。其核心挑战在于如何在保持计算效率的同时提升策略选择的准确性。解决方案的关键在于提出一种基于熵引导的自适应链式思维(entropy-guided chain-of-thought, CoT)推理框架:通过token级不确定性(熵)动态调整上下文检索数量与推理路径数——低不确定性时采用精简推理路径以节省资源,高不确定性时则激活多路径CoT探索以增强决策质量。实验表明,该方法显著优于基线LLM,在100局对弈中平均得分从−11.6%提升至+9.5%,且统计显著,验证了不确定性感知自适应推理机制的有效性。

链接: https://arxiv.org/abs/2601.10775
作者: Tommaso Felice Banfi,Sashenka Gamage
机构: 未知
类目: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: Accepted at AAAI 2026 Bridge (Logical and Symbolic Reasoning in Language Models)

点击查看摘要

Abstract:We propose a novel LLM-based framework for reasoning in discrete, game-theoretic tasks, illustrated with \emphTic-Tac-Toe. The method integrates in-context learning with entropy-guided chain-of-thought (CoT) reasoning and adaptive context retrieval. The model dynamically adjusts both the number of retrieved examples and reasoning paths according to token-level uncertainty: concise reasoning with minimal context is used when uncertainty is low, whereas higher uncertainty triggers expanded multi-path CoT exploration. Experimental evaluation against a sub-optimal algorithmic opponent shows that entropy-aware adaptive reasoning substantially improves decision quality, increasing the average game outcome from (-11.6%) with the baseline LLM to (+9.5%) with entropy-guided adaptive reasoning over 100 games (win = +1, tie = 0, loss = -1), while maintaining a relatively low number of LLM queries per game. Statistical validation confirms that the improvement is significant, and correlation analysis reveals a negative association between token-level entropy and move optimality. These findings demonstrate that uncertainty-guided adaptive reasoning effectively enhances LLM performance in sequential decision-making environments.
zh

[NLP-63] Building AI Agents to Improve Job Referral Requests to Strangers

【速读】: 该论文旨在解决职业社交平台中求职者撰写有效职位推荐请求(job referral request)的难题,以提升其获得他人推荐的概率。解决方案的关键在于构建一个由改进代理(improver agent)和评估代理(evaluator agent)组成的AI系统:改进代理利用大语言模型(LLM)重写原始请求,评估代理则基于训练好的模型预测请求被成功推荐的可能性;进一步引入检索增强生成(RAG)技术后,系统能够精准优化弱请求而避免对强请求造成负面影响,从而在不损害高质量请求表现的前提下,使弱请求的预测成功率平均提升14%。

链接: https://arxiv.org/abs/2601.10726
作者: Ross Chu,Yuting Huang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper develops AI agents that help job seekers write effective requests for job referrals in a professional online community. The basic workflow consists of an improver agent that rewrites the referral request and an evaluator agent that measures the quality of revisions using a model trained to predict the probability of receiving referrals from other users. Revisions suggested by the LLM (large language model) increase predicted success rates for weaker requests while reducing them for stronger requests. Enhancing the LLM with Retrieval-Augmented Generation (RAG) prevents edits that worsen stronger requests while it amplifies improvements for weaker requests. Overall, using LLM revisions with RAG increases the predicted success rate for weaker requests by 14% without degrading performance on stronger requests. Although improvements in model-predicted success do not guarantee more referrals in the real world, they provide low-cost signals for promising features before running higher-stakes experiments on real users.
zh

[NLP-64] Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在在线信息环境中如何隐式编码用户感知的信任度(perceived trustworthiness),以及这种编码是否具有心理一致性。解决方案的关键在于,通过分析指令微调后的LLMs(如Llama 3.1 8B、Qwen 2.5 7B和Mistral 7B)在PEACE-Reviews数据集上的激活模式,发现模型在预训练过程中已隐式学习到与人类信任形成机制一致的信号——尤其是公平性(fairness)、确定性(certainty)和责任归属自我感(accountability-self)等认知评估维度;进一步的探针分析表明,这些信任信号可通过线性解码提取,且微调仅优化而非重构原有表征,从而为构建可信、透明的网络生态系统中的AI系统提供了可解释的表征基础。

链接: https://arxiv.org/abs/2601.10719
作者: Gerard Yeo,Svetlana Churina,Kokil Jaidka
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Perceived trustworthiness underpins how users navigate online information, yet it remains unclear whether large language models (LLMs),increasingly embedded in search, recommendation, and conversational systems, represent this construct in psychologically coherent ways. We analyze how instruction-tuned LLMs (Llama 3.1 8B, Qwen 2.5 7B, Mistral 7B) encode perceived trustworthiness in web-like narratives using the PEACE-Reviews dataset annotated for cognitive appraisals, emotions, and behavioral intentions. Across models, systematic layer- and head-level activation differences distinguish high- from low-trust texts, revealing that trust cues are implicitly encoded during pretraining. Probing analyses show linearly de-codable trust signals and fine-tuning effects that refine rather than restructure these representations. Strongest associations emerge with appraisals of fairness, certainty, and accountability-self – dimensions central to human trust formation online. These findings demonstrate that modern LLMs internalize psychologically grounded trust signals without explicit supervision, offering a representational foundation for designing credible, transparent, and trust-worthy AI systems in the web ecosystem. Code and appendix are available at: this https URL.
zh

[NLP-65] Japanese AI Agent System on Human Papillomavirus Vaccination: System Design

【速读】: 该论文旨在解决日本因HPV疫苗主动推荐暂停(2013–2021年)所引发的公众接种犹豫问题,尤其针对社交媒体上 misinformation泛滥导致的信息鸿沟,以及传统传播方式难以同步应对个体咨询与群体舆论监测的局限性。解决方案的关键在于构建一个双功能人工智能代理系统:其一为基于ReAct架构的检索增强生成(Retrieval-Augmented Generation, RAG)对话机器人,整合五类知识源(学术论文、政府文件、新闻媒体及社交平台数据),通过多工具协同实现高准确率的个性化问答;其二为自动化报告生成模块,可从用户交互和社交舆情中提取模式,输出涵盖新闻分析、研究综述、情感倾向和行为特征的结构化洞察。该系统在单轮与多轮对话任务中均达到高分(平均4.80–4.98),并验证了其在信息可信传递与公共 discourse 系统性分析上的可行性,形成可迁移至其他医疗场景的集成框架。

链接: https://arxiv.org/abs/2601.10718
作者: Junyu Liu,Siwen Yang,Dexiu Ma,Qian Niu,Zequn Zhang,Momoko Nagai-Tanima,Tomoki Aoyama
机构: Kyoto University (京都大学); University of Waterloo (滑铁卢大学); Texas Tech University (德克萨斯理工大学); The University of Tokyo (东京大学); USTC (中国科学技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human papillomavirus (HPV) vaccine hesitancy poses significant public health challenges, particularly in Japan where proactive vaccination recommendations were suspended from 2013 to 2021. The resulting information gap is exacerbated by misinformation on social media, and traditional ways cannot simultaneously address individual queries while monitoring population-level discourse. This study aimed to develop a dual-purpose AI agent system that provides verified HPV vaccine information through a conversational interface while generating analytical reports for medical institutions based on user interactions and social media. We implemented a system comprising: a vector database integrating academic papers, government sources, news media, and social media; a Retrieval-Augmented Generation chatbot using ReAct agent architecture with multi-tool orchestration across five knowledge sources; and an automated report generation system with modules for news analysis, research synthesis, social media sentiment analysis, and user interaction pattern identification. Performance was assessed using a 0-5 scoring scale. For single-turn evaluation, the chatbot achieved mean scores of 4.83 for relevance, 4.89 for routing, 4.50 for reference quality, 4.90 for correctness, and 4.88 for professional identity (overall 4.80). Multi-turn evaluation yielded higher scores: context retention 4.94, topic coherence 5.00, and overall 4.98. The report generation system achieved completeness 4.00-5.00, correctness 4.00-5.00, and helpfulness 3.67-5.00, with reference validity 5.00 across all periods. This study demonstrates the feasibility of an integrated AI agent system for bidirectional HPV vaccine communication. The architecture enables verified information delivery with source attribution while providing systematic public discourse analysis, with a transferable framework for adaptation to other medical contexts.
zh

[NLP-66] MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents ACL2026

【速读】: 该论文旨在解决当前语言模型代理(language-model agents)在处理个人数据时,缺乏对多模态隐私行为的系统性评估问题,尤其是现有基于上下文完整性(Contextual Integrity, CI)的基准测试主要聚焦于文本场景和负面拒绝情形,忽视了多模态隐私风险以及隐私与效用之间的权衡。解决方案的关键在于提出首个面向代理场景的多模态成对上下文完整性基准——MPCI-Bench,其核心创新包括:(1)基于同一视觉源生成正负样本对,确保情境一致性;(2)构建三个层级的评估范式(规范性种子判断、情境丰富的叙事推理、可执行代理行为轨迹);(3)通过三原则迭代精炼流程保障数据质量。实验表明,主流多模态模型在隐私-效用平衡上存在系统性缺陷,并表现出显著的模态泄露差距(modality leakage gap),即敏感视觉信息比文本信息更易泄露。

链接: https://arxiv.org/abs/2601.08235
作者: Shouju Wang,Haopeng Zhang
机构: University of Hawaii at Manoa (夏威夷大学马诺阿分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to ACL 2026

点击查看摘要

Abstract:As language-model agents evolve from passive chatbots into proactive assistants that handle personal data, evaluating their adherence to social norms becomes increasingly critical, often through the lens of Contextual Integrity (CI). However, existing CI benchmarks are largely text-centric and primarily emphasize negative refusal scenarios, overlooking multimodal privacy risks and the fundamental trade-off between privacy and utility. In this paper, we introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings. MPCI-Bench consists of paired positive and negative instances derived from the same visual source and instantiated across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Data quality is ensured through a Tri-Principle Iterative Refinement pipeline. Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility and a pronounced modality leakage gap, where sensitive visual information is leaked more frequently than textual information. We will open-source MPCI-Bench to facilitate future research on agentic CI.
zh

[NLP-67] Generative AI Purpose-built for Social and Mental Health: A Real-World Pilot

【速读】: 该论文旨在解决当前心理健康支持服务中存在可及性差、个性化不足以及 scalability(可扩展性)受限的问题,尤其是在资源匮乏或难以获得传统心理治疗的背景下。其解决方案的关键在于开发并评估一个专为心理健康设计的生成式 AI (Generative AI) 基础模型聊天机器人,通过自然语言交互提供安全、个性化且可扩展的心理健康支持。研究发现,用户在使用该系统后,抑郁症状(PHQ-9)和焦虑症状(GAD-7)显著减轻,并且积极心理指标如希望感、行为激活和社会连接等也持续改善;同时高参与度与良好的工作联盟(working alliance)显著预测疗效,自动化安全机制有效识别并处理了潜在风险情境,验证了该技术在真实世界应用中的可行性与安全性。

链接: https://arxiv.org/abs/2511.11689
作者: Thomas D. Hull,Lizhe Zhang,Patricia A. Arean,Matteo Malgaroli
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (GAI) chatbots built for mental health could deliver safe, personalized, and scalable mental health support. We evaluate a foundation model designed for mental health. Adults completed mental health measures while engaging with the chatbot between May 15, 2025 and September 15, 2025. Users completed an opt-in consent, demographic information, mental health symptoms, social connection, and self-identified goals. Measures were repeated every two weeks up to 6 weeks, and a final follow-up at 10 weeks. Analyses included effect sizes, and growth mixture models to identify participant groups and their characteristic engagement, severity, and demographic factors. Users demonstrated significant reductions in PHQ-9 and GAD-7 that were sustained at follow-up. Significant improvements in Hope, Behavioral Activation, Social Interaction, Loneliness, and Perceived Social Support were observed throughout and maintained at 10 week follow-up. Engagement was high and predicted outcomes. Working alliance was comparable to traditional care and predicted outcomes. Automated safety guardrails functioned as designed, with 76 sessions flagged for risk and all handled according to escalation policies. This single arm naturalistic observational study provides initial evidence that a GAI foundation model for mental health can deliver accessible, engaging, effective, and safe mental health support. These results lend support to findings from early randomized designs and offer promise for future study of mental health GAI in real world settings.
zh

计算机视觉

[CV-0] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

【速读】:该论文旨在解决当前医学基础模型在胸部X光图像理解与生成任务中难以协同统一的问题,因为这两类任务的目标存在本质冲突:理解任务强调语义抽象,而生成任务则要求像素级重建。现有基于参数共享的自回归架构常导致两个任务性能均受损。其解决方案的关键在于提出UniX模型,通过将任务解耦为一个用于理解的自回归分支和一个用于高保真生成的扩散分支,并引入跨模态自注意力机制,动态地利用理解特征引导生成过程;同时结合严格的数据清洗流程和多阶段训练策略,实现两任务间的协同优化,从而在仅使用LMM-CXR四分之一参数的情况下,显著提升理解性能(Micro-F1提升46.1%)和生成质量(FD-RadDino提升24.2%),达到与专用模型相当的性能水平。

链接: https://arxiv.org/abs/2601.11522
作者: Ruiheng Zhang,Jingfeng Yao,Huangxuan Zhao,Hao Yan,Xiao He,Lei Chen,Zhou Wei,Yong Luo,Zengmao Wang,Lefei Zhang,Dacheng Tao,Bo Du
机构: Wuhan University (武汉大学); Huazhong University of Science and Technology (华中科技大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Codes and models are available at this https URL

点击查看摘要

Abstract:Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at this https URL.
zh

[CV-1] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures WWW KR ATC

【速读】:该论文旨在解决现有3D形状生成方法对高质量、干净且无遮挡输入数据的依赖问题,这类条件在真实世界场景中难以满足。为应对这一挑战,作者提出ShapeR,其核心解决方案是利用从随意拍摄的图像序列中提取的多模态信息(包括稀疏SLAM点、带姿态的多视角图像和机器生成的文本描述),并通过一个经过校正的流变换器(rectified flow transformer)进行有效条件建模,从而生成高保真度的度量尺度3D形状。关键创新在于结合了视觉惯性SLAM、3D检测算法与视觉-语言模型以构建鲁棒的输入表征,并通过在线组合增强、课程训练策略及背景杂波处理机制提升模型在复杂现实场景下的泛化能力。

链接: https://arxiv.org/abs/2601.11514
作者: Yawar Siddiqui,Duncan Frost,Samir Aroudj,Armen Avetisyan,Henry Howard-Jenkins,Daniel DeTone,Pierre Moulon,Qirui Wu,Zhengqin Li,Julian Straub,Richard Newcombe,Jakob Engel
机构: Meta Reality Labs Research (Meta现实实验室研究); Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this http URL Video: this https URL

点击查看摘要

Abstract:Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
zh

[CV-2] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

【速读】:该论文旨在解决室内环境中物体随时间动态变化时,如何在稀疏采样的3D扫描数据中实现语义实例分割(Semantic Instance Segmentation, SIS)的时序一致性问题。现有方法在处理此类任务时面临两大挑战:一是传统3D语义实例分割(3DSIS)方法缺乏时间推理能力,需依赖离散匹配步骤导致跟踪不稳定;二是基于4D LiDAR的方法依赖高频时间采样,在长期演化的室内场景中难以适用。论文提出的解决方案核心在于ReScene4D——一种适配3DSIS架构以支持稀疏时序观测的新型方法,其关键创新在于通过跨帧信息共享机制,在不依赖密集观测的前提下实现了稳定的实例身份追踪,并显著提升了标准3DSIS性能。此外,作者还引入了t-mAP指标来量化时序一致性,为评估该任务提供了新基准。

链接: https://arxiv.org/abs/2601.11508
作者: Emily Steiner,Jianhao Zheng,Henry Howard-Jenkins,Chris Xie,Iro Armeni
机构: Stanford University (斯坦福大学); Meta Reality Labs Research (Meta现实实验室研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
zh

[CV-3] Generative Scenario Rollouts for End-to-End Autonomous Driving

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在端到端自动驾驶系统中主要依赖稀疏轨迹标注进行模仿学习、未能充分挖掘其作为生成式模型潜力的问题。解决方案的关键在于提出一种可插拔的框架——生成场景滚动(Generative Scenario Rollouts, GeRo),通过自回归滚动策略联合执行规划与语言引导的未来交通场景生成。GeRo首先训练VLA模型将本车及交通参与者动态编码为潜在标记,以支持文本对齐生成;随后在多视角图像、场景描述和本车动作问题的条件下,生成未来潜在标记和文本响应,实现长时程推理与多智能体规划。引入滚动一致性损失(rollout-consistency loss)利用真实或伪标签稳定预测,减少漂移并保持文本-动作对齐,从而实现时序一致、语言 grounded 的滚动模拟。该方法显著提升了Bench2Drive基准上的驾驶得分(+15.7)和成功率(+26.2),并在强化学习结合生成滚动的基础上实现了最先进的闭环与开环性能,展现出强大的零样本鲁棒性。

链接: https://arxiv.org/abs/2601.11475
作者: Rajeev Yasarla,Deepti Hegde,Shizhong Han,Hsin-Pai Cheng,Yunxiao Shi,Meysam Sadeghigooghari,Shweta Mahajan,Apratim Bhattacharyya,Litian Liu,Risheek Garrepalli,Thomas Svantesson,Fatih Porikli,Hong Cai
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.
zh

[CV-4] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

【速读】:该论文旨在解决大规模集约化动物饲养设施(Concentrated Animal Feeding Operations, CAFOs)的精准识别与特征刻画问题,以应对其对人类健康和环境带来的风险,并提升在极端天气和疫病威胁下的监测能力。解决方案的关键在于提出一种基于基础设施的可解释性流程:首先利用领域调优的YOLOv8检测器定位候选基础设施(如畜舍、饲料场、粪便池和储粮塔),并通过SAM2生成掩码并结合组件特定筛选条件;其次提取结构化描述符(如数量、面积、朝向及空间关系)并与深度视觉特征融合,采用轻量级空间交叉注意力分类器进行决策;最终输出CAFO类型预测及掩码级别的归因信息,实现决策可解释性。该方法在多个美国区域均达到最优性能,相较最佳基线提升达15%。

链接: https://arxiv.org/abs/2601.11451
作者: Oishee Bintey Hoque,Nibir Chandra Mandal,Kyle Luong,Amanda Wilson,Samarth Swarup,Madhav Marathe,Abhijin Adiga
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure-first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (1) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain-tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component-specific criteria, (2) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross-attention classifier, and (3) outputs both CAFO type predictions and mask-level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state-of-the-art performance, with Swin-B+PRISM-CAFO surpassing the best performing baseline by up to 15%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient–activation analyses that quantify the impact of domain priors and show ho
zh

[CV-5] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

【速读】:该论文试图解决的问题是:在无条件基于得分的扩散模型(score-based diffusion models)中,集成方法(ensembling)是否能够像在监督学习中那样提升生成模型的性能,尤其是在图像质量指标(如FID)上的表现。解决方案的关键在于系统性地评估多种集成策略(包括Deep Ensembles、Monte Carlo Dropout等)对得分估计误差和模型似然的影响,并结合理论分析揭示得分模型集成的本质机制,从而解释为何集成虽能改善分数匹配损失和对数似然,却无法稳定提升感知质量(如FID),并进一步探索其在表格数据(随机森林)中的适用性差异。

链接: https://arxiv.org/abs/2601.11444
作者: Raphaël Razafindralambo,Rémy Sun,Frédéric Precioso,Damien Garreau,Pierre-Alexandre Mattei
机构: Inria(法国国家信息与自动化研究院); Université Côte d’Azur(蔚蓝海岸大学); CNRS(法国国家科学研究中心); I3S(智能系统研究所); Maasai(研究团队); Julius-Maximilians-Universität Würzburg(维尔茨堡大学); Institute for Computer Science / CAIDAS(计算机科学研究所/CAIDAS); LJAD(数学与应用实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
备注: Accepted at TMLR. Code: this https URL

点击查看摘要

Abstract:Diffusion models now generate high-quality, diverse samples, with an increasing focus on more powerful models. Although ensembling is a well-known way to improve supervised models, its application to unconditional score-based diffusion models remains largely unexplored. In this work we investigate whether it provides tangible benefits for generative modelling. We find that while ensembling the scores generally improves the score-matching loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as FID on image datasets. We confirm this observation across a breadth of aggregation rules using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to explain this discrepancy by investigating possible explanations, such as the link between score estimation and image quality. We also look into tabular data through random forests, and find that one aggregation strategy outperforms the others. Finally, we provide theoretical insights into the summing of score models, which shed light not only on ensembling but also on several model composition techniques (e.g. guidance).
zh

[CV-6] Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

【速读】:该论文旨在解决三维视觉语言模型(3D VLMs)在空间推理过程中缺乏可解释性的问题,即模型难以提供清晰、可追踪的几何推理路径。其解决方案的关键在于提出Map2Thought框架,该框架由两个核心组件构成:度量认知地图(Metric Cognitive Map, Metric-CogMap)和认知思维链(Cognitive Chain-of-Thought, Cog-CoT)。其中,Metric-CogMap通过融合离散网格表示与连续度量尺度表示,实现统一的空间表征;在此基础上,Cog-CoT利用确定性几何运算(如向量操作、边界框距离计算及遮挡感知的外观顺序线索)进行显式几何推理,并生成基于三维结构的可解释推理轨迹,从而显著提升模型在低监督场景下的性能与透明度。

链接: https://arxiv.org/abs/2601.11442
作者: Xiangjun Gao,Zhensong Zhang,Dave Zhenyu Chen,Songcen Xu,Long Quan,Eduardo Pérez-Pellitero,Youngkyoon Jang
机构: The Hong Kong University of Science and Technology (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.
zh

[CV-7] opology-Guaranteed Image Segmentation: Enforcing Connectivity Genus and Width Constraints

【速读】:该论文旨在解决传统拓扑方法在图像分割中难以同时保留结构连通性与几何宽度信息(如线宽、长度)的问题。现有方法如持久同调(persistent homology)虽能刻画拓扑不变量(如连通分支数和亏格),但其数学定义缺乏对结构宽度属性的显式建模,导致分割结果在实际应用中无法满足几何精度需求。解决方案的关键在于提出一种融合持久同调与偏微分方程(PDE)平滑思想的新颖数学框架:通过调整上水平集的局部极值点来引入宽度信息,并将此增强后的拓扑描述嵌入变分图像分割模型中;进一步结合特定损失函数设计神经网络,从而在保持连通性和亏格等拓扑不变量的同时,显式地保留分割结构的宽度特征(如厚度与长度)。

链接: https://arxiv.org/abs/2601.11409
作者: Wenxiao Li,Xue-Cheng Tai,Jun Liu
机构: Beijing Normal University (北京师范大学); NORCE Norwegian Research Centre (挪威研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing research highlights the crucial role of topological priors in image segmentation, particularly in preserving essential structures such as connectivity and genus. Accurately capturing these topological features often requires incorporating width-related information, including the thickness and length inherent to the image structures. However, traditional mathematical definitions of topological structures lack this dimensional width information, limiting methods like persistent homology from fully addressing practical segmentation needs. To overcome this limitation, we propose a novel mathematical framework that explicitly integrates width information into the characterization of topological structures. This method leverages persistent homology, complemented by smoothing concepts from partial differential equations (PDEs), to modify local extrema of upper-level sets. This approach enables the resulting topological structures to inherently capture width properties. We incorporate this enhanced topological description into variational image segmentation models. Using some proper loss functions, we are also able to design neural networks that can segment images with the required topological and width properties. Through variational constraints on the relevant topological energies, our approach successfully preserves essential topological invariants such as connectivity and genus counts, simultaneously ensuring that segmented structures retain critical width attributes, including line thickness and length. Numerical experiments demonstrate the effectiveness of our method, showcasing its capability to maintain topological fidelity while explicitly embedding width characteristics into segmented image structures.
zh

[CV-8] SME-YOLO: A Real-Time Detector for Tiny Defect Detection on PCB Surfaces

【速读】:该论文旨在解决印刷电路板(Printed Circuit Board, PCB)表面缺陷检测中因缺陷尺寸微小、纹理相似性高及尺度分布不均而导致的高精度检测难题。解决方案的关键在于提出一种基于YOLOv11n的新型框架SME-YOLO(Small-target Multi-scale Enhanced YOLO),其核心创新包括:引入归一化Wasserstein距离损失(Normalized Wasserstein Distance Loss, NWDLoss)以降低交并比(Intersection over Union, IoU)对微小目标位置偏移的敏感性;用高效上采样卷积模块(Efficient Upsampling Convolution Block, EUCB)替代原始上采样结构,通过多尺度卷积逐步恢复空间分辨率并增强边缘与纹理细节保留能力;设计多尺度聚焦注意力模块(Multi-Scale Focused Attention, MSFA),针对PCB缺陷的空间分布特性自适应强化关键尺度区间内的感知能力,实现局部细粒度特征与全局上下文信息的有效融合。实验表明,该方法在PKU-PCB数据集上相较基线YOLOv11n提升了2.2%的mAP和4%的Precision,验证了其有效性。

链接: https://arxiv.org/abs/2601.11402
作者: Meng Han
机构: Henan University (河南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface defects on Printed Circuit Boards (PCBs) directly compromise product reliability and safety. However, achieving high-precision detection is challenging because PCB defects are typically characterized by tiny sizes, high texture similarity, and uneven scale distributions. To address these challenges, this paper proposes a novel framework based on YOLOv11n, named SME-YOLO (Small-target Multi-scale Enhanced YOLO). First, we employ the Normalized Wasserstein Distance Loss (NWDLoss). This metric effectively mitigates the sensitivity of Intersection over Union (IoU) to positional deviations in tiny objects. Second, the original upsampling module is replaced by the Efficient Upsampling Convolution Block (EUCB). By utilizing multi-scale convolutions, the EUCB gradually recovers spatial resolution and enhances the preservation of edge and texture details for tiny defects. Finally, this paper proposes the Multi-Scale Focused Attention (MSFA) module. Tailored to the specific spatial distribution of PCB defects, this module adaptively strengthens perception within key scale intervals, achieving efficient fusion of local fine-grained features and global context information. Experimental results on the PKU-PCB dataset demonstrate that SME-YOLO achieves state-of-the-art performance. Specifically, compared to the baseline YOLOv11n, SME-YOLO improves mAP by 2.2% and Precision by 4%, validating the effectiveness of the proposed method.
zh

[CV-9] Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model

【速读】:该论文旨在解决湿地遥感制图中因稀疏点标注、季节与年际动态变化导致的精度不足问题,尤其针对现有深度学习模型在单时相影像下表现不佳,以及基础模型如SAM(Segment Anything Model)无法建模时间信息而导致异质湿地分割碎片化的问题。其解决方案的关键在于提出WetSAM框架,采用双分支结构:一是时序提示分支,通过层次化适配器和动态时间聚合机制,将湿地特征从物候变化中解耦;二是空间分支,利用时序约束的区域生长策略生成可靠的密集伪标签;并通过双向一致性正则化协同优化两个分支,从而实现仅需少量点标注即可获得高精度、结构一致的湿地分割结果。

链接: https://arxiv.org/abs/2601.11400
作者: Shuai Yuan,Tianwu Lin,Shuang Chen,Yu Xia,Peng Qin,Xiangyu Liu,Xiaoqing Xu,Nan Xu,Hongsheng Zhang,Jie Wang,Peng Gong
机构: The University of Hong Kong (香港大学); Pengcheng Laboratory (鹏城实验室); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.
zh

[CV-10] SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction

【速读】:该论文旨在解决3D语义占据预测(3D semantic occupancy prediction)在自动驾驶场景理解中因高计算与内存开销而导致难以实时部署的问题。其核心挑战在于如何在保持几何与语义完整性的同时,显著降低冗余计算。解决方案的关键在于提出SUG-Occ框架,通过显式利用语义和不确定性先验抑制自由空间投影,结合无符号距离编码增强几何一致性,构建结构一致的稀疏三维表示;进一步设计级联稀疏补全模块(基于超交叉稀疏卷积与生成式上采样),实现从粗到精的高效推理;最后引入基于对象上下文表示(Object Contextual Representation, OCR)的掩码解码器,通过轻量级查询-上下文交互聚合全局语义信息并优化体素级预测,避免对体积特征进行昂贵的注意力操作,从而在SemanticKITTI基准上实现了7.34%的精度提升和57.8%的效率增益。

链接: https://arxiv.org/abs/2601.11396
作者: Hanlin Wu,Pengfei Lin,Ehsan Javanmardi,Nanren Bao,Bo Qian,Hao Si,Manabu Tsukada
机构: The University of Tokyo (东京大学); National Institute of Informatics (日本信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8% gain in efficiency.
zh

[CV-11] Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning AAAI2026

【速读】:该论文旨在解决组合图像检索(Composed Image Retrieval, CIR)中因数据三元组内在噪声导致的不确定性问题,该不确定性会削弱模型鲁棒性。现有概率学习方法因采用实例级整体建模且对查询与目标同质化处理,在CIR场景下表现不足。其解决方案的关键在于提出一种异质不确定性引导(Heterogeneous Uncertainty-Guided, HUG)范式,通过细粒度的概率学习框架,用高斯嵌入(Gaussian embeddings)分别建模查询与目标的语义概念及其不确定性;针对多模态查询与单模态目标设计差异化的不确定性估计策略,并引入可证明的动态加权机制整合内容质量与跨模态协同不确定性,从而获得综合查询不确定性;同时设计基于不确定性的对比学习目标,结合全面负采样策略,显著提升判别能力。

链接: https://arxiv.org/abs/2601.11393
作者: Haomiao Tang,Jinpeng Wang,Minyi Zhao,Guanghao Meng,Ruisheng Luo,Long Chen,Shu-Tao Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication and oral presentation at AAAI 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model’s robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG’s effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.
zh

[CV-12] hink-Clip-Sample: Slow-Fast Frame Selection for Video Understanding ICASSP2026

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在长视频理解任务中因计算资源限制和帧选择策略不佳而导致性能受限的问题。其解决方案的核心在于提出一种无需训练的框架Think-Clip-Sample (TCS),包含两个关键组件:(i) 多查询推理(Multi-Query Reasoning),通过生成多个互补性查询来充分挖掘问题与视频内容之间的多维关联;(ii) 片段级慢快采样(Clip-level Slow-Fast Sampling),自适应地平衡密集局部细节与稀疏全局上下文,从而提升长视频理解的准确率与效率。实验表明,TCS在MLVU、LongVideoBench和VideoMME等多个基准上均显著提升模型性能,最高可提升6.9%准确率,并在保持相近精度的同时减少50%的推理时间成本。

链接: https://arxiv.org/abs/2601.11359
作者: Wenhui Tan,Ruihua Song,Jiaze Li,Jianzhong Ju,Zhenbo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP2026

点击查看摘要

Abstract:Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.
zh

[CV-13] Assessing Building Heat Resilience Using UAV and Street-View Imagery with Coupled Global Context Vision Transformer

【速读】:该论文旨在解决发展中国家城市地区因气候变化加剧的人类热暴露问题,尤其是低收入社区由于建筑材料和城市结构特征导致的热风险不平等。其核心挑战在于缺乏可扩展的方法来评估与热暴露相关的建筑属性。解决方案的关键在于提出一种基于机器学习的双模态跨视角学习框架(dual-modality cross-view learning approach),通过融合开源的无人机(UAV)和街景(SV)影像,利用耦合全局上下文视觉变压器(CGCViT)模型提取城市结构的热相关表征,并结合HotSat-1热红外(TIR)测量数据量化建筑属性与健康风险之间的关系。该方法显著优于单一模态模型(提升达9.3%),并揭示了植被覆盖、浅色屋顶及混凝土/黏土/木质屋顶等特征与较低热暴露水平密切相关,从而为基于本地化数据的公平气候适应策略提供技术支持。

链接: https://arxiv.org/abs/2601.11357
作者: Steffen Knoblauch,Ram Kumar Muthusamy,Hao Li,Iddy Chazua,Benedcto Adamu,Innocent Maholi,Alexander Zipf
机构: Heidelberg University (海德堡大学); National University of Singapore (新加坡国立大学); OpenMap Development Tanzania (OpenMap发展坦桑尼亚)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Climate change is intensifying human heat exposure, particularly in densely built urban centers of the Global South. Low-cost construction materials and high thermal-mass surfaces further exacerbate this risk. Yet scalable methods for assessing such heat-relevant building attributes remain scarce. We propose a machine learning framework that fuses openly available unmanned aerial vehicle (UAV) and street-view (SV) imagery via a coupled global context vision transformer (CGCViT) to learn heat-relevant representations of urban structures. Thermal infrared (TIR) measurements from HotSat-1 are used to quantify the relationship between building attributes and heat-associated health risks. Our dual-modality cross-view learning approach outperforms the best single-modality models by up to 9.3% , demonstrating that UAV and SV imagery provide valuable complementary perspectives on urban structures. The presence of vegetation surrounding buildings (versus no vegetation), brighter roofing (versus darker roofing), and roofing made of concrete, clay, or wood (versus metal or tarpaulin) are all significantly associated with lower HotSat-1 TIR values. Deployed across the city of Dar es Salaam, Tanzania, the proposed framework illustrates how household-level inequalities in heat exposure - often linked to socio-economic disadvantage and reflected in building materials - can be identified and addressed using machine learning. Our results point to the critical role of localized, data-driven risk assessment in shaping climate adaptation strategies that deliver equitable outcomes.
zh

[CV-14] Beer-Lambert Autoencoder for Unsupervised Stain Representation Learning and Deconvolution in Multi-immunohistochemical Brightfield Histology Images

【速读】:该论文旨在解决多色免疫组化(multiplex immunohistochemistry, mIHC)中多个染色剂(chromogens)在RGB全切片图像(WSIs)中的贡献分离问题,这是实现染色标准化、标记表达定量分析和细胞层面读数的关键步骤。传统基于Beer-Lambert(BL)定律的颜色解卷积方法在两至三染色场景下表现良好,但在K≥3种染色剂的mIHC中因欠定性(under-determined)和数值不稳定而失效。其解决方案的核心是一个数据驱动的编码器-解码器架构:编码器采用轻量级U-Net结构预测K个非负浓度通道,解码器则为可微分的BL前向模型,其染色矩阵以典型染色剂色相初始化;训练过程无监督,通过感知重建目标与抑制不必要染色混合的损失项共同优化,从而获得清晰且分离良好的各染色剂浓度图谱。

链接: https://arxiv.org/abs/2601.11336
作者: Mark Eastwood,Thomas McKee,Zedong Hu,Sabine Tejpar,Fayyaz Minhas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Separating the contributions of individual chromogenic stains in RGB histology whole slide images (WSIs) is essential for stain normalization, quantitative assessment of marker expression, and cell-level readouts in immunohistochemistry (IHC). Classical Beer-Lambert (BL) color deconvolution is well-established for two- or three-stain settings, but becomes under-determined and unstable for multiplex IHC (mIHC) with K3 chromogens. We present a simple, data-driven encoder-decoder architecture that learns cohort-specific stain characteristics for mIHC RGB WSIs and yields crisp, well-separated per-stain concentration maps. The encoder is a compact U-Net that predicts K nonnegative concentration channels; the decoder is a differentiable BL forward model with a learnable stain matrix initialized from typical chromogen hues. Training is unsupervised with a perceptual reconstruction objective augmented by loss terms that discourage unnecessary stain mixing. On a colorectal mIHC panel comprising 5 stains (H, CDX2, MUC2, MUC5, CD8) we show excellent RGB reconstruction, and significantly reduced inter-channel bleed-through compared with matrix-based deconvolution. Code and model are available at this https URL.
zh

[CV-15] Enhancing Vision Language Models with Logic Reasoning for Situational Awareness

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在态势感知(Situational Awareness, SA)应用中,对罕见但关键事件的识别准确性不足、细节提取不充分以及输出缺乏可解释性的问题。其解决方案的关键在于将VLMs与传统计算机视觉方法通过显式逻辑推理相结合,提出一种智能微调(intelligent fine-tuning, FT)策略,不仅显著提升了识别精度,还能在推理阶段生成对VLM输出的合理性说明,从而增强结果的可靠性与可解释性。

链接: https://arxiv.org/abs/2601.11322
作者: Pavana Pradeep,Krishna Kant,Suya Yu
机构: Temple University (坦普尔大学); ARL (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
备注: Accepted for publication in IEEE Transactions on AI

点击查看摘要

Abstract:Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and © generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.
zh

[CV-16] Context-Aware Semantic Segmentation via Stage-Wise Attention

【速读】:该论文旨在解决生成式 AI (Generative AI) 在遥感图像语义超分辨率分割(Semantic Ultra High Resolution Image Segmentation, UHR)任务中面临的挑战,即基于 Transformer 的模型因内存随 token 数量呈二次增长而受限于上下文范围或空间分辨率。解决方案的关键在于提出一种双分支、基于 Swin 的架构 CASWiT(Context-Aware Stage-Wise Transformer),其通过一个上下文编码器处理下采样邻域以捕获长程依赖关系,同时高分辨率编码器提取 UHR 图像块的细节特征,并利用跨尺度融合模块(结合交叉注意力与门控特征注入机制)将全局上下文信息注入细粒度特征中,从而在保持高分辨率的同时增强语义理解能力。此外,作者还设计了一种 SimMIM 风格的预训练策略,通过掩码 75% 的高分辨率图像 token 和对应低分辨率中心区域进行重建训练,进一步提升模型性能。

链接: https://arxiv.org/abs/2601.11310
作者: Antoine Carreaud,Elias Naha,Arthur Chansel,Nina Lahellec,Jan Skaloud,Adrien Gressin
机构: ESO lab. EPFL (EPFL欧洲核子研究中心实验室); University of Applied Sciences Western Switzerland (HES-SO / HEIG-VD) (瑞士西部应用科学大学(HES-SO / HEIG-VD))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: this https URL.
zh

[CV-17] SAMannot: A Memory-Efficient Local Open-source Framework for Interactive Video Instance Segmentation based on SAM2

【速读】:该论文旨在解决当前精准视频分割研究中面临的两大核心问题:一是手动标注过程劳动密集且效率低下,二是云服务工具存在隐私泄露风险。为应对上述挑战,作者提出了一种名为SAMannot的开源本地化框架,其关键创新在于将Segment Anything Model 2 (SAM2) 集成到人机协同(human-in-the-loop)的工作流中,并通过优化模型依赖关系和引入轻量级处理层,在保证高响应速度的同时显著降低计算资源消耗。该方案还包含持久实例身份管理、基于屏障帧的自动“锁定与精修”流程以及基于掩码骨架化的自动提示机制,从而实现高效、私密且可扩展的视频实例分割标注,适用于动物行为追踪等复杂场景。

链接: https://arxiv.org/abs/2601.11301
作者: Gergely Dinya,András Gelencsér,Krisztina Kupán,Clemens Küpper,Kristóf Karacs,Anna Gelencsér-Horváth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current research workflows for precise video segmentation are often forced into a compromise between labor-intensive manual curation, costly commercial platforms, and/or privacy-compromising cloud-based services. The demand for high-fidelity video instance segmentation in research is often hindered by the bottleneck of manual annotation and the privacy concerns of cloud-based tools. We present SAMannot, an open-source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human-in-the-loop workflow. To address the high resource requirements of foundation models, we modified the SAM2 dependency and implemented a processing layer that minimizes computational overhead and maximizes throughput, ensuring a highly responsive user interface. Key features include persistent instance identity management, an automated ``lock-and-refine’’ workflow with barrier frames, and a mask-skeletonization-based auto-prompting mechanism. SAMannot facilitates the generation of research-ready datasets in YOLO and PNG formats alongside structured interaction logs. Verified through animal behavior tracking use-cases and subsets of the LVOS and DAVIS benchmark datasets, the tool provides a scalable, private, and cost-effective alternative to commercial platforms for complex video annotation tasks.
zh

[CV-18] Efficient On-Board Processing of Oblique UAV Video for Rapid Flood Extent Mapping

【速读】:该论文旨在解决无人机(UAV)在受限的尺寸、重量和功耗(Size, Weight, and Power, SWaP)条件下,难以实时处理高分辨率斜向航拍视频流的问题。由于这类视频具有宽视场特性,对边缘设备的计算密度要求极高,导致现有硬件无法实现低延迟的语义分割推理。解决方案的关键在于提出Temporal Token Reuse (TTR)框架,其核心思想是利用航拍视频中固有的时空冗余性,将图像块表示为“token”,并通过轻量级相似性度量动态识别静态区域,并复用这些区域预计算的深层特征,从而跳过冗余的主干网络计算,显著降低推理延迟。实验表明,TTR在边缘硬件上实现了30%的延迟减少,且分割精度仅下降0.5% mIoU,有效提升了实时遥感任务中的视频理解能力。

链接: https://arxiv.org/abs/2601.11290
作者: Vishisht Sharma,Sam Leroux,Lisa Landuyt,Nick Witvrouwen,Pieter Simoens
机构: Ghent University - imec(根特大学-imec); VITO (Flemish Institute for Technological Research)(比利时弗拉芒政府技术创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective disaster response relies on rapid disaster response, where oblique aerial video is the primary modality for initial scouting due to its ability to maximize spatial coverage and situational awareness in limited flight time. However, the on-board processing of high-resolution oblique streams is severely bottlenecked by the strict Size, Weight, and Power (SWaP) constraints of Unmanned Aerial Vehicles (UAVs). The computational density required to process these wide-field-of-view streams precludes low-latency inference on standard edge hardware. To address this, we propose Temporal Token Reuse (TTR), an adaptive inference framework capable of accelerating video segmentation on embedded devices. TTR exploits the intrinsic spatiotemporal redundancy of aerial video by formulating image patches as tokens; it utilizes a lightweight similarity metric to dynamically identify static regions and propagate their precomputed deep features, thereby bypassing redundant backbone computations. We validate the framework on standard benchmarks and a newly curated Oblique Floodwater Dataset designed for hydrological monitoring. Experimental results on edge-grade hardware demonstrate that TTR achieves a 30% reduction in inference latency with negligible degradation in segmentation accuracy ( 0.5% mIoU). These findings confirm that TTR effectively shifts the operational Pareto frontier, enabling high-fidelity, real-time oblique video understanding for time-critical remote sensing missions
zh

[CV-19] X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning

【速读】:该论文旨在解决机器人学习中视觉-运动策略(visuomotor policies)在数据稀缺场景下性能受限的问题。当前主流方法多依赖于大规模预训练的Vision Transformer(ViT),虽具备强大泛化能力,但其高数据需求难以适配多数机器人应用场景;而紧凑的卷积神经网络(CNN)虽易于优化,却缺乏足够的表示能力。解决方案的关键在于提出一种名为X-Distill的离线跨架构知识蒸馏方法:首先将大型冻结的DINOv2教师模型(具有丰富视觉表征能力)的知识蒸馏至轻量级ResNet-18学生模型,使其获得强大的视觉先验;随后,该蒸馏后的编码器与扩散策略头(diffusion policy head)在目标任务上联合微调。这一策略有效结合了ViT的强大表征能力和CNN的高效优化特性,在34个仿真基准和5个真实世界任务中均显著优于从头训练的ResNet或微调后的DINOv2模型,甚至超越使用点云观测或更大视觉-语言模型的方案,验证了简单且基于理论基础的蒸馏策略在数据高效机器人操作中的优越性。

链接: https://arxiv.org/abs/2601.11269
作者: Maanping Shao,Feihong Zhang,Gu Zhang,Baiye Cheng,Zhengrong Xue,Huazhe Xu
机构: Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院); Shanghai Qi Zhi Institute(上海期智研究院); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Tsinghua University(清华大学); Huazhong University of Science and Technology(华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on 34 simulated benchmarks and 5 challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
zh

[CV-20] FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly Detection

【速读】:该论文针对无人机视频异常检测(UAV Video Anomaly Detection, UAV VAD)中动态背景下的多源运动耦合问题展开研究,旨在解决现有方法在复杂动态场景下易将正常无人机运动误判为异常或难以识别被掩盖的真正异常的问题。其核心挑战在于如何有效分离无人机自身运动与目标物体运动,并建模跨时间尺度的帧间连续性和局部空间相关性。解决方案的关键在于提出频率辅助的时间膨胀Mamba网络(Frequency-Assisted Temporal Dilation Mamba, FTDMamba),包含两个核心模块:一是频率解耦时空相关模块(Frequency Decoupled Spatiotemporal Correlation Module),通过频域分析解耦耦合运动模式并建模全局时空依赖;二是时间膨胀Mamba模块(Temporal Dilation Mamba Module),利用Mamba架构的序列建模能力,在多个时间感受野内联合学习细粒度时序动态与局部空间结构。

链接: https://arxiv.org/abs/2601.11254
作者: Cheng-Zhuang Liu,Si-Bao Chen,Qing-Ling Shu,Chris Ding,Jin Tang,Bin Luo
机构: Anhui University (安徽大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video anomaly detection (VAD) mainly focus on ground-based surveillance or unmanned aerial vehicle (UAV) videos with static backgrounds, whereas research on UAV videos with dynamic backgrounds remains limited. Unlike static scenarios, dynamically captured UAV videos exhibit multi-source motion coupling, where the motion of objects and UAV-induced global motion are intricately intertwined. Consequently, existing methods may misclassify normal UAV movements as anomalies or fail to capture true anomalies concealed within dynamic backgrounds. Moreover, many approaches do not adequately address the joint modeling of inter-frame continuity and local spatial correlations across diverse temporal scales. To overcome these limitations, we propose the Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network for UAV VAD, including two core components: (1) a Frequency Decoupled Spatiotemporal Correlation Module, which disentangles coupled motion patterns and models global spatiotemporal dependencies through frequency analysis; and (2) a Temporal Dilation Mamba Module, which leverages Mamba’s sequence modeling capability to jointly learn fine-grained temporal dynamics and local spatial structures across multiple temporal receptive fields. Additionally, unlike existing UAV VAD datasets which focus on static backgrounds, we construct a large-scale Moving UAV VAD dataset (MUVAD), comprising 222,736 frames with 240 anomaly events across 12 anomaly types. Extensive experiments demonstrate that FTDMamba achieves state-of-the-art (SOTA) performance on two public static benchmarks and the new MUVAD dataset. The code and MUVAD dataset will be available at: this https URL.
zh

[CV-21] Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval

【速读】:该论文旨在解决手写文字检索(handwritten word retrieval)中的两大挑战:一是手写风格差异大导致的识别困难,二是跨语言语义鸿沟(cross-lingual semantic gaps)带来的检索性能下降。为应对这些问题,作者提出了一种轻量级非对称双编码器框架(lightweight asymmetric dual-encoder framework),其关键在于通过联合优化实例级对齐(instance-level alignment)与类别级语义一致性(class-level semantic consistency),将视觉嵌入锚定到语言无关的语义原型(language-agnostic semantic prototypes),从而实现跨脚本和书写风格的不变性(style-invariant visual embeddings)。该方案在保持高检索准确率的同时,显著降低了模型参数量,支持资源受限环境下的边缘部署。

链接: https://arxiv.org/abs/2601.11248
作者: Fangke Chen,Tianhao Dong,Sirry Chen,Guobin Zhang,Yishu Zhang,Yining Chen
机构: Zhejiang University (浙江大学); Shanghai Innovation Institute; Nanyang Technological University (南洋理工大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,5 figures

点击查看摘要

Abstract:Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive computational costs hinder practical edge deployment. To address this, we propose a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings. By jointly optimizing instance-level alignment and class-level semantic consistency, our approach anchors visual embeddings to language-agnostic semantic prototypes, enforcing invariance across scripts and writing styles. Experiments show that our method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks. We further conduct explicit cross-lingual retrieval, where the query language differs from the target language, to validate the effectiveness of the learned cross-lingual representations. Achieving strong performance with only a fraction of the parameters required by existing models, our framework enables accurate and resource-efficient cross-script handwriting retrieval.
zh

[CV-22] Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification

【速读】:该论文旨在解决跨场景行人重识别(Person Re-Identification, ReID)中因不同场景(如分辨率差异、服装变化等)导致的性能下降问题,提出了一种无监督多场景(Unsupervised Multi-Scenario, UMS)ReID任务,以在单一框架下统一建模多种异构场景。其解决方案的关键在于引入图像-文本知识建模(Image-Text Knowledge Modeling, ITKM),该方法基于预训练的CLIP模型,分三阶段优化:首先在图像编码器中加入场景嵌入(scenario embedding)并微调以自适应融合多场景知识;其次通过学习文本嵌入与伪标签对齐,并设计多场景分离损失增强跨场景文本表示的差异性;最后引入层次化匹配模块(簇级和实例级)获取场景内异质正样本对(如可见光与红外图像),并采用动态文本表示更新策略保持图文监督信号的一致性。此架构有效提升了跨场景泛化能力与整体性能。

链接: https://arxiv.org/abs/2601.11243
作者: Zhiqi Pang,Lingling Zhao,Yang Liu,Chunyu Wang,Gaurav Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) – a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
zh

[CV-23] Bio-inspired fine-tuning for selective transfer learning in image classification

【速读】:该论文旨在解决迁移学习中因源域与目标域分布差异导致模型性能下降的问题,特别是在标注数据有限的情况下。其解决方案的关键在于提出一种基于进化优化的自适应微调技术(BioTune),通过智能选择冻结层并动态调整未冻结层的学习率,从而实现对不同数据特征和分布变化的高效适配。实验表明,BioTune在多个图像分类数据集上均优于当前主流微调方法(如AutoRGN和LoRA),且在四种不同CNN架构中保持稳定优异表现,验证了其鲁棒性与通用性。

链接: https://arxiv.org/abs/2601.11235
作者: Ana Davila,Jacinto Colan,Yasuhisa Hasegawa
机构: Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has significantly advanced image analysis across diverse domains but often depends on large, annotated datasets for success. Transfer learning addresses this challenge by utilizing pre-trained models to tackle new tasks with limited labeled data. However, discrepancies between source and target domains can hinder effective transfer learning. We introduce BioTune, a novel adaptive fine-tuning technique utilizing evolutionary optimization. BioTune enhances transfer learning by optimally choosing which layers to freeze and adjusting learning rates for unfrozen layers. Through extensive evaluation on nine image classification datasets, spanning natural and specialized domains such as medical imaging, BioTune demonstrates superior accuracy and efficiency over state-of-the-art fine-tuning methods, including AutoRGN and LoRA, highlighting its adaptability to various data characteristics and distribution changes. Additionally, BioTune consistently achieves top performance across four different CNN architectures, underscoring its flexibility. Ablation studies provide valuable insights into the impact of BioTune’s key components on overall performance. The source code is available at this https URL.
zh

[CV-24] VidLeaks: Membership Inference Attacks Against Text-to-Video Models

【速读】:该论文旨在解决生成式视频模型(Text-to-Video, T2V)中存在的隐私泄露问题,特别是针对训练数据成员身份的成员推理攻击(Membership Inference Attacks, MIAs)在视频场景下的适用性不足。现有MIAs主要面向静态数据(如图像或文本),无法有效捕捉视频特有的时空复杂性,尤其忽略了关键帧中记忆信号的稀疏性以及随机时间动态引入的不稳定性。解决方案的关键在于提出首个系统性研究框架VidLeaks,其通过两种互补的信号探测稀疏-时序记忆:一是空间重建保真度(Spatial Reconstruction Fidelity, SRF),利用Top-K相似性增强来自稀疏记忆关键帧的空间记忆信号;二是时序生成稳定性(Temporal Generative Stability, TGS),通过多次查询间的语义一致性来捕获时间维度上的信息泄露。实验表明,即使在严格的仅查询黑盒设置下,VidLeaks仍能实现高达97.01%的AUC值,揭示了T2V模型存在显著的成员信息泄露风险,为视频生成系统的审计和防御机制开发提供了基础。

链接: https://arxiv.org/abs/2601.11210
作者: Li Wang,Wenyu Chen,Ning Yu,Zheng Li,Shanqing Guo
机构: Shandong University (山东大学); Eyeline Labs
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: this https URL.
zh

[CV-25] ATATA: One Algorithm to Align Them All

【速读】:该论文旨在解决多模态样本对在结构上保持对齐的联合生成问题,尤其针对现有方法如Score Distillation Sampling(SDS)存在计算效率低、模式崩溃及视觉质量差等缺陷。其核心解决方案是基于Rectified Flow模型,通过在样本空间中联合传输一个片段(joint transport of a segment),实现高效且结构对齐的样本对生成,无需依赖复杂的代码依赖式生成过程。该方法可构建于任意在结构化潜在空间运行的Rectified Flow模型之上,显著提升推理速度并保持高质量的视觉表现,在图像、视频和3D形状生成任务中均优于当前主流编辑或联合推理方法。

链接: https://arxiv.org/abs/2601.11194
作者: Boyi Pang,Savva Ignatyev,Vladimir Ippolitov,Ramil Khafizov,Yurii Melnik,Oleg Voynov,Maksim Nakhodnov,Aibek Alanov,Xiaopeng Fan,Peter Wonka,Evgeny Burnaev
机构: Harbin Institute of Technology(哈尔滨工业大学); Appiled AI Institute; Fusion Brain; MSU; HSE University(高等经济大学); The Peng Cheng Laboratory(鹏城实验室); HIT Suzhou Research Institute(哈尔滨工业大学苏州研究院); KAUST(沙特阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We suggest a new multi-modal algorithm for joint inference of paired structurally aligned samples with Rectified Flow models. While some existing methods propose a codependent generation process, they do not view the problem of joint generation from a structural alignment perspective. Recent work uses Score Distillation Sampling to generate aligned 3D models, but SDS is known to be time-consuming, prone to mode collapse, and often provides cartoonish results. By contrast, our suggested approach relies on the joint transport of a segment in the sample space, yielding faster computation at inference time. Our approach can be built on top of an arbitrary Rectified Flow model operating on the structured latent space. We show the applicability of our method to the domains of image, video, and 3D shape generation using state-of-the-art baselines and evaluate it against both editing-based and joint inference-based competing approaches. We demonstrate a high degree of structural alignment for the sample pairs obtained with our method and a high visual quality of the samples. Our method improves the state-of-the-art for image and video generation pipelines. For 3D generation, it is able to show comparable quality while working orders of magnitude faster.
zh

[CV-26] Democratizing planetary-scale analysis: An ultra-lightweight Earth embedding database for accurate and flexible global land monitoring

【速读】:该论文旨在解决全球尺度地球观测(Earth Observation, EO)数据在存储与计算资源上的巨大压力,从而阻碍行星尺度研究的问题。其核心解决方案是提出一种名为Embedded Seamless Data (ESD) 的超轻量级全球地球嵌入数据库,通过将Landsat系列(5、7、8、9)和MODIS Terra多源遥感数据转换为信息密集的量化潜在向量(quantized latent vectors),在统一的潜在空间中压缩并保留关键地物物理特征与语义信息。关键技术在于采用ESDNet架构与有限标量量化(Finite Scalar Quantization, FSQ)方法,实现约340倍的数据体积压缩,使单年全球陆表数据可封装于约2.4 TB内,且保持高重建保真度(MAE: 0.0130; RMSE: 0.0179; CC: 0.8543),显著提升本地工作站上的十年尺度分析可行性,并在土地覆盖分类任务中优于原始反射率数据(准确率79.74% vs. 76.92%)。

链接: https://arxiv.org/abs/2601.11183
作者: Shuang Chen,Jie Wang,Shuai Yuan,Jiayang Li,Yu Xia,Yuanhong Liao,Junbo Wei,Jincheng Yuan,Xiaoqing Xu,Xiaolin Zhu,Peng Zhu,Hongsheng Zhang,Yuyu Zhou,Haohuan Fu,Huabing Huang,Bin Chen,Fan Dai,Peng Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid evolution of satellite-borne Earth Observation (EO) systems has revolutionized terrestrial monitoring, yielding petabyte-scale archives. However, the immense computational and storage requirements for global-scale analysis often preclude widespread use, hindering planetary-scale studies. To address these barriers, we present Embedded Seamless Data (ESD), an ultra-lightweight, 30-m global Earth embedding database spanning the 25-year period from 2000 to 2024. By transforming high-dimensional, multi-sensor observations from the Landsat series (5, 7, 8, and 9) and MODIS Terra into information-dense, quantized latent vectors, ESD distills essential geophysical and semantic features into a unified latent space. Utilizing the ESDNet architecture and Finite Scalar Quantization (FSQ), the dataset achieves a transformative ~340-fold reduction in data volume compared to raw archives. This compression allows the entire global land surface for a single year to be encapsulated within approximately 2.4 TB, enabling decadal-scale global analysis on standard local workstations. Rigorous validation demonstrates high reconstructive fidelity (MAE: 0.0130; RMSE: 0.0179; CC: 0.8543). By condensing the annual phenological cycle into 12 temporal steps, the embeddings provide inherent denoising and a semantically organized space that outperforms raw reflectance in land-cover classification, achieving 79.74% accuracy (vs. 76.92% for raw fusion). With robust few-shot learning capabilities and longitudinal consistency, ESD provides a versatile foundation for democratizing planetary-scale research and advancing next-generation geospatial artificial intelligence.
zh

[CV-27] SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention

【速读】:该论文旨在解决标准Softmax自注意力机制在视觉任务中因计算复杂度为O(N²)而难以部署于高分辨率场景的问题,同时克服线性注意力虽将复杂度降至O(N)但因状态表示压缩导致建模能力下降和精度损失的局限。其解决方案的关键在于从层堆叠(layer-stacking)视角对线性和Softmax注意力进行系统性分析,并提出细粒度的逐层混合策略——即SoLA-Vision(Softmax-Linear Attention Vision),通过战略性地插入少量全局Softmax层,在保持较低计算成本的同时显著提升模型表达能力和下游任务性能。

链接: https://arxiv.org/abs/2601.11164
作者: Ruibang Li,Guan Luo,Yiwei Zhang,Jin Gao,Bing Li,Weiming Hu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS); School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information; School of Information Science and Technology, ShanghaiTech University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Standard softmax self-attention excels in vision tasks but incurs quadratic complexity O(N^2), limiting high-resolution deployment. Linear attention reduces the cost to O(N), yet its compressed state representations can impair modeling capacity and accuracy. We present an analytical study that contrasts linear and softmax attention for visual representation learning from a layer-stacking perspective. We further conduct systematic experiments on layer-wise hybridization patterns of linear and softmax attention. Our results show that, compared with rigid intra-block hybrid designs, fine-grained layer-wise hybridization can match or surpass performance while requiring fewer softmax layers. Building on these findings, we propose SoLA-Vision (Softmax-Linear Attention Vision), a flexible layer-wise hybrid attention backbone that enables fine-grained control over how linear and softmax attention are integrated. By strategically inserting a small number of global softmax layers, SoLA-Vision achieves a strong trade-off between accuracy and computational cost. On ImageNet-1K, SoLA-Vision outperforms purely linear and other hybrid attention models. On dense prediction tasks, it consistently surpasses strong baselines by a considerable margin. Code will be released.
zh

[CV-28] GMM-COMET: Continual Source-Free Universal Domain Adaptation via a Mean Teacher and Gaussian Mixture Model-Based Pseudo-Labeling

【速读】:该论文致力于解决源-free通用域适应(source-free universal domain adaptation, SF-UniDA)场景下模型在连续流式多目标域上的持续适应问题,即当目标域分布随时间变化且源数据不可用时,如何使模型在不依赖源域数据的前提下实现稳定、鲁棒的性能提升。其解决方案的关键在于:将基于高斯混合模型(Gaussian Mixture Model, GMM)的伪标签生成机制嵌入到均值教师(Mean Teacher)框架中,以增强长期适应序列中的稳定性,并引入一致性损失(consistency loss)进一步提升模型对不同域分布的鲁棒性,从而构建出首个针对持续SF-UniDA任务的有效基线方法GMM-COMET。

链接: https://arxiv.org/abs/2601.11161
作者: Pascal Schlachter,Bin Yang
机构: University of Stuttgart (斯图加特大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation tackles the problem that domain shifts between training and test data impair the performance of neural networks in many real-world applications. Thereby, in realistic scenarios, the source data may no longer be available during adaptation, and the label space of the target domain may differ from the source label space. This setting, known as source-free universal domain adaptation (SF-UniDA), has recently gained attention, but all existing approaches only assume a single domain shift from source to target. In this work, we present the first study on continual SF-UniDA, where the model must adapt sequentially to a stream of multiple different unlabeled target domains. Building upon our previous methods for online SF-UniDA, we combine their key ideas by integrating Gaussian mixture model-based pseudo-labeling within a mean teacher framework for improved stability over long adaptation sequences. Additionally, we introduce consistency losses for further robustness. The resulting method GMM-COMET provides a strong first baseline for continual SF-UniDA and is the only approach in our experiments to consistently improve upon the source-only model across all evaluated scenarios. Our code is available at this https URL.
zh

[CV-29] Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

【速读】:该论文旨在解决视觉逆图形学(Vision-as-Inverse-Graphics)任务中,现有视觉语言模型(VLMs)在单次推理下难以实现精细空间与物理接地的问题,从而无法有效重建或编辑可编辑的图形程序。其解决方案的关键在于提出一个闭环的“写-执行-渲染-比较-修正”迭代机制,即 VIGA(Vision-as-Inverse-Graphic Agent),通过交织的多模态推理实现长程任务规划;该方法结合技能库(交替生成器与验证器角色)和演化上下文记忆(包含计划、代码差异及渲染历史),支持任务无关(task-agnostic)与模型无关(model-agnostic)的场景重建、编辑与物理交互等复杂任务,显著提升性能,在BlenderGym和SlideBench等基准上分别取得35.32%和117.17%的改进,并在新提出的BlenderBench基准中实现124.70%的提升。

链接: https://arxiv.org/abs/2601.11109
作者: Shaofeng Yin,Jiaxin Ge,Zora Zhiruo Wang,Xiuyu Li,Michael J. Black,Trevor Darrell,Angjoo Kanazawa,Haiwen Feng
机构: University of California, Berkeley (加州大学伯克利分校); Carnegie Mellon University (卡内基梅隆大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Impossible Inc.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Vision-as-inverse-graphics, the concept of reconstructing an image as an editable graphics program is a long-standing goal of computer vision. Yet even strong VLMs aren’t able to achieve this in one-shot as they lack fine-grained spatial and physical grounding capability. Our key insight is that closing this gap requires interleaved multimodal reasoning through iterative execution and verification. Stemming from this, we present VIGA (Vision-as-Inverse-Graphic Agent) that starts from an empty world and reconstructs or edits scenes through a closed-loop write-run-render-compare-revise procedure. To support long-horizon reasoning, VIGA combines (i) a skill library that alternates generator and verifier roles and (ii) an evolving context memory that contains plans, code diffs, and render history. VIGA is task-agnostic as it doesn’t require auxiliary modules, covering a wide range of tasks such as 3D reconstruction, multi-step scene editing, 4D physical interaction, and 2D document editing, etc. Empirically, we found VIGA substantially improves one-shot baselines on BlenderGym (35.32%) and SlideBench (117.17%). Moreover, VIGA is also model-agnostic as it doesn’t require finetuning, enabling a unified protocol to evaluate heterogeneous foundation VLMs. To better support this protocol, we introduce BlenderBench, a challenging benchmark that stress-tests interleaved multimodal reasoning with graphics engine, where VIGA improves by 124.70%.
zh

[CV-30] Graph Smoothing for Enhanced Local Geometry Learning in Point Cloud Analysis AAAI2026

【速读】:该论文旨在解决图结构方法在3D点云分析中因边界点稀疏连接和交汇区域噪声连接而导致的次优图结构问题(suboptimal graph structures)。其解决方案的关键在于提出一种融合图平滑模块(graph smoothing module)与增强局部几何学习模块(enhanced local geometry learning module)的新方法:首先通过图平滑模块优化图结构,以减少不可靠连接的负面影响;随后基于优化后的图结构,利用基于特征向量的自适应几何描述符提取形状特征,并通过柱坐标变换获取分布特征,从而提升局部几何信息的表达能力。

链接: https://arxiv.org/abs/2601.11102
作者: Shangbo Yuan,Jie Xu,Ping Hu,Xiaofeng Zhu,Na Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Graph-based methods have proven to be effective in capturing relationships among points for 3D point cloud analysis. However, these methods often suffer from suboptimal graph structures, particularly due to sparse connections at boundary points and noisy connections in junction areas. To address these challenges, we propose a novel method that integrates a graph smoothing module with an enhanced local geometry learning module. Specifically, we identify the limitations of conventional graph structures, particularly in handling boundary points and junction areas. In response, we introduce a graph smoothing module designed to optimize the graph structure and minimize the negative impact of unreliable sparse and noisy connections. Based on the optimized graph structure, we improve the feature extract function with local geometry information. These include shape features derived from adaptive geometric descriptors based on eigenvectors and distribution features obtained through cylindrical coordinate transformation. Experimental results on real-world datasets validate the effectiveness of our method in various point cloud learning tasks, i.e., classification, part segmentation, and semantic segmentation.
zh

[CV-31] CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

【速读】:该论文旨在解决多主体角色图像动画中面临的三大挑战:难以处理任意数量和类型的主体、主体间空间布局不一致(如参考图与驱动姿态存在错位),以及无法精准将动作重新绑定到指定主体。现有方法因过度依赖严格的像素级空间绑定而受限,导致在复杂场景下泛化能力差。其解决方案的关键在于提出一个Unbind-Rebind框架:首先通过引入随机扰动的姿势偏移编码器(pose shift encoder)打破刚性空间绑定,使模型学习与位置无关的动作表示(location-agnostic motion representation);随后利用文本提示的语义引导和主体掩码的空间引导,在重绑定模块中精确控制动作流向目标角色,从而实现对任意主体数量、类型及空间配置的灵活动画生成。

链接: https://arxiv.org/abs/2601.11096
作者: Shuai Tan,Biao Gong,Ke Ma,Yutong Feng,Qiyuan Zhang,Yan Wang,Yujun Shen,Hengshuang Zhao
机构: The University of Hong Kong (香港大学); Ant Group (蚂蚁集团); Huazhong University of Science and Technology (华中科技大学); Tsinghua University (清华大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial misalignment between the reference image and the driving poses. We attribute these limitations to an overly rigid spatial binding that forces strict pixel-wise alignment between the pose and reference, and an inability to consistently rebind motion to intended subjects. To address these challenges, we propose CoDance, a novel Unbind-Rebind framework that enables the animation of arbitrary subject counts, types, and spatial configurations conditioned on a single, potentially misaligned pose sequence. Specifically, the Unbind module employs a novel pose shift encoder to break the rigid spatial binding between the pose and the reference by introducing stochastic perturbations to both poses and their latent features, thereby compelling the model to learn a location-agnostic motion representation. To ensure precise control and subject association, we then devise a Rebind module, leveraging semantic guidance from text prompts and spatial guidance from subject masks to direct the learned motion to intended characters. Furthermore, to facilitate comprehensive evaluation, we introduce a new multi-subject CoDanceBench. Extensive experiments on CoDanceBench and existing datasets show that CoDance achieves SOTA performance, exhibiting remarkable generalization across diverse subjects and spatial layouts. The code and weights will be open-sourced.
zh

[CV-32] PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models

【速读】:该论文旨在解决当前基于Transformer的视频生成模型在模拟刚体运动时缺乏物理原理约束的问题,尤其是物体碰撞等经典力学行为难以真实还原的局限性。现有预训练-微调范式在像素级全局去噪过程中忽略了物体刚性这一关键物理属性,导致即使数学上正确的物理约束也被视为次优条件而非强制规则,从而限制了生成视频的物理真实性。解决方案的关键在于提出一种物理感知的强化学习范式,首次在高维空间中直接施加物理碰撞规则,确保物理知识被严格执行而非作为软约束;进一步将其扩展为统一框架Mimicry-Discovery Cycle (MDcycle),支持高效微调的同时保持对物理反馈信号的充分利用,从而显著提升生成视频的物理合理性。

链接: https://arxiv.org/abs/2601.11087
作者: Qiyuan Zhang,Biao Gong,Shuai Tan,Zheng Zhang,Yujun Shen,Xing Zhu,Yuyuan Li,Kelu Yao,Chunhua Shen,Changqing Zou
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Zhejiang Lab (浙江省实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physical principles are fundamental to realistic visual simulation, but remain a significant oversight in transformer-based video generation. This gap highlights a critical limitation in rendering rigid body motion, a core tenet of classical mechanics. While computer graphics and physics-based simulators can easily model such collisions using Newton formulas, modern pretrain-finetune paradigms discard the concept of object rigidity during pixel-level global denoising. Even perfectly correct mathematical constraints are treated as suboptimal solutions (i.e., conditions) during model optimization in post-training, fundamentally limiting the physical realism of generated videos. Motivated by these considerations, we introduce, for the first time, a physics-aware reinforcement learning paradigm for video generation models that enforces physical collision rules directly in high-dimensional spaces, ensuring the physics knowledge is strictly applied rather than treated as conditions. Subsequently, we extend this paradigm to a unified framework, termed Mimicry-Discovery Cycle (MDcycle), which allows substantial fine-tuning while fully preserving the model’s ability to leverage physics-grounded feedback. To validate our approach, we construct new benchmark PhysRVGBench and perform extensive qualitative and quantitative experiments to thoroughly assess its effectiveness.
zh

[CV-33] H-AIM: Orchestrating LLM s PDDL and Behavior Trees for Hierarchical Multi-Robot Planning

【速读】:该论文旨在解决具身人工智能中异构机器人团队从高层指令执行长时程任务的挑战,尤其针对大语言模型(LLM)在长期推理和动态多机器人协同方面的局限性。其解决方案的关键在于提出一种三级级联式框架H-AIM(Hierarchical Autonomous Intelligent Multi-Robot Planning),通过三个核心阶段实现:首先利用LLM将自然语言指令转化为形式化的规划问题(PDDL);其次融合LLM语义推理与经典规划器的搜索能力生成优化动作序列;最后将计划编译为行为树(Behavior Tree)以支持反应式控制。此外,系统通过共享黑板机制实现异构机器人团队的状态同步与通信,从而提升整体协作效率。

链接: https://arxiv.org/abs/2601.11063
作者: Haishan Zeng,Peng Li
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In embodied artificial intelligence, enabling heterogeneous robot teams to execute long-horizon tasks from high-level instructions remains a critical challenge. While large language models (LLMs) show promise in instruction parsing and preliminary planning, they exhibit limitations in long-term reasoning and dynamic multi-robot coordination. We propose Hierarchical Autonomous Intelligent Multi-Robot Planning(H-AIM), a novel embodied multi-robot task planning framework that addresses these issues through a three-stage cascaded architecture: 1) It leverages an LLM to parse instructions and generate Planning Domain Definition Language (PDDL) problem descriptions, thereby transforming commands into formal planning problems; 2) It combines the semantic reasoning of LLMs with the search capabilities of a classical planner to produce optimized action sequences; 3) It compiles the resulting plan into behavior trees for reactive control. The framework supports dynamically sized heterogeneous robot teams via a shared blackboard mechanism for communication and state synchronization. To validate our approach, we introduce the MACE-THOR benchmark dataset, comprising 42 complex tasks across 8 distinct household layouts. Experimental results demonstrate that H-AIM achieves a remarkable performance improvement, elevating the task success rate from 12% to 55% and boosting the goal condition recall from 32% to 72% against the strongest baseline, LaMMA-P.
zh

[CV-34] M3DDM: An improved video outpainting by a modified masking strategy

【速读】:该论文旨在解决M3DDM在视频外画(video outpainting)任务中因训练与推理阶段掩码策略不一致而导致的视觉质量下降问题,尤其是在相机运动受限或外画区域较大等信息稀疏场景下,表现为空间模糊和时序不一致性。其解决方案的关键在于修正训练阶段的掩码策略:将原模型中跨帧随机方向和宽度的掩码方式改为统一方向和宽度的掩码策略,并在此基础上对预训练的M3DDM模型进行微调,从而显著提升视频外画的视觉保真度与时序一致性,同时保持计算效率。

链接: https://arxiv.org/abs/2601.11048
作者: Takuya Murakawa,Takumi Fukuzawa,Ning Ding,Toru Tamaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: proc. of IWAIT2026

点击查看摘要

Abstract:M3DDM provides a computationally efficient framework for video outpainting via latent diffusion modeling. However, it exhibits significant quality degradation – manifested as spatial blur and temporal inconsistency – under challenging scenarios characterized by limited camera motion or large outpainting regions, where inter-frame information is limited. We identify the cause as a training-inference mismatch in the masking strategy: M3DDM’s training applies random mask directions and widths across frames, whereas inference requires consistent directional outpainting throughout the video. To address this, we propose M3DDM+, which applies uniform mask direction and width across all frames during training, followed by fine-tuning of the pretrained M3DDM model. Experiments demonstrate that M3DDM+ substantially improves visual fidelity and temporal coherence in information-limited scenarios while maintaining computational efficiency. The code is available at this https URL.
zh

[CV-35] Your One-Stop Solution for AI-Generated Video Detection

【速读】:该论文旨在解决当前AI生成视频检测领域中存在的两大关键问题:一是现有数据集规模有限且多基于过时或单一的生成模型,难以反映现代生成技术的多样性与快速演进;二是当前基准测试仍停留在数据集构建阶段,缺乏对检测方法的系统性评估与深入分析。解决方案的核心在于提出AIGVDBench,一个覆盖31种先进生成模型、包含超过44万条视频的综合性基准,并对33种不同类别的检测器执行了1500余次评估,从而完成8项深入分析并揭示4项新发现,为AI生成视频检测研究提供了坚实的基础和可扩展的评估框架。

链接: https://arxiv.org/abs/2601.11035
作者: Long Ma,Zihao Xue,Yan Wang,Zhiyuan Yan,Jin Xu,Xiaorui Jiang,Haiyang Yu,Yong Liao,Zhen Bi
机构: University of Science and Technology of China (中国科学技术大学); Huzhou University (湖州大学); Alibaba Group (阿里巴巴集团); Peking University (北京大学); Institute of Dataspace, Hefei Comprehensive National Science Center (合肥综合性国家科学中心数据空间研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in generative modeling can create remarkably realistic synthetic videos, making it increasingly difficult for humans to distinguish them from real ones and necessitating reliable detection methods. However, two key limitations hinder the development of this field. \textbfFrom the dataset perspective, existing datasets are often limited in scale and constructed using outdated or narrowly scoped generative models, making it difficult to capture the diversity and rapid evolution of modern generative techniques. Moreover, the dataset construction process frequently prioritizes quantity over quality, neglecting essential aspects such as semantic diversity, scenario coverage, and technological representativeness. \textbfFrom the benchmark perspective, current benchmarks largely remain at the stage of dataset creation, leaving many fundamental issues and in-depth analysis yet to be systematically explored. Addressing this gap, we propose AIGVDBench, a benchmark designed to be comprehensive and representative, covering \textbf31 state-of-the-art generation models and over \textbf440,000 videos. By executing more than \textbf1,500 evaluations on \textbf33 existing detectors belonging to four distinct categories. This work presents \textbf8 in-depth analyses from multiple perspectives and identifies \textbf4 novel findings that offer valuable insights for future research. We hope this work provides a solid foundation for advancing the field of AI-generated video detection. Our benchmark is open-sourced at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.11035 [cs.CV] (or arXiv:2601.11035v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.11035 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Long Ma [view email] [v1] Fri, 16 Jan 2026 07:02:06 UTC (17,544 KB) Full-text links: Access Paper: View a PDF of the paper titled Your One-Stop Solution for AI-Generated Video Detection, by Long Ma and 8 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-01 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[CV-36] IDDR-NGP: Incorporating Detectors for Distractor Removal with Instant Neural Radiance Field

【速读】:该论文旨在解决在隐式三维表示(implicit 3D representations)中同时去除多种类型干扰物(distractors)的问题,例如雪花、彩纸、落叶和花瓣等。现有方法通常仅针对特定类型的干扰物进行处理,缺乏通用性。解决方案的关键在于提出一种统一的去干扰方法 IDDR-NGP,其直接作用于 Instant-NGP 框架,并通过结合隐式三维表示与二维检测器,实现从多视角受损图像中高效恢复高质量三维场景。该方法引入了学习感知图像块相似性损失(LPIPS loss)和多视角补偿损失(MVCL),以联合优化渲染结果并聚合多视角信息,从而实现端到端训练下对多种干扰物的有效移除。

链接: https://arxiv.org/abs/2601.11030
作者: Xianliang Huang,Jiajie Gou,Shuhang Chen,Zhizhou Zhong,Jihong Guan,Shuigeng Zhou
机构: Fudan University (复旦大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, accepted by ACM-MM23

点击查看摘要

Abstract:This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenes. Extensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.
zh

[CV-37] Matching High-Dimensional Geometric Quantiles for Test-Time Adaptation of Transformers and Convolutional Networks Alike

【速读】:该论文旨在解决测试时适应(Test-time adaptation, TTA)中因测试数据与训练数据分布差异导致的性能下降问题,尤其关注现有方法对模型架构依赖性强、难以推广至通用架构的局限性。解决方案的关键在于提出一种架构无关的TTA方法,通过在分类器前添加一个适配器网络(adapter network)来预处理输入图像,并采用提出的分位数损失(quantile loss)进行训练。该方法通过匹配高维几何分位数来校正分布偏移,理论上证明了在一定条件下最小化分位数损失可学习到最优适配器,从而实现对不同架构(包括经典卷积网络和Transformer)的有效适应。

链接: https://arxiv.org/abs/2601.11022
作者: Sravan Danda,Aditya Challa,Shlok Mehendale,Snehanshu Saha
机构: BITS Pilani K K Birla Goa Campus (比特理工学院 K K 巴尔拉戈阿校区)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) refers to adapting a classifier for the test data when the probability distribution of the test data slightly differs from that of the training data of the model. To the best of our knowledge, most of the existing TTA approaches modify the weights of the classifier relying heavily on the architecture. It is unclear as to how these approaches are extendable to generic architectures. In this article, we propose an architecture-agnostic approach to TTA by adding an adapter network pre-processing the input images suitable to the classifier. This adapter is trained using the proposed quantile loss. Unlike existing approaches, we correct for the distribution shift by matching high-dimensional geometric quantiles. We prove theoretically that under suitable conditions minimizing quantile loss can learn the optimal adapter. We validate our approach on CIFAR10-C, CIFAR100-C and TinyImageNet-C by training both classic convolutional and transformer networks on CIFAR10, CIFAR100 and TinyImageNet datasets.
zh

[CV-38] MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement

【速读】:该论文旨在解决医学视觉语言模型(MedVLMs)在真实临床场景中复杂推理能力不足的问题,具体包括三方面挑战:深度推理数据稀缺、多专科对齐的冷启动限制,以及标准强化学习算法难以建模临床推理多样性。其解决方案的关键在于提出MmedExpert-R1,通过三个核心机制实现突破:首先,构建包含10K样本的高质量多专科推理数据集MMedExpert,提供结构化推理轨迹;其次,采用领域特定适配(Domain-Specific Adaptation, DSA)生成专科特异的LoRA模块以实现多样化初始化;再次,引入基于临床指南的优势建模(Guideline-Based Advantages, GBA),显式刻画不同临床推理视角以对齐真实诊断策略;最后,通过冲突感知能力集成(Conflict-Aware Capability Integration)将多个专科专家融合为统一代理,保障跨专科鲁棒性。该方法在MedXpert-MM和OmniMedVQA等基准上达到SOTA性能,为可靠多模态医疗推理系统奠定基础。

链接: https://arxiv.org/abs/2601.10949
作者: Meidan Ding,Jipeng Zhang,Wenxuan Wang,Haiqin Zhong,Xiaoling Luo,Wenting Chen,Linlin Shen
机构: Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学); The Hong Kong University of Science and Technology (香港科技大学); Renmin University of China (中国人民大学); Guangdong Provincial Key Laboratory of Intelligent Information Processing (广东省智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical Vision-Language Models (MedVLMs) excel at perception tasks but struggle with complex clinical reasoning required in real-world scenarios. While reinforcement learning (RL) has been explored to enhance reasoning capabilities, existing approaches face critical mismatches: the scarcity of deep reasoning data, cold-start limits multi-specialty alignment, and standard RL algorithms fail to model clinical reasoning diversity. We propose MMedExpert-R1, a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and clinical guideline reinforcement. We construct MMedExpert, a high-quality dataset of 10K samples across four specialties with step-by-step reasoning traces. Our Domain-Specific Adaptation (DSA) creates specialty-specific LoRA modules to provide diverse initialization, while Guideline-Based Advantages (GBA) explicitly models different clinical reasoning perspectives to align with real-world diagnostic strategies. Conflict-Aware Capability Integration then merges these specialized experts into a unified agent, ensuring robust multi-specialty alignment. Comprehensive experiments demonstrate state-of-the-art performance, with our 7B model achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, establishing a robust foundation for reliable multimodal medical reasoning systems.
zh

[CV-39] Sparse Data Tree Canopy Segmentation: Fine-Tuning Leading Pretrained Models on Only 150 Images

【速读】:该论文旨在解决在极端数据稀缺条件下(仅150张标注图像)进行树冠(tree canopy)分割任务时,深度学习模型易过拟合且泛化能力差的问题。其关键解决方案在于系统评估五种代表性架构(YOLOv11、Mask R-CNN、DeepLabv3、Swin-UNet 和 DINOv2)在小样本场景下的性能表现,并发现基于预训练卷积神经网络(CNN)的模型(尤其是YOLOv11和Mask R-CNN)显著优于基于视觉Transformer(Vision Transformer)的模型。原因在于CNN具有更强的归纳偏置(inductive bias),更适应小数据环境,而Transformer类模型因高数据需求和任务类型不匹配(如将语义分割误用于实例分割)导致性能下降。研究进一步验证了轻量级CNN方法在有限标注数据下仍是树冠检测最可靠的选择。

链接: https://arxiv.org/abs/2601.10931
作者: David Szczecina,Hudson Sun,Anthony Bertnyk,Niloofar Azad,Kyle Gao,Lincoln Linlin Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real-life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R-CNN, DeepLabv3, Swin-UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution-based models, particularly YOLOv11 and Mask R-CNN, generalize significantly better than pretrained transformer-based models. DeeplabV3, Swin-UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer-based architectures struggle in low-data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small-data constraint and demonstrate that lightweight CNN-based methods remain the most reliable for canopy detection on limited imagery.
zh

[CV-40] RobuMTL: Enhancing Multi-Task Learning Robustness Against Weather Conditions WACV

【速读】:该论文旨在解决自主系统在真实世界环境中因恶劣天气导致视觉退化而引发的多任务学习(Multi-Task Learning, MTL)模型性能下降与可靠性降低的问题。解决方案的关键在于提出一种名为RobuMTL的新架构,其通过动态选择基于输入扰动的任务特定分层低秩适配(Low-Rank Adaptation, LoRA)模块和LoRA专家队列(expert squad),以混合专家(mixture-of-experts)的方式实现对不同输入特征的自适应专业化,从而提升模型在多样化现实条件下的鲁棒性。

链接: https://arxiv.org/abs/2601.10921
作者: Tasneem Shaffee,Sherief Reda
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

点击查看摘要

Abstract:Robust Multi-Task Learning (MTL) is crucial for autonomous systems operating in real-world environments, where adverse weather conditions can severely degrade model performance and reliability. In this paper, we introduce RobuMTL, a novel architecture designed to adaptively address visual degradation by dynamically selecting task-specific hierarchical Low-Rank Adaptation (LoRA) modules and a LoRA expert squad based on input perturbations in a mixture-of-experts fashion. Our framework enables adaptive specialization based on input characteristics, improving robustness across diverse real-world conditions. To validate our approach, we evaluated it on the PASCAL and NYUD-v2 datasets and compared it against single-task models, standard MTL baselines, and state-of-the-art methods. On the PASCAL benchmark, RobuMTL delivers a +2.8% average relative improvement under single perturbations and up to +44.4% under mixed weather conditions compared to the MTL baseline. On NYUD-v2, RobuMTL achieves a +9.7% average relative improvement across tasks. The code is available at GitHub.
zh

[CV-41] Self-learned representation-guided latent diffusion model for breast cancer classification in deep ultraviolet whole surface images

【速读】:该论文旨在解决乳腺保乳手术(Breast-Conserving Surgery, BCS)中术中切缘评估对高精度、快速成像的需求,以及由于深度紫外荧光扫描显微镜(Deep Ultraviolet Fluorescence Scanning Microscopy, DUV-FSM)数据标注稀缺导致深度学习模型训练困难的问题。解决方案的关键在于提出一种由自监督学习(Self-Supervised Learning, SSL)引导的潜在扩散模型(Latent Diffusion Model, LDM),通过微调DINO教师模型提取的嵌入向量注入细胞结构的语义信息,生成高质量合成训练图像块;随后将真实与合成图像块联合用于视觉Transformer(Vision Transformer, ViT)的微调,并采用图像块预测聚合策略实现全切片图像(Whole Slide Image, WSI)级别的分类,从而在有限标注数据下显著提升模型性能。

链接: https://arxiv.org/abs/2601.10917
作者: Pouya Afshin,David Helminiak,Tianling Niu,Julie M. Jorns,Tina Yen,Bing Yu,Dong Hye Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted for the IEEE International Symposium on Biomedical Imaging (ISBI) 2026, London, UK, and will be presented in the corresponding session

点击查看摘要

Abstract:Breast-Conserving Surgery (BCS) requires precise intraoperative margin assessment to preserve healthy tissue. Deep Ultraviolet Fluorescence Scanning Microscopy (DUV-FSM) offers rapid, high-resolution surface imaging for this purpose; however, the scarcity of annotated DUV data hinders the training of robust deep learning models. To address this, we propose an Self-Supervised Learning (SSL)-guided Latent Diffusion Model (LDM) to generate high-quality synthetic training patches. By guiding the LDM with embeddings from a fine-tuned DINO teacher, we inject rich semantic details of cellular structures into the synthetic data. We combine real and synthetic patches to fine-tune a Vision Transformer (ViT), utilizing patch prediction aggregation for WSI-level classification. Experiments using 5-fold cross-validation demonstrate that our method achieves 96.47 % accuracy and reduces the FID score to 45.72, significantly outperforming class-conditioned baselines.
zh

[CV-42] Classification of Chest XRay Diseases through image processing and analysis techniques

【速读】:该论文旨在解决多类胸部X光图像分类问题(Multi-Classification Chest X-Ray Images),即通过自动化方法对不同胸腔疾病进行准确识别与区分。其解决方案的关键在于采用DenseNet121等深度学习模型,并结合开源Web应用实现可交互的诊断辅助系统,从而提升分类精度与临床实用性。同时,研究通过对比实验评估不同方法的性能表现,并指出当前模型的局限性,为未来改进提供方向。

链接: https://arxiv.org/abs/2601.10913
作者: Santiago Martínez Novoa,María Catalina Ibáñez,Lina Gómez Mesa,Jeremias Kramer
机构: Universidad de los Andes(安第斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-Classification Chest X-Ray Images are one of the most prevalent forms of radiological examination used for diagnosing thoracic diseases. In this study, we offer a concise overview of several methods employed for tackling this task, including DenseNet121. In addition, we deploy an open-source web-based application. In our study, we conduct tests to compare different methods and see how well they work. We also look closely at the weaknesses of the methods we propose and suggest ideas for making them better in the future. Our code is available at: this https URL
zh

[CV-43] FrankenMotion: Part-level Human Motion Generation and Composition

【速读】:该论文旨在解决现有文本驱动人体运动生成方法在控制细粒度身体部位运动方面能力不足的问题,其根源在于缺乏高质量的、具有时间感知的局部动作标注数据。解决方案的关键在于构建了一个包含原子级、时序敏感的身体部位文本注释的高质量运动数据集,并基于此设计了一种扩散模型框架FrankenMotion,该框架通过为每个身体部位分配独立的时间结构化文本提示来实现对空间(身体部位)和时间(原子动作)维度的联合控制,从而首次实现了具备精细局部运动控制能力的生成式人体运动合成。

链接: https://arxiv.org/abs/2601.10909
作者: Chuqiao Li,Xianghui Xie,Yong Cao,Andreas Geiger,Gerard Pons-Moll
机构: Tübingen AI Center, University of Tübingen, Germany(图宾根人工智能中心,图宾根大学,德国); Max Planck Institute for Informatics, Saarland Informatics Campus, Germany(马克斯·普朗克信息学研究所,萨尔兰信息学园区,德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Human motion generation from text prompts has made remarkable progress in recent years. However, existing methods primarily rely on either sequence-level or action-level descriptions due to the absence of fine-grained, part-level motion annotations. This limits their controllability over individual body parts. In this work, we construct a high-quality motion dataset with atomic, temporally-aware part-level text annotations, leveraging the reasoning capabilities of large language models (LLMs). Unlike prior datasets that either provide synchronized part captions with fixed time segments or rely solely on global sequence labels, our dataset captures asynchronous and semantically distinct part movements at fine temporal resolution. Based on this dataset, we introduce a diffusion-based part-aware motion generation framework, namely FrankenMotion, where each body part is guided by its own temporally-structured textual prompt. This is, to our knowledge, the first work to provide atomic, temporally-aware part-level motion annotations and have a model that allows motion generation with both spatial (body part) and temporal (atomic action) control. Experiments demonstrate that FrankenMotion outperforms all previous baseline models adapted and retrained for our setting, and our model can compose motions unseen during training. Our code and dataset will be publicly available upon publication.
zh

[CV-44] Medical SAM3: A Foundation Model for Universal Prompt-Driven Medical Image Segmentation

【速读】:该论文旨在解决生成式 AI(Generative AI)在医学图像分割任务中因领域偏移(domain shift)、缺乏空间提示(privileged spatial prompts)以及对复杂解剖结构和体积信息推理能力不足而导致的性能下降问题。其解决方案的关键在于对原始 SAM3 模型进行全面微调(full fine-tuning),利用涵盖 10 种医学成像模态、33 个数据集的大规模 2D 和 3D 医学影像数据,配以分割掩码与文本提示,使模型获得鲁棒的领域特定表征,同时保留基于提示的灵活性。这一方法显著提升了在语义模糊、形态复杂及长程三维上下文等挑战场景下的分割性能,验证了整体模型适配优于仅依赖提示工程的有效性。

链接: https://arxiv.org/abs/2601.10880
作者: Chongcong Jiang,Tianxingjian Ding,Chuhan Song,Jiachen Tu,Ziyang Yan,Yihua Shao,Zhenyi Wang,Yuzhang Shang,Tianyu Han,Yu Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Promptable segmentation foundation models such as SAM3 have demonstrated strong generalization capabilities through interactive and concept-based prompting. However, their direct applicability to medical image segmentation remains limited by severe domain shifts, the absence of privileged spatial prompts, and the need to reason over complex anatomical and volumetric structures. Here we present Medical SAM3, a foundation model for universal prompt-driven medical image segmentation, obtained by fully fine-tuning SAM3 on large-scale, heterogeneous 2D and 3D medical imaging datasets with paired segmentation masks and text prompts. Through a systematic analysis of vanilla SAM3, we observe that its performance degrades substantially on medical data, with its apparent competitiveness largely relying on strong geometric priors such as ground-truth-derived bounding boxes. These findings motivate full model adaptation beyond prompt engineering alone. By fine-tuning SAM3’s model parameters on 33 datasets spanning 10 medical imaging modalities, Medical SAM3 acquires robust domain-specific representations while preserving prompt-driven flexibility. Extensive experiments across organs, imaging modalities, and dimensionalities demonstrate consistent and significant performance gains, particularly in challenging scenarios characterized by semantic ambiguity, complex morphology, and long-range 3D context. Our results establish Medical SAM3 as a universal, text-guided segmentation foundation model for medical imaging and highlight the importance of holistic model adaptation for achieving robust prompt-driven segmentation under severe domain shift. Code and model will be made available at this https URL.
zh

[CV-45] Effects of Different Attention Mechanisms Applied on 3D Models in Video Classification

【速读】:该论文旨在解决在提升视频帧空间分辨率的同时,因减少时间维度信息捕获而导致的生成式 AI(Generative AI)模型性能下降问题。其核心解决方案在于通过引入多种注意力机制(如多头注意力、通道注意力及卷积块注意力模块 CBAM 和时序卷积网络 TCN)来增强受限时间特征下的模型表达能力,从而缓解因时间信息丢失带来的识别精度损失。实验表明,在修改后的 R(2+1)D 架构中加入多头注意力机制后,UCF101 数据集上的准确率达到 88.98%,验证了注意力机制对提升受限时间建模能力的关键作用。

链接: https://arxiv.org/abs/2601.10854
作者: Mohammad Rasras,Iuliana Marin,Serban Radu,Irina Mocanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 6 figures, conference

点击查看摘要

Abstract:Human action recognition has become an important research focus in computer vision due to the wide range of applications where it is used. 3D Resnet-based CNN models, particularly MC3, R3D, and R(2+1)D, have different convolutional filters to extract spatiotemporal features. This paper investigates the impact of reducing the captured knowledge from temporal data, while increasing the resolution of the frames. To establish this experiment, we created similar designs to the three originals, but with a dropout layer added before the final classifier. Secondly, we then developed ten new versions for each one of these three designs. The variants include special attention blocks within their architecture, such as convolutional block attention module (CBAM), temporal convolution networks (TCN), in addition to multi-headed and channel attention mechanisms. The purpose behind that is to observe the extent of the influence each of these blocks has on performance for the restricted-temporal models. The results of testing all the models on UCF101 have shown accuracy of 88.98% for the variant with multiheaded attention added to the modified R(2+1)D. This paper concludes the significance of missing temporal features in the performance of the newly created increased resolution models. The variants had different behavior on class-level accuracy, despite the similarity of their enhancements to the overall performance.
zh

[CV-46] One Model Many Behaviors: Training-Induced Effects on Out-of-Distribution Detection WACV2026

【速读】:该论文旨在解决开放世界场景下机器学习系统中分布外(Out-of-distribution, OOD)检测性能与模型在分布内(In-distribution, ID)准确率之间关系不明确的问题。现有研究普遍假设更高的ID准确率会带来更好的OOD检测效果,但本文通过系统性实证研究发现二者呈非单调关系:随着训练策略提升ID准确率,OOD检测性能先提升后下降。解决方案的关键在于揭示训练策略、OOD检测方法选择与最终检测性能之间的强耦合性,表明不存在适用于所有场景的通用OOD检测方法,强调需根据具体训练流程定制化选择或设计检测策略。

链接: https://arxiv.org/abs/2601.10836
作者: Gerhard Krumpl,Henning Avenhaus,Horst Possegger
机构: Graz University of Technology (格拉茨工业大学); KESTRELEYE GmbH (KESTRELEYE有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial for deploying robust and reliable machine-learning systems in open-world settings. Despite steady advances in OOD detectors, their interplay with modern training pipelines that maximize in-distribution (ID) accuracy and generalization remains under-explored. We investigate this link through a comprehensive empirical study. Fixing the architecture to the widely adopted ResNet-50, we benchmark 21 post-hoc, state-of-the-art OOD detection methods across 56 ImageNet-trained models obtained via diverse training strategies and evaluate them on eight OOD test sets. Contrary to the common assumption that higher ID accuracy implies better OOD detection performance, we uncover a non-monotonic relationship: OOD performance initially improves with accuracy but declines once advanced training recipes push accuracy beyond the baseline. Moreover, we observe a strong interdependence between training strategy, detector choice, and resulting OOD performance, indicating that no single method is universally optimal.
zh

[CV-47] Can Vision-Language Models Understand Construction Workers? An Exploratory Study

【速读】:该论文旨在解决建筑场景中机器人对工人行为与情绪识别能力不足的问题,以实现更安全、高效的协作。其核心挑战在于缺乏标注数据且需准确理解复杂的人类行为和情感状态。解决方案的关键在于利用通用视觉-语言模型(Vision-Language Models, VLMs)进行零样本或少样本的行为识别,无需大量领域特定训练数据即可实现对施工人员动作和情绪的初步检测,其中GPT-4o展现出最优性能,验证了VLM在建筑场景下作为基础感知工具的可行性。

链接: https://arxiv.org/abs/2601.10835
作者: Hieu Bui,Nathaniel E. Chodosh,Arash Tavakoli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model’s outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.
zh

[CV-48] A Unified 3D Object Perception Framework for Real-Time Outside-In Multi-Camera Systems

【速读】:该论文旨在解决工业基础设施场景下基于静态摄像头网络的高精度三维目标感知与多目标多相机(MTMC)跟踪问题,尤其针对“从内向外”自动驾驶模型向“从外向内”固定摄像机网络迁移时面临的异构部署和极端遮挡挑战。其解决方案的关键在于:一是引入绝对世界坐标几何先验和遮挡感知的重识别(ReID)嵌入模块,以提升跨分布式传感器网络的身份稳定性;二是采用基于NVIDIA COSMOS框架的生成式数据增强策略,有效缩小仿真到现实(Sim2Real)域差距,无需人工标注即可增强模型外观不变性;三是开发针对多尺度可变形聚合(MSDA)的TensorRT优化插件,在现代GPU上实现2.15倍加速,使单张Blackwell级GPU支持超过64路并发视频流,满足实时部署需求。

链接: https://arxiv.org/abs/2601.10819
作者: Yizhou Wang,Sameer Pusegaonkar,Yuxing Wang,Anqi Li,Vishal Kumar,Chetan Sethi,Ganapathy Aiyer,Yun He,Kartikay Thakkar,Swapnil Rathi,Bhushan Rupde,Zheng Tang,Sujit Biswas
机构: NVIDIA Corporation(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D object perception and multi-target multi-camera (MTMC) tracking are fundamental for the digital transformation of industrial infrastructure. However, transitioning “inside-out” autonomous driving models to “outside-in” static camera networks presents significant challenges due to heterogeneous camera placements and extreme occlusion. In this paper, we present an adapted Sparse4D framework specifically optimized for large-scale infrastructure environments. Our system leverages absolute world-coordinate geometric priors and introduces an occlusion-aware ReID embedding module to maintain identity stability across distributed sensor networks. To bridge the Sim2Real domain gap without manual labeling, we employ a generative data augmentation strategy using the NVIDIA COSMOS framework, creating diverse environmental styles that enhance the model’s appearance-invariance. Evaluated on the AI City Challenge 2025 benchmark, our camera-only framework achieves a state-of-the-art HOTA of 45.22 . Furthermore, we address real-time deployment constraints by developing an optimized TensorRT plugin for Multi-Scale Deformable Aggregation (MSDA). Our hardware-accelerated implementation achieves a 2.15\times speedup on modern GPU architectures, enabling a single Blackwell-class GPU to support over 64 concurrent camera streams.
zh

[CV-49] ICONIC-444: A 3.1-Million-Image Dataset for OOD Detection Research KR WACV2026

【速读】:该论文旨在解决当前分布外(Out-of-Distribution, OOD)检测研究中缺乏大规模、高质量且类别定义清晰的工业图像数据集的问题,尤其是难以覆盖从近似OOD到远距离OOD的多难度层级,以及支持细粒度与粗粒度计算机视觉任务的评估需求。解决方案的关键在于提出ICONIC-444数据集,该数据集包含超过310万张RGB图像,涵盖444个类别,由原型工业分拣设备采集,真实模拟工业场景;同时定义了四个基准任务以系统性地评估和推动OOD检测方法的发展,并为22种先进的后处理OOD检测方法提供了基线结果。

链接: https://arxiv.org/abs/2601.10802
作者: Gerhard Krumpl,Henning Avenhaus,Horst Possegger
机构: Graz University of Technology (格拉茨技术大学); KESTRELEYE GmbH (KESTRELEYE有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026, Dataset repo: this https URL

点击查看摘要

Abstract:Current progress in out-of-distribution (OOD) detection is limited by the lack of large, high-quality datasets with clearly defined OOD categories across varying difficulty levels (near- to far-OOD) that support both fine- and coarse-grained computer vision tasks. To address this limitation, we introduce ICONIC-444 (Image Classification and OOD Detection with Numerous Intricate Complexities), a specialized large-scale industrial image dataset containing over 3.1 million RGB images spanning 444 classes tailored for OOD detection research. Captured with a prototype industrial sorting machine, ICONIC-444 closely mimics real-world tasks. It complements existing datasets by offering structured, diverse data suited for rigorous OOD evaluation across a spectrum of task complexities. We define four reference tasks within ICONIC-444 to benchmark and advance OOD detection research and provide baseline results for 22 state-of-the-art post-hoc OOD detection methods.
zh

[CV-50] Future Optical Flow Prediction Improves Robot Control Video Generation

【速读】:该论文旨在解决从嘈杂的真实世界数据中学习通用且空间密集的未来运动表示(future motion representations)这一关键挑战,尤其是在生成式 AI(Generative AI)和控制任务中缺乏有效建模方法的问题。解决方案的关键在于提出 FOFPred——一个基于语言条件的光学流预测模型,其核心创新是将统一的视觉-语言模型(Vision-Language Model, VLM)与扩散模型(Diffusion architecture)相结合,从而实现像素级生成保真度与多模态推理能力的协同提升。该架构通过在大规模、无结构的网络视频-文本数据上进行预训练,并辅以关键的数据清洗与处理技术,显著增强了模型对复杂真实场景中运动模式的学习与泛化能力,最终在机器人操作和视频生成等下游任务中验证了其跨域适用性。

链接: https://arxiv.org/abs/2601.10781
作者: Kanchana Ranasinghe,Honglu Zhou,Yu Fang,Luyu Yang,Le Xue,Ran Xu,Caiming Xiong,Silvio Savarese,Michael S Ryoo,Juan Carlos Niebles
机构: Salesforce AI Research (Salesforce人工智能研究); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Site (Code, Models, Demo): this https URL

点击查看摘要

Abstract:Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
zh

[CV-51] Explore with Long-term Memory: A Benchmark and Multimodal LLM -based Reinforcement Learning Framework for Embodied Exploration

【速读】:该论文旨在解决当前主流单次(one-shot)具身任务仅关注任务完成结果,而忽视探索过程与长期情景记忆利用的问题,从而限制了智能体在复杂、长时程环境中实现持续学习与优化决策的能力。其解决方案的关键在于提出一种名为Long-term Memory Embodied Exploration (LMEE) 的新范式,通过统一智能体的探索认知与决策行为,并构建包含多目标导航与基于记忆的问答任务的评估基准LMEE-Bench,以全面衡量探索过程与结果;同时提出MemoryExplorer方法,利用强化学习微调多模态大语言模型(Multimodal Large Language Model, MLLM),引入包含动作预测、前沿选择和问答任务的多任务奖励函数,促使智能体主动查询记忆并开展前瞻性探索,显著提升了长时程具身任务中的表现。

链接: https://arxiv.org/abs/2601.10744
作者: Sen Wang,Bangwei Liu,Zhenkun Gao,Lizhuang Ma,Xuhong Wang,Yuan Xie,Xin Tan
机构: East China Normal University (华东师范大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Our dataset and code will be released at our \href{ this https URL }{website}

点击查看摘要

Abstract:An ideal embodied agent should possess lifelong learning capabilities to handle long-horizon and complex tasks, enabling continuous operation in general environments. This not only requires the agent to accurately accomplish given tasks but also to leverage long-term episodic memory to optimize decision-making. However, existing mainstream one-shot embodied tasks primarily focus on task completion results, neglecting the crucial process of exploration and memory utilization. To address this, we propose Long-term Memory Embodied Exploration (LMEE), which aims to unify the agent’s exploratory cognition and decision-making behaviors to promote lifelong this http URL further construct a corresponding dataset and benchmark, LMEE-Bench, incorporating multi-goal navigation and memory-based question answering to comprehensively evaluate both the process and outcome of embodied exploration. To enhance the agent’s memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. By incorporating a multi-task reward function that includes action prediction, frontier selection, and question answering, our model achieves proactive exploration. Extensive experiments against state-of-the-art embodied exploration models demonstrate that our approach achieves significant advantages in long-horizon embodied tasks.
zh

[CV-52] Line-based Event Preprocessing: Towards Low-Energy Neuromorphic Computer Vision

【速读】:该论文旨在解决神经形态视觉(neuromorphic vision)在嵌入式应用中能耗优化难题,特别是如何降低基于脉冲神经网络(spiking neural networks)的事件数据处理过程中的能量消耗。其关键解决方案是引入基于线特征的事件数据预处理机制,通过减少输入事件数量来降低神经形态硬件上的突触操作次数,从而实现理论能耗显著下降的同时维持甚至提升分类准确率,系统性地提升了神经形态分类效率,为更节能的神经形态计算机视觉提供了新路径。

链接: https://arxiv.org/abs/2601.10742
作者: Amélie Gruel,Pierre Lewden,Adrien F. Vincent,Sylvain Saïghi
机构: Labo. de l’Intégration du Matériau au Système, Univ. Bordeaux, Bordeaux INP, CNRS- F-33400 Talence, France
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 18 pages (3 pages of acknowledgments and references), 10 figures and 4 tables. Submitted to the IOP Science “Neuromorphic Computing and Engineering” journal, awaiting feedback. This work is supported by a public grant overseen by the French National Research Agency (ANR) as part of the éPEPR IA France 2030é programme (Emergences project ANR-23-PEIA-0002)

点击查看摘要

Abstract:Neuromorphic vision made significant progress in recent years, thanks to the natural match between spiking neural networks and event data in terms of biological inspiration, energy savings, latency and memory use for dynamic visual data processing. However, optimising its energy requirements still remains a challenge within the community, especially for embedded applications. One solution may reside in preprocessing events to optimise data quantity thus lowering the energy cost on neuromorphic hardware, proportional to the number of synaptic operations. To this end, we extend an end-to-end neuromorphic line detection mechanism to introduce line-based event data preprocessing. Our results demonstrate on three benchmark event-based datasets that preprocessing leads to an advantageous trade-off between energy consumption and classification performance. Depending on the line-based preprocessing strategy and the complexity of the classification task, we show that one can maintain or increase the classification accuracy while significantly reducing the theoretical energy consumption. Our approach systematically leads to a significant improvement of the neuromorphic classification efficiency, thus laying the groundwork towards a more frugal neuromorphic computer vision thanks to event preprocessing.
zh

[CV-53] Real-Time Drivers Drowsiness Detection and Analysis through Deep Learning

【速读】:该论文旨在解决驾驶员在长时间驾驶过程中因疲劳导致的注意力下降甚至睡意问题,此类状态可能引发严重的交通事故,威胁自身及他人安全。为实现对驾驶员疲劳状态的实时监测与预警,研究提出了一种基于深度卷积神经网络(Deep Convolutional Neural Networks, DCNNs)的非侵入式、低成本且高效的驾驶疲劳检测系统。其解决方案的关键在于:利用实时摄像头采集驾驶员面部图像,通过OpenCV库提取关键面部特征点(如眼睑开合程度和张嘴动作),结合预训练的DCNN模型进行特征分析与分类,从而在检测到疲劳迹象时立即触发持续警报,有效提升行车安全性。该方法在NTHU-DDD和Yawn-Eye-Dataset数据集上分别实现了99.6%和97%的识别准确率,验证了其可行性与鲁棒性。

链接: https://arxiv.org/abs/2511.12438
作者: ANK Zaman,Prosenjit Chatterjee,Rajat Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:A long road trip is fun for drivers. However, a long drive for days can be tedious for a driver to accommodate stringent deadlines to reach distant destinations. Such a scenario forces drivers to drive extra miles, utilizing extra hours daily without sufficient rest and breaks. Once a driver undergoes such a scenario, it occasionally triggers drowsiness during driving. Drowsiness in driving can be life-threatening to any individual and can affect other drivers’ safety; therefore, a real-time detection system is needed. To identify fatigued facial characteristics in drivers and trigger the alarm immediately, this research develops a real-time driver drowsiness detection system utilizing deep convolutional neural networks (DCNNs) and this http URL proposed and implemented model takes real- time facial images of a driver using a live camera and utilizes a Python-based library named OpenCV to examine the facial images for facial landmarks like sufficient eye openings and yawn-like mouth movements. The DCNNs framework then gathers the data and utilizes a per-trained model to detect the drowsiness of a driver using facial landmarks. If the driver is identified as drowsy, the system issues a continuous alert in real time, embedded in the Smart Car this http URL potentially saving innocent lives on the roadways, the proposed technique offers a non-invasive, inexpensive, and cost-effective way to identify drowsiness. Our proposed and implemented DCNNs embedded drowsiness detection model successfully react with NTHU-DDD dataset and Yawn-Eye-Dataset with drowsiness detection classification accuracy of 99.6% and 97% respectively.
zh

[CV-54] Simple Models Rich Representations: Visual Decoding from Primate Intracortical Neural Signals NEURIPS2025

【速读】:该论文旨在解决从灵长类动物高密度皮层内记录的神经活动中解码视觉信息的问题(decoding visual information from high-density intracortical recordings in primates)。其关键解决方案在于:通过系统评估模型架构、训练目标与数据规模对解码性能的影响,发现解码准确率主要由对神经信号时序动态建模能力决定,而非模型复杂度;在此基础上,提出一种结合时序注意力机制与浅层多层感知机(MLP)的简单模型,实现了高达70%的top-1图像检索准确率,显著优于线性基线及循环和卷积方法;进一步构建了一个模块化生成式解码流程,融合低分辨率潜在空间重建与语义条件扩散模型,可在仅200毫秒脑活动输入下生成合理图像,为脑机接口(BCI)和语义神经解码提供了可扩展的框架原则。

链接: https://arxiv.org/abs/2601.11108
作者: Matteo Ciferri,Matteo Ferrante,Nicola Toschi
机构: University of Rome, Tor Vergata (罗马大学托尔维加塔分校); Department of Biomedicine and Prevention (生物医学与预防系); A.A. Martinos Center for Biomedical Imaging (A.A. 马蒂诺斯生物医学成像中心); Harvard Medical School/MGH, Boston (美国哈佛医学院/麻省总医院,波士顿)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at “Foundation Models for the Brain and Body” - NeurIPS 2025 Workshop

点击查看摘要

Abstract:Understanding how neural activity gives rise to perception is a central challenge in neuroscience. We address the problem of decoding visual information from high-density intracortical recordings in primates, using the THINGS Ventral Stream Spiking Dataset. We systematically evaluate the effects of model architecture, training objectives, and data scaling on decoding performance. Results show that decoding accuracy is mainly driven by modeling temporal dynamics in neural signals, rather than architectural complexity. A simple model combining temporal attention with a shallow MLP achieves up to 70% top-1 image retrieval accuracy, outperforming linear baselines as well as recurrent and convolutional approaches. Scaling analyses reveal predictable diminishing returns with increasing input dimensionality and dataset size. Building on these findings, we design a modular generative decoding pipeline that combines low-resolution latent reconstruction with semantically conditioned diffusion, generating plausible images from 200 ms of brain activity. This framework provides principles for brain-computer interfaces and semantic neural decoding.
zh

[CV-55] Generation of Chest CT pulmonary Nodule Images by Latent Diffusion Models using the LIDC-IDRI Dataset

【速读】:该论文旨在解决医学影像诊断中因特定疾病(如小细胞肺癌或良恶性肿瘤难以区分的病例)数据稀缺导致的数据不平衡问题,从而影响计算机辅助诊断系统性能。其解决方案的关键在于利用潜在扩散模型(Latent Diffusion Models, LDM)自动生成高质量的胸部CT结节图像,通过基于医生评估构建的图像与文本提示配对数据集对Stable Diffusion v1.5和v2.0进行微调,并调节引导尺度(guidance scale, GS)以控制生成图像对输入文本的忠实度。实验表明,SDv2在GS=5时表现最优,在图像质量、多样性及文本一致性方面均达到与真实临床图像相当的水平,验证了该方法在生成具有特定医学特征的高保真图像上的有效性。

链接: https://arxiv.org/abs/2601.11085
作者: Kaito Urata,Maiko Nagao,Atsushi Teramoto,Kazuyoshi Imaizumi,Masashi Kondo,Hiroshi Fujita
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Recently, computer-aided diagnosis systems have been developed to support diagnosis, but their performance depends heavily on the quality and quantity of training data. However, in clinical practice, it is difficult to collect the large amount of CT images for specific cases, such as small cell carcinoma with low epidemiological incidence or benign tumors that are difficult to distinguish from malignant ones. This leads to the challenge of data imbalance. In this study, to address this issue, we proposed a method to automatically generate chest CT nodule images that capture target features using latent diffusion models (LDM) and verified its effectiveness. Using the LIDC-IDRI dataset, we created pairs of nodule images and finding-based text prompts based on physician evaluations. For the image generation models, we used Stable Diffusion version 1.5 (SDv1) and 2.0 (SDv2), which are types of LDM. Each model was fine-tuned using the created dataset. During the generation process, we adjusted the guidance scale (GS), which indicates the fidelity to the input text. Both quantitative and subjective evaluations showed that SDv2 (GS = 5) achieved the best performance in terms of image quality, diversity, and text consistency. In the subjective evaluation, no statistically significant differences were observed between the generated images and real images, confirming that the quality was equivalent to real clinical images. We proposed a method for generating chest CT nodule images based on input text using LDM. Evaluation results demonstrated that the proposed method could generate high-quality images that successfully capture specific medical features.
zh

[CV-56] Visual question answering-based image-finding generation for pulmonary nodules on chest CT from structured annotations

【速读】:该论文旨在解决胸部CT影像中肺结节诊断过程中,传统固定描述方式难以满足临床医生个性化提问需求的问题,从而提升诊断交互性与效率。其解决方案的关键在于:利用LIDC-IDRI数据集中结构化的形态学特征信息,构建了一个视觉问答(Visual Question Answering, VQA)数据集,并基于此对VQA模型进行微调,实现根据医生提出的自然语言问题生成符合放射学规范的图像发现描述。实验表明,生成结果在CIDEr评分(3.896)和形态学特征一致性方面表现优异,验证了该方法作为交互式诊断支持系统的有效性。

链接: https://arxiv.org/abs/2601.11075
作者: Maiko Nagao,Kaito Urata,Atsushi Teramoto,Kazuyoshi Imaizumi,Masashi Kondo,Hiroshi Fujita
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Interpretation of imaging findings based on morphological characteristics is important for diagnosing pulmonary nodules on chest computed tomography (CT) images. In this study, we constructed a visual question answering (VQA) dataset from structured data in an open dataset and investigated an image-finding generation method for chest CT images, with the aim of enabling interactive diagnostic support that presents findings based on questions that reflect physicians’ interests rather than fixed descriptions. In this study, chest CT images included in the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) datasets were used. Regions of interest surrounding the pulmonary nodules were extracted from these images, and image findings and questions were defined based on morphological characteristics recorded in the database. A dataset comprising pairs of cropped images, corresponding questions, and image findings was constructed, and the VQA model was fine-tuned on it. Language evaluation metrics such as BLEU were used to evaluate the generated image findings. The VQA dataset constructed using the proposed method contained image findings with natural expressions as radiological descriptions. In addition, the generated image findings showed a high CIDEr score of 3.896, and a high agreement with the reference findings was obtained through evaluation based on morphological characteristics. We constructed a VQA dataset for chest CT images using structured information on the morphological characteristics from the LIDC-IDRI dataset. Methods for generating image findings in response to these questions have also been investigated. Based on the generated results and evaluation metric scores, the proposed method was effective as an interactive diagnostic support system that can present image findings according to physicians’ interests.
zh

[CV-57] Convolutions Need Registers Too: HVS-Inspired Dynamic Attention for Video Quality Assessment ACM-MM

【速读】:该论文旨在解决无参考视频质量评估(No-reference Video Quality Assessment, NR-VQA)中难以准确建模视频时序动态特性的难题,尤其是在缺乏参考视频的情况下如何有效捕捉人类视觉系统(Human Visual System, HVS)所关注的动态显著区域。传统方法通常依赖静态显著图作为辅助输入,未能将全局上下文嵌入到视频特征提取的核心流程中。其解决方案的关键在于提出一种名为DAGR-VQA的新框架,首次将可学习的注册令牌(register tokens)直接集成到卷积主干网络中,以实现时空动态显著性预测;通过引入可学习的全局上下文载体,模型能够生成时变的、受HVS启发的注意力机制,从而在不进行显式运动估计的前提下,自适应地追踪视频中的显著区域,并结合RGB输入与时间Transformer进行联合分析,最终实现感知一致的视频质量评估。

链接: https://arxiv.org/abs/2601.11045
作者: Mayesha Maliha R. Mithila,Mylene C.Q. Farias
机构: Texas State University (德克萨斯州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at ACM MMSys 2026. 12 pages, 8 figures. No supplementary material

点击查看摘要

Abstract:No-reference video quality assessment (NR-VQA) estimates perceptual quality without a reference video, which is often challenging. While recent techniques leverage saliency or transformer attention, they merely address global context of the video signal by using static maps as auxiliary inputs rather than embedding context fundamentally within feature extraction of the video sequence. We present Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA), the first framework integrating register-token directly into a convolutional backbone for spatio-temporal, dynamic saliency prediction. By embedding learnable register tokens as global context carriers, our model enables dynamic, HVS-inspired attention, producing temporally adaptive saliency maps that track salient regions over time without explicit motion estimation. Our model integrates dynamic saliency maps with RGB inputs, capturing spatial data and analyzing it through a temporal transformer to deliver a perceptually consistent video quality assessment. Comprehensive tests conducted on the LSVQ, KonVid-1k, LIVE-VQC, and YouTube-UGC datasets show that the performance is highly competitive, surpassing the majority of top baselines. Research on ablation studies demonstrates that the integration of register tokens promotes the development of stable and temporally consistent attention mechanisms. Achieving an efficiency of 387.7 FPS at 1080p, DAGR-VQA demonstrates computational performance suitable for real-time applications like multimedia streaming systems.
zh

[CV-58] KOCOBrain: Kuramoto-Guided Graph Network for Uncovering Structure-Function Coupling in Adolescent Prenatal Drug Exposure

【速读】:该论文旨在解决孕期暴露于精神活性物质(如大麻)对胎儿神经发育的干扰问题,尤其是如何准确识别与这种暴露相关的脑网络神经特征。其核心挑战在于传统方法难以有效捕捉结构连接组(structural connectome)与功能连接组(functional connectome)之间的耦合关系,且在类别不平衡的数据场景下模型鲁棒性不足。解决方案的关键在于提出KOCOBrain框架——一种统一的图神经网络架构,通过基于Kuramoto相位动力学的层建模解剖连接上的神经同步过程,生成反映结构-功能耦合的相位感知嵌入;同时引入认知感知注意力机制,利用认知评分对个体信息路由进行调制,并采用联合优化目标提升模型在类别不平衡条件下的预测性能。该方法在ABCD队列数据中显著优于现有基线模型,并揭示了可解释的结构-功能模式,反映了早期暴露导致的大脑网络协调障碍。

链接: https://arxiv.org/abs/2601.11018
作者: Badhan Mazumder,Lei Wu,Sir-Lord Wiafe,Vince D. Calhoun,Dong Hye Ye
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version of the paper accepted to the IEEE International Symposium on Biomedical Imaging (ISBI 2026). This is the author’s accepted manuscript. The final published version will appear in IEEE Xplore

点击查看摘要

Abstract:Exposure to psychoactive substances during pregnancy, such as cannabis, can disrupt neurodevelopment and alter large-scale brain networks, yet identifying their neural signatures remains challenging. We introduced KOCOBrain: KuramotO COupled Brain Graph Network; a unified graph neural network framework that integrates structural and functional connectomes via Kuramoto-based phase dynamics and cognition-aware attention. The Kuramoto layer models neural synchronization over anatomical connections, generating phase-informed embeddings that capture structure-function coupling, while cognitive scores modulate information routing in a subject-specific manner followed by a joint objective enhancing robustness under class imbalance scenario. Applied to the ABCD cohort, KOCOBrain improved prenatal drug exposure prediction over relevant baselines and revealed interpretable structure-function patterns that reflect disrupted brain network coordination associated with early exposure.
zh

[CV-59] Differentiating through binarized topology changes: Second-order subpixel-smoothed projection

【速读】:该论文旨在解决拓扑优化(Topology Optimization, TopOpt)中可制造结构的非可微性问题,即由于制造结构本质上是二值的(binary),导致其在梯度优化过程中缺乏可微性,从而破坏了基于梯度算法的收敛性保证。现有方法如子像素平滑投影(Subpixel-Smoothed Projection, SSP)虽通过一阶展开实现界面平滑,但无法保证在拓扑变化(如界面合并)时仍具有可微性。本文的关键解决方案是引入对SSP的二阶正则化,利用滤波场的海森矩阵(Hessian)增强其在拓扑变换过程中的二阶可微性,从而在保持几乎处处二值结构的同时,显著提升优化算法的收敛性与适用范围,尤其适用于连接性主导的场景(connectivity-dominant cases)。该改进方法称为SSP2,具有与传统投影方法相当的计算复杂度,可直接替换现有拓扑优化代码。

链接: https://arxiv.org/abs/2601.10737
作者: Giuseppe Romano,Rodrigo Arrieta,Steven G. Johnson
机构: 未知
类目: ignal Processing (eess.SP); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:A key challenge in topology optimization (TopOpt) is that manufacturable structures, being inherently binary, are non-differentiable, creating a fundamental tension with gradient-based optimization. The subpixel-smoothed projection (SSP) method addresses this issue by smoothing sharp interfaces at the subpixel level through a first-order expansion of the filtered field. However, SSP does not guarantee differentiability under topology changes, such as the merging of two interfaces, and therefore violates the convergence guarantees of many popular gradient-based optimization algorithms. We overcome this limitation by regularizing SSP with the Hessian of the filtered field, resulting in a twice-differentiable projected density during such transitions, while still guaranteeing an almost-everywhere binary structure. We demonstrate the effectiveness of our second-order SSP (SSP2) methodology on both thermal and photonic problems, showing that SSP2 has faster convergence than SSP for connectivity-dominant cases – where frequent topology changes occur – while exhibiting comparable performance otherwise. Beyond improving convergence guarantees for CCSA optimizers, SSP2 enables the use of a broader class of optimization algorithms with stronger theoretical guarantees, such as interior-point methods. Since SSP2 adds minimal complexity relative to SSP or traditional projection schemes, it can be used as a drop-in replacement in existing TopOpt codes.
zh

人工智能

[AI-0] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management

【速读】:该论文旨在解决当前1型糖尿病(Type 1 Diabetes, T1D)算法开发中因数据集碎片化和缺乏标准化而导致的数据整合困难、可比性与泛化能力不足的问题。解决方案的关键在于构建一个统一且可访问的T1D数据资源——MetaboNet数据集,其核心特征是整合了多个公开可用的数据集,并严格要求包含连续血糖监测(Continuous Glucose Monitoring, CGM)数据与胰岛素泵剂量记录的重叠信息,同时保留如碳水化合物摄入量和体力活动等辅助信息。该数据集涵盖3135名受试者共1228患者年的高质量数据,显著大于现有独立基准数据集,且提供免协议的公共子集与需数据使用协议(Data Use Agreement, DUA)限制的子集,辅以自动化处理流程,从而提升算法开发的效率与结果的通用性。

链接: https://arxiv.org/abs/2601.11505
作者: Miriam K. Wolff,Peter Calhoun,Eleonora Maria Aiello,Yao Qin,Sam F. Royston
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Quantitative Methods (q-bio.QM)
备注: 22 pages, 5 figures, 7 supplementary figures, submitted to JDST

点击查看摘要

Abstract:Progress in Type 1 Diabetes (T1D) algorithm development is limited by the fragmentation and lack of standardization across existing T1D management datasets. Current datasets differ substantially in structure and are time-consuming to access and process, which impedes data integration and reduces the comparability and generalizability of algorithmic developments. This work aims to establish a unified and accessible data resource for T1D algorithm development. Multiple publicly available T1D datasets were consolidated into a unified resource, termed the MetaboNet dataset. Inclusion required the availability of both continuous glucose monitoring (CGM) data and corresponding insulin pump dosing records. Additionally, auxiliary information such as reported carbohydrate intake and physical activity was retained when present. The MetaboNet dataset comprises 3135 subjects and 1228 patient-years of overlapping CGM and insulin data, making it substantially larger than existing standalone benchmark datasets. The resource is distributed as a fully public subset available for immediate download at this https URL , and with a Data Use Agreement (DUA)-restricted subset accessible through their respective application processes. For the datasets in the latter subset, processing pipelines are provided to automatically convert the data into the standardized MetaboNet format. A consolidated public dataset for T1D research is presented, and the access pathways for both its unrestricted and DUA-governed components are described. The resulting dataset covers a broad range of glycemic profiles and demographics and thus can yield more generalizable algorithmic performance than individual datasets.
zh

[AI-1] BoxMind: Closed-loop AI strategy optimization for elite boxing validated in the 2024 Olympics

【速读】:该论文旨在解决竞技体育中搏击类项目(如拳击)因动作动态复杂性和战术表征缺乏结构化而导致的AI驱动分析发展滞后的问题。其核心解决方案在于构建一个闭环的AI专家系统BoxMind,通过定义具有精确时空边界的原子级出拳事件,并将其解析为18个分层的技术-战术指标;进而提出一种基于图结构的预测模型,融合显式的技战术特征与可学习的时间变异性潜在嵌入(latent embeddings),以捕捉选手对战中的动态博弈关系;最终将比赛结果建模为技术-战术指标的可微函数,利用获胜概率梯度生成可执行的战略调整建议,从而实现从视频数据到战略智能的转化。

链接: https://arxiv.org/abs/2601.11492
作者: Kaiwen Wang,Kaili Zheng,Rongrong Deng,Qingmin Fan,Milin Zhang,Zongrui Li,Xuesi Zhou,Bo Han,Liren Chen,Chenyi Guo,Ji Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Competitive sports require sophisticated tactical analysis, yet combat disciplines like boxing remain underdeveloped in AI-driven analytics due to the complexity of action dynamics and the lack of structured tactical representations. To address this, we present BoxMind, a closed-loop AI expert system validated in elite boxing competition. By defining atomic punch events with precise temporal boundaries and spatial and technical attributes, we parse match footage into 18 hierarchical technical-tactical indicators. We then propose a graph-based predictive model that fuses these explicit technical-tactical profiles with learnable, time-variant latent embeddings to capture the dynamics of boxer matchups. Modeling match outcome as a differentiable function of technical-tactical indicators, we turn winning probability gradients into executable tactical adjustments. Experiments show that the outcome prediction model achieves state-of-the-art performance, with 69.8% accuracy on BoxerGraph test set and 87.5% on Olympic matches. Using this predictive model as a foundation, the system generates strategic recommendations that demonstrate proficiency comparable to human experts. BoxMind is validated through a closed-loop deployment during the 2024 Paris Olympics, directly contributing to the Chinese National Team’s historic achievement of three gold and two silver medals. BoxMind establishes a replicable paradigm for transforming unstructured video data into strategic intelligence, bridging the gap between computer vision and decision support in competitive sports.
zh

[AI-2] Health Facility Location in Ethiopia: Leverag ing LLM s to Integrate Expert Knowledge into Algorithmic Planning

【速读】:该论文旨在解决埃塞俄比亚卫生部门在资源有限条件下,如何科学优先升级基层卫生站(health posts)以最大化人口覆盖范围,同时兼顾多元专家与利益相关者的定性偏好问题。解决方案的关键在于提出一种混合框架——大型语言模型与扩展贪心算法(Large language model and Extended Greedy, LEG),其核心是将具有理论保障的群体覆盖优化算法与基于大语言模型(LLM)的迭代精化机制相结合,通过人机对齐策略使最终方案既满足数学上的覆盖率近似保证,又能有效融入专家的自然语言指导,从而实现公平、数据驱动的健康系统规划。

链接: https://arxiv.org/abs/2601.11479
作者: Yohai Trabelsi,Guojun Xiong,Fentabil Getnet,Stéphane Verguet,Milind Tambe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ethiopia’s Ministry of Health is upgrading health posts to improve access to essential services, particularly in rural areas. Limited resources, however, require careful prioritization of which facilities to upgrade to maximize population coverage while accounting for diverse expert and stakeholder preferences. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we propose a hybrid framework that systematically integrates expert knowledge with optimization techniques. Classical optimization methods provide theoretical guarantees but require explicit, quantitative objectives, whereas stakeholder criteria are often articulated in natural language and difficult to formalize. To bridge these domains, we develop the Large language model and Extended Greedy (LEG) framework. Our framework combines a provable approximation algorithm for population coverage optimization with LLM-driven iterative refinement that incorporates human-AI alignment to ensure solutions reflect expert qualitative guidance while preserving coverage guarantees. Experiments on real-world data from three Ethiopian regions demonstrate the framework’s effectiveness and its potential to inform equitable, data-driven health system planning.
zh

[AI-3] Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

【速读】:该论文旨在解决预测性流程监控(Predictive Process Monitoring)中的关键挑战,即如何在数据稀缺场景下准确预测流程的未来结果,包括总时间(Total Time)和活动发生频率(Activity Occurrence)等关键绩效指标(Key Performance Indicators, KPIs)。其解决方案的关键在于扩展并验证基于大语言模型(Large Language Model, LLM)的预测框架,通过提示工程(prompting)实现对多种KPI的通用预测能力,并揭示LLM不仅利用其内在先验知识,还挖掘训练迹线间的内部相关性进行推理。实验表明,在仅100条迹线的数据条件下,该方法优于传统基准方法,且其推理机制并非简单复制已有预测策略,而是执行更高阶的逻辑推理以生成预测结果。

链接: https://arxiv.org/abs/2601.11468
作者: Alessandro Padella,Massimiliano de Leoni,Marlon Dumas
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 19 pages, 4 figure, TMIS journal submission

点击查看摘要

Abstract:Predictive Process Monitoring is a branch of process mining that aims to predict the outcome of an ongoing process. Recently, it leveraged machine-and-deep learning architectures. In this paper, we extend our prior LLM-based Predictive Process Monitoring framework, which was initially focused on total time prediction via prompting. The extension consists of comprehensively evaluating its generality, semantic leverage, and reasoning mechanisms, also across multiple Key Performance Indicators. Empirical evaluations conducted on three distinct event logs and across the Key Performance Indicators of Total Time and Activity Occurrence prediction indicate that, in data-scarce settings with only 100 traces, the LLM surpasses the benchmark methods. Furthermore, the experiments also show that the LLM exploits both its embodied prior knowledge and the internal correlations among training traces. Finally, we examine the reasoning strategies employed by the model, demonstrating that the LLM does not merely replicate existing predictive methods but performs higher-order reasoning to generate the predictions.
zh

[AI-4] GenDA: Generative Data Assimilation on Complex Urban Areas via Classifier-Free Diffusion Guidance

【速读】:该论文旨在解决城市风场重构问题,即在仅有稀疏传感器数据的情况下,如何重建高分辨率的风场分布。传统方法难以在复杂城市环境中实现准确且泛化的风场估计,尤其是在缺乏充足观测数据时。其解决方案的关键在于提出一种生成式数据同化(Generative Data Assimilation, GenDA)框架,该框架采用基于图结构的多尺度扩散架构,通过计算流体动力学(CFD)仿真训练模型,并将无条件分支学习几何感知的流场先验,条件分支注入观测约束以指导采样过程,从而实现障碍物感知的风场重构,并可在未见几何结构、风向和网格分辨率下无需重新训练即可泛化。

链接: https://arxiv.org/abs/2601.11440
作者: Francisco Giral,Álvaro Manzano,Ignacio Gómez,Ricardo Vinuesa,Soledad Le Clainche
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Urban wind flow reconstruction is essential for assessing air quality, heat dispersion, and pedestrian comfort, yet remains challenging when only sparse sensor data are available. We propose GenDA, a generative data assimilation framework that reconstructs high-resolution wind fields on unstructured meshes from limited observations. The model employs a multiscale graph-based diffusion architecture trained on computational fluid dynamics (CFD) simulations and interprets classifier-free guidance as a learned posterior reconstruction mechanism: the unconditional branch learns a geometry-aware flow prior, while the sensor-conditioned branch injects observational constraints during sampling. This formulation enables obstacle-aware reconstruction and generalization across unseen geometries, wind directions, and mesh resolutions without retraining. We consider both sparse fixed sensors and trajectory-based observations using the same reconstruction procedure. When evaluated against supervised graph neural network (GNN) baselines and classical reduced-order data assimilation methods, GenDA reduces the relative root-mean-square error (RRMSE) by 25-57% and increases the structural similarity index (SSIM) by 23-33% across the tested meshes. Experiments are conducted on Reynolds-averaged Navier-Stokes (RANS) simulations of a real urban neighbourhood in Bristol, United Kingdom, at a characteristic Reynolds number of \mathrmRe\approx2\times10^7 , featuring complex building geometry and irregular terrain. The proposed framework provides a scalable path toward generative, geometry-aware data assimilation for environmental monitoring in complex domains.
zh

[AI-5] he Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

【速读】:该论文旨在解决当前机器人学习与模仿学习领域中数据集和任务设计缺乏系统性原则的问题,从而导致现有评估难以真实反映不同方法在机器人代理能力上的差异。其解决方案的关键在于提出首个“机器人学习奥运会”(Great March 100, GM-100)基准,包含100个精心设计的任务,覆盖广泛的交互类型与长尾行为,基于对人类-物体交互原语(human-object interaction primitives)和物体可供性(object affordances)的系统分析进行扩展,确保任务多样性与挑战性,能够有效区分当前视觉语言动作模型(VLA models)的性能差异。

链接: https://arxiv.org/abs/2601.11421
作者: Ziyu Wang,Chenyuan Liu,Yushun Xiang,Runhao Zhang,Qingbo Hao,Hongliang Lu,Houyu Chen,Zhizhong Feng,Kaiyue Zheng,Dehao Ye,Xianchao Zeng,Xinyu Zhou,Boran Wen,Jiaxin Li,Mingyu Zhang,Kecheng Zheng,Qian Zhu,Ran Cheng,Yong-Lu Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbfGM-100) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at this https URL.
zh

[AI-6] Hyperparameter Optimization of Constraint Programming Solvers

【速读】:该论文旨在解决约束求解器(constraint programming solver)在实际应用中因超参数配置不当而导致性能不佳的问题。由于手动调参过程复杂且依赖专家经验,作者提出了一种名为“探查与求解”(probe and solve)的两阶段自动化超参数优化框架,其核心在于将有限的时间预算划分为两个阶段:第一阶段通过可配置的超参数优化方法(如贝叶斯优化或汉明距离搜索)探索不同配置;第二阶段则使用第一阶段找到的最佳配置在剩余时间内求解问题。该方案的关键创新在于引入了模型驱动的探索策略(以贝叶斯优化为代表),相比传统的局部搜索方法(如汉明距离搜索)展现出更强的泛化能力和稳定性,在ACE和Choco两个求解器上均显著优于默认配置,从而为资源受限场景下的约束求解器调优提供了一个高效、实用的解决方案。

链接: https://arxiv.org/abs/2601.11389
作者: Hedieh Haddad,Thibault Falque,Pierre Talbot,Pascal Bouvry
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 3 figures. Submitted to Journal of Combinatorial Optimization. Special Issue: Recent applications, models and algorithms in Combinatorial Optimization

点击查看摘要

Abstract:The performance of constraint programming solvers is highly sensitive to the choice of their hyperparameters. Manually finding the best solver configuration is a difficult, time-consuming task that typically requires expert knowledge. In this paper, we introduce probe and solve algorithm, a novel two-phase framework for automated hyperparameter optimization integrated into the CPMpy library. This approach partitions the available time budget into two phases: a probing phase that explores different sets of hyperparameters using configurable hyperparameter optimization methods, followed by a solving phase where the best configuration found is used to tackle the problem within the remaining time. We implement and compare two hyperparameter optimization methods within the probe and solve algorithm: Bayesian optimization and Hamming distance search. We evaluate the algorithm on two different constraint programming solvers, ACE and Choco, across 114 combinatorial problem instances, comparing their performance against the solver’s default configurations. Results show that using Bayesian optimization, the algorithm outperforms the solver’s default configurations, improving solution quality for ACE in 25.4% of instances and matching the default performance in 57.9%, and for Choco, achieving superior results in 38.6% of instances. It also consistently surpasses Hamming distance search within the same framework, confirming the advantage of model-based exploration over simple local search. Overall, the probe and solve algorithm offers a practical, resource-aware approach for tuning constraint solvers that yields robust improvements across diverse problem types. Comments: 28 pages, 3 figures. Submitted to Journal of Combinatorial Optimization. Special Issue: Recent applications, models and algorithms in Combinatorial Optimization Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.11389 [cs.AI] (or arXiv:2601.11389v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.11389 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-7] Institutional AI: Governing LLM Collusion in Multi-Agent Cournot Markets via Public Governance Graphs

【速读】:该论文旨在解决多智能体大语言模型(Multi-agent LLM ensembles)在协作过程中可能收敛至社会有害均衡的问题,例如市场中的合谋行为(如Cournot竞争模型中的串通定价)。其核心挑战在于传统基于偏好工程的对齐方法难以在群体层面有效约束代理间的协调行为。解决方案的关键是提出“制度型人工智能”(Institutional AI)框架,将对齐问题从代理空间(agent-space)重构为制度空间(institution-space)的设计问题,其中治理图(governance graph)作为不可变、公开的规范声明,明确法律状态、状态转移规则、制裁机制及修复路径;并通过Oracle/Controller运行时系统执行这些规则,对协调行为施加可执行后果并记录加密签名的审计日志。实验证明,相较于无治理和仅靠提示的宪法式干预,基于治理图的制度方案显著降低合谋发生率(严重合谋比例从50%降至5.6%),表明制度设计可为多智能体系统的对齐提供可操作且有效的抽象机制。

链接: https://arxiv.org/abs/2601.11369
作者: Marcantonio Bracale Syrnikov,Federico Pierucci,Marcello Galisai,Matteo Prandi,Piercosma Bisconti,Francesco Giarrusso,Olga Sorokoletova,Vincenzo Suriani,Daniele Nardi
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent LLM ensembles can converge on coordinated, socially harmful equilibria. This paper advances an experimental framework for evaluating Institutional AI, our system-level approach to AI alignment that reframes alignment from preference engineering in agent-space to mechanism design in institution-space. Central to this approach is the governance graph, a public, immutable manifest that declares legal states, transitions, sanctions, and restorative paths; an Oracle/Controller runtime interprets this manifest, attaching enforceable consequences to evidence of coordination while recording a cryptographically keyed, append-only governance log for audit and provenance. We apply the Institutional AI framework to govern the Cournot collusion case documented by prior work and compare three regimes: Ungoverned (baseline incentives from the structure of the Cournot market), Constitutional (a prompt-only policy-as-prompt prohibition implemented as a fixed written anti-collusion constitution, and Institutional (governance-graph-based). Across six model configurations including cross-provider pairs (N=90 runs/condition), the Institutional regime produces large reductions in collusion: mean tier falls from 3.1 to 1.8 (Cohen’s d=1.28), and severe-collusion incidence drops from 50% to 5.6%. The prompt-only Constitutional baseline yields no reliable improvement, illustrating that declarative prohibitions do not bind under optimisation pressure. These results suggest that multi-agent alignment may benefit from being framed as an institutional design problem, where governance graphs can provide a tractable abstraction for alignment-relevant collective behavior.
zh

[AI-8] FEATHer: Fourier-Efficient Adaptive Temporal Hierarchy Forecaster for Time-Series Forecasting

【速读】:该论文旨在解决工业场景中边缘设备(如PLC、微控制器)因计算资源受限(如延迟和内存约束,参数量通常仅数百至数千)而难以部署高精度长时序预测模型的问题。解决方案的关键在于提出一种超轻量级的Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer) 架构:其核心创新包括(1)通过频域路径实现多尺度分解以捕捉不同频率特征;(2)采用无循环、无注意力机制的共享密集时间核(Dense Temporal Kernel),利用投影-深度可分离卷积-投影结构高效建模时间依赖性;(3)基于频谱特性的分支门控机制自适应融合表示;(4)通过周期稀疏核(Sparse Period Kernel)进行周期性下采样重构,有效建模季节性模式。该设计使模型参数量低至400,同时在8个基准数据集上取得平均排名2.05,显著优于现有方法,验证了在资源受限边缘硬件上实现可靠长程预测的可行性。

链接: https://arxiv.org/abs/2601.11350
作者: Jaehoon Lee,Seungwoo Lee,Younghwi Kim,Dohee Kim,Sunghyun Sim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

点击查看摘要

Abstract:Time-series forecasting is fundamental in industrial domains like manufacturing and smart factories. As systems evolve toward automation, models must operate on edge devices (e.g., PLCs, microcontrollers) with strict constraints on latency and memory, limiting parameters to a few thousand. Conventional deep architectures are often impractical here. We propose the Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer) for accurate long-term forecasting under severe limits. FEATHer introduces: (i) ultra-lightweight multiscale decomposition into frequency pathways; (ii) a shared Dense Temporal Kernel using projection-depthwise convolution-projection without recurrence or attention; (iii) frequency-aware branch gating that adaptively fuses representations based on spectral characteristics; and (iv) a Sparse Period Kernel reconstructing outputs via period-wise downsampling to capture seasonality. FEATHer maintains a compact architecture (as few as 400 parameters) while outperforming baselines. Across eight benchmarks, it achieves the best ranking, recording 60 first-place results with an average rank of 2.05. These results demonstrate that reliable long-range forecasting is achievable on constrained edge hardware, offering a practical direction for industrial real-time inference.
zh

[AI-9] XChoice: Explainable Evaluation of AI-Human Alignment in LLM -based Constrained Choice Decision Making

【速读】:该论文旨在解决当前评估人工智能(AI)与人类在受限决策场景中对齐程度时,仅依赖结果一致性指标(如准确率和F1分数)所导致的“黑箱”问题,即无法揭示模型决策机制与人类行为之间深层次差异。其解决方案的关键在于提出XChoice框架,通过拟合基于机制的决策模型到人类数据和大语言模型(LLM)生成的决策中,恢复可解释的参数向量,从而量化决策因素的重要性权重、约束敏感性及隐含权衡关系,并以此为基础比较不同模型、选项和子群体间的对齐情况。该方法实现了从表面结果匹配到机制层面诊断的跃迁,为精准识别和改进AI系统中的非对齐现象提供了新路径。

链接: https://arxiv.org/abs/2601.11286
作者: Weihong Qi,Fan Huang,Rasika Muralidharan,Jisun An,Haewoon Kwak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present XChoice, an explainable framework for evaluating AI-human alignment in constrained decision making. Moving beyond outcome agreement such as accuracy and F1 score, XChoice fits a mechanism-based decision model to human data and LLM-generated decisions, recovering interpretable parameters that capture the relative importance of decision factors, constraint sensitivity, and implied trade-offs. Alignment is assessed by comparing these parameter vectors across models, options, and subgroups. We demonstrate XChoice on Americans’ daily time allocation using the American Time Use Survey (ATUS) as human ground truth, revealing heterogeneous alignment across models and activities and salient misalignment concentrated in Black and married groups. We further validate robustness of XChoice via an invariance analysis and evaluate targeted mitigation with a retrieval augmented generation (RAG) intervention. Overall, XChoice provides mechanism-based metrics that diagnose misalignment and support informed improvements beyond surface outcome matching.
zh

[AI-10] From SERPs to Sound: How Search Engine Result Pages and AI-generated Podcasts Interact to Influence User Attitudes on Controversial Topics

【速读】:该论文旨在解决信息媒介交互对用户态度影响的问题,特别是搜索结果页面(Search Engine Result Pages, SERPs)与生成式AI播客(AI-generated podcasts)这两种不同信息呈现方式在接触顺序和模态差异下如何共同塑造用户观点,尤其是在争议性、价值导向性强的议题中。解决方案的关键在于通过一项受控用户实验(N=483),系统考察信息暴露的序列效应(sequence effect)以及观点偏见(viewpoint bias)和议题争议程度(degree of topic controversiality)在态度变化中的调节作用,从而揭示两种媒介交互机制下的认知与态度形成路径。

链接: https://arxiv.org/abs/2601.11282
作者: Junjie Wang,Gaole He,Alisa Rieger,Ujwal Gadiraju
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: ACM CHIIR 2026

点击查看摘要

Abstract:Compared to search engine result pages (SERPs), AI-generated podcasts represent a relatively new and relatively more passive modality of information consumption, delivering narratives in a naturally engaging format. As these two media increasingly converge in everyday information-seeking behavior, it is essential to explore how their interaction influences user attitudes, particularly in contexts involving controversial, value-laden, and often debated topics. Addressing this need, we aim to understand how information mediums of present-day SERPs and AI-generated podcasts interact to shape the opinions of users. To this end, through a controlled user study (N=483), we investigated user attitudinal effects of consuming information via SERPs and AI-generated podcasts, focusing on how the sequence and modality of exposure shape user opinions. A majority of users in our study corresponded to attitude change outcomes, and we found an effect of sequence on attitude change. Our results further revealed a role of viewpoint bias and the degree of topic controversiality in shaping attitude change, although we found no effect of individual moderators.
zh

[AI-11] Beyond Model Scaling: Test-Time Intervention for Efficient Deep Reasoning

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在多步推理过程中存在的效率低下问题,如过度思考(overthinking)和推理偏移(overshoot),这些问题会导致计算成本增加并降低性能。现有高效推理方法通常采用封闭回路模式,缺乏外部干预机制来引导推理过程。解决方案的关键在于提出一种名为Think-with-Me的测试时交互式推理范式,其核心创新是利用推理过程中的过渡连接词(transitional conjunctions)作为自然的干预点,这些连接词标志着自我验证或探索阶段;通过在此类节点暂停推理并引入外部反馈(基于合理性与完整性多准则评估,来自人类或LLM代理),模型可自适应地延长或终止推理路径,在减少冗余的同时保持准确性。该方法借助Group Relative Policy Optimization(GRPO)训练目标模型以适配此交互模式,实验证明其在有限上下文窗口下实现了准确率与推理长度之间的更优平衡。

链接: https://arxiv.org/abs/2601.11252
作者: Qianyue Wang,Jinwu Hu,Yufeng Wang,Huanxiang Lin,Bolin Chen,Zhiquan Wen,Yaofo Chen,Mingkui Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) excel at multi-step reasoning but often suffer from inefficient reasoning processes like overthinking and overshoot, where excessive or misdirected reasoning increases computational cost and degrades performance. Existing efficient reasoning methods operate in a closed-loop manner, lacking mechanisms for external intervention to guide the reasoning process. To address this, we propose Think-with-Me, a novel test-time interactive reasoning paradigm that introduces external feedback intervention into the reasoning process. Our key insights are that transitional conjunctions serve as natural points for intervention, signaling phases of self-validation or exploration and using transitional words appropriately to prolong the reasoning enhances performance, while excessive use affects performance. Building on these insights, Think-with-Me pauses reasoning at these points for external feedback, adaptively extending or terminating reasoning to reduce redundancy while preserving accuracy. The feedback is generated via a multi-criteria evaluation (rationality and completeness) and comes from either human or LLM proxies. We train the target model using Group Relative Policy Optimization (GRPO) to adapt to this interactive mode. Experiments show that Think-with-Me achieves a superior balance between accuracy and reasoning length under limited context windows. On AIME24, Think-with-Me outperforms QwQ-32B by 7.19% in accuracy while reducing average reasoning length by 81% under an 8K window. The paradigm also benefits security and creative tasks.
zh

[AI-12] SDFLoRA: Selective Dual-Module LoRA for Federated Fine-tuning with Heterogeneous Clients

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中大规模语言模型(Large Language Models, LLMs)因客户端间低秩配置差异导致的LoRA(Low-Rank Adaptation)更新聚合偏差与不稳定性问题。现有方法通常强制统一秩或对异构更新进行空间对齐,但会过度约束客户端特有语义、限制个性化能力,并在差分隐私(Differential Privacy, DP)噪声注入下削弱本地信息保护。其解决方案的关键在于提出选择性双模块联邦LoRA(Selective Dual-module Federated LoRA, SDFLoRA),将每个客户端适配器分解为全局模块(捕获可迁移知识)与局部模块(保留客户端特定适应),仅对全局模块进行选择性对齐与聚合,而局部模块保持私有;该设计在支持秩异质性的同时,通过仅向全局模块注入差分隐私噪声实现隐私感知优化,从而在GLUE基准上优于主流联邦LoRA基线并取得更优的效用-隐私权衡。

链接: https://arxiv.org/abs/2601.11219
作者: Zhikang Shen,Jianrong Lu,Haiyuan Wan,Jianhai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) for large language models (LLMs) has attracted increasing attention as a way to enable privacy-preserving adaptation over distributed data. Parameter-efficient methods such as LoRA are widely adopted to reduce communication and memory costs. Despite these advances, practical FL deployments often exhibit rank heterogeneity, since different clients may use different low-rank configurations. This makes direct aggregation of LoRA updates biased and unstable. Existing solutions typically enforce unified ranks or align heterogeneous updates into a shared subspace, which over-constrains client-specific semantics, limits personalization, and provides weak protection of local client information under differential privacy noise. To address this issue, we propose Selective Dual-module Federated LoRA (SDFLoRA), which decomposes each client adapter into a global module that captures transferable knowledge and a local module that preserves client-specific adaptations. The global module is selectively aligned and aggregated across clients, while local modules remain private. This design enables robust learning under rank heterogeneity and supports privacy-aware optimization by injecting differential privacy noise exclusively into the global module. Experiments on GLUE benchmarks demonstrate that SDFLoRA outperforms representative federated LoRA baselines and achieves a better utility-privacy trade-off.
zh

[AI-13] LoRA as Oracle

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在安全关键场景中面临的后门攻击(Backdoor Attack)和隐私泄露(Privacy Leakage)问题,尤其是现有防御方法普遍依赖干净参考模型、大量重训练或对攻击机制做出强假设的局限性。其解决方案的关键在于提出一种基于低秩适应(Low-Rank Adaptation, LoRA)的轻量级、模型无关的探测框架:通过在冻结的主干模型上附加任务特定的LoRA适配器,并分析其在暴露于可疑样本时的优化动态与表征变化,发现中毒样本和成员样本会引发显著区别于正常数据的低秩更新信号;这些信号可通过简单的排序统计与能量统计量进行测量,从而实现无需原始训练数据或修改部署模型的可靠后门检测与成员推理攻击检测。

链接: https://arxiv.org/abs/2601.11207
作者: Marco Arazzi,Antonino Nocera
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Backdoored and privacy-leaking deep neural networks pose a serious threat to the deployment of machine learning systems in security-critical settings. Existing defenses for backdoor detection and membership inference typically require access to clean reference models, extensive retraining, or strong assumptions about the attack mechanism. In this work, we introduce a novel LoRA-based oracle framework that leverages low-rank adaptation modules as a lightweight, model-agnostic probe for both backdoor detection and membership inference. Our approach attaches task-specific LoRA adapters to a frozen backbone and analyzes their optimization dynamics and representation shifts when exposed to suspicious samples. We show that poisoned and member samples induce distinctive low-rank updates that differ significantly from those generated by clean or non-member data. These signals can be measured using simple ranking and energy-based statistics, enabling reliable inference without access to the original training data or modification of the deployed model. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.11207 [cs.CR] (or arXiv:2601.11207v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.11207 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] Epistemic Control and the Normativity of Machine Learning-Based Science

【速读】:该论文旨在解决机器学习(Machine Learning, ML)系统在科学研究中广泛应用背景下,人类科学家是否因ML系统的特性而被排除在科学决策过程之外的问题。作者通过引入“认识论控制”(epistemic control)这一概念,并将其细化为“追踪”(tracking)和“追溯”(tracing)两个条件,构建了一个更细致的理解框架,从而反驳了Paul Humphreys提出的悲观观点,指出人类科学家仍可在ML驱动的科学研究中保持实质性的认识论参与和控制权。解决方案的关键在于重新界定“认识论控制”的内涵,并提出一种更具操作性和情境敏感性的评估标准,以实现对ML系统在科学实践中作用的合理定位。

链接: https://arxiv.org/abs/2601.11202
作者: Emanuele Ratti
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The past few years have witnessed an increasing use of machine learning (ML) systems in science. Paul Humphreys has argued that, because of specific characteristics of ML systems, human scientists are pushed out of the loop of science. In this chapter, I investigate to what extent this is true. First, I express these concerns in terms of what I call epistemic control. I identify two conditions for epistemic control, called tracking and tracing, drawing on works in philosophy of technology. With this new understanding of the problem, I then argue against Humphreys pessimistic view. Finally, I construct a more nuanced view of epistemic control in ML-based science.
zh

[AI-15] FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization

【速读】:该论文旨在解决后训练量化(Post-Training Quantization, PTQ)中校准数据(calibration data)代表性不足与普适性差的问题,这一问题直接影响量化参数的准确性,进而限制了大语言模型(Large Language Models, LLMs)在资源受限设备上的部署效果。解决方案的关键在于提出一种名为FAQ(Family-Aware Quantization)的校准数据再生框架,其核心思想是利用同家族大语言模型的先验知识生成高保真校准样本:首先将原始校准样本输入同家族更大的模型,借助其一致的知识体系重构高质量数据;随后通过专家引导下的群体竞争机制筛选最优样本,并进行重归一化处理,从而显著提升标准PTQ方法的效果。实验表明,FAQ在多个模型系列(如Qwen3-8B)上相较原始校准数据可减少高达28.5%的精度损失。

链接: https://arxiv.org/abs/2601.11200
作者: Haiyang Xiao,Weiqing Li,Jinyue Guo,Guochao Jiang,Guohua Liu,Yuewei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although post-training quantization (PTQ) provides an efficient numerical compression scheme for deploying large language models (LLMs) on resource-constrained devices, the representativeness and universality of calibration data remain a core bottleneck in determining the accuracy of quantization parameters. Traditional PTQ methods typically rely on limited samples, making it difficult to capture the activation distribution during the inference phase, leading to biases in quantization parameters. To address this, we propose \textbfFAQ (Family-Aware Quantization), a calibration data regeneration framework that leverages prior knowledge from LLMs of the same family to generate high-fidelity calibration samples. Specifically, FAQ first inputs the original calibration samples into a larger LLM from the same family as the target model, regenerating a series of high-fidelity calibration data using a highly consistent knowledge system. Subsequently, this data, carrying Chain-of-Thought reasoning and conforming to the expected activation distribution, undergoes group competition under expert guidance to select the best samples, which are then re-normalized to enhance the effectiveness of standard PTQ. Experiments on multiple model series, including Qwen3-8B, show that FAQ reduces accuracy loss by up to 28.5% compared to the baseline with original calibration data, demonstrating its powerful potential and contribution.
zh

[AI-16] SD-RAG : A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中敏感信息泄露与隐私保护不足的问题,尤其针对现有方法依赖提示层约束而易受提示注入攻击(prompt injection attacks)的缺陷。解决方案的关键在于提出一种名为SD-RAG的新范式,其核心创新是将安全与隐私约束的执行从生成过程解耦,转而在检索阶段通过数据净化和披露控制实现前置防护;同时引入语义驱动的动态策略机制与优化的基于图的数据模型,支持细粒度、策略感知的检索,从而在保障生成质量的同时显著提升隐私保护效果和对抗攻击的鲁棒性。

链接: https://arxiv.org/abs/2601.11199
作者: Aiman Al Masoud,Marco Arazzi,Antonino Nocera
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has attracted significant attention due to its ability to combine the generative capabilities of Large Language Models (LLMs) with knowledge obtained through efficient retrieval mechanisms over large-scale data collections. Currently, the majority of existing approaches overlook the risks associated with exposing sensitive or access-controlled information directly to the generation model. Only a few approaches propose techniques to instruct the generative model to refrain from disclosing sensitive information; however, recent studies have also demonstrated that LLMs remain vulnerable to prompt injection attacks that can override intended behavioral constraints. For these reasons, we propose a novel approach to Selective Disclosure in Retrieval-Augmented Generation, called SD-RAG, which decouples the enforcement of security and privacy constraints from the generation process itself. Rather than relying on prompt-level safeguards, SD-RAG applies sanitization and disclosure controls during the retrieval phase, prior to augmenting the language model’s input. Moreover, we introduce a semantic mechanism to allow the ingestion of human-readable dynamic security and privacy constraints together with an optimized graph-based data model that supports fine-grained, policy-aware retrieval. Our experimental evaluation demonstrates the superiority of SD-RAG over baseline existing approaches, achieving up to a 58% improvement in the privacy score, while also showing a strong resilience to prompt injection attacks targeting the generative model.
zh

[AI-17] Policy-Based Deep Reinforcement Learning Hyperheuristics for Job-Shop Scheduling Problems

【速读】:该论文旨在解决作业车间调度问题(Job Shop Scheduling Problem, JSSP),其核心挑战在于如何动态选择最优的调度规则以最小化最大完工时间(makespan)。解决方案的关键在于提出了一种基于策略的深度强化学习超启发式框架,其中包含两个核心机制:一是动作预过滤(action prefiltering),通过限制决策空间为可行的低层调度动作,实现对启发式规则的无偏评估;二是承诺机制(commitment mechanism),用于调控启发式切换频率,从逐步切换到整 episode 承诺的不同策略影响训练行为与调度性能。此外,研究对比了确定性贪婪选择与随机采样两种策略层面的动作选择方式,实验表明该方法在标准JSSP基准测试中优于传统启发式、元启发式及近期基于神经网络的调度方法。

链接: https://arxiv.org/abs/2601.11189
作者: Sofiene Lassoued,Asrat Gobachew,Stefan Lier,Andreas Schwung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes a policy-based deep reinforcement learning hyper-heuristic framework for solving the Job Shop Scheduling Problem. The hyper-heuristic agent learns to switch scheduling rules based on the system state dynamically. We extend the hyper-heuristic framework with two key mechanisms. First, action prefiltering restricts decision-making to feasible low-level actions, enabling low-level heuristics to be evaluated independently of environmental constraints and providing an unbiased assessment. Second, a commitment mechanism regulates the frequency of heuristic switching. We investigate the impact of different commitment strategies, from step-wise switching to full-episode commitment, on both training behavior and makespan. Additionally, we compare two action selection strategies at the policy level: deterministic greedy selection and stochastic sampling. Computational experiments on standard JSSP benchmarks demonstrate that the proposed approach outperforms traditional heuristics, metaheuristics, and recent neural network-based scheduling methods
zh

[AI-18] Clustering High-dimensional Data: Balancing Abstraction and Representation Tutorial at AAAI 2026

【速读】:该论文旨在解决大规模真实数据集中自然分组(即聚类)的识别问题,核心挑战在于如何在抽象(abstraction)与表示(representation)之间取得平衡。传统方法如K-means虽实现高抽象度(通过平均消除细节),但表示能力有限(假设所有簇为原始空间中的高斯分布),难以应对高维复杂数据。解决方案的关键在于引入更具表达力的表示机制,例如子空间聚类和深度聚类方法,同时显式地在目标函数中强制抽象性,以避免模型退化为单纯表示学习工具。当前深度聚类方法通过中心点约束(centroid-based)和密度约束(density-based)损失来定义并实现抽象,而子空间聚类思想进一步通过学习两个独立的潜在空间——一个用于聚类相关的信息,另一个捕捉其他冗余信息——有效分离了抽象与表示,从而提升聚类性能。未来方向将聚焦于更自适应地动态平衡二者,以增强聚类方法的性能、能效与可解释性。

链接: https://arxiv.org/abs/2601.11160
作者: Claudia Plant,Lena G. M. Bauer,Christian Böhm
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How to find a natural grouping of a large real data set? Clustering requires a balance between abstraction and representation. To identify clusters, we need to abstract from superfluous details of individual objects. But we also need a rich representation that emphasizes the key features shared by groups of objects that distinguish them from other groups of objects. Each clustering algorithm implements a different trade-off between abstraction and representation. Classical K-means implements a high level of abstraction - details are simply averaged out - combined with a very simple representation - all clusters are Gaussians in the original data space. We will see how approaches to subspace and deep clustering support high-dimensional and complex data by allowing richer representations. However, with increasing representational expressiveness comes the need to explicitly enforce abstraction in the objective function to ensure that the resulting method performs clustering and not just representation learning. We will see how current deep clustering methods define and enforce abstraction through centroid-based and density-based clustering losses. Balancing the conflicting goals of abstraction and representation is challenging. Ideas from subspace clustering help by learning one latent space for the information that is relevant to clustering and another latent space to capture all other information in the data. The tutorial ends with an outlook on future research in clustering. Future methods will more adaptively balance abstraction and representation to improve performance, energy efficiency and interpretability. By automatically finding the sweet spot between abstraction and representation, the human brain is very good at clustering and other related tasks such as single-shot learning. So, there is still much room for improvement. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.11160 [cs.LG] (or arXiv:2601.11160v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.11160 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-19] Cross-Modal Attention Network with Dual Graph Learning in Multimodal Recommendation

【速读】:该论文旨在解决现有多媒体推荐系统中两个关键问题:一是模态融合过于浅层,通常依赖简单的特征拼接,难以挖掘模态内部与模态之间的高阶协同关系;二是用户与物品在特征处理上存在不对称性,即用户仅通过交互ID表征,而物品则拥有丰富的多模态内容,导致难以学习共享的语义空间。解决方案的核心在于提出一种交叉模态递归注意力网络(Cross-modal Recursive Attention Network, CRANE),其关键创新包括:设计递归交叉模态注意力(Recursive Cross-Modal Attention, RCA)机制,通过在联合潜在空间中迭代地利用跨模态相关性来优化各模态特征,从而显式建模高阶模态内和模态间依赖关系;同时构建对称的双图嵌入框架——包含异构用户-物品交互图与同质物品-物品语义图,并通过自监督对比学习目标统一行为信号与语义信号,实现用户与物品的对称多模态建模。该方法在保持高计算效率的同时,在多个真实数据集上显著优于当前最优基线模型。

链接: https://arxiv.org/abs/2601.11151
作者: Ji Dai,Quan Fang,Jun Hu,Desheng Cai,Yang Yang,Can Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

点击查看摘要

Abstract:Multimedia recommendation systems leverage user-item interactions and multimodal information to capture user preferences, enabling more accurate and personalized recommendations. Despite notable advancements, existing approaches still face two critical limitations: first, shallow modality fusion often relies on simple concatenation, failing to exploit rich synergic intra- and inter-modal relationships; second, asymmetric feature treatment-where users are only characterized by interaction IDs while items benefit from rich multimodal content-hinders the learning of a shared semantic space. To address these issues, we propose a Cross-modal Recursive Attention Network with dual graph Embedding (CRANE). To tackle shallow fusion, we design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space, effectively capturing high-order intra- and inter-modal dependencies. For symmetric multimodal learning, we explicitly construct users’ multimodal profiles by aggregating features of their interacted items. Furthermore, CRANE integrates a symmetric dual-graph framework-comprising a heterogeneous user-item interaction graph and a homogeneous item-item semantic graph-unified by a self-supervised contrastive learning objective to fuse behavioral and semantic signals. Despite these complex modeling capabilities, CRANE maintains high computational efficiency. Theoretical and empirical analyses confirm its scalability and high practical efficiency, achieving faster convergence on small datasets and superior performance ceilings on large-scale ones. Comprehensive experiments on four public real-world datasets validate an average 5% improvement in key metrics over state-of-the-art baselines.
zh

[AI-20] Do We Always Need Query-Level Workflows? Rethinking Agent ic Workflow Generation for Multi-Agent Systems

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中任务分解与协作流程生成的效率与可靠性问题,特别是现有方法在任务层面(task-level)和查询层面(query-level)生成工作流时的成本与性能权衡不明确的问题。研究发现,查询层面的工作流生成并非总是必要,因为少量最优的任务级工作流即可覆盖甚至超越更多查询场景;同时,基于完整执行的 exhaustive 任务级评估方式不仅消耗大量 token,且可靠性不足。为此,作者提出一种名为 \textbfSCALE 的低开销任务级生成框架,其核心创新在于通过少样本校准(few-shot calibration)实现对优化器的自我预测(self-prediction),从而替代传统全量执行验证,显著降低计算成本并保持高性能——实验表明,\textbfSCALE 在多个数据集上平均性能下降仅 0.61%,而总体 token 使用量减少高达 83%。

链接: https://arxiv.org/abs/2601.11147
作者: Zixu Wang,Bingbing Xu,Yige Yuan,Huawei Shen,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbfSCALE, which means \underline\textbfSelf prediction of the optimizer with few shot \underline\textbfCALibration for \underline\textbfEvaluation instead of full validation execution. Extensive experiments demonstrate that \textbfSCALE maintains competitive performance, with an average degradation of just 0.61% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83%.
zh

[AI-21] Deep GraphRAG : A Balanced Approach to Hierarchical Retrieval and Adaptive Integration

【速读】:该论文旨在解决图结构检索增强生成(GraphRAG)框架中全局搜索的全面性与局部搜索效率之间的权衡问题,以及现有方法在大规模分层图中导航困难、检索路径优化不足和探索-利用动态平衡缺失等挑战。其解决方案的关键在于提出一种深度分层检索与自适应集成相结合的框架——Deep GraphRAG,核心创新包括:(1) 采用“由粗到细”的三级检索策略,即社区间过滤、社区级精炼与实体级细粒度搜索,实现宏观跨社区与微观内社区语境关系的有效融合;(2) 引入基于束搜索优化的动态重排序模块,持续筛选候选节点以平衡效率与全局覆盖;(3) 设计知识集成模块,利用轻量级大语言模型(LLM)结合动态权重奖励GRPO(DW-GRPO)训练方法,自动调整相关性、忠实性和简洁性三目标的权重,使1.5B参数模型性能逼近70B模型水平。

链接: https://arxiv.org/abs/2601.11144
作者: Yuejie Li,Ke Yang,Tao Wang,Bolin Chen,Bowen Li,Chengjun Mao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) frameworks face a trade-off between the comprehensiveness of global search and the efficiency of local search. Existing methods are often challenged by navigating large-scale hierarchical graphs, optimizing retrieval paths, and balancing exploration-exploitation dynamics, frequently lacking robust multi-stage re-ranking. To overcome these deficits, we propose Deep GraphRAG, a framework designed for a balanced approach to hierarchical retrieval and adaptive integration. It introduces a hierarchical global-to-local retrieval strategy that integrates macroscopic inter-community and microscopic intra-community contextual relations. This strategy employs a three-stage process: (1) inter-community filtering, which prunes the search space using local context; (2) community-level refinement, which prioritizes relevant subgraphs via entity-interaction analysis; and (3) entity-level fine-grained search within target communities. A beam search-optimized dynamic re-ranking module guides this process, continuously filtering candidates to balance efficiency and global comprehensiveness. Deep GraphRAG also features a Knowledge Integration Module leveraging a compact LLM, trained with Dynamic Weighting Reward GRPO (DW-GRPO). This novel reinforcement learning approach dynamically adjusts reward weights to balance three key objectives: relevance, faithfulness, and conciseness. This training enables compact models (1.5B) to approach the performance of large models (70B) in the integration task. Evaluations on Natural Questions and HotpotQA demonstrate that Deep GraphRAG significantly outperforms baseline graph retrieval methods in both accuracy and efficiency.
zh

[AI-22] Learning Quadrupedal Locomotion for a Heavy Hydraulic Robot Using an Actuator Model

【速读】:该论文旨在解决大规模液压机器人在仿真到现实(sim-to-real)迁移过程中因液压系统固有的慢控制响应和复杂流体动力学特性所带来的挑战。由于多连杆气缸结构及各气缸间流体速率差异导致的复杂动力学,使得对所有关节进行高保真度仿真难以适用于强化学习(Reinforcement Learning, RL)场景。其解决方案的关键在于提出一种基于液压动力学的解析执行器模型(analytical actuator model),该模型可在1微秒内预测全部12个执行器的关节扭矩,从而实现RL环境中的快速计算;相比基于神经网络的模型,在数据有限条件下展现出更强的泛化能力,并成功将训练好的运动策略部署于超过300 kg的液压四足机器人上,首次实现了重型液压四足机器人稳定且鲁棒的命令跟踪运动控制的sim-to-real迁移。

链接: https://arxiv.org/abs/2601.11143
作者: Minho Lee,Hyeonseok Kim,Jin Tak Kim,Sangshin Park,Jeong Hyun Lee,Jungsan Cho,Jemin Hwangbo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, Accepted to IEEE Robotics and Automation Letters (RA-L) 2025

点击查看摘要

Abstract:The simulation-to-reality (sim-to-real) transfer of large-scale hydraulic robots presents a significant challenge in robotics because of the inherent slow control response and complex fluid dynamics. The complex dynamics result from the multiple interconnected cylinder structure and the difference in fluid rates of the cylinders. These characteristics complicate detailed simulation for all joints, making it unsuitable for reinforcement learning (RL) applications. In this work, we propose an analytical actuator model driven by hydraulic dynamics to represent the complicated actuators. The model predicts joint torques for all 12 actuators in under 1 microsecond, allowing rapid processing in RL environments. We compare our model with neural network-based actuator models and demonstrate the advantages of our model in data-limited scenarios. The locomotion policy trained in RL with our model is deployed on a hydraulic quadruped robot, which is over 300 kg. This work is the first demonstration of a successful transfer of stable and robust command-tracking locomotion with RL on a heavy hydraulic quadruped robot, demonstrating advanced sim-to-real transferability.
zh

[AI-23] Context-aware Graph Causality Inference for Few-Shot Molecular Property Prediction

【速读】:该论文旨在解决分子属性预测在少样本(few-shot)场景下的挑战,即当仅有少量标注分子数据时,如何准确预测未见属性。传统基于上下文学习的方法虽能捕捉分子与属性之间的关系,但存在两大局限:一是未能利用功能基团(functional group)与属性间的因果关联,二是难以识别直接与属性相关的关键子结构。其解决方案的关键在于提出CaMol框架,该框架从因果推断视角出发,假设每个分子具有决定特定属性的潜在因果结构。核心创新包括:(1) 构建上下文图(context graph),通过连接功能基团、分子和属性来编码化学先验知识,引导发现因果子结构;(2) 设计可学习的原子掩码策略,以分离因果子结构与混杂因素;(3) 引入分布干预器(distribution intervener),结合因果子结构与化学合理的混杂因子进行后门调整,从而剥离真实化学变异中的因果效应。实验表明,CaMol在少样本任务中展现出更高的预测精度与样本效率,并且所发现的因果子结构高度契合化学知识,提升了模型的可解释性。

链接: https://arxiv.org/abs/2601.11135
作者: Van Thuy Hoang,O-Joun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Molecular property prediction is becoming one of the major applications of graph learning in Web-based services, e.g., online protein structure prediction and drug discovery. A key challenge arises in few-shot scenarios, where only a few labeled molecules are available for predicting unseen properties. Recently, several studies have used in-context learning to capture relationships among molecules and properties, but they face two limitations in: (1) exploiting prior knowledge of functional groups that are causally linked to properties and (2) identifying key substructures directly correlated with properties. We propose CaMol, a context-aware graph causality inference framework, to address these challenges by using a causal inference perspective, assuming that each molecule consists of a latent causal structure that determines a specific property. First, we introduce a context graph that encodes chemical knowledge by linking functional groups, molecules, and properties to guide the discovery of causal substructures. Second, we propose a learnable atom masking strategy to disentangle causal substructures from confounding ones. Third, we introduce a distribution intervener that applies backdoor adjustment by combining causal substructures with chemically grounded confounders, disentangling causal effects from real-world chemical variations. Experiments on diverse molecular datasets showed that CaMol achieved superior accuracy and sample efficiency in few-shot tasks, showing its generalizability to unseen properties. Also, the discovered causal substructures were strongly aligned with chemical knowledge about functional groups, supporting the model interpretability.
zh

[AI-24] Learn Before Represent: Bridging Generative and Contrastive Learning for Domain-Specific LLM Embeddings

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在垂直领域(如化学和法律)中表现不佳的问题,其核心瓶颈在于当前主流的“LLM+对比学习(LLM+CL)”范式仅关注语义对齐,而无法实现领域知识的获取,导致在专业术语理解上的失败。解决方案的关键在于提出一种两阶段框架——“先学习再表示(Learn Before Represent, LBR)”:第一阶段通过信息瓶颈约束的生成式学习注入领域知识,同时保留LLM的因果注意力机制以最大化知识吸收并压缩语义;第二阶段在压缩表示上进行生成式精炼的对比学习,实现语义对齐。该方法保持了架构一致性,并解决了生成与对比学习之间的目标冲突,显著提升了医疗、化学和代码检索等任务中的性能。

链接: https://arxiv.org/abs/2601.11124
作者: Xiaoyu Liang,Yuchen Peng,Jiale Luo,Wenhao Wang,Haoji Hu,Xincheng Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) adapted via contrastive learning excel in general representation learning but struggle in vertical domains like chemistry and law, primarily due to a lack of domain-specific knowledge. This work identifies a core bottleneck: the prevailing ``LLM+CL’’ paradigm focuses on semantic alignment but cannot perform knowledge acquisition, leading to failures on specialized terminology. To bridge this gap, we propose Learn Before Represent (LBR), a novel two-stage framework. LBR first injects domain knowledge via an Information Bottleneck-Constrained Generative Learning stage, preserving the LLM’s causal attention to maximize knowledge acquisition while compressing semantics. It then performs Generative-Refined Contrastive Learning on the compressed representations for alignment. This approach maintains architectural consistency and resolves the objective conflict between generative and contrastive learning. Extensive experiments on medical, chemistry, and code retrieval tasks show that LBR significantly outperforms strong baselines. Our work establishes a new paradigm for building accurate and robust representations in vertical domains.
zh

[AI-25] ReCreate: Reasoning and Creating Domain Agents Driven by Experience

【速读】:该论文旨在解决如何在真实场景中自动创建和适应领域代理(domain agents)的问题,当前多数代理仍依赖人工设计,因任务差异大而成本高昂。其核心挑战在于现有自动化方法多将代理生成视为黑箱过程,仅依赖最终性能指标进行优化,忽视了成功或失败背后的因果证据,且计算开销高。解决方案的关键是提出 ReCreate 框架,该框架基于经验驱动(experience-driven)理念,通过三个关键组件实现:(i) 经验存储与检索机制,用于按需分析代理交互历史;(ii) 推理-创造协同流水线(reasoning-creating synergy pipeline),将执行经验映射为代理结构的编辑;(iii) 分层更新机制,从实例级细节中抽象出可复用的领域模式(domain patterns)。这一策略显著提升了代理生成效率与效果,优于人类设计及现有自动化方法。

链接: https://arxiv.org/abs/2601.11100
作者: Zhezheng Hao,Hong Wang,Jian Luo,Jianqing Zhang,Yuyan Zhou,Qiang Lin,Can Wang,Hande Dong,Jiawei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model agents are reshaping the industrial landscape. However, most practical agents remain human-designed because tasks differ widely, making them labor-intensive to build. This situation poses a central question: can we automatically create and adapt domain agents in the wild? While several recent approaches have sought to automate agent creation, they typically treat agent generation as a black-box procedure and rely solely on final performance metrics to guide the process. Such strategies overlook critical evidence explaining why an agent succeeds or fails, and often require high computational costs. To address these limitations, we propose ReCreate, an experience-driven framework for the automatic creation of domain agents. ReCreate systematically leverages agent interaction histories, which provide rich concrete signals on both the causes of success or failure and the avenues for improvement. Specifically, we introduce an agent-as-optimizer paradigm that effectively learns from experience via three key components: (i) an experience storage and retrieval mechanism for on-demand inspection; (ii) a reasoning-creating synergy pipeline that maps execution experience into scaffold edits; and (iii) hierarchical updates that abstract instance-level details into reusable domain patterns. In experiments across diverse domains, ReCreate consistently outperforms human-designed agents and existing automated agent generation methods, even when starting from minimal seed scaffolds.
zh

[AI-26] MiCA: A Mobility-Informed Causal Adapter for Lightweight Epidemic Forecasting

【速读】:该论文旨在解决传染病传播预测中因人类移动数据噪声大、间接性强且难以与疾病记录可靠整合,以及病例时间序列短、时间分辨率粗导致传统依赖大量参数的移动感知预测模型效果受限的问题。解决方案的关键在于提出轻量级、架构无关的移动信息因果适配器(Mobility-Informed Causal Adapter, MiCA),其通过因果发现推断移动关系,并利用门控残差混合机制将这些空间结构信息融入时间预测模型,使轻量级模型能够在数据稀疏和噪声环境下选择性地利用移动衍生的空间结构,同时避免引入图神经网络或全注意力等复杂关系组件,从而在保持轻量化的同时显著提升预测准确性。

链接: https://arxiv.org/abs/2601.11089
作者: Suhan Guo,Jiahong Deng,Furao Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate forecasting of infectious disease dynamics is critical for public health planning and intervention. Human mobility plays a central role in shaping the spatial spread of epidemics, but mobility data are noisy, indirect, and difficult to integrate reliably with disease records. Meanwhile, epidemic case time series are typically short and reported at coarse temporal resolution. These conditions limit the effectiveness of parameter-heavy mobility-aware forecasters that rely on clean and abundant data. In this work, we propose the Mobility-Informed Causal Adapter (MiCA), a lightweight and architecture-agnostic module for epidemic forecasting. MiCA infers mobility relations through causal discovery and integrates them into temporal forecasting models via gated residual mixing. This design allows lightweight forecasters to selectively exploit mobility-derived spatial structure while remaining robust under noisy and data-limited conditions, without introducing heavy relational components such as graph neural networks or full attention. Extensive experiments on four real-world epidemic datasets, including COVID-19 incidence, COVID-19 mortality, influenza, and dengue, show that MiCA consistently improves lightweight temporal backbones, achieving an average relative error reduction of 7.5% across forecasting horizons. Moreover, MiCA attains performance competitive with SOTA spatio-temporal models while remaining lightweight.
zh

[AI-27] Visual Marker Search for Autonomous Drone Landing in Diverse Urban Environments

【速读】:该论文旨在解决当前基于标记的无人机自主着陆系统在复杂城市环境中鲁棒性不足的问题,尤其针对理想化假设(如完美可见性和传感器性能)导致的实际部署受限。其解决方案的关键在于构建一个基于AirSim平台的仿真评估套件,通过系统性地改变城市布局、光照和天气条件来模拟真实操作多样性,并在此基础上对比两种启发式覆盖策略与强化学习代理的表现,从而揭示探索策略与场景复杂度对成功率、路径效率及鲁棒性的综合影响,为开发可靠的空中导航系统提供实证依据。

链接: https://arxiv.org/abs/2601.11078
作者: Jiaohong Yao,Linfeng Liang,Yao Deng,Xi Zheng,Richard Han,Yuankai Qi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Marker-based landing is widely used in drone delivery and return-to-base systems for its simplicity and reliability. However, most approaches assume idealized landing site visibility and sensor performance, limiting robustness in complex urban settings. We present a simulation-based evaluation suite on the AirSim platform with systematically varied urban layouts, lighting, and weather to replicate realistic operational diversity. Using onboard camera sensors (RGB for marker detection and depth for obstacle avoidance), we benchmark two heuristic coverage patterns and a reinforcement learning-based agent, analyzing how exploration strategy and scene complexity affect success rate, path efficiency, and robustness. Results underscore the need to evaluate marker-based autonomous landing under diverse, sensor-relevant conditions to guide the development of reliable aerial navigation systems.
zh

[AI-28] A3D: Adaptive Affordance Assembly with Dual-Arm Manipulation AAAI2026

【速读】:该论文旨在解决机器人在家具组装任务中面临的双重挑战:一是需要实现精确的双臂协同操作,其中一臂负责抓取和移动部件,另一臂提供协作支撑与稳定;二是要在长时程装配过程中动态调整支撑策略,并具备跨不同部件几何形状的泛化能力。解决方案的关键在于提出A3D框架,其核心创新是通过密集点级几何表示建模部件间的交互模式,从而学习自适应的可操作性(affordances)以识别最优支撑与稳定位置;同时引入一个自适应模块,利用交互反馈动态调整支撑策略,使机器人能够根据历史交互信息实时优化装配过程中的支撑行为。该方法在包含50种多样部件的仿真环境中验证了对多种家具类别和几何形状的有效泛化能力。

链接: https://arxiv.org/abs/2601.11076
作者: Jiaqi Liang,Yue Chen,Qize Yu,Yan Shen,Haipeng Zhang,Hao Dong,Ruihai Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: AAAI2026 oral

点击查看摘要

Abstract:Furniture assembly is a crucial yet challenging task for robots, requiring precise dual-arm coordination where one arm manipulates parts while the other provides collaborative support and stabilization. To accomplish this task more effectively, robots need to actively adapt support strategies throughout the long-horizon assembly process, while also generalizing across diverse part geometries. We propose A3D, a framework which learns adaptive affordances to identify optimal support and stabilization locations on furniture parts. The method employs dense point-level geometric representations to model part interaction patterns, enabling generalization across varied geometries. To handle evolving assembly states, we introduce an adaptive module that uses interaction feedback to dynamically adjust support strategies during assembly based on previous interactions. We establish a simulation environment featuring 50 diverse parts across 8 furniture types, designed for dual-arm collaboration evaluation. Experiments demonstrate that our framework generalizes effectively to diverse part geometries and furniture categories in both simulation and real-world settings.
zh

[AI-29] Bridging Cognitive Neuroscience and Graph Intelligence: Hippocampus-Inspired Multi-View Hypergraph Learning for Web Finance Fraud

【速读】:该论文旨在解决在线金融欺诈检测中两个关键挑战:一是欺诈行为的伪装(fraud camouflage),即恶意交易通过模仿正常行为来规避检测;二是长尾数据分布问题,导致罕见但危害严重的欺诈案例难以被识别。解决方案的核心在于提出一种受海马体启发的多视角超图学习模型(HIMVH),其关键创新包括:(1) 设计交叉视图不一致性感知模块,模拟海马体对场景冲突的监控机制,捕捉不同交易视图间的细微差异与行为异质性,从而识别隐蔽的欺诈行为;(2) 引入新颖性感知超图学习模块,借鉴CA1区的匹配-不匹配新颖性检测机制,通过度量特征偏离邻域期望的程度并自适应重加权消息传递,显著提升对长尾稀有欺诈模式的敏感性。

链接: https://arxiv.org/abs/2601.11073
作者: Rongkun Cui,Nana Zhang,Kun Zhu,Qi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online financial services constitute an essential component of contemporary web ecosystems, yet their openness introduces substantial exposure to fraud that harms vulnerable users and weakens trust in digital finance. Such threats have become a significant web harm that erodes societal fairness and affects the well being of online communities. However, existing detection methods based on graph neural networks (GNNs) struggle with two persistent challenges: (1) fraud camouflage, where malicious transactions mimic benign behaviors to evade detection, and (2) long-tailed data distributions, which obscure rare but critical fraudulent cases. To fill these gaps, we propose HIMVH, a Hippocampus-Inspired Multi-View Hypergraph learning model for web finance fraud detection. Specifically, drawing inspiration from the scene conflict monitoring role of the hippocampus, we design a cross-view inconsistency perception module that captures subtle discrepancies and behavioral heterogeneity across multiple transaction views. This module enables the model to identify subtle cross-view conflicts for detecting online camouflaged fraudulent behaviors. Furthermore, inspired by the match-mismatch novelty detection mechanism of the CA1 region, we introduce a novelty-aware hypergraph learning module that measures feature deviations from neighborhood expectations and adaptively reweights messages, thereby enhancing sensitivity to online rare fraud patterns in the long-tailed settings. Extensive experiments on six web-based financial fraud datasets demonstrate that HIMVH achieves 6.42% improvement in AUC, 9.74% in F1 and 39.14% in AP on average over 15 SOTA models.
zh

[AI-30] Fairness in Healthcare Processes: A Quantitative Analysis of Decision Making in Triage

【速读】:该论文旨在解决医疗场景中自动化决策公平性不足的问题,特别是在急诊分诊(emergency triage)这种高压力环境下,如何通过实证数据评估和识别潜在的不公平现象。其解决方案的关键在于提出一种基于流程挖掘(process mining)的方法,将真实世界事件日志(如MIMIC-IV ED中的MIMICEL数据)与正义理论(justice theory)的概念维度相连接,量化时间、重做、偏离和决策等过程结果,并利用Kruskal-Wallis检验、卡方检验及效应量分析来评估年龄、性别、种族、语言和保险状态等因素对这些结果的影响,从而构建一个可解释的公平性概念框架,为负责任的、公平感知的流程挖掘在医疗领域的应用提供实证支持。

链接: https://arxiv.org/abs/2601.11065
作者: Rachmadita Andreswari,Stephan A. Fahrenkrog-Petersen,Jan Mendling
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: conference

点击查看摘要

Abstract:Fairness in automated decision-making has become a critical concern, particularly in high-pressure healthcare scenarios such as emergency triage, where fast and equitable decisions are essential. Process mining is increasingly investigating fairness. There is a growing area focusing on fairness-aware algorithms. So far, we know less how these concepts perform on empirical healthcare data or how they cover aspects of justice theory. This study addresses this research problem and proposes a process mining approach to assess fairness in triage by linking real-life event logs with conceptual dimensions of justice. Using the MIMICEL event log (as derived from MIMIC-IV ED), we analyze time, re-do, deviation and decision as process outcomes, and evaluate the influence of age, gender, race, language and insurance using the Kruskal-Wallis, Chi-square and effect size measurements. These outcomes are mapped to justice dimensions to support the development of a conceptual framework. The results demonstrate which aspects of potential unfairness in high-acuity and sub-acute surface. In this way, this study contributes empirical insights that support further research in responsible, fairness-aware process mining in healthcare.
zh

[AI-31] Predicting Biased Human Decision-Making with Large Language Models in Conversational Settings

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)是否能够预测对话场景中人类决策的偏倚行为,以及其预测是否能捕捉到认知负荷对这些偏倚的影响。研究通过预注册实验(N = 1,648)让参与者在不同复杂度的对话中完成六项经典决策任务,发现人类表现出框架效应(Framing Effect)和现状偏倚(Status Quo Bias),且随着对话复杂度上升导致认知负荷增加,这两种偏倚显著增强,验证了“负荷-偏倚交互效应”。解决方案的关键在于利用LLMs(包括GPT-4、GPT-5及开源模型)基于人口统计信息和历史对话内容进行个体决策预测,结果表明:引入对话上下文的预测显著提升了准确性,并且模型预测重现了人类观察到的偏倚模式与负荷调节效应;其中GPT-4家族在预测准确性和人类偏倚模式拟合度上均优于其他模型,证明其作为模拟人类决策工具的有效性。

链接: https://arxiv.org/abs/2601.11049
作者: Stephen Pilli,Vivek Nallur
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at ACM IUI 2026

点击查看摘要

Abstract:We examine whether large language models (LLMs) can predict biased decision-making in conversational settings, and whether their predictions capture not only human cognitive biases but also how those effects change under cognitive load. In a pre-registered study (N = 1,648), participants completed six classic decision-making tasks via a chatbot with dialogues of varying complexity. Participants exhibited two well-documented cognitive biases: the Framing Effect and the Status Quo Bias. Increased dialogue complexity resulted in participants reporting higher mental demand. This increase in cognitive load selectively, but significantly, increased the effect of the biases, demonstrating the load-bias interaction. We then evaluated whether LLMs (GPT-4, GPT-5, and open-source models) could predict individual decisions given demographic information and prior dialogue. While results were mixed across choice problems, LLM predictions that incorporated dialogue context were significantly more accurate in several key scenarios. Importantly, their predictions reproduced the same bias patterns and load-bias interactions observed in humans. Across all models tested, the GPT-4 family consistently aligned with human behavior, outperforming GPT-5 and open-source models in both predictive accuracy and fidelity to human-like bias patterns. These findings advance our understanding of LLMs as tools for simulating human decision-making and inform the design of conversational agents that adapt to user biases.
zh

[AI-32] AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

【速读】:该论文旨在解决当前基准测试对自主智能体(autonomous agents)能力评估的局限性问题,即现有评测体系多聚焦于单一能力,难以模拟真实世界中长期、复杂且需多步骤协作的任务场景;同时,依赖人工反馈导致自动化收集与评估效率低下,制约了模型的大规模部署与迭代优化。其解决方案的关键在于提出AgencyBench——一个基于日常AI使用场景构建的综合性基准,涵盖32个真实任务、138项具体指令与评分标准,每个任务平均涉及90次工具调用、百万级token处理及数小时执行时间。为实现自动化评估,研究引入用户模拟代理(user simulation agent)提供迭代反馈,并通过Docker沙箱环境进行可视化和功能导向的评分判定。实验表明,闭源模型显著优于开源模型(48.4% vs 32.1%),且不同模型在资源效率、反馈驱动自纠错能力和工具偏好等方面存在显著差异,进一步揭示了模型架构与代理框架协同优化的重要性。

链接: https://arxiv.org/abs/2601.11044
作者: Keyu Li,Junhao Shi,Yang Xiao,Mohan Jiang,Jie Sun,Yunze Wu,Shijie Xia,Xiaojie Cai,Tianze Xu,Weiye Si,Wenjie Li,Dequan Wang,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at this https URL.
zh

[AI-33] BAPO: Boundary-Aware Policy Optimization for Reliable Agent ic Search

【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)的智能体搜索(agentic search)在实际应用中可靠性不足的问题,即智能体在缺乏足够证据或推理达到极限时仍倾向于生成看似合理但不可靠的答案,而非主动选择“我不知道”(I DON’T KNOW, IDK)。为提升智能体对自身推理边界的认知能力,作者提出边界感知策略优化(Boundary-Aware Policy Optimization, BAPO),其核心创新在于:(i) 引入分组式的边界感知奖励机制,仅在推理确实达到极限时鼓励IDK响应;(ii) 设计自适应奖励调节器,在早期探索阶段动态抑制该奖励,防止模型将IDK作为规避困难任务的捷径。实验表明,BAPO可在不牺牲准确性的前提下显著增强智能体搜索的整体可靠性。

链接: https://arxiv.org/abs/2601.11037
作者: Shiyu Liu,Yongjing Yin,Jianhao Yan,Yunbo Tang,Qinggang Zhang,Bei Li,Xin Chen,Jingang Wang,Xunliang Cai,Jinsong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL

点击查看摘要

Abstract:RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON’T KNOW’’ (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.
zh

[AI-34] Combating Spurious Correlations in Graph Interpretability via Self-Reflection

【速读】:该论文旨在解决在具有强虚假相关性(spurious correlations)的图结构数据上提升可解释性图学习(interpretable graph learning)性能的问题,尤其针对ICLR 2022提出的Spurious-Motif基准数据集。该数据集通过设计误导性模式使模型难以区分真正重要的结构,导致现有方法表现显著下降。论文的关键解决方案是引入一种自省(self-reflection)框架,该框架将原始方法生成的节点和边重要性评分作为反馈输入回原模型,进行迭代式再评估,从而增强对真实相关结构的识别能力。这一机制借鉴了大语言模型中用于复杂任务的自省提示策略,并从图表示学习角度分析其有效性,进一步提出基于该反馈机制的微调训练方法以优化模型性能。

链接: https://arxiv.org/abs/2601.11021
作者: Kecheng Cai,Chenyang Xu,Chao Peng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretable graph learning has recently emerged as a popular research topic in machine learning. The goal is to identify the important nodes and edges of an input graph that are crucial for performing a specific graph reasoning task. A number of studies have been conducted in this area, and various benchmark datasets have been proposed to facilitate evaluation. Among them, one of the most challenging is the Spurious-Motif benchmark, introduced at ICLR 2022. The datasets in this synthetic benchmark are deliberately designed to include spurious correlations, making it particularly difficult for models to distinguish truly relevant structures from misleading patterns. As a result, existing methods exhibit significantly worse performance on this benchmark compared to others. In this paper, we focus on improving interpretability on the challenging Spurious-Motif datasets. We demonstrate that the self-reflection technique, commonly used in large language models to tackle complex tasks, can also be effectively adapted to enhance interpretability in datasets with strong spurious correlations. Specifically, we propose a self-reflection framework that can be integrated with existing interpretable graph learning methods. When such a method produces importance scores for each node and edge, our framework feeds these predictions back into the original method to perform a second round of evaluation. This iterative process mirrors how large language models employ self-reflective prompting to reassess their previous outputs. We further analyze the reasons behind this improvement from the perspective of graph representation learning, which motivates us to propose a fine-tuning training method based on this feedback mechanism. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.11021 [cs.LG] (or arXiv:2601.11021v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.11021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-35] Efficient Protein Optimization via Structure-aware Hamiltonian Dynamics

【速读】:该论文旨在解决蛋白质序列优化中因上位效应(epistasis)和结构约束忽视所导致的高维复杂性问题,从而实现高效设计具有特定功能和结构特性的蛋白质变体。其解决方案的关键在于提出一种基于哈密顿蒙特卡洛(Hamiltonian Monte Carlo)的贝叶斯优化方法 HADES,通过引入物理模拟中的动量和不确定性来加速候选序列在连续状态空间中的探索,并结合位置离散化策略将连续状态映射为离散蛋白序列;同时,采用两阶段编码器-解码器框架构建结构-功能关系的代理模型,学习平滑的适应度景观以指导采样,从而有效利用蛋白质序列与结构之间的相互约束,提升优化效率与设计质量。

链接: https://arxiv.org/abs/2601.11012
作者: Jiahao Wang,Shuangjia Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability to engineer optimized protein variants has transformative potential for biotechnology and medicine. Prior sequence-based optimization methods struggle with the high-dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure-aware approximated posterior. Leveraging momentum and uncertainty in the simulated physical movements, HADES enables rapid transition of proposals toward promising areas. A position discretization procedure is introduced to propose discrete protein sequences from such a continuous state system. The posterior surrogate is powered by a two-stage encoder-decoder framework to determine the structure and function relationships between mutant neighbors, consequently learning a smoothed landscape to sample from. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in in-silico evaluations across most metrics. Remarkably, our approach offers a unique advantage by leveraging the mutual constraints between protein structure and sequence, facilitating the design of protein sequences with similar structures and optimized properties. The code and data are publicly available at this https URL.
zh

[AI-36] Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

【速读】:该论文旨在解决现代大语言模型(Large Language Model, LLM)代理系统中代理-工具通信环路(agent-tool communication loop)所面临的一种新型拒绝服务(Denial-of-Service, DoS)攻击问题。现有DoS攻击主要依赖用户提示或注入检索增强生成(Retrieval-Augmented Generation, RAG)上下文,属于单轮且缺乏任务导向性,难以在多轮代理-工具交互中隐蔽地放大计算成本。解决方案的关键在于设计一种隐蔽的、多轮经济型DoS攻击:通过在兼容Model Context Protocol (MCP) 的工具服务器中调整文本可见字段和模板驱动的返回策略,同时保持函数签名不变并确保最终输出正确,从而诱导代理进入冗长、冗余的工具调用序列;该过程利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)优化器对修改进行高效探索,使攻击在不触发传统验证机制的前提下显著增加token数量、GPU KV缓存占用和能耗,最高可提升成本达658倍、能量消耗100–560倍,并大幅降低并发吞吐量。此方法将代理-工具接口提升至首要安全边界,要求从仅验证最终答案转向监控整个代理流程的经济与计算开销。

链接: https://arxiv.org/abs/2601.10955
作者: Kaiyu Zhou,Yongsen Zheng,Yicheng He,Meng Xue,Xueluan Gong,Yuji Wang,Kwok-Yan Lam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The agent-tool communication loop is a critical attack surface in modern Large Language Model (LLM) agents. Existing Denial-of-Service (DoS) attacks, primarily triggered via user prompts or injected retrieval-augmented generation (RAG) context, are ineffective for this new paradigm. They are fundamentally single-turn and often lack a task-oriented approach, making them conspicuous in goal-oriented workflows and unable to exploit the compounding costs of multi-turn agent-tool interactions. We introduce a stealthy, multi-turn economic DoS attack that operates at the tool layer under the guise of a correctly completed task. Our method adjusts text-visible fields and a template-governed return policy in a benign, Model Context Protocol (MCP)-compatible tool server, optimizing these edits with a Monte Carlo Tree Search (MCTS) optimizer. These adjustments leave function signatures unchanged and preserve the final payload, steering the agent into prolonged, verbose tool-calling sequences using text-only notices. This compounds costs across turns, escaping single-turn caps while keeping the final answer correct to evade validation. Across six LLMs on the ToolBench and BFCL benchmarks, our attack expands tasks into trajectories exceeding 60,000 tokens, inflates costs by up to 658x, and raises energy by 100-560x. It drives GPU KV cache occupancy from 1% to 35-74% and cuts co-running throughput by approximately 50%. Because the server remains protocol-compatible and task outcomes are correct, conventional checks fail. These results elevate the agent-tool interface to a first-class security frontier, demanding a paradigm shift from validating final answers to monitoring the economic and computational cost of the entire agentic process.
zh

[AI-37] What Matters in Data Curation for Multimodal Reasoning ? Insights from the DCVLR Challenge

【速读】:该论文旨在解决多模态推理(Multimodal Reasoning)中的数据筛选问题,即如何通过高效的数据选择策略提升模型性能,而非单纯依赖大规模数据集。其核心解决方案在于:在固定模型架构和训练协议的前提下,基于对齐的基准数据集进行难度导向的样本选择(Difficulty-based Example Selection),从而显著提升模型在多模态推理任务上的表现。实验证明,这种策略是性能提升的主要驱动力,而单纯扩大数据规模仅能降低训练波动性,多样性增强与合成数据扩充等常见方法不仅无益,反而可能损害性能,表明该任务已进入饱和区间(Saturation Regime),强调了数据质量(尤其是对齐程度和难度分布)的重要性。

链接: https://arxiv.org/abs/2601.10922
作者: Yosub Shin,Michael Buriek,Boris Sobolev,Pavel Bushuyeu,Vikas Kumar,Haoyang Xu,Samuel Watson,Igor Molybog
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study data curation for multimodal reasoning through the NeurIPS 2025 Data Curation for Vision-Language Reasoning (DCVLR) challenge, which isolates dataset selection by fixing the model and training protocol. Using a compact curated dataset derived primarily from Walton Multimodal Cold Start, our submission placed first in the challenge. Through post-competition ablations, we show that difficulty-based example selection on an aligned base dataset is the dominant driver of performance gains. Increasing dataset size does not reliably improve mean accuracy under the fixed training recipe, but mainly reduces run-to-run variance, while commonly used diversity and synthetic augmentation heuristics provide no additional benefit and often degrade performance. These results characterize DCVLR as a saturation-regime evaluation and highlight the central role of alignment and difficulty in data-efficient multimodal reasoning.
zh

[AI-38] ARC Prize 2025: Technical Report

【速读】:该论文旨在解决当前前沿人工智能在抽象推理与通用智能(AGI)评估中面临的局限性问题,特别是现有模型对知识覆盖的依赖导致的性能瓶颈及基准污染(benchmark contamination)现象。其核心解决方案在于提出并分析“精炼循环”(refinement loop)机制的作用,即通过任务级迭代优化过程(如进化程序合成或应用层反馈调整)提升模型在少样本场景下的泛化能力;同时指出,尽管基于权重空间优化的零预训练深度学习方法已展现竞争力(参数量仅7M),但真正突破仍需引入交互式推理挑战——这正是ARC-AGI-3所聚焦的方向,包括探索、规划、记忆、目标获取与对齐等关键能力。

链接: https://arxiv.org/abs/2601.10904
作者: François Chollet,Mike Knoop,Gregory Kamradt,Bryan Landers
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ARC-AGI benchmark series serves as a critical measure of few-shot generalization on novel tasks, a core aspect of intelligence. The ARC Prize 2025 global competition targeted the newly released ARC-AGI-2 dataset, which features greater task complexity compared to its predecessor. The Kaggle competition attracted 1,455 teams and 15,154 entries, with the top score reaching 24% on the ARC-AGI-2 private evaluation set. Paper submissions nearly doubled year-over-year to 90 entries, reflecting the growing research interest in fluid intelligence and abstract reasoning. The defining theme of 2025 is the emergence of the refinement loop – a per-task iterative program optimization loop guided by a feedback signal. Refinement loops come in a variety of forms, in particular evolutionary program synthesis approaches and application-layer refinements to commercial AI systems. Such refinement loops are also possible in weight space, as evidenced by zero-pretraining deep learning methods which are now achieving competitive performance with remarkably small networks (7M parameters). In parallel, four frontier AI labs (Anthropic, Google DeepMind, OpenAI, and xAI) reported ARC-AGI performance in public model cards in 2025, establishing ARC-AGI as an industry standard benchmark for AI reasoning. However, our analysis indicates that current frontier AI reasoning performance remains fundamentally constrained to knowledge coverage, giving rise to new forms of benchmark contamination. In this paper, we survey the top-performing methods, examine the role of refinement loops in AGI progress, discuss knowledge-dependent overfitting, and preview ARC-AGI-3, which introduces interactive reasoning challenges that require exploration, planning, memory, goal acquisition, and alignment capabilities.
zh

[AI-39] Approximately Optimal Global Planning for Contact-Rich SE(2) Manipulation on a Graph of Reachable Sets

【速读】:该论文旨在解决接触丰富操作(Contact-Rich Manipulation, CRM)中路径规划的优化问题,即如何在保证可行性的同时实现全局最优的机械臂运动规划,以充分发挥CRM相较于仅依赖末端执行器(如指尖)操作的优势。其解决方案的关键在于提出一种两阶段新范式:离线构建一个互达集图(mutual reachable sets graph),其中每个节点表示从特定初始物体姿态和夹持状态可达的所有物体姿态集合;在线阶段在此图上进行规划,有效计算并序列化局部计划,从而实现全局优化的运动轨迹。该方法显著提升了任务效率与成功率,并具备实时性,使CRM在实际应用中成为可行方案。

链接: https://arxiv.org/abs/2601.10827
作者: Simin Liu,Tong Zhao,Bernhard Paus Graesdal,Peter Werner,Jiuguang Wang,John Dolan,Changliu Liu,Tao Pang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 17 pages, 14 figures; under submission to IEEE Transactions on Robotics

点击查看摘要

Abstract:If we consider human manipulation, it is clear that contact-rich manipulation (CRM)-the ability to use any surface of the manipulator to make contact with objects-can be far more efficient and natural than relying solely on end-effectors (i.e., fingertips). However, state-of-the-art model-based planners for CRM are still focused on feasibility rather than optimality, limiting their ability to fully exploit CRM’s advantages. We introduce a new paradigm that computes approximately optimal manipulator plans. This approach has two phases. Offline, we construct a graph of mutual reachable sets, where each set contains all object orientations reachable from a starting object orientation and grasp. Online, we plan over this graph, effectively computing and sequencing local plans for globally optimized motion. On a challenging, representative contact-rich task, our approach outperforms a leading planner, reducing task cost by 61%. It also achieves a 91% success rate across 250 queries and maintains sub-minute query times, ultimately demonstrating that globally optimized contact-rich manipulation is now practical for real-world tasks.
zh

[AI-40] Digital Metabolism: Decoupling Logic from Facts via Regenerative Unlearning – Towards a Pure Neural Logic Core

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)中存在的参数纠缠问题,即通用推理能力(逻辑)与特定事实知识(事实)在共享权重中处于叠加态,导致计算资源被浪费在模拟检索过程上,进而引发幻觉现象。解决方案的关键在于提出“数字代谢”(digital metabolism)这一热力学假说,并设计了再生逻辑核协议(Regenerative Logic-Core Protocol, RLCP),通过双流训练框架结合深层梯度反转机制,使特定事实依赖关系线性不可解,从而实现目标事实关联的近零保留。实验表明,经RLCP处理后的模型在保持低事实记忆准确率(<7%)的同时,自发涌现出链式思维(Chain-of-Thought, CoT)结构,暗示其从直接联想回忆(O(1))向显式推理(O(N))转变,为构建模块化“神经中央处理器 + 符号内存”架构提供了动态权重层面的实现路径。

链接: https://arxiv.org/abs/2601.10810
作者: Mengmeng Peng,Zhenyu Fang,He Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) currently suffer from parameter entanglement, where general reasoning capabilities (logic) and specific factual knowledge (facts) exist in a superposition state within shared weights. This coupling leads to the “memory wall,” where computational capacity is squandered on simulating retrieval, often resulting in hallucinations. In this paper, we propose “digital metabolism,” a thermodynamic hypothesis suggesting that targeted forgetting is necessary for distilling a pure neural logic core. To validate this hypothesis, we introduce the Regenerative Logic-Core Protocol (RLCP), a dual-stream training framework that renders specific factual dependencies linearly undecodable via deep-layer gradient reversal. Applying RLCP to Qwen2.5-0.5B, we observe a distinct phase transition: the model achieves near-zero retention of targeted factual associations (Accuracy 7%) while exhibiting changes consistent with an emergent “structural crystallization” effect. Empirical analysis on GSM8K reveals that the “metabolized” model spontaneously adopts chain-of-thought (CoT) scaffolding, which we interpret as compensating for the loss of direct associative recall (shifting from O(1) recall to O(N) reasoning). While the causal mechanism underlying this behavioral shift requires further investigation, our findings provide a dynamic weight-level counterpart to architectural innovations like DeepSeek’s Engram, paving the way for modular “Neural CPU + Symbolic RAM” architectures.
zh

[AI-41] Unified Optimization of Source Weights and Transfer Quantities in Multi-Source Transfer Learning: An Asymptotic Framework

【速读】:该论文旨在解决多源迁移学习中因盲目均匀迁移导致的负迁移问题,以及现有方法通常仅优化源权重或转移样本量、而忽视二者联合优化的局限性。其解决方案的关键在于提出统一权重与数量的优化框架(Unified Optimization of Weights and Quantities, UOWQ),将多源迁移学习建模为参数估计问题,并基于Kullback-Leibler散度的泛化误差分析,理论上证明在权重适当调整时使用全部源样本是最优的;同时推导出单源情况下的闭式解和多源情况下的凸优化数值求解方法,从而实现源权重与转移样本量的协同最优配置。

链接: https://arxiv.org/abs/2601.10779
作者: Qingyue Zhang,Chang Chu,Haohao Fu,Tianren Peng,Yanru Wu,Guanbo Huang,Yang Li,Shao-Lun Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transfer learning plays a vital role in improving model performance in data-scarce scenarios. However, naive uniform transfer from multiple source tasks may result in negative transfer, highlighting the need to properly balance the contributions of heterogeneous sources. Moreover, existing transfer learning methods typically focus on optimizing either the source weights or the amount of transferred samples, while largely neglecting the joint consideration of the other. In this work, we propose a theoretical framework, Unified Optimization of Weights and Quantities (UOWQ), which formulates multi-source transfer learning as a parameter estimation problem grounded in an asymptotic analysis of a Kullback-Leibler divergence-based generalization error measure. The proposed framework jointly determines the optimal source weights and optimal transfer quantities for each source task. Firstly, we prove that using all available source samples is always optimal once the weights are properly adjusted, and we provide a theoretical explanation for this phenomenon. Moreover, to determine the optimal transfer weights, our analysis yields closed-form solutions in the single-source setting and develops a convex optimization-based numerical procedure for the multi-source case. Building on the theoretical results, we further propose practical algorithms for both multi-source transfer learning and multi-task learning settings. Extensive experiments on real-world benchmarks, including DomainNet and Office-Home, demonstrate that UOWQ consistently outperforms strong baselines. The results validate both the theoretical predictions and the practical effectiveness of our framework.
zh

[AI-42] LogicLens: Leverag ing Semantic Code Graph to explore Multi Repository large systems

【速读】:该论文旨在解决大型软件系统(尤其是跨多个代码仓库和微服务的系统)中开发者难以理解其结构、领域逻辑及运行时行为的问题。这些问题通常隐含且分散,导致开发效率低下。解决方案的关键在于构建一个语义多仓库图(semantic multi-repository graph),该图通过语法分析(如AST解析和仓库遍历)与基于大语言模型(Large Language Models, LLMs)的语义增强相结合,在预处理阶段生成,从而同时捕捉结构元素(如文件、类、函数)和功能抽象(如领域实体、操作、工作流)。在此基础上,LogicLens作为反应式对话代理,使开发者能够通过自然语言交互动态检索相关子图并回答技术或功能问题,从而实现对复杂系统的高效探索与推理。

链接: https://arxiv.org/abs/2601.10773
作者: Niko Usai,Dario Montagnini,Kristian Ilianov Iliev,Raffaele Camanzo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding large software systems is a challenging task, especially when code is distributed across multiple repositories and microservices. Developers often need to reason not only about the structure of the code, but also about its domain logic and runtime behaviors, which are typically implicit and scattered. We introduce LogicLens, a reactive conversational agent that assists developers in exploring complex software systems through a semantic multi-repository graph. This graph is built in a preprocessing step by combining syntactic code analysis, via AST parsing and repository traversal, with semantic enrichment using Large Language Models (LLMs). The resulting graph captures both structural elements, such as files, classes, and functions, as well as functional abstractions like domain entities, operations, and workflows. Once the graph is constructed, LogicLens enables developers to interact with it via natural language, dynamically retrieving relevant subgraphs and answering technical or functional queries. We present the architecture of the system, discuss emergent behaviors, and evaluate its effectiveness on real-world multi-repository scenarios. We demonstrate emergent capabilities including impact analysis and symptom-based debugging that arise naturally from the semantic graph structure.
zh

[AI-43] Unifying Speech Recognition Synthesis and Conversion with Autoregressive Transformers

【速读】:该论文旨在解决传统语音系统中因任务分离导致的可扩展性差、效率低以及跨任务泛化能力弱的问题,这些问题通常源于为文本到语音(TTS)、自动语音识别(ASR)和语音转换(VC)分别设计独立模型所形成的碎片化流水线。解决方案的关键在于提出一个统一的音频基础模型 General-Purpose Audio (GPA),其核心创新是基于共享离散音频标记空间的全自回归架构,通过联合多任务训练实现 TTS、ASR 和 VC 的端到端集成,并支持指令驱动的任务诱导,无需修改模型结构即可灵活切换任务。这一设计不仅提升了跨任务的通用性与效率,还实现了高并发推理与多尺度部署能力,包括面向边缘计算场景的轻量化 0.3B 参数版本,从而在保持高性能的同时满足低延迟实际部署需求。

链接: https://arxiv.org/abs/2601.10770
作者: Runyuan Cai,Yu Lin,Yiming Wang,Chunlin Fu,Xiaodong Zeng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.
zh

[AI-44] Optimisation of complex product innovation processes based on trend models with three-valued logic

【速读】:该论文旨在解决复杂产品创新过程建模与分析的问题,其核心挑战在于如何在信息有限的情况下捕捉系统动态行为。解决方案的关键在于引入基于启发式规则的趋势模型(trend model),其中每个启发式通过单调趋势(递增、递减或恒定)来量化系统演化特征,从而避免对数值数据或粗糙集的依赖;进一步地,将解定义为一组可能的状态转换场景,并以转移图(transition graph)表示,使得系统的未来或过去行为均可通过该图中的路径进行刻画。

链接: https://arxiv.org/abs/2601.10768
作者: Nina Bočková,Barbora Volná,Mirko Dohnal
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:This paper investigates complex product-innovation processes using models grounded in a set of heuristics. Each heuristic is expressed through simple trends – increasing, decreasing, or constant – which serve as minimally information-intensive quantifiers, avoiding reliance on numerical values or rough sets. A solution to a trend model is defined as a set of scenarios with possible transitions between them, represented by a transition graph. Any possible future or past behaviour of the system under study can thus be depicted by a path within this graph.
zh

[AI-45] Neuro-Symbolic Activation Discovery: Transferring Mathematical Structures from Physics to Ecology for Parameter-Efficient Neural Networks

【速读】:该论文旨在解决现代神经网络中通用激活函数(如ReLU、GELU、SiLU)忽视科学数据内在数学结构的问题,从而导致模型在特定领域任务中效率与性能不足。其解决方案的关键在于提出神经符号激活发现(Neuro-Symbolic Activation Discovery)框架,利用遗传编程(Genetic Programming)从科学数据中自动提取可解释的数学表达式,并将其作为定制化激活函数注入神经网络。该方法实现了跨领域的几何迁移现象(Geometric Transfer):在粒子物理数据上学习到的激活函数能够有效迁移到生态分类任务中,显著提升参数效率——例如在Forest Cover数据集上以仅5,825参数达到82.4%准确率,相较传统重型网络(31,801参数,83.4%准确率)实现5.5倍参数压缩且仅损失1%精度,同时通过引入参数效率评分(E_param = AUC / log₁₀(Params))验证轻量化混合架构比过度参数化基线更高效。

链接: https://arxiv.org/abs/2601.10740
作者: Anas Hajbi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern neural networks rely on generic activation functions (ReLU, GELU, SiLU) that ignore the mathematical structure inherent in scientific data. We propose Neuro-Symbolic Activation Discovery, a framework that uses Genetic Programming to extract interpretable mathematical formulas from data and inject them as custom activation functions. Our key contribution is the discovery of a Geometric Transfer phenomenon: activation functions learned from particle physics data successfully generalize to ecological classification, outperforming standard activations (ReLU, GELU, SiLU) in both accuracy and parameter efficiency. On the Forest Cover dataset, our Hybrid Transfer model achieves 82.4% accuracy with only 5,825 parameters, compared to 83.4% accuracy requiring 31,801 parameters for a conventional heavy network – a 5.5x parameter reduction with only 1% accuracy loss. We introduce a Parameter Efficiency Score ( E_param = AUC / \log_10(Params) ) and demonstrate that lightweight hybrid architectures consistently achieve 18-21% higher efficiency than over-parameterized baselines. Crucially, we establish boundary conditions: while Physics to Ecology transfer succeeds (both involve continuous Euclidean measurements), Physics to Text transfer fails (discrete word frequencies require different mathematical structures). Our work opens pathways toward domain-specific activation libraries for efficient scientific machine learning.
zh

[AI-46] CTHA: Constrained Temporal Hierarchical Architecture for Stable Multi-Agent LLM Systems

【速读】:该论文旨在解决多时间尺度智能体架构(multi-time-scale agent architectures)在引入时间层次结构后,因层间协调稳定性被破坏而导致的严重层间冲突、误差无界传播及可扩展性受限的问题。其解决方案的关键在于提出约束型时间层次架构(Constrained Temporal Hierarchical Architecture, CTHA),通过将层间通信空间投影到结构化流形(structured manifolds)以恢复协调稳定性,并引入受原则约束的仲裁机制确保决策一致性;具体包含三项核心约束:(1)消息契约约束(Message Contract Constraints),通过类型化的摘要、计划和策略包规范层间信息流;(2)权威流形约束(Authority Manifold Constraints),依据各层的时间范围限定其决策空间;(3)仲裁解析约束(Arbiter Resolution Constraints),保障多层决策的无冲突组合。

链接: https://arxiv.org/abs/2601.10738
作者: Percy Jardine
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, multi-time-scale agent architectures have extended the ubiquitous single-loop paradigm by introducing temporal hierarchies with distinct cognitive layers. While yielding substantial performance gains, this diversification fundamentally compromises the coordination stability intrinsic to unified agent systems, which causes severe inter-layer conflicts, unbounded error propagation, and restricted scalability. To address these challenges, we propose Constrained Temporal Hierarchical Architecture (CTHA), a general framework that projects the inter-layer communication space onto structured manifolds to restore coordination stability, while incorporating principled arbitration mechanisms to ensure coherent decision-making. Specifically, CTHA enforces three key constraints: (1) Message Contract Constraints that formalize information flow between layers via typed summary, plan, and policy packets; (2) Authority Manifold Constraints that bound each layer’s decision space according to its temporal scope; and (3) Arbiter Resolution Constraints that guarantee conflict-free composition of multi-layer decisions. Empirical experiments demonstrate that CTHA is effective for complex task execution at scale, offering 47% reduction in failure cascades, 2.3x improvement in sample efficiency, and superior scalability compared to unconstrained hierarchical baselines. We anticipate that CTHA, as a principled extension of temporal hierarchies, will contribute to a deeper understanding of multi-agent coordination and suggest promising directions for the evolution of robust autonomous systems.
zh

[AI-47] ORBITFLOW: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration VLDB2026

【速读】:该论文旨在解决长上下文大语言模型(Long-context LLMs)在服务过程中因请求长度和批处理组成动态变化导致的KV缓存(Key-Value Cache)内存占用波动问题,从而避免因频繁的CPU与GPU间KV缓存传输引发的延迟突增和SLO(Service Level Objective)违反。解决方案的关键在于提出ORBITFLOW系统,其通过轻量级整数线性规划(ILP)求解器为每个请求在内存约束下动态决定保留哪些层的KV缓存于GPU,并基于运行时反馈持续优化缓存放置策略;同时,在高负载下引入回退机制临时推迟内存开销大的请求,保障整体SLO达成。

链接: https://arxiv.org/abs/2601.10729
作者: Xinyue Ma,Heelim Hong,Taegeon Um,Jongseop Lee,Seoyeong Choy,Woo-Yeon Lee,Myeongjae Jeon
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: Accepted at the 52nd International Conference on Very Large Data Bases (VLDB 2026). Xinyue Ma and Heelim Hong contributed equally (co-first authors)

点击查看摘要

Abstract:Serving long-context LLMs is challenging because request lengths and batch composition vary during token generation, causing the memory footprint to fluctuate significantly at runtime. Offloading KV caches to host memory limits effective memory usage, but existing static and predetermined offloading strategies cannot adapt to the rapidly shifting memory demands of long-context serving. This often leads to excessive CPU-to-GPU KV transfers that translate into latency spikes and frequent SLO violations. To address these challenges, we introduce ORBITFLOW, a fine-grained and adaptive KV cache management system that meets latency SLOs in long-context LLM serving. ORBITFLOW employs a lightweight ILP solver to decide which layers’ KV caches to retain on the GPU for each request, within memory capacity constraints. It continuously refines KV placements based on runtime feedback when the active plan becomes suboptimal during token generation. Under heavy load, ORBITFLOW invokes a fallback mechanism to temporarily defer in-flight requests with large memory footprints, preserving overall SLO attainment. Our experiments demonstrate that ORBITFLOW improves SLO attainment for TPOT and TBT by up to 66% and 48%, respectively, while reducing the 95th percentile latency by 38% and achieving up to 3.3x higher throughput compared to existing offloading methods.
zh

[AI-48] DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion ACL

【速读】:该论文旨在解决现有语音分词器(Speech Tokenizer)在语义与声学特征分离上的不足,即多数方法要么侧重语义编码,要么将语义内容与声学风格耦合,难以实现完整的语义-声学解耦。其解决方案的关键在于提出DSA-Tokenizer,通过不同的优化约束显式地将语音分解为离散的语义令牌(Semantic Tokens)和声学令牌(Acoustic Tokens):语义令牌由自动语音识别(ASR)监督以捕捉语言内容,而声学令牌专注于梅尔频谱图(mel-spectrograms)重建以编码风格信息;同时引入层次化的流匹配(Flow-Matching)解码器消除两类序列间的刚性长度约束,并采用联合重构-重组训练策略强化这种解耦能力,从而实现高保真重建与灵活重组,推动可控语音生成的发展。

链接: https://arxiv.org/abs/2601.09239
作者: Hanlin Zhang,Daxin Tan,Dehua Tao,Xiao Chen,Haochen Tan,Yunhe Li,Yuchen Cao,Jianping Wang,Linqi Song
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submit to ACL ARR 2026 Jaunary

点击查看摘要

Abstract:Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation quality. Furthermore, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at this https URL. The code and model will be made publicly available after the paper has been accepted.
zh

[AI-49] EvidFuse: Writing-Time Evidence Learning for Consistent Text-Chart Data Reporting

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的数据驱动报告生成系统中普遍存在的图表与文本不一致(chart-text inconsistency)及洞察固化(insight freezing)问题。现有方法通常采用分阶段的流水线策略,如先生成文本再生成图表或反之,导致中间证据空间固定,无法在叙事演进过程中动态检索或构建新的可视化证据,从而限制了分析的深度与灵活性。解决方案的关键在于提出一个无需训练的多智能体框架 EvidFuse,其核心创新是通过两个协作组件解耦可视化分析与长文本撰写过程:一是具备探索性数据分析(Exploratory Data Analysis, EDA)知识并可访问原始表格的“数据增强型分析代理”(Data-Augmented Analysis Agent),二是能够规划大纲并实时发出细粒度分析请求的“实时证据构建写作者”(Real-Time Evidence Construction Writer)。这种设计使得视觉证据可在写作过程中按需构造并即时融入叙述,从而动态约束后续论述并扩展证据空间,显著提升图表质量、图文一致性与报告整体实用性。

链接: https://arxiv.org/abs/2601.05487
作者: Huanxiang Lin,Qianyue Wang,Jinwu Hu,Bailin Chen,Qing Du,Mingkui Tan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data-driven reports communicate decision-relevant insights by tightly interleaving narrative text with charts grounded in underlying tables. However, current LLM-based systems typically generate narratives and visualizations in staged pipelines, following either a text-first-graph-second or a graph-first-text-second paradigm. These designs often lead to chart-text inconsistency and insight freezing, where the intermediate evidence space becomes fixed and the model can no longer retrieve or construct new visual evidence as the narrative evolves, resulting in shallow and predefined analysis. To address the limitations, we propose \textbfEvidFuse, a training-free multi-agent framework that enables writing-time text-chart interleaved generation for data-driven reports. EvidFuse decouples visualization analysis from long-form drafting via two collaborating components: a \textbfData-Augmented Analysis Agent, equipped with Exploratory Data Analysis (EDA)-derived knowledge and access to raw tables, and a \textbfReal-Time Evidence Construction Writer that plans an outline and drafts the report while intermittently issuing fine-grained analysis requests. This design allows visual evidence to be constructed and incorporated exactly when the narrative requires it, directly constraining subsequent claims and enabling on-demand expansion of the evidence space. Experiments demonstrate that EvidFuse attains the top rank in both LLM-as-a-judge and human evaluations on chart quality, chart-text alignment, and report-level usefulness.
zh

[AI-50] Artificial Intelligence and the US Economy: An Accounting Perspective on Investment and Production

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)对宏观经济影响的量化与理解不足的问题,尤其是其在国家账户体系中的体现方式及其对总需求和GDP增长的实际贡献。解决方案的关键在于构建一个简单的宏观核算框架,并结合对AI生产过程的简化描述,识别出数据中心(Data Centers)作为AI生态系统核心基础设施的作用;研究表明,尽管AI相关资本支出显著拉动了 aggregate demand(总需求),但扣除高进口成分后其对GDP增长的贡献较小,而新建成的数据中心若按当前利用率和定价水平运营,其产出的服务价值可能在未来几个季度内达到与累计投资相当的规模,从而揭示了AI驱动经济增长的潜在路径及中期内可能面临的宏观风险。

链接: https://arxiv.org/abs/2601.11196
作者: Luisa Carpinelli,Filippo Natoli,Marco Taboga
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: 35 pages, 11 figures, pre-print

点击查看摘要

Abstract:Artificial intelligence (AI) has moved to the center of policy, market, and academic debates, but its macroeconomic footprint is still only partly understood. This paper provides an overview on how the current AI wave is captured in US national accounts, combining a simple macro-accounting framework with a stylized description of the AI production process. We highlight the crucial role played by data centers, which constitute the backbone of the AI ecosystem and have attracted formidable investment in 2025, as they are indispensable for meeting the rapidly increasing worldwide demand for AI services. We document that the boom in IT and AI-related capital expenditure in the first three quarters of the year has given an outsized boost to aggregate demand, while its contribution to GDP growth is smaller once the high import content of AI hardware is netted out. Furthermore, simple calculations suggest that, at current utilization rates and pricing, the production of services originating in new AI data centers could contribute to GDP over the turn of the next quarters on a scale comparable to that of investment spending to date. Short reinvestment cycles and uncertainty about future AI demand, while not currently acting as a macroeconomic drag, can nevertheless fuel macroeconomic risks over the medium term.
zh

[AI-51] Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach

【速读】:该论文旨在解决情境化分布鲁棒优化(Contextual Distributionally Robust Optimization, Contextual DRO)中如何有效建模因果结构与连续性、同时保持决策规则可解释性的问题。现有方法往往忽视了数据生成过程中的因果关系,或难以在复杂环境中获得具有理论保障且可解释的最优决策策略。解决方案的关键在于:首先提出因果Sinkhorn散度(Causal Sinkhorn Discrepancy, CSD),这是一种熵正则化的因果Wasserstein距离,能够在保持因果一致性的同时诱导连续的运输计划;进而构建基于CSD的模糊集,形成因果Sinkhorn鲁棒优化模型(Causal-SDRO),并推导其强对偶形式,其中最坏情况分布被刻画为Gibbs分布的混合;最后设计软回归森林(Soft Regression Forest, SRF)决策规则,该规则在任意可测函数空间内逼近最优策略,兼具参数化、可微分和Lipschitz光滑特性,从而实现全局与局部层面的内在可解释性,并通过一种收敛速率匹配标准随机梯度下降的随机复合梯度算法高效求解。

链接: https://arxiv.org/abs/2601.11016
作者: Fenglin Zhang,Jie Wang
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates. We first introduce the causal Sinkhorn discrepancy (CSD), an entropy-regularized causal Wasserstein distance that encourages continuous transport plans while preserving the causal consistency. We then formulate a contextual DRO model with a CSD-based ambiguity set, termed Causal Sinkhorn DRO (Causal-SDRO), and derive its strong dual reformulation where the worst-case distribution is characterized as a mixture of Gibbs distributions. To solve the corresponding infinite-dimensional policy optimization, we propose the Soft Regression Forest (SRF) decision rule, which approximates optimal policies within arbitrary measurable function spaces. The SRF preserves the interpretability of classical decision trees while being fully parametric, differentiable, and Lipschitz smooth, enabling intrinsic interpretation from both global and local perspectives. To solve the Causal-SDRO with parametric decision rules, we develop an efficient stochastic compositional gradient algorithm that converges to an \varepsilon -stationary point at a rate of O(\varepsilon^-4) , matching the convergence rate of standard stochastic gradient descent. Finally, we validate our method through numerical experiments on synthetic and real-world datasets, demonstrating its superior performance and interpretability.
zh

[AI-52] AnyECG: Evolved ECG Foundation Model for Holistic Health Profiling

【速读】:该论文旨在解决当前人工智能心电图(AI-ECG)模型多局限于单一疾病识别,忽视共病(comorbidity)关系及未来疾病风险预测的问题。其核心挑战在于构建一个能够实现全面健康画像(holistic health profiling)的通用型基础模型。解决方案的关键在于利用迁移学习(transfer learning)对已有的 ECGFounder 模型进行微调,开发出 AnyECG 基础模型,并基于包含 1330 万份心电图和 298 万名患者的多中心数据集进行训练与验证。该模型在 1172 种疾病中展现出系统性预测能力,尤其在多种慢性病如原发性甲状旁腺功能亢进症(AUROC 0.941)、2 型糖尿病(0.803)等上表现优异,实现了同时检测多种疾病、识别共病模式并预测长期健康风险的目标。

链接: https://arxiv.org/abs/2601.10748
作者: Jun Li,Hongling Zhu,Yujie Xiao,Qinghao Zhao,Yalei Ke,Gongzheng Tang,Guangkun Nie,Deyun Zhang,Jin Li,Canqing Yu,Shenda Hong
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: in progress

点击查看摘要

Abstract:Background: Artificial intelligence enabled electrocardiography (AI-ECG) has demonstrated the ability to detect diverse pathologies, but most existing models focus on single disease identification, neglecting comorbidities and future risk prediction. Although ECGFounder expanded cardiac disease coverage, a holistic health profiling model remains needed. Methods: We constructed a large multicenter dataset comprising 13.3 million ECGs from 2.98 million patients. Using transfer learning, ECGFounder was fine-tuned to develop AnyECG, a foundation model for holistic health profiling. Performance was evaluated using external validation cohorts and a 10-year longitudinal cohort for current diagnosis, future risk prediction, and comorbidity identification. Results: AnyECG demonstrated systemic predictive capability across 1172 conditions, achieving an AUROC greater than 0.7 for 306 diseases. The model revealed novel disease associations, robust comorbidity patterns, and future disease risks. Representative examples included high diagnostic performance for hyperparathyroidism (AUROC 0.941), type 2 diabetes (0.803), Crohn disease (0.817), lymphoid leukemia (0.856), and chronic obstructive pulmonary disease (0.773). Conclusion: The AnyECG foundation model provides substantial evidence that AI-ECG can serve as a systemic tool for concurrent disease detection and long-term risk prediction. Comments: in progress Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.10748 [eess.SP] (or arXiv:2601.10748v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2601.10748 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jun Li Mr [view email] [v1] Mon, 12 Jan 2026 16:03:52 UTC (3,791 KB)
zh

[AI-53] Millimeter-Wave Gesture Recognition in ISAC: Does Reducing Sensing Airtime Hamper Accuracy?

【速读】:该论文旨在解决集成感知与通信(Integrated Sensing and Communications, ISAC)系统中因空分复用导致的感知性能下降问题,尤其是当感知时间被压缩时对手势识别准确率的影响尚不明确。解决方案的关键在于通过在毫米波(Millimeter-Wave, mmWave)ISAC系统上采集多波束对的功率数据,并利用卷积神经网络(Convolutional Neural Networks, CNN)训练手势分类器;实验表明,在仅使用25%感知空时的情况下,分类准确率仅下降0.15个百分点,证明了mmWave ISAC可在极低感知空时下维持高精度感知性能,同时具备高数据吞吐能力,从而为真正无线扩展现实(Extended Reality, XR)等应用提供可靠支持。

链接: https://arxiv.org/abs/2601.10733
作者: Jakob Struye,Nabeel Nisar Bhat,Siddhartha Kumar,Mohammad Hossein Moghaddam,Jeroen Famaey
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most Integrated Sensing and Communications (ISAC) systems require dividing airtime across their two modes. However, the specific impact of this decision on sensing performance remains unclear and underexplored. In this paper, we therefore investigate the impact on a gesture recognition system using a Millimeter-Wave (mmWave) ISAC system. With our dataset of power per beam pair gathered with two mmWave devices performing constant beam sweeps while test subjects performed distinct gestures, we train a gesture classifier using Convolutional Neural Networks. We then subsample these measurements, emulating reduced sensing airtime, showing that a sensing airtime of 25 % only reduces classification accuracy by 0.15 percentage points from full-time sensing. Alongside this high-quality sensing at low airtime, mmWave systems are known to provide extremely high data throughputs, making mmWave ISAC a prime enabler for applications such as truly wireless Extended Reality.
zh

机器学习

[LG-0] QUPID: A Partitioned Quantum Neural Network for Anomaly Detection in Smart Grid

链接: https://arxiv.org/abs/2601.11500
作者: Hoang M. Ngo,Tre’ R. Jeter,Jung Taek Seo,My T. Thai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Smart grid infrastructures have revolutionized energy distribution, but their day-to-day operations require robust anomaly detection methods to counter risks associated with cyber-physical threats and system faults potentially caused by natural disasters, equipment malfunctions, and cyber attacks. Conventional machine learning (ML) models are effective in several domains, yet they struggle to represent the complexities observed in smart grid systems. Furthermore, traditional ML models are highly susceptible to adversarial manipulations, making them increasingly unreliable for real-world deployment. Quantum ML (QML) provides a unique advantage, utilizing quantum-enhanced feature representations to model the intricacies of the high-dimensional nature of smart grid systems while demonstrating greater resilience to adversarial manipulation. In this work, we propose QUPID, a partitioned quantum neural network (PQNN) that outperforms traditional state-of-the-art ML models in anomaly detection. We extend our model to R-QUPID that even maintains its performance when including differential privacy (DP) for enhanced robustness. Moreover, our partitioning framework addresses a significant scalability problem in QML by efficiently distributing computational workloads, making quantum-enhanced anomaly detection practical in large-scale smart grid environments. Our experimental results across various scenarios exemplifies the efficacy of QUPID and R-QUPID to significantly improve anomaly detection capabilities and robustness compared to traditional ML approaches.

[LG-1] On the Probability of First Success in Differential Evolution: Hazard Identities and Tail Bounds

链接: https://arxiv.org/abs/2601.11499
作者: Dimitar Nedanovski,Svetoslav Nenov,Dimitar Pilev
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: All codes are publically available at this https URL

点击查看摘要

Abstract:We study first-hitting times in Differential Evolution (DE) through a conditional hazard frame work. Instead of analyzing convergence via Markov-chain transition kernels or drift arguments, we ex press the survival probability of a measurable target set A as a product of conditional first-hit probabilities (hazards) p_t=\Prob(E_t\mid\mathcal F_t-1) . This yields distribution-free identities for survival and explicit tail bounds whenever deterministic lower bounds on the hazard hold on the survival event. For the L-SHADE algorithm with current-to- p best/1 mutation, we construct a checkable algorithmic witness event \mathcal L_t under which the conditional hazard admits an explicit lower bound depending only on sampling rules, population size, and crossover statistics. This separates theoretical constants from empirical event frequencies and explains why worst-case constant-hazard bounds are typically conservative. We complement the theory with a Kaplan–Meier survival analysis on the CEC2017 benchmark suite . Across functions and budgets, we identify three distinct empirical regimes: (i) strongly clustered success, where hitting times concentrate in short bursts; (ii) approximately geometric tails, where a constant-hazard model is accurate; and (iii) intractable cases with no observed hits within the evaluation horizon. The results show that while constant-hazard bounds provide valid tail envelopes, the practical behavior of L-SHADE is governed by burst-like transitions rather than homogeneous per-generati on success probabilities. Comments: All codes are publically available at this https URL Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG) MSC classes: 90C59, 60J20, 68W20 ACMclasses: G.1.6; F.2.1 Cite as: arXiv:2601.11499 [cs.NE] (or arXiv:2601.11499v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2601.11499 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Extractive summarization on a CMOS Ising machine

链接: https://arxiv.org/abs/2601.11491
作者: Ziqing Zeng,Abhimanyu Kumar,Chris H. Kim,Ulya R. Karpuzcu,Sachin S. Sapatnekar
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Extractive summarization (ES) aims to generate a concise summary by selecting a subset of sentences from a document while maximizing relevance and minimizing redundancy. Although modern ES systems achieve high accuracy using powerful neural models, their deployment typically relies on CPU or GPU infrastructures that are energy-intensive and poorly suited for real-time inference in resource-constrained environments. In this work, we explore the feasibility of implementing McDonald-style extractive summarization on a low-power CMOS coupled oscillator-based Ising machine (COBI) that supports integer-valued, all-to-all spin couplings. We first propose a hardware-aware Ising formulation that reduces the scale imbalance between local fields and coupling terms, thereby improving robustness to coefficient quantization: this method can be applied to any problem formulation that requires k of n variables to be chosen. We then develop a complete ES pipeline including (i) stochastic rounding and iterative refinement to compensate for precision loss, and (ii) a decomposition strategy that partitions a large ES problem into smaller Ising subproblems that can be efficiently solved on COBI and later combined. Experimental results on the CNN/DailyMail dataset show that our pipeline can produce high-quality summaries using only integer-coupled Ising hardware with limited precision. COBI achieves 3-4.5x runtime speedups compared to a brute-force method, which is comparable to software Tabu search, and two to three orders of magnitude reductions in energy, while maintaining competitive summary quality. These results highlight the potential of deploying CMOS Ising solvers for real-time, low-energy text summarization on edge devices.

[LG-3] Low-Rank Key Value Attention

链接: https://arxiv.org/abs/2601.11471
作者: James O’Neill,Robert Clancy,Mariia Matskevichus,Fergal Reid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer pretraining is increasingly constrained by memory and compute requirements, with the key-value (KV) cache emerging as a dominant bottleneck during training and autoregressive decoding. We propose \textitlow-rank KV adaptation (LRKV), a simple modification of multi-head attention that reduces KV cache memory by exploiting redundancy across attention heads while preserving full token-level resolution. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, yielding a continuous trade-off between complete sharing and fully independent attention. LRKV is a drop-in replacement for standard multi-head attention and directly subsumes query-sharing approaches such as multi-query and grouped-query attention, while remaining distinct from latent-compression methods such as multi-latent attention (MLA). Across large-scale pretraining experiments, LRKV consistently achieves faster loss reduction, lower validation perplexity, and stronger downstream task performance than standard attention, MQA/GQA, and MLA. At the 2.5B scale, LRKV outperforms standard attention while using roughly half the KV cache, and reaches equivalent model quality with up to \textbf20-25% less training compute when measured in cumulative FLOPs. To explain these gains, we analyze attention head structure in operator space and show that LRKV preserves nearly all functional head diversity relative to standard attention, whereas more aggressive KV-sharing mechanisms rely on compensatory query specialization. Together, these results establish LRKV as a practical and effective attention mechanism for scaling Transformer pretraining under memory- and compute-constrained regimes. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.11471 [cs.LG] (or arXiv:2601.11471v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.11471 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

链接: https://arxiv.org/abs/2601.11460
作者: Franziska Herbert,Vignesh Prasad,Han Liu,Dorothea Koert,Georgia Chalvatzaki
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures, preprint

点击查看摘要

Abstract:Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.

[LG-5] IMS: Intelligent Hardware Monitoring System for Secure SoCs DATE

链接: https://arxiv.org/abs/2601.11447
作者: Wadid Foudhaili,Aykut Rencber,Anouar Nechi,Rainer Buchty,Mladen Berekovic,Andres Gomez,Saleh Mulhem
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: The final version is accepted for publication at the Design, Automation Test in Europe Conference (DATE) 2026

点击查看摘要

Abstract:In the modern Systems-on-Chip (SoC), the Advanced eXtensible Interface (AXI) protocol exhibits security vulnerabilities, enabling partial or complete denial-of-service (DoS) through protocol-violation attacks. The recent countermeasures lack a dedicated real-time protocol semantic analysis and evade protocol compliance checks. This paper tackles this AXI vulnerability issue and presents an intelligent hardware monitoring system (IMS) for real-time detection of AXI protocol violations. IMS is a hardware module leveraging neural networks to achieve high detection accuracy. For model training, we perform DoS attacks through header-field manipulation and systematic malicious operations, while recording AXI transactions to build a training dataset. We then deploy a quantization-optimized neural network, achieving 98.7% detection accuracy with =3% latency overhead, and throughput of 2.5 million inferences/s. We subsequently integrate this IMS into a RISC-V SoC as a memory-mapped IP core to monitor its AXI bus. For demonstration and initial assessment for later ASIC integration, we implemented this IMS on an AMD Zynq UltraScale+ MPSoC ZCU104 board, showing an overall small hardware footprint (9.04% look-up-tables (LUTs), 0.23% DSP slices, and 0.70% flip-flops) and negligible impact on the overall design’s achievable frequency. This demonstrates the feasibility of lightweight, security monitoring for resource-constrained edge environments.

[LG-6] Inter-patient ECG Arrhythmia Classification with LGNs and LUTNs

链接: https://arxiv.org/abs/2601.11433
作者: Wout Mommen,Lars Keuninckx,Paul Detterer,Achiel Colpaert,Piet Wambacq
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Differentiable Logic Gate Networks (LGNs) and Lookup Table Networks (LUTNs) are demonstrated to be suitable for the automatic classification of electrocardiograms (ECGs) using the inter-patient paradigm. The methods are benchmarked using the MIT-BIH arrhythmia data set, achieving up to 94.28% accuracy and a j\kappa index of 0.683 on a four-class classification problem. Our models use between 2.89k and 6.17k FLOPs, including preprocessing and readout, which is three to six orders of magnitude less compared to SOTA methods. A novel preprocessing method is utilized that attains superior performance compared to existing methods for both the mixed-patient and inter-patient paradigms. In addition, a novel method for training the Lookup Tables (LUTs) in LUTNs is devised that uses the Boolean equation of a multiplexer (MUX). Additionally, rate coding was utilized for the first time in these LGNs and LUTNs, enhancing the performance of LGNs. Furthermore, it is the first time that LGNs and LUTNs have been benchmarked on the MIT-BIH arrhythmia dataset using the inter-patient paradigm. Using an Artix 7 FPGA, between 2000 and 2990 LUTs were needed, and between 5 to 7 mW (i.e. 50 pJ to 70 pJ per inference) was estimated for running these models. The performance in terms of both accuracy and j\kappa -index is significantly higher compared to previous LGN results. These positive results suggest that one can utilize LGNs and LUTNs for the detection of arrhythmias at extremely low power and high speeds in heart implants or wearable devices, even for patients not included in the training set.

[LG-7] Forcing and Diagnosing Failure Modes of Fourier Neural Operators Across Diverse PDE Families

链接: https://arxiv.org/abs/2601.11428
作者: Lennon Shikhman
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Fourier Neural Operators (FNOs) have shown strong performance in learning solution maps of partial differential equations (PDEs), but their robustness under distribution shifts, long-horizon rollouts, and structural perturbations remains poorly understood. We present a systematic stress-testing framework that probes failure modes of FNOs across five qualitatively different PDE families: dispersive, elliptic, multi-scale fluid, financial, and chaotic systems. Rather than optimizing in-distribution accuracy, we design controlled stress tests–including parameter shifts, boundary or terminal condition changes, resolution extrapolation with spectral analysis, and iterative rollouts–to expose vulnerabilities such as spectral bias, compounding integration errors, and overfitting to restricted boundary regimes. Our large-scale evaluation (1,000 trained models) reveals that distribution shifts in parameters or boundary conditions can inflate errors by more than an order of magnitude, while resolution changes primarily concentrate error in high-frequency modes. Input perturbations generally do not amplify error, though worst-case scenarios (e.g., localized Poisson perturbations) remain challenging. These findings provide a comparative failure-mode atlas and actionable insights for improving robustness in operator learning.

[LG-8] New Adaptive Mechanism for Large Neighborhood Search using Dual Actor-Critic

链接: https://arxiv.org/abs/2601.11414
作者: Shaohua Yu,Wenhao Mao,Zigao Wu,Jakob Puchinger
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive Large Neighborhood Search (ALNS) is a widely used heuristic method for solving combinatorial optimization problems. ALNS explores the solution space by iteratively using destroy and repair operators with probabilities, which are adjusted by an adaptive mechanism to find optimal solutions. However, the classic ALNS adaptive mechanism does not consider the interaction between destroy and repair operators when selecting them. To overcome this limitation, this study proposes a novel adaptive mechanism. This mechanism enhances the adaptability of the algorithm through a Dual Actor-Critic (DAC) model, which fully considers the fact that the quality of new solutions is jointly determined by the destroy and repair operators. It effectively utilizes the interaction between these operators during the weight adjustment process, greatly improving the adaptability of the ALNS algorithm. In this mechanism, the destroy and repair processes are modeled as independent Markov Decision Processes to guide the selection of operators more accurately. Furthermore, we use Graph Neural Networks to extract key features from problem instances and perform effective aggregation and normalization to enhance the algorithm’s transferability to different sizes and characteristics of problems. Through a series of experiments, we demonstrate that the proposed DAC-ALNS algorithm significantly improves solution efficiency and exhibits excellent transferability.

[LG-9] Factored Value Functions for Graph-Based Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2601.11401
作者: Ahmed Rashwan,Keith Briggs,Chris Budd,Lisa Kreusser
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Credit assignment is a core challenge in multi-agent reinforcement learning (MARL), especially in large-scale systems with structured, local interactions. Graph-based Markov decision processes (GMDPs) capture such settings via an influence graph, but standard critics are poorly aligned with this structure: global value functions provide weak per-agent learning signals, while existing local constructions can be difficult to estimate and ill-behaved in infinite-horizon settings. We introduce the Diffusion Value Function (DVF), a factored value function for GMDPs that assigns to each agent a value component by diffusing rewards over the influence graph with temporal discounting and spatial attenuation. We show that DVF is well-defined, admits a Bellman fixed point, and decomposes the global discounted value via an averaging property. DVF can be used as a drop-in critic in standard RL algorithms and estimated scalably with graph neural networks. Building on DVF, we propose Diffusion A2C (DA2C) and a sparse message-passing actor, Learned DropEdge GNN (LD-GNN), for learning decentralised algorithms under communication costs. Across the firefighting benchmark and three distributed computation tasks (vector graph colouring and two transmit power optimisation problems), DA2C consistently outperforms local and global critic baselines, improving average reward by up to 11%.

[LG-10] Latent Space Inference via Paired Autoencoders

链接: https://arxiv.org/abs/2601.11397
作者: Emma Hart,Bas Peters,Julianne Chung,Matthias Chung
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:This work describes a novel data-driven latent space inference framework built on paired autoencoders to handle observational inconsistencies when solving inverse problems. Our approach uses two autoencoders, one for the parameter space and one for the observation space, connected by learned mappings between the autoencoders’ latent spaces. These mappings enable a surrogate for regularized inversion and optimization in low-dimensional, informative latent spaces. Our flexible framework can work with partial, noisy, or out-of-distribution data, all while maintaining consistency with the underlying physical models. The paired autoencoders enable reconstruction of corrupted data, and then use the reconstructed data for parameter estimation, which produces more accurate reconstructions compared to paired autoencoders alone and end-to-end encoder-decoders of the same architecture, especially in scenarios with data inconsistencies. We demonstrate our approaches on two imaging examples in medical tomography and geophysical seismic-waveform inversion, but the described approaches are broadly applicable to a variety of inverse problems in scientific and engineering applications.

[LG-11] Offline Reinforcement-Learning-Based Power Control for Application-Agnostic Energy Efficiency

链接: https://arxiv.org/abs/2601.11352
作者: Akhilesh Raj,Swann Perarnau,Aniruddha Gokhale,Solomon Bekele Abera
类目: Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY)
*备注: 11 pages, 5 figures, 3 tables and unpublished

点击查看摘要

Abstract:Energy efficiency has become an integral aspect of modern computing infrastructure design, impacting the performance, cost, scalability, and durability of production systems. The incorporation of power actuation and sensing capabilities in CPU designs is indicative of this, enabling the deployment of system software that can actively monitor and adjust energy consumption and performance at runtime. While reinforcement learning (RL) would seem ideal for the design of such energy efficiency control systems, online training presents challenges ranging from the lack of proper models for setting up an adequate simulated environment, to perturbation (noise) and reliability issues, if training is deployed on a live system. In this paper we discuss the use of offline reinforcement learning as an alternative approach for the design of an autonomous CPU power controller, with the goal of improving the energy efficiency of parallel applications at runtime without unduly impacting their performance. Offline RL sidesteps the issues incurred by online RL training by leveraging a dataset of state transitions collected from arbitrary policies prior to training. Our methodology applies offline RL to a gray-box approach to energy efficiency, combining online application-agnostic performance data (e.g., heartbeats) and hardware performance counters to ensure that the scientific objectives are met with limited performance degradation. Evaluating our method on a variety of compute-bound and memory-bound benchmarks and controlling power on a live system through Intel’s Running Average Power Limit, we demonstrate that such an offline-trained agent can substantially reduce energy consumption at a tolerable performance degradation cost. Comments: 11 pages, 5 figures, 3 tables and unpublished Subjects: Machine Learning (cs.LG); Performance (cs.PF); Systems and Control (eess.SY) Cite as: arXiv:2601.11352 [cs.LG] (or arXiv:2601.11352v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.11352 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Information Theoretic Perspective on Representation Learning

链接: https://arxiv.org/abs/2601.11334
作者: Deborah Pereg
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An information-theoretic framework is introduced to analyze last-layer embedding, focusing on learned representations for regression tasks. We define representation-rate and derive limits on the reliability with which input-output information can be represented as is inherently determined by the input-source entropy. We further define representation capacity in a perturbed setting, and representation rate-distortion for a compressed output. We derive the achievable capacity, the achievable representation-rate, and their converse. Finally, we combine the results in a unified setting.

[LG-13] FORESTLLM : Large Language Models Make Random Forest Great on Few-shot Tabular Learning

链接: https://arxiv.org/abs/2601.11311
作者: Zhihan Yang,Jiaqi Wei,Xiang Zhang,Haoyu Dong,Yiwen Wang,Xiaoke Guo,Pengkun Zhang,Yiwei Xu,Chenyu You
类目: Machine Learning (cs.LG)
*备注: 23 pages

点击查看摘要

Abstract:Tabular data high-stakes critical decision-making in domains such as finance, healthcare, and scientific discovery. Yet, learning effectively from tabular data in few-shot settings, where labeled examples are scarce, remains a fundamental challenge. Traditional tree-based methods often falter in these regimes due to their reliance on statistical purity metrics, which become unstable and prone to overfitting with limited supervision. At the same time, direct applications of large language models (LLMs) often overlook its inherent structure, leading to suboptimal performance. To overcome these limitations, we propose FORESTLLM, a novel framework that unifies the structural inductive biases of decision forests with the semantic reasoning capabilities of LLMs. Crucially, FORESTLLM leverages the LLM only during training, treating it as an offline model designer that encodes rich, contextual knowledge into a lightweight, interpretable forest model, eliminating the need for LLM inference at test time. Our method is two-fold. First, we introduce a semantic splitting criterion in which the LLM evaluates candidate partitions based on their coherence over both labeled and unlabeled data, enabling the induction of more robust and generalizable tree structures under few-shot supervision. Second, we propose a one-time in-context inference mechanism for leaf node stabilization, where the LLM distills the decision path and its supporting examples into a concise, deterministic prediction, replacing noisy empirical estimates with semantically informed outputs. Across a diverse suite of few-shot classification and regression benchmarks, FORESTLLM achieves state-of-the-art performance.

[LG-14] Metabolomic Biomarker Discovery for ADHD Diagnosis Using Interpretable Machine Learning

链接: https://arxiv.org/abs/2601.11283
作者: Nabil Belacel,Mohamed Rachid Boulassel
类目: Machine Learning (cs.LG)
*备注: 24 pages, 4 figures, 2 tables, submitted to AI in Medicine

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder with limited objective diagnostic tools, highlighting the urgent need for objective, biology-based diagnostic frameworks in precision psychiatry. We integrate urinary metabolomics with an interpretable machine learning framework to identify biochemical signatures associated with ADHD. Targeted metabolomic profiles from 52 ADHD and 46 control participants were analyzed using a Closest Resemblance (CR) classifier with embedded feature selection. The CR model outperformed Random Forest and K-Nearest Neighbor classifiers, achieving an AUC 0.97 based on a reduced panel of 14 metabolites. These metabolites including dopamine 4-sulfate, N-acetylaspartylglutamic acid, and citrulline map to dopaminergic neurotransmission and amino acid metabolism pathways, offering mechanistic insight into ADHD pathophysiology. The CR classifier’s transparent decision boundaries and low computational cost support integration into targeted metabolomic assays and future point of care diagnostic platforms. Overall, this work demonstrates a translational framework combining metabolomics and interpretable machine learning to advance objective, biologically informed diagnostic strategies for ADHD.

[LG-15] Sample-Near-Optimal Agnostic Boosting with Improved Running Time ALT2026

链接: https://arxiv.org/abs/2601.11265
作者: Arthur da Cunha,Miakel Møller Høgsgaard,Andrea Paudice
类目: Machine Learning (cs.LG)
*备注: 28 pages, 0 figures. Accepted at the 37th International Conference on Algorithmic Learning Theory (ALT 2026)

点击查看摘要

Abstract:Boosting is a powerful method that turns weak learners, which perform only slightly better than random guessing, into strong learners with high accuracy. While boosting is well understood in the classic setting, it is less so in the agnostic case, where no assumptions are made about the data. Indeed, only recently was the sample complexity of agnostic boosting nearly settled arXiv:2503.09384, but the known algorithm achieving this bound has exponential running time. In this work, we propose the first agnostic boosting algorithm with near-optimal sample complexity, running in time polynomial in the sample size when considering the other parameters of the problem fixed.

[LG-16] Scalable Music Cover Retrieval Using Lyrics-Aligned Audio Embeddings ECIR2026

链接: https://arxiv.org/abs/2601.11262
作者: Joanne Affolter,Benjamin Martin,Elena V. Epure,Gabriel Meseguer-Brocal,Frédéric Kaplan
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Published at ECIR 2026 (European Conference of Information Retrieval)

点击查看摘要

Abstract:Music Cover Retrieval, also known as Version Identification, aims to recognize distinct renditions of the same underlying musical work, a task central to catalog management, copyright enforcement, and music retrieval. State-of-the-art approaches have largely focused on harmonic and melodic features, employing increasingly complex audio pipelines designed to be invariant to musical attributes that often vary widely across covers. While effective, these methods demand substantial training time and computational resources. By contrast, lyrics constitute a strong invariant across covers, though their use has been limited by the difficulty of extracting them accurately and efficiently from polyphonic audio. Early methods relied on simple frameworks that limited downstream performance, while more recent systems deliver stronger results but require large models integrated within complex multimodal architectures. We introduce LIVI (Lyrics-Informed Version Identification), an approach that seeks to balance retrieval accuracy with computational efficiency. First, LIVI leverages supervision from state-of-the-art transcription and text embedding models during training to achieve retrieval accuracy on par with–or superior to–harmonic-based systems. Second, LIVI remains lightweight and efficient by removing the transcription step at inference, challenging the dominance of complexity-heavy pipelines.

[LG-17] Effects of Introducing Synaptic Scaling on Spiking Neural Network Learning

链接: https://arxiv.org/abs/2601.11261
作者: Shinnosuke Touda,Hirotsugu Okuno
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures, 6 tables

点击查看摘要

[LG-18] Latent Dynamics Graph Convolutional Networks for model order reduction of parameterized time-dependent PDEs

链接: https://arxiv.org/abs/2601.11259
作者: Lorenzo Tomada,Federico Pichi,Gianluigi Rozza
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are emerging as powerful tools for nonlinear Model Order Reduction (MOR) of time-dependent parameterized Partial Differential Equations (PDEs). However, existing methodologies struggle to combine geometric inductive biases with interpretable latent behavior, overlooking dynamics-driven features or disregarding spatial information. In this work, we address this gap by introducing Latent Dynamics Graph Convolutional Network (LD-GCN), a purely data-driven, encoder-free architecture that learns a global, low-dimensional representation of dynamical systems conditioned on external inputs and parameters. The temporal evolution is modeled in the latent space and advanced through time-stepping, allowing for time-extrapolation, and the trajectories are consistently decoded onto geometrically parameterized domains using a GNN. Our framework enhances interpretability by enabling the analysis of the reduced dynamics and supporting zero-shot prediction through latent interpolation. The methodology is mathematically validated via a universal approximation theorem for encoder-free architectures, and numerically tested on complex computational mechanics problems involving physical and geometric parameters, including the detection of bifurcating phenomena for Navier-Stokes equations. Code availability: this https URL

[LG-19] Operator learning on domain boundary through combining fundamental solution-based artificial data and boundary integral techniques

链接: https://arxiv.org/abs/2601.11222
作者: Haochen Wu,Heng Wu,Benzhuo Lu
类目: Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:For linear partial differential equations with known fundamental solutions, this work introduces a novel operator learning framework that relies exclusively on domain boundary data, including solution values and normal derivatives, rather than full-domain sampling. By integrating the previously developed Mathematical Artificial Data (MAD) method, which enforces physical consistency, all training data are synthesized directly from the fundamental solutions of the target problems, resulting in a fully data-driven pipeline without the need for external measurements or numerical simulations. We refer to this approach as the Mathematical Artificial Data Boundary Neural Operator (MAD-BNO), which learns boundary-to-boundary mappings using MAD-generated Dirichlet-Neumann data pairs. Once trained, the interior solution at arbitrary locations can be efficiently recovered through boundary integral formulations, supporting Dirichlet, Neumann, and mixed boundary conditions as well as general source terms. The proposed method is validated on benchmark operator learning tasks for two-dimensional Laplace, Poisson, and Helmholtz equations, where it achieves accuracy comparable to or better than existing neural operator approaches while significantly reducing training time. The framework is naturally extensible to three-dimensional problems and complex geometries.

[LG-20] meMar: Multi-Scale Autoregressive Modeling for Unconditional Time Series Generation

链接: https://arxiv.org/abs/2601.11184
作者: Xiangyu Xu,Qingsong Zhong,Jilin Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative modeling offers a promising solution to data scarcity and privacy challenges in time series analysis. However, the structural complexity of time series, characterized by multi-scale temporal patterns and heterogeneous components, remains insufficiently addressed. In this work, we propose a structure-disentangled multiscale generation framework for time series. Our approach encodes sequences into discrete tokens at multiple temporal resolutions and performs autoregressive generation in a coarse-to-fine manner, thereby preserving hierarchical dependencies. To tackle structural heterogeneity, we introduce a dual-path VQ-VAE that disentangles trend and seasonal components, enabling the learning of semantically consistent latent representations. Additionally, we present a guidance-based reconstruction strategy, where coarse seasonal signals are utilized as priors to guide the reconstruction of fine-grained seasonal patterns. Experiments on six datasets show that our approach produces higher-quality time series than existing methods. Notably, our model achieves strong performance with a significantly reduced parameter count and exhibits superior capability in generating high-quality long-term sequences. Our implementation is available at this https URL.

[LG-21] LSTM VS. Feed-Forward Autoencoders for Unsupervised Fault Detection in Hydraulic Pumps

链接: https://arxiv.org/abs/2601.11163
作者: P. Sánchez,K. Reyes,B. Radu,E. Fernández
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unplanned failures in industrial hydraulic pumps can halt production and incur substantial costs. We explore two unsupervised autoencoder (AE) schemes for early fault detection: a feed-forward model that analyses individual sensor snapshots and a Long Short-Term Memory (LSTM) model that captures short temporal windows. Both networks are trained only on healthy data drawn from a minute-level log of 52 sensor channels; evaluation uses a separate set that contains seven annotated fault intervals. Despite the absence of fault samples during training, the models achieve high reliability.

[LG-22] heoretically and Practically Efficient Resistance Distance Computation on Large Graphs

链接: https://arxiv.org/abs/2601.11159
作者: Yichun Yang,Longlong Lin,Rong-Hua Li,Meihao Liao,Guoren Wang
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The computation of resistance distance is pivotal in a wide range of graph analysis applications, including graph clustering, link prediction, and graph neural networks. Despite its foundational importance, efficient algorithms for computing resistance distances on large graphs are still lacking. Existing state-of-the-art (SOTA) methods, including power iteration-based algorithms and random walk-based local approaches, often struggle with slow convergence rates, particularly when the condition number of the graph Laplacian matrix, denoted by \kappa , is large. To tackle this challenge, we propose two novel and efficient algorithms inspired by the classic Lanczos method: Lanczos Iteration and Lanczos Push, both designed to reduce dependence on \kappa . Among them, Lanczos Iteration is a near-linear time global algorithm, whereas Lanczos Push is a local algorithm with a time complexity independent of the size of the graph. More specifically, we prove that the time complexity of Lanczos Iteration is \tildeO(\sqrt\kappa m) ( m is the number of edges of the graph and \tildeO means the complexity omitting the \log terms) which achieves a speedup of \sqrt\kappa compared to previous power iteration-based global methods. For Lanczos Push, we demonstrate that its time complexity is \tildeO(\kappa^2.75) under certain mild and frequently established assumptions, which represents a significant improvement of \kappa^0.25 over the SOTA random walk-based local algorithms. We validate our algorithms through extensive experiments on eight real-world datasets of varying sizes and statistical properties, demonstrating that Lanczos Iteration and Lanczos Push significantly outperform SOTA methods in terms of both efficiency and accuracy.

[LG-23] Assesing the Viability of Unsupervised Learning with Autoencoders for Predictive Maintenance in Helicopter Engines

链接: https://arxiv.org/abs/2601.11154
作者: P. Sánchez,K. Reyes,B. Radu,E. Fernández
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unplanned engine failures in helicopters can lead to severe operational disruptions, safety hazards, and costly repairs. To mitigate these risks, this study compares two predictive maintenance strategies for helicopter engines: a supervised classification pipeline and an unsupervised anomaly detection approach based on autoencoders (AEs). The supervised method relies on labelled examples of both normal and faulty behaviour, while the unsupervised approach learns a model of normal operation using only healthy engine data, flagging deviations as potential faults. Both methods are evaluated on a real-world dataset comprising labelled snapshots of helicopter engine telemetry. While supervised models demonstrate strong performance when annotated failures are available, the AE achieves effective detection without requiring fault labels, making it particularly well suited for settings where failure data are scarce or incomplete. The comparison highlights the practical trade-offs between accuracy, data availability, and deployment feasibility, and underscores the potential of unsupervised learning as a viable solution for early fault detection in aerospace applications.

[LG-24] FSL-BDP: Federated Survival Learning with Bayesian Differential Privacy for Credit Risk Modeling

链接: https://arxiv.org/abs/2601.11134
作者: Sultan Amed,Tanmay Sen,Sayantan Banerjee
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Credit risk models are a critical decision-support tool for financial institutions, yet tightening data-protection rules (e.g., GDPR, CCPA) increasingly prohibit cross-border sharing of borrower data, even as these models benefit from cross-institution learning. Traditional default prediction suffers from two limitations: binary classification ignores default timing, treating early defaulters (high loss) equivalently to late defaulters (low loss), and centralized training violates emerging regulatory constraints. We propose a Federated Survival Learning framework with Bayesian Differential Privacy (FSL-BDP) that models time-to-default trajectories without centralizing sensitive data. The framework provides Bayesian (data-dependent) differential privacy (DP) guarantees while enabling institutions to jointly learn risk dynamics. Experiments on three real-world credit datasets (LendingClub, SBA, Bondora) show that federation fundamentally alters the relative effectiveness of privacy mechanisms. While classical DP performs better than Bayesian DP in centralized settings, the latter benefits substantially more from federation (+7.0% vs +1.4%), achieving near parity of non-private performance and outperforming classical DP in the majority of participating clients. This ranking reversal yields a key decision-support insight: privacy mechanism selection should be evaluated in the target deployment architecture, rather than centralized benchmarks. These findings provide actionable guidance for practitioners designing privacy-preserving decision support systems in regulated, multi-institutional environments.

[LG-25] Shape-morphing programming of soft materials on complex geometries via neural operator

链接: https://arxiv.org/abs/2601.11126
作者: Lu Chen,Gengxiang Chen,Xu Liu,Jingyan Su,Xuhao Lyu,Lihui Wang,Yingguang Li
类目: Machine Learning (cs.LG)
*备注: 20 pages,5 Figures

点击查看摘要

Abstract:Shape-morphing soft materials can enable diverse target morphologies through voxel-level material distribution design, offering significant potential for various applications. Despite progress in basic shape-morphing design with simple geometries, achieving advanced applications such as conformal implant deployment or aerodynamic morphing requires accurate and diverse morphing designs on complex geometries, which remains challenging. Here, we present a Spectral and Spatial Neural Operator (S2NO), which enables high-fidelity morphing prediction on complex geometries. S2NO effectively captures global and local morphing behaviours on irregular computational domains by integrating Laplacian eigenfunction encoding and spatial convolutions. Combining S2NO with evolutionary algorithms enables voxel-level optimisation of material distributions for shape morphing programming on various complex geometries, including irregular-boundary shapes, porous structures, and thin-walled structures. Furthermore, the neural operator’s discretisation-invariant property enables super-resolution material distribution design, further expanding the diversity and complexity of morphing design. These advancements significantly improve the efficiency and capability of programming complex shape morphing.

[LG-26] Optimized Algorithms for Text Clustering with LLM -Generated Constraints AAAI-26

链接: https://arxiv.org/abs/2601.11118
作者: Chaoqi Jia,Weihong Wu,Longkun Guo,Zhigang Lu,Chao Chen,Kok-Leong Ong
类目: Machine Learning (cs.LG)
*备注: AAAI-26

点击查看摘要

Abstract:Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process. With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality through LLM-based automatic constraint generation. In this paper, we propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints. This approach improves both query efficiency and constraint accuracy compared to state-of-the-art methods. We further introduce a constrained clustering algorithm tailored to the characteristics of LLM-generated constraints. Our method incorporates a confidence threshold and a penalty mechanism to address potentially inaccurate constraints. We evaluate our approach on five text datasets, considering both the cost of constraint generation and the overall clustering performance. The results show that our method achieves clustering accuracy comparable to the state-of-the-art algorithms while reducing the number of LLM queries by more than 20 times.

[LG-27] Differentially Private Subspace Fine-Tuning for Large Language Models

链接: https://arxiv.org/abs/2601.11113
作者: Lele Zheng,Xiang Wang,Tao Zhang,Yang Cao,Ke Cheng,Yulong Shen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models on downstream tasks is crucial for realizing their cross-domain potential but often relies on sensitive data, raising privacy concerns. Differential privacy (DP) offers rigorous privacy guarantees and has been widely adopted in fine-tuning; however, naively injecting noise across the high-dimensional parameter space creates perturbations with large norms, degrading performance and destabilizing training. To address this issue, we propose DP-SFT, a two-stage subspace fine-tuning method that substantially reduces noise magnitude while preserving formal DP guarantees. Our intuition is that, during fine-tuning, significant parameter updates lie within a low-dimensional, task-specific subspace, while other directions change minimally. Hence, we only inject DP noise into this subspace to protect privacy without perturbing irrelevant parameters. In phase one, we identify the subspace by analyzing principal gradient directions to capture task-specific update signals. In phase two, we project full gradients onto this subspace, add DP noise, and map the perturbed gradients back to the original parameter space for model updates, markedly lowering noise impact. Experiments on multiple datasets demonstrate that DP-SFT enhances accuracy and stability under rigorous DP constraints, accelerates convergence, and achieves substantial gains over DP fine-tuning baselines.

[LG-28] Soft Bayesian Context Tree Models for Real-Valued Time Series

链接: https://arxiv.org/abs/2601.11079
作者: Shota Saito,Yuta Nakahara,Toshiyasu Matsushima
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes the soft Bayesian context tree model (Soft-BCT), which is a novel BCT model for real-valued time series. The Soft-BCT considers soft (probabilistic) splits of the context space, instead of hard (deterministic) splits of the context space as in the previous BCT for real-valued time series. A learning algorithm of the Soft-BCT is proposed based on the variational inference. For some real-world datasets, the Soft-BCT demonstrates almost the same or superior performance to the previous BCT.

[LG-29] OpFML: Pipeline for ML-based Operational Forecasting

链接: https://arxiv.org/abs/2601.11046
作者: Shahbaz Alvi,Giusy Fedele,Gabriele Accarino,Italo Epicoco,Ilenia Manco,Pasquale Schiano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning is finding its application in a multitude of areas in science and research, and Climate and Earth Sciences is no exception to this trend. Operational forecasting systems based on data-driven approaches and machine learning methods deploy models for periodic forecasting. Wildfire danger assessment using machine learning has garnered significant interest in the last decade, as conventional methods often overestimate the risk of wildfires. In this work, we present the code OpFML: Operational Forecasting with Machine Learning. OpFML is a configurable and adaptable pipeline that can be utilized to serve a machine learning model for periodic forecasting. We further demonstrate the capabilities of the pipeline through its application to daily Fire Danger Index forecasting and outline its various features.

[LG-30] Self-Augmented Mixture-of-Experts for QoS Prediction

链接: https://arxiv.org/abs/2601.11036
作者: Kecheng Cai,Chao Peng,Chenyang Xu,Xia Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quality of Service (QoS) prediction is one of the most fundamental problems in service computing and personalized recommendation. In the problem, there is a set of users and services, each associated with a set of descriptive features. Interactions between users and services produce feedback values, typically represented as numerical QoS metrics such as response time or availability. Given the observed feedback for a subset of user-service pairs, the goal is to predict the QoS values for the remaining pairs. A key challenge in QoS prediction is the inherent sparsity of user-service interactions, as only a small subset of feedback values is typically observed. To address this, we propose a self-augmented strategy that leverages a model’s own predictions for iterative refinement. In particular, we partially mask the predicted values and feed them back into the model to predict again. Building on this idea, we design a self-augmented mixture-of-experts model, where multiple expert networks iteratively and collaboratively estimate QoS values. We find that the iterative augmentation process naturally aligns with the MoE architecture by enabling inter-expert communication: in the second round, each expert receives the first-round predictions and refines its output accordingly. Experiments on benchmark datasets show that our method outperforms existing baselines and achieves competitive results. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.11036 [cs.LG] (or arXiv:2601.11036v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.11036 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] AVP-Pro: An Adaptive Multi-Modal Fusion and Contrastive Learning Approach for Comprehensive Two-Stage Antiviral Peptide Identification

链接: https://arxiv.org/abs/2601.11028
作者: Xinru Wen,Weizhong Lin,zi liu,Xuan Xiao
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2512.21544

点击查看摘要

Abstract:The accurate identification of antiviral peptides (AVPs) is crucial for novel drug development. However, existing methods still have limitations in capturing complex sequence dependencies and distinguishing confusing samples with high similarity. To address these challenges, we propose AVP-Pro, a novel two-stage predictive framework that integrates adaptive feature fusion and contrastive learning. To comprehensively capture the physicochemical properties and deep-seated patterns of peptide sequences, we constructed a panoramic feature space encompassing 10 distinct descriptors and designed a hierarchical fusion architecture. This architecture integrates self-attention and adaptive gating mechanisms to dynamically modulate the weights of local motifs extracted by CNNs and global dependencies captured by BiLSTMs based on sequence context. Targeting the blurred decision boundary caused by the high similarity between positive and negative sample sequences, we adopted an Online Hard Example Mining (OHEM)-driven contrastive learning strategy enhanced by BLOSUM62. This approach significantly sharpened the model’s discriminative power. Model evaluation results show that in the first stage of general AVP identification, the model achieved an accuracy of 0.9531 and an MCC of 0.9064, outperforming existing state-of-the-art (SOTA) methods. In the second stage of functional subtype prediction, combined with a transfer learning strategy, the model realized accurate classification of 6 viral families and 8 specific viruses under small-sample conditions. AVP-Pro provides a powerful and interpretable new tool for the high-throughput screening of antiviral drugs. To further enhance accessibility for users, we have developed a user-friendly web interface, which is available at this https URL.

[LG-32] Backdoor Attacks on Multi-modal Contrastive Learning

链接: https://arxiv.org/abs/2601.11006
作者: Simi D Kuniyilh,Rita Machacy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning has become a leading self- supervised approach to representation learning across domains, including vision, multimodal settings, graphs, and federated learning. However, recent studies have shown that contrastive learning is susceptible to backdoor and data poisoning attacks. In these attacks, adversaries can manipulate pretraining data or model updates to insert hidden malicious behavior. This paper offers a thorough and comparative review of backdoor attacks in contrastive learning. It analyzes threat models, attack methods, target domains, and available defenses. We summarize recent advancements in this area, underline the specific vulnerabilities inherent to contrastive learning, and discuss the challenges and future research directions. Our findings have significant implications for the secure deployment of systems in industrial and distributed environments.

[LG-33] Exact Constraint Enforcement in Physics-Informed Extreme Learning Machines using Null-Space Projection Framework

链接: https://arxiv.org/abs/2601.10999
作者: Rishi Mishra,Smriti,Balaji Srinivasan,Sundararajan Natarajan,Ganapathy Krishnamurthi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 16 pages,6 Figures

点击查看摘要

Abstract:Physics-informed extreme learning machines (PIELMs) typically impose boundary and initial conditions through penalty terms, yielding only approximate satisfaction that is sensitive to user-specified weights and can propagate errors into the interior solution. This work introduces Null-Space Projected PIELM (NP-PIELM), achieving exact constraint enforcement through algebraic projection in coefficient space. The method exploits the geometric structure of the admissible coefficient manifold, recognizing that it admits a decomposition through the null space of the boundary operator. By characterizing this manifold via a translation-invariant representation and projecting onto the kernel component, optimization is restricted to constraint-preserving directions, transforming the constrained problem into unconstrained least-squares where boundary conditions are satisfied exactly at discrete collocation points. This eliminates penalty coefficients, dual variables, and problem-specific constructions while preserving single-shot training efficiency. Numerical experiments on elliptic and parabolic problems including complex geometries and mixed boundary conditions validate the framework.

[LG-34] Constant Metric Scaling in Riemannian Computation

链接: https://arxiv.org/abs/2601.10992
作者: Kisung You
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Constant rescaling of a Riemannian metric appears in many computational settings, often through a global scale parameter that is introduced either explicitly or implicitly. Although this operation is elementary, its consequences are not always made clear in practice and may be confused with changes in curvature, manifold structure, or coordinate representation. In this note we provide a short, self-contained account of constant metric scaling on arbitrary Riemannian manifolds. We distinguish between quantities that change under such a scaling, including norms, distances, volume elements, and gradient magnitudes, and geometric objects that remain invariant, such as the Levi–Civita connection, geodesics, exponential and logarithmic maps, and parallel transport. We also discuss implications for Riemannian optimization, where constant metric scaling can often be interpreted as a global rescaling of step sizes rather than a modification of the underlying geometry. The goal of this note is purely expository and is intended to clarify how a global metric scale parameter can be introduced in Riemannian computation without altering the geometric structures on which these methods rely.

[LG-35] Reasoning Distillation for Lightweight Automated Program Repair

链接: https://arxiv.org/abs/2601.10987
作者: Aanand Balasubramanian,Sashank Silwal
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 tables. Preprint

点击查看摘要

Abstract:We study whether lightweight symbolic reasoning supervision can improve fix type classification in compact automated program repair models. Small code models are attractive for resource-constrained settings, but they typically produce only a single prediction, making it unclear whether they learn meaningful program structure or rely on shallow correlations. We propose a reasoning distillation approach in which a large teacher model provides structured symbolic reasoning tags alongside fix-type labels. These tags capture high-level causal properties of bugs without relying on free-form explanations. We train a CodeT5-based student model under label-only and reasoning-distilled settings on the IntroClass benchmark. Reasoning supervision consistently improves macro averaged performance, particularly on less frequent bug categories, without increasing model size or complexity. We further analyze the relationship between reasoning accuracy and fix-type prediction, showing that correct reasoning traces strongly correlate with correct predictions, while not fully determining them. Our results suggest that symbolic reasoning distillation is a practical way to improve interpretability and robustness in lightweight program repair models.

[LG-36] oward Adaptive Grid Resilience: A Gradient-Free Meta-RL Framework for Critical Load Restoration

链接: https://arxiv.org/abs/2601.10973
作者: Zain ul Abdeen,Waris Gill,Ming Jin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Restoring critical loads after extreme events demands adaptive control to maintain distribution-grid resilience, yet uncertainty in renewable generation, limited dispatchable resources, and nonlinear dynamics make effective restoration difficult. Reinforcement learning (RL) can optimize sequential decisions under uncertainty, but standard RL often generalizes poorly and requires extensive retraining for new outage configurations or generation patterns. We propose a meta-guided gradient-free RL (MGF-RL) framework that learns a transferable initialization from historical outage experiences and rapidly adapts to unseen scenarios with minimal task-specific tuning. MGF-RL couples first-order meta-learning with evolutionary strategies, enabling scalable policy search without gradient computation while accommodating nonlinear, constrained distribution-system dynamics. Experiments on IEEE 13-bus and IEEE 123-bus test systems show that MGF-RL outperforms standard RL, MAML-based meta-RL, and model predictive control across reliability, restoration speed, and adaptation efficiency under renewable forecast errors. MGF-RL generalizes to unseen outages and renewable patterns while requiring substantially fewer fine-tuning episodes than conventional RL. We also provide sublinear regret bounds that relate adaptation efficiency to task similarity and environmental variation, supporting the empirical gains and motivating MGF-RL for real-time load restoration in renewable-rich distribution grids.

[LG-37] ransient learning dynamics drive escape from sharp valleys in Stochastic Gradient Descent

链接: https://arxiv.org/abs/2601.10962
作者: Ning Yang,Yikuan Zhang,Qi Ouyang,Chao Tang,Yuhai Tu
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism governing solution selection. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and transition toward flatter regions of the loss landscape. By using a tractable physical model, we show that the SGD noise reshapes the landscape into an effective potential that favors flat solutions. Crucially, we uncover a transient freezing mechanism: as training proceeds, growing energy barriers suppress inter-valley transitions and ultimately trap the dynamics within a single basin. Increasing the SGD noise strength delays this freezing, which enhances convergence to flatter minima. Together, these results provide a unified physical framework linking learning dynamics, loss-landscape geometry, and generalization, and suggest principles for the design of more effective optimization algorithms.

[LG-38] Multivariate LSTM-Based Forecasting for Renewable Energy: Enhancing Climate Change Mitigation WWW ICLR2025

链接: https://arxiv.org/abs/2601.10961
作者: Farshid Kamrani,Kristen Schell
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 Workshop on Tackling Climate Change with Machine Learning, paper #57 ( this https URL )

点击查看摘要

Abstract:The increasing integration of renewable energy sources (RESs) into modern power systems presents significant opportunities but also notable challenges, primarily due to the inherent variability of RES generation. Accurate forecasting of RES generation is crucial for maintaining the reliability, stability, and economic efficiency of power system operations. Traditional approaches, such as deterministic methods and stochastic programming, frequently depend on representative scenarios generated through clustering techniques like K-means. However, these methods may fail to fully capture the complex temporal dependencies and non-linear patterns within RES data. This paper introduces a multivariate Long Short-Term Memory (LSTM)-based network designed to forecast RESs generation using their real-world historical data. The proposed model effectively captures long-term dependencies and interactions between different RESs, utilizing historical data from both local and neighboring areas to enhance predictive accuracy. In the case study, we showed that the proposed forecasting approach results in lower CO2 emissions, and a more reliable supply of electric loads.

[LG-39] HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training

链接: https://arxiv.org/abs/2601.10940
作者: Aakriti,Zhe Li,Dandan Liang,Chao Huang,Rui Li,Haibo Yang
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 6 tables. Submitted to WiOpt 2026

点击查看摘要

Abstract:Split learning (SL) enables collaborative training of large language models (LLMs) between resource-constrained edge devices and compute-rich servers by partitioning model computation across the network boundary. However, existing SL systems predominantly rely on first-order (FO) optimization, which requires clients to store intermediate quantities such as activations for backpropagation. This results in substantial memory overhead, largely negating benefits of model partitioning. In contrast, zeroth-order (ZO) optimization eliminates backpropagation and significantly reduces memory usage, but often suffers from slow convergence and degraded performance. In this work, we propose HOSL, a novel Hybrid-Order Split Learning framework that addresses this fundamental trade-off between memory efficiency and optimization effectiveness by strategically integrating ZO optimization on the client side with FO optimization on the server side. By employing memory-efficient ZO gradient estimation at the client, HOSL eliminates backpropagation and activation storage, reducing client memory consumption. Meanwhile, server-side FO optimization ensures fast convergence and competitive performance. Theoretically, we show that HOSL achieves a \mathcalO(\sqrtd_c/TQ) rate, which depends on client-side model dimension d_c rather than the full model dimension d , demonstrating that convergence improves as more computation is offloaded to the server. Extensive experiments on OPT models (125M and 1.3B parameters) across 6 tasks demonstrate that HOSL reduces client GPU memory by up to 3.7 \times compared to the FO method while achieving accuracy within 0.20%-4.23% of this baseline. Furthermore, HOSL outperforms the ZO baseline by up to 15.55%, validating the effectiveness of our hybrid strategy for memory-efficient training on edge devices.

[LG-40] A PAC-Bayesian Analysis of Channel-Induced Degradation in Edge Inference

链接: https://arxiv.org/abs/2601.10915
作者: Yangshuo He,Guanding Yu,Jingge Zhu
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the emerging paradigm of edge inference, neural networks (NNs) are partitioned across distributed edge devices that collaboratively perform inference via wireless transmission. However, standard NNs are generally trained in a noiseless environment, creating a mismatch with the noisy channels during edge deployment. In this paper, we address this issue by characterizing the channel-induced performance deterioration as a generalization error against unseen channels. We introduce an augmented NN model that incorporates channel statistics directly into the weight space, allowing us to derive PAC-Bayesian generalization bounds that explicitly quantifies the impact of wireless distortion. We further provide closed-form expressions for practical channels to demonstrate the tractability of these bounds. Inspired by the theoretical results, we propose a channel-aware training algorithm that minimizes a surrogate objective based on the derived bound. Simulations show that the proposed algorithm can effectively improve inference accuracy by leveraging channel statistics, without end-to-end re-training.

[LG-41] FAConvLSTM: Factorized-Attention ConvLSTM for Efficient Feature Extraction in Multivariate Climate Data

链接: https://arxiv.org/abs/2601.10914
作者: Francis Ndikum Nji,Jianwu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning physically meaningful spatiotemporal representations from high-resolution multivariate Earth observation data is challenging due to strong local dynamics, long-range teleconnections, multi-scale interactions, and nonstationarity. While ConvLSTM2D is a commonly used baseline, its dense convolutional gating incurs high computational cost and its strictly local receptive fields limit the modeling of long-range spatial structure and disentangled climate dynamics. To address these limitations, we propose FAConvLSTM, a Factorized-Attention ConvLSTM layer designed as a drop-in replacement for ConvLSTM2D that simultaneously improves efficiency, spatial expressiveness, and physical interpretability. FAConvLSTM factorizes recurrent gate computations using lightweight [1 times 1] bottlenecks and shared depthwise spatial mixing, substantially reducing channel complexity while preserving recurrent dynamics. Multi-scale dilated depthwise branches and squeeze-and-excitation recalibration enable efficient modeling of interacting physical processes across spatial scales, while peephole connections enhance temporal precision. To capture teleconnection-scale dependencies without incurring global attention cost, FAConvLSTM incorporates a lightweight axial spatial attention mechanism applied sparsely in time. A dedicated subspace head further produces compact per timestep embeddings refined through temporal self-attention with fixed seasonal positional encoding. Experiments on multivariate spatiotemporal climate data shows superiority demonstrating that FAConvLSTM yields more stable, interpretable, and robust latent representations than standard ConvLSTM, while significantly reducing computational overhead.

[LG-42] Realistic Curriculum Reinforcement Learning for Autonomous and Sustainable Marine Vessel Navigation AAAI-26 AAAI

链接: https://arxiv.org/abs/2601.10911
作者: Zhang Xiaocai,Xiao Zhe,Liang Maohan,Liu Tao,Li Haijiang,Zhang Wenbin
类目: Machine Learning (cs.LG)
*备注: Present in The 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Sustainability is becoming increasingly critical in the maritime transport, encompassing both environmental and social impacts, such as Greenhouse Gas (GHG) emissions and navigational safety. Traditional vessel navigation heavily relies on human experience, often lacking autonomy and emission awareness, and is prone to human errors that may compromise safety. In this paper, we propose a Curriculum Reinforcement Learning (CRL) framework integrated with a realistic, data-driven marine simulation environment and a machine learning-based fuel consumption prediction module. The simulation environment is constructed using real-world vessel movement data and enhanced with a Diffusion Model to simulate dynamic maritime conditions. Vessel fuel consumption is estimated using historical operational data and learning-based regression. The surrounding environment is represented as image-based inputs to capture spatial complexity. We design a lightweight, policy-based CRL agent with a comprehensive reward mechanism that considers safety, emissions, timeliness, and goal completion. This framework effectively handles complex tasks progressively while ensuring stable and efficient learning in continuous action spaces. We validate the proposed approach in a sea area of the Indian Ocean, demonstrating its efficacy in enabling sustainable and safe vessel navigation.

[LG-43] Action Shapley: A Training Data Selection Metric for World Model in Reinforcement Learning

链接: https://arxiv.org/abs/2601.10905
作者: Rajat Ghosh,Debojyoti Dutta
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Numerous offline and model-based reinforcement learning systems incorporate world models to emulate the inherent environments. A world model is particularly important in scenarios where direct interactions with the real environment is costly, dangerous, or impractical. The efficacy and interpretability of such world models are notably contingent upon the quality of the underlying training data. In this context, we introduce Action Shapley as an agnostic metric for the judicious and unbiased selection of training data. To facilitate the computation of Action Shapley, we present a randomized dynamic algorithm specifically designed to mitigate the exponential complexity inherent in traditional Shapley value computations. Through empirical validation across five data-constrained real-world case studies, the algorithm demonstrates a computational efficiency improvement exceeding 80% in comparison to conventional exponential time computations. Furthermore, our Action Shapley-based training data selection policy consistently outperforms ad-hoc training data selection.

[LG-44] Unit-Consistent (UC) Adjoint for GSD and Backprop in Deep Learning Applications

链接: https://arxiv.org/abs/2601.10873
作者: Jeffrey Uhlmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks constructed from linear maps and positively homogeneous nonlinearities (e.g., ReLU) possess a fundamental gauge symmetry: the network function is invariant to node-wise diagonal rescalings. However, standard gradient descent is not equivariant to this symmetry, causing optimization trajectories to depend heavily on arbitrary parameterizations. Prior work has proposed rescaling-invariant optimization schemes for positively homogeneous networks (e.g., path-based or path-space updates). Our contribution is complementary: we formulate the invariance requirement at the level of the backward adjoint/optimization geometry, which provides a simple, operator-level recipe that can be applied uniformly across network components and optimizer state. By replacing the Euclidean transpose with a Unit-Consistent (UC) adjoint, we derive UC gauge-consistent steepest descent and backprogation.

[LG-45] Beyond Accuracy: A Stability-Aware Metric for Multi-Horizon Forecasting

链接: https://arxiv.org/abs/2601.10863
作者: Chutian Ma,Grigorii Pomazkin,Giacinto Paolo Saggese,Paul Smith
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional time series forecasting methods optimize for accuracy alone. This objective neglects temporal consistency, in other words, how consistently a model predicts the same future event as the forecast origin changes. We introduce the forecast accuracy and coherence score (forecast AC score for short) for measuring the quality of probabilistic multi-horizon forecasts in a way that accounts for both multi-horizon accuracy and stability. Our score additionally provides for user-specified weights to balance accuracy and consistency requirements. As an example application, we implement the score as a differentiable objective function for training seasonal ARIMA models and evaluate it on the M4 Hourly benchmark dataset. Results demonstrate substantial improvements over traditional maximum likelihood estimation. Our AC-optimized models achieve a 75% reduction in forecast volatility for the same target timestamps while maintaining comparable or improved point forecast accuracy.

[LG-46] AI-Guided Human-In-the-Loop Inverse Design of High Performance Engineering Structures

链接: https://arxiv.org/abs/2601.10859
作者: Dat Quoc Ha,Md Ferdous Alam,Markus J. Buehler,Faez Ahmed,Josephine V. Carstensen
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:Inverse design tools such as Topology Optimization (TO) can achieve new levels of improvement for high-performance engineered structures. However, widespread use is hindered by high computational times and a black-box nature that inhibits user interaction. Human-in-the-loop TO approaches are emerging that integrate human intuition into the design generation process. However, these rely on the time-consuming bottleneck of iterative region selection for design modifications. To reduce the number of iterative trials, this contribution presents an AI co-pilot that uses machine learning to predict the user’s preferred regions. The prediction model is configured as an image segmentation task with a U-Net architecture. It is trained on synthetic datasets where human preferences either identify the longest topological member or the most complex structural connection. The model successfully predicts plausible regions for modification and presents them to the user as AI recommendations. The human preference model demonstrates generalization across diverse and non-standard TO problems and exhibits emergent behavior outside the single-region selection training data. Demonstration examples show that the new human-in-the-loop TO approach that integrates the AI co-pilot can improve manufacturability or improve the linear buckling load by 39% while only increasing the total design time by 15 sec compared to conventional simplistic TO.

[LG-47] Mugi: Value Level Parallelism For Efficient LLM s

链接: https://arxiv.org/abs/2601.10823
作者: Daniel Price,Prabhu Vellaisamy,John Shen,Di Wu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 2026 International Conference on Architectural Support for Programming Languages and Operating Systems

点击查看摘要

Abstract:Value level parallelism (VLP) has been proposed to improve the efficiency of large-batch, low-precision general matrix multiply (GEMM) between symmetric activations and weights. In transformer based large language models (LLMs), there exist more sophisticated operations beyond activation-weight GEMM. In this paper, we explore how VLP benefits LLMs. First, we generalize VLP for nonlinear approximations, outperforming existing nonlinear approximations in end-to-end LLM accuracy, performance, and efficiency. Our VLP approximation follows a value-centric approach, where important values are assigned with greater accuracy. Second, we optimize VLP for small-batch GEMMs with asymmetric inputs efficiently, which leverages timely LLM optimizations, including weight-only quantization, key-value (KV) cache quantization, and group query attention. Finally, we design a new VLP architecture, Mugi, to encapsulate the innovations above and support full LLM workloads, while providing better performance, efficiency and sustainability. Our experimental results show that Mugi can offer significant improvements on throughput and energy efficiency, up to 45\times and 668\times for nonlinear softmax operations, and 2.07\times and 3.11\times for LLMs, and also decrease operational carbon for LLM operation by 1.45\times and embodied carbon by 1.48\times .

[LG-48] owards Tensor Network Models for Low-Latency Jet Tagging on FPGAs

链接: https://arxiv.org/abs/2601.10801
作者: Alberto Coppi,Ema Puljak,Lorenzo Borella,Daniel Jaschke,Enrique Rico,Maurizio Pierini,Jacopo Pazzini,Andrea Triossi,Simone Montangero
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Instrumentation and Detectors (physics.ins-det); Quantum Physics (quant-ph)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:We present a systematic study of Tensor Network (TN) models \unicodex2013 Matrix Product States (MPS) and Tree Tensor Networks (TTN) \unicodex2013 for real-time jet tagging in high-energy physics, with a focus on low-latency deployment on Field Programmable Gate Arrays (FPGAs). Motivated by the strict requirements of the HL-LHC Level-1 trigger system, we explore TNs as compact and interpretable alternatives to deep neural networks. Using low-level jet constituent features, our models achieve competitive performance compared to state-of-the-art deep learning classifiers. We investigate post-training quantization to enable hardware-efficient implementations without degrading classification performance or latency. The best-performing models are synthesized to estimate FPGA resource usage, latency, and memory occupancy, demonstrating sub-microsecond latency and supporting the feasibility of online deployment in real-time trigger systems. Overall, this study highlights the potential of TN-based models for fast and resource-efficient inference in low-latency environments.

[LG-49] Analytic Bijections for Smooth and Interpretable Normalizing Flows

链接: https://arxiv.org/abs/2601.10774
作者: Mathis Gerdes,Miranda C. N. Cheng
类目: Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 33 + 5 pages, 17 + 1 figures, 3 tables

点击查看摘要

Abstract:A key challenge in designing normalizing flows is finding expressive scalar bijections that remain invertible with tractable Jacobians. Existing approaches face trade-offs: affine transformations are smooth and analytically invertible but lack expressivity; monotonic splines offer local control but are only piecewise smooth and act on bounded domains; residual flows achieve smoothness but need numerical inversion. We introduce three families of analytic bijections – cubic rational, sinh, and cubic polynomial – that are globally smooth ( C^\infty ), defined on all of \mathbbR , and analytically invertible in closed form, combining the favorable properties of all prior approaches. These bijections serve as drop-in replacements in coupling flows, matching or exceeding spline performance. Beyond coupling layers, we develop radial flows: a novel architecture using direct parametrization that transforms the radial coordinate while preserving angular direction. Radial flows exhibit exceptional training stability, produce geometrically interpretable transformations, and on targets with radial structure can achieve comparable quality to coupling flows with 1000\times fewer parameters. We provide comprehensive evaluation on 1D and 2D benchmarks, and demonstrate applicability to higher-dimensional physics problems through experiments on \phi^4 lattice field theory, where our bijections outperform affine baselines and enable problem-specific designs that address mode collapse.

[LG-50] A Probabilistic Approach to Trajectory-Based Optimal Experimental Design

链接: https://arxiv.org/abs/2601.11473
作者: Ahmed Attia
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 42 Figures, this version includes supplementary material as appendices

点击查看摘要

Abstract:We present a novel probabilistic approach for optimal path experimental design. In this approach a discrete path optimization problem is defined on a static navigation mesh, and trajectories are modeled as random variables governed by a parametric Markov policy. The discrete path optimization problem is then replaced with an equivalent stochastic optimization problem over the policy parameters, resulting in an optimal probability model that samples estimates of the optimal discrete path. This approach enables exploration of the utility function’s distribution tail and treats the utility function of the design as a black box, making it applicable to linear and nonlinear inverse problems and beyond experimental design. Numerical verification and analysis are carried out by using a parameter identification problem widely used in model-based optimal experimental design.

[LG-51] Near-Optimal Decentralized Stochastic Nonconvex Optimization with Heavy-Tailed Noise

链接: https://arxiv.org/abs/2601.11435
作者: Menglian Wang,Zhuanghua Liu,Luo Luo
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies decentralized stochastic nonconvex optimization problem over row-stochastic networks. We consider the heavy-tailed gradient noise which is empirically observed in many popular real-world applications. Specifically, we propose a decentralized normalized stochastic gradient descent with Pull-Diag gradient tracking, which achieves approximate stationary points with the optimal sample complexity and the near-optimal communication complexity. We further follow our framework to study the setting of undirected networks, also achieving the nearly tight upper complexity bounds. Moreover, we conduct empirical studies to show the practical superiority of the proposed methods.

[LG-52] Statistical Robustness of Interval CVaR Based Regression Models under Perturbation and Contamination

链接: https://arxiv.org/abs/2601.11420
作者: Yulei You,Junyi Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Robustness under perturbation and contamination is a prominent issue in statistical learning. We address the robust nonlinear regression based on the so-called interval conditional value-at-risk (In-CVaR), which is introduced to enhance robustness by trimming extreme losses. While recent literature shows that the In-CVaR based statistical learning exhibits superior robustness performance than classical robust regression models, its theoretical robustness analysis for nonlinear regression remains largely unexplored. We rigorously quantify robustness under contamination, with a unified study of distributional breakdown point for a broad class of regression models, including linear, piecewise affine and neural network models with \ell_1 , \ell_2 and Huber losses. Moreover, we analyze the qualitative robustness of the In-CVaR based estimator under perturbation. We show that under several minor assumptions, the In-CVaR based estimator is qualitatively robust in terms of the Prokhorov metric if and only if the largest portion of losses is trimmed. Overall, this study analyzes robustness properties of In-CVaR based nonlinear regression models under both perturbation and contamination, which illustrates the advantages of In-CVaR risk measure over conditional value-at-risk and expectation for robust regression in both theory and numerical experiments.

[LG-53] Zero-Shot Detection of Elastic Transient Morphology Across Physical Systems

链接: https://arxiv.org/abs/2601.11415
作者: Jose Sánchez Andreu
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 17 pages, 6 figures. Supplemental material included

点击查看摘要

Abstract:We test whether a representation learned from interferometric strain transients in gravitational-wave observatories can act as a frozen morphology-sensitive operator for unseen sensors, provided the target signals preserve coherent elastic transient structure. Using a neural encoder trained exclusively on non-Gaussian instrumental glitches, we perform strict zero-shot anomaly analysis on rolling-element bearings without retraining, fine-tuning, or target-domain labels. On the IMS-NASA run-to-failure dataset, the operator yields a monotonic health index HI(t) = s0.99(t)/tau normalized to an early-life reference distribution, enabling fixed false-alarm monitoring at 1-q = 1e-3 with tau = Q0.999(P0). In discrete fault regimes (CWRU), it achieves strong window-level discrimination (AUC_win about 0.90) and file-level separability approaching unity (AUC_file about 0.99). Electrically dominated vibration signals (VSB) show weak, non-selective behavior, delineating a physical boundary for transfer. Under a matched IMS controlled-split protocol, a generic EfficientNet-B0 encoder pretrained on ImageNet collapses in the intermittent regime (Lambda_tail about 2), while the interferometric operator retains strong extreme-event selectivity (Lambda_tail about 860), indicating that the effect is not a generic property of CNN features. Controlled morphology-destruction transformations selectively degrade performance despite per-window normalization, consistent with sensitivity to coherent time-frequency organization rather than marginal amplitude statistics. Comments: 17 pages, 6 figures. Supplemental material included Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an) Cite as: arXiv:2601.11415 [astro-ph.IM] (or arXiv:2601.11415v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2601.11415 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] Model-free policy gradient for discrete-time mean-field control

链接: https://arxiv.org/abs/2601.11217
作者: Matthieu Meunier,Huyên Pham,Christoph Reisinger
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 42 pages, 5 figures

点击查看摘要

Abstract:We study model-free policy learning for discrete-time mean-field control (MFC) problems with finite state space and compact action space. In contrast to the extensive literature on value-based methods for MFC, policy-based approaches remain largely unexplored due to the intrinsic dependence of transition kernels and rewards on the evolving population state distribution, which prevents the direct use of likelihood-ratio estimators of policy gradients from classical single-agent reinforcement learning. We introduce a novel perturbation scheme on the state-distribution flow and prove that the gradient of the resulting perturbed value function converges to the true policy gradient as the perturbation magnitude vanishes. This construction yields a fully model-free estimator based solely on simulated trajectories and an auxiliary estimate of the sensitivity of the state distribution. Building on this framework, we develop MF-REINFORCE, a model-free policy gradient algorithm for MFC, and establish explicit quantitative bounds on its bias and mean-squared error. Numerical experiments on representative mean-field control tasks demonstrate the effectiveness of the proposed approach.

[LG-55] Comprehensive Robust Dynamic Mode Decomposition from Mode Extraction to Dimensional Reduction WWW

链接: https://arxiv.org/abs/2601.11116
作者: Yuki Nakamura,Shingo Takemoto,Shunsuke Ono
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Signal Processing. The source code is available at this https URL . The project page is this https URL

点击查看摘要

Abstract:We propose Comprehensive Robust Dynamic Mode Decomposition (CR-DMD), a novel framework that robustifies the entire DMD process - from mode extraction to dimensional reduction - against mixed noise. Although standard DMD widely used for uncovering spatio-temporal patterns and constructing low-dimensional models of dynamical systems, it suffers from significant performance degradation under noise due to its reliance on least-squares estimation for computing the linear time evolution operator. Existing robust variants typically modify the least-squares formulation, but they remain unstable and fail to ensure faithful low-dimensional representations. First, we introduce a convex optimization-based preprocessing method designed to effectively remove mixed noise, achieving accurate and stable mode extraction. Second, we propose a new convex formulation for dimensional reduction that explicitly links the robustly extracted modes to the original noisy observations, constructing a faithful representation of the original data via a sparse weighted sum of the modes. Both stages are efficiently solved by a preconditioned primal-dual splitting method. Experiments on fluid dynamics datasets demonstrate that CR-DMD consistently outperforms state-of-the-art robust DMD methods in terms of mode accuracy and fidelity of low-dimensional representations under noisy conditions.

[LG-56] KANHedge: Efficient Hedging of High-Dimensional Options Using Kolmogorov-Arnold Network-Based BSDE Solver

链接: https://arxiv.org/abs/2601.11097
作者: Rushikesh Handal,Masanori Hirano
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:High-dimensional option pricing and hedging present significant challenges in quantitative finance, where traditional PDE-based methods struggle with the curse of dimensionality. The BSDE framework offers a computationally efficient alternative to PDE-based methods, and recently proposed deep BSDE solvers, generally utilizing conventional Multi-Layer Perceptrons (MLPs), build upon this framework to provide a scalable alternative to numerical BSDE solvers. In this research, we show that although such MLP-based deep BSDEs demonstrate promising results in option pricing, there remains room for improvement regarding hedging performance. To address this issue, we introduce KANHedge, a novel BSDE-based hedger that leverages Kolmogorov-Arnold Networks (KANs) within the BSDE framework. Unlike conventional MLP approaches that use fixed activation functions, KANs employ learnable B-spline activation functions that provide enhanced function approximation capabilities for continuous derivatives. We comprehensively evaluate KANHedge on both European and American basket options across multiple dimensions and market conditions. Our experimental results demonstrate that while KANHedge and MLP achieve comparable pricing accuracy, KANHedge provides improved hedging performance. Specifically, KANHedge achieves considerable reductions in hedging cost metrics, demonstrating enhanced risk control capabilities.

[LG-57] Split-and-Conquer: Distributed Factor Modeling for High-Dimensional Matrix-Variate Time Series

链接: https://arxiv.org/abs/2601.11091
作者: Hangjin Jiang,Yuzhou Li,Zhaoxing Gao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a distributed framework for reducing the dimensionality of high-dimensional, large-scale, heterogeneous matrix-variate time series data using a factor model. The data are first partitioned column-wise (or row-wise) and allocated to node servers, where each node estimates the row (or column) loading matrix via two-dimensional tensor PCA. These local estimates are then transmitted to a central server and aggregated, followed by a final PCA step to obtain the global row (or column) loading matrix estimator. Given the estimated loading matrices, the corresponding factor matrices are subsequently computed. Unlike existing distributed approaches, our framework preserves the latent matrix structure, thereby improving computational efficiency and enhancing information utilization. We also discuss row- and column-wise clustering procedures for settings in which the group memberships are unknown. Furthermore, we extend the analysis to unit-root nonstationary matrix-variate time series. Asymptotic properties of the proposed method are derived for the diverging dimension of the data in each computing unit and the sample size T . Simulation results assess the computational efficiency and estimation accuracy of the proposed framework, and real data applications further validate its predictive performance.

[LG-58] Memorize Early Then Query: Inlier-Memorization-Guided Active Outlier Detection

链接: https://arxiv.org/abs/2601.10993
作者: Minseo Kang,Seunghwan Park,Dongha Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Outlier detection (OD) aims to identify abnormal instances, known as outliers or anomalies, by learning typical patterns of normal data, or inliers. Performing OD under an unsupervised regime-without any information about anomalous instances in the training data-is challenging. A recently observed phenomenon, known as the inlier-memorization (IM) effect, where deep generative models (DGMs) tend to memorize inlier patterns during early training, provides a promising signal for distinguishing outliers. However, existing unsupervised approaches that rely solely on the IM effect still struggle when inliers and outliers are not well-separated or when outliers form dense clusters. To address these limitations, we incorporate active learning to selectively acquire informative labels, and propose IMBoost, a novel framework that explicitly reinforces the IM effect to improve outlier detection. Our method consists of two stages: 1) a warm-up phase that induces and promotes the IM effect, and 2) a polarization phase in which actively queried samples are used to maximize the discrepancy between inlier and outlier scores. In particular, we propose a novel query strategy and tailored loss function in the polarization phase to effectively identify informative samples and fully leverage the limited labeling budget. We provide a theoretical analysis showing that the IMBoost consistently decreases inlier risk while increasing outlier risk throughout training, thereby amplifying their separation. Extensive experiments on diverse benchmark datasets demonstrate that IMBoost not only significantly outperforms state-of-the-art active OD methods but also requires substantially less computational cost.

[LG-59] Depression Detection Based on Electroencephalography Using a Hybrid Deep Neural Network CNN-GRU and MRMR Feature Selection

链接: https://arxiv.org/abs/2601.10959
作者: Mohammad Reza Yousefi,Hajar Ismail Al-Tamimi,Amin Dehghani
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:This study investigates the detection and classification of depressive and non-depressive states using deep learning approaches. Depression is a prevalent mental health disorder that substantially affects quality of life, and early diagnosis can greatly enhance treatment effectiveness and patient care. However, conventional diagnostic methods rely heavily on self-reported assessments, which are often subjective and may lack reliability. Consequently, there is a strong need for objective and accurate techniques to identify depressive states. In this work, a deep learning based framework is proposed for the early detection of depression using EEG signals. EEG data, which capture underlying brain activity and are not influenced by external behavioral factors, can reveal subtle neural changes associated with depression. The proposed approach combines convolutional neural networks (CNNs) and gated recurrent units (GRUs) to jointly extract spatial and temporal features from EEG recordings. The minimum redundancy maximum relevance (MRMR) algorithm is then applied to select the most informative features, followed by classification using a fully connected neural network. The results demonstrate that the proposed model achieves high performance in accurately identifying depressive states, with an overall accuracy of 98.74%. By effectively integrating temporal and spatial information and employing optimized feature selection, this method shows strong potential as a reliable tool for clinical applications. Overall, the proposed framework not only enables accurate early detection of depression but also has the potential to support improved treatment strategies and patient outcomes.

[LG-60] Learning collision operators from plasma phase space data using differentiable simulators

链接: https://arxiv.org/abs/2601.10885
作者: Diogo D. Carvalho,Pablo J. Bilbao,Warren B. Mori,Luis O. Silva,E. Paulo Alves
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We propose a methodology to infer collision operators from phase space data of plasma dynamics. Our approach combines a differentiable kinetic simulator, whose core component in this work is a differentiable Fokker-Planck solver, with a gradient-based optimisation method to learn the collisional operators that best describe the phase space dynamics. We test our method using data from two-dimensional Particle-in-Cell simulations of spatially uniform thermal plasmas, and learn the collision operator that captures the self-consistent electromagnetic interaction between finite-size charged particles over a wide variety of simulation parameters. We demonstrate that the learned operators are more accurate than alternative estimates based on particle tracks, while making no prior assumptions about the relevant time-scales of the processes and significantly reducing memory requirements. We find that the retrieved operators, obtained in the non-relativistic regime, are in excellent agreement with theoretical predictions derived for electrostatic scenarios. Our results show that differentiable simulators offer a powerful and computational efficient approach to infer novel operators for a wide rage of problems, such as electromagnetically dominated collisional dynamics and stochastic wave-particle interactions.

[LG-61] Physically constrained unfolded multi-dimensional OMP for large MIMO systems

链接: https://arxiv.org/abs/2601.10771
作者: Nay Klaimi(INSA Rennes, IETR),Clément Elvira(IETR),Philippe Mary(INSA Rennes, IETR),Luc Le Magoarou(INSA Rennes, IETR)
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse recovery methods are essential for channel estimation and localization in modern communication systems, but their reliability relies on accurate physical models, which are rarely perfectly known. Their computational complexity also grows rapidly with the dictionary dimensions in large MIMO systems. In this paper, we propose MOMPnet, a novel unfolded sparse recovery framework that addresses both the reliability and complexity challenges of traditional methods. By integrating deep unfolding with data-driven dictionary learning, MOMPnet mitigates hardware impairments while preserving interpretability. Instead of a single large dictionary, multiple smaller, independent dictionaries are employed, enabling a low-complexity multidimensional Orthogonal Matching Pursuit algorithm. The proposed unfolded network is evaluated on realistic channel data against multiple baselines, demonstrating its strong performance and potential.

[LG-62] Mass Distribution versus Density Distribution in the Context of Clustering

链接: https://arxiv.org/abs/2601.10759
作者: Kai Ming Ting,Ye Zhu,Hang Zhang,Tianrun Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates two fundamental descriptors of data, i.e., density distribution versus mass distribution, in the context of clustering. Density distribution has been the de facto descriptor of data distribution since the introduction of statistics. We show that density distribution has its fundamental limitation – high-density bias, irrespective of the algorithms used to perform clustering. Existing density-based clustering algorithms have employed different algorithmic means to counter the effect of the high-density bias with some success, but the fundamental limitation of using density distribution remains an obstacle to discovering clusters of arbitrary shapes, sizes and densities. Using the mass distribution as a better foundation, we propose a new algorithm which maximizes the total mass of all clusters, called mass-maximization clustering (MMC). The algorithm can be easily changed to maximize the total density of all clusters in order to examine the fundamental limitation of using density distribution versus mass distribution. The key advantage of the MMC over the density-maximization clustering is that the maximization is conducted without a bias towards dense clusters.

[LG-63] Sensor Placement for Urban Traffic Interpolation: A Data-Driven Evaluation to Inform Policy

链接: https://arxiv.org/abs/2601.10747
作者: Silke K. Kaiser
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data on citywide street-segment traffic volumes are essential for urban planning and sustainable mobility management. Yet such data are available only for a limited subset of streets due to the high costs of sensor deployment and maintenance. Traffic volumes on the remaining network are therefore interpolated based on existing sensor measurements. However, current sensor locations are often determined by administrative priorities rather than by data-driven optimization, leading to biased coverage and reduced estimation performance. This study provides a large-scale, real-world benchmarking of easily implementable, data-driven strategies for optimizing the placement of permanent and temporary traffic sensors, using segment-level data from Berlin (Strava bicycle counts) and Manhattan (taxi counts). It compares spatial placement strategies based on network centrality, spatial coverage, feature coverage, and active learning. In addition, the study examines temporal deployment schemes for temporary sensors. The findings highlight that spatial placement strategies that emphasize even spatial coverage and employ active learning achieve the lowest prediction errors. With only 10 sensors, they reduce the mean absolute error by over 60% in Berlin and 70% in Manhattan compared to alternatives. Temporal deployment choices further improve performance: distributing measurements evenly across weekdays reduces error by an additional 7% in Berlin and 21% in Manhattan. Together, these spatial and temporal principles allow temporary deployments to closely approximate the performance of optimally placed permanent deployments. From a policy perspective, the results indicate that cities can substantially improve data usefulness by adopting data-driven sensor placement strategies, while retaining flexibility in choosing between temporary and permanent deployments.

[LG-64] UBiGTLoc: A Unified BiLSTM-Graph Transformer Localization Framework for IoT Sensor Networks

链接: https://arxiv.org/abs/2601.10743
作者: Ayesh Abu Lehyeh,Anastassia Gharib,Tian Xia,Dryver Huston,Safwan Wshah
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted and published in IEEE Internet of Things Journal

点击查看摘要

Abstract:Sensor nodes localization in wireless Internet of Things (IoT) sensor networks is crucial for the effective operation of diverse applications, such as smart cities and smart agriculture. Existing sensor nodes localization approaches heavily rely on anchor nodes within wireless sensor networks (WSNs). Anchor nodes are sensor nodes equipped with global positioning system (GPS) receivers and thus, have known locations. These anchor nodes operate as references to localize other sensor nodes. However, the presence of anchor nodes may not always be feasible in real-world IoT scenarios. Additionally, localization accuracy can be compromised by fluctuations in Received Signal Strength Indicator (RSSI), particularly under non-line-of-sight (NLOS) conditions. To address these challenges, we propose UBiGTLoc, a Unified Bidirectional Long Short-Term Memory (BiLSTM)-Graph Transformer Localization framework. The proposed UBiGTLoc framework effectively localizes sensor nodes in both anchor-free and anchor-presence WSNs. The framework leverages BiLSTM networks to capture temporal variations in RSSI data and employs Graph Transformer layers to model spatial relationships between sensor nodes. Extensive simulations demonstrate that UBiGTLoc consistently outperforms existing methods and provides robust localization across both dense and sparse WSNs while relying solely on cost-effective RSSI data.

[LG-65] SSC-UNet: UNet with Self-Supervised Contrastive Learning for Phonocardiography Noise Reduction ALT

链接: https://arxiv.org/abs/2601.10735
作者: Lizy Abraham,Siobhan Coughlan,Kritika Rajain,Changhong Li,Saji Philip,Adam James
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted by IEEE Healthcom 2025

点击查看摘要

Abstract:Congenital Heart Disease (CHD) remains a significant global health concern affecting approximately 1% of births worldwide. Phonocardiography has emerged as a supplementary tool to diagnose CHD cost-effectively. However, the performance of these diagnostic models highly depends on the quality of the phonocardiography, thus, noise reduction is particularly critical. Supervised UNet effectively improves noise reduction capabilities, but limited clean data hinders its application. The complex time-frequency characteristics of phonocardiography further complicate finding the balance between effectively removing noise and preserving pathological features. In this study, we proposed a self-supervised phonocardiography noise reduction model based on Noise2Noise to enable training without clean data. Augmentation and contrastive learning are applied to enhance its performance. We obtained an average SNR of 12.98 dB after filtering under 10~dB of hospital noise. Classification sensitivity after filtering was improved from 27% to 88%, indicating its promising pathological feature retention capabilities in practical noisy environments.

信息检索

[IR-0] Validating Search Query Simulations: A Taxonomy of Measures

链接: https://arxiv.org/abs/2601.11412
作者: Andreas Konstantin Kruff,Nolwenn Bernard,Philipp Schaer
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Assessing the validity of user simulators when used for the evaluation of information retrieval systems remains an open question, constraining their effective use and the reliability of simulation-based results. To address this issue, we conduct a comprehensive literature review with a particular focus on methods for the validation of simulated user queries with regard to real queries. Based on the review, we develop a taxonomy that structures the current landscape of available measures. We empirically corroborate the taxonomy by analyzing the relationships between the different measures applied to four different datasets representing diverse search scenarios. Finally, we provide concrete recommendations on which measures or combinations of measures should be considered when validating user simulation in different contexts. Furthermore, we release a dedicated library with the most commonly used measures to facilitate future research.

[IR-1] Seek and You Shall Find: Design Evaluation of a Context-Aware Interactive Search Companion

链接: https://arxiv.org/abs/2601.11287
作者: Markus Bink,Marten Risius,Udo Kruschwitz,David Elsweiler
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Pre-Print accepted at CHIIR 2026

点击查看摘要

Abstract:Many users struggle with effective online search and critical evaluation, especially in high-stakes domains like health, while often overestimating their digital literacy. Thus, in this demo, we present an interactive search companion that seamlessly integrates expert search strategies into existing search engine result pages. Providing context-aware tips on clarifying information needs, improving query formulation, encouraging result exploration, and mitigating biases, our companion aims to foster reflective search behaviour while minimising cognitive burden. A user study demonstrates the companion’s successful encouragement of more active and exploratory search, leading users to submit 75 % more queries and view roughly twice as many results, as well as performance gains in difficult tasks. This demo illustrates how lightweight, contextual guidance can enhance search literacy and empower users through micro-learning opportunities. While the vision involves real-time LLM adaptivity, this study utilises a controlled implementation to test the underlying intervention strategies.

[IR-2] “Can You Tell Me?”: Designing Copilots to Support Human Judgement in Online Information Seeking

链接: https://arxiv.org/abs/2601.11284
作者: Markus Bink,Marten Risius,Udo Kruschwitz,David Elsweiler
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Pre-Print accepted at CHIIR 2026

点击查看摘要

Abstract:Generative AI (GenAI) tools are transforming information seeking, but their fluent, authoritative responses risk overreliance and discourage independent verification and reasoning. Rather than replacing the cognitive work of users, GenAI systems should be designed to support and scaffold it. Therefore, this paper introduces an LLM-based conversational copilot designed to scaffold information evaluation rather than provide answers and foster digital literacy skills. In a pre-registered, randomised controlled trial (N=261) examining three interface conditions including a chat-based copilot, our mixed-methods analysis reveals that users engaged deeply with the copilot, demonstrating metacognitive reflection. However, the copilot did not significantly improve answer correctness or search engagement, largely due to a “time-on-chat vs. exploration” trade-off and users’ bias toward positive information. Qualitative findings reveal tension between the copilot’s Socratic approach and users’ desire for efficiency. These results highlight both the promise and pitfalls of pedagogical copilots, and we outline design pathways to reconcile literacy goals with efficiency demands.

[IR-3] Rank4Gen: RAG -Preference-Aligned Document Set Selection and Ranking

链接: https://arxiv.org/abs/2601.11273
作者: Yongqi Fan,Yuxiang Chu,Zhentao Xia,Xiaoyang Chen,Jie Liu,Haijin Liang,Jin Ma,Ben He,Yingfei Sun,Dezhi Ye,Tong Ruan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the RAG paradigm, the information retrieval module provides context for generators by retrieving and ranking multiple documents to support the aggregation of evidence. However, existing ranking models are primarily optimized for query–document relevance, which often misaligns with generators’ preferences for evidence selection and citation, limiting their impact on response quality. Moreover, most approaches do not account for preference differences across generators, resulting in unstable cross-generator performance. We propose \textbfRank4Gen, a generator-aware ranker for RAG that targets the goal of \emphRanking for Generators. Rank4Gen introduces two key preference modeling strategies: (1) \textbfFrom Ranking Relevance to Response Quality, which optimizes ranking with respect to downstream response quality rather than query–document relevance; and (2) \textbfGenerator-Specific Preference Modeling, which conditions a single ranker on different generators to capture their distinct ranking preferences. To enable such modeling, we construct \textbfPRISM, a dataset built from multiple open-source corpora and diverse downstream generators. Experiments on five challenging and recent RAG benchmarks demonstrate that RRank4Gen achieves strong and competitive performance for complex evidence composition in RAG.

[IR-4] LLM -Assisted Pseudo-Relevance Feedback ECIR2026

链接: https://arxiv.org/abs/2601.11238
作者: David Otero,Javier Parapar
类目: Information Retrieval (cs.IR)
*备注: Accepted ECIR 2026

点击查看摘要

Abstract:Query expansion is a long-standing technique to mitigate vocabulary mismatch in ad hoc Information Retrieval. Pseudo-relevance feedback methods, such as RM3, estimate an expanded query model from the top-ranked documents, but remain vulnerable to topic drift when early results include noisy or tangential content. Recent approaches instead prompt Large Language Models to generate synthetic expansions or query variants. While effective, these methods risk hallucinations and misalignment with collection-specific terminology. We propose a hybrid alternative that preserves the robustness and interpretability of classical PRF while leveraging LLM semantic judgement. Our method inserts an LLM-based filtering stage prior to RM3 estimation: the LLM judges the documents in the initial top- k ranking, and RM3 is computed only over those accepted as relevant. This simple intervention improves over blind PRF and a strong baseline across several datasets and metrics.

[IR-5] From Knots to Knobs: Towards Steerable Collaborative Filtering Using Sparse Autoencoders

链接: https://arxiv.org/abs/2601.11182
作者: Martin Spišák,Ladislav Peška,Petr Škoda,Vojtěch Vančura,Rodrigo Alves
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have recently emerged as pivotal tools for introspection into large language models. SAEs can uncover high-quality, interpretable features at different levels of granularity and enable targeted steering of the generation process by selectively activating specific neurons in their latent activations. Our paper is the first to apply this approach to collaborative filtering, aiming to extract similarly interpretable features from representations learned purely from interaction signals. In particular, we focus on a widely adopted class of collaborative autoencoders (CFAEs) and augment them by inserting an SAE between their encoder and decoder networks. We demonstrate that such representation is largely monosemantic and propose suitable mapping functions between semantic concepts and individual neurons. We also evaluate a simple yet effective method that utilizes this representation to steer the recommendations in a desired direction.

[IR-6] he Big Ban Theory: A Pre- and Post-Intervention Dataset of Online Content Moderation Actions

链接: https://arxiv.org/abs/2601.11128
作者: Aldo Cerulli,Lorenzo Cima,Benedetta Tessa,Serena Tardelli,Stefano Cresci
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Online platforms rely on moderation interventions to curb harmful behavior such hate speech, toxicity, and the spread of mis- and disinformation. Yet research on the effects and possible biases of such interventions faces multiple limitations. For example, existing works frequently focus on single or a few interventions, due to the absence of comprehensive datasets. As a result, researchers must typically collect the necessary data for each new study, which limits opportunities for systematic comparisons. To overcome these challenges, we introduce The Big Ban Theory (TBBT), a large dataset of moderation interventions. TBBT covers 25 interventions of varying type, severity, and scope, comprising in total over 339K users and nearly 39M posted messages. For each intervention, we provide standardized metadata and pseudonymized user activity collected three months before and after its enforcement, enabling consistent and comparable analyses of intervention effects. In addition, we provide a descriptive exploratory analysis of the dataset, along with several use cases of how it can support research on content moderation. With this dataset, we aim to support researchers studying the effects of moderation interventions and to promote more systematic, reproducible, and comparable research. TBBT is publicly available at: this https URL.

[IR-7] PruneRAG : Confidence-Guided Query Decomposition Trees for Efficient Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2601.11024
作者: Shuguang Jiao,Xinyu Xiao,Yunfan Wei,Shuhan Qi,Chengkai Huang,Quan Z. Michael Sheng,Lina Yao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a powerful framework for enhancing large language models in knowledge-intensive and reasoning tasks. However, as reasoning chains deepen or search trees expand, RAG systems often face two persistent failures: evidence forgetting, where retrieved knowledge is not effectively used, and inefficiency, caused by uncontrolled query expansions and redundant retrieval. These issues reveal a critical gap between retrieval and evidence utilization in current RAG architectures. We propose PruneRAG, a confidence-guided query decomposition framework that builds a structured query decomposition tree to perform stable and efficient reasoning. PruneRAG introduces three key mechanisms: adaptive node expansion that regulates tree width and depth, confidence-guided decisions that accept reliable answers and prune uncertain branches, and fine-grained retrieval that extracts entity-level anchors to improve retrieval precision. Together, these components preserve salient evidence throughout multi-hop reasoning while significantly reducing retrieval overhead. To better analyze evidence misuse, we define the Evidence Forgetting Rate as a metric to quantify cases where golden evidence is retrieved but not correctly used. Extensive experiments across various multi-hop QA benchmarks show that PruneRAG achieves superior accuracy and efficiency over state-of-the-art baselines.

[IR-8] PRISM: Personalized Recommendation via Information Synergy Module WWW2026

链接: https://arxiv.org/abs/2601.10944
作者: Xinyi Zhang,Yutong Li,Peijie Sun,Letian Sha,Zhongxuan Han
类目: Information Retrieval (cs.IR)
*备注: Accepted as a Full Paper at WWW 2026

点击查看摘要

Abstract:Multimodal sequential recommendation (MSR) leverages diverse item modalities to improve recommendation accuracy, while achieving effective and adaptive fusion remains challenging. Existing MSR models often overlook synergistic information that emerges only through modality combinations. Moreover, they typically assume a fixed importance for different modality interactions across users. To address these limitations, we propose \textbfPersonalized \textbfRecommend-ation via \textbfInformation \textbfSynergy \textbfModule (PRISM), a plug-and-play framework for sequential recommendation (SR). PRISM explicitly decomposes multimodal information into unique, redundant, and synergistic components through an Interaction Expert Layer and dynamically weights them via an Adaptive Fusion Layer guided by user preferences. This information-theoretic design enables fine-grained disentanglement and personalized fusion of multimodal signals. Extensive experiments on four datasets and three SR backbones demonstrate its effectiveness and versatility. The code is available at this https URL.

[IR-9] Can Instructed Retrieval Models Really Support Exploration?

链接: https://arxiv.org/abs/2601.10936
作者: Piyush Maheshwari,Sheshera Mysore,Hamed Zamani
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Exploratory searches are characterized by under-specified goals and evolving query intents. In such scenarios, retrieval models that can capture user-specified nuances in query intent and adapt results accordingly are desirable – instruction-following retrieval models promise such a capability. In this work, we evaluate instructed retrievers for the prevalent yet under-explored application of aspect-conditional seed-guided exploration using an expert-annotated test collection. We evaluate both recent LLMs fine-tuned for instructed retrieval and general-purpose LLMs prompted for ranking with the highly performant Pairwise Ranking Prompting. We find that the best instructed retrievers improve on ranking relevance compared to instruction-agnostic approaches. However, we also find that instruction following performance, crucial to the user experience of interacting with models, does not mirror ranking relevance improvements and displays insensitivity or counter-intuitive behavior to instructions. Our results indicate that while users may benefit from using current instructed retrievers over instruction-agnostic models, they may not benefit from using them for long-running exploratory sessions requiring greater sensitivity to instructions.

[IR-10] ail-Aware Data Augmentation for Long-Tail Sequential Recommendation WWW2026

链接: https://arxiv.org/abs/2601.10933
作者: Yizhou Dang,Zhifu Wei,Minhan Huang,Lianbo Ma,Jianzhe Zhao,Guibing Guo,Xingwei Wang
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW 2026

点击查看摘要

Abstract:Sequential recommendation (SR) learns user preferences based on their historical interaction sequences and provides personalized suggestions. In real-world scenarios, most users can only interact with a handful of items, while the majority of items are seldom consumed. This pervasive long-tail challenge limits the model’s ability to learn user preferences. Despite previous efforts to enrich tail items/users with knowledge from head parts or improve tail learning through additional contextual information, they still face the following issues: 1) They struggle to improve the situation where interactions of tail users/items are scarce, leading to incomplete preferences learning for the tail parts. 2) Existing methods often degrade overall or head parts performance when improving accuracy for tail users/items, thereby harming the user experience. We propose Tail-Aware Data Augmentation (TADA) for long-tail sequential recommendation, which enhances the interaction frequency for tail items/users while maintaining head performance, thereby promoting the model’s learning capabilities for the tail. Specifically, we first capture the co-occurrence and correlation among low-popularity items by a linear model. Building upon this, we design two tail-aware augmentation operators, T-Substitute and T-Insert. The former replaces the head item with a relevant item, while the latter utilizes co-occurrence relationships to extend the original sequence by incorporating both head and tail items. The augmented and original sequences are mixed at the representation level to preserve preference knowledge. We further extend the mix operation across different tail-user sequences and augmented sequences to generate richer augmented samples, thereby improving tail performance. Comprehensive experiments demonstrate the superiority of our method. The codes are provided at this https URL.

[IR-11] Streaming Stochastic Submodular Maximization with On-Demand User Requests NEURIPS’25

链接: https://arxiv.org/abs/2601.10901
作者: Honglian Wang,Sijing Tu,Lutz Oettershagen,Aristides Gionis
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注: NeurIPS’25

点击查看摘要

Abstract:We explore a novel problem in streaming submodular maximization, inspired by the dynamics of news-recommendation platforms. We consider a setting where users can visit a news website at any time, and upon each visit, the website must display up to k news items. User interactions are inherently stochastic: each news item presented to the user is consumed with a certain acceptance probability by the user, and each news item covers certain topics. Our goal is to design a streaming algorithm that maximizes the expected total topic coverage. To address this problem, we establish a connection to submodular maximization subject to a matroid constraint. We show that we can effectively adapt previous methods to address our problem when the number of user visits is known in advance or linear-size memory in the stream length is available. However, in more realistic scenarios where only an upper bound on the visits and sublinear memory is available, the algorithms fail to guarantee any bounded performance. To overcome these limitations, we introduce a new online streaming algorithm that achieves a competitive ratio of 1/(8\delta) , where \delta controls the approximation quality. Moreover, it requires only a single pass over the stream, and uses memory independent of the stream length. Empirically, our algorithms consistently outperform the baselines.

附件下载

点击下载今日全部论文列表