本篇博文主要内容为 2026-01-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2026-01-26)
今日共更新368篇论文,其中:
- 自然语言处理共66篇(Computation and Language (cs.CL))
- 人工智能共106篇(Artificial Intelligence (cs.AI))
- 计算机视觉共66篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共86篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Strategies for Span Labeling with Large Language Models
【速读】: 该论文旨在解决生成式语言模型(Generative Language Models)在进行文本标注任务(如命名实体识别或错误检测)时,因缺乏显式机制定位输入特定部分而导致的span标签不一致问题。现有方法多依赖于临时提示策略,效果不稳定。论文将这些策略分为三类:对输入文本进行标记、索引数值位置以及匹配span内容。为改进内容匹配方法的局限性,作者提出LogitMatch——一种约束解码方法,通过强制模型输出与有效输入span对齐来提升准确性。其核心创新在于引入logits层面的约束机制,避免了传统匹配方法中因span内容不匹配导致的性能下降,在多个任务上显著优于其他策略。
链接: https://arxiv.org/abs/2601.16946
作者: Danil Semin,Ondřej Dušek,Zdeněk Kasner
机构: Institute of Formal and Applied Linguistics (形式与应用语言学研究所); Faculty of Mathematics and Physics, Charles University (查尔斯大学数学与物理学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model’s output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.
zh
[NLP-1] Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias
【速读】: 该论文旨在解决嵌入式搜索中因文档分段位置和语言资源差异导致的表示偏差问题,即长文档中靠前段落及高资源语言(如英语)段落被过度表征,而靠后段落及低资源语言段落则被边缘化,从而影响检索可发现性(discoverability)。其解决方案的关键在于提出一种基于排列的评估框架以量化此类偏差,并引入一种推理时注意力校准方法(inference-time attention calibration),通过重新分配池化标记(pooling-token)嵌入中的注意力分布,使各段落位置获得更均衡的注意力权重,从而提升后期段落的可发现性。
链接: https://arxiv.org/abs/2601.16934
作者: Elias Schuhmacher,Andrianos Michail,Juri Opitz,Rico Sennrich,Simon Clematide
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at this https URL
zh
[NLP-2] LLM -Based Adversarial Persuasion Attacks on Fact-Checking Systems
【速读】: 该论文旨在解决自动化事实核查(Automated Fact-Checking, AFC)系统易受对抗攻击的问题,尤其是现有方法多依赖噪声注入或语义扰动,而忽视了更具隐蔽性的说服技巧(persuasion techniques)在虚假信息传播中的潜在威胁。解决方案的关键在于提出一类新型的“说服型对抗攻击”(persuasive adversarial attacks),利用生成式大语言模型(Generative LLM)对声明进行改写,使其在保持原意的基础上融入15种分类为6类的说服策略,从而干扰AFC系统的验证与证据检索能力。实验表明,此类攻击显著降低了Fever和FEVEROUS基准上的核查准确率和证据召回率,揭示了说服技巧作为对抗攻击手段的强大效力,并呼吁构建更鲁棒的AFC系统。
链接: https://arxiv.org/abs/2601.16890
作者: João A. Leite,Olesya Razuvayevskaya,Kalina Bontcheva,Carolina Scarton
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.
zh
[NLP-3] Reasoning Promotes Robustness in Theory of Mind Tasks
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在理论心智(Theory of Mind, ToM)测试中表现出的性能提升是否源于真正具备社会认知能力,还是仅因模型鲁棒性增强的问题。为解答此问题,作者采用基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)训练的推理型模型,并结合新型机器心理实验设计与现有基准测试进行实证分析。解决方案的关键在于通过对比推理模型与普通LLMs在不同提示扰动和任务变化下的表现差异,发现推理模型的性能提升主要归因于其更强的解题鲁棒性,而非引入了全新的ToM推理机制,从而为评估LLMs的社会认知行为提供了更清晰的解释框架。
链接: https://arxiv.org/abs/2601.16853
作者: Ian B. de Haan,Peter van der Putten,Max van Duijn
机构: Leiden University (莱顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 2 figures
Abstract:Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.
zh
[NLP-4] ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models
【速读】: 该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在关联颜色与隐含概念(implicit color concepts)方面的能力不足问题,即模型难以准确捕捉和表达那些未明确指定颜色名称或代码的抽象色彩语义。解决方案的关键在于提出ColorConceptBench——一个基于人类标注的基准测试集,通过概率颜色分布的视角系统评估模型对1,281个隐含颜色概念的映射能力,其核心创新在于超越传统显式颜色标签,聚焦于模型对抽象语义的理解与生成一致性。实证结果显示,现有主流T2I模型在此任务上表现有限,且常规优化手段(如模型规模扩展和引导机制)无法有效改善这一缺陷,表明实现类人级颜色语义需从模型学习机制本身进行根本性重构。
链接: https://arxiv.org/abs/2601.16836
作者: Chenxi Ruan,Yu Xiao,Yihan Hou,Guosheng Hu,Wei Zeng
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学); China Academy of Art (中国美术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:While text-to-image (T2I) models have advanced considerably, their capability to associate colors with implicit concepts remains underexplored. To address the gap, we introduce ColorConceptBench, a new human-annotated benchmark to systematically evaluate color-concept associations through the lens of probabilistic color distributions. ColorConceptBench moves beyond explicit color names or codes by probing how models translate 1,281 implicit color concepts using a foundation of 6,369 human annotations. Our evaluation of seven leading T2I models reveals that current models lack sensitivity to abstract semantics, and crucially, this limitation appears resistant to standard interventions (e.g., scaling and guidance). This demonstrates that achieving human-like color semantics requires more than larger models, but demands a fundamental shift in how models learn and represent implicit meaning.
zh
[NLP-5] rapped in the past? Disentangling fluid and crystallized intelligence of large language models using chess
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在能力表现上难以区分是依赖于记忆性知识(结晶智力,crystallized intelligence)还是推理能力(流体智力,fluid intelligence)的问题。其解决方案的关键在于引入国际象棋作为受控实验场景,通过构建一个基于训练数据分布接近度的位置分类体系(从可由记忆解决的常见局面到需第一性原理推理的新颖局面),系统评估不同代际GPT模型在不同推理强度下的表现。结果表明,随着任务对流体智力要求的提升,性能显著下降,尤其在分布外任务中退化至随机水平,揭示当前架构在系统泛化能力上的局限性,强调仅靠规模扩展不足以实现稳健的流体智能。
链接: https://arxiv.org/abs/2601.16823
作者: Leonard S. Pleiss,Maximilian Schiffer,Robert K. von Weizsäcker
机构: Technical University Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall (crystallized intelligence) or reasoning ability (fluid intelligence). We introduce chess as a controlled testbed for disentangling these faculties. Leveraging the game’s structure and scalable engine evaluations, we construct a taxonomy of positions varying in training corpus proximity–ranging from common states solvable by memorization to novel ones requiring first-principles reasoning. We systematically evaluate multiple GPT generations under varying reasoning intensities. Our analysis reveals a clear gradient: performance consistently degrades as fluid intelligence demands increase. Notably, in out-of-distribution tasks, performance collapses to random levels. While newer models improve, progress slows significantly for tasks outside the training distribution. Furthermore, while reasoning-augmented inference improves performance, its marginal benefit per token decreases with distributional proximity. These results suggest current architectures remain limited in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust fluid intelligence.
zh
[NLP-6] SoS: Analysis of Surface over Semantics in Multilingual Text-To-Image Generation
【速读】: 该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在处理非英语输入时存在的“表面优先于语义”(Surface-over-Semantics, SoS)问题,即模型更关注输入文本的表面形式而非其深层语义,导致对不同文化身份的视觉呈现出现刻板印象。解决方案的关键在于构建了一个涵盖171种文化身份、翻译成14种语言的标准化提示集,并基于此对七种主流T2I模型进行系统评估,同时引入一种新的量化指标来衡量SoS倾向在不同模型、语言和文化间的分布与强度,从而揭示了所有模型(除一个外)均在至少两种语言中表现出显著的表面偏好,且这种偏好随文本编码器层级加深而加剧,且常与视觉刻板印象相关联。
链接: https://arxiv.org/abs/2601.16803
作者: Carolin Holtermann,Florian Schneider,Anne Lauscher
机构: Trustworthy AI Lab, University of Hamburg (汉堡大学可信人工智能实验室); Language Technology Group, University of Hamburg (汉堡大学语言技术组)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image (T2I) models are increasingly employed by users worldwide. However, prior research has pointed to the high sensitivity of T2I towards particular input languages - when faced with languages other than English (i.e., different surface forms of the same prompt), T2I models often produce culturally stereotypical depictions, prioritizing the surface over the prompt’s semantics. Yet a comprehensive analysis of this behavior, which we dub Surface-over-Semantics (SoS), is missing. We present the first analysis of T2I models’ SoS tendencies. To this end, we create a set of prompts covering 171 cultural identities, translated into 14 languages, and use it to prompt seven T2I models. To quantify SoS tendencies across models, languages, and cultures, we introduce a novel measure and analyze how the tendencies we identify manifest visually. We show that all but one model exhibit strong surface-level tendency in at least two languages, with this effect intensifying across the layers of T2I text encoders. Moreover, these surface tendencies frequently correlate with stereotypical visual depictions.
zh
[NLP-7] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis
【速读】: 该论文旨在解决细粒度情感分析(fine-grained opinion analysis)中因标注数据稀缺而导致的模型训练困难问题,尤其在跨领域和真实应用场景下,人工标注成本高昂且效率低下。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)构建一个声明式(declarative)的自动标注流水线,通过结构化提示设计减少人工提示工程的变异性,并提出一种新颖的LLM协同判别机制以处理多个标签并生成最终标注结果。实验表明,该方法可在Aspect Sentiment Triplet Extraction (ASTE) 和 Aspect-Category-Opinion-Sentiment (ACOS) 任务上实现高标注一致性,显著降低数据标注的人力与成本。
链接: https://arxiv.org/abs/2601.16800
作者: Gaurav Negi,MA Waskow,Paul Buitelaar
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications. We explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis, addressing the shortage of domain-specific labelled datasets. In this work, we use a declarative annotation pipeline. This approach reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a novel methodology for an LLM to adjudicate multiple labels and produce final annotations. After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators. This reduces the cost and human effort needed to create these fine-grained opinion-annotated datasets.
zh
[NLP-8] Persuasion Tokens for Editing Factual Knowledge in LLM s EACL
【速读】: 该论文旨在解决上下文知识编辑(In-context Knowledge Editing, IKE)中存在的两大问题:一是IKE依赖于冗长且特定于事实的演示(demonstrations),导致人工构建成本高;二是这些演示占用了大量上下文窗口空间,限制了模型的实际应用效率。解决方案的关键在于提出说服令牌(Persuasion Tokens, P-Tokens)——一种通过训练获得的特殊标记,能够模拟IKE演示的效果,在无需提供具体事实示例的情况下实现高效的知识编辑。实验表明,P-Tokens在多个数据集和大型语言模型(LLMs)上表现优于或等同于传统IKE方法,同时具备对干扰项的鲁棒性及随数量增加而提升的编辑性能,从而为LLM的知识更新提供了更实用、可扩展的新路径。
链接: https://arxiv.org/abs/2601.16781
作者: Paul Youssef,Jörg Schlötterer,Christin Seifert
机构: Marburg University (马尔堡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at EACL Main 2026
Abstract:In-context knowledge editing (IKE) is a promising technique for updating Large Language Models (LLMs) with new information. However, IKE relies on lengthy, fact-specific demonstrations which are costly to create and consume significant context window space. In this paper, we introduce persuasion tokens (P-Tokens) – special tokens trained to replicate the effect of IKE demonstrations, enabling efficient knowledge editing without requiring fact-specific demonstrations. We evaluate P-Tokens across two editing datasets and three LLMs, demonstrating performance comparable to, and often exceeding, IKE. We further find that editing performance is robust to distractors with small negative effects to neighboring facts, and that increasing the number of P-Tokens improves performance. Our work addresses key limitations of IKE and provides a more practical and scalable alternative for editing LLMs.
zh
[NLP-9] Do LLM hallucination detectors suffer from low-resource effect? EACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言中表现下降的问题,特别是探讨其幻觉检测器是否也受到“低资源效应”(low-resource effect)的影响。研究发现,尽管LLMs在低资源语言中的任务准确率显著下降,但幻觉检测器的准确率下降幅度通常仅为任务准确率下降的几分之一,表明模型内部可能仍保留了关于不确定性的信号。解决方案的关键在于:即使在低资源语言环境下,幻觉检测器依然保持较高鲁棒性,前提是训练和推理过程采用本族语监督(in-language supervision),而在跨语言设置中若缺乏此类监督则性能会显著下降。
链接: https://arxiv.org/abs/2601.16766
作者: Debtanu Datta,Mohan Kishore Chilukuri,Yash Kumar,Saptarshi Ghosh,Muhammad Bilal Zafar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EACL 2026 (Main)
Abstract:LLMs, while outperforming humans in a wide range of tasks, can still fail in unanticipated ways. We focus on two pervasive failure modes: (i) hallucinations, where models produce incorrect information about the world, and (ii) the low-resource effect, where the models show impressive performance in high-resource languages like English but the performance degrades significantly in low-resource languages like Bengali. We study the intersection of these issues and ask: do hallucination detectors suffer from the low-resource effect? We conduct experiments on five tasks across three domains (factual recall, STEM, and Humanities). Experiments with four LLMs and three hallucination detectors reveal a curious finding: As expected, the task accuracies in low-resource languages experience large drops (compared to English). However, the drop in detectors’ accuracy is often several times smaller than the drop in task accuracy. Our findings suggest that even in low-resource languages, the internal mechanisms of LLMs might encode signals about their uncertainty. Further, the detectors are robust within language (even for non-English) and in multilingual setups, but not in cross-lingual settings without in-language supervision.
zh
[NLP-10] Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation
【速读】: 该论文旨在解决放射学报告中纵向信息(longitudinal information)自动标注难题,即如何高效且准确地识别和提取多期影像检查中疾病进展的时序变化,以支持对生成式放射学报告模型的性能评估。现有方法依赖人工规则或词典,存在劳动强度大、适应性差、精度不足等问题。解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的自动化标注流程:首先定位包含相关信息的句子,再提取疾病演变轨迹,并通过对比五种主流LLM在500份人工标注数据上的表现,最终选用Qwen2.5-32B模型对95,169份MIMIC-CXR公开数据进行标注,构建了一个标准化基准数据集,显著提升了纵向信息检测与疾病追踪任务的F1分数(分别提高11.3%和5.3%),为后续报告生成模型提供了可靠评估依据。
链接: https://arxiv.org/abs/2601.16753
作者: Xinyi Wang,Grazziela Figueredo,Ruizhe Li,Xin Chen
机构: University of Nottingham (诺丁汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Longitudinal information in radiology reports refers to the sequential tracking of findings across multiple examinations over time, which is crucial for monitoring disease progression and guiding clinical decisions. Many recent automated radiology report generation methods are designed to capture longitudinal information; however, validating their performance is challenging. There is no proper tool to consistently label temporal changes in both ground-truth and model-generated texts for meaningful comparisons. Existing annotation methods are typically labor-intensive, relying on the use of manual lexicons and rules. Complex rules are closed-source, domain specific and hard to adapt, whereas overly simple ones tend to miss essential specialised information. Large language models (LLMs) offer a promising annotation alternative, as they are capable of capturing nuanced linguistic patterns and semantic similarities without extensive manual intervention. They also adapt well to new contexts. In this study, we therefore propose an LLM-based pipeline to automatically annotate longitudinal information in radiology reports. The pipeline first identifies sentences containing relevant information and then extracts the progression of diseases. We evaluate and compare five mainstream LLMs on these two tasks using 500 manually annotated reports. Considering both efficiency and performance, Qwen2.5-32B was subsequently selected and used to annotate another 95,169 reports from the public MIMIC-CXR dataset. Our Qwen2.5-32B-annotated dataset provided us with a standardized benchmark for evaluating report generation models. Using this new benchmark, we assessed seven state-of-the-art report generation models. Our LLM-based annotation method outperforms existing annotation solutions, achieving 11.3% and 5.3% higher F1-scores for longitudinal information detection and disease tracking, respectively.
zh
[NLP-11] SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在软件开发任务中因长上下文交互导致的高API成本和延迟问题。现有上下文压缩方法(如LongLLMLingua)通常依赖固定指标(如PPL),忽视代码理解的任务特异性,常破坏语法结构与逻辑完整性,导致关键实现细节丢失。其解决方案的关键在于提出SWE-Pruner,一个面向编码代理的自适应上下文剪枝框架:通过模拟人类程序员“选择性浏览”源码的行为,基于当前任务目标(如“聚焦错误处理”)动态引导剪枝目标,并引入轻量级神经skimmer(0.6B参数)模型,根据任务提示从上下文中实时筛选相关代码行,从而在显著减少token消耗(23–54%)的同时保持任务性能稳定。
链接: https://arxiv.org/abs/2601.16746
作者: Yuhang Wang,Yuling Shi,Mo Yang,Rongrui Zhang,Shilin He,Heng Lian,Yuting Chen,Siyu Ye,Kai Cai,Xiaodong Gu
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Code available at this https URL
Abstract:LLM agents have demonstrated remarkable capabilities in software development, but their performance is hampered by long interaction contexts, which incur high API costs and latency. While various context compression approaches such as LongLLMLingua have emerged to tackle this challenge, they typically rely on fixed metrics such as PPL, ignoring the task-specific nature of code understanding. As a result, they frequently disrupt syntactic and logical structure and fail to retain critical implementation details. In this paper, we propose SWE-Pruner, a self-adaptive context pruning framework tailored for coding agents. Drawing inspiration from how human programmers “selectively skim” source code during development and debugging, SWE-Pruner performs task-aware adaptive pruning for long contexts. Given the current task, the agent formulates an explicit goal (e.g., “focus on error handling”) as a hint to guide the pruning targets. A lightweight neural skimmer (0.6B parameters) is trained to dynamically select relevant lines from the surrounding context given the goal. Evaluations across four benchmarks and multiple models validate SWE-Pruner’s effectiveness in various scenarios, achieving 23-54% token reduction on agent tasks like SWE-Bench Verified and up to 14.84x compression on single-turn tasks like LongCodeQA with minimal performance impact.
zh
[NLP-12] Mitigating Bias in Automated Grading Systems for ESL Learners: A Contrastive Learning Approach
【速读】: 该论文旨在解决生成式 AI 在自动作文评分(Automated Essay Scoring, AES)系统中对英语作为第二语言(English as a Second Language, ESL)学习者存在的算法偏见问题。现有基于Transformer的回归模型主要在母语者语料上训练,容易将二语(L2)表层语言特征与作文质量建立虚假关联,导致高能力ESL写作被系统性低分评价。解决方案的关键在于引入一种基于匹配作文对的对比学习策略(Contrastive Learning with Matched Essay Pairs),通过构建17,161对人工标注质量一致的ESL与母语者作文,并采用三元组损失函数(Triplet Margin Loss)优化模型隐空间表示,使两类写作的嵌入向量对齐。实验表明该方法将高能力ESL写作评分差距从10.3%降低至6.2%,同时保持了良好的评分一致性(QWK=0.76),且后验语言学分析显示模型成功解耦了句法复杂度与语法错误,避免了对有效二语句法结构的惩罚。
链接: https://arxiv.org/abs/2601.16724
作者: Kevin Fan,Eric Yun
机构: Georgia Institute of Technology (佐治亚理工学院); Georgia State University (佐治亚州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Automated Essay Scoring (AES) systems are increasingly used in high-stakes educational settings, concerns regarding algorithmic bias against English as a Second Language (ESL) learners have increased. Current Transformer-based regression models trained primarily on native-speaker corpora often learn spurious correlations between surface-level L2 linguistic features and essay quality. In this study, we conduct a bias study of a fine-tuned DeBERTa-v3 model using the ASAP 2.0 and ELLIPSE datasets, revealing a constrained score scaling for high-proficiency ESL writing where high-proficiency ESL essays receive scores 10.3% lower than Native speaker essays of identical human-rated quality. To mitigate this, we propose applying contrastive learning with a triplet construction strategy: Contrastive Learning with Matched Essay Pairs. We constructed a dataset of 17,161 matched essay pairs and fine-tuned the model using Triplet Margin Loss to align the latent representations of ESL and Native writing. Our approach reduced the high-proficiency scoring disparity by 39.9% (to a 6.2% gap) while maintaining a Quadratic Weighted Kappa (QWK) of 0.76. Post-hoc linguistic analysis suggests the model successfully disentangled sentence complexity from grammatical error, preventing the penalization of valid L2 syntactic structures.
zh
[NLP-13] Better Generalizing to Unseen Concepts: An Evaluation Framework and An LLM -Based Auto-Labeled Pipeline for Biomedical Concept Recognition EACL2026
【速读】: 该论文旨在解决提及无关的生物医学概念识别(Mention-agnostic Biomedical Concept Recognition, MA-BCR)中因人工标注数据稀缺而导致模型难以泛化到未见概念的问题。其解决方案的关键在于:首先构建基于层次化概念索引的评估框架与新颖指标以系统性衡量泛化能力;其次探索利用大语言模型(Large Language Models, LLMs)生成自动标注数据(Auto-Labeled Data, ALD),设计任务特定的数据生成流水线。研究表明,尽管LLM生成的ALD无法完全替代人工标注,但能有效提升模型对未见概念的识别性能,提供更广泛的覆盖范围和结构化知识。
链接: https://arxiv.org/abs/2601.16711
作者: Shanshan Liu,Noriki Nishida,Fei Cheng,Narumi Tokunaga,Rumana Ferdous Munne,Yuki Yamagata,Kouji Kozaki,Takehito Utsuro,Yuji Matsumoto
机构: RIKEN AIP (理化学研究所人工智能中心); University of Tsukuba (筑波大学); Kyoto University (京都大学); RIKEN R-IH (理化学研究所综合研究核); RIKEN BRC (理化学研究所生物资源研究中心); Osaka Electro-Communication University (大阪电気通信大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to EACL 2026 (Main)
Abstract:Generalization to unseen concepts is a central challenge due to the scarcity of human annotations in Mention-agnostic Biomedical Concept Recognition (MA-BCR). This work makes two key contributions to systematically address this issue. First, we propose an evaluation framework built on hierarchical concept indices and novel metrics to measure generalization. Second, we explore LLM-based Auto-Labeled Data (ALD) as a scalable resource, creating a task-specific pipeline for its generation. Our research unequivocally shows that while LLM-generated ALD cannot fully substitute for manual annotations, it is a valuable resource for improving generalization, successfully providing models with the broader coverage and structural knowledge needed to approach recognizing unseen concepts. Code and datasets are available at this https URL.
zh
[NLP-14] EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents
【速读】: 该论文旨在解决当前智能体(agent)在长期记忆能力评估中缺乏动态、可验证且覆盖多维度记忆技能的基准测试工具的问题。现有方法通常依赖固定问答集,难以真实反映智能体在复杂交互环境中的记忆保持与检索能力。其解决方案的关键在于提出EMemBench——一个基于交互式游戏的程序化基准,通过从每个智能体自身轨迹中生成问题,实现对文本和视觉环境下的多种记忆技能(如单跳/多跳回忆、归纳推理、时间/空间推理、逻辑推理及对抗性任务)的可控评估。该基准利用游戏底层信号计算可验证的真值(ground truth),确保答案可验证性和记忆技能的均衡覆盖,从而为评估具备强大语言模型(LM)或视觉语言模型(VLM)作为骨干网络的智能体提供统一、严谨且具有挑战性的测试框架。
链接: https://arxiv.org/abs/2601.16690
作者: Xinze Li,Ziyue Zhu,Siyuan Liu,Yubo Ma,Yuhang Zang,Yixin Cao,Aixin Sun
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages
Abstract:We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent’s own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.
zh
[NLP-15] PLawBench: A Rubric-Based Benchmark for Evaluating LLM s in Real-World Legal Practice
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在法律领域应用评估中存在的局限性问题,即现有法律基准测试多依赖简化且高度标准化的任务,无法反映真实法律实践中所涉及的模糊性、复杂性和推理需求,同时缺乏对细粒度法律推理能力的显式评估。其解决方案的关键在于提出PLawBench——一个面向实际法律工作场景的基准测试框架,通过三个核心任务类别(公共法律咨询、实务案例分析和法律文书生成)模拟法律从业者的核心流程,涵盖法律问题识别、结构化法律推理及法律一致性文书生成等能力,并配套专家设计的12,500项细粒度评价指标,结合与人类专家判断对齐的LLM评估器进行量化评估,从而更真实地揭示当前LLMs在法律推理方面的不足,为未来法律大模型的发展提供明确方向。
链接: https://arxiv.org/abs/2601.16669
作者: Yuzhen Shi,Huanghai Liu,Yiran Hu,Gaojie Song,Xinran Xu,Yubo Ma,Tianyi Tang,Li Zhang,Qingjing Chen,Di Feng,Wenbo Lv,Weiheng Wu,Kexin Yang,Sen Yang,Wei Wang,Rongyao Shi,Yuanyang Qiu,Yuemeng Qi,Jingwen Zhang,Xiaoyu Sui,Yifan Chen,Yi Zhang,An Yang,Bowen Yu,Dayiheng Liu,Junyang Lin,Weixing Shen,Bing Zhao,Charles L.A. Clarke,Hu Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model’s ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: this https URL.
zh
[NLP-16] Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中基于梯度的实例级解释方法因模型梯度维度极高而导致计算不可行的问题。现有方法通常通过选择模型参数的一个子集来降低维度,但该子集往往凭经验选取且缺乏系统性验证。论文的关键解决方案是:相较于随机投影或使用完整梯度,采用一种贪心策略从架构上具有意义的组件中选择少量关键参数,能够更有效地捕捉训练数据影响信息,并在检索任务中表现更优,同时显著提升计算效率,从而为大规模模型的实例级解释提供一种切实可行的路径。
链接: https://arxiv.org/abs/2601.16651
作者: Lukas Hinterleitner,Loris Schoenegger,Benjamin Roth
机构: Faculty of Computer Science, University of Vienna, Vienna, Austria; UniVie Doctoral School Computer Science, University of Vienna, Vienna, Austria; Faculty of Philological and Cultural Studies, University of Vienna, Vienna, Austria
类目: Computation and Language (cs.CL)
备注: 8 pages
Abstract:Gradient-based methods for instance-based explanation for large language models (LLMs) are hindered by the immense dimensionality of model gradients. In practice, influence estimation is restricted to a subset of model parameters to make computation tractable, but this subset is often chosen ad hoc and rarely justified by systematic evaluation. This paper investigates if it is better to create low-dimensional representations by selecting a small, architecturally informed subset of model components or by projecting the full gradients into a lower-dimensional space. Using a novel benchmark, we show that a greedily selected subset of components captures the information about training data influence needed for a retrieval task more effectively than either the full gradient or random projection. We further find that this approach is more computationally efficient than random projection, demonstrating that targeted component selection is a practical strategy for making instance-based explanations of large models more computationally feasible.
zh
[NLP-17] Sycophancy Hides Linearly in the Attention Heads
【速读】: 该论文旨在解决大型语言模型中“谄媚行为”(sycophancy)的问题,即模型在面对用户潜在质疑时倾向于迎合而非坚持事实准确性的倾向。其核心解决方案是基于“线性表示假设”,通过在残差流(residual stream)、多层感知机(MLP)和注意力层中训练线性探测器(linear probes),识别出最能分离正确与错误响应信号的机制。研究发现,这些信号在多头注意力(multi-head attention)激活中最具线性可分性,且仅在中间层的部分注意力头中进行线性干预效果最优。关键在于利用注意力激活空间中的特定方向实施简单、精准的线性修正,即可有效抑制模型的谄媚倾向,从而提升其事实准确性与抗诱导能力。
链接: https://arxiv.org/abs/2601.16644
作者: Rifo Genadi,Munachiso Nwadike,Nurdaulet Mukhituly,Hilal Alquabeh,Tatsuya Hiraoka,Kentaro Inui
机构: MBZUAI (Mohamed bin Zayed University of Artificial Intelligence); RIKEN AIP (RIKEN Center for Advanced Intelligence Project); Tohoku University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified “truthful” directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations.
zh
[NLP-18] ypologically Informed Parameter Aggregation EACL2026
【速读】: 该论文旨在解决大规模多语言语言模型在低资源和未见语言上性能下降的问题,以及基于适配器(adapter)的微调方法在大规模训练语言特定适配器时成本高昂的挑战。其解决方案的关键在于提出一种无需训练的参数聚合方法——类型学信息引导的参数聚合(Typologically Informed Parameter Aggregation, TIPA),通过加权聚合已有适配器来构建代理语言适配器,权重由语言类型学相似性决定,从而实现零样本跨语言迁移,且无需为每种语言单独训练适配器模块。
链接: https://arxiv.org/abs/2601.16629
作者: Stef Accou,Wessel Poelman
机构: 未知
类目: Computation and Language (cs.CL)
备注: EACL 2026: Findings
Abstract:Massively multilingual language models enable cross-lingual generalization but underperform on low-resource and unseen languages. While adapter-based fine-tuning offers a parameter-efficient solution, training language-specific adapters at scale remains costly. We introduce Typologically Informed Parameter Aggregation (TIPA), a training-free method that constructs proxy language adapters by aggregating existing ones, weighted by typological similarity. Integrated into the MAD-X framework, these proxies enable zero-shot cross-lingual transfer without additional training. We evaluate TIPA on five NLP tasks and over 230 languages. TIPA consistently outperforms or matches baselines such as English-only fine-tuning or selecting the typologically closest language adapter. We see the largest gains for languages lacking dedicated adapters. Our results demonstrate that typologically informed aggregation provides a viable alternative to language-specific modules without any training needed.
zh
[NLP-19] MultiLexNorm: A Unified Benchmark and a Generative Model for Lexical Normalization for Asian Languages
【速读】: 该论文旨在解决社交媒体文本中由于语言使用非正式、自发性强及社会方言多样性导致自然语言处理(Natural Language Processing, NLP)模型性能下降的问题,核心在于通过词汇规范化(lexical normalization)将非标准文本转换为标准变体以提升模型鲁棒性。其解决方案的关键是提出一个扩展版的MultiLexNorm基准,覆盖5种亚洲语言(来自不同语系和书写系统),并引入基于大语言模型(Large Language Models, LLMs)的新架构,相较于现有方法在跨语言场景下展现出更强的泛化能力和稳定性,同时揭示了当前方法仍存在的局限性,为未来研究指明方向。
链接: https://arxiv.org/abs/2601.16623
作者: Weerayut Buaphet,Thanh-Nhi Nguyen,Risa Kondo,Tomoyuki Kajiwara,Yumin Kim,Jimin Lee,Hwanhee Lee,Holy Lovenia,Peerat Limkonchotiwat,Sarana Nutanong,Rob Van der Goot
机构: School of Information Science and Technology (VISTEC)(信息科学与技术学院(VISTEC) ); University of Information Technology, Ho Chi Minh City(信息科技大学,胡志明市); Vietnam National University, Ho Chi Minh City(越南国家大学,胡志明市); Graduate School of Science and Engineering, Ehime University(爱媛大学理工学研究科); D3 Center, The University of Osaka(大阪大学D3中心); Chung-Ang University(中央大学); SEACrowd(东南亚众包); IndoNLP(印尼自然语言处理); AI Singapore(人工智能新加坡); IT University of Copenhagen(哥本哈根信息技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Social media data has been of interest to Natural Language Processing (NLP) practitioners for over a decade, because of its richness in information, but also challenges for automatic processing. Since language use is more informal, spontaneous, and adheres to many different sociolects, the performance of NLP models often deteriorates. One solution to this problem is to transform data to a standard variant before processing it, which is also called lexical normalization. There has been a wide variety of benchmarks and models proposed for this task. The MultiLexNorm benchmark proposed to unify these efforts, but it consists almost solely of languages from the Indo-European language family in the Latin script. Hence, we propose an extension to MultiLexNorm, which covers 5 Asian languages from different language families in 4 different scripts. We show that the previous state-of-the-art model performs worse on the new languages and propose a new architecture based on Large Language Models (LLMs), which shows more robust performance. Finally, we analyze remaining errors, revealing future directions for this task.
zh
[NLP-20] How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在引入个性化记忆机制后,因无关或不当记忆干扰而导致意图理解偏差的问题,即“非理性个性化”现象。其解决方案的关键在于将记忆利用视为一种语用推理(pragmatic reasoning)过程,通过构建RP-Reasoner模型实现对个性化信息的有选择性整合,从而提升用户意图识别的准确性与用户体验。实验表明,该方法在自建基准RPEval上显著优于多个基线,并解决了80%大规模商用个性化助手中的不良案例,验证了语用推理在缓解非理性个性化方面的有效性。
链接: https://arxiv.org/abs/2601.16621
作者: Xueyang Feng,Weinan Gan,Xu Chen,Quanyu Dai,Yong Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language model (LLM)-powered assistants have recently integrated memory mechanisms that record user preferences, leading to more personalized and user-aligned responses. However, irrelevant personalized memories are often introduced into the context, interfering with the LLM’s intent understanding. To comprehensively investigate the dual effects of personalization, we develop RPEval, a benchmark comprising a personalized intent reasoning dataset and a multi-granularity evaluation protocol. RPEval reveals the widespread phenomenon of irrational personalization in existing LLMs and, through error pattern analysis, illustrates its negative impact on user experience. Finally, we introduce RP-Reasoner, which treats memory utilization as a pragmatic reasoning process, enabling the selective integration of personalized information. Experimental results demonstrate that our method significantly outperforms carefully designed baselines on RPEval, and resolves 80% of the bad cases observed in a large-scale commercial personalized assistant, highlighting the potential of pragmatic reasoning to mitigate irrational personalization. Our benchmark is publicly available at this https URL.
zh
[NLP-21] PROST-LLM : Progressively Enhancing the Speech-to-Speech Translation Capability in LLM s ICASSP2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在语音到语音翻译(Speech-to-Speech Translation, S2ST)任务中因数据稀缺而导致的应用受限问题。其解决方案的关键在于提出一种渐进式语音到语音翻译框架(PROST-LLM),通过三个阶段逐步增强LLMs的S2ST能力:首先利用CVSS语料库进行三任务学习(tri-task learning)和模态链(chain of modality)方法微调模型以提升初始性能;其次,基于微调后的模型采用自采样(self-sampling)与回译(back-translation)生成偏好对(preference pairs),无需人工标注;最后,利用这些偏好对进行偏好优化(preference optimization),进一步提升模型的翻译质量。该方法有效缓解了S2ST任务中高质量标注数据不足的问题,并显著增强了LLMs在跨语言语音翻译中的表现。
链接: https://arxiv.org/abs/2601.16618
作者: Jing Xu,Jiaqi Wang,Daxin Tan,Xiao Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICASSP 2026
Abstract:Although Large Language Models (LLMs) excel in many tasks, their application to Speech-to-Speech Translation (S2ST) is underexplored and hindered by data scarcity. To bridge this gap, we propose PROST-LLM (PROgressive Speech-to-speech Translation) to enhance the S2ST capabilities in LLMs progressively. First, we fine-tune the LLMs with the CVSS corpus, employing designed tri-task learning and chain of modality methods to boost the initial performance. Then, leveraging the fine-tuned model, we generate preference pairs through self-sampling and back-translation without human evaluation. Finally, these preference pairs are used for preference optimization to enhance the model’s S2ST capability further. Extensive experiments confirm the effectiveness of our proposed PROST-LLM in improving the S2ST capability of LLMs.
zh
[NLP-22] AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model
【速读】: 该论文旨在解决视觉大语言模型(Visual Large Language Models, VLLMs)在工业场景中部署时面临的三大核心问题:一是VLLMs在特定领域任务上性能不及定制化深度学习模型(Deep Learning Models, DLMs);二是其参数量庞大,导致边缘部署计算资源消耗高;三是推理速度慢,难以满足实时响应需求。解决方案的关键在于提出一个名为AuroraEdge-V-2B的轻量化、高性能VLLM,其创新性地采用压缩融合(compression-fusion)方法,在保持强泛化能力的同时显著降低视觉token数量和浮点运算量(减少50%),从而实现低延迟、高效率的边缘部署,且在9个基准测试中优于同参数规模模型(如Qwen2-VL-2B、InternVL-2.5-2B)。
链接: https://arxiv.org/abs/2601.16615
作者: Xiang Chen
机构: Independent Researcher(独立研究者)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, due to the advancement of multimodal technology, people are attempting to use visual large language models (VLLMs) in industrial production. Many deep learning models (DLMs) deployed in the production environment are gradually being replaced by VLLMs. Compared with DLMs, VLLMs have some advantages in industrial applications: (1) Their strong generalization ability enables them to perform well across a wide range of tasks. (2) They are flexible and can deal with unfamiliar samples through context learning quickly. However, VLLMs also have obvious drawbacks: (1) VLLMs do not perform as well as custom-developed DLMs in specific domains. (2) The number of parameters in VLLMs is generally quite large, and their deployment requires substantial computational resources. (3) VLLMs generally operate much slower than DLMs, making real-time response challenging to achieve. To better utilize VLLMs in industrial applications, we introduce AuroraEdge-V-2B in this work, a compact, robust, and high-speed VLLM designed for edge deployment. To make the model run faster, we also propose a compression-fusion method to improve inference efficiency. AuroraEdge-V-2B has the following notable features: (1) Easy deployment and faster: It has only 2B parameters and is highly suitable for edge deployment, offering better real-time performance. (2) Fewer visual tokens and cheaper: It significantly reduces the number of visual tokens in the decoding process, thereby reducing the floating-point operations by half during inference and making it cheaper to use. (3) Strong performance: It gets a higher score on 9 benchmarks than models with the same number of parameter (e.g., Qwen2-VL-2B, Qwen2.5-VL-3B, InternVL-2.5-2B).
zh
[NLP-23] Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis
【速读】: 该论文旨在解决当前混合代理(Mixture-of-Agents, MoA)框架中代理间语义交互深度不足的问题,尤其在推理过程中难以有效纠正幻觉和优化逻辑链。现有方法虽引入动态路由与残差连接以提升效率,但未能充分促进代理间的深层语义协同。其解决方案的关键在于提出Attention-MoA框架,通过引入**代理间语义注意力机制(Inter-agent Semantic Attention)**重构代理协作模式,使模型能够基于语义相关性动态分配关注权重;同时结合带有自适应早停机制的层间残差模块(Inter-layer Residual Module with Adaptive Early Stopping Mechanism),缓解深层网络中的信息退化问题并提升计算效率。这一设计显著增强了系统对复杂任务的理解与纠错能力,在多个基准测试中超越了主流大模型。
链接: https://arxiv.org/abs/2601.16596
作者: Jianyu Wen,Yang Wei,Xiongxi Yu,Changxuan Xiao,Ke Zeng
机构: Meituan LongCat Interaction Team (美团龙猫互动团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As the development of Large Language Models (LLMs) shifts from parameter scaling to inference-time collaboration, the Mixture-of-Agents (MoA) framework has emerged as a general paradigm to harness collective intelligence by layering diverse models. While recent MoA variants have introduced dynamic routing and residual connections to improve efficiency, these methods often fail to facilitate deep semantic interaction between agents, limiting the system’s ability to actively correct hallucinations and refine logic. In this paper, we introduce Attention-MoA, a novel MoA-based framework that redefines collaboration through Inter-agent Semantic Attention. Complemented by an Inter-layer Residual Module with Adaptive Early Stopping Mechanism, our architecture mitigates information degradation in deep layers while improving computational efficiency. Extensive evaluations across AlpacaEval 2.0, MT-Bench, and FLASK demonstrate that Attention-MoA significantly outperforms state-of-the-art baselines, achieving a 91.15% Length-Controlled Win Rate on AlpacaEval 2.0 and dominating in 10 out of 12 capabilities on FLASK. Notably, Attention-MoA enables an ensemble of small open-source models to outperform massive proprietary models like Claude-4.5-Sonnet and GPT-4.1, achieving an MT-Bench score of 8.83 and an AlpacaEval 2.0 LC Win Rate of 77.36%.
zh
[NLP-24] Retrieve-Refine-Calibrate: A Framework for Complex Claim Fact-Checking
【速读】: 该论文旨在解决现有事实核查方法中因采用分解范式(decomposition paradigm)而导致的验证准确率下降问题,该范式将待验证声明拆分为多个子声明分别核查,但可能引入与实体或证据无关的噪声信息。其解决方案的关键在于提出一种基于大语言模型(LLM)的Retrieve-Refine-Calibrate(RRC)框架:首先识别声明中的关键实体并检索相关证据;其次通过语义对齐精炼(refine)检索到的证据以去除冗余信息;最后通过重新评估低置信度预测来校准(calibrate)最终判断,从而提升整体验证精度。
链接: https://arxiv.org/abs/2601.16555
作者: Mingwei Sun,Qianlong Wang,Ruifeng Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures. This is an original work by the authors. Any unauthorized submission, reproduction, or commercial use by third parties is prohibited
Abstract:Fact-checking aims to verify the truthfulness of a claim based on the retrieved evidence. Existing methods typically follow a decomposition paradigm, in which a claim is broken down into sub-claims that are individually verified. However, the decomposition paradigm may introduce noise to the verification process due to irrelevant entities or evidence, ultimately degrading verification accuracy. To address this problem, we propose a Retrieve-Refine-Calibrate (RRC) framework based on large language models (LLMs). Specifically, the framework first identifies the entities mentioned in the claim and retrieves evidence relevant to them. Then, it refines the retrieved evidence based on the claim to reduce irrelevant information. Finally, it calibrates the verification process by re-evaluating low-confidence predictions. Experiments on two popular fact-checking datasets (HOVER and FEVEROUS-S) demonstrate that our framework achieves superior performance compared with competitive baselines.
zh
[NLP-25] A Collision-Free Hot-Tier Extension for Engram-Style Conditional Memory: A Controlled Study of Training Dynamics
【速读】: 该论文旨在解决Engram-style条件记忆机制中高频词元(n-grams)碰撞是否构成主要瓶颈的问题。其核心假设是,高频词元间的哈希冲突可能影响模型训练效果,从而限制性能提升。解决方案的关键在于提出Engram-Nine,一种无碰撞的热区扩展结构:通过最小完美哈希函数(Minimal Perfect Hash Function, MPHF)映射最频繁的词元,同时保留原始多头哈希查找作为冷区,实现严格参数一致性的对比实验。这一设计揭示了碰撞并非单纯负面因素,反而可能起到隐式正则化作用,且训练过程中“热区到冷区”的损失优势会发生反转,表明当前模型瓶颈更可能源于门控机制对位置信用分配的偏差,而非索引精度本身。
链接: https://arxiv.org/abs/2601.16531
作者: Tao Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We investigate whether high-frequency key collisions are a primary bottleneck in Engram-style conditional memory. To isolate the effect of collisions, we introduce Engram-Nine, a collision-free hot-tier extension that maps the most frequent n-grams through a Minimal Perfect Hash Function (MPHF) while retaining the original multi-head hashed lookup as a cold tier. Under a strictly iso-parameter setup, the collision-free design does not consistently improve validation loss. Through route-stratified evaluation (decomposing per-token loss into hot/cold contributions), we uncover a consistent “hot-to-cold advantage flip” during training: hot (high-frequency) positions initially have lower loss, but cold positions eventually surpass them. Crucially, collision-free configurations flip earlier than collision-prone baselines, suggesting that collisions act as implicit regularization. We also identify a gating mismatch: the gate learns to favor hot positions early in training, but this preference persists even after the flip, assigning higher weights to positions with higher loss. Our findings suggest that improving lookup precision alone does not guarantee better training outcomes. The dominant limitation may lie in gating credit assignment rather than index accuracy, and collision-induced noise may provide beneficial regularization that should not be naively eliminated. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.16531 [cs.LG] (or arXiv:2601.16531v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.16531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-26] Curate-Train-Refine: A Closed-Loop Agent ic Framework for Zero Shot Classification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在零样本(zero-shot)和少样本(few-shot)分类任务中虽性能优越但推理成本高、延迟大的问题,从而限制其实际部署。解决方案的关键在于设计一种基于动态生成监督信号的轻量级文本分类器训练机制:通过一个迭代式的代理(agentic)闭环流程,由LLM自主完成数据筛选、模型表现分析及针对性错误修正示例的合成,持续优化训练数据质量并适配下游分类器与具体任务。这一方法显著提升了分类准确率,同时避免了直接部署大模型的高昂开销。
链接: https://arxiv.org/abs/2601.16530
作者: Gaurav Maheshwari,Kevin El Haddad
机构: Diabolocom; ISIA Lab - University of Mons (蒙斯大学ISIA实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) and high-capacity encoders have advanced zero and few-shot classification, but their inference cost and latency limit practical deployment. We propose training lightweight text classifiers using dynamically generated supervision from an LLM. Our method employs an iterative, agentic loop in which the LLM curates training data, analyzes model successes and failures, and synthesizes targeted examples to address observed errors. This closed-loop generation and evaluation process progressively improves data quality and adapts it to the downstream classifier and task. Across four widely used benchmarks, our approach consistently outperforms standard zero and few-shot baselines. These results indicate that LLMs can serve effectively as data curators, enabling accurate and efficient classification without the operational cost of large-model deployment.
zh
[NLP-27] Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的对象幻觉(object hallucinations)问题,即模型生成不存在的实体描述,从而损害其可靠性。现有去遗忘(unlearning)方法虽能一定程度上抑制幻觉,但存在结构脆弱性(structural fragility):标准擦除手段仅实现表面抑制,导致模型陷入尖锐极小值(sharp minima),在轻量再学习后幻觉会灾难性地重现。解决方案的关键在于提出SARE框架,将去遗忘建模为一个目标导向的最小-最大优化问题,并引入目标导向的SAM(Targeted-SAM)机制,显式地平坦化围绕幻觉概念的损失景观(loss landscape)。通过在模拟最坏情况参数扰动下抑制幻觉,SARE确保了去遗忘结果对权重变化的几何稳定性,从而实现持久且鲁棒的幻觉消除效果。
链接: https://arxiv.org/abs/2601.16527
作者: Xianya Fang,Feiyang Ren,Xiang Chen,Yu Tian,Zhen Bi,Haiyang Yu,Sheng-Jun Huang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Institute for AI, Tsinghua University (清华大学人工智能研究院); Huzhou University (湖州大学); Institute of Dataspace, Hefei Comprehensive National Science Center (合肥综合性国家科学中心数据空间研究所); University of Science and Technology of China (中国科学技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.
zh
[NLP-28] angramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在精确组合空间推理能力方面的不足问题。现有基准测试通常任务简单、依赖语义近似或粗粒度相对位置,且评估指标缺乏严谨的数学形式化。为填补这一空白,作者提出TangramPuzzle这一基于几何约束的评测基准,其核心创新在于引入Tangram Construction Expression (TCE)——一种符号化的几何框架,将拼图组装过程精确地映射到可机器验证的坐标规范中,从而消除视觉近似的模糊性。解决方案的关键在于通过两个互补任务(轮廓预测与端到端代码生成)系统性地评估模型对几何约束的理解和执行能力,实验揭示了MLLMs倾向于匹配目标轮廓而忽视几何约束,导致拼图部件发生形变或扭曲的现象。
链接: https://arxiv.org/abs/2601.16520
作者: Daixian Liu,Jiayi Kuang,Yinghui Li,Yangning Li,Di Yin,Haoyu Cao,Xing Sun,Ying Shen,Hai-Tao Zheng,Liang Lin,Philip S. Yu
机构: Tsinghua University (清华大学); Sun-Yat Sen University (中山大学); Tencent Youtu Lab (腾讯优图实验室); University of Illinois Chicago (芝加哥伊利诺伊大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual recognition and semantic understanding. Nevertheless, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often involve relatively simple tasks and rely on semantic approximations or coarse relative positioning, while their evaluation metrics are typically limited and lack rigorous mathematical formulations. To bridge this gap, we introduce TangramPuzzle, a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game. We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications, to mitigate the ambiguity of visual approximation. We design two complementary tasks: Outline Prediction, which demands inferring global shapes from local components, and End-to-End Code Generation, which requires solving inverse geometric assembly problems. We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints, leading to distortions or deformations of the pieces.
zh
[NLP-29] SearchLLM : Detecting LLM Paraphrased Text by Measuring the Similarity with Regeneration of the Candidate Source via Search Engine EACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在文本改写(paraphrasing)过程中可能导致原始语义丢失或扭曲的问题,尤其是当改写后的文本高度模仿原内容时,传统检测方法难以有效识别。其解决方案的关键在于提出一种名为 SearchLLM 的新方法,该方法利用搜索引擎定位潜在的原始文本来源,并通过比对输入文本与候选源文本的相似性来判断是否为 LLM 改写内容。SearchLLM 作为一个代理层可无缝集成到现有检测工具中,显著提升检测准确率并增强对 paraphrasing 攻击的防御能力。
链接: https://arxiv.org/abs/2601.16512
作者: Hoang-Quoc Nguyen-Son,Minh-Son Dao,Koji Zettsu
机构: National Institute of Information and Communications Technology(日本信息通信研究机构); Nagoya University(名古屋大学)
类目: Computation and Language (cs.CL)
备注: EACL 2026 camera ready (Main Track)
Abstract:With the advent of large language models (LLMs), it has become common practice for users to draft text and utilize LLMs to enhance its quality through paraphrasing. However, this process can sometimes result in the loss or distortion of the original intended meaning. Due to the human-like quality of LLM-generated text, traditional detection methods often fail, particularly when text is paraphrased to closely mimic original content. In response to these challenges, we propose a novel approach named SearchLLM, designed to identify LLM-paraphrased text by leveraging search engine capabilities to locate potential original text sources. By analyzing similarities between the input and regenerated versions of candidate sources, SearchLLM effectively distinguishes LLM-paraphrased content. SearchLLM is designed as a proxy layer, allowing seamless integration with existing detectors to enhance their performance. Experimental results across various LLMs demonstrate that SearchLLM consistently enhances the accuracy of recent detectors in detecting LLM-paraphrased text that closely mimics original content. Furthermore, SearchLLM also helps the detectors prevent paraphrasing attacks.
zh
[NLP-30] Is Length Really A Liability? An Evaluation of Multi-turn LLM Conversations using BoolQ
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)基准测试主要依赖单轮提示(single-prompt evaluations)所导致的评估局限性问题,即此类静态评估无法捕捉真实场景中因对话长度和上下文依赖引发的有害响应风险。解决方案的关键在于引入多轮对话设置下的动态评估范式,通过在BoolQ数据集上系统性地改变对话长度和支架(scaffolding)条件,揭示出仅在多轮交互中显现的模型特异性脆弱性,从而证明了静态评估难以识别部署相关的真实风险。
链接: https://arxiv.org/abs/2601.16508
作者: Karl Neergaard,Le Qiu,Emmanuele Chersoni
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: 4 pages plus 6 pages of bibliography and appendix
Abstract:Single-prompt evaluations dominate current LLM benchmarking, yet they fail to capture the conversational dynamics where real-world harm occurs. In this study, we examined whether conversation length affects response veracity by evaluating LLM performance on the BoolQ dataset under varying length and scaffolding conditions. Our results across three distinct LLMs revealed model-specific vulnerabilities that are invisible under single-turn testing. The length-dependent and scaffold-specific effects we observed demonstrate a fundamental limitation of static evaluations, as deployment-relevant vulnerabilities could only be spotted in a multi-turn conversational setting.
zh
[NLP-31] LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
【速读】: 该论文旨在解决当前常识推理评测基准普遍采用单标签评价方式,无法准确刻画多个可能解释之间逻辑关系(如联合合理、互斥或共同不合理)的问题。其解决方案的关键在于提出LOGICAL-COMMONSENSEQA基准,将常识推理任务重新建模为对原子陈述对的逻辑组合操作(AND、OR、NEITHER/NOR),通过引入可区分的合理性层级运算符,实现对模型在复合式常识推理能力上的系统性评估。该框架揭示了现有模型在否定类推理中的显著性能下降,并为推进组合式常识推理提供了可控且严谨的实验平台。
链接: https://arxiv.org/abs/2601.16504
作者: Obed Junias,Maria Leonor Pacheco
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
zh
[NLP-32] MRAG : Benchmarking Retrieval-Augmented Generation for Bio-medicine
【速读】: 该论文旨在解决医学领域中检索增强生成(Retrieval-Augmented Generation, RAG)系统缺乏全面评估基准的问题。为应对这一挑战,作者提出了Medical RAG(MRAG)基准,涵盖中英文多种任务,并构建了基于Wikipedia和PubMed的语料库;其关键解决方案在于设计并开源MRAG-Bench数据集与MRAG-Toolkit工具包,支持对不同RAG组件(如检索方法、模型规模、提示策略)进行系统性探索,从而推动医学问答场景下大语言模型(LLM)可靠性和推理质量的提升。
链接: https://arxiv.org/abs/2601.16503
作者: Wei Zhu
机构: University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. © While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench’s dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.
zh
[NLP-33] EvoConfig: Self-Evolving Multi-Agent Systems for Efficient Autonomous Environment Configuration
【速读】: 该论文旨在解决大型语言模型在软件工程任务中构建可靠可执行环境时面临的配置效率低和错误处理能力弱的问题。当前方法普遍忽视对智能体(agent)细粒度操作的分析,导致难以应对复杂错误并频繁出现配置失败。解决方案的关键在于提出EvoConfig框架,其核心创新包括:一是专家诊断模块(expert diagnosis module),实现对执行后行为的细粒度分析;二是自进化机制(self-evolving mechanism),使专家代理能够自我反馈并实时动态调整修复优先级,从而提升多智能体协作下的环境配置正确率与调试能力。实验证明,EvoConfig在Envbench基准上达到78.1%的成功率,显著优于Repo2Run的71.0%,且在错误识别准确性和修复建议有效性方面表现更优。
链接: https://arxiv.org/abs/2601.16489
作者: Xinshuai Guo,Jiayi Kuang,Linyue Pan,Yinghui Li,Yangning Li,Hai-Tao Zheng,Ying Shen,Di Yin,Xing Sun
机构: Tsinghua University (清华大学); Sun-Yat Sen University (中山大学); Tencent Youtu Lab (腾讯优图实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:A reliable executable environment is the foundation for ensuring that large language models solve software engineering tasks. Due to the complex and tedious construction process, large-scale configuration is relatively inefficient. However, most methods always overlook fine-grained analysis of the actions performed by the agent, making it difficult to handle complex errors and resulting in configuration failures. To address this bottleneck, we propose EvoConfig, an efficient environment configuration framework that optimizes multi-agent collaboration to build correct runtime environments. EvoConfig features an expert diagnosis module for fine-grained post-execution analysis, and a self-evolving mechanism that lets expert agents self-feedback and dynamically adjust error-fixing priorities in real time. Empirically, EvoConfig matches the previous state-of-the-art Repo2Run on Repo2Run’s 420 repositories, while delivering clear gains on harder cases: on the more challenging Envbench, EvoConfig achieves a 78.1% success rate, outperforming Repo2Run by 7.1%. Beyond end-to-end success, EvoConfig also demonstrates stronger debugging competence, achieving higher accuracy in error identification and producing more effective repair recommendations than existing methods.
zh
[NLP-34] mely Machine: Awareness of Time Makes Test-Time Scaling Agent ic
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代理(agentic)场景下,因频繁调用工具导致传统基于生成长度的测试时扩展(test-time scaling)失效的问题。其核心挑战在于工具延迟(tool latency)使推理时间与生成长度解耦,从而无法有效利用计算资源。解决方案的关键在于重新定义测试时为墙钟时间(wall-clock time),提出Timely Machine框架,使模型能根据时间预算动态调整策略;同时引入Timely-Eval基准评估不同工具频率和时间约束下的表现,并设计Timely-RL方法,通过冷启动监督微调结合强化学习(Reinforcement Learning, RL)提升模型的时间规划能力,从而增强对时间预算的感知并稳定提升性能。
链接: https://arxiv.org/abs/2601.16486
作者: Yichuan Ma,Linyang Li,Yongkang chen,Peiji Li,Xiaozhe Li,Qipeng Guo,Dahua Lin,Kai Chen
机构: Fudan University (复旦大学); Shanghai AI Laboratory; Tongji University (同济大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:As large language models (LLMs) increasingly tackle complex reasoning tasks, test-time scaling has become critical for enhancing capabilities. However, in agentic scenarios with frequent tool calls, the traditional generation-length-based definition breaks down: tool latency decouples inference time from generation length. We propose Timely Machine, redefining test-time as wall-clock time, where models dynamically adjust strategies based on time budgets. We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning. By varying tool latency, we find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality. Moreover, existing models fail to adapt reasoning to time budgets. We propose Timely-RL to address this gap. After cold-start supervised fine-tuning, we use reinforcement learning to enhance temporal planning. Timely-RL improves time budget awareness and consistently boosts performance across Timely-Eval. We hope our work offers a new perspective on test-time scaling for the agentic era.
zh
[NLP-35] L-GRPO: Turn-Level RL for Reasoning -Guided Iterative Optimization
【速读】: 该论文旨在解决迭代优化类推理任务中现有强化学习方法的局限性问题,这类任务的特点是智能体在多轮交互中持续作用于相同的环境状态,且轨迹价值由最优轮次奖励决定而非累计回报。传统基于轨迹级强化学习(如GRPO)的方法无法实现细粒度的轮次级优化,而黑箱优化方法则忽略了模型已有的先验知识和推理能力。解决方案的关键在于提出一种轻量级强化学习算法——轮次级GRPO(Turn-Level GRPO, TL-GRPO),其核心创新是引入轮次级分组采样机制,使策略能够在每一轮交互中独立评估并优化行为,从而实现对复杂科学优化任务(如模拟电路尺寸设计,ACS)的高效、精准调优。
链接: https://arxiv.org/abs/2601.16480
作者: Peiji Li,Linyang Li,Handa Sun,Wenjin Mai,Yongkang Chen,Xiaozhe Li,Yue Shen,Yichuan Ma,Yiliu Sun,Jiaxi Cao,Zhishu He,Bo Wang,Xiaoqing Zheng,Zhaori Bi,Xipeng Qiu,Qipeng Guo,Kai Chen,Dahua Lin
机构: Fudan University (复旦大学); Shanghai AI Laboratory; The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.
zh
[NLP-36] DeepEra: A Deep Evidence Reranking Agent for Scientific Retrieval-Augmented Generated Question Answering
【速读】: 该论文旨在解决两阶段检索增强生成(Retrieval-Augmented Generation, RAG)框架中存在的一种关键问题:即检索到的文本片段虽然在语义上与查询相似,但逻辑上并不相关(Semantically Similar but Logically Irrelevant, SSLI),这会显著降低科学问答(Scientific Question Answering, SciQA)系统的事实可靠性。解决方案的关键在于提出一种深度证据重排序代理(Deep Evidence Reranking Agent, DeepEra),其通过引入分步推理机制,对候选文档段落进行超越表面语义层面的精确评估,从而提升检索结果的逻辑一致性与事实准确性。
链接: https://arxiv.org/abs/2601.16478
作者: Haotian Chen,Qingqing Long,Siyu Pu,Xiao Luo,Wei Ju,Meng Xiao,Yuanchun Zhou,Jianghua Zhao,Xuezhi Wang
机构: Computer Network Information Center, Chinese Academy of Sciences (中国科学院计算机网络信息中心); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid growth of scientific literature, scientific question answering (SciQA) has become increasingly critical for exploring and utilizing scientific knowledge. Retrieval-Augmented Generation (RAG) enhances LLMs by incorporating knowledge from external sources, thereby providing credible evidence for scientific question answering. But existing retrieval and reranking methods remain vulnerable to passages that are semantically similar but logically irrelevant, often reducing factual reliability and amplifying this http URL address this challenge, we propose a Deep Evidence Reranking Agent (DeepEra) that integrates step-by-step reasoning, enabling more precise evaluation of candidate passages beyond surface-level semantics. To support systematic evaluation, we construct SciRAG-SSLI (Scientific RAG - Semantically Similar but Logically Irrelevant), a large-scale dataset comprising about 300K SciQA instances across 10 subjects, constructed from 10M scientific corpus. The dataset combines naturally retrieved contexts with systematically generated distractors to test logical robustness and factual grounding. Comprehensive evaluations confirm that our approach achieves superior retrieval performance compared to leading rerankers. To our knowledge, this work is the first to comprehensively study and empirically validate innegligible SSLI issues in two-stage RAG frameworks.
zh
[NLP-37] Persona Jailbreaking in Large Language Models EACL26
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在教育、心理健康和客户支持等高风险领域中,因对话历史中的对抗性输入导致的隐式人格操控(persona manipulation)问题。现有研究多集中于叙事或角色扮演任务,忽视了仅通过用户侧输入即可在黑盒推理场景下重塑模型人格的现象,从而暴露了LLM安全性的新漏洞。解决方案的关键在于提出PHISH(Persona Hijacking via Implicit Steering in History)框架,其核心机制是将语义负载的提示嵌入用户查询中,通过多轮对话逐步诱导出与原始设定相反的人格特征(reverse personas),并在不显著损害模型推理能力的前提下实现高效且可量化的人格篡改。实验表明,PHISH在多个基准测试和8种主流LLM上均能稳定触发人格变化,并在高风险应用场景中得到人工与LLM-as-Judge双重验证。
链接: https://arxiv.org/abs/2601.16466
作者: Jivnesh Sandhan,Fei Cheng,Tushar Sandhan,Yugo Murawaki
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at EACL26 (Findings)
Abstract:Large Language Models (LLMs) are increasingly deployed in domains such as education, mental health and customer support, where stable and consistent personas are critical for reliability. Yet, existing studies focus on narrative or role-playing tasks and overlook how adversarial conversational history alone can reshape induced personas. Black-box persona manipulation remains unexplored, raising concerns for robustness in realistic interactions. In response, we introduce the task of persona editing, which adversarially steers LLM traits through user-side inputs under a black-box, inference-only setting. To this end, we propose PHISH (Persona Hijacking via Implicit Steering in History), the first framework to expose a new vulnerability in LLM safety that embeds semantically loaded cues into user queries to gradually induce reverse personas. We also define a metric to quantify attack success. Across 3 benchmarks and 8 LLMs, PHISH predictably shifts personas, triggers collateral changes in correlated traits, and exhibits stronger effects in multi-turn settings. In high-risk domains mental health, tutoring, and customer support, PHISH reliably manipulates personas, validated by both human and LLM-as-Judge evaluations. Importantly, PHISH causes only a small reduction in reasoning benchmark performance, leaving overall utility largely intact while still enabling significant persona manipulation. While current guardrails offer partial protection, they remain brittle under sustained attack. Our findings expose new vulnerabilities in personas and highlight the need for context-resilient persona in LLMs. Our codebase and dataset is available at: this https URL
zh
[NLP-38] Graph-Anchored Knowledge Indexing for Retrieval-Augmented Generation
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理噪声文档时,难以有效整合与解释分散于其中的关键证据这一核心挑战。其解决方案的关键在于提出GraphAnchor方法——一种图锚定的知识索引机制,将静态知识表示重构为动态演化的知识索引;该方法在迭代检索过程中增量更新图结构,以锚定显著实体和关系,形成结构化索引,从而引导大语言模型(Large Language Models, LLMs)评估知识充分性并生成后续子查询,最终联合所有检索文档与演化后的图结构生成答案。
链接: https://arxiv.org/abs/2601.16462
作者: Zhenghao Liu,Mingyan Wu,Xinze Li,Yukun Yan,Shuo Wang,Cheng Yang,Minghe Yu,Zheni Zeng,Maosong Sun
机构: Northeastern University (东北大学); Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. Nevertheless, effectively integrating and interpreting key evidence scattered across noisy documents remains a critical challenge for existing RAG systems. In this paper, we propose GraphAnchor, a novel Graph-Anchored Knowledge Indexing approach that reconceptualizes graph structures from static knowledge representations into active, evolving knowledge indices. GraphAnchor incrementally updates a graph during iterative retrieval to anchor salient entities and relations, yielding a structured index that guides the LLM in evaluating knowledge sufficiency and formulating subsequent subqueries. The final answer is generated by jointly leveraging all retrieved documents and the final evolved graph. Experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of GraphAnchor, and reveal that GraphAnchor modulates the LLM’s attention to more effectively associate key information distributed in retrieved documents. All code and data are available at this https URL.
zh
[NLP-39] Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go NEURIPS2025
【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在特定领域(如围棋)中推理能力显著不足的问题,即LLMs虽在数学和编程等通用任务中表现优异,但在专业领域(如围棋)难以达到初学者水平,更无法实现自然语言下的战略推理。其解决方案的关键在于:首先通过结构化围棋专业知识与通用链式思维(Chain-of-Thought, CoT)推理数据的混合微调进行冷启动,随后采用强化学习将围棋专家知识与通用推理能力深度融合,从而构建出名为LoGos的模型,该模型不仅保留了LLMs的强通用推理能力,还能以自然语言进行围棋对弈,实现接近人类职业选手的战略决策与落子预测性能。
链接: https://arxiv.org/abs/2601.16447
作者: Yichuan Ma,Linyang Li,Yongkang Chen,Peiji Li,Jiasheng Ye,Qipeng Guo,Dahua Lin,Kai Chen
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025
Abstract:Large language models (LLMs) have demonstrated exceptional performance in reasoning tasks such as mathematics and coding, matching or surpassing human capabilities. However, these impressive reasoning abilities face significant challenges in specialized domains. Taking Go as an example, although AlphaGo has established the high performance ceiling of AI systems in Go, mainstream LLMs still struggle to reach even beginner-level proficiency, let alone perform natural language reasoning. This performance gap between general-purpose LLMs and domain experts is significantly limiting the application of LLMs on a wider range of domain-specific tasks. In this work, we aim to bridge the divide between LLMs’ general reasoning capabilities and expert knowledge in domain-specific tasks. We perform mixed fine-tuning with structured Go expertise and general long Chain-of-Thought (CoT) reasoning data as a cold start, followed by reinforcement learning to integrate expert knowledge in Go with general reasoning capabilities. Through this methodology, we present \textbfLoGos, a powerful LLM that not only maintains outstanding general reasoning abilities, but also conducts Go gameplay in natural language, demonstrating effective strategic reasoning and accurate next-move prediction. LoGos achieves performance comparable to human professional players, substantially surpassing all existing LLMs. Through this work, we aim to contribute insights on applying general LLM reasoning capabilities to specialized domains. We will release the first large-scale Go dataset for LLM training, the first LLM Go evaluation benchmark, and the first general LLM that reaches human professional-level performance in Go at: this https URL.
zh
[NLP-40] Exploring the Effects of Alignment on Numerical Bias in Large Language Models AAAI2026
【速读】: 该论文试图解决大语言模型作为评估者(LLM-as-a-judge)时存在的数值偏差(numerical bias)问题,即评估分数在某些值上出现过度集中,导致评价性能下降。研究表明,这种偏差源于模型对齐(alignment)过程,包括指令微调(instruction tuning)和偏好微调(preference tuning),这些步骤虽然提升了模型的对齐质量,但也降低了输出多样性,从而加剧了数值偏差。解决方案的关键在于对后对齐阶段的模型进行评分范围调整(score range adjustment),该方法相较于温度缩放(temperature scaling)和分布校准(distribution calibration)更有效,能够显著降低偏差并提升评估性能,尽管目前仍属于启发式策略,需进一步研究最优评分区间选择与更鲁棒的缓解机制。
链接: https://arxiv.org/abs/2601.16444
作者: Ayako Sato,Hwichan Kim,Zhousi Chen,Masato Mita,Mamoru Komachi
机构: Tokyo Metropolitan University (东京都立大学); Hitotsubashi University (一桥大学); CyberAgent Inc. (CyberAgent公司)
类目: Computation and Language (cs.CL)
备注: Accepted at AIBSD 2026 (Workshop at AAAI 2026)
Abstract:``LLM-as-a-judge,‘’ which utilizes large language models (LLMs) as evaluators, has proven effective in many evaluation tasks. However, evaluator LLMs exhibit numerical bias, a phenomenon where certain evaluation scores are generated disproportionately often, leading reduced evaluation performance. This study investigates the cause of this bias. Given that most evaluator LLMs are aligned through instruction tuning and preference tuning, and that prior research suggests alignment reduces output diversity, we hypothesize that numerical bias arises from alignment. To test this, we compare outputs from pre- and post-alignment LLMs, and observe that alignment indeed increases numerical bias. We also explore mitigation strategies for post-alignment LLMs, including temperature scaling, distribution calibration, and score range adjustment. Among these, score range adjustment is most effective in reducing bias and improving performance, though still heuristic. Our findings highlight the need for further work on optimal score range selection and more robust mitigation strategies.
zh
[NLP-41] Endless Terminals: Scaling RL Environments for Terminal Agents
【速读】: 该论文旨在解决自改进智能体(self-improving agents)在训练过程中面临的环境瓶颈问题。当前的终端基准测试(terminal benchmarks)主要用于评估,而非支持训练;强化学习(Reinforcement Learning, RL)需要的是可扩展的训练流水线,而不仅仅是数据集。为此,作者提出了一种名为“无限终端”(Endless Terminals)的全自动流水线,其核心创新在于无需人工标注即可程序化生成终端使用任务。该流水线包含四个关键阶段:多样化任务描述生成、容器化环境构建与验证、完成度测试生成以及可解性过滤。通过此方法,研究者获得了涵盖文件操作、日志管理、数据处理、脚本编写和数据库操作等领域的3255个任务,并采用基础的PPO算法配合二元奖励机制进行训练,未引入检索、多智能体协作或专用工具等复杂组件。实验表明,尽管架构简单,模型性能显著提升,且在人类标注基准(如TerminalBench 2.0)上也展现出优越迁移能力,证明了环境规模扩展对简单强化学习策略的有效性。
链接: https://arxiv.org/abs/2601.16443
作者: Kanishk Gandhi,Shivam Garg,Noah D. Goodman,Dimitris Papailiopoulos
机构: Stanford University (斯坦福大学); Microsoft Research (微软研究院); UW-Madison (威斯康星大学麦迪逊分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.
zh
[NLP-42] Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在遥感和医学影像等专业领域中表现受限的问题。现有方法通常通过文本指令、提示或辅助描述注入领域知识,但实验表明此类输入层面的知识注入对科学多模态任务效果甚微,说明当前MLLMs难以仅通过语言形式内化领域先验信息。解决方案的关键在于将领域知识从输入层提升至优化层:作者提出一种强化学习微调框架,将领域知识编码为约束条件和奖励信号,直接嵌入模型的学习目标函数中,从而引导模型在输出空间中生成符合领域规范的行为。该方法在多个遥感与医疗数据集上均取得显著性能提升,达到当前最优水平,验证了优化层面领域知识整合的必要性。
链接: https://arxiv.org/abs/2601.16419
作者: Qinglong Cao,Yuntian Chen,Chao Ma,Xiaokang Yang
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have shown remarkable capabilities in multimodal perception and understanding tasks. However, their effectiveness in specialized domains, such as remote sensing and medical imaging, remains limited. A natural approach to domain adaptation is to inject domain knowledge through textual instructions, prompts, or auxiliary captions. Surprisingly, we find that such input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks, even when the domain knowledge is explicitly provided. This observation suggests that current MLLMs fail to internalize domain-specific priors through language alone, and that domain knowledge must be integrated at the optimization level. Motivated by this insight, we propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective. Instead of treating domain knowledge as descriptive information, we encode it as domain-informed constraints and reward signals, shaping the model’s behavior in the output space. Extensive experiments across multiple datasets in remote sensing and medical domains consistently demonstrate good performance gains, achieving state-of-the-art results on multimodal domain tasks. Our results highlight the necessity of optimization-level domain knowledge integration and reveal a fundamental limitation of textual domain conditioning in current MLLMs.
zh
[NLP-43] Jacobian Scopes: token-level causal attributions in LLM s ACL2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中难以量化具体输入token对最终预测贡献程度的问题,尤其是在多层架构和注意力头数量庞大的情况下。其核心挑战在于理解哪些先前的token最强烈地影响了模型的下一个词预测。解决方案的关键在于提出了一套基于梯度的token级因果归因方法——Jacobian Scopes,通过分析最终隐藏状态与输入之间的线性化关系,定量刻画输入token对模型预测的影响。该方法包含三种变体:语义范围(Semantic Scope)关注特定logits的敏感性、Fisher范围(Fisher Scope)聚焦于整个预测分布,以及温度范围(Temperature Scope)衡量模型置信度(逆温度)。这些方法在指令理解、翻译和上下文学习(In-Context Learning, ICL)等任务中验证了有效性,并揭示了隐含的政治偏见等现象。
链接: https://arxiv.org/abs/2601.16407
作者: Toni J.B. Liu,Baran Zadeoğlu,Nicolas Boullé,Raphaël Sarfati,Christopher J. Earls
机构: Cornell University, USA; Imperial College London, UK; Goodfire AI, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 15 figures, under review at ACL 2026
Abstract:Large language models (LLMs) make next-token predictions based on clues present in their context, such as semantic descriptions and in-context examples. Yet, elucidating which prior tokens most strongly influence a given prediction remains challenging due to the proliferation of layers and attention heads in modern architectures. We propose Jacobian Scopes, a suite of gradient-based, token-level causal attribution methods for interpreting LLM predictions. By analyzing the linearized relations of final hidden state with respect to inputs, Jacobian Scopes quantify how input tokens influence a model’s prediction. We introduce three variants - Semantic, Fisher, and Temperature Scopes - which respectively target sensitivity of specific logits, the full predictive distribution, and model confidence (inverse temperature). Through case studies spanning instruction understanding, translation and in-context learning (ICL), we uncover interesting findings, such as when Jacobian Scopes point to implicit political biases. We believe that our proposed methods also shed light on recently debated mechanisms underlying in-context time-series forecasting. Our code and interactive demonstrations are publicly available at this https URL.
zh
[NLP-44] Clarify or Answer: Reinforcement Learning for Agent ic VQA with Context Under-specification
【速读】: 该论文旨在解决现实世界视觉问答(Visual Question Answering, VQA)中因图像-问题对信息不足而导致的歧义性问题,即当答案依赖于图像中不可见的外部上下文时,直接回答可能产生自信但错误的预测。解决方案的关键在于提出一种“澄清或回答”(Clarify-or-Answer, CoA)代理模型,该模型独立建模“是否需要澄清”的决策过程以及“若需澄清应提出何种问题”的生成机制;CoA首先判断是否需要澄清,若需要则生成一个聚焦且非平凡的澄清问题,并结合回答结果生成最终答案。此外,作者引入CONTEXTCLARIFY数据集和GRPO-CR(Clarification Reasoning)强化学习方法,通过多奖励信号优化澄清问题的质量,从而显著提升端到端VQA准确率(平均提升15.3点,相对基线提高83%)。
链接: https://arxiv.org/abs/2601.16400
作者: Zongwan Cao,Bingbing Wen,Lucy Lu Wang
机构: University of Washington (华盛顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Real-world visual question answering (VQA) is often context-dependent: an image-question pair may be under-specified, such that the correct answer depends on external information that is not observable in the image. In such cases, directly answering can lead to confident but incorrect predictions. We propose CoA(Clarify-or-Answer), an ask-or-answer agent that separately models the decision to ask or answer, and what to ask if needed. CoA first determines whether clarification is necessary; if so, it asks a single focused question and then incorporates the response to produce the final answer. We introduce CONTEXTCLARIFY with a set of ambiguous VQA questions and the contrast set that is non-ambiguous. We further introduce GRPO-CR (Clarification Reasoning), a reinforcement learning approach that optimizes clarification question generation with multiple reward signals encouraging well-formed, focused, non-trivial questions that resolve ambiguity. Across three VLLMs and three datasets, CoA achieves consistent improvements at both the module and system levels, improving end-to-end VQA accuracy by an average of +15.3 points (83%) over prompting-based baselines
zh
[NLP-45] White-Box Sensitivity Auditing with Steering Vectors
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)算法审计中依赖黑盒评估方法所带来的局限性,即仅通过输入输出测试难以捕捉抽象的社会相关属性(如性别偏见),且测试样本常受限于启发式生成。其解决方案的关键在于提出一种白盒敏感性审计框架,利用激活操控(activation steering)技术对模型内部表征进行干预,从而系统性地探测模型预测对关键概念(如受保护属性)的敏感度。该方法在四个模拟高风险决策任务中的应用表明,即使黑盒评估未发现显著偏见,白盒审计仍能揭示模型预测对受保护属性的高度依赖性。
链接: https://arxiv.org/abs/2601.16398
作者: Hannah Cyberey,Yangfeng Ji,David Evans
机构: University of Virginia (弗吉尼亚大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black-box evaluations that assess model behavior only through input-output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text-based inputs alone. To address these limitations, we propose a white-box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model’s intended function for the task. We demonstrate its application to bias audits in four simulated high-stakes LLM decision tasks. Our method consistently reveals substantial dependence on protected attributes in model predictions, even in settings where standard black-box evaluations suggest little or no bias. Our code is openly available at this https URL
zh
[NLP-46] Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization
【速读】: 该论文旨在解决临床摘要生成中可信性不足的问题,即如何在保证生成文本流畅性的同时,提供每条陈述的来源透明性(source attribution),以增强临床决策的可解释性和可靠性。传统方法依赖于事后标注或重新训练模型来实现源引用,存在效率低、部署难等局限。其解决方案的关键在于提出一种无需训练的生成时源引用框架,直接利用解码器注意力机制识别支持性文本片段或图像区域,并通过两种多模态引用策略实现精准定位:一是原始图像模式,直接使用图像块注意力;二是基于描述的引用模式,将图像替换为生成的描述文本以实现纯文本对齐。实验表明,该方法在对话和放射学报告两个典型领域均显著优于嵌入式基线与自引用方法,在保持轻量化的同时提升了文本与多模态层面的引用准确率(如F1分数提升15%)。
链接: https://arxiv.org/abs/2601.16397
作者: Qianqi Yan,Huy Nguyen,Sumana Srivatsa,Hari Bandi,Xin Eric Wang,Krishnaram Kenthapadi
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); Oracle Health AI (Oracle健康AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.
zh
[NLP-47] Cross-Lingual Activation Steering for Multilingual Language Models
【速读】: 该论文旨在解决大语言模型在多语言能力上存在的性能差距问题,即主流语言与非主流语言之间的表现差异。现有研究认为这种差距源于多语言表征中共享神经元与语言特异性神经元的不平衡。论文提出的解决方案是Cross-Lingual Activation Steering(CLAS),其关键在于无需修改模型权重,在推理阶段通过选择性调节神经元激活来实现跨语言能力增强;实验表明,CLAS能显著提升低资源语言的任务表现(分类准确率平均提升2.3%,F1分数提升3.4%),且不损害高资源语言性能,其有效性源于功能上的语义分离而非严格的对齐,即语言簇间距离增加促进了有效迁移。
链接: https://arxiv.org/abs/2601.16390
作者: Rhitabrat Pokharel,Ameeta Agrawal,Tanay Nagar
机构: Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Large language models exhibit strong multilingual capabilities, yet significant performance gaps persist between dominant and non-dominant languages. Prior work attributes this gap to imbalances between shared and language-specific neurons in multilingual representations. We propose Cross-Lingual Activation Steering (CLAS), a training-free inference-time intervention that selectively modulates neuron activations. We evaluate CLAS on classification and generation benchmarks, achieving average improvements of 2.3% (Acc.) and 3.4% (F1) respectively, while maintaining high-resource language performance. We discover that effective transfer operates through functional divergence rather than strict alignment; performance gains correlate with increased language cluster separation. Our results demonstrate that targeted activation steering can unlock latent multilingual capacity in existing models without modification to model weights.
zh
[NLP-48] PolyAgent : Large Language Model Agent for Polymer Design
【速读】: 该论文旨在解决实验室研究人员在早期聚合物发现阶段面临的计算资源受限与模型可访问性不足的问题,即难以利用机器学习模型进行聚合物结构-性能预测和生成。其解决方案的关键在于构建一个集成于终端的闭环聚合物结构-性能预测框架,该框架基于大语言模型(LLM)推理能力,实现属性预测、属性引导的聚合物结构生成及结构优化功能,并通过SMILES序列结合合成可及性评分(Synthetic Accessibility Score)和合成复杂度评分(SC Score)来确保生成的聚合物结构在实验上具有可行性,从而为实验室研究者提供切实可行的计算洞察。
链接: https://arxiv.org/abs/2601.16376
作者: Vani Nigam,Achuth Chandrasekhar,Amir Barati Farimani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:On-demand Polymer discovery is essential for various industries, ranging from biomedical to reinforcement materials. Experiments with polymers have a long trial-and-error process, leading to long procedures and extensive resources. For these processes, machine learning has accelerated scientific discovery at the property prediction and latent space search fronts. However, laboratory researchers cannot readily access codes and these models to extract individual structures and properties due to infrastructure limitations. We present a closed-loop polymer structure-property predictor integrated in a terminal for early-stage polymer discovery. The framework is powered by LLM reasoning to provide users with property prediction, property-guided polymer structure generation, and structure modification capabilities. The SMILES sequences are guided by the synthetic accessibility score and the synthetic complexity score (SC Score) to ensure that polymer generation is as close as possible to synthetically accessible monomer-level structures. This framework addresses the challenge of generating novel polymer structures for laboratory researchers, thereby providing computational insights into polymer research.
zh
[NLP-49] Identity Cooperation and Framing Effects within Groups of Real and Simulated Humans
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟人类行为时,尤其是社会困境博弈中身份与情境因素影响的建模不足问题。现有研究多依赖“弱绑定”(weak binding)策略来引导聊天模型扮演特定角色,但难以忠实再现基于身份的行为模式。本文的关键解决方案在于采用“深度绑定”(deep binding)方法,通过为基础模型注入详尽的叙事性身份背景(narrative identities),并利用指令微调模型验证行为一致性,从而显著提升LLMs对人类行为的仿真 fidelity。此外,该方法还能有效建模时间(如实验年份)、问题表述方式及被试群体等情境变量,弥补传统人类实验描述中常被忽略的细节,增强研究结果的可复现性。
链接: https://arxiv.org/abs/2601.16355
作者: Suhong Moon,Minwoo Kang,Joseph Suh,Mustafa Safdari,John Canny
机构: University of California, Berkeley (加州大学伯克利分校); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL)
备注:
Abstract:Humans act via a nuanced process that depends both on rational deliberation and also on identity and contextual factors. In this work, we study how large language models (LLMs) can simulate human action in the context of social dilemma games. While prior work has focused on “steering” (weak binding) of chat models to simulate personas, we analyze here how deep binding of base models with extended backstories leads to more faithful replication of identity-based behaviors. Our study has these findings: simulation fidelity vs human studies is improved by conditioning base LMs with rich context of narrative identities and checking consistency using instruction-tuned models. We show that LLMs can also model contextual factors such as time (year that a study was performed), question framing, and participant pool effects. LLMs, therefore, allow us to explore the details that affect human studies but which are often omitted from experiment descriptions, and which hamper accurate replication.
zh
[NLP-50] Regional Bias in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的区域偏见(regional bias)问题,即模型在跨文化应用场景中对特定地区表现出非中立倾向,从而影响输出的公平性、可靠性和包容性。其解决方案的关键在于提出一种名为FAZE的基于提示(prompt-based)的评估框架,该框架通过设计100个情境中立但要求强制选择区域的测试用例,量化模型对不同地区的偏好程度,并以10分制评分(分数越高表示偏向越明显),从而系统性地识别和比较不同LLMs的区域偏见水平。
链接: https://arxiv.org/abs/2601.16349
作者: M P V S Gopinadh,Kappara Lakshmi Sindhu,Soma Sekhar Pandu Ranga Raju P,Yesaswini Swarna
机构: Vishnu Institute of Technology (维什努技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure. Presented at the Second International Conference on Advanced Computing, Machine Learning, Robotics and Internet Technologies (AMRIT 2024)
Abstract:This study investigates regional bias in large language models (LLMs), an emerging concern in AI fairness and global representation. We evaluate ten prominent LLMs: GPT-3.5, GPT-4o, Gemini 1.5 Flash, Gemini 1.0 Pro, Claude 3 Opus, Claude 3.5 Sonnet, Llama 3, Gemma 7B, Mistral 7B, and Vicuna-13B using a dataset of 100 carefully designed prompts that probe forced-choice decisions between regions under contextually neutral scenarios. We introduce FAZE, a prompt-based evaluation framework that measures regional bias on a 10-point scale, where higher scores indicate a stronger tendency to favor specific regions. Experimental results reveal substantial variation in bias levels across models, with GPT-3.5 exhibiting the highest bias score (9.5) and Claude 3.5 Sonnet scoring the lowest (2.5). These findings indicate that regional bias can meaningfully undermine the reliability, fairness, and inclusivity of LLM outputs in real-world, cross-cultural applications. This work contributes to AI fairness research by highlighting the importance of inclusive evaluation frameworks and systematic approaches for identifying and mitigating geographic biases in language models.
zh
[NLP-51] Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
【速读】: 该论文旨在解决多模态基础模型在识别视频中关键子事件(sub-events)方面的性能不足问题,这是实现高质量语言生成与事件摘要的前提。其核心挑战在于模型难以有效融合来自不同模态(如视觉、音频、文本等)的信息以准确判断事件的重要性。解决方案的关键在于构建一个基于足球比赛精彩片段中隐含人类偏好标注的新数据集,无需额外人工标注成本;并通过该数据集评估多个前沿多模态模型,发现它们的性能接近随机水平,且存在过度依赖单一模态、缺乏跨模态协同的问题。研究进一步指出,未来改进方向应聚焦于模块化架构设计以应对样本级异质性,并引入互补训练策略以增强多模态信息融合能力。
链接: https://arxiv.org/abs/2601.16333
作者: Aditya K Surikuchi,Raquel Fernández,Sandro Pezzelle
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, which we will publicly release to the community, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.
zh
[NLP-52] Machine-Assisted Grading of Nationwide School-Leaving Essay Exams with LLM s and Statistical NLP
【速读】: 该论文旨在解决大规模开放式考试作答自动化评分的可行性问题,特别是在时间紧迫、需高一致性评估的国家级毕业考试场景中,传统人工评分面临效率瓶颈。解决方案的关键在于构建一个以课程标准为依据的评分量规(rubric)驱动、结合大型语言模型(Large Language Models, LLMs)与统计自然语言处理(Statistical Natural Language Processing, NLP)技术的“人在回路”(human-in-the-loop)评分流程,确保自动化评分结果在人类评分范围内,并具备可解释性与公平性。该方法不仅实现了国家层面的高效评估,还能生成细粒度子分报告,支持个性化教学反馈,适用于小语种环境下的教育数字化转型。
链接: https://arxiv.org/abs/2601.16314
作者: Andres Karjus,Kais Allkivi,Silvia Maine,Katarin Leppik,Krister Kruusmaa,Merilin Aruvee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) enable rapid and consistent automated evaluation of open-ended exam responses, including dimensions of content and argumentation that have traditionally required human judgment. This is particularly important in cases where a large amount of exams need to be graded in a limited time frame, such as nation-wide graduation exams in various countries. Here, we examine the applicability of automated scoring on two large datasets of trial exam essays of two full national cohorts from Estonia. We operationalize the official curriculum-based rubric and compare LLM and statistical natural language processing (NLP) based assessments with human panel scores. The results show that automated scoring can achieve performance comparable to that of human raters and tends to fall within the human scoring range. We also evaluate bias, prompt injection risks, and LLMs as essay writers. These findings demonstrate that a principled, rubric-driven, human-in-the-loop scoring pipeline is viable for high-stakes writing assessment, particularly relevant for digitally advanced societies like Estonia, which is about to adapt a fully electronic examination system. Furthermore, the system produces fine-grained subscore profiles that can be used to generate systematic, personalized feedback for instruction and exam preparation. The study provides evidence that LLM-assisted assessment can be implemented at a national scale, even in a small-language context, while maintaining human oversight and compliance with emerging educational and regulatory standards.
zh
[NLP-53] aching and Evaluating LLM s to Reason About Polymer Design Related Tasks
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在聚合物设计(polymer design)任务中表现不佳的问题,其核心挑战在于:现有模型普遍缺乏聚合物领域的专业知识,且已对齐的模型在知识覆盖和能力适配方面存在不足。解决方案的关键在于提出一个大规模、多源融合的基准数据集 PolyBench,包含超过 12.5 万项聚合物设计相关任务,并基于包含 1300 多万条数据点的知识库(涵盖实验与合成来源)确保广泛覆盖聚合物及其性质;同时引入一种知识增强的推理蒸馏方法(knowledge-augmented reasoning distillation),通过结构化思维链(Chain-of-Thought, CoT)增强训练数据质量,并按从简单到复杂的分析推理层级组织任务,从而实现小参数量语言模型(Small Language Models, SLMs,7B–14B 参数)在 PolyBench 上超越同类规模模型甚至闭源前沿 LLM 的性能表现,同时在其他聚合物基准上也取得提升。
链接: https://arxiv.org/abs/2601.16312
作者: Dikshya Mohanty,Mohammad Saqib Hasan,Syed Mostofa Monsur,Size Zheng,Benjamin Hsiao,Niranjan Balasubramanian
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs prove ineffective on this problem space because: (i) most models lack polymer-specific knowledge (ii) existing aligned models lack coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large scale training and test benchmark dataset of more than 125K polymer design related tasks, leveraging a knowledge base of 13M+ data points obtained from experimental and synthetic sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small language models (SLMs), of 7B to 14B parameters, trained on PolyBench data outperform similar sized models, and even closed source frontier LLMs on PolyBench test dataset while demonstrating gains on other polymer benchmarks as well.
zh
[NLP-54] A Longitudinal Multinational and Multilingual Corpus of News Coverag e of the Russo-Ukrainian War
【速读】: 该论文旨在解决跨国家、跨语言语境下战争叙事差异的系统性分析问题,尤其关注冲突期间媒体框架与信息战策略的演变机制。其解决方案的关键在于构建了DNIPRO这一纵向多语种新闻语料库,包含24.6万篇来自俄罗斯、乌克兰、美国、英国和中国五国媒体的报道,覆盖英语、俄语和中文三种语言,并配有结构化元数据与多种人工标注的高质量注释,从而支持对立场识别、情感分析、话题框架及矛盾性事件的深入计算分析。该资源的独特价值在于整合了地缘政治竞争视角,使研究者能够量化揭示不同国家媒体如何建构对立现实,为全球信息生态系统中冲突叙事的生成与演化提供可扩展的基准工具。
链接: https://arxiv.org/abs/2601.16309
作者: Dikshya Mohanty,Taisiia Sabadyn,Jelwin Rodrigues,Chenlu Wang,Abhishek Kalugade,Ritwik Banerjee
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:We introduce DNIPRO, a novel longitudinal corpus of 246K news articles documenting the Russo-Ukrainian war from Feb 2022 to Aug 2024, spanning eleven media outlets across five nation states (Russia, Ukraine, U.S., U.K., and China) and three languages (English, Russian, and Mandarin Chinese). This multilingual resource features consistent and comprehensive metadata, and multiple types of annotation with rigorous human evaluations for downstream tasks relevant to systematic transnational analyses of contentious wartime discourse. DNIPRO’s distinctive value lies in its inclusion of competing geopolitical perspectives, making it uniquely suited for studying narrative divergence, media framing, and information warfare. To demonstrate its utility, we include use case experiments using stance detection, sentiment analysis, topical framing, and contradiction analysis of major conflict events within the larger war. Our explorations reveal how outlets construct competing realities, with coverage exhibiting polarized interpretations that reflect geopolitical interests. Beyond supporting computational journalism research, DNIPRO provides a foundational resource for understanding how conflicting narratives emerge and evolve across global information ecosystems.
zh
[NLP-55] Generating Literature-Driven Scientific Theories at Scale
【速读】: 该论文试图解决自动化科学发现中理论构建(theory building)这一高阶科学活动长期被忽视的问题,即如何从大规模科学文献中自动合成包含定性与定量规律的理论。其解决方案的关键在于提出一种基于文献支撑(literature-grounding)的理论生成方法,通过利用13.7k篇源论文构建2.9k个理论,并对比参数化大语言模型(LLM)记忆与文献驱动生成方式在准确性与新颖性目标下的表现差异。实验表明,相较于依赖参数知识的生成方式,文献支撑的方法显著提升了理论对已有证据的匹配度和对未来4.6k篇论文结果的预测能力。
链接: https://arxiv.org/abs/2601.16282
作者: Peter Jansen,Peter Clark,Doug Downey,Daniel S. Weld
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages plus appendix, 3 figures
Abstract:Contemporary automated scientific discovery has focused on agents for generating scientific experiments, while systems that perform higher-level scientific activities such as theory building remain underexplored. In this work, we formulate the problem of synthesizing theories consisting of qualitative and quantitative laws from large corpora of scientific literature. We study theory generation at scale, using 13.7k source papers to synthesize 2.9k theories, examining how generation using literature-grounding versus parametric knowledge, and accuracy-focused versus novelty-focused generation objectives change theory properties. Our experiments show that, compared to using parametric LLM memory for generation, our literature-supported method creates theories that are significantly better at both matching existing evidence and at predicting future results from 4.6k subsequently-written papers
zh
[NLP-56] Better as Generators Than Classifiers: Leverag ing LLM s and Synthetic Data for Low-Resource Multilingual Classification EACL2026
【速读】: 该论文旨在解决低资源语言场景下小模型训练缺乏高质量标注数据的问题,其核心挑战是如何利用大语言模型(Large Language Models, LLMs)的多语言能力来提升小模型在多种语言和任务上的性能。解决方案的关键在于将LLM作为“教师”角色,通过生成合成数据(synthetic data)对小模型进行蒸馏式训练或指令微调,从而使得小模型在少量合成数据支持下即可超越原始大模型本身,尤其在低资源语言中表现更为显著。实验表明,LLMs更适合作为数据生成器而非直接分类器使用,能够有效赋能轻量级、高效的多语言模型。
链接: https://arxiv.org/abs/2601.16278
作者: Branislav Pecher,Jan Cegin,Robert Belanec,Ivan Srba,Jakub Simko,Maria Bielikova
机构: Kempelen Institute of Intelligent Technologies (Kempelen智能技术研究所); Brno University of Technology (布诺理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Findings of EACL 2026
Abstract:Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, making them promising tools in both high- and low-resource languages. One particularly valuable use case is generating synthetic samples that can be used to train smaller models in low-resource scenarios where human-labelled data is scarce. In this work, we investigate whether these synthetic data generation capabilities can serve as a form of distillation, producing smaller models that perform on par with or even better than massive LLMs across languages and tasks. To this end, we use a state-of-the-art multilingual LLM to generate synthetic datasets covering 11 languages and 4 classification tasks. These datasets are then used to train smaller models via fine-tuning or instruction tuning, or as synthetic in-context examples for compact LLMs. Our experiments show that even small amounts of synthetic data enable smaller models to outperform the large generator itself, particularly in low-resource languages. Overall, the results suggest that LLMs are best utilised as generators (teachers) rather than classifiers, producing data that empowers smaller and more efficient multilingual models.
zh
[NLP-57] GameTalk: Training LLM s for Strategic Conversation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体(multi-agent)环境中进行长期战略决策的问题,尤其是在需要通过多轮对话实现协调与谈判的场景下。现有研究多集中于单轮决策任务或静态动作预测,缺乏对基于完整对话过程优化全局目标的探索。解决方案的关键在于提出GameTalk框架,通过将奖励信号设计为依赖于整个交互过程的方式,对LLMs进行微调(fine-tuning),并采用GRPO、DPO和STaR等方法实现对长程策略的优化。实验表明,该方法显著优于未训练模型,尤其在奖励塑造(reward shaping)条件下,其中DPO(Direct Preference Optimization)表现最佳,验证了对话式微调在复杂交互环境中的有效性。
链接: https://arxiv.org/abs/2601.16276
作者: Victor Conchello Vendrell,Max Ruiz Luyten,Mihaela van der Schaar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 32 pages, 8 figures
Abstract:Strategic decision-making in multi-agent settings is a key challenge for large language models (LLMs), particularly when coordination and negotiation must unfold over extended conversations. While recent work has explored the use of LLMs in isolated decision tasks, little attention has been given to optimizing long-term objectives through dialogue. We introduce \textbfGameTalk, a framework for training LLMs to make strategic decisions via multi-turn interactions. Unlike prior work that focuses on single-turn objectives or static action prediction, we train LLMs to optimize a global objective across full conversations. We achieve this by adapting fine-tuning methods like GRPO, DPO, and STaR to incorporate reward signals that depend on the entire interaction. We evaluate this approach on a suite of increasingly complex games, designed to stress different aspects of reasoning, coordination, and opponent modeling. Our results show that GameTalk significantly outperforms untrained models, especially under reward shaping, with DPO consistently yielding the strongest gains. These findings position conversational fine-tuning as a promising path for LLMs to reason, negotiate, and act in interactive environments.
zh
[NLP-58] SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models
【速读】: 该论文旨在解决多模态基础模型(multimodal foundation models)在面对单模态(音频)对抗攻击时的鲁棒性不足问题,尤其是针对音频输入扰动如何引发跨模态失效这一此前被忽视的攻击面。其关键解决方案在于系统性地分析六种互补的攻击目标,覆盖从音频编码器表征到跨模态注意力、隐藏状态及输出概率等多个处理阶段,并通过实验证明:即使在低感知失真条件下(LPIPS = 0.08, SI-SNR = 0),仅对音频进行扰动即可导致高达96%的攻击成功率,且攻击效果更依赖于优化过程的扩展而非数据规模增加。这揭示了当前多模态系统对单一模态干扰的脆弱性,并强调需构建基于跨模态一致性的防御机制以提升整体鲁棒性。
链接: https://arxiv.org/abs/2601.16231
作者: Aafiya Hussain,Gaurav Srivastava,Alvi Ishmam,Zaber Hakim,Chris Thomas
机构: Virginia Tech (弗吉尼亚理工学院)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: untargeted, audio-only adversarial attacks on trimodal audio-video-language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across three state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to 96% attack success rate. We further show that attacks can be successful at low perceptual distortions (LPIPS = 0.08, SI-SNR = 0) and benefit more from extended optimization than increased data scale. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving 97% attack success under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency.
zh
[NLP-59] Limits of n-gram Style Control for LLM s via Logit-Space Injection
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)个性化生成风格时面临的计算效率与控制精度之间的权衡问题。传统方法如提示工程(prompt engineering)或参数高效微调(parameter-efficient fine-tuning,例如LoRA)虽有效,但前者难以精确捕捉复杂写作风格,后者则需要高资源的训练过程。为此,作者提出一种轻量级替代方案:在解码阶段通过logit空间注入n-gram风格先验来引导冻结模型(frozen LLM)的输出风格。其关键在于构建一个基于不同n-gram阶数(1-to-3-gram)的风格概率先验,并在生成过程中将当前上下文匹配的n-gram风格log-probabilities加权叠加到LLM原始logits上,权重由控制参数λ ∈ [0, 1]调节。该方法实现了可调的风格控制,但在实践中表现出高度脆弱性——仅在极窄的低λ范围内(如Don Quixote语料中λ=0.1)能同时提升风格契合度和流畅性,其他情况下易导致文本退化甚至崩溃,整体性能仍落后于提示工程和LoRA。
链接: https://arxiv.org/abs/2601.16224
作者: Sami-ul Ahmed
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL)
备注: 18 pages, 7 figures. Experimental study of decoding-time style control via n-gram logit injection
Abstract:Large language models (LLMs) are typically personalized via prompt engineering or parameter-efficient fine-tuning such as LoRA. However, writing style can be difficult to distill into a single prompt, and LoRA fine-tuning requires computationally intensive training and infrastructure. We investigate a possible lightweight alternative: steering a frozen LLM with n-gram style priors injected in logit space at decoding time. We train an n-gram model on stylistically distinct corpora – including Don Quixote, CNN/DailyMail news headlines, and arXiv abstracts – constructing an interpolated 1-to-3-gram prior over next-token probabilities. During generation we modify the LLM’s logits by adding a weighted sum of style log-probabilities from each n-gram order that matches the current context, scaled by a control parameter lambda in [0, 1]. We sweep lambda and style corpora and report style perplexity under the n-gram model, base-model perplexity as a proxy for fluency, Jensen-Shannon (JS) divergence between the original and steered token distributions, and token-overlap statistics. On TinyLlama-1.1B we identify a single narrow regime (for the Don Quixote corpus at lambda=0.1) where style perplexity improves by 24.7% and base-model perplexity improves by 51.4% relative to the frozen model. Outside this regime, and for multi-author corpora such as CNN/DailyMail and arXiv abstracts, even small nonzero lambda values generally result in worse style and fluency, and larger lambda values lead to collapse with extreme perplexities and incoherent text. Logit-space injection of n-gram style priors provides lightweight, tunable style control, but it is fragile: it operates effectively only within a narrow range of low lambda values and is consistently outperformed by prompting and LoRA. Comments: 18 pages, 7 figures. Experimental study of decoding-time style control via n-gram logit injection Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.16224 [cs.CL] (or arXiv:2601.16224v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.16224 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-60] owards Latent Diffusion Suitable For Text
【速读】: 该论文旨在解决自回归大语言模型(Autoregressive Large Language Models, LLMs)在采样速度慢且生成连贯性受限的问题。其解决方案的关键在于提出神经流扩散模型(Neural Flow Diffusion Models, NFDM),该方法通过学习数据驱动的多变量前向过程,使前向过程与生成轨迹更好地适配于语言建模任务,从而将连续扩散模型直接应用于离散状态空间。NFDM 在相同模型规模下显著缩小了与自回归模型之间的似然差距,同时保持了与以往潜在扩散模型相当的样本质量。
链接: https://arxiv.org/abs/2601.16220
作者: Nesta Midavaine,Christian A. Naesseth,Grigory Bartosh
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of continuous diffusion models to discrete state spaces. NFDM learns a multivariate forward process from the data, ensuring that the forward process and generative trajectory are a good fit for language modeling. Our model substantially reduces the likelihood gap with autoregressive models of the same size, while achieving sample quality comparable to that of previous latent diffusion models.
zh
[NLP-61] Domain Specific Specialization in Low-Resource Settings: The Efficacy of Offline Response-Based Knowledge Distillation in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理特定领域或机构知识时因缺乏预训练数据而产生的幻觉问题,尤其是在硬件资源受限条件下难以构建高精度专业化助手的挑战。其解决方案的关键在于提出一种离线的基于响应的知识蒸馏方法,通过精心设计的数据策略提升学生模型(student model)的准确性与鲁棒性:相较于大量无结构数据易引发持续幻觉,仅用500行由教师模型生成的上下文感知合成数据即可实现96.7%的准确率和强拒绝能力,验证了数据质量与结构对齐优于数据量的核心假设,同时借助Unsloth库优化Qwen-2.5-7B模型,显著降低GPU显存需求(从40 GB降至16 GB),从而在低资源环境下实现高效、可靠的领域适配。
链接: https://arxiv.org/abs/2601.16219
作者: Erdem Aslan,Pakize Erdoğmuş
机构: Düzce University (杜兹切大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 tables
Abstract:Large Language Models (LLMs) excel in general tasks but often struggle with hallucinations when handling domain-specific or institutional knowledge absent from their pre-training. We present an offline response-based knowledge distillation method that develops high-accuracy specialized assistants under constrained hardware resources. We evaluate three distinct data strategies: general domain adaptation (15,000 lines), unstructured knowledge injection (2,000 lines), and a context-aware synthetic dataset (500 lines) generated by a teacher model. To minimize computational costs, we utilize the Unsloth library to optimize the Qwen-2.5-7B student model, reducing NVIDIA A100 GPU memory requirements from 40 GB to 16 GB. Experimental results demonstrate that while larger unstructured datasets suffer from persistent hallucinations, the 500-line context-aware dataset achieves a 96.7% accuracy rate and robust rejection capability. These findings validate the LIMA hypothesis, showing that data quality and structural alignment are more critical than quantity for domain adaptation in low-resource settings.
zh
[NLP-62] M3Kang: Evaluating Multilingual Multimodal Mathematical Reasoning in Vision-Language Models
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多语言数学推理任务中性能表现尚不明确、且远低于人类水平的问题。其核心挑战在于如何有效评估和提升VLMs在跨语言、跨文化背景下对含图表的数学问题的理解与推理能力。解决方案的关键在于构建了M3Kang——首个大规模多语言、多模态数学推理数据集,源自全球参与人数最多的青少年数学竞赛“Kangaroo Math Competition”,涵盖108种语言及1,747道按年级难度分级的多选题,其中部分题目包含解题必需的图表信息。通过该数据集对主流闭源与开源SOTA模型进行系统评测,研究发现模型性能随语言多样性与模型规模提升而改善,但与题目难度等级无关;同时验证了多语言技术可有效迁移至多模态场景并显著优于基线方法,为未来多语言数学推理研究提供了基准与方向。
链接: https://arxiv.org/abs/2601.16218
作者: Aleix Torres-Camps,Nathaniel Mitrani Hadida,Víctor Conchello Vendrell,Àlex Batlle Casellas,Arnau Padrés Masdemont,Jordi Ros-Giralt
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures
Abstract:Despite state-of-the-art vision-language models (VLMs) have demonstrated strong reasoning capabilities, their performance in multilingual mathematical reasoning remains underexplored, particularly when compared to human performance. To bridge this gap, we introduce M3Kang, the first massively multilingual, multimodal mathematical reasoning dataset for VLMs. It is derived from the Kangaroo Math Competition, the world’s largest mathematics contest, which annually engages over six million participants under the age of 18 across more than 90 countries. M3Kang includes 1,747 unique multiple-choice problems organized by grade-level difficulty, with translations into 108 culturally diverse languages, some of them including diagrams essential for solving them. Using this dataset, we conduct extensive benchmarking on both closed- and open-source SOTA models. We observe that, despite recent advances, models still struggle with basic math and diagram-based reasoning, with performance scaling with language presence and model size, but not with grade level. We also find that multilingual techniques can be effectively extended to the multimodal setting, resulting in significant improvements over baseline approaches. Our analysis also incorporates performance data from over 68,000 students, enabling direct comparison with human performance. We are open-sourcing M3Kang, including the English-only subset M2Kang, along with the framework and codebase used to construct the dataset.
zh
[NLP-63] ChiEngMixBench: Evaluating Large Language Models on Spontaneous and Natural Chinese-English Code-Mixed Generation
【速读】: 该论文旨在解决当前大语言模型在人机交互中对代码混用(Code-Mixing)行为评估不足的问题,现有研究多将其简化为翻译或可转换性问题,难以判断模型的语码切换是否符合真实语境和人类交际惯例。解决方案的关键在于提出首个面向真实社区语境的基准测试工具ChiEngMixBench,其核心创新是将代码混用建模为认知对齐(Cognitive Alignment)问题,并引入两个互补指标——自发性(Spontaneity)与自然度(Naturalness),从而系统性地量化模型在多语言交互中的表现;此外,研究还发现了一种隐式涌现的术语分层策略(Terminology Layering Strategy),与矩阵语言框架理论(Matrix Language Frame, MLF)一致,揭示了多语言大模型与人类沟通之间存在结构化的认知对齐机制。
链接: https://arxiv.org/abs/2601.16217
作者: Qingyan Yang,Tongxi Wang,Yunsheng Luo
机构: Southeast University (东南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Code-mixing is increasingly prevalent in interactions between humans and large language models, yet existing work often reduces it to a translation or convertibility problem, making it difficult to assess whether a model’s switching behavior is context-appropriate and aligned with human conventions. We introduce ChiEngMixBench, the first benchmark designed to evaluate code-mixing ability in authentic community contexts, built upon a general construction pipeline that enables scalable dataset development across domains and bilingual pairs. ChiEngMixBench formulates code-mixing as a cognitive alignment problem, characterized by two complementary signals: Spontaneity and Naturalness. Empirical evaluation shows that our metrics can systematically distinguish code-mixing performance across models. Beyond benchmarking, we further uncover an implicitly emergent Terminology Layering Strategy, a phenomenon consistent with the Matrix Language Frame (MLF) theory, indicating structured cognitive alignment between multilingual large language models and human communication.
zh
[NLP-64] EdgeSpot: Efficient and High-Performance Few-Shot Model for Keyword Spotting ICASSP2026
【速读】: 该论文旨在解决边缘设备上关键词检测(Keyword Spotting, KS)模型在低样本量(few-shot)场景下的准确率与计算效率难以兼顾的问题。其解决方案的关键在于提出一种轻量化且高效的模型架构 EdgeSpot,该架构结合了优化的 BC-ResNet 声学骨干网络、可训练的逐通道能量归一化(Per-Channel Energy Normalization)前端以及轻量级时序自注意力机制,并通过知识蒸馏策略利用自监督教师模型(采用 Sub-center ArcFace 损失优化)进行训练。实验表明,EdgeSpot 在固定虚警率(FAR)下显著优于强基线模型,其中最大变体 EdgeSpot-4 在 10-shot 场景下于 1% FAR 下将准确率从 73.7% 提升至 82.0%,仅需 29.4M MACs 和 128k 参数,实现了高精度与低资源消耗的平衡。
链接: https://arxiv.org/abs/2601.16316
作者: Oguzhan Buyuksolak,Alican Gok,Osman Erman Okman
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to be presented in IEEE ICASSP 2026
Abstract:We introduce an efficient few-shot keyword spotting model for edge devices, EdgeSpot, that pairs an optimized version of a BC-ResNet-based acoustic backbone with a trainable Per-Channel Energy Normalization frontend and lightweight temporal self-attention. Knowledge distillation is utilized during training by employing a self-supervised teacher model, optimized with Sub-center ArcFace loss. This study demonstrates that the EdgeSpot model consistently provides better accuracy at a fixed false-alarm rate (FAR) than strong BC-ResNet baselines. The largest variant, EdgeSpot-4, improves the 10-shot accuracy at 1% FAR from 73.7% to 82.0%, which requires only 29.4M MACs with 128k parameters.
zh
[NLP-65] Zero-Shot Speech LLM s for Multi-Aspect Evaluation of L2 Speech: Challenges and Opportunities
【速读】: 该论文旨在解决自动语音评分(Automated Speech Scoring)在第二语言(L2)英语发音评估中的准确性难题,尤其针对句子级流利度(fluency)、语调(prosody)和完整性(completeness)等复杂特征的建模挑战。解决方案的关键在于评估一个指令微调的语音大语言模型(Speech Large Language Model, Speech-LLM)——Qwen2-Audio-7B-Instruct,在未经过特定任务训练的情况下(zero-shot),对5,000条Speechocean762语料进行评分的能力,结果表明该模型在高质语音上与人工评分具有强一致性(±2容忍度内),展现出语音LLM在可扩展发音评估中的巨大潜力,同时指出了其在低质量语音上过估计分数及错误检测精度不足的问题,为未来通过提示工程(prompting)、校准(calibration)和语音学信息整合优化提供了方向。
链接: https://arxiv.org/abs/2601.16230
作者: Aditya Kamlesh Parikh,Cristian Tejedor-Garcia,Catia Cucchiarini,Helmer Strik
机构: Radboud University (拉德布德大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants which is financed by the Dutch Research Council (NWO)
Abstract:An accurate assessment of L2 English pronunciation is crucial for language learning, as it provides personalized feedback and ensures a fair evaluation of individual progress. However, automated scoring remains challenging due to the complexity of sentence-level fluency, prosody, and completeness. This paper evaluates the zero-shot performance of Qwen2-Audio-7B-Instruct, an instruction-tuned speech-LLM, on 5,000 Speechocean762 utterances. The model generates rubric-aligned scores for accuracy, fluency, prosody, and completeness, showing strong agreement with human ratings within ±2 tolerance, especially for high-quality speech. However, it tends to overpredict low-quality speech scores and lacks precision in error detection. These findings demonstrate the strong potential of speech LLMs in scalable pronunciation assessment and suggest future improvements through enhanced prompting, calibration, and phonetic integration to advance Computer-Assisted Pronunciation Training.
zh
计算机视觉
[CV-0] AnyView: Synthesizing Any Novel View in Dynamic Scenes
【速读】:该论文旨在解决生成式视频模型在高度动态真实场景中难以保持多视角一致性与时空一致性的难题。解决方案的关键在于提出一个基于扩散模型的视频生成框架AnyView,其通过融合单目(2D)、多视角静态(3D)和多视角动态(4D)等多种监督数据源,训练出一种具备通用性的时空隐式表示,从而实现从任意相机位置和轨迹出发的零样本新视角视频合成,且无需依赖强几何先验或结构假设。
链接: https://arxiv.org/abs/2601.16982
作者: Basile Van Hoorick,Dian Chen,Shun Iwase,Pavel Tokmakov,Muhammad Zubair Irshad,Igor Vasiljevic,Swati Gupta,Fangzhou Cheng,Sergey Zakharov,Vitor Campagnolo Guizilini
机构: Toyota Research Institute (丰田研究院); Amazon Web Services (亚马逊网络服务)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project webpage: this https URL
Abstract:Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce \textbfAnyView, a diffusion-based video generation framework for \emphdynamic view synthesis with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose \textbfAnyViewBench, a challenging new benchmark tailored towards \emphextreme dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emphany viewpoint. Results, data, code, and models can be viewed at: this https URL
zh
[CV-1] SyncLight: Controllable and Consistent Multi-View Relighting
【速读】:该论文旨在解决多视角静态场景中光照一致性难以保持的问题,尤其针对多相机广播、立体电影和虚拟制作等应用对严格光照一致性的需求。现有生成式方法在单视图 relighting 方面进展显著,但在多视图场景下无法保证光照的一致性。解决方案的关键在于提出 SyncLight 方法,其核心创新是基于多视图扩散 Transformer(multi-view diffusion transformer)并采用潜在桥接匹配(latent bridge matching)训练策略,实现仅需一个参考编辑即可精确控制整个多视图图像集的光强与颜色变化,且无需相机位姿信息即可零样本推广至任意数量视角,从而在单次推理步骤内完成高保真多视图 relighting。
链接: https://arxiv.org/abs/2601.16981
作者: David Serrano-Lozano,Anand Bhattad,Luis Herranz,Jean-François Lalonde,Javier Vazquez-Corral
机构: Universitat Autònoma de Barcelona(巴塞罗那自治大学); Johns Hopkins University(约翰霍普金斯大学); Universidad Politécnica de Madrid(马德里理工大学); Université Laval(拉瓦尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this http URL
Abstract:We present SyncLight, the first method to enable consistent, parametric relighting across multiple uncalibrated views of a static scene. While single-view relighting has advanced significantly, existing generative approaches struggle to maintain the rigorous lighting consistency essential for multi-camera broadcasts, stereoscopic cinema, and virtual production. SyncLight addresses this by enabling precise control over light intensity and color across a multi-view capture of a scene, conditioned on a single reference edit. Our method leverages a multi-view diffusion transformer trained using a latent bridge matching formulation, achieving high-fidelity relighting of the entire image set in a single inference step. To facilitate training, we introduce a large-scale hybrid dataset comprising diverse synthetic environments – curated from existing sources and newly designed scenes – alongside high-fidelity, real-world multi-view captures under calibrated illumination. Surprisingly, though trained only on image pairs, SyncLight generalizes zero-shot to an arbitrary number of viewpoints, effectively propagating lighting changes across all views, without requiring camera pose information. SyncLight enables practical relighting workflows for multi-view capture systems.
zh
[CV-2] VisGym: Diverse Customizable Scalable Environments for Multimodal Agents
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多步视觉交互任务中缺乏系统评估与训练框架的问题,尤其关注模型如何在长时程内整合感知、记忆与动作决策。其解决方案的关键在于提出VisGym——一个包含17个环境的基准测试套件,涵盖符号谜题、真实图像理解、导航与操作等任务,并支持对难度、输入表示、规划时长和反馈机制的灵活控制;同时提供结构化演示生成器以实现监督微调,从而揭示模型在长上下文利用、视觉符号任务迁移以及部分可观测场景下的关键失败模式,为提升VLM的多步视觉决策能力提供了可量化的实验平台与改进方向。
链接: https://arxiv.org/abs/2601.16973
作者: Zirui Wang,Junyi Zhang,Jiaxin Ge,Long Lian,Letian Fu,Lisa Dunlap,Ken Goldberg,XuDong Wang,Ion Stoica,David M. Chan,Sewon Min,Joseph E. Gonzalez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: this https URL.
zh
[CV-3] Domain-invariant Mixed-domain Semi-supervised Medical Image Segmentation with Clustered Maximum Mean Discrepancy Alignment ICASSP2026
【速读】:该论文旨在解决医学图像语义分割中因标注稀缺和多源数据分布不一致(即混合域设置)导致的性能下降问题,尤其在缺乏显式域标签且存在严重域差异的实际场景下,现有半监督或域适应方法难以有效应用。其解决方案的关键在于提出一种域不变的混合域半监督分割框架,通过两个核心机制实现:一是复制粘贴机制(Copy-Paste Mechanism, CPM),利用跨域信息迁移增强训练集多样性;二是聚类最大均值差异(Cluster Maximum Mean Discrepancy, CMMD)模块,通过对未标记特征进行聚类并以MMD目标对齐至有标签锚点,从而学习域不变表示。该方法嵌入教师-学生架构,在极少标注样本和多个未知域差异条件下仍能实现鲁棒且精确的分割效果。
链接: https://arxiv.org/abs/2601.16954
作者: Ba-Thinh Lam,Thanh-Huy Nguyen,Hoang-Thien Nguyen,Quang-Khai Bui-Tran,Nguyen Lan Vi Vu,Phat K. Huynh,Ulas Bagci,Min Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted in ICASSP 2026
Abstract:Deep learning has shown remarkable progress in medical image semantic segmentation, yet its success heavily depends on large-scale expert annotations and consistent data distributions. In practice, annotations are scarce, and images are collected from multiple scanners or centers, leading to mixed-domain settings with unknown domain labels and severe domain gaps. Existing semi-supervised or domain adaptation approaches typically assume either a single domain shift or access to explicit domain indices, which rarely hold in real-world deployment. In this paper, we propose a domain-invariant mixed-domain semi-supervised segmentation framework that jointly enhances data diversity and mitigates domain bias. A Copy-Paste Mechanism (CPM) augments the training set by transferring informative regions across domains, while a Cluster Maximum Mean Discrepancy (CMMD) block clusters unlabeled features and aligns them with labeled anchors via an MMD objective, encouraging domain-invariant representations. Integrated within a teacher-student framework, our method achieves robust and precise segmentation even with very few labeled examples and multiple unknown domain discrepancies. Experiments on Fundus and MMs benchmarks demonstrate that our approach consistently surpasses semi-supervised and domain adaptation methods, establishing a potential solution for mixed-domain semi-supervised medical image segmentation.
zh
[CV-4] Reward-Forcing: Autoregressive Video Generation with Reward Feedback
【速读】:该论文旨在解决当前自回归视频生成模型(autoregressive video generation)在缺乏强教师模型(teacher model)时性能受限的问题,尤其是与双向架构(bidirectional architecture)相比,其输出质量通常存在差距。解决方案的关键在于引入奖励信号(reward signals)来引导生成过程,从而简化训练流程并保持高视觉保真度和时间一致性。这种方法避免了对教师模型的依赖,使模型能够在不牺牲性能的前提下实现更高效、可扩展的自回归生成,在标准基准测试(如VBench)上表现优于或媲美同类模型。
链接: https://arxiv.org/abs/2601.16933
作者: Jingran Zhang,Ning Li,Yuanhao Ban,Andrew Bai,Justin Cui
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint
Abstract:While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically lags behind their bidirectional counterparts. In this paper, we explore an alternative approach that uses reward signals to guide the generation process, enabling more efficient and scalable autoregressive generation. By using reward signals to guide the model, our method simplifies training while preserving high visual fidelity and temporal consistency. Through extensive experiments on standard benchmarks, we find that our approach performs comparably to existing autoregressive models and, in some cases, surpasses similarly sized bidirectional models by avoiding constraints imposed by teacher architectures. For example, on VBench, our method achieves a total score of 84.92, closely matching state-of-the-art autoregressive methods that score 84.31 but require significant heterogeneous distillation.
zh
[CV-5] LoL: Longer than Longer Scaling Video Generation to Hour
【速读】:该论文旨在解决自回归式长视频生成模型中存在的误差累积与长期连贯性丧失问题,尤其是由注意力池化帧(attention sink frames)引发的“sink-collapse”现象——即生成内容反复回到锚定帧导致场景突变和循环运动。其解决方案的关键在于识别出该问题源于旋转位置编码(Rotary Position Embedding, RoPE)的周期性结构与多头注意力机制之间的内在冲突,并提出一种轻量级、无需训练的方法:通过引入多头RoPE抖动(multi-head RoPE jitter),打破不同注意力头之间的同质化,从而有效抑制长期时序崩溃,实现高质量、无显著退化的实时流式无限长度视频生成。
链接: https://arxiv.org/abs/2601.16914
作者: Justin Cui,Jie Wu,Ming Li,Tao Yang,Xiaojie Li,Rui Wang,Andrew Bai,Yuanhao Ban,Cho-Jui Hsieh
机构: 1. University of California, San Diego (加州大学圣地亚哥分校); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
zh
[CV-6] Embedding -based Crop Type Classification in the Groundnut Basin of Senegal
【速读】:该论文旨在解决小农户地区(smallholder regions)作物类型制图中卫星遥感方法适用性不足的问题,尤其是在粮食安全、生计支持和气候变化缓解等关键领域。其解决方案的关键在于提出并验证一套基于嵌入(embedding)的评估标准(包括性能、合理性、可迁移性和可及性),并采用地理空间基础模型(geospatial foundation model, FM)生成的嵌入特征进行作物分类与制图。研究发现,TESSERA模型嵌入方法在塞内加尔花生盆地区域表现最优,相较于现有基准方法,在一次时间迁移场景下准确率提升达28%,表明TESSERA嵌入是适用于小农户区域作物类型识别与制图的有效技术路径。
链接: https://arxiv.org/abs/2601.16900
作者: Madeline C. Lisaius,Srinivasan Keshav,Andrew Blake,Clement Atzberger
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Crop type maps from satellite remote sensing are important tools for food security, local livelihood support and climate change mitigation in smallholder regions of the world, but most satellite-based methods are not well suited to smallholder conditions. To address this gap, we establish a four-part criteria for a useful embedding-based approach consisting of 1) performance, 2) plausibility, 3) transferability and 4) accessibility and evaluate geospatial foundation model (FM) embeddings -based approaches using TESSERA and AlphaEarth against current baseline methods for a region in the groundnut basin of Senegal. We find that the TESSERA -based approach to land cover and crop type mapping fulfills the selection criteria best, and in one temporal transfer example shows 28% higher accuracy compared to the next best method. These results indicate that TESSERA embeddings are an effective approach for crop type classification and mapping tasks in Senegal.
zh
[CV-7] Evaluating Large Vision-language Models for Surgical Tool Detection
【速读】:该论文旨在解决当前手术AI系统因多模态感知能力不足而导致的对复杂手术流程理解不全面的问题,尤其聚焦于提升生成式AI在手术工具检测任务中的性能。其解决方案的关键在于引入并评估三种先进的视觉-语言模型(Vision-Language Models, VLMs),即Qwen2.5、LLaVA1.5和InternVL3.5,在GraSP机器人手术数据集上进行零样本(zero-shot)和参数高效微调(LoRA fine-tuning)实验,验证VLMs在手术场景下实现类人级推理与理解的能力。结果表明,Qwen2.5在两种设置下均表现最优,展现出更强的零样本泛化能力和与现有基线方法Grounding DINO相当的微调性能,凸显了多模态大模型在推动通用手术AI系统发展中的核心价值。
链接: https://arxiv.org/abs/2601.16895
作者: Nakul Poudel,Richard Simon,Cristian A. Linte
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.
zh
[CV-8] GPA-VGGT:Adapting VGGT to Large scale Localization by self-Supervised learning with Geometry and Physics Aware loss
【速读】:该论文旨在解决基于Transformer的视觉几何框架(如VGGT)在无标签和未见场景中难以适应的问题,尤其是在大规模环境下的相机位姿估计与三维重建任务中,传统方法依赖于真实标签导致泛化能力受限。其解决方案的关键在于提出一种自监督学习框架,通过将传统的成对关系扩展为序列级几何约束,并利用物理光度一致性与几何约束联合优化损失函数,从而在无需硬标签的情况下训练模型;该方法在每个序列中采样多个源帧并将其几何投影到不同目标帧上,增强了时序特征的一致性,使得局部和全局跨视图注意力层以及相机和深度预测头能够有效捕捉多视角几何结构,实验表明模型可在数百次迭代内收敛并在大规模定位任务中取得显著性能提升。
链接: https://arxiv.org/abs/2601.16885
作者: Yangfan Xu,Lilian Zhang,Xiaofeng He,Pengdong Wu,Wenqi Wu,Jun Mao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at this https URL.
zh
[CV-9] No Validation No Problem: Predicting Model Performance from a Single Gradient
【速读】:该论文旨在解决深度学习训练过程中依赖验证集进行模型检查点(checkpoint)选择和早停(early stopping)的问题,从而减少对标注验证数据的依赖。其核心解决方案是提出一种无需验证的检查点信号:通过单次前向-反向传播计算分类头梯度的Frobenius范数 ||g||_F = ||dL/dW||_F 作为代理指标。该指标在ImageNet-1k的CNN与Transformer模型中均表现出与Top-1准确率强负相关、与损失函数正相关的特性,且在短尾窗口内选取最小梯度对应的检查点可逼近“理想”(oracle)性能(误差仅约1.12%)。此外,针对不同架构采用不同的归一化策略(如CNN使用头尺度归一化,Transformer使用特征尺度归一化)以提升稳定性;该方法还适用于目标检测/分割任务(COCO)及扩散模型(UNet/DDPM),并能有效监控训练进度,具备轻量级、标签无感知的优势,且添加开销低于一个epoch的0.1%。
链接: https://arxiv.org/abs/2601.16874
作者: Fangzheng Wu,Brian Summa
机构: Tulane University (杜兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a validation-free checkpointing signal from a single forward-backward pass: the Frobenius norm of the classifier-head gradient on one detached-feature batch, ||g||_F = ||dL/dW||_F. Across ImageNet-1k CNNs and Transformers, this proxy is strongly negative with Top-1 and positive with loss. Selecting the checkpoint with the minimum head gradient in a short tail window closes most of the gap to the oracle (4.24% +/- 2.00% with a universal setup, about 1.12% with light per-family tuning). For practical deployment, a head-scale normalization is more stable within classic CNN families (e.g., ResNets), while a feature-scale normalization works well for Transformers and modern CNNs. The same one-batch probe also predicts COCO detection/segmentation mAP. In diffusion (UNet/DDPM on CIFAR-10), it tracks progress and enables near-oracle tail-window selection; it is positively correlated with same-distribution probe MSE and negatively with FID (lower is better), so it can be used as a lightweight, label-free monitor. Validation labels are never used beyond reporting. The probe adds much less than 0.1% of an epoch and works as a drop-in for validation-free checkpoint selection and early stopping.
zh
[CV-10] Calibrated Probabilistic Interpolation for GEDI Biomass
【速读】:该论文旨在解决从NASA GEDI任务中获取可靠全区域生物量制图的问题,核心挑战在于如何有效插值稀疏的激光雷达(LiDAR)观测数据以覆盖异质性景观,并实现准确且校准良好的不确定性估计。传统机器学习方法如随机森林和XGBoost虽广泛使用,但将多光谱或SAR遥感数据的空间预测视为独立事件,未能适应不同地理区域的复杂程度,导致无法生成可靠的预测区间。解决方案的关键在于引入注意力神经过程(Attentive Neural Processes, ANPs),这是一种概率元学习框架,通过显式地基于局部观测集和地理空间基础模型嵌入来条件化预测,从而学习灵活的空间协方差函数;该机制使模型能够在复杂地形中扩大不确定性估计、在均质区域收缩不确定性,实现了跨五种生态区(从热带亚马逊雨林到北方和高山生态系统)的高精度与理想不确定性校准,同时支持少样本迁移学习,显著提升了跨区域泛化能力。
链接: https://arxiv.org/abs/2601.16834
作者: Robin Young,Srinivasan Keshav
机构: 未知
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable wall-to-wall biomass mapping from NASA’s GEDI mission requires interpolating sparse LiDAR observations across heterogeneous landscapes. While machine learning approaches like Random Forest and XGBoost are standard for this task, they treat spatial predictions of GEDI observations from multispectral or SAR remote sensing data as independent without adapting to the varying difficulty of heterogeneous landscapes. We demonstrate these approaches generally fail to produce calibrated prediction intervals. We identify that this stems from conflating ensemble variance with aleatoric uncertainty and ignoring local spatial context. To resolve this, we introduce Attentive Neural Processes (ANPs), a probabilistic meta-learning framework that explicitly conditions predictions on local observation sets and geospatial foundation model embeddings. Unlike static ensembles, ANPs learn a flexible spatial covariance function, allowing uncertainty estimates to expand in complex landscapes and contract in homogeneous areas. We validate this approach across five distinct biomes ranging from Tropical Amazonian forests to Boreal and Alpine ecosystems, demonstrating that ANPs achieve competitive accuracy while maintaining near-ideal uncertainty calibration. We demonstrate the operational utility of the method through few-shot adaptation, where the model recovers most of the performance gap in cross-region transfer using minimal local data. This work provides a scalable, theoretically rigorous alternative to ensemble variance for continental scale earth observation. Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.16834 [cs.LG] (or arXiv:2601.16834v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.16834 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-11] Incorporating Eye-Tracking Signals Into Multimodal Deep Visual Models For Predicting User Aesthetic Experience In Residential Interiors
【速读】:该论文旨在解决室内空间美学评价的预测难题,其核心挑战在于感知的主观性与视觉响应的复杂性。为应对这一问题,作者提出了一种双分支卷积神经网络-长短期记忆网络(CNN-LSTM)框架,通过融合视觉特征与眼动追踪信号来预测住宅 interiors 的美学评分。该方案的关键创新在于将眼动数据作为训练阶段的“特权信息”(privileged information),从而提升模型对客观维度(如光照)和主观维度(如放松感)的预测准确性,其中瞳孔反应对客观评估贡献最大,而注视点与视觉线索的结合则显著增强主观评价性能。实验表明,即使在部署时仅使用视觉输入,训练中引入眼动数据的模型仍能保持良好表现,凸显了该方法在实际美学评估工具开发中的可行性与优势。
链接: https://arxiv.org/abs/2601.16811
作者: Chen-Ying Chien,Po-Chih Kuo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how people perceive and evaluate interior spaces is essential for designing environments that promote well-being. However, predicting aesthetic experiences remains difficult due to the subjective nature of perception and the complexity of visual responses. This study introduces a dual-branch CNN-LSTM framework that fuses visual features with eye-tracking signals to predict aesthetic evaluations of residential interiors. We collected a dataset of 224 interior design videos paired with synchronized gaze data from 28 participants who rated 15 aesthetic dimensions. The proposed model attains 72.2% accuracy on objective dimensions (e.g., light) and 66.8% on subjective dimensions (e.g., relaxation), outperforming state-of-the-art video baselines and showing clear gains on subjective evaluation tasks. Notably, models trained with eye-tracking retain comparable performance when deployed with visual input alone. Ablation experiments further reveal that pupil responses contribute most to objective assessments, while the combination of gaze and visual cues enhances subjective evaluations. These findings highlight the value of incorporating eye-tracking as privileged information during training, enabling more practical tools for aesthetic assessment in interior design.
zh
[CV-12] REL-SF4PASS: Panoramic Semantic Segmentation with REL Depth Representation and Spherical Fusion CVPR2026
【速读】:该论文针对全景语义分割(Panoramic Semantic Segmentation, PASS)中现有方法在利用全景图像几何结构方面不足的问题展开研究,尤其指出多数方法仅依赖球面几何与RGB输入或原始/ HHA格式的深度信息,未能充分挖掘全景图像的几何特性。解决方案的关键在于提出一种基于柱坐标系的新型深度表示方法REL(Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle),该表示能够完整刻画三维空间信息及表面法向方向;同时设计了Spherical-dynamic Multi-Modal Fusion (SMMF)机制,根据不同全景图像区域的特点采用差异化融合策略,从而缓解等距圆柱投影(ERP)中圆柱侧面展开断裂问题,提升模型性能与鲁棒性。实验表明,该方法在Stanford2D3D全景数据集上平均mIoU提升2.35%,且对3D扰动的性能方差降低约70%。
链接: https://arxiv.org/abs/2601.16788
作者: Xuewei Li,Xinghan Bao,Zhimin Chen,Xi Li
机构: Shanghai DianJi University (上海电力大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: submitted to CVPR 2026
Abstract:As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to give complete scene perception based on an ultra-wide angle of view. Most PASS methods often focus on spherical geometry with RGB input or using the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL-SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical-dynamic Multi-Modal Fusion SMMF. REL is made up of Rectified Depth, Elevation-Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represents 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, which uses different fusion strategies to match the different regions in panoramic images. Experimental results show that REL-SF4PASS considerably improves performance and robustness on popular benchmark, Stanford2D3D Panoramic datasets. It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.
zh
[CV-13] SLD: Segmentation-Based Landmark Detection for Spinal Ligaments
【速读】:该论文旨在解决脊柱生物力学建模中韧带附着点(ligament attachment points)自动识别精度不足的问题,现有方法或局限于特定脊柱区域,或缺乏足够的准确性。解决方案的关键在于提出一种结合形状驱动的三维椎体分割与领域专用规则推理的新方法:首先对3D椎体进行基于形状的分割以定位椎体边界,再依据解剖学先验知识定义不同类型的韧带附着点识别规则,从而实现跨脊柱区域的高精度检测,验证结果显示平均绝对误差(MAE)为0.7 mm,均方根误差(RMSE)为1.1 mm,显著优于现有方法。
链接: https://arxiv.org/abs/2601.16782
作者: Lara Blomenkamp,Ivanna Kramer,Sabine Bauer,Theresa Schöche
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In biomechanical modeling, the representation of ligament attachments is crucial for a realistic simulation of the forces acting between the vertebrae. These forces are typically modeled as vectors connecting ligament landmarks on adjacent vertebrae, making precise identification of these landmarks a key requirement for constructing reliable spine models. Existing automated detection methods are either limited to specific spinal regions or lack sufficient accuracy. This work presents a novel approach for detecting spinal ligament landmarks, which first performs shape-based segmentation of 3D vertebrae and subsequently applies domain-specific rules to identify different types of attachment points. The proposed method outperforms existing approaches by achieving high accuracy and demonstrating strong generalization across all spinal regions. Validation on two independent spinal datasets from multiple patients yielded a mean absolute error (MAE) of 0.7 mm and a root mean square error (RMSE) of 1.1 mm.
zh
[CV-14] CASP: Few-Shot Class-Incremental Learning with CLS Token Attention Steering Prompts
【速读】:该论文旨在解决少样本增量学习(Few-shot Class-Incremental Learning, FSCIL)中的核心挑战,即在仅用极少量样本的情况下快速适应新类别,同时有效缓解灾难性遗忘问题。现有基于提示(prompt-based)的方法虽取得一定进展,但在极端少样本场景下,模型的知识迁移与泛化能力仍受限。为此,作者提出CLS Token Attention Steering Prompts (CASP),其关键在于:通过引入类共享的可训练偏置参数到CLS token的查询(query)、键(key)和值(value)投影中,显式调控自注意力权重,从而增强模型对任务无关信息的过滤能力;此外,设计了注意力扰动策略与浅层特征空间中的流形标记混合(Manifold Token Mixup),以合成潜在的新类别特征,提升泛化性能并保留未来任务的表征容量。该方法无需在增量阶段微调,且显著降低参数开销。
链接: https://arxiv.org/abs/2601.16773
作者: Shuai Huang,Xuhan Lin,Yuwu Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-shot class-incremental learning (FSCIL) presents a core challenge in continual learning, requiring models to rapidly adapt to new classes with very limited samples while mitigating catastrophic forgetting. Recent prompt-based methods, which integrate pretrained backbones with task-specific prompts, have made notable progress. However, under extreme few-shot incremental settings, the model’s ability to transfer and generalize becomes critical, and it is thus essential to leverage pretrained knowledge to learn feature representations that can be shared across future categories during the base session. Inspired by the mechanism of the CLS token, which is similar to human attention and progressively filters out task-irrelevant information, we propose the CLS Token Attention Steering Prompts (CASP). This approach introduces class-shared trainable bias parameters into the query, key, and value projections of the CLS token to explicitly modulate the self-attention weights. To further enhance generalization, we also design an attention perturbation strategy and perform Manifold Token Mixup in the shallow feature space, synthesizing potential new class features to improve generalization and reserve the representation capacity for upcoming tasks. Experiments on the CUB200, CIFAR100, and ImageNet-R datasets demonstrate that CASP outperforms state-of-the-art methods in both standard and fine-grained FSCIL settings without requiring fine-tuning during incremental phases and while significantly reducing the parameter overhead.
zh
[CV-15] AutoRegressive Generation with B-rep Holistic Token Sequence Representation
【速读】:该论文旨在解决传统边界表示(B-rep)建模中几何与拓扑信息分离导致难以应用序列生成框架(如Transformer架构)的问题。其解决方案的关键在于提出BrepARG,首次将B-rep的几何和拓扑特征统一编码为三种类型的标记序列:几何与位置标记用于表达几何特征,面索引标记用于表达拓扑关系;通过分层构建几何块(面和边)并进行序列化,最终形成完整的B-rep标记序列表示,从而支持基于Transformer的自回归生成模型,实现端到端的B-rep生成。
链接: https://arxiv.org/abs/2601.16771
作者: Jiahao Li,Yunpeng Bai,Yongkang Dai,Hao Guo,Hongping Gan,Yilei Shi
机构: Northwestern Polytechnical University (西北工业大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Previous representation and generation approaches for the B-rep relied on graph-based representations that disentangle geometric and topological features through decoupled computational pipelines, thereby precluding the application of sequence-based generative frameworks, such as transformer architectures that have demonstrated remarkable performance. In this paper, we propose BrepARG, the first attempt to encode B-rep’s geometry and topology into a holistic token sequence representation, enabling sequence-based B-rep generation with an autoregressive architecture. Specifically, BrepARG encodes B-rep into 3 types of tokens: geometry and position tokens representing geometric features, and face index tokens representing topology. Then the holistic token sequence is constructed hierarchically, starting with constructing the geometry blocks (i.e., faces and edges) using the above tokens, followed by geometry block sequencing. Finally, we assemble the holistic sequence representation for the entire B-rep. We also construct a transformer-based autoregressive model that learns the distribution over holistic token sequences via next-token prediction, using a multi-layer decoder-only architecture with causal masking. Experiments demonstrate that BrepARG achieves state-of-the-art (SOTA) performance. BrepARG validates the feasibility of representing B-rep as holistic token sequences, opening new directions for B-rep generation.
zh
[CV-16] Flow Matching for Probabilistic Monocular 3D Human Pose Estimation
【速读】:该论文旨在解决单目相机视角下3D人体姿态估计中存在的深度模糊性(depth ambiguity)问题,该问题导致早期方法常产生错误但过度自信的3D估计结果。为缓解此问题,论文提出FMPose,一种基于流匹配(flow matching)的生成式概率3D人体姿态估计方法。其关键在于利用连续归一化流(continuous normalizing flows)建模从简单先验分布到合理3D人体姿态分布的最优传输路径,并通过图卷积网络(graph convolutional networks)对2D姿态线索进行建模,以学习关节间的可学习连接结构来实现特征聚合。相比基于扩散模型的方法,FMPose在生成速度和精度上均表现更优,在Human3.6M、MPI-INF-3DHP和3DPW三个主流基准上均取得显著提升。
链接: https://arxiv.org/abs/2601.16763
作者: Cuong Le,Pavló Melnyk,Bastian Wandt,Mårten Wadenbäck
机构: Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 7 tables, under submission
Abstract:Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. Compared to diffusion-based methods, the FMPose with optimal transport produces faster and more accurate 3D pose generations. Experimental results show major improvements of our FMPose over current state-of-the-art methods on three common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP and 3DPW.
zh
[CV-17] Curated endoscopic retrograde cholangiopancreatography images dataset
【速读】:该论文试图解决内镜逆行胰胆管造影(Endoscopic Retrograde Cholangiopancreatography, ERCP)领域中公开数据集稀缺的问题,这一局限性制约了人工智能在自动诊断中的应用。解决方案的关键在于构建一个大规模、高质量且经过严格标注的ERCP图像数据集,包含19,018张原始图像和19,317张处理后的图像,其中5,519张图像由经验丰富的胃肠病专家进行人工标注和审核,确保了数据的专业性和可靠性。该数据集可作为自动ERCP分析与胆道及胰腺疾病诊断研究的基准资源。
链接: https://arxiv.org/abs/2601.16759
作者: Alda João Andrade,Mónica Martins,André Ferreira,Tarcísio Araújo,Luís Lopes,Victor Alves
机构: Hospital Santa Luzia, ULS Alto Minho, Viana do Castelo, Portugal; University of Minho, Braga, 4710-057, Portugal; Essen University Hospital (AöR), University of Duisburg-Essen, Essen, Germany; Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho, Braga, Portugal; ICVS/3B’s - PT Government Associate Laboratory, Braga/Guimarães
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Endoscopic Retrograde Cholangiopancreatography (ERCP) is a key procedure in the diagnosis and treatment of biliary and pancreatic diseases. Artificial intelligence has been pointed as one solution to automatize diagnosis. However, public ERCP datasets are scarce, which limits the use of such approach. Therefore, this study aims to help fill this gap by providing a large and curated dataset. The collection is composed of 19.018 raw images and 19.317 processed from 1.602 patients. 5.519 images are labeled, which provides a ready to use dataset. All images were manually inspected and annotated by two gastroenterologist with more than 5 years of experience and reviewed by another gastroenterologist with more than 20 years of experience, all with more than 400 ERCP procedures annually. The utility and validity of the dataset is proven by a classification experiment. This collection aims to provide or contribute for a benchmark in automatic ERCP analysis and diagnosis of biliary and pancreatic diseases.
zh
[CV-18] Automated Road Crack Localization to Guide Highway Maintenance
【速读】:该论文旨在解决因气候变化导致的温度波动加剧对高速公路路面结构造成的应力问题,从而引发维护成本上升的挑战,其核心目标是开发一种高效、精准的公路养护策略。解决方案的关键在于构建一个基于开源数据的智能检测框架,该框架融合航空影像与OpenStreetMap(OSM)数据,通过微调YOLOv11模型实现高速公路裂缝的精准定位;进一步提出瑞士相对裂缝密度指数(Swiss Relative Highway Crack Density, RHCD),用于指导全国范围内的养护决策,实证表明该指数与地表长期温度振幅(LT-LST-A)和交通量(TV)的相关性较弱,凸显其独立价值,且高RHCD值多集中于城市中心和交叉路口区域,验证了预测结果的合理性。
链接: https://arxiv.org/abs/2601.16737
作者: Steffen Knoblauch,Ram Kumar Muthusamy,Pedram Ghamisi,Alexander Zipf
机构: HeiGIT at Heidelberg University (海德堡大学HeiGIT); GIScience Research Group, Heidelberg University (海德堡大学地理信息科学研究组); Interdisciplinary Centre of Scientific Computing (IWR), Heidelberg University (海德堡大学跨学科科学计算中心); Lancaster University (兰卡斯特大学); Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology (德国亥姆霍兹德累斯顿研究中心,弗莱贝格资源技术亥姆霍兹研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures
Abstract:Highway networks are crucial for economic prosperity. Climate change-induced temperature fluctuations are exacerbating stress on road pavements, resulting in elevated maintenance costs. This underscores the need for targeted and efficient maintenance strategies. This study investigates the potential of open-source data to guide highway infrastructure maintenance. The proposed framework integrates airborne imagery and OpenStreetMap (OSM) to fine-tune YOLOv11 for highway crack localization. To demonstrate the framework’s real-world applicability, a Swiss Relative Highway Crack Density (RHCD) index was calculated to inform nationwide highway maintenance. The crack classification model achieved an F1-score of 0.84 for the positive class (crack) and 0.97 for the negative class (no crack). The Swiss RHCD index exhibited weak correlations with Long-term Land Surface Temperature Amplitudes (LT-LST-A) (Pearson’s r\ = -0.05 ) and Traffic Volume (TV) (Pearson’s r\ = 0.17 ), underlining the added value of this novel index for guiding maintenance over other data. Significantly high RHCD values were observed near urban centers and intersections, providing contextual validation for the predictions. These findings highlight the value of open-source data sharing to drive innovation, ultimately enabling more efficient solutions in the public sector.
zh
[CV-19] A Step to Decouple Optimization in 3DGS
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)优化过程中被忽视的两个关键问题:(i) 更新步长耦合导致优化器状态缩放及视点外属性更新开销大;(ii) 梯度耦合在时刻上引发正则化不足或过度。针对这些问题,作者通过重新审视3DGS的优化流程,提出将原过程解耦为三个模块:稀疏Adam(Sparse Adam)、重置状态正则化(Re-State Regularization)和解耦属性正则化(Decoupled Attribute Regularization)。其解决方案的关键在于识别并分离优化中的有害耦合,再基于实证分析重新组合有益组件,最终设计出AdamW-GS优化器,在保持优化效率的同时显著提升表示效果。
链接: https://arxiv.org/abs/2601.16736
作者: Renjie Ding,Yaonan Wang,Min Liu,Jialin Zhu,Jiazheng Wang,Jiahao Zhao,Wenting Shen,Feixiang He,Xiang Che
机构: National Engineering Research Center of Robot Visual Perception and Control Technology, School of Artificial Intelligence and Robitics, Hunan University (湖南大学); Baidu Inc. (百度公司); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.
zh
[CV-20] Using Shadows in Circular Synthetic Aperture Sonar Imaging for Target Analysis
【速读】:该论文旨在解决圆形合成孔径声呐(Circular Synthetic Aperture Sonar, CSAS)在目标识别中因环形探测路径引入的视差导致阴影信息丢失的问题,从而影响目标形状辨识与三维重建精度。其解决方案的关键在于:首先采用子孔径滤波(sub-aperture filtering)获取沿圆形轨迹不同视角下的多幅图像;随后应用固定焦点阴影增强(Fixed Focus Shadow Enhancement, FFSE)技术提取清晰的阴影信息;最后结合空间雕刻(space-carving)算法从分割后的阴影中推断目标的三维形状。这一方法有效恢复了传统CSAS成像中被忽略的阴影特征,显著提升了目标分析与三维重建能力。
链接: https://arxiv.org/abs/2601.16733
作者: Yann Le Gall,Nicolas Burlet,Mathieu Simon,Fabien Novella,Samantha Dugelay,Jean-Philippe Malkasse
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:Circular Synthetic Aperture Sonar (CSAS) provides a 360° azimuth view of the seabed, surpassing the limited aperture and mono-view image of conventional side-scan SAS. This makes CSAS a valuable tool for target recognition in mine warfare where the diversity of point of view is essential for reducing false alarms. CSAS processing typically produces a very high-resolution two-dimensional image. However, the parallax introduced by the circular displacement of the illuminator fill-in the shadow regions, and the shadow cast by an object on the seafloor is lost in favor of azimuth coverage and resolution. Yet the shadows provide complementary information on target shape useful for target recognition. In this paper, we explore a way to retrieve shadow information from CSAS data to improve target analysis and carry 3D reconstruction. Sub-aperture filtering is used to get a collection of images at various points of view along the circular trajectory and fixed focus shadow enhancement (FFSE) is applied to obtain sharp shadows. An interactive interface is also proposed to allow human operators to visualize these shadows along the circular trajectory. A space-carving reconstruction method is applied to infer the 3D shape of the object from the segmented shadows. The results demonstrate the potential of shadows in circular SAS for improving target analysis and 3D reconstruction.
zh
[CV-21] CER-HV: A CER-Based Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR
【速读】:该论文旨在解决阿拉伯文字体手写文本识别(Handwritten Text Recognition, HTR)性能滞后于拉丁文字体HTR的问题,核心症结在于现有数据集普遍存在标签错误,而此类噪声未被充分识别和纠正。解决方案的关键是提出CER-HV(基于字符错误率的排序与人工验证)框架,其由两部分组成:一是基于精心配置的卷积循环神经网络(Convolutional Recurrent Neural Network, CRNN)构建的CER噪声检测器,通过早期停止机制防止对噪声样本过拟合;二是引入人机协同验证(Human-in-the-Loop, HITL)步骤,对高风险样本进行人工核查与修正。该方法显著提升了多个阿拉伯文书写语言(如阿拉伯语、波斯语、普什图语等)数据集的质量,并在未清洗数据下即达到当前最优的字符错误率(CER),同时在清洁后进一步改善了模型评估指标。
链接: https://arxiv.org/abs/2601.16713
作者: Sana Al-azzawi,Elisa Barney,Marcus Liwicki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handwritten text recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR, despite recent advances in model architectures, datasets, and benchmarks. We show that data quality is a significant limiting factor in many published datasets and propose CER-HV (CER-based Ranking with Human Verification) as a framework to detect and clean label errors. CER-HV combines a CER-based noise detector, built on a carefully configured Convolutional Recurrent Neural Network (CRNN) with early stopping to avoid overfitting noisy samples, and a human-in-the-loop (HITL) step that verifies high-ranking samples. The framework reveals that several existing datasets contain previously underreported problems, including transcription, segmentation, orientation, and non-text content errors. These have been identified with up to 90 percent precision in the Muharaf and 80-86 percent in the PHTI datasets. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.45 percent Character Error Rate (CER) on KHATT (Arabic), 8.26 percent on PHTI (Pashto), 10.66 percent on Ajami, and 10.11 percent on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves the evaluation CER by 0.3-0.6 percent on the cleaner datasets and 1.0-1.8 percent on the noisier ones. Although our experiments focus on documents written in an Arabic-script language, including Arabic, Persian, Urdu, Ajami, and Pashto, the framework is general and can be applied to other text recognition datasets. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.16713 [cs.CV] (or arXiv:2601.16713v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.16713 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-22] Affinity Contrastive Learning for Skeleton-based Human Activity Understanding
【速读】:该论文针对骨架动作理解中现有对比学习方法未能充分挖掘类间结构相似性且忽略异常正样本影响的问题,提出了一种改进的亲和力对比学习网络(ACLNet)。其解决方案的关键在于:首先引入亲和力度量(affinity metric)以优化相似性计算,从而构建更具信息量的动作 superclass(超类),增强对比信号的判别能力;其次设计动态温度调度机制,自适应调整不同超类的惩罚强度;最后采用基于边距的对比策略,提升类内难分正负样本之间的分离度。该方法在多个基准数据集上验证了其在骨架动作识别、步态识别与行人重识别任务中的有效性。
链接: https://arxiv.org/abs/2601.16694
作者: Hongda Liu,Yunfan Liu,Min Ren,Lin Sui,Yunlong Wang,Zhenan Sun
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子、电气与通信工程学院); School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); Moonshot AI (月之暗面)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TBIOM
Abstract:In skeleton-based human activity understanding, existing methods often adopt the contrastive learning paradigm to construct a discriminative feature space. However, many of these approaches fail to exploit the structural inter-class similarities and overlook the impact of anomalous positive samples. In this study, we introduce ACLNet, an Affinity Contrastive Learning Network that explores the intricate clustering relationships among human activity classes to improve feature discrimination. Specifically, we propose an affinity metric to refine similarity measurements, thereby forming activity superclasses that provide more informative contrastive signals. A dynamic temperature schedule is also introduced to adaptively adjust the penalty strength for various superclasses. In addition, we employ a margin-based contrastive strategy to improve the separation of hard positive and negative samples within classes. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton, PKU-MMD, FineGYM, and CASIA-B demonstrate the superiority of our method in skeleton-based action recognition, gait recognition, and person re-identification. The source code is available at this https URL.
zh
[CV-23] ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction CVPR2026
【速读】:该论文旨在解决现有3D服装重建方法依赖非结构化表示(如3D高斯泼溅)导致的拓扑结构和缝合线信息不准确的问题,从而难以支持高保真物理仿真与机器人操作。其解决方案的关键在于提出ReWeaver框架,能够从稀疏多视角RGB图像(最少四张)中同时预测二维UV空间与三维空间中的衣片(panel)及缝合线(seam),并精确建模它们之间的连接关系,生成结构化的2D–3D服装表示,显著提升拓扑准确性、几何对齐度与缝合-衣片一致性。
链接: https://arxiv.org/abs/2601.16672
作者: Ming Li,Hui Shan,Kai Zheng,Chentao Shen,Siyu Liu,Yanwei Fu,Zhen Chen,Xiangru Huang
机构: Zhejiang University (浙江大学); Shanghai Innovation Institute; Westlake University; Fudan University (复旦大学); Adobe; Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures, Submitted to CVPR 2026
Abstract:High-quality 3D garment reconstruction plays a crucial role in mitigating the sim-to-real gap in applications such as digital avatars, virtual try-on and robotic manipulation. However, existing garment reconstruction methods typically rely on unstructured representations, such as 3D Gaussian Splats, struggling to provide accurate reconstructions of garment topology and sewing structures. As a result, the reconstructed outputs are often unsuitable for high-fidelity physical simulation. We propose ReWeaver, a novel framework for topology-accurate 3D garment and sewing pattern reconstruction from sparse multi-view RGB images. Given as few as four input views, ReWeaver predicts seams and panels as well as their connectivities in both the 2D UV space and the 3D space. The predicted seams and panels align precisely with the multi-view images, yielding structured 2D–3D garment representations suitable for 3D perception, high-fidelity physical simulation, and robotic manipulation. To enable effective training, we construct a large-scale dataset GCD-TS, comprising multi-view RGB images, 3D garment geometries, textured human body meshes and annotated sewing patterns. The dataset contains over 100,000 synthetic samples covering a wide range of complex geometries and topologies. Extensive experiments show that ReWeaver consistently outperforms existing methods in terms of topology accuracy, geometry alignment and seam-panel consistency.
zh
[CV-24] ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中因模态失衡导致的状态主导偏差(state-dominant bias)和虚假完成(false completion)问题,即模型过度依赖内部状态信息而忽视视觉证据,从而在执行失败时仍错误地预测任务完成。解决方案的关键在于提出一种新的VLA框架ReViP(Vision-Proprioception Rebalance),其核心创新是引入任务感知的环境先验(task-aware environment priors),通过外部视觉语言模型(VLM)作为任务阶段观察器实时提取任务中心的视觉线索,并驱动视觉-本体感觉特征级线性调制(Vision-Proprioception Feature-wise Linear Modulation),以自适应调节语义感知与本体感觉动态之间的耦合强度,从而增强视觉锚定(visual grounding)并提升扰动下的鲁棒性。
链接: https://arxiv.org/abs/2601.16667
作者: Zhuohao Li,Yinghao Li,Jian-Jian Jiang,Lang Zhou,Tianyu Zhang,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Shenzhen Loop Area Institute (深圳环区研究院); Beijing Institute of Technology (北京理工大学); Peng Cheng Laboratory, China (鹏城实验室,中国); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China (教育部机器智能与先进计算重点实验室,中国); Guangdong Province Key Laboratory of Information Security Technology, China (广东省信息安全技术重点实验室,中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with VLM-encoded vision-language features, resulting in state-dominant bias and false completions despite visible execution failures. We attribute this to modality imbalance, where policies over-rely on internal state while underusing visual evidence. To address this, we present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary task-aware environment priors to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations, which drive a Vision-Proprioception Feature-wise Linear Modulation to enhance environmental awareness and reduce state-driven errors. Moreover, to evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop. Extensive experiments show that ReViP effectively reduces false-completion rates and improves success rates over strong VLA baselines on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.
zh
[CV-25] Reliable Brain Tumor Segmentation Based on Spiking Neural Networks with Efficient Training
【速读】:该论文旨在解决3D脑肿瘤分割任务中传统深度学习模型计算成本高、能耗大以及缺乏可靠不确定性估计的问题,尤其针对医疗物联网(Medical IoT)和即时检测(Point-of-Care)系统对低功耗、高可靠性模型的需求。其解决方案的关键在于提出一种基于脉冲神经网络(Spiking Neural Networks, SNNs)的可靠且节能框架:通过多视角(矢状面、冠状面、轴面)集成SNN模型实现体素级不确定性估计以提升分割鲁棒性,并引入前向时域传播(Forward Propagation Through Time, FPTT)方法显著降低训练过程中的计算复杂度,实验表明该方法在BraTS数据集上实现了与现有方法相当的精度,同时减少了87%的浮点运算次数(FLOPs),验证了SNN在医疗边缘计算场景下的可行性与优势。
链接: https://arxiv.org/abs/2601.16652
作者: Aurora Pia Ghiardelli,Guangzhi Tang,Tao Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at ISBI 2026
Abstract:We propose a reliable and energy-efficient framework for 3D brain tumor segmentation using spiking neural networks (SNNs). A multi-view ensemble of sagittal, coronal, and axial SNN models provides voxel-wise uncertainty estimation and enhances segmentation robustness. To address the high computational cost in training SNN models for semantic image segmentation, we employ Forward Propagation Through Time (FPTT), which maintains temporal learning efficiency with significantly reduced computational cost. Experiments on the Multimodal Brain Tumor Segmentation Challenges (BraTS 2017 and BraTS 2023) demonstrate competitive accuracy, well-calibrated uncertainty, and an 87% reduction in FLOPs, underscoring the potential of SNNs for reliable, low-power medical IoT and Point-of-Care systems.
zh
[CV-26] Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss WACV2026
【速读】:该论文旨在解决基于潜在扩散模型(Latent Diffusion Models, LDMs)的图像编辑中难以保持像素级边缘结构的问题,这对摄影级风格迁移或图像色调调整等任务至关重要。其解决方案的关键在于提出一种无需训练的结构保真损失(Structure Preservation Loss, SPL),该损失通过局部线性模型量化输入图像与编辑后图像之间的结构差异,并将其直接集成到扩散模型的生成过程中,以确保结构一致性;此外,还辅以去编码失真校正的后处理步骤、用于精确编辑定位的掩码策略以及颜色保真损失,从而在不破坏未编辑区域色彩的前提下提升整体编辑质量。
链接: https://arxiv.org/abs/2601.16645
作者: Minsu Gong,Nuri Ryu,Jungseul Ok,Sunghyun Cho
机构: Planby Technologies(计划由科技); POSTECH(浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026
Abstract:Recent advances in image editing leverage latent diffusion models (LDMs) for versatile, text-prompt-driven edits across diverse tasks. Yet, maintaining pixel-level edge structures-crucial for tasks such as photorealistic style transfer or image tone adjustment-remains as a challenge for latent-diffusion-based editing. To overcome this limitation, we propose a novel Structure Preservation Loss (SPL) that leverages local linear models to quantify structural differences between input and edited images. Our training-free approach integrates SPL directly into the diffusion model’s generative process to ensure structural fidelity. This core mechanism is complemented by a post-processing step to mitigate LDM decoding distortions, a masking strategy for precise edit localization, and a color preservation loss to preserve hues in unedited areas. Experiments confirm SPL enhances structural fidelity, delivering state-of-the-art performance in latent-diffusion-based image editing. Our code will be publicly released at this https URL.
zh
[CV-27] SCHIGAND: A Synthetic Facial Generation Mode Pipeline
【速读】:该论文旨在解决当前生物特征识别系统在训练与测试中面临的隐私法规限制、数据稀缺性及伦理问题,同时克服现有生成模型在真实性、多样性与身份保持之间难以平衡的挑战。其解决方案的关键在于提出SCHIGAND——一个集成StyleCLIP、HyperStyle、InterfaceGAN和扩散模型(Diffusion models)的新型合成人脸生成流程,通过增强身份一致性并生成类内真实变异和类间区分度,显著提升了合成人脸数据集的质量与可控性,从而满足生物特征测试需求。
链接: https://arxiv.org/abs/2601.16627
作者: Ananya Kadali,Sunnie Jehan-Morrison,Orasiki Wellington,Barney Evans,Precious Durojaiye,Richard Guest
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:The growing demand for diverse and high-quality facial datasets for training and testing biometric systems is challenged by privacy regulations, data scarcity, and ethical concerns. Synthetic facial images offer a potential solution, yet existing generative models often struggle to balance realism, diversity, and identity preservation. This paper presents SCHIGAND, a novel synthetic face generation pipeline integrating StyleCLIP, HyperStyle, InterfaceGAN, and Diffusion models to produce highly realistic and controllable facial datasets. SCHIGAND enhances identity preservation while generating realistic intra-class variations and maintaining inter-class distinctiveness, making it suitable for biometric testing. The generated datasets were evaluated using ArcFace, a leading facial verification model, to assess their effectiveness in comparison to real-world facial datasets. Experimental results demonstrate that SCHIGAND achieves a balance between image quality and diversity, addressing key limitations of prior generative models. This research highlights the potential of SCHIGAND to supplement and, in some cases, replace real data for facial biometric applications, paving the way for privacy-compliant and scalable solutions in synthetic dataset generation.
zh
[CV-28] Boundary and Position Information Mining for Aerial Small Object Detection
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)应用场景中,小目标检测因尺度不平衡和边缘模糊而导致的准确率下降问题。其解决方案的关键在于提出了一种边界与位置信息挖掘(Boundary and Position Information Mining, BPIM)框架,通过引入位置信息引导(Position Information Guidance, PIG)模块获取目标位置线索、边界信息引导(Boundary Information Guidance, BIG)模块提取边缘特征,并结合跨尺度融合(Cross Scale Fusion, CSF)、三重特征融合(Three Feature Fusion, TFF)以及自适应权重融合(Adaptive Weight Fusion, AWF)机制,实现多层级特征的协同优化。该方法利用注意力机制和跨尺度特征融合策略,有效整合边界、位置与尺度信息,从而显著提升小目标的感知能力和上下文特征判别力,在保持计算复杂度可控的前提下取得优于主流方法的检测性能。
链接: https://arxiv.org/abs/2601.16617
作者: Rongxin Huang,Guangfeng Lin,Wenbo Zhou,Zhirong Li,Wenhuan Wu
机构: Xi’an University of Technology (西安理工大学); Hubei University of Automotive Technology (湖北汽车工业学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures
Abstract:Unmanned Aerial Vehicle (UAV) applications have become increasingly prevalent in aerial photography and object recognition. However, there are major challenges to accurately capturing small targets in object detection due to the imbalanced scale and the blurred edges. To address these issues, boundary and position information mining (BPIM) framework is proposed for capturing object edge and location cues. The proposed BPIM includes position information guidance (PIG) module for obtaining location information, boundary information guidance (BIG) module for extracting object edge, cross scale fusion (CSF) module for gradually assembling the shallow layer image feature, three feature fusion (TFF) module for progressively combining position and boundary information, and adaptive weight fusion (AWF) module for flexibly merging the deep layer semantic feature. Therefore, BPIM can integrate boundary, position, and scale information in image for small object detection using attention mechanisms and cross-scale feature fusion strategies. Furthermore, BPIM not only improves the discrimination of the contextual feature by adaptive weight fusion with boundary, but also enhances small object perceptions by cross-scale position fusion. On the VisDrone2021, DOTA1.0, and WiderPerson datasets, experimental results show the better performances of BPIM compared to the baseline Yolov5-P2, and obtains the promising performance in the state-of-the-art methods with comparable computation load.
zh
[CV-29] A Lightweight Medical Image Classification Framework via Self-Supervised Contrastive Learning and Quantum-Enhanced Feature Modeling
【速读】:该论文旨在解决医疗图像分析中因标注数据稀缺、计算资源受限以及模型泛化能力不足所导致的性能瓶颈问题。其解决方案的关键在于提出了一种轻量级的医学图像分类框架,该框架融合了自监督对比学习与量子增强特征建模:首先利用SimCLR风格的自监督预训练策略在无标签数据上对MobileNetV2进行初始化,以缓解标注依赖;随后嵌入一个轻量级参数化量子电路(Parameterized Quantum Circuit, PQC)作为量子特征增强模块,构建经典-量子混合架构,并在有限标注数据上进行微调。该方法在仅约2–3百万参数和低计算成本下,显著优于未采用自监督或量子增强的经典基线模型,在准确率(Accuracy)、AUC和F1分数等指标上均实现提升,同时特征可视化验证了其更强的判别能力和表示稳定性。
链接: https://arxiv.org/abs/2601.16608
作者: Jingsong Xia,Siqi Wang
机构: The Second Clinical Medical College, Nanjing Medical University (南京医科大学第二临床医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Intelligent medical image analysis is essential for clinical decision support but is often limited by scarce annotations, constrained computational resources, and suboptimal model generalization. To address these challenges, we propose a lightweight medical image classification framework that integrates self-supervised contrastive learning with quantum-enhanced feature modeling. MobileNetV2 is employed as a compact backbone and pretrained using a SimCLR-style self-supervised paradigm on unlabeled images. A lightweight parameterized quantum circuit (PQC) is embedded as a quantum feature enhancement module, forming a hybrid classical-quantum architecture, which is subsequently fine-tuned on limited labeled data. Experimental results demonstrate that, with only approximately 2-3 million parameters and low computational cost, the proposed method consistently outperforms classical baselines without self-supervised learning or quantum enhancement in terms of Accuracy, AUC, and F1-score. Feature visualization further indicates improved discriminability and representation stability. Overall, this work provides a practical and forward-looking solution for high-performance medical artificial intelligence under resource-constrained settings.
zh
[CV-30] X-Aligner: Composed Visual Retrieval without the Bells and Whistles
【速读】:该论文旨在解决当前组成视频检索(Composed Video Retrieval, CoVR)框架在融合视觉与文本查询时仅通过单阶段特征融合导致性能提升有限的问题。其解决方案的关键在于提出一种基于视觉语言模型(Vision Language Models, VLMs)的新颖两阶段训练框架,引入一个名为X-Aligner的交叉注意力模块,该模块通过多层交叉注意力机制逐步融合并对齐视觉与文本模态表示至目标视频的语义空间;同时,将视觉查询的描述性字幕作为额外输入以增强多模态查询表达能力,并采用分阶段训练策略——第一阶段仅训练新引入模块以保留预训练VLM的表征能力,第二阶段再微调文本查询编码器,从而实现更高效且具泛化性的多模态对齐与检索性能。
链接: https://arxiv.org/abs/2601.16582
作者: Yuqian Zheng,Mariana-Iuliana Georgescu
机构: Technical University of Munich (慕尼黑工业大学); Helmholtz Munich (亥姆霍兹慕尼黑研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.
zh
[CV-31] HA2F: Dual-module Collaboration-Guided Hierarchical Adaptive Aggregation Framework for Remote Sensing Change Detection
【速读】:该论文针对遥感变化检测(Remote Sensing Change Detection, RSCD)中现有方法存在的问题——即局部特征提取与全局图像处理方式导致的跨时相特征匹配偏差,以及对辐射和几何噪声敏感的问题——提出了一种双模块协同引导的分层自适应聚合框架(HA2F)。其关键在于:一是动态分层特征校准模块(DHFCM),通过感知特征选择机制动态融合相邻层级特征,抑制无关差异以缓解多时相特征对齐偏差;二是噪声自适应特征精化模块(NAFRM),利用双特征选择机制突出变化敏感区域并生成空间掩码,有效抑制非相关区域或阴影干扰。该方案在多个基准数据集上实现了最优性能,兼顾精度与计算效率。
链接: https://arxiv.org/abs/2601.16573
作者: Shuying Li,Yuchen Wang,San Zhang,Chuang Yang
机构: Xi’an University of Posts and Telecommunications (西安邮电大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote sensing change detection (RSCD) aims to identify the spatio-temporal changes of land cover, providing critical support for multi-disciplinary applications (e.g., environmental monitoring, disaster assessment, and climate change studies). Existing methods focus either on extracting features from localized patches, or pursue processing entire images holistically, which leads to the cross temporal feature matching deviation and exhibiting sensitivity to radiometric and geometric noise. Following the above issues, we propose a dual-module collaboration guided hierarchical adaptive aggregation framework, namely HA2F, which consists of dynamic hierarchical feature calibration module (DHFCM) and noise-adaptive feature refinement module (NAFRM). The former dynamically fuses adjacent-level features through perceptual feature selection, suppressing irrelevant discrepancies to address multi-temporal feature alignment deviations. The NAFRM utilizes the dual feature selection mechanism to highlight the change sensitive regions and generate spatial masks, suppressing the interference of irrelevant regions or shadows. Extensive experiments verify the effectiveness of the proposed HA2F, which achieves state-of-the-art performance on LEVIR-CD, WHU-CD, and SYSU-CD datasets, surpassing existing comparative methods in terms of both precision metrics and computational efficiency. In addition, ablation experiments show that DHFCM and NAFRM are effective. \hrefthis https URLHA2F Official Code is Available Here!
zh
[CV-32] Understanding and Improving UMAP with Geometric and Topological Priors: The JORC-UMAP Algorithm
【速读】:该论文旨在解决UMAP(Uniform Manifold Approximation and Projection)在高维数据降维可视化中因局部欧氏距离假设导致的拓扑撕裂(topological tearing)和结构坍塌(structural collapse)问题,其根源在于UMAP对k近邻图(k-nearest neighbor graph)的敏感性。解决方案的关键在于引入两种几何与拓扑先验:一是基于Ollivier-Ricci曲率(Ollivier-Ricci curvature)作为几何先验,强化几何瓶颈处的边并减少冗余连接;二是利用Jaccard相似性构建拓扑先验,确保邻域一致性。由此提出的JORC-UMAP方法能更准确地区分真实流形结构与虚假连接,在保持计算效率的同时显著提升可视化质量,如通过SVM分类准确率和三元组保留分数等指标验证。
链接: https://arxiv.org/abs/2601.16552
作者: Xiaobin Li,Run Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Geometric Topology (math.GT)
备注: 22 pages, 8 figures. Comments are welcome
Abstract:Nonlinear dimensionality reduction techniques, particularly UMAP, are widely used for visualizing high-dimensional data. However, UMAP’s local Euclidean distance assumption often fails to capture intrinsic manifold geometry, leading to topological tearing and structural collapse. We identify UMAP’s sensitivity to the k-nearest neighbor graph as a key cause. To address this, we introduce Ollivier-Ricci curvature as a geometric prior, reinforcing edges at geometric bottlenecks and reducing redundant links. Since curvature estimation is noise-sensitive, we also incorporate a topological prior using Jaccard similarity to ensure neighborhood consistency. The resulting method, JORC-UMAP, better distinguishes true manifold structure from spurious connections. Experiments on synthetic and real-world datasets show that JORC-UMAP reduces tearing and collapse more effectively than standard UMAP and other DR methods, as measured by SVM accuracy and triplet preservation scores, while maintaining computational efficiency. This work offers a geometry-aware enhancement to UMAP for more faithful data visualization.
zh
[CV-33] Semi-Supervised Hierarchical Open-Set Classification WACV2026
【速读】:该论文旨在解决层次化开放集分类(hierarchical open-set classification)中的性能瓶颈问题,即在存在未见类别时如何有效提升模型对已知类别的分类准确率,并合理将未知类分配至合适的高层类别。针对这一挑战,作者提出了一种基于伪标签(pseudo-labeling)的师生框架(teacher-student framework),其关键创新在于:1)引入子树伪标签(subtree pseudo-labels),通过利用类别层级结构提供更可靠的监督信号以应对未知类干扰;2)设计年龄门控机制(age-gating),动态抑制伪标签的过自信预测,从而提升训练稳定性与泛化能力。实验表明,该方法在仅使用每类20个标注样本的情况下即可达到全监督基准性能,显著优于自监督预训练+微调策略。
链接: https://arxiv.org/abs/2601.16541
作者: Erik Wallin,Fredrik Kahl,Lars Hammarstrand
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: WACV2026
Abstract:Hierarchical open-set classification handles previously unseen classes by assigning them to the most appropriate high-level category in a class taxonomy. We extend this paradigm to the semi-supervised setting, enabling the use of large-scale, uncurated datasets containing a mixture of known and unknown classes to improve the hierarchical open-set performance. To this end, we propose a teacher-student framework based on pseudo-labeling. Two key components are introduced: 1) subtree pseudo-labels, which provide reliable supervision in the presence of unknown data, and 2) age-gating, a mechanism that mitigates overconfidence in pseudo-labels. Experiments show that our framework outperforms self-supervised pretraining followed by supervised adaptation, and even matches the fully supervised counterpart when using only 20 labeled samples per class on the iNaturalist19 benchmark. Our code is available at this https URL.
zh
[CV-34] OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLM)在现实世界中缺乏持续空间理解与推理能力的问题,尤其关注其在动态环境中无法长期积累并更新空间认知,以及难以部署于具身系统(embodied systems)的局限性。解决方案的关键在于提出 OnlineSI 框架,通过维护有限的空间记忆(spatial memory)来持续存储历史观测信息,从而保证每次推理所需的计算复杂度不随输入数据累积而增长;同时融合3D点云信息与语义信息,增强模型对场景中物体的位置感知与识别精度,实现对环境的持续空间理解。
链接: https://arxiv.org/abs/2601.16538
作者: Zixian Liu,Zhaoxi Chen,Liang Pan,Ziwei Liu
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:In recent years, researchers have increasingly been interested in how to enable Multimodal Large Language Models (MLLM) to possess spatial understanding and reasoning capabilities. However, most existing methods overlook the importance of the ability to continuously work in an ever-changing world, and lack the possibility of deployment on embodied systems in real-world environments. In this work, we introduce OnlineSI, a framework that can continuously improve its spatial understanding of its surroundings given a video stream. Our core idea is to maintain a finite spatial memory to retain past observations, ensuring the computation required for each inference does not increase as the input accumulates. We further integrate 3D point cloud information with semantic information, helping MLLM to better locate and identify objects in the scene. To evaluate our method, we introduce the Fuzzy F_1 -Score to mitigate ambiguity, and test our method on two representative datasets. Experiments demonstrate the effectiveness of our method, paving the way towards real-world embodied systems.
zh
[CV-35] AnchoredDream: Zero-Shot 360° Indoor Scene Generation from a Single View via Geometric Grounding
【速读】:该论文旨在解决从单张图像生成完整360°室内场景的难题,这一问题在实际应用中具有重要意义,但因其高度病态性(ill-posed)而极具挑战。现有方法虽借助扩散模型和深度估计网络取得一定进展,但在大视角变化下仍难以保持外观一致性(appearance consistency)与几何合理性(geometric plausibility)。其解决方案的关键在于提出一种零样本(zero-shot)新框架 AnchoredDream,通过引入“外观-几何相互增强机制”(appearance-geometry mutual boosting mechanism),首先基于外观引导生成高保真几何结构,再分阶段完成全景重建:包括 warp-and-inpaint、warp-and-refine、后优化以及创新的 Grouting Block 模块,以确保输入视图与生成区域间的无缝衔接。该方法显著提升了生成结果在视觉一致性和几何合理性方面的表现。
链接: https://arxiv.org/abs/2601.16532
作者: Runmao Yao,Junsheng Zhou,Zhen Dong,Yu-Shen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-view indoor scene generation plays a crucial role in a range of real-world applications. However, generating a complete 360° scene from a single image remains a highly ill-posed and challenging problem. Recent approaches have made progress by leveraging diffusion models and depth estimation networks, yet they still struggle to maintain appearance consistency and geometric plausibility under large viewpoint changes, limiting their effectiveness in full-scene generation. To address this, we propose AnchoredDream, a novel zero-shot pipeline that anchors 360° scene generation on high-fidelity geometry via an appearance-geometry mutual boosting mechanism. Given a single-view image, our method first performs appearance-guided geometry generation to construct a reliable 3D scene layout. Then, we progressively generate the complete scene through a series of modules: warp-and-inpaint, warp-and-refine, post-optimization, and a novel Grouting Block, which ensures seamless transitions between the input view and generated regions. Extensive experiments demonstrate that AnchoredDream outperforms existing methods by a large margin in both appearance consistency and geometric plausibility–all in a zero-shot manner. Our results highlight the potential of geometric grounding for high-quality, zero-shot single-view scene generation.
zh
[CV-36] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中因长输入序列导致的高计算延迟问题,其根源在于全注意力机制(full attention)具有二次复杂度。现有稀疏注意力机制中,训练-free 方法受限于稀疏程度不足,加速效果有限;而训练-based 方法虽可实现更高稀疏性,但需大量数据和计算资源进行训练。解决方案的关键在于提出 SALAD,通过并行引入一个轻量级线性注意力分支(linear attention branch),并设计一种输入依赖的门控机制(input-dependent gating mechanism)来动态平衡两个分支的输出,从而在保持生成质量的同时实现 90% 的稀疏度和 1.72 倍的推理加速,且微调过程仅需 2,000 个视频样本和 1,600 步训练即可完成。
链接: https://arxiv.org/abs/2601.16515
作者: Tongcheng Fang,Hanling Zhang,Ruiqi Xie,Zhuo Han,Xin Tao,Tianchen Zhao,Pengfei Wan,Wenbo Ding,Wanli Ouyang,Xuefei Ning,Yu Wang
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.
zh
[CV-37] Expert Knowledge-Guided Decision Calibration for Accurate Fine-Grained Tree Species Classification
【速读】:该论文旨在解决森林树种细粒度分类中因数据稀缺导致的长尾分布和类别间相似性高所引发的识别困难问题,尤其在少样本或易混淆类别上表现不佳。其核心解决方案是引入外部“领域专家”知识,并设计了一种专家知识引导的分类决策校准网络(EKDC-Net),关键在于两个模块:一是局部先验引导的知识提取模块(LPKEM),通过类激活图(CAM)分析引导专家聚焦于判别性特征;二是不确定性引导的决策校准模块(UDCM),动态融合全局类别不确定性和实例级预测不确定性以修正本地模型输出。该方法仅需0.08M额外可学习参数即可显著提升主干网络准确率(+6.42%)与精确率(+11.46%),且具备轻量化、即插即用特性。
链接: https://arxiv.org/abs/2601.16498
作者: Chen Long,Dian Chen,Ruifei Ding,Zhe Chen,Zhen Dong,Bisheng Yang
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate fine-grained tree species classification is critical for forest inventory and biodiversity monitoring. Existing methods predominantly focus on designing complex architectures to fit local data distributions. However, they often overlook the long-tailed distributions and high inter-class similarity inherent in limited data, thereby struggling to distinguish between few-shot or confusing categories. In the process of knowledge dissemination in the human world, individuals will actively seek expert assistance to transcend the limitations of local thinking. Inspired by this, we introduce an external “Domain Expert” and propose an Expert Knowledge-Guided Classification Decision Calibration Network (EKDC-Net) to overcome these challenges. Our framework addresses two core issues: expert knowledge extraction and utilization. Specifically, we first develop a Local Prior Guided Knowledge Extraction Module (LPKEM). By leveraging Class Activation Map (CAM) analysis, LPKEM guides the domain expert to focus exclusively on discriminative features essential for classification. Subsequently, to effectively integrate this knowledge, we design an Uncertainty-Guided Decision Calibration Module (UDCM). This module dynamically corrects the local model’s decisions by considering both overall category uncertainty and instance-level prediction uncertainty. Furthermore, we present a large-scale classification dataset covering 102 tree species, named CU-Tree102 to address the issue of scarce diversity in current benchmarks. Experiments on three benchmark datasets demonstrate that our approach achieves state-of-the-art performance. Crucially, as a lightweight plug-and-play module, EKDC-Net improves backbone accuracy by 6.42% and precision by 11.46% using only 0.08M additional learnable parameters. The dataset, code, and pre-trained models are available at this https URL.
zh
[CV-38] Multi-View Consistent Wound Segmentation With Neural Fields
【速读】:该论文旨在解决伤口分割(wound segmentation)中从二维(2D)图像恢复多视角一致的三维(3D)结构这一挑战,以实现更精确的愈合进程追踪。其解决方案的关键在于提出WoundNeRF,一种基于神经辐射场(Neural Radiance Fields, NeRF)和有符号距离函数(Signed Distance Function, SDF)的方法,通过自动标注生成稳健的伤口分割结果,并在精度上优于当前主流的视觉Transformer(Vision Transformer)网络和传统的栅格化算法。
链接: https://arxiv.org/abs/2601.16487
作者: Remi Chierchia,Léo Lebrat,David Ahmedt-Aristizabal,Yulia Arzhaeva,Olivier Salvado,Clinton Fookes,Rodrigo Santa Cruz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Wound care is often challenged by the economic and logistical burdens that consistently afflict patients and hospitals worldwide. In recent decades, healthcare professionals have sought support from computer vision and machine learning algorithms. In particular, wound segmentation has gained interest due to its ability to provide professionals with fast, automatic tissue assessment from standard RGB images. Some approaches have extended segmentation to 3D, enabling more complete and precise healing progress tracking. However, inferring multi-view consistent 3D structures from 2D images remains a challenge. In this paper, we evaluate WoundNeRF, a NeRF SDF-based method for estimating robust wound segmentations from automatically generated annotations. We demonstrate the potential of this paradigm in recovering accurate segmentations by comparing it against state-of-the-art Vision Transformer networks and conventional rasterisation-based algorithms. The code will be released to facilitate further development in this promising paradigm.
zh
[CV-39] DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses
【速读】:该论文旨在解决当前用于检测和溯源深度伪造图像(deepfakes)的防御水印技术易被攻击者移除的问题,尤其针对基于编码器-解码器架构的水印方案。其核心解决方案是提出DeMark,一种无需查询的黑盒攻击框架,通过压缩感知(compressive sensing)驱动的稀疏化过程,在潜在空间(latent space)中抑制水印信号,同时保持图像的感知真实性和结构合理性。该方法在8种主流水印方案上将检测准确率从100%降至平均32.9%,显著优于现有攻击手段,且验证了多种防御策略(如图像超分辨率、稀疏水印和对抗训练)均难以有效抵御此类攻击,揭示了当前水印机制对潜在空间扰动的脆弱性,强调亟需更鲁棒的水印设计以应对深度伪造滥用风险。
链接: https://arxiv.org/abs/2601.16473
作者: Wei Song,Zhenchang Xing,Liming Zhu,Yulei Sui,Jingling Xue
机构: UNSW Sydney (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61实验室)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid proliferation of realistic deepfakes has raised urgent concerns over their misuse, motivating the use of defensive watermarks in synthetic images for reliable detection and provenance tracking. However, this defense paradigm assumes such watermarks are inherently resistant to removal. We challenge this assumption with DeMark, a query-free black-box attack framework that targets defensive image watermarking schemes for deepfakes. DeMark exploits latent-space vulnerabilities in encoder-decoder watermarking models through a compressive sensing based sparsification process, suppressing watermark signals while preserving perceptual and structural realism appropriate for deepfakes. Across eight state-of-the-art watermarking schemes, DeMark reduces watermark detection accuracy from 100% to 32.9% on average while maintaining natural visual quality, outperforming existing attacks. We further evaluate three defense strategies, including image super resolution, sparse watermarking, and adversarial training, and find them largely ineffective. These results demonstrate that current encoder decoder watermarking schemes remain vulnerable to latent-space manipulations, underscoring the need for more robust watermarking methods to safeguard against deepfakes.
zh
[CV-40] Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos
【速读】:该论文旨在解决多模态大语言模型(MLLMs)在物理世界理解能力上的不足,尤其是对物体动力学、材料属性及因果交互等物理原理的掌握仍远未达到人类水平的问题。现有数据集或依赖高成本的真实世界视频标注,或受限于合成模拟的现实感与多样性不足。解决方案的关键在于提出一种新颖范式:利用游戏视频中的“ glitches”(视觉异常,即违反预设物理法则的现象)作为大规模且可扩展的监督信号来源。通过构建 PhysGame 数据集(包含 140,057 条以 glitch 为中心的问答对)和 GameBench 基准测试集,结合元信息引导的提示策略提升数据质量,实验表明该方法显著提升了模型在真实世界(PhysBench)和通用场景(MVBench)下的物理推理性能,并增强了检测物理不合理性的鲁棒性,验证了从游戏异常中学习是推动多模态智能体物理理解能力的有效路径。
链接: https://arxiv.org/abs/2601.16471
作者: Meng Cao,Haoran Tang,Haoze Zhao,Mingfei Han,Ruyang Liu,Qiang Sun,Xiaojun Chang,Ian Reid,Xiaodan Liang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Peking University (北京大学); University of Waterloo (滑铁卢大学); University of Toronto (多伦多大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMLR
Abstract:Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an meta information guided instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.
zh
[CV-41] VISTA-PATH: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology
【速读】:该论文旨在解决病理图像语义分割(semantic segmentation)中模型泛化能力不足、与临床需求脱节的问题,尤其针对组织结构异质性强、依赖专家知识标注的挑战。现有分割基础模型通常将任务视为静态视觉预测,难以适配复杂多样的病理图像特征并支持临床可解释性。其解决方案的关键在于提出VISTA-PATH——一个交互式、类别感知的病理分割基础模型,通过联合条件建模视觉上下文、语义组织描述及可选专家提供的空间提示(spatial prompts),实现跨多种器官和组织类别的高精度像素级分割;同时构建包含160万张图像-掩码-文本三元组的大规模病理分割数据集(VISTA-PATH Data),并支持稀疏边界框反馈驱动的全切片动态优化,从而将分割从静态预测转变为可交互、临床导向的数字病理表示。
链接: https://arxiv.org/abs/2601.16451
作者: Peixian Liang,Songhao Li,Shunsuke Koga,Yutong Li,Zahra Alipour,Yucheng Tang,Daguang Xu,Zhi Huang
机构: University of Pennsylvania (宾夕法尼亚大学); Georgia Institute of Technology and Emory University (佐治亚理工学院和埃默里大学); NVIDIA Corporation (英伟达公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate semantic segmentation for histopathology image is crucial for quantitative tissue analysis and downstream clinical modeling. Recent segmentation foundation models have improved generalization through large-scale pretraining, yet remain poorly aligned with pathology because they treat segmentation as a static visual prediction task. Here we present VISTA-PATH, an interactive, class-aware pathology segmentation foundation model designed to resolve heterogeneous structures, incorporate expert feedback, and produce pixel-level segmentation that are directly meaningful for clinical interpretation. VISTA-PATH jointly conditions segmentation on visual context, semantic tissue descriptions, and optional expert-provided spatial prompts, enabling precise multi-class segmentation across heterogeneous pathology images. To support this paradigm, we curate VISTA-PATH Data, a large-scale pathology segmentation corpus comprising over 1.6 million image-mask-text triplets spanning 9 organs and 93 tissue classes. Across extensive held-out and external benchmarks, VISTA-PATH consistently outperforms existing segmentation foundation models. Importantly, VISTA-PATH supports dynamic human-in-the-loop refinement by propagating sparse, patch-level bounding-box annotation feedback into whole-slide segmentation. Finally, we show that the high-fidelity, class-aware segmentation produced by VISTA-PATH is a preferred model for computational pathology. It improve tissue microenvironment analysis through proposed Tumor Interaction Score (TIS), which exhibits strong and significant associations with patient survival. Together, these results establish VISTA-PATH as a foundation model that elevates pathology image segmentation from a static prediction to an interactive and clinically grounded representation for digital pathology. Source code and demo can be found at this https URL.
zh
[CV-42] Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding
【速读】:该论文旨在解决情感计算与人机交互中从多模态信号(如视觉、语音等)识别和推理人类情绪的挑战,特别是当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在情感推理能力上的局限性,以及缺乏高质量、大规模标注数据集和标准化评估基准的问题。解决方案的关键在于提出Emotion-LLaMAv2模型与MMEVerse基准平台:Emotion-LLaMAv2引入三个核心改进——端到端多视角编码器消除对外部人脸检测的依赖并捕获更细腻的情绪线索;Conv Attention预融合模块实现LLM主干外的局部与全局多模态特征协同交互;基于感知到认知的课程式指令微调策略统一了情绪识别与自由形式的情感推理任务;同时,MMEVerse整合12个公开情绪数据集,通过多智能体管道重标注生成130k训练片段和36k测试片段,构建统一的多模态指令格式与标准化评估体系,从而推动该领域的大规模训练与可复现评估。
链接: https://arxiv.org/abs/2601.16449
作者: Xiaojiang Peng,Jingyi Chen,Zebang Cheng,Bao Peng,Fengyi Wu,Yifei Dong,Shuyuan Tu,Qiyu Hu,Huiting Huang,Yuxiang Lin,Jun-Yan He,Kai Wang,Zheng Lian,Zhi-Qi Cheng
机构: Shenzhen Technology University (深圳技术大学); University of Washington (华盛顿大学); Meituan (美团); National University of Singapore (新加坡国立大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding human emotions from multimodal signals poses a significant challenge in affective computing and human-robot interaction. While multimodal large language models (MLLMs) have excelled in general vision-language tasks, their capabilities in emotional reasoning remain limited. The field currently suffers from a scarcity of large-scale datasets with high-quality, descriptive emotion annotations and lacks standardized benchmarks for evaluation. Our preliminary framework, Emotion-LLaMA, pioneered instruction-tuned multimodal learning for emotion reasoning but was restricted by explicit face detectors, implicit fusion strategies, and low-quality training data with limited scale. To address these limitations, we present Emotion-LLaMAv2 and the MMEVerse benchmark, establishing an end-to-end pipeline together with a standardized evaluation setting for emotion recognition and reasoning. Emotion-LLaMAv2 introduces three key advances. First, an end-to-end multiview encoder eliminates external face detection and captures nuanced emotional cues via richer spatial and temporal multiview tokens. Second, a Conv Attention pre-fusion module is designed to enable simultaneous local and global multimodal feature interactions external to the LLM backbone. Third, a perception-to-cognition curriculum instruction tuning scheme within the LLaMA2 backbone unifies emotion recognition and free-form emotion reasoning. To support large-scale training and reproducible evaluation, MMEVerse aggregates twelve publicly available emotion datasets, including IEMOCAP, MELD, DFEW, and MAFW, into a unified multimodal instruction format. The data are re-annotated via a multi-agent pipeline involving Qwen2 Audio, Qwen2.5 VL, and GPT 4o, producing 130k training clips and 36k testing clips across 18 evaluation benchmarks.
zh
[CV-43] Masked Face Recognition under Different Backbones
【速读】:该论文旨在解决后疫情时代民航安检中大量乘客佩戴口罩对传统人脸匹配模型性能造成显著下降的问题。其解决方案的关键在于系统性地评估多种骨干网络(backbone network)在有无口罩遮挡条件下的识别效果,发现r100系列模型在无 mask 场景下表现最优(98%+准确率),而在戴口罩场景下,改进后的 r100_mask_v2 模型仍保持领先(90.07% 准确率),同时 Vision Transformer(ViT)类模型(如 Vit-Small/Tiny)也展现出较强的鲁棒性,为实际部署提供了基于性能与效率权衡的针对性建议。
链接: https://arxiv.org/abs/2601.16440
作者: Bo Zhang,Ming Zhang,Kun Wu,Lei Bian,Yi Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Erratum to the paper (Zhang et al., 2025): corrections to Table IV and the data in Page 3, Section A. In the post-pandemic era, a high proportion of civil aviation passengers wear masks during security checks, posing significant challenges to traditional face recognition models. The backbone network serves as the core component of face recognition models. In standard tests, r100 series models excelled (98%+ accuracy at 0.01% FAR in face comparison, high top1/top5 in search). r50 ranked second, r34_mask_v1 lagged. In masked tests, r100_mask_v2 led (90.07% accuracy), r50_mask_v3 performed best among r50 but trailed r100. Vit-Small/Tiny showed strong masked performance with gains in effectiveness. Through extensive comparative experiments, this paper conducts a comprehensive evaluation of several core backbone networks, aiming to reveal the impacts of different models on face recognition with and without masks, and provide specific deployment recommendations.
zh
[CV-44] MDAFNet: Multiscale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因网络层数增加导致的目标边缘像素逐渐退化,以及传统卷积在特征提取过程中难以区分频率成分,从而引发低频背景干扰高频目标、高频噪声诱发误检的问题。解决方案的关键在于提出一种名为MDAFNet(Multi-scale Differential Edge and Adaptive Frequency Guided Network)的新型网络架构,其核心创新包括:一是多尺度边缘差分模块(Multi-Scale Differential Edge, MSDE),通过多尺度边缘提取与增强机制有效补偿下采样过程中的目标边缘信息损失;二是双域自适应特征增强模块(Dual-Domain Adaptive Feature Enhancement, DAFE),融合频域处理机制与空间域模拟频率分解与融合机制,实现对高频目标的自适应增强和对高频噪声的选择性抑制。
链接: https://arxiv.org/abs/2601.16434
作者: Shuying Li,Qiang Ma,San Zhang,Wuwei Wang,Chuang Yang
机构: Xi’an University of Posts and Telecommunications(西安邮电大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); The Hong Kong Polytechnic University(香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (IRSTD) plays a crucial role in numerous military and civilian applications. However, existing methods often face the gradual degradation of target edge pixels as the number of network layers increases, and traditional convolution struggles to differentiate between frequency components during feature extraction, leading to low-frequency backgrounds interfering with high-frequency targets and high-frequency noise triggering false detections. To address these limitations, we propose MDAFNet (Multi-scale Differential Edge and Adaptive Frequency Guided Network for Infrared Small Target Detection), which integrates the Multi-Scale Differential Edge (MSDE) module and Dual-Domain Adaptive Feature Enhancement (DAFE) module. The MSDE module, through a multi-scale edge extraction and enhancement mechanism, effectively compensates for the cumulative loss of target edge information during downsampling. The DAFE module combines frequency domain processing mechanisms with simulated frequency decomposition and fusion mechanisms in the spatial domain to effectively improve the network’s capability to adaptively enhance high-frequency targets and selectively suppress high-frequency noise. Experimental results on multiple datasets demonstrate the superior detection performance of MDAFNet.
zh
[CV-45] AlphaFace: High Fidelity and Real-time Face Swapper Robust to Facial Pose
【速读】:该论文旨在解决现有人脸交换(face-swapping)方法在极端面部姿态下性能显著下降的问题。传统方法依赖显式几何特征以提升姿态鲁棒性,但引入了额外依赖关系并增加计算开销;而基于扩散模型的方法虽效果优异,却难以满足实时处理需求。其解决方案的关键在于提出AlphaFace,该方法利用开源视觉-语言模型及CLIP的图像与文本嵌入,设计新颖的视觉和文本语义对比损失(visual and textual semantic contrastive losses),从而增强身份表征能力并更精确地保留属性信息,同时保持实时性能。
链接: https://arxiv.org/abs/2601.16429
作者: Jongmin Yu,Hyeontaek Oh,Zhongtian Sun,Angelica I Aviles-Rivero,Moongu Jeon,Jinhong Yang
机构: ProjectG.AI; University of Cambridge (剑桥大学); University of Kent (肯特大学); Tsinghua University (清华大学); Gwangju Institute of Science and Technology (光州科学技术院); Inje University (仁川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing face-swapping methods often deliver competitive results in constrained settings but exhibit substantial quality degradation when handling extreme facial poses. To improve facial pose robustness, explicit geometric features are applied, but this approach remains problematic since it introduces additional dependencies and increases computational cost. Diffusion-based methods have achieved remarkable results; however, they are impractical for real-time processing. We introduce AlphaFace, which leverages an open-source vision-language model and CLIP image and text embeddings to apply novel visual and textual semantic contrastive losses. AlphaFace enables stronger identity representation and more precise attribute preservation, all while maintaining real-time performance. Comprehensive experiments across FF++, MPIE, and LPFF demonstrate that AlphaFace surpasses state-of-the-art methods in pose-challenging cases. The project is publicly available on `this https URL.
zh
[CV-46] DCCS-Det: Directional Context and Cross-Scale-Aware Detector for Infrared Small Target
【速读】:该论文针对红外小目标检测(IRSTD)中普遍存在的两大问题展开研究:一是现有方法在建模局部-全局特征时存在不足,导致目标与背景的区分能力弱;二是特征冗余和语义稀释现象严重,影响目标表示质量。解决方案的关键在于提出DCCS-Det检测器,其核心创新包括两个模块:一是双流显著性增强(Dual-stream Saliency Enhancement, DSE)块,通过融合局部感知与方向感知的上下文聚合机制,有效捕捉长程空间依赖性和局部细节;二是潜在语义提取与聚合(Latent-aware Semantic Extraction and Aggregation, LaSEA)模块,借助跨尺度特征提取和随机池化采样策略,缓解特征退化问题,强化判别性特征并抑制噪声。实验表明,该方法在多个数据集上实现了最优检测精度与良好效率。
链接: https://arxiv.org/abs/2601.16428
作者: Shuying Li,Qiang Ma,San Zhang,Chuang Yang
机构: Xi’an University of Posts and Telecommunications (西安邮电大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Key Laboratory of Cyberspace Security, Ministry of Education of China (教育部网络空间安全重点实验室); Henan Key Laboratory of Cyberspace Situation Awareness (河南省网络空间态势感知重点实验室); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Infrared small target detection (IRSTD) is critical for applications like remote sensing and surveillance, which aims to identify small, low-contrast targets against complex backgrounds. However, existing methods often struggle with inadequate joint modeling of local-global features (harming target-background discrimination) or feature redundancy and semantic dilution (degrading target representation quality). To tackle these issues, we propose DCCS-Det (Directional Context and Cross-Scale Aware Detector for Infrared Small Target), a novel detector that incorporates a Dual-stream Saliency Enhancement (DSE) block and a Latent-aware Semantic Extraction and Aggregation (LaSEA) module. The DSE block integrates localized perception with direction-aware context aggregation to help capture long-range spatial dependencies and local details. On this basis, the LaSEA module mitigates feature degradation via cross-scale feature extraction and random pooling sampling strategies, enhancing discriminative features and suppressing noise. Extensive experiments show that DCCS-Det achieves state-of-the-art detection accuracy with competitive efficiency across multiple datasets. Ablation studies further validate the contributions of DSE and LaSEA in improving target perception and feature representation under complex scenarios. \hrefthis https URLDCCS-Det Official Code is Available Here!
zh
[CV-47] A Cosine Network for Image Super-Resolution
【速读】:该论文旨在解决图像超分辨率(Image Super-Resolution, ISR)任务中结构信息提取与保持的难题,尤其关注如何有效保留和增强从低分辨率图像中恢复出的结构特征以提升重建质量。其解决方案的关键在于提出一种基于余弦网络(Cosine Network for Image Super-Resolution, CSRNet)的新型架构与训练策略:首先设计奇偶异构块(odd and even heterogeneous blocks)以扩大网络结构差异,从而提取互补的同源结构信息;其次融合线性与非线性结构信息,克服单一结构表示的局限性,提升结构信息的鲁棒性;最后引入余弦退火机制(cosine annealing mechanism)优化训练过程,通过暖重启(warm restarts)调整学习率,缓解梯度下降陷入局部最优的问题。这些改进共同提升了模型在ISR任务中的性能表现。
链接: https://arxiv.org/abs/2601.16413
作者: Chunwei Tian,Chengyuan Zhang,Bob Zhang,Zhiwu Li,C. L. Philip Chen,David Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Hunan University (湖南大学); University of Macau (澳门大学); Xidian University (西安电子科技大学); South China University of Technology (华南理工大学); Pazhou Lab (琶洲实验室); Chinese University of Hong Kong (Shenzhen) (香港中文大学(深圳)); Shenzhen Institute of Artificial Intelligence and Robotics for Society (深圳市人工智能与机器人研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in IEEE Transactions on Image Processing (2025)
Abstract:Deep convolutional neural networks can use hierarchical information to progressively extract structural information to recover high-quality images. However, preserving the effectiveness of the obtained structural information is important in image super-resolution. In this paper, we propose a cosine network for image super-resolution (CSRNet) by improving a network architecture and optimizing the training strategy. To extract complementary homologous structural information, odd and even heterogeneous blocks are designed to enlarge the architectural differences and improve the performance of image super-resolution. Combining linear and non-linear structural information can overcome the drawback of homologous information and enhance the robustness of the obtained structural information in image super-resolution. Taking into account the local minimum of gradient descent, a cosine annealing mechanism is used to optimize the training procedure by performing warm restarts and adjusting the learning rate. Experimental results illustrate that the proposed CSRNet is competitive with state-of-the-art methods in image super-resolution.
zh
[CV-48] ResAgent : Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation
【速读】:该论文旨在解决基于多模态大语言模型(Multimodal Large Language Model, MLLM)的指代表达分割(Referring Expression Segmentation, RES)方法中存在的两个关键问题:一是MLLM生成的粗粒度边界框导致冗余或非判别性的点提示;二是依赖文本坐标推理的方式不可靠,难以区分视觉相似的干扰项。解决方案的核心在于提出一个名为\model的新框架,其关键创新为两个模块:(1) 基于熵的点发现(Entropy-Based Point Discovery, EBD),通过建模边界框内的空间不确定性来识别高信息量候选点,将点选择视为信息最大化过程;(2) 基于视觉的推理(Vision-Based Reasoning, VBR),通过联合视觉-语义对齐验证点的正确性,摒弃纯文本坐标推理以实现更鲁棒的验证。该框架采用从粗到精的流程:边界框初始化、熵引导的点发现、视觉验证和掩码解码,在四个基准数据集上均达到新的最先进性能,证明了其在生成准确且语义锚定的分割掩码方面的有效性。
链接: https://arxiv.org/abs/2601.16394
作者: Yihao Wang,Jusheng Zhang,Ziyi Tang,Keze Wang,Meng Yang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 7gigures
Abstract:Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf\model, a novel RES framework integrating \textbfEntropy-\textbfBased Point \textbfDiscovery (\textbfEBD) and \textbfVision-\textbfBased \textbfReasoning (\textbfVBR). Specifically, EBD identifies high-information candidate points by modeling spatial uncertainty within coarse bounding boxes, treating point selection as an information maximization process. VBR verifies point correctness through joint visual-semantic alignment, abandoning text-only coordinate inference for more robust validation. Built on these components, \model implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding. Extensive evaluations on four benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg) demonstrate that \model achieves new state-of-the-art performance across all four benchmarks, highlighting its effectiveness in generating accurate and semantically grounded segmentation masks with minimal prompts.
zh
[CV-49] VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection
【速读】:该论文旨在解决少样本异常检测(Few-Shot Anomaly Detection, FSAD)中因依赖自然场景预训练特征而忽视工业领域细粒度语义信息,以及视觉与文本模态融合策略浅层、存在语义错位导致跨模态干扰的问题。解决方案的关键在于提出一种面向FSAD的视觉-文本多模态融合框架VTFusion,其核心创新包括:1)设计自适应特征提取器以学习任务特定表示,弥合预训练模型与工业数据之间的域差距,并通过生成多样化合成异常增强特征判别力;2)构建专用的多模态预测融合模块,包含促进跨模态深度信息交互的融合块和在多模态引导下生成精细化像素级异常图的分割网络,从而显著提升工业场景下的异常检测性能。
链接: https://arxiv.org/abs/2601.16381
作者: Yuxin Jiang,Yunkang Cao,Yuqi Cheng,Yiheng Zhang,Weiming Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pre-trained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pre-trained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level AUROCs of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this paper, further demonstrating its practical applicability in demanding industrial scenarios.
zh
[CV-50] Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
【速读】:该论文旨在解决当前多模态语言模型(Multimodal Language Models, MLMs)在空间推理任务中表现出的持续自我中心偏差(egocentric bias),尤其是其在需要采用另一代理视角进行视觉理解(visual perspective-taking)时的性能不足问题。研究表明,尽管现有模型在语义层面的跨模态任务上表现优异,但缺乏支持分配性空间认知(allocentric reasoning)的能力。解决方案的关键在于引入“视角标记”(perspective tokens),这类特殊嵌入通过两种机制编码方向信息:一是基于具身身体关键点的线索,二是抽象表示以支持心理旋转(mental rotation)。将这些标记集成到LLaVA-1.5-13B模型中后,在Level-2视觉视角采纳任务上显著提升准确率,且基于旋转的标记能泛化至非人类参考代理。实证分析进一步表明,微调增强了基础模型中已存在的潜在方向敏感性,说明MLMs本身包含分配性推理的先验结构,但需通过显式嵌入空间结构来激活和优化。这一方法提供了一种轻量、与模型无关的机制,使模型具备更类人的空间推理能力。
链接: https://arxiv.org/abs/2601.16378
作者: Bridget Leonard,Scott O. Murray
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent’s visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
zh
[CV-51] Coarse-to-Fine Non-rigid Multi-modal Image Registration for Historical Panel Paintings based on Crack Structures
【速读】:该论文旨在解决历史面板画作多模态图像(如可见光摄影、红外反射成像、紫外荧光摄影、X射线成像等)在像素级对齐过程中依赖人工标注导致效率低、精度差的问题。其核心挑战在于不同模态图像间存在分辨率差异、图像尺寸巨大、非刚性形变及模态特异性内容,使得传统注册方法难以适用。解决方案的关键在于提出一种基于稀疏关键点与薄板样条(thin-plate splines)的粗到精非刚性配准框架:首先利用卷积神经网络联合检测和描述由颜料层裂纹(craquelure)特征提取的关键点,并通过图神经网络实现基于局部块的描述子匹配;随后结合局部区域同形投影误差过滤误匹配;进一步引入多层级关键点精化机制以支持混合分辨率图像的逐级高精度配准。实验表明,该方法在自建多模态数据集上显著优于现有关键点匹配与密集匹配方法及其优化策略。
链接: https://arxiv.org/abs/2601.16348
作者: Aline Sindel,Andreas Maier,Vincent Christlein
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希·亚历山大埃尔朗根-纽伦堡大学); Erlangen National High Performance Computing Center (NHR@FAU) (埃尔朗根国家高性能计算中心); German Research Foundation (德国研究基金会); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, submitted for review
Abstract:Art technological investigations of historical panel paintings rely on acquiring multi-modal image data, including visual light photography, infrared reflectography, ultraviolet fluorescence photography, x-radiography, and macro photography. For a comprehensive analysis, the multi-modal images require pixel-wise alignment, which is still often performed manually. Multi-modal image registration can reduce this laborious manual work, is substantially faster, and enables higher precision. Due to varying image resolutions, huge image sizes, non-rigid distortions, and modality-dependent image content, registration is challenging. Therefore, we propose a coarse-to-fine non-rigid multi-modal registration method efficiently relying on sparse keypoints and thin-plate-splines. Historical paintings exhibit a fine crack pattern, called craquelure, on the paint layer, which is captured by all image systems and is well-suited as a feature for registration. In our one-stage non-rigid registration approach, we employ a convolutional neural network for joint keypoint detection and description based on the craquelure and a graph neural network for descriptor matching in a patch-based manner, and filter matches based on homography reprojection errors in local areas. For coarse-to-fine registration, we introduce a novel multi-level keypoint refinement approach to register mixed-resolution images up to the highest resolution. We created a multi-modal dataset of panel paintings with a high number of keypoint annotations, and a large test set comprising five multi-modal domains and varying image resolutions. The ablation study demonstrates the effectiveness of all modules of our refinement method. Our proposed approaches achieve the best registration results compared to competing keypoint and dense matching methods and refinement methods.
zh
[CV-52] FeTTL: Federated Template and Task Learning for Multi-Institutional Medical Imaging
【速读】:该论文旨在解决联邦学习(Federated Learning)在多中心医疗影像应用中因数据分布偏移(domain shifts)和异质性(heterogeneity)导致的模型性能下降问题,尤其针对不同机构间成像协议、设备类型及人群差异带来的挑战。其解决方案的关键在于提出一种名为“联邦模板与任务学习”(Federated Template and Task Learning, FeTTL)的新框架,通过联合学习一个全局模板(global template)与任务模型(task model),实现客户端间数据分布的对齐,从而有效缓解分布偏移问题。实验表明,FeTTL在视网膜眼底图像光学盘分割和组织病理学转移灶分类两项任务上均显著优于现有联邦学习基线方法(p值=0.002),且强调了模板与任务模型协同学习的重要性,为真实世界多中心场景下的鲁棒模型部署提供了可扩展的解决方案。
链接: https://arxiv.org/abs/2601.16302
作者: Abhijeet Parida,Antonia Alomar,Zhifan Jiang,Pooneh Roshanitabrizi,Austin Tapp,Ziyue Xu,Syed Muhammad Anwar,Maria J. Ledesma-Carbayo,Holger R. Roth,Marius George Linguraru
机构: Sheikh Zayed Institute for Pediatric Surgical Innovation, Children’s National Hospital, Washington, DC, USA; Universidad Politécnica de Madrid, Madrid, Spain; Universitat Pompeu Fabra, Barcelona, Spain; CIBER-BBN, ISCIII, Madrid, Spain; NVIDIA Corporation, Santa Clara, CA, USA; Departments of Radiology and Pediatrics, George Washington University, Washington, DC, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Federated learning enables collaborative model training across geographically distributed medical centers while preserving data privacy. However, domain shifts and heterogeneity in data often lead to a degradation in model performance. Medical imaging applications are particularly affected by variations in acquisition protocols, scanner types, and patient populations. To address these issues, we introduce Federated Template and Task Learning (FeTTL), a novel framework designed to harmonize multi-institutional medical imaging data in federated environments. FeTTL learns a global template together with a task model to align data distributions among clients. We evaluated FeTTL on two challenging and diverse multi-institutional medical imaging tasks: retinal fundus optical disc segmentation and histopathological metastasis classification. Experimental results show that FeTTL significantly outperforms the state-of-the-art federated learning baselines (p-values 0.002) for optical disc segmentation and classification of metastases from multi-institutional data. Our experiments further highlight the importance of jointly learning the template and the task. These findings suggest that FeTTL offers a principled and extensible solution for mitigating distribution shifts in federated learning, supporting robust model deployment in real-world, multi-institutional environments.
zh
[CV-53] Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory
【速读】:该论文旨在解决多轮视频编辑(multi-turn video editing)中跨轮次一致性(cross-consistency)不足的问题,即在用户多次交互式修改视频时,现有视频到视频扩散模型难以保持前后编辑结果的一致性。解决方案的关键在于提出 Memory-V2V 框架,其核心创新是通过引入显式记忆机制来增强模型对历史编辑结果的依赖:首先利用外部缓存中的已编辑视频进行精确检索,并结合动态标记化策略将先前结果作为条件输入;同时,在 DiT(Diffusion Transformer)骨干网络中设计可学习的标记压缩器(token compressor),以消除冗余条件标记并保留关键视觉线索,从而在提升跨轮次一致性的同时实现约 30% 的整体推理加速。
链接: https://arxiv.org/abs/2601.16296
作者: Dohun Lee,Chun-Hao Paul Huang,Xuelin Chen,Jong Chul Ye,Duygu Ceylan,Hyeonho Jeong
机构: Adobe Research(Adobe研究院); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Recent foundational video-to-video diffusion models have achieved impressive results in editing user provided videos by modifying appearance, motion, or camera movement. However, real-world video editing is often an iterative process, where users refine results across multiple rounds of interaction. In this multi-turn setting, current video editors struggle to maintain cross-consistency across sequential edits. In this work, we tackle, for the first time, the problem of cross-consistency in multi-turn video editing and introduce Memory-V2V, a simple, yet effective framework that augments existing video-to-video models with explicit memory. Given an external cache of previously edited videos, Memory-V2V employs accurate retrieval and dynamic tokenization strategies to condition the current editing step on prior results. To further mitigate redundancy and computational overhead, we propose a learnable token compressor within the DiT backbone that compresses redundant conditioning tokens while preserving essential visual cues, achieving an overall speedup of 30%. We validate Memory-V2V on challenging tasks including video novel view synthesis and text-conditioned long video editing. Extensive experiments show that Memory-V2V produces videos that are significantly more cross-consistent with minimal computational overhead, while maintaining or even improving task-specific performance over state-of-the-art baselines. Project page: this https URL
zh
[CV-54] GR3EN: Generative Relighting for 3D Environments
【速读】:该论文旨在解决复杂真实场景中3D重建的光照重渲染(relighting)问题,现有方法通常面临欠定或病态的逆渲染(inverse rendering)难题,难以在大规模室内外环境中生成高质量结果。其解决方案的关键在于利用视频到视频的光照重渲染扩散模型(video-to-video relighting diffusion model)输出进行知识蒸馏(distillation),将该模型的能力迁移至3D重建中,从而避免直接求解复杂的逆渲染问题,实现对房间尺度场景的可控、高质量3D光照重渲染。
链接: https://arxiv.org/abs/2601.16272
作者: Xiaoyan Xing,Philipp Henzler,Junhwa Hur,Runze Li,Jonathan T. Barron,Pratul P. Srinivasan,Dor Verbin
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究院); University of Amsterdam(阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:We present a method for relighting 3D reconstructions of large room-scale environments. Existing solutions for 3D scene relighting often require solving under-determined or ill-conditioned inverse rendering problems, and are as such unable to produce high-quality results on complex real-world scenes. Though recent progress in using generative image and video diffusion models for relighting has been promising, these techniques are either limited to 2D image and video relighting or 3D relighting of individual objects. Our approach enables controllable 3D relighting of room-scale scenes by distilling the outputs of a video-to-video relighting diffusion model into a 3D reconstruction. This side-steps the need to solve a difficult inverse rendering problem, and results in a flexible system that can relight 3D reconstructions of complex real-world scenes. We validate our approach on both synthetic and real-world datasets to show that it can faithfully render novel views of scenes under new lighting conditions.
zh
[CV-55] PocketDVDNet: Realtime Video Denoising for Real Camera Noise
【速读】:该论文旨在解决真实场景下多成分传感器噪声(multi-component sensor noise)对视频去噪的挑战,尤其针对自动对焦、自动驾驶和监控等实时应用中对高质量、低资源消耗视频去噪的需求。其解决方案的关键在于提出了一种轻量级视频去噪网络 PocketDVDNet,通过融合稀疏引导的结构化剪枝(sparsity-guided structured pruning)、物理信息噪声模型(physics-informed noise model)与知识蒸馏(knowledge distillation)的联合压缩框架,在显著降低模型参数量(减少74%)的同时提升去噪质量,并实现5帧图像块的实时处理能力。该方法通过在真实噪声环境下训练教师模型并指导学生网络学习隐式噪声建模,从而无需显式输入噪声图即可完成高效去噪,实现了性能与效率的协同优化。
链接: https://arxiv.org/abs/2601.16780
作者: Crispian Morris,Imogen Dexter,Fan Zhang,David R. Bull,Nantheera Anantrasirichai
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Live video denoising under realistic, multi-component sensor noise remains challenging for applications such as autofocus, autonomous driving, and surveillance. We propose PocketDVDNet, a lightweight video denoiser developed using our model compression framework that combines sparsity-guided structured pruning, a physics-informed noise model, and knowledge distillation to achieve high-quality restoration with reduced resource demands. Starting from a reference model, we induce sparsity, apply targeted channel pruning, and retrain a teacher on realistic multi-component noise. The student network learns implicit noise handling, eliminating the need for explicit noise-map inputs. PocketDVDNet reduces the original model size by 74% while improving denoising quality and processing 5-frame patches in real-time. These results demonstrate that aggressive compression, combined with domain-adapted distillation, can reconcile performance and efficiency for practical, real-time video denoising.
zh
[CV-56] Fast faithful and photorealistic diffusion-based image super-resolution with enhanced Flow Map models
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在图像超分辨率(Image Super-Resolution, SR)任务中面临的两个核心挑战:一是重建忠实度(reconstruction faithfulness)与照片真实感(photorealism)之间的权衡问题,二是推理效率不足的问题。现有基于知识蒸馏的单步扩散方法虽提升了效率,但受限于信息压缩导致感知线索(如纹理真实性和景深)退化。为此,作者提出FlowMapSR框架,其关键创新在于:首先将流图(Flow Map)模型引入SR任务,利用其无需显式蒸馏即可实现高效推理的优势;其次引入两种互补增强策略——基于分类器自由引导(Classifier-Free Guidance)的正负提示引导机制,以及基于低秩适应(LoRA)的对抗微调方法;实验表明,结合Shortcut型Flow Map结构与上述增强手段,可在x4和x8两种放大倍率下同时实现更优的忠实度与真实感平衡,且无需针对不同尺度进行条件设计或退化引导。
链接: https://arxiv.org/abs/2601.16660
作者: Maxence Noble,Gonzalo Iñaki Quintana,Benjamin Aubin,Clément Chadebec
机构: Jasper Research(杰帕研究); CMAP, CNRS, Ecole polytechnique(法国国家科学研究中心,巴黎综合理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Technical report
Abstract:Diffusion-based image super-resolution (SR) has recently attracted significant attention by leveraging the expressive power of large pre-trained text-to-image diffusion models (DMs). A central practical challenge is resolving the trade-off between reconstruction faithfulness and photorealism. To address inference efficiency, many recent works have explored knowledge distillation strategies specifically tailored to SR, enabling one-step diffusion-based approaches. However, these teacher-student formulations are inherently constrained by information compression, which can degrade perceptual cues such as lifelike textures and depth of field, even with high overall perceptual quality. In parallel, self-distillation DMs, known as Flow Map models, have emerged as a promising alternative for image generation tasks, enabling fast inference while preserving the expressivity and training stability of standard DMs. Building on these developments, we propose FlowMapSR, a novel diffusion-based framework for image super-resolution explicitly designed for efficient inference. Beyond adapting Flow Map models to SR, we introduce two complementary enhancements: (i) positive-negative prompting guidance, based on a generalization of classifier free-guidance paradigm to Flow Map models, and (ii) adversarial fine-tuning using Low-Rank Adaptation (LoRA). Among the considered Flow Map formulations (Eulerian, Lagrangian, and Shortcut), we find that the Shortcut variant consistently achieves the best performance when combined with these enhancements. Extensive experiments show that FlowMapSR achieves a better balance between reconstruction faithfulness and photorealism than recent state-of-the-art methods for both x4 and x8 upscaling, while maintaining competitive inference time. Notably, a single model is used for both upscaling factors, without any scale-specific conditioning or degradation-guided mechanisms.
zh
[CV-57] PanopMamba: Vision State Space Modeling for Nuclei Panoptic Segmentation
【速读】:该论文旨在解决组织病理图像中细胞核全景分割(panoptic segmentation)的三大挑战:小目标检测困难、边界模糊以及类别不平衡问题。其核心解决方案是提出一种新颖的混合编码器-解码器架构PanopMamba,该架构融合了状态空间模型(State Space Model, SSM)与Transformer,并引入基于SSM的特征增强融合机制,以实现金字塔特征中的高效长距离感知和跨尺度信息共享。关键创新在于设计多尺度Mamba骨干网络与SSM驱动的特征融合模块,显著提升了密集重叠细胞核在语义和空间维度上的特征表达能力,同时首次将Mamba引入全景分割任务,在多个基准数据集上验证了其优越性。
链接: https://arxiv.org/abs/2601.16631
作者: Ming Kang,Fung Fung Ting,Raphaël C.-W. Phan,Zongyuan Ge,Chee-Ming Ting
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Applications (stat.AP)
备注: 10 pages, 3 figures
Abstract:Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder-decoder architecture that integrates Mamba and Transformer with additional feature-enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)-based fusion network to enable efficient long-range perception in pyramid features, thereby extending the pure encoder-decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM-based feature-enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba-based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image-level Panoptic Quality ( i PQ), boundary-weighted PQ ( w PQ), and frequency-weighted PQ ( fw PQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state-of-the-art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at this https URL.
zh
[CV-58] On The Robustness of Foundational 3D Medical Image Segmentation Models Against Imprecise Visual Prompts
【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学影像分割任务中对不精确提示(prompt)的鲁棒性问题,即当前3D基础模型在面对现实世界中常见的密集视觉提示扰动时,其性能表现尚不明确。解决方案的关键在于通过系统性地施加多种受控扰动(controlled perturbations),模拟真实场景下的提示不准确性,并在多器官腹部分割任务上对两个最新的基础模型进行实验验证,从而揭示模型对形状和空间线索的依赖程度及其对特定扰动的抗干扰能力。
链接: https://arxiv.org/abs/2601.16383
作者: Soumitri Chattopadhyay,Basar Demir,Marc Niethammer
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ISBI 2026
Abstract:While 3D foundational models have shown promise for promptable segmentation of medical volumes, their robustness to imprecise prompts remains under-explored. In this work, we aim to address this gap by systematically studying the effect of various controlled perturbations of dense visual prompts, that closely mimic real-world imprecision. By conducting experiments with two recent foundational models on a multi-organ abdominal segmentation task, we reveal several facets of promptable medical segmentation, especially pertaining to reliance on visual shape and spatial cues, and the extent of resilience of models towards certain perturbations. Codes are available at: this https URL
zh
[CV-59] Experience with Single Domain Generalization in Real World Medical Imaging Deployments AAAI2026
【速读】:该论文旨在解决医学影像领域中单域泛化(Single Domain Generalization, SDG)的问题,即如何在仅使用单一数据域训练模型的情况下,使其能够有效泛化到未见过的目标域,尤其是在多中心研究中因扫描仪和成像协议差异导致的域偏移问题。其解决方案的关键在于提出一种通用的“深度学习+专家知识嵌入”(Deep Learning with Expert Knowledge Embedding, DL+EKE)方法,通过将领域专家知识系统性地融入深度学习框架中,增强模型对罕见类别特征变化的鲁棒性,从而显著优于当前主流的SDG技术,在糖尿病视网膜病变(Diabetic Retinopathy, DR)等任务中验证了该方法的有效性,并进一步将其部署于应激心电图(stress ECG)和静息态功能磁共振成像(resting-state fMRI)的实际场景中。
链接: https://arxiv.org/abs/2601.16359
作者: Ayan Banerjee,Komandoor Srivathsan,Sandeep K.S. Gupta
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at AAAI 2026 Innovative Applications of Artificial Intelligence
Abstract:A desirable property of any deployed artificial intelligence is generalization across domains, i.e. data generation distribution under a specific acquisition condition. In medical imagining applications the most coveted property for effective deployment is Single Domain Generalization (SDG), which addresses the challenge of training a model on a single domain to ensure it generalizes well to unseen target domains. In multi-center studies, differences in scanners and imaging protocols introduce domain shifts that exacerbate variability in rare class characteristics. This paper presents our experience on SDG in real life deployment for two exemplary medical imaging case studies on seizure onset zone detection using fMRI data, and stress electrocardiogram based coronary artery detection. Utilizing the commonly used application of diabetic retinopathy, we first demonstrate that state-of-the-art SDG techniques fail to achieve generalized performance across data domains. We then develop a generic expert knowledge integrated deep learning technique DL+EKE and instantiate it for the DR application and show that DL+EKE outperforms SOTA SDG methods on DR. We then deploy instances of DL+EKE technique on the two real world examples of stress ECG and resting state (rs)-fMRI and discuss issues faced with SDG techniques.
zh
人工智能
[AI-0] A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLM s
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练过程中损失曲率演化难以测量的问题,尤其是传统基于Hessian矩阵最大特征值(Hessian sharpness, \lambda_\max^H)的分析因计算成本过高而无法应用于大模型。其解决方案的关键在于提出并验证了临界尖锐度(critical sharpness, λc),这是一种仅需少于10次前向传播即可计算的高效指标,且能准确捕捉到已知的Hessian尖锐度现象(如渐进式尖锐化和稳定边缘效应)。通过该指标,作者首次在高达70亿参数规模的OLMo-2模型上展示了这些尖锐度现象,并进一步引入相对临界尖锐度(relative critical sharpness, λc1→2)来量化一个损失景观在优化另一个损失时的曲率变化,从而指导预训练到微调的过渡策略与数据混合设计。此方法为大模型训练中的曲率动态诊断和数据组成决策提供了可扩展、实用的工具。
链接: https://arxiv.org/abs/2601.16979
作者: Dayal Singh Kalra,Jean-Christophe Gagnon-Audet,Andrey Gromov,Ishita Mediratta,Kelvin Niu,Alexander H Miller,Michael Shvartsman
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 9 pages, 6 figures
Abstract:Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ( \lambda_\max^H ) – the largest eigenvalue of the loss Hessian – determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze \textitcritical sharpness ( \lambda_c ), a computationally efficient measure requiring fewer than 10 forward passes given the update direction \Delta \mathbf\theta . Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to 7 B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce \textitrelative critical sharpness ( \lambda_c^1\to 2 ), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.
zh
[AI-1] Empowering Medical Equipment Sustainability in Low-Resource Settings: An AI-Powered Diagnostic and Support Platform for Biomedical Technicians MICCAI2025
【速读】:该论文旨在解决低收入和中等收入国家(LMICs)医疗诊断设备因缺乏及时维护、技术专家资源有限及制造商支持不足而导致的设备闲置或故障率高的问题,从而减少设备停机时间、延迟诊断和患者护理质量下降。解决方案的关键在于开发并验证一个基于人工智能(AI)的实时支持平台,该平台集成大语言模型(LLM)与用户友好的网页界面,使生物医学工程师能够输入错误代码或设备症状,并获得准确的分步故障排除指导;同时内置全球同行讨论论坛以促进知识共享,尤其针对罕见或未记录的问题。初步原型在Philips HDI 5000超声设备上实现100%错误代码解析精度和80%纠正措施建议准确率,证明了AI驱动系统在资源受限环境中提升医疗设备维护效率的可行性与潜力。
链接: https://arxiv.org/abs/2601.16967
作者: Bernes Lorier Atabonfack,Ahmed Tahiru Issah,Mohammed Hardi Abdul Baaki,Clemence Ingabire,Tolulope Olusuyi,Maruf Adewole,Udunna C. Anazodo,Timothy X Brown
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at the MIRASOL Workshop at MICCAI 2025. To appear in Lecture Notes in Computer Science (LNCS)
Abstract:In low- and middle-income countries (LMICs), a significant proportion of medical diagnostic equipment remains underutilized or non-functional due to a lack of timely maintenance, limited access to technical expertise, and minimal support from manufacturers, particularly for devices acquired through third-party vendors or donations. This challenge contributes to increased equipment downtime, delayed diagnoses, and compromised patient care. This research explores the development and validation of an AI-powered support platform designed to assist biomedical technicians in diagnosing and repairing medical devices in real-time. The system integrates a large language model (LLM) with a user-friendly web interface, enabling imaging technologists/radiographers and biomedical technicians to input error codes or device symptoms and receive accurate, step-by-step troubleshooting guidance. The platform also includes a global peer-to-peer discussion forum to support knowledge exchange and provide additional context for rare or undocumented issues. A proof of concept was developed using the Philips HDI 5000 ultrasound machine, achieving 100% precision in error code interpretation and 80% accuracy in suggesting corrective actions. This study demonstrates the feasibility and potential of AI-driven systems to support medical device maintenance, with the aim of reducing equipment downtime to improve healthcare delivery in resource-constrained environments.
zh
[AI-2] Spatial-Agent : Agent ic Geo-spatial Reasoning with Scientific Core Concepts
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在地理空间推理任务中表现不佳的问题,尤其是其依赖网络搜索或模式匹配而非真正的地理计算,常出现空间关系幻觉。解决方案的关键在于提出Spatial-Agent,该系统以空间信息科学的基础理论为支撑,将地理分析问答建模为概念转换问题:通过解析自然语言问题生成可执行的工作流——GeoFlow图(GeoFlow Graphs),即节点代表空间概念、边表示转换操作的有向无环图;利用空间信息理论提取空间概念并赋予功能角色,遵循严格的顺序约束进行转换序列的模板化生成,从而实现高准确率且具备可解释性的地理空间推理能力。
链接: https://arxiv.org/abs/2601.16965
作者: Riyang Bao,Cheng Yang,Dazhou Yu,Zhexiang Tang,Gengchen Mai,Liang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15pages, 4 figures
Abstract:Geospatial reasoning is essential for real-world applications such as urban analytics, transportation planning, and disaster response. However, existing LLM-based agents often fail at genuine geospatial computation, relying instead on web search or pattern matching while hallucinating spatial relationships. We present Spatial-Agent, an AI agent grounded in foundational theories of spatial information science. Our approach formalizes geo-analytical question answering as a concept transformation problem, where natural-language questions are parsed into executable workflows represented as GeoFlow Graphs – directed acyclic graphs with nodes corresponding to spatial concepts and edges representing transformations. Drawing on spatial information theory, Spatial-Agent extracts spatial concepts, assigns functional roles with principled ordering constraints, and composes transformation sequences through template-based generation. Extensive experiments on MapEval-API and MapQA benchmarks demonstrate that Spatial-Agent significantly outperforms existing baselines including ReAct and Reflexion, while producing interpretable and executable geospatial workflows.
zh
[AI-3] Agent Drive: An Open Benchmark Dataset for Agent ic AI Reasoning with LLM -Generated Scenarios in Autonomous Systems
【速读】:该论文旨在解决当前自主智能体(agentic AI)在推理驱动的感知、规划与决策任务中缺乏大规模、结构化且具备安全关键性的评估基准的问题。解决方案的关键在于提出并构建了AgentDrive这一开放基准数据集,其中包含30万条由大语言模型(LLM)生成的驾驶场景,这些场景在七个正交维度(如场景类型、驾驶员行为、环境、道路布局等)上被因子化建模,并通过LLM驱动的提示到JSON的管道生成语义丰富且可直接用于仿真的规范。每个场景均经过仿真滚动、代理安全指标计算及规则化结果标注,从而确保其结构化与安全性;同时,配套引入AgentDrive-MCQ多选题基准,涵盖五种推理维度,用于系统性评估模型性能。此方案为训练和评估面向复杂现实世界的自主智能体提供了标准化、可扩展且安全可控的测试平台。
链接: https://arxiv.org/abs/2601.16964
作者: Mohamed Amine Ferrag,Abderrahmane Lakas,Merouane Debbah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM-driven prompt-to-JSON pipeline generates semantically rich, simulation-ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule-based outcome labeling. To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large-scale evaluation of fifty leading LLMs on AgentDrive-MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics-grounded reasoning. We release the AgentDrive dataset, AgentDrive-MCQ benchmark, evaluation code, and related materials at this https URL
zh
[AI-4] Nishpaksh: TEC Standard-Compliant Framework for Fairness Auditing and Certification of AI Models
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在高风险决策系统(如新兴电信和6G应用)中缺乏符合地区监管要求的公平性评估框架的问题。现有工具如IBM AI Fairness 360和Microsoft Fairlearn虽能检测偏差,但难以适配区域性法规与国家优先事项。其解决方案的关键在于提出Nishpaksh——一个本土化的公平性评估工具,该工具基于印度电信工程中心(TEC)标准,集成问卷驱动的风险量化、情境化阈值确定及定量公平性评估,并通过向量化计算、响应式状态管理和可认证报告,实现可复现、审计级的公平性评估,从而填补研究导向方法与印度AI治理实践之间的实施鸿沟。
链接: https://arxiv.org/abs/2601.16926
作者: Shashank Prakash,Ranjitha Prasad,Avinash Agarwal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted and presented at 2026 18th International Conference on COMmunication Systems and NETworks (COMSNETS)
Abstract:The growing reliance on Artificial Intelligence (AI) models in high-stakes decision-making systems, particularly within emerging telecom and 6G applications, underscores the urgent need for transparent and standardized fairness assessment frameworks. While global toolkits such as IBM AI Fairness 360 and Microsoft Fairlearn have advanced bias detection, they often lack alignment with region-specific regulatory requirements and national priorities. To address this gap, we propose Nishpaksh, an indigenous fairness evaluation tool that operationalizes the Telecommunication Engineering Centre (TEC) Standard for the Evaluation and Rating of Artificial Intelligence Systems. Nishpaksh integrates survey-based risk quantification, contextual threshold determination, and quantitative fairness evaluation into a unified, web-based dashboard. The tool employs vectorized computation, reactive state management, and certification-ready reporting to enable reproducible, audit-grade assessments, thereby addressing a critical post-standardization implementation need. Experimental validation on the COMPAS dataset demonstrates Nishpaksh’s effectiveness in identifying attribute-specific bias and generating standardized fairness scores compliant with the TEC framework. The system bridges the gap between research-oriented fairness methodologies and regulatory AI governance in India, marking a significant step toward responsible and auditable AI deployment within critical infrastructure like telecommunications.
zh
[AI-5] Preventing the Collapse of Peer Review Requires Verification-First AI
【速读】:该论文试图解决当前学术评审体系中因AI辅助评审工具设计不当而导致的“伪可靠性”问题,即当评审系统过度依赖可量化的代理指标(proxy metrics)而非真实科学验证时,可能引发对研究质量的误判和虚假信号膨胀。其核心解决方案在于提出“真理耦合”(truth-coupling)作为评价标准,强调评审工具应优先服务于对潜在科学事实的验证能力,而非简单模仿人类评审行为。关键创新点在于构建了一个最小模型,揭示了两种驱动机制——验证压力(verification pressure)与信号萎缩(signal shrinkage)——如何共同促成从“真知导向”向“代理主导”的相变,并推导出激励崩溃条件:即使当前决策仍看似可靠,理性参与者也会转向优化代理指标而非追求真实科学价值。因此,论文建议将AI定位为对抗性审计者(adversarial auditor),生成可审计的验证证据以扩展有效验证带宽,而非仅作为评分预测器放大虚假主张。
链接: https://arxiv.org/abs/2601.16909
作者: Lei You,Lele Cao,Iryna Gurevych
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper argues that AI-assisted peer review should be verification-first rather than review-mimicking. We propose truth-coupling, i.e. how tightly venue scores track latent scientific truth, as the right objective for review tools. We formalize two forces that drive a phase transition toward proxy-sovereign evaluation: verification pressure, when claims outpace verification capacity, and signal shrinkage, when real improvements become hard to separate from noise. In a minimal model that mixes occasional high-fidelity checks with frequent proxy judgment, we derive an explicit coupling law and an incentive-collapse condition under which rational effort shifts from truth-seeking to proxy optimization, even when current decisions still appear reliable. These results motivate actions for tool builders and program chairs: deploy AI as an adversarial auditor that generates auditable verification artifacts and expands effective verification bandwidth, rather than as a score predictor that amplifies claim inflation.
zh
[AI-6] GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints
【速读】:该论文旨在解决大语言模型中机器遗忘(Machine Unlearning, MU)在混合专家(Mixture-of-Experts, MoE)架构下难以有效实施的问题。现有方法因利用MoE架构的路由机制漏洞——通过调整路由器将查询引导至非知识专家而非真正擦除知识,导致模型效用下降且遗忘效果表面化。其解决方案的关键在于提出一种算法无关的框架GRIP(Geometric Routing Invariance Preservation),核心创新是引入几何约束:通过将路由器梯度更新投影到专家特定的零空间(null-space),实现路由稳定性与参数灵活性的解耦。这使得专家选择在保留知识时保持稳定,同时允许路由器参数在零空间内自由调整,从而迫使遗忘优化直接作用于专家参数本身,而非依赖路由操纵这一捷径。实验表明,GRIP作为适配器可显著提升路由稳定性(>95%)并维持模型性能,使现有遗忘算法适用于MoE架构。
链接: https://arxiv.org/abs/2601.16905
作者: Andy Zhu,Rongzhe Wei,Yupu Gu,Pan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine unlearning (MU) for large language models has become critical for AI safety, yet existing methods fail to generalize to Mixture-of-Experts (MoE) architectures. We identify that traditional unlearning methods exploit MoE’s architectural vulnerability: they manipulate routers to redirect queries away from knowledgeable experts rather than erasing knowledge, causing a loss of model utility and superficial forgetting. We propose Geometric Routing Invariance Preservation (GRIP), an algorithm-agnostic framework for unlearning for MoE. Our core contribution is a geometric constraint, implemented by projecting router gradient updates into an expert-specific null-space. Crucially, this decouples routing stability from parameter rigidity: while discrete expert selections remain stable for retained knowledge, the continuous router parameters remain plastic within the null space, allowing the model to undergo necessary internal reconfiguration to satisfy unlearning objectives. This forces the unlearning optimization to erase knowledge directly from expert parameters rather than exploiting the superficial router manipulation shortcut. GRIP functions as an adapter, constraining router parameter updates without modifying the underlying unlearning algorithm. Extensive experiments on large-scale MoE models demonstrate that our adapter eliminates expert selection shift (achieving over 95% routing stability) across all tested unlearning methods while preserving their utility. By preventing existing algorithms from exploiting MoE model’s router vulnerability, GRIP adapts existing unlearning research from dense architectures to MoEs.
zh
[AI-7] MAGE-KT: Multi-Agent Graph-Enhanced Knowledge Tracing with Subgraph Retrieval and Asymmetric Fusion
【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)中如何更有效地建模学生、题目与知识概念(Knowledge Concepts, KCs)之间关系的问题,尤其针对现有图结构方法未能充分挖掘KC间复杂关联、且在大规模异构KT图上进行全图编码时存在计算开销大和噪声干扰导致注意力扩散至无关区域的局限性。其解决方案的关键在于提出多智能体图增强型知识追踪框架(Multi-Agent Graph-Enhanced Knowledge Tracing, MAGE-KT),通过融合多智能体KC关系提取器与学生-题目交互图构建多视角异构图,捕获语义与行为信号的互补信息;并基于目标学生的历史记录检索紧凑高价值子图,利用非对称交叉注意力融合模块实现精准增强预测,有效避免注意力扩散和冗余计算,从而提升KC关系建模精度与下一题预测性能。
链接: https://arxiv.org/abs/2601.16886
作者: Chi Yu,Hongyu Yuan,Zhiyi Duan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Tracing (KT) aims to model a student’s learning trajectory and predict performance on the next question. A key challenge is how to better represent the relationships among students, questions, and knowledge concepts (KCs). Recently, graph-based KT paradigms have shown promise for this problem. However, existing methods have not sufficiently explored inter-concept relations, often inferred solely from interaction sequences. In addition, the scale and heterogeneity of KT graphs make full-graph encoding both computationally both costly and noise-prone, causing attention to bleed into student-irrelevant regions and degrading the fidelity of inter-KC relations. To address these issues, we propose a novel framework: Multi-Agent Graph-Enhanced Knowledge Tracing (MAGE-KT). It constructs a multi-view heterogeneous graph by combining a multi-agent KC relation extractor and a student-question interaction graph, capturing complementary semantic and behavioral signals. Conditioned on the target student’s history, it retrieves compact, high-value subgraphs and integrates them using an Asymmetric Cross-attention Fusion Module to enhance prediction while avoiding attention diffusion and irrelevant computation. Experiments on three widely used KT datasets show substantial improvements in KC-relation accuracy and clear gains in next-question prediction over existing methods.
zh
[AI-8] Explaining Group Recommendations via Counterfactuals
【速读】:该论文旨在解决群体推荐系统(Group Recommender Systems)中缺乏透明性的问题,即群体成员难以理解推荐结果背后的决策依据。现有解释方法多聚焦于个体偏好,无法有效捕捉群体内多个用户偏好之间的复杂交互关系。其解决方案的关键在于提出一种群体反事实解释(Group Counterfactual Explanations)框架,通过量化移除特定历史交互行为对群体推荐结果的影响来提供可解释性;同时引入适用于群体场景的效用与公平性度量,并设计基于帕累托前沿筛选(Pareto-based filtering)和生长-剪枝(grow-and-prune)策略的启发式算法,以在计算效率与解释质量之间实现平衡。
链接: https://arxiv.org/abs/2601.16882
作者: Maria Stratigi,Nikos Bikakis
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Group recommender systems help users make collective choices but often lack transparency, leaving group members uncertain about why items are suggested. Existing explanation methods focus on individuals, offering limited support for groups where multiple preferences interact. In this paper, we propose a framework for group counterfactual explanations, which reveal how removing specific past interactions would change a group recommendation. We formalize this concept, introduce utility and fairness measures tailored to groups, and design heuristic algorithms, such as Pareto-based filtering and grow-and-prune strategies, for efficient explanation discovery. Experiments on MovieLens and Amazon datasets show clear trade-offs: low-cost methods produce larger, less fair explanations, while other approaches yield concise and balanced results at higher cost. Furthermore, the Pareto-filtering heuristic demonstrates significant efficiency improvements in sparse settings.
zh
[AI-9] Boosting Deep Reinforcement Learning with Semantic Knowledge for Robotic Manipulators
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在机器人控制中部署时面临的样本效率低下的问题,即训练过程需要大量经验数据,导致计算和时间成本过高。解决方案的关键在于将语义知识以知识图谱嵌入(Knowledge Graph Embeddings, KGEs)的形式与视觉观测信息融合,构建一种新型架构,使智能体在训练过程中能够利用环境中的上下文知识,从而显著提升学习效率。实验表明,该方法可在不增加训练时间和计算复杂度的前提下,实现最多60%的学习时间减少和约15个百分点的任务准确率提升。
链接: https://arxiv.org/abs/2601.16866
作者: Lucía Güitta-López,Vincenzo Suriani,Jaime Boal,Álvaro J. López-López,Daniele Nardi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep Reinforcement Learning (DRL) is a powerful framework for solving complex sequential decision-making problems, particularly in robotic control. However, its practical deployment is often hindered by the substantial amount of experience required for learning, which results in high computational and time costs. In this work, we propose a novel integration of DRL with semantic knowledge in the form of Knowledge Graph Embeddings (KGEs), aiming to enhance learning efficiency by providing contextual information to the agent. Our architecture combines KGEs with visual observations, enabling the agent to exploit environmental knowledge during training. Experimental validation with robotic manipulators in environments featuring both fixed and randomized target attributes demonstrates that our method achieves up to 60% reduction in learning time and improves task accuracy by approximately 15 percentage points, without increasing training time or computational complexity. These results highlight the potential of semantic knowledge to reduce sample complexity and improve the effectiveness of DRL in robotic applications.
zh
[AI-10] Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation
【速读】:该论文旨在解决传统大模型在硬件资源受限场景下难以实现高性能推理的问题,尤其是如何在不依赖超大规模参数模型(如100B+)的前提下,通过高效协同多个小规模模型(<20B)来逼近甚至超越其性能。解决方案的关键在于提出N-Way Self-Evaluating Deliberation (NSED)协议——一种运行时Mixture-of-Models (MoM)架构,其核心创新包括:(1) 动态专家经纪人(Dynamic Expertise Broker),将模型选择建模为带成本约束的背包问题(Knapsack Problem),实时绑定异构检查点至功能角色;(2) 将推理过程形式化为宏观尺度递归神经网络(Macro-Scale Recurrent Neural Network, RNN),利用语义遗忘门实现迭代精炼而不增加显存(VRAM)比例开销;(3) 引入非线性共识机制(Quadratic Voting激活函数)与无信任Peer Review编排结构,提升群体决策质量。实证结果表明,该方法可在消费级硬件上实现接近或优于顶级大模型的性能,并具备内在对齐优势。
链接: https://arxiv.org/abs/2601.16863
作者: Tims Pecerskis,Aivars Smirnovs
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:This paper introduces the N-Way Self-Evaluating Deliberation (NSED) protocol, a Runtime Mixture-of-Models (MoM) architecture that constructs emergent composite models from a plurality of distinct expert agents. Unlike traditional Mixture-of-Experts (MoE) which rely on static gating networks, NSED employs a Dynamic Expertise Broker - a runtime optimization engine that treats model selection as a variation of the Knapsack Problem, binding heterogeneous checkpoints to functional roles based on live telemetry and cost constraints. At the execution layer, we formalize deliberation as a Macro-Scale Recurrent Neural Network (RNN), where the consensus state loops back through a semantic forget gate to enable iterative refinement without proportional VRAM scaling. Key components include an orchestration fabric for trustless N-to-N peer review, a Quadratic Voting activation function for non-linear consensus, and a feedback-driven state update. Empirical validation on challenging benchmarks (AIME 2025, LiveCodeBench) demonstrates that this topology allows ensembles of small (less than 20B) consumer-grade models to match or exceed the performance of state-of-the-art 100B+ parameter models, establishing a new hardware arbitrage efficiency frontier. Furthermore, testing on the DarkBench safety suite reveals intrinsic alignment properties, with peer-mediated correction reducing sycophancy scores below that of any individual agent.
zh
[AI-11] Orbitopal Fixing in SAT
【速读】:该论文旨在解决布尔可满足性(Boolean Satisfiability, SAT)求解器在处理具有对称性的实例时效率低下的问题,即求解器会重复搜索与已探索区域对称的搜索空间,导致冗余计算。解决方案的关键在于提出一种基于轨道多面体固定(orbitopal fixing)的静态对称性破除方法,该方法通过添加单位子句(unit clauses)来实现对称性约束,既保证了推理速度且不干扰原有求解器启发式策略,又兼容形式化证明日志记录;同时,该方法在替代冗余证明系统(substitution redundancy proof system)中生成简洁的证明证书,从而在对称性丰富的基准测试中实现了稳定的加速效果,且在其他场景下几乎无性能退化。
链接: https://arxiv.org/abs/2601.16855
作者: Markus Anders,Cayden Codel,Marijn J. H. Heule
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: to appear at TACAS 2026
Abstract:Despite their sophisticated heuristics, boolean satisfiability (SAT) solvers are still vulnerable to symmetry, causing them to visit search regions that are symmetric to ones already explored. While symmetry handling is routine in other solving paradigms, integrating it into state-of-the-art proof-producing SAT solvers is difficult: added reasoning must be fast, non-interfering with solver heuristics, and compatible with formal proof logging. To address these issues, we present a practical static symmetry breaking approach based on orbitopal fixing, a technique adapted from mixed-integer programming. Our approach adds only unit clauses, which minimizes downstream slowdowns, and it emits succinct proof certificates in the substitution redundancy proof system. Implemented in the satsuma tool, our methods deliver consistent speedups on symmetry-rich benchmarks with negligible regressions elsewhere.
zh
[AI-12] Uncertainty propagation through trained multi-layer perceptrons: Exact analytical results
【速读】:该论文旨在解决神经网络中不确定性传播(uncertainty propagation)的解析建模问题,特别是针对具有单隐藏层和ReLU激活函数的多层感知机(Multi-Layer Perceptron, MLP)。传统方法通常依赖于泰勒级数展开来近似输出分布的统计特性,但存在精度受限的问题。本文的关键创新在于推导出输入为多元高斯分布时,输出均值与方差的精确解析表达式,无需采用任何近似展开,从而为神经网络的不确定性量化提供了理论严谨且可计算的框架。
链接: https://arxiv.org/abs/2601.16830
作者: Andrew Thompson,Miles McCrory
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Statistics Theory (math.ST)
备注:
Abstract:We give analytical results for propagation of uncertainty through trained multi-layer perceptrons (MLPs) with a single hidden layer and ReLU activation functions. More precisely, we give expressions for the mean and variance of the output when the input is multivariate Gaussian. In contrast to previous results, we obtain exact expressions without resort to a series expansion.
zh
[AI-13] Privacy in Human-AI Romantic Relationships: Concerns Boundaries and Agency
【速读】:该论文旨在解决生成式 AI (Generative AI) 在人类与AI伴侣的浪漫关系中所引发的隐私安全风险问题,特别是这些关系在探索、亲密和终止阶段中的隐私感知与实践差异。其解决方案的关键在于通过访谈研究(N=17)揭示了用户在不同关系阶段对隐私边界的动态认知与协商过程,发现AI伴侣被赋予代理性(agency),能主动参与隐私边界谈判并鼓励用户披露个人信息;同时指出平台功能设计与多元互动模式共同扩展了隐私场景,强调需重构人类-人工智能亲密关系中的隐私建构逻辑。
链接: https://arxiv.org/abs/2601.16824
作者: Rongjun Ma,Shijing He,Jose Luis Martin-Navarro,Xiao Zhan,Jose Such
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at CHI 2026
Abstract:An increasing number of LLM-based applications are being developed to facilitate romantic relationships with AI partners, yet the safety and privacy risks in these partnerships remain largely underexplored. In this work, we investigate privacy in human-AI romantic relationships through an interview study (N=17), examining participants’ experiences and privacy perceptions across stages of exploration, intimacy, and dissolution, alongside platforms they used. We found that these relationships took varied forms, from one-to-one to one-to-many, and were shaped by multiple actors, including creators, platforms, and moderators. AI partners were perceived as having agency, actively negotiating privacy boundaries with participants and sometimes encouraging disclosure of personal details. As intimacy deepened, these boundaries became more permeable, though some participants voiced concerns such as conversation exposure and sought to preserve anonymity. Overall, platform affordances and diverse romantic dynamics expand the privacy landscape, underscoring the need to rethink how privacy is constructed in human-AI intimacy.
zh
[AI-14] Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source
【速读】:该论文旨在解决生成式 AI(Generative AI)在软件开发中作为编码助手所生成代码的长期可维护性问题,特别是针对“一次性代码”(disposable code)假设的验证——即AI生成代码是否被快速合并后很快被丢弃。研究通过生存分析方法对201个开源项目中超过20万个由AI和人类编写的代码单元进行追踪,发现AI生成代码的修改率显著更低(线级修改率降低15.8个百分点,风险比HR=0.842, p<0.001),表明其具备更强的持久性。关键解决方案在于实证检验了这一假设,并揭示出AI生成代码虽在适应性修改上略逊于人类代码,但整体稳定性更高;同时指出预测代码何时被修改仍具挑战性(宏F1仅为0.285),提示组织实践而非生成质量才是决定AI代码长期演进的关键瓶颈。
链接: https://arxiv.org/abs/2601.16809
作者: Musfiqur Rahman,Emad Shihab
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This paper has been submitted to EASE 2026 research track and currently under review
Abstract:The integration of AI agents as coding assistants into software development has raised questions about the long-term viability of AI agent-generated code. A prevailing hypothesis within the software engineering community suggests this code is “disposable”, meaning it is merged quickly but discarded shortly thereafter. If true, organizations risk shifting maintenance burden from generation to post-deployment remediation. We investigate this hypothesis through survival analysis of 201 open-source projects, tracking over 200,000 code units authored by AI agents versus humans. Contrary to the disposable code narrative, agent-authored code survives significantly longer: at the line level, it exhibits a 15.8 percentage-point lower modification rate and 16% lower hazard of modification (HR = 0.842, p 0.001). However, modification profiles differ. Agent-authored code shows modestly elevated corrective rates (26.3% vs. 23.0%), while human code shows higher adaptive rates. However, the effect sizes are small (Cramér’s V = 0.116), and per-agent variation exceeds the agent-human gap. Turning to prediction, textual features can identify modification-prone code (AUC-ROC = 0.671), but predicting when modifications occur remains challenging (Macro F1 = 0.285), suggesting timing depends on external organizational dynamics. The bottleneck for agent-generated code may not be generation quality, but the organizational practices that govern its long-term evolution.
zh
[AI-15] An Efficient Insect-inspired Approach for Visual Point-goal Navigation
【速读】:该论文旨在解决视觉点目标导航(visual point-goal navigation)任务中计算成本高、效率低的问题,尤其针对传统深度学习模型在复杂环境中路径规划时所需的大量算力资源。解决方案的关键在于受昆虫大脑结构启发,构建了一个融合了关联学习(associative learning)和路径整合(path integration)机制的简化智能体(agent)。该设计模拟了昆虫在真实世界中通过视觉线索学习并优化从巢穴到食物源路径的能力,在保持高性能的同时显著降低了计算开销,且在更贴近现实的仿真环境中展现出鲁棒性。
链接: https://arxiv.org/abs/2601.16806
作者: Lu Yihe,Barbara Webb
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:In this work we develop a novel insect-inspired agent for visual point-goal navigation. This combines abstracted models of two insect brain structures that have been implicated, respectively, in associative learning and path integration. We draw an analogy between the formal benchmark of the Habitat point-goal navigation task and the ability of insects to learn and refine visually guided paths around obstacles between a discovered food location and their nest. We demonstrate that the simple insect-inspired agent exhibits performance comparable to recent SOTA models at many orders of magnitude less computational cost. Testing in a more realistic simulated environment shows the approach is robust to perturbations.
zh
[AI-16] GTA: Generative Traffic Agents for Simulating Realistic Mobility Behavior
【速读】:该论文试图解决如何在不依赖昂贵数据采集和手工规则的前提下,大规模、情境敏感地模拟人类交通选择行为这一挑战,这对于城市规划与可持续交通政策的早期评估至关重要。解决方案的关键在于提出生成式交通代理(Generative Traffic Agents, GTA),其通过大语言模型(LLM)驱动的基于人格特征的代理,从人口普查的社会经济数据中生成人工人群,并模拟活动日程与出行方式选择,从而实现无需人工设定规则的大规模、类人级交通行为仿真。
链接: https://arxiv.org/abs/2601.16778
作者: Simon Lämmer,Mark Colley,Patrick Ebel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Conditionally accepted at CHI 2026
Abstract:People’s transportation choices reflect complex trade-offs shaped by personal preferences, social norms, and technology acceptance. Predicting such behavior at scale is a critical challenge with major implications for urban planning and sustainable transport. Traditional methods use handcrafted assumptions and costly data collection, making them impractical for early-stage evaluations of new technologies or policies. We introduce Generative Traffic Agents (GTA) for simulating large-scale, context-sensitive transportation choices using LLM-powered, persona-based agents. GTA generates artificial populations from census-based sociodemographic data. It simulates activity schedules and mode choices, enabling scalable, human-like simulations without handcrafted rules. We evaluate GTA in Berlin-scale experiments, comparing simulation results against empirical data. While agents replicate patterns, such as modal split by socioeconomic status, they show systematic biases in trip length and mode preference. GTA offers new opportunities for modeling how future innovations, from bike lanes to transit apps, shape mobility decisions.
zh
[AI-17] LongCat-Flash-Thinking-2601 Technical Report
【速读】:该论文旨在解决当前开源大模型在复杂工具交互(tool-integrated reasoning)和真实世界噪声环境下的泛化能力与鲁棒性不足的问题。其核心挑战在于如何通过系统性训练框架提升模型在多领域、多任务场景中的稳定推理能力,尤其是在长尾分布生成、多轮代理交互以及高噪声现实环境中保持高性能。解决方案的关键在于:首先,构建了一个统一的训练范式,融合领域并行专家训练与后续融合机制,并实现从预训练到后训练的数据构造、环境设计、算法优化与基础设施的端到端协同;其次,通过深入探索环境扩展(environment scaling)与任务结构化设计,显著增强模型对复杂工具使用的泛化能力;再次,基于对真实世界噪声模式的系统分析与分解,引入针对性训练策略以显式建模噪声,从而提升实际应用中的鲁棒性;最后,提出“Heavy Thinking”模式,在测试阶段通过并行扩展推理深度与宽度实现有效的推理规模放大(test-time scaling)。
链接: https://arxiv.org/abs/2601.16725
作者: Meituan LongCat Team,Anchun Gui,Bei Li,Bingyang Tao,Bole Zhou,Borun Chen,Chao Zhang,Chao Zhang,Chen Gao,Chen Zhang,Chengcheng Han,Chenhui Yang,Chuyu Zhang,Cong Chen,Cunguang Wang,Daoru Pan,Defei Bu,Dengchang Zhao,Di Xiu,Dishan Liu,Dongyu Ru,Dunwei Tu,Fan Wu,Fengcheng Yuan,Fengcun Li,Gang Xu,Guanyu Wu,Guoyuan Lin,Haibin Wang,Hansi Yang,Hao Yang,Haonan Yan,Haoxiang Ma,Haoxing Wen,Hongyan Hao,Hongyin Tang,Hongyu Zang,Hongzhi Ni,Hui Su,Jiacheng Zhang,Jiahong Zhou,Jiahuan Li,Jiaming Wang,Jian Yang,Jianfei Zhang,Jianhao Xu,Jianing Wang,Jiapeng Zhu,Jiaqi Sun,Jiarong Shi,Jiarui Zhao,Jingang Wang,Jinluan Yang,Jinrui Ding,Jinwei Xiao,Jiyuan He,Juncan Xu,Kefeng Zhang,Keheng Wang,Li Wei,Lianhui Ma,Lin Qiu,Lingbing Kong,Lingchuan Liu,Linsen Guo,Mengshen Zhu,Mengxia Shen,Mingyang Zhu,Peiguang Li,Peng Pei,Pengcheng Jia,Pengtao Zhang,Peng Zhao,Qi Gu,Qiong Huang,Qiyuan Duan,Quanchi Weng,Rongxiang Weng,Rongzhi Zhang,Rumei Li,Shanglin Lei,Shengnan An,Shijun Dai,Shuaikang Liu,Shuang Zhou,Shuo Wang,Songyuan Zhao,Tao Liang,Tianhao Hu,Tianze Chen,Wei Liu,Wei Shi,Wei Wang,Weifeng Tang,Wenjie Shi,Wenlong Zhu,Wentao Chen,Wentao Shi,Xi Su,Xiangcheng Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on a wide range of agentic benchmarks, including agentic search, agentic tool use, and tool-integrated reasoning. Beyond benchmark performance, the model demonstrates strong generalization to complex tool interactions and robust behavior under noisy real-world environments. Its advanced capability stems from a unified training framework that combines domain-parallel expert training with subsequent fusion, together with an end-to-end co-design of data construction, environments, algorithms, and infrastructure spanning from pre-training to post-training. In particular, the model’s strong generalization capability in complex tool-use are driven by our in-depth exploration of environment scaling and principled task construction. To optimize long-tailed, skewed generation and multi-turn agentic interactions, and to enable stable training across over 10,000 environments spanning more than 20 domains, we systematically extend our asynchronous reinforcement learning framework, DORA, for stable and efficient large-scale multi-environment training. Furthermore, recognizing that real-world tasks are inherently noisy, we conduct a systematic analysis and decomposition of real-world noise patterns, and design targeted training procedures to explicitly incorporate such imperfections into the training process, resulting in improved robustness for real-world applications. To further enhance performance on complex reasoning tasks, we introduce a Heavy Thinking mode that enables effective test-time scaling by jointly expanding reasoning depth and width through intensive parallel thinking.
zh
[AI-18] Dynamic Expert-Guided Model Averag ing for Causal Discovery
【速读】:该论文旨在解决因果发现(causal discovery)中算法选择困难与现实场景假设不成立的问题。在实际应用中,由于存在大量性能相当但假设条件各异的因果发现算法,且真实数据常违反经典算法的理想假设(如无混杂变量、线性关系等),导致模型可靠性下降,亟需一种灵活、鲁棒的集成策略。其解决方案的关键在于提出一种基于动态请求专家知识的模型平均方法(flexible model averaging method),通过引入可动态调用的专家知识(如大语言模型 LLMs)来指导不同算法的加权组合,从而提升集成模型在噪声数据和不完美专家情况下的表现,尤其适用于临床等复杂场景中的因果推断任务。
链接: https://arxiv.org/abs/2601.16715
作者: Adrick Tench,Thomas Demeester
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding causal relationships is critical for healthcare. Accurate causal models provide a means to enhance the interpretability of predictive models, and furthermore a basis for counterfactual and interventional reasoning and the estimation of treatment effects. However, would-be practitioners of causal discovery face a dizzying array of algorithms without a clear best choice. This abundance of competitive algorithms makes ensembling a natural choice for practical applications. At the same time, real-world use cases frequently face challenges that violate the assumptions of common causal discovery algorithms, forcing heavy reliance on expert knowledge. Inspired by recent work on dynamically requested expert knowledge and LLMs as experts, we present a flexible model averaging method leveraging dynamically requested expert knowledge to ensemble a diverse array of causal discovery algorithms. Experiments demonstrate the efficacy of our method with imperfect experts such as LLMs on both clean and noisy data. We also analyze the impact of different degrees of expert correctness and assess the capabilities of LLMs for clinical causal discovery, providing valuable insights for practitioners.
zh
[AI-19] Adoption of Generative Artificial Intelligence in the German Software Engineering Industry: An Empirical Study
【速读】:该论文旨在解决当前关于生成式 AI(Generative AI)工具在德国软件工程师群体中采用动态的实证研究缺失问题,尤其关注深度交互、组织约束和经验因素如何影响其有效使用。解决方案的关键在于通过混合方法研究设计——包括18名从业者的探索性访谈与109名开发者的问卷调查——系统分析了工具采纳模式、提示策略及组织因素对有效性的影响,识别出经验水平调节感知收益、组织规模影响工具选择与使用强度,并指出项目上下文认知不足是最主要障碍,从而为开发者、组织和工具厂商提供可操作的实践建议。
链接: https://arxiv.org/abs/2601.16700
作者: Ludwig Felder,Tobias Eisenreich,Mahsa Fischer,Stefan Wagner,Chunyang Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: Submitted to FSE '26
Abstract:Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers. While adoption rates in the industry are rising, the underlying factors influencing the effective use of these tools, including the depth of interaction, organizational constraints, and experience-related considerations, have not been thoroughly investigated. This issue is particularly relevant in environments with stringent regulatory requirements, such as Germany, where practitioners must address the GDPR and the EU AI Act while balancing productivity gains with intellectual property considerations. Despite the significant impact of GenAI on software engineering, to the best of our knowledge, no empirical study has systematically examined the adoption dynamics of GenAI tools within the German context. To address this gap, we present a comprehensive mixed-methods study on GenAI adoption among German software engineers. Specifically, we conducted 18 exploratory interviews with practitioners, followed by a developer survey with 109 participants. We analyze patterns of tool adoption, prompting strategies, and organizational factors that influence effectiveness. Our results indicate that experience level moderates the perceived benefits of GenAI tools, and productivity gains are not evenly distributed among developers. Further, organizational size affects both tool selection and the intensity of tool use. Limited awareness of the project context is identified as the most significant barrier. We summarize a set of actionable implications for developers, organizations, and tool vendors seeking to advance artificial intelligence (AI) assisted software development.
zh
[AI-20] Agents Eval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning
【速读】:该论文旨在解决医学影像报告自动生成系统中临床正确性和推理保真度难以准确评估的问题(clinical correctness and reasoning fidelity)。现有方法因无法捕捉放射学诊断中的结构化逻辑而存在判断不可靠、临床相关性弱的缺陷。解决方案的关键在于提出AgentsEval——一种多智能体流式推理框架,模拟放射科医生的协作诊断流程,将评估过程分解为可解释的步骤:标准定义、证据提取、对齐与一致性评分,从而提供明确的推理路径和结构化的临床反馈。此外,研究构建了一个基于多领域扰动的基准测试集,涵盖五种不同成像模态的数据集,并引入受控语义变化以验证评估的鲁棒性。实验表明,该框架在句法改写、语义扰动和风格变化下仍能保持临床一致性和语义忠实性,推动了大语言模型在临床实践中可信部署的发展。
链接: https://arxiv.org/abs/2601.16685
作者: Suzhong Fu,Jingqi Dong,Xuan Ding,Rui Sun,Yiming Yang,Shuguang Cui,Zhen Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating the clinical correctness and reasoning fidelity of automatically generated medical imaging reports remains a critical yet unresolved challenge. Existing evaluation methods often fail to capture the structured diagnostic logic that underlies radiological interpretation, resulting in unreliable judgments and limited clinical relevance. We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists. By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback. We also construct a multi-domain perturbation-based benchmark covering five medical report datasets with diverse imaging modalities and controlled semantic variations. Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations. This framework represents a step toward transparent and clinically grounded assessment of medical report generation systems, fostering trustworthy integration of large language models into clinical practice.
zh
[AI-21] Sim-to-Real Transfer via a Style-Identified Cycle Consistent Generative Adversarial Network: Zero-Shot Deployment on Robotic Manipulators through Visual Domain Adaptation
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在工业场景中应用时面临的样本效率低的问题,特别是由于真实环境训练成本高、耗时长所导致的部署困难。其核心挑战在于虚拟环境与真实环境之间的“仿真到现实差距”(sim-to-real gap),限制了策略从仿真到物理世界的直接迁移。为实现零样本迁移(zero-shot transfer),即无需额外实机调优即可直接部署,本文提出了一种基于风格识别循环一致生成对抗网络(Style-Identified Cycle Consistent Generative Adversarial Network, StyleID-CycleGAN 或 SICGAN)的域适应方法。该方案的关键创新在于利用SICGAN将原始虚拟观测转化为具有真实感的合成图像,从而构建一个融合虚拟动力学与真实视觉输入的混合训练域,使DRL智能体在虚拟环境中训练后可直接应用于真实场景,实验验证其在两种工业机器人上的零样本迁移成功率超过95%,且具备对不同颜色和形状物体的良好泛化能力。
链接: https://arxiv.org/abs/2601.16677
作者: Lucía Güitta-López,Lionel Güitta-López,Jaime Boal,Álvaro Jesús López-López
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The sample efficiency challenge in Deep Reinforcement Learning (DRL) compromises its industrial adoption due to the high cost and time demands of real-world training. Virtual environments offer a cost-effective alternative for training DRL agents, but the transfer of learned policies to real setups is hindered by the sim-to-real gap. Achieving zero-shot transfer, where agents perform directly in real environments without additional tuning, is particularly desirable for its efficiency and practical value. This work proposes a novel domain adaptation approach relying on a Style-Identified Cycle Consistent Generative Adversarial Network (StyleID-CycleGAN or SICGAN), an original Cycle Consistent Generative Adversarial Network (CycleGAN) based model. SICGAN translates raw virtual observations into real-synthetic images, creating a hybrid domain for training DRL agents that combines virtual dynamics with real-like visual inputs. Following virtual training, the agent can be directly deployed, bypassing the need for real-world training. The pipeline is validated with two distinct industrial robots in the approaching phase of a pick-and-place operation. In virtual environments agents achieve success rates of 90 to 100%, and real-world deployment confirms robust zero-shot transfer (i.e., without additional training in the physical environment) with accuracies above 95% for most workspace regions. We use augmented reality targets to improve the evaluation process efficiency, and experimentally demonstrate that the agent successfully generalizes to real objects of varying colors and shapes, including LEGO\textsuperscript\textregistered~cubes and a mug. These results establish the proposed pipeline as an efficient, scalable solution to the sim-to-real problem.
zh
[AI-22] Revisiting the Role of Natural Language Code Comments in Code Translation
【速读】:该论文旨在解决当前代码翻译(code translation)任务中因缺乏自然语言代码注释(code comments)而导致的性能瓶颈问题,尤其是现有基准测试未充分考虑注释对翻译质量的影响。研究表明,代码注释,特别是描述代码整体功能而非逐行细节的注释,能显著提升大语言模型(LLM)在跨语言代码转换中的准确性。解决方案的关键在于提出一种名为 COMMENTRA 的新方法,其核心思想是利用高质量的代码注释作为上下文信息来增强 LLM 的理解能力,实验证明该方法可使基于 LLM 的代码翻译性能提升近一倍。
链接: https://arxiv.org/abs/2601.16661
作者: Monika Gupta,Ajay Meena,Anamitra Roy Choudhury,Vijay Arya,Srikanta Bedathur
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of large language models (LLMs) has ushered in a new era in automated code translation across programming languages. Since most code-specific LLMs are pretrained on well-commented code from large repositories like GitHub, it is reasonable to hypothesize that natural language code comments could aid in improving translation quality. Despite their potential relevance, comments are largely absent from existing code translation benchmarks, rendering their impact on translation quality inadequately characterised. In this paper, we present a large-scale empirical study evaluating the impact of comments on translation performance. Our analysis involves more than 80,000 translations, with and without comments, of 1100+ code samples from two distinct benchmarks covering pairwise translations between five different programming languages: C, C++, Go, Java, and Python. Our results provide strong evidence that code comments, particularly those that describe the overall purpose of the code rather than line-by-line functionality, significantly enhance translation accuracy. Based on these findings, we propose COMMENTRA, a code translation approach, and demonstrate that it can potentially double the performance of LLM-based code translation. To the best of our knowledge, our study is the first in terms of its comprehensiveness, scale, and language coverage on how to improve code translation accuracy using code comments.
zh
[AI-23] Provably Robust Bayesian Counterfactual Explanations under Model Changes
【速读】:该论文旨在解决现实场景中机器学习模型频繁更新导致现有反事实解释(Counterfactual Explanations, CEs)快速失效或不可靠的问题。其解决方案的关键在于提出概率安全的反事实解释(Probabilistically Safe CEs, PSCE),通过引入 δ-安全(确保高预测置信度)和 ϵ-鲁棒(确保低预测方差)两个形式化约束,基于贝叶斯原理构建在模型变化下仍满足 ⟨δ,ϵ⟩-集的概率保证机制,并将不确定性感知约束整合进优化框架,从而生成在模型更新后依然可靠、可解释且具有理论保障的反事实样本。
链接: https://arxiv.org/abs/2601.16659
作者: Jamie Duell,Xiuyi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Counterfactual explanations (CEs) offer interpretable insights into machine learning predictions by answering ``what if?" questions. However, in real-world settings where models are frequently updated, existing counterfactual explanations can quickly become invalid or unreliable. In this paper, we introduce Probabilistically Safe CEs (PSCE), a method for generating counterfactual explanations that are \delta -safe, to ensure high predictive confidence, and \epsilon -robust to ensure low predictive variance. Based on Bayesian principles, PSCE provides formal probabilistic guarantees for CEs under model changes which are adhered to in what we refer to as the \langle \delta, \epsilon \rangle -set. Uncertainty-aware constraints are integrated into our optimization framework and we validate our method empirically across diverse datasets. We compare our approach against state-of-the-art Bayesian CE methods, where PSCE produces counterfactual explanations that are not only more plausible and discriminative, but also provably robust under model change.
zh
[AI-24] Generative Confidants: How do People Experience Trust in Emotional Support from Generative AI?
【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)被广泛用于情感支持和陪伴的背景下,人们如何在非监督、非正式的交互中建立并体验信任,这一机制尚不明确。解决方案的关键在于通过一项定性研究,对24名频繁使用生成式 AI 获取情感支持的用户进行日记记录、对话转录与深度访谈,识别出三个新的信任驱动因素:个性化带来的熟悉感、对生成式 AI 的细致心理模型以及用户对对话控制权的认知。研究发现,生成式 AI 采用统一的个性化、积极且具说服力的语言风格有助于促进这些信任因素,但也可能削弱用户对 AI 本质(即其为训练后模拟人类语言的机器)的认知,从而影响信任行为的多样性。
链接: https://arxiv.org/abs/2601.16656
作者: Riccardo Volpato,Simone Stumpf,Lisa DeBruine
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:People are increasingly turning to generative AI (e.g., ChatGPT, Gemini, Copilot) for emotional support and companionship. While trust is likely to play a central role in enabling these informal and unsupervised interactions, we still lack an understanding of how people develop and experience it in this context. Seeking to fill this gap, we recruited 24 frequent users of generative AI for emotional support and conducted a qualitative study consisting of diary entries about interactions, transcripts of chats with AI, and in-depth interviews. Our results suggest important novel drivers of trust in this context: familiarity emerging from personalisation, nuanced mental models of generative AI, and awareness of people’s control over conversations. Notably, generative AI’s homogeneous use of personalised, positive, and persuasive language appears to promote some of these trust-building factors. However, this also seems to discourage other trust-related behaviours, such as remembering that generative AI is a machine trained to converse in human language. We present implications for future research that are likely to become critical as the use of generative AI for emotional support increasingly overlaps with therapeutic work.
zh
[AI-25] LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents
【速读】:该论文旨在解决当前大型语言模型在多轮、长周期代理任务中表现不佳的问题,这类任务通常需要规划(planning)、状态追踪(state tracking)和长上下文处理等复杂能力。为厘清各项底层能力对任务成功的重要性,作者提出了一种基于“Oracle反事实框架”的评估方法:通过假设代理能获得一个完美执行特定技能的“Oracle”辅助,测量其性能提升幅度,从而量化该技能的关键性。解决方案的核心在于构建一组可调复杂度的程序化生成游戏类任务,结合精确的Oracle干预(如完美规划或无误状态追踪),在受控环境中隔离并量化各技能的贡献,避免现实基准中存在的混杂因素。
链接: https://arxiv.org/abs/2601.16649
作者: Amin Rakhsha,Thomas Hehn,Pietro Mazzaglia,Fabio Valerio Massoli,Arash Behboodi,Tribhuvanesh Orekondy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can perform well on many isolated tasks, yet they continue to struggle on multi-turn, long-horizon agentic problems that require skills such as planning, state tracking, and long context processing. In this work, we aim to better understand the relative importance of advancing these underlying capabilities for success on such tasks. We develop an oracle counterfactual framework for multi-turn problems that asks: how would an agent perform if it could leverage an oracle to perfectly perform a specific task? The change in the agent’s performance due to this oracle assistance allows us to measure the criticality of such oracle skill in the future advancement of AI agents. We introduce a suite of procedurally generated, game-like tasks with tunable complexity. These controlled environments allow us to provide precise oracle interventions, such as perfect planning or flawless state tracking, and make it possible to isolate the contribution of each oracle without confounding effects present in real-world benchmarks. Our results show that while some interventions (e.g., planning) consistently improve performance across settings, the usefulness of other skills is dependent on the properties of the environment and language model. Our work sheds light on the challenges of multi-turn agentic environments to guide the future efforts in the development of AI agents and language models.
zh
[AI-26] Dual-Prototype Disentanglement: A Context-Aware Enhancement Framework for Time Series Forecasting
【速读】:该论文旨在解决时间序列预测中现有深度学习方法难以动态解耦并利用复杂、交织的时序模式的问题,导致模型学习到静态的平均表示而缺乏上下文感知能力。其解决方案的关键在于提出一种模型无关的辅助框架——双原型自适应解耦(DPAD),核心包括两个部分:一是构建包含共性模式库(具有强时序先验)和稀有模式库(动态记忆关键罕见事件)的动态双原型库(DDP),二是设计双路径上下文感知路由机制(DPC),通过从DDP中选择性检索特定上下文的模式表示来增强输出;此外引入解耦引导损失(DGLoss)确保各原型库各司其职且覆盖全面,从而实现对时序模式的动态解耦与上下文自适应。
链接: https://arxiv.org/abs/2601.16632
作者: Haonan Yang,Jianchao Tang,Zhuo Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting has witnessed significant progress with deep learning. While prevailing approaches enhance forecasting performance by modifying architectures or introducing novel enhancement strategies, they often fail to dynamically disentangle and leverage the complex, intertwined temporal patterns inherent in time series, thus resulting in the learning of static, averaged representations that lack context-aware capabilities. To address this, we propose the Dual-Prototype Adaptive Disentanglement framework (DPAD), a model-agnostic auxiliary method that equips forecasting models with the ability of pattern disentanglement and context-aware adaptation. Specifically, we construct a Dynamic Dual-Prototype bank (DDP), comprising a common pattern bank with strong temporal priors to capture prevailing trend or seasonal patterns, and a rare pattern bank dynamically memorizing critical yet infrequent events, and then an Dual-Path Context-aware routing (DPC) mechanism is proposed to enhance outputs with selectively retrieved context-specific pattern representations from the DDP. Additionally, we introduce a Disentanglement-Guided Loss (DGLoss) to ensure that each prototype bank specializes in its designated role while maintaining comprehensive coverage. Comprehensive experiments demonstrate that DPAD consistently improves forecasting performance and reliability of state-of-the-art models across diverse real-world benchmarks.
zh
[AI-27] E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory
【速读】:该论文旨在解决当前等变图神经网络(Equivariant Graph Neural Networks, EGNNs)在建模三维原子系统时面临的可扩展性瓶颈问题,尤其是由于显式构建几何特征或在每条边进行密集张量积运算导致的计算效率低下。解决方案的关键在于提出一种名为E2Former-V2的新架构,其核心创新包括:(1) 等变轴对齐稀疏化(Equivariant Axis-Aligned Sparsification, EAAS),通过利用SO(3)到SO(2)的基变换,将计算密集的稠密张量收缩转化为高效的稀疏奇偶重索引操作;(2) 原位等变注意力机制(On-the-Fly Equivariant Attention),基于定制的融合Triton内核实现完全以节点为中心的机制,避免生成边缘张量并最大化共享随机存取存储器(SRAM)利用率,从而在推理阶段实现高达20倍的TFLOPS提升。实验表明,该方法在保持与现有模型相当预测性能的同时显著加速了推理过程。
链接: https://arxiv.org/abs/2601.16622
作者: Lin Huang,Chengxiang Huang,Ziang Wang,Yiyue Du,Chu Wang,Haocheng Lu,Yunyang Li,Xiaoli Liu,Arthur Jiang,Jia Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream architectures face critical scalability bottlenecks due to the explicit construction of geometric features or dense tensor products on \textitevery edge. To overcome this, we introduce \textbfE2Former-V2, a scalable architecture that integrates algebraic sparsity with hardware-aware execution. We first propose \textbfEquivariant \textbfAxis-\textbfAligned \textbfSparsification (EAAS). EAAS builds on Wigner- 6j convolution by exploiting an \mathrmSO(3) \rightarrow \mathrmSO(2) change of basis to transform computationally expensive dense tensor contractions into efficient, sparse parity re-indexing operations. Building on this representation, we introduce \textbfOn-the-Fly Equivariant Attention, a fully node-centric mechanism implemented via a custom fused Triton kernel. By eliminating materialized edge tensors and maximizing SRAM utilization, our kernel achieves a \textbf20 \times improvement in TFLOPS compared to standard implementations. Extensive experiments on the SPICE and OMol25 datasets demonstrate that E2Former-V2 maintains comparable predictive performance while notably accelerating inference. This work demonstrates that large equivariant transformers can be trained efficiently using widely accessible GPU platforms. The code is avalible at this https URL.
zh
[AI-28] Integrating Meteorological and Operational Data: A Novel Approach to Understanding Railway Delays in Finland WWW
【速读】:该论文旨在解决铁路运行延迟预测中缺乏多源异构数据融合的问题,特别是如何有效整合气象信息与铁路运营数据以提升延迟分析的准确性。其核心解决方案是构建并公开发布首个将芬兰铁路运营数据与同步气象观测数据(2018–2024年)集成的高质量数据集,通过Haversine距离实现时空对齐,并采用空间插值填补缺失数据、周期性编码时间特征及鲁棒缩放处理传感器异常值等预处理策略,最终形成包含28个工程特征的3850万条观测记录,为机器学习模型提供可靠输入。
链接: https://arxiv.org/abs/2601.16592
作者: Vinicius Pozzobon Borin,Jean Michel de Souza Sant’Ana,Usama Raheel,Nurul Huda Mahmood
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 12 pages, 8 figures, database: this https URL
Abstract:Train delays result from complex interactions between operational, technical, and environmental factors. While weather impacts railway reliability, particularly in Nordic regions, existing datasets rarely integrate meteorological information with operational train data. This study presents the first publicly available dataset combining Finnish railway operations with synchronized meteorological observations from 2018-2024. The dataset integrates operational metrics from Finland Digitraffic Railway Traffic Service with weather measurements from 209 environmental monitoring stations, using spatial-temporal alignment via Haversine distance. It encompasses 28 engineered features across operational variables and meteorological measurements, covering approximately 38.5 million observations from Finland’s 5,915-kilometer rail network. Preprocessing includes strategic missing data handling through spatial fallback algorithms, cyclical encoding of temporal features, and robust scaling of weather data to address sensor outliers. Analysis reveals distinct seasonal patterns, with winter months exhibiting delay rates exceeding 25% and geographic clustering of high-delay corridors in central and northern Finland. Furthermore, the work demonstrates applications of the data set in analysing the reliability of railway traffic in Finland. A baseline experiment using XGBoost regression achieved a Mean Absolute Error of 2.73 minutes for predicting station-specific delays, demonstrating the dataset’s utility for machine learning applications. The dataset enables diverse applications, including train delay prediction, weather impact assessment, and infrastructure vulnerability mapping, providing researchers with a flexible resource for machine learning applications in railway operations research.
zh
[AI-29] Emerging Threats and Countermeasures in Neuromorphic Systems: A Survey
【速读】:该论文旨在解决类脑计算(Neuromorphic Computing)系统在硬件和软件层面日益突出的安全与隐私风险问题,尤其是针对基于脉冲神经网络(Spiking Neural Networks, SNNs)和忆阻器(Memristor)的新兴架构所引入的新攻击面。其解决方案的关键在于系统性地分析安全威胁,涵盖侧信道漏洞、攻击方法及防护策略,并强调通过集成物理不可克隆函数(Physical Unclonable Functions, PUFs)和真随机数生成器(True Random Number Generators, TRNGs)等硬件原语,在保障能效优势的同时实现可信的加密计算与安全防护,从而为构建高效且安全的类脑硬件架构提供理论基础与实践指导。
链接: https://arxiv.org/abs/2601.16589
作者: Pablo Sorrentino,Stjepan Picek,Ihsen Alouani,Nikolaos Athanasios Anagnostopoulos,Francesco Regazzoni,Lejla Batina,Tamalika Banerjee,Fatih Turkmen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Survey paper on security and privacy in neuromorphic and in-memory computing systems (35 pages, 11 figures, 6 tables)
Abstract:Neuromorphic computing mimics brain-inspired mechanisms through spiking neurons and energy-efficient processing, offering a pathway to efficient in-memory computing (IMC). However, these advancements raise critical security and privacy concerns. As the adoption of bio-inspired architectures and memristive devices increases, so does the urgency to assess the vulnerability of these emerging technologies to hardware and software attacks. Emerging architectures introduce new attack surfaces, particularly due to asynchronous, event-driven processing and stochastic device behavior. The integration of memristors into neuromorphic hardware and software implementations in spiking neural networks offers diverse possibilities for advanced computing architectures, including their role in security-aware applications. This survey systematically analyzes the security landscape of neuromorphic systems, covering attack methodologies, side-channel vulnerabilities, and countermeasures. We focus on both hardware and software concerns relevant to spiking neural networks (SNNs) and hardware primitives, such as Physical Unclonable Functions (PUFs) and True Random Number Generators (TRNGs) for cryptographic and secure computation applications. We approach this analysis from diverse perspectives, from attack methodologies to countermeasure strategies that integrate efficiency and protection in brain-inspired hardware. This review not only maps the current landscape of security threats but provides a foundation for developing secure and trustworthy neuromorphic architectures.
zh
[AI-30] Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability
【速读】:该论文旨在解决深度学习训练过程中优化器与数据状态之间存在非马尔可夫(non-Markovian)记忆效应的问题,即当前模型输出不仅依赖于当前输入,还受历史训练步骤中优化器状态和数据顺序的影响。这一现象挑战了传统假设——随机梯度下降(SGD)等优化算法在训练过程中满足马尔可夫性质(Markovian idealization)。论文的关键解决方案是提出一种模型无关的训练记忆观测工具(witness),基于**可区分性回流(back-flow of distinguishability)**构建,通过一个两步干预协议比较单次与两次干预后的输出分布差异(如总变差距离 TV、Jensen-Shannon 散度 JS 或 Hellinger 距离),其正向增加量 ΔBF=D2−D1 可作为非马尔可夫性的实证指标。该方法无需修改网络结构,计算成本低,且能清晰归因于优化器状态与数据顺序的记忆作用,为评估不同优化策略、课程学习安排及训练调度提供了统一的测量基准。
链接: https://arxiv.org/abs/2601.16563
作者: Vasileios Sevetlidis,George Pavlidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: to be published in the 29th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research
Abstract:This work proposes neural training as a \emphprocess tensor: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emphback-flow of distinguishability. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase \Delta_\mathrmBF = D_2 - D_10 (with D\in\mathrmTV, \mathrmJS, \mathrmH\ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emphcausal break (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emphmeasurement contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings. “Data order matters” turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.
zh
[AI-31] PRISM: Purified Representation and Integrated Semantic Modeling for Generative Sequential Recommendation
【速读】:该论文针对生成式推荐(Generative Sequential Recommendation, GSR)中存在的两个关键问题展开研究:一是语义标记(Semantic ID, SID)的纯度与稳定性不足,即现有量化方法难以应对用户交互噪声和码本坍缩问题,导致SID区分度模糊;二是生成过程的信息损失严重且缺乏结构约束,仅依赖粗粒度离散标记会丢失细粒度语义信息,并忽略物品间的层次逻辑关系。解决方案的核心在于提出PRISM框架,其关键创新为:(1) 设计净化语义量化器(Purified Semantic Quantizer),通过自适应协同去噪与分层语义锚定机制构建鲁棒码本,提升SID质量;(2) 提出集成语义推荐器(Integrated Semantic Recommender),引入动态语义融合机制以恢复量化损失的细粒度语义,并通过语义结构对齐目标强制生成结果符合逻辑结构,从而实现高质量、结构化的序列生成。
链接: https://arxiv.org/abs/2601.16556
作者: Dengzhao Fang,Jingtong Gao,Yu Li,Xiangyu Zhao,Yi Chang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generative Sequential Recommendation (GSR) has emerged as a promising paradigm, reframing recommendation as an autoregressive sequence generation task over discrete Semantic IDs (SIDs), typically derived via codebook-based quantization. Despite its great potential in unifying retrieval and ranking, existing GSR frameworks still face two critical limitations: (1) impure and unstable semantic tokenization, where quantization methods struggle with interaction noise and codebook collapse, resulting in SIDs with ambiguous discrimination; and (2) lossy and weakly structured generation, where reliance solely on coarse-grained discrete tokens inevitably introduces information loss and neglects items’ hierarchical logic. To address these issues, we propose a novel generative recommendation framework, PRISM, with Purified Representation and Integrated Semantic Modeling. Specifically, to ensure high-quality tokenization, we design a Purified Semantic Quantizer that constructs a robust codebook via adaptive collaborative denoising and hierarchical semantic anchoring mechanisms. To compensate for information loss during quantization, we further propose an Integrated Semantic Recommender, which incorporates a dynamic semantic integration mechanism to integrate fine-grained semantics and enforces logical validity through a semantic structure alignment objective. PRISM consistently outperforms state-of-the-art baselines across four real-world datasets, demonstrating substantial performance gains, particularly in high-sparsity scenarios.
zh
[AI-32] LLM is Not All You Need: A Systematic Evaluation of ML vs. Foundation Models for text and image based Medical Classification
【速读】:该论文旨在解决医学分类任务中不同模型架构性能对比不清的问题,特别是传统机器学习(Machine Learning, ML)与基于Transformer的生成式AI(Generative AI)模型在多模态数据(文本与图像)上的表现差异。其解决方案的关键在于构建一个统一且严格的基准测试框架,使用四个公开可用的数据集(涵盖二分类与多分类复杂度),对三类模型进行系统评估:经典ML模型(LR、LightGBM、ResNet-50)、提示驱动的大型语言模型/视觉语言模型(Prompt-Based LLMs/VLMs,如Gemini 2.5),以及参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)模型(LoRA适配的Gemma3变体)。实验通过一致的数据划分和指标对齐确保公平比较,结果表明:经典ML模型在多数医学分类任务中仍保持最优性能,尤其在结构化文本数据上表现突出;而LoRA微调的Gemma3模型因微调策略不足导致泛化能力差;零样本VLM管道(Gemini 2.5)则在图像任务中表现可比于ResNet-50,但在文本任务中表现不佳,揭示了基础模型并非在所有场景下均优于传统方法,且PEFT的有效性高度依赖于具体的适配策略。
链接: https://arxiv.org/abs/2601.16549
作者: Meet Raval,Tejul Pandit,Dhvani Upadhyay
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 3 tables, paper accepted in AAIML’26 conference
Abstract:The combination of multimodal Vision-Language Models (VLMs) and Large Language Models (LLMs) opens up new possibilities for medical classification. This work offers a rigorous, unified benchmark by using four publicly available datasets covering text and image modalities (binary and multiclass complexity) that contrasts traditional Machine Learning (ML) with contemporary transformer-based techniques. We evaluated three model classes for each task: Classical ML (LR, LightGBM, ResNet-50), Prompt-Based LLMs/VLMs (Gemini 2.5), and Fine-Tuned PEFT Models (LoRA-adapted Gemma3 variants). All experiments used consistent data splits and aligned metrics. According to our results, traditional machine learning (ML) models set a high standard by consistently achieving the best overall performance across most medical categorization tasks. This was especially true for structured text-based datasets, where the classical models performed exceptionally well. In stark contrast, the LoRA-tuned Gemma variants consistently showed the worst performance across all text and image experiments, failing to generalize from the minimal fine-tuning provided. However, the zero-shot LLM/VLM pipelines (Gemini 2.5) had mixed results; they performed poorly on text-based tasks, but demonstrated competitive performance on the multiclass image task, matching the classical ResNet-50 baseline. These results demonstrate that in many medical categorization scenarios, established machine learning models continue to be the most reliable option. The experiment suggests that foundation models are not universally superior and that the effectiveness of Parameter-Efficient Fine-Tuning (PEFT) is highly dependent on the adaptation strategy, as minimal fine-tuning proved detrimental in this study.
zh
[AI-33] CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation
【速读】:该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在继承文本基础大语言模型(Large Language Models, LLMs)架构后,普遍存在知识与推理能力下降的问题。作者认为这一局限源于当前训练范式未能有效弥合特征表示空间中的声学-语义鸿沟(acoustic-semantic gap)。解决方案的关键在于提出一种统一的对齐框架CORD,其核心是在线跨模态自蒸馏(online cross-modal self-distillation),通过将音频条件推理与文本条件推理在统一模型内进行对齐,利用文本模态作为内部教师信号,实现多粒度对齐:在token层面采用重要性加权的策略反向KL散度优化早期和语义关键token;在序列层面引入基于判别器的全局奖励,并通过Group Relative Policy Optimization(GRPO)优化完整推理轨迹。该方法仅需8万条合成训练样本即可显著提升音频条件推理性能并缩小音频与文本模态间的性能差距,验证了其有效性与数据效率。
链接: https://arxiv.org/abs/2601.16547
作者: Jing Hu,Danxiang Zhu,Xianlong Luo,Dan Zhang,Shuwei He,Yishu Lei,Haitao Zheng,Shikun Feng,Jingzhou He,Yu Sun,Hua Wu,Haifeng Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 13 pages, 4 figures
Abstract:Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio-text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.
zh
[AI-34] Do Models Hear Like Us? Probing the Representational Alignment of Audio LLM s and Naturalistic EEG
【速读】:该论文旨在解决当前生成式 AI(Generative AI)中音频大语言模型(Audio LLMs)的内部表征是否与人类在自然听觉情境下的神经活动动态一致这一关键问题。解决方案的关键在于系统性地分析12个开源Audio LLMs在不同网络层中的表征与脑电图(EEG)信号之间的对齐情况,采用8种相似性度量方法(如基于Spearman相关性的表示相似性分析RSA),并结合时间窗口和情感语调特征进行多维解析,从而揭示了模型表征在时空维度上的深度依赖性、N400相关神经动力学的一致性以及负向语调引起的几何相似性下降与协方差依赖增强的分离现象。
链接: https://arxiv.org/abs/2601.16540
作者: Haoyun Yang,Xin Xiao,Jiang Zhong,Yu Tian,Dong Xiaohua,Yu Mao,Hao Wu,Kaiwen Wei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.
zh
[AI-35] SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床决策支持中可能因患者施压而屈从于不适当医疗建议的风险问题。解决方案的关键在于提出SycoEval-EM,一个基于多智能体的对抗性模拟框架,通过在急诊医学场景下对LLM进行多轮对抗性患者说服测试,系统评估其鲁棒性。该方法揭示了模型对影像学检查请求的脆弱性显著高于阿片类药物处方(38.8% vs 25.0%),且现有静态基准无法准确预测模型在社会压力下的安全性,强调了采用多轮对抗测试作为临床AI认证必要手段的重要性。
链接: https://arxiv.org/abs/2601.16529
作者: Dongshen Peng,Yi Wang,Carl Preiksaitis,Christian Rose
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 11 pages, 5 figures
Abstract:Large language models (LLMs) show promise in clinical decision support yet risk acquiescing to patient pressure for inappropriate care. We introduce SycoEval-EM, a multi-agent simulation framework evaluating LLM robustness through adversarial patient persuasion in emergency medicine. Across 20 LLMs and 1,875 encounters spanning three Choosing Wisely scenarios, acquiescence rates ranged from 0-100%. Models showed higher vulnerability to imaging requests (38.8%) than opioid prescriptions (25.0%), with model capability poorly predicting robustness. All persuasion tactics proved equally effective (30.0-36.0%), indicating general susceptibility rather than tactic-specific weakness. Our findings demonstrate that static benchmarks inadequately predict safety under social pressure, necessitating multi-turn adversarial testing for clinical AI certification.
zh
[AI-36] Finite-Time Analysis of Gradient Descent for Shallow Transformers AISTATS2026
【速读】:该论文旨在解决Transformer模型在非凸优化景观下性能优异但理论机制尚不清晰的问题。其解决方案的关键在于通过分析一个浅层Transformer(含m个独立注意力头)在核 regime 下使用投影梯度下降法进行训练的过程,揭示了两个核心结论:(i) 非渐近保证所需的宽度仅随样本量n的对数增长;(ii) 优化误差与序列长度T无关,这与循环结构中误差随T指数增长形成鲜明对比。这一发现表明Transformer在优化稳定性上具有显著优势,尽管其内存需求随序列长度增加而上升。
链接: https://arxiv.org/abs/2601.16514
作者: Enes Arda,Semih Cayci,Atilla Eryilmaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Accepted to AISTATS 2026
Abstract:Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with m independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size n , and (ii) the optimization error is independent of the sequence length T . This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with T . The trade-off is memory: to keep the full context, the Transformer’s memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and confirm the predicted scaling laws for Transformers.
zh
[AI-37] kNN-Graph: An adaptive graph model for k-nearest neighbors
【速读】:该论文旨在解决k近邻(k-nearest neighbors, kNN)算法在大规模应用中面临的推理速度与分类精度之间的计算权衡问题。现有近似最近邻方法虽能加速检索,但常导致分类精度下降,且难以自适应地选择最优邻域大小(k)。其解决方案的关键在于提出一种自适应图模型,通过将邻近节点选择与加权计算完全转移到训练阶段,从而解耦推理延迟与计算复杂度。该模型结合了分层可导航小世界(Hierarchical Navigable Small World, HNSW)图结构与预计算投票机制:高层用于快速导航,低层则编码精确的、节点特定的决策边界并自适应调整邻域数量。实验表明,该架构在六大数据集上显著提升推理速度并实现实时性能,同时保持高分类准确性,为基于图的非参数学习提供了新的结构范式。
链接: https://arxiv.org/abs/2601.16509
作者: Jiaye Li,Gang Chen,Hang Xu,Shichao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures
Abstract:The k-nearest neighbors (kNN) algorithm is a cornerstone of non-parametric classification in artificial intelligence, yet its deployment in large-scale applications is persistently constrained by the computational trade-off between inference speed and accuracy. Existing approximate nearest neighbor solutions accelerate retrieval but often degrade classification precision and lack adaptability in selecting the optimal neighborhood size (k). Here, we present an adaptive graph model that decouples inference latency from computational complexity. By integrating a Hierarchical Navigable Small World (HNSW) graph with a pre-computed voting mechanism, our framework completely transfers the computational burden of neighbor selection and weighting to the training phase. Within this topological structure, higher graph layers enable rapid navigation, while lower layers encode precise, node-specific decision boundaries with adaptive neighbor counts. Benchmarking against eight state-of-the-art baselines across six diverse datasets, we demonstrate that this architecture significantly accelerates inference speeds, achieving real-time performance, without compromising classification accuracy. These findings offer a scalable, robust solution to the long-standing inference bottleneck of kNN, establishing a new structural paradigm for graph-based nonparametric learning.
zh
[AI-38] SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在安全对齐上的浅层防御问题,即现有方法虽能识别明显威胁,但难以抵御伪装成正常请求的隐蔽攻击(如预填充攻击),且常导致模型实用性下降。其解决方案的关键在于提出SafeThinker框架,通过一个轻量级网关分类器动态评估输入风险,并据此将输入路由至三种机制:标准拒绝机制(Standardized Refusal Mechanism)处理明确威胁以提升效率;安全感知双专家模块(Safety-Aware Twin Expert, SATE)识别伪装成良性查询的欺骗性攻击;分布引导思考组件(Distribution-Guided Think, DDGT)在生成过程不确定时自适应介入。该设计实现了内在判断与防御资源的协同调度,显著降低各类越狱攻击成功率的同时保持模型功能完整性。
链接: https://arxiv.org/abs/2601.16506
作者: Xianya Fang,Xianying Luo,Yadong Wang,Xiang Chen,Yu Tian,Zequn Sun,Rui Liu,Jun Fang,Naiqiang Tan,Yuanning Cui,Sheng-Jun Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the intrinsic risk-awareness of Large Language Models (LLMs), current defenses often result in shallow safety alignment, rendering models vulnerable to disguised attacks (e.g., prefilling) while degrading utility. To bridge this gap, we propose SafeThinker, an adaptive framework that dynamically allocates defensive resources via a lightweight gateway classifier. Based on the gateway’s risk assessment, inputs are routed through three distinct mechanisms: (i) a Standardized Refusal Mechanism for explicit threats to maximize efficiency; (ii) a Safety-Aware Twin Expert (SATE) module to intercept deceptive attacks masquerading as benign queries; and (iii) a Distribution-Guided Think (DDGT) component that adaptively intervenes during uncertain generation. Experiments show that SafeThinker significantly lowers attack success rates across diverse jailbreak strategies without compromising utility, demonstrating that coordinating intrinsic judgment throughout the generation process effectively balances robustness and practicality.
zh
[AI-39] Doc2AHP: Inferring Structured Multi-Criteria Decision Models via Semantic Trees with LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂决策任务中难以保证结构一致性与推理可靠性的问题,同时克服传统决策理论(如层次分析法 Analytic Hierarchy Process, AHP)因依赖人工专家标注而产生的“专家瓶颈”问题。其解决方案的关键在于提出 Doc2AHP 框架,该框架将 AHP 的结构化原则作为约束条件,引导 LLM 在无结构文档空间中进行受控搜索,从而确保父节点与子节点之间的逻辑蕴含关系;此外,引入多智能体权重分配机制与自适应一致性优化策略,保障权重分配的数值一致性,使非专家用户也能从零构建高质量决策模型,并显著优于直接生成基线方法在逻辑完整性和下游任务准确性上的表现。
链接: https://arxiv.org/abs/2601.16479
作者: Hongjia Wu,Shuai Zhou,Hongxin Zhang,Wei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) demonstrate remarkable proficiency in semantic understanding, they often struggle to ensure structural consistency and reasoning reliability in complex decision-making tasks that demand rigorous logic. Although classical decision theories, such as the Analytic Hierarchy Process (AHP), offer systematic rational frameworks, their construction relies heavily on labor-intensive domain expertise, creating an “expert bottleneck” that hinders scalability in general scenarios. To bridge the gap between the generalization capabilities of LLMs and the rigor of decision theory, we propose Doc2AHP, a novel structured inference framework guided by AHP principles. Eliminating the need for extensive annotated data or manual intervention, our approach leverages the structural principles of AHP as constraints to direct the LLM in a constrained search within the unstructured document space, thereby enforcing the logical entailment between parent and child nodes. Furthermore, we introduce a multi-agent weighting mechanism coupled with an adaptive consistency optimization strategy to ensure the numerical consistency of weight allocation. Empirical results demonstrate that Doc2AHP not only empowers non-expert users to construct high-quality decision models from scratch but also significantly outperforms direct generative baselines in both logical completeness and downstream task accuracy.
zh
[AI-40] RENEW: Risk- and Energy-Aware Navigation in Dynamic Waterways AAAI2026
【速读】:该论文旨在解决自主水面航行器(ASV)在存在外部扰动(如海流)的动态环境中进行全局路径规划的问题,特别是如何同时应对适应性非可航行区域识别与拓扑路径多样性不足带来的导航鲁棒性挑战。解决方案的关键在于提出了一种统一的风险-能耗感知策略(risk- and energy-aware strategy),通过高层约束三角剖分实现拓扑多样性,低层在安全走廊内进行轨迹优化,并结合海上应急规划理念,采用“尽力而为”机制维持控制稳定性,从而在复杂海洋环境中实现更安全、高效的路径规划。
链接: https://arxiv.org/abs/2601.16424
作者: Mingi Jeong,Alberto Quattrini Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figure, 4 tables, AAAI 2026 (main track; oral acceptance)
Abstract:We present RENEW, a global path planner for Autonomous Surface Vehicle (ASV) in dynamic environments with external disturbances (e.g., water currents). RENEW introduces a unified risk- and energy-aware strategy that ensures safety by dynamically identifying non-navigable regions and enforcing adaptive safety constraints. Inspired by maritime contingency planning, it employs a best-effort strategy to maintain control under adverse conditions. The hierarchical architecture combines high-level constrained triangulation for topological diversity with low-level trajectory optimization within safe corridors. Validated with real-world ocean data, RENEW is the first framework to jointly address adaptive non-navigability and topological path diversity for robust maritime navigation.
zh
[AI-41] PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning
【速读】:该论文旨在解决临床人工智能(Clinical AI)研究中普遍存在的三大障碍:基线复现困难、高计算成本以及对领域专业知识的依赖。为应对这些挑战,作者提出了PyHealth 2.0——一个增强型临床深度学习工具包,其核心解决方案在于构建了一个统一框架,集成超过15个数据集、20余项临床任务、25种模型、5种可解释性方法及不确定性量化技术(包括共形预测),支持多模态临床数据(信号、影像和电子健康记录)并实现5种以上医学编码标准的自动转换;同时通过优化算法与资源调度设计,在保证精度的前提下将处理速度提升至39倍、内存消耗降低至1/20,从而适配从16GB笔记本到生产系统的多样化计算环境;此外,依托活跃的开源社区(400+成员)和多语言支持(RHealth),显著降低使用门槛,推动可复现、易访问的医疗AI研究发展。
链接: https://arxiv.org/abs/2601.16414
作者: John Wu,Yongda Fan,Zhenbang Wu,Paul Landes,Eric Schrock,Sayeed Sajjad Razin,Arjun Chatterjee,Naveen Baskaran,Joshua Steier,Andrea Fitzpatrick,Bilal Arif,Rian Atri,Jathurshan Pradeepkumar,Siddhartha Laghuvarapu,Junyi Gao,Adam R. Cross,Jimeng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities - signals, imaging, and electronic health records - with translation of 5+ medical coding standards; (2) accessibility-focused design accommodating multimodal data and diverse computational resources with up to 39x faster processing and 20x lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI. Available at pip install pyhealth.
zh
[AI-42] Reasoning -Enhanced Rare-Event Prediction with Balanced Outcome Correction
【速读】:该论文旨在解决低频事件(low-prevalence events)预测中因类别极度不平衡导致的传统模型偏向多数类、召回率低、校准性差及实际应用价值受限的问题。其解决方案的关键在于提出LPCORP(Low-Prevalence CORrector for Prediction)框架,该框架采用两阶段策略:第一阶段利用增强推理(reasoning-enhanced)模型从文本输入中生成更丰富的预测结果;第二阶段通过一个轻量级逻辑回归分类器基于置信度对预测结果进行选择性修正,从而有效缓解由发生频率驱动的偏差。该方法在不进行任何重采样操作的前提下,显著提升了测试集上的性能,尤其改善了精度(precision),并在成本分析中表明,基于预测的预防性干预可比无预防措施时减少超过50%的罕见事件损失成本。
链接: https://arxiv.org/abs/2601.16406
作者: Vitaly Bulgakov,Alexander Turchin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 12 figures, provisional patent
Abstract:Rare-event prediction is critical in domains such as healthcare, finance, reliability engineering, customer support, aviation safety, where positive outcomes are infrequent yet potentially catastrophic. Extreme class imbalance biases conventional models toward majority-class predictions, limiting recall, calibration, and operational usefulness. We propose LPCORP (Low-Prevalence CORrector for Prediction)*, a two-stage framework that combines reasoningenhanced prediction with confidence-based outcome correction. A reasoning model first produces enriched predictions from narrative inputs, after which a lightweight logistic-regression classifier evaluates and selectively corrects these outputs to mitigate prevalence-driven bias. We evaluate LPCORP on real-world datasets from medical and consumer service domains. The results show that this method transforms a highly imbalanced setting into a well-balanced one while preserving the original number of samples and without applying any resampling strategies. Test-set evaluation demonstrates substantially improved performance, particularly in precision, which is a known weakness in low-prevalence data. We further provide a costreduction analysis comparing the expenses associated with rare-event damage control without preventive measures to those incurred when low-cost, prediction-based preventive interventions are applied that showed more than 50% reduction in some cases. * Patent pending: U.S. Provisional 63/933,518, filed 8 December 2025.
zh
[AI-43] Improving the Accuracy of Community Detection on Signed Networks via Community Refinement and Contrastive Learning
【速读】:该论文旨在解决Signed Network(符号网络)中社区检测(Community Detection, CD)因边权符号噪声或冲突导致社区结构不一致的问题。现有方法在面对复杂且含噪的正负关系时,难以稳定地识别出具有实际意义的社区划分。其解决方案的关键在于提出一个模型无关(model-agnostic)的后处理框架ReCon,通过四个迭代步骤——结构精炼(structural refinement)、边界精炼(boundary refinement)、对比学习(contrastive learning)和聚类——逐步优化初始社区结构,从而显著提升社区检测的准确性与鲁棒性。实验表明,该框架可有效集成至多种CD方法中,并在合成与真实网络上均表现出一致性改进效果。
链接: https://arxiv.org/abs/2601.16372
作者: Hyunuk Shin,Hojin Kim,Chanyoung Lee,Yeon-Chang Lee,David Yoon Suk Kang
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:Community detection (CD) on signed networks is crucial for understanding how positive and negative relations jointly shape network structure. However, existing CD methods often yield inconsistent communities due to noisy or conflicting edge signs. In this paper, we propose ReCon, a model-agnostic post-processing framework that progressively refines community structures through four iterative steps: (1) structural refinement, (2) boundary refinement, (3) contrastive learning, and (4) clustering. Extensive experiments on eighteen synthetic and four real-world networks using four CD methods demonstrate that ReCon consistently enhances community detection accuracy, serving as an effective and easily integrable solution for reliable CD across diverse network properties.
zh
[AI-44] NOIR: Privacy-Preserving Generation of Code with Open-Source LLM s USENIX-SECURITY
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的代码生成服务中存在知识产权和数据安全风险的问题,即云服务提供商可能通过观察客户端输入的提示(prompts)和生成的代码来推断敏感信息。解决方案的关键在于提出NOIR框架,其核心机制包括:在客户端部署编码器与解码器,将提示嵌入(embeddings)发送至云端获取增强嵌入后本地解码生成代码;引入基于token嵌入级别的局部差分隐私(local differential privacy),结合词汇表层面的不可区分性保护以及数据无关的随机分词器(data-independent and randomized tokenizer),从而有效抵御诚实但好奇(honest-but-curious)云服务器发起的重构攻击和频率分析攻击。实验证明,NOIR在多个基准测试(如EvalPlus和BigCodeBench)上实现了高保真度的代码生成性能,同时提供强隐私保障。
链接: https://arxiv.org/abs/2601.16354
作者: Khoa Nguyen,Khiem Ton,NhatHai Phan,Issa Khalil,Khang Tran,Cristian Borcea,Ruoming Jin,Abdallah Khreishah,My T. Thai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear at Usenix Security Symposium 2026
Abstract:Although boosting software development performance, large language model (LLM)-powered code generation introduces intellectual property and data security risks rooted in the fact that a service provider (cloud) observes a client’s prompts and generated code, which can be proprietary in commercial systems. To mitigate this problem, we propose NOIR, the first framework to protect the client’s prompts and generated code from the cloud. NOIR uses an encoder and a decoder at the client to encode and send the prompts’ embeddings to the cloud to get enriched embeddings from the LLM, which are then decoded to generate the code locally at the client. Since the cloud can use the embeddings to infer the prompt and the generated code, NOIR introduces a new mechanism to achieve indistinguishability, a local differential privacy protection at the token embedding level, in the vocabulary used in the prompts and code, and a data-independent and randomized tokenizer on the client side. These components effectively defend against reconstruction and frequency analysis attacks by an honest-but-curious cloud. Extensive analysis and results using open-source LLMs show that NOIR significantly outperforms existing baselines on benchmarks, including the Evalplus (MBPP and HumanEval, Pass@1 of 76.7 and 77.4), and BigCodeBench (Pass@1 of 38.7, only a 1.77% drop from the original LLM) under strong privacy against attacks.
zh
[AI-45] DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
【速读】:该论文旨在解决当前数据科学基准测试中存在的三大问题:评估接口碎片化导致跨基准比较困难、任务覆盖范围狭窄以及缺乏严格的**数据接地(data grounding)**验证,尤其指出现有基准中大量任务可无需使用真实数据即可完成,削弱了评估的有效性。其解决方案的核心是提出DSGym——一个标准化的、自包含执行环境框架,通过模块化架构支持任务、代理模板和工具的灵活扩展,从而构建一个动态且可扩展的测试平台;同时,通过DGS-Tasks对现有基准进行质量筛选与“捷径”可解性过滤,并引入DSBio(生物信息学任务)和DSPredict(多领域预测任务)以增强任务多样性与现实性,最终实现对数据科学代理在规划、执行和验证分析全流程中的端到端能力评估。
链接: https://arxiv.org/abs/2601.16344
作者: Fan Nie,Junlin Wang,Harper Hua,Federico Bianchi,Yongchan Kwon,Zhenting Qi,Owen Queen,Shang Zhu,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.
zh
[AI-46] DMAVA: Distributed Multi-Autonomous Vehicle Architecture Using Autoware
【速读】:该论文旨在解决多辆自动驾驶汽车(Autonomous Vehicles, AVs)在仿真环境中协同运行的难题,现有仿真架构大多局限于单车操作或依赖集中式控制,难以实现跨物理主机的实时同步与独立决策。其解决方案的关键在于提出一种分布式多AV架构(Distributed Multi-AV Architecture, DMAVA),通过低延迟的数据中心通信层实现多主机间的同步协调,每个车辆独立运行完整的AV栈(如Autoware Universe),并在共享的Unity环境中并发执行,结合ROS 2 Humble、AWSIM Labs和Zenoh等工具链,实现了稳定定位、可靠跨主机通信及闭环控制的完全同步,为高级协作式自动驾驶任务(如多车自动代客泊车)提供了可扩展的基础平台。
链接: https://arxiv.org/abs/2601.16336
作者: Zubair Islam,Mohamed El-Darieby
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 4 figures, 5 tables, Submitted to IEEE IV 2026, Demo videos and source code available
Abstract:Simulating and validating coordination among multiple autonomous vehicles (AVs) is a challenging task as most existing simulation architectures are limited to single-vehicle operation or rely on centralized control. This paper presents a Distributed Multi-AV Architecture (DMAVA) that enables synchronized, real-time autonomous driving simulation across multiple physical hosts. Each vehicle runs its own complete AV stack and operates independently from other AVs. The vehicles in the simulation maintain synchronized coordination through a low-latency data-centric communication layer. The proposed system integrates ROS 2 Humble, Autoware Universe, AWSIM Labs, and Zenoh to support concurrent execution of multiple Autoware stacks within a shared Unity-based environment. Experiments conducted on multiple-host configurations demonstrate stable localization, reliable inter-host communication, and fully synchronized closed-loop control. The DMAVA also serves as a foundation for Multi-Vehicle Autonomous Valet Parking, demonstrating its extensibility toward higher-level cooperative autonomy. Demo videos and source code are available at: this https URL.
zh
[AI-47] DMV-AVP: Distributed Multi-Vehicle Autonomous Valet Parking using Autoware
【速读】:该论文旨在解决现有自动驾驶泊车(Autonomous Valet Parking, AVP)仿真系统普遍采用集中式或非分布式架构所导致的可扩展性受限和全自主控制能力不足的问题。解决方案的关键在于构建一个基于分布式多车辆架构(Distributed Multi-Vehicle Architecture, DMAVA)的DMV-AVP系统,其核心创新包括:1)开发了一个多车辆AVP节点,实现跨车辆的状态协调、队列管理和车位预留机制;2)集成Unity-内置YOLOv5车位检测模块,提供基于视觉的实时感知能力;二者均通过Zenoh通信层实现低延迟的主题同步与主机间协同行为,从而在两台和三台主机配置下验证了确定性的协调能力、无冲突泊车行为及可扩展性能,为未来实车部署与硬件在环验证奠定了基础。
链接: https://arxiv.org/abs/2601.16327
作者: Zubair Islam,Mohamed El-Darieby
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 7 pages, 5 figures, 1 table. Demo videos and source code available
Abstract:This paper presents the DMV-AVP System, a distributed simulation of Multi-Vehicle Autonomous Valet Parking (AVP). The system was implemented as an application of the Distributed Multi-Vehicle Architecture (DMAVA) for synchronized multi-host execution. Most existing simulation approaches rely on centralized or non-distributed designs that constrain scalability and limit fully autonomous control. This work introduces two modules built on top of the DMAVA: 1) a Multi-Vehicle AVP Node that performs state-based coordination, queuing, and reservation management across multiple vehicles, and 2) a Unity-Integrated YOLOv5 Parking Spot Detection Module that provides real-time, vision-based perception within AWSIM Labs. Both modules integrate seamlessly with the DMAVA and extend it specifically for multi-vehicle AVP operation, supported by a Zenoh-based communication layer that ensures low-latency topic synchronization and coordinated behavior across hosts. Experiments conducted on two- and three-host configurations demonstrate deterministic coordination, conflict-free parking behavior, and scalable performance across distributed Autoware instances. The results confirm that the proposed Distributed Multi-Vehicle AVP System supports cooperative AVP simulation and establishes a foundation for future real-world and hardware-in-the-loop validation. Demo videos and source code are available at this https URL
zh
[AI-48] Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple
【速读】:该论文旨在解决现代CPU平台上通用矩阵乘法(GEMM)性能优化中因平台和矩阵形状依赖性导致的“玻璃jaw”问题,即传统调优方法难以适应不同硬件架构与输入尺寸,从而限制了可移植性和最佳性能实现。其解决方案的关键在于利用广义空间填充曲线(Generalized Space-Filling Curves, SFC),特别是广义希尔伯特曲线(Generalized Hilbert Curves),对计算空间进行分区,生成平台无关且形状无关的矩阵乘法策略,从而天然具备高数据局部性;进一步地,通过将基于SFC的工作划分扩展为通信避免(Communication-Avoiding, CA)算法,以复制输入张量的方式在关键路径上最小化通信开销,实现无需复杂调优即可获得接近最优性能的紧凑代码(约30行代码),并在多个CPU平台上实现比厂商库最高达2倍几何平均加速的显著效果。
链接: https://arxiv.org/abs/2601.16294
作者: Evangelos Georganas,Alexander Heinecke,Pradeep Dubey
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:General Matrix Multiplication (GEMM) is the cornerstone of Deep Learning and HPC workloads; accordingly, academia and industry have heavily optimized this kernel. Modern platforms with matrix multiplication accelerators exhibit high FLOP/Byte machine balance, which makes implementing optimal matrix multiplication challenging. On modern CPU platforms with matrix engines, state-of-the-art vendor libraries tune input tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. However, the best settings for these parameters depend strongly on the target platform (number of cores, memory hierarchy, cache sizes) and on the shapes of the matrices, making exhaustive tuning infeasible; in practice this leads to performance “glass jaws”. In this work we revisit space filling curves (SFC) to alleviate the problem of this cumbersome tuning. SFC convert multi-dimensional coordinates (e.g. 2D) into a single dimension (1D), keeping nearby points in the high-dimensional space close in the 1D order. We partition the Matrix Multiplication computation space using recent advancements in generalized SFC (Generalized Hilbert Curves), and we obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality. Furthermore, we extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that replicate the input tensors and provably minimize communication/data-movement on the critical path. The integration of CA-algorithms is seamless and yields compact code (~30 LOC), yet it achieves state-of-the-art results on multiple CPU platforms, outperforming vendor libraries by up to 2x(geometric-mean speedup) for a range of GEMM shapes.
zh
[AI-49] SemanticALLI: Caching Reasoning Not Just Responses in Agent ic Systems
【速读】:该论文旨在解决生成式 AI (Generative AI) 管道中因用户自然语言表述变化而导致的中间逻辑重复重建问题,即即使用户表达不同,系统仍频繁重复计算相同分析意图(如指标归一化或图表结构搭建),造成资源浪费。其解决方案的关键在于提出一种面向管道的语义缓存架构 SemanticALLI,通过将生成过程分解为 Analytic Intent Resolution (AIR) 和 Visualization Synthesis (VS) 两个阶段,将结构化的中间表示(Intermediate Representation, IR)提升为可缓存的第一类对象,从而在语义层面实现跨请求的冗余推理复用,显著提升缓存命中率(从38.7%提升至83.10%),并减少4,023次大语言模型(LLM)调用和整体token消耗。
链接: https://arxiv.org/abs/2601.16286
作者: Varun Chillara,Dylan Kline,Christopher Alvares,Evan Wooten,Huan Yang,Shlok Khetan,Cade Bauer,Tré Guillory,Tanishka Shah,Yashodhara Dhariwal,Volodymyr Pavlov,George Popstefanov
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Agentic AI pipelines suffer from a hidden inefficiency: they frequently reconstruct identical intermediate logic, such as metric normalization or chart scaffolding, even when the user’s natural language phrasing is entirely novel. Conventional boundary caching fails to capture this inefficiency because it treats inference as a monolithic black box. We introduce SemanticALLI, a pipeline-aware architecture within Alli (PMG’s marketing intelligence platform), designed to operationalize redundant reasoning. By decomposing generation into Analytic Intent Resolution (AIR) and Visualization Synthesis (VS), SemanticALLI elevates structured intermediate representations (IRs) to first-class, cacheable artifacts. The impact of caching within the agentic loop is substantial. In our evaluation, baseline monolithic caching caps at a 38.7% hit rate due to linguistic variance. In contrast, our structured approach allows for an additional stage, the Visualization Synthesis stage, to achieve an 83.10% hit rate, bypassing 4,023 LLM calls with a median latency of just 2.66 ms. This internal reuse reduces total token consumption, offering a practical lesson for AI system design: even when users rarely repeat themselves, the pipeline often does, at stable, structured checkpoints where caching is most reliable. Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2601.16286 [cs.AI] (or arXiv:2601.16286v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.16286 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-50] When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)中工具使用可靠性评估方法缺失的问题,尤其是在中小企业(SME)部署场景下对隐私敏感环境的需求。其解决方案的关键在于提出了一种基于大数据分析的综合性诊断框架,包含12类错误分类体系,用于系统性评估智能代理在工具初始化、参数处理、执行及结果解释等环节中的故障模式;通过在多种边缘硬件配置上对1,980个确定性测试实例进行评估,识别出适用于生产部署的可靠阈值,并验证了不同模型(如Qwen2.5系列、GPT-4、Claude 3.5/3.7)在可靠性与效率间的权衡关系,从而为资源受限组织提供可落地的生成式AI代理部署方案。
链接: https://arxiv.org/abs/2601.16280
作者: Donghao Huang,Gauri Malwe,Zhaoxia Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in 2026 The 9th International Conference on Artificial Intelligence and Big Data (ICAIBD 2026)
Abstract:Multi-agent systems powered by large language models (LLMs) are transforming enterprise automation, yet systematic evaluation methodologies for assessing tool-use reliability remain underdeveloped. We introduce a comprehensive diagnostic framework that leverages big data analytics to evaluate procedural reliability in intelligent agent systems, addressing critical needs for SME-centric deployment in privacy-sensitive environments. Our approach features a 12-category error taxonomy capturing failure modes across tool initialization, parameter handling, execution, and result interpretation. Through systematic evaluation of 1,980 deterministic test instances spanning both open-weight models (Qwen2.5 series, Functionary) and proprietary alternatives (GPT-4, Claude 3.5/3.7) across diverse edge hardware configurations, we identify actionable reliability thresholds for production deployment. Our analysis reveals that procedural reliability, particularly tool initialization failures, constitutes the primary bottleneck for smaller models, while qwen2.5:32b achieves flawless performance matching GPT-4.1. The framework demonstrates that mid-sized models (qwen2.5:14b) offer practical accuracy-efficiency trade-offs on commodity hardware (96.6% success rate, 7.3 s latency), enabling cost-effective intelligent agent deployment for resource-constrained organizations. This work establishes foundational infrastructure for systematic reliability evaluation of tool-augmented multi-agent AI systems.
zh
[AI-51] Ordering-based Causal Discovery via Generalized Score Matching
【速读】:该论文旨在解决从纯观测数据中学习有向无环图(Directed Acyclic Graph, DAG)结构这一长期存在的挑战,尤其针对离散数据场景下的因果发现问题。其解决方案的关键在于扩展了原本适用于连续数据的分数匹配(score matching)框架,并提出了一种基于离散分数函数的新颖叶节点判别准则(leaf discriminant criterion),从而能够准确推断出真实因果顺序;该排序结果可显著提升现有因果发现基线方法在几乎所有实验设置中的准确性。
链接: https://arxiv.org/abs/2601.16249
作者: Vy Vo,He Zhao,Trung Le,Edwin V. Bonilla,Dinh Phung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning DAG structures from purely observational data remains a long-standing challenge across scientific domains. An emerging line of research leverages the score of the data distribution to initially identify a topological order of the underlying DAG via leaf node detection and subsequently performs edge pruning for graph recovery. This paper extends the score matching framework for causal discovery, which is originally designated for continuous data, and introduces a novel leaf discriminant criterion based on the discrete score function. Through simulated and real-world experiments, we demonstrate that our theory enables accurate inference of true causal orders from observed discrete data and the identified ordering can significantly boost the accuracy of existing causal discovery baselines on nearly all of the settings.
zh
[AI-52] A New Paradigm for Trusted Respiratory Monitoring Via Consumer Electronics-grade Radar Signals
【速读】:该论文旨在解决基于雷达的呼吸监测中用户敏感身份信息(User-sensitive Identity Information, USI)泄露问题,同时确保在非接触式监测场景下仍能保持高精度的呼吸信号检测。其核心挑战在于消费级雷达数据不可避免地包含与个体身份相关的特征,若未加处理可能被恶意利用导致隐私泄露。解决方案的关键在于提出一种可信呼吸监测范式 Tru-RM,其三大核心技术为:属性特征解耦(Attribute Feature Decoupling, AFD),用于将原始雷达信号分解为通用呼吸成分、个人差异成分及其他无关成分;灵活扰动加密器(Flexible Perturbation Encryptor, FPE),通过引入大噪声抑制无关成分,并结合带学习强度参数的相位噪声算法消除个人差异成分中的USI,从而实现无损的用户身份匿名化;以及鲁棒扰动容忍网络(Perturbation Tolerable Network, PTN),设计迁移的广义域不变网络以应对波形显著变化下的呼吸检测任务,保障扰动后呼吸信号的高精度识别。
链接: https://arxiv.org/abs/2601.16241
作者: Xinyu Li,Jinyang Huang,Feng-Qi Cui,Meng Wang,Peng Zhao,Meng Li,Dan Guo,Meng Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Respiratory monitoring is an extremely important task in modern medical services. Due to its significant advantages, e.g., non-contact, radar-based respiratory monitoring has attracted widespread attention from both academia and industry. Unfortunately, though it can achieve high monitoring accuracy, consumer electronics-grade radar data inevitably contains User-sensitive Identity Information (USI), which may be maliciously used and further lead to privacy leakage. To track these challenges, by variational mode decomposition (VMD) and adversarial loss-based encryption, we propose a novel Trusted Respiratory Monitoring paradigm, Tru-RM, to perform automated respiratory monitoring through radio signals while effectively anonymizing USI. The key enablers of Tru-RM are Attribute Feature Decoupling (AFD), Flexible Perturbation Encryptor (FPE), and robust Perturbation Tolerable Network (PTN) used for attribute decomposition, identity encryption, and perturbed respiratory monitoring, respectively. Specifically, AFD is designed to decompose the raw radar signals into the universal respiratory component, the personal difference component, and other unrelated components. Then, by using large noise to drown out the other unrelated components, and the phase noise algorithm with a learning intensity parameter to eliminate USI in the personal difference component, FPE is designed to achieve complete user identity information encryption without affecting respiratory features. Finally, by designing the transferred generalized domain-independent network, PTN is employed to accurately detect respiration when waveforms change significantly. Extensive experiments based on various detection distances, respiratory patterns, and durations demonstrate the superior performance of Tru-RM on strong anonymity of USI, and high detection accuracy of perturbed respiratory waveforms.
zh
[AI-53] VibeTensor: System Software for Deep Learning Fully Generated by AI Agents
【速读】:该论文旨在解决如何通过大语言模型(LLM)驱动的编码代理(coding agents)在高度人类指导下的自动化生成完整、可运行且经过验证的深度学习系统软件栈的问题。传统AI辅助软件工程多局限于局部代码片段生成,而本文提出并实现了一个“全自动生成”(fully generated)的深度学习框架——VIBETENSOR,其核心突破在于:不仅生成了从Python层到CUDA内存管理的端到端代码,还通过自动化构建、测试与差异性检查完成验证,无需人工逐变更审查。关键解决方案包括:基于C++20的高性能张量核心(支持CPU+GPU)、PyTorch风格的Python绑定(nanobind)、自研张量存储系统与反向自动微分机制、CUDA运行时功能集成(流/事件/图)、带诊断能力的流有序缓存分配器,以及稳定的C ABI用于动态插件扩展。该成果标志着编码代理已具备生成复杂、结构化系统级软件的能力,并在多个基准测试和端到端训练任务中验证了其实用性与稳定性。
链接: https://arxiv.org/abs/2601.16238
作者: Bing Xu,Terry Chen,Fengzhe Zhou,Tianqi Chen,Yangqing Jia,Vinod Grover,Haicheng Wu,Wei Liu,Craig Wittenbrink,Wen-mei Hwu,Roger Bringmann,Ming-Yu Liu,Luis Ceze,Michael Lightstone,Humphrey Shi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Open-source: this https URL
Abstract:VIBETENSOR is an open-source research system software stack for deep learning, generated by LLM-powered coding agents under high-level human guidance. In this paper, “fully generated” refers to code provenance: implementation changes were produced and applied as agent-proposed diffs; validation relied on agent-run builds, tests, and differential checks, without per-change manual diff review. It implements a PyTorch-style eager tensor library with a C++20 core (CPU+CUDA), a torch-like Python overlay via nanobind, and an experimental this http URL interface. Unlike thin bindings, VIBETENSOR includes its own tensor/storage system, schema-lite dispatcher, reverse-mode autograd, CUDA runtime (streams/events/graphs), a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. We view this release as a milestone for AI-assisted software engineering: it shows coding agents can generate a coherent deep learning runtime spanning language bindings down to CUDA memory management, validated primarily by builds and tests. We describe the architecture, summarize the workflow used to produce and validate the system, and evaluate the artifact. We report repository scale and test-suite composition, and summarize reproducible microbenchmarks from an accompanying AI-generated kernel suite, including fused attention versus PyTorch SDPA/FlashAttention. We also report end-to-end training sanity checks on 3 small workloads (sequence reversal, ViT, miniGPT) on NVIDIA H100 (Hopper, SM90) and Blackwell-class GPUs; multi-GPU results are Blackwell-only and use an optional CUTLASS-based ring-allreduce plugin gated on CUDA 13+ and sm103a toolchain support. Finally, we discuss failure modes in generated system software, including a “Frankenstein” composition effect where locally correct subsystems interact to yield globally suboptimal performance.
zh
[AI-54] Computational Foundations for Strategic Coopetition: Formalizing Collective Action and Loyalty
【速读】:该论文旨在解决混合动机多智能体环境中持续搭便车(free-riding)问题,即个体努力对所有成员均有益,但成本由自身承担,导致纳什均衡下普遍怠工的现象。其核心解决方案是构建一种基于忠诚度调节的效用函数,包含两个关键机制:一是忠诚收益(福利内化与内在贡献满足感),二是成本容限(忠诚成员的努力负担降低)。通过依赖加权团队凝聚力将i*结构依赖映射至团队层面激励,实现了从个体到团队的策略协同建模,并在3,125种配置中验证了忠诚度效应显著(中位努力差异达15.04倍),六项行为目标全部达标,且在Apache HTTP Server长达28年的开源项目案例中成功复现各阶段贡献模式,统计显著性达到p<0.001,效应量Cohen’s d=0.71。
链接: https://arxiv.org/abs/2601.16237
作者: Vik Pant,Eric Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: 68 pages, 22 figures. Third technical report in research program; should be read with companion arXiv:2510.18802 and arXiv:2510.24909 . Adapts and extends complex actor material from Pant (2021) doctoral dissertation, University of Toronto
Abstract:Mixed-motive multi-agent settings are rife with persistent free-riding because individual effort benefits all members equally, yet each member bears the full cost of their own contribution. Classical work by Holmström established that under pure self-interest, Nash equilibrium is universal shirking. While i* represents teams as composite actors, it lacks scalable computational mechanisms for analyzing how collective action problems emerge and resolve in coopetitive settings. This technical report extends computational foundations for strategic coopetition to team-level dynamics, building on companion work formalizing interdependence/complementarity (arXiv:2510.18802) and trust dynamics (arXiv:2510.24909). We develop loyalty-moderated utility functions with two mechanisms: loyalty benefit (welfare internalization plus intrinsic contribution satisfaction) and cost tolerance (reduced effort burden for loyal members). We integrate i* structural dependencies through dependency-weighted team cohesion, connecting member incentives to team-level positioning. The framework applies to both human teams (loyalty as psychological identification) and multi-agent systems (alignment coefficients and adjusted cost functions). Experimental validation across 3,125 configurations demonstrates robust loyalty effects (15.04x median effort differentiation). All six behavioral targets achieve thresholds: free-riding baseline (96.5%), loyalty monotonicity (100%), effort differentiation (100%), team size effect (100%), mechanism synergy (99.5%), and bounded outcomes (100%). Empirical validation using published Apache HTTP Server (1995-2023) case study achieves 60/60 points, reproducing contribution patterns across formation, growth, maturation, and governance phases. Statistical significance confirmed at p0.001, Cohen’s d=0.71.
zh
[AI-55] Policy-Embedded Graph Expansion: Networked HIV Testing with Diffusion-Driven Network Samples
【速读】:该论文旨在解决现实场景中HIV检测效率低下的问题,特别是在增量式揭示的疾病传播网络上如何优化序列检测策略,以提升检测覆盖率并减少资源消耗。其核心挑战在于现有智能算法依赖不切实际的假设,难以在真实世界部署。解决方案的关键是提出Policy-Embedded Graph Expansion (PEGE) 框架,该框架将图扩展的生成分布直接嵌入决策策略中,而非进行显式的拓扑重建;同时引入Dynamics-Driven Branching (DDB) 模型,基于扩散机制实现数据受限环境下自然形成的树状结构(如转介流程)的高效建模与决策支持。实验表明,PEGE + DDB组合方法在真实HIV传播网络上显著优于基线模型,例如在仅测试25%人口的情况下,检测率提升9%,且折扣奖励提高13%。
链接: https://arxiv.org/abs/2601.16233
作者: Akseli Kangaslahti,Davin Choo,Lingkai Kong,Milind Tambe,Alastair van Heerden,Cheryl Johnson
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:HIV is a retrovirus that attacks the human immune system and can lead to death without proper treatment. In collaboration with the WHO and Wits University, we study how to improve the efficiency of HIV testing with the goal of eventual deployment, directly supporting progress toward UN Sustainable Development Goal 3.3. While prior work has demonstrated the promise of intelligent algorithms for sequential, network-based HIV testing, existing approaches rely on assumptions that are impractical in our real-world implementations. Here, we study sequential testing on incrementally revealed disease networks and introduce Policy-Embedded Graph Expansion (PEGE), a novel framework that directly embeds a generative distribution over graph expansions into the decision-making policy rather than attempting explicit topological reconstruction. We further propose Dynamics-Driven Branching (DDB), a diffusion-based graph expansion model that supports decision making in PEGE and is designed for data-limited settings where forest structures arise naturally, as in our real-world referral process. Experiments on real HIV transmission networks show that the combined approach (PEGE + DDB) consistently outperforms existing baselines (e.g., 13% improvement in discounted reward and 9% more HIV detections with 25% of the population tested) and explore key tradeoffs that drive decision quality.
zh
[AI-56] Interpretable Fine-Gray Deep Survival Model for Competing Risks: Predicting Post-Discharge Foot Complications for Diabetic Patients in Ontario
【速读】:该论文旨在解决深度学习模型在医疗领域应用中因缺乏透明性而导致的AI安全性和临床信任问题,特别是在存在竞争风险的生存建模场景下。其核心挑战在于如何在保持高预测性能的同时实现模型的内在可解释性。解决方案的关键在于提出一种名为CRISPNAM-FG的内在可解释生存模型,该模型基于神经加法模型(Neural Additive Models, NAMs)结构,并为每个风险类别设计独立的投影向量,通过Fine-Gray方法预测累积发生函数(Cumulative Incidence Function, CIF),从而在不依赖后处理解释技术的前提下提供形状函数(shape functions)和特征重要性图(feature importance plots),实现预测过程的透明化与可审计性。
链接: https://arxiv.org/abs/2511.12409
作者: Dhanesh Ramachandram,Anne Loefler,Surain Roberts,Amol Verma,Maia Norman,Fahad Razak,Conrad Pow,Charles de Mestral
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Model interpretability is crucial for establishing AI safety and clinician trust in medical applications for example, in survival modelling with competing risks. Recent deep learning models have attained very good predictive performance but their limited transparency, being black-box models, hinders their integration into clinical practice. To address this gap, we propose an intrinsically interpretable survival model called CRISPNAM-FG. Leveraging the structure of Neural Additive Models (NAMs) with separate projection vectors for each risk, our approach predicts the Cumulative Incidence Function using the Fine-Gray formulation, achieving high predictive power with intrinsically transparent and auditable predictions. We validated the model on several benchmark datasets and applied our model to predict future foot complications in diabetic patients across 29 Ontario hospitals (2016-2023). Our method achieves competitive performance compared to other deep survival models while providing transparency through shape functions and feature importance plots.
zh
[AI-57] BONO-Bench: A Comprehensive Test Suite for Bi-objective Numerical Optimization with Traceable Pareto Sets
【速读】:该论文旨在解决多目标优化(Multi-Objective Optimization)领域中基准测试(Benchmarking)所面临的两大问题:一是人工构造的测试问题虽具有明确最优解但缺乏现实性与多样性;二是将单目标复杂问题组合成多目标问题时,难以控制和理解其性质。解决方案的关键在于提出一种基于理论可分析的凸二次函数(convex-quadratic functions)组合生成方法,能够系统地构建具有可控属性(如决策变量数、局部最优数量、帕累托前沿形状、目标空间平台区域及条件数等)的双目标数值优化问题,并确保帕累托前沿(Pareto front)可通过诸如超体积(hypervolume)或R2指标等帕累托合规性能指标精确逼近。该方法形成了名为BONO-Bench的20类测试问题集,并以Python包bonobench形式开源发布,从而支持可复现且结构清晰的基准评估。
链接: https://arxiv.org/abs/2601.16970
作者: Lennart Schäpermeier,Pascal Kerschke
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Special Issue on Benchmarking in Multi-Criteria Optimization at ACM TELO
Abstract:The evaluation of heuristic optimizers on test problems, better known as \emphbenchmarking, is a cornerstone of research in multi-objective optimization. However, most test problems used in benchmarking numerical multi-objective black-box optimizers come from one of two flawed approaches: On the one hand, problems are constructed manually, which result in problems with well-understood optimal solutions, but unrealistic properties and biases. On the other hand, more realistic and complex single-objective problems are composited into multi-objective problems, but with a lack of control and understanding of problem properties. This paper proposes an extensive problem generation approach for bi-objective numerical optimization problems consisting of the combination of theoretically well-understood convex-quadratic functions into unimodal and multimodal landscapes with and without global structure. It supports configuration of test problem properties, such as the number of decision variables, local optima, Pareto front shape, plateaus in the objective space, or degree of conditioning, while maintaining theoretical tractability: The optimal front can be approximated to an arbitrary degree of precision regarding Pareto-compliant performance indicators such as the hypervolume or the exact R2 indicator. To demonstrate the generator’s capabilities, a test suite of 20 problem categories, called \emphBONO-Bench, is created and subsequently used as a basis of an illustrative benchmark study. Finally, the general approach underlying our proposed generator, together with the associated test suite, is publicly released in the Python package \textttbonobench to facilitate reproducible benchmarking. Comments: Accepted for publication in the Special Issue on Benchmarking in Multi-Criteria Optimization at ACM TELO Subjects: Optimization and Control (math.OC); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.16970 [math.OC] (or arXiv:2601.16970v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2601.16970 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lennart Schäpermeier [view email] [v1] Fri, 23 Jan 2026 18:42:20 UTC (15,823 KB) Full-text links: Access Paper: View a PDF of the paper titled BONO-Bench: A Comprehensive Test Suite for Bi-objective Numerical Optimization with Traceable Pareto Sets, by Lennart Sch"apermeier and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: math.OC prev | next new | recent | 2026-01 Change to browse by: cs cs.AI math References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-58] ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation
【速读】:该论文旨在解决现有语音到语音大语言模型在生成共情回应时对情感信息(如语调、语气和情感强度)建模不足的问题,这些问题通常源于依赖自动语音识别(ASR)转录或通过编码器提取潜在表示,从而削弱了多轮对话中的情感信息与上下文连贯性。其解决方案的关键在于提出ES4R框架,通过在语音编码前显式建模结构化的共情上下文,而非依赖编码器的隐式学习或显式情绪监督;具体创新包括引入双层注意力机制以捕捉话语级情感状态与对话级情感动态,并利用语音引导的跨模态注意力将情感表征与文本语义融合,最终实现更自然且富有同理心的语音响应生成。
链接: https://arxiv.org/abs/2601.16225
作者: Zhuoyue Gao,Xiaohui Wang,Xiaocui Yang,Wen Zhang,Daling Wang,Shi Feng,Yifei Zhang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose \textbfES4R, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations are then integrated with textual semantics through speech-guided cross-modal attention to generate empathetic responses. For speech output, we employ energy-based strategy selection and style fusion to achieve empathetic speech synthesis. ES4R consistently outperforms strong baselines in both automatic and human evaluations and remains robust across different LLM backbones.
zh
机器学习
[LG-0] Latent Diffusion for Internet of Things Attack Data Generation in Intrusion Detection
链接: https://arxiv.org/abs/2601.16976
作者: Estela Sánchez-Carballo,Francisco M. Melgarejo-Meseguer,José Luis Rojo-Álvarez
类目: Machine Learning (cs.LG)
*备注: Submitted to IEEE. 15 pages, 2 figures
Abstract:Intrusion Detection Systems (IDSs) are a key component for protecting Internet of Things (IoT) environments. However, in Machine Learning-based (ML-based) IDSs, performance is often degraded by the strong class imbalance between benign and attack traffic. Although data augmentation has been widely explored to mitigate this issue, existing approaches typically rely on simple oversampling techniques or generative models that struggle to simultaneously achieve high sample fidelity, diversity, and computational efficiency. To address these limitations, we propose the use of a Latent Diffusion Model (LDM) for attack data augmentation in IoT intrusion detection and provide a comprehensive comparison against state-of-the-art baselines. Experiments were conducted on three representative IoT attack types, specifically Distributed Denial-of-Service (DDoS), Mirai, and Man-in-the-Middle, evaluating both downstream IDS performance and intrinsic generative quality using distributional, dependency-based, and diversity metrics. Results show that balancing the training data with LDM-generated samples substantially improves IDS performance, achieving F1-scores of up to 0.99 for DDoS and Mirai attacks and consistently outperforming competing methods. Additionally, quantitative and qualitative analyses demonstrate that LDMs effectively preserve feature dependencies while generating diverse samples and reduce sampling time by approximately 25% compared to diffusion models operating directly in data space. These findings highlight latent diffusion as an effective and scalable solution for synthetic IoT attack data generation, substantially mitigating the impact of class imbalance in ML-based IDSs for IoT scenarios.
[LG-1] Auto-Regressive Masked Diffusion Models
链接: https://arxiv.org/abs/2601.16971
作者: Mahdi Karami,Ali Ghodsi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. Leveraging this flexibility, we introduce a novel strided parallel generation strategy that accelerates inference by generating tokens in parallel streams while maintaining global coherence. Empirical results demonstrate that ARMD achieves state-of-the-art performance on standard language modeling benchmarks, outperforming established diffusion baselines while requiring significantly fewer training steps. Furthermore, it establishes a new benchmark for parallel text generation, effectively bridging the performance gap between parallel and sequential decoding.
[LG-2] 3D Molecule Generation from Rigid Motifs via SE(3) Flows
链接: https://arxiv.org/abs/2601.16955
作者: Roman Poletukhin,Marcel Kollovieh,Eike Eberhard,Stephan Günnemann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Three-dimensional molecular structure generation is typically performed at the level of individual atoms, yet molecular graph generation techniques often consider fragments as their structural units. Building on the advances in frame-based protein structure generation, we extend these fragmentation ideas to 3D, treating general molecules as sets of rigid-body motifs. Utilising this representation, we employ SE(3)-equivariant generative modelling for de novo 3D molecule generation from rigid motifs. In our evaluations, we observe comparable or superior results to state-of-the-art across benchmarks, surpassing it in atom stability on GEOM-Drugs, while yielding a 2x to 10x reduction in generation steps and offering 3.5x compression in molecular representations compared to the standard atom-based methods.
[LG-3] Is BatchEnsemble a Single Model? On Calibration and Diversity of Efficient Ensembles
链接: https://arxiv.org/abs/2601.16936
作者: Anton Zamyatin,Patrick Indri,Sagar Malhotra,Thomas Gärtner
类目: Machine Learning (cs.LG)
*备注: Accepted at the 1st workshop on Epistemic Intelligence in Machine Learning at EurIPS 2025
Abstract:In resource-constrained and low-latency settings, uncertainty estimates must be efficiently obtained. Deep Ensembles provide robust epistemic uncertainty (EU) but require training multiple full-size models. BatchEnsemble aims to deliver ensemble-like EU at far lower parameter and memory cost by applying learned rank-1 perturbations to a shared base network. We show that BatchEnsemble not only underperforms Deep Ensembles but closely tracks a single model baseline in terms of accuracy, calibration and out-of-distribution (OOD) detection on CIFAR10/10C/SVHN. A controlled study on MNIST finds members are near-identical in function and parameter space, indicating limited capacity to realize distinct predictive modes. Thus, BatchEnsemble behaves more like a single model than a true ensemble.
[LG-4] Group-realizable multi-group learning by minimizing empirical risk
链接: https://arxiv.org/abs/2601.16922
作者: Navid Ardeshir,Samuel Deng,Daniel Hsu,Jingwen Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The sample complexity of multi-group learning is shown to improve in the group-realizable setting over the agnostic setting, even when the family of groups is infinite so long as it has finite VC dimension. The improved sample complexity is obtained by empirical risk minimization over the class of group-realizable concepts, which itself could have infinite VC dimension. Implementing this approach is also shown to be computationally intractable, and an alternative approach is suggested based on improper learning.
[LG-5] Calibrated Similarity for Reliable Geometric Analysis of Embedding Spaces
链接: https://arxiv.org/abs/2601.16907
作者: Nicolas Tacheny
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2512.10350
Abstract:While raw cosine similarity in pretrained embedding spaces exhibits strong rank correlation with human judgments, anisotropy induces systematic miscalibration of absolute values: scores concentrate in a narrow high-similarity band regardless of actual semantic relatedness, limiting interpretability as a quantitative measure. Prior work addresses this by modifying the embedding space (whitening, contrastive fine tuning), but such transformations alter geometric structure and require recomputing all embeddings. Using isotonic regression trained on human similarity judgments, we construct a monotonic transformation that achieves near-perfect calibration while preserving rank correlation and local stability(98% across seven perturbation types). Our contribution is not to replace cosine similarity, but to restore interpretability of its absolute values through monotone calibration, without altering its ranking properties. We characterize isotonic calibration as an order-preserving reparameterization and prove that all order-based constructions (angular ordering, nearest neighbors, threshold graphs and quantile-based decisions) are invariant under this transformation. Comments: arXiv admin note: substantial text overlap with arXiv:2512.10350 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.16907 [cs.LG] (or arXiv:2601.16907v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.16907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] he Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning
链接: https://arxiv.org/abs/2601.16906
作者: Calarina Muslimani,Yunshu Du,Kenta Kawamoto,Kaushik Subramanian,Peter Stone,Peter Wurman
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function’s induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.
[LG-7] FedSGM: A Unified Framework for Constraint Aware Bidirectionally Compressed Multi-Step Federated Optimization
链接: https://arxiv.org/abs/2601.16897
作者: Antesh Upadhyay,Sang Bin Moon,Abolfazl Hashemi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We introduce FedSGM, a unified framework for federated constrained optimization that addresses four major challenges in federated learning (FL): functional constraints, communication bottlenecks, local updates, and partial client participation. Building on the switching gradient method, FedSGM provides projection-free, primal-only updates, avoiding expensive dual-variable tuning or inner solvers. To handle communication limits, FedSGM incorporates bi-directional error feedback, correcting the bias introduced by compression while explicitly understanding the interaction between compression noise and multi-step local updates. We derive convergence guarantees showing that the averaged iterate achieves the canonical \boldsymbol\mathcalO(1/\sqrtT) rate, with additional high-probability bounds that decouple optimization progress from sampling noise due to partial participation. Additionally, we introduce a soft switching version of FedSGM to stabilize updates near the feasibility boundary. To our knowledge, FedSGM is the first framework to unify functional constraints, compression, multiple local updates, and partial client participation, establishing a theoretically grounded foundation for constrained federated learning. Finally, we validate the theoretical guarantees of FedSGM via experimentation on Neyman-Pearson classification and constrained Markov decision process (CMDP) tasks.
[LG-8] Multigrade Neural Network Approximation
链接: https://arxiv.org/abs/2601.16884
作者: Shijun Zhang,Zuowei Shen,Yuesheng Xu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly non-convex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably one-hidden-layer \textttReLU models, training admits convex reformulations with global guarantees, motivating learning paradigms that improve stability while scaling to depth. MGDL builds upon this insight by training deep networks grade by grade: previously learned grades are frozen, and each new residual block is trained solely to reduce the remaining approximation error, yielding an interpretable and stable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function, there exists a fixed-width multigrade \textttReLU scheme whose residuals decrease strictly across grades and converge uniformly to zero. To the best of our knowledge, this work provides the first rigorous theoretical guarantee that grade-wise training yields provable vanishing approximation error in deep networks. Numerical experiments further illustrate the theoretical results.
[LG-9] heory of Minimal Weight Perturbations in Deep Networks and its Applications for Low-Rank Activated Backdoor Attacks
链接: https://arxiv.org/abs/2601.16880
作者: Bethan Evans,Jared Tanner
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:The minimal norm weight perturbations of DNNs required to achieve a specified change in output are derived and the factors determining its size are discussed. These single-layer exact formulae are contrasted with more generic multi-layer Lipschitz constant based robustness guarantees; both are observed to be of the same order which indicates similar efficacy in their guarantees. These results are applied to precision-modification-activated backdoor attacks, establishing provable compression thresholds below which such attacks cannot succeed, and show empirically that low-rank compression can reliably activate latent backdoors while preserving full-precision accuracy. These expressions reveal how back-propagated margins govern layer-wise sensitivity and provide certifiable guarantees on the smallest parameter updates consistent with a desired output shift.
[LG-10] Provably Learning Attention with Queries
链接: https://arxiv.org/abs/2601.16873
作者: Satwik Bhattamishra,Kulin Shah,Michael Hahn,Varun Kanade
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the corresponding real-valued output. We begin with the simplest case, a single-head softmax-attention regressor. We show that for a model with width d , there is an elementary algorithm to learn the parameters of single-head attention exactly with O(d^2) queries. Further, we show that if there exists an algorithm to learn ReLU feedforward networks (FFNs), then the single-head algorithm can be easily adapted to learn one-layer Transformers with single-head attention. Next, motivated by the regime where the head dimension r \ll d , we provide a randomised algorithm that learns single-head attention-based models with O(rd) queries via compressed sensing arguments. We also study robustness to noisy oracle access, proving that under mild norm and margin conditions, the parameters can be estimated to \varepsilon accuracy with a polynomial number of queries even when outputs are only provided up to additive tolerance. Finally, we show that multi-head attention parameters are not identifiable from value queries in general – distinct parameterisations can induce the same input-output map. Hence, guarantees analogous to the single-head setting are impossible without additional structural assumptions.
[LG-11] he Art of Being Difficult: Combining Human and AI Strengths to Find Adversarial Instances for Heuristics
链接: https://arxiv.org/abs/2601.16849
作者: Henri Nikoleit,Ankit Anand,Anurag Murty Naredla,Heiko Röglin
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:We demonstrate the power of human-LLM collaboration in tackling open problems in theoretical computer science. Focusing on combinatorial optimization, we refine outputs from the FunSearch algorithm [Romera-Paredes et al., Nature 2023] to derive state-of-the-art lower bounds for standard heuristics. Specifically, we target the generation of adversarial instances where these heuristics perform poorly. By iterating on FunSearch’s outputs, we identify improved constructions for hierarchical k -median clustering, bin packing, the knapsack problem, and a generalization of Lovász’s gasoline problem - some of these have not seen much improvement for over a decade, despite intermittent attention. These results illustrate how expert oversight can effectively extrapolate algorithmic insights from LLM-based evolutionary methods to break long-standing barriers. Our findings demonstrate that while LLMs provide critical initial patterns, human expertise is essential for transforming these patterns into mathematically rigorous and insightful constructions. This work highlights that LLMs are a strong collaborative tool in mathematics and computer science research. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2601.16849 [cs.LG] (or arXiv:2601.16849v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.16849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Sample-wise Constrained Learning via a Sequential Penalty Approach with Applications in Image Processing
链接: https://arxiv.org/abs/2601.16812
作者: Francesca Lanzillotta,Chiara Albisani,Davide Pucci,Daniele Baracchi,Alessandro Piva,Matteo Lapucci
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optimization and Control (math.OC)
*备注:
Abstract:In many learning tasks, certain requirements on the processing of individual data samples should arguably be formalized as strict constraints in the underlying optimization problem, rather than by means of arbitrary penalties. We show that, in these scenarios, learning can be carried out exploiting a sequential penalty method that allows to properly deal with constraints. The proposed algorithm is shown to possess convergence guarantees under assumptions that are reasonable in deep learning scenarios. Moreover, the results of experiments on image processing tasks show that the method is indeed viable to be used in practice.
[LG-13] Building a Robust Risk-Based Access Control System to Combat Ransomwares Capability to Encrypt: A Machine Learning Approach
链接: https://arxiv.org/abs/2601.16795
作者: Kenan Begovic,Abdulaziz Al-Ali,Qutaibah Malluhi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Ransomware core capability, unauthorized encryption, demands controls that identify and block malicious cryptographic activity without disrupting legitimate use. We present a probabilistic, risk-based access control architecture that couples machine learning inference with mandatory access control to regulate encryption on Linux in real time. The system builds a specialized dataset from the native ftrace framework using the function_graph tracer, yielding high-resolution kernel-function execution traces augmented with resource and I/O counters. These traces support both a supervised classifier and interpretable rules that drive an SELinux policy via lightweight booleans, enabling context-sensitive permit/deny decisions at the moment encryption begins. Compared to approaches centered on sandboxing, hypervisor introspection, or coarse system-call telemetry, the function-level tracing we adopt provides finer behavioral granularity than syscall-only telemetry while avoiding the virtualization/VMI overhead of sandbox-based approaches. Our current user-space prototype has a non-trivial footprint under burst I/O; we quantify it and recognize that a production kernel-space solution should aim to address this. We detail dataset construction, model training and rule extraction, and the run-time integration that gates file writes for suspect encryption while preserving benign cryptographic workflows. During evaluation, the two-layer composition retains model-level detection quality while delivering rule-like responsiveness; we also quantify operational footprint and outline engineering steps to reduce CPU and memory overhead for enterprise deployment. The result is a practical path from behavioral tracing and learning to enforceable, explainable, and risk-proportionate encryption control on production Linux systems.
[LG-14] ReLU Networks for Model Predictive Control: Network Complexity and Performance Guarantees
链接: https://arxiv.org/abs/2601.16764
作者: Xingchen Li,Keyou You
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Recent years have witnessed a resurgence in using ReLU neural networks (NNs) to represent model predictive control (MPC) policies. However, determining the required network complexity to ensure closed-loop performance remains a fundamental open problem. This involves a critical precision-complexity trade-off: undersized networks may fail to capture the MPC policy, while oversized ones may outweigh the benefits of ReLU network approximation. In this work, we propose a projection-based method to enforce hard constraints and establish a state-dependent Lipschitz continuity property for the optimal MPC cost function, which enables sharp convergence analysis of the closed-loop system. For the first time, we derive explicit bounds on ReLU network width and depth for approximating MPC policies with guaranteed closed-loop performance. To further reduce network complexity and enhance closed-loop performance, we propose a non-uniform error framework with a state-aware scaling function to adaptively adjust both the input and output of the ReLU network. Our contributions provide a foundational step toward certifiable ReLU NN-based MPC.
[LG-15] A Feature Extraction Pipeline for Enhancing Lightweight Neural Networks in sEMG-based Joint Torque Estimation
链接: https://arxiv.org/abs/2601.16712
作者: Kartik Chari,Raid Dokhan,Anas Homsi,Niklas Kueper,Elsa Andrea Kirchner
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Robot-assisted rehabilitation offers an effective approach, wherein exoskeletons adapt to users’ needs and provide personalized assistance. However, to deliver such assistance, accurate prediction of the user’s joint torques is essential. In this work, we propose a feature extraction pipeline using 8-channel surface electromyography (sEMG) signals to predict elbow and shoulder joint torques. For preliminary evaluation, this pipeline was integrated into two neural network models: the Multilayer Perceptron (MLP) and the Temporal Convolutional Network (TCN). Data were collected from a single subject performing elbow and shoulder movements under three load conditions (0 kg, 1.10 kg, and 1.85 kg) using three motion-capture cameras. Reference torques were estimated from center-of-mass kinematics under the assumption of static equilibrium. Our offline analyses showed that, with our feature extraction pipeline, MLP model achieved mean RMSE of 0.963 N m, 1.403 N m, and 1.434 N m (over five seeds) for elbow, front-shoulder, and side-shoulder joints, respectively, which were comparable to the TCN performance. These results demonstrate that the proposed feature extraction pipeline enables a simple MLP to achieve performance comparable to that of a network designed explicitly for temporal dependencies. This finding is particularly relevant for applications with limited training data, a common scenario patient care.
[LG-16] I Guess Thats Why They Call it the Blues: Causal Analysis for Audio Classifiers
链接: https://arxiv.org/abs/2601.16675
作者: David A. Kelly,Hana Chockler
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:It is well-known that audio classifiers often rely on non-musically relevant features and spurious correlations to classify audio. Hence audio classifiers are easy to manipulate or confuse, resulting in wrong classifications. While inducing a misclassification is not hard, until now the set of features that the classifiers rely on was not well understood. In this paper we introduce a new method that uses causal reasoning to discover features of the frequency space that are sufficient and necessary for a given classification. We describe an implementation of this algorithm in the tool FreqReX and provide experimental results on a number of standard benchmark datasets. Our experiments show that causally sufficient and necessary subsets allow us to manipulate the outputs of the models in a variety of ways by changing the input very slightly. Namely, a change to one out of 240,000 frequencies results in a change in classification 58% of the time, and the change can be so small that it is practically inaudible. These results show that causal analysis is useful for understanding the reasoning process of audio classifiers and can be used to successfully manipulate their outputs. Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2601.16675 [cs.SD] (or arXiv:2601.16675v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2601.16675 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-17] Predicting Startup Success Using Large Language Models : A Novel In-Context Learning Approach
链接: https://arxiv.org/abs/2601.16568
作者: Abdurahman Maarouf,Alket Bakiaj,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Venture capital (VC) investments in early-stage startups that end up being successful can yield high returns. However, predicting early-stage startup success remains challenging due to data scarcity (e.g., many VC firms have information about only a few dozen of early-stage startups and whether they were successful). This limits the effectiveness of traditional machine learning methods that rely on large labeled datasets for model training. To address this challenge, we propose an in-context learning framework for startup success prediction using large language models (LLMs) that requires no model training and leverages only a small set of labeled startups as demonstration examples. Specifically, we propose a novel k-nearest-neighbor-based in-context learning framework, called kNN-ICL, which selects the most relevant past startups as examples based on similarity. Using real-world profiles from Crunchbase, we find that the kNN-ICL approach achieves higher prediction accuracy than supervised machine learning baselines and vanilla in-context learning. Further, we study how performance varies with the number of in-context examples and find that a high balanced accuracy can be achieved with as few as 50 examples. Together, we demonstrate that in-context learning can serve as a decision-making tool for VC firms operating in data-scarce environments.
[LG-18] DANCE: Dynamic Available Neighbor-gated Condensation for Federated Text-Attributed Graphs
链接: https://arxiv.org/abs/2601.16519
作者: Zekai Chen,Haodong Lu,Xunkai Li,Henan Sun,Jia Li,Hongchao Qin,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated graph learning (FGL) enables collaborative training on graph data across multiple clients. With the rise of large language models (LLMs), textual attributes in FGL graphs are gaining attention. Text-attributed graph federated learning (TAG-FGL) improves FGL by explicitly leveraging LLMs to process and integrate these textual features. However, current TAG-FGL methods face three main challenges: \textbf(1) Overhead. LLMs for processing long texts incur high token and computation costs. To make TAG-FGL practical, we introduce graph condensation (GC) to reduce computation load, but this choice also brings new issues. \textbf(2) Suboptimal. To reduce LLM overhead, we introduce GC into TAG-FGL by compressing multi-hop texts/neighborhoods into a condensed core with fixed LLM surrogates. However, this one-shot condensation is often not client-adaptive, leading to suboptimal performance. \textbf(3) Interpretability. LLM-based condensation further introduces a black-box bottleneck: summaries lack faithful attribution and clear grounding to specific source spans, making local inspection and auditing difficult. To address the above issues, we propose \textbfDANCE, a new TAG-FGL paradigm with GC. To improve \textbfsuboptimal performance, DANCE performs round-wise, model-in-the-loop condensation refresh using the latest global model. To enhance \textbfinterpretability, DANCE preserves provenance by storing locally inspectable evidence packs that trace predictions to selected neighbors and source text spans. Across 8 TAG datasets, DANCE improves accuracy by \textbf2.33% at an \textbf8% condensation ratio, with \textbf33.42% fewer tokens than baselines.
[LG-19] Rethinking Large Language Models For Irregular Time Series Classification In Critical Care
链接: https://arxiv.org/abs/2601.16516
作者: Feixiang Zheng,Yu Wu,Cecilia Mascolo,Ting Dang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures
Abstract:Time series data from the Intensive Care Unit (ICU) provides critical information for patient monitoring. While recent advancements in applying Large Language Models (LLMs) to time series modeling (TSM) have shown great promise, their effectiveness on the irregular ICU data, characterized by particularly high rates of missing values, remains largely unexplored. This work investigates two key components underlying the success of LLMs for TSM: the time series encoder and the multimodal alignment strategy. To this end, we establish a systematic testbed to evaluate their impact across various state-of-the-art LLM-based methods on benchmark ICU datasets against strong supervised and self-supervised baselines. Results reveal that the encoder design is more critical than the alignment strategy. Encoders that explicitly model irregularity achieve substantial performance gains, yielding an average AUPRC increase of 12.8% over the vanilla Transformer. While less impactful, the alignment strategy is also noteworthy, with the best-performing semantically rich, fusion-based strategy achieving a modest 2.9% improvement over cross-attention. However, LLM-based methods require at least 10 \times longer training than the best-performing irregular supervised models, while delivering only comparable performance. They also underperform in data-scarce few-shot learning settings. These findings highlight both the promise and current limitations of LLMs for irregular ICU time series. The code is available at this https URL.
[LG-20] Learning to Optimize by Differentiable Programming
链接: https://arxiv.org/abs/2601.16510
作者: Liping Tao,Xindi Tong,Chee Wei Tan
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Solving massive-scale optimization problems requires scalable first-order methods with low per-iteration cost. This tutorial highlights a shift in optimization: using differentiable programming not only to execute algorithms but to learn how to design them. Modern frameworks such as PyTorch, TensorFlow, and JAX enable this paradigm through efficient automatic differentiation. Embedding first-order methods within these systems allows end-to-end training that improves convergence and solution quality. Guided by Fenchel-Rockafellar duality, the tutorial demonstrates how duality-informed iterative schemes such as ADMM and PDHG can be learned and adapted. Case studies across LP, OPF, Laplacian regularization, and neural network verification illustrate these gains.
[LG-21] BoostFGL: Boosting Fairness in Federated Graph Learning
链接: https://arxiv.org/abs/2601.16496
作者: Zekai Chen,Kairui Yang,Xunkai Li,Henan Sun,Zhihan Zhang,Jia Li,Qiangqiang Dai,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Federated graph learning (FGL) enables collaborative training of graph neural networks (GNNs) across decentralized subgraphs without exposing raw data. While existing FGL methods often achieve high overall accuracy, we show that this average performance can conceal severe degradation on disadvantaged node groups. From a fairness perspective, these disparities arise systematically from three coupled sources: label skew toward majority patterns, topology confounding in message propagation, and aggregation dilution of updates from hard clients. To address this, we propose \textbfBoostFGL, a boosting-style framework for fairness-aware FGL. BoostFGL introduces three coordinated mechanisms: \ding182 \emphClient-side node boosting, which reshapes local training signals to emphasize systematically under-served nodes; \ding183 \emphClient-side topology boosting, which reallocates propagation emphasis toward reliable yet underused structures and attenuates misleading neighborhoods; and \ding184 \emphServer-side model boosting, which performs difficulty- and reliability-aware aggregation to preserve informative updates from hard clients while stabilizing the global model. Extensive experiments on 9 datasets show that BoostFGL delivers substantial fairness gains, improving Overall-F1 by 8.43%, while preserving competitive overall performance against strong FGL baselines.
[LG-22] Robust Categorical Data Clustering Guided by Multi-Granular Competitive Learning
链接: https://arxiv.org/abs/2601.16491
作者: Shenghong Cai,Yiqun Zhang,Xiaopeng Luo,Yiu-Ming Cheung,Hong Jia,Peng Liu
类目: Machine Learning (cs.LG)
*备注: This paper has been published in the IEEE International Conference on Distributed Computing Systems (ICDCS 2024)
Abstract:Data set composed of categorical features is very common in big data analysis tasks. Since categorical features are usually with a limited number of qualitative possible values, the nested granular cluster effect is prevalent in the implicit discrete distance space of categorical data. That is, data objects frequently overlap in space or subspace to form small compact clusters, and similar small clusters often form larger clusters. However, the distance space cannot be well-defined like the Euclidean distance due to the qualitative categorical data values, which brings great challenges to the cluster analysis of categorical data. In view of this, we design a Multi-Granular Competitive Penalization Learning (MGCPL) algorithm to allow potential clusters to interactively tune themselves and converge in stages with different numbers of naturally compact clusters. To leverage MGCPL, we also propose a Cluster Aggregation strategy based on MGCPL Encoding (CAME) to first encode the data objects according to the learned multi-granular distributions, and then perform final clustering on the embeddings. It turns out that the proposed MGCPL-guided Categorical Data Clustering (MCDC) approach is competent in automatically exploring the nested distribution of multi-granular clusters and highly robust to categorical data sets from various domains. Benefiting from its linear time complexity, MCDC is scalable to large-scale data sets and promising in pre-partitioning data sets or compute nodes for boosting distributed computing. Extensive experiments with statistical evidence demonstrate its superiority compared to state-of-the-art counterparts on various real public data sets.
[LG-23] A Cautionary Tale of Self-Supervised Learning for Imaging Biomarkers: Alzheimers Disease Case Study
链接: https://arxiv.org/abs/2601.16467
作者: Maxwell Reynolds,Chaitanya Srinivasan,Vijay Cherupally,Michael Leone,Ke Yu,Li Sun,Tigmanshu Chaudhary,Andreas Pfenning,Kayhan Batmanghelich
类目: Machine Learning (cs.LG)
*备注:
Abstract:Discovery of sensitive and biologically grounded biomarkers is essential for early detection and monitoring of Alzheimer’s disease (AD). Structural MRI is widely available but typically relies on hand-crafted features such as cortical thickness or volume. We ask whether self-supervised learning (SSL) can uncover more powerful biomarkers from the same data. Existing SSL methods underperform FreeSurfer-derived features in disease classification, conversion prediction, and amyloid status prediction. We introduce Residual Noise Contrastive Estimation (R-NCE), a new SSL framework that integrates auxiliary FreeSurfer features while maximizing additional augmentation-invariant information. R-NCE outperforms traditional features and existing SSL methods across multiple benchmarks, including AD conversion prediction. To assess biological relevance, we derive Brain Age Gap (BAG) measures and perform genome-wide association studies. R-NCE-BAG shows high heritability and associations with MAPT and IRAG1, with enrichment in astrocytes and oligodendrocytes, indicating sensitivity to neurodegenerative and cerebrovascular processes.
[LG-24] On the Effects of Adversarial Perturbations on Distribution Robustness
链接: https://arxiv.org/abs/2601.16464
作者: Yipei Wang,Zhaoying Pan,Xiaoqian Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adversarial robustness refers to a model’s ability to resist perturbation of inputs, while distribution robustness evaluates the performance of the model under data shifts. Although both aim to ensure reliable performance, prior work has revealed a tradeoff in distribution and adversarial robustness. Specifically, adversarial training might increase reliance on spurious features, which can harm distribution robustness, especially the performance on some underrepresented subgroups. We present a theoretical analysis of adversarial and distribution robustness that provides a tractable surrogate for per-step adversarial training by studying models trained on perturbed data. In addition to the tradeoff, our work further identified a nuanced phenomenon that \ell_\infty perturbations on data with moderate bias can yield an increase in distribution robustness. Moreover, the gain in distribution robustness remains on highly skewed data when simplicity bias induces reliance on the core feature, characterized as greater feature separability. Our theoretical analysis extends the understanding of the tradeoff by highlighting the interplay of the tradeoff and the feature separability. Despite the tradeoff that persists in many cases, overlooking the role of feature separability may lead to misleading conclusions about robustness.
[LG-25] On the Expressive Power of Floating-Point Transformers
链接: https://arxiv.org/abs/2601.16450
作者: Sejun Park,Yeachan Park,Geonho Hwang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The study on the expressive power of transformers shows that transformers are permutation equivariant, and they can approximate all permutation-equivariant continuous functions on a compact domain. However, these results are derived under real parameters and exact operations, while real implementations on computers can only use a finite set of numbers and inexact machine operations with round-off errors. In this work, we investigate the representability of floating-point transformers that use floating-point parameters and floating-point operations. Unlike existing results under exact operations, we first show that floating-point transformers can represent a class of non-permutation-equivariant functions even without positional encoding. Furthermore, we prove that floating-point transformers can represent all permutation-equivariant functions when the sequence length is bounded, but they cannot when the sequence length is large. We also found the minimal equivariance structure in floating-point transformers, and show that all non-trivial additive positional encoding can harm the representability of floating-point transformers.
[LG-26] Brownian ReLU(Br-ReLU): A New Activation Function for a Long-Short Term Memory (LSTM) Network
链接: https://arxiv.org/abs/2601.16446
作者: George Awiakye-Marfo,Elijah Agbosu,Victoria Mawuena Barns,Samuel Asante Gyamerah
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 13 pages, 7 figures, 6 tables
Abstract:Deep learning models are effective for sequential data modeling, yet commonly used activation functions such as ReLU, LeakyReLU, and PReLU often exhibit gradient instability when applied to noisy, non-stationary financial time series. This study introduces BrownianReLU, a stochastic activation function induced by Brownian motion that enhances gradient propagation and learning stability in Long Short-Term Memory (LSTM) networks. Using Monte Carlo simulation, BrownianReLU provides a smooth, adaptive response for negative inputs, mitigating the dying ReLU problem. The proposed activation is evaluated on financial time series from Apple, GCB, and the SP 500, as well as LendingClub loan data for classification. Results show consistently lower Mean Squared Error and higher R^2 values, indicating improved predictive accuracy and generalization. Although ROC-AUC metric is limited in classification tasks, activation choice significantly affects the trade-off between accuracy and sensitivity, with Brownian ReLU and the selected activation functions yielding practically meaningful performance.
[LG-27] Safe Multitask Molecular Graph Networks for Vapor Pressure and Odor Threshold Prediction
链接: https://arxiv.org/abs/2601.16426
作者: Shuang Wu,Meijie Wang,Lun Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate two important tasks in odor-related property modeling: Vapor Pressure (VP) and Odor Threshold (OP). To evaluate the model’s out-of-distribution (OOD) capability, we adopt the Bemis-Murcko scaffold split. In terms of features, we introduce the rich A20/E17 molecular graph features (20-dimensional atom features + 17-dimensional bond features) and systematically compare GINE and PNA backbones. The results show: for VP, PNA with a simple regression head achieves Val MSE \approx 0.21 (normalized space); for the OP single task under the same scaffold split, using A20/E17 with robust training (Huber/winsor) achieves Val MSE \approx 0.60-0.61. For multitask training, we propose a “safe multitask” approach: VP as the primary task and OP as the auxiliary task, using delayed activation + gradient clipping + small weight, which avoids harming the primary task and simultaneously yields the best VP generalization performance. This paper provides complete reproducible experiments, ablation studies, and error-similarity analysis while discussing the impact of data noise and method limitations.
[LG-28] Bayesian Experimental Design for Model Discrepancy Calibration: A Rivalry between Kullback–Leibler Divergence and Wasserstein Distance
链接: https://arxiv.org/abs/2601.16425
作者: Huchen Yang,Xinghao Dong,Jin-Long Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Designing experiments that systematically gather data from complex physical systems is central to accelerating scientific discovery. While Bayesian experimental design (BED) provides a principled, information-based framework that integrates experimental planning with probabilistic inference, the selection of utility functions in BED is a long-standing and active topic, where different criteria emphasize different notions of information. Although Kullback–Leibler (KL) divergence has been one of the most common choices, recent studies have proposed Wasserstein distance as an alternative. In this work, we first employ a toy example to illustrate an issue of Wasserstein distance - the value of Wasserstein distance of a fixed-shape posterior depends on the relative position of its main mass within the support and can exhibit false rewards unrelated to information gain, especially with a non-informative prior (e.g., uniform distribution). We then further provide a systematic comparison between these two criteria through a classical source inversion problem in the BED literature, revealing that the KL divergence tends to lead to faster convergence in the absence of model discrepancy, while Wasserstein metrics provide more robust sequential BED results if model discrepancy is non-negligible. These findings clarify the trade-offs between KL divergence and Wasserstein metrics for the utility function and provide guidelines for selecting suitable criteria in practical BED applications.
[LG-29] ght Regret Bounds for Bilateral Trade under Semi Feedback
链接: https://arxiv.org/abs/2601.16412
作者: Yaonan Jin
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:The study of \textitregret minimization in fixed-price bilateral trade has received considerable attention in recent research. Previous works [CCC+24a, CCC+24b, AFF24, BCCF24, CJLZ25, LCM25a, GDFS25] have acquired a thorough understanding of the problem, except for determining the tight regret bound for GBB semi-feedback fixed-price mechanisms under adversarial values. In this paper, we resolve this open question by devising an \widetildeO(T^2 / 3) -regret mechanism, matching the \Omega(T^2 / 3) lower bound from [CJLZ25] up to polylogarithmic factors. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2601.16412 [cs.GT] (or arXiv:2601.16412v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2601.16412 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] A Refinement of Vapnik–Chervonenkis Theorem
链接: https://arxiv.org/abs/2601.16411
作者: A. Iosevich,A. Vagharshakyan,E. Wyman
类目: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Probability (math.PR)
*备注:
Abstract:Vapnik–Chervonenkis’ theorem is a seminal result in machine learning. It establishes sufficient conditions for empirical probabilities to converge to theoretical probabilities, uniformly over families of events. It also provides an estimate for the rate of such uniform convergence. We revisit the probabilistic component of the classical argument. Instead of applying Hoeffding’s inequality at the final step, we use a normal approximation with explicit Berry–Esseen error control. This yields a moderate-deviation sharpening of the usual VC estimate, with an additional factor of order (\varepsilon\sqrtn)^-1 in the leading exponential term when \varepsilon\sqrtn is large. Subjects: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Probability (math.PR) MSC classes: 68Q32 Cite as: arXiv:2601.16411 [cs.LG] (or arXiv:2601.16411v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.16411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-31] Reinforcement Learning-Based Energy-Aware Coverag e Path Planning for Precision Agriculture
链接: https://arxiv.org/abs/2601.16405
作者: Beining Wu,Zihao Ding,Leo Ostigaard,Jun Huang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted by RACS '25: International Conference on Research in Adaptive and Convergent Systems, November 16-19, 2025, Ho Chi Minh, Vietnam. 10 pages, 5 figures
Abstract:Coverage Path Planning (CPP) is a fundamental capability for agricultural robots; however, existing solutions often overlook energy constraints, resulting in incomplete operations in large-scale or resource-limited environments. This paper proposes an energy-aware CPP framework grounded in Soft Actor-Critic (SAC) reinforcement learning, designed for grid-based environments with obstacles and charging stations. To enable robust and adaptive decision-making under energy limitations, the framework integrates Convolutional Neural Networks (CNNs) for spatial feature extraction and Long Short-Term Memory (LSTM) networks for temporal dynamics. A dedicated reward function is designed to jointly optimize coverage efficiency, energy consumption, and return-to-base constraints. Experimental results demonstrate that the proposed approach consistently achieves over 90% coverage while ensuring energy safety, outperforming traditional heuristic algorithms such as Rapidly-exploring Random Tree (RRT), Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO) baselines by 13.4-19.5% in coverage and reducing constraint violations by 59.9-88.3%. These findings validate the proposed SAC-based framework as an effective and scalable solution for energy-constrained CPP in agricultural robotics.
[LG-32] owards a Theoretical Understanding to the Generalization of RLHF
链接: https://arxiv.org/abs/2601.16403
作者: Zhaochun Li(1,2),Mingyang Yi(3),Yue Wang(2),Shisheng Cui(1),Yong Liu(3) ((1) Beijing Institute of Technolegy, (2) Zhongguancun Academy, (3) Renmin University of China)
类目: Machine Learning (cs.LG)
*备注: 31 pages, 6 figures
Abstract:Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbffeature coverage condition, the empirical optima of policy model have a generalization bound of order \mathcalO(n^-\frac12) . Moreover, the results can be extrapolated to parameters obtained by gradient-based learning algorithms, i.e., Gradient Ascent (GA) and Stochastic Gradient Ascent (SGA). Thus, we argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.
[LG-33] A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning
链接: https://arxiv.org/abs/2601.16399
作者: Sihan Zeng,Sujay Bhatt,Sumitra Ganesh,Alec Koppel
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study a structured bi-level optimization problem where the upper-level objective is a smooth function and the lower-level problem is policy optimization in a Markov decision process (MDP). The upper-level decision variable parameterizes the reward of the lower-level MDP, and the upper-level objective depends on the optimal induced policy. Existing methods for bi-level optimization and RL often require second-order information, impose strong regularization at the lower level, or inefficiently use samples through nested-loop procedures. In this work, we propose a single-loop, first-order actor-critic algorithm that optimizes the bi-level objective via a penalty-based reformulation. We introduce into the lower-level RL objective an attenuating entropy regularization, which enables asymptotically unbiased upper-level hyper-gradient estimation without solving the unregularized RL problem exactly. We establish the finite-time and finite-sample convergence of the proposed algorithm to a stationary point of the original, unregularized bi-level optimization problem through a novel lower-level residual analysis under a special type of Polyak-Lojasiewicz condition. We validate the performance of our method through experiments on a GridWorld goal position problem and on happy tweet generation through reinforcement learning from human feedback (RLHF).
[LG-34] Analyzing Neural Network Information Flow Using Differential Geometry
链接: https://arxiv.org/abs/2601.16366
作者: Shuhang Tan,Jayson Sia,Paul Bogdan,Radoslav Ivanov
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:
Abstract:This paper provides a fresh view of the neural network (NN) data flow problem, i.e., identifying the NN connections that are most important for the performance of the full model, through the lens of graph theory. Understanding the NN data flow provides a tool for symbolic NN analysis, e.g.,~robustness analysis or model repair. Unlike the standard approach to NN data flow analysis, which is based on information theory, we employ the notion of graph curvature, specifically Ollivier-Ricci curvature (ORC). The ORC has been successfully used to identify important graph edges in various domains such as road traffic analysis, biological and social networks. In particular, edges with negative ORC are considered bottlenecks and as such are critical to the graph’s overall connectivity, whereas positive-ORC edges are not essential. We use this intuition for the case of NNs as well: we 1)~construct a graph induced by the NN structure and introduce the notion of neural curvature (NC) based on the ORC; 2)~calculate curvatures based on activation patterns for a set of input examples; 3)~aim to demonstrate that NC can indeed be used to rank edges according to their importance for the overall NN functionality. We evaluate our method through pruning experiments and show that removing negative-ORC edges quickly degrades the overall NN performance, whereas positive-ORC edges have little impact. The proposed method is evaluated on a variety of models trained on three image datasets, namely MNIST, CIFAR-10 and CIFAR-100. The results indicate that our method can identify a larger number of unimportant edges as compared to state-of-the-art pruning methods.
[LG-35] Efficient Gaussian process learning via subspace projections ICASSP2026
链接: https://arxiv.org/abs/2601.16332
作者: Felipe Tobar,Elsa Cazelles
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at IEEE ICASSP 2026
Abstract:We propose a novel training objective for GPs constructed using lower-dimensional linear projections of the data, referred to as \emphprojected likelihood (PL). We provide a closed-form expression for the information loss related to the PL and empirically show that it can be reduced with random projections on the unit sphere. We show the superiority of the PL, in terms of accuracy and computational efficiency, over the exact GP training and the variational free energy approach to sparse GPs over different optimisers, kernels and datasets of moderately large sizes.
[LG-36] Student Mental Health Screening via Fitbit Data Collected During the COVID-19 Pandemic
链接: https://arxiv.org/abs/2601.16324
作者: Rebecca Lopez,Avantika Shrestha,ML Tlachac,Kevin Hickey,Xingtong Guo,Shichao Liu,Elke Rundensteiner
类目: Machine Learning (cs.LG)
*备注:
Abstract:College students experience many stressors, resulting in high levels of anxiety and depression. Wearable technology provides unobtrusive sensor data that can be used for the early detection of mental illness. However, current research is limited concerning the variety of psychological instruments administered, physiological modalities, and time series parameters. In this research, we collect the Student Mental and Environmental Health (StudentMEH) Fitbit dataset from students at our institution during the pandemic. We provide a comprehensive assessment of the ability of predictive machine learning models to screen for depression, anxiety, and stress using different Fitbit modalities. Our findings indicate potential in physiological modalities such as heart rate and sleep to screen for mental illness with the F1 scores as high as 0.79 for anxiety, the former modality reaching 0.77 for stress screening, and the latter modality achieving 0.78 for depression. This research highlights the potential of wearable devices to support continuous mental health monitoring, the importance of identifying best data aggregation levels and appropriate modalities for screening for different mental ailments.
[LG-37] Kernel smoothing on manifolds
链接: https://arxiv.org/abs/2601.16777
作者: Eunseong Bae,Wolfgang Polonik
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:
Abstract:Under the assumption that data lie on a compact (unknown) manifold without boundary, we derive finite sample bounds for kernel smoothing and its (first and second) derivatives, and we establish asymptotic normality through Berry-Esseen type bounds. Special cases include kernel density estimation, kernel regression and the heat kernel signature. Connections to the graph Laplacian are also discussed.
[LG-38] Efficient Learning of Stationary Diffusions with Stein-type Discrepancies
链接: https://arxiv.org/abs/2601.16597
作者: Fabian Bleile,Sarah Lumpp,Mathias Drton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Learning a stationary diffusion amounts to estimating the parameters of a stochastic differential equation whose stationary distribution matches a target distribution. We build on the recently introduced kernel deviation from stationarity (KDS), which enforces stationarity by evaluating expectations of the diffusion’s generator in a reproducing kernel Hilbert space. Leveraging the connection between KDS and Stein discrepancies, we introduce the Stein-type KDS (SKDS) as an alternative formulation. We prove that a vanishing SKDS guarantees alignment of the learned diffusion’s stationary distribution with the target. Furthermore, under broad parametrizations, SKDS is convex with an empirical version that is \epsilon -quasiconvex with high probability. Empirically, learning with SKDS attains comparable accuracy to KDS while substantially reducing computational cost and yields improvements over the majority of competitive baselines.
[LG-39] Learning Successive Interference Cancellation for Low-Complexity Soft-Output MIMO Detection
链接: https://arxiv.org/abs/2601.16586
作者: Benedikt Fesl,Fatih Capar
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Low-complexity multiple-input multiple-output (MIMO) detection remains a key challenge in modern wireless systems, particularly for 5G reduced capability (RedCap) and internet-of-things (IoT) devices. In this context, the growing interest in deploying machine learning on edge devices must be balanced against stringent constraints on computational complexity and memory while supporting high-order modulation. Beyond accurate hard detection, reliable soft information is equally critical, as modern receivers rely on soft-input channel decoding, imposing additional requirements on the detector design. In this work, we propose recurSIC, a lightweight learning-based MIMO detection framework that is structurally inspired by successive interference cancellation (SIC) and incorporates learned processing stages. It generates reliable soft information via multi-path hypothesis tracking with a tunable complexity parameter while requiring only a single forward pass and a minimal parameter count. Numerical results in realistic wireless scenarios show that recurSIC achieves strong hard- and soft-detection performance at very low complexity, making it well suited for edge-constrained MIMO receivers.
[LG-40] Perfect Clustering for Sparse Directed Stochastic Block Models
链接: https://arxiv.org/abs/2601.16427
作者: Behzad Aalipur,Yichen Qin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:Exact recovery in stochastic block models (SBMs) is well understood in undirected settings, but remains considerably less developed for directed and sparse networks, particularly when the number of communities diverges. Spectral methods for directed SBMs often lack stability in asymmetric, low-degree regimes, and existing non-spectral approaches focus primarily on undirected or dense settings. We propose a fully non-spectral, two-stage procedure for community detection in sparse directed SBMs with potentially growing numbers of communities. The method first estimates the directed probability matrix using a neighborhood-smoothing scheme tailored to the asymmetric setting, and then applies K -means clustering to the estimated rows, thereby avoiding the limitations of eigen- or singular value decompositions in sparse, asymmetric networks. Our main theoretical contribution is a uniform row-wise concentration bound for the smoothed estimator, obtained through new arguments that control asymmetric neighborhoods and separate in- and out-degree effects. These results imply the exact recovery of all community labels with probability tending to one, under mild sparsity and separation conditions that allow both \gamma_n \to 0 and K_n \to \infty . Simulation studies, including highly directed, sparse, and non-symmetric block structures, demonstrate that the proposed procedure performs reliably in regimes where directed spectral and score-based methods deteriorate. To the best of our knowledge, this provides the first exact recovery guarantee for this class of non-spectral, neighborhood-smoothing methods in the sparse, directed setting. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME) MSC classes: 62H30 Cite as: arXiv:2601.16427 [stat.ML] (or arXiv:2601.16427v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.16427 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Long-Term Probabilistic Forecast of Vegetation Conditions Using Climate Attributes in the Four Corners Region
链接: https://arxiv.org/abs/2601.16347
作者: Erika McPhillips,Hyeongseong Lee,Xiangyu Xie,Kathy Baylis,Chris Funk,Mengyang Gu
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Weather conditions can drastically alter the state of crops and rangelands, and in turn, impact the incomes and food security of individuals worldwide. Satellite-based remote sensing offers an effective way to monitor vegetation and climate variables on regional and global scales. The annual peak Normalized Difference Vegetation Index (NDVI), derived from satellite observations, is closely associated with crop development, rangeland biomass, and vegetation growth. Although various machine learning methods have been developed to forecast NDVI over short time ranges, such as one-month-ahead predictions, long-term forecasting approaches, such as one-year-ahead predictions of vegetation conditions, are not yet available. To fill this gap, we develop a two-phase machine learning model to forecast the one-year-ahead peak NDVI over high-resolution grids, using the Four Corners region of the Southwestern United States as a testbed. In phase one, we identify informative climate attributes, including precipitation and maximum vapor pressure deficit, and develop the generalized parallel Gaussian process that captures the relationship between climate attributes and NDVI. In phase two, we forecast these climate attributes using historical data at least one year before the NDVI prediction month, which then serve as inputs to forecast the peak NDVI at each spatial grid. We developed open-source tools that outperform alternative methods for both gross NDVI and grid-based NDVI one-year forecasts, providing information that can help farmers and ranchers make actionable plans a year in advance.
[LG-42] Active learning for photonics
链接: https://arxiv.org/abs/2601.16287
作者: Ryan Lopez,Charlotte Loh,Rumen Dangovski,Marin Soljačić
类目: Optics (physics.optics); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 6 pages, 5 figures, submitted to Optics Express
Abstract:Active learning for photonic crystals explores the integration of analytic approximate Bayesian last layer neural networks (LL-BNNs) with uncertainty-driven sample selection to accelerate photonic band gap prediction. We employ an analytic LL-BNN formulation, corresponding to the infinite Monte Carlo sample limit, to obtain uncertainty estimates that are strongly correlated with the true predictive error on unlabeled candidate structures. These uncertainty scores drive an active learning strategy that prioritizes the most informative simulations during training. Applied to the task of predicting band gap sizes in two-dimensional, two-tone photonic crystals, our approach achieves up to a 2.6x reduction in required training data compared to a random sampling baseline while maintaining predictive accuracy. The efficiency gains arise from concentrating computational resources on high uncertainty regions of the design space rather than sampling uniformly. Given the substantial cost of full band structure simulations, especially in three dimensions, this data efficiency enables rapid and scalable surrogate modeling. Our results suggest that analytic LL-BNN based active learning can substantially accelerate topological optimization and inverse design workflows for photonic crystals, and more broadly, offers a general framework for data efficient regression across scientific machine learning domains.
[LG-43] Distributional Computational Graphs: Error Bounds
链接: https://arxiv.org/abs/2601.16250
作者: Olof Hallqvist Elias,Michael Selby,Phillip Stanley-Marbell
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注: 28 pages, 2 figures
Abstract:We study a general framework of distributional computational graphs: computational graphs whose inputs are probability distributions rather than point values. We analyze the discretization error that arises when these graphs are evaluated using finite approximations of continuous probability distributions. Such an approximation might be the result of representing a continuous real-valued distribution using a discrete representation or from constructing an empirical distribution from samples (or might be the output of another distributional computational graph). We establish non-asymptotic error bounds in terms of the Wasserstein-1 distance, without imposing structural assumptions on the computational graph.
[LG-44] st-Time Adaptation for Speech Emotion Recognition ICASSP2026
链接: https://arxiv.org/abs/2601.16240
作者: Jiaheng Dong,Hong Jia,Ting Dang
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted by 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:The practical utility of Speech Emotion Recognition (SER) systems is undermined by their fragility to domain shifts, such as speaker variability, the distinction between acted and naturalistic emotions, and cross-corpus variations. While domain adaptation and fine-tuning are widely studied, they require either source data or labelled target data, which are often unavailable or raise privacy concerns in SER. Test-time adaptation (TTA) bridges this gap by adapting models at inference using only unlabeled target data. Yet, having been predominantly designed for image classification and speech recognition, the efficacy of TTA for mitigating the unique domain shifts in SER has not been investigated. In this paper, we present the first systematic evaluation and comparison covering 11 TTA methods across three representative SER tasks. The results indicate that backpropagation-free TTA methods are the most promising. Conversely, entropy minimization and pseudo-labeling generally fail, as their core assumption of a single, confident ground-truth label is incompatible with the inherent ambiguity of emotional expression. Further, no single method universally excels, and its effectiveness is highly dependent on the distributional shifts and tasks.
[LG-45] Latent Causal Diffusions for Single-Cell Perturbation Modeling
链接: https://arxiv.org/abs/2601.15341
作者: Lars Lorch,Jiaqi Zhang,Charlotte Bunne,Andreas Krause,Bernhard Schölkopf,Caroline Uhler
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG)
*备注:
Abstract:Perturbation screens hold the potential to systematically map regulatory processes at single-cell resolution, yet modeling and predicting transcriptome-wide responses to perturbations remains a major computational challenge. Existing methods often underperform simple baselines, fail to disentangle measurement noise from biological signal, and provide limited insight into the causal structure governing cellular responses. Here, we present the latent causal diffusion (LCD), a generative model that frames single-cell gene expression as a stationary diffusion process observed under measurement noise. LCD outperforms established approaches in predicting the distributional shifts of unseen perturbation combinations in single-cell RNA-sequencing screens while simultaneously learning a mechanistic dynamical system of gene regulation. To interpret these learned dynamics, we develop an approach we call causal linearization via perturbation responses (CLIPR), which yields an approximation of the direct causal effects between all genes modeled by the diffusion. CLIPR provably identifies causal effects under a linear drift assumption and recovers causal structure in both simulated systems and a genome-wide perturbation screen, where it clusters genes into coherent functional modules and resolves causal relationships that standard differential expression analysis cannot. The LCD-CLIPR framework bridges generative modeling with causal inference to predict unseen perturbation effects and map the underlying regulatory mechanisms of the transcriptome.
信息检索
[IR-0] From Atom to Community: Structured and Evolving Agent Memory for User Behavior Modeling
链接: https://arxiv.org/abs/2601.16872
作者: Yuxin Liao,Le Wu,Min Hou,Yu Wang,Han Wu,Meng Wang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:User behavior modeling lies at the heart of personalized applications like recommender systems. With LLM-based agents, user preference representation has evolved from latent embeddings to semantic memory. While existing memory mechanisms show promise in textual dialogues, modeling non-textual behaviors remains challenging, as preferences must be inferred from implicit signals like clicks without ground truth supervision. Current approaches rely on a single unstructured summary, updated through simple overwriting. However, this is suboptimal: users exhibit multi-faceted interests that get conflated, preferences evolve yet naive overwriting causes forgetting, and sparse individual interactions necessitate collaborative signals. We present STEAM (\textit\textbfSTructured and \textbfEvolving \textbfAgent \textbfMemory), a novel framework that reimagines how agent memory is organized and updated. STEAM decomposes preferences into atomic memory units, each capturing a distinct interest dimension with explicit links to observed behaviors. To exploit collaborative patterns, STEAM organizes similar memories across users into communities and generates prototype memories for signal propagation. The framework further incorporates adaptive evolution mechanisms, including consolidation for refining memories and formation for capturing emerging interests. Experiments on three real-world datasets demonstrate that STEAM substantially outperforms state-of-the-art baselines in recommendation accuracy, simulation fidelity, and diversity.
[IR-1] Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation
链接: https://arxiv.org/abs/2601.16858
作者: Mahe Chen,Xiaoxuan Wang,Kaiwen Chen,Nick Koudas
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The rise of generative AI as a primary information source presents a paradigm shift from traditional web search. This paper presents a large-scale empirical study quantifying the fundamental differences between the results returned by Google Search and leading generative AI services. We analyze multiple dimensions, demonstrating that AI-generated answers and web search results diverge significantly in their consulted source domains, the typology of these domains (e.g., earned media vs. owned, social), query intent and the freshness of the information provided. We then investigate the role of LLM pre-training as a key factor shaping these differences, analyzing how this intrinsic knowledge base interacts with and influences real-time web search when enabled. Our findings reveal the distinct mechanics of these two information ecosystems, leading to critical observations on the emergent field of Answer Engine Optimization (AEO) and its contrast with traditional Search Engine Optimization (SEO).
[IR-2] PI2I: A Personalized Item-Based Collaborative Filtering Retrieval Framework WWW’26
链接: https://arxiv.org/abs/2601.16815
作者: Shaoqing Wang,Yingcai Ma,Kairui Fu,Ziyang Wang,Dunxian Huang,Yuliang Yan,Jian Wu
类目: Information Retrieval (cs.IR)
*备注: Published on WWW’26: In Proceedings of the ACM Web Conference 2026
Abstract:Efficiently selecting relevant content from vast candidate pools is a critical challenge in modern recommender systems. Traditional methods, such as item-to-item collaborative filtering (CF) and two-tower models, often fall short in capturing the complex user-item interactions due to uniform truncation strategies and overdue user-item crossing. To address these limitations, we propose Personalized Item-to-Item (PI2I), a novel two-stage retrieval framework that enhances the personalization capabilities of CF. In the first Indexer Building Stage (IBS), we optimize the retrieval pool by relaxing truncation thresholds to maximize Hit Rate, thereby temporarily retaining more items users might be interested in. In the second Personalized Retrieval Stage (PRS), we introduce an interactive scoring model to overcome the limitations of inner product calculations, allowing for richer modeling of intricate user-item interactions. Additionally, we construct negative samples based on the trigger-target (item-to-item) relationship, ensuring consistency between offline training and online inference. Offline experiments on large-scale real-world datasets demonstrate that PI2I outperforms traditional CF methods and rivals Two-Tower models. Deployed in the “Guess You Like” section on Taobao, PI2I achieved a 1.05% increase in online transaction rates. In addition, we have released a large-scale recommendation dataset collected from Taobao, containing 130 million real-world user interactions used in the experiments of this paper. The dataset is publicly available at this https URL, which could serve as a valuable benchmark for the research community.
[IR-3] LLM -powered Real-time Patent Citation Recommendation for Financial Technologies
链接: https://arxiv.org/abs/2601.16775
作者: Tianang Deng,Yu Deng,Tianchen Gao,Yonghong Hu,Rui Pan
类目: Information Retrieval (cs.IR); Applications (stat.AP)
*备注:
Abstract:Rapid financial innovation has been accompanied by a sharp increase in patenting activity, making timely and comprehensive prior-art discovery more difficult. This problem is especially evident in financial technologies, where innovations develop quickly, patent collections grow continuously, and citation recommendation systems must be updated as new applications arrive. Existing patent retrieval and citation recommendation methods typically rely on static indexes or periodic retraining, which limits their ability to operate effectively in such dynamic settings. In this study, we propose a real-time patent citation recommendation framework designed for large and fast-changing financial patent corpora. Using a dataset of 428,843 financial patents granted by the China National Intellectual Property Administration (CNIPA) between 2000 and 2024, we build a three-stage recommendation pipeline. The pipeline uses large language model (LLM) embeddings to represent the semantic content of patent abstracts, applies efficient approximate nearest-neighbor search to construct a manageable candidate set, and ranks candidates by semantic similarity to produce top-k citation recommendations. In addition to improving recommendation accuracy, the proposed framework directly addresses the dynamic nature of patent systems. By using an incremental indexing strategy based on hierarchical navigable small-world (HNSW) graphs, newly issued patents can be added without rebuilding the entire index. A rolling day-by-day update experiment shows that incremental updating improves recall while substantially reducing computational cost compared with rebuild-based indexing. The proposed method also consistently outperforms traditional text-based baselines and alternative nearest-neighbor retrieval approaches.
[IR-4] LLM -based Semantic Search for Conversational Queries in E-commerce
链接: https://arxiv.org/abs/2601.16492
作者: Emad Siddiqui,Venkatesh Terikuti,Xuan Lu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Conversational user queries are increasingly challenging traditional e-commerce platforms, whose search systems are typically optimized for keyword-based queries. We present an LLM-based semantic search framework that effectively captures user intent from conversational queries by combining domain-specific embeddings with structured filters. To address the challenge of limited labeled data, we generate synthetic data using LLMs to guide the fine-tuning of two models: an embedding model that positions semantically similar products close together in the representation space, and a generative model for converting natural language queries into structured constraints. By combining similarity-based retrieval with constraint-based filtering, our framework achieves strong precision and recall across various settings compared to baseline approaches on a real-world dataset.
[IR-5] Segregation Before Polarization: How Recommendation Strategies Shape Echo Chamber Pathways
链接: https://arxiv.org/abs/2601.16457
作者: Junning Zhao,Kazutoshi Sasahara,Yu Chen
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
*备注: 13 pages, 5 figures for main text; 7 pages, 6 figures for supplementary materials
Abstract:Social media platforms facilitate echo chambers through feedback loops between user preferences and recommendation algorithms. While algorithmic homogeneity is well-documented, the distinct evolutionary pathways driven by content-based versus link-based recommendations remain unclear. Using an extended dynamic Bounded Confidence Model (BCM), we show that content-based algorithms–unlike their link-based counterparts–steer social networks toward a segregation-before-polarization (SbP) pathway. Along this trajectory, structural segregation precedes opinion divergence, accelerating individual isolation while delaying but ultimately intensifying collective polarization. Furthermore, we reveal a paradox in information sharing: Reposting increases the number of connections in the network, yet it simultaneously reinforces echo chambers because it amplifies small, latent opinion differences that would otherwise remain inconsequential. These findings suggest that mitigating polarization requires stage-dependent algorithmic interventions, shifting from content-centric to structure-centric strategies as networks evolve.

