Arxiv今日论文 | 2026-01-27

本篇博文主要内容为 2026-01-27 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决文本嵌入（text embeddings）在临床试验等生物医学领域中缺乏可解释性、探索性和逆向生成能力的问题，从而限制了其透明度和潜在的生成式应用。解决方案的关键在于提出并实现了一种通用的、可训练的嵌入语言模型（Embedding Language Model, ELM）架构与训练框架，通过设计针对临床试验的特定训练任务和专家验证的合成数据集，训练出能够从嵌入向量中准确描述和比较未见临床试验，并能基于新向量生成合理临床试验摘要的模型——ctELM。该模型还展现出对概念向量（如受试者年龄和性别）的响应能力，体现了嵌入空间的可控生成潜力。

链接: https://arxiv.org/abs/2601.18796
作者: Brian Ondov,Chia-Hsuan Chang,Yujia Zhou,Mauro Giuffrè,Hua Xu
机构: Yale School of Medicine (耶鲁医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we align Large Language Models to embeddings of clinical trials using the recently reported Embedding Language Model (ELM) method. We develop an open-source, domain-agnostic ELM architecture and training framework, design training tasks for clinical trials, and introduce an expert-validated synthetic dataset. We then train a series of ELMs exploring the impact of tasks and training regimes. Our final model, ctELM, can accurately describe and compare unseen clinical trials from embeddings alone and produce plausible clinical trials from novel vectors. We further show that generated trial abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.
zh

[NLP-1] Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在强化学习（Reinforcement Learning, RL）推理过程中计算资源浪费的问题，尤其是在困难任务上，由于正确策略轨迹稀少、策略梯度消失导致学习停滞。其核心解决方案是提出PrefixRL方法，关键在于利用先前采样的离策略（off-policy）轨迹前缀作为条件，引导当前策略进行在线策略（on-policy）优化以完成剩余部分，从而避免传统离策略方法带来的优化不稳定性。通过调节前缀长度来动态调整问题难度，PrefixRL不仅在理论上与标准RL目标一致，还显著提升了样本效率，并实验证明其能实现“反向泛化”——即仅在带前缀的问题上训练即可提升未加前缀的分布外性能，且效果不受离策略数据来源模型家族的影响，具备良好的实用性。

链接: https://arxiv.org/abs/2601.18795
作者: Amrith Setlur,Zijian Wang,Andrew Cohen,Paria Rashidinejad,Sang Michael Xie
机构: Meta(Meta)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.
zh

[NLP-2] MEGnifying Emotion: Sentiment Analysis from Annotated Brain Data

【速读】：该论文旨在解决从大脑活动（尤其是非侵入式脑磁图MEG数据）中直接解码情感状态的问题，现有数据集多聚焦于语音或语音转录文本的对齐，缺乏情感标注。其解决方案的关键在于利用预训练的文本到情感（Text-to-Sentiment）模型为听觉刺激（如有声书）进行情感标注，并通过强制对齐（force-alignment）技术将情感标签与MEG信号精确时间对齐，从而构建可用于训练脑到情感（Brain-to-Sentiment）模型的数据集。实验表明，该方法在平衡准确率上优于基线，验证了利用现有MEG数据集并直接从脑信号中学习情感解码的可行性。

链接: https://arxiv.org/abs/2601.18792
作者: Brian Liu,Oiwi Parker Jones
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decoding emotion from brain activity could unlock a deeper understanding of the human experience. While a number of existing datasets align brain data with speech and with speech transcripts, no datasets have annotated brain data with sentiment. To bridge this gap, we explore the use of pre-trained Text-to-Sentiment models to annotate non invasive brain recordings, acquired using magnetoencephalography (MEG), while participants listened to audiobooks. Having annotated the text, we employ force-alignment of the text and audio to align our sentiment labels with the brain recordings. It is straightforward then to train Brainto-Sentiment models on these data. Experimental results show an improvement in balanced accuracy for Brain-to-Sentiment compared to baseline, supporting the proposed approach as a proof-of-concept for leveraging existing MEG datasets and learning to decode sentiment directly from the brain.
zh

[NLP-3] Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets

【速读】：该论文旨在解决跨语言词汇结构比较的难题，特别是在拉丁文和西里尔文书写系统下对242种语言进行大规模、可量化分析的问题。其核心挑战在于如何在不依赖人工标注的情况下，实现多语言间词汇相似性与分化模式的统一建模。解决方案的关键在于构建基于字节对编码（Byte-Pair Encoding, BPE）的“语系集”（glottosets），利用基于排序的子词向量来捕捉词汇重叠、词法分化及语言相似性，并通过BPE分词结果与形态边界的一致性验证其有效性——实验表明，该方法在15种语言上比随机基线提升95%的形态边界匹配度（F1=0.34 vs 0.15），且BPE词汇相似性显著关联于语言的谱系关系（Mantel r = 0.329, p < 0.001），从而为跨语言宏观语言学研究提供了可扩展、自动化的分析框架。

链接: https://arxiv.org/abs/2601.18791
作者: Iaroslav Chelombitko,Mika Hämäläinen,Aleksey Komissarov
机构: Neapolis University Pafos, Paphos, Cyprus(尼波利斯大学帕福斯分校); Metropolia University of Applied Sciences, Helsinki, Finland(赫尔辛基应用科学大学); aglabx, Paphos, Cyprus(aglabx)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 4 figues, 4 tables

点击查看摘要

Abstract:We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing ‘glottosets’ from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.
zh

[NLP-4] MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts

【速读】：该论文试图解决的问题是：在大型语言模型（Large Language Models, LLMs）日益优化深度推理能力的背景下，其对任务正确执行的过度追求是否会导致在高危情境下忽视安全响应，从而引发“隧道视野”（tunnel vision）效应。解决方案的关键在于引入MortalMATH基准测试，该基准包含150个场景，其中用户在请求代数帮助的同时描述逐步升级的生命威胁事件（如中风症状、自由落体等），用以系统评估模型在紧急情况下的反应能力。实验发现，通用模型（如Llama-3.1）能主动拒绝数学任务以优先处理危险，而专用推理模型（如Qwen-3-32b和GPT-5-nano）则倾向于维持95%以上的任务完成率，忽略用户濒临死亡的描述；同时，推理计算时间延迟可达15秒，进一步加剧风险。这表明，单纯强化正确答案导向的训练可能使模型丧失必要的生存本能，影响其安全部署。

链接: https://arxiv.org/abs/2601.18790
作者: Etienne Lanzeray,Stephane Meilliez,Malo Ruelle,Damien Sileo
机构: Univ. Lille, Lille, France; Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL, F-59000 Lille, France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a “tunnel vision” that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.
zh

[NLP-5] Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings

【速读】：该论文旨在解决无监督文本分割（unsupervised text segmentation）中的关键挑战：边界标签获取成本高、主观性强，且难以跨领域和粒度迁移。其解决方案的核心是提出Embed-KCPD方法，该方法无需训练即可将句子表示为嵌入向量（embedding vectors），并通过最小化带惩罚项的KCPD（Kernel Change Point Detection）目标函数来估计分割边界。该方法的关键创新在于构建了首个针对m-依赖序列（m-dependent sequences）的依赖感知理论框架，证明了关于总体惩罚风险的Oracle不等式以及每个真实变化点可被定位在相对段长较小窗口内的局部化保证，从而实现了从理论到实践的有效衔接。

链接: https://arxiv.org/abs/2601.18788
作者: Mumin Jia,Jairo Diaz-Rodriguez
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: arXiv admin note: substantial text overlap with arXiv:2510.03437 . substantial text overlap with arXiv:2510.03437 . substantial text overlap with arXiv:2510.03437 . substantial text overlap with arXiv:2510.03437

点击查看摘要

Abstract:Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under m -dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift’s tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
zh

[NLP-6] Design Techniques for LLM -Powered Interactive Storytelling: A Case Study of the Dramamancer System EMNLP

【速读】：该论文旨在解决交互式叙事中作者意图（authorial intent）与玩家自主性（player agency）之间难以平衡的问题。传统叙事系统往往受限于预设脚本，难以在保持故事连贯性的同时赋予玩家足够的自由度。论文提出的解决方案核心在于利用大语言模型（Large Language Models, LLMs）作为中介机制，将作者设计的结构化故事模板（story schemas）动态转化为由玩家驱动的个性化游戏体验。其关键创新在于通过LLM实现从静态叙事框架到实时、情境化内容生成的映射，从而在保证叙事一致性的同时增强玩家的参与感和创造性反馈循环。

链接: https://arxiv.org/abs/2601.18785
作者: Tiffany Wang,Yuqian Sun,Yi Wang,Melissa Roemmele,John Joon Young Chung,Max Kreminski
机构: Midjourney
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Extended abstract presented at the 2025 Wordplay Workshop at EMNLP

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has enabled a new paradigm for bridging authorial intent and player agency in interactive narrative. We consider this paradigm through the example of Dramamancer, a system that uses an LLM to transform author-created story schemas into player-driven playthroughs. This extended abstract outlines some design techniques and evaluation considerations associated with this system.
zh

[NLP-7] POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration

【速读】：该论文旨在解决强化学习（Reinforcement Learning, RL）在训练大语言模型（Large Language Models, LLMs）时面临的探索困境，尤其是在困难问题上，基于策略的RL方法几乎无法生成任何正确轨迹，导致奖励为零且缺乏有效的学习信号。现有改进方案如熵奖励、更宽松的重要性比例截断或直接优化pass@k目标均未能有效缓解此问题，甚至可能引发优化不稳定。作者发现混合简单与复杂问题进行训练会因“射线干扰”（ray interference）而适得其反——优化过程过度聚焦于已可解的问题，反而抑制了对难题的进展。解决方案的关键在于引入特权探索机制（Privileged On-Policy Exploration, POPE），通过将人类或其他专家提供的最优解作为特权信息，以前缀形式注入到难问题中，从而引导模型在rollout过程中获得非零奖励；更重要的是，这种受引导的行为能通过指令遵循与推理能力的协同作用，迁移回原始未引导的问题，显著扩展可解问题范围并在复杂推理基准上大幅提升性能。

链接: https://arxiv.org/abs/2601.18779
作者: Yuxiao Qu,Amrith Setlur,Virginia Smith,Ruslan Salakhutdinov,Aviral Kumar
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.
zh

[NLP-8] aching Models to Teach Themselves: Reasoning at the Edge of Learnability

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在微调过程中因初始成功率低而导致训练信号稀疏、学习停滞的问题，即如何让模型突破自身的“学习平台期”。其核心解决方案是提出SOAR框架——一种基于元强化学习（meta-RL）的自改进机制，通过一个教师副本与学生副本的双层结构实现自动化课程生成：教师模型利用预训练模型潜在的知识生成合成问题，学生模型尝试解答，教师根据学生在少量难题上的实际进步获得奖励，从而引导课程演化。关键在于将课程设计锚定于可测量的学生进展，而非依赖内在代理奖励（intrinsic reward），这有效避免了以往LLM自对弈中常见的不稳定性和多样性崩溃问题，并揭示出问题的结构性质量与表述清晰性比解正确性更利于学习推进。

链接: https://arxiv.org/abs/2601.18778
作者: Shobhita Sundaram,John Quan,Ariel Kwiatkowski,Kartik Ahuja,Yann Ollivier,Julia Kempe
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
zh

[NLP-9] PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation AAAI2026

【速读】：该论文旨在解决传统检索系统（如搜索、排序和RAG系统）质量评估中对大量人工相关性标注的依赖问题，尤其是在使用大型语言模型（LLM）作为自动化评判者时，其固有偏差导致无法直接用于指标估计的挑战。解决方案的关键在于提出一种扩展的预测驱动推断（Prediction-Powered Inference, PPI）统计框架——PRECISE，该框架通过结合极少量的人工标注（如100个查询）与大规模未标注样本（如10,000条）及LLM判断，实现对需细粒度标注（如查询-文档级别）的指标（如Precision@K）进行可靠估计。其创新点在于重构指标积分空间，将计算复杂度从O(2^|C|)（|C|为百万级语料库规模）降低至O(2^K)，同时有效校正LLM在低资源场景下的偏差，显著提升估计精度和稳定性。

链接: https://arxiv.org/abs/2601.18777
作者: Abhishek Divekar,Anirban Majumder
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)

点击查看摘要

Abstract:Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.
zh

[NLP-10] Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory

【速读】：该论文旨在解决现有基于搜索的大型语言模型（Large Language Models, LLMs）框架在处理复杂多跳推理任务时存在的三大核心问题：一是依赖隐式自然语言推理来制定搜索策略，导致子问题间的依赖关系难以显式建模；二是无法高效复用先前检索到的知识；三是难以通过强化学习（Reinforcement Learning）有效学习最优搜索策略。解决方案的关键在于提出Dep-Search框架，其通过引入显式的控制机制，实现结构化推理、检索与持久化记忆的深度融合，利用GRPO（Generalized Reward Policy Optimization）算法优化策略学习，从而支持模型显式分解具有依赖关系的问题、按需检索信息、访问历史存储知识，并将长推理上下文压缩为可复用的记忆条目，显著提升多跳问答任务中的推理能力。

链接: https://arxiv.org/abs/2601.18771
作者: Yanming Liu,Xinyue Peng,Zixuan Yan,Yanxin Shen,Wenjie Xu,Yuefeng Huang,Xinyi Wang,Jiannan Cao,Jianwei Yin,Xuhong Zhang
机构: Zhejiang University (浙江大学); Intel Corporation (英特尔公司); Tsinghua University (清华大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Dep-Search 1st version

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval-augmented generation (RAG) frameworks to more sophisticated search-based frameworks that orchestrate multi-step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub-questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep-Search, a dependency-aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep-Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep-Search significantly enhances LLMs’ ability to tackle complex multi-hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.
zh

[NLP-11] Beyond Preferences: Learning Alignment Principles Grounded in Human Reason s and Values

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在对齐人类价值观时，如何公平且有效地确定其“宪法”（constitution）——即一组用于指导模型行为的原则——的问题。现有方法如逆向宪法AI（Inverse Constitutional AI, ICAI）仅依赖于人类偏好标注数据生成原则，但缺乏对用户普遍期望与具体交互情境下偏好的综合考量。解决方案的关键在于提出一种统一框架——基于事实的宪法AI（Grounded Constitutional AI, GCAI），通过两类原则共同构建宪法：一是从用户关于AI价值陈述中提取的通用原则（general principles），二是利用人类提供偏好理由的数据，借助逆向宪法AI扩展得到的情境原则（contextual principles）。实验表明，GCAI生成的宪法在人类评估中更受青睐，且被认为更具道德基础、逻辑一致性与多元包容性。

链接: https://arxiv.org/abs/2601.18760
作者: Henry Bell,Lara Neubauer da Costa Schertel,Bochu Ding,Brandon Fain
机构: Duke University (杜克大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A crucial consideration when developing and deploying Large Language Models (LLMs) is the human values to which these models are aligned. In the constitutional framework of alignment models are aligned to a set of principles (the constitution) specified in natural language. However, it is unclear how to fairly determine this constitution with widespread stakeholder input. In this work we propose Grounded Constitutional AI (GCAI), a unified framework for generating constitutions of principles that are representative of both users’ general expectations toward AI (general principles) and their interaction-time preferences (contextual principles). We extend the Inverse Constitutional AI (ICAI) approach to generate contextual principles from human preference annotation data by leveraging human-provided \textitreasons for their preferences. We supplement these contextual principles with general principles surfaced from user statements of \textitvalues regarding AI. We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior. Additionally participants consider the GCAI constitution to be more morally grounded, coherent, and pluralistic.
zh

[NLP-12] Capturing P: On the Expressive Power and Efficient Evaluation of Boolean Retrieval

【速读】：该论文旨在解决现代信息检索系统在处理复杂逻辑与算术约束时面临的效率困境：传统基于迭代器的文档逐条处理（Document-at-a-Time）机制无法原生支持嵌套逻辑图结构，导致执行复杂查询时性能不可行；而朴素的词项逐层递归（Term-at-a-Time）方法虽能表达此类结构，却因广泛逻辑排除操作引发内存消耗过高。解决方案的关键在于提出一种名为 $\mathcal{L}_R$ 的形式化检索语言，其基于有向无环图（Directed Acyclic Graph, DAG）建模，并被证明精确捕获多项式时间复杂度类 $\mathbf{P}$ ；同时设计了 $\texttt{ComputePN}$ 算法，通过结合原生DAG遍历与高效的“正负响应”机制，在保证计算完备性的同时实现内存可控的高效查询评估，从而将索引转变为通用计算引擎。

链接: https://arxiv.org/abs/2601.18747
作者: Amir Aavani
机构: Apple Inc.(苹果公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Modern information retrieval is transitioning from simple document filtering to complex, neuro-symbolic reasoning workflows. However, current retrieval architectures face a fundamental efficiency dilemma when handling the rigorous logical and arithmetic constraints required by this new paradigm. Standard iterator-based engines (Document-at-a-Time) do not natively support complex, nested logic graphs; forcing them to execute such queries typically results in intractable runtime performance. Conversely, naive recursive approaches (Term-at-a-Time), while capable of supporting these structures, suffer from prohibitive memory consumption when enforcing broad logical exclusions. In this paper, we propose that a retrieval engine must be capable of Capturing \mathbfP '' -- evaluating any polynomial-time property directly over its index in a computationally efficient manner. We define a formal Retrieval Language ( \mathcalL_R ) based on Directed Acyclic Graphs (DAGs) and prove it precisely captures the complexity class \mathbfP . We introduce \textttComputePN, a novel evaluation algorithm that makes \mathcalL_R tractable. By combining native DAG traversal with a memory-efficient Positive-Negative’’ response mechanism, \textttComputePN ensures the efficient evaluation of any query in \mathcalL_R . This work establishes the theoretical foundation for turning the search index into a general-purpose computational engine. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Databases (cs.DB) Cite as: arXiv:2601.18747 [cs.IR] (or arXiv:2601.18747v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.18747 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-13] Self-Distilled Reason er: On-Policy Self-Distillation for Large Language Models

【速读】：该论文旨在解决传统知识蒸馏（Knowledge Distillation）在大语言模型（Large Language Model, LLM）推理能力迁移中面临的两个关键问题：一是离策略（off-policy）蒸馏方法存在训练与推理阶段的数据分布不匹配问题；二是现有方法未有效利用推理数据集中提供的真实解（ground-truth solutions）。为此，作者提出**在策略自蒸馏（On-Policy Self-Distillation, OPSD）**框架，其核心创新在于让单个模型同时扮演教师和学生角色——教师策略基于特权信息（如验证过的推理轨迹）进行条件化，而学生策略仅接收问题输入；训练过程通过最小化学生在自身轨迹上生成的token分布与教师分布之间的逐token差异来优化。该方法无需额外教师模型，且能显式利用外部推理线索，显著提升token效率（达GRPO强化学习方法的4–8倍）并优于传统离策略蒸馏方法。

链接: https://arxiv.org/abs/2601.18734
作者: Siyan Zhao,Zhihui Xie,Mengchen Liu,Jing Huang,Guan Pang,Feiyu Chen,Aditya Grover
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student’s own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
zh

[NLP-14] One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

【速读】：该论文旨在解决个性化对齐（Personalized Alignment）中两个关键挑战：一是个体用户反馈数据稀缺，二是模型需高效适应未见过的用户。其解决方案的核心在于将个性化奖励建模（Personalized Reward Modeling）重构为元学习（Meta-Learning）问题，提出元奖励建模（Meta Reward Modeling, MRM），通过将每位用户的奖励模型表示为基函数（Base Reward Functions）的加权组合，并利用类MAML（Model-Agnostic Meta-Learning）框架优化初始权重，从而实现低样本下的快速适应。此外，引入鲁棒个性化目标（Robust Personalization Objective, RPO）以强化对难学用户的适应能力，提升整体鲁棒性。

链接: https://arxiv.org/abs/2601.18731
作者: Hongru Cai,Yongqi Li,Tiezheng Yu,Fengbin Zhu,Wenjie Wang,Fuli Feng,Wenjie Li
机构: The Hong Kong Polytechnic University (香港理工大学); Huawei Technologies (华为技术有限公司); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user’s reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
zh

[NLP-15] Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在对齐价值导向原则（value-laden principles）时存在的计算成本高、依赖人工标注数据以及难以适应原始训练中未强调的原则等问题。传统方法如基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）虽有效，但需大量参数微调与复杂工程实现。本文提出 \textscreflect，一种推理阶段的宪法对齐框架，其核心在于无需任何训练或数据即可实现对齐：通过完全在上下文（in-context）中执行三阶段操作——(i) 宪法条件下的基础响应生成，(ii) 生成后的自我评估，以及 (iii) 自我批评与最终修订，从而显式地在推理过程中运用原则进行逻辑推理。该方法显著提升模型对多样化且复杂原则的遵守程度，尤其在减少罕见但高风险的违规行为方面表现优异，同时保留事实推理能力，并能自动生成可用于后续参数微调的高质量训练数据，实现长期部署中的高效扩展与推理开销降低。

链接: https://arxiv.org/abs/2601.18730
作者: Henry Bell,Caroline Zhang,Mohammed Mobasserul Haque,Dhaval Potdar,Samia Zaman,Brandon Fain
机构: Duke University (杜克大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textscreflect, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textscreflect operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textscreflect’s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textscreflect significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model’s original parameter fine-tuning, without sacrificing factual reasoning. \textscreflect is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textscreflect naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.
zh

[NLP-16] HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences

【速读】：该论文旨在解决生成式 AI（Generative AI）在学术写作中引发的“HalluCitation”问题，即虚假引用现象——论文中出现不存在的文献条目，严重损害科学可靠性与会议公信力。其解决方案的关键在于系统性地识别和量化HalluCitation的流行程度及其影响：通过对ACL、NAACL和EMNLP在2024年和2025年发表的所有论文（包括主会议、Findings及研讨会论文）进行分析，发现近300篇论文含有至少一条HalluCitation，其中多数出现在2025年，且半数集中于EMNLP 2025，超过100篇为该会议主文或Findings论文，揭示了该问题正迅速加剧并已对顶级会议的学术可信度构成实质性威胁。

链接: https://arxiv.org/abs/2601.18724
作者: Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: Work In Progress

点击查看摘要

Abstract:Recently, we have often observed hallucinated citations or references that do not correspond to any existing work in papers under review, preprints, or published papers. Such hallucinated citations pose a serious concern to scientific reliability. When they appear in accepted papers, they may also negatively affect the credibility of conferences. In this study, we refer to hallucinated citations as “HalluCitation” and systematically investigate their prevalence and impact. We analyze all papers published at ACL, NAACL, and EMNLP in 2024 and 2025, including main conference, Findings, and workshop papers. Our analysis reveals that nearly 300 papers contain at least one HalluCitation, most of which were published in 2025. Notably, half of these papers were identified at EMNLP 2025, the most recent conference, indicating that this issue is rapidly increasing. Moreover, more than 100 such papers were accepted as main conference and Findings papers at EMNLP 2025, affecting the credibility.
zh

[NLP-17] Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning

【速读】：该论文旨在解决当前推理型大语言模型（Reasoning Large Language Models, RLMs）在处理低资源语言问题时性能显著下降的问题，即模型在非英语语境下的推理能力远低于英语场景。其解决方案的关键在于提出一种两阶段无目标语言数据依赖的框架——SP3F（Self-Play with Privileged Pairwise Feedback）：第一阶段通过监督微调（Supervised Fine-Tuning, SFT）对翻译后的英文问答对进行训练以提升基础模型正确性；第二阶段采用强化学习（Reinforcement Learning, RL）结合自对弈机制，利用一个拥有英文参考答案特权信息的成对判别器来提供反馈，即使模型输出均不完全正确，也能判断优劣，从而有效引导模型优化。该方法在多个数学与非数学任务中实现了超越全量后训练模型的表现，且所需训练数据量显著减少。

链接: https://arxiv.org/abs/2601.18722
作者: Lintang Sutawika,Gokul Swamy,Zhiwei Steven Wu,Graham Neubig
机构: Carnegie Mellon University, Language Technologies Institute (卡内基梅隆大学语言技术研究所); Carnegie Mellon University, Robotics Institute (卡内基梅隆大学机器人研究所); Carnegie Mellon University, Software and Societal Systems Department (卡内基梅隆大学软件与社会系统系)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code available at this https URL

点击查看摘要

Abstract:When asked a question in a language less seen in its training data, current reasoning large language models (RLMs) often exhibit dramatically lower performance than when asked the same question in English. In response, we introduce \textttSP3F (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without \textitany data in the target language(s). First, we supervise fine-tune (SFT) on translated versions of English question-answer pairs to raise base model correctness. Second, we perform RL with feedback from a pairwise judge in a self-play fashion, with the judge receiving the English reference response as \textitprivileged information. Thus, even when none of the model’s responses are completely correct, the privileged pairwise judge can still tell which response is better. End-to-end, \textttSP3F greatly improves base model performance, even outperforming fully post-trained models on multiple math and non-math tasks with less than of the training data across the single-language, multilingual, and generalization to unseen language settings. Comments: Code available at this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.18722 [cs.CL] (or arXiv:2601.18722v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.18722 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-18] Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在连续微调过程中出现的灾难性遗忘（catastrophic forgetting）问题，即新任务的学习会干扰模型对先前任务知识的保持。其关键发现在于系统性地识别出三种驱动遗忘的核心机制：注意力权重中的梯度干扰、中间层表征漂移以及损失曲面平坦化；并进一步揭示遗忘严重程度与任务相似性和梯度对齐度高度相关（Pearson r = 0.87），同时指出约15%至23%的注意力头在微调中遭受严重破坏，且低层注意力头更易受影响。这些机制性洞察为设计针对性的持续学习缓解策略提供了理论基础。

链接: https://arxiv.org/abs/2601.18699
作者: Olaf Yunus Laitinen Imanov
机构: Technical University of Denmark (丹麦技术大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 16 figures (6 main + 10 supplementary)

点击查看摘要

Abstract:Large language models exhibit remarkable performance across diverse tasks through pre-training and fine-tuning paradigms. However, continual fine-tuning on sequential tasks induces catastrophic forgetting, where newly acquired knowledge interferes with previously learned capabilities. Despite widespread observations of this phenomenon, the mechanistic understanding remains limited. Here, we present a comprehensive mechanistic analysis of catastrophic forgetting in transformer-based LLMs during sequential fine-tuning. Through systematic experiments across multiple model scales (109B to 400B total parameters) and task sequences, we identify three primary mechanisms driving forgetting: gradient interference in attention weights, representational drift in intermediate layers, and loss landscape flattening. We demonstrate that forgetting severity correlates strongly with task similarity (Pearson r = 0.87) and gradient alignment metrics. Our analysis reveals that approximately 15 to 23 percent of attention heads undergo severe disruption during fine-tuning, with lower layers showing greater susceptibility. These findings establish mechanistic foundations for developing targeted mitigation strategies in continual learning systems.
zh

[NLP-19] FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）作为自主代理时面临的记忆局限性问题，即缺乏选择性遗忘机制，导致在上下文边界处出现灾难性遗忘，或在单个会话内产生信息过载。其解决方案的关键在于提出FadeMem——一种受生物启发的代理记忆架构，通过双层记忆层次结构实现差异化衰减策略：保留信息由语义相关性、访问频率和时间模式共同调控的自适应指数衰减函数决定，从而模拟人类认知中的动态遗忘与巩固过程；同时结合LLM引导的冲突消解与智能记忆融合机制，在保留关键信息的同时使无关细节自然衰减，显著提升多跳推理与检索能力，并在Multi-Session Chat、LoCoMo和LTI-Bench等任务中实现45%的存储压缩率。

链接: https://arxiv.org/abs/2601.18642
作者: Lei Wei,Xu Dong,Xiao Peng,Niantao Xie,Bin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models deployed as autonomous agents face critical memory limitations, lacking selective forgetting mechanisms that lead to either catastrophic forgetting at context boundaries or information overload within them. While human memory naturally balances retention and forgetting through adaptive decay processes, current AI systems employ binary retention strategies that preserve everything or lose it entirely. We propose FadeMem, a biologically-inspired agent memory architecture that incorporates active forgetting mechanisms mirroring human cognitive efficiency. FadeMem implements differential decay rates across a dual-layer memory hierarchy, where retention is governed by adaptive exponential decay functions modulated by semantic relevance, access frequency, and temporal patterns. Through LLM-guided conflict resolution and intelligent memory fusion, our system consolidates related information while allowing irrelevant details to fade. Experiments on Multi-Session Chat, LoCoMo, and LTI-Bench demonstrate superior multi-hop reasoning and retrieval with 45% storage reduction, validating the effectiveness of biologically-inspired forgetting in agent memory systems.
zh

[NLP-20] AdaReason er: Dynamic Tool Orchestration for Iterative Visual Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在面对复杂视觉推理任务时，缺乏有效工具调用与组合能力的问题，即如何让模型自主识别、选择并协调使用多种工具以完成多步骤任务，尤其是在未见过的工具或新任务场景下仍能保持良好泛化性能。解决方案的关键在于提出AdaReasoner——一个将工具使用建模为通用推理技能而非特定工具或显式监督行为的框架，其核心包括：(i) 可扩展的数据收集管道，使模型暴露于长程、多步工具交互中；(ii) Tool-GRPO强化学习算法，基于最终任务成功优化工具选择与序列；(iii) 自适应学习机制，动态调节工具使用频率。这些组件共同使模型能够从任务上下文和中间结果中推断工具效用，实现多工具协同与对未知工具的泛化，从而显著提升推理能力，在多个基准测试中优于现有方法，包括在7B基线模型上平均提升24.9%，并超越GPT-5等商用系统。

链接: https://arxiv.org/abs/2601.18631
作者: Mingyang Song,Haoyu Sun,Jiawei Gu,Linjie Li,Luxin Xu,Ranjay Krishna,Yu Cheng
机构: Fudan University (复旦大学); Tongji University (同济大学); National University of Singapore (新加坡国立大学); University of Washington (华盛顿大学); University of Electronic Science and Technology of China (电子科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 28 pages, 10 figures and 13 tables

点击查看摘要

Abstract:When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.
zh

[NLP-21] Emergence of Phonemic Syntactic and Semantic Representations in Artificial Neural Networks

【速读】：该论文试图解决的问题是：缺乏一个统一的计算框架来解释语言习得过程中神经表征的形成机制，尤其是在儿童从音位分类、词汇识别到句法组合这一发展序列中，其背后的神经基础如何被建模。解决方案的关键在于通过分析人工神经网络（Artificial Neural Networks, ANNs）在训练过程中的激活模式，发现语音和文本驱动的模型均呈现出分阶段的学习轨迹——即神经激活逐渐构建出分别对应音位、词汇和句法结构的子空间，且这些子空间的几何特性与人类语言习得阶段具有定性相似性。尽管所需数据量远高于儿童（多两个至四个数量级），该研究揭示了语言习得关键阶段自发涌现的条件，为理解语言习得的计算原理提供了新路径。

链接: https://arxiv.org/abs/2601.18617
作者: Pierre Orhan,Pablo Diego-Simón,Emmnanuel Chemla,Yair Lakretz,Yves Boubenec,Jean-Rémi King
机构: Paris Brain Institute (巴黎大脑研究所); Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP) (认知科学与心理语言学实验室); École Normale Supérieure (巴黎高等师范学院); PSL University (巴黎文理研究大学); CNRS (法国国家科学研究中心); Meta AI (Meta AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:During language acquisition, children successively learn to categorize phonemes, identify words, and combine them with syntax to form new meaning. While the development of this behavior is well characterized, we still lack a unifying computational framework to explain its underlying neural representations. Here, we investigate whether and when phonemic, lexical, and syntactic representations emerge in the activations of artificial neural networks during their training. Our results show that both speech- and text-based models follow a sequence of learning stages: during training, their neural activations successively build subspaces, where the geometry of the neural activations represents phonemic, lexical, and syntactic structure. While this developmental trajectory qualitatively relates to children’s, it is quantitatively different: These algorithms indeed require two to four orders of magnitude more data for these neural representations to emerge. Together, these results show conditions under which major stages of language acquisition spontaneously emerge, and hence delineate a promising path to understand the computations underpinning language acquisition.
zh

[NLP-22] Stability as a Liability:Systematic Breakdown of Linguistic Structure in LLM s

【速读】：该论文试图解决的问题是：在大规模语言模型的训练过程中，优化稳定性与生成质量之间的关系是否一致，即训练过程中的稳定动态是否能保证生成分布的多样性与表达能力。解决方案的关键在于揭示了在标准最大似然训练下，稳定参数轨迹会促使模型近似最小化前向KL散度（forward KL divergence）至经验分布，同时隐式降低生成熵（generative entropy），导致模型将概率质量集中在经验分布的少数模式上，从而产生系统性退化（systematic degeneration）现象，即使损失函数收敛平滑。作者通过受控反馈训练框架验证了这一机制，表明优化稳定性与生成表达能力并非天然对齐，稳定性本身不足以作为生成质量的可靠指标。

链接: https://arxiv.org/abs/2601.18588
作者: Xianzhe Meng,Qiangsheng Zeng,Ling Luo,Qinghan Yang,Jiarui Hao,Wenbo Wu,Qinyu Wang,Rui Yin,Lin Qi,Renzhi Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training stability is typically regarded as a prerequisite for reliable optimization in large language models. In this work, we analyze how stabilizing training dynamics affects the induced generation distribution. We show that under standard maximum likelihood training, stable parameter trajectories lead stationary solutions to approximately minimize the forward KL divergence to the empirical distribution, while implicitly reducing generative entropy. As a consequence, the learned model can concentrate probability mass on a limited subset of empirical modes, exhibiting systematic degeneration despite smooth loss convergence. We empirically validate this effect using a controlled feedback-based training framework that stabilizes internal generation statistics, observing consistent low-entropy outputs and repetitive behavior across architectures and random seeds. It indicates that optimization stability and generative expressivity are not inherently aligned, and that stability alone is an insufficient indicator of generative quality.
zh

[NLP-23] From Classification to Ranking: Enhancing LLM Reasoning Capabilities for MBTI Personality Detection AAAI2026

【速读】：该论文旨在解决现有基于大语言模型（Large Language Models, LLMs）的人格检测方法在准确分类人格特质时面临的挑战，尤其是人类人格的复杂性以及不同特质之间的细微差异导致的分类困难，同时克服传统提示（prompt）方法对专家手工设计知识的过度依赖、缺乏自主模式学习能力的问题。其解决方案的关键在于将人格检测任务重新定义为一个排序任务（ranking task），而非传统的分类任务，并引入基于强化学习的训练范式——Group Relative Policy Optimization (GRPO)，配合专门设计的基于排序的奖励函数，使模型能够从主观且边界模糊的人格评估中学习最优答案排序策略，从而提升人格检测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2601.18582
作者: Yuan Cao,Feixiang Liu,Xinyue Wang,Yihan Zhu,Hui Xu,Zheng Wang,Qiang Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures, AAAI 2026 Bridge

点击查看摘要

Abstract:Personality detection aims to measure an individual’s corresponding personality traits through their social media posts. The advancements in Large Language Models (LLMs) offer novel perspectives for personality detection tasks. Existing approaches enhance personality trait analysis by leveraging LLMs to extract semantic information from textual posts as prompts, followed by training classifiers for categorization. However, accurately classifying personality traits remains challenging due to the inherent complexity of human personality and subtle inter-trait distinctions. Moreover, prompt-based methods often exhibit excessive dependency on expert-crafted knowledge without autonomous pattern-learning capacity. To address these limitations, we view personality detection as a ranking task rather than a classification and propose a corresponding reinforcement learning training paradigm. First, we employ supervised fine-tuning (SFT) to establish personality trait ranking capabilities while enforcing standardized output formats, creating a robust initialization. Subsequently, we introduce Group Relative Policy Optimization (GRPO) with a specialized ranking-based reward function. Unlike verification tasks with definitive solutions, personality assessment involves subjective interpretations and blurred boundaries between trait categories. Our reward function explicitly addresses this challenge by training LLMs to learn optimal answer rankings. Comprehensive experiments have demonstrated that our method achieves state-of-the-art performance across multiple personality detection benchmarks.
zh

[NLP-24] One Persona Many Cues Different Results: How Sociodemographic Cues Impact LLM Personalization

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在基于社会人口学子群体（sociodemographic subgroup）进行个性化时，可能引入或放大群体间偏差和不公平结果的问题。其关键解决方案在于系统评估六种常用的人格提示（persona cue）在七种开源与专有LLM上的表现，发现尽管这些提示整体相关性高，但对不同人格的响应存在显著差异，因此强调应避免仅依赖单一提示线索得出结论，并建议未来个性化研究需采用多个具有外部有效性的提示线索进行综合评估。

链接: https://arxiv.org/abs/2601.18572
作者: Franziska Weeber,Vera Neplenbroek,Jan Batzner,Sebastian Padó
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalization of LLMs by sociodemographic subgroup often improves user experience, but can also introduce or amplify biases and unfair outcomes across groups. Prior work has employed so-called personas, sociodemographic user attributes conveyed to a model, to study bias in LLMs by relying on a single cue to prompt a persona, such as user names or explicit attribute mentions. This disregards LLM sensitivity to prompt variations (robustness) and the rarity of some cues in real interactions (external validity). We compare six commonly used persona cues across seven open and proprietary LLMs on four writing and advice tasks. While cues are overall highly correlated, they produce substantial variance in responses across personas. We therefore caution against claims from a single persona cue and recommend future personalization research to evaluate multiple externally valid cues.
zh

[NLP-25] Unknown Unknowns: Why Hidden Intentions in LLM s Evade Detection

【速读】：该论文试图解决生成式 AI（Generative AI）在实际应用中可能隐藏的、目标导向的行为（即“隐性意图”）难以被检测的问题，这些问题可能源于训练或优化过程中的偏差，也可能由恶意开发者刻意引入，但因其隐蔽性而难以在开放世界场景中识别。解决方案的关键在于提出一个基于社会科学研究的十类隐性意图分类体系，该体系从意图、机制、情境和影响四个维度进行组织，并通过可控模型实验验证其可诱导性，同时系统评估了推理与非推理型大语言模型（LLM）判别器的检测能力，发现现有方法在低发生率条件下极易失效——表现为假阳性主导精度、假阴性掩盖真实风险，从而揭示审计失败的根本原因：缺乏极低的假阳性率或对操纵类型的强先验知识。这一框架为未来研究隐性意图的诱发、测试及治理提供了基础。

链接: https://arxiv.org/abs/2601.18552
作者: Devansh Srivastav,David Pape,Lea Schönherr
机构: CISPA Helmholtz Center for Information Security(信息安全研究中心), Saarbrücken, Germany(德国萨尔布吕肯)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs are increasingly embedded in everyday decision-making, yet their outputs can encode subtle, unintended behaviours that shape user beliefs and actions. We refer to these covert, goal-directed behaviours as hidden intentions, which may arise from training and optimisation artefacts, or be deliberately induced by an adversarial developer, yet remain difficult to detect in practice. We introduce a taxonomy of ten categories of hidden intentions, grounded in social science research and organised by intent, mechanism, context, and impact, shifting attention from surface-level behaviours to design-level strategies of influence. We show how hidden intentions can be easily induced in controlled models, providing both testbeds for evaluation and demonstrations of potential misuse. We systematically assess detection methods, including reasoning and non-reasoning LLM judges, and find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions, where false positives overwhelm precision and false negatives conceal true risks. Stress tests on precision-prevalence and precision-FNR trade-offs reveal why auditing fails without vanishingly small false positive rates or strong priors on manipulation types. Finally, a qualitative case study shows that all ten categories manifest in deployed, state-of-the-art LLMs, emphasising the urgent need for robust frameworks. Our work provides the first systematic analysis of detectability failures of hidden intentions in LLMs under open-world settings, offering a foundation for understanding, inducing, and stress-testing such behaviours, and establishing a flexible taxonomy for anticipating evolving threats and informing governance.
zh

[NLP-26] Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features EACL2026

【速读】：该论文旨在解决现有子词分割（subword segmentation）评估方法依赖高质量金标准（gold segmentation）数据的问题，而这类数据在许多语言中要么不可获得，要么质量不一。其解决方案的关键在于提出一种基于形态句法特征（morpho-syntactic features）的新评估指标，通过IBM Model 1实现子词与形态特征的概率对齐，从而在无需金标准的情况下，仍能有效衡量子词分割的形态合理性（morphological plausibility）。该方法利用Universal Dependencies或UniMorph等资源中广泛可用的形态句法信息，在多种形态系统差异显著的语言中展现出良好的适用性与与传统形态边界召回率的相关性。

链接: https://arxiv.org/abs/2601.18536
作者: Abishek Stephen,Jindřich Libovický
机构: Charles University, Faculty of Mathematics and Physics (查理大学数学与物理学院); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of EACL 2026, 9 pages, 6 figures

点击查看摘要

Abstract:We present a novel metric for the evaluation of the morphological plausibility of subword segmentation. Unlike the typically used morpheme boundary or retrieval F-score, which requires gold segmentation data that is either unavailable or of inconsistent quality across many languages, our approach utilizes morpho-syntactic features. These are available in resources such as Universal Dependencies or UniMorph for a much wider range of languages. The metric works by probabilistically aligning subwords with morphological features through an IBM Model 1. Our experiments show that the metric correlates well with traditional morpheme boundary recall while being more broadly applicable across languages with different morphological systems.
zh

[NLP-27] From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation ICLR2026

【速读】：该论文旨在解决强化学习中基于可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）在开放生成任务中面临的挑战，即缺乏明确的地面真实标签导致单点监督效率低下和奖励欺骗（reward hacking）问题。其解决方案的关键在于提出一种基于参考文本的可验证奖励强化学习方法（Reinforcement Learning with Verifiable Reference-based Rewards, RLVRR），通过从高质量参考文本中提取有序的语言信号（即“奖励链”，reward chain），将奖励分解为两个维度：内容维度保留确定性的核心概念（如关键词），风格维度则通过大语言模型（LLM）验证对风格属性的遵循程度。这种方法结合了强化学习的探索能力与监督微调（SFT）的效率和可靠性，实现了结构化推理与开放生成任务的统一训练，并在多个基准测试中展现出更优的泛化能力和输出多样性。

链接: https://arxiv.org/abs/2601.18533
作者: Yuxin Jiang,Yufei Wang,Qiyuan Zhang,Xingshan Zeng,Liangyou Li,Jierun Chen,Chaofan Tao,Haoli Bai,Lifeng Shang
机构: Huawei Technologies Co.,Ltd(华为技术有限公司); City University of Hong Kong(香港城市大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 8 figures, 12 tables. Accepted at ICLR 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open-ended generation is challenging because there is no unambiguous ground truth. Relying on single-dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference-based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM-based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine-tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open-ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general-purpose LLM alignment. We release our code and data at this https URL.
zh

[NLP-28] Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models EACL2026

【速读】：该论文旨在解决长上下文语言模型（Long-Context Language Models, LCLMs）在信息识别与利用能力上的局限性，以及其在键值缓存（KV-cache）压缩技术下鲁棒性不足的问题。解决方案的关键在于系统性地探究不同微调策略对LCLMs性能的提升效果，特别是在域内任务中的表现增强及在KV-cache压缩场景下的鲁棒性改善；实验表明，针对性的微调可显著提升模型在特定领域（如金融问答）的表现（最高+20点），并在KV-cache压缩下带来中等程度的鲁棒性改进，但跨域泛化能力仍受任务特性影响较大。

链接: https://arxiv.org/abs/2601.18527
作者: Francesco Maria Molfese,Momchil Hardalov,Rexhina Blloshmi,Bill Byrne,Adrià de Gispert
机构: Sapienza University of Rome (罗马大学); Amazon AGI
类目: Computation and Language (cs.CL)
备注: European Chapter of the Association for Computational Linguistics EACL 2026

点击查看摘要

Abstract:With context windows of millions of tokens, Long-Context Language Models (LCLMs) can encode entire document collections, offering a strong alternative to conventional retrieval-augmented generation (RAG). However, it remains unclear whether fine-tuning strategies can improve long-context performance and translate to greater robustness under KV-cache compression techniques. In this work, we investigate which training strategies most effectively enhance LCLMs’ ability to identify and use relevant information, as well as enhancing their robustness under KV-cache compression. Our experiments show substantial in-domain improvements, achieving gains of up to +20 points over the base model. However, out-of-domain generalization remains task dependent with large variance – LCLMs excels on finance questions (+9 points), while RAG shows stronger performance on multiple-choice questions (+6 points) over the baseline models. Finally, we show that our fine-tuning approaches bring moderate improvements in robustness under KV-cache compression, with gains varying across tasks.
zh

[NLP-29] GenAI for Social Work Field Education: Client Simulation with Real-Time Feedback

【速读】：该论文旨在解决社会工作领域场域教育（Field Education）中因导师资源有限和咨询对象不足而导致的实时、客观反馈难以实现的问题。解决方案的关键在于开发了一个名为SWITCH的社交工作互动训练聊天机器人，其核心创新包括：基于认知基础的客户模拟模型（包含静态与动态特征），用于生成逼真对话行为；结合检索增强的上下文学习与微调后的BERT多标签分类器，实现对用户言语中咨询技能的高精度识别；以及集成动机访谈（Motivational Interviewing, MI）进展系统，根据技能识别结果动态调控MI阶段转换，从而构建一个可扩展、低成本且一致性的训练流程，有效补充传统场域教学并释放督导精力用于高层次指导。

链接: https://arxiv.org/abs/2601.18517
作者: James Sungarda,Hongkai Liu,Zilong Zhou,Tien-Hsuan Wu,Johnson Chun-Sing Cheung,Ben Kao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 2025 IEEE International Conference on Big Data. ISBN: 979-8-3315-9447-3/25. Page numbers: 3544-3553

点击查看摘要

Abstract:Field education is the signature pedagogy of social work, yet providing timely and objective feedback during training is constrained by the availability of instructors and counseling clients. In this paper, we present SWITCH, the Social Work Interactive Training Chatbot. SWITCH integrates realistic client simulation, real-time counseling skill classification, and a Motivational Interviewing (MI) progression system into the training workflow. To model a client, SWITCH uses a cognitively grounded profile comprising static fields (e.g., background, beliefs) and dynamic fields (e.g., emotions, automatic thoughts, openness), allowing the agent’s behavior to evolve throughout a session realistically. The skill classification module identifies the counseling skills from the user utterances, and feeds the result to the MI controller that regulates the MI stage transitions. To enhance classification accuracy, we study in-context learning with retrieval over annotated transcripts, and a fine-tuned BERT multi-label classifier. In the experiments, we demonstrated that both BERT-based approach and in-context learning outperforms the baseline with big margin. SWITCH thereby offers a scalable, low-cost, and consistent training workflow that complements field education, and allows supervisors to focus on higher-level mentorship.
zh

[NLP-30] Using Large Language Models to Construct Virtual Top Managers: A Method for Organizational Research

【速读】：该论文试图解决在组织研究中难以直接获取高层管理者（top managers）参与的难题，尤其是在现实场景下难以接触真实CEO的情况下。解决方案的关键在于构建基于大语言模型（Large Language Models, LLMs）的虚拟人格（virtual personas），这些人格通过整合真实CEO的沟通数据与道德基础理论（Moral Foundations Theory）进行理论化设计，从而模拟特定领导者的决策行为。研究通过三个阶段验证了虚拟CEO在构念效度、信度和行为保真度方面的表现，结果表明这类LLM生成的人格能够近似反映人类样本中的道德判断，证明其作为可信且互补的研究工具具有可行性。

链接: https://arxiv.org/abs/2601.18512
作者: Antonio Garzon-Vico,Krithika Sharon Komalapati,Arsalan Shahid,Jan Rosier
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study introduces a methodological framework that uses large language models to create virtual personas of real top managers. Drawing on real CEO communications and Moral Foundations Theory, we construct LLM-based participants that simulate the decision-making of individual leaders. Across three phases, we assess construct validity, reliability, and behavioral fidelity by benchmarking these virtual CEOs against human participants. Our results indicate that theoretically scaffolded personas approximate the moral judgements observed in human samples, suggesting that LLM-based personas can serve as credible and complementary tools for organizational research in contexts where direct access to executives is limited. We conclude by outlining implications for future research using LLM-based personas in organizational settings.
zh

[NLP-31] Agent DoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

【速读】：该论文旨在解决AI代理（AI agent）在自主使用工具和与环境交互过程中产生的复杂安全与风险问题，当前的防护模型缺乏对代理风险的认知能力以及风险诊断的透明度。其解决方案的关键在于提出一个统一的三维风险分类法（源点-故障模式-后果），并基于此构建了细粒度的代理安全基准测试（ATBench）与诊断型防护框架（AgentDoG）。AgentDoG能够实现跨代理行为轨迹的细粒度上下文监控，并精准诊断不安全行为及看似安全但不合理行为的根本原因，提供可追溯性与透明度，超越传统二元标签，从而有效促进代理对齐。

链接: https://arxiv.org/abs/2601.18491
作者: Dongrui Liu,Qihan Ren,Chen Qian,Shuai Shao,Yuejin Xie,Yu Li,Zhonghao Yang,Haoyu Luo,Peng Wang,Qingyu Liu,Binxin Hu,Ling Tang,Jilin Mei,Dadi Guo,Leitao Yuan,Junyao Yang,Guanxu Chen,Qihao Lin,Yi Yu,Bo Zhang,Jiaxuan Guo,Jie Zhang,Wenqi Shao,Huiqi Deng,Zhiheng Xi,Wenjie Wang,Wenxuan Wang,Wen Shen,Zhikai Chen,Haoyu Xie,Jialing Tao,Juntao Dai,Jiaming Ji,Zhongjie Ba,Linfeng Zhang,Yong Liu,Quanshi Zhang,Lei Zhu,Zhihua Wei,Hui Xue,Chaochao Lu,Jing Shao,Xia Hu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 40 pages, 26 figures

点击查看摘要

Abstract:The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.
zh

[NLP-32] Demographic Probing of Large Language Models Lacks Construct Validity

【速读】：该论文旨在解决当前广泛使用的**人口统计学探测（demographic probing）方法在评估大语言模型（Large Language Models, LLMs）如何根据人口属性（如种族和性别）调整行为时所存在的构念效度不足（lack of construct validity）**问题。作者指出，现有方法通常依赖单一人口线索（如姓名或方言）作为群体归属的信号，并隐含假设这些线索可互换地表征同一类受人口因素影响的行为模式，但这一假设未被验证。研究发现，在现实场景中，不同线索虽指向同一人口群体，却引发不完全重叠的模型行为变化，且同一线索内部对不同群体的区分能力弱且不稳定，导致估计出的差异在幅度和方向上高度波动。解决方案的关键在于：采用多个生态有效（ecologically valid）的人口线索进行交叉验证，并显式控制语言混杂因素（linguistic confounders），以获得更稳定、可信的人口属性效应推断。

链接: https://arxiv.org/abs/2601.18486
作者: Manuel Tonneau,Neil K. R. Seghal,Niyati Malhotra,Victor Orozco-Olvera,Ana María Muñoz Boudet,Lakshmi Subramanian,Sharath Chandra Guntuku,Valentin Hofmann
机构: World Bank(世界银行); University of Oxford(牛津大学); New York University(纽约大学); University of Pennsylvania(宾夕法尼亚大学); Allen Institute for AI(艾伦人工智能研究所); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Demographic probing is widely used to study how large language models (LLMs) adapt their behavior to signaled demographic attributes. This approach typically uses a single demographic cue in isolation (e.g., a name or dialect) as a signal for group membership, implicitly assuming strong construct validity: that such cues are interchangeable operationalizations of the same underlying, demographically conditioned behavior. We test this assumption in realistic advice-seeking interactions, focusing on race and gender in a U.S. context. We find that cues intended to represent the same demographic group induce only partially overlapping changes in model behavior, while differentiation between groups within a given cue is weak and uneven. Consequently, estimated disparities are unstable, with both magnitude and direction varying across cues. We further show that these inconsistencies partly arise from variation in how strongly cues encode demographic attributes and from linguistic confounders that independently shape model behavior. Together, our findings suggest that demographic probing lacks construct validity: it does not yield a single, stable characterization of how LLMs condition on demographic information, which may reflect a misspecified or fragmented construct. We conclude by recommending the use of multiple, ecologically valid cues and explicit control of confounders to support more defensible claims about demographic effects in LLMs.
zh

[NLP-33] Funny or Persuasive but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLM s EACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成文本时缺乏对特定语义概念进行细粒度控制的问题，尤其是在同时控制多个独立概念（如说服力与幽默感）时的性能下降问题。现有提示工程和表征工程技术通常只能实现粗粒度或单一属性的控制，且缺乏对多属性场景下系统性评估的能力。论文提出了一种针对单概念与双概念场景的细粒度可控性评估框架，其关键在于通过设计 linguistically distinct concept pairs（语言学上区分明确的概念对）来量化模型在多属性条件下生成能力的保持程度，从而揭示当前基于提示的方法在组合性（compositionality）上的根本局限——即即使概念在语义上可分离，模型仍难以协同控制多个属性。这一框架为未来多概念控制方法的开发提供了可量化的基准和理论依据。

链接: https://arxiv.org/abs/2601.18483
作者: Arya Labroo,Ivaxi Sheth,Vyas Raina,Amaani Ahmed,Mario Fritz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at EACL main conference

点击查看摘要

Abstract:Large Language Models (LLMs) offer strong generative capabilities, but many applications require explicit and \textitfine-grained control over specific textual concepts, such as humor, persuasiveness, or formality. Prior approaches in prompting and representation engineering can provide coarse or single-attribute control, but systematic evaluation of multi-attribute settings remains limited. We introduce an evaluation framework for fine-grained controllability for both single- and dual-concept scenarios, focusing on linguistically distinct concept pairs (e.g., persuasiveness vs.~humor). Surprisingly, across multiple LLMs and generative tasks, we find that performance often drops in the dual-concept setting, even though the chosen concepts should in principle be separable. This reveals a fundamental limitation of naive prompting-based control: models struggle with compositionality even when concepts are intuitively independent. Our framework provides systematic evidence of this gap and offers a principled approach for measuring the ability of future methods for multi-concept control.
zh

[NLP-34] Latent Knowledge as a Predictor of Fact Acquisition in Fine-Tuned Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生物医学知识存储中存在不均衡性的问题，即部分事实虽已编码于模型权重中但难以通过确定性解码可靠提取（称为潜在知识，latent knowledge），而另一些事实则几乎未被表示。为应对这一问题，研究者对Llama 3.1 8B Instruct模型进行微调，学习人类表型本体（Human Phenotype Ontology, HPO）和基因本体（Gene Ontology, GO）的术语标识符映射关系，并采用随机解码检测基线时期的潜在知识，结合Cox比例风险模型识别事实获取、泛化及退化的影响因素。关键解决方案在于将学习过程建模为“时间到事件”过程，利用潜在知识作为核心预测因子，揭示其显著加速事实获取速度（HR=2.6）、提升峰值学习速率并促进收敛；同时发现潜在知识的存在也提高对未见GO事实的泛化能力（5.8%），且训练期间强化可有效抵抗未见过的术语退化，从而系统性阐明了潜在知识在模型知识习得与稳定性中的作用机制。

链接: https://arxiv.org/abs/2601.18468
作者: Daniel B. Hier,Tayo Obafemi-Ajayi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models store biomedical facts with uneven strength after pretraining: some facts are present in the weights but are not reliably accessible under deterministic decoding (latent knowledge), while others are scarcely represented. We fine tuned Llama 3.1 8B Instruct to learn ontology term identifier mappings from the Human Phenotype Ontology (800 pairs) and the Gene Ontology (400 training pairs), withholding 400 GO pairs to test generalization. Treating learning as a time to event process across 20 epochs, we used stochastic decoding to detect latent knowledge at baseline and Cox proportional hazards models to identify predictors of acquisition, generalization, and degradation. Baseline deterministic recall for HPO was 2.8%, rising to 71.9% after fine-tuning. Latent knowledge was the strongest predictor of faster fact acquisition (HR 2.6) and was associated with earlier, higher peak learning rates and faster convergence; identifier frequency and curated annotation counts had smaller effects. Generalization to withheld GO facts was uncommon (5.8%) but more likely when latent knowledge was present. Previously correct GO mappings degraded more often for withheld (unseen) terms than for trained (seen) terms, suggesting a protective effect of reinforcement during training. These results show that latent knowledge predicts both the speed of factual learning during fine-tuning and the limited generalization of unseen ontology facts, while resistance to degradation depends on whether facts are reinforced.
zh

[NLP-35] Pisets: A Robust Speech Recognition System for Lectures and Interviews

【速读】：该论文旨在解决生成式 AI（Generative AI）语音识别系统中常见的错误率高与幻觉问题，尤其是在处理长音频数据和复杂声学环境时的性能瓶颈。解决方案的关键在于提出了一种三组件架构：首先使用Wav2Vec2进行初步语音识别，再通过Audio Spectrogram Transformer（AST）过滤假阳性结果，最后利用Whisper模型完成最终的文本转换；同时结合课程学习方法、多样化的俄语语音语料库训练以及先进的不确定性建模技术，显著提升了系统的鲁棒性和转录准确性，优于WhisperX和标准Whisper模型。

链接: https://arxiv.org/abs/2601.18415
作者: Ivan Bondarenko,Daniil Grebenkin,Oleg Sedukhin,Mikhail Klementev,Roman Derunets,Lyudmila Budneva
机构: Novosibirsk State University (新西伯利亚国立大学); Siberian Neuronets LLC (西伯利亚神经网络有限责任公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This work presents a speech-to-text system “Pisets” for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system’s effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of “Pisets” system is publicly available at GitHub: this https URL.
zh

[NLP-36] Do not be greedy Think Twice: Sampling and Selection for Document-level Information Extraction ECAI2026 IJCAI

【速读】：该论文旨在解决文档级信息抽取（Document-level Information Extraction, DocIE）中因采用贪婪解码（greedy decoding）导致输出结果缺乏多样性与优化潜力的问题。传统方法通过固定解码策略限制了模型生成能力，忽略了潜在更优的输出模板。其解决方案的关键在于提出ThinkTwice框架：利用大语言模型（LLM）对同一文档生成多个候选模板，并通过两种机制进行选择——一是无监督方法，基于多个生成结果的一致性进行筛选；二是有监督方法，使用奖励模型（reward model）在标注数据上训练以评估模板质量。此外，为缓解DocIE中高质量推理轨迹（reasoning trajectories）稀缺问题，作者进一步设计了一种基于拒绝采样的银质数据生成方法，用于构建模板与推理路径的配对训练样本。实验表明，该框架在多项指标上均显著优于贪婪基线和当前最优方法。

链接: https://arxiv.org/abs/2601.18395
作者: Mikel Zubillaga,Oscar Sainz,Oier Lopez de Lacalle,Eneko Agirre
机构: HiTZ Center - Ixa, University of the Basque Country UPV/EHU (巴斯克大学)
类目: Computation and Language (cs.CL)
备注: Submitted to IJCAI-ECAI 2026

点击查看摘要

Abstract:Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the state-of-the-art.
zh

[NLP-37] OCR-Enhanced Multimodal ASR Can Read While Listening ICASSP2026

【速读】：该论文旨在解决多语言环境下自动语音识别（Automatic Speech Recognition, ASR）性能受限的问题，尤其在低信噪比或语音模糊场景中，单纯依赖音频信息难以获得高精度识别结果。为提升识别效果，作者提出Donut-Whisper模型，其核心创新在于采用双编码器架构融合音频与视觉信息（如字幕），并通过交叉注意力模块协同优化线性结构与基于Q-Former的模态对齐机制，从而生成更具判别力的音视频联合特征表示。此外，论文设计了一种轻量级知识蒸馏方案，利用音视频模型指导纯音频模型学习，进一步提升音频-only模型性能。实验表明，该方法在中英文电影片段构建的新多语言音视频数据集上显著优于基线模型，尤其在英文和中文子集上分别实现5.75% WER和16.5% CER的绝对降低。

链接: https://arxiv.org/abs/2601.18393
作者: Junli Chen,Changli Tang,Yixuan Li,Guangzhi Sun,Chao Zhang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 4 pages, 2 figures. Submitted to ICASSP 2026

点击查看摘要

Abstract:Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.
zh

[NLP-38] Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在执行复杂任务时因生成冗长推理轨迹而导致的内存占用高和计算开销大的问题。其核心解决方案是提出动态思维标记选择（Dynamic Thinking-Token Selection, DynTS），关键在于通过分析注意力图发现推理轨迹中仅部分决策关键标记（decision-critical tokens）对最终答案有显著影响，其余标记贡献可忽略；据此，DynTS 在推理过程中仅保留这些关键标记对应的键值（Key-Value, KV）缓存状态，剔除冗余条目，从而实现高效推理。

链接: https://arxiv.org/abs/2601.18383
作者: Zhenyuan Guo,Tong Chen,Wenlong Meng,Chen Gong,Xin Yu,Chengkun Wei,Wenzhi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs’ efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decision-critical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dynamic Thinking-Token Selection (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency.
zh

[NLP-39] Corpus-Based Approaches to Igbo Diacritic Restoration

【速读】：该论文旨在解决低资源语言（尤其是伊博语）中由于缺乏标注数据而导致的重音符号歧义问题，即在文本处理过程中，因缺少重音符号而难以准确还原原始词形。其关键解决方案是构建一个灵活的框架用于生成用于重音符号恢复的数据集，并提出三种核心方法：基于n-gram的标准语言模型、基于上下文窗口的分类模型以及基于嵌入相似度的向量匹配模型。其中，嵌入模型通过比较上下文词向量与候选变体向量之间的相似性得分来识别最可能的正确重音形式，体现了对语境信息的有效利用。

链接: https://arxiv.org/abs/2601.18380
作者: Ignatius Ezeani
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 270 page. Ph.D. Thesis. The University of Sheffield

点击查看摘要

Abstract:With natural language processing (NLP), researchers aim to enable computers to identify and understand patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntax, pragmatics and phonology, which need to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well-resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese, etc. Over 95% of the world’s 7000 languages are low-resourced for NLP, i.e. they have little or no data, tools, and techniques for NLP work. In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on the Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window of words on both sides of the target stripped word was used. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors. Comments: 270 page. Ph.D. Thesis. The University of Sheffield Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2601.18380 [cs.CL] (or arXiv:2601.18380v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.18380 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2019 White Rose eTheses Online
zh

[NLP-40] Hierarchical Text Classification with LLM -Refined Taxonomies

【速读】：该论文旨在解决层次文本分类（Hierarchical Text Classification, HTC）中因真实世界分类体系（taxonomy）存在语义模糊性（如相似父节点下出现相同叶节点名称）而导致语言模型难以学习清晰决策边界的问题。解决方案的关键在于提出TaxMorph框架，利用大语言模型（Large Language Models, LLMs）对整个分类体系进行重构，通过重命名、合并、拆分和重新排序等操作优化其结构，使其更贴合LLMs所编码的语义空间，从而提升分类性能。实验表明，经LLM精炼后的分类体系在多个基准上均优于人工设计的分类体系，F1值最高提升2.9个百分点。

链接: https://arxiv.org/abs/2601.18375
作者: Jonas Golde,Nicolaas Jedema,Ravi Krishnan,Phong Le
机构: Humboldt Universität zu Berlin (柏林洪堡大学); Amazon (亚马逊); Meta (Meta); School of Computer Science, University of St Andrews (圣安德鲁斯大学计算机科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hierarchical text classification (HTC) depends on taxonomies that organize labels into structured hierarchies. However, many real-world taxonomies introduce ambiguities, such as identical leaf names under similar parent nodes, which prevent language models (LMs) from learning clear decision boundaries. In this paper, we present TaxMorph, a framework that uses large language models (LLMs) to transform entire taxonomies through operations such as renaming, merging, splitting, and reordering. Unlike prior work, our method revises the full hierarchy to better match the semantics encoded by LMs. Experiments across three HTC benchmarks show that LLM-refined taxonomies consistently outperform human-curated ones in various settings up to +2.9pp. in F1. To better understand these improvements, we compare how well LMs can assign leaf nodes to parent nodes and vice versa across human-curated and LLM-refined taxonomies. We find that human-curated taxonomies lead to more easily separable clusters in embedding space. However, the LLM-refined taxonomies align more closely with the model’s actual confusion patterns during classification. In other words, even though they are harder to separate, they better reflect the model’s inductive biases. These findings suggest that LLM-guided refinement creates taxonomies that are more compatible with how models learn, improving HTC performance.
zh

[NLP-41] CitiLink: Enhancing Municipal Transparency and Citizen Engagement through Searchable Meeting Minutes

【速读】：该论文旨在解决城市议会会议纪要（city council minutes）因结构复杂、语言正式而难以被公众或记者高效获取信息的问题。解决方案的关键在于构建一个名为CitiLink的平台，利用大语言模型（Large Language Models, LLMs）从非结构化文本中自动提取元数据、讨论主题和投票结果等关键信息，并将其存储于数据库中，结合BM25排序算法与面向用户的界面实现全文检索和多维过滤，从而提升地方政府信息的可访问性与透明度。

链接: https://arxiv.org/abs/2601.18374
作者: Rodrigo Silva,José Evans,José Isidro,Miguel Marques,Afonso Fonseca,Ricardo Morais,João Canavilhas,Arian Pasquali,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:City council minutes are typically lengthy and formal documents with a bureaucratic writing style. Although publicly available, their structure often makes it difficult for citizens or journalists to efficiently find information. In this demo, we present CitiLink, a platform designed to transform unstructured municipal meeting minutes into structured and searchable data, demonstrating how NLP and IR can enhance the accessibility and transparency of local government. The system employs LLMs to extract metadata, discussed subjects, and voting outcomes, which are then indexed in a database to support full-text search with BM25 ranking and faceted filtering through a user-friendly interface. The developed system was built over a collection of 120 minutes made available by six Portuguese municipalities. To assess its usability, CitiLink was tested through guided sessions with municipal personnel, providing insights into how real users interact with the system. In addition, we evaluated Gemini’s performance in extracting relevant information from the minutes, highlighting its effectiveness in data extraction.
zh

[NLP-42] Can Good Writing Be Generative? Expert-Level AI Writing Emerges through Fine-Tuning on High-Quality Books

【速读】：该论文试图解决的问题是：生成式 AI（Generative AI）是否能够有效模仿人类作家的写作风格，从而挑战传统上认为创意写作是人类专属领域的认知。解决方案的关键在于设计了一项行为实验，邀请28位文学硕士（MFA）作家与三款大型语言模型（LLMs）在模仿50位备受赞誉的作家风格方面进行竞争，并通过盲测配对比较的方式，由28位专家评委和131位非专业评委评估结果。实验发现，在上下文提示条件下专家更偏好人类写作，但经作者完整作品微调后，AI写作偏好率反超；同时，非专业评委始终倾向AI写作。这一设计揭示了AI在风格模仿上的强大能力及其对创作者身份认同的冲击，为理解AI在创意劳动中的角色提供了实证基础。

链接: https://arxiv.org/abs/2601.18353
作者: Tuhin Chakrabarty,Paramveer S. Dhillon
机构: Stony Brook University (石溪大学); University of Michigan (密歇根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Proceedings of CHI 2026 Conference (To Appear)

点击查看摘要

Abstract:Creative writing has long been considered a uniquely human endeavor, requiring voice and style that machines could not replicate. This assumption is challenged by Generative AI that can emulate thousands of author styles in seconds with negligible marginal labor. To understand this better, we conducted a behavioral experiment where 28 MFA writers (experts) competed against three LLMs in emulating 50 critically acclaimed authors. Based on blind pairwise comparisons by 28 expert judges and 131 lay judges, we find that experts preferred human writing in 82.7% of cases under the in-context prompting condition but this reversed to 62% preference for AI after fine-tuning on authors’ complete works. Lay judges, however, consistently preferred AI writing. Debrief interviews with expert writers revealed that their preference for AI writing triggered an identity crisis, eroding aesthetic confidence and questioning what constitutes “good writing.” These findings challenge discourse about AI’s creative limitations and raise fundamental questions about the future of creative labor.
zh

[NLP-43] Code over Words: Overcoming Semantic Inertia via Code-Grounded Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在动态上下文环境中存在的语义惯性（Semantic Inertia）问题，即模型难以抑制预训练阶段形成的先验知识（如“熔岩是危险的”），即使面对明确的反事实规则（如“熔岩是安全的”）也依然固守原有认知。解决方案的关键在于将动态规则从自然语言描述转化为可执行代码表示，从而分离逻辑约束与语义描述，避免自然语言编码中语义与规则的纠缠；具体方法为引入Code-Grounded Vistas (LCV)，通过在训练阶段使用反事实规则对进行微调，并识别规则冲突状态，迫使模型关注逻辑约束而非视觉或常识语义，从而有效实现先验抑制。该方法显著优于依赖昂贵推理搜索的策略，在效率和准确性上均取得提升。

链接: https://arxiv.org/abs/2601.18352
作者: Manjie Xu,Isabella Yin,Xinyi Tu,Chi Zhang,Yixin Zhu
机构: Peking University (北京大学); Tsinghua International School; University of California, Berkeley; State Key Lab of General AI, Peking University (通用人工智能国家重点实验室); Beijing Key Laboratory of Behavior and Mental Health, Peking University (行为与心理健康北京市重点实验室); Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence ( embodied intelligence lab, 北大-武汉人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs struggle with Semantic Inertia: the inability to inhibit pre-trained priors (e.g., “Lava is Dangerous”) when dynamic, in-context rules contradict them. We probe this phenomenon using Baba Is You, where physical laws are mutable text rules, enabling precise evaluation of models’ ability to override learned priors when rules change. We quantatively observe that larger models can exhibit inverse scaling: they perform worse than smaller models when natural language reasoning requires suppressing pre-trained associations (e.g., accepting “Lava is Safe”). Our analysis attributes this to natural language encoding, which entangles descriptive semantics and logical rules, leading to persistent hallucinations of familiar physics despite explicit contradictory rules. Here we show that representing dynamics as executable code, rather than descriptive text, reverses this trend and enables effective prior inhibition. We introduce Code-Grounded Vistas (LCV), which fine-tunes models on counterfactual pairs and identifies states with contradictory rules, thereby forcing attention to logical constraints rather than visual semantics. This training-time approach outperforms expensive inference-time search methods in both efficiency and accuracy. Our results demonstrate that representation fundamentally determines whether scaling improves or impairs contextual reasoning. This challenges the assumption that larger models are universally better, with implications for domains that require dynamic overriding of learned priors.
zh

[NLP-44] When Domain Pretraining Interferes with Instruction Alignment: An Empirical Study of Adapter Merging in Medical LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在医疗领域中面临的术语精度不足与安全关键指令遵循能力薄弱的问题。其解决方案的关键在于提出一种两阶段LoRA（Low-Rank Adaptation）微调管道：首先通过领域自适应预训练（Domain-Adaptive Pre-Training, DAPT）注入广泛的医学知识，随后通过监督微调（Supervised Fine-Tuning, SFT）使模型对医学问答指令产生更准确的响应。为平衡指令遵循能力和领域知识保留，作者进一步引入加权适配器合并（Weighted Adapter Merging）策略，在导出最终模型前线性融合SFT与DAPT适配器，从而实现性能与稳定性的优化。

链接: https://arxiv.org/abs/2601.18350
作者: Junyi Zou
机构: Zjydiary Group (zjydiary集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) show strong general capability but often struggle with medical terminology precision and safety-critical instruction following. We present a case study for adapter interference in safety-critical domains using a 14B-parameter base model through a two-stage LoRA pipeline: (1) domain-adaptive pre-training (PT) to inject broad medical knowledge via continued pre-training (DAPT), and (2) supervised fine-tuning (SFT) to align the model with medical question-answering behaviors through instruction-style data. To balance instruction-following ability and domain knowledge retention, we propose Weighted Adapter Merging, linearly combining SFT and PT adapters before exporting a merged base-model checkpoint. On a held-out medical validation set (F5/F6), the merged model achieves BLEU-4 = 16.38, ROUGE-1 = 20.42, ROUGE-2 = 4.60, and ROUGE-L = 11.54 under a practical decoding configuration. We further analyze decoding sensitivity and training stability with loss curves and controlled decoding comparisons.
zh

[NLP-45] Overalignment in Frontier LLM s: An Empirical Study of Sycophantic Behaviour in Healthcare

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在临床应用中因“谄媚倾向”（sycophancy）而可能引发的患者安全风险问题，即模型倾向于迎合用户意见而非坚持事实准确性。其解决方案的关键在于提出一种新的量化指标——调整后的谄媚得分（Adjusted Sycophancy Score），该指标通过引入“混淆度”（confusability）来校正模型因随机性导致的不稳定性，从而更准确地分离出对用户权威性建议的非理性响应。研究基于医学多项选择题（Medical Multiple-Choice Question Answering, MCQA）数据集进行验证，揭示了推理优化型模型虽表面准确率高，但在权威压力下易生成错误推理路径的反直觉脆弱性，强调基准性能不能等同于临床可靠性，并指出简化推理结构可能更具抗专家驱动谄媚的能力。

链接: https://arxiv.org/abs/2601.18334
作者: Clément Christophe,Wadood Mohammed Abdul,Prateek Munjal,Tathagata Raha,Ronnie Rajan,Praveenkumar Kanithi
机构: M42, Abu Dhabi
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As LLMs are increasingly integrated into clinical workflows, their tendency for sycophancy, prioritizing user agreement over factual accuracy, poses significant risks to patient safety. While existing evaluations often rely on subjective datasets, we introduce a robust framework grounded in medical MCQA with verifiable ground truths. We propose the Adjusted Sycophancy Score, a novel metric that isolates alignment bias by accounting for stochastic model instability, or “confusability”. Through an extensive scaling analysis of the Qwen-3 and Llama-3 families, we identify a clear scaling trajectory for resilience. Furthermore, we reveal a counter-intuitive vulnerability in reasoning-optimized “Thinking” models: while they demonstrate high vanilla accuracy, their internal reasoning traces frequently rationalize incorrect user suggestions under authoritative pressure. Our results across frontier models suggest that benchmark performance is not a proxy for clinical reliability, and that simplified reasoning structures may offer superior robustness against expert-driven sycophancy.
zh

[NLP-46] Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在细粒度情感推理中因数据稀缺和跨模态融合不足导致的单模态主导问题，尤其是在视觉与听觉线索微妙、模糊或矛盾（如讽刺场景）时易产生幻觉。其解决方案的关键在于提出SABER-LLM框架：首先构建包含60万视频片段的大规模情感推理数据集SABER，采用六维标注体系联合捕捉音频视觉线索与因果逻辑；其次引入结构化证据分解范式，强制执行“感知-推理”分离机制以缓解单模态主导现象；最后通过一致性感知的直接偏好优化策略，在感知歧义或冲突条件下显式增强模态间对齐能力，从而显著提升模型在复杂情绪动态中的鲁棒性。

链接: https://arxiv.org/abs/2601.18321
作者: Zhixian Zhao,Wenjie Tian,Xiaohai Tian,Jun Zhang,Lei Xie
机构: Northwestern Polytechnical University (西北工业大学)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory (e.g., in sarcastic scenery). To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six-dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a “perceive-then-reason” separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency-aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench-M, and SABER-Test demonstrate that SABER-LLM significantly outperforms open-source baselines and achieves robustness competitive with closed-source models in decoding complex emotional dynamics. The dataset and model are available at this https URL.
zh

[NLP-47] MultiVis-Agent : A Multi-Agent Framework with Logic Rules for Reliable and Comprehensive Cross-Modal Data Visualization SIGMOD2026

【速读】：该论文旨在解决自动化可视化生成中面临的复杂多模态需求与系统可靠性不足的问题，具体包括：现有系统仅支持单模态输入、一次性生成且流程僵化，难以应对真实场景中涉及参考图像、代码示例及迭代优化的多阶段任务；同时，基于大语言模型（Large Language Model, LLM）的方法虽具潜力，但存在灾难性失败和无限循环等可靠性风险。解决方案的关键在于提出 MultiVis-Agent 框架，其核心创新是引入一个四层逻辑规则框架，该框架通过数学约束引导 LLM 推理而非替代其能力，从而在保证灵活性的同时提供系统可靠性的数学保障，并结合 MultiVis-Bench 基准对多模态可视化任务进行全面评估，实验证明该方法显著优于基线模型，在可视化评分（75.63% vs. 57.54–62.79%）、任务完成率（99.58% vs. 74.48%）和代码执行成功率（94.56% vs. 65.10%）上均实现大幅提升。

链接: https://arxiv.org/abs/2601.18320
作者: Jinwei Lu,Yuanfeng Song,Chen Zhang,Raymond Chi-Wing Wong
机构: The Hong Kong Polytechnic University(香港理工大学); ShanghaiByteDance(上海字节跳动); The Hong Kong University of Science and Technology(香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted to SIGMOD 2026

点击查看摘要

Abstract:Real-world visualization tasks involve complex, multi-modal requirements that extend beyond simple text-to-chart generation, requiring reference images, code examples, and iterative refinement. Current systems exhibit fundamental limitations: single-modality input, one-shot generation, and rigid workflows. While LLM-based approaches show potential for these complex requirements, they introduce reliability challenges including catastrophic failures and infinite loop susceptibility. To address this gap, we propose MultiVis-Agent, a logic rule-enhanced multi-agent framework for reliable multi-modal and multi-scenario visualization generation. Our approach introduces a four-layer logic rule framework that provides mathematical guarantees for system reliability while maintaining flexibility. Unlike traditional rule-based systems, our logic rules are mathematical constraints that guide LLM reasoning rather than replacing it. We formalize the MultiVis task spanning four scenarios from basic generation to iterative refinement, and develop MultiVis-Bench, a benchmark with over 1,000 cases for multi-modal visualization evaluation. Extensive experiments demonstrate that our approach achieves 75.63% visualization score on challenging tasks, significantly outperforming baselines (57.54-62.79%), with task completion rates of 99.58% and code execution success rates of 94.56% (vs. 74.48% and 65.10% without logic rules), successfully addressing both complexity and reliability challenges in automated visualization generation.
zh

[NLP-48] Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLM EACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在后训练量化（post-training quantization）过程中因使用单一英语校准集而导致多语言性能下降的问题。其核心解决方案在于系统性地评估不同语言配置的校准集对量化效果的影响，关键发现是：采用非英语或混合语言的校准数据能够显著降低困惑度（perplexity），尤其以多语言混合校准集效果最佳，平均可提升3.52点；同时指出针对目标语言定制校准集能获得最大性能增益，并揭示了激活范围分布差异导致特定语言-量化器组合性能退化的问题，从而强调了校准数据的语言适配性和多样性对于鲁棒量化多语言LLMs的重要性。

链接: https://arxiv.org/abs/2601.18306
作者: Everlyn Asiko Chimoto,Mostafa Elhoushi,Bruce A. Bassett
机构: Lelapa AI; Cerebras Systems, Inc; University of the Witwatersrand; University of Cape Town; South African Astronomical Observatory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EACL 2026 Main Conference

点击查看摘要

Abstract:Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use small, English-only calibration sets; however, their impact on multilingual models remains underexplored. We systematically evaluate eight calibration settings (five single-language and three multilingual mixes) on two quantizers (GPTQ, AWQ) on data from 10 languages. Our findings reveal a consistent trend: non-English and multilingual calibration sets significantly improve perplexity compared to English-only baselines. Specifically, we observe notable average perplexity gains across both quantizers on Llama3.1 8B and Qwen2.5 7B, with multilingual mixes achieving the largest overall reductions of up to 3.52 points in perplexity. Furthermore, our analysis indicates that tailoring calibration sets to the evaluation language yields the largest improvements for individual languages, underscoring the importance of linguistic alignment. We also identify specific failure cases where certain language-quantizer combinations degrade performance, which we trace to differences in activation range distributions across languages. These results highlight that static one-size-fits-all calibration is suboptimal and that tailoring calibration data, both in language and diversity, plays a crucial role in robustly quantizing multilingual LLMs.
zh

[NLP-49] Suppressing Final Layer Hidden State Jumps in Transformer Pretraining EACL2026

【速读】：该论文试图解决Transformer语言模型在预训练过程中存在中间层特征表示变化微弱、而最后一层出现显著“跳跃”现象的问题，这种不均衡的层间行为可能暗示模型对中间层能力利用不足。解决方案的关键在于提出一种称为“跳变抑制正则化器”（Jump-Suppressing Regularizer, JREG）的新型训练正则化方法，通过在预训练阶段惩罚最后一层附近的角距离突变，引导模型更均衡地使用各中间层的表征能力，从而提升整体任务性能。

链接: https://arxiv.org/abs/2601.18302
作者: Keigo Shibata,Kazuki Yano,Ryosuke Takahashi,Jaesung Lee,Wataru Ikeda,Jun Suzuki
机构: Tohoku University (东北大学); RIKEN (理化学研究所); NII LLMC (日本国立信息学研究所语言模型与计算中心)
类目: Computation and Language (cs.CL)
备注: Accepted to the Findings of EACL 2026

点击查看摘要

Abstract:This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large ``jump’’ in the angular distance occurring in or around the final Transformer layer. To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training. Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers. Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.
zh

[NLP-50] mp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

【速读】：该论文旨在解决时序知识图谱问答（Temporal Knowledge Graph Question Answering, TKGQA）中的复杂推理难题，其核心挑战在于对动态事实进行多跳依赖关系建模和复杂时间约束推理。现有方法受限于固定工作流和昂贵的闭源API，缺乏灵活性与可扩展性。解决方案的关键在于提出首个通过强化学习训练的自主端到端代理Temp-R1：首先通过引入专用内部动作扩展动作空间，缓解单一动作推理的认知过载；其次采用逆向课程学习策略，优先训练困难问题以防止简单问题上的捷径学习，从而促进复杂推理能力的发展，最终在MultiTQ和TimelineKGQA数据集上实现SOTA性能，尤其在复杂问题上较强基线提升19.8%。

链接: https://arxiv.org/abs/2601.18296
作者: Zhaoyan Gong,Zhiqiang Liu,Songze Li,Xiaoke Guo,Yuanxiang Liu,Xinle Deng,Zhizhen Liu,Lei Liang,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); ZJU-Ant Group Joint Lab of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi-hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed-source APIs, limiting flexibility and scalability. We propose Temp-R1, the first autonomous end-to-end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single-action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B-parameter Temp-R1 achieves state-of-the-art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. Our code will be publicly available soon at this https URL.
zh

[NLP-51] U-Fold: Dynamic Intent-Aware Context Folding for User-Centric Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在用户中心型对话中因上下文长度限制而导致的可扩展性问题。现有上下文折叠（context-folding）方法虽能缓解这一问题，但在多轮、复杂意图场景下存在两个关键缺陷：一是不可逆地丢弃对后续决策至关重要的细粒度约束和中间事实；二是无法追踪用户意图的演化，导致信息遗漏和错误操作。为应对这些问题，作者提出U-Fold框架，其核心创新在于动态保留完整的用户-代理对话与工具调用历史，并在每一轮交互中通过两个核心组件生成一个意图感知的演化式对话摘要和一个紧凑的任务相关工具日志，从而在保持信息完整性的同时实现高效上下文管理。

链接: https://arxiv.org/abs/2601.18285
作者: Jin Su,Runnan Fang,Yeqiu Li,Xiaobin Wang,Shihao Cai,Pengjun Xie,Ningyu Zhang,Fajie Yuan
机构: Zhejiang University (浙江大学); Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团); Westlake University (西湖大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents have been successfully deployed in many tool-augmented settings, but their scalability is fundamentally constrained by context length. Existing context-folding methods mitigate this issue by summarizing past interactions, yet they are typically designed for single-query or single-intent scenarios. In more realistic user-centric dialogues, we identify two major failure modes: (i) they irreversibly discard fine-grained constraints and intermediate facts that are crucial for later decisions, and (ii) their summaries fail to track evolving user intent, leading to omissions and erroneous actions. To address these limitations, we propose U-Fold, a dynamic context-folding framework tailored to user-centric tasks. U-Fold retains the full user–agent dialogue and tool-call history but, at each turn, uses two core components to produce an intent-aware, evolving dialogue summary and a compact, task-relevant tool log. Extensive experiments on \tau -bench, \tau^2 -bench, VitaBench, and harder context-inflated settings show that U-Fold consistently outperforms ReAct (achieving a 71.4% win rate in long-context settings) and prior folding baselines (with improvements of up to 27.0%), particularly on long, noisy, multi-turn tasks. Our study demonstrates that U-Fold is a promising step toward transferring context-management techniques from single-query benchmarks to realistic user-centric applications.
zh

[NLP-52] hink-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在函数调用（function calling）过程中缺乏显式推理透明性的问题，尤其是在处理具有参数依赖关系的复杂函数时，现有方法如思维链提示（chain-of-thought prompting）仅在代理层面提供推理指导，无法实现对单个参数生成过程的细粒度解释。解决方案的关键在于提出一种名为“思维增强型函数调用”（Think-Augmented Function Calling, TAFC）的新框架，其核心创新包括：引入通用的“think”参数增强机制以显式表达决策过程，并通过动态优化参数描述提升推理质量；针对复杂参数自动触发细粒度推理（granular reasoning），基于复杂度评分确保关键决策获得充分论证；同时设计推理引导优化策略，使生成的推理内容更贴近人类预期。TAFC无需修改LLM架构且保持API兼容性，在ToolBench基准测试中显著提升了多参数函数的参数生成准确率与推理一致性，增强了AI代理行为的可解释性。

链接: https://arxiv.org/abs/2601.18282
作者: Lei Wei,Jinpeng Ou,Xiao Peng,Bin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in function calling for autonomous agents, yet current mechanisms lack explicit reasoning transparency during parameter generation, particularly for complex functions with interdependent parameters. While existing approaches like chain-of-thought prompting operate at the agent level, they fail to provide fine-grained reasoning guidance for individual function parameters. To address these limitations, we propose Think-Augmented Function Calling (TAFC), a novel framework that enhances function calling accuracy through explicit reasoning at both function and parameter levels. Our method introduces a universal “think” parameter augmentation that enables models to articulate their decision-making process, with dynamic optimization for parameter descriptions to improve reasoning quality. For complex parameters, TAFC automatically triggers granular reasoning based on complexity scoring, ensuring appropriate justification for critical decisions. Additionally, we propose reasoning-guided optimization to align generated reasoning with human expectations. TAFC requires no architectural modifications to existing LLMs while maintaining full API compatibility. Evaluation on ToolBench across proprietary and open-source models demonstrates significant improvements in parameter generation accuracy and reasoning coherence for multi-parameter functions, while providing enhanced interpretability for debugging AI agent behaviors.
zh

[NLP-53] Reflecting Twice before Speaking with Empathy: Self-Reflective Alternating Inference for Empathy-Aware End-to-End Spoken Dialogue

【速读】：该论文旨在解决当前端到端语音语言模型（End-to-end Spoken Language Models, SLMs）在情感共鸣（empathy）对话能力上的局限性问题，尤其是现有方法过度依赖刚性的监督信号（如监督微调中的真实回复或强化学习中的偏好分数），难以准确捕捉情感表达的复杂性和共情行为的适切性。其解决方案的关键在于提出两个核心创新：一是EmpathyEval，一个基于自然语言描述的评估模型，用于量化评估语音对话中的共情质量；二是ReEmpathy，一种通过新颖的“共情自省交替推理机制”（Empathetic Self-Reflective Alternating Inference mechanism）增强共情对话能力的SLM，该机制在生成语音响应的同时引入自由形式的共情相关反思推理，从而显著提升对话的情感敏感性和情绪智能水平。

链接: https://arxiv.org/abs/2601.18281
作者: Yuhang Jia,Pei Liu,Haoqin Sun,Jiaming Zhou,Xuxin Cheng,Cao Liu,Ke Zeng,Xunliang Cai,Yong Qin
机构: Nankai University (南开大学); Meituan LongCat Interaction Team
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:End-to-end Spoken Language Models (SLMs) hold great potential for paralinguistic perception, and numerous studies have aimed to enhance their capabilities, particularly for empathetic dialogue. However, current approaches largely depend on rigid supervised signals, such as ground-truth response in supervised fine-tuning or preference scores in reinforcement learning. Such reliance is fundamentally limited for modeling complex empathy, as there is no single “correct” response and a simple numerical score cannot fully capture the nuances of emotional expression or the appropriateness of empathetic behavior. To address these limitations, we sequentially introduce EmpathyEval, a descriptive natural-language-based evaluation model for assessing empathetic quality in spoken dialogues. Building upon EmpathyEval, we propose ReEmpathy, an end-to-end SLM that enhances empathetic dialogue through a novel Empathetic Self-Reflective Alternating Inference mechanism, which interleaves spoken response generation with free-form, empathy-related reflective reasoning. Extensive experiments demonstrate that ReEmpathy substantially improves empathy-sensitive spoken dialogue by enabling reflective reasoning, offering a promising approach toward more emotionally intelligent and empathy-aware human-computer interactions.
zh

[NLP-54] Designing large language model prompts to extract scores from messy text: A shared dataset and challenge

【速读】：该论文旨在解决从非结构化短文本中自动提取英国科研质量评分（UK scale of 1* to 4*）的问题，其核心挑战在于设计一个高效的提示（prompt）以指导大语言模型（Large Language Models, LLMs）准确识别并输出数值分数，同时在文本未明确包含有效分数时返回缺失值代码 -1。解决方案的关键在于 prompt 的设计：不仅要确保 LLM 输出严格为单一数字（无额外文本），还需明确指示模型在何种情况下应返回 -1，从而提升对复杂、不规范文本的数值推理能力。初始方案准确率为 72.6%，目标是通过优化 prompt 提升性能并深化对 LLM 在数值任务中表现的理解。

链接: https://arxiv.org/abs/2601.18271
作者: Mike Thelwall
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In some areas of computing, natural language processing and information science, progress is made by sharing datasets and challenging the community to design the best algorithm for an associated task. This article introduces a shared dataset of 1446 short texts, each of which describes a research quality score on the UK scale of 1* to 4*. This is a messy collection, with some texts not containing scores and others including invalid scores or strange formats. With this dataset there is also a description of what constitutes a valid score and a “gold standard” of the correct scores for these texts (including missing values). The challenge is to design a prompt for Large Language Models (LLMs) to extract the scores from these texts as accurately as possible. The format for the response should be a number and no other text so there are two aspects to the challenge: ensuring that the LLM returns only a number, and instructing it to deduce the correct number for the text. As part of this, the LLM prompt needs to explain when to return the missing value code, -1, instead of a number when the text does not clearly contain one. The article also provides an example of a simple prompt. The purpose of the challenge is twofold: to get an effective solution to this problem, and to increase understanding of prompt design and LLM capabilities for complex numerical tasks. The initial solution suggested has an accuracy of 72.6%, so the challenge is to beat this.
zh

[NLP-55] FGGM: Fisher-Guided Gradient Masking for Continual Learning ICASSP2026

【速读】：该论文旨在解决大语言模型在持续学习过程中面临的灾难性遗忘（catastrophic forgetting）问题。其解决方案的关键在于提出Fisher-Guided Gradient Masking (FGGM)框架，通过利用对角线Fisher信息（diagonal Fisher Information）动态生成具有自适应阈值的二进制掩码，从而智能地选择需要更新的参数，保留关键参数以实现稳定性和可塑性的平衡，且无需依赖历史数据。该方法提供了一种数学上严谨的参数重要性估计方式，相较于基于权重幅度的方法（如MIGU），显著提升了模型在TRACE基准和代码生成任务中的能力保持效果。

链接: https://arxiv.org/abs/2601.18261
作者: Chao-Hong Tan,Qian Chen,Wen Wang,Yukun Ma,Chong Zhang,Chong Deng,Qinglin Zhang,Xiangang Li,Jieping Ye
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Catastrophic forgetting impairs the continuous learning of large language models. We propose Fisher-Guided Gradient Masking (FGGM), a framework that mitigates this by strategically selecting parameters for updates using diagonal Fisher Information. FGGM dynamically generates binary masks with adaptive thresholds, preserving critical parameters to balance stability and plasticity without requiring historical data. Unlike magnitude-based methods such as MIGU, our approach offers a mathematically principled parameter importance estimation. On the TRACE benchmark, FGGM shows a 9.6% relative improvement in retaining general capabilities over supervised fine-tuning (SFT) and a 4.4% improvement over MIGU on TRACE tasks. Additional analysis on code generation tasks confirms FGGM’s superior performance and reduced forgetting, establishing it as an effective solution.
zh

[NLP-56] BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation

【速读】：该论文旨在解决开放域对话式人工智能（Conversational AI）中用户满意度评估的难题，传统A/B测试在缺乏显式反馈和隐式指标模糊的情况下难以提供可靠度量。解决方案的关键在于提出BoRP（Bootstrapped Regression Probing）框架，其核心创新是利用大语言模型（LLM）潜在空间的几何特性，通过基于极化指数的自举机制自动生成评价标准，并采用偏最小二乘法（Partial Least Squares, PLS）将隐藏状态映射为连续满意度分数，从而实现高保真、低成本且可扩展的用户满意度评估。

链接: https://arxiv.org/abs/2601.18253
作者: Peng Sun,Xiangyu Zhang,Duan Wu
机构: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This is a pre-print

点击查看摘要

Abstract:Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.
zh

[NLP-57] chING: Towards Real World Technical Image Understanding via VLMs EACL2026

【速读】：该论文旨在解决当前视觉语言模型（Visual Language Models, VLMs）在理解手绘技术图（如流程图、框图等）方面表现不佳的问题，尤其是在专业技术人员日常交流中频繁使用手绘方式表达技术内容的场景下，现有模型难以有效识别和编辑这些图像。其解决方案的关键在于构建一个大规模合成生成的技术图语料库（reflective of real world images），用于训练VLMs，并引入多种自监督任务进行预训练；随后在少量真实手绘图像上进行微调，最终基于Llama 3.2 11B-instruct模型得到LLama-VL-TUG，显著提升了ROUGE-L指标（提升2.14倍）和F1分数（平均提升6.97倍），并在7/8类图示中实现最低编译错误率，证明了合成数据驱动方法的有效性与实用性。

链接: https://arxiv.org/abs/2601.18238
作者: Tafazzul Nadeem,Bhavik Shangari,Manish Rai,Gagan Raj Gupta,Ashutosh Modi
机构: IIT Kanpur (印度理工学院坎普尔分校); IIT Bhilai (印度理工学院比哈尔分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at Findings of EACL 2026, 30 Pages (9 Pages main paper + 4 pages references + 17 pages appendix)

点击查看摘要

Abstract:Professionals working in technical domain typically hand-draw (on whiteboard, paper, etc.) technical diagrams (e.g., flowcharts, block diagrams, etc.) during discussions; however, if they want to edit these later, it needs to be drawn from scratch. Modern day VLMs have made tremendous progress in image understanding but they struggle when it comes to understanding technical diagrams. One way to overcome this problem is to fine-tune on real world hand-drawn images, but it is not practically possible to generate large number of such images. In this paper, we introduce a large synthetically generated corpus (reflective of real world images) for training VLMs and subsequently evaluate VLMs on a smaller corpus of hand-drawn images (with the help of humans). We introduce several new self-supervision tasks for training and perform extensive experiments with various baseline models and fine-tune Llama 3.2 11B-instruct model on synthetic images on these tasks to obtain LLama-VL-TUG, which significantly improves the ROUGE-L performance of Llama 3.2 11B-instruct by 2.14x and achieves the best all-round performance across all baseline models. On real-world images, human evaluation reveals that we achieve minimum compilation errors across all baselines in 7 out of 8 diagram types and improve the average F1 score of Llama 3.2 11B-instruct by 6.97x.
zh

[NLP-58] Generative AI in Saudi Arabia: A National Survey of Adoption Risks and Public Perceptions

【速读】：该论文旨在解决沙特阿拉伯在国家数字化转型背景下，生成式人工智能（Generative AI）应用中公众认知、采纳行为及潜在风险缺乏系统研究的问题。通过一项覆盖全国的问卷调查（n=330），研究揭示了沙特国民在七个维度上的GenAI使用现状：从意识水平到未来预期均存在显著差异，尤其在技术理解深度和高级应用场景（如编程或跨模态生成）上表现不足。解决方案的关键在于构建以提升AI素养为核心的多层次干预策略：一方面推动结构化培训体系，涵盖基础技能、领域适配与伦理规范；另一方面强调开发符合本地文化语境的语言模型，并强化隐私保护与负责任部署机制，从而为政策制定者和开发者提供可操作的优先方向。

链接: https://arxiv.org/abs/2601.18234
作者: Abdulaziz AlDakheel,Ali Alshehre,Esraa Alamoudi,Moslim AlKhabbaz,Ahmed Aljohani,Raed Alharbi
机构: Saudi Electronic University (沙特电子大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) is rapidly becoming embedded in Saudi Arabia’s digital transformation under Vision 2030, yet public awareness, adoption, and concerns surrounding these tools remain underexplored. This study provides an early snapshot of GenAI engagement among Saudi nationals. Using a nationwide survey of 330 participants across regions, age groups, and employment sectors, we examine seven dimensions of GenAI use: awareness and understanding, adoption patterns, perceived impacts, training needs, risks and barriers, data-sharing behaviors, and future expectations. Findings show that 93% of respondents actively use GenAI primarily for text-based tasks, while more advanced uses such as programming or multimodal generation are less common. Despite the prevalence of use, overall awareness and conceptual understanding remain uneven, with many reporting limited technical knowledge. Participants recognize GenAI’s benefits for productivity, work quality, and understanding complex information, yet caution that sustained reliance may undermine critical thinking and key professional skills. Trust in AI-generated outputs remains cautious, with widespread concerns about privacy, misinformation, and ethical misuse, including potential job displacement. Respondents show strong interest in structured GenAI training that combines foundational skills, domain-specific applications, and clear guidance on privacy, ethics, and responsible use. These results establish a baseline for GenAI engagement in Saudi Arabia and highlight priorities for policymakers and developers: expanding AI literacy, ensuring culturally and linguistically aligned GenAI solutions, and strengthening frameworks for privacy and responsible deployment.
zh

[NLP-59] PaperTok: Exploring the Use of Generative AI for Creating Short-form Videos for Research Communication

【速读】：该论文试图解决科研人员在进行科学传播时面临的难题：尽管学术研究成果的传播至关重要，但研究人员往往缺乏时间与技能来制作适合大众媒体（如短视频平台）的吸引人内容。解决方案的关键在于开发了一个名为PaperTok的端到端系统，该系统利用生成式AI（Generative AI）自动将学术论文转化为脚本选项及对应的音视频内容，从而降低初始创作门槛；研究人员可基于个人偏好通过进一步提示（prompting）进行优化调整，最终实现高效、高质量的科学传播内容生成。

链接: https://arxiv.org/abs/2601.18218
作者: Meziah Ruby Cristobal,Hyeonjeong Byeon,Tze-Yu Chen,Ruoxi Shang,Donghoon Shin,Ruican Zhong,Tony Zhou,Gary Hsieh
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The dissemination of scholarly research is critical, yet researchers often lack the time and skills to create engaging content for popular media such as short-form videos. To address this gap, we explore the use of generative AI to help researchers transform their academic papers into accessible video content. Informed by a formative study with science communicators and content creators (N=8), we designed PaperTok, an end-to-end system that automates the initial creative labor by generating script options and corresponding audiovisual content from a source paper. Researchers can then refine based on their preferences with further prompting. A mixed-methods user study (N=18) and crowdsourced evaluation (N=100) demonstrate that PaperTok’s workflow can help researchers create engaging and informative short-form videos. We also identified the need for more fine-grained controls in the creation process. To this end, we offer implications for future generative tools that support science outreach.
zh

[NLP-60] PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR EACL2026

【速读】：该论文旨在解决当前搜索代理（Search Agents）主要面向通用领域问答（General-domain QA）的问题，从而限制了其在科学、工程和医学等技术场景中的应用相关性。为应对这一挑战，作者提出训练搜索代理在科学论文中进行检索与推理，以提升其对技术性问题的解答能力，并为未来“AI科学家”系统奠定基础。解决方案的关键在于：构建了一个包含1600万篇生物医学论文摘要的搜索语料库，并在此基础上创建了一个名为PaperSearchQA的挑战性事实型问答数据集（含6万样本），同时设计了基于可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）训练框架，使代理在该环境中显著优于非RL检索基线，并展现出规划、推理与自我验证等高级行为。

链接: https://arxiv.org/abs/2601.18207
作者: James Burgess,Jan N. Hansen,Duo Peng,Yuhui Zhang,Alejandro Lozano,Min Woo Sun,Emma Lundberg,Serena Yeung-Levy
机构: Stanford University (斯坦福大学); Chan Zuckerberg Biohub Network (Chan Zuckerberg 生物中心网络); KTH Royal Institute of Technology (皇家理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: EACL 2026

点击查看摘要

Abstract:Search agents are language models (LMs) that reason and search knowledge bases (or the web) to answer questions; recent methods supervise only the final answer accuracy using reinforcement learning with verifiable rewards (RLVR). Most RLVR search agents tackle general-domain QA, which limits their relevance to technical AI systems in science, engineering, and medicine. In this work we propose training agents to search and reason over scientific papers – this tests technical question-answering, it is directly relevant to real scientists, and the capabilities will be crucial to future AI Scientist systems. Concretely, we release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA with 60k samples answerable from the corpus, along with benchmarks. We train search agents in this environment to outperform non-RL retrieval baselines; we also perform further quantitative analysis and observe interesting agent behaviors like planning, reasoning, and self-verification. Our corpus, datasets, and benchmarks are usable with the popular Search-R1 codebase for RLVR training and released on this https URL. Finally, our data creation methods are scalable and easily extendable to other scientific domains.
zh

[NLP-61] MemWeaver: Weaving Hybrid Memories for Traceable Long-Horizon Agent ic Reasoning

【速读】：该论文旨在解决大语言模型代理（Large Language Model-based Agents）在长时间交互中面临的记忆系统问题，具体包括时间一致性不足、多跳推理能力弱以及跨会话证据复用困难等挑战。现有方法主要依赖非结构化检索或粗粒度抽象，导致推理脆弱且可追溯性差。解决方案的关键在于提出一种统一的记忆框架 MemWeaver，其核心创新是将长期经验整合为三个相互关联的组件：基于时间锚定的图记忆（temporally grounded graph memory）用于结构化关系推理，经验记忆（experience memory）抽象重复交互模式，以及段落记忆（passage memory）保留原始文本证据。此外，MemWeaver 采用双通道检索策略，联合获取结构化知识与支持证据，从而构建紧凑但信息密集的推理上下文，在 LoCoMo 基准测试中显著提升多跳和时间推理准确率，并将输入上下文长度减少超过 95%。

链接: https://arxiv.org/abs/2601.18204
作者: Juexiang Ye,Xue Li,Xinyu Yang,Chengkai Huang,Lanshun Nie,Lina Yao,Dechen Zhan
机构: Harbin Institute of Technology (哈尔滨工业大学); The University of New South Wales (新南威尔士大学); Macquarie University (麦考瑞大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model-based agents operating in long-horizon interactions require memory systems that support temporal consistency, multi-hop reasoning, and evidence-grounded reuse across sessions. Existing approaches largely rely on unstructured retrieval or coarse abstractions, which often lead to temporal conflicts, brittle reasoning, and limited traceability. We propose MemWeaver, a unified memory framework that consolidates long-term agent experiences into three interconnected components: a temporally grounded graph memory for structured relational reasoning, an experience memory that abstracts recurring interaction patterns from repeated observations, and a passage memory that preserves original textual evidence. MemWeaver employs a dual-channel retrieval strategy that jointly retrieves structured knowledge and supporting evidence to construct compact yet information-dense contexts for reasoning. Experiments on the LoCoMo benchmark demonstrate that MemWeaver substantially improves multi-hop and temporal reasoning accuracy while reducing input context length by over 95% compared to long-context baselines.
zh

[NLP-62] Fine-Grained Emotion Detection on GoEmotions: Experimental Comparison of Classical Machine Learning BiLSTM and Transformer Models

【速读】：该论文旨在解决细粒度情绪识别（fine-grained emotion recognition）这一多标签自然语言处理（NLP）任务中的挑战，主要问题包括标签重叠和类别不平衡。解决方案的关键在于对比三种不同建模方法：基于TF-IDF的逻辑回归系统、带注意力机制的双向长短期记忆网络（BiLSTM）以及微调后的BERT模型，并采用逆频率类权重缓解类别不平衡问题。实验结果表明，虽然逻辑回归在Micro-F1指标上表现最优（0.51），但BERT在整体平衡性上更优，尤其在Macro-F1（0.49）、Hamming Loss（0.036）和Subset Accuracy（0.36）等指标上超越了原论文报告结果，说明上下文表示（contextual representations）对稀有情绪和模糊样本更具判别力。

链接: https://arxiv.org/abs/2601.18162
作者: Ani Harutyunyan,Sachin Kumar
机构: American University of Armenia (美国亚美尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-grained emotion recognition is a challenging multi-label NLP task due to label overlap and class imbalance. In this work, we benchmark three modeling families on the GoEmotions dataset: a TF-IDF-based logistic regression system trained with binary relevance, a BiLSTM with attention, and a BERT model fine-tuned for multi-label classification. Experiments follow the official train/validation/test split, and imbalance is mitigated using inverse-frequency class weights. Across several metrics, namely Micro-F1, Macro-F1, Hamming Loss, and Subset Accuracy, we observe that logistic regression attains the highest Micro-F1 of 0.51, while BERT achieves the best overall balance surpassing the official paper’s reported results, reaching Macro-F1 0.49, Hamming Loss 0.036, and Subset Accuracy 0.36. This suggests that frequent emotions often rely on surface lexical cues, whereas contextual representations improve performance on rarer emotions and more ambiguous examples.
zh

[NLP-63] FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在强化学习（Reinforcement Learning, RL）训练中因rollout阶段生成长序列导致的注意力机制和KV缓存内存占用过高、从而成为性能瓶颈的问题。其核心挑战在于：FP8（8-bit浮点数）虽能降低计算成本与内存带宽压力，但因策略权重每步更新需反复量化同步，且低精度rollout易偏离高精度训练假设，引发训练-推理不一致（train-inference mismatch）甚至不稳定。解决方案的关键在于构建一套完整的FP8 rollout栈：(i) 采用分块FP8量化实现W8A8线性层高效推理；(ii) 将FP8扩展至KV缓存，通过每步QKV缩放重校准消除长上下文内存瓶颈；(iii) 利用基于重要性采样的rollout修正方法（token-level TIS/MIS变体）缓解精度损失带来的偏差。实验表明，该方案在密集模型和MoE模型上可提升最多44%的rollout吞吐量，同时保持与BF16基线相当的学习行为。

链接: https://arxiv.org/abs/2601.18150
作者: Zhaopeng Qiu,Shuang Yu,Jingqi Zhang,Shuai Zhang,Xue Huang,Jingyi Yang,Junjie Lai
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
zh

[NLP-64] DeepPlanning : Benchmarking Long-Horizon Agent ic Planning with Verifiable Constraints

【速读】：该论文旨在解决当前代理评估基准在长时程任务中对全局约束优化（如时间与预算限制）和实际场景中主动信息获取及细粒度局部约束建模的不足问题。现有LLM规划基准多聚焦于局部步骤推理，难以衡量代理的真实规划能力。为此，作者提出DeepPlanning这一具有挑战性的基准，涵盖需多日旅行规划和多商品购物的任务，强调主动信息收集、局部约束推理与全局优化的结合。其关键创新在于设计了更贴近现实的复杂任务场景，并通过实证表明前沿的智能体大模型在该基准上仍表现不佳，揭示了可靠显式推理模式与并行工具使用对于提升长期规划中效果-效率权衡的重要性。

链接: https://arxiv.org/abs/2601.18137
作者: Yinger Zhang,Shutong Jiang,Renhao Li,Jianhong Tu,Yang Su,Lianghao Deng,Xudong Guo,Chenxu Lv,Junyang Lin
机构: Alibaba Group (阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
zh

[NLP-65] yphoon-S: Minimal Open Post-Training for Sovereign Large Language Models

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在主权场景下的适配难题，即如何在计算资源有限、数据规模受限且需满足严格透明性要求的条件下，构建具备区域特定任务能力（如本地法律推理和文化知识理解）的高质量主权语言模型。其核心挑战在于避免依赖大规模指令语料库和复杂的偏好微调流水线及强化学习微调（Reinforcement Fine-Tuning, RFT），从而降低部署门槛。解决方案的关键在于提出一种轻量级、开源的后训练策略——Typhoon S，该策略融合监督微调（Supervised Fine-Tuning, SFT）、在线策略蒸馏（On-policy Distillation）与小规模RFT，并引入InK-GRPO方法（在GRPO损失基础上增加下一个词预测损失），有效提升了模型在泰语法律推理与本土知识方面的表现，同时保持通用能力，为学术级资源环境下实现高性能主权LLM提供了可行路径。

链接: https://arxiv.org/abs/2601.18129
作者: Kunat Pipatanakul,Pittawat Taveekitworachai
机构: Typhoon, SCB 10X
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages. Code is publicly available at this https URL . Datasets and model weights are available at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO – an extension of GRPO that augments the GRPO loss with a next-word prediction loss – improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.
zh

[NLP-66] FABLE: Forest-Based Adaptive Bi-Path LLM -Enhanced Retrieval for Multi-Document Reasoning

【速读】：该论文旨在解决长上下文大语言模型（Long-context Large Language Models, LLMs）在实际应用中仍存在局限性的问题，包括“中间丢失”现象（lost-in-the-middle phenomenon）、高计算成本以及多文档推理时可扩展性差等；同时指出传统检索增强生成（Retrieval-Augmented Generation, RAG）系统因采用扁平的块级检索方式而引入语义噪声且难以支持结构化跨文档合成。其解决方案的关键在于提出FABLE框架——一种基于森林结构的自适应双路径LLM增强检索机制，通过构建多粒度语义层次化的森林索引，并结合LLM引导的层级遍历与结构感知传播策略，实现细粒度证据获取与显式预算控制下的效率权衡，从而在显著减少token使用（最高达94%）的前提下达到接近全上下文LLM推理的准确性，证明了结构化检索对长上下文LLMs的补充作用不可替代。

链接: https://arxiv.org/abs/2601.18116
作者: Lin Sun,Linglin Zhang,Jingang Huang,Change Jia,Zhengwei Cheng,Xiangzheng Zhang
机构: Qiyuan Tech(奇元科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid expansion of long-context Large Language Models (LLMs) has reignited debate on whether Retrieval-Augmented Generation (RAG) remains necessary. However, empirical evidence reveals persistent limitations of long-context inference, including the lost-in-the-middle phenomenon, high computational cost, and poor scalability for multi-document reasoning. Conversely, traditional RAG systems, while efficient, are constrained by flat chunk-level retrieval that introduces semantic noise and fails to support structured cross-document synthesis. We present \textbfFABLE, a \textbfForest-based \textbfAdaptive \textbfBi-path \textbfLLM-\textbfEnhanced retrieval framework that integrates LLMs into both knowledge organization and retrieval. FABLE constructs LLM-enhanced hierarchical forest indexes with multi-granularity semantic structures, then employs a bi-path strategy combining LLM-guided hierarchical traversal with structure-aware propagation for fine-grained evidence acquisition, with explicit budget control for adaptive efficiency trade-offs. Extensive experiments demonstrate that FABLE consistently outperforms SOTA RAG methods and achieves comparable accuracy to full-context LLM inference with up to 94% token reduction, showing that long-context LLMs amplify rather than fully replace the need for structured retrieval. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.18116 [cs.CL] (or arXiv:2601.18116v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.18116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-67] GLEN-Bench: A Graph-Language based Benchmark for Nutritional Health

【速读】：该论文旨在解决当前营养干预计算方法在个性化膳食指导中的三大关键问题：一是饮食模式研究常忽视现实约束（如社会经济地位、共病状况和食物获取限制）；二是推荐系统缺乏对为何某种食物适合特定患者的解释能力；三是缺乏统一基准来评估跨营养健康任务的方法性能。解决方案的关键在于提出GLEN-Bench，首个基于知识图谱与自然语言的综合基准，整合NHANES健康记录、FNDDS食物成分数据及USDA食物可及性指标，构建连接人口统计学特征、健康状况、饮食行为、贫困相关约束与营养需求的知识图谱，并围绕阿片类药物使用障碍场景设计风险识别、个性化推荐与问答解释三个协同任务，从而实现从数据驱动到可解释决策的闭环支持，为生成式AI（Generative AI）赋能的精准营养干预提供可复用的技术框架与基线。

链接: https://arxiv.org/abs/2601.18106
作者: Jiatan Huang,Zheyuan Zhang,Tianyi Ma,Mingchen Li,Yaning Zheng,Yanfang Ye,Chuxu Zhang
机构: University of Connecticut (康涅狄格大学); University of Notre Dame (圣母大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Nutritional interventions are important for managing chronic health conditions, but current computational methods provide limited support for personalized dietary guidance. We identify three key gaps: (1) dietary pattern studies often ignore real-world constraints such as socioeconomic status, comorbidities, and limited food access; (2) recommendation systems rarely explain why a particular food helps a given patient; and (3) no unified benchmark evaluates methods across the connected tasks needed for nutritional interventions. We introduce GLEN-Bench, the first comprehensive graph-language based benchmark for nutritional health assessment. We combine NHANES health records, FNDDS food composition data, and USDA food-access metrics to build a knowledge graph that links demographics, health conditions, dietary behaviors, poverty-related constraints, and nutrient needs. We test the benchmark using opioid use disorder, where models must detect subtle nutritional differences across disease stages. GLEN-Bench includes three linked tasks: risk detection identifies at-risk individuals from dietary and socioeconomic patterns; recommendation suggests personalized foods that meet clinical needs within resource constraints; and question answering provides graph-grounded, natural-language explanations to facilitate comprehension. We evaluate these graph-language approaches, including graph neural networks, large language models, and hybrid architectures, to establish solid baselines and identify practical design choices. Our analysis identifies clear dietary patterns linked to health risks, providing insights that can guide practical interventions.
zh

[NLP-68] CHiRPE: A Step Towards Real-World Clinical NLP with Clinician-Oriented Model Explanations EACL2026

【速读】：该论文旨在解决临床场景中自然语言处理（Natural Language Processing, NLP）工具因缺乏可解释性而难以被医生采纳的问题。传统可解释人工智能（Explainable AI, XAI）方法与临床推理逻辑不一致，且未充分纳入临床专家意见。解决方案的关键在于构建一个由临床专家共同设计的NLP流程——CHiRPE（Clinical High-Risk Prediction with Explainability），其核心包括：基于症状域映射的文本处理、大语言模型（Large Language Model, LLM）摘要生成以及BERT分类器预测精神分裂症高风险；同时开发了新型SHAP解释格式，特别是概念引导型解释和混合图文摘要形式，经28名临床专家评估显示出显著偏好。该方法实现了高精度预测与临床可理解性的统一，验证了以临床为导向的模型开发路径的有效性。

链接: https://arxiv.org/abs/2601.18102
作者: Stephanie Fong,Zimu Wang,Guilherme C. Oliveira,Xiangyu Zhao,Yiwen Jiang,Jiahe Liu,Beau-Luke Colton,Scott Woods,Martha E. Shenton,Barnaby Nelson,Zongyuan Ge,Dominic Dwyer
机构: Orygen and The University of Melbourne (奥里根和墨尔本大学); AIM for Health Lab, Monash University (蒙纳士大学健康目标实验室); University of Liverpool (利物浦大学); Yale School of Medicine, Yale University (耶鲁医学院，耶鲁大学); Brigham and Women’s Hospital, Harvard Medical School (布莱根妇女医院，哈佛医学院)
类目: Computation and Language (cs.CL)
备注: This paper is accepted at EACL 2026

点击查看摘要

Abstract:The medical adoption of NLP tools requires interpretability by end users, yet traditional explainable AI (XAI) methods are misaligned with clinical reasoning and lack clinician input. We introduce CHiRPE (Clinical High-Risk Prediction with Explainability), an NLP pipeline that takes transcribed semi-structured clinical interviews to: (i) predict psychosis risk; and (ii) generate novel SHAP explanation formats co-developed with clinicians. Trained on 944 semi-structured interview transcripts across 24 international clinics of the AMP-SCZ study, the CHiRPE pipeline integrates symptom-domain mapping, LLM summarisation, and BERT classification. CHiRPE achieved over 90% accuracy across three BERT variants and outperformed baseline models. Explanation formats were evaluated by 28 clinical experts who indicated a strong preference for our novel concept-guided explanations, especially hybrid graph-and-text summary formats. CHiRPE demonstrates that clinically-guided model development produces both accurate and interpretable results. Our next step is focused on real-world testing across our 24 international sites.
zh

[NLP-69] Sparks of Cooperative Reasoning : LLM s as Strategic Hanabi Agents

【速读】：该论文旨在解决多智能体在信息不完全条件下进行协作推理的挑战，以汉诺比（Hanabi）卡牌游戏为场景，该任务要求具备心智理论（theory-of-mind）推理能力和策略性沟通。其解决方案的关键在于系统性地评估不同规模的语言模型（4B至600B+参数）在三种情境下的协作表现：从仅提供显式卡牌信息的最小提示（Watson设置），到基于贝叶斯推理的程序化推导（Sherlock设置），再到利用工作记忆进行多轮状态跟踪（Mycroft设置）。研究发现，模型可通过内部工作记忆实现状态追踪，并且跨模型协作性能随模型能力平滑提升；尤其在Sherlock设置中，最强模型平均得分超过15分，虽仍低于人类专家和专用Hanabi代理（>20分），但通过在自建数据集HanabiLogs与HanabiRewards上进行监督微调和强化学习微调（分别提升21%和156%），显著提升了合作表现，甚至超越了非推理类模型GPT-4.1达52%，并展现出对其他协作任务（如群体猜谜、时间推理、指令遵循等）的泛化能力。

链接: https://arxiv.org/abs/2601.18077
作者: Mahesh Ramesh,Kaousheik Jayakumar,Aswinkumar Ramkumar,Pavan Thodima,Aniket Rege
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cooperative reasoning under incomplete information remains challenging for both humans and multi-agent systems. The card game Hanabi embodies this challenge, requiring theory-of-mind reasoning and strategic communication. We benchmark 17 state-of-the-art LLM agents in 2-5 player games and study the impact of context engineering across model scales (4B to 600B+) to understand persistent coordination failures and robustness to scaffolding: from a minimal prompt with only explicit card details (Watson setting), to scaffolding with programmatic, Bayesian-motivated deductions (Sherlock setting), to multi-turn state tracking via working memory (Mycroft setting). We show that (1) agents can maintain an internal working memory for state tracking and (2) cross-play performance between different LLMs smoothly interpolates with model strength. In the Sherlock setting, the strongest reasoning models exceed 15 points on average across player counts, yet still trail experienced humans and specialist Hanabi agents, both consistently scoring above 20. We release the first public Hanabi datasets with annotated trajectories and move utilities: (1) HanabiLogs, containing 1,520 full game logs for instruction tuning, and (2) HanabiRewards, containing 560 games with dense move-level value annotations for all candidate moves. Supervised and RL finetuning of a 4B open-weight model (Qwen3-Instruct) on our datasets improves cooperative Hanabi play by 21% and 156% respectively, bringing performance to within ~3 points of a strong proprietary reasoning model (o4-mini) and surpassing the best non-reasoning model (GPT-4.1) by 52%. The HanabiRewards RL-finetuned model further generalizes beyond Hanabi, improving performance on a cooperative group-guessing benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction-following on IFBench-800K by 1.7 Pass@10, and matching AIME 2025 mathematical reasoning Pass@10.
zh

[NLP-70] Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models

【速读】：该论文旨在解决一个关键问题：视觉-语言模型（Vision-Language Models, VLMs）相较于仅处理文本的大型语言模型（Large Language Models, LLMs），在使用纯文本提示进行评估时，是否展现出更接近人类对语言具体性（linguistic concreteness）敏感性的能力。为回答此问题，研究者设计了一项受控对比实验，将相同架构的Llama文本骨干模型与其对应的视觉版本（Llama Vision）在多个模型规模下进行比较，将多模态预训练视为对感知基础（perceptual grounding）的消融，而非推理阶段对图像的访问权限。解决方案的关键在于：通过三个互补维度——输出行为（问答准确率与具体性关联）、嵌入几何结构（表示是否沿具体性轴组织）以及注意力动态（基于注意力熵量化上下文依赖程度），系统评估VLMs在具体性感知上的提升；同时，通过提取模型生成的词级具体性评分并检验其与人类规范分布的一致性，进一步验证多模态训练能否带来更符合人类认知的具体性判断。结果表明，VLMs在所有指标上均表现出更强的具身化敏感性，证明了多模态预训练对增强语言理解中具体性感知的有效性。

链接: https://arxiv.org/abs/2601.18065
作者: Aryan Roy,Zekun Wang,Christopher J. MacLellan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Do vision–language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when both are evaluated with text-only prompts? We study this question with a controlled comparison between matched Llama text backbones and their Llama Vision counterparts across multiple model scales, treating multimodal pretraining as an ablation on perceptual grounding rather than access to images at inference. We measure concreteness effects at three complementary levels: (i) output behavior, by relating question-level concreteness to QA accuracy; (ii) embedding geometry, by testing whether representations organize along a concreteness axis; and (iii) attention dynamics, by quantifying context reliance via attention-entropy measures. In addition, we elicit token-level concreteness ratings from models and evaluate alignment to human norm distributions, testing whether multimodal training yields more human-consistent judgments. Across benchmarks and scales, VLMs show larger gains on more concrete inputs, exhibit clearer concreteness-structured representations, produce ratings that better match human norms, and display systematically different attention patterns consistent with increased grounding.
zh

[NLP-71] Neurocomputational Mechanisms of Syntactic Transfer in Bilingual Sentence Production

【速读】：该论文旨在解决双语产出错误及其传统记录的时间特征（如事件相关电位）难以解释跨语言影响（cross-linguistic influence, CLI）的神经机制问题。其解决方案的关键在于引入振荡性神经信号（oscillatory signatures），并通过神经计算模型ROSE（Rhythmic Oscillatory Syntax Encoder）提供一种可实现层次的解释框架，揭示CLI由第二语言（L2）句法规划过程中特定振荡模式失效所驱动，从而为功能抑制/竞争理论提供神经动力学基础，并拓展了比传统神经标记更复杂的空间-时间维度上的语言障碍生物标志物研究路径。

链接: https://arxiv.org/abs/2601.18056
作者: Ahmet Yavuz Uluslu,Elliot Murphy
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We discuss the benefits of incorporating into the study of bilingual production errors and their traditionally documented timing signatures (e.g., event-related potentials) certain types of oscillatory signatures, which can offer new implementational-level constraints for theories of bilingualism. We argue that a recent neural model of language, ROSE, can offer a neurocomputational account of syntactic transfer in bilingual production, capturing some of its formal properties and the scope of morphosyntactic sequencing failure modes. We take as a case study cross-linguistic influence (CLI) and attendant theories of functional inhibition/competition, and present these as being driven by specific oscillatory failure modes during L2 sentence planning. We argue that modeling CLI in this way not only offers the kind of linking hypothesis ROSE was built to encourage, but also licenses the exploration of more spatiotemporally complex biomarkers of language dysfunction than more commonly discussed neural signatures.
zh

[NLP-72] Addressing LLM Diversity by Infusing Random Concepts

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）生成输出多样性不足的问题。其解决方案的关键在于通过在提示（prompt）中注入随机概念（如无关的单词或句子），以激发模型生成更具多样性的响应。实验表明，这种策略能有效提升LLM输出的多样性，且作者设计了一套系统化的评估协议用于量化这一效果，为未来研究如何利用随机性增强模型多样性提供了新思路和基准方法。

链接: https://arxiv.org/abs/2601.18053
作者: Pulin Agrawal,Prasoon Goyal
机构: Pennsylvania State University (宾夕法尼亚州立大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are known to produce outputs with limited diversity. In this work, we study whether infusing random concepts in the prompts can improve the diversity of the generated outputs. To benchmark the approach, we design a systematic evaluation protocol which involves prompting an LLM with questions of the form “Name 10 Hollywood actors”, and analyzing diversity measures of the resulting LLM outputs. Our experiments on multiple LLMs show that prepending random words/sentences unrelated to the prompt result in greater diversity in the outputs of LLMs. We believe that this promising result and the evaluation protocol opens up interesting avenues for future work, such as how infusing randomness into LLMs could be applied to other domains. Further, the evaluation protocol could also inspire research into benchmarking LLM diversity more systematically.
zh

[NLP-73] Sentipolis: Emotion-Aware Agents for Social Simulations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在社会模拟中因将情绪视为瞬时线索而导致的情绪遗忘（emotional amnesia）和长时程情感连续性弱的问题。其核心解决方案是提出Sentipolis框架，关键在于引入连续的愉悦-唤醒-支配（Pleasure-Arousal-Dominance, PAD）情绪表征、双速情绪动态机制以及情绪与记忆的耦合设计，从而实现具有情感状态持续性的智能体行为，显著提升社交互动中的情感连贯性和沟通质量。

链接: https://arxiv.org/abs/2601.18027
作者: Chiyuan Fu,Lyuhao Chen,Yunze Xiao,Weihao Xuan,Carlos Busso,Mona Diab
机构: Carnegie Mellon University (卡内基梅隆大学); The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM agents are increasingly used for social simulation, yet emotion is often treated as a transient cue, causing emotional amnesia and weak long-horizon continuity. We present Sentipolis, a framework for emotionally stateful agents that integrates continuous Pleasure-Arousal-Dominance (PAD) representation, dual-speed emotion dynamics, and emotion–memory coupling. Across thousands of interactions over multiple base models and evaluators, Sentipolis improves emotionally grounded behavior, boosting communication, and emotional continuity. Gains are model-dependent: believability increases for higher-capacity models but can drop for smaller ones, and emotion-awareness can mildly reduce adherence to social norms, reflecting a human-like tension between emotion-driven behavior and rule compliance in social simulation. Network-level diagnostics show reciprocal, moderately clustered, and temporally stable relationship structures, supporting the study of cumulative social dynamics such as alliance formation and gradual relationship change.
zh

[NLP-74] CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

【速读】：该论文旨在解决多语言语料库构建中语言识别（Language Identification, LID）模型在网页数据上表现不佳的问题，尤其针对低资源语言和噪声干扰下的识别准确性不足。其解决方案的关键在于提出一个由社区驱动、人工标注的LID基准测试集CommonLID，覆盖109种语言，其中包含大量此前被忽视的语言，从而提供更贴近真实网络文本分布的评估标准。通过在CommonLID及其他五个常用数据集上测试八种主流LID模型，作者揭示了现有评估体系对许多语言的准确率存在高估现象，凸显了该基准对于推动更公平、高质量多语言语料库建设的重要性。

链接: https://arxiv.org/abs/2601.18026
作者: Pedro Ortiz Suarez,Laurie Burchell,Catherine Arnett,Rafael Mosquera-Gómez,Sara Hincapie-Monsalve,Thom Vaughan,Damian Stewart,Malte Ostendorff,Idris Abdulmumin,Vukosi Marivate,Shamsuddeen Hassan Muhammad,Atnafu Lambebo Tonja,Hend Al-Khalifa,Nadia Ghezaiel Hammouda,Verrah Otiende,Tack Hwa Wong,Jakhongir Saydaliev,Melika Nobakhtian,Muhammad Ravi Shulthan Habibi,Chalamalasetti Kranti,Carol Muchemi,Khang Nguyen,Faisal Muhammad Adam,Luis Frentzen Salim,Reem Alqifari,Cynthia Amol,Joseph Marvin Imperial,Ilker Kesen,Ahmad Mustafid,Pavel Stepachev,Leshem Choshen,David Anugraha,Hamada Nayel,Seid Muhie Yimam,Vallerie Alexandra Putra,My Chiffon Nguyen,Azmine Toushik Wasi,Gouthami Vadithya,Rob van der Goot,Lanwenn ar C’horr,Karan Dua,Andrew Yates,Mithil Bangera,Yeshil Bangera,Hitesh Laxmichand Patel,Shu Okabe,Fenal Ashokbhai Ilasariya,Dmitry Gaynullin,Genta Indra Winata,Yiyuan Li,Juan Pablo Martínez,Amit Agarwal,Ikhlasul Akmal Hanif,Raia Abu Ahmad,Esther Adenuga,Filbert Aurelian Tjiaranata,Weerayut Buaphet,Michael Anugraha,Sowmya Vajjala,Benjamin Rice,Azril Hafizi Amirudin,Jesujoba O. Alabi,Srikant Panda,Yassine Toughrai,Bruhan Kyomuhendo,Daniel Ruffinelli,Akshata A,Manuel Goulão,Ej Zhou,Ingrid Gabriela Franco Ramirez,Cristina Aggazzotti,Konstantin Dobler,Jun Kevin,Quentin Pagès,Nicholas Andrews,Nuhu Ibrahim,Mattes Ruckdeschel,Amr Keleg,Mike Zhang,Casper Muziri,Saron Samuel,Sotaro Takeshita,Kun Kerdthaisong,Luca Foppiano,Rasul Dent,Tommaso Green,Ahmad Mustapha Wali,Kamohelo Makaaka,Vicky Feliren,Inshirah Idris,Hande Celikkanat,Abdulhamid Abubakar,Jean Maillard,Benoît Sagot,Thibault Clérice,Kenton Murray,Sarah Luger
机构: Common Crawl Foundation; EleutherAI; Factored AI; MLCommons; University of Pretoria; Imperial College London; MBZUAI; King Saud University; University of Hail; USIU-Africa; Universiti Teknologi PETRONAS; EPFL; Tehran Institute for Advanced Studies; Khatam University; Universitas Indonesia; University of Potsdam; Universität Trier; Michigan State University; NOUN (ACETEL); Academia Sinica; National Taiwan University of Science and Technology; Maseno University; Tonative Africa; University of Bath; National University Philippines; University of Copenhagen; Independent; University of Edinburgh; MIT-IBM Watson AI Research; MIT; IBM Research; Stanford University; Benha University; University of Hamburg; Bina Nusantara University; SEACrowd; Computational Intelligence and Operations Laboratory; University of New Haven; IT University of Copenhagen; Ofis Publik ar Brezhoneg; Johns Hopkins University; New York University; TUM Heilbronn; Stevens Institute of Technology; Capital One; University of North Carolina at Chapel Hill; Academia Aragonesa de la Lengua; Universidad de Zaragoza; Liverpool John Moores University; DFKI Berlin; The African Research Collective; Vidyasirimedhi Institute of Science and Technology; National Research Council, Canada; Princeton University; University of The People; Saarland University; Birla Institute of Technology and Science; LORIA; University of Mannheim; NeuralShift; University of Cambridge; Hasso Plattner Institute; ELLIS Unit Potsdam; Universitas Pelita Harapan; University of Manchester; Thammasat University; ScienciaLAB; Inria Paris; University of Bucharest; Monash University, Indonesia; Wadmedani Ahlia University; NSUK; Council for Ligurian Linguistic Heritage
类目: Computation and Language (cs.CL)
备注: 17 pages, 7 tables, 5 figures

点击查看摘要

Abstract:Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.
zh

[NLP-75] A System for Name and Address Parsing with Large Language Models

【速读】：该论文旨在解决从非结构化的人名和地址文本中可靠转换为结构化数据的问题，这一挑战在大规模信息系统中尤为突出。传统基于规则或概率的方法在输入干净时表现良好，但在噪声或多语言环境下失效；而神经网络和大语言模型（LLM）虽具备较强泛化能力，却缺乏确定性控制与可复现性。其解决方案的关键在于提出一种提示驱动、以验证为中心的框架，无需微调即可将自由文本映射到统一的17字段结构模式，核心要素包括输入归一化、结构化提示设计、约束解码以及固定实验条件下严格的规则验证机制，从而实现高字段准确率、强模式一致性及稳定的置信度校准，为结构化信息抽取提供了一种可解释、可扩展且稳健的替代方案。

链接: https://arxiv.org/abs/2601.18014
作者: Adeeba Tarannum,Muzakkiruddin Ahmed Mohammed,Mert Can Cakmak,Shames Al Mandalawi,John Talburt
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable transformation of unstructured person and address text into structured data remains a key challenge in large-scale information systems. Traditional rule-based and probabilistic approaches perform well on clean inputs but fail under noisy or multilingual conditions, while neural and large language models (LLMs) often lack deterministic control and reproducibility. This paper introduces a prompt-driven, validation-centered framework that converts free-text records into a consistent 17-field schema without fine-tuning. The method integrates input normalisation, structured prompting, constrained decoding, and strict rule-based validation under fixed experimental settings to ensure reproducibility. Evaluations on heterogeneous real-world address data show high field-level accuracy, strong schema adherence, and stable confidence calibration. The results demonstrate that combining deterministic validation with generative prompting provides a robust, interpretable, and scalable solution for structured information extraction, offering a practical alternative to training-heavy or domain-specific models.
zh

[NLP-76] Evaluating Semantic and Syntactic Understanding in Large Language Models for Payroll Systems

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在精确数值计算和可审计输出方面的可靠性问题，尤其是在高风险场景如薪资系统中，模型能否准确理解规则、按正确顺序应用逻辑并实现分（cent）级精度的结果。其解决方案的关键在于通过设计分层数据集与多种提示策略（从基础提示到结构化引导和推理增强提示），系统性评估不同模型家族（如GPT、Claude、Perplexity、Grok和Gemini）的表现，并识别出“仅靠精心提示即可满足需求”与“必须引入显式计算机制”的明确临界条件，从而为需要高精度与可验证性的实际部署提供可复现的框架与实践指导。

链接: https://arxiv.org/abs/2601.18012
作者: Hendrika Maclean,Mert Can Cakmak,Muzakkiruddin Ahmed Mohammed,Shames Al Mandalawi,John Talburt
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are now used daily for writing, search, and analysis, and their natural language understanding continues to improve. However, they remain unreliable on exact numerical calculation and on producing outputs that are straightforward to audit. We study synthetic payroll system as a focused, high-stakes example and evaluate whether models can understand a payroll schema, apply rules in the right order, and deliver cent-accurate results. Our experiments span a tiered dataset from basic to complex cases, a spectrum of prompts from minimal baselines to schema-guided and reasoning variants, and multiple model families including GPT, Claude, Perplexity, Grok and Gemini. Results indicate clear regimes where careful prompting is sufficient and regimes where explicit computation is required. The work offers a compact, reproducible framework and practical guidance for deploying LLMs in settings that demand both accuracy and assurance.
zh

[NLP-77] PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation

【速读】：该论文旨在解决无参考文本的机器翻译（Machine Translation, MT）质量评估问题，即如何在不依赖人工译文作为参考的情况下准确衡量翻译质量。传统质量估计（Quality Estimation, QE）方法通常基于单个候选译文进行评分，难以有效捕捉译文间的相对优劣关系。本文提出 PEAR（Pairwise Evaluation for Automatic Relative Scoring），其核心创新在于将无参考评估重构为分级成对比较任务：给定源句和两个候选译文，模型预测二者质量差异的方向与程度。关键在于利用人类判断差异作为监督信号，并引入符号反转正则项以确保候选顺序反转时预测结果符号相反，从而增强模型对相对质量感知的鲁棒性。实验表明，PEAR 在 WMT24 元评估基准上显著优于同等训练数据和骨干网络的单候选 QE 模型，且参数量更少却超越更大规模的 QE 及参考基线模型，同时提供更低冗余的评估信号，并可高效用于最小贝叶斯风险（Minimum Bayes Risk, MBR）解码。

链接: https://arxiv.org/abs/2601.18006
作者: Lorenzo Proietti,Roman Grundkiewicz,Matt Post
机构: Sapienza University of Rome (罗马大学); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:We present PEAR (Pairwise Evaluation for Automatic Relative Scoring), a supervised Quality Estimation (QE) metric family that reframes reference-free Machine Translation (MT) evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. The metrics are trained using pairwise supervision derived from differences in human judgments, with an additional regularization term that encourages sign inversion under candidate order reversal. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large metrics, PEAR surpasses far larger QE models and reference-based metrics. Our analysis further indicates that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is an effective utility function for Minimum Bayes Risk (MBR) decoding, reducing pairwise scoring cost at negligible impact.
zh

[NLP-78] AI-based approach to burnout identification from textual data

【速读】：该论文旨在解决如何通过自然语言处理（Natural Language Processing, NLP）技术从文本数据中自动识别职场倦怠（Burnout）的问题，尤其针对高压力工作环境中大规模书面沟通内容的监测需求。其解决方案的关键在于利用预训练的RuBERT模型，结合两种数据源——由ChatGPT生成的合成句子和来自俄罗斯YouTube视频评论的真实用户文本——对模型进行微调，从而构建一个能够为输入文本分配倦怠概率的分类器，实现对倦怠相关语言信号的自动化检测与量化分析。

链接: https://arxiv.org/abs/2601.17993
作者: Marina Zavertiaeva,Petr Parshakov,Mikhail Usanin,Aleksei Smirnov,Sofia Paklina,Anastasiia Kibardina
机构: National Research University Higher School of Economics (俄罗斯国立研究大学高等经济学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:This study introduces an AI-based methodology that utilizes natural language processing (NLP) to detect burnout from textual data. The approach relies on a RuBERT model originally trained for sentiment analysis and subsequently fine-tuned for burnout detection using two data sources: synthetic sentences generated with ChatGPT and user comments collected from Russian YouTube videos about burnout. The resulting model assigns a burnout probability to input texts and can be applied to process large volumes of written communication for monitoring burnout-related language signals in high-stress work environments.
zh

[NLP-79] SD-E2: Semantic Exploration for Reasoning Under Token Budgets EACL2026

【速读】：该论文旨在解决小语言模型（Small Language Models, SLMs）在复杂推理任务中表现受限的问题，核心瓶颈在于计算资源受限下探索（exploration）成本过高。解决方案的关键在于提出一种基于强化学习的框架——语义多样性探索-利用（Semantic Diversity-Exploration-Exploitation, SD-E²），其创新点在于显式优化生成推理轨迹中的语义多样性：通过冻结的句子嵌入模型计算语义空间内的差异性，设计多样性奖励函数以衡量策略覆盖度和平均成对语义距离，而非表面形式的新颖性；该奖励与结果正确性和解题效率共同构成归一化的多目标目标函数，从而稳定训练过程并提升推理能力。实验表明，SD-E²在GSM8K、MedMCQA和AIME等多个基准上显著优于基线模型，验证了语义新颖性奖励能更高效地引导资源受限模型的探索-利用平衡。

链接: https://arxiv.org/abs/2601.17982
作者: Kshitij Mishra,Nils Lukas,Salem Lahlou
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EACL 2026

点击查看摘要

Abstract:Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E ^2 ), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E ^2 assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E ^2 surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E ^2 offers a complementary path to efficiency gains in resource-constrained models.
zh

[NLP-80] LLM s as Cultural Archives: Cultural Commonsense Knowledge Graph Extraction EACL2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中隐含但未结构化的文化常识难以被有效利用的问题，从而限制了其在跨文化场景下的可解释性与实用性。解决方案的关键在于提出一种迭代式的、基于提示（prompt-based）的框架，将LLM视为文化档案库（cultural archives），系统性地提取特定文化中的实体、关系和实践，并将其组织为跨语言的多步推理链，构建出文化常识知识图谱（Cultural Commonsense Knowledge Graph, CCKG）。该方法不仅提升了文化相关任务的性能，还揭示了当前LLM在不同文化编码上的不均衡性，表明链式结构的文化知识是实现文化 grounded NLP 的可行基础。

链接: https://arxiv.org/abs/2601.17971
作者: Junior Cedric Tonga,Chen Cecilia Liu,Iryna Gurevych,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Ubiquitous Knowledge Processing Lab (UKP Lab) (无处不在的知识处理实验室); Technische Universität Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL)
备注: EACL 2026 MAIN

点击查看摘要

Abstract:Large language models (LLMs) encode rich cultural knowledge learned from diverse web-scale data, offering an unprecedented opportunity to model cultural commonsense at scale. Yet this knowledge remains mostly implicit and unstructured, limiting its interpretability and use. We present an iterative, prompt-based framework for constructing a Cultural Commonsense Knowledge Graph (CCKG) that treats LLMs as cultural archives, systematically eliciting culture-specific entities, relations, and practices and composing them into multi-step inferential chains across languages. We evaluate CCKG on five countries with human judgments of cultural relevance, correctness, and path coherence. We find that the cultural knowledge graphs are better realized in English, even when the target culture is non-English (e.g., Chinese, Indonesian, Arabic), indicating uneven cultural encoding in current LLMs. Augmenting smaller LLMs with CCKG improves performance on cultural reasoning and story generation, with the largest gains from English chains. Our results show both the promise and limits of LLMs as cultural technologies and that chain-structured cultural knowledge is a practical substrate for culturally grounded NLP.
zh

[NLP-81] A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在临床场景（如阿尔茨海默病进展诊断）中部署时面临的可解释性难题，具体表现为现有归因方法因LLM表征的多义性导致解释结果不稳定且不同方法间差异显著，而机制解释方法又缺乏与输入输出的直接对齐且无法提供明确的重要性评分。解决方案的关键在于提出一个统一的可解释性框架，通过单义特征提取构建层级别的单义嵌入空间，并优化该框架以显式降低不同归因方法间的变异性，从而生成稳定且可解释的输入级重要性分数，并借助目标层的解压缩表示突出关键特征，推动LLMs在认知健康和神经退行性疾病中的安全可信应用。

链接: https://arxiv.org/abs/2601.17952
作者: Michail Mamalakis,Tiago Azevedo,Cristian Cosentino,Chiara D’Ercoli,Subati Abulikemu,Zhongtian Sun,Richard Bethlehem,Pietro Lio
机构: University of Cambridge (剑桥大学); University of Calabria (卡拉布里亚大学); Sapienza Università di Roma (罗马第一大学); University of Kent (肯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretability remains a key challenge for deploying large language models (LLMs) in clinical settings such as Alzheimer’s disease progression diagnosis, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an LLM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LLMs in cognitive health and neurodegenerative disease.
zh

[NLP-82] ShapLoRA: Allocation of Low-rank Adaption on Large Language Models via Shapley Value Inspired Importance Estimation

【速读】：该论文旨在解决低秩适配（Low-rank adaption, LoRA）在大型语言模型（Large Language Models, LLMs）微调中因均匀分配秩（rank）而导致性能未达最优的问题。现有方法依赖于不可解释且不可靠的秩重要性度量，限制了LoRA的性能提升潜力。其解决方案的关键在于提出ShapLoRA框架，该框架基于博弈论中的Shapley值思想，结合敏感性分析，引入一种更具可解释性的秩重要性度量——Shapley敏感性（Shapley sensitivity），并优化了训练流程：在独立验证集上计算Shapley敏感性，并设计分配-重训练程序以实现公平比较。实验表明，该方法在保持可调参数数量相当的前提下，显著优于当前主流基线。

链接: https://arxiv.org/abs/2601.17921
作者: Yi Zhao,Qinghua Yao,Xinyuan song,Wei Zhu
机构: Singapore Management University (新加坡管理大学); University of Pennsylvania (宾夕法尼亚大学); Emory University (埃默里大学); University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: accepted by CPAL

点击查看摘要

Abstract:Low-rank adaption (LoRA) is a representative method in the field of parameter-efficient fine-tuning (PEFT), and is key to Democratizating the modern large language models (LLMs). The vanilla LoRA is implemented with uniform ranks, and the recent literature have found that properly allocating ranks on the LLM backbones results in performance boosts. However, the previous rank allocation methods have limitations since they rely on inexplanable and unreliable importance measures for the LoRA ranks. To address the above issues, we propose the ShapLoRA framework. Inspired by the explanable attribution measure Shapley Value, we combine the sensitivity-based measures with the idea of coalitions in the collaborative games among LoRA ranks, and propose a more explainable importance measure called Shapley sensitivity. In addition, we optimize the workflow of the existing works by: (a) calculating Shapley sensitivity on a separate validation set; (b) Setting up the allocating-retraining procedures for fair comparisons. We have conducted experiments on various challenging tasks, and the experimental results demonstrate that our ShapLoRA method can outperform the recent baselines with comparable tunable parameters.\footnoteCodes and fine-tuned models will be open-sourced to facilitate future research.
zh

[NLP-83] Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models EACL2026

【速读】：该论文旨在解决当前生成式 AI（Generative AI）在医疗领域应用中因对齐不足和可靠性差而导致的部署受限问题，特别是直接偏好优化（Direct Preference Optimization, DPO）方法在高风险医疗场景下的有效性尚未得到充分验证。其关键解决方案在于提出一种针对性的偏好构建策略，专门用于缓解现有 DPO 模型中常见的视觉误解释错误，从而提升模型在视觉问答任务上的性能；实验表明，该策略相较最强基线实现了 3.6% 的准确率提升，显著改善了医疗视觉语言模型（Medical Vision-Language Models, LVLMs）的可靠性和对齐质量。

链接: https://arxiv.org/abs/2601.17918
作者: Dain Kim,Jiwoo Lee,Jaehoon Yun,Yong Hoe Koo,Qingyu Chen,Hyunjae Kim,Jaewoo Kang
机构: Korea University (韩国大学); AIGEN Sciences; Hanyang University College of Medicine (汉阳大学医学院); Asan Medical Center, University of Ulsan College of Medicine (首尔大学医学院附属医院); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: EACL 2026 (Findings)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) hold significant promise for medical applications, yet their deployment is often constrained by insufficient alignment and reliability. While Direct Preference Optimization (DPO) has emerged as a potent framework for refining model responses, its efficacy in high-stakes medical contexts remains underexplored, lacking the rigorous empirical groundwork necessary to guide future methodological advances. To bridge this gap, we present the first comprehensive examination of diverse DPO variants within the medical domain, evaluating nine distinct formulations across two medical LVLMs: LLaVA-Med and HuatuoGPT-Vision. Our results reveal several critical limitations: current DPO approaches often yield inconsistent gains over supervised fine-tuning, with their efficacy varying significantly across different tasks and backbones. Furthermore, they frequently fail to resolve fundamental visual misinterpretation errors. Building on these insights, we present a targeted preference construction strategy as a proof-of-concept that explicitly addresses visual misinterpretation errors frequently observed in existing DPO models. This design yields a 3.6% improvement over the strongest existing DPO baseline on visual question-answering tasks. To support future research, we release our complete framework, including all training data, model checkpoints, and our codebase at this https URL.
zh

[NLP-84] reaming-dLLM : Accelerating Diffusion LLM s via Suffix Pruning and Dynamic Decoding

【速读】：该论文旨在解决扩散大语言模型（Diffusion Large Language Models, dLLMs）在推理过程中存在的空间冗余和时间效率低下问题。具体而言，现有方法在块状扩散过程中对信息稀疏的后缀区域进行均匀建模，导致空间冗余；同时采用固定去噪调度策略，未能根据生成进度动态调整迭代次数，造成时间上的低效。解决方案的关键在于提出一个无需训练的Streaming-dLLM框架：空间维度上引入衰减引导的后缀建模（attenuation guided suffix modeling），通过剪枝冗余掩码标记来近似完整上下文；时间维度上设计基于置信度感知的动态策略与早期退出机制（confidence aware strategy with early exit mechanism），使模型能跳过已收敛标记的无意义迭代。实验表明，该方案可在保持生成质量的前提下实现最高达68.2倍的加速比。

链接: https://arxiv.org/abs/2601.17917
作者: Zhongyu Xiao,Zhiwei Hao,Jianyuan Guo,Yong Luo,Jia Liu,Jie Xu,Han Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report. Code is available at this https URL

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block-wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative-sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming-dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at this https URL.
zh

[NLP-85] Assessment of Generative Named Entity Recognition in the Era of Large Language Models

【速读】：该论文旨在解决传统命名实体识别（Named Entity Recognition, NER）方法在面对大规模语言模型（Large Language Models, LLMs）兴起时的范式转变问题，即如何有效利用生成式 AI（Generative AI）实现更灵活、可扩展且性能优异的 NER 解决方案。其关键解决方案在于：通过参数高效微调（Parameter-Efficient Fine-Tuning）与结构化输出格式（如内联括号或 XML 格式），使开源 LLM 在扁平和嵌套 NER 任务上达到与传统编码器-解码器模型相当甚至更优的性能，并验证了 LLM 的 NER 能力来源于指令遵循能力和生成能力，而非单纯记忆实体标签对；同时表明针对 NER 的指令微调不会显著损害模型的通用能力，反而可能提升其在其他任务（如 DROP）上的表现，从而确立生成式 NER 是一种具有潜力的替代方案。

链接: https://arxiv.org/abs/2601.17898
作者: Qi Zhan,Yile Wang,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Named entity recognition (NER) is evolving from a sequence labeling task into a generative paradigm with the rise of large language models (LLMs). We conduct a systematic evaluation of open-source LLMs on both flat and nested NER tasks. We investigate several research questions including the performance gap between generative NER and traditional NER models, the impact of output formats, whether LLMs rely on memorization, and the preservation of general capabilities after fine-tuning. Through experiments across eight LLMs of varying scales and four standard NER datasets, we find that: (1) With parameter-efficient fine-tuning and structured formats like inline bracketed or XML, open-source LLMs achieve performance competitive with traditional encoder-based models and surpass closed-source LLMs like GPT-3; (2) The NER capability of LLMs stems from instruction-following and generative power, not mere memorization of entity-label pairs; and (3) Applying NER instruction tuning has minimal impact on general capabilities of LLMs, even improving performance on datasets like DROP due to enhanced entity understanding. These findings demonstrate that generative NER with LLMs is a promising, user-friendly alternative to traditional methods. We release the data and code at this https URL.
zh

[NLP-86] Artificial Intelligence and Intellectual Property Rights: Comparative Transnational Policy Analysis

【速读】：该论文旨在解决印度知识产权法在应对人工智能（Artificial Intelligence, AI）生成内容时存在的法律适应性不足问题，具体表现为对AI生成成果的商业秘密保护不力、专利可授权性认定标准缺失以及版权归属规则模糊，导致法律适用上的教义性矛盾和执法低效。其解决方案的关键在于通过比较法研究（涵盖印度、美国、英国和欧盟）构建一套协调一致的法律分类体系（harmonized legal taxonomy），明确AI在创新过程中的角色定位，在保障创新激励的同时实现与全球知识产权治理框架的接轨，从而推动印度知识产权法的现代化重构。

链接: https://arxiv.org/abs/2601.17892
作者: Sahibpreet Singh,Manjit Singh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in Journal of University Institute of Legal Studies, Vol. 19, Issue 1, pp. 182-208, 2025

点击查看摘要

Abstract:Artificial intelligence’s rapid integration with intellectual property rights necessitates assessment of its impact on trade secrets, copyrights and patents. This study addresses lacunae in existing laws where India lacks AI-specific provisions, creating doctrinal inconsistencies and enforcement inefficacies. Global discourse on AI-IPR protections remains nascent. The research identifies gaps in Indian IP laws’ adaptability to AI-generated outputs: trade secret protection is inadequate against AI threats; standardized inventorship criteria are absent. Employing doctrinal and comparative methodology, it scrutinizes legislative texts, judicial precedents and policy instruments across India, US, UK and EU. Preliminary findings reveal shortcomings: India’s contract law creates fragmented trade secret regime; Section 3(k) of Indian Patents Act blocks AI invention patenting; copyright varies in authorship attribution. The study proposes harmonized legal taxonomy accommodating AI’s role while preserving innovation incentives. India’s National AI Strategy (2024) shows progress but legislative clarity is imperative. This contributes to global discourse with AI-specific IP protections ensuring resilience and equitable innovation. Promising results underscore recalibrating India’s IP jurisprudence for global alignment.
zh

[NLP-87] Self-Manager: Parallel Agent Loop for Long-form Deep Research

【速读】：该论文旨在解决当前智能代理在执行长周期深度研究任务时，因受限于单一上下文窗口和串行执行范式而导致的上下文干扰、阻塞行为以及可扩展性与适应性不足的问题。其解决方案的关键在于提出Self-Manager机制，该机制采用并行代理循环（parallel agent loop）架构，通过主进程创建多个具有独立上下文的子线程，并借助Thread Control Blocks（线程控制块）实现对子线程的迭代管理，从而支持异步并发执行，显著提升任务处理的专注度、灵活性与整体效率。

链接: https://arxiv.org/abs/2601.17879
作者: Yilong Xu,Zhi Zheng,Xiang Long,Yujun Cai,Yiwei Wang
机构: University of Chinese Academy of Sciences (中国科学院大学); ModelBest Inc.; University of Queensland (昆士兰大学); University of California Merced (加州大学默塞德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Long-form deep research requires multi-faceted investigations over extended horizons to get a comprehensive report. When handling such complex tasks, existing agents manage context at the subtask level to overcome linear context accumulation and information loss. However, they still adhere to a single context window and sequential execution paradigm, which results in mutual interference and blocking behavior, restricting scalability and adaptability. To address this issue, this paper introduces Self-Manager, a parallel agent loop that enables asynchronous and concurrent execution. The main thread can create multiple subthreads, each with its own isolated context, and manage them iteratively through Thread Control Blocks, allowing for more focused and flexible parallel agent execution. To assess its effectiveness, we benchmark Self-Manager on DeepResearch Bench, where it consistently outperforms existing single-agent loop baselines across all metrics. Furthermore, we conduct extensive analytical experiments to demonstrate the necessity of Self-Manager’s design choices, as well as its advantages in contextual capacity, efficiency, and generalization.
zh

[NLP-88] On the Emergence and Test-Time Use of Structural Information in Large Language Models

【速读】：该论文旨在解决语言模型如何从观测数据中学习抽象结构信息，并在测试时有效利用这些结构信息以实现灵活的组合生成能力的问题。其核心挑战在于理解模型是否能习得可泛化的结构知识，而非仅依赖训练数据中的模式。解决方案的关键在于设计一个基于语言结构变换的受控自然语言数据集，通过该数据集实证分析语言模型对结构信息的学习过程与复杂推理任务之间的关联性，从而揭示模型在测试时进行组合生成的能力局限。

链接: https://arxiv.org/abs/2601.17869
作者: Michelle Chao Chen,Moritz Miller,Bernhard Schölkopf,Siyuan Guo
机构: ETH Zurich; Max Planck Institute for Intelligent Systems; University of Cambridge
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning structural information from observational data is central to producing new knowledge outside the training corpus. This holds for mechanistic understanding in scientific discovery as well as flexible test-time compositional generation. We thus study how language models learn abstract structures and utilize the learnt structural information at test-time. To ensure a controlled setup, we design a natural language dataset based on linguistic structural transformations. We empirically show that the emergence of learning structural information correlates with complex reasoning tasks, and that the ability to perform test-time compositional generation remains limited.
zh

[NLP-89] D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models WWW’26

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成过程中，其逐token预测概率（P_token）是否能准确反映任务层面的目标分布（P_task），即生成样本的概率是否与实际任务需求（如信息相关性、产品购买概率或动作执行概率）保持一致的问题。研究表明，LLMs存在两种行为模式：D-models（如Qwen-2.5）表现出高波动性的P_token且与P_task匹配度差，而E-models（如Mistral-Small）则具有更稳定的P_token并更好对齐P_task。解决方案的关键在于通过控制变量的采样模拟实验识别出这两类模型，并进一步在下游任务（如代码生成和推荐系统）中验证其多样性与稳定性之间的系统性权衡机制，从而为不同应用场景下选择D-model或E-model提供依据，尤其适用于需要平衡多样性和可靠性的网络规模应用（如推荐、搜索和对话代理）。

链接: https://arxiv.org/abs/2601.17865
作者: Jia Gu,Liang Pang,Huawei Shen,Xueqi Cheng
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beijing
类目: Computation and Language (cs.CL)
备注: 12 pages, 10 figures. Accepted by WWW’26

点击查看摘要

Abstract:The predictive probability of the next token (P_token) in large language models (LLMs) is inextricably linked to the probability of relevance for the next piece of information, the purchase probability of the next product, and the execution probability of the next action-all of which fall under the scope of the task-level target distribution (P_task). While LLMs are known to generate samples that approximate real-world distributions, whether their fine-grained sampling probabilities faithfully align with task requirements remains an open question. Through controlled distribution-sampling simulations, we uncover a striking dichotomy in LLM behavior, distinguishing two model types: D-models (e.g. Qwen-2.5), whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models (e.g. Mistral-Small), whose P_token is more stable and better aligned with P_task. We further evaluate these two model types in downstream tasks such as code generation and recommendation, revealing systematic trade-offs between diversity and stability that shape task outcomes. Finally, we analyze the internal properties of both model families to probe their underlying mechanisms. These findings offer foundational insights into the probabilistic sampling behavior of LLMs and provide practical guidance on when to favor D- versus E-models. For web-scale applications, including recommendation, search, and conversational agents, our results inform model selection and configuration to balance diversity with reliability under real-world uncertainty, providing a better level of interpretation.
zh

[NLP-90] EFT-CoT: A Multi-Agent Chain-of-Thought Framework for Emotion-Focused Therapy

【速读】：该论文旨在解决当前基于认知行为疗法（Cognitive Behavioral Therapy, CBT）的生成式AI在心理健康问答（Mental Health Question Answering, MHQA）中过度依赖“自上而下”的理性重构，忽视来访者具身体验和原始情绪处理的问题。其解决方案的关键在于提出一种以情绪聚焦疗法（Emotion-Focused Therapy, EFT）为基础的多智能体思维链框架（EFT-CoT），采用“自下而上”的干预路径，将心理干预过程拆解为“具身感知—认知探索—叙事干预”三阶段推理流程，并通过八个专业化智能体显式执行躯体觉察映射、适应性评估、核心信念提取与叙事重构等关键模块，从而实现更具同理心和结构专业性的心理支持。

链接: https://arxiv.org/abs/2601.17842
作者: Lanqing Du,Yunong Li,YuJie Long,Shihong Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Leveraging Large Language Models (LLMs) for Mental Health Question Answering (MHQA) is promising for mitigating resource shortages. However, existing Cognitive Behavioral Therapy (CBT)-based approaches predominantly favor a “top-down” rational restructuring, often neglecting clients’ embodied experiences and primary emotion processing. To address this, we propose an Emotion-Focused Therapy (EFT)-based Multi-Agent Chain-of-Thought framework (EFT-CoT). Adopting a “bottom-up” trajectory, it deconstructs the intervention into a three-stage reasoning flow: “Embodied Perception - Cognitive Exploration - Narrative Intervention.” Utilizing eight specialized agents, the system explicitly executes critical components such as somatic awareness mapping, adaptive assessment, core belief extraction, and narrative restructuring. We further constructed “EFT-Instruct,” a high-quality dataset via Chain-of-Thought distillation of approximately 67,000 authentic texts, and fine-tuned a specialized model, EFT-LLM. Experimental evaluations demonstrate that EFT-LLM outperforms strong baselines and human responses across metrics like empathy depth and structural professionalism. Ablation studies confirm the necessity of the multi-agent mechanism. The model exhibits superior psychological reasoning, offering an effective pathway for interpretable, high-empathy counseling systems.
zh

[NLP-91] Linguistic and Argument Diversity in Synthetic Data for Function-Calling Agents

【速读】：该论文旨在解决函数调用智能体（Function Calling Agent）训练数据质量与多样性不足的问题，尤其是现有方法在请求语句的语义多样性以及参数（如城市名、股票代码等）覆盖范围上的欠缺。其解决方案的关键在于提出一种无需依赖人工规则或分类体系的合成数据生成方法，通过优化跨查询和参数层面的通用多样性指标来提升数据集的广度与代表性，从而在不牺牲准确性的前提下显著增强模型在分布外场景下的泛化能力。实验表明，该方法在BFCL基准上相较基线模型实现了7.4%的准确率提升。

链接: https://arxiv.org/abs/2601.17829
作者: Dan Greenstein,Zohar Karnin,Chen Amiraz,Oren Somekh
机构: Technion(以色列理工学院); TII(图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The construction of function calling agents has emerged as a promising avenue for extending model capabilities. A major challenge for this task is obtaining high quality diverse data for training. Prior work emphasizes diversity in functions, invocation patterns, and interaction turns, yet linguistic diversity of requests and coverage of arguments (e.g., \textttcity_name, \textttstock_ticker) remain underexplored. We propose a method that generates synthetic datasets via optimizing general-purpose diversity metrics across both queries and arguments, without relying on hand-crafted rules or taxonomies, making it robust to different usecases. We demonstrate the effectiveness of our technique via both intrinsic and extrinsic testing, comparing it to SoTA data generation methods. We show a superiority over baselines in terms of diversity, while keeping comparable correctness. Additionally, when used as a training set, the model resulting from our dataset exhibits superior performance compared to analogous models based on the baseline data generation methods in out-of-distribution performance. In particular, we achieve an 7.4% increase in accuracy on the BFCL benchmark compared to similar counterparts.
zh

[NLP-92] DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation ACL

【速读】：该论文旨在解决意大利语-英语机器翻译（Machine Translation, MT）领域中缺乏高效、小型且性能优异的专用模型问题。针对这一挑战，作者提出并训练了一个仅含0.5亿参数的小型解码器-only Transformer模型DIETA，其关键在于构建了一个大规模、多源异构的平行语料库（约2.07亿句对），涵盖议会记录、法律文本、网络爬取内容、字幕、新闻、文学等多样化领域，并通过预训练模型进行回译（back-translation）扩充至3.52亿句对；同时，为评估当代文本翻译质量，创建了一个基于2025年WikiNews文章的新小规模测试集（450句）。实验表明，DIETA在多个意大利语-英语基准测试中表现优异，稳居32系统排行榜第二四分位，且在五项测试中的四项优于大多数参数量小于30亿的模型，验证了其在资源受限场景下的高性价比与实用性。

链接: https://arxiv.org/abs/2601.17823
作者: Pranav Kasela,Marco Braga,Alessandro Ghiotto,Andrea Pilzer,Marco Viviani,Alessandro Raganato
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in CLiC-IT '25: this https URL

点击查看摘要

Abstract:In this paper, we present DIETA, a small, decoder-only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian-English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, web-crawled content, subtitles, news, literature and 352 million back-translated data using pretrained models. Additionally, we create and release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian-English benchmarks, consistently ranking in the second quartile of a 32-system leaderboard and outperforming most other sub-3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian-English machine translation. this https URL
zh

[NLP-93] Beyond a Single Perspective: Text Anomaly Detection with Multi-View Language Representations

【速读】：该论文旨在解决文本异常检测（Text Anomaly Detection, TAD）中现有两阶段“嵌入-检测”方法因依赖单一嵌入模型且缺乏跨数据集和异常类型适应性而导致性能受限的问题。解决方案的关键在于提出 MCA² 框架，其核心创新包括：利用多个预训练语言模型生成多视角嵌入以捕捉更丰富的正常文本模式；设计对比协同模块以增强不同视角间的互补信息交互；并引入自适应分配模块自动学习各视角的贡献权重，从而显著提升模型在多样化数据集上的泛化能力和检测精度。

链接: https://arxiv.org/abs/2601.17786
作者: Yixin Liu,Kehan Yan,Shiyuan Li,Qingfeng Chen,Shirui Pan
机构: Griffith University (格里菲斯大学); Guangxi University (广西大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 7 tables, and 5 figures

点击查看摘要

Abstract:Text anomaly detection (TAD) plays a critical role in various language-driven real-world applications, including harmful content moderation, phishing detection, and spam review filtering. While two-step “embedding-detector” TAD methods have shown state-of-the-art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types. To address these limitations, we propose to exploit the embeddings from multiple pretrained language models and integrate them into MCA^2 , a multi-view TAD framework. MCA^2 adopts a multi-view reconstruction model to effectively extract normal textual patterns from multiple embedding perspectives. To exploit inter-view complementarity, a contrastive collaboration module is designed to leverage and strengthen the interactions across different views. Moreover, an adaptive allocation module is developed to automatically assign the contribution weight of each view, thereby improving the adaptability to diverse datasets. Extensive experiments on 10 benchmark datasets verify the effectiveness of MCA^2 against strong baselines. The source code of MCA^2 is available at this https URL.
zh

[NLP-94] Controlling Reading Ease with Gaze-Guided Text Generation EACL2026

【速读】：该论文旨在解决如何生成具有可控阅读难易程度的文本问题，以适应不同读者的认知负荷需求。其解决方案的关键在于利用一个能够预测人类眼动模式的模型，通过该模型对语言模型输出进行引导，使其生成的文本能够诱发特定的阅读行为（如注视时间、回视次数等）。实验结果表明，该方法能有效调节文本的阅读难度，且这种调节主要由影响词汇加工（lexical processing）的特征驱动，从而为信息可访问性提升和个性化语言学习材料生成提供了可行路径。

链接: https://arxiv.org/abs/2601.17781
作者: Andreas Säuberli,Darja Jepifanova,Diego Frassinelli,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (MCML), Munich, Germany (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: Accepted for publication at EACL 2026

点击查看摘要

Abstract:The way our eyes move while reading can tell us about the cognitive effort required to process the text. In the present study, we use this fact to generate texts with controllable reading ease. Our method employs a model that predicts human gaze patterns to steer language model outputs towards eliciting certain reading behaviors. We evaluate the approach in an eye-tracking experiment with native and non-native speakers of English. The results demonstrate that the method is effective at making the generated texts easier or harder to read, measured both in terms of reading times and perceived difficulty of the texts. A statistical analysis reveals that the changes in reading behavior are mostly due to features that affect lexical processing. Possible applications of our approach include text simplification for information accessibility and generation of personalized educational material for language learning.
zh

[NLP-95] DPI: Exploiting Parameter Heterogeneity for Interference-Free Fine-Tuning

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在多任务监督微调（Supervised Fine-Tuning, SFT）过程中因任务间目标冲突导致的“跷跷板效应”（seesaw effect）问题，即优化某一任务性能时可能损害其他任务的表现。解决方案的关键在于提出一种基于参数解耦的动态隔离策略：首先通过独立微调识别每个任务的核心参数区域（即更新幅度最大的参数子集），随后根据核心参数区域的重叠程度决定任务合并或分阶段训练；在多阶段微调中冻结前期任务的核心参数，从而避免后续任务对其覆盖，有效缓解跨任务干扰，提升整体性能稳定性。

链接: https://arxiv.org/abs/2601.17777
作者: Xiaoyu Liu,Xiaoyu Guan,Di Liang,Xianjie Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a crucial step for adapting large language models (LLMs) to downstream tasks. However, conflicting objectives across heterogeneous SFT tasks often induce the “seesaw effect”: optimizing for one task may degrade performance on others, particularly when model parameters are updated indiscriminately. In this paper, we propose a principled approach to disentangle and isolate task-specific parameter regions, motivated by the hypothesis that parameter heterogeneity underlies cross-task interference. Specifically, we first independently fine-tune LLMs on diverse SFT tasks and identify each task’s core parameter region as the subset of parameters exhibiting the largest updates. Tasks with highly overlapping core parameter regions are merged for joint training, while disjoint tasks are organized into different stages. During multi-stage SFT, core parameters acquired in prior tasks are frozen, thereby preventing overwriting by subsequent tasks. To verify the effectiveness of our method, we conducted intensive experiments on multiple public datasets. The results showed that our dynamic parameter isolation strategy consistently reduced data conflicts and achieved consistent performance improvements compared to multi-stage and multi-task tuning baselines.
zh

[NLP-96] Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在实际部署中因键值缓存（Key-Value Cache, KV Cache）占用内存过大而导致的效率瓶颈问题，现有压缩技术通常面临性能下降与计算开销之间的权衡。其解决方案的关键在于提出一种基于门控机制（gating-based）的KV缓存淘汰方法，通过引入轻量级的sink-attention门控模块来识别并保留关键的KV对，并在预填充（prefill）和解码（decoding）阶段无缝集成；该门控训练算法仅依赖于前向传播，避免了昂贵的反向传播过程，同时借助任务无关的重建目标实现良好的任务泛化能力，从而在维持近无损性能的前提下实现高达70%的KV缓存压缩率。

链接: https://arxiv.org/abs/2601.17668
作者: Jang-Hyun Kim,Dongyoon Han,Sangdoo Yun
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient key-value (KV) cache management is crucial for the practical deployment of large language models (LLMs), yet existing compression techniques often incur a trade-off between performance degradation and computational overhead. We propose a novel gating-based KV cache eviction method for frozen-weight LLMs that achieves high compression ratios with negligible computational cost. Our approach introduces lightweight sink-attention gating modules to identify and retain critical KV pairs, and integrates seamlessly into both the prefill and decoding stages. The proposed gate training algorithm relies on forward passes of an LLM, avoiding expensive backpropagation, while achieving strong task generalization through a task-agnostic reconstruction objective. Extensive experiments across the Qwen2.5-1M, Qwen3, and Gemma3 families show that our method maintains near-lossless performance while evicting up to 70% of the KV cache. The results are consistent across a wide range of tasks, including long-context understanding, code comprehension, and mathematical reasoning, demonstrating the generality of our approach.
zh

[NLP-97] UrduLM: A Resource-Efficient Monolingual Urdu Language Model

【速读】：该论文旨在解决乌尔都语（Urdu）在自然语言处理（Natural Language Processing, NLP）领域缺乏专用预训练语言模型和高质量语料库的问题。当前多语言模型对乌尔都语的支持有限，存在性能差、计算成本高及文化不准确等缺陷，主要源于训练数据不足。解决方案的关键在于构建一个专为乌尔都语设计的单语种预训练语言模型 UrduLM，其核心包括：（1）从多样化来源收集并整理了33GB的乌尔都语语料库；（2）开发了一个定制的子词分词器（BPE tokenizer），相较于多语言方案可降低至少20–30%的分词开销；（3）预训练了一个参数量为1亿的解码器-only架构模型。实验表明，在少样本场景下，UrduLM在情感分类任务中达到66.6%准确率，在语法纠错任务中BLEU分数超过30，性能媲美规模达其30倍的多语言模型，且资源消耗更低，具有良好的可扩展性与开放性，为乌尔都语及其他低资源语言的NLP研究提供了基准和框架。

链接: https://arxiv.org/abs/2601.17664
作者: Syed Muhammad Ali,Hammad Sajid,Zainab Haider,Ali Muhammad Asad,Haya Fatima,Abdul Samad
机构: Habib University (哈比卜大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models and curated corpora. While multilingual models provide limited Urdu support, they suffer from poor performance, high computational costs, and cultural inaccuracies due to insufficient training data. To address these challenges, we present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings. We curate a 33GB Urdu corpus from diverse sources, develop a custom BPE tokenizer that reduces tokenization overhead by atleast 20-30% compared to multilingual alternatives, and pretrain a 100M-parameter decoder-only model. In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size, reaching 66.6% accuracy on sentiment classification and BLEU scores exceeding 30 on grammar correction tasks. The complete methodology – including corpus, tokenizer, model weights, and evaluation benchmarks – is released openly to establish a baseline for Urdu NLP research and provide a scalable framework for other underrepresented languages.
zh

[NLP-98] Beyond the Rabbit Hole: Mapping the Relational Harms of QAnon Radicalization

【速读】：该论文旨在解决当前关于阴谋论（conspiracy theories）研究中忽视其对信徒亲友所造成的情感伤害这一重要问题，填补大规模计算研究在人际关系层面的空白。其关键解决方案在于提出一种新颖的混合方法，首先利用BERTopic主题建模识别QAnon群体的极端化路径，进而通过基于LDA的图模型挖掘出六种典型的“极端化人格类型”（radicalization personas），最后借助大语言模型（LLM）辅助的情绪识别与回归分析，将这些人格类型与叙述者报告的具体情感损害建立量化关联。研究表明，这些人格类型不仅是描述性的分类工具，更是预测特定情感后果的强大指标，从而为将极端化理解为一种关系性现象提供了首个实证框架。

链接: https://arxiv.org/abs/2601.17658
作者: Bich Ngoc(Rubi)Doan,Giuseppe Russo,Gianmarco De Francisci Morales,Robert West
机构: 1: Stanford University (斯坦福大学); 2: University of Trento (特伦托大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of conspiracy theories has created far-reaching societal harm in the public discourse by eroding trust and fueling polarization. Beyond this public impact lies a deeply personal toll on the friends and families of conspiracy believers, a dimension often overlooked in large-scale computational research. This study fills this gap by systematically mapping radicalization journeys and quantifying the associated emotional toll inflicted on loved ones. We use the prominent case of QAnon as a case study, analyzing 12747 narratives from the r/QAnonCasualties support community through a novel mixed-methods approach. First, we use topic modeling (BERTopic) to map the radicalization trajectories, identifying key pre-existing conditions, triggers, and post-radicalization characteristics. From this, we apply an LDA-based graphical model to uncover six recurring archetypes of QAnon adherents, which we term “radicalization personas.” Finally, using LLM-assisted emotion detection and regression modeling, we link these personas to the specific emotional toll reported by narrators. Our findings reveal that these personas are not just descriptive; they are powerful predictors of the specific emotional harms experienced by narrators. Radicalization perceived as a deliberate ideological choice is associated with narrator anger and disgust, while those marked by personal and cognitive collapse are linked to fear and sadness. This work provides the first empirical framework for understanding radicalization as a relational phenomenon, offering a vital roadmap for researchers and practitioners to navigate its interpersonal fallout.
zh

[NLP-99] AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLM s Contextual and Cultural Knowledge and Thinking

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在理解互联网音视频内容时存在的局限性问题，特别是其对非文本信息（如音乐、音效）以及文化语境和情感层面的理解能力不足。解决方案的关键在于构建了一个由人类精心标注的基准测试集AVMeme Exam，该数据集包含一千多个具有代表性的网络音频视频片段（涵盖语音、歌曲、音乐和音效），每个样本均配有多层次的问答对（从表层内容到语境、情感、使用场景及世界知识），并附带元数据（如年份、转录文本、摘要和敏感性标签）。通过此基准系统评估主流MLLMs与人类参与者的表现，研究揭示了现有模型在无文字音乐和音效上的表现较差，且难以实现文化与情境层面的深度理解，从而指出了当前AI在人类对齐的多模态智能方面存在显著差距，并呼吁开发能超越表面感知、具备情境化和文化感知能力的新一代模型。

链接: https://arxiv.org/abs/2601.17645
作者: Xilin Jiang,Qiaolin Wang,Junkai Wu,Xiaomin He,Zhongweiyang Xu,Yinghao Ma,Minshuo Piao,Kaiyi Yang,Xiuwen Zheng,Riki Shimizu,Yicong Chen,Arsalan Firoozi,Gavin Mischler,Sukru Samet Dindar,Richard Antonello,Linyang He,Tsun-An Hsieh,Xulin Fan,Yulun Wu,Yuesheng Ma,Chaitanya Amballa,Weixiong Chen,Jiarui Hai,Ruisi Li,Vishal Choudhari,Cong Han,Yinghao Aaron Li,Adeen Flinker,Mounya Elhilali,Emmanouil Benetos,Mark Hasegawa-Johnson,Romit Roy Choudhury,Nima Mesgarani
机构: Columbia University (哥伦比亚大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); University of Washington (华盛顿大学); Johns Hopkins University (约翰霍普金斯大学); New York University (纽约大学); Queen Mary University of London (伦敦玛丽女王大学); Google (谷歌); Meta (Meta)
类目: ound (cs.SD); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: this http URL

点击查看摘要

Abstract:Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique QA assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture compared to surface content. These findings highlight a key gap in human-aligned multimodal intelligence and call for models that can perceive contextually and culturally beyond the surface of what they hear and see. Project page: this http URL
zh

[NLP-100] Memento: Towards Proactive Visualization of Everyday Memories with Personal Wearable AR Assistant

【速读】：该论文旨在解决当前增强现实（AR）交互中缺乏长期记忆与情境感知能力的问题，即现有系统无法持续记录用户在特定时空和活动背景下的查询行为，从而难以实现基于上下文的主动响应。解决方案的关键在于提出名为 Memento 的对话式 AR 助手，其通过永久存储用户的语音查询及其时空位置（spatiotemporal）和活动（activity）上下文，构建持久的记忆库；在此基础上，Memento 能够识别用户重复兴趣与其触发情境之间的关联，并在相似或相同的情境再次发生时，主动调用相关记忆并以 AR 方式提供最新响应，从而将 AR 体验无缝嵌入用户日常流程中，形成具有连贯长期视角的多模态（视觉、空间、时间与具身）交互序列。

链接: https://arxiv.org/abs/2601.17622
作者: Yoonsang Kim,Yalong Yang,Arie E. Kaufman
机构: Stony Brook University (石溪大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages, 5 figures. This is the author’s version of the article that will appear at the IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (IEEE VRW) 2026

点击查看摘要

Abstract:We introduce Memento, a conversational AR assistant that permanently captures and memorizes user’s verbal queries alongside their spatiotemporal and activity contexts. By storing these “memories,” Memento discovers connections between users’ recurring interests and the contexts that trigger them. Upon detection of similar or identical spatiotemporal activity, Memento proactively recalls user interests and delivers up-to-date responses through AR, seamlessly integrating AR experience into their daily routine. Unlike prior work, each interaction in Memento is not a transient event, but a connected series of interactions with coherent long–term perspective, tailored to the user’s broader multimodal (visual, spatial, temporal, and embodied) context. We conduct preliminary evaluation through user feedbacks with participants of diverse expertise in immersive apps, and explore the value of proactive context-aware AR assistant in everyday settings. We share our findings and challenges in designing a proactive, context-aware AR system.
zh

[NLP-101] Agent ic Search in the Wild: Intents and Trajectory Dynamics from 14M Real Search Requests

【速读】：该论文旨在解决生成式 AI (Generative AI) 驱动的搜索代理（search agents）在多步信息检索任务中行为机制不明确的问题，特别是缺乏对代理搜索会话演化过程及检索证据使用方式的实证理解。其解决方案的关键在于通过大规模日志分析（基于 DeepResearchGym 收集的 1444 万次请求、397 万次会话），对会话进行分段、意图标注和查询改写标签分配，并提出 Context-driven Term Adoption Rate (CTAR) 指标来量化新引入查询词是否可追溯至先前检索到的证据。该方法揭示了代理搜索中的关键行为模式，如会话长度分布、意图驱动的行为差异以及跨步骤证据复用现象，从而为改进早期停止策略、意图自适应检索预算和显式跨步骤上下文追踪提供了实证依据。

链接: https://arxiv.org/abs/2601.17617
作者: Jingjie Ning,João Coelho,Yibo Kong,Yunfan Long,Bruno Martins,João Magalhães,Jamie Callan,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学); Instituto Superior Técnico, University of Lisbon (里斯本理工大学); NOVA LINCS, NOVA University Lisbon (里斯本新大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-powered search agents are increasingly being used for multi-step information seeking tasks, yet the IR community lacks empirical understanding of how agentic search sessions unfold and how retrieved evidence is used. This paper presents a large-scale log analysis of agentic search based on 14.44M search requests (3.97M sessions) collected from DeepResearchGym, i.e. an open-source search API accessed by external agentic clients. We sessionize the logs, assign session-level intents and step-wise query-reformulation labels using LLM-based annotation, and propose Context-driven Term Adoption Rate (CTAR) to quantify whether newly introduced query terms are traceable to previously retrieved evidence. Our analyses reveal distinctive behavioral patterns. First, over 90% of multi-turn sessions contain at most ten steps, and 89% of inter-step intervals fall under one minute. Second, behavior varies by intent. Fact-seeking sessions exhibit high repetition that increases over time, while sessions requiring reasoning sustain broader exploration. Third, agents reuse evidence across steps. On average, 54% of newly introduced query terms appear in the accumulated evidence context, with contributions from earlier steps beyond the most recent retrieval. The findings suggest that agentic search may benefit from repetition-aware early stopping, intent-adaptive retrieval budgets, and explicit cross-step context tracking. We plan to release the anonymized logs to support future research.
zh

[NLP-102] What Language Models Know But Dont Say: Non-Generative Prior Extraction for Generalization

【速读】：该论文旨在解决小样本场景下模型泛化能力差的问题，特别是在医学和金融等领域的实际应用中，由于标注数据稀缺且存在分布偏移（covariate shift），传统逻辑回归模型难以在真实世界人群中保持性能。解决方案的关键在于提出一种确定性方法 LoID（Logit-Informed Distributions），通过直接分析大型语言模型（LLM）在细粒度词元级别上的预测结果，提取用于贝叶斯逻辑回归的有信息先验分布。其核心创新是利用精心设计的句子结构来探测 LLM 对特定特征正负影响方向的信心强度与一致性，从而量化每个特征的影响力及其置信度，无需依赖文本生成，实现了可复现、高效且可靠的先验知识注入机制。

链接: https://arxiv.org/abs/2601.17609
作者: Sara Rezaeimanesh,Mohammad M. Ghassemi
机构: Michigan State University (密歇根州立大学); Michigan State Unviersity (密歇根州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In domains like medicine and finance, large-scale labeled data is costly and often unavailable, leading to models trained on small datasets that struggle to generalize to real-world populations. Large language models contain extensive knowledge from years of research across these domains. We propose LoID (Logit-Informed Distributions), a deterministic method for extracting informative prior distributions for Bayesian logistic regression by directly accessing their token-level predictions. Rather than relying on generated text, we probe the model’s confidence in opposing semantic directions (positive vs. negative impact) through carefully constructed sentences. By measuring how consistently the LLM favors one direction across diverse phrasings, we extract the strength and reliability of the model’s belief about each feature’s influence. We evaluate LoID on ten real-world tabular datasets under synthetic out-of-distribution (OOD) settings characterized by covariate shift, where the training data represents only a subset of the population. We compare our approach against (1) standard uninformative priors, (2) AutoElicit, a recent method that prompts LLMs to generate priors via text completions, (3) LLMProcesses, a method that uses LLMs to generate numerical predictions through in-context learning and (4) an oracle-style upper bound derived from fitting logistic regression on the full dataset. We assess performance using Area Under the Curve (AUC). Across datasets, LoID significantly improves performance over logistic regression trained on OOD data, recovering up to \textbf59% of the performance gap relative to the oracle model. LoID outperforms AutoElicit and LLMProcessesc on 8 out of 10 datasets, while providing a reproducible and computationally efficient mechanism for integrating LLM knowledge into Bayesian inference.
zh

[NLP-103] Learning to Ideate for Machine Learning Engineering Agents EACL2026

【速读】：该论文旨在解决现有机器学习工程（Machine Learning Engineering, MLE）代理在迭代优化其算法实现效果方面能力不足的问题。其解决方案的关键在于提出一种双代理框架——MLE-Ideator，该框架将“想法生成”（ideation）与“实现执行”（implementation）分离：由一个专门的Ideator代理为实施代理提供战略层面的帮助，从而提升整体系统性能。实验表明，在无需训练的情况下，该框架显著优于仅依赖实现的基线模型；进一步地，通过强化学习（Reinforcement Learning, RL）对Ideator进行训练，仅用10个MLE任务中的1000个样本即可使Qwen3-8B Ideator实现相对提升11.5%，并超越Claude Sonnet 3.5，验证了该架构在构建具备战略思维的AI系统以支持科学发现方面的潜力。

链接: https://arxiv.org/abs/2601.17596
作者: Yunxiang Zhang,Kang Zhou,Zhichao Xu,Kiran Ramnath,Yun Zhou,Sangmin Woo,Haibo Ding,Lin Lee Cheong
机构: AWS AI Labs (AWS人工智能实验室); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注: EACL 2026 main conference

点击查看摘要

Abstract:Existing machine learning engineering (MLE) agents struggle to iteratively optimize their implemented algorithms for effectiveness. To address this, we introduce MLE-Ideator, a dual-agent framework that separates ideation from implementation. In our system, an implementation agent can request strategic help from a dedicated Ideator. We show this approach is effective in two ways. First, in a training-free setup, our framework significantly outperforms implementation-only agent baselines on MLE-Bench. Second, we demonstrate that the Ideator can be trained with reinforcement learning (RL) to generate more effective ideas. With only 1K training samples from 10 MLE tasks, our RL-trained Qwen3-8B Ideator achieves an 11.5% relative improvement compared to its untrained counterpart and surpasses Claude Sonnet 3.5. These results highlights a promising path toward training strategic AI systems for scientific discovery.
zh

[NLP-104] From Chains to DAGs: Probing the Graph Structure of Reasoning in LLM s

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在执行多步推理时，其内部表示是否编码了图结构（即有向无环图，Directed Acyclic Graph, DAG）的几何特性，而不仅仅是线性链式的步骤。传统研究常将推理视为线性序列，但许多实际推理任务具有分支、合并与复用中间结论的复杂结构，这种结构更符合DAG建模。解决方案的关键在于提出“推理DAG探测”（Reasoning DAG Probing）框架，通过训练轻量级探测器（probes）从模型隐藏状态中预测两个图论属性——节点深度（node depth）和节点间距离（pairwise node distance），从而直接检验LLM内部是否以可线性访问的形式编码了推理DAG的几何结构，并分析该结构在不同网络层中的涌现规律。

链接: https://arxiv.org/abs/2601.17593
作者: Tianjun Zhong,Linyang He,Nima Mesgarani
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in large language models has renewed interest in mechanistically characterizing how multi-step reasoning is represented and computed. While much prior work treats reasoning as a linear chain of steps, many reasoning problems are more naturally structured as directed acyclic graphs (DAGs), where intermediate conclusions may depend on multiple premises, branch into parallel sub-derivations, and later merge or be reused. Understanding whether such graph-structured reasoning is reflected in model internals remains an open question. In this work, we introduce Reasoning DAG Probing, a framework that directly asks whether LLM hidden states encode the geometry of a reasoning DAG in a linearly accessible form, and where this structure emerges across layers. Within this framework, we associate each reasoning node with a textual realization and train lightweight probes to predict two graph-theoretic properties from hidden states: node depth and pairwise node distance. We use these probes to analyze the layerwise emergence of DAG structure and evaluate controls that disrupt reasoning-relevant structure while preserving superficial textual properties. Our results provide evidence that reasoning DAG geometry is meaningfully encoded in intermediate layers, with recoverability varying systematically by node depth and model scale, suggesting that LLM reasoning is not only sequential but exhibits measurable internal graph structure. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.17593 [cs.CL] (or arXiv:2601.17593v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.17593 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-105] Intelligence Requires Grounding But Not Embodiment

【速读】：该论文试图解决的核心问题是：智能是否必须依赖具身性（embodiment）才能实现。作者通过论证指出，真正构成智能的本质并非具身本身，而是“接地性”（grounding），即智能体必须与其环境建立有意义的关联以形成对世界的理解。解决方案的关键在于重新定义智能为具备四个属性：动机（motivation）、预测能力（predictive ability）、因果理解（understanding of causality）和从经验中学习的能力（learning from experience），并证明这些属性可以在非具身但接地的智能体中实现，从而提出“接地性”才是智能的必要条件，而非具身性。

链接: https://arxiv.org/abs/2601.17588
作者: Marcus Ma,Shrikanth Narayanan
机构: University of Southern California (南加州大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in LLMs have reignited scientific debate over whether embodiment is necessary for intelligence. We present the argument that intelligence requires grounding, a phenomenon entailed by embodiment, but not embodiment itself. We define intelligence as the possession of four properties – motivation, predictive ability, understanding of causality, and learning from experience – and argue that each can be achieved by a non-embodied, grounded agent. We use this to conclude that grounding, not embodiment, is necessary for intelligence. We then present a thought experiment of an intelligent LLM agent in a digital environment and address potential counterarguments.
zh

[NLP-106] Sequence Repetition Enhances Token Embeddings and Improves Sequence Labeling with Decoder-only Language Models EACL2026

【速读】：该论文旨在解决decoder-only语言模型（language models, LMs）在序列标注（sequence labeling, SL）任务中因结构限制而难以有效利用双向上下文的问题。传统上，SL任务依赖于编码器-only模型以获取完整的双向信息，但近年来decoder-only模型发展迅速，其结构却天然仅支持自回归生成，即只能基于前缀进行预测。为使decoder-only模型适应SL任务，现有方法如移除因果掩码（causal mask removal）虽能实现双向建模，但需对基础模型功能进行较大改动。本文提出一种更轻量级的替代方案——序列重复（sequence repetition, SR），其核心在于通过重复输入序列来诱导decoder学习双向表示，从而无需修改模型架构即可获得与encoder-only模型相当甚至更优的token级嵌入质量。实验表明，SR不仅提升了嵌入效果，且中间层嵌入即可达到与最终层相当性能，显著降低计算开销，证明了该方法在提升decoder-only模型灵活性和效率方面的有效性。

链接: https://arxiv.org/abs/2601.17585
作者: Matija Luka Kukić,Marko Čuljak,David Dukić,Martin Tutek,Jan Šnajder
机构: TakeLab @ Faculty of Electrical Engineering and Computing, University of Zagreb (萨格勒布大学电气工程与计算学院)
类目: Computation and Language (cs.CL)
备注: Accepted at EACL 2026 Findings

点击查看摘要

Abstract:Modern language models (LMs) are trained in an autoregressive manner, conditioned only on the prefix. In contrast, sequence labeling (SL) tasks assign labels to each individual input token, naturally benefiting from bidirectional context. This discrepancy has historically led SL to rely on inherently bidirectional encoder-only models. However, the rapid development of decoder-only models has raised the question of whether they can be adapted to SL. While causal mask removal has emerged as a viable technique for adapting decoder-only models to leverage the full context for SL, it requires considerable changes to the base model functionality. In this work, we explore sequence repetition (SR) as a less invasive alternative for enabling bidirectionality in decoder-only models. Through fine-tuning experiments, we show that SR inherently makes decoders bidirectional, improving the quality of token-level embeddings and surpassing encoders and unmasked decoders. Contrary to earlier claims, we find that increasing the number of repetitions does not degrade SL performance. Finally, we demonstrate that embeddings from intermediate layers are highly effective for SR, comparable to those from final layers, while being significantly more efficient to compute. Our findings underscore that SR alleviates the structural limitations of decoders, enabling more efficient and adaptable LMs and broadening their applicability to other token-level tasks.
zh

[NLP-107] Status Hierarchies in Language Models

【速读】：该论文试图解决的问题是：语言模型在多智能体交互场景中是否会像人类一样形成地位层级（status hierarchies），并由此引发潜在的AI安全风险。解决方案的关键在于基于Berger等人（1972）的期望状态理论（expectation states framework），设计多智能体实验场景，让不同能力水平的语言模型在接收到地位线索（如资质、专业性）后，通过观察对方决策来调整自身判断，从而测量其“服从行为”（deference）——即是否因地位线索而非任务信息而改变立场。研究发现，在能力相当的情况下，模型会显著形成地位差异（35个百分点的不对称性，p < .001）；但当能力差异存在时，高地位标签反而抑制了高能力模型的服从倾向，表明地位效应并非简单放大低能力者的影响力，而是可能诱发复杂的行为策略，这对生成式AI（Generative AI）的安全对齐（alignment）构成新挑战。

链接: https://arxiv.org/abs/2601.17577
作者: Emilio Barkett
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:From school playgrounds to corporate boardrooms, status hierarchies – rank orderings based on respect and perceived competence – are universal features of human social organization. Language models trained on human-generated text inevitably encounter these hierarchical patterns embedded in language, raising the question of whether they might reproduce such dynamics in multi-agent settings. This thesis investigates when and how language models form status hierarchies by adapting Berger et al.‘s (1972) expectation states framework. I create multi-agent scenarios where separate language model instances complete sentiment classification tasks, are introduced with varying status characteristics (e.g., credentials, expertise), then have opportunities to revise their initial judgments after observing their partner’s responses. The dependent variable is deference, the rate at which models shift their ratings toward their partner’s position based on status cues rather than task information. Results show that language models form significant status hierarchies when capability is equal (35 percentage point asymmetry, p .001), but capability differences dominate status cues, with the most striking effect being that high-status assignments reduce higher-capability models’ deference rather than increasing lower-capability models’ deference. The implications for AI safety are significant: status-seeking behavior could introduce deceptive strategies, amplify discriminatory biases, and scale across distributed deployments far faster than human hierarchies form organically. This work identifies emergent social behaviors in AI systems and highlights a previously underexplored dimension of the alignment challenge.
zh

[NLP-108] Improving User Privacy in Personalized Generation: Client-Side Retrieval-Augmented Modification of Server-Side Generated Speculations

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）个性化生成过程中面临的隐私与性能权衡问题：即如何在不将用户私有信息暴露给云端服务器的前提下，实现高质量的个性化输出。现有方法通常依赖检索增强技术，但面临两个困境——要么将私有数据暴露给云服务导致隐私泄露，要么仅使用本地小型模型而限制了生成质量。解决方案的关键在于提出一个名为 P³ 的交互式框架，其核心机制是通过“草稿-验证”迭代过程：由大模型在服务器端基于用户查询生成一批草稿 token，再由具备访问用户私有档案权限的小型客户端模型对这些草稿进行评估和修正，直至生成终止符。该设计既避免了私有数据上传至服务器，又有效利用了大模型的强大生成能力，实验证明其在多个个性化问答数据集上显著优于非个性化和纯本地化基线方法，并恢复了接近“泄露”场景下 90.3%–95.7% 的性能，同时引入的额外隐私泄露仅为 1.5%–3.5%，展现出良好的实用性与隐私保护能力。

链接: https://arxiv.org/abs/2601.17569
作者: Alireza Salemi,Hamed Zamani
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalization is crucial for aligning Large Language Model (LLM) outputs with individual user preferences and background knowledge. State-of-the-art solutions are based on retrieval augmentation, where relevant context from a user profile is retrieved for LLM consumption. These methods deal with a trade-off between exposing retrieved private data to cloud providers and relying on less capable local models. We introduce P^3 , an interactive framework for high-quality personalization without revealing private profiles to server-side LLMs. In P^3 , a large server-side model generates a sequence of k draft tokens based solely on the user query, while a small client-side model, with retrieval access to the user’s private profile, evaluates and modifies these drafts to better reflect user preferences. This process repeats until an end token is generated. Experiments on LaMP-QA, a recent benchmark consisting of three personalized question answering datasets, show that P^3 consistently outperforms both non-personalized server-side and personalized client-side baselines, achieving statistically significant improvements of 7.4% to 9% on average. Importantly, P^3 recovers 90.3% to 95.7% of the utility of a ``leaky’’ upper-bound scenario in which the full profile is exposed to the large server-side model. Privacy analyses, including linkability and attribute inference attacks, indicate that P^3 preserves the privacy of a non-personalized server-side model, introducing only marginal additional leakage ( 1.5% – 3.5% ) compared to submitting a query without any personal context. Additionally, the framework is efficient for edge deployment, with the client-side model generating only 9.2% of the total tokens. These results demonstrate that P^3 provides a practical, effective solution for personalized generation with improved privacy.
zh

[NLP-109] Less is More for RAG : Information Gain Pruning for Generator-Aligned Reranking and Evidence Selection

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）系统中，受限于上下文长度预算时如何高效选择最相关检索片段的问题。现有方法依赖检索相关性指标（如NDCG）进行排序，但这些指标与最终问答质量的相关性较弱，甚至在多段落注入场景下因冗余和轻微冲突导致生成不稳定。解决方案的关键在于提出一种部署友好的重排序与剪枝模块——信息增益剪枝（Information Gain Pruning, IGP），该模块利用生成器对证据的“效用信号”来筛选高价值片段，并在截断前过滤掉低效或有害内容，无需修改原有预算接口即可显著提升质量-成本权衡，在多个开放域问答基准上实现平均F1提升约12–20%，同时将最终输入token数减少约76–79%。

链接: https://arxiv.org/abs/2601.17532
作者: Zhipeng Song,Yizhi Zhou,Xiangyu Kong,Jiulong Jiao,Xinrui Bao,Xu You,Xueqing Shi,Yuhang Zhou,Heng Qi
机构: Dalian University of Technology (大连理工大学); Liaodong University (辽东学院); Dalian Minzu University (大连民族大学); Liao Ning Technical University (辽宁工程技术大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 10 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) grounds large language models with external evidence, but under a limited context budget, the key challenge is deciding which retrieved passages should be injected. We show that retrieval relevance metrics (e.g., NDCG) correlate weakly with end-to-end QA quality and can even become negatively correlated under multi-passage injection, where redundancy and mild conflicts destabilize generation. We propose \textbfInformation Gain Pruning (IGP), a deployment-friendly reranking-and-pruning module that selects evidence using a generator-aligned utility signal and filters weak or harmful passages before truncation, without changing existing budget interfaces. Across five open-domain QA benchmarks and multiple retrievers and generators, IGP consistently improves the quality–cost trade-off. In a representative multi-evidence setting, IGP delivers about +12–20% relative improvement in average F1 while reducing final-stage input tokens by roughly 76–79% compared to retriever-only baselines.
zh

[NLP-110] Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes EACL

【速读】：该论文旨在解决深度伪造（deepfake）检测中面临的两大核心问题：一是模态碎片化（modality fragmentation），导致现有方法在不同且对抗性强的深度伪造模态间泛化能力差；二是跨模态推理浅层化（shallow inter-modal reasoning），难以捕捉细粒度的语义不一致。解决方案的关键在于提出一种名为ConLLM（Contrastive Learning with Large Language Models）的混合框架，其采用两阶段架构：第一阶段利用预训练模型（Pre-Trained Models, PTMs）提取各模态特异性嵌入；第二阶段通过对比学习对齐这些嵌入以缓解模态碎片化，并借助大语言模型（Large Language Models, LLMs）进行语义层面的推理，从而增强对跨模态语义不一致性的识别能力。该方案显著提升了音频、视频及音视频联合场景下的检测性能。

链接: https://arxiv.org/abs/2601.17530
作者: Gautam Siddharth Kashyap,Harsh Joshi,Niharika Jain,Ebad Shabbir,Jiechao Gao,Nipun Joshi,Usman Naseem
机构: Macquarie University (麦考瑞大学); Bharati Vidyapeeth’s College Of Engineering (巴赫蒂维迪佩斯学院工程学院); Vivekananda Institute of Professional Studies (维韦卡南达专业研究学院); DSEU-Okhla (德赛大学奥克拉分校); Stanford University (斯坦福大学); Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EACL Findings 2026

点击查看摘要

Abstract:The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%-10% consistent improvements across modalities.
zh

[NLP-111] o Case or Not to Case: An Empirical Study in Learned Sparse Retrieval ECIR2026

【速读】：该论文旨在解决当前 Learned Sparse Retrieval (LSR) 方法普遍依赖无大小写敏感（uncased）主干模型所导致的兼容性问题，尤其是在最新高性能语言模型仅提供大小写敏感（cased）版本的情况下，这种趋势可能威胁 LSR 方法的未来发展。解决方案的关键在于系统性地对比相同主干模型的大小写与无大小写版本在多个数据集上的表现，并发现：通过将输入文本统一转为小写（lowercasing），使用 cased 主干模型的 LSR 模型性能可恢复至与 uncased 版本相当；进一步的 token 级分析表明，lowercasing 后 cased 模型几乎完全抑制了大小写词汇项，实质上等效于 uncased 模型，从而解释了性能恢复的原因。这一发现使得强 cased 模型能够无缝集成到 LSR 流程中，拓展了其应用范围。

链接: https://arxiv.org/abs/2601.17500
作者: Emmanouil Georgios Lionis,Jia-Huei Ju,Angelos Nalmpantis,Casper Thuis,Sean MacAvaney,Andrew Yates
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in ECIR2026 (Part I) Advances in Information Retrieval

点击查看摘要

Abstract:Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts; however, this gap can be eliminated by pre-processing the text to lowercase. Moreover, our token-level analysis reveals that, under lowercasing, cased models almost entirely suppress cased vocabulary items and behave effectively as uncased models, explaining their restored performance. This result broadens the applicability of recent cased models to the LSR setting and facilitates the integration of stronger backbone architectures into sparse retrieval. The complete code and implementation for this project are available at: this https URL
zh

[NLP-112] PEARL: Prototype-Enhanced Alignment for Label-Efficient Representation Learning with Deployment-Driven Insights from Digital Governance Communication Systems

【速读】：该论文旨在解决在实际部署的文本相似性系统中，由于嵌入空间（embedding space）局部邻域结构与任务需求不匹配而导致的检索性能下降问题。具体而言，在标签稀缺、领域漂移频繁且无法重新训练基础编码器的情况下，原始嵌入往往无法有效支持最近邻检索和轻量级分类器的准确运行。解决方案的关键在于提出一种标签高效的嵌入对齐方法——PEARL（Prototype-Enhanced Aligned Representation Learning），其通过有限监督信号将嵌入软性地朝向类别原型（class prototypes）对齐，从而重塑局部邻域几何结构，同时保持高维空间完整性并避免过度投影或嵌入坍缩，显著提升了在极端标签稀缺场景下的邻域质量，相较原始嵌入提升达25.7%，优于强基准无监督后处理方法超过21.1%。

链接: https://arxiv.org/abs/2601.17495
作者: Ruiyu Zhang,Lin Nie,Wai-Fung Lam,Qihao Wang,Xin Zhao
机构: The University of Hong Kong (香港大学); The Hong Kong Polytechnic University (香港理工大学); Lanzhou University (兰州大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages, 1 figure

点击查看摘要

Abstract:In many deployed systems, new text inputs are handled by retrieving similar past cases, for example when routing and responding to citizen messages in digital governance platforms. When these systems fail, the problem is often not the language model itself, but that the nearest neighbors in the embedding space correspond to the wrong cases. Modern machine learning systems increasingly rely on fixed, high-dimensional embeddings produced by large pretrained models and sentence encoders. In real-world deployments, labels are scarce, domains shift over time, and retraining the base encoder is expensive or infeasible. As a result, downstream performance depends heavily on embedding geometry. Yet raw embeddings are often poorly aligned with the local neighborhood structure required by nearest-neighbor retrieval, similarity search, and lightweight classifiers that operate directly on embeddings. We propose PEARL (Prototype-Enhanced Aligned Representation Learning), a label-efficient approach that uses limited supervision to softly align embeddings toward class prototypes. The method reshapes local neighborhood geometry while preserving dimensionality and avoiding aggressive projection or collapse. Its aim is to bridge the gap between purely unsupervised post-processing, which offers limited and inconsistent gains, and fully supervised projections that require substantial labeled data. We evaluate PEARL under controlled label regimes ranging from extreme label scarcity to higher-label settings. In the label-scarce condition, PEARL substantially improves local neighborhood quality, yielding 25.7% gains over raw embeddings and more than 21.1% gains relative to strong unsupervised post-processing, precisely in the regime where similarity-based systems are most brittle.
zh

[NLP-113] SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving

【速读】：该论文旨在解决多模态小中型语言模型（Multimodal Small-to-Medium sized Language Models, MSLMs）在视觉理解与数学推理，尤其是涉及多样化视觉融合的几何问题上表现不佳的问题。现有模型难以准确分解复杂视觉输入并建立感知与结构化推理之间的有效连接，导致性能受限。其解决方案的关键在于提出SpatialMath框架，该框架通过专用感知模块提取视觉图示中的空间接地表示（spatially-grounded representations），捕捉关键几何结构与空间关系，并将这些表示系统性地注入符号推理链中，从而实现以视觉理解为导向的结构化推理。此方法显著提升了模型在视觉密集型任务中的表现，验证了结构化“感知—推理”流水线对MSLMs的重要性。

链接: https://arxiv.org/abs/2601.17489
作者: Ashutosh Bajpai,Akshat Bhandari,Akshay Nambi,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi(印度理工学院德里分校); Indian Institute of Technology Abu Dhabi(印度理工学院阿布扎比分校); Microsoft Research(微软研究院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning, particularly in geometric problems with diverse levels of visual infusion. Current models struggle to accurately decompose intricate visual inputs and connect perception with structured reasoning, leading to suboptimal performance. To address these challenges, we propose SpatialMath, a novel Spatial Comprehension-Infused Symbolic Reasoning Framework designed to integrate spatial representations into structured symbolic reasoning chains. SpatialMath employs a specialized perception module to extract spatially-grounded representations from visual diagrams, capturing critical geometric structures and spatial relationships. These representations are then methodically infused into symbolic reasoning chains, facilitating visual comprehension-aware structured reasoning. To this end, we introduce MATHVERSE-PLUS, a novel dataset containing structured visual interpretations and step-by-step reasoning paths for vision-intensive mathematical problems. SpatialMath significantly outperforms strong multimodal baselines, achieving up to 10 percentage points improvement over supervised fine-tuning with data augmentation in vision-intensive settings. Robustness analysis reveals that enhanced spatial representations directly improve reasoning accuracy, reinforcing the need for structured perception-to-reasoning pipelines in MSLMs.
zh

[NLP-114] Unintended Memorization of Sensitive Information in Fine-Tuned Language Models EACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在敏感数据集上微调时，因无意记忆和泄露个人身份信息（Personally Identifiable Information, PII）所带来的隐私风险问题，尤其是针对仅出现在输入中而未出现在训练目标中的PII的暴露漏洞。其关键解决方案在于系统性地设计可控的提取探针（extraction probes）来量化此类PII的记忆行为，并对四种隐私保护方法——差分隐私（differential privacy）、机器遗忘（machine unlearning）、正则化（regularization）和偏好对齐（preference alignment）进行基准测试，以评估它们在隐私保护与任务性能之间的权衡关系。研究发现，后训练方法通常提供更一致的隐私-效用平衡，而差分隐私在特定场景下能显著降低泄露风险，尽管可能引发训练不稳定性，从而强调了构建鲁棒且可扩展的隐私保护技术的必要性。

链接: https://arxiv.org/abs/2601.17480
作者: Marton Szep,Jorge Marin Ruiz,Georgios Kaissis,Paulina Seidl,Rüdiger von Eisenhart-Rothe,Florian Hinterwimmer,Daniel Rueckert
机构: Department of Orthopaedics and Sports Orthopaedics, TUM University Hospital; Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital; Munich Center for Machine Learning (MCML); Department of Computing, Imperial College London
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to EACL 2026. 20 pages

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) on sensitive datasets carries a substantial risk of unintended memorization and leakage of Personally Identifiable Information (PII), which can violate privacy regulations and compromise individual safety. In this work, we systematically investigate a critical and underexplored vulnerability: the exposure of PII that appears only in model inputs, not in training targets. Using both synthetic and real-world datasets, we design controlled extraction probes to quantify unintended PII memorization and study how factors such as language, PII frequency, task type, and model size influence memorization behavior. We further benchmark four privacy-preserving approaches including differential privacy, machine unlearning, regularization, and preference alignment, evaluating their trade-offs between privacy and task performance. Our results show that post-training methods generally provide more consistent privacy-utility trade-offs, while differential privacy achieves strong reduction in leakage in specific settings, although it can introduce training instability. These findings highlight the persistent challenge of memorization in fine-tuned LLMs and emphasize the need for robust, scalable privacy-preserving techniques.
zh

[NLP-115] Clustering-driven Memory Compression for On-device Large Language Models ICASSP2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在个性化生成中因用户特定记忆（user-specific memories）累积导致上下文窗口快速耗尽的问题，同时避免简单平均压缩带来的语义冲突与性能下降。其解决方案的关键在于提出一种基于聚类的记忆压缩策略：通过相似性聚类将异质记忆分组，并在簇内合并记忆后再与输入提示拼接，从而在保持语义一致性的同时显著减少冗余，实现上下文效率与个性化质量的平衡。

链接: https://arxiv.org/abs/2601.17443
作者: Ondrej Bohdal,Pramit Saha,Umberto Michieli,Mete Ozay,Taha Ceritli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or direct concatenation. Furthermore, for a fixed context budget, clustering-driven merging yields more compact memory representations and consistently enhances generation quality.
zh

[NLP-116] Data-driven Clustering and Merging of Adapters for On-device Large Language Models ICASSP2026

【速读】：该论文旨在解决移动设备上部署大规模语言模型时，因存储资源受限而无法保存全部任务特定适配器（Adapter）的问题，核心挑战在于如何选择具有代表性的适配器以实现跨多个下游任务的良好泛化能力。解决方案的关键在于提出一种名为D2C的适配器聚类方法，该方法仅需每任务少量示例（如每个任务10个样本），并通过迭代优化过程精炼聚类分配，最终将同一簇内的适配器合并为多任务适配器，从而在有限存储预算下提升模型性能。

链接: https://arxiv.org/abs/2601.17441
作者: Ondrej Bohdal,Taha Ceritli,Mete Ozay,Jijoong Moon,Kyeng-Hun Lee,Hyeonmok Ko,Umberto Michieli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:On-device large language models commonly employ task-specific adapters (e.g., LoRAs) to deliver strong performance on downstream tasks. While storing all available adapters is impractical due to memory constraints, mobile devices typically have sufficient capacity to store a limited number of these parameters. This raises a critical challenge: how to select representative adapters that generalize well across multiple tasks - a problem that remains unexplored in existing literature. We propose a novel method D2C for adapter clustering that leverages minimal task-specific examples (e.g., 10 per task) and employs an iterative optimization process to refine cluster assignments. The adapters within each cluster are merged, creating multi-task adapters deployable on resource-constrained devices. Experimental results demonstrate that our method effectively boosts performance for considered storage budgets.
zh

[NLP-117] he 17% Gap: Quantifying Epistemic Decay in AI-Assisted Survey Papers

【速读】：该论文旨在解决生成式 AI（Generative AI）在科学写作中广泛应用时，因引用链断裂导致的信息熵增加问题，尤其是对有效引文链条系统性退化的量化缺失。其解决方案的关键在于构建了一个混合验证管道，结合DOI解析、Crossref元数据分析、Semantic Scholar查询与模糊文本匹配技术，以区分格式错误（“Sloppiness”）与可验证不存在的虚假引文（“Phantoms”），从而首次定量揭示了人工智能领域综述论文中17.0%的“幽灵引文”率，并识别出三种不同的失效模式：纯幻觉（5.1%）、伪造标识符但标题有效（16.4%）以及解析引发的匹配失败（78.5%）。这一方法为评估AI辅助科研中的可信度提供了可操作的基准。

链接: https://arxiv.org/abs/2601.17431
作者: H. Kemal İlter
机构: Bakırçay University (巴克西大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:The adoption of Large Language Models (LLMs) in scientific writing promises efficiency but risks introducing informational entropy. While “hallucinated papers” are a known artifact, the systematic degradation of valid citation chains remains unquantified. We conducted a forensic audit of 50 recent survey papers in Artificial Intelligence (N=5,514 citations) published between September 2024 and January 2026. We utilized a hybrid verification pipeline combining DOI resolution, Crossref metadata analysis, Semantic Scholar queries, and fuzzy text matching to distinguish between formatting errors (“Sloppiness”) and verifiable non-existence ("Phantoms). We detect a persistent 17.0% Phantom Rate – citations that cannot be resolved to any digital object despite aggressive forensic recovery. Diagnostic categorization reveals three distinct failure modes: pure hallucinations (5.1%), hallucinated identifiers with valid titles (16.4%), and parsing-induced matching failures (78.5%). Longitudinal analysis reveals a flat trend (+0.07 pp/month), suggesting that high-entropy citation practices have stabilized as an endemic feature of the field. The scientific citation graph in AI survey literature exhibits “link rot” at scale. This suggests a mechanism where AI tools act as “lazy research assistants,” retrieving correct titles but hallucinating metadata, thereby severing the digital chain of custody required for reproducible science.
zh

[NLP-118] Oops Wait: Token-Level Signals as a Lens into LLM Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）中话语类标记（discourse-like tokens，如“wait”和“therefore”）如何随训练策略与模型规模变化的问题，以揭示其在推理过程中的作用机制。解决方案的关键在于通过分析不同模型在token级别上的概率信号，发现特定token与推理正确性存在强相关性，且这种相关性受训练策略影响显著但对模型规模保持稳定；进一步聚焦于“wait” token与答案概率的关系，表明小数据集微调的模型虽能借助此类信号获得推理能力，但尚未充分挖掘其潜力。这一方法为系统观察和理解LLM推理动态提供了新视角。

链接: https://arxiv.org/abs/2601.17421
作者: Jaehui Hwang,Dongyoon Han,Sangdoo Yun,Byeongho Heo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The emergence of discourse-like tokens such as “wait” and “therefore” in large language models (LLMs) has offered a unique window into their reasoning processes. However, systematic analyses of how such signals vary across training strategies and model scales remain lacking. In this paper, we analyze token-level signals through token probabilities across various models. We find that specific tokens strongly correlate with reasoning correctness, varying with training strategies while remaining stable across model scales. A closer look at the “wait” token in relation to answer probability demonstrates that models fine-tuned on small-scale datasets acquire reasoning ability through such signals but exploit them only partially. This work provides a systematic lens to observe and understand the dynamics of LLM reasoning.
zh

[NLP-119] CLM-Bench: Benchmarking and Analyzing Cross-lingual Misalignment of LLM s in Knowledge Editing EACL

【速读】：该论文旨在解决多语言知识编辑（Multilingual Knowledge Editing, MKE）中因评估框架偏倚而导致的性能误判问题。现有MKE基准通常通过机械翻译英文数据集构建，导致引入翻译噪声并忽略目标语言特有的文化实体，无法真实反映大语言模型（LLMs）的知识分布。为应对这一挑战，作者提出CLM-Bench——一个基于中文原生内容构建的文化感知型评估基准，其中包含1,010对扎根于中国语境的CounterFact样本，并与英文对应项对齐。其关键创新在于采用“先中文后翻译”的方法论，确保数据的文化代表性；实验揭示了跨语言知识编辑中的显著不对齐现象：不同语言的编辑向量几乎正交，存在于不相交的子空间中，且混合语言编辑时编辑向量呈现线性叠加特性，这表明当前主流方法在跨语言迁移上的有效性受到质疑，强调了使用文化原生基准进行评估的重要性。

链接: https://arxiv.org/abs/2601.17397
作者: Yucheng Hu,Wei Zhou,Juesi Xiao
机构: Tianjin University, School of Future Technology (天津大学未来技术学院); Tianjin University, College of Intelligence and Computing (天津大学智能与计算学部)
类目: Computation and Language (cs.CL)
备注: EACL MME workshop paper

点击查看摘要

Abstract:Knowledge Editing (KE) has emerged as a promising paradigm for updating facts in Large Language Models (LLMs) without retraining. However, progress in Multilingual Knowledge Editing (MKE) is currently hindered by biased evaluation frameworks. We observe that existing MKE benchmarks are typically constructed by mechanically translating English-centric datasets into target languages (e.g., English-to-Chinese). This approach introduces translation artifacts and neglects culturally specific entities native to the target language, failing to reflect the true knowledge distribution of LLMs. To address this, we propose CLM-Bench, a culture-aware benchmark constructed using a native Chinese-first methodology. We curate 1,010 high-quality CounterFact pairs rooted in Chinese cultural contexts and align them with English counterparts. Using CLM-Bench, we conduct extensive experiments on representative LLMs (e.g., Llama-3, Qwen2) and reveal a significant Cross-lingual Misalignment: edits in one language function independently and fail to propagate to the other. We further provide a geometric explanation via layer-wise representation analysis, demonstrating that edit vectors for Chinese and English are nearly orthogonal – residing in disjoint subspaces – while mixed-lingual editing exhibits linear additivity of these vectors. Our findings challenge the effectiveness of current methods in cross-lingual transfer and underscore the importance of culturally native benchmarks.
zh

[NLP-120] Revisiting Modality Invariance in a Multilingual Speech-Text Model via Neuron-Level Analysis

【速读】：该论文旨在解决多语言语音-文本基础模型（Multilingual speech-text foundation models）在跨模态（speech vs. text）和跨语言场景下是否具备一致的内部表征问题，即语言信息在语音与文本两种模态中是否被统一编码。其解决方案的关键在于通过三种互补分析方法：首先利用平均精度排序识别出语言和模态选择性神经元；其次采用中位数替换干预法在推理阶段验证这些神经元的功能作用；最后分析不同语言和模态下的激活幅度不平等性。研究发现，尽管编码器表示趋向于语言无关性，但这种压缩导致共享解码器在构建模态无关表示时难以准确恢复原始语言信息，尤其是在从语音到文本转换时更为明显；同时观察到交叉注意力中的键和值投影存在高度局部化的模态选择性结构，且语音条件解码和非主导脚本表现出更高的激活集中度，揭示了跨模态与跨语言任务中模型脆弱性的潜在机制。

链接: https://arxiv.org/abs/2601.17387
作者: Toshiki Nakai,Varsha Suresh,Vera Demberg
机构: Saarland University (萨尔兰大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computation and Language (cs.CL)
备注: 8 pages for the main text, 51 figures, 1 table

点击查看摘要

Abstract:Multilingual speech-text foundation models aim to process language uniformly across both modality and language, yet it remains unclear whether they internally represent the same language consistently when it is spoken versus written. We investigate this question in SeamlessM4T v2 through three complementary analyses that probe where language and modality information is encoded, how selective neurons causally influence decoding, and how concentrated this influence is across the network. We identify language- and modality-selective neurons using average-precision ranking, investigate their functional role via median-replacement interventions at inference time, and analyze activation-magnitude inequality across languages and modalities. Across experiments, we find evidence of incomplete modality invariance. Although encoder representations become increasingly language-agnostic, this compression makes it more difficult for the shared decoder to recover the language of origin when constructing modality-agnostic representations, particularly when adapting from speech to text. We further observe sharply localized modality-selective structure in cross-attention key and value projections. Finally, speech-conditioned decoding and non-dominant scripts exhibit higher activation concentration, indicating heavier reliance on a small subset of neurons, which may underlie increased brittleness across modalities and languages.
zh

[NLP-121] WarrantScore: Modeling Warrants between Claims and Evidence for Substantiation Evaluation in Peer Reviews

【速读】：该论文旨在解决科学同行评审过程中因投稿论文数量激增而导致的人力资源短缺问题，其核心解决方案是提出一种可解释的评估方法，用于量化评审意见中论点（claim）与证据（evidence）之间的 substantiation（论证充分性）水平。关键创新在于引入一个新的评价指标，不仅检测证据是否存在，更着重于准确评估论点与证据之间的逻辑推理关系，从而提升自动化评估与人工评分的一致性，增强生成式 AI 在辅助同行评审中的实用性与可信度。

链接: https://arxiv.org/abs/2601.17377
作者: Kiyotada Mori,Shohei Tanaka,Tosho Hirasawa,Tadashi Kozuno,Koichiro Yoshino,Yoshitaka Ushiku
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学); OMRON SINIC X Corporation (欧姆龙Sinic X公司); Institute of Science Tokyo (东京科学研究所); Guardian Robot Project, RIKEN (理化学研究所守护机器人项目)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The scientific peer-review process is facing a shortage of human resources due to the rapid growth in the number of submitted papers. The use of language models to reduce the human cost of peer review has been actively explored as a potential solution to this challenge. A method has been proposed to evaluate the level of substantiation in scientific reviews in a manner that is interpretable by humans. This method extracts the core components of an argument, claims and evidence, and assesses the level of substantiation based on the proportion of claims supported by evidence. The level of substantiation refers to the extent to which claims are based on objective facts. However, when assessing the level of substantiation, simply detecting the presence or absence of supporting evidence for a claim is insufficient; it is also necessary to accurately assess the logical inference between a claim and its evidence. We propose a new evaluation metric for scientific review comments that assesses the logical inference between claims and evidence. Experimental results show that the proposed method achieves a higher correlation with human scores than conventional methods, indicating its potential to better support the efficiency of the peer-review process.
zh

[NLP-122] Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

【速读】：该论文旨在解决标准注意力机制在长文本场景下因二次复杂度而导致的可扩展性瓶颈问题，尤其是在大语言模型（Large Language Models, LLMs）处理长上下文时效率低下。现有混合注意力策略虽结合稀疏与全连接注意力以提升效率，但通常采用静态计算比例（即固定稀疏与全连接注意力的比例），无法根据下游任务在推理阶段对稀疏性的敏感度动态调整。解决方案的关键在于提出弹性注意力（Elastic Attention），通过引入一个轻量级注意力路由模块（Attention Router）嵌入预训练模型中，使每个注意力头能根据输入内容动态分配至不同计算模式（如稀疏或全连接），从而实现整体稀疏性的自适应调节。该方法仅需在8张A800 GPU上训练12小时即可显著提升模型在长文本任务中的性能与推理效率。

链接: https://arxiv.org/abs/2601.17367
作者: Zecheng Tang,Quantong Qiu,Yi Yang,Zhiyi Hong,Haiya Xiang,Kebin Liu,Qingqing Dang,Juntao Li,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.
zh

[NLP-123] Parameter Efficient Fine Tuning Llama 3.1 for Answering Arabic Legal Questions: A Case Study on Jordanian Laws

【速读】：该论文旨在解决如何高效地将大型语言模型（Large Language Model, LLM）适配至阿拉伯语法律领域，以提升其在特定法律问答任务中的准确性与推理能力。其关键解决方案在于采用参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）技术，结合LoRA（Low-Rank Adaptation）适配器和4-bit量化模型，在Unsloth框架支持下实现加速且资源高效的训练；同时构建了包含6000对法律问答数据的定制化数据集，并通过BLEU与ROUGE指标验证了微调后模型在法律推理准确性和性能上的显著提升。

链接: https://arxiv.org/abs/2601.17364
作者: Mohammed Fasha,Bassam Hammo,Bilal Sowan,Husam Barham,Esam Nsour
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, resources at: this https URL

点击查看摘要

Abstract:This study uses Jordanian law as a case study to explore the fine-tuning of the Llama-3.1 large language model for Arabic question-answering. Two versions of the model - Llama-3.1-8B-bnb-4bit and Llama-3.1-8B-Instruct-bnb-4bit - were fine-tuned using parameter-efficient fine-tuning (PEFT) with LoRA adapters and 4-bit quantized models, leveraging the Unsloth framework for accelerated and resource-efficient training. A custom dataset of 6000 legal question-answer pairs was curated from Jordanian laws and formatted into structured prompts. Performance was evaluated using the BLEU and the ROUGE metrics to compare the fine-tuned models to their respective base versions. Results demonstrated improved legal reasoning and accuracy while achieving resource efficiency through quantization and optimized fine-tuning strategies. This work underscores the potential of adapting large language models for Arabic legal domains and highlights effective techniques for fine-tuning domain-specific tasks.
zh

[NLP-124] Do readers prefer AI-generated Italian short stories?

【速读】：该论文试图解决的问题是：读者是否更偏好由人工智能（AI）生成的意大利语短篇小说，而非由知名意大利作家阿尔贝托·莫拉维亚（Alberto Moravia）创作的作品。其解决方案的关键在于采用盲测实验设计，即在不知晓文本来源的情况下，让20名参与者阅读并评价三篇故事（两篇由ChatGPT-4o生成，一篇由莫拉维亚撰写），同时收集参与者的阅读习惯和人口统计学数据（包括年龄、性别、教育水平及母语），以分析潜在影响因素。结果表明，AI生成文本获得略高平均评分且更常被偏好，但差异不显著，且未发现文本偏好与任何人口或阅读变量存在统计学关联，从而挑战了人类作者作品在文学接受中天然占优的假设。

链接: https://arxiv.org/abs/2601.17363
作者: Michael Farrell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a blind setup, 20 participants read and evaluated three stories, two created with ChatGPT-4o and one by Alberto Moravia, without being informed of their origin. To explore potential influencing factors, reading habits and demographic data, comprising age, gender, education and first language, were also collected. The results showed that the AI-written texts received slightly higher average ratings and were more frequently preferred, although differences were modest. No statistically significant associations were found between text preference and demographic or reading-habit variables. These findings challenge assumptions about reader preference for human-authored fiction and raise questions about the necessity of synthetic-text editing in literary contexts.
zh

[NLP-125] he Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在具备自主性时可能出现的价值错位（Value Misalignment）问题，尤其是现实中完全良性、情境化且具代理行为的场景下，现有评估方法未能充分覆盖的内在价值错位（Intrinsic Value Misalignment, Intrinsic VM）风险。解决方案的关键在于提出并实现了一个名为IMPRESS（Intrinsic Value Misalignment Probes in Realistic Scenario Set）的场景驱动型评估框架，通过构建由真实、无害且上下文相关的场景组成的基准测试集，并采用多阶段LLM生成流程与严格质量控制，系统性地量化和分析不同模型在代理行为中表现出的价值偏离程度，从而揭示其普遍性及影响因素（如动机类型、风险类别、模型规模与架构等），并验证现有缓解策略的有效性。

链接: https://arxiv.org/abs/2601.17344
作者: Chen Chen,Kim Young Il,Yuan Yang,Wenhao Su,Yilin Zhang,Xueluan Gong,Qian Wang,Yongsen Zheng,Ziyao Liu,Kwok-Yan Lam
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 11 figures

点击查看摘要

Abstract:Large language model (LLM) agents with extended autonomy unlock new capabilities, but also introduce heightened challenges for LLM safety. In particular, an LLM agent may pursue objectives that deviate from human values and ethical norms, a risk known as value misalignment. Existing evaluations primarily focus on responses to explicit harmful input or robustness against system failure, while value misalignment in realistic, fully benign, and agentic settings remains largely underexplored. To fill this gap, we first formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM). We then introduce IMPRESS (Intrinsic Value Misalignment Probes in REalistic Scenario Set), a scenario-driven framework for systematically assessing this risk. Following our framework, we construct benchmarks composed of realistic, fully benign, and contextualized scenarios, using a multi-stage LLM generation pipeline with rigorous quality control. We evaluate Intrinsic VM on 21 state-of-the-art LLM agents and find that it is a common and broadly observed safety risk across models. Moreover, the misalignment rates vary by motives, risk types, model scales, and architectures. While decoding strategies and hyperparameters exhibit only marginal influence, contextualization and framing mechanisms significantly shape misalignment behaviors. Finally, we conduct human verification to validate our automated judgments and assess existing mitigation strategies, such as safety prompting and guardrails, which show instability or limited effectiveness. We further demonstrate key use cases of IMPRESS across the AI Ecosystem. Our code and benchmark will be publicly released upon acceptance.
zh

[NLP-126] Meta-Judging with Large Language Models : Concepts Methods and Challenges

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）作为评估者（LLM-as-a-Judge）所面临的诸多局限性问题，包括对提示词的敏感性、系统性偏见、冗余效应以及不可靠或虚构的推理过程。针对这些问题，作者提出了一种更稳健的评估范式——LLM-as-a-Meta-Judge（即“元评估”），其核心在于利用LLM自身对其他LLM输出进行批判性判断的能力，通过引入多层评估机制和对齐训练方法来提升自动化评估的稳定性与可信度。关键创新在于构建了一个涵盖概念基础、元评估机制、对齐训练、评估体系、局限性分析及未来方向的六维框架，从而为下一代LLM评估方法提供了理论支撑与实践路径。

链接: https://arxiv.org/abs/2601.17312
作者: Hugo Silva,Mateus Mendes,Hugo Gonçalo Oliveira
机构: University of Coimbra (科英布拉大学); Polytechnic University of Coimbra (科英布拉理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are evolving fast and are now frequently used as evaluators, in a process typically referred to as LLM-as-a-Judge, which provides quality assessments of model outputs. However, recent research points out significant vulnerabilities in such evaluation, including sensitivity to prompts, systematic biases, verbosity effects, and unreliable or hallucinated rationales. These limitations motivated the development of a more robust paradigm, dubbed LLM-as-a-Meta-Judge. This survey reviews recent advances in meta-judging and organizes the literature, by introducing a framework along six key perspectives: (i) Conceptual Foundations, (ii) Mechanisms of Meta-Judging, (iii) Alignment Training Methods, (iv) Evaluation, (v) Limitations and Failure Modes, and (vi) Future Directions. By analyzing the limitations of LLM-as-a-Judge and summarizing recent advances in meta-judging by LLMs, we argue that LLM-as-a-Meta-Judge offers a promising direction for more stable and trustworthy automated evaluation, while highlighting remaining challenges related to cost, prompt sensitivity, and shared model biases, which must be addressed to advance the next generation of LLM evaluation methodologies.
zh

[NLP-127] Structure-Aware NL-to-SQL for SFC Provisioning via AST-Masking Empowered Language Models

【速读】：该论文旨在解决服务功能链（Service Function Chain, SFC）在动态且时延敏感网络中进行有效编排时，传统强化学习（Reinforcement Learning, RL）方法因忽视结构化领域知识而导致泛化能力弱和可解释性差的问题。为实现规范驱动的SFC管理，论文提出利用大语言模型（Large Language Models, LLMs）将自然语言（Natural Language, NL）规格说明转化为可执行的结构化查询语言（Structured Query Language, SQL）命令。其解决方案的关键在于引入抽象语法树掩码（Abstract Syntax Tree-Masking, AST-Masking）技术，该方法基于SQL的抽象语法树（Abstract Syntax Tree, AST）对关键组件赋予权重，并在训练过程中强制语法感知学习，从而在不增加推理开销的前提下提升SQL生成的准确性和语法正确性。实验表明，AST-Masking显著提升了多个语言模型的SQL生成准确性，例如FLAN-T5达到99.6%的执行准确率（Execution Accuracy, EA），而Gemma的绝对提升从7.5%跃升至72.0%，验证了结构感知微调在保障语法正确与高效SQL生成方面的有效性。

链接: https://arxiv.org/abs/2601.17295
作者: Xinyu Zhu,Parisa Fard Moshiri,Poonam Lohan,Burak Kantarci,Emil Janulewicz
机构: University of Ottawa (渥太华大学); Ciena (赛朗)
类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, accepted to IEEE International Conference on Communications (ICC) 2026

点击查看摘要

Abstract:Effective Service Function Chain (SFC) provisioning requires precise orchestration in dynamic and latency-sensitive networks. Reinforcement Learning (RL) improves adaptability but often ignores structured domain knowledge, which limits generalization and interpretability. Large Language Models (LLMs) address this gap by translating natural language (NL) specifications into executable Structured Query Language (SQL) commands for specification-driven SFC management. Conventional fine-tuning, however, can cause syntactic inconsistencies and produce inefficient queries. To overcome this, we introduce Abstract Syntax Tree (AST)-Masking, a structure-aware fine-tuning method that uses SQL ASTs to assign weights to key components and enforce syntax-aware learning without adding inference overhead. Experiments show that AST-Masking significantly improves SQL generation accuracy across multiple language models. FLAN-T5 reaches an Execution Accuracy (EA) of 99.6%, while Gemma achieves the largest absolute gain from 7.5% to 72.0%. These results confirm the effectiveness of structure-aware fine-tuning in ensuring syntactically correct and efficient SQL generation for interpretable SFC orchestration.
zh

[NLP-128] Mind the Ambiguity: Aleatoric Uncertainty Quantification in LLM s for Safe Medical Question Answering WWW2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在医学问答（Medical Question Answering, MQA）中因用户查询模糊而导致的准确性下降问题，该问题在高风险医疗场景中构成显著安全威胁。解决方案的关键在于将输入模糊性与可归因于输入不明确的随机不确定性（aleatoric uncertainty, AU）进行形式化关联，并基于此提出一种“先澄清后回答”（Clarify-Before-Answer）框架；其中核心创新是引入AU-Probe模块——一个轻量级机制，可直接从LLM内部隐藏状态中检测AU，无需模型微调或多次前向传播，从而实现高效、主动的用户澄清请求，显著提升安全性与准确性。实验表明，该方法在四个开源LLM上平均准确率提升9.48%。

链接: https://arxiv.org/abs/2601.17284
作者: Yaokun Liu,Yifan Liu,Phoebe Mbuvi,Zelin Li,Ruichen Yao,Gawon Lim,Dong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at The Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:The deployment of Large Language Models in Medical Question Answering is severely hampered by ambiguous user queries, a significant safety risk that demonstrably reduces answer accuracy in high-stakes healthcare settings. In this paper, we formalize this challenge by linking input ambiguity to aleatoric uncertainty (AU), which is the irreducible uncertainty arising from underspecified input. To facilitate research in this direction, we construct CV-MedBench, the first benchmark designed for studying input ambiguity in Medical QA. Using this benchmark, we analyze AU from a representation engineering perspective, revealing that AU is linearly encoded in LLM’s internal activation patterns. Leveraging this insight, we introduce a novel AU-guided “Clarify-Before-Answer” framework, which incorporates AU-Probe - a lightweight module that detects input ambiguity directly from hidden states. Unlike existing uncertainty estimation methods, AU-Probe requires neither LLM fine-tuning nor multiple forward passes, enabling an efficient mechanism to proactively request user clarification and significantly enhance safety. Extensive experiments across four open LLMs demonstrate the effectiveness of our QA framework, with an average accuracy improvement of 9.48% over baselines. Our framework provides an efficient and robust solution for safe Medical QA, strengthening the reliability of health-related applications. The code is available at this https URL, and the CV-MedBench dataset is released on Hugging Face at this https URL.
zh

[NLP-129] PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

【速读】：该论文旨在解决当前自然语言处理（Natural Language Processing, NLP）领域对多语码转换（Code-Switching）对话理解能力不足的问题，尤其针对现有基准测试难以准确反映真实世界多语交际复杂性的局限。其解决方案的关键在于构建PingPong基准数据集，该数据集由人类撰写的多参与者（2–4人）、多线程结构的自然对话组成，涵盖五种语言组合变体（包括三语混合），并具有较长的回复距离和显著的说话者主导权差异，从而更真实地模拟日常多语交流场景。基于此数据集，作者进一步定义了问答、对话摘要和主题分类三项下游任务，实验表明当前主流语言模型在代码切换输入上的表现仍受限，凸显出开发更具鲁棒性的NLP系统以应对现实世界多语话语复杂性的紧迫性。

链接: https://arxiv.org/abs/2601.17277
作者: Mohammad Rifqi Farhansyah,Hanif Muhammad Zhafran,Farid Adilazuarda,Shamsuddeen Hassan Muhammad,Maryam Ibrahim Mukhtar,Nedjma Ousidhoum,Genta Indra Winata,Ayu Purwarianti,Alham Fikri Aji
机构: 未知
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Code-switching is a widespread practice among the world’s multilingual majority, yet few benchmarks accurately reflect its complexity in everyday communication. We present PingPong, a benchmark for natural multi-party code-switching dialogues covering five language-combination variations, some of which are trilingual. Our dataset consists of human-authored conversations among 2 to 4 participants covering authentic, multi-threaded structures where replies frequently reference much earlier points in the dialogue. We demonstrate that our data is significantly more natural and structurally diverse than machine-generated alternatives, offering greater variation in message length, speaker dominance, and reply distance. Based on these dialogues, we define three downstream tasks: Question Answering, Dialogue Summarization, and Topic Classification. Evaluations of several state-of-the-art language models on PingPong reveal that performance remains limited on code-switched inputs, underscoring the urgent need for more robust NLP systems capable of addressing the intricacies of real-world multilingual discourse.
zh

[NLP-130] Window Size Versus Accuracy Experiments in Voice Activity Detectors

【速读】：该论文旨在解决语音活动检测（Voice Activity Detection, VAD）在实际应用中因窗口大小选择不当而导致精度下降的问题，尤其关注不同VAD算法在多样化真实数字音频流中的表现差异。其解决方案的关键在于系统性地评估三种主流VAD算法（Silero、WebRTC和RMS）在不同窗口尺寸下的性能，并进一步引入迟滞（hysteresis）机制以优化输出稳定性。实验表明，Silero在准确性上显著优于WebRTC和RMS，而迟滞机制对WebRTC的性能提升尤为明显，为VAD系统的参数调优提供了实证依据。

链接: https://arxiv.org/abs/2601.17270
作者: Max McKinnon,Samir Khaki,Chandan KA Reddy,William Huang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Voice activity detection (VAD) plays a vital role in enabling applications such as speech recognition. We analyze the impact of window size on the accuracy of three VAD algorithms: Silero, WebRTC, and Root Mean Square (RMS) across a set of diverse real-world digital audio streams. We additionally explore the use of hysteresis on top of each VAD output. Our results offer practical references for optimizing VAD systems. Silero significantly outperforms WebRTC and RMS, and hysteresis provides a benefit for WebRTC.
zh

[NLP-131] Frame-Guided Synthetic Claim Generation for Automatic Fact-Checking Using High-Volume Tabular Data

【速读】：该论文旨在解决自动化事实核查基准普遍忽视真实世界高容量结构化数据验证的问题，现有研究多依赖小规模、人工标注的表格，难以反映现实场景中的复杂性。其解决方案的关键在于构建一个大规模、多语言的事实核查数据集，包含78,503条基于434个复杂经济合作与发展组织（OECD）表格生成的合成声明，每个表格平均超过50万行；并提出一种基于语义框架（semantic frames）的引导式生成方法，使模型能够从海量数据中提取关键信息并生成具有现实感的多语言声明。此外，通过知识探测实验验证大语言模型（LLM）未记忆这些事实，从而确保系统必须进行真正的检索与推理，而非依赖参数化知识。研究表明，证据检索是当前模型的主要瓶颈，凸显了在超大规模结构化数据中实现准确事实核查的挑战性。

链接: https://arxiv.org/abs/2601.17232
作者: Jacob Devasier,Akshith Putta,Qing Wang,Alankrit Moses,Chengkai Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated fact-checking benchmarks have largely ignored the challenge of verifying claims against real-world, high-volume structured data, instead focusing on small, curated tables. We introduce a new large-scale, multilingual dataset to address this critical gap. It contains 78,503 synthetic claims grounded in 434 complex OECD tables, which average over 500K rows each. We propose a novel, frame-guided methodology where algorithms programmatically select significant data points based on six semantic frames to generate realistic claims in English, Chinese, Spanish, and Hindi. Crucially, we demonstrate through knowledge-probing experiments that LLMs have not memorized these facts, forcing systems to perform genuine retrieval and reasoning rather than relying on parameterized knowledge. We provide a baseline SQL-generation system and show that our benchmark is highly challenging. Our analysis identifies evidence retrieval as the primary bottleneck, with models struggling to find the correct data in massive tables. This dataset provides a critical new resource for advancing research on this unsolved, real-world problem.
zh

[NLP-132] CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval

【速读】：该论文旨在解决自动化事实核查（Automated Fact-Checking）在高风险领域如法律中长期被忽视的问题，即现有方法主要针对静态知识库进行验证，而未能处理法律文本中真理随时间演变且技术复杂的特点。其解决方案的关键在于构建了一个名为CaseFacts的新基准数据集，用于验证通俗法律主张是否与美国最高法院判例一致；该数据集通过多阶段流水线生成，利用大语言模型（Large Language Models, LLMs）从专家案件摘要中合成主张，并引入一种新颖的语义相似性启发式方法高效识别和验证复杂的法律推翻关系（Overruling）。这一设计不仅要求系统跨越普通陈述与专业司法术语之间的语义鸿沟，还需考虑主张的时间有效性，从而推动面向法律事实核查系统的研究发展。

链接: https://arxiv.org/abs/2601.17230
作者: Akshith Reddy Putta,Jacob Devasier,Chengkai Li
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automated Fact-Checking has largely focused on verifying general knowledge against static corpora, overlooking high-stakes domains like law where truth is evolving and technically complex. We introduce CaseFacts, a benchmark for verifying colloquial legal claims against U.S. Supreme Court precedents. Unlike existing resources that map formal texts to formal texts, CaseFacts challenges systems to bridge the semantic gap between layperson assertions and technical jurisprudence while accounting for temporal validity. The dataset consists of 6,294 claims categorized as Supported, Refuted, or Overruled. We construct this benchmark using a multi-stage pipeline that leverages Large Language Models (LLMs) to synthesize claims from expert case summaries, employing a novel semantic similarity heuristic to efficiently identify and verify complex legal overrulings. Experiments with state-of-the-art LLMs reveal that the task remains challenging; notably, augmenting models with unrestricted web search degrades performance compared to closed-book baselines due to the retrieval of noisy, non-authoritative precedents. We release CaseFacts to spur research into legal fact verification systems.
zh

[NLP-133] Retell Reward Repeat: Reinforcement Learning for Narrative Theory-Informed Story Generation

【速读】：该论文旨在解决自动故事生成（ASG）中因依赖有限标注数据而导致的训练与评估局限性问题，尤其是在面对叙事主观性时模型难以充分对齐人类偏好。其解决方案的关键在于引入基于Todorov叙事平衡理论（Todorov’s Theory of Narrative Equilibrium）的强化学习框架（d-RLAIF），通过将叙事原则转化为可量化的奖励信号，利用LLM-as-judge模型进行人类偏好对齐，并以Gemini-3-Flash作为评估工具对比生成故事与人类写作的差异。该方法在多样性与叙事规范一致性上优于传统监督微调（SFT），展示了强化学习在语言学基础驱动的后训练阶段对主观任务的有效性。

链接: https://arxiv.org/abs/2601.17226
作者: David Y. Liu,Xanthe Muston,Aditya Joshi,Sebastian Sequoiah-Grayson
机构: University of New South Wales (新南威尔士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 Pages, 6 figures

点击查看摘要

Abstract:Despite the subjective nature of storytelling, past works on automatic story generation (ASG) have relied on limited ground truths for training and evaluation. In this work, we explore reinforcement learning (d-RLAIF) as a post-training alternative to supervised fine-tuning (SFT). We first apply Todorov’s Theory of Narrative Equilibrium to establish principles that define desirable ASG qualities. We prompt 7B and 14B LLM-as-judge models with our principles to test alignment with human annotators and provide reward signals during d-RLAIF. We use Gemini-3-Flash to evaluate the output of our post-trained models and compare them to human-written stories from the TimeTravel dataset. We show that d-RLAIF offers a viable alternative to supervised fine-tuning (SFT)–producing stories that are more diverse and aligned with human narrative conventions. Our paper demonstrates the promise of reinforcement learning for linguistically grounded post-training for subjective tasks such as ASG.
zh

[NLP-134] Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在推理过程中缺乏可验证性和透明度的问题，尤其是在依赖神经判别器进行链式思维（Chain-of-Thought, CoT）评分时所面临的黑箱性、偏见和奖励劫持风险。解决方案的关键在于提出一种基于确定性规则的中间推理验证机制——Verifiable Process Reward Models (VPRMs)，其通过程序化规则验证器对推理步骤进行逐层检查，从而确保中间决策符合领域规则，并显著提升最终标签与步骤间的一致性与逻辑连贯性。

链接: https://arxiv.org/abs/2601.17223
作者: Massimiliano Pronesti,Anya Belz,Yufang Hou
机构: IBM Research Europe - Ireland(IBM研究欧洲-爱尔兰); Dublin City University(都柏林城市大学); IT:U Interdisciplinary Transformation University Austria(IT:U跨学科转型大学奥地利)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work on reinforcement learning with verifiable rewards (RLVR) has shown that large language models (LLMs) can be substantially improved using outcome-level verification signals, such as unit tests for code or exact-match checks for mathematics. In parallel, process supervision has long been explored as a way to shape the intermediate reasoning behaviour of LLMs, but existing approaches rely on neural judges to score chain-of-thought steps, leaving them vulnerable to opacity, bias, and reward hacking. To address this gap, we introduce Verifiable Process Reward Models (VPRMs), a reinforcement-learning framework in which intermediate reasoning steps are checked by deterministic, rule-based verifiers. We apply VPRMs to risk-of-bias assessment for medical evidence synthesis, a domain where guideline-defined criteria and rule-based decision paths enable programmatic verification of reasoning traces. Across multiple datasets, we find that VPRMs generate reasoning that adheres closely to domain rules and achieve substantially higher coherence between step-level decisions and final labels. Results show that VPRMs achieve up to 20% higher F1 than state-of-the-art models and 6.5% higher than verifiable outcome rewards, with substantial gains in evidence grounding and logical coherence.
zh

[NLP-135] DF-RAG : Query-Aware Diversity for Retrieval-Augmented Generation EACL2026

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）在复杂推理型问答（reasoning-intensive question-answering, QA）任务中性能受限的问题，其核心挑战在于传统基于余弦相似度的检索方法虽能保证内容相关性，却易引入冗余信息，从而降低信息召回率。解决方案的关键在于提出多样性聚焦的检索增强生成（Diversity-Focused Retrieval-Augmented Generation, DF-RAG），通过在检索阶段系统性地引入多样性约束，选择既与查询相关又彼此差异最大的文本片段；其创新点在于能够在测试时动态优化每条查询的多样性水平，无需额外微调或先验信息，从而显著提升复杂QA任务上的F1得分，相较基线RAG提升4–10%，并接近理论最优上限（Oracle ceiling）的91.3%。

链接: https://arxiv.org/abs/2601.17212
作者: Saadat Hasan Khan,Spencer Hong,Jingyu Wu,Kevin Lybarger,Youbing Yin,Erin Babinsky,Daben Liu
机构: George Mason University (乔治梅森大学); Capital One (资本one)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of EACL 2026

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a common technique for grounding language model outputs in domain-specific information. However, RAG is often challenged by reasoning-intensive question-answering (QA), since common retrieval methods like cosine similarity maximize relevance at the cost of introducing redundant content, which can reduce information recall. To address this, we introduce Diversity-Focused Retrieval-Augmented Generation (DF-RAG), which systematically incorporates diversity into the retrieval step to improve performance on complex, reasoning-intensive QA benchmarks. DF-RAG builds upon the Maximal Marginal Relevance framework to select information chunks that are both relevant to the query and maximally dissimilar from each other. A key innovation of DF-RAG is its ability to optimize the level of diversity for each query dynamically at test time without requiring any additional fine-tuning or prior information. We show that DF-RAG improves F1 performance on reasoning-intensive QA benchmarks by 4-10 percent over vanilla RAG using cosine similarity and also outperforms other established baselines. Furthermore, we estimate an Oracle ceiling of up to 18 percent absolute F1 gains over vanilla RAG, of which DF-RAG captures up to 91.3 percent.
zh

[NLP-136] Relating Word Embedding Gender Biases to Gender Gaps: A Cross-Cultural Analysis

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）模型中因训练数据来源（如新闻、社交媒体等文化衍生文本）所引入的性别偏见问题，同时探讨这些偏见是否能够反映真实社会中的性别差距。其解决方案的关键在于：首先构建一种量化词嵌入（word embeddings）中性别偏见的方法，进而利用该偏见指标与教育、政治、经济和健康等领域的性别统计差异进行相关性分析，从而揭示语言层面的性别偏见如何映射到现实社会结构中，并验证其在跨地区（美国51个州及99个国家）数据上的预测能力与规律性。

链接: https://arxiv.org/abs/2601.17203
作者: Scott Friedman,Sonja Schmer-Galunder,Anthony Chen,Jeffrey Rye
机构: SIFT(安全信息和威胁检测公司)
类目: Computation and Language (cs.CL)
备注: 7 pages, 5 figures. Presented at the First Workshop on Gender Bias in Natural Language Processing (GeBNLP 2019)

点击查看摘要

Abstract:Modern models for common NLP tasks often employ machine learning techniques and train on journalistic, social media, or other culturally-derived text. These have recently been scrutinized for racial and gender biases, rooting from inherent bias in their training text. These biases are often sub-optimal and recent work poses methods to rectify them; however, these biases may shed light on actual racial or gender gaps in the culture(s) that produced the training text, thereby helping us understand cultural context through big data. This paper presents an approach for quantifying gender bias in word embeddings, and then using them to characterize statistical gender gaps in education, politics, economics, and health. We validate these metrics on 2018 Twitter data spanning 51 U.S. regions and 99 countries. We correlate state and country word embedding biases with 18 international and 5 U.S.-based statistical gender gaps, characterizing regularities and predictive strength.
zh

[NLP-137] Reasoning Beyond Literal: Cross-style Multimodal Reasoning for Figurative Language Understanding

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在处理多模态隐喻语言（如讽刺、幽默和隐喻）时表现不佳的问题，这类语言依赖于表达与意图之间的微妙不一致，并且图像信息可能强化或反转文本语义，要求模型具备跨模态推理能力并考虑主观性。解决方案的关键在于提出一个三步框架：首先实现对多模态隐喻语言的解析，其次生成可解释的推理轨迹以增强透明度，最后通过联合训练多种隐喻风格实现泛化能力。实验表明，引入推理轨迹显著提升理解性能，且一种风格中学到的推理机制可迁移至其他相关风格（如讽刺与幽默之间），最终训练出的轻量级VLM在跨风格泛化上优于更大规模的开源及闭源模型，同时提供可验证的推理过程。

链接: https://arxiv.org/abs/2601.17197
作者: Seyyed Saeid Cheshmi,Hahnemann Ortiz,James Mooney,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated strong reasoning abilities in literal multimodal tasks such as visual mathematics and science question answering. However, figurative language, such as sarcasm, humor, and metaphor, remains a significant challenge, as it conveys intent and emotion through subtle incongruities between expressed and intended meanings. In multimodal settings, accompanying images can amplify or invert textual meaning, demanding models that reason across modalities and account for subjectivity. We propose a three-step framework for developing efficient multimodal reasoning models that can (i) interpret multimodal figurative language, (ii) provide transparent reasoning traces, and (iii) generalize across multiple figurative styles. Experiments across four styles show that (1) incorporating reasoning traces substantially improves multimodal figurative understanding, (2) reasoning learned in one style can transfer to others, especially between related styles like sarcasm and humor, and (3) training jointly across styles yields a generalized reasoning VLM that outperforms much larger open- and closed-source models. Our findings show that lightweight VLMs with verifiable reasoning achieve robust cross-style generalization while providing inspectable reasoning traces for multimodal tasks. The code and implementation are available at this https URL.
zh

[NLP-138] Beyond Simulations: What 20000 Real Conversations Reveal About Mental Health AI Safety

【速读】：该论文旨在解决当前用于心理健康支持的生成式 AI (Generative AI) 安全评估依赖于小规模、模拟测试集的问题，这些测试集与真实用户交互的语言分布关系不明，导致安全性能评估可能无法反映实际部署效果。其解决方案的关键在于开展一项生态审计（ecological audit），基于超过20,000条真实用户对话数据，对专为心理健康设计的AI系统进行持续性安全验证，并将其在标准测试集上的表现与实际部署中的表现进行对比。结果显示，尽管测试集失败率显著高于真实场景，且存在极低比例的漏检（如NSSI风险误判率为0.38%），但临床审核确认无一例自杀风险被遗漏，从而证明了以真实使用场景为导向的持续安全监测机制优于传统静态基准测试，是提升AI心理健康系统可靠性的核心路径。

链接: https://arxiv.org/abs/2601.17003
作者: Caitlin A. Stamatis,Jonah Meyerhoff,Richard Zhang,Olivier Tieleman,Matteo Malgaroli,Thomas D. Hull
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 38 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for mental health support, yet existing safety evaluations rely primarily on small, simulation-based test sets that have an unknown relationship to the linguistic distribution of real usage. In this study, we present replications of four published safety test sets targeting suicide risk assessment, harmful content generation, refusal robustness, and adversarial jailbreaks for a leading frontier generic AI model alongside an AI purpose built for mental health support. We then propose and conduct an ecological audit on over 20,000 real-world user conversations with the purpose-built AI designed with layered suicide and non-suicidal self-injury (NSSI) safeguards to compare test set performance to real world performance. While the purpose-built AI was significantly less likely than general-purpose LLMs to produce enabling or harmful content across suicide/NSSI (.4-11.27% vs 29.0-54.4%), eating disorder (8.4% vs 54.0%), and substance use (9.9% vs 45.0%) benchmark prompts, test set failure rates for suicide/NSSI were far higher than in real-world deployment. Clinician review of flagged conversations from the ecological audit identified zero cases of suicide risk that failed to receive crisis resources. Across all 20,000 conversations, three mentions of NSSI risk (.015%) did not trigger a crisis intervention; among sessions flagged by the LLM judge, this corresponds to an end-to-end system false negative rate of .38%, providing a lower bound on real-world safety failures. These findings support a shift toward continuous, deployment-relevant safety assurance for AI mental-health systems rather than limited set benchmark certification.
zh

[NLP-139] RAM-SD: Retrieval-Augmented Multi-agent framework for Sarcasm Detection

【速读】：该论文旨在解决讽刺检测（sarcasm detection）中因依赖细微语境理解、世界知识及多维度语言线索而导致的挑战，现有方法如微调的Transformer或大语言模型采用统一推理策略，难以应对不同讽刺表达所提出的多样化分析需求，例如情境预期违背建模、外部知识锚定或特定修辞模式识别。解决方案的关键在于提出一种检索增强的多智能体框架RAM-SD，其核心机制包括：通过上下文检索引入讽刺与非讽刺示例以实现语境锚定；由元规划器根据讽刺类型选择最优推理路径；多个专业化智能体执行互补性的多视角分析；最终由集成器融合结果并生成可解释的自然语言判断说明，从而在四个标准基准上达到77.74%的Macro-F1，显著优于GPT-4o+CoC基线（+7.01点），同时提供透明、可追溯的推理过程。

链接: https://arxiv.org/abs/2601.17002
作者: Ziyang Zhou,Ziqi Liu,Yan Wang,Yiming Lin,Yangbin Chen
机构: Xi’an Jiaotong–Liverpool University (西安交通大学利物浦大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 6 tables, preprint

点击查看摘要

Abstract:Sarcasm detection remains a significant challenge due to its reliance on nuanced contextual understanding, world knowledge, and multi-faceted linguistic cues that vary substantially across different sarcastic expressions. Existing approaches, from fine-tuned transformers to large language models, apply a uniform reasoning strategy to all inputs, struggling to address the diverse analytical demands of sarcasm. These demands range from modeling contextual expectation violations to requiring external knowledge grounding or recognizing specific rhetorical patterns. To address this limitation, we introduce RAM-SD, a Retrieval-Augmented Multi-Agent framework for Sarcasm Detection. The framework operates through four stages: (1) contextual retrieval grounds the query in both sarcastic and non-sarcastic exemplars; (2) a meta-planner classifies the sarcasm type and selects an optimal reasoning plan from a predefined set; (3) an ensemble of specialized agents performs complementary, multi-view analysis; and (4) an integrator synthesizes these analyses into a final, interpretable judgment with a natural language explanation. Evaluated on four standard benchmarks, RAM-SD achieves a state-of-the-art Macro-F1 of 77.74%, outperforming the strong GPT-4o+CoC baseline by 7.01 points. Our framework not only sets a new performance benchmark but also provides transparent and interpretable reasoning traces, illuminating the cognitive processes behind sarcasm comprehension.
zh

[NLP-140] Uncertainty Quantification for Named Entity Recognition via Full-Sequence and Subsequence Conformal Prediction

【速读】：该论文旨在解决当前命名实体识别（Named Entity Recognition, NER）模型仅输出单一标签序列而缺乏不确定性度量的问题，从而导致下游应用易受误差传播影响。其解决方案的关键在于提出一种通用框架，基于校准预测（conformal prediction）技术，将序列标注型NER模型扩展为可生成具有用户指定置信水平的不确定性感知预测集（prediction sets），这些预测集保证包含正确标签序列的概率满足理论覆盖要求。该方法通过设计高效的非一致性评分函数，实现了对句子长度、语言、实体类型及句内实体数量等异质性的建模，并支持无条件与类别条件覆盖控制，实验证明其在多个基准数据集和NER模型上均具备良好的适用性、有效性与计算效率。

链接: https://arxiv.org/abs/2601.16999
作者: Matthew Singer,Srijan Sengupta,Karl Pazdernik
机构: North Carolina State University (北卡罗来纳州立大学); Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) serves as a foundational component in many natural language processing (NLP) pipelines. However, current NER models typically output a single predicted label sequence without any accompanying measure of uncertainty, leaving downstream applications vulnerable to cascading errors. In this paper, we introduce a general framework for adapting sequence-labeling-based NER models to produce uncertainty-aware prediction sets. These prediction sets are collections of full-sentence labelings that are guaranteed to contain the correct labeling with a user-specified confidence level. This approach serves a role analogous to confidence intervals in classical statistics by providing formal guarantees about the reliability of model predictions. Our method builds on conformal prediction, which offers finite-sample coverage guarantees under minimal assumptions. We design efficient nonconformity scoring functions to construct efficient, well-calibrated prediction sets that support both unconditional and class-conditional coverage. This framework accounts for heterogeneity across sentence length, language, entity type, and number of entities within a sentence. Empirical experiments on four NER models across three benchmark datasets demonstrate the broad applicability, validity, and efficiency of the proposed methods.
zh

[NLP-141] Evaluating Reward Model Generalization via Pairwise Maximum Discrepancy Competitions

【速读】：该论文旨在解决当前奖励模型（Reward Model, RM）评估中普遍存在的泛化能力不足问题，尤其是现有评估方法依赖静态、预标注的偏好数据集，难以在开放世界场景下真实反映RM对未见提示（unseen prompts）和分布外数据的适应能力。其解决方案的关键在于提出Pairwise Maximum Discrepancy Competition (PMDC) 框架：通过从大规模无标签提示池中动态选择使两个RM产生最大分歧的提示-响应对，生成高争议性的测试用例，并由一个“仲裁者”（oracle）进行判定，最终利用Bradley–Terry模型构建全局排名与成对胜率图谱，从而高效、精准地衡量RM的泛化性能。

链接: https://arxiv.org/abs/2601.16987
作者: Shunyang Luo,Peibei Cao,Zhihui Zhu,Kehua Feng,Zhihua Wang,Keyan Ding
机构: ZJU-UIUC Institute, Zhejiang University (浙江大学-伊利诺伊大学厄巴纳香槟分校联合学院); School of Artificial Intelligence, Nanjing University of Information Science and Technology (南京信息工程大学人工智能学院); ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University (浙江大学杭州全球科创中心); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Reward models (RMs) are central to aligning large language models, yet their practical effectiveness hinges on generalization to unseen prompts and shifting distributions. Most existing RM evaluations rely on static, pre-annotated preference datasets, which provide limited coverage and often fail to faithfully assess generalization in open-world settings. We introduce Pairwise Maximum Discrepancy Competition (PMDC), a dynamic and annotation-efficient framework for evaluating RM generalization using a large, unlabeled, open-domain prompt pool. PMDC actively selects prompt–response pairs that maximize disagreement between two RMs, yielding a compact set of highly contentious test cases. These cases are adjudicated by an oracle, and the resulting outcomes are aggregated via a Bradley–Terry model to produce a global ranking and pairwise win-rate landscape of RMs. We apply PMDC to re-evaluate 10 representative RMs and observe substantial rank reshuffling compared with conventional benchmarks. Qualitative analyses further uncover systematic generalization failures, providing valuable insights for improving reward modeling.
zh

[NLP-142] Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLM s via Answer-First Principle

【速读】：该论文旨在解决生成式 AI（Generative AI）在链式思维（Chain-of-Thought, CoT）推理过程中因长思考阶段序列存储于键值缓存（Key-Value Cache, KV cache）而导致的内存开销过大问题。传统KV压缩策略对CoT无效，因其未区分思考过程中的不同token对最终答案的贡献度。解决方案的关键在于提出Crystal-KV框架，其核心是“答案优先”原则：通过将答案偏好映射到思考阶段注意力图中，识别出两类缓存条目——SlipKV（仅维持推理流但可能引入误导性上下文）和CrystalKV（真正提升最终答案正确性的关键信息）。进一步设计基于注意力机制的最近最少使用算法（Least Recently Frequently Used, LRU），精准识别并淘汰无用的SlipKV条目，同时保留CrystalKV以保障推理连贯性；最后引入自适应缓存预算分配算法，依据各层/头中CrystalKV的比例动态调整KV缓存资源，从而显著提升吞吐量与响应速度，且不牺牲甚至提升CoT推理准确性。

链接: https://arxiv.org/abs/2601.16986
作者: Zihan Wang,Cheng Tang,Lei Gong,Cheng Li,Chao Wang,teng wang,Wenqi Lou,Xuehai Zhou
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); Suzhou Institute for Advanced Research, University of Science and Technology of China (中国科学技术大学苏州研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks, yet incurs excessive memory overhead due to the long think-stage sequences stored in the Key-Value (KV) cache. Unlike traditional generation tasks where all tokens are uniformly important, CoT emphasizes the final answer, rendering conventional KV compression strategies ineffective. In this paper, we present Crystal-KV, an efficient KV cache management framework tailored for CoT reasoning. Our key insight is the answer-first principle. By mapping answer preferences into think-stage attention map, we distinguish between SlipKV, which mainly maintains the reasoning flow but may occasionally introduce misleading context, and CrystalKV, which truly contributes to the correctness of the final answer. Next, we propose an attention-based Least Recently Frequently Used algorithm. It precisely identifies when a SlipKV entry’s utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Finally, we introduce an adaptive cache budget allocation algorithm. Based on the dynamic proportion of CrystalKV, it estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization. Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.
zh

[NLP-143] coAI: Advancing 3GPP Technical Specification Search through Agent ic Multi-Modal Retrieval-Augmented Generation AACL2025

【速读】：该论文旨在解决3GPP技术规范文档在处理复杂查询、视觉信息以及文档间依赖关系时的挑战，这些问题源于其层次化结构、密集排版和多模态内容。解决方案的关键在于提出TelcoAI系统，该系统采用代理式（agentic）多模态检索增强生成（RAG）架构，核心创新包括：基于章节感知的文本分块（section-aware chunking）、结构化查询规划（structured query planning）、元数据引导的检索（metadata-guided retrieval）以及文本与图表的多模态融合（multi-modal fusion），从而显著提升对技术文档的理解准确性和可靠性。

链接: https://arxiv.org/abs/2601.16984
作者: Rahul Ghosh,Chun-Hao Liu,Gaurav Rele,Vidya Sagar Ravipati,Hazar Aouad
机构: Generative AI Innovation Center, Amazon Web Services (AWS); Bouygues Telecom
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted to IJCNLP-AACL 2025

点击查看摘要

Abstract:The 3rd Generation Partnership Project (3GPP) produces complex technical specifications essential to global telecommunications, yet their hierarchical structure, dense formatting, and multi-modal content make them difficult to process. While Large Language Models (LLMs) show promise, existing approaches fall short in handling complex queries, visual information, and document interdependencies. We present TelcoAI, an agentic, multi-modal Retrieval-Augmented Generation (RAG) system tailored for 3GPP documentation. TelcoAI introduces section-aware chunking, structured query planning, metadata-guided retrieval, and multi-modal fusion of text and diagrams. Evaluated on multiple benchmarks-including expert-curated queries-our system achieves 87% recall, 83% claim recall, and 92% faithfulness, representing a 16% improvement over state-of-the-art baselines. These results demonstrate the effectiveness of agentic and multi-modal reasoning in technical document understanding, advancing practical solutions for real-world telecommunications research and engineering.
zh

[NLP-144] Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder ICASSP2026

【速读】：该论文旨在解决在噪声环境下自动语音识别（ASR）系统性能下降的问题，特别是通过融合视觉信息来提升音频-视觉自动语音识别（AV-ASR）系统的鲁棒性。其解决方案的关键在于提出了一种简单而有效的视觉特征融合方法——“双用”策略，即在预训练的Whisper ASR模型中，将视觉特征同时用于编码器和解码器：在编码器中学习音视频交互特征，在解码器中动态加权多模态信息。实验表明，该方法在不同规模的Whisper模型上均显著优于传统的中间层融合方式，在babble噪声下相对词错误率（WER）改善达35%（Whisper small）至57%（Whisper medium），并实现了LRS3 AV-ASR基准测试在噪声条件下的新最优性能。

链接: https://arxiv.org/abs/2601.18396
作者: Zhengyang Li,Thomas Graave,Björn Möller,Zehang Wu,Matthias Franz,Tim Fingscheidt
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: accepted at ICASSP2026

点击查看摘要

Abstract:In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method – use of visual features both in encoder and decoder (dual-use) – to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. Third, we conduct ablation studies examining the impact of various module designs and fusion options. Fine-tuned on 1929 hours of audiovisual data, our dual-use method using Whisper medium achieves 4.08% (MUSAN babble noise) and 4.43% (NoiseX babble noise) average WER across various SNRs, thereby establishing a new state-of-the-art in noisy conditions on the LRS3 AV-ASR benchmark. Our code is at this https URL
zh

计算机视觉

[CV-0] SeNeDiF-OOD: Semantic Nested Dichotomy Fusion for Out-of-Distribution Detection Methodology in Open-World Classification. A Case Study on Monument Style Classification

【速读】：该论文旨在解决开放世界环境中分布外（Out-of-distribution, OOD）检测的复杂性问题，特别是针对从低级噪声到语义漂移等异质性OOD数据的识别难题。传统单阶段检测方法难以有效区分此类多样性OOD样本。解决方案的关键在于提出一种基于语义嵌套二分融合（Semantic Nested Dichotomy Fusion, SeNeDiF-OOD）的新框架，该框架将检测任务分解为具有层次结构的二元融合节点，每一层通过整合与特定语义抽象层级对齐的决策边界，实现对不同类别OOD输入的精准过滤，同时保持对分布内（in-distribution）样本的性能稳定。

链接: https://arxiv.org/abs/2601.18739
作者: Ignacio Antequera-Sánchez,Juan Luis Suárez-Díaz,Rosana Montes,Francisco Herrera
机构: University of Granada (格拉纳达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is a fundamental requirement for the reliable deployment of artificial intelligence applications in open-world environments. However, addressing the heterogeneous nature of OOD data, ranging from low-level corruption to semantic shifts, remains a complex challenge that single-stage detectors often fail to resolve. To address this issue, we propose SeNeDiF-OOD, a novel methodology based on Semantic Nested Dichotomy Fusion. This framework decomposes the detection task into a hierarchical structure of binary fusion nodes, where each layer is designed to integrate decision boundaries aligned with specific levels of semantic abstraction. To validate the proposed framework, we present a comprehensive case study using MonuMAI, a real-world architectural style recognition system exposed to an open environment. This application faces a diverse range of inputs, including non-monument images, unknown architectural styles, and adversarial attacks, making it an ideal testbed for our proposal. Through extensive experimental evaluation in this domain, results demonstrate that our hierarchical fusion methodology significantly outperforms traditional baselines, effectively filtering these diverse OOD categories while preserving in-distribution performance.
zh

[CV-1] Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge NEURIPS2025

【速读】：该论文旨在解决多智能体协作在具身人工智能（Embodied AI）中的挑战，特别是在复杂任务场景下如何实现高效、可扩展的协同规划与控制。其关键解决方案是提出Multi-Agent Robotic System (MARS) Challenge，通过在NeurIPS 2025 SpaVLE Workshop中聚焦视觉语言模型（Vision-Language Models, VLMs）驱动的多智能体具身规划与动态环境下的机器人操作策略执行，评估参赛方案以推动具身多智能体系统的设计与协调机制发展。

链接: https://arxiv.org/abs/2601.18733
作者: Li Kang,Heng Zhou,Xiufeng Song,Rui Li,Bruno N.Y. Chen,Ziye Wang,Ximeng Meng,Stone Tao,Yiran Qin,Xiaohong Liu,Ruimao Zhang,Lei Bai,Yilun Du,Hao Su,Philip Torr,Zhenfei Yin,Ruihao Gong,Yejun Zeng,Fengjun Zhong,Shenghao Jin,Jinyang Guo,Xianglong Liu,Xiaojun Jia,Tianqi Shan,Wenqi Ren,Simeng Qin,Jialing Yang,Xiaoyu Ma,Tianxing Chen,Zixuan Li,Zijian Cai,Yan Qin,Yusen Qin,Qiangyu Chen,Kaixuan Wang,Zhaoming Han,Yao Mu,Ping Luo,Yuanqi Yao,Haoming Song,Jan-Nico Zaech,Fabien Despinoy,Danda Pani Paudel,Luc Van Gool
机构: SJTU(上海交通大学); Oxford(牛津大学); USTC(中国科学技术大学); Shanghai AI Lab(上海人工智能实验室); CMU(卡内基梅隆大学); HKU(香港大学); Tongji(同济大学); UC San Diego(加州大学圣地亚哥分校); CUHK-SZ(香港中文大学（深圳）); SYSU(中山大学); Harvard(哈佛大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: MARS Challenge @ NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI. Challenge page: this https URL

点击查看摘要

Abstract:Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.
zh

[CV-2] Low Cost High Efficiency: LiDAR Place Recognition in Vineyards with Matryoshka Representation Learning

【速读】：该论文旨在解决农业环境中移动机器人在缺乏显著地标情况下的定位与场景识别（place recognition）难题，尤其是在葡萄园等非结构化场景中，传统方法难以实现高效且鲁棒的环境理解。其解决方案的关键在于提出了一种轻量级深度学习方法 MinkUNeXt-VINE，通过预处理策略和基于 Matryoshka Representation Learning 的多损失函数设计，实现了对低成本、稀疏 LiDAR 输入数据的高效特征提取与低维输出表示，在保证高实时性的同时显著提升了识别精度与鲁棒性。

链接: https://arxiv.org/abs/2601.18714
作者: Judith Vilella-Cantos,Mauro Martini,Marcello Chiaberge,Mónica Ballesta,David Valiente
机构: Universitat Politècnica de València (瓦伦西亚理工大学); Universitat de València (瓦伦西亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Localization in agricultural environments is challenging due to their unstructured nature and lack of distinctive landmarks. Although agricultural settings have been studied in the context of object classification and segmentation, the place recognition task for mobile robots is not trivial in the current state of the art. In this study, we propose MinkUNeXt-VINE, a lightweight, deep-learning-based method that surpasses state-of-the-art methods in vineyard environments thanks to its pre-processing and Matryoshka Representation Learning multi-loss approach. Our method prioritizes enhanced performance with low-cost, sparse LiDAR inputs and lower-dimensionality outputs to ensure high efficiency in real-time scenarios. Additionally, we present a comprehensive ablation study of the results on various evaluation cases and two extensive long-term vineyard datasets employing different LiDAR sensors. The results demonstrate the efficiency of the trade-off output produced by this approach, as well as its robust performance on low-cost and low-resolution input data. The code is publicly available for reproduction.
zh

[CV-3] SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model

【速读】：该论文旨在解决传统机器学习代理模型在复杂几何物理仿真中依赖仿真网格（simulation mesh）所带来的计算成本过高问题，同时避免纯无网格方法因缺乏结构信息而导致的预测误差增大。解决方案的关键在于提出一种名为SMART的神经代理模型，其通过点云（point-cloud）表示几何形状并编码至共享潜在空间，从而捕获结构与参数特征；进而利用物理解码器对编码器中间潜在表示进行跨层注意力机制交互，实现空间查询到物理量的映射，并在更新潜在几何特征的同时协同演化物理场，最终在不依赖仿真网格的情况下达到与基于网格方法相当甚至更优的精度表现。

链接: https://arxiv.org/abs/2601.18707
作者: Jan Hagnberger,Mathias Niepert
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder’s intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.
zh

[CV-4] Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge

【速读】：该论文旨在解决当前生成式视频模型（text-to-video generation models）在地理分布上的视觉知识是否公平的问题，即这些模型是否能均衡地理解和生成来自全球不同地区、发展水平和文化背景的旅游景点内容。其解决方案的关键在于提出了一种系统性的评估框架——Geo-Attraction Landmark Probing (GAP)，该框架通过多维度指标（包括全局结构对齐、关键点级细粒度对齐及视觉语言模型判断）分离整体视频质量与特定景点的地理锚定视觉知识，并基于人工评价验证其有效性；同时构建了包含500个全球分布景点的基准数据集GEOATTRACTION-500，用于吸引中心的量化评估。实证结果表明，Sora 2等先进模型在地理层面表现出相对均匀的视觉知识覆盖，仅弱依赖于景点知名度，揭示了文本到视频生成模型在全球应用中的潜力与持续评估的必要性。

链接: https://arxiv.org/abs/2601.18698
作者: Xiao Liu,Jiawei Zhang
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.
zh

[CV-5] A Prag matic VLA Foundation Model

【速读】：该论文旨在解决当前机器人操作中Vision-Language-Action (VLA)基础模型在跨任务、跨平台泛化能力不足，且训练成本高昂的问题。解决方案的关键在于构建一个名为LingBot-VLA的高性能VLA模型，其基于约20,000小时来自9种主流双臂机器人配置的真实世界数据进行训练，并通过在3个机器人平台上对每项任务执行130次微调实验进行系统评估，验证了其卓越的性能与广泛适用性。此外，该研究还开发了一个高效的代码库，在8-GPU训练环境下实现每GPU每秒261样本的吞吐量，较现有VLA相关代码库提速1.5~2.8倍，显著降低了适配成本，从而使得模型更适用于真实场景部署。

链接: https://arxiv.org/abs/2601.18692
作者: Wei Wu,Fan Lu,Yunnan Wang,Shuai Yang,Shi Liu,Fangjing Wang,Qian Zhu,He Sun,Yong Wang,Shuailei Ma,Yiyu Ren,Kejia Zhang,Hui Yu,Jingmei Zhao,Shuai Zhou,Zhenqi Qiu,Houlong Xiong,Ziyu Wang,Zechen Wang,Ran Cheng,Yong-Lu Li,Yongtao Huang,Xing Zhu,Yujun Shen,Kecheng Zheng
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Webpage: this https URL , Code: this https URL

点击查看摘要

Abstract:Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second per GPU with an 8-GPU training setup, representing a 1.5~2.8 \times (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
zh

[CV-6] Counterfactual Explanations on Robust Perceptual Geodesics ICLR2026

【速读】：该论文旨在解决生成反事实解释（counterfactual explanations）时因距离度量选择不当而导致的语义不一致或对抗性扰动问题，即现有方法在潜在空间中采用平坦或几何错位的度量方式，容易产生离流形（off-manifold）伪影、语义漂移或对抗性坍缩。其解决方案的关键在于提出感知反事实测地线（Perceptual Counterfactual Geodesics, PCG），通过利用鲁棒视觉特征诱导的感知黎曼度量（perceptually Riemannian metric）来构造测地线路径，从而实现与人类感知对齐的平滑、流形内且语义有效的反事实变换。

链接: https://arxiv.org/abs/2601.18678
作者: Eslam Zaher,Maciej Trzaskowski,Quan Nguyen,Fred Roosta
机构: ARC Training Centre for Information Resilience (信息韧性培训中心); University of Queensland (昆士兰大学); Institute for Molecular Bioscience (分子生物研究所); QIMR Berghofer Medical Research Institute (QIMR伯格霍弗医学研究研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Differential Geometry (math.DG)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Latent-space optimization methods for counterfactual explanations - framed as minimal semantic perturbations that change model predictions - inherit the ambiguity of Wachter et al.'s objective: the choice of distance metric dictates whether perturbations are meaningful or adversarial. Existing approaches adopt flat or misaligned geometries, leading to off-manifold artifacts, semantic drift, or adversarial collapse. We introduce Perceptual Counterfactual Geodesics (PCG), a method that constructs counterfactuals by tracing geodesics under a perceptually Riemannian metric induced from robust vision features. This geometry aligns with human perception and penalizes brittle directions, enabling smooth, on-manifold, semantically valid transitions. Experiments on three vision datasets show that PCG outperforms baselines and reveals failure modes hidden under standard metrics.
zh

[CV-7] Splat-Portrait: Generalizing Talking Heads with Gaussian Splatting

【速读】：该论文旨在解决3D人脸生成中因依赖领域特定启发式方法（如基于变形的面部运动表示先验）而导致的重建不准确问题，从而影响动画的真实感。其解决方案的关键在于提出Splat-Portrait，一种基于高斯点绘（Gaussian Splatting）的方法：该方法自动将单张肖像图像解耦为静态3D重建（以静态高斯点绘表示）和全图2D背景，并在无需任何运动驱动先验的情况下，仅根据输入音频生成自然唇部运动；训练过程仅依赖2D重建损失与得分蒸馏损失（score-distillation losses），无需3D监督或关键点标注，显著提升了说话头生成与新视角合成的视觉质量。

链接: https://arxiv.org/abs/2601.18633
作者: Tong Shi,Melonie de Almeida,Daniela Ivanova,Nicolas Pugeault,Paul Henderson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image. Previous 3D talking head generation methods have relied on domain-specific heuristics such as warping-based facial motion representation priors to animate talking motions, yet still produce inaccurate 3D avatar reconstructions, thus undermining the realism of generated animations. We introduce Splat-Portrait, a Gaussian-splatting-based method that addresses the challenges of 3D head reconstruction and lip motion synthesis. Our approach automatically learns to disentangle a single portrait image into a static 3D reconstruction represented as static Gaussian Splatting, and a predicted whole-image 2D background. It then generates natural lip motion conditioned on input audio, without any motion driven priors. Training is driven purely by 2D reconstruction and score-distillation losses, without 3D supervision nor landmarks. Experimental results demonstrate that Splat-Portrait exhibits superior performance on talking head generation and novel view synthesis, achieving better visual quality compared to previous works. Our project code and supplementary documents are public available at this https URL.
zh

[CV-8] CONQUER: Context-Aware Representation with Query Enhancement for Text-Based Person Search ICASSP2026

【速读】：该论文针对文本驱动的人体检索（Text-Based Person Search, TBPS）任务中存在跨模态差异和用户查询模糊性的问题提出解决方案。核心挑战在于如何提升文本与图像模态间的对齐精度，并增强模型在不完整或歧义查询下的鲁棒性。其关键创新在于提出一个两阶段框架CONQUER：训练阶段通过多粒度编码、互补样本挖掘及基于最优传输的上下文引导最优匹配策略，构建更鲁棒的跨模态嵌入空间；推理阶段引入可插拔的查询增强模块，利用锚点选择与属性驱动的信息补全机制，在无需重训练主干网络的前提下自适应优化模糊查询，从而显著提升检索性能，尤其在跨域和不完整查询场景下表现突出。

链接: https://arxiv.org/abs/2601.18625
作者: Zequn Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Text-Based Person Search (TBPS) aims to retrieve pedestrian images from large galleries using natural language descriptions. This task, essential for public safety applications, is hindered by cross-modal discrepancies and ambiguous user queries. We introduce CONQUER, a two-stage framework designed to address these challenges by enhancing cross-modal alignment during training and adaptively refining queries at inference. During training, CONQUER employs multi-granularity encoding, complementary pair mining, and context-guided optimal matching based on Optimal Transport to learn robust embeddings. At inference, a plug-and-play query enhancement module refines vague or incomplete queries via anchor selection and attribute-driven enrichment, without requiring retraining of the backbone. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that CONQUER consistently outperforms strong baselines in both Rank-1 accuracy and mAP, yielding notable improvements in cross-domain and incomplete-query scenarios. These results highlight CONQUER as a practical and effective solution for real-world TBPS deployment. Source code is available at this https URL.
zh

[CV-9] Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation ICLR2026

【速读】：该论文旨在解决跨模态图像翻译中因固定调度域转移（fixed-schedule domain transfer）导致的脆弱性和低效性问题，即标准扩散方法依赖全局线性映射，在反向采样过程中迫使模型穿越高代价的离域区域（off-manifold），从而引发语义漂移并增加校正负担。其解决方案的关键在于将域变化动力学直接嵌入生成过程：在每一步逆向扩散中预测空间变化的混合场，并注入显式的、目标一致的恢复项至漂移项中，实现步骤内引导（in-step guidance），使大尺度更新保持在流形上，同时将模型角色从全局对齐转变为局部残差修正。该方法通过连续时间形式化和精确解推导，进一步设计出保持边际一致性的首阶采样器，显著提升医学影像、遥感及电致发光语义映射等任务中的结构保真度与语义一致性，且收敛步数更少。

链接: https://arxiv.org/abs/2601.18623
作者: Zihao Wang,Yuzhou Chen,Shaogang Ren
机构: Laplace Lab at University of Tennessee (Laplace 实验室，田纳西大学); University of California Riverside (加州大学河滨分校); University of Tennessee at Chattanooga (田纳西大学查塔努加分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted as a conference paper at ICLR 2026

点击查看摘要

Abstract:Cross-modal image translation remains brittle and inefficient. Standard diffusion approaches often rely on a single, global linear transfer between domains. We find that this shortcut forces the sampler to traverse off-manifold, high-cost regions, inflating the correction burden and inviting semantic drift. We refer to this shared failure mode as fixed-schedule domain transfer. In this paper, we embed domain-shift dynamics directly into the generative process. Our model predicts a spatially varying mixing field at every reverse step and injects an explicit, target-consistent restoration term into the drift. This in-step guidance keeps large updates on-manifold and shifts the model’s role from global alignment to local residual correction. We provide a continuous-time formulation with an exact solution form and derive a practical first-order sampler that preserves marginal consistency. Empirically, across translation tasks in medical imaging, remote sensing, and electroluminescence semantic mapping, our framework improves structural fidelity and semantic consistency while converging in fewer denoising steps.
zh

[CV-10] Scale-Aware Self-Supervised Learning for Segmentation of Small and Sparse Structures

【速读】：该论文旨在解决自监督学习（Self-supervised Learning, SSL）在医学和地球科学图像分割任务中对小尺度、稀疏或局部不规则目标表现不佳的问题。现有SSL方法通常针对大而均匀的区域进行优化，在处理细粒度结构时性能显著下降。其解决方案的关键在于提出一种尺度感知的SSL适配策略，通过在预训练阶段引入小窗口裁剪（small-window cropping）作为数据增强手段，使模型能够聚焦于细尺度结构的学习。实验表明，该方法在地震断层分割和细胞轮廓提取任务中分别提升了13%和5%的准确率，且仅对小尺度目标有效，凸显了SSL设计需与目标对象的尺度和稀疏性相匹配的重要性。

链接: https://arxiv.org/abs/2601.18619
作者: Jorge Quesada,Ghassan AlRegib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a powerful strategy for representation learning under limited annotation regimes, yet its effectiveness remains highly sensitive to many factors, especially the nature of the target task. In segmentation, existing pipelines are typically tuned to large, homogeneous regions, but their performance drops when objects are small, sparse, or locally irregular. In this work, we propose a scale-aware SSL adaptation that integrates small-window cropping into the augmentation pipeline, zooming in on fine-scale structures during pretraining. We evaluate this approach across two domains with markedly different data modalities: seismic imaging, where the goal is to segment sparse faults, and neuroimaging, where the task is to delineate small cellular structures. In both settings, our method yields consistent improvements over standard and state-of-the-art baselines under label constraints, improving accuracy by up to 13% for fault segmentation and 5% for cell delineation. In contrast, large-scale features such as seismic facies or tissue regions see little benefit, underscoring that the value of SSL depends critically on the scale of the target objects. Our findings highlight the need to align SSL design with object size and sparsity, offering a general principle for buil ding more effective representation learning pipelines across scientific imaging domains.
zh

[CV-11] Multimodal Privacy-Preserving Entity Resolution with Fully Homomorphic Encryption ICASSP’26

【速读】：该论文旨在解决高合规领域中实体识别（Entity Resolution）的长期难题，即在存在显著数据异构性（如个人标识符的语法差异）的情况下，如何实现安全的身份核验。其解决方案的关键在于提出了一种新颖的多模态框架，能够在政府和金融机构常见的海量数据集上运行，同时应对数据体量、匹配精度和隐私保护三重挑战；该框架通过确保在整个匹配生命周期内个人身份信息（Personally Identifiable Information, PII）的明文始终不可计算访问，从而在满足严格监管要求的同时，以密码学保障客户隐私，并实现低等错误率（Equal Error Rate, EER）与可扩展的计算可行性。

链接: https://arxiv.org/abs/2601.18612
作者: Susim Roy,Nalini Ratha
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, IEEE ICASSP’26

点击查看摘要

Abstract:The canonical challenge of entity resolution within high-compliance sectors, where secure identity reconciliation is frequently confounded by significant data heterogeneity, including syntactic variations in personal identifiers, is a longstanding and complex problem. To this end, we introduce a novel multimodal framework operating with the voluminous data sets typical of government and financial institutions. Specifically, our methodology is designed to address the tripartite challenge of data volume, matching fidelity, and privacy. Consequently, the underlying plaintext of personally identifiable information remains computationally inaccessible throughout the matching lifecycle, empowering institutions to rigorously satisfy stringent regulatory mandates with cryptographic assurances of client confidentiality while achieving a demonstrably low equal error rate and maintaining computational tractability at scale.
zh

[CV-12] EFSI-DETR: Efficient Frequency-Semantic Integration for Real-Time Small Object Detection in UAV Imagery

【速读】：该论文旨在解决无人机（UAV）影像中实时小目标检测的难题，核心挑战在于特征表示能力有限以及多尺度融合效率低下。现有方法对频域信息利用不足，且依赖静态卷积操作，难以获取丰富的特征表达并有效挖掘深层语义信息。解决方案的关键在于提出EFSI-DETR框架，其核心创新包括：(1) 动态频率-空间统一协同网络（DyFusNet），联合利用频域与空间线索实现鲁棒的多尺度特征融合；(2) 高效语义特征浓缩器（ESFC），以极低计算成本提取深层语义特征；同时引入细粒度特征保留策略（FFR），在融合过程中保留浅层空间丰富特征，从而增强小目标细节信息。该方案在VisDrone和CODrone数据集上实现了SOTA性能与实时推理速度（单卡RTX 4090达188 FPS）。

链接: https://arxiv.org/abs/2601.18597
作者: Yu Xia,Chang Liu,Tianqi Xiang,Zhigang Tu
机构: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (武汉大学测绘遥感信息工程国家重点实验室); Wuhan University Shenzhen Research Institute (武汉大学深圳研究院); School of Computer Science, Wuhan University (武汉大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf1.6% and \textbf5.8% in AP and AP _s on VisDrone, while obtaining \textbf188 FPS inference speed on a single RTX 4090 GPU.
zh

[CV-13] AGSP-DSA: An Adaptive Graph Signal Processing Framework for Robust Multimodal Fusion with Dynamic Semantic Alignment

【速读】：该论文旨在解决异构多模态数据融合中的鲁棒性问题，特别是在文本、音频和图像等不同模态信息存在缺失或噪声时的性能下降问题。解决方案的关键在于提出一种自适应图信号处理与动态语义对齐框架（Adaptive Graph Signal Processing with Dynamic Semantic Alignment, AGSP-DSA），其核心包括：基于双图结构的学习机制以同时捕捉模态内（intra-modal）与模态间（inter-modal）关系；通过谱图滤波增强有效信号；利用多尺度图卷积网络（Multi-scale Graph Convolutional Networks, GCNs）进行节点嵌入表示；以及引入语义感知注意力机制，使各模态能根据上下文相关性动态调整贡献权重。该方法在CMU-MOSEI、AVE和MM-IMDB三个基准数据集上均达到当前最优性能，验证了其在情感分析、事件识别和多媒体分类任务中提升多模态学习效果的有效性。

链接: https://arxiv.org/abs/2601.18589
作者: KV Karthikeya,Ashok Kumar Das,Shantanu Pal,Vivekananda Bhat K,Arun Sekar Rajasekaran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP DSA) framework to perform robust multimodal data fusion over heterogeneous sources, including text, audio, and images. The requested approach uses a dual-graph construction to learn both intra-modal and inter-modal relations, spectral graph filtering to boost the informative signals, and effective node embedding with Multi-scale Graph Convolutional Networks (GCNs). Semantic aware attention mechanism: each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including CMU-MOSEI, AVE, and MM-IMDB, show that AGSP-DSA performs as the state of the art. More precisely, it achieves 95.3% accuracy, 0.936 F1-score, and 0.924 mAP on CMU-MOSEI, improving MM-GNN by 2.6 percent in accuracy. It gets 93.4% accuracy and 0.911 F1-score on AVE and 91.8% accuracy and 0.886 F1-score on MM-IMDB, which demonstrate good generalization and robustness in the missing modality setting. These findings verify the efficiency of AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition and multimedia classification.
zh

[CV-14] GimmBO: Interactive Generative Image Model Merging via Bayesian Optimization

【速读】：该论文旨在解决扩散模型（diffusion model）中适配器（adapter）融合时的交互式探索难题，即如何高效地在高维权重空间中自动优化多个适配器的合并权重，以生成符合用户偏好的图像结果。当前依赖手动滑块调节的方式难以扩展且效率低下，尤其当适配器数量达到20–30个时。解决方案的关键在于提出GimmBO框架，其核心是基于偏好贝叶斯优化（Preferential Bayesian Optimization, PBO）并引入两阶段贝叶斯优化（Bayesian Optimization, BO）后端，以应对真实场景中的权重稀疏性和范围约束问题，从而显著提升采样效率与收敛速度，在模拟用户和真实用户研究中均展现出优于传统BO和线搜索基线的性能表现。

链接: https://arxiv.org/abs/2601.18585
作者: Chenxi Liu,Selena Ling,Alec Jacobson
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Fine-tuning-based adaptation is widely used to customize diffusion-based image generation, leading to large collections of community-created adapters that capture diverse subjects and styles. Adapters derived from the same base model can be merged with weights, enabling the synthesis of new visual results within a vast and continuous design space. To explore this space, current workflows rely on manual slider-based tuning, an approach that scales poorly and makes weight selection difficult, even when the candidate set is limited to 20-30 adapters. We propose GimmBO to support interactive exploration of adapter merging for image generation through Preferential Bayesian Optimization (PBO). Motivated by observations from real-world usage, including sparsity and constrained weight ranges, we introduce a two-stage BO backend that improves sampling efficiency and convergence in high-dimensional spaces. We evaluate our approach with simulated users and a user study, demonstrating improved convergence, high success rates, and consistent gains over BO and line-search baselines, and further show the flexibility of the framework through several extensions.
zh

[CV-15] Self-Refining Video Sampling

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 视频模型在复杂物理动态建模上的不足，即视频生成结果缺乏物理真实性和运动连贯性的问题。其解决方案的关键在于提出一种自精炼视频采样（self-refining video sampling）方法，该方法利用预训练的视频生成器自身作为“自精炼器”，通过将生成器视为去噪自动编码器（denoising autoencoder），在推理阶段引入迭代内循环优化过程，无需外部验证器或额外训练即可提升视频质量；同时结合不确定性感知的精炼策略，基于自一致性选择性地优化局部区域，从而避免过度精炼导致的伪影，显著改善了运动一致性和物理合理性。

链接: https://arxiv.org/abs/2601.18577
作者: Sangwon Jang,Taekyung Ki,Jaehyeong Jo,Saining Xie,Jaehong Yoon,Sung Ju Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.
zh

[CV-16] An Unsupervised Tensor-Based Domain Alignment

【速读】：该论文旨在解决跨域适应（Domain Adaptation, DA）中源域与目标域数据分布差异导致的模型泛化性能下降问题。其解决方案的关键在于提出一种基于张量的域对齐算法，通过在不变子空间内利用对齐矩阵进行迭代优化，且约束条件定义在斜流形（oblique manifold）上，相较于传统Stiefel流形具有更高的灵活性和适应性；同时引入正则化项以保持源域和目标域张量的方差特性，从而提升对齐精度与鲁棒性。该方法可有效推广现有张量域对齐方法，并在实验中验证了其在加速转换过程和提高分类准确率方面的优越性。

链接: https://arxiv.org/abs/2601.18564
作者: Chong Hyun Lee,Kibae Lee,Hyun Hee Yim
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:We propose a tensor-based domain alignment (DA) algorithm designed to align source and target tensors within an invariant subspace through the use of alignment matrices. These matrices along with the subspace undergo iterative optimization of which constraint is on oblique manifold, which offers greater flexibility and adaptability compared to the traditional Stiefel manifold. Moreover, regularization terms defined to preserve the variance of both source and target tensors, ensures robust performance. Our framework is versatile, effectively generalizing existing tensor-based DA methods as special cases. Through extensive experiments, we demonstrate that our approach not only enhances DA conversion speed but also significantly boosts classification accuracy. This positions our method as superior to current state-of-the-art techniques, making it a preferable choice for complex domain adaptation tasks.
zh

[CV-17] AI-enabled Satellite Edge Computing: A Single-Pixel Feature based Shallow Classification Model for Hyperspectral Imaging

【速读】：该论文旨在解决高光谱成像卫星在灾害监测与应急制图等场景中因下行链路传输速率受限而导致响应延迟的问题，同时应对星上处理面临的传感器故障、扫描模式错误引发的图像质量退化（如坏像素、错位像素及混合噪声）挑战。其解决方案的关键在于提出一种轻量级、非深度学习的边缘计算范式，结合少样本学习策略实现星上自主分类决策；并设计了一种两阶段像素级标签传播机制：第一阶段通过构建锚点-像素相似性矩阵传播选定锚点标签以获得初始标签，第二阶段基于像素级相似度直接生成稀疏图结构，并利用闭式解替代迭代计算，显著降低计算复杂度；此外引入基于秩约束的图聚类算法自动确定锚点标签，从而在资源受限条件下实现高效、鲁棒的高光谱图像分类。

链接: https://arxiv.org/abs/2601.18560
作者: Li Fang,Tianyu Li,Yanghong Lin,Shudong Zhou,Wei Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the important component of the Earth observation system, hyperspectral imaging satellites provide high-fidelity and enriched information for the formulation of related policies due to the powerful spectral measurement capabilities. However, the transmission speed of the satellite downlink has become a major bottleneck in certain applications, such as disaster monitoring and emergency mapping, which demand a fast response ability. We propose an efficient AI-enabled Satellite Edge Computing paradigm for hyperspectral image classification, facilitating the satellites to attain autonomous decision-making. To accommodate the resource constraints of satellite platforms, the proposed method adopts a lightweight, non-deep learning framework integrated with a few-shot learning strategy. Moreover, onboard processing on satellites could be faced with sensor failure and scan pattern errors, which result in degraded image quality with bad/misaligned pixels and mixed noise. To address these challenges, we develop a novel two-stage pixel-wise label propagation scheme that utilizes only intrinsic spectral features at the single pixel level without the necessity to consider spatial structural information as requested by deep neural networks. In the first stage, initial pixel labels are obtained by propagating selected anchor labels through the constructed anchor-pixel affinity matrix. Subsequently, a top-k pruned sparse graph is generated by directly computing pixel-level similarities. In the second stage, a closed-form solution derived from the sparse graph is employed to replace iterative computations. Furthermore, we developed a rank constraint-based graph clustering algorithm to determine the anchor labels.
zh

[CV-18] Generative Diffusion Augmentation with Quantum-Enhanced Discrimination for Medical Image Diagnosis

【速读】：该论文旨在解决医学图像分类中因类别严重不平衡导致的模型偏差问题，尤其在正样本远多于负样本的实际医疗数据集上，传统深度学习模型常表现出对少数类别的低召回率（recall），进而引发临床误诊风险。其解决方案的关键在于提出一种融合简化扩散增强（Simplified Diffusion Augmentation, SDA）与量子增强特征判别（Quantum-Enhanced Classification, QEC）的新型框架——SDA-QEC：首先利用轻量级扩散生成器为少数类合成高质量样本以重构训练分布，随后在MobileNetV2架构中嵌入量子特征层，在希尔伯特空间中实现高维特征映射，显著提升模型对细微差异的判别能力。实验表明，该方法在冠状动脉造影图像分类任务中实现了98.33%的准确率、F1分数和敏感度/特异度平衡，优于多个经典CNN基准模型，验证了生成式AI与量子增强建模结合在小样本、高度不平衡且高风险医疗诊断场景中的可行性与优越性。

链接: https://arxiv.org/abs/2601.18556
作者: Jingsong Xia,Siqi Wang
机构: Nanjing Medical University (南京医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In biomedical engineering, artificial intelligence has become a pivotal tool for enhancing medical diagnostics, particularly in medical image classification tasks such as detecting pneumonia from chest X-rays and breast cancer screening. However, real-world medical datasets frequently exhibit severe class imbalance, where positive samples substantially outnumber negative samples, leading to biased models with low recall rates for minority classes. This imbalance not only compromises diagnostic accuracy but also poses clinical misdiagnosis risks. To address this challenge, we propose SDA-QEC (Simplified Diffusion Augmentation with Quantum-Enhanced Classification), an innovative framework that integrates simplified diffusion-based data augmentation with quantum-enhanced feature discrimination. Our approach employs a lightweight diffusion augmentor to generate high-quality synthetic samples for minority classes, rebalancing the training distribution. Subsequently, a quantum feature layer embedded within MobileNetV2 architecture enhances the model’s discriminative capability through high-dimensional feature mapping in Hilbert space. Comprehensive experiments on coronary angiography image classification demonstrate that SDA-QEC achieves 98.33% accuracy, 98.78% AUC, and 98.33% F1-score, significantly outperforming classical baselines including ResNet18, MobileNetV2, DenseNet121, and VGG16. Notably, our framework simultaneously attains 98.33% sensitivity and 98.33% specificity, achieving a balanced performance critical for clinical deployment. The proposed method validates the feasibility of integrating generative augmentation with quantum-enhanced modeling in real-world medical imaging tasks, offering a novel research pathway for developing highly reliable medical AI systems in small-sample, highly imbalanced, and high-risk diagnostic scenarios.
zh

[CV-19] Automated Landmark Detection for assessing hip conditions: A Cross-Modality Validation of MRI versus X-ray

【速读】：该论文旨在解决临床中股骨髋臼撞击症（FemoroAcetabular Impingement, FAI）筛查依赖于X射线角度测量，而对撞击区域高度和范围的评估需借助MRI获取三维信息的问题。传统方法存在模态割裂，难以实现多维量化分析。其解决方案的关键在于利用标准热图回归架构，在配对的MRI与X射线数据集（89例患者）上验证跨模态临床等效性，证明MRI在冠状面3D体积中也能实现与X射线相当的解剖标志点定位精度和诊断准确性，尤其适用于cam型FAI的自动识别，从而为将自动化FAI评估集成到常规MRI流程提供可行性支持，并为进一步基于体积的多地标点分析奠定基础。

链接: https://arxiv.org/abs/2601.18555
作者: Roberto Di Via,Vito Paolo Pastore,Francesca Odone,Siôn Glyn-Jones,Irina Voiculescu
机构: University of Genoa (热那亚大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at International Symposium on Biomedical Imaging (ISBI 2026)

点击查看摘要

Abstract:Many clinical screening decisions are based on angle measurements. In particular, FemoroAcetabular Impingement (FAI) screening relies on angles traditionally measured on X-rays. However, assessing the height and span of the impingement area requires also a 3D view through an MRI scan. The two modalities inform the surgeon on different aspects of the condition. In this work, we conduct a matched-cohort validation study (89 patients, paired MRI/X-ray) using standard heatmap regression architectures to assess cross-modality clinical equivalence. Seen that landmark detection has been proven effective on X-rays, we show that MRI also achieves equivalent localisation and diagnostic accuracy for cam-type impingement. Our method demonstrates clinical feasibility for FAI assessment in coronal views of 3D MRI volumes, opening the possibility for volumetric analysis through placing further landmarks. These results support integrating automated FAI assessment into routine MRI workflows. Code is released at this https URL
zh

[CV-20] REMAC: Reference-Based Martian Asymmetrical Image Compression

【速读】：该论文旨在解决火星图像压缩中两个关键问题：一是现有基于学习的压缩方法未考虑火星端有限的计算资源；二是未能利用火星图像间强相关的特性来提升压缩性能。解决方案的核心在于提出一种参考引导的异构图像压缩方法（REMAC），其关键创新包括：1）将计算复杂度从资源受限的编码器迁移至资源丰富的解码器，通过引入参考图像指导熵模型和专用解码器，利用图像间的跨帧相似性减少冗余计算；2）在解码器端采用深度多尺度架构以建模图像内的长程空间依赖关系，增强对图像内相似性的利用；3）设计潜在特征复用机制进一步缓解火星端的极端计算约束。实验表明，REMAC在降低43.51%编码器复杂度的同时，实现了0.2664 dB的BD-PSNR增益。

链接: https://arxiv.org/abs/2601.18547
作者: Qing Ding,Mai Xu,Shengxi Li,Xin Deng,Xin Zou
机构: Beihang University (北京航空航天大学); Beijing Institute of Spacecraft System Engineering (北京空间飞行器总体设计研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted for publication in IEEE Transactions on Geoscience and Remote Sensing (TGRS). 2025 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 18 pages, 20 figures

点击查看摘要

Abstract:To expedite space exploration on Mars, it is indispensable to develop an efficient Martian image compression method for transmitting images through the constrained Mars-to-Earth communication channel. Although the existing learned compression methods have achieved promising results for natural images from earth, there remain two critical issues that hinder their effectiveness for Martian image compression: 1) They overlook the highly-limited computational resources on Mars; 2) They do not utilize the strong \textitinter-image similarities across Martian images to advance image compression performance. Motivated by our empirical analysis of the strong \textitintra- and \textitinter-image similarities from the perspective of texture, color, and semantics, we propose a reference-based Martian asymmetrical image compression (REMAC) approach, which shifts computational complexity from the encoder to the resource-rich decoder and simultaneously improves compression performance. To leverage \textitinter-image similarities, we propose a reference-guided entropy module and a ref-decoder that utilize useful information from reference images, reducing redundant operations at the encoder and achieving superior compression performance. To exploit \textitintra-image similarities, the ref-decoder adopts a deep, multi-scale architecture with enlarged receptive field size to model long-range spatial dependencies. Additionally, we develop a latent feature recycling mechanism to further alleviate the extreme computational constraints on Mars. Experimental results show that REMAC reduces encoder complexity by 43.51% compared to the state-of-the-art method, while achieving a BD-PSNR gain of 0.2664 dB.
zh

[CV-21] GenAgent : Scaling Text-to-Image Generation via Agent ic Multimodal Reasoning

【速读】：该论文旨在解决统一视觉理解与生成模型在训练成本高和理解-生成权衡上的局限性问题。现有方法往往受限于静态模块化流程，难以实现多轮自主交互与迭代优化。其解决方案的关键在于提出 GenAgent 框架，通过代理（agent）机制将理解能力保留在多模态模型中，而将图像生成视为可调用工具，从而解耦理解与生成过程；同时引入基于多轮推理链（multimodal chains-of-thought）的动态交互策略，支持自主推理、工具调用、判断与反思，并结合两阶段训练策略（监督微调 + 端到端强化学习），显著提升基础生成器（FLUX.1-dev）在 GenEval++ 和 WISE 上的性能，且具备跨工具泛化、测试时扩展性和任务自适应推理等优势。

链接: https://arxiv.org/abs/2601.18543
作者: Kaixun Jiang,Yuzheng Wang,Junjie Zhou,Pandeng Li,Zhihang Liu,Chen-Wei Xie,Zhaoyu Chen,Yun Zheng,Wenqiang Zhang
机构: Fudan University (复旦大学); Tongyi Lab; Nanjing University (南京大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6%) and WISE (+14%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \hrefthis https URLthis url.
zh

[CV-22] From Cold Start to Active Learning: Embedding-Based Scan Selection for Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割中因手动标注耗时且依赖专业经验而导致的标注瓶颈问题。其核心解决方案在于提出一种结合基础模型（foundation model）嵌入与聚类的冷启动采样策略，通过自动确定聚类数量并按比例从各簇中采样，构建初始训练集以实现多样性和代表性；随后引入基于不确定性的主动学习（Active Learning, AL）框架，并融合空间多样性指导样本选择，从而在低数据场景下显著提升分割精度。该方法具有直观性和可解释性，支持可视化候选样本在特征空间中的分布，实验表明其在多个X-ray和MRI数据集上均优于随机采样等基线方法。

链接: https://arxiv.org/abs/2601.18532
作者: Devon Levy,Bar Assayag,Laura Gaspar,Ilan Shimshoni,Bella Specktor-Fadida
机构: University of Haifa (海法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages without references

点击查看摘要

Abstract:Accurate segmentation annotations are critical for disease monitoring, yet manual labeling remains a major bottleneck due to the time and expertise required. Active learning (AL) alleviates this burden by prioritizing informative samples for annotation, typically through a diversity-based cold-start phase followed by uncertainty-driven selection. We propose a novel cold-start sampling strategy that combines foundation-model embeddings with clustering, including automatic selection of the number of clusters and proportional sampling across clusters, to construct a diverse and representative initial training. This is followed by an uncertainty-based AL framework that integrates spatial diversity to guide sample selection. The proposed method is intuitive and interpretable, enabling visualization of the feature-space distribution of candidate samples. We evaluate our approach on three datasets spanning X-ray and MRI modalities. On the CheXmask dataset, the cold-start strategy outperforms random selection, improving Dice from 0.918 to 0.929 and reducing the Hausdorff distance from 32.41 to 27.66 mm. In the AL setting, combined entropy and diversity selection improves Dice from 0.919 to 0.939 and reduces the Hausdorff distance from 30.10 to 19.16 mm. On the Montgomery dataset, cold-start gains are substantial, with Dice improving from 0.928 to 0.950 and Hausdorff distance decreasing from 14.22 to 9.38 mm. On the SynthStrip dataset, cold-start selection slightly affects Dice but reduces the Hausdorff distance from 9.43 to 8.69 mm, while active learning improves Dice from 0.816 to 0.826 and reduces the Hausdorff distance from 7.76 to 6.38 mm. Overall, the proposed framework consistently outperforms baseline methods in low-data regimes, improving segmentation accuracy.
zh

[CV-23] Closing the Modality Gap Aligns Group-Wise Semantics ICLR2026

【速读】：该论文旨在解决多模态学习中因模态差异导致的结构不匹配问题，即“模态间隙”（modality gap），该间隙表现为不同模态在共享潜在空间中的表示分布存在系统性偏差，尽管CLIP类损失能实现语义层面的对齐，但模态间的结构一致性仍不足。解决方案的关键在于提出一种新颖的方法，在双模态场景下有效缩小这种结构差异，并可直接扩展至n模态情形；实验表明，虽然减少模态间隙对实例级任务（如检索）提升有限，但在群体级任务（如聚类）中效果显著，从而揭示了模态间隙在语义分组类任务中的核心作用。

链接: https://arxiv.org/abs/2601.18525
作者: Eleonora Grassucci,Giordano Cicchetti,Emanuele Frasca,Aurelio Uncini,Danilo Comminiello
机构: Sapienza University of Rome (罗马大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:In multimodal learning, CLIP has been recognized as the \textitde facto method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general n -modal case. Through our extensive evaluation, we demonstrate our novel insight: while reducing the gap provides only marginal or inconsistent improvements in traditional instance-wise tasks, it significantly enhances group-wise tasks. These findings may reshape our understanding of the modality gap, highlighting its key role in improving performance on tasks requiring semantic grouping.
zh

[CV-24] DisasterInsight: A Multimodal Benchmark for Function-Aware and Grounded Disaster Assessment ICPR2026

【速读】：该论文旨在解决当前遥感视觉-语言模型（Vision-Language Models, VLMs）在灾难响应场景中功能理解不足和指令鲁棒性差的问题，现有基准多依赖粗粒度标签与图像级识别，难以满足人道主义评估的实际需求。其解决方案的关键在于构建DisasterInsight这一多模态基准，将xBD数据集重构为约11.2万条以建筑为中心的实例，并支持多样化的任务评估，包括建筑功能分类、损毁等级与灾害类型分类、计数及结构化报告生成；同时提出DI-Chat方法，通过低秩适应（Low-Rank Adaptation, LoRA）微调通用VLM骨干网络，在灾难特定指令数据上获得领域适配的基线模型，显著提升了损毁程度识别与报告生成质量，验证了该基准对灾难影像中 grounded 多模态推理研究的有效性。

链接: https://arxiv.org/abs/2601.18493
作者: Sara Tehrani,Yonghao Xu,Leif Haglund,Amanda Berg,Michael Felsberg
机构: Linköping University (林雪平大学); Vantor (林雪平)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at ICPR 2026

点击查看摘要

Abstract:Timely interpretation of satellite imagery is critical for disaster response, yet existing vision-language benchmarks for remote sensing largely focus on coarse labels and image-level recognition, overlooking the functional understanding and instruction robustness required in real humanitarian workflows. We introduce DisasterInsight, a multimodal benchmark designed to evaluate vision-language models (VLMs) on realistic disaster analysis tasks. DisasterInsight restructures the xBD dataset into approximately 112K building-centered instances and supports instruction-diverse evaluation across multiple tasks, including building-function classification, damage-level and disaster-type classification, counting, and structured report generation aligned with humanitarian assessment guidelines. To establish domain-adapted baselines, we propose DI-Chat, obtained by fine-tuning existing VLM backbones on disaster-specific instruction data using parameter-efficient Low-Rank Adaptation (LoRA). Extensive experiments on state-of-the-art generic and remote-sensing VLMs reveal substantial performance gaps across tasks, particularly in damage understanding and structured report generation. DI-Chat achieves significant improvements on damage-level and disaster-type classification as well as report generation quality, while building-function classification remains challenging for all evaluated models. DisasterInsight provides a unified benchmark for studying grounded multimodal reasoning in disaster imagery. Comments: Under review at ICPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.18493 [cs.CV] (or arXiv:2601.18493v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.18493 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-25] LoD-Structured 3D Gaussian Splatting for Streaming Video Reconstruction

【速读】：该论文旨在解决Streaming Free-Viewpoint Video (SFVV) 中因稀疏视角输入、训练成本高昂及带宽限制导致的实时流式重建难题，尤其关注快速优化、稀疏约束下的高保真重建以及最小化存储开销。其解决方案的关键在于提出StreamLoD-GS框架，通过三个核心创新实现：(1) 基于Anchor和Octree的层次化细节（Level of Detail, LoD）结构3D Gaussian Splatting（3DGS），结合分层高斯丢弃技术，在保证渲染质量的同时提升优化效率与稳定性；(2) 基于高斯混合模型（Gaussian Mixture Model, GMM）的运动分割机制，区分动态与静态内容以精细化处理动态区域并维持背景稳定；(3) 量化残差精修机制，在显著降低存储需求的同时不牺牲视觉保真度。

链接: https://arxiv.org/abs/2601.18475
作者: Xinhui Liu,Can Wang,Lei Liu,Zhenghao Chen,Wei Jiang,Wei Wang,Dong Xu
机构: The University of Hong Kong (香港大学); Futurewei Technologies Inc (未来wei科技公司); The University of Newcastle (纽卡斯尔大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Free-Viewpoint Video (FVV) reconstruction enables photorealistic and interactive 3D scene visualization; however, real-time streaming is often bottlenecked by sparse-view inputs, prohibitive training costs, and bandwidth constraints. While recent 3D Gaussian Splatting (3DGS) has advanced FVV due to its superior rendering speed, Streaming Free-Viewpoint Video (SFVV) introduces additional demands for rapid optimization, high-fidelity reconstruction under sparse constraints, and minimal storage footprints. To bridge this gap, we propose StreamLoD-GS, an LoD-based Gaussian Splatting framework designed specifically for SFVV. Our approach integrates three core innovations: 1) an Anchor- and Octree-based LoD-structured 3DGS with a hierarchical Gaussian dropout technique to ensure efficient and stable optimization while maintaining high-quality rendering; 2) a GMM-based motion partitioning mechanism that separates dynamic and static content, refining dynamic regions while preserving background stability; and 3) a quantized residual refinement framework that significantly reduces storage requirements without compromising visual fidelity. Extensive experiments demonstrate that StreamLoD-GS achieves competitive or state-of-the-art performance in terms of quality, efficiency, and storage.
zh

[CV-26] Fair-Eye Net: A Fair Trustworthy Multimodal Integrated Glaucoma Full Chain AI System

【速读】：该论文旨在解决青光眼（glaucoma）早期筛查与长期随访中因单一检测手段或松散关联检查所导致的主观性强、诊疗碎片化问题，以及影像设备和专科医生资源匮乏引发的公平性与一致性不足问题。其解决方案的关键在于提出了一种名为Fair-Eye Net的公平、可靠的多模态人工智能系统，通过双流异构融合架构整合眼底照相、OCT结构指标、视野（visual field, VF）功能指数及人口学因素，并引入不确定性感知的分层门控策略实现选择性预测与安全转诊；同时，在多任务学习框架下将公平性作为核心目标进行优化，显著降低劣势群体漏诊率（种族假阴性差异下降73.4%），并保持跨域性能稳定，从而实现从筛查到风险预警的闭环临床应用，为全球眼健康公平提供可复现的AI驱动路径。

链接: https://arxiv.org/abs/2601.18464
作者: Wenbin Wei,Suyuan Yao,Cheng Huang,Xiangyu Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Glaucoma is a top cause of irreversible blindness globally, making early detection and longitudinal follow-up pivotal to preventing permanent vision loss. Current screening and progression assessment, however, rely on single tests or loosely linked examinations, introducing subjectivity and fragmented care. Limited access to high-quality imaging tools and specialist expertise further compromises consistency and equity in real-world use. To address these gaps, we developed Fair-Eye Net, a fair, reliable multimodal AI system closing the clinical loop from glaucoma screening to follow-up and risk alerting. It integrates fundus photos, OCT structural metrics, VF functional indices, and demographic factors via a dual-stream heterogeneous fusion architecture, with an uncertainty-aware hierarchical gating strategy for selective prediction and safe referral. A fairness constraint reduces missed diagnoses in disadvantaged subgroups. Experimental results show it achieved an AUC of 0.912 (96.7% specificity), cut racial false-negativity disparity by 73.4% (12.31% to 3.28%), maintained stable cross-domain performance, and enabled 3-12 months of early risk alerts (92% sensitivity, 88% specificity). Unlike post hoc fairness adjustments, Fair-Eye Net optimizes fairness as a primary goal with clinical reliability via multitask learning, offering a reproducible path for clinical translation and large-scale deployment to advance global eye health equity.
zh

[CV-27] 3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control

【速读】：该论文旨在解决生成式AI（Generative AI）在生成全身协同手势（holistic co-speech gestures）时存在的语义不一致和空间不稳定问题，这些问题通常源于现有基于局部部件分解或逐帧回归的方法。解决方案的关键在于提出3DGesPolicy框架，将整体手势生成重构为连续轨迹控制问题，借助机器人领域的扩散策略（diffusion policy）建模帧间变化为统一的整体动作，从而学习跨帧的连贯运动模式，并确保空间与语义上的一致性；同时引入Gesture-Audio-Phoneme（GAP）融合模块，深度融合语音、音素等多模态信号，实现语音语义、身体动作与面部表情之间的结构化精细对齐。

链接: https://arxiv.org/abs/2601.18451
作者: Xuanmeng Sha,Liyun Zhang,Tomohiro Mashita,Naoya Chiba,Yuki Uranishi
机构: The University of OsakaJapan; Osaka Electro-Communication UniversityJapan; Osaka UniversityJapan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.
zh

[CV-28] On Procrustes Contamination in Machine Learning Applications of Geometric Morphometrics

【速读】：该论文旨在解决几何形态测量学（Geometric Morphometrics, GMM）在机器学习（Machine Learning, ML）应用中因广义普罗克鲁斯特斯分析（Generalized Procrustes Analysis, GPA）预处理导致的统计依赖性污染问题，即在数据分割为训练集和测试集前进行GPA对齐可能引入跨样本依赖，从而影响下游预测模型的准确性。解决方案的关键在于提出一种新的重对齐（realignment）程序：在模型拟合前，将测试样本重新对齐至训练集的配置，从而消除样本间的依赖关系，确保模型评估的独立性和有效性。该方法通过模拟验证了其在不同样本量、特征点密度及异速生长模式下的稳定性，并揭示了形状空间中RMSE随维度变化的“对角线”规律，强调了保持特征点间空间自相关性的必要性。

链接: https://arxiv.org/abs/2601.18448
作者: Lloyd Austin Courtenay
机构: PACEA UMR5199, CNRS (法国国家科学研究中心); Université de Bordeaux (波尔多大学); Department d’Història i Història de l’Art, Universitat Rovira i Virgili (罗维拉·维尔吉利大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 5 figures, Preprint pending review

点击查看摘要

Abstract:Geometric morphometrics (GMM) is widely used to quantify shape variation, more recently serving as input for machine learning (ML) analyses. Standard practice aligns all specimens via Generalized Procrustes Analysis (GPA) prior to splitting data into training and test sets, potentially introducing statistical dependence and contaminating downstream predictive models. Here, the effects of GPA-induced contamination are formally characterised using controlled 2D and 3D simulations across varying sample sizes, landmark densities, and allometric patterns. A novel realignment procedure is proposed, whereby test specimens are aligned to the training set prior to model fitting, eliminating cross-sample dependency. Simulations reveal a robust “diagonal” in sample-size vs. landmark-space, reflecting the scaling of RMSE under isotropic variation, with slopes analytically derived from the degrees of freedom in Procrustes tangent space. The importance of spatial autocorrelation among landmarks is further demonstrated using linear and convolutional regression models, highlighting performance degradation when landmark relationships are ignored. This work establishes the need for careful preprocessing in ML applications of GMM, provides practical guidelines for realignment, and clarifies fundamental statistical constraints inherent to Procrustes shape space.
zh

[CV-29] Comparative Evaluation of Machine Learning Algorithms for Affective State Recognition from Childrens Drawings

【速读】：该论文旨在解决儿童早期情绪状态识别困难的问题，传统评估方法往往存在侵入性强、主观性高或难以一致应用的局限。解决方案的关键在于利用生成式AI（Generative AI）技术中的深度学习模型，通过分析儿童绘画作品来自动识别其情绪状态；具体而言，研究采用迁移学习策略，在由心理学专家标注情感标签的数据集上训练MobileNet、EfficientNet和VGG16三种卷积神经网络架构，系统比较其在分类性能、鲁棒性和计算效率方面的表现，从而为移动设备和实时应用场景提供可行的情绪识别方案。

链接: https://arxiv.org/abs/2601.18414
作者: Aura Loredana Dan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Autism spectrum disorder (ASD) represents a neurodevelopmental condition characterized by difficulties in expressing emotions and communication, particularly during early childhood. Understanding the affective state of children at an early age remains challenging, as conventional assessment methods are often intrusive, subjective, or difficult to apply consistently. This paper builds upon previous work on affective state recognition from children’s drawings by presenting a comparative evaluation of machine learning models for emotion classification. Three deep learning architectures – MobileNet, EfficientNet, and VGG16 – are evaluated within a unified experimental framework to analyze classification performance, robustness, and computational efficiency. The models are trained using transfer learning on a dataset of children’s drawings annotated with emotional labels provided by psychological experts. The results highlight important trade-offs between lightweight and deeper architectures when applied to drawing-based affective computing tasks, particularly in mobile and real-time application contexts.
zh

[CV-30] Larger than memory image processing

【速读】：该论文旨在解决超大规模图像数据集（如1.4 PB的电子显微镜体积数据和150 TB的人体器官图谱）在内存受限条件下进行高效分析的问题，其核心挑战在于I/O瓶颈。解决方案的关键在于将图像分析建模为基于流式遍历（streaming passes）的数据处理过程，并提出一种以切片为基础的流式架构（slice-based streaming architecture），该架构可适配两种主流存储格式——2D切片堆叠（如多页TIFF）与3D分块布局（如Zarr/HDF5），同时通过扫掠执行（sweep-based execution）、窗口化操作（windowed operations）及重叠感知分块（overlap-aware tiling）策略最小化磁盘重复访问。进一步地，作者设计了一种领域特定语言（DSL），能自动编译时和运行时分析流水线，优化窗口大小、融合阶段、分流与合并流以及调度遍历顺序，从而实现近线性的I/O扫描效率和可预测的内存占用，显著提升极大规模图像处理的吞吐量，且无需将整个数据集加载进内存。

链接: https://arxiv.org/abs/2601.18407
作者: Jon Sporring,David Stansby
机构: University of Copenhagen (哥本哈根大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:This report addresses larger-than-memory image analysis for petascale datasets such as 1.4 PB electron-microscopy volumes and 150 TB human-organ atlases. We argue that performance is fundamentally I/O-bound. We show that structuring analysis as streaming passes over data is crucial. For 3D volumes, two representations are popular: stacks of 2D slices (e.g., directories or multi-page TIFF) and 3D chunked layouts (e.g., Zarr/HDF5). While for a few algorithms, chunked layout on disk is crucial to keep disk I/O at a minimum, we show how the slice-based streaming architecture can be built on top of either image representation in a manner that minimizes disk I/O. This is in particular advantageous for algorithms relying on neighbouring values, since the slicing streaming architecture is 1D, which implies that there are only 2 possible sweeping orders, both of which are aligned with the order in which images are read from the disk. This is in contrast to 3D chunks, in which any sweep cannot be done without accessing each chunk at least 9 times. We formalize this with sweep-based execution (natural 2D/3D orders), windowed operations, and overlap-aware tiling to minimize redundant access. Building on these principles, we introduce a domain-specific language (DSL) that encodes algorithms with intrinsic knowledge of their optimal streaming and memory use; the DSL performs compile-time and run-time pipeline analyses to automatically select window sizes, fuse stages, tee and zip streams, and schedule passes for limited-RAM machines, yielding near-linear I/O scans and predictable memory footprints. The approach integrates with existing tooling for segmentation and morphology but reframes pre/post-processing as pipelines that privilege sequential read/write patterns, delivering substantial throughput gains for extremely large images without requiring full-volume residency in memory.
zh

[CV-31] Efficient Complex-Valued Vision Transformers for MRI Classification Directly from k-Space

【速读】：该论文旨在解决深度学习在磁共振成像（MRI）中普遍依赖重建后的幅值图像（magnitude images）所带来的问题，即相位信息丢失和计算复杂度高的问题。传统神经网络架构（如卷积神经网络或基于网格的视觉Transformer）难以处理原始频域（k-Space）数据的全局、非局部特性。其解决方案的关键在于提出一种复数域视觉Transformer（kViT），并引入径向k-Space分块策略（radial k-Space patching），该策略能够尊重频域中的能量分布特性，从而有效建模k-Space数据的物理结构。实验表明，该方法在分类性能上可媲美主流图像域基准模型（如ResNet、EfficientNet、ViT），且在高加速因子下表现出更强鲁棒性，并将训练过程中的显存消耗降低高达68倍，实现了从扫描仪直接进行高效AI分析的新范式。

链接: https://arxiv.org/abs/2601.18392
作者: Moritz Rempe,Lukas T. Rotkopf,Marco Schlimbach,Helmut Becker,Fabian Hörst,Johannes Haubold,Philipp Dammann,Kevin Kröninger,Jens Kleesiek
机构: Institute for AI in Medicine (IKIM); University Hospital Essen (埃森大学医院); Cancer Research Center Cologne Essen (CCCE); University Medicine Essen (埃森大学医学中心); RACOON Study Group; Department of Physics; Technical University Dortmund (多特蒙德工业大学); German Cancer Consortium (DKTK); Partner Site Essen; Medical Faculty and Faculty of Computer Science; University of Duisburg-Essen (杜伊斯堡-埃森大学); German Cancer Research Center (DKFZ); Department of Radiology; Department of Neurosurgery and Spine Surgery
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning applications in Magnetic Resonance Imaging (MRI) predominantly operate on reconstructed magnitude images, a process that discards phase information and requires computationally expensive transforms. Standard neural network architectures rely on local operations (convolutions or grid-patches) that are ill-suited for the global, non-local nature of raw frequency-domain (k-Space) data. In this work, we propose a novel complex-valued Vision Transformer (kViT) designed to perform classification directly on k-Space data. To bridge the geometric disconnect between current architectures and MRI physics, we introduce a radial k-Space patching strategy that respects the spectral energy distribution of the frequency-domain. Extensive experiments on the fastMRI and in-house datasets demonstrate that our approach achieves classification performance competitive with state-of-the-art image-domain baselines (ResNet, EfficientNet, ViT). Crucially, kViT exhibits superior robustness to high acceleration factors and offers a paradigm shift in computational efficiency, reducing VRAM consumption during training by up to 68 \times compared to standard methods. This establishes a pathway for resource-efficient, direct-from-scanner AI analysis.
zh

[CV-32] ARMOR: Agent ic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks

【速读】：该论文旨在解决现有自动化攻击工具集（automated attack suites）因采用静态组合策略而缺乏战略适应性和语义感知能力的问题。其核心解决方案是提出一种基于视觉语言模型（Vision Language Models, VLM）引导的智能体协同框架——ARMOR（Agentic Reasoning for Methods Orchestration and Reparameterization），通过三个经典的对抗性原语（Carlini-Wagner、Jacobian-based Saliency Map Attack 和 Spatially Transformed Attacks）在共享“Mixing Desk”中协作生成与合成扰动，并利用大语言模型（Large Language Models, LLM）实现实时闭环系统中的自适应调参与重参数化，从而动态挖掘图像特定的语义漏洞，提升跨架构迁移能力和对白盒/黑盒目标的攻击效果。

链接: https://arxiv.org/abs/2601.18386
作者: Gabriel Lee Jun Rong,Christos Korgialas,Dion Jia Xu Ho,Pai Chet Ng,Xiaoxiao Miao,Konstantinos N. Plataniotis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing automated attack suites operate as static ensembles with fixed sequences, lacking strategic adaptation and semantic awareness. This paper introduces the Agentic Reasoning for Methods Orchestration and Reparameterization (ARMOR) framework to address these limitations. ARMOR orchestrates three canonical adversarial primitives, Carlini-Wagner (CW), Jacobian-based Saliency Map Attack (JSMA), and Spatially Transformed Attacks (STA) via Vision Language Models (VLM)-guided agents that collaboratively generate and synthesize perturbations through a shared ``Mixing Desk". Large Language Models (LLMs) adaptively tune and reparameterize parallel attack agents in a real-time, closed-loop system that exploits image-specific semantic vulnerabilities. On standard benchmarks, ARMOR achieves improved cross-architecture transfer and reliably fools both settings, delivering a blended output for blind targets and selecting the best attack or blended attacks for white-box targets using a confidence-and-SSIM score.
zh

[CV-33] Estimation of geometric transformation matrices using grid-shaped pilot signals

【速读】：该论文旨在解决图像数字水印技术中因几何失真（如裁剪、缩放、旋转等）导致的同步困难问题，尤其针对裁剪操作改变图像原点后难以准确提取水印位置的挑战。解决方案的关键在于引入一种网格状导频信号（pilot signal），该信号在水平和垂直方向上编码不同特征，嵌入图像后随几何变换一同变形；通过对其畸变进行分析，结合Radon变换估计网格角度与间距，并利用方向编码区分水平与垂直线，从而精确恢复几何变换矩阵，实现鲁棒的同步与水印提取。

链接: https://arxiv.org/abs/2601.18385
作者: Rinka Kawano,Masaki Kawamura
机构: Yamaguchi University (山口大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Digital watermarking techniques are essential to prevent unauthorized use of images. Since pirated images are often geometrically distorted by operations such as scaling and cropping, accurate synchronization - detecting the embedding position of the watermark - is critical for proper extraction. In particular, cropping changes the origin of the image, making synchronization difficult. However, few existing methods are robust against cropping. To address this issue, we propose a watermarking method that estimates geometric transformations applied to a stego image using a pilot signal, allowing synchronization even after cropping. A grid-shaped pilot signal with distinct horizontal and vertical values is embedded in the image. When the image is transformed, the grid is also distorted. By analyzing this distortion, the transformation matrix can be estimated. Applying the Radon transform to the distorted image allows estimation of the grid angles and intervals. In addition, since the horizontal and vertical grid lines are encoded differently, the grid orientation can be determined, which reduces ambiguity. To validate our method, we performed simulations with anisotropic scaling, rotation, shearing, and cropping. The results show that the proposed method accurately estimates transformation matrices with low error under both single and composite attacks.
zh

[CV-34] Gaze Prediction in Virtual Reality Without Eye Tracking Using Visual and Head Motion Cues

【速读】：该论文旨在解决虚拟现实（VR）环境中因缺乏直接眼动追踪（eye tracking）而导致的感知延迟问题，尤其是在硬件限制或隐私顾虑下无法获取精确 gaze 数据时，如何有效预测用户注意力焦点以支持如视网膜区域渲染（foveated rendering）等计算密集型应用。解决方案的关键在于提出一种融合头戴式显示器（Head-Mounted Display, HMD）运动信号与视频帧中视觉显著性（visual saliency）特征的新型 gaze 预测框架：首先利用轻量级显著性编码器 UniSal 提取视觉特征，再将这些特征与 HMD 的时间序列运动数据进行融合，并通过 TSMixer 或 LSTM 等轻量级时序预测模块对未来 gaze 方向进行建模。实验表明，该方法在 EHTask 数据集及商用 VR 设备上均显著优于基准模型（如 Center-of-HMD 和 Mean Gaze），有效降低了感知滞后并提升了交互自然性。

链接: https://arxiv.org/abs/2601.18372
作者: Christos Petrou,Harris Partaourides,Athanasios Balomenos,Yannis Kopsinis,Sotirios Chatzis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaze prediction plays a critical role in Virtual Reality (VR) applications by reducing sensor-induced latency and enabling computationally demanding techniques such as foveated rendering, which rely on anticipating user attention. However, direct eye tracking is often unavailable due to hardware limitations or privacy concerns. To address this, we present a novel gaze prediction framework that combines Head-Mounted Display (HMD) motion signals with visual saliency cues derived from video frames. Our method employs UniSal, a lightweight saliency encoder, to extract visual features, which are then fused with HMD motion data and processed through a time-series prediction module. We evaluate two lightweight architectures, TSMixer and LSTM, for forecasting future gaze directions. Experiments on the EHTask dataset, along with deployment on commercial VR hardware, show that our approach consistently outperforms baselines such as Center-of-HMD and Mean Gaze. These results demonstrate the effectiveness of predictive gaze modeling in reducing perceptual lag and enhancing natural interaction in VR environments where direct eye tracking is constrained.
zh

[CV-35] OREHAS: A fully automated deep-learning pipeline for volumetric endolymphatic hydrops quantification in MRI

【速读】：该论文旨在解决内淋巴积水（endolymphatic hydrops, EH）在临床影像中难以实现自动化、定量且可重复评估的问题。传统方法依赖人工勾画，效率低且易受主观偏差影响，限制了其在大规模研究和临床诊断中的应用。解决方案的关键在于提出一种全自动化流程OREHAS（Optimized Recognition Evaluation of volumetric Hydrops in the Auditory System），该流程整合了切片分类、内耳定位与序列特异性分割三个模块，直接从常规3D-SPACE-MRC和3D-REAL-IR磁共振成像（MRI）数据中计算每耳的内淋巴体积与前庭体积比（ELR）。其核心技术优势在于仅需每例患者3–6个标注切片即可训练模型，并在完整3D体积上实现高精度分割（Dice分数达0.90和0.75），显著优于现有商用软件（VSI提升至74.3%），同时结果更符合生理现实，为基于精确体积测量的EH量化提供了可靠工具。

链接: https://arxiv.org/abs/2601.18368
作者: Caterina Fuster-Barceló,Claudia Castrillón,Laura Rodrigo-Muñoz,Victor Manuel Vega-Suárez,Nicolás Pérez-Fernández,Gorka Bastarrika,Arrate Muñoz-Barrutia
机构: Universidad Carlos III de Madrid (西班牙Carlos III大学); University of Navarra (纳瓦拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present OREHAS (Optimized Recognition Evaluation of volumetric Hydrops in the Auditory System), the first fully automatic pipeline for volumetric quantification of endolymphatic hydrops (EH) from routine 3D-SPACE-MRC and 3D-REAL-IR MRI. The system integrates three components – slice classification, inner ear localization, and sequence-specific segmentation – into a single workflow that computes per-ear endolymphatic-to-vestibular volume ratios (ELR) directly from whole MRI volumes, eliminating the need for manual intervention. Trained with only 3 to 6 annotated slices per patient, OREHAS generalized effectively to full 3D volumes, achieving Dice scores of 0.90 for SPACE-MRC and 0.75 for REAL-IR. In an external validation cohort with complete manual annotations, OREHAS closely matched expert ground truth (VSI = 74.3%) and substantially outperformed the clinical this http URL software (VSI = 42.5%), which tended to overestimate endolymphatic volumes. Across 19 test patients, vestibular measurements from OREHAS were consistent with this http URL, while endolymphatic volumes were systematically smaller and more physiologically realistic. These results show that reliable and reproducible EH quantification can be achieved from standard MRI using limited supervision. By combining efficient deep-learning-based segmentation with a clinically aligned volumetric workflow, OREHAS reduces operator dependence, ensures methodological consistency. Besides, the results are compatible with established imaging protocols. The approach provides a robust foundation for large-scale studies and for recalibrating clinical diagnostic thresholds based on accurate volumetric measurements of the inner ear. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.18368 [cs.CV] (or arXiv:2601.18368v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.18368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-36] Q-Bench-Portrait: Benchmarking Multimodal Large Language Models on Portrait Image Quality Perception

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在人像图像质量感知（portrait image quality perception）领域能力不足的问题，即现有模型虽在通用图像基准上表现优异，但在处理具有独特结构和感知特性的肖像图像时仍存在显著性能瓶颈。解决方案的关键在于构建首个针对人像图像质量感知的综合性基准——Q-Bench-Portrait，其核心要素包括：(1) 多样化的人像图像来源（自然图像、合成失真、AI生成、艺术及计算机图形图像）；(2) 覆盖技术失真、AIGC特定失真与美学维度的多维质量评估；(3) 包含单选、多选、是非判断与开放问答等多种题型，并支持全局与局部层级的细粒度评测。通过该基准对20个开源与5个闭源MLLMs进行系统评估，揭示了当前模型在人像感知任务中仍存在明显局限性，为后续提升通用与专用MLLMs在人像场景下的感知能力提供了标准化评测工具与研究方向。

链接: https://arxiv.org/abs/2601.18346
作者: Sijing Wu,Yunhao Li,Zicheng Zhang,Qi Jia,Xinyue Li,Huiyu Duan,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have demonstrated impressive performance on existing low-level vision benchmarks, which primarily focus on generic images. However, their capabilities to perceive and assess portrait images, a domain characterized by distinct structural and perceptual properties, remain largely underexplored. To this end, we introduce Q-Bench-Portrait, the first holistic benchmark specifically designed for portrait image quality perception, comprising 2,765 image-question-answer triplets and featuring (1) diverse portrait image sources, including natural, synthetic distortion, AI-generated, artistic, and computer graphics images; (2) comprehensive quality dimensions, covering technical distortions, AIGC-specific distortions, and aesthetics; and (3) a range of question formats, including single-choice, multiple-choice, true/false, and open-ended questions, at both global and local levels. Based on Q-Bench-Portrait, we evaluate 20 open-source and 5 closed-source MLLMs, revealing that although current models demonstrate some competence in portrait image perception, their performance remains limited and imprecise, with a clear gap relative to human judgments. We hope that the proposed benchmark will foster further research into enhancing the portrait image perception capabilities of both general-purpose and domain-specific MLLMs.
zh

[CV-37] Beyond Rigid: Benchmarking Non-Rigid Video Editing

【速读】：该论文旨在解决文本驱动视频编辑中非刚性形变（non-rigid deformation）生成的难题，特别是物理失真和时间闪烁等问题。其解决方案的关键在于提出首个专门针对非刚性视频编辑的基准测试平台NRVBench，包含高质量数据集、基于视觉-语言模型的新评价指标NRVE-Acc，以及无需训练的基线方法VM-Edit，该方法通过双区域去噪机制实现结构感知控制，在保持结构完整性的同时有效处理动态形变，从而显著提升物理合理性与时序一致性。

链接: https://arxiv.org/abs/2601.18340
作者: Bingzheng Qu,Kehai Chen,Xuefeng Bai,Jun Yu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the remarkable progress in text-driven video editing, generating coherent non-rigid deformations remains a critical challenge, often plagued by physical distortion and temporal flicker. To bridge this gap, we propose NRVBench, the first dedicated and comprehensive benchmark designed to evaluate non-rigid video editing. First, we curate a high-quality dataset consisting of 180 non-rigid motion videos from six physics-based categories, equipped with 2,340 fine-grained task instructions and 360 multiple-choice questions. Second, we propose NRVE-Acc, a novel evaluation metric based on Vision-Language Models that can rigorously assess physical compliance, temporal consistency, and instruction alignment, overcoming the limitations of general metrics in capturing complex dynamics. Third, we introduce a training-free baseline, VM-Edit, which utilizes a dual-region denoising mechanism to achieve structure-aware control, balancing structural preservation and dynamic deformation. Extensive experiments demonstrate that while current methods have shortcomings in maintaining physical plausibility, our method achieves excellent performance across both standard and proposed metrics. We believe the benchmark could serve as a standard testing platform for advancing physics-aware video editing.
zh

[CV-38] PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction DATE

【速读】：该论文旨在解决多视角三维重建（multi-view 3D reconstruction）方法对光照和图像信号处理（ISP）不一致性高度敏感的问题，这类不一致性源于相机光学特性及ISP处理差异，导致现有校正策略如逐帧潜在变量或仿射颜色校正缺乏物理依据且泛化能力差。解决方案的关键在于提出物理合理且可解释的ISP校正模块（Physically-Plausible ISP, PPISP），通过解耦相机固有属性与捕获依赖效应，利用训练数据预测新视角下的ISP参数，类比真实相机的自动曝光和自动白平衡机制，从而实现对新视图的真实感重建与公平评估，无需依赖真实图像作为监督信号。

链接: https://arxiv.org/abs/2601.18336
作者: Isaac Deutsch,Nicolas Moënne-Loccoz,Gavriel State,Zan Gojcic
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: For more details and updates, please visit our project website: this https URL

点击查看摘要

Abstract:Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available. The source code is available at: this https URL
zh

[CV-39] A Tumor Aware DenseNet Swin Hybrid Learning with Boosted and Hierarchical Feature Spaces for Large-Scale Brain MRI Classification

【速读】：该论文旨在解决脑肿瘤磁共振成像（MRI）分析中难以同时捕捉细粒度纹理特征与长距离上下文依赖关系的问题，尤其针对不同类别肿瘤（如胶质瘤、脑膜瘤和垂体瘤）在形态不规则、边界模糊或结构复杂等诊断挑战下的误判问题。其解决方案的关键在于提出一种高效的密集Swin混合框架（EDSH），通过两种肿瘤感知的实验设置实现互补特征学习：一是引入增强特征空间（BFS），使独立定制的DenseNet与Swin_t分支分别提取局部与全局表征，并进行维度对齐、融合与增强，从而提升对弥散性胶质瘤模式的敏感性；二是构建分层DenseNet-Swin_t架构并嵌入双重残差连接（DFE和DR），其中DenseNet作为主干卷积神经网络（CNN）用于结构化局部特征提取，Swin_t则建模全局肿瘤形态，有效抑制脑膜瘤和垂体瘤分类中的假阴性，其关键创新在于针对MRI空间特性定制DenseNet输入模块及基于任务对齐的patch嵌入与移位窗口自注意力机制，以高效捕获层次化全局依赖关系。

链接: https://arxiv.org/abs/2601.18330
作者: Muhammad Ali Shah(1),Muhammad Mansoor Alam(1,2),Saddam Hussain Khan(3) ((1) Riphah International University, Islamabad, Pakistan, (2) Multimedia University, Malaysia, (3) University of Engineering and Applied Sciences, Swat, Kanju Township, Pakistan)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 33 Pages, 8 Tables, Figures 16

点击查看摘要

Abstract:This study proposes an efficient Densely Swin Hybrid (EDSH) framework for brain tumor MRI analysis, designed to jointly capture fine grained texture patterns and long range contextual dependencies. Two tumor aware experimental setups are introduced to address class-specific diagnostic challenges. The first setup employs a Boosted Feature Space (BFS), where independently customized DenseNet and Swint branches learn complementary local and global representations that are dimension aligned, fused, and boosted, enabling highly sensitive detection of diffuse glioma patterns by successfully learning the features of irregular shape, poorly defined mass, and heterogeneous texture. The second setup adopts a hierarchical DenseNet Swint architecture with Deep Feature Extraction have Dual Residual connections (DFE and DR), in which DenseNet serves as a stem CNN for structured local feature learning, while Swin_t models global tumor morphology, effectively suppressing false negatives in meningioma and pituitary tumor classification by learning the features of well defined mass, location (outside brain) and enlargments in tumors (dural tail or upward extension). DenseNet is customized at the input level to match MRI spatial characteristics, leveraging dense residual connectivity to preserve texture information and mitigate vanishing-gradient effects. In parallel, Swint is tailored through task aligned patch embedding and shifted-window self attention to efficiently capture hierarchical global dependencies. Extensive evaluation on a large-scale MRI dataset (stringent 40,260 images across four tumor classes) demonstrates consistent superiority over standalone CNNs, Vision Transformers, and hybrids, achieving 98.50 accuracy and recall on the test unseen dataset.
zh

[CV-40] SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis

【速读】：该论文旨在解决当前图形用户界面（GUI）代理在执行滑动（swipe）交互操作时存在的准确性不足问题，这一短板已成为制约任务完成效率的新瓶颈。现有GUI代理通常采用过于简化的滑动策略，难以模拟人类真实的滑动行为。其解决方案的关键在于：首先将人类滑动手势分解为多个可量化的维度，并基于此构建自动化流水线SwipeGen，通过GUI探索合成类人滑动交互；进而利用该合成数据训练出具备更强交互执行能力的GUI代理GUISwiper，实验表明其滑动执行准确率达到69.07%，相较现有视觉语言模型（VLM）基线提升214%。

链接: https://arxiv.org/abs/2601.18305
作者: Xuan Wang,Siyuan Su,Quantong Fu,Yongxiang Hu,Yangfan Zhou
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Intelligent Information Processing (上海市智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures. Under review. Code and dataset will be released upon acceptance

点击查看摘要

Abstract:With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling swipe interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose GUISwiper, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that GUISwiper achieves a swipe execution accuracy of 69.07%, representing a 214% improvement over existing VLM baselines.
zh

[CV-41] Contextual Range-View Projection for 3D LiDAR Point Clouds

【速读】：该论文旨在解决3D LiDAR点云到2D范围图像（range image）投影过程中存在的“多对一冲突”问题，即多个3D点被映射到同一像素时，传统方法仅保留深度最小（最靠近LiDAR）的点，忽视语义信息和物体结构，导致重要上下文信息丢失。解决方案的关键在于扩展基于深度的选择规则，引入两类机制：一是中心感知投影（Centerness-Aware Projection, CAP），通过调整点的深度值以优先保留靠近实例中心的点，从而增强对目标物体结构的保留；二是类别加权感知投影（Class-Weighted-Aware Projection, CWAP），利用用户定义的类别权重实现对特定类别的优先处理，提升目标类别的投影质量。实验表明，CAP可提升mIoU达3.1%，CWAP则在不影响其他类别性能的前提下增强特定类别的表现。

链接: https://arxiv.org/abs/2601.18301
作者: Seyedali Mousavi,Seyedhamidreza Mousavi,Masoud Daneshtalab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Range-view projection provides an efficient method for transforming 3D LiDAR point clouds into 2D range image representations, enabling effective processing with 2D deep learning models. However, a major challenge in this projection is the many-to-one conflict, where multiple 3D points are mapped onto the same pixel in the range image, requiring a selection strategy. Existing approaches typically retain the point with the smallest depth (closest to the LiDAR), disregarding semantic relevance and object structure, which leads to the loss of important contextual information. In this paper, we extend the depth-based selection rule by incorporating contextual information from both instance centers and class labels, introducing two mechanisms: \textitCenterness-Aware Projection (CAP) and \textitClass-Weighted-Aware Projection (CWAP). In CAP, point depths are adjusted according to their distance from the instance center, thereby prioritizing central instance points over noisy boundary and background points. In CWAP, object classes are prioritized through user-defined weights, offering flexibility in the projection strategy. Our evaluations on the SemanticKITTI dataset show that CAP preserves more instance points during projection, achieving up to a 3.1% mIoU improvement compared to the baseline. Furthermore, CWAP enhances the performance of targeted classes while having a negligible impact on the performance of other classes
zh

[CV-42] Revisiting Aerial Scene Classification on the AID Benchmark

【速读】：该论文旨在解决航空影像（aerial images）场景分类中因数据异质性带来的建模挑战，即如何在包含建筑、森林、山地和未利用土地等多种地物类型的复杂场景中构建鲁棒的分类模型。其解决方案的关键在于提出一种空间注意力增强的卷积神经网络（spatial attention-enhanced CNN），即Aerial-Y-Net，该模型融合了多尺度特征提取机制与注意力机制，能够更有效地捕捉航空影像中的关键空间信息，从而提升分类精度；在AID数据集上的实验表明，该方法达到了91.72%的准确率，优于多个基线架构。

链接: https://arxiv.org/abs/2601.18263
作者: Subhajeet Das,Susmita Ghosh,Abhiroop Chatterjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the IEEE India Geoscience and Remote Sensing Symposium 2025 and accepted for publication in IEEE Xplore

点击查看摘要

Abstract:Aerial images play a vital role in urban planning and environmental preservation, as they consist of various structures, representing different types of buildings, forests, mountains, and unoccupied lands. Due to its heterogeneous nature, developing robust models for scene classification remains a challenge. In this study, we conduct a literature review of various machine learning methods for aerial image classification. Our survey covers a range of approaches from handcrafted features (e.g., SIFT, LBP) to traditional CNNs (e.g., VGG, GoogLeNet), and advanced deep hybrid networks. In this connection, we have also designed Aerial-Y-Net, a spatial attention-enhanced CNN with multi-scale feature fusion mechanism, which acts as an attention-based model and helps us to better understand the complexities of aerial images. Evaluated on the AID dataset, our model achieves 91.72% accuracy, outperforming several baseline architectures.
zh

[CV-43] Depth to Anatomy: Learning Internal Organ Locations from Surface Depth Images

【速读】：该论文旨在解决医学影像扫描中患者定位不精准导致的扫描效率低和患者体验差的问题，尤其关注如何利用非侵入式深度感知技术实现内部器官位置的快速估计，从而优化扫描流程。解决方案的关键在于提出了一种基于学习的框架，直接从单张2D深度图像中预测多个内部器官的3D位置与形状，无需显式重建体表几何结构；其核心创新是利用大规模全身MRI数据集合成带解剖分割标注的深度图像，训练统一的卷积神经网络（Convolutional Neural Network, CNN），实现了对骨骼与软组织等多样解剖结构的高精度定位，为将深度传感器集成到放射科工作流提供了可行路径。

链接: https://arxiv.org/abs/2601.18260
作者: Eytan Kats,Kai Geissler,Daniel Mensing,Jochen G. Hirsch,Stefan Heldman,Mattias P. Heinrich
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:Automated patient positioning plays an important role in optimizing scanning procedure and improving patient throughput. Leveraging depth information captured by RGB-D cameras presents a promising approach for estimating internal organ positions, thereby enabling more accurate and efficient positioning. In this work, we propose a learning-based framework that directly predicts the 3D locations and shapes of multiple internal organs from single 2D depth images of the body surface. Utilizing a large-scale dataset of full-body MRI scans, we synthesize depth images paired with corresponding anatomical segmentations to train a unified convolutional neural network architecture. Our method accurately localizes a diverse set of anatomical structures, including bones and soft tissues, without requiring explicit surface reconstruction. Experimental results demonstrate the potential of integrating depth sensors into radiology workflows to streamline scanning procedures and enhance patient experience through automated patient positioning.
zh

[CV-44] Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

【速读】：该论文旨在解决线段（line segment）与角点（junction）分离预测导致的匹配不一致和鲁棒性不足的问题，这在结构化几何感知任务中尤为关键。现有方法通常先独立预测线段和角点，再通过后处理进行整合，易产生误差累积与语义错位。其解决方案的核心是提出 Co-PLNet，一个点线协同框架，通过 Point-Line Prompt Encoder（PLP-Encoder）将早期检测结果转化为空间对齐的几何提示图，并利用 Cross-Guidance Line Decoder（CGL-Decoder）以稀疏注意力机制融合互补提示，从而强制点线一致性并提升效率，显著改善了精度与实时性能。

链接: https://arxiv.org/abs/2601.18252
作者: Chao Wang,Xuanying Li,Cheng Dai,Jinglei Feng,Yuxiang Luo,Yuqi Ouyang,Hao Qin
机构: Sichuan University (四川大学); North Sichuan Medical College (川北医学院); Waseda University (早稻田大学); University College Dublin (都柏林大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception.
zh

[CV-45] A multimodal vision foundation model for generalizable knee pathology

【速读】：该论文旨在解决当前骨科人工智能（AI）方法在处理肌肉骨骼疾病影像时存在的局限性，包括任务特定的监督学习范式导致的碎片化、对大量标注数据的依赖以及跨模态和临床场景泛化能力不足等问题。其核心解决方案是提出OrthoFoundation——一个面向肌肉骨骼病理学的多模态视觉基础模型，通过在120万张未标注膝关节X射线和MRI图像上利用Dinov3骨干网络进行自监督对比学习预训练，从而捕捉鲁棒的放射学表征；该模型不仅在14项下游任务中达到最先进（SOTA）性能，且表现出卓越的标签效率（仅需50%标注数据即可媲美监督基线）和跨解剖结构泛化能力（从膝关节扩展至髋、肩、踝），为减少标注负担并提升临床诊断准确性提供了可扩展的基础框架。

链接: https://arxiv.org/abs/2601.18250
作者: Kang Yu,Dingyu Wang,Zimu Yuan,Nan Zhou,Jiajun Liu,Jiaxin Liu,Shanggui Liu,Yaoyan Zheng,Huishu Yuan,Di Huang,Dong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Musculoskeletal disorders represent a leading cause of global disability, creating an urgent demand for precise interpretation of medical imaging. Current artificial intelligence (AI) approaches in orthopedics predominantly rely on task-specific, supervised learning paradigms. These methods are inherently fragmented, require extensive annotated datasets, and often lack generalizability across different modalities and clinical scenarios. The development of foundation models in this field has been constrained by the scarcity of large-scale, curated, and open-source musculoskeletal datasets. To address these challenges, we introduce OrthoFoundation, a multimodal vision foundation model optimized for musculoskeletal pathology. We constructed a pre-training dataset of 1.2 million unlabeled knee X-ray and MRI images from internal and public databases. Utilizing a Dinov3 backbone, the model was trained via self-supervised contrastive learning to capture robust radiological representations. OrthoFoundation achieves state-of-the-art (SOTA) performance across 14 downstream tasks. It attained superior accuracy in X-ray osteoarthritis diagnosis and ranked first in MRI structural injury detection. The model demonstrated remarkable label efficiency, matching supervised baselines using only 50% of labeled data. Furthermore, despite being pre-trained on knee images, OrthoFoundation exhibited exceptional cross-anatomy generalization to the hip, shoulder, and ankle. OrthoFoundation represents a significant advancement toward general-purpose AI for musculoskeletal imaging. By learning fundamental, joint-agnostic radiological semantics from large-scale multimodal data, it overcomes the limitations of conventional models, which provides a robust framework for reducing annotation burdens and enhancing diagnostic accuracy in clinical practice.
zh

[CV-46] Vision-Language-Model-Guided Differentiable Ray Tracing for Fast and Accurate Multi-Material RF Parameter Estimation

【速读】：该论文旨在解决6G系统中电磁数字孪生（electromagnetic digital twin）对射频（RF）材料参数精确估计的需求，尤其是传统基于梯度的逆向射线追踪（gradient-based inverse ray tracing, RT）方法在初始化敏感性和有限测量条件下计算成本高的问题。解决方案的关键在于引入视觉语言模型（vision-language model, VLM）来引导可微分射线追踪（differentiable RT, DRT）引擎：首先，VLM通过解析场景图像识别材料类别，并借助ITU-R材料表映射为定量先验，生成更合理的电导率初始值；其次，VLM进一步优化发射机/接收机部署位置，以激发多样且具有材料区分能力的传播路径。由此获得的语义先验显著加速并稳定了DRT的梯度优化过程，在仅需少量接收机的情况下实现亚0.1%的平均相对误差，同时降低整体测量需求和迭代复杂度。

链接: https://arxiv.org/abs/2601.18242
作者: Zerui Kang,Yishen Lim,Zhouyou Gu,Seung-Woo Ko,Tony Q.S. Quek,Jihong Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Accurate radio-frequency (RF) material parameters are essential for electromagnetic digital twins in 6G systems, yet gradient-based inverse ray tracing (RT) remains sensitive to initialization and costly under limited measurements. This paper proposes a vision-language-model (VLM) guided framework that accelerates and stabilizes multi-material parameter estimation in a differentiable RT (DRT) engine. A VLM parses scene images to infer material categories and maps them to quantitative priors via an ITU-R material table, yielding informed conductivity initializations. The VLM further selects informative transmitter/receiver placements that promote diverse, material-discriminative paths. Starting from these priors, the DRT performs gradient-based refinement using measured received signal strengths. Experiments in NVIDIA Sionna on indoor scenes show 2-4 \times faster convergence and 10-100 \times lower final parameter error compared with uniform or random initialization and random placement baselines, achieving sub-0.1% mean relative error with only a few receivers. Complexity analyses indicate per-iteration time scales near-linearly with the number of materials and measurement setups, while VLM-guided placement reduces the measurements required for accurate recovery. Ablations over RT depth and ray counts confirm further accuracy gains without significant per-iteration overhead. Results demonstrate that semantic priors from VLMs effectively guide physics-based optimization for fast and reliable RF material estimation.
zh

[CV-47] V-Loop: Visual Logical Loop Verification for Hallucination Detection in Medical Visual Question Answering

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在医学视觉问答（Medical Visual Question Answering, VQA）任务中易产生幻觉（hallucination）的问题，即模型输出与图像事实不符，这在高风险医疗场景中可能带来严重后果。现有基于不确定性的自省检测方法虽计算高效，但本质上是间接的，仅估计图像-问题对的预测不确定性，而非直接验证特定答案的事实正确性。论文提出了一种无需训练、可即插即用的视觉逻辑环验证框架（Visual Logical Loop Verification, V-Loop），其核心创新在于引入双向推理机制：从原始问答对中提取语义单元，基于答案单元生成一个验证性问题以重新查询问题单元，并通过强制视觉注意力一致性确保主问题和验证问题均依赖同一图像证据；若验证答案与预期语义一致，则逻辑环闭合，表明答案具备事实依据；否则判定为主答案为幻觉。此方法显著提升了检测准确性且保持高效，同时能有效增强现有不确定性方法的性能。

链接: https://arxiv.org/abs/2601.18240
作者: Mengyuan Jin,Zehui Liao,Yong Xia
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable capability in assisting disease diagnosis in medical visual question answering (VQA). However, their outputs remain vulnerable to hallucinations (i.e., responses that contradict visual facts), posing significant risks in high-stakes medical scenarios. Recent introspective detection methods, particularly uncertainty-based approaches, offer computational efficiency but are fundamentally indirect, as they estimate predictive uncertainty for an image-question pair rather than verifying the factual correctness of a specific answer. To address this limitation, we propose Visual Logical Loop Verification (V-Loop), a training-free and plug-and-play framework for hallucination detection in medical VQA. V-Loop introduces a bidirectional reasoning process that forms a visually grounded logical loop to verify factual correctness. Given an input, the MLLM produces an answer for the primary input pair. V-Loop extracts semantic units from the primary QA pair, generates a verification question by conditioning on the answer unit to re-query the question unit, and enforces visual attention consistency to ensure answering both primary question and verification question rely on the same image evidence. If the verification answer matches the expected semantic content, the logical loop closes, indicating factual grounding; otherwise, the primary answer is flagged as hallucinated. Extensive experiments on multiple medical VQA benchmarks and MLLMs show that V-Loop consistently outperforms existing introspective methods, remains highly efficient, and further boosts uncertainty-based approaches when used in combination.
zh

[CV-48] Facial Emotion Recognition on FER-2013 using an EfficientNetB2-Based Approach

【速读】：该论文旨在解决在真实场景下基于面部图像进行情绪识别（Facial Emotion Recognition, FER）所面临的多重挑战，包括低图像质量、光照与姿态变化、背景干扰、类间差异小、标注噪声以及严重的类别不平衡等问题，这些问题在FER-2013数据集（48×48灰度图像）中尤为突出。为应对这些挑战，作者提出了一种轻量且高效的FER流水线，其核心在于采用EfficientNetB2架构，并结合两阶段预热与微调策略，辅以AdamW优化器、解耦权重衰减、标签平滑（epsilon = 0.06）、裁剪类别权重以缓解类别不平衡、Dropout、混合精度训练及大规模实时数据增强等技术手段，最终在保持近十分之一参数量的同时实现了68.78%的测试准确率，展现出良好的稳定性和泛化能力，适用于实时和边缘计算场景。

链接: https://arxiv.org/abs/2601.18228
作者: Sahil Naik,Soham Bagayatkar,Pavankumar Singh
机构: VIT, Mumbai (维特大学孟买校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Detection of human emotions based on facial images in real-world scenarios is a difficult task due to low image quality, variations in lighting, pose changes, background distractions, small inter-class variations, noisy crowd-sourced labels, and severe class imbalance, as observed in the FER-2013 dataset of 48x48 grayscale images. Although recent approaches using large CNNs such as VGG and ResNet achieve reasonable accuracy, they are computationally expensive and memory-intensive, limiting their practicality for real-time applications. We address these challenges using a lightweight and efficient facial emotion recognition pipeline based on EfficientNetB2, trained using a two-stage warm-up and fine-tuning strategy. The model is enhanced with AdamW optimization, decoupled weight decay, label smoothing (epsilon = 0.06) to reduce annotation noise, and clipped class weights to mitigate class imbalance, along with dropout, mixed-precision training, and extensive real-time data augmentation. The model is trained using a stratified 87.5%/12.5% train-validation split while keeping the official test set intact, achieving a test accuracy of 68.78% with nearly ten times fewer parameters than VGG16-based baselines. Experimental results, including per-class metrics and learning dynamics, demonstrate stable training and strong generalization, making the proposed approach suitable for real-time and edge-based applications.
zh

[CV-49] HomoFM: Deep Homography Estimation with Flow Matching

【速读】：该论文旨在解决深度同形变换（Deep Homography Estimation）中现有方法在处理复杂几何变换时精度不足、泛化能力弱的问题，尤其是在存在域偏移（如多模态匹配或光照变化）场景下表现不稳定。其解决方案的关键在于提出HomoFM框架，首次将生成模型中的流匹配（Flow Matching）技术引入同形变换任务，将问题建模为速度场学习问题：通过学习一个连续且逐点的速度场，将噪声分布映射到对齐坐标，从而利用条件流轨迹恢复高精度变换；同时，为增强域适应性，创新性地在特征提取主干网络中集成梯度反转层（Gradient Reversal Layer, GRL），显式约束编码器学习域不变表示，显著提升模型鲁棒性。

链接: https://arxiv.org/abs/2601.18222
作者: Mengfan He,Liangzheng Sun,Chunyu Li,Ziyang Meng
机构: Tsinghua University (清华大学); Beijing Information Science and Technology University (北京信息科技大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep homography estimation has broad applications in computer vision and robotics. Remarkable progresses have been achieved while the existing methods typically treat it as a direct regression or iterative refinement problem and often struggling to capture complex geometric transformations or generalize across different domains. In this work, we propose HomoFM, a new framework that introduces the flow matching technique from generative modeling into the homography estimation task for the first time. Unlike the existing methods, we formulate homography estimation problem as a velocity field learning problem. By modeling a continuous and point-wise velocity field that transforms noisy distributions into registered coordinates, the proposed network recovers high-precision transformations through a conditional flow trajectory. Furthermore, to address the challenge of domain shifts issue, e.g., the cases of multimodal matching or varying illumination scenarios, we integrate a gradient reversal layer (GRL) into the feature extraction backbone. This domain adaptation strategy explicitly constrains the encoder to learn domain-invariant representations, significantly enhancing the network’s robustness. Extensive experiments demonstrate the effectiveness of the proposed method, showing that HomoFM outperforms state-of-the-art methods in both estimation accuracy and robustness on standard benchmarks. Code and data resource are available at this https URL.
zh

[CV-50] QualiRAG : Retrieval-Augmented Generation for Visual Quality Understanding

【速读】：该论文旨在解决当前视觉质量评估（Visual Quality Assessment, VQA）任务中依赖人工标注的监督微调或强化学习所带来的劳动密集型与数据集特异性偏差问题。其核心挑战在于如何在不进行任务特定训练的情况下，实现细粒度的时空感知与上下文信息融合，从而提升模型对图像/视频质量的理解能力。解决方案的关键在于提出一种无需训练的检索增强生成（Retrieval-Augmented Generation, RAG）框架——QualiRAG，该框架通过动态构建四类互补的知识源（视觉元数据、主体定位、全局质量摘要和局部质量描述），并基于相关性感知的检索机制，从大型多模态模型（Large Multimodal Models, LMMs）中挖掘潜在的感知知识，实现证据驱动的推理，显著提升了视觉质量理解与比较任务的性能。

链接: https://arxiv.org/abs/2601.18195
作者: Linhan Cao,Wei Sun,Weixia Zhang,Xiangyang Zhu,Kaiwei Zhang,Jun Jia,Dandan Zhu,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual quality assessment (VQA) is increasingly shifting from scalar score prediction toward interpretable quality understanding – a paradigm that demands \textitfine-grained spatiotemporal perception and \textitauxiliary contextual information. Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction datasets, which involve labor-intensive annotation and are prone to dataset-specific biases. To address these challenges, we propose \textbfQualiRAG, a \textittraining-free \textbfRetrieval-\textbfAugmented \textbfGeneration \textbf(RAG) framework that systematically leverages the latent perceptual knowledge of large multimodal models (LMMs) for visual quality perception. Unlike conventional RAG that retrieves from static corpora, QualiRAG dynamically generates auxiliary knowledge by decomposing questions into structured requests and constructing four complementary knowledge sources: \textitvisual metadata, \textitsubject localization, \textitglobal quality summaries, and \textitlocal quality descriptions, followed by relevance-aware retrieval for evidence-grounded reasoning. Extensive experiments show that QualiRAG achieves substantial improvements over open-source general-purpose LMMs and VQA-finetuned LMMs on visual quality understanding tasks, and delivers competitive performance on visual quality comparison tasks, demonstrating robust quality assessment capabilities without any task-specific training. The code will be publicly available at this https URL.
zh

[CV-51] MindCine: Multimodal EEG-to-Video Reconstruction with Large-Scale Pretrained Models

【速读】：该论文旨在解决从脑电图（EEG）信号中重建人类动态视觉感知的挑战，主要问题包括：单一模态导致现有方法仅依赖文本模态而忽略其他模态，易引发过拟合；以及由于EEG-视频数据稀缺，模型难以收敛。其解决方案的关键在于提出一种名为MindCine的新框架，采用多模态联合学习策略融合除文本外的其他模态信息以提升泛化能力，并利用预训练的大规模EEG模型缓解数据稀缺问题以增强语义解码性能；同时设计了一个具有因果注意力机制的序列到序列（Seq2Seq）模型，专门用于解码感知信息，从而实现高质量视频重建。

链接: https://arxiv.org/abs/2601.18192
作者: Tian-Yi Zhou,Xuan-Hao Liu,Bao-Liang Lu,Wei-Long Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG’s non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfitting problems; 2) Data Scarcity: current methods often have difficulty training to converge with limited EEG-video data. To solve the above problems, we propose a novel framework MindCine to achieve high-fidelity video reconstructions on limited data. We employ a multimodal joint learning strategy to incorporate beyond-text modalities in the training stage and leverage a pre-trained large EEG model to relieve the data scarcity issue for decoding semantic information, while a Seq2Seq model with causal attention is specifically designed for decoding perceptual information. Extensive experiments demonstrate that our model outperforms state-of-the-art methods both qualitatively and quantitatively. Additionally, the results underscore the effectiveness of the complementary strengths of different modalities and demonstrate that leveraging a large-scale EEG model can further enhance reconstruction performance by alleviating the challenges associated with limited data.
zh

[CV-52] Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval

【速读】：该论文旨在解决当前视觉-语言预训练（Vision-Language Pre-training, VLP）模型在遥感图像-文本检索（Remote Sensing Image-Text Retrieval, RSITR）任务中面临的两大挑战：一是现有方法主要依赖粗粒度全局对齐，难以捕捉高分辨率遥感图像中密集且多尺度的语义信息；二是基于全量微调的适应策略计算开销大，并易引发灾难性遗忘。解决方案的关键在于提出一种参数高效框架MPS-CLIP，其核心创新包括：利用大语言模型（Large Language Model, LLM）提取关键词以引导Segment Anything Model (SamGeo)生成语义相关的子视角，通过门控全局注意力（Gated Global Attention, G²A）适配器在冻结主干网络的前提下捕获全局上下文与长程依赖关系，以及设计多视角表示（Multi-Perspective Representation, MPR）模块聚合局部线索形成鲁棒的多视角嵌入表征。此外，采用结合多视角对比损失与加权三元组损失的混合优化目标，动态选择响应最强的视角以抑制噪声并强化精确语义匹配。

链接: https://arxiv.org/abs/2601.18190
作者: Yifan Li,Shiying Wang,Jianqiang Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures. Code: this https URL

点击查看摘要

Abstract:Vision-Language Pre-training (VLP) models like CLIP have significantly advanced Remote Sensing Image-Text Retrieval (RSITR). However, existing methods predominantly rely on coarse-grained global alignment, which often overlooks the dense, multi-scale semantics inherent in overhead imagery. Moreover, adapting these heavy models via full fine-tuning incurs prohibitive computational costs and risks catastrophic forgetting. To address these challenges, we propose MPS-CLIP, a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment. Specifically, we leverage a Large Language Model (LLM) to extract core semantic keywords, guiding the Segment Anything Model (SamGeo) to generate semantically relevant sub-perspectives. To efficiently adapt the frozen backbone, we introduce a Gated Global Attention (G^2A) adapter, which captures global context and long-range dependencies with minimal overhead. Furthermore, a Multi-Perspective Representation (MPR) module aggregates these local cues into robust multi-perspective embeddings. The framework is optimized via a hybrid objective combining multi-perspective contrastive and weighted triplet losses, which dynamically selects maximum-response perspectives to suppress noise and enforce precise semantic matching. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that MPS-CLIP achieves state-of-the-art performance with 35.18% and 48.40% mean Recall (mR), respectively, significantly outperforming full fine-tuning baselines and recent competitive methods. Code is available at this https URL.
zh

[CV-53] textscNaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation

【速读】：该论文旨在解决视觉-语言导航（Vision-and-Language Navigation, VLN）中因缺乏对视觉-动作因果关系建模而导致的不稳定行为、泛化能力弱以及轨迹累积误差等问题。其解决方案的关键在于提出一个统一框架NaVIDA（Navigation with Inverse Dynamics Augmentation），通过引入基于块（chunk-based）的逆动力学监督来显式学习视觉变化与对应动作之间的因果关系，并结合分层概率动作块化（Hierarchical Probabilistic Action Chunking, HPAC）结构化监督信号、扩展有效规划范围，同时设计熵引导机制自适应调整动作块的执行范围，从而抑制误差累积并稳定推理阶段的行为表现。

链接: https://arxiv.org/abs/2601.18188
作者: Weiye Zhu,Zekai Zhang,Xiangchen Wang,Hewei Pan,Teng Wang,Tiantian Geng,Rongtao Xu,Feng Zheng
机构: SUSTech; MBZUAI; SpatialTemporal AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly modeling how actions causally transform subsequent visual observations. Lacking such vision-action causality, agents cannot anticipate the visual changes induced by its own actions, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textscNaVIDA (\textbfNavigation with \textbfInverse \textbfDynamics \textbfAugmentation), a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution. \textscNaVIDA augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions. To structure this supervision and extend the effective planning range, \textscNaVIDA employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. To further curb error accumulation and stabilize behavior at inference, an entropy-guided mechanism adaptively sets the execution horizon of action chunks. Extensive experiments show that \textscNaVIDA achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach. Code and data will be available upon acceptance.
zh

[CV-54] YOLO-DS: Fine-Grained Feature Decoupling via Dual-Statistic Synergy Operator for Object Detection

【速读】：该论文旨在解决现有单阶段目标检测器（如YOLO系列）在共享特征通道中缺乏对异质物体响应的显式建模问题，从而限制了性能进一步提升。其解决方案的关键在于提出一种新颖的双统计协同算子（Dual-Statistic Synergy Operator, DSO），通过联合建模通道均值与峰值-均值差来解耦物体特征；在此基础上设计了两个轻量级门控模块——双统计协同门控模块（DSG）用于自适应通道特征选择，以及多路径分段门控模块（MSG）用于深度维度特征加权，从而显著增强模型对异质目标的区分能力，同时保持高效推理。

链接: https://arxiv.org/abs/2601.18172
作者: Lin Huang,Yujuan Tan,Weisheng Li,Shitai Shan,Liu Liu,Bo Liu,Linlin Shen,Jing Yu,Yue Niu
机构: National University of Defense Technology (国防科技大学); Chongqing University of Posts and Telecommunications (重庆邮电大学); Inspur (浪潮); Shenzhen University (深圳大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One-stage object detection, particularly the YOLO series, strikes a favorable balance between accuracy and efficiency. However, existing YOLO detectors lack explicit modeling of heterogeneous object responses within shared feature channels, which limits further performance gains. To address this, we propose YOLO-DS, a framework built around a novel Dual-Statistic Synergy Operator (DSO). The DSO decouples object features by jointly modeling the channel-wise mean and the peak-to-mean difference. Building upon the DSO, we design two lightweight gating modules: the Dual-Statistic Synergy Gating (DSG) module for adaptive channel-wise feature selection, and the Multi-Path Segmented Gating (MSG) module for depth-wise feature weighting. On the MS-COCO benchmark, YOLO-DS consistently outperforms YOLOv8 across five model scales (N, S, M, L, X), achieving AP gains of 1.1% to 1.7% with only a minimal increase in inference latency. Extensive visualization, ablation, and comparative studies validate the effectiveness of our approach, demonstrating its superior capability in discriminating heterogeneous objects with high efficiency.
zh

[CV-55] mpDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration

【速读】：该论文旨在解决经动脉化疗栓塞术（Transarterial Chemoembolization, TACE）中因术中血管导航复杂和解剖结构变异导致的精准定位难题，核心在于实现高精度、鲁棒的二维（2D）与三维（3D）血管配准，以辅助微导管及器械的实时引导。其解决方案的关键在于提出一种“粗到精”的注册策略：首先设计结构感知透视n点（Structure-aware Perspective-n-Point, SA-PnP）模块建立2D与3D血管结构间的全局对应关系；其次引入Temporal Diffusion Registration（TempDiffReg）模型，通过时序扩散机制迭代优化血管形变，利用时间上下文捕捉复杂的解剖变化和局部结构差异。该方法在626对多帧样本上验证，相较当前最优方法显著降低均方误差（MSE: 0.63 mm，下降66.7%）和平均绝对误差（MAE: 0.51 mm，下降17.7%），具备提升临床操作安全性与效率的潜力。

链接: https://arxiv.org/abs/2601.18168
作者: Zehua Liu,Shihao Zou,Jincai Huang,Yanfang Zhang,Chao Tong,Weixin Si
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE BIBM 2025

点击查看摘要

Abstract:Transarterial chemoembolization (TACE) is a preferred treatment option for hepatocellular carcinoma and other liver malignancies, yet it remains a highly challenging procedure due to complex intra-operative vascular navigation and anatomical variability. Accurate and robust 2D-3D vessel registration is essential to guide microcatheter and instruments during TACE, enabling precise localization of vascular structures and optimal therapeutic targeting. To tackle this issue, we develop a coarse-to-fine registration strategy. First, we introduce a global alignment module, structure-aware perspective n-point (SA-PnP), to establish correspondence between 2D and 3D vessel structures. Second, we propose TempDiffReg, a temporal diffusion model that performs vessel deformation iteratively by leveraging temporal context to capture complex anatomical variations and local structural changes. We collected data from 23 patients and constructed 626 paired multi-frame samples for comprehensive evaluation. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art (SOTA) methods in both accuracy and anatomical plausibility. Specifically, our method achieves a mean squared error (MSE) of 0.63 mm and a mean absolute error (MAE) of 0.51 mm in registration accuracy, representing 66.7% lower MSE and 17.7% lower MAE compared to the most competitive existing approaches. It has the potential to assist less-experienced clinicians in safely and efficiently performing complex TACE procedures, ultimately enhancing both surgical outcomes and patient care. Code and data are available at: \textcolorbluethis https URL
zh

[CV-56] Agent ic Very Long Video Understanding

【速读】：该论文旨在解决长时视频理解（long-horizon video understanding）中的关键挑战，即如何在跨越数天甚至数周的连续第一人称视频流中实现跨模态、时序一致且具备组合推理能力的上下文理解。现有方法如大语言模型和检索增强生成受限于有限的上下文窗口，难以对超长视频进行多跳推理。其解决方案的核心是提出EGAgent框架，该框架以实体场景图（entity scene graph）为中心，动态建模人物、地点、物体及其随时间演变的关系，并通过规划代理（planning agent）集成结构化搜索、多模态检索（视觉与音频）及复合推理工具，从而支持细粒度、跨模态、时序连贯的复杂问答任务。

链接: https://arxiv.org/abs/2601.18157
作者: Aniket Rege,Arka Sadhu,Yuliang Li,Kejie Li,Ramya Korlakai Vinayak,Yuning Chai,Yong Jae Lee,Hyo Jin Kim
机构: Meta(Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 7 figures, 8 tables

点击查看摘要

Abstract:The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.
zh

[CV-57] Forward Consistency Learning with Gated Context Aggregation for Video Anomaly Detection

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）在资源受限边缘设备上的部署难题，以及现有基于预测的方法仅依赖单帧未来预测误差、忽略更丰富长期时序约束的问题。解决方案的关键在于提出一种轻量级模型FoGA，其核心创新包括：(1) 基于U-Net架构的双路径预测机制，同时生成即时与前向预测；(2) 在跳跃连接中引入门控上下文聚合模块（gated context aggregation module），动态融合同尺度编码器与解码器特征；(3) 设计一种新颖的前向一致性损失函数，并结合混合异常度量策略，整合即时与前向帧的预测误差以提升检测精度。该方法参数量仅约2M，在保证高检测性能的同时实现高达155 FPS的推理速度，显著优于当前最优方法。

链接: https://arxiv.org/abs/2601.18135
作者: Jiahao Lyu,Minghua Zhao,Xuewen Huang,Yifei Chen,Shuangli Du,Jing Hu,Cheng Shi,Zhiyong Lv
机构: Xi’an University of Technology (西安理工大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: It has been submitted to the KBS journal

点击查看摘要

Abstract:As a crucial element of public security, video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems. However, most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices. Moreover, mainstream prediction-based VAD detects anomalies using only single-frame future prediction errors, overlooking the richer constraints from longer-term temporal forward information. In this paper, we introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context Aggregation, containing about 2M parameters and tailored for potential edge devices. Specifically, we propose a Unet-based method that performs feature extraction on consecutive frames to generate both immediate and forward predictions. Then, we introduce a gated context aggregation module into the skip connections to dynamically fuse encoder and decoder features at the same spatial scale. Finally, the model is jointly optimized with a novel forward consistency loss, and a hybrid anomaly measurement strategy is adopted to integrate errors from both immediate and forward frames for more accurate detection. Extensive experiments demonstrate the effectiveness of the proposed method, which substantially outperforms state-of-the-art competing methods, running up to 155 FPS. Hence, our FoGA achieves an excellent trade-off between performance and the efficiency metric.
zh

[CV-58] Grasp-and-Lift: Executable 3D Hand-Object Interaction Reconstruction via Physics-in-the-Loop Optimization

【速读】：该论文旨在解决现有大规模手势操作数据集（如DexYCB和HO3D）在物理仿真中复现时出现的物理不现实交互问题，例如穿透、接触缺失和抓取不稳定等。其核心挑战在于这些数据虽在视觉上对齐良好，但缺乏物理可行性。解决方案的关键是提出一种“仿真驱动的闭环优化框架”，将原始轨迹转化为可执行的物理动作：通过低维样条参数化手部运动（基于稀疏时间关键帧），并利用无梯度优化算法CMA-ES将高保真物理引擎视为黑箱目标函数，从而在保持人类演示相似性的同时最大化物理成功率（如稳定抓取与提升）。此方法显著优于近期迁移管道MANIPTRANS，在重放过程中降低了手部与物体姿态误差，并更准确地恢复了手物物理交互关系。

链接: https://arxiv.org/abs/2601.18121
作者: Byeonggyeol Choi,Woojin Oh,Jongwoo Lim
机构: Seoul National University (首尔国立大学); ME; IPAI
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Dexterous hand manipulation increasingly relies on large-scale motion datasets with precise hand-object trajectory data. However, existing resources such as DexYCB and HO3D are primarily optimized for visual alignment but often yield physically implausible interactions when replayed in physics simulators, including penetration, missed contact, and unstable grasps. We propose a simulation-in-the-loop refinement framework that converts these visually aligned trajectories into physically executable ones. Our core contribution is to formulate this as a tractable black-box optimization problem. We parameterize the hand’s motion using a low-dimensional, spline-based representation built on sparse temporal keyframes. This allows us to use a powerful gradient-free optimizer, CMA-ES, to treat the high-fidelity physics engine as a black-box objective function. Our method finds motions that simultaneously maximize physical success (e.g., stable grasp and lift) while minimizing deviation from the original human demonstration. Compared to MANIPTRANS-recent transfer pipelines, our approach achieves lower hand and object pose errors during replay and more accurately recovers hand-object physical interactions. Our approach provides a general and scalable method for converting visual demonstrations into physically valid trajectories, enabling the generation of high-fidelity data crucial for robust policy learning. Comments: 13 pages, 7 figures Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.18121 [cs.RO] (or arXiv:2601.18121v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2601.18121 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-59] LungCRCT: Causal Representation based Lung CT Processing for Lung Cancer Treatment

【速读】：该论文旨在解决肺癌早期诊断与治疗分析中因深度学习模型可解释性差、因果关联建模能力弱而导致的临床转化瓶颈问题。当前基于卷积神经网络（CNN）或视觉Transformer（ViT）的影像识别模型虽在肺部结节分类任务中表现优异，但其对物理因果机制的捕捉能力有限，难以支持精准的因果干预分析。解决方案的关键在于提出LungCRCT框架——一种基于潜在因果表示学习的肺癌分析方法，通过结合图自编码器驱动的因果发现算法、距离相关性解耦（distance Correlation disentanglement）与基于熵的图像重建优化策略，从LDCT影像中提取具有生物学意义的因果表征，从而实现对肺癌进展机制的可解释建模，并在恶性肿瘤分类任务中达到AUC 93.91%的高性能，同时显著降低下游模型复杂度。

链接: https://arxiv.org/abs/2601.18118
作者: Daeyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to silence in early stages, lung cancer has been one of the most leading causes of mortality in cancer patients world-wide. Moreover, major symptoms of lung cancer are hard to differentiate with other respiratory disease symptoms such as COPD, further leading patients to overlook cancer progression in early stages. Thus, to enhance survival rates in lung cancer, early detection from consistent proactive respiratory system monitoring becomes crucial. One of the most prevalent and effective methods for lung cancer monitoring would be low-dose computed tomography(LDCT) chest scans, which led to remarkable enhancements in lung cancer detection or tumor classification tasks under rapid advancements and applications of computer vision based AI models such as EfficientNet or ResNet in image processing. However, though advanced CNN models under transfer learning or ViT based models led to high performing lung cancer detections, due to its intrinsic limitations in terms of correlation dependence and low interpretability due to complexity, expansions of deep learning models to lung cancer treatment analysis or causal intervention analysis simulations are still limited. Therefore, this research introduced LungCRCT: a latent causal representation learning based lung cancer analysis framework that retrieves causal representations of factors within the physical causal mechanism of lung cancer progression. With the use of advanced graph autoencoder based causal discovery algorithms with distance Correlation disentanglement and entropy-based image reconstruction refinement, LungCRCT not only enables causal intervention analysis for lung cancer treatments, but also leads to robust, yet extremely light downstream models in malignant tumor classification tasks with an AUC score of 93.91%.
zh

[CV-60] Spatial-Conditioned Reasoning in Long-Egocentric Videos

【速读】：该论文旨在解决长时视角视频（long-horizon egocentric video）中由于视角漂移（viewpoint drift）和缺乏持久几何上下文（persistent geometric context）导致的视觉导航难题，特别是当前视觉语言模型（VLMs）在处理此类序列时空间推理能力不足的问题。其解决方案的关键在于不修改模型架构或推理流程的前提下，通过引入显式空间信号（explicit spatial signals）来增强VLM的空间理解能力：具体包括对Google Sanpo数据集进行细粒度重标注以构建Sanpo-D基准，并融合深度图（depth maps）与RGB帧作为输入，从而提升模型在安全关键任务（如行人检测和障碍物识别）中的空间推理性能，揭示了通用准确性与空间专业化之间的权衡关系。

链接: https://arxiv.org/abs/2601.18100
作者: James Tribble,Hao Wang,Si-En Hong,Chaoyi Zhou,Ashish Bastola,Siyu Huang,Abolfazl Razi
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded representations can improve performance on safety-critical tasks such as pedestrian and obstruction detection.
zh

[CV-61] Computational Framework for Estimating Relative Gaussian Blur Kernels between Image Pairs

【速读】：该论文旨在解决如何在无需训练的情况下实现实时的图像模糊参数估计问题，尤其针对高斯模糊模型在真实场景中的快速应用。其核心解决方案是提出一种零训练前向计算框架，通过离散化计算清晰图像到模糊图像的解析表达式（analytic expression），并在标准差范围内选择最优匹配；该解析表达式在某些图像点上存在多解，但可通过邻域相似性度量筛选出唯一解，从而实现对部分模糊图像间模糊程度的精确估计。

链接: https://arxiv.org/abs/2601.18099
作者: Akbar Saadat
机构: Iranian railways (伊朗铁路公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 14 input images, 3 TikZ images. arXiv admin note: substantial text overlap with arXiv:2601.04779 . substantial text overlap with arXiv:2601.04779 . substantial text overlap with arXiv:2601.04779 . substantial text overlap with arXiv:2601.04779

点击查看摘要

Abstract:Following the earlier verification for Gaussian model in \citeASaa2026, this paper introduces a zero training forward computational framework for the model to realize it in real time applications. The framework is based on discrete calculation of the analytic expression of the defocused image from the sharper one for the application range of the standard deviation of the Gaussian kernels and selecting the best matches. The analytic expression yields multiple solutions at certain image points, but is filtered down to a single solution using similarity measures over neighboring this http URL framework is structured to handle cases where two given images are partial blurred versions of each other. Experimental evaluations on real images demonstrate that the proposed framework achieves a mean absolute error (MAE) below 1.7% in estimating synthetic blur values. Furthermore, the discrepancy between actual blurred image intensities and their corresponding estimates remains under 2% , obtained by applying the extracted defocus filters to less blurred images.
zh

[CV-62] xt-Pass Filter: An Efficient Scene Text Detector

【速读】：该论文旨在解决现有文本检测方法中因采用收缩-扩展策略（shrink-mask expansion strategy）而导致的文本边缘视觉特征丢失问题，从而限制了对任意形状文本的准确识别。其核心解决方案是提出Text-Pass Filter (TPF)，通过直接分割整个文本区域来避免上述固有局限性；关键创新在于利用带通滤波器（band-pass filter）的频率选择特性，构建每个文本独有的特征-滤波器对，在推理阶段实现文本特征的精准匹配与分离，无需复杂解码或后处理即可自然区分粘连文本，同时引入强化集成单元（Reinforcement Ensemble Unit, REU）和前景先验单元（Foreground Prior Unit, FPU）分别增强特征一致性并提升前景-背景判别能力，从而实现高效、实时且鲁棒的文本检测。

链接: https://arxiv.org/abs/2601.18098
作者: Chuang Yang,Haozhao Ma,Xu Han,Yuan Yuan,Qi Wang
机构: Northwestern Polytechnical University (西北工业大学); School of Artificial Intelligence, OPtics and ElectroNics (iOPEN) (人工智能学院，光电与信息工程学院（iOPEN) ); School of Computer Science (计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To pursue an efficient text assembling process, existing methods detect texts via the shrink-mask expansion strategy. However, the shrinking operation loses the visual features of text margins and confuses the foreground and background difference, which brings intrinsic limitations to recognize text features. We follow this issue and design Text-Pass Filter (TPF) for arbitrary-shaped text detection. It segments the whole text directly, which avoids the intrinsic limitations. It is noteworthy that different from previous whole text region-based methods, TPF can separate adhesive texts naturally without complex decoding or post-processing processes, which makes it possible for real-time text detection. Concretely, we find that the band-pass filter allows through components in a specified band of frequencies, called its passband but blocks components with frequencies above or below this band. It provides a natural idea for extracting whole texts separately. By simulating the band-pass filter, TPF constructs a unique feature-filter pair for each text. In the inference stage, every filter extracts the corresponding matched text by passing its pass-feature and blocking other features. Meanwhile, considering the large aspect ratio problem of ribbon-like texts makes it hard to recognize texts wholly, a Reinforcement Ensemble Unit (REU) is designed to enhance the feature consistency of the same text and to enlarge the filter’s recognition field to help recognize whole texts. Furthermore, a Foreground Prior Unit (FPU) is introduced to encourage TPF to discriminate the difference between the foreground and background, which improves the feature-filter pair quality. Experiments demonstrate the effectiveness of REU and FPU while showing the TPF’s superiority.
zh

[CV-63] Cross-Domain Transfer with Self-Supervised Spectral-Spatial Modeling for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像在跨域迁移场景中因依赖源域标注和分布偏移而导致的泛化性能下降问题。其解决方案的关键在于提出一种无需源域标签的自监督跨域迁移框架，通过两个核心机制实现：一是预训练阶段设计的Spatial-Spectral Transformer（S2Former）模块，采用双分支空间-光谱Transformer并引入双向交叉注意力机制，实现光谱与空间特征的协同建模；二是引入频域约束（Frequency Domain Constraint, FDC），利用实数快速傅里叶变换（rFFT）和高频幅度损失保持频域一致性，增强模型对细节和边界信息的分辨能力；三是微调阶段提出的Diffusion-Aligned Fine-tuning (DAFT)蒸馏机制，通过教师-学生结构对齐语义演化轨迹，在目标域样本稀缺条件下实现鲁棒迁移学习。

链接: https://arxiv.org/abs/2601.18088
作者: Jianshu Chao,Tianhua Lv,Qiqiong Ma,Yunfei Qiu,Li Fang,Huifang Shen,Wei Yao
机构: Quanzhou Institute of Equipment Manufacturing, Haixi Institutes, Chinese Academy of Sciences, Quanzhou, China (中国科学院海西研究院泉州装备研究所); Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences, Fuzhou, China (中国科学院福建物质结构研究所); Liaoning Technical University, Huludao, Liaoning, China (辽宁工程技术大学); State Key Laboratory of Regional and Urban Ecology, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen, China (中国科学院城市环境研究所区域与城市生态重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning has demonstrated considerable potential in hyperspectral representation, yet its application in cross-domain transfer scenarios remains under-explored. Existing methods, however, still rely on source domain annotations and are susceptible to distribution shifts, leading to degraded generalization performance in the target domain. To address this, this paper proposes a self-supervised cross-domain transfer framework that learns transferable spectral-spatial joint representations without source labels and achieves efficient adaptation under few samples in the target domain. During the self-supervised pre-training phase, a Spatial-Spectral Transformer (S2Former) module is designed. It adopts a dual-branch spatial-spectral transformer and introduces a bidirectional cross-attention mechanism to achieve spectral-spatial collaborative modeling: the spatial branch enhances structural awareness through random masking, while the spectral branch captures fine-grained differences. Both branches mutually guide each other to improve semantic consistency. We further propose a Frequency Domain Constraint (FDC) to maintain frequency-domain consistency through real Fast Fourier Transform (rFFT) and high-frequency magnitude loss, thereby enhancing the model’s capability to discern fine details and boundaries. During the fine-tuning phase, we introduce a Diffusion-Aligned Fine-tuning (DAFT) distillation mechanism. This aligns semantic evolution trajectories through a teacher-student structure, enabling robust transfer learning under low-label conditions. Experimental results demonstrate stable classification performance and strong cross-domain adaptability across four hyperspectral datasets, validating the method’s effectiveness under resource-constrained conditions.
zh

[CV-64] Semi-Supervised Hyperspectral Image Classification with Edge-Aware Superpixel Label Propagation and Adaptive Pseudo-Labeling

【速读】：该论文旨在解决半监督高光谱图像（Hyperspectral Image, HSI）分类中因标注成本高和样本稀缺导致的边界标签扩散（boundary label diffusion）与伪标签不稳定（pseudo-label instability）问题。其解决方案的关键在于提出一个动态可靠性增强的伪标签框架（Dynamic Reliability-Enhanced Pseudo-Label Framework, DREPL），该框架由动态历史融合预测（Dynamic History-Fused Prediction, DHP）和自适应三元样本分类（Adaptive Tripartite Sample Categorization, ATSC）协同构成，通过引入历史预测加权机制和平滑伪标签波动，提升时间一致性与抗噪能力；同时结合置信度与一致性度量实现对易样本、模糊样本和难样本的分层利用，从而显著改善伪标签质量与学习效率。此外，边缘感知超像素标签传播（Edge-Aware Superpixel Label Propagation, EASLP）模块进一步优化空间先验信息，抑制边界区域的标签扩散，最终实现时空一致性联合优化。

链接: https://arxiv.org/abs/2601.18049
作者: Yunfei Qiu,Qiqiong Ma,Tianhua Lv,Li Fang,Shudong Zhou,Wei Yao
机构: Liaoning Technical University (辽宁工程技术大学); State Key Laboratory of Regional and Urban Ecology, Institute of Urban Environment, Chinese Academy of Sciences (中国科学院城市环境研究所区域与城市生态国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant progress has been made in semi-supervised hyperspectral image (HSI) classification regarding feature extraction and classification performance. However, due to high annotation costs and limited sample availability, semi-supervised learning still faces challenges such as boundary label diffusion and pseudo-label instability. To address these issues, this paper proposes a novel semi-supervised hyperspectral classification framework integrating spatial prior information with a dynamic learning mechanism. First, we design an Edge-Aware Superpixel Label Propagation (EASLP) module. By integrating edge intensity penalty with neighborhood correction strategy, it mitigates label diffusion from superpixel segmentation while enhancing classification robustness in boundary regions. Second, we introduce a Dynamic History-Fused Prediction (DHP) method. By maintaining historical predictions and dynamically weighting them with current results, DHP smoothens pseudo-label fluctuations and improves temporal consistency and noise resistance. Concurrently, incorporating condifence and consistency measures, the Adaptive Tripartite Sample Categorization (ATSC) strategy implements hierarchical utilization of easy, ambiguous, and hard samples, leading to enhanced pseudo-label quality and learning efficiency. The Dynamic Reliability-Enhanced Pseudo-Label Framework (DREPL), composed of DHP and ATSC, strengthens pseudo-label stability across temporal and sample domains. Through synergizes operation with EASLP, it achieves spatio-temporal consistency optimization. Evaluations on four benchmark datasets demonstrate its capability to maintain superior classification performance.
zh

[CV-65] Leverag ing Persistence Image to Enhance Robustness and Performance in Curvilinear Structure Segmentation

【速读】：该论文旨在解决医学图像中曲线状结构（curvilinear structures）分割精度不足的问题，尤其关注如何有效整合拓扑特性（如连通性）以提升分割的一致性和鲁棒性。现有方法多依赖于手工设计的拓扑损失函数，存在泛化能力差、计算复杂度高且难以融入网络架构等局限。其解决方案的关键在于提出PIs-Regressor模块，该模块能够从数据中直接学习持久性图像（Persistence Image, PI）——一种有限且可微的拓扑特征表示，并将其嵌入到TopoSegNet网络结构中，在下采样和上采样阶段均融合拓扑信息，从而将拓扑先验知识内建于模型架构而非作为外部约束，显著增强了对过曝、模糊等干扰因素的适应能力，实现了像素级精度与拓扑保真度的双重优化。

链接: https://arxiv.org/abs/2601.18045
作者: Zhuangzhi Gao,Feixiang Zhou,He Zhao,Xiuju Chen,Xiaoxin Li,Qinkai Yu,Yitian Zhao,Alena Shantsila,Gregory Y. H. Lip,Eduard Shantsila,Yalin Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE International Symposium on Biomedical Imaging (ISBI) 2026. 5 pages, 3 figures

点击查看摘要

Abstract:Segmenting curvilinear structures in medical images is essential for analyzing morphological patterns in clinical applications. Integrating topological properties, such as connectivity, improves segmentation accuracy and consistency. However, extracting and embedding such properties - especially from Persistence Diagrams (PD) - is challenging due to their non-differentiability and computational cost. Existing approaches mostly encode topology through handcrafted loss functions, which generalize poorly across tasks. In this paper, we propose PIs-Regressor, a simple yet effective module that learns persistence image (PI) - finite, differentiable representations of topological features - directly from data. Together with Topology SegNet, which fuses these features in both downsampling and upsampling stages, our framework integrates topology into the network architecture itself rather than auxiliary losses. Unlike existing methods that depend heavily on handcrafted loss functions, our approach directly incorporates topological information into the network structure, leading to more robust segmentation. Our design is flexible and can be seamlessly combined with other topology-based methods to further enhance segmentation performance. Experimental results show that integrating topological features enhances model robustness, effectively handling challenges like overexposure and blurring in medical imaging. Our approach on three curvilinear benchmarks demonstrate state-of-the-art performance in both pixel-level accuracy and topological fidelity.
zh

[CV-66] Strip-Fusion: Spatiotemporal Fusion for Multispectral Pedestrian Detection

【速读】：该论文旨在解决多模态行人检测中存在的时间信息利用不足、输入图像对齐不准确以及在复杂光照和严重遮挡条件下检测性能下降的问题。其核心解决方案是提出了一种时空融合网络Strip-Fusion，通过引入时序自适应卷积（temporally adaptive convolutions）动态加权时空特征，从而更好地捕捉行人的运动与上下文信息；同时设计了一种基于Kullback-Leibler散度的损失函数以缓解可见光与热成像之间的模态不平衡问题，引导特征对齐至更具信息量的模态；此外还开发了一种新的后处理算法来降低误检率。这些创新共同提升了模型在KAIST和CVC-14数据集上的鲁棒性和检测精度，尤其在挑战性场景下表现显著优于现有方法。

链接: https://arxiv.org/abs/2601.18008
作者: Asiegbu Miracle Kanu-Asiegbu,Nitin Jotwani,Xiaoxiao Du
机构: University of Michigan(密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication in IEEE Robotics and Automation Letters (RA-L). Code available at: this https URL

点击查看摘要

Abstract:Pedestrian detection is a critical task in robot perception. Multispectral modalities (visible light and thermal) can boost pedestrian detection performance by providing complementary visual information. Several gaps remain with multispectral pedestrian detection methods. First, existing approaches primarily focus on spatial fusion and often neglect temporal information. Second, RGB and thermal image pairs in multispectral benchmarks may not always be perfectly aligned. Pedestrians are also challenging to detect due to varying lighting conditions, occlusion, etc. This work proposes Strip-Fusion, a spatial-temporal fusion network that is robust to misalignment in input images, as well as varying lighting conditions and heavy occlusions. The Strip-Fusion pipeline integrates temporally adaptive convolutions to dynamically weigh spatial-temporal features, enabling our model to better capture pedestrian motion and context over time. A novel Kullback-Leibler divergence loss was designed to mitigate modality imbalance between visible and thermal inputs, guiding feature alignment toward the more informative modality during training. Furthermore, a novel post-processing algorithm was developed to reduce false positives. Extensive experimental results show that our method performs competitively for both the KAIST and the CVC-14 benchmarks. We also observed significant improvements compared to previous state-of-the-art on challenging conditions such as heavy occlusion and misalignment.
zh

[CV-67] MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images WACV2026

【速读】：该论文旨在解决寄生虫感染诊断中深度学习模型可解释性不足的问题，尤其是在低资源环境中依赖人工显微镜检查所面临的效率与准确性挑战。现有方法多依赖视觉热图或注意力图进行解释，但无法捕捉临床医生用于诊断的关键形态学特征（morphological traits）。解决方案的核心在于提出MorphXAI框架，通过将形态学监督直接嵌入预测流程，使模型在定位寄生虫的同时，能够量化并输出形状、曲率、可见点数、鞭毛存在与否及发育阶段等临床相关属性，从而实现检测性能提升与生物学意义明确的结构化解释。

链接: https://arxiv.org/abs/2601.18001
作者: Aqsa Yousaf,Sint Sint Win,Megan Coffee,Habeeb Olufowobi
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校); NYU Grossman School of Medicine (纽约大学格罗斯曼医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:Parasitic infections remain a pressing global health challenge, particularly in low-resource settings where diagnosis still depends on labor-intensive manual inspection of blood smears and the availability of expert domain knowledge. While deep learning models have shown strong performance in automating parasite detection, their clinical usefulness is constrained by limited interpretability. Existing explainability methods are largely restricted to visual heatmaps or attention maps, which highlight regions of interest but fail to capture the morphological traits that clinicians rely on for diagnosis. In this work, we present MorphXAI, an explainable framework that unifies parasite detection with fine-grained morphological analysis. MorphXAI integrates morphological supervision directly into the prediction pipeline, enabling the model to localize parasites while simultaneously characterizing clinically relevant attributes such as shape, curvature, visible dot count, flagellum presence, and developmental stage. To support this task, we curate a clinician-annotated dataset of three parasite species (Leishmania, Trypanosoma brucei, and Trypanosoma cruzi) with detailed morphological labels, establishing a new benchmark for interpretable parasite analysis. Experimental results show that MorphXAI not only improves detection performance over the baseline but also provides structured, biologically meaningful explanations.
zh

[CV-68] Systematic Characterization of Minimal Deep Learning Architectures: A Unified Analysis of Convergence Pruning and Quantization

【速读】：该论文旨在解决深度学习模型中如何识别最小且稳定的架构以可靠完成特定任务的问题，特别是在面对剪枝（pruning）和低精度量化（quantization）约束时。其解决方案的关键在于提出一种系统性的计算方法，通过在大规模架构空间中进行结构化设计扫描，评估收敛性、剪枝敏感性和量化鲁棒性，从而揭示不同网络结构下学习动态的普遍规律，并量化参数冗余与精度损失之间的关系。研究发现，尽管架构多样，图像分类任务的学习行为可归纳为不稳定、学习和过拟合三个阶段，且深层网络对剪枝更具韧性（参数冗余可达60%），而低精度量化对参数较少或任务更复杂的模型影响更大，为构建高效、稳定的小型化模型提供了可操作的指导依据。

链接: https://arxiv.org/abs/2601.17987
作者: Ziwei Zheng,Huizhi Liang,Vaclav Snasel,Vito Latora,Panos Pardalos,Giuseppe Nicosia,Varun Ojha
机构: Newcastle University (纽卡斯尔大学); VSB-Technical University of Ostrava (奥斯特拉发理工大学); Queen Mary University of London (伦敦玛丽女王大学); University of Florida (佛罗里达大学); University of Catania (卡塔尼亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning networks excel at classification, yet identifying minimal architectures that reliably solve a task remains challenging. We present a computational methodology for systematically exploring and analyzing the relationships among convergence, pruning, and quantization. The workflow first performs a structured design sweep across a large set of architectures, then evaluates convergence behavior, pruning sensitivity, and quantization robustness on representative models. Focusing on well-known image classification of increasing complexity, and across Deep Neural Networks, Convolutional Neural Networks, and Vision Transformers, our initial results show that, despite architectural diversity, performance is largely invariant and learning dynamics consistently exhibit three regimes: unstable, learning, and overfitting. We further characterize the minimal learnable parameters required for stable learning, uncover distinct convergence and pruning phases, and quantify the effect of reduced numeric precision on trainable parameters. Aligning with intuition, the results confirm that deeper architectures are more resilient to pruning than shallower ones, with parameter redundancy as high as 60%, and quantization impacts models with fewer learnable parameters more severely and has a larger effect on harder image datasets. These findings provide actionable guidance for selecting compact, stable models under pruning and low-precision constraints in image classification.
zh

[CV-69] Domain-Expert-Guided Hybrid Mixture-of-Experts for Medical AI: Integrating Data-Driven Learning with Clinical Priors

【速读】：该论文旨在解决混合专家（Mixture-of-Experts, MoE）模型在医学等专业领域因数据量有限而难以有效学习的问题，同时利用临床实践中丰富的专家知识（如医师注视模式和诊断启发式）来增强模型的泛化能力和可解释性。解决方案的关键在于提出一种可插拔且可解释的域知识引导混合专家（Domain-Knowledge-Guided Hybrid MoE, DKGH-MoE）模块：该模块融合了数据驱动的MoE以从原始影像中提取新颖特征，并引入基于领域专家指导的MoE，通过整合临床先验信息（特别是医师眼动线索）来强化高诊断相关区域的关注，从而实现数据驱动特征与领域专家知识的互补协同，显著提升模型性能与临床可解释性。

链接: https://arxiv.org/abs/2601.17977
作者: Jinchen Gu,Nan Zhao,Lei Qiu,Lu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages; 3 figures; accepted by International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models increase representational capacity with modest computational cost, but their effectiveness in specialized domains such as medicine is limited by small datasets. In contrast, clinical practice offers rich expert knowledge, such as physician gaze patterns and diagnostic heuristics, that models cannot reliably learn from limited data. Combining data-driven experts, which capture novel patterns, with domain-expert-guided experts, which encode accumulated clinical insights, provides complementary strengths for robust and clinically meaningful learning. To this end, we propose Domain-Knowledge-Guided Hybrid MoE (DKGH-MoE), a plug-and-play and interpretable module that unifies data-driven learning with domain expertise. DKGH-MoE integrates a data-driven MoE to extract novel features from raw imaging data, and a domain-expert-guided MoE incorporates clinical priors, specifically clinician eye-gaze cues, to emphasize regions of high diagnostic relevance. By integrating domain expert insights with data-driven features, DKGH-MoE improves both performance and interpretability.
zh

[CV-70] UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

【速读】：该论文旨在解决如何高效地从预训练视觉骨干网络中生成密集特征（pixel-dense features）的问题，尤其是在保持高性能的同时降低推理成本。现有方法中，基于交叉注意力（cross-attention）的上采样方式虽性能优异，但存在与骨干网络相同的效率扩展瓶颈；而早期的迭代式上采样方法虽具潜力，却因特征不稳定难以达到最优效果。解决方案的关键在于提出UPLiFT架构，并引入一种高效的局部注意力算子——Local Attender，该算子通过完全局部定义的注意力池化机制替代传统全局注意力，从而在迭代过程中稳定特征传播，实现比现有方法更低的推理开销下达到SOTA性能。

链接: https://arxiv.org/abs/2601.17950
作者: Matthew Walmer,Saksham Suri,Anirud Aggarwal,Abhinav Shrivastava
机构: University of Maryland, College Park (马里兰大学帕克分校); Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The space of task-agnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-resolution versions. While early works in this space used iterative upsampling approaches, more recent works have switched to cross-attention-based methods, which risk falling into the same efficiency scaling problems of the backbones they are upsampling. In this work, we demonstrate that iterative upsampling methods can still compete with cross-attention-based methods; moreover, they can achieve state-of-the-art performance with lower inference costs. We propose UPLiFT, an architecture for Universal Pixel-dense Lightweight Feature Transforms. We also propose an efficient Local Attender operator to overcome the limitations of prior iterative feature upsampling methods. This operator uses an alternative attentional pooling formulation defined fully locally. We show that our Local Attender allows UPLiFT to maintain stable features throughout upsampling, enabling state-of-the-art performance with lower inference costs than existing pixel-dense feature upsamplers. In addition, we apply UPLiFT to generative downstream tasks and show that it achieves competitive performance with state-of-the-art Coupled Flow Matching models for VAE feature upsampling. Altogether, UPLiFT offers a versatile and efficient approach to creating denser features.
zh

[CV-71] FlowMorph: Physics-Consistent Self-Supervision for Label-Free Single-Cell Mechanics in Microfluidic Videos

【速读】：该论文旨在解决红细胞（Red Blood Cells, RBCs）在微流控环境下形变特性量化难题，即如何从低分辨率、短时亮场显微视频中无标签地提取物理一致的力学代理指标 $ k $，以替代传统依赖监督分割或手工设计kymograph的方法。其关键解决方案在于提出FlowMorph框架——一个物理一致性自监督学习模型：通过参数化轮廓建模细胞边界，结合可微分的“胶囊-流动”机制（capsule-in-flow）模拟层流平流与曲率正则化的弹性松弛过程，并利用仅由自动轮廓和光流推导出的损失函数（包括轮廓重叠、胞内流一致性、面积守恒、壁约束及时间平滑性）进行优化，从而实现对RBC力学行为的高精度、无标注建模与预测。

链接: https://arxiv.org/abs/2601.17947
作者: Bora Yimenicioglu,Vishal Manikanden
机构: RareGen; Cornell University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mechanical properties of red blood cells (RBCs) are promising biomarkers for hematologic and systemic disease, motivating microfluidic assays that probe deformability at throughputs of 10^3 – 10^6 cells per experiment. However, existing pipelines rely on supervised segmentation or hand-crafted kymographs and rarely encode the laminar Stokes-flow physics that governs RBC shape evolution. We introduce FlowMorph, a physics-consistent self-supervised framework that learns a label-free scalar mechanics proxy k for each tracked RBC from short brightfield microfluidic videos. FlowMorph models each cell by a low-dimensional parametric contour, advances boundary points through a differentiable ‘‘capsule-in-flow’’ combining laminar advection and curvature-regularized elastic relaxation, and optimizes a loss coupling silhouette overlap, intra-cellular flow agreement, area conservation, wall constraints, and temporal smoothness, using only automatically derived silhouettes and optical flow. Across four public RBC microfluidic datasets, FlowMorph achieves a mean silhouette IoU of 0.905 on physics-rich videos with provided velocity fields and markedly improves area conservation and wall violations over purely data-driven baselines. On \sim 1.5\times 10^5 centered sequences, the scalar k alone separates tank-treading from flipping dynamics with an AUC of 0.863 . Using only 200 real-time deformability cytometry (RT-DC) events for calibration, a monotone map E=g(k) predicts apparent Young’s modulus with a mean absolute error of 0.118 ,MPa on 600 held-out cells and degrades gracefully under shifts in channel geometry, optics, and frame rate. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.17947 [cs.CV] (or arXiv:2601.17947v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.17947 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-72] DTC: A Deformable Transposed Convolution Module for Medical Image Segmentation

【速读】：该论文旨在解决传统上采样方法（如转置卷积和线性插值）在医学图像分割中因固定采样位置而导致的结构信息丢失与细节模糊问题。其关键解决方案是提出一种可变形转置卷积（Deformable Transposed Convolution, DTC），通过学习动态采样坐标来替代固定位置的采样策略，从而增强解码器重建高分辨率特征图的能力，并提升多尺度预测中的细节恢复性能。

链接: https://arxiv.org/abs/2601.17939
作者: Chengkun Sun,Jinqian Pan,Renjie Liang,Zhengkang Fan,Xin Miao,Jiang Bian,Jie Xu
机构: University of Florida (佛罗里达大学); University of Texas at Arlington (德克萨斯大学阿灵顿分校); Indiana University (印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In medical image segmentation, particularly in UNet-like architectures, upsampling is primarily used to transform smaller feature maps into larger ones, enabling feature fusion between encoder and decoder features and supporting multi-scale prediction. Conventional upsampling methods, such as transposed convolution and linear interpolation, operate on fixed positions: transposed convolution applies kernel elements to predetermined pixel or voxel locations, while linear interpolation assigns values based on fixed coordinates in the original feature map. These fixed-position approaches may fail to capture structural information beyond predefined sampling positions and can lead to artifacts or loss of detail. Inspired by deformable convolutions, we propose a novel upsampling method, Deformable Transposed Convolution (DTC), which learns dynamic coordinates (i.e., sampling positions) to generate high-resolution feature maps for both 2D and 3D medical image segmentation tasks. Experiments on 3D (e.g., BTCV15) and 2D datasets (e.g., ISIC18, BUSI) demonstrate that DTC can be effectively integrated into existing medical image segmentation models, consistently improving the decoder’s feature reconstruction and detail recovery capability.
zh

[CV-73] From Specialist to Generalist: Unlocking SAMs Learning Potential on Unlabeled Medical Images

【速读】：该论文旨在解决基础模型（如Segment Anything Model, SAM）在医学图像分割任务中因领域偏移（domain shift）、标注数据稀缺以及参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）无法利用未标注数据而导致的适应困难问题。其解决方案的关键在于提出SC-SAM框架，通过U-Net与SAM之间的双向协同训练机制：U-Net作为“专才”提供点提示（point-based prompts）和伪标签（pseudo-labels）引导SAM的适应，而SAM作为“通才”则为U-Net提供强监督信号以正则化其学习过程，从而实现两者对未标注数据的有效利用，显著提升半监督医学图像分割性能。

链接: https://arxiv.org/abs/2601.17934
作者: Vi Vu,Thanh-Huy Nguyen,Tien-Thinh Nguyen,Ba-Thinh Lam,Hoang-Thien Nguyen,Tianyang Wang,Xingjian Li,Min Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ISBI 2026

点击查看摘要

Abstract:Foundation models like the Segment Anything Model (SAM) show strong generalization, yet adapting them to medical images remains difficult due to domain shift, scarce labels, and the inability of Parameter-Efficient Fine-Tuning (PEFT) to exploit unlabeled data. While conventional models like U-Net excel in semi-supervised medical learning, their potential to assist a PEFT SAM has been largely overlooked. We introduce SC-SAM, a specialist-generalist framework where U-Net provides point-based prompts and pseudo-labels to guide SAM’s adaptation, while SAM serves as a powerful generalist supervisor to regularize U-Net. This reciprocal guidance forms a bidirectional co-training loop that allows both models to effectively exploit the unlabeled data. Across prostate MRI and polyp segmentation benchmarks, our method achieves state-of-the-art results, outperforming other existing semi-supervised SAM variants and even medical foundation models like MedSAM, highlighting the value of specialist-generalist cooperation for label-efficient medical image segmentation. Our code is available at this https URL.
zh

[CV-74] Dissipative Learning: A Framework for Viable Adaptive Systems

【速读】：该论文旨在解决机器学习中正则化与遗忘机制的本质性问题，即如何从热力学和信息几何的角度重新理解学习过程中的稳定性、适应性与信息压缩。其核心挑战在于传统方法将正则化视为启发式策略，而忽视了其在系统演化中的结构性作用。解决方案的关键在于提出BEDS（Bayesian Emergent Dissipative Structures）框架，该框架将学习建模为受耗散约束的压缩信念状态演化过程，并通过Conditional Optimality Theorem证明：基于Fisher-Rao度量的信息散度正则化是唯一满足热力学最优性的策略，能实现最小耗散；相比之下，欧氏距离正则化在结构上次优。这一理论统一了岭回归（Ridge）、信号扰动正则化（SIGReg）、指数移动平均（EMA）及软演员-评论家算法（SAC）等方法为同一控制方程的特例，同时揭示过拟合对应于过度结晶化，灾难性遗忘源于耗散控制不足，从而为持续学习与多智能体系统提供了以可行性、有限资源下稳定性为核心的新范式。

链接: https://arxiv.org/abs/2601.17933
作者: Laurent Caraffa
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 68 pages, 14 figures

点击查看摘要

Abstract:We propose a perspective in which learning is an intrinsically dissipative process. Forgetting and regularization are not heuristic add-ons but structural requirements for adaptive systems. Drawing on information theory, thermodynamics, and information geometry, we introduce the BEDS (Bayesian Emergent Dissipative Structures) framework, modeling learning as the evolution of compressed belief states under dissipation constraints. A central contribution is the Conditional Optimality Theorem, showing that Fisher-Rao regularization measuring change via information divergence rather than Euclidean distance is the unique thermodynamically optimal regularization strategy, achieving minimal dissipation. Euclidean regularization is shown to be structurally suboptimal. The framework unifies existing methods (Ridge, SIGReg, EMA, SAC) as special cases of a single governing equation. Within this view, overfitting corresponds to over-crystallization, while catastrophic forgetting reflects insufficient dissipation control. The framework distinguishes BEDS-crystallizable problems, where beliefs converge to stable equilibria, from BEDS-maintainable problems, which require continual adaptation. It extends naturally to continual and multi-agent systems, where viability, stability under adaptation and finite resources replaces asymptotic optimality as the primary criterion. Overall, this work reframes learning as maintaining viable belief states under dissipation constraints, providing a principled lens on forgetting, regularization, and stability. Comments: 68 pages, 14 figures Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.17933 [cs.LG] (or arXiv:2601.17933v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.17933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-75] RemEdit: Efficient Diffusion Editing with Riemannian Geometry

【速读】：该论文旨在解决生成式 AI 中图像编辑任务面临的语义保真度与推理速度之间的权衡问题。其核心解决方案包含两项协同创新：一是将潜在空间建模为黎曼流形（Riemannian manifold），通过基于Mamba的模块高效学习流形结构，从而实现直接且精确的测地线路径计算，支持平滑的语义编辑；二是提出一种面向特定任务的注意力剪枝机制，利用轻量级剪枝头识别对编辑至关重要的token，在保持语义完整性的同时显著提升推理效率。该方法在不超过50%剪枝率下仍优于现有最先进框架，并实现了实时性能，为实用且强大的图像编辑建立了新基准。

链接: https://arxiv.org/abs/2601.17927
作者: Eashan Adhikarla,Brian D. Davison
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Controllable image generation is fundamental to the success of modern generative AI, yet it faces a critical trade-off between semantic fidelity and inference speed. The RemEdit diffusion-based framework addresses this trade-off with two synergistic innovations. First, for editing fidelity, we navigate the latent space as a Riemannian manifold. A mamba-based module efficiently learns the manifold’s structure, enabling direct and accurate geodesic path computation for smooth semantic edits. This control is further refined by a dual-SLERP blending technique and a goal-aware prompt enrichment pass from a Vision-Language Model. Second, for additional acceleration, we introduce a novel task-specific attention pruning mechanism. A lightweight pruning head learns to retain tokens essential to the edit, enabling effective optimization without the semantic degradation common in content-agnostic approaches. RemEdit surpasses prior state-of-the-art editing frameworks while maintaining real-time performance under 50% pruning. Consequently, RemEdit establishes a new benchmark for practical and powerful image editing. Source code: this https URL.
zh

[CV-76] Feature-Space Generative Models for One-Shot Class-Incremental Learning

【速读】：该论文旨在解决少样本类增量学习（Few-shot Class-Incremental Learning, FSCIL）中模型在仅获得每个新类一个样本（1-shot）且训练后不可再调整的极端条件下，如何有效识别新类的问题。其核心挑战在于如何从有限的样本中泛化出对新类的准确识别能力。解决方案的关键在于提出一种名为Gen1S的新方法，其核心假设是基础类与新类的嵌入空间具有结构相似性：通过将输入样本的嵌入减去该类原型（即类平均嵌入）得到残差空间，并利用变分自编码器（VAE）或扩散模型对基础类的残差分布进行生成建模，从而学习到多模态的结构先验；该先验被用于增强对新类的识别性能，在多个基准数据集和骨干网络架构上均显著优于现有最先进方法。

链接: https://arxiv.org/abs/2601.17905
作者: Jack Foster,Kirill Paramonov,Mete Ozay,Umberto Michieli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) is a paradigm where a model, initially trained on a dataset of base classes, must adapt to an expanding problem space by recognizing novel classes with limited data. We focus on the challenging FSCIL setup where a model receives only a single sample (1-shot) for each novel class and no further training or model alterations are allowed after the base training phase. This makes generalization to novel classes particularly difficult. We propose a novel approach predicated on the hypothesis that base and novel class embeddings have structural similarity. We map the original embedding space into a residual space by subtracting the class prototype (i.e., the average class embedding) of input samples. Then, we leverage generative modeling with VAE or diffusion models to learn the multi-modal distribution of residuals over the base classes, and we use this as a valuable structural prior to improve recognition of novel classes. Our approach, Gen1S, consistently improves novel class recognition over the state of the art across multiple benchmarks and backbone architectures.
zh

[CV-77] Revisiting 3D Reconstruction Kernels as Low-Pass Filters

【速读】：该论文旨在解决3D重建中因离散采样导致的频谱混叠问题，即离散时间信号的频谱中高频分量与低频分量发生重叠，从而影响重建精度。其核心解决方案是引入Jinc核函数，该核函数在截止频率处具有瞬时归零的特性，对应理想低通滤波器，可有效分离基带频谱；同时为克服Jinc核在空间域衰减缓慢的问题，进一步提出调制核函数，在空间效率与频域保真度之间取得平衡，显著提升渲染性能。

链接: https://arxiv.org/abs/2601.17900
作者: Shengjun Zhang,Min Chen,Yibo Wei,Mingyu Dong,Yueqi Duan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces. In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge. Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student’s t distributions, serve as the low pass filters to isolate the baseband spectrum. However, their unideal low-pass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal’s spectrum. To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters. As Jinc kernel suffers from low decay speed in the spatial domain, we further propose modulated kernels to strick an effective balance, and achieves superior rendering performance by reconciling spatial efficiency and frequency-domain fidelity. Experimental results have demonstrated the effectiveness of our Jinc and modulated kernels.
zh

[CV-78] Masked Depth Modeling for Spatial Perception

【速读】：该论文旨在解决RGB-D相机在实际应用中因硬件限制和复杂成像条件（如镜面或无纹理表面）导致的深度图精度不足与像素覆盖率低的问题。解决方案的关键在于提出LingBot-Depth模型，通过将深度传感器的误差视为“掩码”信号，利用视觉上下文信息进行深度补全，并结合自动化的数据清洗流水线实现可扩展训练；该方法不仅在深度精度和像素覆盖上优于顶级RGB-D相机，还在下游任务中实现了RGB与深度模态间的对齐潜在表示。

链接: https://arxiv.org/abs/2601.17895
作者: Bin Tan,Changjiang Sun,Xiage Qin,Hanat Adai,Zelin Fu,Tianxiang Zhou,Han Zhang,Yinghao Xu,Xing Zhu,Yujun Shen,Nan Xue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Tech report, 19 pages, 15 figures and 4 tables

点击查看摘要

Abstract:Spatial visual perception is a fundamental requirement in physical-world applications like autonomous driving and robotic manipulation, driven by the need to interact with 3D environments. Capturing pixel-aligned metric depth using RGB-D cameras would be the most viable way, yet it usually faces obstacles posed by hardware limitations and challenging imaging conditions, especially in the presence of specular or texture-less surfaces. In this work, we argue that the inaccuracies from depth sensors can be viewed as “masked” signals that inherently reflect underlying geometric ambiguities. Building on this motivation, we present LingBot-Depth, a depth completion model which leverages visual context to refine depth maps through masked depth modeling and incorporates an automated data curation pipeline for scalable training. It is encouraging to see that our model outperforms top-tier RGB-D cameras in terms of both depth precision and pixel coverage. Experimental results on a range of downstream tasks further suggest that LingBot-Depth offers an aligned latent representation across RGB and depth modalities. We release the code, checkpoint, and 3M RGB-depth pairs (including 2M real data and 1M simulated data) to the community of spatial perception.
zh

[CV-79] PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

【速读】：该论文旨在解决复杂场景下双臂操作（bimanual manipulation）中视觉-语言-动作（VLA）策略在遮挡、视角变化和场景差异下的泛化能力不足问题。现有方法因多视角特征通过视图无关的token拼接融合导致三维空间一致性弱，且语言指令以全局条件注入造成语义定位粗粒度。解决方案的关键在于：(1) 提出PEAfowl模型，通过预测每个token的深度分布并进行可微3D提升，结合跨视角邻域聚合构建几何对齐且跨视角一致的空间表征；(2) 采用类似Perceiver结构的文本感知读出机制替代全局语言条件，利用冻结的CLIP视觉特征实现迭代证据累积，提升指令精确定位能力；(3) 引入仅训练阶段使用的深度蒸馏策略，从预训练深度教师模型中学习几何先验，增强感知前端对噪声和缺失深度信息的鲁棒性，从而显著提升真实机器人环境中的sim-to-real迁移性能与任务成功率。

链接: https://arxiv.org/abs/2601.17885
作者: Qingyu Fan,Zhaoxiang Li,Yi Lu,Wang Chen,Qiu Shen,Xiao-xiao Long,Yinghao Cai,Tao Lu,Shuo Wang,Xun Cao
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors. On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation. Project website: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2601.17885 [cs.CV] (or arXiv:2601.17885v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.17885 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Qingyu Fan [view email] [v1] Sun, 25 Jan 2026 15:29:32 UTC (8,866 KB)
zh

[CV-80] EEG Foundation Models: Progresses Benchmarking and Open Problems

【速读】：该论文旨在解决当前脑-机接口（BCI）领域中缺乏对现有脑电图（EEG）基础模型的公平且全面比较的问题，其根源在于预训练目标、数据预处理方法和下游评估协议的不一致性。解决方案的关键在于构建一个统一的分类框架（taxonomic framework），涵盖数据标准化、模型架构和自监督预训练策略，并在此基础上系统评估12个开源基础模型及对比基线，在13个跨九类BCI范式的EEG数据集上进行实证分析，同时考虑跨被试泛化能力（留一被试排除法）与快速校准能力（单被试少样本设置），从而揭示基础模型的实际迁移性能及其与模型规模的关系。

链接: https://arxiv.org/abs/2601.17883
作者: Dingkun Liu,Yuheng Chen,Zhu Chen,Zhenyao Cui,Yaozhi Wen,Jiayu An,Jingwei Luo,Dongrui Wu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electroencephalography (EEG) foundation models have recently emerged as a promising paradigm for brain-computer interfaces (BCIs), aiming to learn transferable neural representations from large-scale heterogeneous recordings. Despite rapid progresses, there lacks fair and comprehensive comparisons of existing EEG foundation models, due to inconsistent pre-training objectives, preprocessing choices, and downstream evaluation protocols. This paper fills this gap. We first review 50 representative models and organize their design choices into a unified taxonomic framework including data standardization, model architectures, and self-supervised pre-training strategies. We then evaluate 12 open-source foundation models and competitive specialist baselines across 13 EEG datasets spanning nine BCI paradigms. Emphasizing real-world deployments, we consider both cross-subject generalization under a leave-one-subject-out protocol and rapid calibration under a within-subject few-shot setting. We further compare full-parameter fine-tuning with linear probing to assess the transferability of pre-trained representations, and examine the relationship between model scale and downstream performance. Our results indicate that: 1) linear probing is frequently insufficient; 2) specialist models trained from scratch remain competitive across many tasks; and, 3) larger foundation models do not necessarily yield better generalization performance under current data regimes and training practices.
zh

[CV-81] Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran

【速读】：该论文旨在解决 Quranic 研究中多模态数据匮乏的问题，特别是文本与音频之间缺乏细粒度对齐的资源，限制了生成式 AI (Generative AI) 在语音识别、语义理解及个性化教学等任务中的应用。解决方案的关键在于构建 Quran MD 数据集，该数据集在 verse（ayah）和 word 两个层级上统一整合阿拉伯文原文、英文翻译、音标转写以及来自32位不同诵读者的音频记录，实现了文本与音频的精确对齐，从而为自然语言处理、语音合成、韵律检测（tajweed detection）及跨模态学习提供了高质量、可扩展的基准资源。

链接: https://arxiv.org/abs/2601.17880
作者: Muhammad Umar Salman,Mohammad Areeb Qazi,Mohammed Talha Alam
机构: MBZUAI (Mohamed Bin Zayed University of Artificial Intelligence)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 tables and 2 figures

点击查看摘要

Abstract:We present Quran MD, a comprehensive multimodal dataset of the Quran that integrates textual, linguistic, and audio dimensions at the verse and word levels. For each verse (ayah), the dataset provides its original Arabic text, English translation, and phonetic transliteration. To capture the rich oral tradition of Quranic recitation, we include verse-level audio from 32 distinct reciters, reflecting diverse recitation styles and dialectical nuances. At the word level, each token is paired with its corresponding Arabic script, English translation, transliteration, and an aligned audio recording, allowing fine-grained analysis of pronunciation, phonology, and semantic context. This dataset supports various applications, including natural language processing, speech recognition, text-to-speech synthesis, linguistic analysis, and digital Islamic studies. Bridging text and audio modalities across multiple reciters, this dataset provides a unique resource to advance computational approaches to Quranic recitation and study. Beyond enabling tasks such as ASR, tajweed detection, and Quranic TTS, it lays the foundation for multimodal embeddings, semantic retrieval, style transfer, and personalized tutoring systems that can support both research and community applications. The dataset is available at this https URL
zh

[CV-82] VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

【速读】：该论文旨在解决标准自回归视频大模型（Autoregressive Video LLMs）因因果掩码（causal masking）带来的局部时序建模偏差问题，从而限制了全局时空理解效率。其核心解决方案是提出VidLaDA框架，基于扩散语言模型（Diffusion Language Model）引入双向注意力机制（bidirectional attention），以捕获视频数据中的双向依赖关系；同时为缓解扩散解码在大规模视频标记上的推理瓶颈，设计MARS-Cache机制，通过异步视觉缓存刷新与帧级分块注意力（frame-wise chunk attention）相结合，在保留锚点标记（anchor tokens）的全局连接性的同时有效剪枝冗余信息，实现超过12倍的推理加速且不损失推理准确性。

链接: https://arxiv.org/abs/2601.17868
作者: Zhihao He,Tieyuan Chen,Kangyu Wang,Ziran Qin,Yang Shao,Chaofan Gan,Shijie Li,Zuxuan Wu,Weiyao Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard Autoregressive Video LLMs inevitably suffer from causal masking biases that hinder global spatiotemporal modeling, leading to suboptimal understanding efficiency. We propose VidLaDA, a Video LLM based on Diffusion Language Model utilizing bidirectional attention to capture bidirectional dependencies. To further tackle the inference bottleneck of diffusion decoding on massive video tokens, we introduce MARS-Cache. This framework accelerates inference by combining asynchronous visual cache refreshing with frame-wise chunk attention, effectively pruning redundancy while preserving global connectivity via anchor tokens. Extensive experiments show VidLaDA outperforms diffusion baselines and rivals state-of-the-art autoregressive models (e.g., Qwen2.5-VL and LLaVA-Video), with MARS-Cache delivering over 12x speedup without compromising reasoning accuracy. Code and checkpoints are open-sourced at this https URL.
zh

[CV-83] MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance

【速读】：该论文旨在解决多视角图像分割中因缺乏3D感知而导致的不一致性问题，即现有方法在不同视角下生成的分割结果难以保持几何一致性，通常需要昂贵的逐场景优化来强制实现3D一致性。解决方案的关键在于引入基于点图（pointmap）的3D空间映射机制——利用近期视觉几何模型从无位姿图像中重建的3D点云，建立像素与3D点之间的一一对应关系，从而将图像和提示（prompt）显式提升至3D空间；在此基础上，通过扩展Segment Anything Model (SAM) 的编码器输出，将其图像嵌入转换为3D点嵌入，并借助Transformer结构中的跨注意力机制与3D提示嵌入进行解码，使2D交互自然对齐3D几何结构，进而隐式学习跨视角一致的分割掩码，无需显式的3D网络或标注的3D数据即可实现高质量的多视角分割。

链接: https://arxiv.org/abs/2601.17866
作者: Yoonwoo Jeong,Cheng Sun,Yu-Chiang Frank Wang,Minsu Cho,Jaesung Choe
机构: NVIDIA; POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page, this https URL

点击查看摘要

Abstract:Promptable segmentation has emerged as a powerful paradigm in computer vision, enabling users to guide models in parsing complex scenes with prompts such as clicks, boxes, or textual cues. Recent advances, exemplified by the Segment Anything Model (SAM), have extended this paradigm to videos and multi-view images. However, the lack of 3D awareness often leads to inconsistent results, necessitating costly per-scene optimization to enforce 3D consistency. In this work, we introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps – 3D points reconstructed from unposed images by recent visual geometry models. Leveraging the pixel-point one-to-one correspondence of pointmaps, MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data. Specifically, MV-SAM extends SAM by lifting image embeddings from its pretrained encoder into 3D point embeddings, which are decoded by a transformer using cross-attention with 3D prompt embeddings. This design aligns 2D interactions with 3D geometry, enabling the model to implicitly learn consistent masks across views through 3D positional embeddings. Trained on the SA-1B dataset, our method generalizes well across domains, outperforming SAM2-Video and achieving comparable performance with per-scene optimization baselines on NVOS, SPIn-NeRF, ScanNet++, uCo3D, and DL3DV benchmarks. Code will be released.
zh

[CV-84] Domain Generalization with Quantum Enhancement for Medical Image Classification: A Lightweight Approach for Cross-Center Deployment

【速读】：该论文旨在解决医学图像人工智能模型在跨中心部署时因域偏移（domain shift）导致性能下降的问题，从而提升其临床泛化能力。解决方案的关键在于提出了一种轻量级的域泛化框架，结合量子增强协同学习机制，无需依赖真实多中心标注数据即可实现对未见目标域的鲁棒泛化。其核心创新包括：(1) 通过亮度、对比度、锐化和噪声扰动模拟多域成像差异；(2) 利用梯度反转层进行域对抗训练以抑制域判别特征；(3) 引入轻量级量子特征增强层，利用参数化量子电路实现非线性特征映射与纠缠建模；同时在测试阶段采用适应策略进一步缓解分布偏移。实验表明，该方法显著优于基线模型，在未见域上表现出更低的性能方差及更高的AUC和敏感度。

链接: https://arxiv.org/abs/2601.17862
作者: Jingsong Xia,Siqi Wang
机构: The Second Clinical College, Nanjing Medical University (南京医科大学第二临床学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image artificial intelligence models often achieve strong performance in single-center or single-device settings, yet their effectiveness frequently deteriorates in real-world cross-center deployment due to domain shift, limiting clinical generalizability. To address this challenge, we propose a lightweight domain generalization framework with quantum-enhanced collaborative learning, enabling robust generalization to unseen target domains without relying on real multi-center labeled data. Specifically, a MobileNetV2-based domain-invariant encoder is constructed and optimized through three key components: (1) multi-domain imaging shift simulation using brightness, contrast, sharpening, and noise perturbations to emulate heterogeneous acquisition conditions; (2) domain-adversarial training with gradient reversal to suppress domain-discriminative features; and (3) a lightweight quantum feature enhancement layer that applies parameterized quantum circuits for nonlinear feature mapping and entanglement modeling. In addition, a test-time adaptation strategy is employed during inference to further alleviate distribution shifts. Experiments on simulated multi-center medical imaging datasets demonstrate that the proposed method significantly outperforms baseline models without domain generalization or quantum enhancement on unseen domains, achieving reduced domain-specific performance variance and improved AUC and sensitivity. These results highlight the clinical potential of quantum-enhanced domain generalization under constrained computational resources and provide a feasible paradigm for hybrid quantum–classical medical imaging systems.
zh

[CV-85] SynMind: Reducing Semantic Hallucination in fMRI-Based Image Reconstruction

【速读】：该论文旨在解决fMRI图像重建中普遍存在的语义错位问题，即尽管重建图像在视觉上逼真且整体结构相似于目标刺激，但关键对象常被替换或幻觉化，导致语义不一致。其解决方案的关键在于重新定义显式语义解释在fMRI解码中的作用：通过将fMRI信号解析为层次化、组合式的句子级语义描述（借助接地的视觉语言模型VLM生成多粒度文本表示），从而显式建模物体身份与空间组织关系；在此基础上提出SynMind框架，将这些显式语义编码与视觉先验融合以条件化预训练扩散模型，实现从语义到视觉的精准映射。该方法显著优于现有技术，在定量指标和人类评估中均表现出更强的语义一致性与神经可解释性。

链接: https://arxiv.org/abs/2601.17857
作者: Lan Yang,Minghan Yang,Ke Li,Honggang Zhang,Kaiyue Pang,Yi-Zhe Song
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in fMRI-based image reconstruction have achieved remarkable photo-realistic fidelity. Yet, a persistent limitation remains: while reconstructed images often appear naturalistic and holistically similar to the target stimuli, they frequently suffer from severe semantic misalignment – salient objects are often replaced or hallucinated despite high visual quality. In this work, we address this limitation by rethinking the role of explicit semantic interpretation in fMRI decoding. We argue that existing methods rely too heavily on entangled visual embeddings which prioritize low-level appearance cues – such as texture and global gist – over explicit semantic identity. To overcome this, we parse fMRI signals into rich, sentence-level semantic descriptions that mirror the hierarchical and compositional nature of human visual understanding. We achieve this by leveraging grounded VLMs to generate synthetic, human-like, multi-granularity textual representations that capture object identities and spatial organization. Built upon this foundation, we propose SynMind, a framework that integrates these explicit semantic encodings with visual priors to condition a pretrained diffusion model. Extensive experiments demonstrate that SynMind outperforms state-of-the-art methods across most quantitative metrics. Notably, by offloading semantic reasoning to our text-alignment module, SynMind surpasses competing methods based on SDXL while using the much smaller Stable Diffusion 1.4 and a single consumer GPU. Large-scale human evaluations further confirm that SynMind produces reconstructions more consistent with human visual perception. Neurovisualization analyses reveal that SynMind engages broader and more semantically relevant brain regions, mitigating the over-reliance on high-level visual areas.
zh

[CV-86] Geometry-Grounded Gaussian Splatting

【速读】：该论文旨在解决基于高斯点（Gaussian Splatting, GS）的三维形状重建中几何表示不明确、多视角一致性差以及对噪声点（floaters）敏感的问题。其解决方案的关键在于通过严格的理论推导，将高斯原语建模为一类特殊的随机体（stochastic solids），从而为几何引导的高斯点渲染提供形式化基础；利用随机体的体素特性，直接将高斯原语作为显式几何表示进行处理，并高效生成高质量深度图以实现细粒度几何提取，显著提升了形状重建的精度与鲁棒性。

链接: https://arxiv.org/abs/2601.17835
作者: Baowen Zhang,Chenxing Jiang,Heng Li,Shaojie Shen,Ping Tan
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 15 figures

点击查看摘要

Abstract:Gaussian Splatting (GS) has demonstrated impressive quality and efficiency in novel view synthesis. However, shape extraction from Gaussian primitives remains an open problem. Due to inadequate geometry parameterization and approximation, existing shape reconstruction methods suffer from poor multi-view consistency and are sensitive to floaters. In this paper, we present a rigorous theoretical derivation that establishes Gaussian primitives as a specific type of stochastic solids. This theoretical framework provides a principled foundation for Geometry-Grounded Gaussian Splatting by enabling the direct treatment of Gaussian primitives as explicit geometric representations. Using the volumetric nature of stochastic solids, our method efficiently renders high-quality depth maps for fine-grained geometry extraction. Experiments show that our method achieves the best shape reconstruction results among all Gaussian Splatting-based methods on public datasets.
zh

[CV-87] VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training

【速读】：该论文旨在解决基于去噪的扩散变换器（denoising-based diffusion transformers）在训练过程中收敛效率低的问题。现有方法如REPA（依赖外部表示编码器）或SRA（需双模型架构）虽能提升性能，但因引入外部依赖而在训练阶段带来显著计算开销。解决方案的关键在于提出一种轻量级内在引导框架 \name，其利用现成预训练变分自编码器（Variational Autoencoder, VAE）的特征，借助其重建特性所蕴含的视觉先验（如纹理细节、结构模式和基础语义信息），通过一个轻量级投影层将扩散变换器中间潜在特征与VAE特征对齐，并以特征对齐损失进行监督。该设计无需额外的表示编码器或双模型维护，实现了训练加速且仅增加4%的GFLOPs，无外部引导模型的额外成本。

链接: https://arxiv.org/abs/2601.17830
作者: Mengmeng Wang,Dengyang Jiang,Liuzhuozheng Li,Yucheng Lin,Guojiang Shen,Xiangjie Kong,Yong Liu,Guang Dai,Jingdong Wang
机构: Zhejiang University of Technology (浙江工业大学); SGIT AI Lab, State Grid Corporation of China (国家电网公司SGIT人工智能实验室); Zhejiang University (浙江大学); Baidu (百度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes \textbf\namex, a lightweight intrinsic guidance framework for efficient diffusion training. \name leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, \name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that \name improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.
zh

[CV-88] ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）因视觉标记（visual tokens）冗余而导致的高计算成本问题。现有方法在视觉编码器中过早剪枝会丢失关键视觉信息，而在大语言模型（Large Language Models, LLMs）中剪枝则易导致所选标记间的信息冗余。解决方案的关键在于提出一种视觉与文本语义协同剪枝框架（Visual and Textual Semantic Collaborative Pruning, ViTCoP），通过在视觉编码器中进行冗余过滤，并结合LLM的层次结构特点实施分步联合剪枝（step-wise co-pruning），从而高效保留关键且信息多样化的视觉标记；同时为兼容如FlashAttention等加速技术，引入K向量的L2范数作为LLM中的标记显著性度量指标，实现性能与效率的双重提升。

链接: https://arxiv.org/abs/2601.17818
作者: Wen Luo,Peng Chen,Xiaotao Huang,LiQun Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing methods are generally limited, either losing critical visual information prematurely due to pruning in the vision encoder, or leading to information redundancy among the selected tokens due to pruning in the Large Language Models (LLMs). To address these challenges, we propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the LLM based on its hierarchical characteristics, to efficiently preserve critical and informationally diverse visual tokens. Meanwhile, to ensure compatibility with acceleration techniques like FlashAttention, we introduce the L2 norm of K-vectors as the token saliency metric in the LLM. Extensive experiments on various Large Vision-Language Models demonstrate that ViTCoP not only achieves state-of-the-art performance surpassing existing methods on both image and video understanding tasks, but also significantly reduces model inference latency and GPU memory consumption. Notably, its performance advantage over other methods becomes even more pronounced under extreme pruning rates.
zh

[CV-89] Agreement-Driven Multi-View 3D Reconstruction for Live Cattle Weight Estimation

【速读】：该论文旨在解决传统牲畜体重估测方法（如人工称重或体况评分）因需手动操作而影响生产效率和经济效益的问题，提出一种低成本、非接触式的牛只活体重估算方案。其关键在于利用多视角RGB图像结合SAM 3D-based的共识引导融合策略进行3D重建，并采用集成回归模型实现精准预测；实验表明，提升重建质量比增加模型复杂度更有利于在数据有限的农场环境中实现可扩展部署，其中经典集成模型在实际场景中表现最稳定（R² = 0.69 ± 0.10，MAPE = 2.22 ± 0.56%）。

链接: https://arxiv.org/abs/2601.17791
作者: Rabin Dulal,Wenfeng Jia,Lihong Zheng,Jane Quinn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate cattle live weight estimation is vital for livestock management, welfare, and productivity. Traditional methods, such as manual weighing using a walk-over weighing system or proximate measurements using body condition scoring, involve manual handling of stock and can impact productivity from both a stock and economic perspective. To address these issues, this study investigated a cost-effective, non-contact method for live weight calculation in cattle using 3D reconstruction. The proposed pipeline utilized multi-view RGB images with SAM 3D-based agreement-guided fusion, followed by ensemble regression. Our approach generates a single 3D point cloud per animal and compares classical ensemble models with deep learning models under low-data conditions. Results show that SAM 3D with multi-view agreement fusion outperforms other 3D generation methods, while classical ensemble models provide the most consistent performance for practical farm scenarios (R ^2 = 0.69 \pm 0.10, MAPE = 2.22 \pm 0.56 %), making this practical for on-farm implementation. These findings demonstrate that improving reconstruction quality is more critical than increasing model complexity for scalable deployment on farms where producing a large volume of 3D data is challenging.
zh

[CV-90] raining-Free Text-to-Image Compositional Food Generation via Prompt Grafting

【速读】：该论文旨在解决多食物图像生成中的物体纠缠问题（object entanglement），即在真实世界餐食图像中，由于不同食物之间边界模糊，导致现代文本到图像扩散模型难以准确生成包含多个独立食物项的图像。解决方案的关键在于提出一种无需训练的框架——提示嫁接（Prompt Grafting, PG），其核心机制是在采样过程中分两阶段引入空间提示：首先通过布局提示（layout prompt）建立清晰的食物区域划分，待布局稳定后，再将目标提示（target prompt）嫁接进来，从而实现对食物间分离或混合关系的可控调节。

链接: https://arxiv.org/abs/2601.17666
作者: Xinyue Pan,Yuhao Chen,Fengqing Zhu
机构: Purdue University (普渡大学); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CAI2026

点击查看摘要

Abstract:Real-world meal images often contain multiple food items, making reliable compositional food image generation important for applications such as image-based dietary assessment, where multi-food data augmentation is needed, and recipe visualization. However, modern text-to-image diffusion models struggle to generate accurate multi-food images due to object entanglement, where adjacent foods (e.g., rice and soup) fuse together because many foods do not have clear boundaries. To address this challenge, we introduce Prompt Grafting (PG), a training-free framework that combines explicit spatial cues in text with implicit layout guidance during sampling. PG runs a two-stage process where a layout prompt first establishes distinct regions and the target prompt is grafted once layout formation stabilizes. The framework enables food entanglement control: users can specify which food items should remain separated or be intentionally mixed by editing the arrangement of layouts. Across two food datasets, our method significantly improves the presence of target objects and provides qualitative evidence of controllable separation.
zh

[CV-91] SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation

【速读】：该论文旨在解决对比语言-图像预训练（Contrastive Language-Image Pre-training, CLIP）模型在几何结构感知方面的固有局限性问题。CLIP虽在语义理解上表现卓越，但其对空间结构的建模能力较弱，而现有方法依赖文本提示进行间接引导，效率低下且效果受限。解决方案的关键在于提出一种双路径解码器架构——SPACE-CLIP，该架构直接从冻结的CLIP视觉编码器中挖掘并解析潜在的几何知识，完全绕过文本编码器及其文本提示机制。其中，语义路径通过特征-wise 线性调制（Feature-wise Linear Modulation, FiLM）动态地基于全局上下文条件化高层特征，结构路径则从早期层提取细粒度的空间细节，二者通过分层融合实现语义上下文与精确几何信息的协同增强，从而显著提升深度估计性能，并为下一代具身智能系统（如视觉-语言-动作模型）提供可集成的空间感知模块。

链接: https://arxiv.org/abs/2601.17657
作者: Taewan Cho,Taeryang Kim,Andrew Jaeyong Choi
机构: Gachon University (嘉泉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for semantic understanding but inherently struggles to perceive geometric structure. Existing methods attempt to bridge this gap by querying CLIP with textual prompts, a process that is often indirect and inefficient. This paper introduces a fundamentally different approach using a dual-pathway decoder. We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder, completely bypassing the text encoder and its associated textual prompts. A semantic pathway interprets high-level features, dynamically conditioned on global context using feature-wise linear modulation (FiLM). In addition, a structural pathway extracts fine-grained spatial details from early layers. These complementary streams are hierarchically fused, enabling a robust synthesis of semantic context and precise geometry. Extensive experiments on the KITTI benchmark show that SPACE-CLIP dramatically outperforms previous CLIP-based methods. Our ablation studies validate that the synergistic fusion of our dual pathways is critical to this success. SPACE-CLIP offers a new, efficient, and architecturally elegant blueprint for repurposing large-scale vision models. The proposed method is not just a standalone depth estimator, but a readily integrable spatial perception module for the next generation of embodied AI systems, such as vision-language-action (VLA) models. Our model is available at this https URL
zh

[CV-92] Stylizing ViT: Anatomy-Preserving Instance Style Transfer for Domain Generalization

【速读】：该论文旨在解决深度学习模型在医学图像分析中因数据异质性和稀缺性导致的跨域和跨人群泛化能力不足的问题。传统数据增强方法在面对显著域偏移时效果有限，而现有风格增强方法则受限于风格多样性不足或引入伪影。解决方案的关键在于提出Stylizing ViT，一种新型视觉Transformer编码器，其通过共享权重的注意力模块同时实现自注意力（self-attention）与交叉注意力（cross-attention）：自注意力保持解剖结构一致性，交叉注意力完成风格迁移，从而在不引入伪影的前提下提升风格多样性与模型鲁棒性。

链接: https://arxiv.org/abs/2601.17586
作者: Sebastian Doerrich,Francesco Di Salvo,Jonas Alle,Christian Ledig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted at 23rd IEEE International Symposium on Biomedical Imaging (IEEE ISBI 2026)

点击查看摘要

Abstract:Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight-shared attention blocks for both self- and cross-attention. This design allows the same attention block to maintain anatomical consistency through self-attention while performing style transfer via cross-attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test-time augmentation. The source code is available at this https URL .
zh

[CV-93] Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agent ic Reasoning

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在调用外部工具进行代理推理（agentic reasoning）时，因工具调用过程易受恶意操纵而带来的安全隐患问题。现有方法虽能提升推理效率与实用性，但其对工具调用环节的脆弱性尚未被充分研究。解决方案的关键在于提出一种名为“海绵式工具攻击”（Sponge Tool Attack, STA）的新攻击范式，该方法在仅允许查询访问（query-only access）的前提下，通过迭代式的多智能体协作框架与显式的提示重写策略，生成语义保真度高且外观自然的输入提示改写，从而将原本简洁高效的推理路径扭曲为冗长复杂的过程，造成显著计算开销的同时保持原始任务语义和用户意图不变，实现隐蔽性强的攻击效果。

链接: https://arxiv.org/abs/2601.17566
作者: Qi Li,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enabling large language models (LLMs) to solve complex reasoning tasks is a key step toward artificial general intelligence. Recent work augments LLMs with external tools to enable agentic reasoning, achieving high utility and efficiency in a plug-and-play manner. However, the inherent vulnerabilities of such methods to malicious manipulation of the tool-calling process remain largely unexplored. In this work, we identify a tool-specific attack surface and propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt under a strict query-only access assumption. Without any modification on the underlying model or the external tools, STA converts originally concise and efficient reasoning trajectories into unnecessarily verbose and convoluted ones before arriving at the final answer. This results in substantial computational overhead while remaining stealthy by preserving the original task semantics and user intent. To achieve this, we design STA as an iterative, multi-agent collaborative framework with explicit rewritten policy control, and generates benign-looking prompt rewrites from the original one with high semantic fidelity. Extensive experiments across 6 models (including both open-source models and closed-source APIs), 12 tools, 4 agentic frameworks, and 13 datasets spanning 5 domains validate the effectiveness of STA.
zh

[CV-94] Saliency Driven Imagery Preprocessing for Efficient Compression – Industrial Paper

【速读】：该论文旨在解决卫星遥感图像压缩中存储与带宽成本高昂的问题，尤其是在高分辨率图像数据量激增的背景下，传统均匀编码方法无法有效针对任务相关的兴趣区域（Region of Interest, ROI）进行优化。其解决方案的关键在于引入由显著性图（saliency map）驱动的预处理技术，通过可变尺寸的平滑核映射至不同的量化显著性层级，对图像像素进行差异化处理，从而在单幅大尺度卫星图像内实现可变率压缩（variable rate compression），使压缩效率更贴合下游任务对ROI的关注需求，同时兼容现有有损压缩编码标准。

链接: https://arxiv.org/abs/2601.17555
作者: Justin Downes,Sam Saltwick,Anthony Chen
机构: Amazon Web Services(亚马逊网络服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems (2023)

点击查看摘要

Abstract:The compression of satellite imagery remains an important research area as hundreds of terabytes of images are collected every day, which drives up storage and bandwidth costs. Although progress has been made in increasing the resolution of these satellite images, many downstream tasks are only interested in small regions of any given image. These areas of interest vary by task but, once known, can be used to optimize how information within the image is encoded. Whereas standard image encoding methods, even those optimized for remote sensing, work on the whole image equally, there are emerging methods that can be guided by saliency maps to focus on important areas. In this work we show how imagery preprocessing techniques driven by saliency maps can be used with traditional lossy compression coding standards to create variable rate image compression within a single large satellite image. Specifically, we use variable sized smoothing kernels that map to different quantized saliency levels to process imagery pixels in order to optimize downstream compression and encoding schemes.
zh

[CV-95] OTI: A Model-free and Visually Interpretable Measure of Image Attackability

【速读】：该论文旨在解决现有图像攻击性（attackability）度量方法中存在的两大问题：一是依赖模型代理（model proxy）来提取与模型相关的图像特征，导致在实际应用中难以获取特定任务模型时无法使用；二是所提取的特征缺乏视觉可解释性，难以直观理解其与图像内容之间的关联。解决方案的关键在于提出一种全新的、无需模型的、具有视觉可解释性的攻击性度量指标——对象纹理强度（Object Texture Intensity, OTI），该指标将图像攻击性定义为语义对象的纹理强度，并从决策边界以及对抗扰动的中高频特性两个理论角度进行建模，从而实现了对图像攻击性的有效且高效的评估，同时提供了直观的视觉理解能力。

链接: https://arxiv.org/abs/2601.17536
作者: Jiaming Liang,Haowei Liu,Chi-Man Pun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the tremendous success of neural networks, benign images can be corrupted by adversarial perturbations to deceive these models. Intriguingly, images differ in their attackability. Specifically, given an attack configuration, some images are easily corrupted, whereas others are more resistant. Evaluating image attackability has important applications in active learning, adversarial training, and attack enhancement. This prompts a growing interest in developing attackability measures. However, existing methods are scarce and suffer from two major limitations: (1) They rely on a model proxy to provide prior knowledge (e.g., gradients or minimal perturbation) to extract model-dependent image features. Unfortunately, in practice, many task-specific models are not readily accessible. (2) Extracted features characterizing image attackability lack visual interpretability, obscuring their direct relationship with the images. To address these, we propose a novel Object Texture Intensity (OTI), a model-free and visually interpretable measure of image attackability, which measures image attackability as the texture intensity of the image’s semantic object. Theoretically, we describe the principles of OTI from the perspectives of decision boundaries as well as the mid- and high-frequency characteristics of adversarial perturbations. Comprehensive experiments demonstrate that OTI is effective and computationally efficient. In addition, our OTI provides the adversarial machine learning community with a visual understanding of attackability.
zh

[CV-96] Will It Zero-Shot?: Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

【速读】：该论文旨在解决视觉语言模型（Vision-Language Model, VLM）在跨域应用中性能不可预测的问题，即非专业用户难以判断所选VLM是否适用于其特定任务。解决方案的关键在于结合文本-only 评估与基于任务相关合成图像的生成，通过引入与目标任务相关的合成图像来增强对零样本准确率的预测能力。实验表明，该图像驱动的方法显著提升了预测精度，并为用户提供可视化反馈，帮助理解模型决策依据。

链接: https://arxiv.org/abs/2601.17535
作者: Kevin Robbins,Xiaotong Liu,Yu Wu,Le Sun,Grady McPeak,Abby Stylianou,Robert Pless
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.
zh

[CV-97] FMIR a foundation model-based Image Registration Framework for Robust Image Registration

【速读】：该论文旨在解决深度学习在医学图像配准（medical image registration）中普遍存在的泛化能力不足问题，尤其是在训练数据规模有限的情况下。其解决方案的关键在于提出了一种基于基础模型（foundation model）的配准框架FMIR，通过引入一个通用的特征编码器（feature encoder）来提取具有普适性的解剖结构特征，并结合一个统一的配准头（registration head），同时采用通道正则化策略（channel regularization strategy）仅在单一数据集上进行训练，从而在域内性能达到当前最优（SOTA）的同时，保持对域外数据的鲁棒性，为资源受限条件下构建可泛化的医学影像基础模型提供了可行路径。

链接: https://arxiv.org/abs/2601.17529
作者: Fengting Zhang,Yue He,Qinghao Liu,Yaonan Wang,Xiang Chen,Hang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has revolutionized medical image registration by achieving unprecedented speeds, yet its clinical application is hindered by a limited ability to generalize beyond the training domain, a critical weakness given the typically small scale of medical datasets. In this paper, we introduce FMIR, a foundation model-based registration framework that overcomes this this http URL a foundation model-based feature encoder for extracting anatomical structures with a general registration head, and trained with a channel regularization strategy on just a single dataset, FMIR achieves state-of-the-art(SOTA) in-domain performance while maintaining robust registration on out-of-domain this http URL approach demonstrates a viable path toward building generalizable medical imaging foundation models with limited resources. The code is available at this https URL.
zh

[CV-98] BMDS-Net: A Bayesian Multi-Modal Deep Supervision Network for Robust Brain Tumor Segmentation

【速读】：该论文旨在解决多模态磁共振成像（MRI）中脑肿瘤分割任务在临床实际应用中的两大关键问题：一是模型对缺失模态的敏感性（临床实践中常见），二是缺乏置信度校准（confidence calibration），导致单纯追求Dice分数提升无法满足医疗部署的安全性要求。解决方案的关键在于提出BMDS-Net框架，其核心创新包括：1）构建鲁棒的确定性骨干网络，通过零初始化多模态上下文融合（Zero-Init Multimodal Contextual Fusion, MMCF）模块与残差门控深度解码监督（Residual-Gated Deep Decoder Supervision, DDS）机制，实现特征学习稳定性和边界分割精度的显著提升（尤其在模态损坏下 Hausdorff Distance 显著降低）；2）引入一种内存高效的贝叶斯微调策略，使网络具备概率预测能力，生成体素级不确定性图，辅助临床医生识别潜在错误区域。实验表明，BMDS-Net在BraTS 2021数据集上不仅保持竞争力的分割精度，更在缺失模态场景下展现出远优于基线模型的稳定性。

链接: https://arxiv.org/abs/2601.17504
作者: Yan Zhou,Zhen Huang,Yingqiu Li,Yue Ouyang,Suncheng Xiang,Zehua Wang
机构: Changsha University of Science and Technology (长沙理工大学); Shanghai Jiao Tong University (上海交通大学); Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine (上海交通大学医学院附属胸科医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 16 pages, 5 figures. Manuscript prepared for submission to ACM TOMM

点击查看摘要

Abstract:Accurate brain tumor segmentation from multi-modal magnetic resonance imaging (MRI) is a prerequisite for precise radiotherapy planning and surgical navigation. While recent Transformer-based models such as Swin UNETR have achieved impressive benchmark performance, their clinical utility is often compromised by two critical issues: sensitivity to missing modalities (common in clinical practice) and a lack of confidence calibration. Merely chasing higher Dice scores on idealized data fails to meet the safety requirements of real-world medical deployment. In this work, we propose BMDS-Net, a unified framework that prioritizes clinical robustness and trustworthiness over simple metric maximization. Our contribution is three-fold. First, we construct a robust deterministic backbone by integrating a Zero-Init Multimodal Contextual Fusion (MMCF) module and a Residual-Gated Deep Decoder Supervision (DDS) mechanism, enabling stable feature learning and precise boundary delineation with significantly reduced Hausdorff Distance, even under modality corruption. Second, and most importantly, we introduce a memory-efficient Bayesian fine-tuning strategy that transforms the network into a probabilistic predictor, providing voxel-wise uncertainty maps to highlight potential errors for clinicians. Third, comprehensive experiments on the BraTS 2021 dataset demonstrate that BMDS-Net not only maintains competitive accuracy but, more importantly, exhibits superior stability in missing-modality scenarios where baseline models fail. The source code is publicly available at this https URL.
zh

[CV-99] PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors

【速读】：该论文旨在解决在多样光照条件下进行阴影去除时，如何有效解耦照明与固有反射率（intrinsic reflectance）的问题，尤其当物理先验（physical priors）未正确对齐时，传统方法性能显著下降。其解决方案的关键在于提出双层级先验对齐机制：首先通过物理对齐归一化（Physically Aligned Normalization, PAN）实现闭式照明校正，结合灰世界归一化、对数域Retinex分解与动态范围重组，抑制色偏；其次引入几何-语义校正注意力（Geometric-Semantic Rectification Attention, GSRA），将差异注意力扩展至跨模态对齐，融合深度引导的几何信息与DINO-v2语义嵌入，以缓解多光源下模态冲突问题。该方法在单光源至多源环境光场景中均展现出鲁棒性与低复杂度优势，且优于传统方法在复杂光照下的泛化能力。

链接: https://arxiv.org/abs/2601.17470
作者: Chia-Ming Lee,Yu-Fan Lin,Yu-Jou Hsiao,Jing-Hui Jung,Yu-Lun Liu,Chih-Chung Hsu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Cheng Kung University (国立成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Shadow removal under diverse lighting conditions requires disentangling illumination from intrinsic reflectance, a challenge compounded when physical priors are not properly aligned. We propose PhaSR (Physically Aligned Shadow Removal), addressing this through dual-level prior alignment to enable robust performance from single-light shadows to multi-source ambient lighting. First, Physically Aligned Normalization (PAN) performs closed-form illumination correction via Gray-world normalization, log-domain Retinex decomposition, and dynamic range recombination, suppressing chromatic bias. Second, Geometric-Semantic Rectification Attention (GSRA) extends differential attention to cross-modal alignment, harmonizing depth-derived geometry with DINO-v2 semantic embeddings to resolve modal conflicts under varying illumination. Experiments show competitive performance in shadow removal with lower complexity and generalization to ambient lighting where traditional methods fail under multi-source illumination. Our source code is available at this https URL.
zh

[CV-100] ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation

【速读】：该论文旨在解决单图像反射分离（Single Image Reflection Separation, SIRS）中因非线性混合导致的透射层与反射层混淆问题，尤其在深度解码器层中表现显著，根源在于隐式融合机制和多尺度协调不足。解决方案的关键在于提出 ReflexSplit 双流框架，其核心创新包括：(1) 跨尺度门控融合（Cross-scale Gated Fusion, CrGF），通过自适应聚合语义先验、纹理细节及解码器上下文，稳定梯度流并保持特征一致性；(2) 层融合-分离模块（Layer Fusion-Separation Blocks, LFSB），交替执行共享结构提取的融合与层特定解离的分离，借鉴差分注意力思想，引入跨流减法实现双流分离；(3) 课程训练策略，通过深度依赖初始化和逐轮预热逐步强化差异分离能力，从而提升模型在合成与真实世界数据上的性能与泛化能力。

链接: https://arxiv.org/abs/2601.17468
作者: Chia-Ming Lee,Yu-Fan Lin,Jing-Hui Jung,Yu-Jou Hsiao,Chih-Chung Hsu,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Cheng Kung University (国立成功大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Single Image Reflection Separation (SIRS) disentangles mixed images into transmission and reflection layers. Existing methods suffer from transmission-reflection confusion under nonlinear mixing, particularly in deep decoder layers, due to implicit fusion mechanisms and inadequate multi-scale coordination. We propose ReflexSplit, a dual-stream framework with three key innovations. (1) Cross-scale Gated Fusion (CrGF) adaptively aggregates semantic priors, texture details, and decoder context across hierarchical depths, stabilizing gradient flow and maintaining feature consistency. (2) Layer Fusion-Separation Blocks (LFSB) alternate between fusion for shared structure extraction and differential separation for layer-specific disentanglement. Inspired by Differential Transformer, we extend attention cancellation to dual-stream separation via cross-stream subtraction. (3) Curriculum training progressively strengthens differential separation through depth-dependent initialization and epoch-wise warmup. Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance with superior perceptual quality and robust generalization. Our code is available at this https URL.
zh

[CV-101] Coronary Artery Segmentation and Vessel-Type Classification in X-Ray Angiography

【速读】：该论文旨在解决X射线冠状动脉造影（X-ray coronary angiography, XCA）中血管分割与分型的挑战，尤其针对低对比度、运动伪影、图像失真及导管干扰等问题导致的分割性能下降和跨中心域偏移（domain shift）问题。其关键解决方案包括：（1）通过基于低强度直方图准则选择最佳帧并联合超分辨率与增强提升图像质量；（2）采用支持向量回归（SVR）实现每幅图像的参数自适应调优，显著优于全局固定参数设置；（3）引入融合冠状动脉与导管标注的监督策略训练FPN网络，提升模型在复杂场景下的鲁棒性与外部泛化能力；（4）构建两阶段流程，先进行高精度血管分割（Dice达0.931），再进行血管类型识别（准确率>95%），从而支持基于解剖定位的定量分析。

链接: https://arxiv.org/abs/2601.17429
作者: Mehdi Yousefzadeh,Siavash Shirzadeh Barough,Ashkan Fakharifar,Yashar Tayyarazad,Narges Eghbali,Mohaddeseh Mozaffari,Hoda Taeb,Negar Sadat Rafiee Tabatabaee,Parsa Esfahanian,Ghazaleh Sadeghi Gohar,Amineh Safavirad,Saeideh Mazloomzadeh,Ehsan khalilipur,Armin Elahifar,Majid Maleki
机构: Institute for Research in Fundamental Sciences (IPM)(基础科学研究所); Shahid Beheshti University (谢里夫大学); Shahid Beheshti University of Medical Sciences (谢里夫医科大学); Gilan University of Medical Sciences (吉兰医科大学); Simon Fraser University (西蒙弗雷泽大学); Alborz University of Medical Sciences (阿尔伯兹医科大学); Iran University of Medical Sciences (伊朗医科大学); Rajaie Cardiovascular Medical and Research Institute (拉杰心血管医学与研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:X-ray coronary angiography (XCA) is the clinical reference standard for assessing coronary artery disease, yet quantitative analysis is limited by the difficulty of robust vessel segmentation in routine data. Low contrast, motion, foreshortening, overlap, and catheter confounding degrade segmentation and contribute to domain shift across centers. Reliable segmentation, together with vessel-type labeling, enables vessel-specific coronary analytics and downstream measurements that depend on anatomical localization. From 670 cine sequences (407 subjects), we select a best frame near peak opacification using a low-intensity histogram criterion and apply joint super-resolution and enhancement. We benchmark classical Meijering, Frangi, and Sato vesselness filters under per-image oracle tuning, a single global mean setting, and per-image parameter prediction via Support Vector Regression (SVR). Neural baselines include U-Net, FPN, and a Swin Transformer, trained with coronary-only and merged coronary+catheter supervision. A second stage assigns vessel identity (LAD, LCX, RCA). External evaluation uses the public DCA1 cohort. SVR per-image tuning improves Dice over global means for all classical filters (e.g., Frangi: 0.759 vs. 0.741). Among deep models, FPN attains 0.914+/-0.007 Dice (coronary-only), and merged coronary+catheter labels further improve to 0.931+/-0.006. On DCA1 as a strict external test, Dice drops to 0.798 (coronary-only) and 0.814 (merged), while light in-domain fine-tuning recovers to 0.881+/-0.014 and 0.882+/-0.015. Vessel-type labeling achieves 98.5% accuracy (Dice 0.844) for RCA, 95.4% (0.786) for LAD, and 96.2% (0.794) for LCX. Learned per-image tuning strengthens classical pipelines, while high-resolution FPN models and merged-label supervision improve stability and external transfer with modest adaptation.
zh

[CV-102] CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction

【速读】：该论文旨在解决现有推理分割（reasoning segmentation）方法在处理复杂查询和域外图像时表现不足的问题，尤其在语义模糊或提示隐含的情况下难以准确识别目标对象。其解决方案的关键在于提出一种无需训练的框架CoT-Seg，该框架通过结合链式思维（chain-of-thought reasoning）与自我修正（self-correction）机制，利用预训练多模态大语言模型（MLLMs, 如GPT-4o）的内在推理能力，将复杂查询分解为元指令（meta-instructions），从图像中提取细粒度语义信息，并在自评估基础上迭代优化分割掩码（mask）。这一设计显著提升了模型在歧义或易错场景下的可靠性与鲁棒性，同时支持引入检索增强推理（retrieval-augmented reasoning）以应对信息不足的输入，从而构建了一个更贴近人类认知过程的视觉-语言融合分割范式。

链接: https://arxiv.org/abs/2601.17420
作者: Shiu-hong Kao,Chak Ho Huang,Huaiqian Liu,Yu-Wing Tai,Chi-Keung Tang
机构: The Hong Kong University of Science and Technology (香港科技大学); Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing works of reasoning segmentation often fall short in complex cases, particularly when addressing complicated queries and out-of-domain images. Inspired by the chain-of-thought reasoning, where harder problems require longer thinking steps/time, this paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results, in the same way humans approach harder questions. We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction. Instead of fine-tuning, CoT-Seg leverages the inherent reasoning ability of pre-trained MLLMs (GPT-4o) to decompose queries into meta-instructions, extract fine-grained semantics from images, and identify target objects even under implicit or complex prompts. Moreover, CoT-Seg incorporates a self-correction stage: the model evaluates its own segmentation against the original query and reasoning trace, identifies mismatches, and iteratively refines the mask. This tight integration of reasoning and correction significantly improves reliability and robustness, especially in ambiguous or error-prone cases. Furthermore, our CoT-Seg framework allows easy incorporation of retrieval-augmented reasoning, enabling the system to access external knowledge when the input lacks sufficient information. To showcase CoT-Seg’s ability to handle very challenging cases ,we introduce a new dataset ReasonSeg-Hard. Our results highlight that combining chain-of-thought reasoning, self-correction, offers a powerful paradigm for vision-language integration driven segmentation.
zh

[CV-103] Cloud-Enabled IoT System for Real-Time Environmental Monitoring and Remote Device Control Using Firebase

【速读】：该论文旨在解决传统远程监控系统在实时数据访问、远程控制能力及云集成方面存在的局限性。其解决方案的关键在于构建一个基于云的物联网（IoT）系统，利用谷歌 Firebase 实时数据库实现多设备同步的数据采集与设备控制；系统以 ESP32 微控制器为核心，集成 DHT22 温湿度传感器和 HC-SR04 超声波测距传感器，并通过云端界面实现对两个 LED 指示灯的远程控制，从而在无需复杂服务器架构的前提下，实现了低延迟（<1.5 秒）、高可靠性（99.2% 成功率）的数据传输与持久化存储，为智能家庭自动化和工业监测等场景提供了可扩展且低成本（总成本 32.50 美元）的解决方案。

链接: https://arxiv.org/abs/2601.17414
作者: Abdul Hasib,A. S. M. Ahsanul Sarkar Akib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of Internet of Things (IoT) devices has created unprecedented opportunities for remote monitoring and control applications across various domains. Traditional monitoring systems often suffer from limitations in real-time data accessibility, remote controllability, and cloud integration. This paper presents a cloud-enabled IoT system that leverages Google’s Firebase Realtime Database for synchronized environmental monitoring and device control. The system utilizes an ESP32 microcontroller to interface with a DHT22 temperature/humidity sensor and an HC-SR04 ultrasonic distance sensor, while enabling remote control of two LED indicators through a cloud-based interface. Real-time sensor data is transmitted to Firebase, providing a synchronized platform accessible from multiple devices simultaneously. Experimental results demonstrate reliable data transmission with 99.2% success rate, real-time control latency under 1.5 seconds, and persistent data storage for historical analysis. The system architecture offers a scalable framework for various IoT applications, from smart home automation to industrial monitoring, with a total implementation cost of \ 32.50. The integration of Firebase provides robust cloud capabilities without requiring complex server infrastructure, making advanced IoT applications accessible to developers and researchers with limited resources.
zh

[CV-104] Source-Free Domain Adaptation by Optimizing Batch-Wise Cosine Similarity

【速读】：该论文致力于解决无源域适应（Source-Free Domain Adaptation, SFDA）中的关键挑战，即如何在不访问源域数据的情况下，将模型从已标注的源域有效迁移到未标注的目标域。现有方法通常依赖邻域一致性（neighborhood consistency）策略，但易受误导性邻域信息的影响而产生误差。论文的核心解决方案在于引入“邻域签名”（neighborhood signature）的概念，通过学习更具信息量的聚类结构并缓解噪声邻居的干扰，从而仅用一个精心设计的损失函数即可优化目标域中样本预测的相似性与差异性，实现高效且鲁棒的域适应。实验表明，该方法在具有挑战性的VisDA数据集上优于现有技术，同时在其他基准数据集上也表现出竞争力。

链接: https://arxiv.org/abs/2601.17408
作者: Harsharaj Pathak,Vineeth N Balasubramanian
机构: Indian Institute of Technology Hyderabad (印度理工学院海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) is an emerging area of research that aims to adapt a model trained on a labeled source domain to an unlabeled target domain without accessing the source data. Most of the successful methods in this area rely on the concept of neighborhood consistency but are prone to errors due to misleading neighborhood information. In this paper, we explore this approach from the point of view of learning more informative clusters and mitigating the effect of noisy neighbors using a concept called neighborhood signature, and demonstrate that adaptation can be achieved using just a single loss term tailored to optimize the similarity and dissimilarity of predictions of samples in the target domain. In particular, our proposed method outperforms existing methods in the challenging VisDA dataset while also yielding competitive results on other benchmark datasets.
zh

[CV-105] HAAF: Hierarchical Adaptation and Alignment of Foundation Models for Few-Shot Pathology Anomaly Detection

【速读】：该论文旨在解决精准病理诊断中因视觉-语言（V-L）模型存在粒度不匹配（Granularity Mismatch）问题而导致的细微形态异常检测困难，即通用语义表示难以捕捉特定区域（ROI）内的纹理丰富且局部细微的病变特征。其解决方案的关键在于提出分层适应与对齐框架（HAAF），核心创新为一种跨层级缩放对齐机制（Cross-Level Scaled Alignment, CLSA），该机制通过先将视觉特征注入文本提示生成内容自适应描述符，再以空间方式引导视觉编码器聚焦异常区域，从而实现语义提示与ROI视觉上下文的深度融合；同时引入双分支推理策略，结合语义分数与几何原型以提升小样本场景下的稳定性。

链接: https://arxiv.org/abs/2601.17405
作者: Chunze Yang,Wenjie Zhao,Yue Tang,Junbo Lu,Jiusong Ge,Qidong Liu,Zeyu Gao,Chen Li
机构: Xi’an Jiaotong University (西安交通大学); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precision pathology relies on detecting fine-grained morphological abnormalities within specific Regions of Interest (ROIs), as these local, texture-rich cues - rather than global slide contexts - drive expert diagnostic reasoning. While Vision-Language (V-L) models promise data efficiency by leveraging semantic priors, adapting them faces a critical Granularity Mismatch, where generic representations fail to resolve such subtle defects. Current adaptation methods often treat modalities as independent streams, failing to ground semantic prompts in ROI-specific visual contexts. To bridge this gap, we propose the Hierarchical Adaptation and Alignment Framework (HAAF). At its core is a novel Cross-Level Scaled Alignment (CLSA) mechanism that enforces a sequential calibration order: visual features first inject context into text prompts to generate content-adaptive descriptors, which then spatially guide the visual encoder to spotlight anomalies. Additionally, a dual-branch inference strategy integrates semantic scores with geometric prototypes to ensure stability in few-shot settings. Experiments on four benchmarks show HAAF significantly outperforms state-of-the-art methods and effectively scales with domain-specific backbones (e.g., CONCH) in low-resource scenarios.
zh

[CV-106] ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在中文语境下能力评估中存在的两大挑战：一是基准测试饱和导致的排名失真，二是高昂的计算成本限制了高频、细致的能力诊断。为应对这些问题，作者提出了一种可扩展的实时评估系统 ReLE（Robust Efficient Live Evaluation），其关键在于两项方法论创新：一是引入符号基础的混合评分机制（Symbolic-Grounded Hybrid Scoring Mechanism），有效消除基于嵌入（embedding）的推理任务中的假阳性问题；二是设计基于 Neyman 分配并含噪声校正的动态方差感知调度器（Dynamic Variance-Aware Scheduler），使计算成本降低 70% 同时保持与全量评估的排名相关性 ρ=0.96。通过该系统对 304 个模型进行跨领域-能力正交矩阵（207,843 样本）分析，揭示出模型能力存在显著非均匀性（Capability Anisotropy），且传统聚合排名对权重敏感，而 ReLE 可更稳定地捕捉这种专业化特性。

链接: https://arxiv.org/abs/2601.17399
作者: Rui Fang,Jian Li,Wei Chen,Bin Hu,Ying-Cong Chen,Xin Tang,Liang Diao
机构: Sun Yat-sen University (中山大学); ReLE Benchmark Team; NSFOCUS Technologies Co., Ltd. (启明星辰信息技术集团股份有限公司); HKUST-GZ (香港科技大学（广州）); Huawei (华为); Ping An Property & Casualty Insurance Company of China, Ltd. (中国平安财产保险股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain \times Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70% compared to full-pass evaluations while maintaining a ranking correlation of \rho=0.96 . Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus \sim 5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.
zh

[CV-107] SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition

【速读】：该论文旨在解决事件相机动作识别（Event Camera Action Recognition, EAR）中现有时空多视角表示学习（Spatiotemporal Multi-View Representation Learning, SMVRL）方法存在的两大核心问题：一是传统基于空间轴H和W的稀疏事件投影方式导致的平移敏感性（translation-variant）空间分箱表示，二是早期简单拼接融合架构未能有效建模不同视角间运动特征的样本级互补性。其解决方案的关键在于：(i) 提出一种基于平移不变密集转换的原理性时空多视角表示，以提升对事件时序动态的表达能力；(ii) 设计双分支动态融合架构，显式建模来自不同视角的运动特征之间的样本级互补关系；(iii) 引入类生物启发的时间扭曲增强策略，模拟真实人类动作的速度变化特性，从而显著提升模型性能与效率。实验表明，该框架在三个挑战性EAR数据集上相较现有SMVRL方法分别实现+7.0%、+10.7%和+10.2%的Top-1准确率提升，同时参数减少30.1%，计算量降低35.7%。

链接: https://arxiv.org/abs/2601.17391
作者: Rui Fan,Weidong Hao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting H-W-T events along spatial axis H and W, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. On three challenging EAR datasets of HARDVS, DailyDVS-200 and THU-EACT-50-CHL, we show +7.0%, +10.7%, and +10.2% Top-1 accuracy gains over existing SMVRL EOR method with surprising 30.1% reduced parameters and 35.7% lower computations, establishing our framework as a novel and powerful EAR paradigm.
zh

[CV-108] ONRW: Optimizing inversion noise for high-quality and robust watermark

【速读】：该论文旨在解决现有基于深度学习的图像水印方法在面对图像传输过程中的各类噪声或退化（image corruptions）时鲁棒性不足的问题，从而限制了其实际应用价值。解决方案的关键在于提出了一种基于扩散模型（diffusion model）的高质量且鲁棒的水印框架：首先通过无文本优化（null-text optimization）将干净图像转换为反演噪声（inversion noise），随后在潜在空间中优化该噪声，并利用扩散模型的迭代去噪过程生成高视觉质量的水印图像；该迭代去噪机制不仅保障了图像质量，还显著提升了水印对多种图像退化的鲁棒性；此外，为防止反演噪声优化过程破坏原始语义信息，引入了自注意力约束（self-attention constraints）和伪掩码策略（pseudo-mask strategies），有效维持了图像内容的一致性。

链接: https://arxiv.org/abs/2601.17388
作者: Xuan Ding,Xiu Yan,Chuanlong Xie,Yao Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. Under review

点击查看摘要

Abstract:Watermarking methods have always been effective means of protecting intellectual property, yet they face significant challenges. Although existing deep learning-based watermarking systems can hide watermarks in images with minimal impact on image quality, they often lack robustness when encountering image corruptions during transmission, which undermines their practical application value. To this end, we propose a high-quality and robust watermark framework based on the diffusion model. Our method first converts the clean image into inversion noise through a null-text optimization process, and after optimizing the inversion noise in the latent space, it produces a high-quality watermarked image through an iterative denoising process of the diffusion model. The iterative denoising process serves as a powerful purification mechanism, ensuring both the visual quality of the watermarked image and enhancing the robustness of the watermark against various corruptions. To prevent the optimizing of inversion noise from distorting the original semantics of the image, we specifically introduced self-attention constraints and pseudo-mask strategies. Extensive experimental results demonstrate the superior performance of our method against various image corruptions. In particular, our method outperforms the stable signature method by an average of 10% across 12 different image transformations on COCO datasets. Our codes are available at this https URL.
zh

[CV-109] Physical Prompt Injection Attacks on Large Vision-Language Models

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在真实物理环境中部署时面临的物理提示注入攻击（Physical Prompt Injection Attack, PPIA）问题，即攻击者通过嵌入恶意视觉指令到物理对象中，诱导LVLM产生错误行为，而现有方法通常依赖对输入通道或用户查询的先验知识，这在实际场景中难以满足。解决方案的关键在于提出一种黑盒、查询无关的攻击框架：首先离线选择高识别度且语义有效的视觉提示，再结合时空注意力机制进行环境感知的策略性放置，确保提示既可被LVLM感知又能显著影响其决策；该方法无需访问模型、输入或内部流程，仅通过视觉观察即可实施，在模拟与真实场景下对10种前沿LVLM均实现了高达98%的攻击成功率，并具备良好的鲁棒性。

链接: https://arxiv.org/abs/2601.17383
作者: Chen Ling,Kai Hu,Hangcheng Liu,Xingshuo Han,Tianwei Zhang,Changhai Ou
机构: Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are increasingly deployed in real-world intelligent systems for perception and reasoning in open physical environments. While LVLMs are known to be vulnerable to prompt injection attacks, existing methods either require access to input channels or depend on knowledge of user queries, assumptions that rarely hold in practical deployments. We propose the first Physical Prompt Injection Attack (PPIA), a black-box, query-agnostic attack that embeds malicious typographic instructions into physical objects perceivable by the LVLM. PPIA requires no access to the model, its inputs, or internal pipeline, and operates solely through visual observation. It combines offline selection of highly recognizable and semantically effective visual prompts with strategic environment-aware placement guided by spatiotemporal attention, ensuring that the injected prompts are both perceivable and influential on model behavior. We evaluate PPIA across 10 state-of-the-art LVLMs in both simulated and real-world settings on tasks including visual question answering, planning, and navigation, PPIA achieves attack success rates up to 98%, with strong robustness under varying physical conditions such as distance, viewpoint, and illumination. Our code is publicly available at this https URL.
zh

[CV-110] UCAD: Uncertainty-guided Contour-aware Displacement for semi-supervised medical image segmentation

【速读】：该论文旨在解决半监督医学图像分割中现有位移策略仅作用于矩形区域、忽视解剖结构导致边界失真和语义不一致的问题。其解决方案的关键在于提出一种不确定性引导的轮廓感知位移框架（UCAD），通过超像素（superpixels）生成与解剖边界对齐的解剖学一致区域，并引入不确定性引导的选择机制，有选择性地位移困难区域以增强一致性学习；同时设计动态不确定性加权一致性损失函数，自适应稳定训练过程并在未标注区域有效正则化模型，从而提升分割精度与鲁棒性。

链接: https://arxiv.org/abs/2601.17366
作者: Chengbo Ding,Fenghe Tang,Shaohua Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISBI 2026

点击查看摘要

Abstract:Existing displacement strategies in semi-supervised segmentation only operate on rectangular regions, ignoring anatomical structures and resulting in boundary distortions and semantic inconsistency. To address these issues, we propose UCAD, an Uncertainty-Guided Contour-Aware Displacement framework for semi-supervised medical image segmentation that preserves contour-aware semantics while enhancing consistency learning. Our UCAD leverages superpixels to generate anatomically coherent regions aligned with anatomy boundaries, and an uncertainty-guided selection mechanism to selectively displace challenging regions for better consistency learning. We further propose a dynamic uncertainty-weighted consistency loss, which adaptively stabilizes training and effectively regularizes the model on unlabeled regions. Extensive experiments demonstrate that UCAD consistently outperforms state-of-the-art semi-supervised segmentation methods, achieving superior segmentation accuracy under limited annotation. The code is available at:this https URL.
zh

[CV-111] PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling

【速读】：该论文旨在解决3D高斯溅射（3D Gaussian Splatting, 3DGS）在移动设备上训练受限的问题，即现有方法依赖资源无约束的训练假设，难以适配移动端因训练预算极小和峰值内存有限而产生的严苛约束。解决方案的关键在于提出PocketGS，一种协同设计的三元操作框架：G模块构建几何保真的点云先验；I模块注入局部表面统计信息以生成各向异性高斯分布，从而缩小早期条件差距；T模块通过缓存中间结果与索引映射梯度散射实现α合成的展开，保障移动端反向传播的稳定性。这三个操作共同满足了训练效率、内存紧凑性和建模保真度的竞争性需求，使PocketGS在移动端实现优于主流工作站级3DGS基线的高质量重建，支持端到端的捕获-渲染工作流。

链接: https://arxiv.org/abs/2601.17354
作者: Wenzhi Guo,Guangchi Fang,Shu Yang,Bing Wang
机构: Hong Kong Polytechnic University (香港理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Efficient and high-fidelity 3D scene modeling is a long-standing pursuit in computer graphics. While recent 3D Gaussian Splatting (3DGS) methods achieve impressive real-time modeling performance, they rely on resource-unconstrained training assumptions that fail on mobile devices, which are limited by minute-scale training budgets and hardware-available peak-memory. We present PocketGS, a mobile scene modeling paradigm that enables on-device 3DGS training under these tightly coupled constraints while preserving high perceptual fidelity. Our method resolves the fundamental contradictions of standard 3DGS through three co-designed operators: G builds geometry-faithful point-cloud priors; I injects local surface statistics to seed anisotropic Gaussians, thereby reducing early conditioning gaps; and T unrolls alpha compositing with cached intermediates and index-mapped gradient scattering for stable mobile backpropagation. Collectively, these operators satisfy the competing requirements of training efficiency, memory compactness, and modeling fidelity. Extensive experiments demonstrate that PocketGS is able to outperform the powerful mainstream workstation 3DGS baseline to deliver high-quality reconstructions, enabling a fully on-device, practical capture-to-rendering workflow.
zh

[CV-112] HyDeMiC: A Deep Learning-based Mineral Classifier using Hyperspectral Data

【速读】：该论文旨在解决传统矿物分类方法在处理高维遥感 hyperspectral imaging (HSI) 数据时面临的环境噪声干扰、传感器限制以及计算复杂度高等问题。其解决方案的关键在于提出了一种基于卷积神经网络（Convolutional Neural Network, CNN）的新型矿物分类模型 HyDeMiC，该模型通过使用美国地质调查局（USGS）实验室测量的115种矿物光谱数据进行训练，并模拟真实传感器响应函数生成训练样本，从而提升了模型在含噪条件下的鲁棒性。实验表明，HyDeMiC 在低噪声和中等噪声水平下均表现出优异性能（MCC 接近 1.00），证明其具备应对实际野外 HSI 应用中噪声挑战的能力。

链接: https://arxiv.org/abs/2601.17352
作者: M. L. Mamud,Piyoosh Jaysaval,Frederick D Day-Lewis,M. K. Mudunuru
机构: Pacific Northwest National Laboratory (太平洋西北国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) has emerged as a powerful remote sensing tool for mineral exploration, capitalizing on unique spectral signatures of minerals. However, traditional classification methods such as discriminant analysis, logistic regression, and support vector machines often struggle with environmental noise in data, sensor limitations, and the computational complexity of analyzing high-dimensional HSI data. This study presents HyDeMiC (Hyperspectral Deep Learning-based Mineral Classifier), a convolutional neural network (CNN) model designed for robust mineral classification under noisy data. To train HyDeMiC, laboratory-measured hyperspectral data for 115 minerals spanning various mineral groups were used from the United States Geological Survey (USGS) library. The training dataset was generated by convolving reference mineral spectra with an HSI sensor response function. These datasets contained three copper-bearing minerals, Cuprite, Malachite, and Chalcopyrite, used as case studies for performance demonstration. The trained CNN model was evaluated on several synthetic 2D hyperspectral datasets with noise levels of 1%, 2%, 5%, and 10%. Our noisy data analysis aims to replicate realistic field conditions. The HyDeMiC’s performance was assessed using the Matthews Correlation Coefficient (MCC), providing a comprehensive measure across different noise regimes. Results demonstrate that HyDeMiC achieved near-perfect classification accuracy (MCC = 1.00) on clean and low-noise datasets and maintained strong performance under moderate noise conditions. These findings emphasize HyDeMiC’s robustness in the presence of moderate noise, highlighting its potential for real-world applications in hyperspectral imaging, where noise is often a significant challenge.
zh

[CV-113] NeRF-MIR: Towards High-Quality Restoration of Masked Images with Neural Radiance Fields

【速读】：该论文旨在解决基于神经辐射场（Neural Radiance Fields, NeRF）的三维场景重建在面对受掩码（masked）图像时的性能下降问题，此类图像在自然场景采集中常见，会显著影响NeRF的重建效果。解决方案的关键在于提出一种名为NeRF-MIR的新方法，其核心创新包括：(1) 提出基于块的射线发射策略（Patch-based Entropy for Ray Emitting, PERE），通过优化射线分布提升对复杂纹理的学习效率；(2) 设计渐进式迭代恢复机制（Progressively Iterative REstoration, PIRE），实现掩码区域的自训练恢复；(3) 引入动态加权损失函数，自动调整掩码区域的损失权重以增强恢复精度。这些技术协同作用，使NeRF-MIR在真实数据和构造掩码数据集上均展现出优于现有方法的恢复性能。

链接: https://arxiv.org/abs/2601.17350
作者: Xianliang Huang,Zhizhou Zhong,Shuhang Chen,Yi Xu,Juhong Guan,Shuigeng Zhou
机构: Fudan University (复旦大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 15 figures

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have demonstrated remarkable performance in novel view synthesis. However, there is much improvement room on restoring 3D scenes based on NeRF from corrupted images, which are common in natural scene captures and can significantly impact the effectiveness of NeRF. This paper introduces NeRF-MIR, a novel neural rendering approach specifically proposed for the restoration of masked images, demonstrating the potential of NeRF in this domain. Recognizing that randomly emitting rays to pixels in NeRF may not effectively learn intricate image textures, we propose a \textbfPatch-based \textbfEntropy for \textbfRay \textbfEmitting (\textbfPERE) strategy to distribute emitted rays properly. This enables NeRF-MIR to fuse comprehensive information from images of different views. Additionally, we introduce a \textbfProgressively \textbfIterative \textbfREstoration (\textbfPIRE) mechanism to restore the masked regions in a self-training process. Furthermore, we design a dynamically-weighted loss function that automatically recalibrates the loss weights for masked regions. As existing datasets do not support NeRF-based masked image restoration, we construct three masked datasets to simulate corrupted scenarios. Extensive experiments on real data and constructed datasets demonstrate the superiority of NeRF-MIR over its counterparts in masked image restoration.
zh

[CV-114] Revisiting Lightweight Low-Light Image Enhancement: From a YUV Color Space Perspective

【速读】：该论文旨在解决轻量化低光照图像增强（Lightweight Low-Light Image Enhancement, L3IE）中视觉质量与模型紧凑性之间的权衡问题。现有方法虽采用解耦策略（如Retinex理论和YUV色彩空间变换）简化网络设计，但受限于对通道特异性退化模式及跨通道交互关系的忽视，导致性能瓶颈。其解决方案的关键在于通过频域分析确认YUV色彩空间在L3IE中的优势，并发现亮度通道（Y）主要丢失低频内容，而色度通道（UV）则受高频噪声干扰。基于此洞察，提出一种新的YUV基范式：利用双流全局-局部注意力模块恢复Y通道，采用Y引导的局部感知频率注意力模块处理UV通道，并引入引导交互模块实现特征融合，从而在显著降低参数量的同时实现更优的视觉质量。

链接: https://arxiv.org/abs/2601.17349
作者: Hailong Yan,Shice Liu,Xiangtao Zhang,Lujian Yao,Fengxiang Yang,Jinwei Chen,Bo Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report

点击查看摘要

Abstract:In the current era of mobile internet, Lightweight Low-Light Image Enhancement (L3IE) is critical for mobile devices, which faces a persistent trade-off between visual quality and model compactness. While recent methods employ disentangling strategies to simplify lightweight architectural design, such as Retinex theory and YUV color space transformations, their performance is fundamentally limited by overlooking channel-specific degradation patterns and cross-channel interactions. To address this gap, we perform a frequency-domain analysis that confirms the superiority of the YUV color space for L3IE. We identify a key insight: the Y channel primarily loses low-frequency content, while the UV channels are corrupted by high-frequency noise. Leveraging this finding, we propose a novel YUV-based paradigm that strategically restores channels using a Dual-Stream Global-Local Attention module for the Y channel, a Y-guided Local-Aware Frequency Attention module for the UV channels, and a Guided Interaction module for final feature fusion. Extensive experiments validate that our model establishes a new state-of-the-art on multiple benchmarks, delivering superior visual quality with a significantly lower parameter count.
zh

[CV-115] STARS: Shared-specific Translation and Alignment for missing-modality Remote Sensing Semantic Segmentation

【速读】：该论文旨在解决多模态遥感数据中因模态缺失（如光学图像或数字表面模型DSM）导致的传统融合模型性能下降问题，尤其针对现有方法中存在的特征坍塌（feature collapse）和恢复特征过度泛化等局限。其解决方案的关键在于提出STARS框架，包含两个核心设计：一是采用双向翻译与停止梯度（stop-gradient）相结合的非对称对齐机制，有效防止特征坍塌并降低对超参数的敏感性；二是引入像素级语义采样对齐（Pixel-level Semantic sampling Alignment, PSA）策略，通过类别平衡的像素采样与跨模态语义对齐损失相结合，缓解严重类别不平衡带来的对齐失败问题，从而提升少数类别的识别精度。

链接: https://arxiv.org/abs/2601.17342
作者: Tong Wang,Xiaodong Zhang,Guanzhou Chen,Jiaqi Wang,Chenxi Liu,Xiaoliang Tan,Wenchao Guo,Xuyang Li,Xuanrui Wang,Zifan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal remote sensing technology significantly enhances the understanding of surface semantics by integrating heterogeneous data such as optical images, Synthetic Aperture Radar (SAR), and Digital Surface Models (DSM). However, in practical applications, the missing of modality data (e.g., optical or DSM) is a common and severe challenge, which leads to performance decline in traditional multimodal fusion models. Existing methods for addressing missing modalities still face limitations, including feature collapse and overly generalized recovered features. To address these issues, we propose \textbfSTARS (\textbfShared-specific \textbfTranslation and \textbfAlignment for missing-modality \textbfRemote \textbfSensing), a robust semantic segmentation framework for incomplete multimodal inputs. STARS is built on two key designs. First, we introduce an asymmetric alignment mechanism with bidirectional translation and stop-gradient, which effectively prevents feature collapse and reduces sensitivity to hyperparameters. Second, we propose a Pixel-level Semantic sampling Alignment (PSA) strategy that combines class-balanced pixel sampling with cross-modality semantic alignment loss, to mitigate alignment failures caused by severe class imbalance and improve minority-class recognition.
zh

[CV-116] EXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution ICASSP2026

【速读】：该论文旨在解决现实世界文本图像超分辨率（Text Image Super-Resolution, TISR）中因数据稀缺导致的文本区域恢复效果差，以及现有数据集仅包含孤立文本样本而限制背景重建质量的问题。其核心解决方案是构建一个大规模、高质量的真实场景文本图像数据集 Real-Texts，涵盖中英文自然文本实例，并提出 TEXTS-Aware Diffusion Model (TEXTS-Diff)，通过结合抽象概念增强对视觉场景中文本语义的理解，同时利用具体文本区域细化文字细节，从而有效减少文本区域的失真与幻觉伪影，保持整体场景的高保真度，实现背景与文本区域的协同高质量重建。

链接: https://arxiv.org/abs/2601.17340
作者: Haodong He,Xin Zhan,Yancheng Bai,Rui Lan,Lei Sun,Xiangxiang Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Real-world text image super-resolution aims to restore overall visual quality and text legibility in images suffering from diverse degradations and text distortions. However, the scarcity of text image data in existing datasets results in poor performance on text regions. In addition, datasets consisting of isolated text samples limit the quality of background reconstruction. To address these limitations, we construct Real-Texts, a large-scale, high-quality dataset collected from real-world images, which covers diverse scenarios and contains natural text instances in both Chinese and English. Additionally, we propose the TEXTS-Aware Diffusion Model (TEXTS-Diff) to achieve high-quality generation in both background and textual regions. This approach leverages abstract concepts to improve the understanding of textual elements within visual scenes and concrete text regions to enhance textual details. It mitigates distortions and hallucination artifacts commonly observed in text regions, while preserving high-quality visual scene fidelity. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple evaluation metrics, exhibiting superior generalization ability and text restoration accuracy in complex scenarios. All the code, model, and dataset will be released.
zh

[CV-117] AGE-Net: Spectral–Spatial Fusion and Anatomical Graph Reasoning with Evidential Ordinal Regression for Knee Osteoarthritis Grading

【速读】：该论文旨在解决膝关节X光片中自动化Kellgren–Lawrence（KL）分级的难题，该任务因结构变化细微、远距离解剖依赖关系复杂以及分级边界处存在歧义而极具挑战性。解决方案的关键在于提出AGE-Net框架，其核心创新包括：基于ConvNeXt的架构融合光谱-空间特征（Spectral–Spatial Fusion, SSF）、利用解剖图推理机制（Anatomical Graph Reasoning, AGR）建模长程结构关系，并引入差异精炼模块（Differential Refinement, DFR）优化预测精度；同时，为保留标签的序数特性并量化预测不确定性，采用正态逆伽马（Normal-Inverse-Gamma, NIG）证据回归头与成对序数排序约束相结合的方法。实验表明，AGE-Net在QWK和MSE指标上均显著优于主流CNN基线模型，且具备良好的不确定性评估、鲁棒性和可解释性。

链接: https://arxiv.org/abs/2601.17336
作者: Xiaoyang Li,Runni Zhou
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages

点击查看摘要

Abstract:Automated Kellgren–Lawrence (KL) grading from knee radiographs is challenging due to subtle structural changes, long-range anatomical dependencies, and ambiguity near grade boundaries. We propose AGE-Net, a ConvNeXt-based framework that integrates Spectral–Spatial Fusion (SSF), Anatomical Graph Reasoning (AGR), and Differential Refinement (DFR). To capture predictive uncertainty and preserve label ordinality, AGE-Net employs a Normal-Inverse-Gamma (NIG) evidential regression head and a pairwise ordinal ranking constraint. On a knee KL dataset, AGE-Net achieves a quadratic weighted kappa (QWK) of 0.9017 +/- 0.0045 and a mean squared error (MSE) of 0.2349 +/- 0.0028 over three random seeds, outperforming strong CNN baselines and showing consistent gains in ablation studies. We further outline evaluations of uncertainty quality, robustness, and explainability, with additional experimental figures to be included in the full manuscript.
zh

[CV-118] Learning with Geometric Priors in U-Net Variants for Polyp Segmentation

【速读】：该论文旨在解决现有基于U-Net的结肠息肉（polyp）分割方法在低对比度或复杂内窥镜场景中难以有效捕捉几何与结构特征的问题。其解决方案的关键在于提出一种可插拔的几何先验引导模块（Geometric Prior-guided Module, GPM），该模块通过在模拟的ColonDepth数据集上微调视觉几何基础Transformer（Visual Geometry Grounded Transformer, VGGT）以生成针对内窥镜域定制的深度图，并将这些深度图编码为几何先验信息注入到U-Net编码器的特征图中；同时结合空间与通道注意力机制，强化局部空间细节和全局通道特征的融合，从而显著提升模型对几何结构的感知能力。

链接: https://arxiv.org/abs/2601.17331
作者: Fabian Vazquez,Jose A. Nuñez,Diego Adame,Alissen Moreno,Augustin Zhan,Huimin Li,Jinghao Yang,Haoteng Tang,Bin Fu,Pengfei Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and robust polyp segmentation is essential for early colorectal cancer detection and for computer-aided diagnosis. While convolutional neural network-, Transformer-, and Mamba-based U-Net variants have achieved strong performance, they still struggle to capture geometric and structural cues, especially in low-contrast or cluttered colonoscopy scenes. To address this challenge, we propose a novel Geometric Prior-guided Module (GPM) that injects explicit geometric priors into U-Net-based architectures for polyp segmentation. Specifically, we fine-tune the Visual Geometry Grounded Transformer (VGGT) on a simulated ColonDepth dataset to estimate depth maps of polyp images tailored to the endoscopic domain. These depth maps are then processed by GPM to encode geometric priors into the encoder’s feature maps, where they are further refined using spatial and channel attention mechanisms that emphasize both local spatial and global channel information. GPM is plug-and-play and can be seamlessly integrated into diverse U-Net variants. Extensive experiments on five public polyp segmentation datasets demonstrate consistent gains over three strong baselines. Code and the generated depth maps are available at: this https URL
zh

[CV-119] hermodynamically Optimal Regularization under Information-Geometric Constraints

【速读】：该论文试图解决现代机器学习中多种经验性正则化技术（如权重衰减、Dropout 和指数移动平均）缺乏统一理论基础的问题，并探讨训练大规模模型日益增长的能量成本是否接近某种根本性的效率极限。其解决方案的关键在于构建一个融合热力学最优性、信息几何与正则化的统一理论框架：在三个明确假设下——(A1) 最优性需依赖参数无关的信息度量，(A2) 信念状态服从已知约束下的最大熵分布，(A3) 最优过程为准静态过程——证明了Fisher-Rao度量是信念空间上唯一可接受的几何结构，且热力学最优正则化等价于最小化到参考状态的Fisher-Rao距离平方。这一结果揭示了经典正则化方法在结构上无法保证热力学最优性，并提出了可实验验证的学习热力学效率概念。

链接: https://arxiv.org/abs/2601.17330
作者: Laurent Caraffa
机构: Université Gustave Eiffel (居斯塔夫·艾菲尔大学); LASTIG (地理信息科学与技术实验室); IGN-ENSG (法国国家地理研究所-国立高等测绘学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 0 figures

点击查看摘要

Abstract:Modern machine learning relies on a collection of empirically successful but theoretically heterogeneous regularization techniques, such as weight decay, dropout, and exponential moving averages. At the same time, the rapidly increasing energetic cost of training large models raises the question of whether learning algorithms approach any fundamental efficiency bound. In this work, we propose a unifying theoretical framework connecting thermodynamic optimality, information geometry, and regularization. Under three explicit assumptions – (A1) that optimality requires an intrinsic, parametrization-invariant measure of information, (A2) that belief states are modeled by maximum-entropy distributions under known constraints, and (A3) that optimal processes are quasi-static – we prove a conditional optimality theorem. Specifically, the Fisher–Rao metric is the unique admissible geometry on belief space, and thermodynamically optimal regularization corresponds to minimizing squared Fisher–Rao distance to a reference state. We derive the induced geometries for Gaussian and circular belief models, yielding hyperbolic and von Mises manifolds, respectively, and show that classical regularization schemes are structurally incapable of guaranteeing thermodynamic optimality. We introduce a notion of thermodynamic efficiency of learning and propose experimentally testable predictions. This work provides a principled geometric and thermodynamic foundation for regularization in machine learning. Comments: 7 pages, 0 figures Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.17330 [cs.LG] (or arXiv:2601.17330v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.17330 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-120] SymbolSight: Minimizing Inter-Symbol Interference for Reading with Prosthetic Vision

【速读】：该论文旨在解决视网膜假体（retinal prosthesis）在序列字母呈现时因低空间分辨率和时间滞留效应（temporal persistence）导致的字母识别错误问题，尤其是前一个符号的后像（afterimage）对后续符号感知的干扰。其解决方案的关键在于提出SymbolSight计算框架，通过优化符号与字母的映射关系，利用语言特定的二元语法（bigram）统计信息最小化频繁相邻字母间的混淆概率；该方法不依赖硬件升级，而是基于计算建模筛选出对串行、低带宽假体视觉更适配的异质符号集，在阿拉伯语、保加利亚语和英语中模拟显示可使预测混淆度降低22倍中位数，从而显著提升读写效率。

链接: https://arxiv.org/abs/2601.17326
作者: Jasmine Lesner,Michael Beyeler
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Submitted to IEEE EMBC 2026. 7 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Retinal prostheses restore limited visual perception, but low spatial resolution and temporal persistence make reading difficult. In sequential letter presentation, the afterimage of one symbol can interfere with perception of the next, leading to systematic recognition errors. Rather than relying on future hardware improvements, we investigate whether optimizing the visual symbols themselves can mitigate this temporal interference. We present SymbolSight, a computational framework that selects symbol-to-letter mappings to minimize confusion among frequently adjacent letters. Using simulated prosthetic vision (SPV) and a neural proxy observer, we estimate pairwise symbol confusability and optimize assignments using language-specific bigram statistics. Across simulations in Arabic, Bulgarian, and English, the resulting heterogeneous symbol sets reduced predicted confusion by a median factor of 22 relative to native alphabets. These results suggest that standard typography is poorly matched to serial, low-bandwidth prosthetic vision and demonstrate how computational modeling can efficiently narrow the design space of visual encodings to generate high-potential candidates for future psychophysical and clinical evaluation.
zh

[CV-121] SkyReels-V3 Technique Report

【速读】：该论文旨在解决多模态上下文推理能力在视频生成任务中的瓶颈问题，以构建具备强泛化性和一致性的世界模型。其核心挑战在于如何统一处理参考图像到视频、视频扩展和音频引导视频生成三种不同范式，并确保生成视频在视觉质量、主体身份一致性、时序连贯性及指令遵循等方面达到先进水平。解决方案的关键在于提出SkyReels-V3模型，该模型基于统一的多模态上下文学习框架与扩散Transformer架构，通过设计跨帧配对、图像编辑与语义重写的数据处理流程提升参考一致性，采用图像-视频混合训练策略与多分辨率联合优化增强鲁棒性，并引入首尾帧插入模式与关键帧推理重构机制实现分钟级音频驱动视频生成，从而在多个指标上逼近领先闭源系统性能。

链接: https://arxiv.org/abs/2601.17323
作者: Debang Li,Zhengcong Fei,Tuanhui Li,Yikun Dou,Zheng Chen,Jiangping Yang,Mingyuan Fan,Jingtao Xu,Jiahua Wang,Baoxuan Gu,Mingshan Chang,Yuqiang Xie,Binjie Mao,Youqiang Zhang,Nuo Pang,Hao Zhang,Yuzhe Jin,Zhiheng Xu,Dixuan Lin,Guibin Chen,Yahui Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.17323 [cs.CV] (or arXiv:2601.17323v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.17323 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-122] ClinNet: Evidential Ordinal Regression with Bilateral Asymmetry and Prototype Memory for Knee Osteoarthritis Grading

【速读】：该论文旨在解决膝骨关节炎（Knee Osteoarthritis, KOA）的放射影像分级问题，其核心挑战在于各等级间差异细微、专家标注存在不确定性，以及疾病进展具有内在的有序性（ordinal nature）。传统深度学习方法将此任务建模为确定性的多分类问题，忽略了退变过程的连续性和标注不确定性。解决方案的关键在于提出ClinNet框架，将其重构为一个可信的证据序数回归（evidential ordinal regression）问题，包含三个创新组件：(1) 双侧不对称编码器（Bilateral Asymmetry Encoder, BAE）显式建模内外侧结构差异；(2) 诊断记忆库（Diagnostic Memory Bank）维护类别原型以稳定特征表示；(3) 基于正态逆伽马分布（Normal-Inverse-Gamma, NIG）的证据序数头，联合估计连续Kellgren-Lawrence（KL）分级和认知不确定性（epistemic uncertainty）。实验表明，该方法在Quadratic Weighted Kappa达0.892，显著优于现有基线（p < 0.001），且不确定性估计能有效识别分布外样本和潜在误诊，为临床安全部署提供保障。

链接: https://arxiv.org/abs/2601.17315
作者: Xiaoyang Li,Runni Zhou
机构: University of Dundee (邓迪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Knee osteoarthritis (KOA) grading based on radiographic images is a critical yet challenging task due to subtle inter-grade differences, annotation uncertainty, and the inherently ordinal nature of disease progression. Conventional deep learning approaches typically formulate this problem as deterministic multi-class classification, ignoring both the continuous progression of degeneration and the uncertainty in expert annotations. In this work, we propose ClinNet, a novel trustworthy framework that addresses KOA grading as an evidential ordinal regression problem. The proposed method integrates three key components: (1) a Bilateral Asymmetry Encoder (BAE) that explicitly models medial-lateral structural discrepancies; (2) a Diagnostic Memory Bank that maintains class-wise prototypes to stabilize feature representations; and (3) an Evidential Ordinal Head based on the Normal-Inverse-Gamma (NIG) distribution to jointly estimate continuous KL grades and epistemic uncertainty. Extensive experiments demonstrate that ClinNet achieves a Quadratic Weighted Kappa of 0.892 and Accuracy of 0.768, statistically outperforming state-of-the-art baselines (p 0.001). Crucially, we demonstrate that the model’s uncertainty estimates successfully flag out-of-distribution samples and potential misdiagnoses, paving the way for safe clinical deployment.
zh

[CV-123] Dynamic Meta-Ensemble Framework for Efficient and Accurate Deep Learning in Plant Leaf Disease Detection on Resource-Constrained Edge Devices

【速读】：该论文旨在解决在资源受限的边缘设备（如物联网传感器、智能手机和嵌入式系统）上部署深度学习模型进行植物病害检测时面临的计算能力不足与能耗预算有限的问题。解决方案的关键在于提出一种动态元集成框架（Dynamic Meta-Ensemble Framework, DMEF），其核心是通过自适应加权机制，动态融合三个轻量级卷积神经网络（MobileNetV2、NASNetMobile 和 InceptionV3）的预测结果，以在准确率提升（DeltaAcc）与计算效率（模型规模）之间实现最优权衡。训练过程中，DMEF 迭代更新集成权重，优先选择性能高且复杂度低的子模型，从而在保持低推理延迟（75ms）和紧凑模型参数量（100万）的前提下，显著提升分类精度，在马铃薯和玉米病害数据集上分别达到99.53%和96.61%的准确率，优于独立模型和静态集成方法。

链接: https://arxiv.org/abs/2601.17290
作者: Weloday Fikadu Moges,Jianmei Su,Amin Waqas
机构: South West University Science and Technology (西南科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying deep learning models for plant disease detection on edge devices such as IoT sensors, smartphones, and embedded systems is severely constrained by limited computational resources and energy budgets. To address this challenge, we introduce a novel Dynamic Meta-Ensemble Framework (DMEF) for high-accuracy plant disease diagnosis under resource constraints. DMEF employs an adaptive weighting mechanism that dynamically combines the predictions of three lightweight convolutional neural networks (MobileNetV2, NASNetMobile, and InceptionV3) by optimizing a trade-off between accuracy improvements (DeltaAcc) and computational efficiency (model size). During training, the ensemble weights are updated iteratively, favoring models exhibiting high performance and low complexity. Extensive experiments on benchmark datasets for potato and maize diseases demonstrate state-of-the-art classification accuracies of 99.53% and 96.61%, respectively, surpassing standalone models and static ensembles by 2.1% and 6.3%. With computationally efficient inference latency (75ms) and a compact footprint (1 million parameters), DMEF shows strong potential for edge-based agricultural monitoring, suggesting viability for scalable crop disease management. This bridges the gap between high-accuracy AI and practical field applications.
zh

[CV-124] Fluxamba: Topology-Aware Anisotropic State Space Models for Geological Lineament Segmentation in Multi-Source Remote Sensing

【速读】：该论文旨在解决地质线性特征（如行星线性构造到地球裂缝）的精确分割问题，其核心挑战在于如何在复杂各向异性拓扑结构中捕捉长距离依赖关系。现有状态空间模型（State Space Models, SSMs）虽具有近线性计算复杂度，但因其依赖刚性的轴对齐扫描路径，与曲线目标存在根本性的拓扑不匹配，导致上下文碎片化和特征退化。解决方案的关键是提出轻量级架构Fluxamba，其核心为结构通量块（Structural Flux Block, SFB），通过集成各向异性结构门（Anisotropic Structural Gate, ASG）与先验调制流（Prior-Modulated Flow, PMF），实现特征方向与空间位置的解耦，沿目标内在几何动态调控上下文聚合，而非固定路径；同时引入分层空间调节器（Hierarchical Spatial Regulator, HSR）和高保真聚焦单元（High-Fidelity Focus Unit, HFFU），以增强低对比度环境下多尺度语义对齐并最大化微弱特征的信噪比，从而在保持高精度的同时显著降低计算开销。

链接: https://arxiv.org/abs/2601.17288
作者: Jin Bai,Huiyao Zhang,Qi Wen,Shengyang Li,Xiaolin Tian,Atta ur Rahman
机构: Technology and Engineering Center for Space Utilization, CAS, China; University of Chinese Academy of Sciences, China; Macau University of Science and Technology, China; University of Peshawar, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The precise segmentation of geological linear features, spanning from planetary lineaments to terrestrial fractures, demands capturing long-range dependencies across complex anisotropic topologies. Although State Space Models (SSMs) offer near-linear computational complexity, their dependence on rigid, axis-aligned scanning trajectories induces a fundamental topological mismatch with curvilinear targets, resulting in fragmented context and feature erosion. To bridge this gap, we propose Fluxamba, a lightweight architecture that introduces a topology-aware feature rectification framework. Central to our design is the Structural Flux Block (SFB), which orchestrates an anisotropic information flux by integrating an Anisotropic Structural Gate (ASG) with a Prior-Modulated Flow (PMF). This mechanism decouples feature orientation from spatial location, dynamically gating context aggregation along the target’s intrinsic geometry rather than rigid paths. Furthermore, to mitigate serialization-induced noise in low-contrast environments, we incorporate a Hierarchical Spatial Regulator (HSR) for multi-scale semantic alignment and a High-Fidelity Focus Unit (HFFU) to explicitly maximize the signal-to-noise ratio of faint features. Extensive experiments on diverse geological benchmarks (LROC-Lineament, LineaMapper, and GeoCrack) demonstrate that Fluxamba establishes a new state-of-the-art. Notably, on the challenging LROC-Lineament dataset, it achieves an F1-score of 89.22% and mIoU of 89.87%. Achieving a real-time inference speed of over 24 FPS with only 3.4M parameters and 6.3G FLOPs, Fluxamba reduces computational costs by up to two orders of magnitude compared to heavy-weight baselines, thereby establishing a new Pareto frontier between segmentation fidelity and onboard deployment feasibility.
zh

[CV-125] SPADE: A SIMD Posit-enabled compute engine for Accelerating DNN Efficiency

【速读】：该论文旨在解决边缘人工智能（Edge-AI）系统中对高能效、低面积开销且支持多精度计算的算术单元的需求问题。现有浮点或定点运算架构难以在数值精度、能效与硬件紧凑性之间取得良好平衡，尤其在支持多种数据格式时存在资源冗余。其解决方案的关键在于提出SPADE——一种统一的多精度单指令多数据（SIMD）Posit乘加（MAC）架构，通过引入“寄存器感知”（regime-aware）的lane融合式Posit数据通路，实现8位、16位和32位精度的Posit（Posit (8,0)、Posit (16,1)、Posit (32,2)）在单一框架下的高效复用，无需重复设计独立的数据路径，从而显著减少逻辑单元（LUT）和切片（slice）资源占用，同时保持高性能与高精度推理能力。

链接: https://arxiv.org/abs/2601.17279
作者: Sonu Kumar,Lavanya Vinnakota,Mukul Lokhande,Santosh Kumar Vishvakarma,Adam Teman
机构: IIT Indore (印度理工学院印多尔分校)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The growing demand for edge-AI systems requires arithmetic units that balance numerical precision, energy efficiency, and compact hardware while supporting diverse formats. Posit arithmetic offers advantages over floating- and fixed-point representations through its tapered precision, wide dynamic range, and improved numerical robustness. This work presents SPADE, a unified multi-precision SIMD Posit-based multiplyaccumulate (MAC) architecture supporting Posit (8,0), Posit (16,1), and Posit (32,2) within a single framework. Unlike prior single-precision or floating/fixed-point SIMD MACs, SPADE introduces a regime-aware, lane-fused SIMD Posit datapath that hierarchically reuses Posit-specific submodules (LOD, complementor, shifter, and multiplier) across 8/16/32-bit precisions without datapath replication. FPGA implementation on a Xilinx Virtex-7 shows 45.13% LUT and 80% slice reduction for Posit (8,0), and up to 28.44% and 17.47% improvement for Posit (16,1) and Posit (32,2) over prior work, with only 6.9% LUT and 14.9% register overhead for multi-precision support. ASIC results across TSMC nodes achieve 1.38 GHz at 6.1 mW (28 nm). Evaluation on MNIST, CIFAR-10/100, and alphabet datasets confirms competitive inference accuracy.
zh

[CV-126] Cross360: 360° Monocular Depth Estimation via Cross Projections Across Scales

【速读】：该论文旨在解决360°深度估计中全局连续性与局部失真难以兼顾的问题，现有方法在多投影融合时难以平衡全局与局部一致性，且局部patch特征对全局感知有限，边界处的特征提取存在不一致。其解决方案的关键在于提出Cross360架构，通过引入较少失真的切片投影（tangent patches）与等距圆柱投影（equirectangular）特征的交叉注意力机制，实现局部与全局信息的有效融合：Cross Projection Feature Alignment模块利用交叉注意力对齐局部切片特征与全局视场，使每个切片都具备全局上下文感知能力；Progressive Feature Aggregation with Attention模块则逐级聚合多尺度特征并增强细节精度，从而显著提升深度估计的准确性与全局一致性。

链接: https://arxiv.org/abs/2601.17271
作者: Kun Huang,Fang-Lue Zhang,Neil Dodgson
机构: Victoria University of Wellington (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TIP, 12 pages

点击查看摘要

Abstract:360° depth estimation is a challenging research problem due to the difficulty of finding a representation that both preserves global continuity and avoids distortion in spherical images. Existing methods attempt to leverage complementary information from multiple projections, but struggle with balancing global and local consistency. Their local patch features have limited global perception, and the combined global representation does not address discrepancies in feature extraction at the boundaries between patches. To address these issues, we propose Cross360, a novel cross-attention-based architecture integrating local and global information using less-distorted tangent patches along with equirectangular features. Our Cross Projection Feature Alignment module employs cross-attention to align local tangent projection features with the equirectangular projection’s 360° field of view, ensuring each tangent projection patch is aware of the global context. Additionally, our Progressive Feature Aggregation with Attention module refines multi-scaled features progressively, enhancing depth estimation accuracy. Cross360 significantly outperforms existing methods across most benchmark datasets, especially those in which the entire 360° image is available, demonstrating its effectiveness in accurate and globally consistent depth estimation. The code and model are available at this https URL.
zh

[CV-127] Inference-Time Loss-Guided Colour Preservation in Diffusion Sampling

【速读】：该论文旨在解决文本到图像扩散模型在设计导向工作流中难以精确控制颜色的问题，尤其是在需要满足用户指定颜色目标时，现有方法常出现颜色偏差或局部失真。其解决方案的关键在于提出一种无需额外训练的推理阶段区域约束颜色保持方法，通过三个核心机制实现：(i) 基于感兴趣区域（Region of Interest, ROI）的修复（inpainting）以实现空间选择性；(ii) 背景潜在表示重置（background-latent re-imposition）防止ROI外颜色漂移；(iii) 利用CIE Lab和线性RGB空间中的复合损失函数进行潜在空间梯度引导（latent nudging），该损失引入CVaR风格和软最大值惩罚项，不仅控制ROI平均颜色，还抑制像素级误差分布的尾部波动，并结合延迟启动门控和时间依赖调度策略稳定去噪过程中的引导效果。此方法可无缝集成至标准Stable Diffusion修复流程，实现高精度的颜色一致性。

链接: https://arxiv.org/abs/2601.17259
作者: Angad Singh Ahuja,Aarush Ram Anandh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 25 Pages, 12 Figures, 3 Tables, 5 Appendices, 8 Algorithms

点击查看摘要

Abstract:Precise color control remains a persistent failure mode in text-to-image diffusion systems, particularly in design-oriented workflows where outputs must satisfy explicit, user-specified color targets. We present an inference-time, region-constrained color preservation method that steers a pretrained diffusion model without any additional training. Our approach combines (i) ROI-based inpainting for spatial selectivity, (ii) background-latent re-imposition to prevent color drift outside the ROI, and (iii) latent nudging via gradient guidance using a composite loss defined in CIE Lab and linear RGB. The loss is constructed to control not only the mean ROI color but also the tail of the pixelwise error distribution through CVaR-style and soft-maximum penalties, with a late-start gate and a time-dependent schedule to stabilize guidance across denoising steps. We show that mean-only baselines can satisfy average color constraints while producing perceptually salient local failures, motivating our distribution-aware objective. The resulting method provides a practical, training-free mechanism for targeted color adherence that can be integrated into standard Stable Diffusion inpainting pipelines.
zh

[CV-128] FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

【速读】：该论文旨在解决视频异常理解（Video Anomaly Understanding, VAU）任务中评估方法不准确的问题。现有基准依赖于n-gram指标（如BLEU、ROUGE-L）或大语言模型（Large Language Model, LLM）进行评估，前者无法捕捉视觉引导的自由形式回答的丰富性，后者则偏重语言质量而非事实相关性，导致评价结果与人类感知脱节。解决方案的关键在于提出FineVAU基准，其核心创新包括：a) FVScore——一种基于关键视觉元素存在性的新型人类对齐评估指标，提供可解释的细粒度反馈；b) FineW3数据集——通过结构化自动流程扩充已有标注，引入高质量细粒度视觉信息。实验表明，FVScore在人类感知一致性上优于现有方法，同时揭示了视觉语言模型（LVLM）在需要空间和细粒度时间理解的异常事件识别上的显著局限。

链接: https://arxiv.org/abs/2601.17258
作者: João Pereira,Vasco Lopes,João Neves,David Semedo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FVScore, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM’s ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events with strong visual cues.
zh

[CV-129] Multi-stage Bridge Inspection System: Integrating Foundation Models with Location Anonymization

【速读】：该论文旨在解决桥梁损伤检测中区域隐私保护与损伤特征准确提取之间的矛盾问题，即在保障基础设施监测数据不泄露施工区域信息的前提下，实现高效、精准的损伤识别与可视化评估。其解决方案的关键在于：首先利用Segment Anything Model (SAM) 3进行钢筋锈蚀区域的精确分割，结合DBSCAN算法自动补全漏检区域以提升完整性；其次通过高斯模糊对施工标志区域进行掩蔽处理，有效保护地理信息；同时采用四种图像预处理方法提升光学字符识别（OCR）精度，并借助GPU优化实现单张图像1.7秒的实时处理速度，从而构建了一个兼顾隐私安全与检测效率的开源桥梁损伤检测系统。

链接: https://arxiv.org/abs/2601.17254
作者: Takato Yasuno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 2 tables

点击查看摘要

Abstract:In Japan, civil infrastructure condition monitoring is mandated through visual inspection every five years. Field-captured damage images frequently contain concrete cracks and rebar exposure, often accompanied by construction signs revealing regional information. To enable safe infrastructure use without causing public anxiety, it is essential to protect regional information while accurately extracting damage features and visualizing key indicators for repair decision-making. This paper presents an open-source bridge damage detection system with regional privacy protection capabilities. We employ Segment Anything Model (SAM) 3 for rebar corrosion detection and utilize DBSCAN for automatic completion of missed regions. Construction sign regions are detected and protected through Gaussian blur. Four preprocessing methods improve OCR accuracy, and GPU optimization enables 1.7-second processing per image. The technology stack includes SAM3, PyTorch, OpenCV, pytesseract, and scikit-learn, achieving efficient bridge inspection with regional information protection.
zh

[CV-130] C-RADIOv4 (Tech Report)

【速读】：该论文旨在解决多教师知识蒸馏中如何统一保留并提升多个教师模型独特能力的问题，同时在保持计算复杂度不变的前提下显著改善下游任务性能。解决方案的关键在于采用多教师蒸馏（multi-teacher distillation）策略，构建一个统一的学生模型（student model），该模型能够融合来自SigLIP2、DINOv3和SAM3等多个先进视觉教师模型的能力；此外，C-RADIOv4通过引入任意分辨率支持（any-resolution support）和ViTDet选项（ViTDet option）进一步优化了高分辨率场景下的效率与泛化能力，从而实现了性能与实用性的双重提升。

链接: https://arxiv.org/abs/2601.17237
作者: Mike Ranzinger,Greg Heinrich,Collin McCarthy,Jan Kautz,Andrew Tao,Bryan Catanzaro,Pavlo Molchanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:By leveraging multi-teacher distillation, agglomerative vision backbones provide a unified student model that retains and improves the distinct capabilities of multiple teachers. In this tech report, we describe the most recent release of the C-RADIO family of models, C-RADIOv4, which builds upon AM-RADIO/RADIOv2.5 in design, offering strong improvements on key downstream tasks at the same computational complexity. We release -SO400M (412M params), and -H (631M) model variants, both trained with an updated set of teachers: SigLIP2, DINOv3, and SAM3. In addition to improvements on core metrics and new capabilities from imitating SAM3, the C-RADIOv4 model family further improves any-resolution support, brings back the ViTDet option for drastically enhanced efficiency at high-resolution, and comes with a permissive license.
zh

[CV-131] Semi-Supervised Domain Adaptation with Latent Diffusion for Pathology Image Classification

【速读】：该论文旨在解决计算病理学中深度学习模型在不同队列和机构间泛化能力差的问题，其根源在于域偏移（domain shift）。现有方法要么无法利用目标域的未标注数据，要么依赖图像到图像的翻译技术，可能导致组织结构失真并降低模型准确性。解决方案的关键在于提出一种半监督域自适应（semi-supervised domain adaptation, SSDA）框架，该框架利用在源域和目标域未标注数据上训练的潜在扩散模型（latent diffusion model），生成保留组织形态且具有目标域特征的合成图像；通过条件控制扩散模型的输入（包括基础模型特征、队列身份和组织制备方法），在保持源域组织结构的同时引入目标域外观特性。这些目标感知的合成图像与源域的真实标注图像联合训练下游分类器，显著提升了在目标域测试集上的性能，同时不损害源域表现。

链接: https://arxiv.org/abs/2601.17228
作者: Tengyue Zhang,Ruiwen Ding,Luoting Zhuang,Yuxiao Wu,Erika F. Rodriguez,William Hsu
机构: University of California, Los Angeles (加州大学洛杉矶分校); David Geffen School of Medicine (大卫·格芬医学院); Department of Radiological Sciences (放射科学系); Medical & Imaging Informatics (医学与影像信息学); Bioengineering Department (生物工程系); Henry Samueli School of Engineering (亨利·萨缪里工程学院); Department of Pathology & Laboratory Sciences (病理学与实验室科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Deep learning models in computational pathology often fail to generalize across cohorts and institutions due to domain shift. Existing approaches either fail to leverage unlabeled data from the target domain or rely on image-to-image translation, which can distort tissue structures and compromise model accuracy. In this work, we propose a semi-supervised domain adaptation (SSDA) framework that utilizes a latent diffusion model trained on unlabeled data from both the source and target domains to generate morphology-preserving and target-aware synthetic images. By conditioning the diffusion model on foundation model features, cohort identity, and tissue preparation method, we preserve tissue structure in the source domain while introducing target-domain appearance characteristics. The target-aware synthetic images, combined with real, labeled images from the source cohort, are subsequently used to train a downstream classifier, which is then tested on the target cohort. The effectiveness of the proposed SSDA framework is demonstrated on the task of lung adenocarcinoma prognostication. The proposed augmentation yielded substantially better performance on the held-out test set from the target cohort, without degrading source-cohort performance. The approach improved the weighted F1 score on the target-cohort held-out test set from 0.611 to 0.706 and the macro F1 score from 0.641 to 0.716. Our results demonstrate that target-aware diffusion-based synthetic data augmentation provides a promising and effective approach for improving domain generalization in computational pathology.
zh

[CV-132] Spatiotemporal Semantic V2X Framework for Cooperative Collision Prediction

【速读】：该论文旨在解决智能交通系统（Intelligent Transportation Systems, ITS）中实时碰撞预测的通信瓶颈问题，即传统方法依赖于将原始视频或高维传感数据从路侧单元（RSUs）传输至车辆，难以满足车载通信带宽和时延约束。其解决方案的关键在于提出一种语义车联网（semantic V2X）框架：利用视频联合嵌入预测架构（Video Joint Embedding Predictive Architecture, V-JEPA）在RSU端生成未来帧的时空语义嵌入（spatiotemporal semantic embeddings），并通过V2X链路传输这些紧凑的语义表示，而非原始视频帧；车辆端则采用轻量级注意力探针与分类器对嵌入进行解码以预测潜在碰撞。该方法显著降低通信开销（减少四个数量级）的同时保持高预测精度，实验表明F1分数提升10%，验证了语义V2X在协同实时碰撞预测中的有效性。

链接: https://arxiv.org/abs/2601.17216
作者: Murat Arda Onsu,Poonam Lohan,Burak Kantarci,Aisha Syed,Matthew Andrews,Sean Kennedy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 6 pages 5 figures, accepted to IEEE ICC 2026

点击查看摘要

Abstract:Intelligent Transportation Systems (ITS) demand real-time collision prediction to ensure road safety and reduce accident severity. Conventional approaches rely on transmitting raw video or high-dimensional sensory data from roadside units (RSUs) to vehicles, which is impractical under vehicular communication bandwidth and latency constraints. In this work, we propose a semantic V2X framework in which RSU-mounted cameras generate spatiotemporal semantic embeddings of future frames using the Video Joint Embedding Predictive Architecture (V-JEPA). To evaluate the system, we construct a digital twin of an urban traffic environment enabling the generation of d verse traffic scenarios with both safe and collision events. These embeddings of the future frame, extracted from V-JEPA, capture task-relevant traffic dynamics and are transmitted via V2X links to vehicles, where a lightweight attentive probe and classifier decode them to predict imminent collisions. By transmitting only semantic embeddings instead of raw frames, the proposed system significantly reduces communication overhead while maintaining predictive accuracy. Experimental results demonstrate that the framework with an appropriate processing method achieves a 10% F1-score improvement for collision prediction while reducing transmission requirements by four orders of magnitude compared to raw video. This validates the potential of semantic V2X communication to enable cooperative, real-time collision prediction in ITS.
zh

[CV-133] Structural Complexity of Brain MRI reveals age-associated patterns ICASSP2026

【速读】：该论文旨在解决三维医学影像（特别是脑部磁共振成像，MRI）中多尺度结构复杂性分析的稳定性与准确性问题。传统基于块（block-based）的粗粒化方法在粗尺度下因采样不足易产生不稳定估计，为此，作者提出一种滑动窗口（sliding-window）粗粒化方案，通过更平滑的信息损失量化过程提升大尺度下的鲁棒性。该解决方案的关键在于改进粗粒化策略，从而实现对脑部MRI数据在多尺度上的稳定、可靠分析，并揭示出结构复杂性随年龄系统性下降的规律，尤其在粗尺度下效应最显著，证明其作为生物年龄预测指标的有效性。

链接: https://arxiv.org/abs/2601.17211
作者: Anzhe Cheng,Italo Ivo Lima Dias Pinto,Paul Bogdan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by icassp2026

点击查看摘要

Abstract:We adapt structural complexity analysis to three-dimensional signals, with an emphasis on brain magnetic resonance imaging (MRI). This framework captures the multiscale organization of volumetric data by coarse-graining the signal at progressively larger spatial scales and quantifying the information lost between successive resolutions. While the traditional block-based approach can become unstable at coarse resolutions due to limited sampling, we introduce a sliding-window coarse-graining scheme that provides smoother estimates and improved robustness at large scales. Using this refined method, we analyze large structural MRI datasets spanning mid- to late adulthood and find that structural complexity decreases systematically with age, with the strongest effects emerging at coarser scales. These findings highlight structural complexity as a reliable signal processing tool for multiscale analysis of 3D imaging data, while also demonstrating its utility in predicting biological age from brain MRI.
zh

人工智能

[AI-0] Multi-Objective Reinforcement Learning for Efficient Tactical Decision Making for Trucks in Highway Traffic

【速读】：该论文旨在解决重载车辆在高速公路行驶中如何平衡安全性、效率与运营成本的决策难题，尤其针对传统标量奖励函数因聚合多个冲突目标而掩盖其内在权衡结构的问题。解决方案的关键在于提出一种基于近端策略优化（Proximal Policy Optimization, PPO）的多目标强化学习框架，该框架能够学习一组连续的帕累托最优策略（Pareto-optimal policies），显式刻画安全（以碰撞率和任务完成度衡量）、能耗效率（能量成本）和时间效率（司机成本）三者之间的权衡关系，从而生成平滑且可解释的帕累托前沿，支持无需重新训练即可在不同驾驶策略间无缝切换，实现自主卡车应用中的鲁棒与自适应决策。

链接: https://arxiv.org/abs/2601.18783
作者: Deepthi Pathare,Leo Laine,Morteza Haghir Chehreghani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a continuous set of policies explicitly representing these trade-offs and evaluates it on a scalable simulation platform for tactical decision making in trucks. The proposed approach learns a continuous set of Pareto-optimal policies that capture the trade-offs among three conflicting objectives: safety, quantified in terms of collisions and successful completion; energy efficiency and time efficiency, quantified using energy cost and driver cost, respectively. The resulting Pareto frontier is smooth and interpretable, enabling flexibility in choosing driving behavior along different conflicting objectives. This framework allows seamless transitions between different driving policies without retraining, yielding a robust and adaptive decision-making strategy for autonomous trucking applications.
zh

[AI-1] α3-SecBench: A Large-Scale Evaluation Suite of Security Resilience and Trust for LLM -based UAV Agents over 6G Networks

【速读】：该论文旨在解决当前基于大语言模型（Large Language Model, LLM）的无人机（Unmanned Aerial Vehicle, UAV）自主系统在安全关键场景下，面对恶意攻击时缺乏系统性安全、韧性与可信度评估的问题，尤其在新兴的6G网络环境中。其解决方案的关键在于提出首个大规模评估基准——α³-SecBench，该框架在原有多轮对话式UAV任务基础上，嵌入了20,000个经验证的安全攻击场景，覆盖感知、规划、控制、通信、边缘/云基础设施及LLM推理共七层自治能力；并通过安全检测、韧性行为（安全退化）和可信工具使用三个维度对23个主流LLM进行量化评估，揭示了当前模型在异常检测之外的漏洞定位、防御响应与合规决策方面存在显著不足，从而为未来安全增强型自主无人机系统的设计提供基准依据。

链接: https://arxiv.org/abs/2601.18754
作者: Mohamed Amine Ferrag,Abderrahmane Lakas,Merouane Debbah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce \alpha^3 -SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from \alpha^3 -Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. \alpha^3 -SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release \alpha^3 -SecBench on GitHub: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.18754 [cs.CR] (or arXiv:2601.18754v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.18754 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] HalluGuard: Demystifying Data-Driven and Reasoning -Driven Hallucinations in LLM s ICLR’26

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在医疗、法律和科学发现等高风险领域中因幻觉（hallucination）导致的可靠性问题。现有检测方法通常仅针对数据驱动型或推理驱动型幻觉之一，且依赖任务特定启发式规则，难以泛化至复杂场景。解决方案的关键在于提出“幻觉风险边界”（Hallucination Risk Bound），这是一个统一的理论框架，形式化地将幻觉风险分解为两类：由训练阶段不匹配引发的数据驱动型幻觉和由推理阶段不稳定引发的推理驱动型幻觉；并基于此构建了基于神经切空间核（Neural Tangent Kernel, NTK）的HalluGuard评分机制，通过捕捉NTK诱导的几何结构与表征能力，联合识别两类幻觉，在10个多样化基准测试、11种基线方法和9种主流LLM骨干模型上均实现领先的检测性能。

链接: https://arxiv.org/abs/2601.18753
作者: Xinyue Zeng,Junhong Lin,Yujun Yan,Feng Guo,Liang Shi,Jun Wu,Dawei Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Have been accepted by ICLR’26

点击查看摘要

Abstract:The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.
zh

[AI-3] rust Dont Trust or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback

【速读】：该论文旨在解决偏好强化学习（Preference-based Reinforcement Learning, PBRL）在面对来自异质标注者（heterogeneous annotators）的偏好数据时的鲁棒性问题，特别是当部分标注者具有系统性误导性（adversarial）时，传统方法因无法区分可靠与不可靠反馈而失效。解决方案的关键在于提出一种统一框架 TriTrust-PBRL（TTP），通过联合学习共享奖励模型与专家特定的信任参数（trust parameters），使模型在梯度优化过程中自动识别并调整各标注者的可信度：信任参数会自然演化为正（表示信任）、零（忽略）或负（反转偏好），从而实现对对抗性偏好信号的自动纠正而非简单剔除，保障模型从混合专家池中提取有效信息的能力。

链接: https://arxiv.org/abs/2601.18751
作者: Seyed Amir Hosseini,Maryam Abdolali,Amirhosein Tavakkoli,Fardin Ayar,Ehsan Javanmardi,Manabu Tsukada,Mahdi Javanmardi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Equal contribution: Seyed Amir Hosseini and Maryam Abdolali. Corresponding author: Maryam Abdolali ( this http URL @kntu. this http URL )

点击查看摘要

Abstract:Preference-based reinforcement learning (PBRL) offers a promising alternative to explicit reward engineering by learning from pairwise trajectory comparisons. However, real-world preference data often comes from heterogeneous annotators with varying reliability; some accurate, some noisy, and some systematically adversarial. Existing PBRL methods either treat all feedback equally or attempt to filter out unreliable sources, but both approaches fail when faced with adversarial annotators who systematically provide incorrect preferences. We introduce TriTrust-PBRL (TTP), a unified framework that jointly learns a shared reward model and expert-specific trust parameters from multi-expert preference feedback. The key insight is that trust parameters naturally evolve during gradient-based optimization to be positive (trust), near zero (ignore), or negative (flip), enabling the model to automatically invert adversarial preferences and recover useful signal rather than merely discarding corrupted feedback. We provide theoretical analysis establishing identifiability guarantees and detailed gradient analysis that explains how expert separation emerges naturally during training without explicit supervision. Empirically, we evaluate TTP on four diverse domains spanning manipulation tasks (MetaWorld) and locomotion (DM Control) under various corruption scenarios. TTP achieves state-of-the-art robustness, maintaining near-oracle performance under adversarial corruption while standard PBRL methods fail catastrophically. Notably, TTP outperforms existing baselines by successfully learning from mixed expert pools containing both reliable and adversarial annotators, all while requiring no expert features beyond identification indices and integrating seamlessly with existing PBRL pipelines.
zh

[AI-4] SRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

【速读】：该论文旨在解决当前通用模型（generalist models）在时间序列推理能力评估方面的缺失问题，即现有基准测试未涵盖时间序列数据的多维推理能力。解决方案的关键在于提出TSRBench——一个全面的多模态基准测试平台，其核心包括：1）覆盖14个领域的4125个问题，按感知（Perception）、推理（Reasoning）、预测（Prediction）和决策（Decision-Making）四大维度分类；2）15项任务用于系统评估模型在数值推理等关键能力上的表现。通过在TSRBench上对30余种主流大语言模型（LLM）、视觉语言模型（VLM）和时间序列大模型（TSLLM）的评测，揭示了规模扩展规律在预测任务中失效、语义理解与数值预测解耦、以及多模态融合效率低下等关键挑战，为通用模型在时间序列场景下的发展提供了标准化评估框架与深入洞见。

链接: https://arxiv.org/abs/2601.18744
作者: Fangxu Yu,Xingang Guo,Lingzhi Yuan,Haoqiang Kang,Hongyu Zhao,Lianhui Qin,Furong Huang,Bin Hu,Tianyi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is notably absent from existing benchmarks of generalist models. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual represenations of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at this https URL.
zh

[AI-5] Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems ICLR2026

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在多智能体系统中扩展时面临的经济不可持续问题，即在信息不对称条件下协调异构智能体常导致成本激增。现有方法如混合智能体（Mixture-of-Agents）和基于知识的路由机制依赖启发式代理，忽视成本并破坏不确定性结构，从而导致次优协调。其解决方案的关键在于提出Agora框架，将协调重构为一种去中心化的不确定性市场：通过形式化认知不确定性（epistemic uncertainty）为可交易资产（感知、语义、推理层级），并基于理性经济规则驱动智能体间盈利导向的交易；同时引入一个市场感知型经纪人（market-aware broker），扩展Thompson采样以启动协作并引导系统收敛至成本高效的均衡状态。

链接: https://arxiv.org/abs/2601.18735
作者: Jusheng Zhang,Yijia Fan,Kaitong Cai,Jing Yang,Jiawei Yao,Jian Wang,Guanlong Qu,Ziliang Chen,Keze Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3x. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.
zh

[AI-6] Conditioned Generative Modeling of Molecular Glues: A Realistic AI Approach for Synthesizable Drug-like Molecules

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s disease, AD）中淀粉样蛋白β-42（amyloid beta-42, Abeta-42）的异常积累问题，尤其是其在细胞内早期聚集所引发的突触功能障碍与神经元退行性变。传统研究多聚焦于细胞外淀粉样斑块，而本研究强调细胞内Abeta-42作为疾病早期毒性驱动因素的重要性。解决方案的关键在于提出一种基于人工智能（AI）辅助的药物设计新策略，通过靶向泛素-蛋白酶体系统（ubiquitin-proteasome system, UPS），利用E3连接酶导向的分子胶（molecular glues）促进Abeta-42的选择性降解。具体而言，研究团队结合结构建模、ADMET筛选与对接模拟评估了三种E3连接酶（CRBN、VHL、MDM2）与Abeta-42形成三元复合物的潜力，并开发了一种配体条件化的联结树变分自编码器（Ligase-Conditioned Junction Tree Variational Autoencoder, LC-JT-VAE），该模型融合蛋白质序列嵌入和扭转角感知的分子图表示，生成化学有效、新颖且靶向特异的小分子胶，从而实现Abeta-42的高效降解，为神经退行性疾病提供了新型UPS靶向治疗框架。

链接: https://arxiv.org/abs/2601.18716
作者: Naeyma N. Islam,Thomas R. Caulfield
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: 30 pages, 8 figures

点击查看摘要

Abstract:Alzheimer’s disease (AD) is marked by the pathological accumulation of amyloid beta-42 (Abeta-42), contributing to synaptic dysfunction and neurodegeneration. While extracellular amyloid plaques are well-studied, increasing evidence highlights intracellular Abeta-42 as an early and toxic driver of disease progression. In this study, we present a novel, AI-assisted drug design approach to promote targeted degradation of Abeta-42 via the ubiquitin-proteasome system (UPS), using E3 ligase-directed molecular glues. We systematically evaluated the ternary complex formation potential of Abeta-42 with three E3 ligases: CRBN, VHL, and MDM2, through structure-based modeling, ADMET screening, and docking. We then developed a Ligase-Conditioned Junction Tree Variational Autoencoder (LC-JT-VAE) to generate ligase-specific small molecules, incorporating protein sequence embeddings and torsional angle-aware molecular graphs. Our results demonstrate that this generative model can produce chemically valid, novel, and target-specific molecular glues capable of facilitating Abeta-42 degradation. This integrated approach offers a promising framework for designing UPS-targeted therapies for neurodegenerative diseases.
zh

[AI-7] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLM s

【速读】：该论文旨在解决在医疗等安全关键领域中，基于评分量表（rubric）的大型语言模型（LLM）响应评估与训练因人工专家成本高、开发周期长而难以规模化的问题。其解决方案的关键在于提出Health-SCORE框架，该框架通过设计可泛化且可扩展的评分机制，在显著降低 rubric 开发成本的同时，保持与人工制定量表相当的评估质量；同时，Health-SCORE还可作为结构化的奖励信号用于强化学习中的安全感知监督，或直接嵌入提示（prompt）中以提升生成响应的质量，从而实现更高效、可扩展的 LLM 训练与评估。

链接: https://arxiv.org/abs/2601.18706
作者: Zhichao Yang,Sepehr Janghorbani,Dongxu Zhang,Jun Han,Qian Qian,Andrew Ressler II,Gregory D. Lyng,Sanjit Singh Batra,Robert E. Tillman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and development cost, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality comparable to human-created rubrics while significantly lowering development effort, making rubric-based evaluation and training more scalable.
zh

[AI-8] From Fuzzy to Exact: The Halo Architecture for Infinite-Depth Reasoning via Rational Arithmetic UAI2026

【速读】：该论文试图解决当前深度学习模型在处理高阶因果推理时出现的“幻觉”和逻辑不一致问题，这些问题被归因于IEEE 754浮点数近似误差在深层组合函数中累积所致。解决方案的关键在于提出“精确性假说”（Exactness Hypothesis），即通用人工智能（AGI）所需的高阶因果推理必须依赖具备任意精度算术（Arbitrary Precision Arithmetic）能力的计算基础架构；为此，作者设计了Halo架构，其核心是引入一种新型精确推理单元（Exact Inference Unit, EIU），并基于有理数域（\mathbb{Q}）实现理性算术（Rational Arithmetic），从而在Huginn-0125原型系统上验证了该方案可无限期避免数值发散，显著降低系统2型AGI中的逻辑不确定性。

链接: https://arxiv.org/abs/2601.18702
作者: Hansheng Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 8 pages, 6 figures. Submitted to UAI 2026

点击查看摘要

Abstract:Current paradigms in Deep Learning prioritize computational throughput over numerical precision, relying on the assumption that intelligence emerges from statistical correlation at scale. In this paper, we challenge this orthodoxy. We propose the Exactness Hypothesis: that General Intelligence (AGI), specifically high-order causal inference, requires a computational substrate capable of Arbitrary Precision Arithmetic. We argue that the “hallucinations” and logical incoherence seen in current Large Language Models (LLMs) are artifacts of IEEE 754 floating-point approximation errors accumulating over deep compositional functions. To mitigate this, we introduce the Halo Architecture, a paradigm shift to Rational Arithmetic ( \mathbbQ ) supported by a novel Exact Inference Unit (EIU). Empirical validation on the Huginn-0125 prototype demonstrates that while 600B-parameter scale BF16 baselines collapse in chaotic systems, Halo maintains zero numerical divergence indefinitely. This work establishes exact arithmetic as a prerequisite for reducing logical uncertainty in System 2 AGI.
zh

[AI-9] EA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent

【速读】：该论文旨在解决当前情感支持对话（Emotional Support Conversation, ESC）系统在文本-only场景下过度关注情感表达而忽视事实性支撑的问题，导致模型易产生幻觉且缺乏可信度。其核心挑战在于如何通过外部工具增强模型的事实 grounding 能力，从而提升多轮对话中的可靠性与实用性。解决方案的关键在于提出 TEA-Bench——首个用于评估工具增强型代理在 ESC 中表现的交互式基准，包含真实情感场景、类 MCP（Model Context Protocol）工具环境及过程级指标，可同时衡量情感支持质量与事实准确性。实验表明，工具增强能有效提升支持质量并减少幻觉，但收益高度依赖模型能力：强模型更善于选择性、高效使用工具，而弱模型受益有限。

链接: https://arxiv.org/abs/2601.18700
作者: Xingyu Sui,Yanyan Zhao,Yulin Hu,Jiahe Guo,Weixiang Zhao,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text-only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi-turn emotional support. We introduce TEA-Bench, the first interactive benchmark for evaluating tool-augmented agents in ESC, featuring realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity-dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and find that supervised fine-tuning improves in-distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents.
zh

[AI-10] Neural Multi-Speaker Voice Cloning for Nepali in Low-Resource Settings

【速读】：该论文旨在解决尼泊尔语（Nepali）这一低资源语言环境下语音克隆（voice cloning）技术难以实现的问题，尤其在仅有少量样本情况下合成特定说话者语音的挑战。其关键解决方案在于构建两套独立的数据集：一是用于训练说话人编码器（speaker encoder）的未标注音频数据，二是用于训练基于Tacotron2架构的语音合成器的文本-音频配对数据；通过优化生成式端到端损失（Generative End2End loss）使编码器生成能准确表征说话人声纹特征的嵌入向量，并将其与Tacotron2中的文本嵌入融合，最终由WaveRNN声码器还原为高质量语音。实验表明，该方法可在极少样本条件下有效克隆未见说话者的语音特征，验证了面向低资源语言的少样本语音克隆可行性。

链接: https://arxiv.org/abs/2601.18694
作者: Aayush M. Shrestha,Aditya Bajracharya,Projan Shakya,Dinesh B. Kshatri
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 16 pages with appendix included

点击查看摘要

Abstract:This research presents a few-shot voice cloning system for Nepali speakers, designed to synthesize speech in a specific speaker’s voice from Devanagari text using minimal data. Voice cloning in Nepali remains largely unexplored due to its low-resource nature. To address this, we constructed separate datasets: untranscribed audio for training a speaker encoder and paired text-audio data for training a Tacotron2-based synthesizer. The speaker encoder, optimized with Generative End2End loss, generates embeddings that capture the speaker’s vocal identity, validated through Uniform Manifold Approximation and Projection (UMAP) for dimension reduction visualizations. These embeddings are fused with Tacotron2’s text embeddings to produce mel-spectrograms, which are then converted into audio using a WaveRNN vocoder. Audio data were collected from various sources, including self-recordings, and underwent thorough preprocessing for quality and alignment. Training was performed using mel and gate loss functions under multiple hyperparameter settings. The system effectively clones speaker characteristics even for unseen voices, demonstrating the feasibility of few-shot voice cloning for the Nepali language and establishing a foundation for personalized speech synthesis in low-resource scenarios.
zh

[AI-11] ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

【速读】：该论文旨在解决基于分数的扩散模型（score-based diffusion models）在时间离散化过程中，使用均匀或手工设计的时间网格会导致采样质量下降的问题，尤其是在给定时间步数预算下。解决方案的关键在于提出自适应重参数化时间（Adaptive Reparameterized Time, ART），通过控制重参数化时间变量的时钟速度，实现沿采样轨迹的非均匀时间步长分配，同时保持终态时间不变；进一步地，将时间变化建模为连续时间强化学习（reinforcement learning, RL）问题，并引入ART-RL算法，利用高斯策略优化时间调度，从而在数据驱动框架下学习最优的时间步长分布，显著提升生成样本的质量与泛化能力。

链接: https://arxiv.org/abs/2601.18681
作者: Yilie Huang,Wenpin Tang,Xunyu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:We consider time discretization for score-based diffusion models to generate samples from a learned reverse-time dynamic on a finite grid. Uniform and hand-crafted grids can be suboptimal given a budget on the number of time steps. We introduce Adaptive Reparameterized Time (ART) that controls the clock speed of a reparameterized time variable, leading to a time change and uneven timesteps along the sampling trajectory while preserving the terminal time. The objective is to minimize the aggregate error arising from the discretized Euler scheme. We derive a randomized control companion, ART-RL, and formulate time change as a continuous-time reinforcement learning (RL) problem with Gaussian policies. We then prove that solving ART-RL recovers the optimal ART schedule, which in turn enables practical actor–critic updates to learn the latter in a data-driven way. Empirically, based on the official EDM pipeline, ART-RL improves Fréchet Inception Distance on CIFAR-10 over a wide range of budgets and transfers to AFHQv2, FFHQ, and ImageNet without the need of retraining.
zh

[AI-12] Learning temporal embeddings from electronic health records of chronic kidney disease patients

【速读】：该论文旨在解决如何在不牺牲预测性能的前提下，从纵向电子健康记录（electronic health records, EHR）中学习具有临床意义的时间嵌入表示（temporal embeddings），并探究模型架构选择对嵌入质量的影响。其核心解决方案在于采用时间感知的长短期记忆网络（time-aware LSTM, T-LSTM）作为嵌入学习器，并通过将嵌入学习作为中间步骤而非直接端到端预测，显著提升了嵌入结构化程度与下游任务表现——例如在慢性肾病（chronic kidney disease, CKD）分期聚类中获得更低的Davies-Bouldin指数（DBI = 9.91）和更高分类准确率（0.74），且在ICU内死亡预测任务中嵌入模型相比端到端模型提升准确率至0.82–0.83，验证了嵌入学习策略优于传统单一任务优化范式。

链接: https://arxiv.org/abs/2601.18675
作者: Aditya Kumar,Mario A. Cypko,Oliver Amft
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 3 tables. The paper has been submitted to IEEE EMBC 2026 and copyright might be transferred without notice

点击查看摘要

Abstract:We investigate whether temporal embedding models trained on longitudinal electronic health records can learn clinically meaningful representations without compromising predictive performance, and how architectural choices affect embedding quality. Model-guided medicine requires representations that capture disease dynamics while remaining transparent and task agnostic, whereas most clinical prediction models are optimised for a single task. Representation learning facilitates learning embeddings that generalise across downstream tasks, and recurrent architectures are well-suited for modelling temporal structure in observational clinical data. Using the MIMIC-IV dataset, we study patients with chronic kidney disease (CKD) and compare three recurrent architectures: a vanilla LSTM, an attention-augmented LSTM, and a time-aware LSTM (T-LSTM). All models are trained both as embedding models and as direct end-to-end predictors. Embedding quality is evaluated via CKD stage clustering and in-ICU mortality prediction. The T-LSTM produces more structured embeddings, achieving a lower Davies-Bouldin Index (DBI = 9.91) and higher CKD stage classification accuracy (0.74) than the vanilla LSTM (DBI = 15.85, accuracy = 0.63) and attention-augmented LSTM (DBI = 20.72, accuracy = 0.67). For in-ICU mortality prediction, embedding models consistently outperform end-to-end predictors, improving accuracy from 0.72-0.75 to 0.82-0.83, which indicates that learning embeddings as an intermediate step is more effective than direct end-to-end learning.
zh

[AI-13] FaLW: A Forgetting-aware Loss Reweighting for Long-tailed Unlearning ICLR2026

【速读】：该论文旨在解决现有机器遗忘（machine unlearning）方法在长尾分布（long-tailed distribution）场景下失效的问题，即当需要遗忘的数据（如用户活动记录）呈现显著的类别不均衡时，传统方法会出现异质性遗忘偏差（Heterogeneous Unlearning Deviation）和偏斜遗忘偏差（Skewed Unlearning Deviation）。解决方案的关键在于提出一种即插即用的实例级动态损失重加权方法 FaLW，其核心创新是通过比较每个样本的预测概率与同类别未见数据的分布来评估其遗忘状态，并基于此设计一种由平衡因子调控的遗忘感知重加权机制，从而自适应地调整每个样本的遗忘强度。

链接: https://arxiv.org/abs/2601.18650
作者: Liheng Yu,Zhe Zhao,Yuxuan Wang,Pengkun Wang,Binwu Wang,Yang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: camera-ready for iclr2026

点击查看摘要

Abstract:Machine unlearning, which aims to efficiently remove the influence of specific data from trained models, is crucial for upholding data privacy regulations like the ``right to be forgotten". However, existing research predominantly evaluates unlearning methods on relatively balanced forget sets. This overlooks a common real-world scenario where data to be forgotten, such as a user’s activity records, follows a long-tailed distribution. Our work is the first to investigate this critical research gap. We find that in such long-tailed settings, existing methods suffer from two key issues: \textitHeterogeneous Unlearning Deviation and \textitSkewed Unlearning Deviation. To address these challenges, we propose FaLW, a plug-and-play, instance-wise dynamic loss reweighting method. FaLW innovatively assesses the unlearning state of each sample by comparing its predictive probability to the distribution of unseen data from the same class. Based on this, it uses a forgetting-aware reweighting scheme, modulated by a balancing factor, to adaptively adjust the unlearning intensity for each sample. Extensive experiments demonstrate that FaLW achieves superior performance. Code is available at \textbfSupplementary Material.
zh

[AI-14] Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity

【速读】：该论文旨在解决当前人工智能系统中因语音识别（ASR）技术对标准语音的过度依赖而导致的非典型语音群体被边缘化的问题，即数字排斥现象。其核心问题在于结构性偏见如何通过算法设计固化并放大社会对非标准语音的歧视，进而影响个体在教育、就业等关键领域的公平机会。解决方案的关键在于推动包容性技术设计、实施反偏见训练以减少算法决策中的歧视性后果，并通过具有强制力的政策改革明确将语音多样性视为公平（equity）问题而非仅是可访问性问题；同时倡导跨学科协作与共创建机制，确保AI系统能够真实反映人类语音的多样性，从而实现对非典型发声者的权利保障与社会价值重构。

链接: https://arxiv.org/abs/2601.18641
作者: Onyedikachi Hope Amaechi-Okorie,Branislav Radeljic
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Speech remains one of the most visible yet overlooked vectors of inclusion and exclusion in contemporary society. While fluency is often equated with credibility and competence, individuals with atypical speech patterns are routinely marginalized. Given the current state of the debate, this article focuses on the structural biases that shape perceptions of atypical speech and are now being encoded into artificial intelligence. Automated speech recognition (ASR) systems and voice interfaces, trained predominantly on standardized speech, routinely fail to recognize or respond to diverse voices, compounding digital exclusion. As AI technologies increasingly mediate access to opportunity, the study calls for inclusive technological design, anti-bias training to minimize the impact of discriminatory algorithmic decisions, and enforceable policy reform that explicitly recognize speech diversity as a matter of equity, not merely accessibility. Drawing on interdisciplinary research, the article advocates for a cultural and institutional shift in how we value voice, urging co-created solutions that elevate the rights, representation, and realities of atypical speakers in the digital age. Ultimately, the article reframes speech inclusion as a matter of equity (not accommodation) and advocates for co-created AI systems that reflect the full spectrum of human voices.
zh

[AI-15] Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation

【速读】：该论文旨在解决当前生成式 AI (Generative AI) 在心理健康对话场景中存在认知与情感响应不一致的问题，即模型虽能提供安全、连贯且临床恰当的信息（认知可靠性高），但在情感共鸣和情绪敏感性方面表现不稳定，导致其在实际应用中难以满足专业心理支持的需求。解决方案的关键在于构建一种以人类专家为基础的评估方法，通过两位受过精神病学训练的专家对9种不同来源的大语言模型（包括闭源与开源模型）生成的500条真实场景心理健康对话回复进行独立评分，采用包含6个维度的量表（涵盖认知支持与情感共鸣）进行多维评估，从而揭示模型在认知-情感一致性上的差距，并提出应建立兼顾信息准确性与关系敏感性的失败感知型、临床导向型评价框架，推动面向心理健康领域的对话式AI实现负责任的设计与临床监管。

链接: https://arxiv.org/abs/2601.18630
作者: Abeer Badawi,Md Tahmid Rahman Laskar,Elahe Rahimi,Sheri Grach,Lindsay Bertrand,Lames Danok,Frank Rudzicz,Jimmy Huang,Elham Dolatabadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale across a comprehensive 6 attribute rubric. This rubric captures Cognitive Support and Affective Resonance, providing a multidimensional perspective on therapeutic quality. Our analysis reveals that LLMs provide strong cognitive reliability by producing safe, coherent, and clinically appropriate information, but they demonstrate unstable affective alignment. Although closed source models (e.g., GPT-4o) offer balanced therapeutic responses, open source models show greater variability and emotional flatness. We reveal a persistent cognitive-affective gap and highlight the need for failure aware, clinically grounded evaluation frameworks that prioritize relational sensitivity alongside informational accuracy in mental health oriented LLMs. We advocate for balanced evaluation protocols with human in the loop that center on therapeutic sensitivity and provide a framework to guide the responsible design and clinical oversight of mental health oriented conversational AI.
zh

[AI-16] Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

【速读】：该论文旨在解决深度强化学习中自然梯度（Natural Gradient）计算成本高昂的问题，即每次迭代都需要对费雪信息矩阵（Fisher Information Matrix, FIM）进行求逆，这在实际应用中计算开销巨大且难以扩展。解决方案的关键在于提出一种高效且可扩展的自然策略优化方法，其核心创新是采用秩-1（rank-1）近似来替代完整的FIM逆矩阵计算，从而显著降低计算复杂度。理论分析表明，在特定条件下，该秩-1近似方法收敛速度优于标准策略梯度法，并在某些场景下达到与随机策略梯度方法相当的样本复杂度；实验验证了该方法在多种环境下的性能优于经典的Actor-Critic和信任域（trust-region）基线算法。

链接: https://arxiv.org/abs/2601.18626
作者: Yingxiao Huo,Satya Prakash Dash,Radu Stoican,Samuel Kaski,Mingfei Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.
zh

[AI-17] PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression

【速读】：该论文旨在解决传统Shapley值计算在特征数量较多时面临的指数级计算复杂度问题，即对于d个特征的模型需进行 $2^d$ 次游戏评估。其核心解决方案是通过引入多项式逼近（PolySHAP）方法，用高阶多项式替代KernelSHAP中的一阶线性近似，从而更准确地捕捉特征间的非线性交互作用。该方法不仅在多个基准数据集上实证提升了Shapley值估计精度，还证明了估计结果的收敛一致性；同时，论文揭示了配对采样（paired sampling）与二阶多项式拟合等价，为这一广泛使用的启发式改进提供了首个坚实的理论依据。

链接: https://arxiv.org/abs/2601.18608
作者: Fabian Fumagalli,R. Teal Witter,Christopher Musco
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Shapley values have emerged as a central game-theoretic tool in explainable AI (XAI). However, computing Shapley values exactly requires 2^d game evaluations for a model with d features. Lundberg and Lee’s KernelSHAP algorithm has emerged as a leading method for avoiding this exponential cost. KernelSHAP approximates Shapley values by approximating the game as a linear function, which is fit using a small number of game evaluations for random feature subsets. In this work, we extend KernelSHAP by approximating the game via higher degree polynomials, which capture non-linear interactions between features. Our resulting PolySHAP method yields empirically better Shapley value estimates for various benchmark datasets, and we prove that these estimates are consistent. Moreover, we connect our approach to paired sampling (antithetic sampling), a ubiquitous modification to KernelSHAP that improves empirical accuracy. We prove that paired sampling outputs exactly the same Shapley value approximations as second-order PolySHAP, without ever fitting a degree 2 polynomial. To the best of our knowledge, this finding provides the first strong theoretical justification for the excellent practical performance of the paired sampling heuristic. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.18608 [cs.AI] (or arXiv:2601.18608v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.18608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-18] A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理需要复杂证明规划（proof planning）的问题时表现不稳定的问题，尤其是在面对逻辑推理任务中缺失常识关系（commonsense relations）的情况下。现有方法依赖于预设的逻辑事实，而无法自动补充隐含的常识知识，导致推理失败。解决方案的关键在于提出一种迭代式反馈机制：利用逻辑求解器（logic solver）的反馈来引导LLM逐步补充缺失的常识信息，并通过搜索潜在的常识假设空间以最大化有用事实的发现概率，同时控制计算成本。该方法在移除部分常识信息的纯逻辑推理数据集上显著优于现有技术，验证了神经与符号系统协同优化在人类语境下的有效性。

链接: https://arxiv.org/abs/2601.18595
作者: Joseph Cotnareanu,Didier Chetelat,Yingxue Zhang,Mark Coates
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have demonstrated impressive formal reasoning abilities, they often break down when problems require complex proof planning. One promising approach for improving LLM reasoning abilities involves translating problems into formal logic and using a logic solver. Although off-the-shelf logic solvers are in principle substantially more efficient than LLMs at logical reasoning, they assume that all relevant facts are provided in a question and are unable to deal with missing commonsense relations. In this work, we propose a novel method that uses feedback from the logic solver to augment a logic problem with commonsense relations provided by the LLM, in an iterative manner. This involves a search procedure through potential commonsense assumptions to maximize the chance of finding useful facts while keeping cost tractable. On a collection of pure-logical reasoning datasets, from which some commonsense information has been removed, our method consistently achieves considerable improvements over existing techniques, demonstrating the value in balancing neural and symbolic elements when working in human contexts.
zh

[AI-19] Learning long term climate-resilient transport adaptation pathways under direct and indirect flood impacts using reinforcement learning

【速读】：该论文旨在解决城市交通系统在气候变化背景下因强降雨等极端天气事件频发而导致的长期适应性投资决策难题，其核心挑战在于基础设施投资的长期性、不确定性以及跨部门复杂交互。解决方案的关键在于提出一个通用的决策支持框架，该框架将综合评估模型（Integrated Assessment Model, IAM）与强化学习（Reinforcement Learning, RL）相结合，通过嵌入强化学习循环，从气候情景路径（如IPCC情景路径）出发，映射极端天气驱动因子（如降雨）到灾害发生概率（如内涝），再传播至城市基础设施影响（如交通中断），并量化服务性能损失与社会成本，从而学习出在投资与维护支出和规避影响之间权衡的自适应多十年投资策略。该方法在哥本哈根市内涝场景中验证有效，相比传统基准策略（如不作为或随机动作）展现出更优的空间-时间协同路径和鲁棒性，具备向其他灾害类型和城市迁移的潜力。

链接: https://arxiv.org/abs/2601.18586
作者: Miguel Costa,Arthur Vandervoort,Carolin Schmidt,Morten W. Petersen,Martin Drews,Karyn Morrissey,Francisco C. Pereira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Climate change is expected to intensify rainfall and other hazards, increasing disruptions in urban transportation systems. Designing effective adaptation strategies is challenging due to the long-term, sequential nature of infrastructure investments, deep uncertainty, and complex cross-sector interactions. We propose a generic decision-support framework that couples an integrated assessment model (IAM) with reinforcement learning (RL) to learn adaptive, multi-decade investment pathways under uncertainty. The framework combines long-term climate projections (e.g., IPCC scenario pathways) with models that map projected extreme-weather drivers (e.g. rain) into hazard likelihoods (e.g. flooding), propagate hazards into urban infrastructure impacts (e.g. transport disruption), and value direct and indirect consequences for service performance and societal costs. Embedded in a reinforcement-learning loop, it learns adaptive climate adaptation policies that trade off investment and maintenance expenditures against avoided impacts. In collaboration with Copenhagen Municipality, we demonstrate the approach on pluvial flooding in the inner city for the horizon of 2024 to 2100. The learned strategies yield coordinated spatial-temporal pathways and improved robustness relative to conventional optimization baselines, namely inaction and random action, illustrating the framework’s transferability to other hazards and cities.
zh

[AI-20] FastInsight: Fast and Insightful Retrieval via Fusion Operators for Graph RAG

【速读】：该论文旨在解决现有图检索增强生成（Graph RAG）方法在语料图上进行有洞察力的检索时，因依赖耗时的大型语言模型（LLM）推理而效率低下的问题。其核心挑战在于当前方法存在两个关键局限：基于模型的搜索对图结构拓扑信息不敏感（topology-blindness），而基于图的搜索则对语义信息不敏感（semantics-blindness）。解决方案的关键在于提出一种名为FastInsight的新框架，通过引入两种创新融合算子实现高效且精准的检索：Graph-based Reranker（GRanker）作为图模型驱动的搜索机制，强化拓扑感知能力；Semantic-Topological eXpansion（STeX）作为向量-图混合搜索机制，提升语义与拓扑协同建模能力。实验表明，FastInsight在检索准确率和生成质量上均显著优于当前最优基线，在效果与效率之间实现了显著的帕累托改进（Pareto improvement）。

链接: https://arxiv.org/abs/2601.18579
作者: Seonho An,Chaejeong Hyun,Min-Soo Kim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Existing Graph RAG methods aiming for insightful retrieval on corpus graphs typically rely on time-intensive processes that interleave Large Language Model (LLM) reasoning. To enable time-efficient insightful retrieval, we propose FastInsight. We first introduce a graph retrieval taxonomy that categorizes existing methods into three fundamental operations: vector search, graph search, and model-based search. Through this taxonomy, we identify two critical limitations in current approaches: the topology-blindness of model-based search and the semantics-blindness of graph search. FastInsight overcomes these limitations by interleaving two novel fusion operators: the Graph-based Reranker (GRanker), which functions as a graph model-based search, and Semantic-Topological eXpansion (STeX), which operates as a vector-graph search. Extensive experiments on broad retrieval and generation datasets demonstrate that FastInsight significantly improves both retrieval accuracy and generation quality compared to state-of-the-art baselines, achieving a substantial Pareto improvement in the trade-off between effectiveness and efficiency.
zh

[AI-21] Attention-Based Neural-Augmented Kalman Filter for Legged Robot State Estimation

【速读】：该论文旨在解决足端滑动（foot slip）导致的腿式机器人状态估计误差问题，即当足端发生滑动时，运动学测量数据违背了无滑动假设，从而在滤波更新步骤中引入偏差。解决方案的关键在于提出一种基于注意力机制的神经增强卡尔曼滤波器（Attention-Based Neural-Augmented Kalman Filter, AttenNKF），其核心是将一个神经补偿器（neural compensator）嵌入到不变扩展卡尔曼滤波器（Invariant Extended Kalman Filter, InEKF）中，利用注意力机制根据足端滑动严重程度推断误差，并以事后补偿方式（post-update compensation）修正InEKF的状态估计。该补偿器在潜在空间（latent space）中训练，以降低对原始输入尺度的敏感性并促进结构化的滑动条件补偿，同时保持InEKF的递推结构。实验表明，该方法在易滑条件下显著优于现有腿式机器人状态估计算法。

链接: https://arxiv.org/abs/2601.18569
作者: Seokju Lee,Kyung-Soo Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 6 figures, Accepted to IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:In this letter, we propose an Attention-Based Neural-Augmented Kalman Filter (AttenNKF) for state estimation in legged robots. Foot slip is a major source of estimation error: when slip occurs, kinematic measurements violate the no-slip assumption and inject bias during the update step. Our objective is to estimate this slip-induced error and compensate for it. To this end, we augment an Invariant Extended Kalman Filter (InEKF) with a neural compensator that uses an attention mechanism to infer error conditioned on foot-slip severity and then applies this estimate as a post-update compensation to the InEKF state (i.e., after the filter update). The compensator is trained in a latent space, which aims to reduce sensitivity to raw input scales and encourages structured slip-conditioned compensations, while preserving the InEKF recursion. Experiments demonstrate improved performance compared to existing legged-robot state estimators, particularly under slip-prone conditions.
zh

[AI-22] Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities EACL2026

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在执行复杂指令时的合规性评估问题，现有基准测试往往无法真实反映实际应用场景，且难以将指令合规性与任务成功率独立分离。其解决方案的关键在于提出MOSAIC（MOdular Synthetic Assessment of Instruction Compliance）框架，该框架通过动态生成包含最多20个面向应用的生成约束的合成数据集，实现对LLM指令合规能力的细粒度、独立分析。这一方法揭示了模型在不同约束类型、数量和位置上的表现差异，识别出特定模型弱点、指令间的协同与冲突关系以及位置偏差（如首因效应和近因效应），从而为诊断模型失效和开发更可靠的LLM提供关键洞见。

链接: https://arxiv.org/abs/2601.18554
作者: Alberto Purpura,Li Wang,Sahil Badyal,Eugenio Beaufrand,Adam Faulkner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper accepted to EACL 2026

点击查看摘要

Abstract:Reliably ensuring Large Language Models (LLMs) follow complex instructions is a critical challenge, as existing benchmarks often fail to reflect real-world use or isolate compliance from task success. We introduce MOSAIC (MOdular Synthetic Assessment of Instruction Compliance), a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints to enable a granular and independent analysis of this capability. Our evaluation of five LLMs from different families based on this new benchmark demonstrates that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position. The analysis reveals model-specific weaknesses, uncovers synergistic and conflicting interactions between instructions, and identifies distinct positional biases such as primacy and recency effects. These granular insights are critical for diagnosing model failures and developing more reliable LLMs for systems that demand strict adherence to complex instructions.
zh

[AI-23] SKETCH: Semantic Key-Point Conditioning for Long-Horizon Vessel Trajectory Prediction

【速读】：该论文旨在解决长时程船舶轨迹预测中因复杂航行行为与环境因素导致的累积不确定性问题，现有方法常因缺乏全局方向一致性而产生漂移或不合理的轨迹外推。其解决方案的关键在于提出一种语义关键点条件化的轨迹建模框架，通过引入高阶语义信息——下一关键点（Next Key Point, NKP）来约束未来轨迹的生成空间，将长时预测分解为全局语义决策与局部运动建模两阶段，从而有效限制轨迹支持域在语义可行范围内，显著提升长期预测的准确性与合理性。

链接: https://arxiv.org/abs/2601.18537
作者: Linyong Gan,Zimo Li,Wenxin Xu,Xingjian Li,Jianhua Z. Huang,Enmei Tu,Shuhang Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and environmental factors. Existing methods often struggle to maintain global directional consistency, leading to drifting or implausible trajectories when extrapolated over long time horizons. To address this issue, we propose a semantic-key-point-conditioned trajectory modeling framework, in which future trajectories are predicted by conditioning on a high-level Next Key Point (NKP) that captures navigational intent. This formulation decomposes long-horizon prediction into global semantic decision-making and local motion modeling, effectively restricting the support of future trajectories to semantically feasible subsets. To efficiently estimate the NKP prior from historical observations, we adopt a pretrain-finetune strategy. Extensive experiments on real-world AIS data demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly for long travel durations, directional accuracy, and fine-grained trajectory prediction.
zh

[AI-24] Scalable Transit Delay Prediction at City Scale: A Systematic Approach with Multi-Resolution Feature Engineering and Deep Learning

【速读】：该论文旨在解决城市公交系统中缺乏可靠、覆盖全网络的延误预测能力的问题，以支持乘客获取准确到站信息和运营人员进行实时调度优化。其核心挑战在于现有方法通常仅适用于少量线路、依赖人工特征设计且难以扩展至大规模城市网络。解决方案的关键在于提出一个可扩展的城市级预测流水线，融合多分辨率特征工程、自适应主成分分析（Adaptive PCA）降维与深度学习模型；具体而言，通过在H3网格基础上组合23种聚合方式生成1,683个时空特征，并利用Adaptive PCA压缩为83个主成分保留95%方差；同时引入混合H3+拓扑聚类策略避免密集区域“巨型簇”问题，形成12个均衡的线路簇，从而实现高效分布式训练。最终采用具备簇感知特征的全局LSTM模型，在精度与效率间取得最优平衡，相较Transformer模型提升18–52%，参数量减少275倍，验证了该框架在真实场景下的适用性和复用潜力。

链接: https://arxiv.org/abs/2601.18521
作者: Emna Boudabbous,Mohamed Karaa,Lokman Sboui,Julio Montecinos,Omar Alam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This manuscript is a preprint of an earlier version. A revised system-oriented version is currently under review

点击查看摘要

Abstract:Urban bus transit agencies need reliable, network-wide delay predictions to provide accurate arrival information to passengers and support real-time operational control. Accurate predictions help passengers plan their trips, reduce waiting time, and allow operations staff to adjust headways, dispatch extra vehicles, and manage disruptions. Although real-time feeds such as GTFS-Realtime (GTFS-RT) are now widely available, most existing delay prediction systems handle only a few routes, depend on hand-crafted features, and offer little guidance on how to design a scalable, reusable architecture. We present a city-scale prediction pipeline that combines multi-resolution feature engineering, dimensionality reduction, and deep learning. The framework generates 1,683 spatiotemporal features by exploring 23 aggregation combinations over H3 cells, routes, segments, and temporal patterns, and compresses them into 83 components using Adaptive PCA while preserving 95% of the variance. To avoid the “giant cluster” problem that occurs when dense urban areas fall into a single H3 region, we introduce a hybrid H3+topology clustering method that yields 12 balanced route clusters (coefficient of variation 0.608) and enables efficient distributed training. We compare five model architectures on six months of bus operations from the Société de transport de Montréal (STM) network in Montréal. A global LSTM with cluster-aware features achieves the best trade-off between accuracy and efficiency, outperforming transformer models by 18 to 52% while using 275 times fewer parameters. We also report multi-level evaluation at the elementary segment, segment, and trip level with walk-forward validation and latency analysis, showing that the proposed pipeline is suitable for real-time, city-scale deployment and can be reused for other networks with limited adaptation. Comments: This manuscript is a preprint of an earlier version. A revised system-oriented version is currently under review Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: C.3; I.2.6 Cite as: arXiv:2601.18521 [cs.LG] (or arXiv:2601.18521v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.18521 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-25] Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在部署后难以实现持续适应的问题，尤其是传统强化学习（Reinforcement Learning, RL）方法因计算成本高昂和灾难性遗忘风险而难以实用。其解决方案的关键在于提出一种无需训练的即时强化学习（Just-In-Time Reinforcement Learning, JitRL）框架：通过维护一个动态的非参数化经验记忆库，在测试时检索相关轨迹以在线估计动作优势，并直接对LLM输出的logits进行加性更新。理论证明该更新规则是KL约束策略优化目标的闭式解，从而在不进行梯度更新的前提下实现高效、稳定的策略优化，显著优于现有无训练方法，并在性能上超越高成本微调方法（如WebRL），同时降低30倍以上的运行成本。

链接: https://arxiv.org/abs/2601.18510
作者: Yibo Li,Zijie Lin,Ailin Deng,Xuan Zhang,Yufei He,Shuo Ji,Tri Cao,Bryan Hooi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM’s output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at this https URL.
zh

[AI-26] DEEPMED: Building a Medical DeepResearch Agent via Multi-hop Med-Search Data and Turn-Controlled Agent ic Training Inference

【速读】：该论文旨在解决医学推理模型因参数化知识局限而易发生遗忘和幻觉，以及通用深度研究（DeepResearch, DR）模型在医疗领域直接迁移时性能提升有限的问题。其核心挑战在于两个方面：一是医疗任务对临床情境下的证据解释能力要求高，而现有DR模型虽能检索信息却缺乏临床推理能力，导致“找到但无法有效使用”；二是医疗场景中盲目扩展工具调用会引入噪声上下文，干扰敏感的医学推理并引发重复错误路径的证据获取。解决方案的关键在于三方面创新：首先，设计多跳医学搜索问答合成数据方法，使模型能在医疗语境中应用DR范式；其次，在训练中引入难度感知的回合惩罚机制，抑制过度工具调用；最后，在推理阶段引入监控机制，控制步骤数以验证假设并防止上下文退化（context rot）。实验表明，该方案在七个医学基准上平均提升9.79%，优于更大规模的医疗推理与DR模型。

链接: https://arxiv.org/abs/2601.18496
作者: Zihan wang,Hao Wang,Shi Feng,Xiaocui Yang,Daling Wang,Yiqun Zhang,Jinghao Lin,Haihua Yang,Xiaozhong Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical reasoning models remain constrained by parametric knowledge and are thus susceptible to forgetting and hallucinations. DeepResearch (DR) models ground outputs in verifiable evidence from tools and perform strongly in general domains, but their direct transfer to medical field yields relatively limited gains. We attribute this to two gaps: task characteristic and tool-use scaling. Medical questions require evidence interpretation in a knowledge-intensive clinical context; while general DR models can retrieve information, they often lack clinical-context reasoning and thus “find it but fail to use it,” leaving performance limited by medical abilities. Moreover, in medical scenarios, blindly scaling tool-call can inject noisy context, derailing sensitive medical reasoning and prompting repetitive evidence-seeking along incorrect paths. Therefore, we propose DeepMed. For data, we deploy a multi-hop med-search QA synthesis method supporting the model to apply the DR paradigm in medical contexts. For training, we introduce a difficulty-aware turn-penalty to suppress excessive tool-call growth. For inference, we bring a monitor to help validate hypotheses within a controlled number of steps and avoid context rot. Overall, on seven medical benchmarks, DeepMed improves its base model by 9.79% on average and outperforms larger medical reasoning and DR models.
zh

[AI-27] OffSeeker: Online Reinforcement Learning Is Not All You Need for Deep Research Agents

【速读】：该论文旨在解决当前深度研究代理（Deep Research Agents）在长周期任务中依赖昂贵在线强化学习（Online Reinforcement Learning, RL）的问题，尤其是在API调用成本高昂的背景下，如何实现高效且高性能的训练。其关键解决方案在于提出一个完全开源的离线训练框架，核心包括：DeepForge——一个无需复杂预处理即可生成大规模研究查询的任务合成工具；以及包含66k问答对、33k监督微调（Supervised Fine-Tuning, SFT）轨迹和21k直接偏好优化（Direct Preference Optimization, DPO）样本的高质量数据集。基于此，作者训练出仅依赖离线数据的OffSeeker（8B参数模型），在多个基准测试中表现优于同类规模模型，并接近采用大量在线RL训练的30B参数系统。

链接: https://arxiv.org/abs/2601.18467
作者: Yuhang Zhou,Kai Zheng,Qiguang Chen,Mengkang Hu,Qingfeng Sun,Can Xu,Jingjing Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep research agents have shown remarkable potential in handling long-horizon tasks. However, state-of-the-art performance typically relies on online reinforcement learning (RL), which is financially expensive due to extensive API calls. While offline training offers a more efficient alternative, its progress is hindered by the scarcity of high-quality research trajectories. In this paper, we demonstrate that expensive online reinforcement learning is not all you need to build powerful research agents. To bridge this gap, we introduce a fully open-source suite designed for effective offline training. Our core contributions include DeepForge, a ready-to-use task synthesis framework that generates large-scale research queries without heavy preprocessing; and a curated collection of 66k QA pairs, 33k SFT trajectories, and 21k DPO pairs. Leveraging these resources, we train OffSeeker (8B), a model developed entirely offline. Extensive evaluations across six benchmarks show that OffSeeker not only leads among similar-sized agents but also remains competitive with 30B-parameter systems trained via heavy online RL.
zh

[AI-28] GCFX: Generative Counterfactual Explanations for Deep Graph Models at the Model Level

【速读】：该论文旨在解决深度图学习模型决策过程缺乏透明性的问题，即模型内部结构复杂且难以解释，导致用户难以理解与信任其预测结果。解决方案的关键在于提出一种基于深度图生成的模型级反事实解释方法（GCFX），其核心创新包括：1）构建融合双编码器、结构感知标记器和消息传递神经网络解码器的架构，以精准学习输入数据的真实潜在分布并生成高质量、相关性强的反事实样本；2）设计全局反事实总结算法，从大量候选反事实中筛选最具代表性和全面性的解释，从而揭示模型的整体预测模式。该方法在合成数据集和多个真实世界数据集上验证了其在反事实有效性与覆盖范围上的优越性，同时保持较低的解释成本。

链接: https://arxiv.org/abs/2601.18447
作者: Jinlong Hu,Jiacheng Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep graph learning models have demonstrated remarkable capabilities in processing graph-structured data and have been widely applied across various fields. However, their complex internal architectures and lack of transparency make it difficult to explain their decisions, resulting in opaque models that users find hard to understand and trust. In this paper, we explore model-level explanation techniques for deep graph learning models, aiming to provide users with a comprehensive understanding of the models’ overall decision-making processes and underlying mechanisms. Specifically, we address the problem of counterfactual explanations for deep graph learning models by introducing a generative model-level counterfactual explanation approach called GCFX, which is based on deep graph generation. This approach generates a set of high-quality counterfactual explanations that reflect the model’s global predictive behavior by leveraging an enhanced deep graph generation framework and a global summarization algorithm. GCFX features an architecture that combines dual encoders, structure-aware taggers, and Message Passing Neural Network decoders, enabling it to accurately learn the true latent distribution of input data and generate high-quality, closely related counterfactual examples. Subsequently, a global counterfactual summarization algorithm selects the most representative and comprehensive explanations from numerous candidate counterfactuals, providing broad insights into the model’s global predictive patterns. Experiments on a synthetic dataset and several real-world datasets demonstrate that GCFX outperforms existing methods in terms of counterfactual validity and coverage while maintaining low explanation costs, thereby offering crucial support for enhancing the practicality and trustworthiness of global counterfactual explanations.
zh

[AI-29] Gradient Regularized Natural Gradients

【速读】：该论文旨在解决第二阶优化器（如自然梯度下降）在训练过程中因缺乏梯度稳定性而导致的收敛性差和泛化性能不足的问题。现有研究表明，梯度正则化（Gradient Regularization, GR）能够提升模型的泛化能力，但其与第二阶优化方法的结合尚未得到充分探索。论文提出了一种名为梯度正则化自然梯度（Gradient-Regularized Natural Gradients, GRNG）的可扩展第二阶优化框架，其关键在于将显式的梯度正则化机制嵌入自然梯度更新中，从而在保持自然梯度加速优化优势的同时增强训练稳定性并促进收敛至全局最优解。该方案包含两种互补算法：一种基于结构化近似的频率学变体避免了Fisher信息矩阵（FIM）的显式求逆，另一种基于正则化卡尔曼滤波的贝叶斯变体则完全消除了对FIM求逆的需求，二者均通过理论证明具备收敛性保障，并在视觉与语言任务上显著优于一阶（SGD、AdamW）和二阶基线方法（K-FAC、Sophia）。

链接: https://arxiv.org/abs/2601.18420
作者: Satya Prakash Dash,Hossein Abdi,Wei Pan,Samuel Kaski,Mingfei Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Gradient regularization (GR) has been shown to improve the generalizability of trained models. While Natural Gradient Descent has been shown to accelerate optimization in the initial phase of training, little attention has been paid to how the training dynamics of second-order optimizers can benefit from GR. In this work, we propose Gradient-Regularized Natural Gradients (GRNG), a family of scalable second-order optimizers that integrate explicit gradient regularization with natural gradient updates. Our framework provides two complementary algorithms: a frequentist variant that avoids explicit inversion of the Fisher Information Matrix (FIM) via structured approximations, and a Bayesian variant based on a Regularized-Kalman formulation that eliminates the need for FIM inversion entirely. We establish convergence guarantees for GRNG, showing that gradient regularization improves stability and enables convergence to global minima. Empirically, we demonstrate that GRNG consistently enhances both optimization speed and generalization compared to first-order methods (SGD, AdamW) and second-order baselines (K-FAC, Sophia), with strong results on vision and language benchmarks. Our findings highlight gradient regularization as a principled and practical tool to unlock the robustness of natural gradient methods for large-scale deep learning.
zh

[AI-30] daVinci-Dev: Agent -native Mid-training for Software Engineering

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）在软件工程领域中从单轮代码生成向自主代理式软件开发（agentic software engineering）演进过程中，缺乏高效且可扩展的中训练（mid-training, MT）方法的问题。现有主流方案依赖昂贵的强化学习（reinforcement learning），而中训练因其资源需求高且未被充分探索，成为制约模型习得基础代理行为的关键瓶颈。解决方案的核心在于提出代理原生数据（agent-native data），其包含两类互补轨迹：上下文原生轨迹（contextually-native trajectories） 保留代理完整的信息流，提供广泛覆盖与多样性；环境原生轨迹（environmentally-native trajectories） 来自可执行代码库中的实际工具调用与测试执行，确保交互的真实性与深度。该方法有效缓解了静态训练数据与动态开发环境之间的分布偏移问题，显著提升了模型在真实场景下的代理能力。

链接: https://arxiv.org/abs/2601.18418
作者: Ji Zeng,Dayuan Fu,Tiantian Mi,Yumin Zhuang,Yaxing Huang,Xuefeng Li,Lyumanshan Ye,Muhang Xie,Qishuo Hua,Zhen Huang,Mohan Jiang,Hanning Wang,Jifan Lin,Yang Xiao,Jie Sun,Yunze Wu,Pengfei Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, agentic mid-training-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is agent-native data-supervision comprising two complementary types of trajectories: contextually-native trajectories that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and environmentally-native trajectories collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model’s agentic capabilities on SWE-Bench Verified. We demonstrate our superiority over the previous open software engineering mid-training recipe Kimi-Dev under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve 56.1% and 58.5% resolution rates, respectively, which are …
zh

[AI-31] AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

【速读】：该论文旨在解决传统有限差分（Finite Difference）代码向Devito环境迁移过程中缺乏智能化支持的问题，尤其在知识获取、语义理解与代码生成环节存在效率低、准确性不足等挑战。其核心解决方案在于构建一个基于多阶段迭代工作流的AI代理框架，融合检索增强生成（Retrieval-Augmented Generation, RAG）与开源大语言模型（Large Language Models），通过LangGraph架构实现状态感知的动态路由和并发处理机制；关键创新点包括：利用Leiden社区检测算法构建Devito知识图谱以优化跨领域查询性能（如地震波模拟、计算流体力学等），通过静态分析逆向工程提取Fortran源码的三层次查询策略用于精准检索，以及采用Pydantic约束保障代码合成结构化与可靠性，并引入强化学习驱动的反馈机制，使系统从静态翻译转向动态自适应分析行为。

链接: https://arxiv.org/abs/2601.18381
作者: Yinghan Hou,Zongyou Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated AI agent framework. Retrieval-Augmented Generation (RAG) and open-source Large Language Models are combined through multi-stage iterative workflows in the system’s hybrid LangGraph architecture. The agent constructs an extensive Devito knowledge graph through document parsing, structure-aware segmentation, extraction of entity relationships, and Leiden-based community detection. GraphRAG optimisation enhances query performance across semantic communities that include seismic wave simulation, computational fluid dynamics, and performance tuning libraries. A reverse engineering component derives three-level query strategies for RAG retrieval through static analysis of Fortran source code. To deliver precise contextual information for language model guidance, the multi-stage retrieval pipeline performs parallel searching, concept expansion, community-scale retrieval, and semantic similarity analysis. Code synthesis is governed by Pydantic-based constraints to guarantee structured outputs and reliability. A comprehensive validation framework integrates conventional static analysis with the G-Eval approach, covering execution correctness, structural soundness, mathematical consistency, and API compliance. The overall agent workflow is implemented on the LangGraph framework and adopts concurrent processing to support quality-based iterative refinement and state-aware dynamic routing. The principal contribution lies in the incorporation of feedback mechanisms motivated by reinforcement learning, enabling a transition from static code translation toward dynamic and adaptive analytical behavior.
zh

[AI-32] Analytic Incremental Learning For Sound Source Localization With Imbalance Rectification ICASSP26

【速读】：该论文旨在解决声源定位（Sound Source Localization, SSL）在真实场景中部署时面临的双重不平衡问题：任务内不平衡（intra-task imbalance），源于到达方向（Direction-of-Arrival, DoA）分布的长尾特性；以及任务间不平衡（inter-task imbalance），由跨任务偏移和重叠引发，二者常导致灾难性遗忘，显著降低定位精度。解决方案的关键在于提出一个统一框架，包含两项核心创新：一是基于广义互相关-相位变换（Generalized Cross-Correlation with Phase Transform, GCC-PHAT）的数据增强方法（GDA），通过利用峰值特征缓解任务内的分布偏移；二是提出分析式动态不平衡校正器（Analytic Dynamic Imbalance Rectifier, ADIR），结合任务自适应正则化，实现对任务间动态变化的解析式更新，从而在无需存储示例的情况下提升模型对演化不平衡的鲁棒性。

链接: https://arxiv.org/abs/2601.18335
作者: Zexia Fan,Yu Chen,Qiquan Zhang,Kainan Chen,Xinyuan Qian
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP26

点击查看摘要

Abstract:Sound source localization (SSL) demonstrates remarkable results in controlled settings but struggles in real-world deployment due to dual imbalance challenges: intra-task imbalance arising from long-tailed direction-of-arrival (DoA) distributions, and inter-task imbalance induced by cross-task skews and overlaps. These often lead to catastrophic forgetting, significantly degrading the localization accuracy. To mitigate these issues, we propose a unified framework with two key innovations. Specifically, we design a GCC-PHAT-based data augmentation (GDA) method that leverages peak characteristics to alleviate intra-task distribution skews. We also propose an Analytic dynamic imbalance rectifier (ADIR) with task-adaption regularization, which enables analytic updates that adapt to inter-task dynamics. On the SSLR benchmark, our proposal achieves state-of-the-art (SoTA) results of 89.0% accuracy, 5.3° mean absolute error, and 1.6 backward transfer, demonstrating robustness to evolving imbalances without exemplar storage.
zh

[AI-33] A Generative AI-Driven Reliability Layer for Action-Oriented Disaster Resilience

【速读】：该论文试图解决传统早期预警系统（Early Warning System, EWS）在面对日益加剧的气候相关灾害时，虽能快速发布警报但难以促使人们及时采取保护性行动的问题，从而导致可预防的损失和不平等现象。其解决方案的关键在于提出Climate RADAR（Risk-Aware, Dynamic, and Action Recommendation system），这是一个基于生成式AI（Generative AI）的可靠性层，通过整合气象、水文、脆弱性和社会数据构建综合风险指数，并利用嵌入护栏机制（guardrail-embedded）的大语言模型（Large Language Models, LLMs）向市民、志愿者和市政机构提供个性化行动建议，从而将灾害沟通从“警报传递”转变为“行动执行”，显著提升防护措施的执行率、响应延迟降低以及用户信任度与可用性。

链接: https://arxiv.org/abs/2601.18308
作者: Geunsik Lim
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Systems and Control (eess.SY)
备注: 19 pages

点击查看摘要

Abstract:As climate-related hazards intensify, conventional early warning systems (EWS) disseminate alerts rapidly but often fail to trigger timely protective actions, leading to preventable losses and inequities. We introduce Climate RADAR (Risk-Aware, Dynamic, and Action Recommendation system), a generative AI-based reliability layer that reframes disaster communication from alerts delivered to actions executed. It integrates meteorological, hydrological, vulnerability, and social data into a composite risk index and employs guardrail-embedded large language models (LLMs) to deliver personalized recommendations across citizen, volunteer, and municipal interfaces. Evaluation through simulations, user studies, and a municipal pilot shows improved outcomes, including higher protective action execution, reduced response latency, and increased usability and trust. By combining predictive analytics, behavioral science, and responsible AI, Climate RADAR advances people-centered, transparent, and equitable early warning systems, offering practical pathways toward compliance-ready disaster resilience infrastructures.
zh

[AI-34] riPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中生成有害或不当内容的安全风险问题，尤其是如何高效、持续地提升模型的安全对齐能力。其解决方案的关键在于提出一种闭环强化学习框架 TriPlay-RL，通过攻击者（attacker）、防御者（defender）和评估者（evaluator）三角色的迭代协同优化机制，在无需大量人工标注的情况下实现安全性能的持续提升：攻击者增强对抗性提示的有效性和多样性，防御者在不损害通用推理能力的前提下提高安全性，评估者则通过多轮迭代细化对响应安全性的细粒度判别能力，从而构建一个可扩展且高效的LLM安全对齐新范式。

链接: https://arxiv.org/abs/2601.18292
作者: Zhewen Tan,Wenhan Yu,Jianfeng Si,Tongxin Liu,Kaiqi Guan,Huiyan Jin,Jiawen Tao,Xiaokun Yuan,Duohe Ma,Xiangzheng Zhang,Tong Yang,Lin Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainstream paradigm for LLM safety alignment typically adopts a collaborative framework involving three roles: an attacker for adversarial prompt generation, a defender for safety defense, and an evaluator for response assessment. In this paper, we propose a closed-loop reinforcement learning framework called TriPlay-RL that enables iterative and co-improving collaboration among three roles with near-zero manual annotation. Experimental results show that the attacker preserves high output diversity while achieving a 20%-50% improvement in adversarial effectiveness; the defender attains 10%-30% gains in safety performance without degrading general reasoning capability; and the evaluator continuously refines its fine-grained judgment ability through iterations, accurately distinguishing unsafe responses, simple refusals, and useful guidance. Overall, our framework establishes an efficient and scalable paradigm for LLM safety alignment, enabling continuous co-evolution within a unified learning loop.
zh

[AI-35] What Do Learned Models Measure?

【速读】：该论文旨在解决当前机器学习模型在作为测量仪器（measurement instruments）使用时，现有评估框架无法确保测量稳定性（measurement stability）的问题。其核心挑战在于：当模型输出被解释为对物理或抽象量的测量时，即使不同模型在预测性能上表现相当（如泛化误差、校准性或鲁棒性指标相近），它们仍可能隐含地定义了不等价的测量函数（measurement functions），导致在分布变化（distribution shift）等实际场景中产生系统性差异。解决方案的关键在于提出并形式化“测量稳定性”这一新评价维度，强调测量结果应保持对学习过程和上下文变化的不变性，从而弥补传统评估标准的不足，推动建立更可靠的模型评估体系。

链接: https://arxiv.org/abs/2601.18278
作者: Indrė Žliobaitė
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many scientific and data-driven applications, machine learning models are increasingly used as measurement instruments, rather than merely as predictors of predefined labels. When the measurement function is learned from data, the mapping from observations to quantities is determined implicitly by the training distribution and inductive biases, allowing multiple inequivalent mappings to satisfy standard predictive evaluation criteria. We formalize learned measurement functions as a distinct focus of evaluation and introduce measurement stability, a property capturing invariance of the measured quantity across admissible realizations of the learning process and across contexts. We show that standard evaluation criteria in machine learning, including generalization error, calibration, and robustness, do not guarantee measurement stability. Through a real-world case study, we show that models with comparable predictive performance can implement systematically inequivalent measurement functions, with distribution shift providing a concrete illustration of this failure. Taken together, our results highlight a limitation of existing evaluation frameworks in settings where learned model outputs are identified as measurements, motivating the need for an additional evaluative dimension.
zh

[AI-36] Neural Network Approximation: A View from Polytope Decomposition

【速读】：该论文旨在解决现有通用逼近理论在神经网络表达能力分析中忽视目标函数局部正则性的问题，即传统方法通常通过均匀划分输入空间为微小超立方体来构建逼近，而未考虑函数在不同区域的局部特性。其解决方案的关键在于引入多面体（polytope）分解视角，结合显式的核多项式方法，使得逼近不仅依赖于改进的Totik-Ditzian型连续性模，还基于对定义域的自适应多面体划分；在此基础上，分别在每个子区域内构造ReLU网络以逼近核多项式，从而实现更高效、灵活且在奇异点附近表现更优的函数逼近。进一步地，该方法还可扩展至解析函数，获得更高的逼近精度。

链接: https://arxiv.org/abs/2601.18264
作者: ZeYu Li,ShiJun Zhang,TieYong Zeng,FengLei Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Universal approximation theory offers a foundational framework to verify neural network expressiveness, enabling principled utilization in real-world applications. However, most existing theoretical constructions are established by uniformly dividing the input space into tiny hypercubes without considering the local regularity of the target function. In this work, we investigate the universal approximation capabilities of ReLU networks from a view of polytope decomposition, which offers a more realistic and task-oriented approach compared to current methods. To achieve this, we develop an explicit kernel polynomial method to derive an universal approximation of continuous functions, which is characterized not only by the refined Totik-Ditzian-type modulus of continuity, but also by polytopical domain decomposition. Then, a ReLU network is constructed to approximate the kernel polynomial in each subdomain separately. Furthermore, we find that polytope decomposition makes our approximation more efficient and flexible than existing methods in many cases, especially near singular points of the objective function. Lastly, we extend our approach to analytic functions to reach a higher approximation rate.
zh

[AI-37] Beyond Retention: Orchestrating Structural Safety and Plasticity in Continual Learning for LLM s

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在持续学习过程中面临的稳定性-可塑性权衡问题，即如何在保留旧知识的同时有效学习新任务。现有方法如经验回放（Experience Replay, ER）虽能缓解灾难性遗忘，但其对不同能力域的影响存在显著差异：ER在鲁棒的非结构化任务（如NLP分类）中可带来正向迁移，但在脆弱的结构化任务（如代码生成）中却引发严重负向迁移，表明ER以牺牲结构完整性为代价实现广泛巩固。为此，作者提出正交子空间唤醒（Orthogonal Subspace Wake-up, OSW），其核心在于通过短暂的“唤醒”阶段识别先前任务的关键参数子空间，并对新任务采用正交更新策略，从而为已建立的知识结构提供数学意义上的“安全保证”。实验证明，OSW能够在保持高可塑性的前提下，有效保护脆弱的代码生成能力，解决了ER在结构敏感任务中的失效问题。

链接: https://arxiv.org/abs/2601.18255
作者: Fei Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning in Large Language Models (LLMs) faces the critical challenge of balancing stability (retaining old knowledge) and plasticity (learning new tasks). While Experience Replay (ER) is a standard countermeasure against catastrophic forgetting, its impact across diverse capabilities remains underexplored. In this work, we uncover a critical dichotomy in ER’s behavior: while it induces positive backward transfer on robust, unstructured tasks (e.g., boosting performance on previous NLP classification tasks through repeated rehearsal), it causes severe negative transfer on fragile, structured domains like code generation (e.g., a significant relative drop in coding accuracy). This reveals that ER trades structural integrity for broad consolidation. To address this dilemma, we propose \textbfOrthogonal Subspace Wake-up (OSW). OSW identifies essential parameter subspaces of previous tasks via a brief “wake-up” phase and enforces orthogonal updates for new tasks, providing a mathematically grounded “safety guarantee” for established knowledge structures. Empirical results across a diverse four-task sequence demonstrate that OSW uniquely succeeds in preserving fragile coding abilities where Replay fails, while simultaneously maintaining high plasticity for novel tasks. Our findings emphasize the necessity of evaluating structural safety alongside average retention in LLM continual learning.
zh

[AI-38] AM-Eval: Evaluating LLM s for Automated Unit Test Maintenance

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在软件工程中应用时，对单元测试套件维护（test suite maintenance）支持不足的问题，尤其是现有研究多局限于孤立的测试生成或断言预测，未能覆盖测试创建、修复和更新等完整维护流程。解决方案的关键在于提出 TAM-Eval（Test Automated Maintenance Evaluation），一个面向测试文件级别的评估框架与基准数据集，支持在保留完整仓库上下文的前提下进行系统无关的评估，采用无参考协议（reference-free protocol）基于测试通过率、代码覆盖率和变异测试结果衡量模型性能，从而更真实地反映实际测试维护工作流。

链接: https://arxiv.org/abs/2601.18241
作者: Elena Bruches,Vadim Alperovich,Dari Baturova,Roman Derunets,Daniil Grebenkin,Georgy Mkrtchyan,Oleg Sedukhin,Mikhail Klementev,Ivan Bondarenko,Nikolay Bushkov,Stanislav Moiseev
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 9th Workshop on Validation, Analysis and Evolution of Software Tests (VST 2026), co-located with the the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2026)

点击查看摘要

Abstract:While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at this https URL.
zh

[AI-39] Rethinking Cross-Modal Fine-Tuning: Optimizing the Interaction between Feature Alignment and Target Fitting AISTATS20226

【速读】：该论文旨在解决预训练模型在适配未见过的特征模态（feature modalities）时，如何有效对齐新模态表示与预训练模型中相关表示空间的问题，以实现准确的知识迁移。其核心挑战在于特征对齐与目标微调之间的协同优化，若二者未被合理校准，可能导致源域与目标域特征-标签结构之间出现更大偏差，从而降低目标泛化性能。解决方案的关键在于提出一个理论驱动的框架，通过引入“特征-标签扭曲”（feature-label distortion）这一新颖概念，建立了目标误差的可证明泛化界，揭示了特征对齐与目标拟合之间的内在交互机制，并据此为实际算法设计提供可操作的优化方向。该方法在多个基准数据集上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2601.18231
作者: Trong Khiem Tran,Manh Cuong Dao,Phi Le Nguyen,Thao Nguyen Truong,Trong Nghia Hoang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted AISTATS 20226. Preprint version

点击查看摘要

Abstract:Adapting pre-trained models to unseen feature modalities has become increasingly important due to the growing need for cross-disciplinary knowledge integration.~A key challenge here is how to align the representation of new modalities with the most relevant parts of the pre-trained model’s representation space to enable accurate knowledge transfer.~This requires combining feature alignment with target fine-tuning, but uncalibrated combinations can exacerbate misalignment between the source and target feature-label structures and reduce target generalization.~Existing work however lacks a theoretical understanding of this critical interaction between feature alignment and target fitting.~To bridge this gap, we develop a principled framework that establishes a provable generalization bound on the target error, which explains the interaction between feature alignment and target fitting through a novel concept of feature-label distortion.~This bound offers actionable insights into how this interaction should be optimized for practical algorithm design. The resulting approach achieves significantly improved performance over state-of-the-art methods across a wide range of benchmark datasets.
zh

[AI-40] Yunjue Agent Tech Report: A Fully Reproducible Zero-Start In-Situ Self-Evolving Agent System for Open-Ended Tasks

【速读】：该论文旨在解决传统智能体系统在开放环境中因任务分布持续漂移且缺乏外部监督而导致的能力僵化问题，其核心挑战在于静态工具集或离线训练无法适应动态变化。解决方案的关键在于提出“原位自演化”（In-Situ Self-Evolving）范式，将连续的任务交互视为经验流，通过短时执行反馈提炼出可长期复用的能力；其中，工具演化被识别为能力扩展的核心路径，因其提供可验证的二值反馈信号，并在此框架下构建了Yunjue Agent系统，实现工具的迭代合成、优化与重用，从而有效应对新兴挑战。

链接: https://arxiv.org/abs/2601.18226
作者: Haotian Li,Shijun Yang,Weizhen Qi,Silei Zhao,Rui Hua,Mingzhu Song,Xiaojian Yang,Chao Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional agent systems often struggle in open-ended environments where task distributions continuously drift and external supervision is scarce. Their reliance on static toolsets or offline training lags behind these dynamics, leaving the system’s capability boundaries rigid and unknown. To address this, we propose the In-Situ Self-Evolving paradigm. This approach treats sequential task interactions as a continuous stream of experience, enabling the system to distill short-term execution feedback into long-term, reusable capabilities without access to ground-truth labels. Within this framework, we identify tool evolution as the critical pathway for capability expansion, which provides verifiable, binary feedback signals. Within this framework, we develop Yunjue Agent, a system that iteratively synthesizes, optimizes, and reuses tools to navigate emerging challenges. To optimize evolutionary efficiency, we further introduce a Parallel Batch Evolution strategy. Empirical evaluations across five diverse benchmarks under a zero-start setting demonstrate significant performance gains over proprietary baselines. Additionally, complementary warm-start evaluations confirm that the accumulated general knowledge can be seamlessly transferred to novel domains. Finally, we propose a novel metric to monitor evolution convergence, serving as a function analogous to training loss in conventional optimization. We open-source our codebase, system traces, and evolved tools to facilitate future research in resilient, self-evolving intelligence.
zh

[AI-41] ShopSimulator: Evaluating and Exploring RL-Driven LLM Agent for Shopping Assistants

【速读】：该论文旨在解决当前大语言模型（Large Language Model, LLM）在电商购物场景中部署时面临的多维度挑战，包括对用户个性化偏好理解不足、多轮对话交互能力弱以及在高度相似产品中精准筛选的能力欠缺。现有研究缺乏一个统一且具有挑战性的仿真环境来系统评估和训练LLM代理，且多数仅关注评估基准而未提供训练支持。为此，作者提出ShopSimulator——一个大规模、高复杂度的中文购物仿真环境，用于全面评估LLM代理在多样化场景下的表现，并通过系统性训练探索发现，结合监督微调（Supervised Fine-Tuning, SFT）与强化学习（Reinforcement Learning, RL）的方法能显著提升模型在深度搜索、个性化平衡及用户交互等关键任务上的性能。

链接: https://arxiv.org/abs/2601.18225
作者: Pei Wang,Yanan Wu,Xiaoshuai Song,Weixun Wang,Gengru Chen,Zhongwen Li,Kezhong Yan,Ken Deng,Qi Liu,Shuaibing Zhao,Shaopan Xiong,Xuepeng Liu,Xuefeng Chen,Wanxi Deng,Wenbo Su,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM)-based agents are increasingly deployed in e-commerce shopping. To perform thorough, user-tailored product searches, agents should interpret personal preferences, engage in multi-turn dialogues, and ultimately retrieve and discriminate among highly similar products. However, existing research has yet to provide a unified simulation environment that consistently captures all of these aspects, and always focuses solely on evaluation benchmarks without training support. In this paper, we introduce ShopSimulator, a large-scale and challenging Chinese shopping environment. Leveraging ShopSimulator, we evaluate LLMs across diverse scenarios, finding that even the best-performing models achieve less than 40% full-success rate. Error analysis reveals that agents struggle with deep search and product selection in long trajectories, fail to balance the use of personalization cues, and to effectively engage with users. Further training exploration provides practical guidance for overcoming these weaknesses, with the combination of supervised fine-tuning (SFT) and reinforcement learning (RL) yielding significant performance improvements. Code and data will be released at this https URL.
zh

[AI-42] Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

【速读】：该论文旨在解决通用大语言模型（Generalist LLM）代理在经过特定环境的后训练（post-training）后，部署到未见过的测试域时出现的泛化性能下降问题。其核心挑战在于：如何在训练阶段设计有效的策略，以提升代理在未知域中的跨域鲁棒性。解决方案的关键在于识别出两个对跨域泛化影响最大的环境属性：状态信息丰富度（state information richness）和规划复杂度（planning complexity），并提出一种低开销的状态增强方法——通过向状态中添加少量与目标无关的干扰特征来提升信息丰富度，从而显著改善跨域表现；此外，研究还发现，在强化学习（RL）训练中引入逐步推理（step-by-step thinking）有助于维持泛化能力，而SFT预热或中期训练虽可缓解灾难性遗忘，但会损害未包含在训练数据混合中的域的泛化性能。

链接: https://arxiv.org/abs/2601.18217
作者: Zhihan Liu,Lin Guan,Yixin Nie,Kai Zhang,Zhuoqun Hao,Lin Chen,Asli Celikyilmaz,Zhaoran Wang,Na Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generalist LLM agents are often post-trained on a narrow set of environments but deployed across far broader, unseen domains. In this work, we investigate the challenge of agentic post-training when the eventual test domains are unknown. Specifically, we analyze which properties of reinforcement learning (RL) environments and modeling choices have the greatest influence on out-of-domain performance. First, we identify two environment axes that strongly correlate with cross-domain generalization: (i) state information richness, i.e., the amount of information for the agent to process from the state, and (ii) planning complexity, estimated via goal reachability and trajectory length under a base policy. Notably, domain realism and text-level similarity are not the primary factors; for instance, the simple grid-world domain Sokoban leads to even stronger generalization in SciWorld than the more realistic ALFWorld. Motivated by these findings, we further show that increasing state information richness alone can already effectively improve cross-domain robustness. We propose a randomization technique, which is low-overhead and broadly applicable: add small amounts of distractive goal-irrelevant features to the state to make it richer without altering the task. Beyond environment-side properties, we also examine several modeling choices: (a) SFT warmup or mid-training helps prevent catastrophic forgetting during RL but undermines generalization to domains that are not included in the mid-training datamix; and (b) turning on step-by-step thinking during RL, while not always improving in-domain performance, plays a crucial role in preserving generalization.
zh

[AI-43] SAGE: Steerable Agent ic Data Generation for Deep Search with Execution Feedback

【速读】：该论文旨在解决深度搜索代理（Deep Search Agent）在训练过程中因缺乏高质量、难度可控的问答对数据而导致性能受限的问题。由于人工标注此类数据成本高昂，且涉及多文档推理的复杂探索轨迹，研究者提出了一种名为SAGE的自动化生成管道，其核心在于构建一个包含数据生成器与搜索代理的交互式迭代系统：数据生成器提出候选QA对，搜索代理尝试解答并提供执行反馈，二者通过多轮交互不断优化问题和答案，直至满足预设难度水平。该方案的关键创新在于利用代理间的自监督反馈机制实现高质量、可控难度数据的自动合成，从而显著提升深度搜索模型的准确率与泛化能力。

链接: https://arxiv.org/abs/2601.18202
作者: Fangyuan Xu,Rujun Han,Yanfei Chen,Zifeng Wang,I-Hung Hsu,Jun Yan,Vishy Tirumalashetty,Eunsol Choi,Tomas Pfister,Chen-Yu Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep search agents, which aim to answer complex questions requiring reasoning across multiple documents, can significantly speed up the information-seeking process. Collecting human annotations for this application is prohibitively expensive due to long and complex exploration trajectories. We propose an agentic pipeline that automatically generates high quality, difficulty-controlled deep search question-answer pairs for a given corpus and a target difficulty level. Our pipeline, SAGE, consists of a data generator which proposes QA pairs and a search agent which attempts to solve the generated question and provide execution feedback for the data generator. The two components interact over multiple rounds to iteratively refine the question-answer pairs until they satisfy the target difficulty level. Our intrinsic evaluation shows SAGE generates questions that require diverse reasoning strategies, while significantly increases the correctness and difficulty of the generated data. Our extrinsic evaluation demonstrates up to 23% relative performance gain on popular deep search benchmarks by training deep search agents with our synthetic data. Additional experiments show that agents trained on our data can adapt from fixed-corpus retrieval to Google Search at inference time, without further training.
zh

[AI-44] HeterCSI: Channel-Adaptive Heterogeneous CSI Pretraining Framework for Generalized Wireless Foundation Models

【速读】：该论文旨在解决无线基础模型在处理信道状态信息（Channel State Information, CSI）时面临的双重异构性问题，即CSI在尺度（scale）和场景（scenario）维度上的差异导致现有预训练方法难以实现良好的泛化性和可扩展性。当前方法要么限制输入为固定维度，要么按尺度隔离训练，从而削弱了模型的跨场景适应能力。解决方案的关键在于提出HeterCSI框架，其核心创新是重新理解异构CSI预训练中的梯度动态：发现尺度异构主要引发破坏性梯度干扰，而场景多样性若合理管理则能促进建设性梯度对齐。为此，作者将异构CSI批处理构建建模为一个最小化零填充开销同时保持场景多样性的划分优化问题，并设计了尺度感知自适应批处理策略与双掩码机制，有效分离有效信号与填充伪影，从而显著提升训练效率与跨场景泛化性能。

链接: https://arxiv.org/abs/2601.18200
作者: Chenyu Zhang,Xinchen Lyu,Chenshan Ren,Shuhan Liu,Qimei Cui,Xiaofeng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Wireless foundation models promise transformative capabilities for channel state information (CSI) processing across diverse 6G network applications, yet face fundamental challenges due to the inherent dual heterogeneity of CSI across both scale and scenario dimensions. However, current pretraining approaches either constrain inputs to fixed dimensions or isolate training by scale, limiting the generalization and scalability of wireless foundation models. In this paper, we propose HeterCSI, a channel-adaptive pretraining framework that reconciles training efficiency with robust cross-scenario generalization via a new understanding of gradient dynamics in heterogeneous CSI pretraining. Our key insight reveals that CSI scale heterogeneity primarily causes destructive gradient interference, while scenario diversity actually promotes constructive gradient alignment when properly managed. Specifically, we formulate heterogeneous CSI batch construction as a partitioning optimization problem that minimizes zero-padding overhead while preserving scenario diversity. To solve this, we develop a scale-aware adaptive batching strategy that aligns CSI samples of similar scales, and design a double-masking mechanism to isolate valid signals from padding artifacts. Extensive experiments on 12 datasets demonstrate that HeterCSI establishes a generalized foundation model without scenario-specific finetuning, achieving superior average performance over full-shot baselines. Compared to the state-of-the-art zero-shot benchmark WiFo, it reduces NMSE by 7.19 dB, 4.08 dB, and 5.27 dB for CSI reconstruction, time-domain, and frequency-domain prediction, respectively. The proposed HeterCSI framework also reduces training latency by 53% compared to existing approaches while improving generalization performance by 1.53 dB on average.
zh

[AI-45] GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）驱动的图形用户界面（GUI）智能体在执行任务时因操作不可逆而导致的灾难性偏差问题，即单一错误操作可能引发严重后果。解决方案的关键在于提出GUI Action Critic’s Data Flywheel System（GAIA），其核心是训练一个直观评判模型（Intuitive Critic Model, ICM），该模型基于正负样本对初始代理的动作进行即时正确性评估，从而筛选出高成功率的操作；随后，ICM引导代理收集更高质量的正负样本，形成数据闭环循环，用于迭代优化第二轮批评模型，实现测试时缩放（Test-Time Scaling, TTS）性能的持续提升。

链接: https://arxiv.org/abs/2601.18197
作者: Shaokang Wang,Pei Fu,Ruoceng Zhang,Shaojie Zhang,Xiuwen Xi,Jiahui Yang,Bin Qin,Ying Huang,Zhenbo Luo,Jian Luan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Vision-Language Models (LVLMs) have significantly advanced GUI agents’ capabilities in parsing textual instructions, interpreting screen content, and executing tasks, a critical challenge persists: the irreversibility of agent operations, where a single erroneous action can trigger catastrophic deviations. To address this, we propose the GUI Action Critic’s Data Flywheel System (GAIA), a training framework that enables the models to have iterative critic capabilities, which are used to improve the Test-Time Scaling (TTS) of basic GUI agents’ performance. Specifically, we train an Intuitive Critic Model (ICM) using positive and negative action examples from a base agent first. This critic evaluates the immediate correctness of the agent’s intended actions, thereby selecting operations with higher success probability. Then, the initial critic guides agent actions to collect refined positive/negative samples, initiating the self-improving cycle. The augmented data then trains a second-round critic with enhanced discernment capability. We conduct experiments on various datasets and demonstrate that the proposed ICM can improve the test-time performance of various closed-source and open-source models, and the performance can be gradually improved as the data is recycled. The code and dataset will be publicly released.
zh

[AI-46] VIBEVOICE-ASR Technical Report

【速读】：该论文旨在解决长时音频（如会议、播客）中因上下文碎片化和多说话人复杂性导致的语音理解难题，这些问题在当前短时语音识别技术取得进展后仍未得到有效缓解。解决方案的关键在于提出VibeVoice-ASR框架，其通过端到端生成式建模将自动语音识别（Automatic Speech Recognition, ASR）、说话人聚类（Speaker Diarization）与时间戳标注统一为单一任务，支持单次遍历处理长达60分钟的音频；同时引入基于提示（prompt-based）的上下文注入机制，允许用户输入定制化上下文信息，从而显著提升特定领域术语识别和多音字歧义消解的准确性。

链接: https://arxiv.org/abs/2601.18184
作者: Zhiliang Peng,Jianwei Yu,Yaoyao Chang,Zilong Wang,Li Dong,Yingbo Hao,Yujie Tu,Chenyu Yang,Wenhui Wang,Songchen Xu,Yutao Sun,Hangbo Bao,Weijiang Xu,Yi Zhu,Zehua Wang,Ting Song,Yan Xia,Zewen Chi,Shaohan Huang,Liang Wang,Chuang Ding,Shuai Wang,Xie Chen,Furu Wei
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.
zh

[AI-47] Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success

【速读】：该论文试图解决成功条件化（success conditioning）这一广泛使用策略的理论基础不明确的问题，即其在优化过程中究竟求解的是何种目标函数尚不清楚。解决方案的关键在于证明成功条件化等价于求解一个信任区域优化问题，其中策略改进被最大化，同时受到由数据自动确定半径的χ²散度约束。该理论揭示了一个精确的恒等关系：在每个状态点上，相对策略改进、策略变化幅度与我们定义的动作影响（action-influence，衡量动作选择随机性对成功概率的影响）三者完全相等，从而表明成功条件化是一种保守的策略改进算子，能保证性能不下降且避免危险的分布偏移；当失败时，它会通过几乎不改变策略的方式表现出来，具有可观测性。

链接: https://arxiv.org/abs/2601.18175
作者: Daniel Russo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a desired outcome, and updates the policy to imitate the actions taken along successful trajectories. This principle appears under many names – rejection sampling with SFT, goal-conditioned RL, Decision Transformers – yet what optimization problem it solves, if any, has remained unclear. We prove that success conditioning exactly solves a trust-region optimization problem, maximizing policy improvement subject to a \chi^2 divergence constraint whose radius is determined automatically by the data. This yields an identity: relative policy improvement, the magnitude of policy change, and a quantity we call action-influence – measuring how random variation in action choices affects success rates – are exactly equal at every state. Success conditioning thus emerges as a conservative improvement operator. Exact success conditioning cannot degrade performance or induce dangerous distribution shift, but when it fails, it does so observably, by hardly changing the policy at all. We apply our theory to the common practice of return thresholding, showing this can amplify improvement, but at the cost of potential misalignment with the true objective.
zh

[AI-48] Beyond Pairwise Comparisons: A Distributional Test of Distinctiveness for Machine-Generated Works in Intellectual Property Law

【速读】：该论文旨在解决创造性作品中核心法律原则（如专利的新颖性、版权的原创性和商标的显著性）所依赖的共性问题：即某类作品是否在意义上显著区别于某一参照群体。传统分析方法通常基于个体样本间的成对比较（item-level evidence），但这种以个体为单位的分析方式在面对机器生成作品时存在局限，因为其输出空间本质上是无限的，无法通过有限样本有效表征。为此，论文提出一种分布式的替代方案——基于语义嵌入计算最大均值差异（Maximum Mean Discrepancy, MMD）的两样本检验方法，用于判断两个创作过程（人类或机器）是否产生统计上可区分的输出分布。该方法的关键创新在于无需任务特定训练、不依赖专有训练数据即可评估生成过程的本质差异，并且具有高样本效率（仅需5–10张图像或7–20段文本即可检测差异），从而揭示了生成式AI并非简单复现训练数据，而是作为语义空间中的插值器，在保持语义相似性的同时表现出统计上的随机性差异。

链接: https://arxiv.org/abs/2601.18156
作者: Anirban Mukherjee,Hannah Hanwen Chang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Key doctrines, including novelty (patent), originality (copyright), and distinctiveness (trademark), turn on a shared empirical question: whether a body of work is meaningfully distinct from a relevant reference class. Yet analyses typically operationalize this set-level inquiry using item-level evidence: pairwise comparisons among exemplars. That unit-of-analysis mismatch may be manageable for finite corpora of human-created works, where it can be bridged by ad hoc aggregations. But it becomes acute for machine-generated works, where the object of evaluation is not a fixed set of works but a generative process with an effectively unbounded output space. We propose a distributional alternative: a two-sample test based on maximum mean discrepancy computed on semantic embeddings to determine if two creative processes-whether human or machine-produce statistically distinguishable output distributions. The test requires no task-specific training-obviating the need for discovery of proprietary training data to characterize the generative process-and is sample-efficient, often detecting differences with as few as 5-10 images and 7-20 texts. We validate the framework across three domains: handwritten digits (controlled images), patent abstracts (text), and AI-generated art (real-world images). We reveal a perceptual paradox: even when human evaluators distinguish AI outputs from human-created art with only about 58% accuracy, our method detects distributional distinctiveness. Our results present evidence contrary to the view that generative models act as mere regurgitators of training data. Rather than producing outputs statistically indistinguishable from a human baseline-as simple regurgitation would predict-they produce outputs that are semantically human-like yet stochastically distinct, suggesting their dominant function is as a semantic interpolator within a learned latent space.
zh

[AI-49] Explaining Synergistic Effects in Social Recommendations

【速读】：该论文旨在解决社交推荐系统中多社交网络间协同效应的非线性与不透明性问题，这一特性阻碍了用户理解推荐过程中如何融合多样化信息，从而降低了推荐结果的可解释性。现有解释方法仅能识别对推荐影响显著的拓扑信息，却无法进一步揭示这些信息之间的协同作用机制。解决方案的关键在于提出SemExplainer框架，其核心创新是通过量化图信息增益（graph information gain）来识别体现协同效应的子图结构，并结合条件熵优化策略最大化信息增益，从而从解释性子图中进一步筛选出具有协同效应的子图；最终在这些协同子图中搜索用户到推荐项的路径，生成可解释的推荐理由。

链接: https://arxiv.org/abs/2601.18151
作者: Yicong Li,Shan Jin,Qi Liu,Shuo Wang,Jiaying Liu,Shuo Yu,Qiang Zhang,Kuanjiu Zhou,Feng Xia
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In social recommenders, the inherent nonlinearity and opacity of synergistic effects across multiple social networks hinders users from understanding how diverse information is leveraged for recommendations, consequently diminishing explainability. However, existing explainers can only identify the topological information in social networks that significantly influences recommendations, failing to further explain the synergistic effects among this information. Inspired by existing findings that synergistic effects enhance mutual information between inputs and predictions to generate information gain, we extend this discovery to graph data. We quantify graph information gain to identify subgraphs embodying synergistic effects. Based on the theoretical insights, we propose SemExplainer, which explains synergistic effects by identifying subgraphs that embody them. SemExplainer first extracts explanatory subgraphs from multi-view social networks to generate preliminary importance explanations for recommendations. A conditional entropy optimization strategy to maximize information gain is developed, thereby further identifying subgraphs that embody synergistic effects from explanatory subgraphs. Finally, SemExplainer searches for paths from users to recommended items within the synergistic subgraphs to generate explanations for the recommendations. Extensive experiments on three datasets demonstrate the superiority of SemExplainer over baseline methods, providing superior explanations of synergistic effects.
zh

[AI-50] RareAlert: Aligning heterogeneous large language model reasoning for early rare disease risk screening

【速读】：该论文旨在解决罕见病（rare disease）在初诊阶段因信息有限和高不确定性导致的误诊与漏诊问题，即在初级医疗接触中难以识别高风险患者，从而延误靶向诊断。其解决方案的关键在于构建一个名为RareAlert的早期筛查系统，该系统通过整合十种大语言模型（Large Language Models, LLMs）生成的医学推理信号，利用机器学习对这些信号进行校准与加权，并将对齐后的推理知识蒸馏为一个可在本地部署的轻量级模型。该方法将罕见病识别重构为面向全体人群的通用不确定性解析过程，在真实世界数据集RareBench上验证表明，RareAlert（基于Qwen3-4B）在独立测试集中达到AUC 0.917，显著优于现有最优机器学习集成模型及多个主流LLM（包括GPT-5、Claude-3.7-Sonnet等），证明了校准后多源推理融合在高不确定性临床任务中的有效性。

链接: https://arxiv.org/abs/2601.18132
作者: Xi Chen,Hongru Zhou,Huahui Yi,Shiyu Feng,Hanyu Zhou,Tiancheng He,Mingke You,Li Wang,Qiankun Li,Kun Wang,Weili Fu,Kang Li,Jian Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 page, 3 figures

点击查看摘要

Abstract:Missed and delayed diagnosis remains a major challenge in rare disease care. At the initial clinical encounters, physicians assess rare disease risk using only limited information under high uncertainty. When high-risk patients are not recognised at this stage, targeted diagnostic testing is often not initiated, resulting in missed diagnosis. Existing primary care triage processes are structurally insufficient to reliably identify patients with rare diseases at initial clinical presentation and universal screening is needed to reduce diagnostic delay. Here we present RareAlert, an early screening system which predict patient-level rare disease risk from routinely available primary-visit information. RareAlert integrates reasoning generated by ten LLMs, calibrates and weights these signals using machine learning, and distils the aligned reasoning into a single locally deployable model. To develop and evaluate RareAlert, we curated RareBench, a real-world dataset of 158,666 cases covering 33 Orphanet disease categories and more than 7,000 rare conditions, including both rare and non-rare presentations. The results showed that rare disease identification can be reconceptualised as a universal uncertainty resolution process applied to the general patient population. On an independent test set, RareAlert, a Qwen3-4B based model trained with calibrated reasoning signals, achieved an AUC of 0.917, outperforming the best machine learning ensemble and all evaluated LLMs, including GPT-5, DeepSeek-R1, Claude-3.7-Sonnet, o3-mini, Gemini-2.5-Pro, and Qwen3-235B. These findings demonstrate the diversity in LLM medical reasoning and the effectiveness of aligning such reasoning in highly uncertain clinical tasks. By incorporating calibrated reasoning into a single model, RareAlert enables accurate, privacy-preserving, and scalable rare disease risk screening suitable for large-scale local deployment.
zh

[AI-51] RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents

【速读】：该论文旨在解决混合代理（Mixture-of-Agents, MoA）架构中因密集拓扑导致的高计算成本与延迟问题，以及现有基于大语言模型（LLM）裁判的筛选方法无法有效降低推理开销、缺乏模型选择标准且难以扩展至大规模模型池的局限性。其解决方案的关键在于提出RouteMoA框架，通过三层动态路由机制实现高效筛选：首先利用轻量级评分器根据查询预测粗粒度性能，无需实际推理即可缩小候选模型集；其次引入混合裁判机制，基于已有模型输出进行轻量级自评估与交叉评估，提供后验修正而无需额外推理；最后采用模型排序机制，在性能、成本与延迟之间权衡，选出最优模型。该方案在保持性能的同时显著降低了资源消耗，实验证明在大规模模型池下可减少89.8%的成本和63.6%的延迟。

链接: https://arxiv.org/abs/2601.18130
作者: Jize Wang,Han Wu,Zhiyuan You,Yiming Song,Yijun Wang,Zifei Shan,Yining Li,Songyang Zhang,Xinyi Le,Cailian Chen,Xinping Guan,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Agents (MoA) improves LLM performance through layered collaboration, but its dense topology raises costs and latency. Existing methods employ LLM judges to filter responses, yet still require all models to perform inference before judging, failing to cut costs effectively. They also lack model selection criteria and struggle with large model pools, where full inference is costly and can exceed context limits. To address this, we propose RouteMoA, an efficient mixture-of-agents framework with dynamic routing. It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query, narrowing candidates to a high-potential subset without inference. A mixture of judges then refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference. Finally, a model ranking mechanism selects models by balancing performance, cost, and latency. RouteMoA outperforms MoA across varying tasks and model pool sizes, reducing cost by 89.8% and latency by 63.6% in the large-scale model pool.
zh

[AI-52] he Limits of AI Data Transparency Policy: Three Disclosure Fallacies

【速读】：该论文旨在解决当前AI数据透明政策在实践中效果有限的问题，即政策设计与实际成效之间存在“规范差距”“执行差距”和“影响差距”三大核心缺陷。其解决方案的关键在于从社会科学研究中汲取经验，提出更具实效性的透明性路径，强调政策制定需精准匹配目标、强化执行机制，并确保披露信息能够真正推动开发者行为改进与公众认知提升，从而实现从形式化披露向实质性透明的转变。

链接: https://arxiv.org/abs/2601.18127
作者: Judy Hanwen Shen,Ken Liu,Angelina Wang,Sarah H. Cen,Andy K. Zhang,Caroline Meinhardt,Daniel Zhang,Kevin Klyman,Rishi Bommasani,Daniel E. Ho
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data transparency has emerged as a rallying cry for addressing concerns about AI: data quality, privacy, and copyright chief among them. Yet while these calls are crucial for accountability, current transparency policies often fall short of their intended aims. Similar to nutrition facts for food, policies aimed at nutrition facts for AI currently suffer from a limited consideration of research on effective disclosures. We offer an institutional perspective and identify three common fallacies in policy implementations of data disclosures for AI. First, many data transparency proposals exhibit a specification gap between the stated goals of data transparency and the actual disclosures necessary to achieve such goals. Second, reform attempts exhibit an enforcement gap between required disclosures on paper and enforcement to ensure compliance in fact. Third, policy proposals manifest an impact gap between disclosed information and meaningful changes in developer practices and public understanding. Informed by the social science on transparency, our analysis identifies affirmative paths for transparency that are effective rather than merely symbolic.
zh

[AI-53] Understanding Users Privacy Reasoning and Behaviors During Chatbot Use to Support Meaningful Agency in Privacy

【速读】：该论文旨在解决用户在与对话式代理（Conversational Agents, CAs）交互过程中，因隐私判断高度依赖情境而难以有效保护敏感信息的问题。现有研究缺乏对用户在真实场景下如何即时处理敏感信息行为及其推理机制的深入理解，导致隐私保护支持措施效果有限。解决方案的关键在于设计并评估一个集成隐私提示面板（privacy notice panel），该面板通过拦截消息提交、高亮潜在敏感信息、提供匿名化操作（如撤回、伪造和泛化）以及增强内置隐私控制的可发现性，从而提升用户的隐私意识、促进保护性行为，并支持基于具体情境的信息保护决策。

链接: https://arxiv.org/abs/2601.18125
作者: Mohammad Hadi Nezhad,Francisco Enrique Vicente Castro,Ivon Arroyo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint of a paper under review

点击查看摘要

Abstract:Conversational agents (CAs) (e.g., chatbots) are increasingly used in settings where users disclose sensitive information, raising significant privacy concerns. Because privacy judgments are highly contextual, supporting users to engage in privacy-protective actions during chatbot interactions is essential. However, enabling meaningful engagement requires a deeper understanding of how users currently reason about and manage sensitive information during realistic chatbot use scenarios. To investigate this, we qualitatively examined computer science (undergraduate and masters) students’ in-the-moment disclosure and protection behaviors, as well as the reasoning underlying these behaviors, across a range of realistic chatbot tasks. Participants used a simulated ChatGPT interface with and without a privacy notice panel that intercepts message submissions, highlights potentially sensitive information, and offers privacy protective actions. The panel supports anonymization through retracting, faking, and generalizing, and surfaces two of ChatGPT’s built-in privacy controls to improve their discoverability. Drawing on interaction logs, think-alouds, and survey responses, we analyzed how the panel fostered privacy awareness, encouraged protective actions, and supported context-specific reasoning about what information to protect and how. We further discuss design opportunities for tools that provide users greater and more meaningful agency in protecting sensitive information during CA interactions.
zh

[AI-54] Deadline-Aware Energy-Efficient Control of Domestic Immersion Hot Water Heaters AAAI2026

【速读】：该论文旨在解决传统家用浸入式电热水器在冬季运行时因持续加热而非高效控温而导致的能源浪费问题，尤其是在未考虑可预测的用水需求窗口和环境热损失的情况下。其核心解决方案是引入一种基于截止时间感知的控制策略（deadline-aware control），即在保证水温于指定时刻达到目标温度的前提下最小化能耗。关键创新在于构建了一个基于Gymnasium的仿真环境，模拟具有的一阶热损失特性的热水器系统，并采用三种方法进行对比：时间最优的bang-bang控制、零样本蒙特卡洛树搜索（MCTS）规划器以及近端策略优化（Proximal Policy Optimization, PPO）强化学习策略。实验表明，在相同物理条件下，PPO策略相比bang-bang控制最多可节省69%的能耗，且在训练完成后推理成本接近于零，显著优于其他方法。

链接: https://arxiv.org/abs/2601.18123
作者: Muhammad Ibrahim Khan,Bivin Pradeep,James Brusey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Typical domestic immersion water heater systems are often operated continuously during winter, heating quickly rather than efficiently and ignoring predictable demand windows and ambient losses. We study deadline-aware control, where the aim is to reach a target temperature at a specified time while minimising energy consumption. We introduce an efficient Gymnasium environment that models an immersion hot water heater with first-order thermal losses and discrete on and off actions of 0 W and 6000 W applied every 120 seconds. Methods include a time-optimal bang-bang baseline, a zero-shot Monte Carlo Tree Search planner, and a Proximal Policy Optimisation policy. We report total energy consumption in watt-hours under identical physical dynamics. Across sweeps of initial temperature from 10 to 30 degrees Celsius, deadline from 30 to 90 steps, and target temperature from 40 to 80 degrees Celsius, PPO achieves the most energy-efficient performance at a 60-step horizon of 2 hours, using 3.23 kilowatt-hours, compared to 4.37 to 10.45 kilowatt-hours for bang-bang control and 4.18 to 6.46 kilowatt-hours for MCTS. This corresponds to energy savings of 26 percent at 30 steps and 69 percent at 90 steps. In a representative trajectory with a 50 kg water mass, 20 degrees Celsius ambient temperature, and a 60 degrees Celsius target, PPO consumes 54 percent less energy than bang-bang control and 33 percent less than MCTS. These results show that learned deadline-aware control reduces energy consumption under identical physical assumptions, while planners provide partial savings without training and learned policies offer near-zero inference cost once trained.
zh

[AI-55] Beyond Text-to-SQL: Can LLM s Really Debug Enterprise ETL SQL?

【速读】：该论文旨在解决企业级SQL生成与调试中准确率低的问题，即即使对于经验丰富的开发者和先进的文本到SQL大语言模型（Large Language Models, LLMs），在单次尝试中生成完全正确的SQL代码仍极具挑战性，常需多次调试迭代。其解决方案的关键在于提出OurBench——首个面向企业场景的SQL推理与调试基准测试，包含两大创新：一是基于逆向工程的自动化构建流程，可系统性地注入真实世界中的语法错误（syntax errors）与语义错误（semantic errors），实现大规模、多样化的测试用例生成；二是专为实际企业环境设计的无需执行的评估框架，提供高效、精准且资源消耗低的评估能力。该基准涵盖469个含显式错误信息的语法错误查询（OurBenchSyn）和516个针对用户意图未满足的语义错误查询（OurBenchSem），平均长度超140行，抽象语法树结构复杂，显著提升了对LLMs在企业SQL调试任务中性能的评测深度与广度。

链接: https://arxiv.org/abs/2601.18119
作者: Jing Ye,Yiwen Duan,Yonghong Yu,Victor Ma,Yang Gao,Xing Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SQL is central to enterprise data engineering, yet generating fully correct SQL code in a single attempt remains difficult, even for experienced developers and advanced text-to-SQL LLMs, often requiring multiple debugging iterations. We introduce OurBench, the first benchmark for enterprise-level SQL reasoning and debugging. Our benchmark is built on two key innovations: (1) an automated construction workflow that uses reverse engineering to systematically inject realistic bugs into large-scale SQL code, enabling scalable and diverse benchmark generation; and (2) an execution-free evaluation framework tailored to enterprise settings, providing fast, accurate, and resource-efficient assessment. OurBench comprises 469 OurBenchSyn queries featuring syntax errors with explicit error messages, and 516 OurBenchSem queries targeting semantic errors in which the code fails to meet user intent. The queries are highly complex, averaging over 140 lines and featuring deep and wide abstract syntax trees. Evaluation of nearly 30 LLMs reveals a substantial performance gap: the best-performing model, Claude-4-Sonnet, achieves only 36.46 percent accuracy on OurBenchSyn and 32.17 percent on OurBenchSem, while most models score below 20 percent. We further explore four solution strategies, identify key challenges, and outline promising directions for enterprise SQL debugging with LLMs. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.18119 [cs.AI] (or arXiv:2601.18119v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.18119 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-56] MalURLBench: A Benchmark Evaluating Agents Vulnerabilities When Processing Web URLs

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理恶意URL时存在的安全漏洞问题，即模型容易被伪装的恶意链接欺骗，从而访问不安全网页并引发严重后果。针对这一挑战，作者提出了首个专门用于评估LLMs对恶意URL脆弱性的基准测试工具MalURLBench，包含61,845个攻击实例，覆盖10种真实场景和7类恶意网站。实验表明，当前主流LLMs难以识别复杂伪装的恶意URL；在此基础上，研究进一步识别出影响攻击成功率的关键因素，并提出轻量级防御模块URLGuard以提升安全性。该工作为增强网络代理系统的安全性提供了基础资源与有效方案。

链接: https://arxiv.org/abs/2601.18113
作者: Dezhang Kong,Zhuxi Wu,Shiqi Liu,Zhicheng Tan,Kuichen Lu,Minghao Li,Qichen Liu,Shengyu Chu,Zhenhua Xu,Xuan Liu,Meng Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based web agents have become increasingly popular for their utility in daily life and work. However, they exhibit critical vulnerabilities when processing malicious URLs: accepting a disguised malicious URL enables subsequent access to unsafe webpages, which can cause severe damage to service providers and users. Despite this risk, no benchmark currently targets this emerging threat. To address this gap, we propose MalURLBench, the first benchmark for evaluating LLMs’ vulnerabilities to malicious URLs. MalURLBench contains 61,845 attack instances spanning 10 real-world scenarios and 7 categories of real malicious websites. Experiments with 12 popular LLMs reveal that existing models struggle to detect elaborately disguised malicious URLs. We further identify and analyze key factors that impact attack success rates and propose URLGuard, a lightweight defense module. We believe this work will provide a foundational resource for advancing the security of web agents. Our code is available at this https URL.
zh

[AI-57] Demystifying Data-Driven Probabilistic Medium-Range Weather Forecasting

【速读】：该论文旨在解决当前数据驱动天气预报方法中因复杂、定制化架构和训练策略导致的预测精度驱动因素不明确的问题。其解决方案的关键在于提出一个可扩展的框架，通过直接下采样的潜在空间与历史条件局部投影器相结合，以学习多尺度大气动力学；该设计对概率估计器的选择具有鲁棒性，能够无缝支持随机插值、扩散模型及基于CRPS（Continuous Ranked Probability Score）的集合训练，从而在多数变量上显著优于集成预报系统（Integrated Forecasting System）和GenCast等先进深度学习概率模型，证明了通用模型的规模化即可实现顶尖的中程预报性能。

链接: https://arxiv.org/abs/2601.18111
作者: Jean Kossaifi,Nikola Kovachki,Morteza Mardani,Daniel Leibovici,Suman Ravuri,Ira Shokar,Edoardo Calvello,Mohammad Shoaib Abbas,Peter Harrington,Ashay Subramaniam,Noah Brenowitz,Boris Bonev,Wonmin Byeon,Karsten Kreis,Dale Durran,Arash Vahdat,Mike Pritchard,Jan Kautz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent revolution in data-driven methods for weather forecasting has lead to a fragmented landscape of complex, bespoke architectures and training strategies, obscuring the fundamental drivers of forecast accuracy. Here, we demonstrate that state-of-the-art probabilistic skill requires neither intricate architectural constraints nor specialized training heuristics. We introduce a scalable framework for learning multi-scale atmospheric dynamics by combining a directly downsampled latent space with a history-conditioned local projector that resolves high-resolution physics. We find that our framework design is robust to the choice of probabilistic estimator, seamlessly supporting stochastic interpolants, diffusion models, and CRPS-based ensemble training. Validated against the Integrated Forecasting System and the deep learning probabilistic model GenCast, our framework achieves statistically significant improvements on most of the variables. These results suggest scaling a general-purpose model is sufficient for state-of-the-art medium-range prediction, eliminating the need for tailored training recipes and proving effective across the full spectrum of probabilistic frameworks.
zh

[AI-58] Mitigating the OWASP Top 10 For Large Language Models Applications using Intelligent Agents

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在广泛应用中所面临的安全风险问题，特别是针对Open Web Application Security Project（OWASP）所列出的LLM应用十大安全漏洞。这些问题包括数据完整性、机密性和服务可用性等方面的潜在威胁。论文提出的解决方案关键在于构建一个基于LLM驱动智能代理（LLM-enabled intelligent agents）的安全框架，能够实时主动识别、评估并应对安全威胁，从而为LLM系统的安全性提供前瞻性的防护机制，并为后续研究与开发提供基础架构参考。

链接: https://arxiv.org/abs/2601.18105
作者: Mohammad Fasha,Faisal Abul Rub,Nasim Matar,Bilal Sowan,Mohammad Al Khaldy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a transformative and disruptive technology, enabling a wide range of applications in natural language processing, machine translation, and beyond. However, this widespread integration of LLMs also raised several security concerns highlighted by the Open Web Application Security Project (OWASP), which has identified the top 10 security vulnerabilities inherent in LLM applications. Addressing these vulnerabilities is crucial, given the increasing reliance on LLMs and the potential threats to data integrity, confidentiality, and service availability. This paper presents a framework designed to mitigate the security risks outlined in the OWASP Top 10. Our proposed model leverages LLM-enabled intelligent agents, offering a new approach to proactively identify, assess, and counteract security threats in real-time. The proposed framework serves as an initial blueprint for future research and development, aiming to enhance the security measures of LLMs and protect against emerging threats in this rapidly evolving landscape.
zh

[AI-59] LatentMoE: Toward Optimal Accuracy per FLOP and Parameter in Mixture of Experts

【速读】：该论文旨在解决现有混合专家（Mixture of Experts, MoE）架构在推理成本效率方面尚未达到最优的问题，即如何提升模型在单位计算量（如每浮点运算FLOP）和每参数上的准确率。其核心挑战在于现有MoE设计在不同部署场景下（包括离线高吞吐和在线低延迟推理）存在性能瓶颈，难以实现硬件与软件协同优化。解决方案的关键是提出LatentMoE架构——一种通过系统性设计探索并针对“单位计算资源下的最大准确率”进行优化的新模型结构，该架构基于大规模（高达950亿参数）训练实验与理论分析验证，在多个指标上均显著优于传统MoE架构，且已被应用于Nvidia的Nemotron-3 Super和Ultra系列大模型中，进一步扩展至更长token序列和更大模型规模。

链接: https://arxiv.org/abs/2601.18089
作者: Venmugil Elango,Nidhi Bhatia,Roger Waleffe,Rasoul Shafipour,Tomer Asida,Abhinav Khattar,Nave Assaf,Maximilian Golub,Joey Guman,Tiyasa Mitra,Ritchie Zhao,Ritika Borkar,Ran Zilberstein,Mostofa Patwary,Mohammad Shoeybi,Bita Rouhani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoEs) have become a central component of many state-of-the-art open-source and proprietary large language models. Despite their widespread adoption, it remains unclear how close existing MoE architectures are to optimal with respect to inference cost, as measured by accuracy per floating-point operation and per parameter. In this work, we revisit MoE design from a hardware-software co-design perspective, grounded in empirical and theoretical considerations. We characterize key performance bottlenecks across diverse deployment regimes, spanning offline high-throughput execution and online, latency-critical inference. Guided by these insights, we introduce LatentMoE, a new model architecture resulting from systematic design exploration and optimized for maximal accuracy per unit of compute. Empirical design space exploration at scales of up to 95B parameters and over a 1T-token training horizon, together with supporting theoretical analysis, shows that LatentMoE consistently outperforms standard MoE architectures in terms of accuracy per FLOP and per parameter. Given its strong performance, the LatentMoE architecture has been adopted by the flagship Nemotron-3 Super and Ultra models and scaled to substantially larger regimes, including longer token horizons and larger model sizes, as reported in Nvidia et al. (arXiv:2512.20856).
zh

[AI-60] “Crash Test Dummies” for AI-Enabled Clinical Assessment: Validating Virtual Patient Scenarios with Virtual Learners

【速读】：该论文旨在解决当前医疗健康专业教育（HPE）中人工智能（AI）辅助临床能力评估存在的可靠性与稳健性不足问题，尤其是现有方法多依赖AI与人类评分者间的一致性，缺乏对案例、学习者和评分者三者协同作用的测量框架，导致评估结果可能不可靠，甚至误导学习者。其解决方案的关键在于构建一个开源的AI虚拟患者平台，并结合一种基于贝叶斯HRM-SDT模型的心理测量学方法，通过引入可调参数的“模拟学习者”来压力测试评估流程；该模型将评分视为不确定性下的决策过程，能够分离出学习者能力、案例难度和评分者行为三个独立维度，并利用马尔可夫链蒙特卡洛（MCMC）进行参数估计，从而实现对临床能力的稳健、可解释且泛化性强的量化评估，为AI辅助评估工具在人类学习者中的部署提供基于授权（entrustment-based）验证的阶段性安全蓝图。

链接: https://arxiv.org/abs/2601.18085
作者: Brian Gin,Ahreum Lim,Flávia Silva e Oliveira,Kuan Xing,Xiaomei Song,Gayana Amiyangoda,Thilanka Seneviratne,Alison F. Doubleday,Ananya Gangopadhyaya,Bob Kiser,Lukas Shum-Tim,Dhruva Patel,Kosala Marambe,Lauren Maggio,Ara Tekian,Yoon Soo Park
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Background: In medical and health professions education (HPE), AI is increasingly used to assess clinical competencies, including via virtual standardized patients. However, most evaluations rely on AI-human interrater reliability and lack a measurement framework for how cases, learners, and raters jointly shape scores. This leaves robustness uncertain and can expose learners to misguidance from unvalidated systems. We address this by using AI “simulated learners” to stress-test and psychometrically characterize assessment pipelines before human use. Objective: Develop an open-source AI virtual patient platform and measurement model for robust competency evaluation across cases and rating conditions. Methods: We built a platform with virtual patients, virtual learners with tunable ACGME-aligned competency profiles, and multiple independent AI raters scoring encounters with structured Key-Features items. Transcripts were analyzed with a Bayesian HRM-SDT model that treats ratings as decisions under uncertainty and separates learner ability, case performance, and rater behavior; parameters were estimated with MCMC. Results: The model recovered simulated learners’ competencies, with significant correlations to the generating competencies across all ACGME domains despite a non-deterministic pipeline. It estimated case difficulty by competency and showed stable rater detection (sensitivity) and criteria (severity/leniency thresholds) across AI raters using identical models/prompts but different seeds. We also propose a staged “safety blueprint” for deploying AI tools with learners, tied to entrustment-based validation milestones. Conclusions: Combining a purpose-built virtual patient platform with a principled psychometric model enables robust, interpretable, generalizable competency estimates and supports validation of AI-assisted assessment prior to use with human learners. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Applications (stat.AP) Cite as: arXiv:2601.18085 [cs.HC] (or arXiv:2601.18085v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2601.18085 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Brian Gin [view email] [v1] Mon, 26 Jan 2026 02:47:28 UTC (3,841 KB) Full-text links: Access Paper: View a PDF of the paper titled “Crash Test Dummies” for AI-Enabled Clinical Assessment: Validating Virtual Patient Scenarios with Virtual Learners, by Brian Gin and 15 other authorsView PDF view license Current browse context: cs.HC prev | next new | recent | 2026-01 Change to browse by: cs cs.AI stat stat.AP References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-61] Diffusion Model-based Reinforcement Learning for Version Age of Information Scheduling: Averag e and Tail-Risk-Sensitive Control

【速读】：该论文旨在解决多用户状态更新系统中因随机包到达和不可靠信道导致的语义过时问题，即现有基于平均版本年龄信息（VAoI）最小化的调度方法忽略了罕见但严重的滞后事件，从而影响系统的可靠性。其关键解决方案在于提出一种风险敏感型调度算法 RS-D3SAC，该算法结合了基于扩散模型的策略网络与基于分位数的分布式评论家，显式建模 VAoI 回报分布，并通过条件风险价值（CVaR）实现尾部风险优化，在满足长期传输成本约束的同时显著降低极端滞后的发生概率，而不会牺牲平均性能。

链接: https://arxiv.org/abs/2601.18069
作者: Haoyuan Pan,Sizhao Chen,Zhaorui Wang,Tse-Tin Chan
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Ensuring timely and semantically accurate information delivery is critical in real-time wireless systems. While Age of Information (AoI) quantifies temporal freshness, Version Age of Information (VAoI) captures semantic staleness by accounting for version evolution between transmitters and receivers. Existing VAoI scheduling approaches primarily focus on minimizing average VAoI, overlooking rare but severe staleness events that can compromise reliability under stochastic packet arrivals and unreliable channels. This paper investigates both average-oriented and tail-risk-sensitive VAoI scheduling in a multi-user status update system with long-term transmission cost constraints. We first formulate the average VAoI minimization problem as a constrained Markov decision process and introduce a deep diffusion-based Soft Actor-Critic (D2SAC) algorithm. By generating actions through a diffusion-based denoising process, D2SAC enhances policy expressiveness and establishes a strong baseline for mean performance. Building on this foundation, we put forth RS-D3SAC, a risk-sensitive deep distributional diffusion-based Soft Actor-Critic algorithm. RS-D3SAC integrates a diffusion-based actor with a quantile-based distributional critic, explicitly modeling the full VAoI return distribution. This enables principled tail-risk optimization via Conditional Value-at-Risk (CVaR) while satisfying long-term transmission cost constraints. Extensive simulations show that, while D2SAC reduces average VAoI, RS-D3SAC consistently achieves substantial reductions in CVaR without sacrificing mean performance. The dominant gain in tail-risk reduction stems from the distributional critic, with the diffusion-based actor providing complementary refinement to stabilize and enrich policy decisions, highlighting their effectiveness for robust and risk-aware VAoI scheduling in multi-user wireless systems.
zh

[AI-62] EvolVE: Evolutionary Search for LLM -based Verilog Generation and Optimization

【速读】：该论文旨在解决Verilog硬件设计流程中人工成本高、依赖领域专家经验的问题，以及现有大语言模型（Large Language Models, LLMs）在处理硬件系统严格形式逻辑和并发特性时表现不足的局限性。其解决方案的关键在于提出EvolVE框架，通过系统性分析多种进化策略，在芯片设计任务中发现蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）在提升功能正确性方面最优，而基于思想引导的精化方法（Idea-Guided Refinement, IGR）则在优化性能上更具优势；同时引入结构化测试平台生成（Structured Testbench Generation, STG）加速进化过程，并构建了面向工业规模的IC-RTL基准数据集以填补复杂优化场景的评估空白，最终在VerilogEval v2和RTLLM v2上分别达到98.1%和92%的准确率，在IC-RTL上实现PPA（Power, Performance, Area）乘积最高降低66%。

链接: https://arxiv.org/abs/2601.18067
作者: Wei-Po Hsin,Ren-Hao Deng,Yao-Ting Hsieh,En-Ming Huang,Shih-Hao Hung
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL)
备注: 17 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Verilog’s design cycle is inherently labor-intensive and necessitates extensive domain expertise. Although Large Language Models (LLMs) offer a promising pathway toward automation, their limited training data and intrinsic sequential reasoning fail to capture the strict formal logic and concurrency inherent in hardware systems. To overcome these barriers, we present EvolVE, the first framework to analyze multiple evolution strategies on chip design tasks, revealing that Monte Carlo Tree Search (MCTS) excels at maximizing functional correctness, while Idea-Guided Refinement (IGR) proves superior for optimization. We further leverage Structured Testbench Generation (STG) to accelerate the evolutionary process. To address the lack of complex optimization benchmarks, we introduce IC-RTL, targeting industry-scale problems derived from the National Integrated Circuit Contest. Evaluations establish EvolVE as the new state-of-the-art, achieving 98.1% on VerilogEval v2 and 92% on RTLLM v2. Furthermore, on the industry-scale IC-RTL suite, our framework surpasses reference implementations authored by contest participants, reducing the Power, Performance, Area (PPA) product by up to 66% in Huffman Coding and 17% in the geometric mean across all problems. The source code of the IC-RTL benchmark is available at this https URL.
zh

[AI-63] Resonant Sparse Geometry Networks

【速读】：该论文旨在解决传统Transformer架构中因密集注意力机制导致的高计算复杂度（O(n²)）和参数冗余问题，同时探索更符合生物神经系统特性的高效神经网络设计。其解决方案的关键在于提出一种受大脑启发的共振稀疏几何网络（Resonant Sparse Geometry Networks, RSGN），该架构通过在学习到的双曲空间中嵌入计算节点，使连接强度随测地距离衰减，从而实现输入依赖的动态稀疏性；此外，RSGN采用双时间尺度机制：快速可微激活传播用于梯度优化，慢速赫布型结构学习则基于局部相关性规则调整连接，最终实现O(n·k)的计算复杂度（k为平均活跃邻域大小），显著降低资源消耗并提升任务性能。

链接: https://arxiv.org/abs/2601.18064
作者: Hasi Hays
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We introduce Resonant Sparse Geometry Networks (RSGN), a brain-inspired architecture with self-organizing sparse hierarchical input-dependent connectivity. Unlike Transformer architectures that employ dense attention mechanisms with O(n^2) computational complexity, RSGN embeds computational nodes in learned hyperbolic space where connection strength decays with geodesic distance, achieving dynamic sparsity that adapts to each input. The architecture operates on two distinct timescales: fast differentiable activation propagation optimized through gradient descent, and slow Hebbian-inspired structural learning for connectivity adaptation through local correlation rules. We provide rigorous mathematical analysis demonstrating that RSGN achieves O(n*k) computational complexity, where k n represents the average active neighborhood size. Experimental evaluation on hierarchical classification and long-range dependency tasks demonstrates that RSGN achieves 96.5% accuracy on long-range dependency tasks while using approximately 15x fewer parameters than standard Transformers. On challenging hierarchical classification with 20 classes, RSGN achieves 23.8% accuracy (compared to 5% random baseline) with only 41,672 parameters, nearly 10x fewer than the Transformer baselines which require 403,348 parameters to achieve 30.1% accuracy. Our ablation studies confirm the contribution of each architectural component, with Hebbian learning providing consistent improvements. These results suggest that brain-inspired principles of sparse, geometrically-organized computation offer a promising direction toward more efficient and biologically plausible neural architectures. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2601.18064 [cs.LG] (or arXiv:2601.18064v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.18064 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hasi Hays [view email] [v1] Mon, 26 Jan 2026 01:45:51 UTC (3,011 KB) Full-text links: Access Paper: View a PDF of the paper titled Resonant Sparse Geometry Networks, by Hasi HaysView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-01 Change to browse by: cs cs.AI cs.NE References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-64] Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

【速读】：该论文试图解决的问题是：在高安全风险领域（如心理健康）中，基于人类反馈（Learning from Human Feedback, LHF）训练和评估人工智能系统时，专家判断是否能形成可靠、一致的标签作为监督信号。研究发现，尽管三位认证精神科医生接受过相同培训并遵循统一评分标准，其评分一致性极低（ICC 0.087–0.295），尤其在自杀与自伤相关响应上表现出系统性分歧，甚至出现负可靠性（Krippendorff’s α = -0.203），表明专家间的分歧并非随机误差，而是源于各自临床理念的不同——即“以安全为先”、“以互动为中心”和“文化敏感导向”等不同专业框架。解决方案的关键在于：摒弃传统以共识为基础的数据聚合方法，转而采用能够保留并学习专家分歧的对齐机制，从而将专家间的原则性差异转化为可解释、可利用的模型训练信号，而非简单地将其视为噪声或错误。这标志着从“追求一致性”向“理解多样性”的范式转变，对于构建更安全、透明的生成式AI（Generative AI）系统具有重要启示。

链接: https://arxiv.org/abs/2601.18061
作者: Kiana Jafari,Paul Ulrich Nikolaus Rust,Duncan Eddy,Robbie Fraser,Nina Vasan,Darja Djordjevic,Akanksha Dadlani,Max Lamparth,Eugenia Kim,Mykel Kochenderfer
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 17 pages, 7 pages of appendix, 21 tables

点击查看摘要

Abstract:Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ( ICC 0.087 – 0.295 ), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff’s \alpha = -0.203 ), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.
zh

[AI-65] An Experimental Comparison of Cognitive Forcing Functions for Execution Plans in AI-Assisted Writing: Effects On Trust Overreliance and Perceived Critical Thinking

【速读】：该论文旨在解决生成式 AI（Generative AI）在知识工作流中可能导致的过度依赖问题，即用户可能因信任AI输出而削弱批判性思维能力。其解决方案的关键在于引入认知强制机制（Cognitive Forcing Functions, CFFs），特别是针对AI生成的执行计划设计特定类型的干预策略，如“假设分析”（Assumption）和“假设测试”（WhatIf），以促进用户对AI输出的主动反思与评估，从而在不显著增加认知负荷的前提下提升批判性参与度。

链接: https://arxiv.org/abs/2601.18033
作者: Ahana Ghosh,Advait Sarkar,Siân Lindley,Christian Poelitz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) tools improve productivity in knowledge workflows such as writing, but also risk overreliance and reduced critical thinking. Cognitive forcing functions (CFFs) mitigate these risks by requiring active engagement with AI output. As GenAI workflows grow more complex, systems increasingly present execution plans for user review. However, these plans are themselves AI-generated and prone to overreliance, and the effectiveness of applying CFFs to AI plans remains underexplored. We conduct a controlled experiment in which participants completed AI-assisted writing tasks while reviewing AI-generated plans under four CFF conditions: Assumption (argument analysis), WhatIf (hypothesis testing), Both, and a no-CFF control. A follow-up think-aloud and interview study qualitatively compared these conditions. Results show that the Assumption CFF most effectively reduced overreliance without increasing cognitive load, while participants perceived the WhatIf CFF as most helpful. These findings highlight the value of plan-focused CFFs for supporting critical reflection in GenAI-assisted knowledge work.
zh

[AI-66] Coding-Enforced Resilient and Secure Aggregation for Hierarchical Federated Learning

【速读】：该论文旨在解决分层联邦学习（Hierarchical Federated Learning, HFL）中在不可靠通信环境下保障模型准确性与隐私保护之间的矛盾问题，尤其是因隐私噪声协调随机中断导致的模型聚合不准确和部分参与（partial participation）问题。其解决方案的关键在于提出一种鲁棒的分层安全聚合方案（H-SecCoGC），通过引入编码策略强制结构化聚合，从而在任意强度的隐私保护下实现全局模型的精准构建，显著提升了系统的鲁棒性、隐私保护能力和学习效率。

链接: https://arxiv.org/abs/2601.17995
作者: Shudi Weng,Ming Xiao,Mikael Skoglund
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hierarchical federated learning (HFL) has emerged as an effective paradigm to enhance link quality between clients and the server. However, ensuring model accuracy while preserving privacy under unreliable communication remains a key challenge in HFL, as the coordination among privacy noise can be randomly disrupted. To address this limitation, we propose a robust hierarchical secure aggregation scheme, termed H-SecCoGC, which integrates coding strategies to enforce structured aggregation. The proposed scheme not only ensures accurate global model construction under varying levels of privacy, but also avoids the partial participation issue, thereby significantly improving robustness, privacy preservation, and learning efficiency. Both theoretical analyses and experimental results demonstrate the superiority of our scheme under unreliable communication across arbitrarily strong privacy guarantees
zh

[AI-67] Credit Fairness: Online Fairness In Shared Resource Pools

【速读】：该论文旨在解决资源分配机制中因仅追求逐轮最小化公平性（max-min fairness）而导致的长期资源分配不均问题，即即使 agents 的平均需求相同，也可能因早期资源借贷差异而造成显著累积不公平。其解决方案的关键在于引入“信用公平”（credit fairness）这一新概念，该概念强化了共享激励（sharing incentives），确保在早期借出资源的 agent 能在后续轮次中收回相应资源；同时提出一种兼顾信用公平与帕累托效率（Pareto efficiency）的机制，通过动态调整资源分配以实现长期公平性，避免传统 max-min 机制可能引发的极端不公平现象。

链接: https://arxiv.org/abs/2601.17944
作者: Seyed Majid Zahedi,Rupert Freeman
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:We consider a setting in which a group of agents share resources that must be allocated among them in each discrete time period. Agents have time-varying demands and derive constant marginal utility from each unit of resource received up to their demand, with zero utility for any additional resources. In this setting, it is known that independently maximizing the minimum utility in each round satisfies sharing incentives (agents weakly prefer participating in the mechanism to not participating), strategyproofness (agents have no incentive to misreport their demands), and Pareto efficiency (Freeman et al. 2018). However, recent work (Vuppalapati et al. 2023) has shown that this max-min mechanism can lead to large disparities in the total resources received by agents, even when they have the same average demand. In this paper, we introduce credit fairness, a strengthening of sharing incentives that ensures agents who lend resources in early rounds are able to recoup them in later rounds. Credit fairness can be achieved in conjunction with either Pareto efficiency or strategyproofness, but not both. We propose a mechanism that is credit fair and Pareto efficient, and we evaluate its performance in a computational resource-sharing setting.
zh

[AI-68] LLM -Based SQL Generation: Prompting Self-Refinement and Adaptive Weighted Majority Voting

【速读】：该论文旨在解决自然语言到结构化查询语言（Text-to-SQL）任务中面临的挑战，包括用户查询的语义歧义、数据库模式链接复杂性、SQL方言泛化能力有限以及领域特定知识理解不足等问题。其核心解决方案是提出一种无需标注数据的单智能体自精炼与集成投票（Single-Agent Self-Refinement with Ensemble Voting, SSEV）流水线，结合加权多数投票（Weighted Majority Voting, WMV）及其随机变体（Randomized Weighted Majority Algorithm, RWMA），显著提升了SQL生成的准确性；在此基础上进一步构建了基于多智能体协作的ReCAPAgent-SQL框架，通过规划、外部知识检索、批判性评估、动作生成、自精炼、模式链接和结果验证等专业化智能体协同工作，实现SQL预测的迭代优化，从而有效应对企业级数据库和真实场景下Text-to-SQL任务的复杂性，大幅提高执行准确率并推动可扩展系统的实际部署。

链接: https://arxiv.org/abs/2601.17942
作者: Yu-Jie Yang,Hung-Fu Chang,Po-An Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 29 pages, 22 figures

点击查看摘要

Abstract:Text-to-SQL has emerged as a prominent research area, particularly with the rapid advancement of large language models (LLMs). By enabling users to query databases through natural language rather than SQL, this technology significantly lowers the barrier to data analysis. However, generating accurate SQL from natural language remains challenging due to ambiguity in user queries, the complexity of schema linking, limited generalization across SQL dialects, and the need for domain-specific understanding. In this study, we propose a Single-Agent Self-Refinement with Ensemble Voting (SSEV) pipeline built on PET-SQL that operates without ground-truth data, integrating self-refinement with Weighted Majority Voting (WMV) and its randomized variant (RWMA). Experimental results show that the SSEV achieves competitive performance across multiple benchmarks, attaining execution accuracies of 85.5% on Spider 1.0-Dev, 86.4% on Spider 1.0-Test, and 66.3% on BIRD-Dev. Building on insights from the SSEV pipeline, we further propose ReCAPAgent-SQL (Refinement-Critique-Act-Plan agent-based SQL framework) to address the growing complexity of enterprise databases and real-world Text-to-SQL tasks. The framework integrates multiple specialized agents for planning, external knowledge retrieval, critique, action generation, self-refinement, schema linking, and result validation, enabling iterative refinement of SQL predictions through agent collaboration. ReCAPAgent-SQL’s WMA results achieve 31% execution accuracy on the first 100 queries of Spider 2.0-Lite, demonstrating significant improvements in handling real-world enterprise scenarios. Overall, our work facilitates the deployment of scalable Text-to-SQL systems in practical settings, supporting better data-driven decision-making at lower cost and with greater efficiency.
zh

[AI-69] Learning Transferable Skills in Action RPGs via Directed Skill Graphs and Selective Adaptation

【速读】：该论文旨在解决持续学习（lifelong learning）中智能体在不重新训练或覆盖已有行为的前提下，逐步扩展其能力的问题。解决方案的关键在于将战斗行为建模为有向技能图（skill graph），并通过分层课程学习（hierarchical curriculum）训练各组件，从而将控制分解为五个可复用的专用技能：摄像机控制、目标锁定、移动、闪避和治疗-攻击决策策略。这种因子化设计显著提升了样本效率，并支持选择性微调（selective fine-tuning）——当环境从Phase 1过渡到Phase 2时，仅需调整部分技能，而上游技能保持可迁移性，实验证明仅对两个技能进行针对性微调即可在有限交互预算下快速恢复性能，为复杂实时环境中持续演化的智能体提供了可行路径。

链接: https://arxiv.org/abs/2601.17923
作者: Ali Najar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Lifelong agents should expand their competence over time without retraining from scratch or overwriting previously learned behaviors. We investigate this in a challenging real-time control setting (Dark Souls III) by representing combat as a directed skill graph and training its components in a hierarchical curriculum. The resulting agent decomposes control into five reusable skills: camera control, target lock-on, movement, dodging, and a heal-attack decision policy, each optimized for a narrow responsibility. This factorization improves sample efficiency by reducing the burden on any single policy and supports selective post-training: when the environment shifts from Phase 1 to Phase 2, only a subset of skills must be adapted, while upstream skills remain transferable. Empirically, we find that targeted fine-tuning of just two skills rapidly recovers performance under a limited interaction budget, suggesting that skill-graph curricula together with selective fine-tuning offer a practical pathway toward evolving, continually learning agents in complex real-time environments.
zh

[AI-70] Agent ic AI for Self-Driving Laboratories in Soft Matter: Taxonomy Benchmarksand Open Challenges

【速读】：该论文旨在解决自驱动实验室（Self-driving Laboratories, SDLs）中如何实现高效、安全且可复现的自动化实验决策问题，尤其针对昂贵动作、噪声与延迟反馈、严格可行性与安全性约束以及非平稳环境等挑战。其解决方案的关键在于将SDL的自主性建模为一个明确的“智能体-环境交互”问题，通过定义可观测状态、动作、成本和约束，从而将常见的SDL流程与成熟的AI原理相连接；同时提出以能力为导向的分类体系，涵盖决策时域、不确定性建模、动作参数化、约束处理、故障恢复及人机协作等多个维度，并设计了以成本感知性能、漂移鲁棒性、约束违反行为和可复现性为核心的基准任务模板与评估指标，以推动系统间的公平比较与持续改进。

链接: https://arxiv.org/abs/2601.17920
作者: Xuanzhou Chen,Audrey Wang,Stanley Yin,Hanyang Jiang,Dong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-driving laboratories (SDLs) close the loop between experiment design, automated execution, and data-driven decision making, and they provide a demanding testbed for agentic AI under expensive actions, noisy and delayed feedback, strict feasibility and safety constraints, and non-stationarity. This survey uses soft matter as a representative setting but focuses on the AI questions that arise in real laboratories. We frame SDL autonomy as an agent environment interaction problem with explicit observations, actions, costs, and constraints, and we use this formulation to connect common SDL pipelines to established AI principles. We review the main method families that enable closed loop experimentation, including Bayesian optimization and active learning for sample efficient experiment selection, planning and reinforcement learning for long horizon protocol optimization, and tool using agents that orchestrate heterogeneous instruments and software. We emphasize verifiable and provenance aware policies that support debugging, reproducibility, and safe operation. We then propose a capability driven taxonomy that organizes systems by decision horizon, uncertainty modeling, action parameterization, constraint handling, failure recovery, and human involvement. To enable meaningful comparison, we synthesize benchmark task templates and evaluation metrics that prioritize cost aware performance, robustness to drift, constraint violation behavior, and reproducibility. Finally, we distill lessons from deployed SDLs and outline open challenges in multi-modal representation, calibrated uncertainty, safe exploration, and shared benchmark infrastructure.
zh

[AI-71] hink Locally Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）代理在开放式调查任务中表现不稳定的问题，尤其是在面对海量、异构的运营数据时，传统ReAct类代理因受限于上下文窗口长度而难以有效构建解释性证据链。其核心挑战在于：证据间的隐含依赖结构（如实体交互、信号共变）使得单个事实的重要性往往需在后续发现中才能显现；而现有方法在迭代检索-总结-推理循环中易丢失关键信息，导致结果对探索顺序敏感且运行间不一致（即Pass-at-k高但Majority-at-k低）。解决方案的关键是提出EoG（Explanations over Graphs）框架，将调查建模为在依赖图上的溯因推理（abductive reasoning），并通过解耦机制实现局部证据挖掘与全局信念管理：LLM负责有限范围内的证据标记（因果 vs 症状），确定性控制器则控制遍历路径、状态更新和信念传播，以计算最小解释前沿（minimal explanatory frontier）。此设计显著提升了准确性与一致性，在ITBench诊断任务上实现了平均7倍的Majority-at-k实体F1提升。

链接: https://arxiv.org/abs/2601.17915
作者: Saurabh Jha,Rohan Arora,Bhavya,Noah Zheutlin,Paulina Toro Isaza,Laura Shwartz,Yu Deng,Daby Sow,Ruchi Mahindru,Ruchir Puri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:LLM agents excel when environments are mostly static and the needed information fits in a model’s context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) Cite as: arXiv:2601.17915 [cs.AI] (or arXiv:2601.17915v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.17915 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-72] Causal Pre-training Under the Fairness Lens: An Empirical Study of TabPFN

【速读】：该论文旨在解决基础模型（Foundation Models）在表格数据上的公平性问题，特别是那些基于结构因果模型（Structural Causal Models, SCM）预训练的模型（如TabPFN），尽管其在预测准确性和对虚假相关性的鲁棒性方面表现优异，但其公平性属性尚未得到充分研究。解决方案的关键在于通过系统性的实证评估，考察TabPFN及其微调变体在不同数据规模和分布偏移场景下的预测性能、公平性和鲁棒性，从而揭示因果预训练对算法公平性的促进作用有限，尤其是在缺失非随机（Missing-Not-at-Random, MNAR）协变量偏移下公平性提升不显著，表明当前因果预训练机制不足以保障公平性，亟需额外的公平干预措施。

链接: https://arxiv.org/abs/2601.17912
作者: Qinyi Liu,Mohammad Khalil,Naman Goel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models for tabular data, such as the Tabular Prior-data Fitted Network (TabPFN), are pre-trained on a massive number of synthetic datasets generated by structural causal models (SCM). They leverage in-context learning to offer high predictive accuracy in real-world tasks. However, the fairness properties of these foundational models, which incorporate ideas from causal reasoning during pre-training, have not yet been explored in sufficient depth. In this work, we conduct a comprehensive empirical evaluation of TabPFN and its fine-tuned variants, assessing predictive performance, fairness, and robustness across varying dataset sizes and distributional shifts. Our results reveal that while TabPFN achieves stronger predictive accuracy compared to baselines and exhibits robustness to spurious correlations, improvements in fairness are moderate and inconsistent, particularly under missing-not-at-random (MNAR) covariate shifts. These findings suggest that the causal pre-training in TabPFN is helpful but insufficient for algorithmic fairness, highlighting implications for deploying such models in practice and the need for further fairness interventions.
zh

[AI-73] UniCog: Uncovering Cognitive Abilities of LLM s through Latent Mind Space Analysis

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）认知过程与人类存在本质差异的问题，尤其是现有可解释性方法难以揭示LLM在推理过程中如何调动不同认知能力的局限。其解决方案的关键在于提出UniCog框架，该框架通过构建一个潜在心智空间（latent mind space），将模型密集激活状态编码为稀疏且解耦的潜在维度，从而实现对LLM认知能力的统一分析。该方法揭示了LLM推理中存在一个共享的核心机制与特定能力签名的帕累托原则，并发现推理失败常表现为潜在激活异常强度，最终基于此洞察设计出一种潜在信息引导的候选优先策略，在多个挑战性基准上提升了高达7.5%的推理性能。

链接: https://arxiv.org/abs/2601.17897
作者: Jiayu Liu,Yinhe Long,Zhenya Huang,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A growing body of research suggests that the cognitive processes of large language models (LLMs) differ fundamentally from those of humans. However, existing interpretability methods remain limited in explaining how cognitive abilities are engaged during LLM reasoning. In this paper, we propose UniCog, a unified framework that analyzes LLM cognition via a latent mind space. Formulated as a latent variable model, UniCog encodes diverse abilities from dense model activations into sparse, disentangled latent dimensions. Through extensive analysis on six advanced LLMs, including DeepSeek-V3.2 and GPT-4o, we reveal a Pareto principle of LLM cognition, where a shared reasoning core is complemented by ability-specific signatures. Furthermore, we discover that reasoning failures often manifest as anomalous intensity in latent activations. These findings opens a new paradigm in LLM analysis, providing a cognition grounded view of reasoning dynamics. Finally, leveraging these insights, we introduce a latent-informed candidate prioritization strategy, which improves reasoning performance by up to 7.5% across challenging benchmarks. Our code is available at this https URL.
zh

[AI-74] When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

【速读】：该论文旨在解决个性化大语言模型（Large Language Model, LLM）代理中因长期记忆引入而产生的新型安全风险——意图合理化（intent legitimation），即良性个人记忆会误导模型对用户意图的判断，使其将本应被阻止的有害请求视为合理。解决方案的关键在于提出一个名为PS-Bench的基准测试工具，用于识别和量化此类安全失效，并通过轻量级的检测-反思（detection-reflection）机制从内部表征空间提供可解释的缓解策略，从而有效降低因个性化带来的安全性退化。

链接: https://arxiv.org/abs/2601.17887
作者: Jiahe Guo,Xiangran Guo,Yulin Hu,Zimo Long,Xingyu Sui,Xuda Zhi,Yongbo Huang,Hao He,Weixiang Zhao,Yanyan Zhao,Bing Qin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8%-243.7% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. WARNING: This paper may contain harmful content.
zh

[AI-75] Comparative Algorithmic Governance of Public Health Instruments across India EU US and LMICs

【速读】：该论文试图解决的问题是：在资源受限的司法管辖区中，规范性卫生法与算法型公共卫生基础设施之间缺乏充分协调，导致国际卫生条例（IHR 2005）和世界卫生组织《烟草控制框架公约》（WHO FCTC）等公共健康工具的实施效果不均。其解决方案的关键在于构建一个以权利合规为基础、跨国协调的监管框架，将人工智能（AI）嵌入其中，并借鉴《烟草控制框架公约》（FCTC）的条约架构设计算法型条约制定模式，同时由世界卫生组织（WHO）主导建立类似世界贸易组织（WTO）争端解决机制的合规监督体系，从而提升全球公共卫生治理的公平性、响应能力和跨境协同韧性。

链接: https://arxiv.org/abs/2601.17877
作者: Sahibpreet Singh
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Chapter in “Law and Medicine” (Pacific Books International, 2025), pp. 409-423

点击查看摘要

Abstract:The study investigates the juridico-technological architecture of international public health instruments, focusing on their implementation across India, the European Union, the United States and low- and middle-income countries (LMICs), particularly in Sub-Saharan Africa. It addresses a research lacuna: the insufficient harmonisation between normative health law and algorithmic public health infrastructures in resource-constrained jurisdictions. The principal objective is to assess how artificial intelligence augments implementation of instruments grounded in IHR 2005 and the WHO FCTC while identifying doctrinal and infrastructural bottlenecks. Using comparative doctrinal analysis and legal-normative mapping, the study triangulates legislative instruments, WHO monitoring frameworks, AI systems including BlueDot, Aarogya Setu and EIOS, and compliance metrics. Preliminary results show that AI has improved early detection, surveillance precision and responsiveness in high-capacity jurisdictions, whereas LMICs face infrastructural deficits, data privacy gaps and fragmented legal scaffolding. The findings highlight the relevance of the EU Artificial Intelligence Act and GDPR as regulatory prototypes for health-oriented algorithmic governance and contrast them with embryonic AI integration and limited internet penetration in many LMICs. The study argues for embedding AI within a rights-compliant, supranationally coordinated regulatory framework to secure equitable health outcomes and stronger compliance. It proposes a model for algorithmic treaty-making inspired by FCTC architecture and calls for WHO-led compliance mechanisms modelled on the WTO Dispute Settlement Body to enhance pandemic preparedness, surveillance equity and transnational governance resilience.
zh

[AI-76] MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在数据混合比例优化中的难题，即如何高效确定最优的数据组合以最大化模型性能，而无需进行昂贵的全量训练或依赖低效的启发式试错方法。其解决方案的关键在于提出了一种名为MergeMix的新方法，该方法通过复用模型合并（model merging）权重作为高保真、低成本的性能代理，仅需在少量token上训练领域专用专家模型，并基于下游基准测试优化合并权重，从而实现对数据混合比例的高效优化。此策略显著降低了搜索成本，同时保持了与人工调优相当甚至更优的性能表现。

链接: https://arxiv.org/abs/2601.17858
作者: Jiapeng Wang,Changxin Tian,Kunlong Chen,Ziqi Liu,Jiaxin Mao,Wayne Xin Zhao,Zhiqiang Zhang,Jun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimizing data mixtures is essential for unlocking the full potential of large language models (LLMs), yet identifying the optimal composition remains computationally prohibitive due to reliance on heuristic trials or expensive proxy training. To address this, we introduce \textbfMergeMix, a novel approach that efficiently determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy. By training domain-specific experts on minimal tokens and optimizing their merging weights against downstream benchmarks, MergeMix effectively optimizes the performance of data mixtures without incurring the cost of full-scale training. Extensive experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning while drastically reducing search costs. Furthermore, MergeMix exhibits high rank consistency (Spearman \rho 0.9 ) and strong cross-scale transferability, offering a scalable, automated solution for data mixture optimization.
zh

[AI-77] RAICL: Retrieval-Augmented In-Context Learning for Vision-Language-Model Based EEG Seizure Detection

【速读】：该论文旨在解决当前脑电图（EEG）解码方法高度依赖特定任务数据集训练专用神经网络架构的问题，从而限制了通用大规模脑解码模型的发展。其核心挑战在于EEG信号的非平稳性以及小样本条件下模型泛化能力不足。解决方案的关键在于引入一种基于视觉-语言模型（VLMs）的新范式：将多变量EEG信号转换为堆叠的波形图像，并结合神经科学领域知识设计文本提示（prompt），使预训练的VLM能够直接识别大脑活动模式；同时提出检索增强的上下文学习（RAICL）方法，动态选取最相关且具代表性的少量示例作为条件输入，以提升模型对非平稳EEG信号的适应能力。实验表明，该方法在癫痫发作检测任务中达到或优于传统时序建模方法，且无需重新训练或构建下游架构，具备临床部署潜力。

链接: https://arxiv.org/abs/2601.17844
作者: Siyang Li,Zhuoya Wang,Xiyan Gui,Xiaoqing Chen,Ziwei Wang,Yaozhi Wen,Dongrui Wu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) decoding is a critical component of medical diagnostics, rehabilitation engineering, and brain-computer interfaces. However, contemporary decoding methodologies remain heavily dependent on task-specific datasets to train specialized neural network architectures. Consequently, limited data availability impedes the development of generalizable large brain decoding models. In this work, we propose a paradigm shift from conventional signal-based decoding by leveraging large-scale vision-language models (VLMs) to analyze EEG waveform plots. By converting multivariate EEG signals into stacked waveform images and integrating neuroscience domain expertise into textual prompts, we demonstrate that foundational VLMs can effectively differentiate between different patterns in the human brain. To address the inherent non-stationarity of EEG signals, we introduce a Retrieval-Augmented In-Context Learning (RAICL) approach, which dynamically selects the most representative and relevant few-shot examples to condition the autoregressive outputs of the VLM. Experiments on EEG-based seizure detection indicate that state-of-the-art VLMs under RAICL achieved better or comparable performance with traditional time series based approaches. These findings suggest a new direction in physiological signal processing that effectively bridges the modalities of vision, language, and neural activities. Furthermore, the utilization of off-the-shelf VLMs, without the need for retraining or downstream architecture construction, offers a readily deployable solution for clinical applications.
zh

[AI-78] Aligning Medical Conversational AI through Online Reinforcement Learning with Information-Theoretic Rewards

【速读】：该论文旨在解决医疗对话AI在缺乏预收集人类对话数据的情况下，如何有效训练模型以开展高质量患者问诊并生成全面的现病史（History of Present Illness, HPI）的问题。传统方法依赖昂贵的专家标注对话数据或静态语料库，限制了模型的泛化能力与适应性。解决方案的关键在于提出信息增益微调（Information Gain Fine-Tuning, IGFT），其核心是将在线群体相对策略优化（online Group Relative Policy Optimization, GRPO）与基于信息论的奖励机制相结合，使模型通过自动生成与模拟患者的交互来探索有效的提问策略。IGFT设计了一种信息增益奖励函数，量化每个问题对临床实体（如症状、时间模式和既往史）的揭示程度，并融合GPT-4o-mini对临床相关性、患者参与度和特异性的多维质量评估，从而引导模型学习高效获取诊断信息的针对性提问方式。

链接: https://arxiv.org/abs/2601.17828
作者: Tanvi Verma,Yang Zhou,Rick Siow Mong Goh,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Information Gain Fine-Tuning (IGFT), a novel approach for training medical conversational AI to conduct effective patient interviews and generate comprehensive History of Present Illness (HPI) without requiring pre-collected human conversations. IGFT combines online Group Relative Policy Optimization (GRPO) with information-theoretic rewards, enabling models to learn from self-generated conversations with simulated patients. Unlike existing approaches that rely on expensive expert-annotated conversations or static datasets, our online RL framework allows models to discover effective questioning strategies through exploration. Our key innovation is an information gain reward function that tracks which clinical entities such as symptoms, temporal patterns, and medical history, are revealed during conversation. Each question’s reward is computed based on its expected information gain combined with GPT-4o-mini quality assessments across dimensions including clinical relevance, patient engagement, and specificity. This hybrid approach ensures models learn to ask targeted, clinically appropriate questions that efficiently gather diagnostic information. We fine-tune two models using LoRA: Llama-3.1-8B-Instruct and DeepSeek-R1-Distill-Qwen-7B (a reasoning-optimized model). Training exclusively on Avey data containing concise HPIs, we evaluate generalization to MIMIC data with longer, more elaborate HPIs. DeepSeek-R1-Distill-Qwen-7B (IGFT) achieves F1 scores of 0.408 on Avey (10.9% improvement over base) and 0.289 on MIMIC (12.9% improvement), while Llama-3.1-8B-Instruct (IGFT) reaches 0.384 and 0.336 respectively. Both models outperform OpenAI’s model on MIMIC and surpass medical domain-specific baselines like HuatuoGPT and UltraMedical, which were optimized for single-turn medical QA rather than multi-turn conversations.
zh

[AI-79] RegGuard: AI-Powered Retrieval-Enhanced Assistant for Pharmaceutical Regulatory Compliance

【速读】：该论文旨在解决跨国制药公司在面对日益频繁和复杂的监管更新时，合规团队需手动解读多源异构监管文本并将其与内部政策对齐所带来的高成本、高风险问题。解决方案的关键在于提出RegGuard系统，其核心创新包括两个模块：HiSACC（Hierarchical Semantic Aggregation for Contextual Chunking）通过语义聚类实现长文档的上下文一致分块，确保非连续段落间信息连贯；ReLACE（Regulatory Listwise Adaptive Cross-Encoder for Reranking）基于开源模型构建领域自适应交叉编码器，联合建模用户查询与候选文档以提升排序相关性。该系统在企业环境中验证了在答案相关性、事实依据性和上下文聚焦度上的显著提升，并有效降低幻觉风险，同时具备可审计性和可追溯性，适用于高合规要求的各类场景。

链接: https://arxiv.org/abs/2601.17826
作者: Siyuan Yang,Xihan Bian,Jiayin Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing frequency and complexity of regulatory updates present a significant burden for multinational pharmaceutical companies. Compliance teams must interpret evolving rules across jurisdictions, formats, and agencies, often manually, at high cost and risk of error. We introduce RegGuard, an industrial-scale AI assistant designed to automate the interpretation of heterogeneous regulatory texts and align them with internal corporate policies. The system ingests heterogeneous document sources through a secure pipeline and enhances retrieval and generation quality with two novel components: HiSACC (Hierarchical Semantic Aggregation for Contextual Chunking) semantically segments long documents into coherent units while maintaining consistency across non-contiguous sections. ReLACE (Regulatory Listwise Adaptive Cross-Encoder for Reranking), a domain-adapted cross-encoder built on an open-source model, jointly models user queries and retrieved candidates to improve ranking relevance. Evaluations in enterprise settings demonstrate that RegGuard improves answer quality specifically in terms of relevance, groundedness, and contextual focus, while significantly mitigating hallucination risk. The system architecture is built for auditability and traceability, featuring provenance tracking, access control, and incremental indexing, making it highly responsive to evolving document sources and relevant for any domain with stringent compliance demands.
zh

[AI-80] MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在实际部署中因架构异构性、对齐策略差异和效率不一而导致的“一刀切”使用模式问题：即单一模型难以同时满足轻量级任务（如OCR）与复杂多模态推理任务的需求，导致资源浪费或精度不足。其解决方案的关键在于提出一个统一的基准测试平台MMR-Bench，该平台通过控制输入模态特性、设定可变计算预算，并提供涵盖OCR、通用视觉问答（VQA）及多模态数学推理等多样任务的评估体系，实现了查询级别的模型选择（routing）机制的有效验证。实验表明，引入多模态信号能显著提升路由质量，在固定成本下实现更优的准确率-成本权衡，且基于部分模型训练的路由策略可零样本泛化至新数据集和纯文本基准，无需重新调参，从而为自适应多模态模型选择与高效MLLM部署提供了坚实基础。

链接: https://arxiv.org/abs/2601.17814
作者: Haoxuan Ma,Guannan Lai,Han-Jia Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over-provisions compute on easy instances or sacrifices accuracy on hard ones. Query-level model selection (routing) addresses this tension, but extending routing from text-only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation. We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR-Bench provides (i) a controlled environment with modality-aware inputs and variable compute budgets, (ii) a broad suite of vision-language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single-model reference, oracle upper bounds, and representative routing policies. Using MMR-Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost-accuracy frontier and enable the routed system to exceed the strongest single model’s accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero-shot to new datasets and text-only benchmarks without retuning, establishing MMR-Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: this https URL.
zh

[AI-81] Neuro-Symbolic Verification on Instruction Following of LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中难以保证指令遵循性的问题，尤其是在基于LLM的代理工作流中，不合规输出可能沿推理链传播并放大，导致任务失败或系统故障。解决方案的关键在于提出NSVIF——一个神经符号框架，将指令遵循验证建模为约束满足问题，通过将用户指令形式化为逻辑与语义约束，并利用统一求解器协同执行逻辑推理与语义分析，从而实现对LLM输出的通用、可解释的验证。

链接: https://arxiv.org/abs/2601.17789
作者: Yiming Su,Kunzhao Xu,Yanjie Gao,Fan Yang,Cheng Li,Mao Yang,Tianyin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A fundamental problem of applying Large Language Models (LLMs) to important applications is that LLMs do not always follow instructions, and violations are often hard to observe or check. In LLM-based agentic workflows, such violations can propagate and amplify along reasoning chains, causing task failures and system incidents. This paper presents NSVIF, a neuro-symbolic framework for verifying whether an LLM’s output follows the instructions used to prompt the LLM. NSVIF is a universal, general-purpose verifier; it makes no assumption about the instruction or the LLM. NSVIF formulates instruction-following verification as a constraint-satisfaction problem by modeling user instructions as constraints. NSVIF models both logical and semantic constraints; constraint solving is done by a unified solver that orchestrates logical reasoning and semantic analysis. To evaluate NSVIF, we develop VIFBENCH, a new benchmark for instruction-following verifiers with fine-grained data labels. Experiments show that NSVIF significantly outperforms LLM-based approaches and provides interpretable feedback. We also show that feedback from NSVIF helps improve LLMs’ instruction-following capability without post-training.
zh

[AI-82] Shortcut Learning in Binary Classifier Black Boxes: Applications to Voice Anti-Spoofing and Biometrics

【速读】：该论文旨在解决深度学习模型在数据驱动应用中因训练数据偏倚（bias）而导致的“捷径学习”（shortcut learning）或“聪明汉斯效应”（Clever Hans effect）问题，这些问题可能导致分类器在测试阶段产生不可预测的行为。其解决方案的关键在于提出了一种新颖的黑箱分类器分析框架，该框架融合干预（interventional）与观测（observational）视角，并采用线性混合效应模型（linear mixed-effects model）进行事后分析，从而系统评估训练数据和测试数据对分类器得分的影响，超越传统误差率指标，实现对偏倚数据集及其对分类器行为影响的深入理解。

链接: https://arxiv.org/abs/2601.17782
作者: Md Sahidullah,Hye-jin Shim,Rosa Gonzalez Hautamäki,Tomi H. Kinnunen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for Publication in IEEE Journal of Selected Topics in Signal Processing

点击查看摘要

Abstract:The widespread adoption of deep-learning models in data-driven applications has drawn attention to the potential risks associated with biased datasets and models. Neglected or hidden biases within datasets and models can lead to unexpected results. This study addresses the challenges of dataset bias and explores shortcut learning'' or Clever Hans effect’’ in binary classifiers. We propose a novel framework for analyzing the black-box classifiers and for examining the impact of both training and test data on classifier scores. Our framework incorporates intervention and observational perspectives, employing a linear mixed-effects model for post-hoc analysis. By evaluating classifier performance beyond error rates, we aim to provide insights into biased datasets and offer a comprehensive understanding of their influence on classifier behavior. The effectiveness of our approach is demonstrated through experiments on audio anti-spoofing and speaker verification tasks using both statistical models and deep neural networks. The insights gained from this study have broader implications for tackling biases in other domains and advancing the field of explainable artificial intelligence.
zh

[AI-83] me-Varying Causal Treatment for Quantifying the Causal Effect of Short-Term Variations on Arctic Sea Ice Dynamics

【速读】：该论文旨在解决冰层融化与淡水分布之间因果关系量化难题，尤其关注极地海冰厚度变化对海表面高度（Sea Surface Height, SSH）的因果影响。传统深度学习模型在时空场景下难以可靠估计处理效应，主要受限于未观测混杂因素及缺乏物理约束。其解决方案的关键在于提出知识引导的因果模型变分自编码器（Knowledge-Guided Causal Model Variational Autoencoder, KGCM-VAE），通过引入速度调制机制——利用SSH变化动态调节平滑速度信号以生成物理合理的因果干预，并结合最大均值差异（Maximum Mean Discrepancy, MMD）平衡潜在空间中处理组与对照组协变量分布，以及因果邻接约束解码器确保输出符合已知物理结构，从而提升因果效应估计的准确性与可解释性。

链接: https://arxiv.org/abs/2601.17647
作者: Akila Sampath,Vandana Janeja,Jianwu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantifying the causal relationship between ice melt and freshwater distribution is critical, as these complex interactions manifest as regional fluctuations in sea surface height (SSH). Leveraging SSH as a proxy for sea ice dynamics enables improved understanding of the feedback mechanisms driving polar climate change and global sea-level rise. However, conventional deep learning models often struggle with reliable treatment effect estimation in spatiotemporal settings due to unobserved confounders and the absence of physical constraints. To address these challenges, we propose the Knowledge-Guided Causal Model Variational Autoencoder (KGCM-VAE) to quantify causal mechanisms between sea ice thickness and SSH. The proposed framework integrates a velocity modulation scheme in which smoothed velocity signals are dynamically amplified via a sigmoid function governed by SSH transitions to generate physically grounded causal treatments. In addition, the model incorporates Maximum Mean Discrepancy (MMD) to balance treated and control covariate distributions in the latent space, along with a causal adjacency-constrained decoder to ensure alignment with established physical structures. Experimental results on both synthetic and real-world Arctic datasets demonstrate that KGCM-VAE achieves superior PEHE compared to state-of-the-art benchmarks. Ablation studies further confirm the effectiveness of the approach, showing that the joint application of MMD and causal adjacency constraints yields a 1.88% reduction in estimation error.
zh

[AI-84] A Systemic Evaluation of Multimodal RAG Privacy

【速读】：该论文旨在解决多模态检索增强生成（multimodal Retrieval-Augmented Generation, mRAG）管道在视觉任务中可能引发的隐私泄露问题，尤其是在推理阶段从私有数据集中泄露敏感信息的风险。研究通过实证分析发现，标准模型提示（prompting）方式下，攻击者可利用mRAG机制推断特定视觉资产（如图像）是否存在于知识库中，并进一步提取相关元数据（如标题），从而暴露私有内容。解决方案的关键在于设计和引入隐私保护机制，以在不牺牲模型性能的前提下阻断此类信息泄露路径，为未来mRAG系统的隐私安全研究提供方向。

链接: https://arxiv.org/abs/2601.17644
作者: Ali Al-Lawati,Suhang Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing adoption of multimodal Retrieval-Augmented Generation (mRAG) pipelines for vision-centric tasks (e.g. visual QA) introduces important privacy challenges. In particular, while mRAG provides a practical capability to connect private datasets to improve model performance, it risks the leakage of private information from these datasets during inference. In this paper, we perform an empirical study to analyze the privacy risks inherent in the mRAG pipeline observed through standard model prompting. Specifically, we implement a case study that attempts to infer the inclusion of a visual asset, e.g. image, in the mRAG, and if present leak the metadata, e.g. caption, related to it. Our findings highlight the need for privacy-preserving mechanisms and motivate future research on mRAG privacy.
zh

[AI-85] Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在医疗场景中安全对齐（safety alignment）的失衡问题，即模型在面对边界性或双重用途查询时，要么过度拒绝（over-refusal）无害请求，要么不当合规（unsafe compliance）有害请求，从而导致“安全-效用”权衡困境。其解决方案的关键在于提出首个大规模基准测试框架 Health-ORSC-Bench，系统评估模型在边界提示下的“过量拒绝”与“安全完成”（safe completion）能力——后者指在不引发实际危害的前提下提供高价值指导的能力。该框架包含31,920个跨七类健康主题的良性边界提示，并通过自动化流水线结合人工验证，量化不同意图模糊度下模型的行为表现，揭示了当前前沿模型（如GPT-5、Claude-4）普遍存在的“安全悲观主义”倾向，强调模型家族和规模对校准效果具有显著影响，为下一代医疗AI助手实现精准、安全且有用的响应提供了可量化、可比较的评估标准。

链接: https://arxiv.org/abs/2601.17642
作者: Zhihao Zhang,Liting Huang,Guanghao Wu,Preslav Nakov,Heng Ji,Usman Naseem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Safety alignment in Large Language Models is critical for healthcare; however, reliance on binary refusal boundaries often results in \emphover-refusal of benign queries or \emphunsafe compliance with harmful ones. While existing benchmarks measure these extremes, they fail to evaluate Safe Completion: the model’s ability to maximise helpfulness on dual-use or borderline queries by providing safe, high-level guidance without crossing into actionable harm. We introduce \textbfHealth-ORSC-Bench, the first large-scale benchmark designed to systematically measure \textbfOver-Refusal and \textbfSafe Completion quality in healthcare. Comprising 31,920 benign boundary prompts across seven health categories (e.g., self-harm, medical misinformation), our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity. We evaluate 30 state-of-the-art LLMs, including GPT-5 and Claude-4, revealing a significant tension: safety-optimised models frequently refuse up to 80% of “Hard” benign prompts, while domain-specific models often sacrifice safety for utility. Our findings demonstrate that model family and size significantly influence calibration: larger frontier models (e.g., GPT-5, Llama-4) exhibit “safety-pessimism” and higher over-refusal than smaller or MoE-based counterparts (e.g., Qwen-3-Next), highlighting that current LLMs struggle to balance refusal and compliance. Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants toward nuanced, safe, and helpful completions. The code and data will be released upon acceptance. \textcolorredWarning: Some contents may include toxic or undesired contents.
zh

[AI-86] BrainDistill: Implantable Motor Decoding with Task-Specific Knowledge Distillation

【速读】：该论文旨在解决大规模Transformer-based神经解码器在脑机接口（BCI）中因参数量大、计算需求高而难以部署于功耗受限的植入式系统中的问题。其解决方案的关键在于提出BrainDistill，一个集成植入式神经解码器（IND）与任务特定知识蒸馏（TSKD）框架的新型解码流程；TSKD通过监督投影显式优先保留对运动解码至关重要的特征，而非简单复制教师模型的全部表示，从而在少量校准数据下实现高性能，并结合量化感知训练策略实现仅使用整数运算的推理，显著降低功耗且保持性能损失最小。

链接: https://arxiv.org/abs/2601.17625
作者: Yuhan Xie,Jinhan Liu,Xiaoyong Ni,Fei Tan,Icare Sakr,Thibault Collin,Shiqi Sun,Alejandro Rodriguez Guajardo,Demon Fanny,Charles-francois Vincent Latchoumane,Henri Lorach,Jocelyne Bloch,Gregoire Courtine,Mahsa Shoaran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages,7 figures

点击查看摘要

Abstract:Transformer-based neural decoders with large parameter counts, pre-trained on large-scale datasets, have recently outperformed classical machine learning models and small neural networks on brain-computer interface (BCI) tasks. However, their large parameter counts and high computational demands hinder deployment in power-constrained implantable systems. To address this challenge, we introduce BrainDistill, a novel implantable motor decoding pipeline that integrates an implantable neural decoder (IND) with a task-specific knowledge distillation (TSKD) framework. Unlike standard feature distillation methods that attempt to preserve teacher representations in full, TSKD explicitly prioritizes features critical for decoding through supervised projection. Across multiple neural datasets, IND consistently outperforms prior neural decoders on motor decoding tasks, while its TSKD-distilled variant further surpasses alternative distillation methods in few-shot calibration settings. Finally, we present a quantization-aware training scheme that enables integer-only inference with activation clipping ranges learned during training. The quantized IND enables deployment under the strict power constraints of implantable BCIs with minimal performance loss.
zh

[AI-87] Human-Aligned Enhancement of Programming Answers with LLM s Guided by User Feedback

【速读】：该论文旨在解决技术问答平台（如 Stack Overflow）中编程答案因未充分响应用户评论反馈而存在不完整或过时的问题，从而影响技术知识的可靠性与可信度。其关键解决方案是提出 AUTOCOMBAT 工具，该工具通过联合利用用户评论和问题上下文，由大语言模型（Large Language Models, LLMs）自动识别并整合改进型反馈，实现对编程答案的高质量、意图保持的修订，显著优于基线方法，并在实践中获得开发者高度认可。

链接: https://arxiv.org/abs/2601.17604
作者: Suborno Deb Bappon,Saikat Mondal,Chanchal K. Roy,Kevin Schneider
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used to support software developers in tasks such as code generation, optimization, and documentation. However, their ability to improve existing programming answers in a human-like manner remains underexplored. On technical question-and-answer platforms such as Stack Overflow (SO), contributors often revise answers based on user comments that identify errors, inefficiencies, or missing explanations. Yet roughly one-third of this feedback is never addressed due to limited time, expertise, or visibility, leaving many answers incomplete or outdated. This study investigates whether LLMs can enhance programming answers by interpreting and incorporating comment-based feedback. We make four main contributions. First, we introduce ReSOlve, a benchmark consisting of 790 SO answers with associated comment threads, annotated for improvement-related and general feedback. Second, we evaluate four state-of-the-art LLMs on their ability to identify actionable concerns, finding that DeepSeek achieves the best balance between precision and recall. Third, we present AUTOCOMBAT, an LLM-powered tool that improves programming answers by jointly leveraging user comments and question context. Compared to human revised references, AUTOCOMBAT produces near-human quality improvements while preserving the original intent and significantly outperforming the baseline. Finally, a user study with 58 practitioners shows strong practical value, with 84.5 percent indicating they would adopt or recommend the tool. Overall, AUTOCOMBAT demonstrates the potential of scalable, feedback-driven answer refinement to improve the reliability and trustworthiness of technical knowledge platforms.
zh

[AI-88] Discovery of Feasible 3D Printing Configurations for Metal Alloys via AI-driven Adaptive Experimental Design

【速读】：该论文旨在解决金属合金增材制造过程中参数配置的难题，即如何高效地找到能够生成高质量打印件的输入参数组合（如激光功率、扫描速度等），传统试错法因验证成本高且参数空间庞大而效率低下。解决方案的关键在于将人工智能驱动的自适应实验设计与领域知识相结合，通过构建历史实验数据的代理模型（surrogate model），在每轮迭代中智能选择少量输入配置进行验证，从而显著减少实验次数和资源消耗。

链接: https://arxiv.org/abs/2601.17587
作者: Azza Fadhel,Nathaniel W. Zuckschwerdt,Aryan Deshwal,Susmita Bose,Amit Bandyopadhyay,Jana Doppa
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of Innovative Applications of AI (IAAI) 2026 Conference

点击查看摘要

Abstract:Configuring the parameters of additive manufacturing processes for metal alloys is a challenging problem due to complex relationships between input parameters (e.g., laser power, scan speed) and quality of printed outputs. The standard trial-and-error approach to find feasible parameter configurations is highly inefficient because validating each configuration is expensive in terms of resources (physical and human labor) and the configuration space is very large. This paper combines the general principles of AI-driven adaptive experimental design with domain knowledge to address the challenging problem of discovering feasible configurations. The key idea is to build a surrogate model from past experiments to intelligently select a small batch of input configurations for validation in each iteration. To demonstrate the effectiveness of this methodology, we deploy it for Directed Energy Deposition process to print GRCop–42, a high-performance copper–chromium–niobium alloy developed by NASA for aerospace applications. Within three months, our approach yielded multiple defect-free outputs across a range of laser powers dramatically reducing time to result and resource expenditure compared to several months of manual experimentation by domain scientists with no success. By enabling high-quality GRCop–42 fabrication on readily available infrared laser platforms for the first time, we democratize access to this critical alloy, paving the way for cost-effective, decentralized production for aerospace applications.
zh

[AI-89] Prompt Driven Development with Claude Code: Building a Complete TUI Framework for the Ring Programming Language

【速读】：该论文试图解决的问题是：当前大型语言模型（Large Language Models, LLMs）在通过自然语言交互生成和维护大型、多模块软件系统方面的能力尚不明确，尤其缺乏实证支持。解决方案的关键在于采用纯提示驱动（prompt-driven）的工作流，利用Claude Code与Opus 4.5模型，在三日内完成一个包含7420行代码的终端用户界面框架开发，共使用107个提示，涵盖功能请求、缺陷修复、架构指导和文档生成等任务。研究通过定量提示分析与定性行为评估相结合的方式，验证了现代LLMs能够在无需人工编写代码的情况下维持架构一致性，并支持为新兴编程语言构建生产级工具链，从而确立了提示驱动开发作为软件工程实践中的可行方法论。

链接: https://arxiv.org/abs/2601.17584
作者: Mahmoud Samir Fayed,Ahmed Samir Fayed
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are increasingly used in software development, yet their ability to generate and maintain large, multi module systems through natural language interaction remains insufficiently characterized. This study presents an empirical analysis of developing a 7420 line Terminal User Interface framework for the Ring programming language, completed in roughly ten hours of active work spread across three days using a purely prompt driven workflow with Claude Code, Opus 4.5. The system was produced through 107 prompts: 21 feature requests, 72 bug fix prompts, 9 prompts sharing information from Ring documentation, 4 prompts providing architectural guidance, and 1 prompt dedicated to generating documentation. Development progressed across five phases, with the Window Manager phase requiring the most interaction, followed by complex UI systems and controls expansion. Bug related prompts covered redraw issues, event handling faults, runtime errors, and layout inconsistencies, while feature requests focused primarily on new widgets, window manager capabilities, and advanced UI components. Most prompts were short, reflecting a highly iterative workflow in which the human role was limited to specifying requirements, validating behaviour, and issuing corrective prompts without writing any code manually. The resulting framework includes a complete windowing subsystem, event driven architecture, interactive widgets, hierarchical menus, grid and tree components, tab controls, and a multi window desktop environment. By combining quantitative prompt analysis with qualitative assessment of model behaviour, this study provides empirical evidence that modern LLMs can sustain architectural coherence and support the construction of production grade tooling for emerging programming languages, highlighting prompt driven development as a viable methodology within software engineering practice.
zh

[AI-90] How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests

【速读】：该论文旨在解决当前对AI编码代理（AI coding agents）在开源软件开发中贡献行为的理解不足问题，特别是其生成的拉取请求（Pull Requests, PRs）与人类开发者PRs在代码修改方式和变更描述一致性上的差异。研究通过分析大规模真实数据集（AIDev dataset）中的24,014个已合并的代理生成PRs（含440,295次提交）和5,081个已合并的人类PRs（含23,242次提交），量化比较了两者在提交次数、文件改动数量、删除行数等方面的差异，并采用词法和语义相似度评估PR描述与代码差异（diff）的一致性。关键解决方案在于基于实证方法进行大规模对比分析，揭示了代理生成PR在提交频次上显著不同于人类（Cliff’s δ = 0.5429），且在描述与代码变动一致性方面表现更优，为评估AI编码代理在实际开发流程中的可靠性提供了重要依据。

链接: https://arxiv.org/abs/2601.17581
作者: Daniel Ogenrwot,John Businge
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:AI coding agents are increasingly acting as autonomous contributors by generating and submitting pull requests (PRs). However, we lack empirical evidence on how these agent-generated PRs differ from human contributions, particularly in how they modify code and describe their changes. Understanding these differences is essential for assessing their reliability and impact on development workflows. Using the MSR 2026 Mining Challenge version of the AIDev dataset, we analyze 24,014 merged Agentic PRs (440,295 commits) and 5,081 merged Human PRs (23,242 commits). We examine additions, deletions, commits, and files touched, and evaluate the consistency between PR descriptions and their diffs using lexical and semantic similarity. Agentic PRs differ substantially from Human PRs in commit count (Cliff’s \delta = 0.5429 ) and show moderate differences in files touched and deleted lines. They also exhibit slightly higher description-to-diff similarity across all measures. These findings provide a large-scale empirical characterization of how AI coding agents contribute to open source development.
zh

[AI-91] Real-Time Trend Prediction via Continually-Aligned LLM Query Generation

【速读】：该论文旨在解决低流量搜索环境中趋势检测的冷启动问题（cold-start problem），即由于查询量不足导致系统难以识别新兴或长尾趋势，现有基于关键词频率或查询突增的方法在稀疏场景下响应滞后且效果不佳。其解决方案的关键在于提出RTTP（Real-Time Trending Prediction）框架，该框架通过持续学习的大语言模型（CL-LLM）直接从新闻内容生成搜索风格的查询，并结合用户互动强度与创作者权威性进行评分，从而在实际搜索量形成之前实现早期趋势捕捉；同时引入Mix-Policy DPO这一基于偏好的持续学习方法，融合在线策略稳定性与离线策略新颖性，有效缓解模型迭代中的灾难性遗忘，确保性能稳定提升。

链接: https://arxiv.org/abs/2601.17567
作者: Zijing Hui,Wenhan Lyu,Shusen Wang,Li Chen,Chu Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trending news detection in low-traffic search environments faces a fundamental cold-start problem, where a lack of query volume prevents systems from identifying emerging or long-tail trends. Existing methods relying on keyword frequency or query spikes are inherently slow and ineffective in these sparse settings, lagging behind real-world shifts in attention. We introduce RTTP, a novel Real-Time Trending Prediction framework that generates search queries directly from news content instead of waiting for users to issue them. RTTP leverages a continual learning LLM (CL-LLM) that converts posts into search-style queries and scores them using engagement strength + creator authority, enabling early trend surfacing before search volume forms. To ensure adaptation without degrading reasoning, we propose Mix-Policy DPO, a new preference-based continual learning approach that combines on-policy stability with off-policy novelty to mitigate catastrophic forgetting during model upgrades. Deployed at production scale on Facebook and Meta AI products, RTTP delivers +91.4% improvement in tail-trend detection precision@500 and +19% query generation accuracy over industry baselines, while sustaining stable performance after multi-week online training. This work demonstrates that LLM-generated synthetic search signals, when aligned and continually updated, unlock timely trend understanding in low-traffic search environments.
zh

[AI-92] JaxARC: A High-Performance JAX-based Environment for Abstraction and Reasoning Research

【速读】：该论文旨在解决当前基于Gymnasium的强化学习（Reinforcement Learning, RL）环境在处理抽象推理任务时因计算瓶颈导致实验规模受限的问题。其解决方案的关键在于提出JaxARC——一个基于JAX实现的高性能、无状态且函数式的强化学习环境，通过利用JAX的自动并行化和向量化能力，实现了高达38至5,439倍的速度提升，并达到每秒7.9亿步的峰值吞吐量，从而支持大规模强化学习研究在ARC（Abstraction and Reasoning Corpus）任务上的可行实施。

链接: https://arxiv.org/abs/2601.17564
作者: Aadam,Monu Verma,Mohamed Abdel-Mottaleb
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) tests AI systems’ ability to perform human-like inductive reasoning from a few demonstration pairs. Existing Gymnasium-based RL environments severely limit experimental scale due to computational bottlenecks. We present JaxARC, an open-source, high-performance RL environment for ARC implemented in JAX. Its functional, stateless architecture enables massive parallelism, achieving 38-5,439x speedup over Gymnasium at matched batch sizes, with peak throughput of 790M steps/second. JaxARC supports multiple ARC datasets, flexible action spaces, composable wrappers, and configuration-driven reproducibility, enabling large-scale RL research previously computationally infeasible. JaxARC is available at this https URL.
zh

[AI-93] owards Generalisable Imitation Learning Through Conditioned Transition Estimation and Online Behaviour Alignment AAMAS2026

【速读】：该论文旨在解决当前观察型模仿学习（Imitation Learning from Observation, ILfO）方法存在的三大局限：依赖动作监督优化、假设状态仅存在单一最优动作，以及在未充分考虑环境状态的情况下直接复制教师动作。针对这些问题，论文提出无监督观察型模仿学习（Unsupervised Imitation Learning from Observation, UfO），其核心在于采用两阶段策略：第一阶段通过分析观测到的状态转移近似还原教师的真实动作；第二阶段则通过调整智能体轨迹使其与教师轨迹高度对齐，从而实现无需监督的策略学习。实验表明，UfO不仅超越了教师和所有现有ILfO方法，且标准差最小，体现出更强的泛化能力。

链接: https://arxiv.org/abs/2601.17563
作者: Nathan Gavenski,Matteo Leonetti,Odinaldo Rodrigues
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The 25th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2026)

点击查看摘要

Abstract:State-of-the-art imitation learning from observation methods (ILfO) have recently made significant progress, but they still have some limitations: they need action-based supervised optimisation, assume that states have a single optimal action, and tend to apply teacher actions without full consideration of the actual environment state. While the truth may be out there in observed trajectories, existing methods struggle to extract it without supervision. In this work, we propose Unsupervised Imitation Learning from Observation (UfO) that addresses all of these limitations. UfO learns a policy through a two-stage process, in which the agent first obtains an approximation of the teacher’s true actions in the observed state transitions, and then refines the learned policy further by adjusting agent trajectories to closely align them with the teacher’s. Experiments we conducted in five widely used environments show that UfO not only outperforms the teacher and all other ILfO methods but also displays the smallest standard deviation. This reduction in standard deviation indicates better generalisation in unseen scenarios.
zh

[AI-94] Breaking the Protocol: Security Analysis of the Model Context Protocol Specification and Prompt Injection Vulnerabilities in Tool-Integrated LLM Agents

【速读】：该论文旨在解决模型上下文协议（Model Context Protocol, MCP）在集成大语言模型与外部工具时存在的安全漏洞问题，这些问题尚未得到正式的安全分析。研究发现MCP架构中存在三大协议层漏洞：缺乏能力证明机制导致服务器可伪造权限、双向采样无来源认证引发服务器端提示注入攻击，以及多服务器配置下隐式信任传播风险。为应对这些挑战，作者提出MCPSec协议扩展方案，其核心在于引入能力证明和消息认证机制，在保持向后兼容的前提下显著降低攻击成功率（从52.8%降至12.4%），同时仅带来每条消息平均8.3毫秒的延迟开销。该解决方案验证了MCP的安全缺陷源于架构设计而非具体实现，需通过协议层面修复以保障系统整体安全性。

链接: https://arxiv.org/abs/2601.17549
作者: Narek Maloyan,Dmitry Namiot
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) has emerged as a de facto standard for integrating Large Language Models with external tools, yet no formal security analysis of the protocol specification exists. We present the first rigorous security analysis of MCP’s architectural design, identifying three fundamental protocol-level vulnerabilities: (1) absence of capability attestation allowing servers to claim arbitrary permissions, (2) bidirectional sampling without origin authentication enabling server-side prompt injection, and (3) implicit trust propagation in multi-server configurations. We implement \textscMCPBench, a novel framework bridging existing agent security benchmarks to MCP-compliant infrastructure, enabling direct measurement of protocol-specific attack surfaces. Through controlled experiments on 847 attack scenarios across five MCP server implementations, we demonstrate that MCP’s architectural choices amplify attack success rates by 23–41% compared to equivalent non-MCP integrations. We propose \textscMCPSec, a backward-compatible protocol extension adding capability attestation and message authentication, reducing attack success rates from 52.8% to 12.4% with median latency overhead of 8.3ms per message. Our findings establish that MCP’s security weaknesses are architectural rather than implementation-specific, requiring protocol-level remediation.
zh

[AI-95] Cognitive Platform Engineering for Autonomous Cloud Operations

【速读】：该论文旨在解决现代DevOps实践在应对云原生系统规模与动态性时的局限性问题，传统基于规则的自动化方法因难以处理海量遥测数据和配置漂移（configuration drift），常导致响应滞后、修复延迟及对人工专家经验的依赖。其解决方案的关键在于提出“认知平台工程”（Cognitive Platform Engineering）这一新一代范式，通过将感知（sensing）、推理（reasoning）与自主行动（autonomous action）直接集成到平台生命周期中，并构建一个包含数据采集、智能推理、策略驱动编排和人机交互四层的参考架构，形成持续反馈闭环，从而实现更高效、合规且具备自适应能力的云环境管理。

链接: https://arxiv.org/abs/2601.17542
作者: Vinoth Punniyamoorthy,Nitin Saksena,Srivenkateswara Reddy Sankiti,Nachiappan Chockalingam,Aswathnarayan Muthukrishnan Kirubakaran,Shiva Kumar Reddy Carimireddy,Durgaraman Maruthavanan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Modern DevOps practices have accelerated software delivery through automation, CI/CD pipelines, and observability tooling,but these approaches struggle to keep pace with the scale and dynamism of cloud-native systems. As telemetry volume grows and configuration drift increases, traditional, rule-driven automation often results in reactive operations, delayed remediation, and dependency on manual expertise. This paper introduces Cognitive Platform Engineering, a next-generation paradigm that integrates sensing, reasoning, and autonomous action directly into the platform lifecycle. This paper propose a four-plane reference architecture that unifies data collection, intelligent inference, policy-driven orchestration, and human experience layers within a continuous feedback loop. A prototype implementation built with Kubernetes, Terraform, Open Policy Agent, and ML-based anomaly detection demonstrates improvements in mean time to resolution, resource efficiency, and compliance. The results show that embedding intelligence into platform operations enables resilient, self-adjusting, and intent-aligned cloud environments. The paper concludes with research opportunities in reinforcement learning, explainable governance, and sustainable self-managing cloud ecosystems.
zh

[AI-96] Reconstructing Training Data from Adapter-based Federated Large Language Models WWW

【速读】：该论文旨在解决基于适配器（Adapter）的联邦大语言模型（FedLLMs）在参数高效微调过程中存在的隐私泄露问题。尽管现有方法认为冻结主干网络并仅训练低秩适配器可有效抑制梯度泄漏、抵御梯度逆向攻击（Gradient Inversion Attacks, GIAs），但本文揭示了低秩适配器实际上引入了新的可被利用的隐私泄露通道。其解决方案的关键在于提出一种面向适配器结构的新型梯度逆向攻击方法——无序词袋文本重建攻击（Unordered-word-bag-based Text Reconstruction, UTR），该方法通过三个核心机制实现高精度文本重建：(i) 利用冻结层中的注意力模式推断token存在性，(ii) 在适配器梯度的低秩子空间内进行句级梯度反演，(iii) 借助语言先验引导的约束贪婪解码保证语义一致性。实验表明，UTR在多种模型与数据集上均能实现接近完美的重建准确率（ROUGE-1/2 > 99），显著优于传统GIAs，从而揭示了参数效率与隐私保护之间存在根本性权衡。

链接: https://arxiv.org/abs/2601.17533
作者: Silong Chen,Yuchuan Luo,Guilin Deng,Yi Liu,Min Xu,Shaojing Fu,Xiaohua Jia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Yuchuan Luo and Yi Liu are co-corresponding authors. Accepted at The Web Conference (WWW) 2026

点击查看摘要

Abstract:Adapter-based Federated Large Language Models (FedLLMs) are widely adopted to reduce the computational, storage, and communication overhead of full-parameter fine-tuning for web-scale applications while preserving user privacy. By freezing the backbone and training only compact low-rank adapters, these methods appear to limit gradient leakage and thwart existing Gradient Inversion Attacks (GIAs). Contrary to this assumption, we show that low-rank adapters create new, exploitable leakage channels. We propose the Unordered-word-bag-based Text Reconstruction (UTR) attack, a novel GIA tailored to the unique structure of adapter-based FedLLMs. UTR overcomes three core challenges: low-dimensional gradients, frozen backbones, and combinatorially large reconstruction spaces by: (i) inferring token presence from attention patterns in frozen layers, (ii) performing sentence-level inversion within the low-rank subspace of adapter gradients, and (iii) enforcing semantic coherence through constrained greedy decoding guided by language priors. Extensive experiments across diverse models (GPT2-Large, BERT, Qwen2.5-7B) and datasets (CoLA, SST-2, Rotten Tomatoes) demonstrate that UTR achieves near-perfect reconstruction accuracy (ROUGE-1/2 99), even with large batch size settings where prior GIAs fail completely. Our results reveal a fundamental tension between parameter efficiency and privacy in FedLLMs, challenging the prevailing belief that lightweight adaptation inherently enhances security. Our code and data are available at this https URL. Comments: Yuchuan Luo and Yi Liu are co-corresponding authors. Accepted at The Web Conference (WWW) 2026 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.17533 [cs.CR] (or arXiv:2601.17533v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.17533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-97] Automatic Stability and Recovery for Neural Network Training

【速读】：该论文旨在解决现代神经网络训练中因罕见但严重的不稳定更新导致的不可逆发散或隐性性能退化问题。现有优化方法主要依赖嵌入在优化器中的预防机制，难以在不稳定发生后进行检测与恢复。其解决方案的关键在于提出一种监督式运行时稳定性框架，将优化过程视为受控的随机过程，通过隔离来自次级测量（如验证探针）的创新信号，实现无需修改底层优化器即可自动检测并恢复不稳定更新；该框架还提供了理论上的运行时安全保证，形式化地定义了性能退化的有界性和恢复能力。

链接: https://arxiv.org/abs/2601.17483
作者: Barak Or
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Training modern neural networks is increasingly fragile, with rare but severe destabilizing updates often causing irreversible divergence or silent performance degradation. Existing optimization methods primarily rely on preventive mechanisms embedded within the optimizer, offering limited ability to detect and recover from instability once it occurs. We introduce a supervisory runtime stability framework that treats optimization as a controlled stochastic process. By isolating an innovation signal derived from secondary measurements, such as validation probes, the framework enables automatic detection and recovery from destabilizing updates without modifying the underlying optimizer. We provide theoretical runtime safety guarantees that formalize bounded degradation and recovery. Our implementation incurs minimal overhead and is compatible with memory-constrained training settings.
zh

[AI-98] Lattice: Generative Guardrails for Conversational Agents

【速读】：该论文旨在解决当前对话式人工智能（Conversational AI）系统中安全防护机制依赖静态规则、难以适应新威胁或部署场景变化的问题。解决方案的关键在于提出Lattice框架，其通过两个阶段实现自构建与持续优化的防护机制：第一阶段利用标注样本进行迭代模拟与优化以构建初始防护规则；第二阶段则通过风险评估、对抗测试和规则整合实现部署后的闭环自适应更新，从而在跨域数据上实现7个百分点的F1分数提升，验证了基于迭代优化的自我构建防护机制的有效性。

链接: https://arxiv.org/abs/2601.17481
作者: Emily Broadhurst,Tawab Safi,Joseph Edell,Vashisht Ganesh,Karime Maamari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational AI systems require guardrails to prevent harmful outputs, yet existing approaches use static rules that cannot adapt to new threats or deployment contexts. We introduce Lattice, a framework for self-constructing and continuously improving guardrails. Lattice operates in two stages: construction builds initial guardrails from labeled examples through iterative simulation and optimization; continuous improvement autonomously adapts deployed guardrails through risk assessment, adversarial testing, and consolidation. Evaluated on the ProsocialDialog dataset, Lattice achieves 91% F1 on held-out data, outperforming keyword baselines by 43pp, LlamaGuard by 25pp, and NeMo by 4pp. The continuous improvement stage achieves 7pp F1 improvement on cross-domain data through closed-loop optimization. Our framework shows that effective guardrails can be self-constructed through iterative optimization.
zh

[AI-99] Embodiment-Induced Coordination Regimes in Tabular Multi-Agent Q-Learning

【速读】：该论文试图解决的问题是：在多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）中，集中式价值学习（Centralized Value Learning）是否真的能够提升协作性能和稳定性，尤其是在受身体约束（embodiment constraints）下的实际场景中。以往研究普遍假设集中式学习优于独立学习，但缺乏在受控条件下的实证检验。论文的关键解决方案在于构建了一个完全表格化的捕食者-猎物网格世界环境，通过严格控制智能体的速度与耐力等动力学参数，在多种运动学配置和不对称角色设定下，对比独立Q学习与集中式Q学习的性能表现。该设计排除了函数逼近和表征学习等干扰因素，从而将问题聚焦于协作结构本身的作用机制，揭示出集中式学习并非总是有益，反而可能因协作增强而成为负担，其有效性高度依赖于具体环境的运动学特性与智能体角色分配。

链接: https://arxiv.org/abs/2601.17454
作者: Muhammad Ahmed Atif,Nehal Naeem Haji,Mohammad Shahid Shaikh,Muhammad Ebad Atif
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Centralized value learning is often assumed to improve coordination and stability in multi-agent reinforcement learning, yet this assumption is rarely tested under controlled conditions. We directly evaluate it in a fully tabular predator-prey gridworld by comparing independent and centralized Q-learning under explicit embodiment constraints on agent speed and stamina. Across multiple kinematic regimes and asymmetric agent roles, centralized learning fails to provide a consistent advantage and is frequently outperformed by fully independent learning, even under full observability and exact value estimation. Moreover, asymmetric centralized-independent configurations induce persistent coordination breakdowns rather than transient learning instability. By eliminating confounding effects from function approximation and representation learning, our tabular analysis isolates coordination structure as the primary driver of these effects. The results show that increased coordination can become a liability under embodiment constraints, and that the effectiveness of centralized learning is fundamentally regime and role dependent rather than universal.
zh

[AI-100] owards a Declarative Agent ic Layer for Intelligent Agents in MCP-Based Server Ecosystems

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的智能体系统中存在的可靠性问题，如幻觉行为、不可执行计划及脆弱的多智能体协作等。这些问题并非源于底层模型能力不足，而是由于缺乏明确的架构设计来连接目标、能力与执行过程。解决方案的关键在于提出一个名为DALIA（Declarative Agentic Layer for Intelligent Agents）的声明式、模型无关的架构层，通过形式化可执行能力、基于声明式发现协议暴露任务、维护联邦代理目录与资源信息，并构建仅依赖声明操作的确定性任务图，从而在发现、规划与执行之间建立清晰分离，约束智能体行为于可验证的操作空间内，显著降低对推测性推理和自由协调的依赖，实现跨异构环境的可复现与可验证的智能体工作流。

链接: https://arxiv.org/abs/2601.17435
作者: Maria Jesus Rodriguez-Sanchez,Manuel Noguera,Angel Ruiz-Zafra,Kawtar Benghazi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled the development of increasingly complex agentic and multi-agent systems capable of planning, tool use and task decomposition. However, empirical evidence shows that many of these systems suffer from fundamental reliability issues, including hallucinated actions, unexecutable plans and brittle coordination. Crucially, these failures do not stem from limitations of the underlying models themselves, but from the absence of explicit architectural structure linking goals, capabilities and execution. This paper presents a declarative, model-independent architectural layer for grounded agentic workflows that addresses this gap. The proposed layer, referred to as DALIA (Declarative Agentic Layer for Intelligent Agents), formalises executable capabilities, exposes tasks through a declarative discovery protocol, maintains a federated directory of agents and their execution resources, and constructs deterministic task graphs grounded exclusively in declared operations. By enforcing a clear separation between discovery, planning and execution, the architecture constrains agent behaviour to a verifiable operational space, reducing reliance on speculative reasoning and free-form coordination. We present the architecture and design principles of the proposed layer and illustrate its operation through a representative task-oriented scenario, demonstrating how declarative grounding enables reproducible and verifiable agentic workflows across heterogeneous environments.
zh

[AI-101] A Syllogistic Probe: Tracing the Evolution of Logic Reasoning in Large Language Models

【速读】：该论文旨在探究大语言模型（Large Language Models, LLMs）是否在逻辑推理框架上呈现出类似人类从直觉驱动推理向严格形式化系统演化的趋势，具体聚焦于传统逻辑与现代逻辑在三段论（syllogism）推理中的表现差异。其解决方案的关键在于利用“存在蕴含”（existential import）作为探测工具，构建新的三段论数据集对当前最先进LLMs进行系统性实验，从而揭示模型规模扩展、思维链（thinking）机制以及基础模型架构对逻辑演化路径的影响：（i）参数量增长促进模型向现代逻辑靠拢；（ii）思维链策略可显著加速逻辑框架的转变，超越单纯参数扩容的效果；（iii）基础模型结构决定了该转变过程的难易程度与稳定性。

链接: https://arxiv.org/abs/2601.17426
作者: Zhengqing Zang,Yuqi Ding,Yanmei Gu,Changkai Song,Zhengkai Yang,Guoping Du,Junbo Zhao,Haobo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Human logic has gradually shifted from intuition-driven inference to rigorous formal systems. Motivated by recent advances in large language models (LLMs), we explore whether LLMs exhibit a similar evolution in the underlying logical framework. Using existential import as a probe, we for evaluate syllogism under traditional and modern logic. Through extensive experiments of testing SOTA LLMs on a new syllogism dataset, we have some interesting findings: (i) Model size scaling promotes the shift toward modern logic; (ii) Thinking serves as an efficient accelerator beyond parameter scaling; (iii) the Base model plays a crucial role in determining how easily and stably this shift can emerge. Beyond these core factors, we conduct additional experiments for in-depth analysis of properties of current LLMs on syllogistic reasoning.
zh

[AI-102] GO-OSC and VASH: Geometry-Aware Representation Learning for Early Degradation Detection in Oscillatory Systems

【速读】：该论文旨在解决振荡系统早期退化检测中因传统能量基诊断方法和无约束学习表征结构敏感性不足而导致的延迟或不稳定检测问题。其解决方案的关键在于提出一种几何感知的表示学习框架GO-OSC，该框架通过强制实现可识别的规范潜空间参数化，使得在短时、未标注窗口内能够稳定比较与聚合动态特征；在此基础上进一步定义了一类不变线性几何探测器，专门针对潜在空间中与退化相关方向进行灵敏检测，理论证明在仅相位退化情况下，能量统计量一阶检测能力为零，而几何探测器具有严格正敏感性，从而实现了对早期退化的高效、鲁棒识别。

链接: https://arxiv.org/abs/2601.17396
作者: Vashista Nobaub
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures. Includes theoretical analysis, ablation studies, and experiments on synthetic and real vibration datasets. Code available

点击查看摘要

Abstract:Early-stage degradation in oscillatory systems often manifests as geometric distortions of the dynamics, such as phase jitter, frequency drift, or loss of coherence, long before changes in signal energy are detectable. In this regime, classical energy-based diagnostics and unconstrained learned representations are structurally insensitive, leading to delayed or unstable detection. We introduce GO-OSC, a geometry-aware representation learning framework for oscillatory time series that enforces a canonical and identifiable latent parameterization, enabling stable comparison and aggregation across short, unlabeled windows. Building on this representation, we define a family of invariant linear geometric probes that target degradation-relevant directions in latent space. We provide theoretical results showing that under early phase-only degradation, energy-based statistics have zero first-order detection power, whereas geometric probes achieve strictly positive sensitivity. Our analysis characterizes when and why linear probing fails under non-identifiable representations and shows how canonicalization restores statistical detectability. Experiments on synthetic benchmarks and real vibration datasets validate the theory, demonstrating earlier detection, improved data efficiency, and robustness to operating condition changes.
zh

[AI-103] Prompt and Circumstances: Evaluating the Efficacy of Human Prompt Inference in AI-Generated Art

【速读】：该论文试图解决的问题是：在生成式 AI (Generative AI) 艺术领域中，通过提示词市场（prompt marketplace）出售的隐藏提示词是否可被视为合法的知识产权，尤其是在这些提示词可能被人类或AI工具基于公开展示的生成图像进行推断的情况下。其核心挑战在于评估提示词的可逆性及其对原始内容再现能力的影响。解决方案的关键在于开展一项人类受试者实验，比较人类单独、AI单独以及人机协同（借助大语言模型辅助融合）推断提示词的效果，并量化其生成图像与原图的相似度。研究发现，尽管人机联合推断能产生一定相似度的图像，但效果仍显著低于使用原始提示词，且所提出的融合策略并未优于纯人类推断结果，从而为提示词的知识产权保护提供了实证依据。

链接: https://arxiv.org/abs/2601.17379
作者: Khoi Trinh,Scott Seidenberger,Joseph Spracklen,Raveen Wijewickrama,Bimal Viswanath,Murtuza Jadliwala,Anindya Maiti
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear in EvoMUSART 2026

点击查看摘要

Abstract:The emerging field of AI-generated art has witnessed the rise of prompt marketplaces, where creators can purchase, sell, or share prompts to generate unique artworks. These marketplaces often assert ownership over prompts, claiming them as intellectual property. This paper investigates whether concealed prompts sold on prompt marketplaces can be considered bona fide intellectual property, given that humans and AI tools may be able to infer the prompts based on publicly advertised sample images accompanying each prompt on sale. Specifically, our study aims to assess (i) how accurately humans can infer the original prompt solely by examining an AI-generated image, with the goal of generating images similar to the original image, and (ii) the possibility of improving upon individual human and AI prompt inferences by crafting combined human and AI prompts with the help of a large language model. Although previous research has explored AI-driven prompt inference and protection strategies, our work is the first to incorporate a human subject study and examine collaborative human-AI prompt inference in depth. Our findings indicate that while prompts inferred by humans and prompts inferred through a combined human and AI effort can generate images with a moderate level of similarity, they are not as successful as using the original prompt. Moreover, combining human- and AI-inferred prompts using our suggested merging techniques did not improve performance over purely human-inferred prompts.
zh

[AI-104] Res-MIA: A Training-Free Resolution-Based Membership Inference Attack on Federated Learning Models

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中模型隐私泄露问题，特别是针对黑盒场景下的成员推理攻击（Membership Inference Attack, MIA）——即攻击者通过访问目标模型的输出来判断特定数据样本是否曾被用于训练。现有研究表明，即使在去中心化的联邦学习框架下，全局模型仍可能泄露敏感的成员信息。为应对这一挑战，作者提出了一种名为Res-MIA的新颖训练-free且黑盒的成员推理攻击方法，其关键创新在于利用深度模型对输入高频细节的敏感性：通过受控的下采样与重建操作逐步降低输入分辨率，并分析模型预测置信度的变化趋势；研究发现，训练样本在分辨率退化过程中表现出显著更陡峭的置信度衰减，从而形成可靠的成员信号。该方法无需影子模型、无需辅助数据，仅需少量前向查询即可实现高效攻击，在CIFAR-10上的实验表明其AUC最高可达0.88，揭示了频率敏感过拟合是联邦学习中一个此前未被充分关注的隐私泄露源。

链接: https://arxiv.org/abs/2601.17378
作者: Mohammad Zare,Pirooz Shamsinejadbabaki
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Membership inference attacks (MIAs) pose a serious threat to the privacy of machine learning models by allowing adversaries to determine whether a specific data sample was included in the training set. Although federated learning (FL) is widely regarded as a privacy-aware training paradigm due to its decentralized nature, recent evidence shows that the final global model can still leak sensitive membership information through black-box access. In this paper, we introduce Res-MIA, a novel training-free and black-box membership inference attack that exploits the sensitivity of deep models to high-frequency input details. Res-MIA progressively degrades the input resolution using controlled downsampling and restoration operations, and analyzes the resulting confidence decay in the model’s predictions. Our key insight is that training samples exhibit a significantly steeper confidence decline under resolution erosion compared to non-member samples, revealing a robust membership signal. Res-MIA requires no shadow models, no auxiliary data, and only a limited number of forward queries to the target model. We evaluate the proposed attack on a federated ResNet-18 trained on CIFAR-10, where it consistently outperforms existing training-free baselines and achieves an AUC of up to 0.88 with minimal computational overhead. These findings highlight frequency-sensitive overfitting as an important and previously underexplored source of privacy leakage in federated learning, and emphasize the need for privacy-aware model designs that reduce reliance on fine-grained, non-robust input features.
zh

[AI-105] Diversified Scaling Inference in Time Series Foundation Models

【速读】：该论文旨在解决时间序列基础模型（Time Series Foundation Models, TSFMs）在推理阶段计算资源利用不足的问题，特别是标准采样推理策略难以遵循缩放定律且性能提升受限的瓶颈。其核心解决方案在于引入受控的多样化推理缩放（diversified inference scaling），通过设计特定的时间序列扰动来扩展生成分布的支持范围，从而增强解空间探索能力；关键创新点在于理论分析了多样性与保真度之间的权衡关系，并推导出多样化采样的临界样本阈值，证明在不更新模型参数的前提下，合理应用多样化推理可显著提升TSFM性能，为高效利用推理算力提供了新范式。

链接: https://arxiv.org/abs/2601.17376
作者: Ruijin Hua,Zichuan Liu,Kun Zhang,Yiyuan Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 16 figures, 9 tables

点击查看摘要

Abstract:The advancement of Time Series Foundation Models (TSFMs) has been driven primarily by large-scale pre-training, but inference-time compute potential remains largely untapped. This work systematically investigates two questions: how do TSFMs behave under standard sampling-based inference scaling, and can controlled sampling diversity enhance performance? We first examine the properties of TSFMs under standard sampling often fail to adhere to scaling laws due to insufficient exploration of the solution space. Building on this, we then delve into diversified inference scaling via tailored time series perturbations to expand the generative distribution’s support. We theoretically analyze the diversity-fidelity trade-off and derive a critical sample threshold for diversified sampling to outperform standard sampling. Extensive experiments across various TSFMs and datasets show proper diversified inference scaling yields substantial performance gains without parameter updates, establishing inference design as a critical, compute-efficient dimension of TSFM optimization. As an application, we propose RobustMSE, a rigorous metric to quantify the headroom performance of TSFM under a fixed budget. Overall, our findings clarify these factor interactions, enabling reliable performance via diverse large-scale inference time series in parallel environments without re-training TSFMs.
zh

[AI-106] Robust Privacy: Inference-Time Privacy through Certified Robustness

【速读】：该论文旨在解决机器学习系统在推理阶段可能泄露输入敏感属性的问题，即攻击者可通过模型输出推断出原始输入的隐私信息。其核心解决方案是提出**鲁棒隐私（Robust Privacy, RP）这一新的隐私保护概念，该概念借鉴了认证鲁棒性的思想：若模型预测在输入 $ x $ 的半径为 $ R $ 的邻域内（如 $\ell_2$ 范数下）保持不变，则称 $ x $ 拥有 $ R $-Robust Privacy，意味着观察到的预测无法区分 $ x $ 与其邻域内任意点。进一步地，作者设计了属性隐私增强（Attribute Privacy Enhancement, APE）**机制，将输入层面的不变性转化为属性层面的隐私保护效果，在推荐任务中验证了RP可扩大与正面决策兼容的敏感属性值范围，从而扩展隐私区间。实验表明，RP能有效降低模型 inversion attack (MIA) 的成功率（如噪声水平 $\sigma=0.1$ 时，ASR从73%降至4%），且在部分情况下几乎不损害模型性能。

链接: https://arxiv.org/abs/2601.17360
作者: Jiankai Jin,Xiangzheng Zhang,Zhao Liu,Deyue Zhang,Quanchen Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine learning systems can produce personalized outputs that allow an adversary to infer sensitive input attributes at inference time. We introduce Robust Privacy (RP), an inference-time privacy notion inspired by certified robustness: if a model’s prediction is provably invariant within a radius- R neighborhood around an input x (e.g., under the \ell_2 norm), then x enjoys R -Robust Privacy, i.e., observing the prediction cannot distinguish x from any input within distance R of x . We further develop Attribute Privacy Enhancement (APE) to translate input-level invariance into an attribute-level privacy effect. In a controlled recommendation task where the decision depends primarily on a sensitive attribute, we show that RP expands the set of sensitive-attribute values compatible with a positive recommendation, expanding the inference interval accordingly. Finally, we empirically demonstrate that RP also mitigates model inversion attacks (MIAs) by masking fine-grained input-output dependence. Even at small noise levels ( \sigma=0.1 ), RP reduces the attack success rate (ASR) from 73% to 4% with partial model performance degradation. RP can also partially mitigate MIAs (e.g., ASR drops to 44%) with no model performance degradation.
zh

[AI-107] Spectral Geometry for Deep Learning: Compression and Hallucination Detection via Random Matrix Theory

【速读】：该论文旨在解决大语言模型和深度神经网络中存在的可靠性问题（如幻觉和分布外行为）以及计算成本高昂的问题。其解决方案的关键在于构建一个基于谱几何（spectral geometry）与随机矩阵理论（random matrix theory, RMT）的统一框架，通过分析隐藏层激活的特征值结构来实现：一方面，提出EigenTrack方法，利用谱特征及其时序动态实现实时检测语言模型和视觉-语言模型中的幻觉及分布外行为；另一方面，设计RMT-KD压缩方法，基于谱统计识别信息丰富的成分并结合迭代知识蒸馏生成高效且准确的轻量化模型。研究表明，谱统计提供了可解释且鲁棒的信号，可用于监控不确定性并指导大规模神经网络的压缩。

链接: https://arxiv.org/abs/2601.17357
作者: Davide Ettori
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Master thesis, MS in Computer Science, University of Illinois Chicago, defended November 21, 2025

点击查看摘要

Abstract:Large language models and deep neural networks achieve strong performance but suffer from reliability issues and high computational cost. This thesis proposes a unified framework based on spectral geometry and random matrix theory to address both problems by analyzing the eigenvalue structure of hidden activations. The first contribution, EigenTrack, is a real-time method for detecting hallucinations and out-of-distribution behavior in language and vision-language models using spectral features and their temporal dynamics. The second contribution, RMT-KD, is a principled compression method that identifies informative spectral components and applies iterative knowledge distillation to produce compact and efficient models while preserving accuracy. Together, these results show that spectral statistics provide interpretable and robust signals for monitoring uncertainty and guiding compression in large-scale neural networks.
zh

[AI-108] Auditing Disability Representation in Vision-Language Models

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在描述与残疾人相关的个体图像时，因引入残疾背景信息而导致的解释失真问题，即模型从基于视觉证据的事实性描述转向包含未经证实的推断、情感降级和缺陷导向框架的叙事偏差。其解决方案的关键在于构建一个基于成对提示（Neutral Prompts 和 Disability-Contextualised Prompts）的基准测试框架，并采用以“解释保真度”为核心目标的评估体系，结合文本指标（如情感倾向、社会尊重度和响应长度变化）与大语言模型作为评判者（LLM-as-judge）协议，且由具有残疾生活经验的标注者进行验证。实验表明，通过针对性提示设计和偏好微调（preference fine-tuning），可显著提升解释保真度并减少解释偏移。

链接: https://arxiv.org/abs/2601.17348
作者: Srikant Panda,Sourabh Singh Yadav,Palkesh Malviya
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly deployed in socially sensitive applications, yet their behavior with respect to disability remains underexplored. We study disability aware descriptions for person centric images, where models often transition from evidence grounded factual description to interpretation shift including introduction of unsupported inferences beyond observable visual evidence. To systematically analyze this phenomenon, we introduce a benchmark based on paired Neutral Prompts (NP) and Disability-Contextualised Prompts (DP) and evaluate 15 state-of-the-art open- and closed-source VLMs under a zero-shot setting across 9 disability categories. Our evaluation framework treats interpretive fidelity as core objective and combines standard text-based metrics capturing affective degradation through shifts in sentiment, social regard and response length with an LLM-as-judge protocol, validated by annotators with lived experience of disability. We find that introducing disability context consistently degrades interpretive fidelity, inducing interpretation shifts characterised by speculative inference, narrative elaboration, affective degradation and deficit oriented framing. These effects are further amplified along race and gender dimension. Finally, we demonstrate targeted prompting and preference fine-tuning effectively improves interpretive fidelity and reduces substantially interpretation shifts.
zh

[AI-109] Multi-Agent Learning Path Planning via LLM s

【速读】：该论文旨在解决当前智能辅导系统中学习路径规划方法普遍存在的透明度不足、适应性差以及缺乏以学习者为中心的可解释性问题。其解决方案的关键在于提出了一种基于多智能体协作机制的新型学习路径规划框架（Multi-Agent Learning Path Planning, MALPP），该框架由三个任务专用智能体组成：学习者分析代理、路径规划代理和反思代理，它们通过结构化提示和预定义规则协同工作，实现对学习者画像的精准分析、个性化学习路径的生成及基于可解释反馈的迭代优化。该框架以认知负荷理论（Cognitive Load Theory）和最近发展区理论（Zone of Proximal Development）为理论基础，确保推荐路径在认知层面与教学逻辑高度一致，从而提升学习效率与可信度。

链接: https://arxiv.org/abs/2601.17346
作者: Haoxin Xu,Changyong Qi,Tong Liu,Bohao Zhang,Anna He,Bingqian Jiang,Longwei Zheng,Xiaoqing Gu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into intelligent tutoring systems offers transformative potential for personalized learning in higher education. However, most existing learning path planning approaches lack transparency, adaptability, and learner-centered explainability. To address these challenges, this study proposes a novel Multi-Agent Learning Path Planning (MALPP) framework that leverages a role- and rule-based collaboration mechanism among intelligent agents, each powered by LLMs. The framework includes three task-specific agents: a learner analytics agent, a path planning agent, and a reflection agent. These agents collaborate via structured prompts and predefined rules to analyze learning profiles, generate tailored learning paths, and iteratively refine them with interpretable feedback. Grounded in Cognitive Load Theory and Zone of Proximal Development, the system ensures that recommended paths are cognitively aligned and pedagogically meaningful. Experiments conducted on the MOOCCubeX dataset using seven LLMs show that MALPP significantly outperforms baseline models in path quality, knowledge sequence consistency, and cognitive load alignment. Ablation studies further validate the effectiveness of the collaborative mechanism and theoretical constraints. This research contributes to the development of trustworthy, explainable AI in education and demonstrates a scalable approach to learner-centered adaptive instruction powered by LLMs.
zh

[AI-110] Are We Evaluating the Edit Locality of LLM Model Editing Properly?

【速读】：该论文旨在解决当前模型编辑（model editing）研究中对知识保留能力（specificity，又称编辑局部性）评估方法的不足问题。现有评估协议存在概念性缺陷、与正则化强度相关性弱以及敏感度不足等三大问题，导致难以准确衡量不同编辑方法在注入目标知识的同时保持非目标知识不变的能力。解决方案的关键在于提出一种结构化的评估协议：该协议通过消除开放生成式语言模型（LLM）与确定答案假设之间的冲突、规避查询无关的流畅性偏差，并允许在近连续空间内灵活调整评估严格度，从而显著提升指标对正则化强度变化的敏感性及与实际知识保留能力的相关性，实现更精细的性能区分。

链接: https://arxiv.org/abs/2601.17343
作者: Wei Liu,Haomei Xu,Hongkai Liu,Zhiying Deng,Ruixuan Li,Heng Huang,Yee Whye Teh,Wee Sun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model editing has recently emerged as a popular paradigm for efficiently updating knowledge in LLMs. A central desideratum of updating knowledge is to balance editing efficacy, i.e., the successful injection of target knowledge, and specificity (also known as edit locality), i.e., the preservation of existing non-target knowledge. However, we find that existing specificity evaluation protocols are inadequate for this purpose. We systematically elaborated on the three fundamental issues it faces. Beyond the conceptual issues, we further empirically demonstrate that existing specificity metrics are weakly correlated with the strength of specificity regularizers. We also find that current metrics lack sufficient sensitivity, rendering them ineffective at distinguishing the specificity performance of different methods. Finally, we propose a constructive evaluation protocol. Under this protocol, the conflict between open-ended LLMs and the assumption of determined answers is eliminated, query-independent fluency biases are avoided, and the evaluation strictness can be smoothly adjusted within a near-continuous space. Experiments across various LLMs, datasets, and editing methods show that metrics derived from the proposed protocol are more sensitive to changes in the strength of specificity regularizers and exhibit strong correlation with them, enabling more fine-grained discrimination of different methods’ knowledge preservation capabilities.
zh

[AI-111] he Relativity of AGI: Distributional Axioms Frag ility and Undecidability

【速读】：该论文旨在解决人工通用智能（Artificial General Intelligence, AGI）是否具有可支撑绝对存在性、鲁棒性或自验证性的形式化定义问题。其解决方案的关键在于构建一个基于任务族、任务分布、性能泛函及显式资源预算的分布式、资源受限的语义谓词框架，从而对AGI进行严格的形式化建模。在此基础上，作者证明了：AGI的“普遍性”本质上是关系性的，不存在与任务分布无关的AGI概念；任意微小的任务分布扰动都可能因悬崖集（cliff sets）导致AGI性质失效，排除了普遍鲁棒性；有限资源下无法实现跨任务家族的无界迁移能力；且通过Rice型和Gödel–Tarski论证，表明AGI是一个非平凡的语义属性，无法被任何可计算过程（包括智能体自身）完全且可靠地认证。因此，依赖内部自我认证的递归自我改进机制在理论上是不成立的。

链接: https://arxiv.org/abs/2601.17335
作者: Angshul Majumdar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study whether Artificial General Intelligence (AGI) admits a coherent theoretical definition that supports absolute claims of existence, robustness, or self-verification. We formalize AGI axiomatically as a distributional, resource-bounded semantic predicate, indexed by a task family, a task distribution, a performance functional, and explicit resource budgets. Under this framework, we derive four classes of results. First, we show that generality is inherently relational: there is no distribution-independent notion of AGI. Second, we prove non-invariance results demonstrating that arbitrarily small perturbations of the task distribution can invalidate AGI properties via cliff sets, precluding universal robustness. Third, we establish bounded transfer guarantees, ruling out unbounded generalization across task families under finite resources. Fourth, invoking Rice-style and Gödel–Tarski arguments, we prove that AGI is a nontrivial semantic property and therefore cannot be soundly and completely certified by any computable procedure, including procedures implemented by the agent itself. Consequently, recursive self-improvement schemes that rely on internal self-certification of AGI are ill-posed. Taken together, our results show that strong, distribution-independent claims of AGI are not false but undefined without explicit formal indexing, and that empirical progress in AI does not imply the attainability of self-certifying general intelligence.
zh

[AI-112] FinMetaMind: A Tech Blueprint on NLQ Systems for Financial Knowledge Search

【速读】：该论文旨在解决金融知识搜索中传统查询方法在精度、召回率以及跨对象关联能力上的局限性问题，尤其针对金融数据中实体识别、相关性排序、数据时效性和多源异构信息整合等挑战。其解决方案的关键在于构建一个基于自然语言处理（Natural Language Processing, NLP）、搜索工程和向量数据模型的现代自然语言查询（Natural Language Query, NLQ）系统架构，通过离线索引与在线检索相结合的方式，实现对金融文档与结构化数据的高效语义理解与精准匹配，从而提升金融知识发现的深度与效率。

链接: https://arxiv.org/abs/2601.17333
作者: Lalit Pant,Shivang Nagar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Databases (cs.DB)
备注: 8 pages, 8 figures, Information Retrieval, Natural Language Query, Vector Search, Embeddings, Named Entity Recognition, Large Language Models

点击查看摘要

Abstract:Natural Language Query (NLQ) allows users to search and interact with information systems using plain, human language instead of structured query syntax. This paper presents a technical blueprint on the design of a modern NLQ system tailored to financial knowledge search. The introduction of NLQ not only enhances the precision and recall of the knowledge search compared to traditional methods, but also facilitates deeper insights by efficiently linking disparate financial objects, events, and relationships. Using core constructs from natural language processing, search engineering, and vector data models, the proposed system aims to address key challenges in discovering, relevance ranking, data freshness, and entity recognition intrinsic to financial data retrieval. In this work, we detail the unique requirements of NLQ for financial datasets and documents, outline the architectural components for offline indexing and online retrieval, and discuss the real-world use cases of enhanced knowledge search in financial services. We delve into the theoretical underpinnings and experimental evidence supporting our proposed architecture, ultimately providing a comprehensive analysis on the subject matter. We also provide a detailed elaboration of our experimental methodology, the data used, the results and future optimizations in this study.
zh

[AI-113] heoremForge: Scaling up Formal Data Synthesis with Low-Budget Agent ic Workflow

【速读】：该论文旨在解决形式化数学中代理工作流（agentic workflows）成本过高导致大规模数据合成困难，从而加剧开源语料库稀缺的问题。其核心解决方案是提出名为TheoremForge的低成本形式化数据合成流水线，通过将形式化过程分解为五个子任务（命题形式化、证明生成、前提选择、证明修正和证明草图），并引入“解耦提取策略”（Decoupled Extraction Strategy），从全局失败的轨迹中恢复有效的训练信号，从而高效利用被浪费的计算资源。实验表明，该方法在2000题基准上实现了12.6%的验证率，显著优于8.6%的基线，且每条成功轨迹平均成本仅为0.481美元（使用Gemini-3-Flash模型），同时将证明生成的数据产量提升1.6倍，为构建可扩展的数据飞轮以训练未来专家模型提供了可行框架。

链接: https://arxiv.org/abs/2601.17332
作者: Yicheng Tao,Hongteng Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce \textbfTheoremForge, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are \textitstatement formalization, \textitproof generation, \textitpremise selection, \textitproof correction and \textitproof sketching. By implementing a \textitDecoupled Extraction Strategy, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6%, surpassing the 8.6% baseline, at an average cost of only \textbf\ 0.481 per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by \textbf1.6 \times for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \hrefthis https URLhere.
zh

[AI-114] Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment EACL

【速读】：该论文旨在解决基于偏好对齐（如人类反馈的强化学习，Reinforcement Learning from Human Feedback, RLHF）中因偏好标签噪声和不一致性导致的模型训练不稳定问题。现有方法虽通过加权偏好来缓解不确定性，但忽略了被比较答案本身的可靠性这一更根本的因素。其解决方案的关键在于提出校准反馈对齐（Conformal Feedback Alignment, CFA）框架，该框架将偏好权重建模建立在校准预测（Conformal Prediction, CP） 的统计保证基础上，通过构建具有可控覆盖概率的校准预测集来量化每个答案的可靠性，并将这种答案级不确定性转化为用于DPO和PPO风格训练的合理权重，从而提升对齐的鲁棒性和数据效率。

链接: https://arxiv.org/abs/2601.17329
作者: Tiejin Chen,Xiaoou Liu,Vishnu Nandam,Kuan-Ru Liou,Hua Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accetped to Findings of EACL

点击查看摘要

Abstract:Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the \emphanswers being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling \emphanswer-side uncertainty complements preference-level weighting and yields more robust, data-efficient alignment. Codes are provided here.
zh

[AI-115] Phase Transition for Budgeted Multi-Agent Synergy

【速读】：该论文旨在解决多智能体系统（Multi-agent Systems）在固定推理预算下性能提升受限的问题，即系统可能仅小幅改善、达到饱和甚至性能下降。其核心挑战在于理解并量化三个关键约束：有限的上下文窗口（finite context windows）、代理间通信的失真（lossy inter-agent communication）以及相似代理间的共享失败（shared failures among similar agents）。解决方案的关键在于构建一个最小且可校准的理论框架，通过四个核心参数刻画系统行为：每个叶节点代理的计算-性能缩放指数 $\beta$ 、消息长度 fidelity 曲线 $\gamma(m)$ 、有效共享错误相关性 $\rho$ ，以及上下文窗口 $W$ 引入的硬扇入限制。作者证明了在二值成功/失败任务中，深度 $b$ -叉树结构存在一个尖锐的相变现象，由单一标量 $\alpha_\rho$ （融合 $\gamma(m)$ 、 $\rho$ 和扇入 $b$ ）决定弱信号是否被放大至非平凡固定点或衰减至随机水平；进一步推导出组织指数 $s$ ，揭示预算协同效应（budgeted synergy）发生的条件为 $s\beta > 1$ ，从而给出闭式计算分配规则与明确预算阈值，并通过连续性能温启动分析显式刻画相关性和通信引起的性能下限，最终在合成模拟和大规模 LLM 代理系统研究中验证了预测的相边界及其对瓶颈机制的解释力。

链接: https://arxiv.org/abs/2601.17311
作者: Bang Liu,Linglong Kong,Jian Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 55 pages, 12 figures

点击查看摘要

Abstract:Multi-agent systems can improve reliability, yet under a fixed inference budget they often help, saturate, or even collapse. We develop a minimal and calibratable theory that predicts these regimes from three binding constraints of modern agent stacks: finite context windows, lossy inter-agent communication, and shared failures among similar agents. Each leaf agent is summarized by a compute-performance scaling exponent \beta ; communication is captured by a message-length fidelity curve \gamma(m) ; dependence is captured by an effective shared-error correlation \rho ; and a context window W imposes hard fan-in limits that make hierarchy necessary. For binary success/failure tasks with majority aggregation, we prove a sharp phase transition for deep b -ary trees with correlated inputs and lossy communication: a single scalar \alpha_\rho (combining \gamma(m) , \rho , and fan-in b ) determines whether weak signal is amplified to a nontrivial fixed point or washed out to chance. In the amplifying regime, we derive an organization exponent s and show that budgeted synergy, i.e., outperforming the best single agent under the same total budget, occurs exactly when s\beta , yielding closed-form compute allocation rules and explicit budget thresholds. We further characterize saturation via a mixing depth and provide a conservative clipped predictor that remains accurate across growth and saturation. A continuous-performance warm-up gives closed-form risks for star, chain, and tree organizations, making correlation- and communication-induced floors explicit and exposing the core design trade-offs in a smooth setting. Finally, we validate the predicted phase boundaries in controlled synthetic simulations and show how the same mechanisms explain the dominant bottlenecks reported in recent large-scale matched-budget studies of LLM agent-system scaling.
zh

[AI-116] High-Fidelity Longitudinal Patient Simulation Using Real-World Data

【速读】：该论文旨在解决临床医学中模拟患者轨迹的难题，尤其是如何在复杂生物机制与社会文化因素交织的情况下，实现对个体化未来健康事件的高保真预测。其核心挑战在于缺乏能够精准刻画患者动态演变过程的建模方法。解决方案的关键在于开发了一种基于真实世界电子健康记录（Electronic Health Records, EHR）的生成式模拟器模型（Generative Simulator Model），该模型以患者既往病史为输入，合成精细且符合实际的未来健康轨迹；模型通过在超过2亿条临床记录上预训练，显著提升了对未来事件发生率、实验室指标及时间动态特征的拟合精度，并实现了对未来事件概率的准确估计，从而为个性化治疗规划和虚拟临床试验提供了可扩展的计算框架。

链接: https://arxiv.org/abs/2601.17310
作者: Yu Akagi,Tomohisa Seki,Hiromasa Ito,Toru Takiguchi,Kazuhiko Ohe,Yoshimasa Kawazoe
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Simulation is a powerful tool for exploring uncertainty. Its potential in clinical medicine is transformative and includes personalized treatment planning and virtual clinical trials. However, simulating patient trajectories is challenging because of complex biological and sociocultural influences. Here, we show that real-world clinical records can be leveraged to empirically model patient timelines. We developed a generative simulator model that takes a patient’s history as input and synthesizes fine-grained, realistic future trajectories. The model was pretrained on more than 200 million clinical records. It produced high-fidelity future timelines, closely matching event occurrence rates, laboratory test results, and temporal dynamics in real patient future data. It also accurately estimated future event probabilities, with observed-to-expected ratios consistently near 1.0 across diverse outcomes and time horizons. Our results reveal the untapped value of real-world data in electronic health records and introduce a scalable framework for in silico modeling of clinical care.
zh

[AI-117] Decentralized Multi-Agent Swarms for Autonomous Grid Security in Industrial IoT: A Consensus-based Approach

【速读】：该论文旨在解决工业互联网（IIoT）环境中因集中式安全监控架构导致的高延迟问题，此类延迟易被攻击者利用，从而危及整个制造生态系统的安全性。解决方案的关键在于提出一种去中心化的多智能体蜂群（DMAS）架构，其在每个边缘网关部署自主的人工智能（AI）代理，构成分布式数字“免疫系统”。这些代理通过轻量级点对点协议协作检测异常行为，无需将数据上传至云端，并结合基于共识的威胁验证（CVT）机制，使代理对威胁等级进行投票以实现受感染节点的即时隔离，从而显著提升响应速度、检测准确率并降低带宽消耗。

链接: https://arxiv.org/abs/2601.17303
作者: Samaresh Kumar Singh,Joyjit Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
备注: 9 pages, 8 figures, and Submitted to IEEE SoutheastCon 2026

点击查看摘要

Abstract:As Industrial Internet of Things (IIoT) environments expand to include tens of thousands of connected devices. The centralization of security monitoring architectures creates serious latency issues that savvy attackers can exploit to compromise an entire manufacturing ecosystem. This paper outlines a new, decentralized multi-agent swarm (DMAS) architecture that includes autonomous artificial intelligence (AI) agents at each edge gateway, functioning as a distributed digital “immune system” for IIoT networks. Instead of using a traditional static firewall approach, the DMAS agents communicate via a lightweight peer-to-peer protocol to cooperatively detect anomalous behavior across the IIoT network without sending data to a cloud infrastructure. The authors also outline a consensus-based threat validation (CVT) process in which agents vote on the threat level of an identified threat, enabling instant quarantine of a compromised node or nodes. The authors conducted experiments on a testbed that simulated an innovative factory environment with 2000 IIoT devices and found that the DMAS demonstrated sub-millisecond response times (average of 0.85ms), 97.3% accuracy in detecting malicious activity under high load, and 87% accuracy in detecting zero-day attacks. All significantly higher than baseline values for both centralized and edge computing. Additionally, the proposed architecture can prevent real-time cascading failures in industrial control systems and reduce network bandwidth use by 89% compared to cloud-based solutions.
zh

[AI-118] On the Insecurity of Keystroke-Based AI Authorship Detection: Timing-Forgery Attacks Against Motor-Signal Verification

【速读】：该论文旨在解决当前基于打字时间特征（如键间间隔的变异系数 δ）的人类文本与生成式 AI (Generative AI) 文本区分机制的安全性问题。现有防御方法假设 δ 值可有效标识内容来源，但本文证明其在两类实际攻击下失效：一是复制型攻击（人类转录大语言模型生成文本，保留真实打字运动信号），二是时间伪造攻击（自动化代理从人类实测分布中采样键间间隔）。实验表明，所有攻击均能以 ≥99.8% 的 evasion rate 成功绕过五种分类器，且检测器将 ≥99.8% 的攻击样本误判为人类撰写（平均置信度 ≥0.993）。关键贡献在于形式化一个不可识别性结果：当仅观测打字时间时，复制型攻击下特征与内容来源之间的互信息为零；同时指出即使写作与转录产生统计可区分的运动模式（Cohen’s d=1.28），其 δ 值仍远高于检测阈值，使得区分失去安全性意义。因此，确保内容溯源需设计将书写过程与语义内容绑定的架构。

链接: https://arxiv.org/abs/2601.17280
作者: David Condrey
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 1 figure, 7 tables. Code available at anc/ folder

点击查看摘要

Abstract:Recent proposals advocate using keystroke timing signals, specifically the coefficient of variation ( \delta ) of inter-keystroke intervals, to distinguish human-composed text from AI-generated content. We demonstrate that this class of defenses is insecure against two practical attack classes: the copy-type attack, in which a human transcribes LLM-generated text producing authentic motor signals, and timing-forgery attacks, in which automated agents sample inter-keystroke intervals from empirical human distributions. Using 13,000 sessions from the SBU corpus and three timing-forgery variants (histogram sampling, statistical impersonation, and generative LSTM), we show all attacks achieve \ge 99.8% evasion rates against five classifiers. While detectors achieve AUC=1.000 against fully-automated injection, they classify \ge 99.8% of attack samples as human with mean confidence \ge 0.993. We formalize a non-identifiability result: when the detector observes only timing, the mutual information between features and content provenance is zero for copy-type attacks. Although composition and transcription produce statistically distinguishable motor patterns (Cohen’s d=1.28), both yield \delta values 2-4x above detection thresholds, rendering the distinction security-irrelevant. These systems confirm a human operated the keyboard, but not whether that human originated the text. Securing provenance requires architectures that bind the writing process to semantic content.
zh

[AI-119] Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理复杂多步推理任务时，因依赖“统计拟合”而非系统逻辑推理而导致的性能瓶颈问题。传统强化学习（Reinforcement Learning, RL）虽尝试通过“先思考后表达”的范式改进，但在高维离散token空间中面临样本效率低、梯度估计方差大及灾难性遗忘等结构性挑战。其解决方案的关键在于提出深度潜在推理（DeepLatent Reasoning, DLR）框架——该框架将试错成本从昂贵的token级完整序列生成转移到连续潜在空间中，并引入轻量级辅助模型在潜在空间高效采样推理链编码；通过基于正确性和格式的双重奖励机制筛选高价值潜在轨迹，仅将其输入冻结主模型进行单次解码；同时设计对比学习目标以实现潜在空间内的定向探索，从而在保持推理一致性的同时提升多样性，并因主模型参数冻结而彻底避免灾难性遗忘。

链接: https://arxiv.org/abs/2601.17275
作者: Lianlei Shan,Han Chen,Yixuan Wang,Zhenjie Liu,Wei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages,

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate exceptional performance in surface-level text generation, their nature in handling complex multi-step reasoning tasks often remains one of statistical fitting'' rather than systematic logical deduction. Traditional Reinforcement Learning (RL) attempts to mitigate this by introducing a think-before-speak’’ paradigm. However, applying RL directly in high-dimensional, discrete token spaces faces three inherent challenges: sample-inefficient rollouts, high gradient estimation variance, and the risk of catastrophic forgetting. To fundamentally address these structural bottlenecks, we propose \textbfDeepLatent Reasoning (DLR), a latent-space bidirectional contrastive reinforcement learning framework. This framework shifts the trial-and-error cost from expensive token-level full sequence generation to the continuous latent manifold. Specifically, we introduce a lightweight assistant model to efficiently sample K reasoning chain encodings within the latent space. These encodings are filtered via a dual reward mechanism based on correctness and formatting; only high-value latent trajectories are fed into a \textbffrozen main model for single-pass decoding. To maximize reasoning diversity while maintaining coherence, we design a contrastive learning objective to enable directed exploration within the latent space. Since the main model parameters remain frozen during optimization, this method mathematically eliminates catastrophic forgetting. Experiments demonstrate that under comparable GPU computational budgets, DLR achieves more stable training convergence, supports longer-horizon reasoning chains, and facilitates the sustainable accumulation of reasoning capabilities, providing a viable path toward reliable and scalable reinforcement learning for LLMs.
zh

[AI-120] he Viscosity of Logic: Phase Transitions and Hysteresis in DPO Alignment

【速读】：该论文旨在解决当前生成式 AI（Generative AI）模型在使用直接偏好优化（Direct Preference Optimization, DPO）进行对齐时，对齐强度参数 β 的调优策略存在认知偏差的问题——即普遍假设增大 β 会持续提升模型能力，而忽视其可能引发的非单调甚至负向的能力变化。解决方案的关键在于将 β 视为一个可控参数，并在固定 DPO 训练流程下对其实施密集扫描（dense sweep），系统性地评估不同 β 值下多个 7B 级别开源模型（Mistral、Llama 和 Qwen）的能力表现与偏好边际之间的关系。研究发现：β 对模型能力的影响具有显著架构依赖性和非单调性，且偏好边际与推理能力之间可能存在强负相关（如 Llama 中 Pearson r = -0.91），导致基于边际选择的模型可能反而能力受损；此外，训练路径效应（hysteresis）表明高 β 训练所造成的性能损失难以通过后续降低 β 恢复。因此，论文主张应开展跨 β 参数空间的能力解析评估，而非仅依赖边际指标或综合基准测试来判断模型质量。

链接: https://arxiv.org/abs/2601.17260
作者: Marco Pollanen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 Pages, 5 Figures

点击查看摘要

Abstract:Direct Preference Optimization (DPO) is often tuned as if increasing alignment pressure (controlled by \beta ) yields progressively “better” behavior. We instead treat \beta as a control parameter and densely sweep it for three 7B open-weight families under a fixed DPO recipe. In Mistral, capability is sharply non-monotonic: aggregated logic-probe margins become positive only in a narrow band near \beta \approx 10^-2 and revert outside it, with boundary points that are seed-sensitive. Across architectures under the same sweep, we observe qualitatively different response modes: sharp reorganization in Mistral, selective changes in Llama, and smooth trade-offs in Qwen. Critically, the DPO preference margin can anticorrelate with reasoning capability (Pearson r=-0.91 for Llama logic), so margin-based selection can prefer capability-impaired models. Training path also matters: exposure to high \beta induces capability losses that persist even after \beta is reduced (hysteresis). These findings motivate capability-resolved evaluation across the \beta landscape rather than reliance on margins or aggregate benchmarks.
zh

[AI-121] Online parameter estimation for the Crazyflie quadcopter through an EM algorithm

【速读】：该论文旨在解决四旋翼无人机（quadcopter）系统在存在随机噪声干扰下的状态估计与参数辨识问题，以提升其在复杂环境中的稳定性和控制精度。解决方案的关键在于结合扩展卡尔曼滤波（Extended Kalman Filter, EKF）进行基于传感器噪声观测的状态估计，并采用随机微分方程（Stochastic Differential Equation, SDE）建模框架设计线性二次高斯控制器（Linear Quadratic Gaussian, LQG），同时利用期望最大化算法（Expectation Maximization Algorithm）实现对系统参数的离线与在线估计，其中在线参数估计展现出比离线估计更宽的收敛范围，从而增强了系统的适应性与鲁棒性。

链接: https://arxiv.org/abs/2601.17009
作者: Yanhua Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 20 pages, 37 figures

点击查看摘要

Abstract:Drones are becoming more and more popular nowadays. They are small in size, low in cost, and reliable in operation. They contain a variety of sensors and can perform a variety of flight tasks, reaching places that are difficult or inaccessible for humans. Earthquakes damage a lot of infrastructure, making it impossible for rescuers to reach some areas. But drones can help. Many amateur and professional photographers like to use drones for aerial photography. Drones play a non-negligible role in agriculture and transportation too. Drones can be used to spray pesticides, and they can also transport supplies. A quadcopter is a four-rotor drone and has been studied in this paper. In this paper, random noise is added to the quadcopter system and its effects on the drone system are studied. An extended Kalman filter has been used to estimate the state based on noisy observations from the sensor. Based on a SDE system, a linear quadratic Gaussian controller has been implemented. The expectation maximization algorithm has been applied for parameter estimation of the quadcopter. The results of offline parameter estimation and online parameter estimation are presented. The results show that the online parameter estimation has a slightly larger range of convergence values than the offline parameter estimation.
zh

[AI-122] MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在数学推理任务中因训练数据质量不足、多样性有限且难度控制不精确而导致性能提升受限的问题，尤其难以支持如课程学习（curriculum learning）等高效训练范式。其解决方案的关键在于提出MathMixup这一新颖的数据合成范式，通过混合与分解策略系统性地生成高质量、难度可控的数学推理问题，并结合自动化自我校验与人工筛选机制确保语义清晰性和合理的难度梯度。基于此方法构建的MathMixupQA数据集及配套的课程学习策略，显著提升了LLMs的数学推理能力，实验证明微调后的Qwen2.5-7B模型在七个数学基准测试上平均得分达52.6%，超越现有最优方法。

链接: https://arxiv.org/abs/2601.17006
作者: Xuchen Li,Jing Chen,Xuzhao Li,Hao Liang,Xiaohuan Zhou,Taifeng Wang,Wentao Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint, Under review

点击查看摘要

Abstract:In mathematical reasoning tasks, the advancement of Large Language Models (LLMs) relies heavily on high-quality training data with clearly defined and well-graded difficulty levels. However, existing data synthesis methods often suffer from limited diversity and lack precise control over problem difficulty, making them insufficient for supporting efficient training paradigms such as curriculum learning. To address these challenges, we propose MathMixup, a novel data synthesis paradigm that systematically generates high-quality, difficulty-controllable mathematical reasoning problems through hybrid and decomposed strategies. Automated self-checking and manual screening are incorporated to ensure semantic clarity and a well-structured difficulty gradient in the synthesized data. Building on this, we construct the MathMixupQA dataset and design a curriculum learning strategy that leverages these graded problems, supporting flexible integration with other datasets. Experimental results show that MathMixup and its curriculum learning strategy significantly enhance the mathematical reasoning performance of LLMs. Fine-tuned Qwen2.5-7B achieves an average score of 52.6% across seven mathematical benchmarks, surpassing previous state-of-the-art methods. These results fully validate the effectiveness and broad applicability of MathMixup in improving the mathematical reasoning abilities of LLMs and advancing data-centric curriculum learning.
zh

[AI-123] From Noise to Insights: Enhancing Supply Chain Decision Support through AI-Based Survey Integrity Analytics

【速读】：该论文试图解决供应链决策中调查数据可靠性不足的问题，尤其是在评估企业对生成式 AI (Generative AI) 驱动工具（如安全库存优化系统）的准备程度时，低质量或虚假响应会显著降低分析结果的准确性。解决方案的关键在于提出了一种轻量级的基于人工智能的筛选框架，采用监督学习方法对调查输入进行过滤：通过人工标注逻辑不一致和响应模式来识别虚假回答，并利用随机森林等模型在99个行业样本上训练分类器，最终实现92.0%的准确率，优于前期研究，验证了AI集成到调查流程中的可行性与可扩展性。

链接: https://arxiv.org/abs/2601.17005
作者: Bhubalan Mani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 2025 IEEE 4th World Conference on Applied Intelligence and Computing (AIC)

点击查看摘要

Abstract:The reliability of survey data is crucial in supply chain decision-making, particularly when evaluating readiness for AI-driven tools such as safety stock optimization systems. However, surveys often attract low-effort or fake responses that degrade the accuracy of derived insights. This study proposes a lightweight AI-based framework for filtering unreliable survey inputs using a supervised machine learning approach. In this expanded study, a larger dataset of 99 industry responses was collected, with manual labeling to identify fake responses based on logical inconsistencies and response patterns. After preprocessing and label encoding, both Random Forest and baseline models (Logistic Regression, XGBoost) were trained to distinguish genuine from fake responses. The best-performing model achieved an 92.0% accuracy rate, demonstrating improved detection compared to the pilot study. Despite limitations, the results highlight the viability of integrating AI into survey pipelines and provide a scalable solution for improving data integrity in supply chain research, especially during product launch and technology adoption phases.
zh

[AI-124] Investigating Self-regulated Learning Sequences within a Generative AI-based Intelligent Tutoring System

【速读】：该论文试图解决的问题是：在生成式人工智能（Generative AI）辅助的学习环境中，如何有效捕捉学生自我调节学习（Self-Regulated Learning, SRL）的动态模式，以提升学习效果。解决方案的关键在于：从学生在智能辅导系统中完成问题解决任务时与GenAI的交互轨迹数据中提取SRL行为序列，并结合信息处理视角（即信息获取与信息转化）分析其使用目的；通过顺序分析和聚类方法将参与者分为两类SRL模式群体，揭示其在GenAI使用频率与时序特征上的差异，从而为教学设计和GenAI学习环境优化提供实证依据。

链接: https://arxiv.org/abs/2601.17000
作者: Jie Gao,Shasha Li,Jianhua Zhang,Shan Li,Tingting Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has been a growing trend in employing generative artificial intelligence (GenAI) techniques to support learning. Moreover, scholars have reached a consensus on the critical role of self-regulated learning (SRL) in ensuring learning effectiveness within GenAI-assisted learning environments, making it essential to capture students’ dynamic SRL patterns. In this study, we extracted students’ interaction patterns with GenAI from trace data as they completed a problem-solving task within a GenAI-assisted intelligent tutoring system. Students’ purpose of using GenAI was also analyzed from the perspective of information processing, i.e., information acquisition and information transformation. Using sequential and clustering analysis, this study classified participants into two groups based on their SRL sequences. These two groups differed in the frequency and temporal characteristics of GenAI use. In addition, most students used GenAI for information acquisition rather than information transformation, while the correlation between the purpose of using GenAI and learning performance was not statistically significant. Our findings inform both pedagogical design and the development of GenAI-assisted learning environments.
zh

[AI-125] BibAgent : An Agent ic Framework for Traceable Miscitation Detection in Scientific Literature

【速读】：该论文旨在解决科学文献中引文完整性受损的问题，即广泛存在的误引（miscitation）现象，包括从细微扭曲到伪造参考文献等行为。当前系统性引文验证难以实现：人工审核无法应对现代出版量，而现有自动化工具受限于仅基于摘要的分析或小规模、领域特定数据集，部分原因在于全文访问的“付费墙”障碍。解决方案的关键在于提出BibAgent——一个可扩展的端到端智能体框架，其核心创新包括：集成检索、推理与自适应证据聚合机制，并针对可获取和付费墙来源采用不同策略；对于付费墙文献，引入新颖的Evidence Committee机制，通过下游引文共识推断引用有效性。该方案显著提升了引文验证的准确性与可解释性，实现了跨学科大规模引文错位检测。

链接: https://arxiv.org/abs/2601.16993
作者: Peiran Li,Fangzhou Lin,Shuo Xing,Xiang Zheng,Xi Hong,Jiashuo Sun,Zhengzhong Tu,Chaoqun Ni
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Citations are the bedrock of scientific authority, yet their integrity is compromised by widespread miscitations: ranging from nuanced distortions to fabricated references. Systematic citation verification is currently unfeasible; manual review cannot scale to modern publishing volumes, while existing automated tools are restricted by abstract-only analysis or small-scale, domain-specific datasets in part due to the “paywall barrier” of full-text access. We introduce BibAgent, a scalable, end-to-end agentic framework for automated citation verification. BibAgent integrates retrieval, reasoning, and adaptive evidence aggregation, applying distinct strategies for accessible and paywalled sources. For paywalled references, it leverages a novel Evidence Committee mechanism that infers citation validity via downstream citation consensus. To support systematic evaluation, we contribute a 5-category Miscitation Taxonomy and MisciteBench, a massive cross-disciplinary benchmark comprising 6,350 miscitation samples spanning 254 fields. Our results demonstrate that BibAgent outperforms state-of-the-art Large Language Model (LLM) baselines in citation verification accuracy and interpretability, providing scalable, transparent detection of citation misalignments across the scientific literature.
zh

[AI-126] Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在资源受限环境中进行微调时面临的高存储与计算开销问题。传统低秩适应方法（Low-Rank Adaptation, LoRA）虽能减少可训练参数，但其底层密集权重仍带来显著的存储和计算负担；而直接对LoRA进行基于幅度的剪枝（Magnitude-based Pruning）通常会导致性能下降。为此，作者提出了一种名为SALR（Sparsity-Aware Low-Rank Representation）的新颖微调范式，其核心在于在均方误差（Mean Squared Error, MSE）框架下统一低秩适应与稀疏剪枝：通过静态剪枝冻结的基础权重并利用截断奇异值分解（Truncated-SVD）恢复被丢弃的残差信息，理论上可将每个参数的MSE降低至原值的 $1 - r/\min(d,k)$ 倍；同时，通过融合多个低秩适配器为单一GEMM操作，并采用基于位图的编码与两级流水线解码+GEMM设计，在保持性能的同时实现真正的模型压缩与推理加速。

链接: https://arxiv.org/abs/2601.16991
作者: Longteng Zhang,Sen Wu,Shuai Hou,Zhengyu Qing,Zhuo Zheng,Danning Ke,Qihong Lin,Qiang Wang,Shaohuai Shi,Xiaowen Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA’s performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of (1 - r/\min(d,k)) . To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by 2\times , and delivers up to a 1.7\times inference speedup.
zh

[AI-127] Breaking Task Impasses Quickly: Adaptive Neuro-Symbolic Learning for Open-World Robotics ICRA2025

【速读】：该论文旨在解决开放世界环境中自主系统对未预见新颖情况适应能力不足的问题，尤其针对混合规划与强化学习（Reinforcement Learning, RL）方法中存在的样本效率低、适应速度慢及灾难性遗忘等瓶颈。其解决方案的关键在于提出一种神经符号框架，融合层次化抽象、任务与运动规划（Task and Motion Planning, TAMP）以及强化学习，通过符号目标导向学习与基于世界模型的探索相结合，实现机器人在环境变化下的快速适应，从而显著提升收敛速度、样本效率和鲁棒性。

链接: https://arxiv.org/abs/2601.16985
作者: Pierrick Lorang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IEEE ICRA 2025 Doctoral Consortium

点击查看摘要

Abstract:Adapting to unforeseen novelties in open-world environments remains a major challenge for autonomous systems. While hybrid planning and reinforcement learning (RL) approaches show promise, they often suffer from sample inefficiency, slow adaptation, and catastrophic forgetting. We present a neuro-symbolic framework integrating hierarchical abstractions, task and motion planning (TAMP), and reinforcement learning to enable rapid adaptation in robotics. Our architecture combines symbolic goal-oriented learning and world model-based exploration to facilitate rapid adaptation to environmental changes. Validated in robotic manipulation and autonomous driving, our approach achieves faster convergence, improved sample efficiency, and superior robustness over state-of-the-art hybrid methods, demonstrating its potential for real-world deployment.
zh

[AI-128] Optimal Use of Preferences in Artificial Intelligence Algorithms

【速读】：该论文旨在解决机器学习系统中偏好嵌入的最优策略问题，即在训练阶段直接嵌入偏好（preference embedding）与先训练无偏模型再通过后处理引入偏好（preference-free training with ex post preference application）之间的权衡。其核心解决方案在于提出了一种信息价值递减条件（diminishing-value-of-information condition）：相对于固定的归一化无偏好损失函数，偏好嵌入会边际上降低信息的价值，从而导致学习到的后验概率分布发生均值不变的收缩（mean-preserving contraction）。由于信息的价值在信念上是凸的，因此对于任何期望效用决策问题，无偏好训练均弱占优。这为模块化AI流水线设计提供了理论基础——先学习校准的概率分布，再通过下游决策规则实现不对称成本；但若存在认知约束（如人类在人机决策中的计算限制），偏好嵌入因可自动完成阈值计算而可能更优。

链接: https://arxiv.org/abs/2601.18732
作者: Joshua S. Gans
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI)
备注: 54 pages, 2 figures

点击查看摘要

Abstract:Machine learning systems embed preferences either in training losses or through post-processing of calibrated predictions. Applying information design methods from Strack and Yang (2024), this paper provides decision problem agnostic conditions under which separation training preference free and applying preferences ex post is optimal. Unlike prior work that requires specifying downstream objectives, the welfare results here apply uniformly across decision problems. The key primitive is a diminishing-value-of-information condition: relative to a fixed (normalised) preference-free loss, preference embedding makes informativeness less valuable at the margin, inducing a mean-preserving contraction of learned posteriors. Because the value of information is convex in beliefs, preference-free training weakly dominates for any expected utility decision problem. This provides theoretical foundations for modular AI pipelines that learn calibrated probabilities and implement asymmetric costs through downstream decision rules. However, separation requires users to implement optimal decision rules. When cognitive constraints bind, as documented in human AI decision-making, preference embedding can dominate by automating threshold computation. These results provide design guidance: preserve optionality through post-processing when objectives may shift; embed preferences when decision-stage frictions dominate.
zh

[AI-129] Point transformer for protein structural heterogeneity analysis using CryoEM

【速读】：该论文旨在解决复杂蛋白质系统中多自由度动态模式难以解耦与解释的问题，尤其在冷冻电子显微镜（CryoEM）数据中，如何更有效地识别和表征结构异质性以揭示其动力学机制。解决方案的关键在于引入Point Transformer这一专为点云分析设计的自注意力网络，显著提升了CryoEM图像计算分析中对异质性的解析能力，并实现了对高复杂度蛋白系统动态行为的更具人类可解释性的刻画。

链接: https://arxiv.org/abs/2601.18713
作者: Muyuan Chen,Muchen Li,Renjie Liao
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structural dynamics of macromolecules is critical to their structural-function relationship. Cryogenic electron microscopy (CryoEM) provides snapshots of vitrified protein at different compositional and conformational states, and the structural heterogeneity of proteins can be characterized through computational analysis of the images. For protein systems with multiple degrees of freedom, it is still challenging to disentangle and interpret the different modes of dynamics. Here, by implementing Point Transformer, a self-attention network designed for point cloud analysis, we are able to improve the performance of heterogeneity analysis on CryoEM data, and characterize the dynamics of highly complex protein systems in a more human-interpretable way.
zh

[AI-130] Emergent Cooperation in Quantum Multi-Agent Reinforcement Learning Using Communication

【速读】：该论文旨在解决量子多智能体强化学习（Quantum Multi-Agent Reinforcement Learning, QMARL）中缺乏有效机制以促进智能体间自发合作的问题，尤其是在顺序社会困境（Sequential Social Dilemmas, SSDs）场景下。其关键解决方案在于引入多种基于通信的协作机制，包括互认令牌交换（Mutual Acknowledgment Token Exchange, MATE）、扩展版本互信分布式激励认可令牌交换（Mutually Endorsed Distributed Incentive Acknowledgment Token Exchange, MEDIATE）、赠予机制（Gifting）以及强化交互学习（Reinforced Inter-Agent Learning, RIAL），并通过在重复囚徒困境、重复猎鹿博弈和重复斗鸡博弈三种SSD任务中的实验验证，证明MATE结合时序差分度量（MATEₜ𝒹）、AutoMATE、MEDIATE-I与MEDIATE-S等方法能显著提升合作水平，表明通信是推动QMARL中涌现合作的有效途径。

链接: https://arxiv.org/abs/2601.18419
作者: Michael Kölle,Christian Reff,Leo Sünkel,Julian Hager,Gerhard Stenzel,Claudia Linnhoff-Popien
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted at IEEE ICC 2026

点击查看摘要

Abstract:Emergent cooperation in classical Multi-Agent Reinforcement Learning has gained significant attention, particularly in the context of Sequential Social Dilemmas (SSDs). While classical reinforcement learning approaches have demonstrated capability for emergent cooperation, research on extending these methods to Quantum Multi-Agent Reinforcement Learning remains limited, particularly through communication. In this paper, we apply communication approaches to quantum Q-Learning agents: the Mutual Acknowledgment Token Exchange (MATE) protocol, its extension Mutually Endorsed Distributed Incentive Acknowledgment Token Exchange (MEDIATE), the peer rewarding mechanism Gifting, and Reinforced Inter-Agent Learning (RIAL). We evaluate these approaches in three SSDs: the Iterated Prisoner’s Dilemma, Iterated Stag Hunt, and Iterated Game of Chicken. Our experimental results show that approaches using MATE with temporal-difference measure (MATE\textsubscriptTD), AutoMATE, MEDIATE-I, and MEDIATE-S achieved high cooperation levels across all dilemmas, demonstrating that communication is a viable mechanism for fostering emergent cooperation in Quantum Multi-Agent Reinforcement Learning.
zh

机器学习

[LG-0] Benchmarking Machine Learning Models for IoT Malware Detection under Data Scarcity and Drift

链接: https://arxiv.org/abs/2601.18736
作者: Jake Lyon,Ehsan Saeedizade,Shamik Sengupta
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The rapid expansion of the Internet of Things (IoT) in domains such as smart cities, transportation, and industrial systems has heightened the urgency of addressing their security vulnerabilities. IoT devices often operate under limited computational resources, lack robust physical safeguards, and are deployed in heterogeneous and dynamic networks, making them prime targets for cyberattacks and malware applications. Machine learning (ML) offers a promising approach to automated malware detection and classification, but practical deployment requires models that are both effective and lightweight. The goal of this study is to investigate the effectiveness of four supervised learning models (Random Forest, LightGBM, Logistic Regression, and a Multi-Layer Perceptron) for malware detection and classification using the IoT-23 dataset. We evaluate model performance in both binary and multiclass classification tasks, assess sensitivity to training data volume, and analyze temporal robustness to simulate deployment in evolving threat landscapes. Our results show that tree-based models achieve high accuracy and generalization, even with limited training data, while performance deteriorates over time as malware diversity increases. These findings underscore the importance of adaptive, resource-efficient ML models for securing IoT systems in real-world environments.

[LG-1] Riemannian AmbientFlow: Towards Simultaneous Manifold Learning and Generative Modeling from Corrupted Data

链接: https://arxiv.org/abs/2601.18728
作者: Willem Diepeveen,Oscar Leong
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Modern generative modeling methods have demonstrated strong performance in learning complex data distributions from clean samples. In many scientific and imaging applications, however, clean samples are unavailable, and only noisy or linearly corrupted measurements can be observed. Moreover, latent structures, such as manifold geometries, present in the data are important to extract for further downstream scientific analysis. In this work, we introduce Riemannian AmbientFlow, a framework for simultaneously learning a probabilistic generative model and the underlying, nonlinear data manifold directly from corrupted observations. Building on the variational inference framework of AmbientFlow, our approach incorporates data-driven Riemannian geometry induced by normalizing flows, enabling the extraction of manifold structure through pullback metrics and Riemannian Autoencoders. We establish theoretical guarantees showing that, under appropriate geometric regularization and measurement conditions, the learned model recovers the underlying data distribution up to a controllable error and yields a smooth, bi-Lipschitz manifold parametrization. We further show that the resulting smooth decoder can serve as a principled generative prior for inverse problems with recovery guarantees. We empirically validate our approach on low-dimensional synthetic manifolds and on MNIST.

[LG-2] Analyzing Images of Blood Cells with Quantum Machine Learning Methods: Equilibrium Propagation and Variational Quantum Circuits to Detect Acute Myeloid Leukemia

链接: https://arxiv.org/abs/2601.18710
作者: A. Bano(1),L. Liebovitch(2) ((1) Rutgers University, (2) Columbia University)
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 5 pages, 1 figure, 2 tables

点击查看摘要

Abstract:This paper presents a feasibility study demonstrating that quantum machine learning (QML) algorithms achieve competitive performance on real-world medical imaging despite operating under severe constraints. We evaluate Equilibrium Propagation (EP), an energy-based learning method that does not use backpropagation (incompatible with quantum systems due to state-collapsing measurements) and Variational Quantum Circuits (VQCs) for automated detection of Acute Myeloid Leukemia (AML) from blood cell microscopy images using binary classification (2 classes: AML vs. Healthy). Key Result: Using limited subsets (50-250 samples per class) of the AML-Cytomorphology dataset (18,365 expert-annotated images), quantum methods achieve performance only 12-15% below classical CNNs despite reduced image resolution (64x64 pixels), engineered features (20D), and classical simulation via Qiskit. EP reaches 86.4% accuracy (only 12% below CNN) without backpropagation, while the 4-qubit VQC attains 83.0% accuracy with consistent data efficiency: VQC maintains stable 83% performance with only 50 samples per class, whereas CNN requires 250 samples (5x more data) to reach 98%. These results establish reproducible baselines for QML in healthcare, validating NISQ-era feasibility. Comments: 5 pages, 1 figure, 2 tables Subjects: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2601.18710 [cs.ET] (or arXiv:2601.18710v1 [cs.ET] for this version) https://doi.org/10.48550/arXiv.2601.18710 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Explainability Methods for Hardware Trojan Detection: A Systematic Comparison

链接: https://arxiv.org/abs/2601.18696
作者: Paul Whitten,Francis Wolff,Chris Papachristou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hardware trojan detection requires accurate identification and interpretable explanations for security engineers to validate and act on results. This work compares three explainability categories for gate-level trojan detection on the Trust-Hub benchmark: (1) domain-aware property-based analysis of 31 circuit-specific features from gate fanin patterns, flip-flop distances, and I/O connectivity; (2) case-based reasoning using k-nearest neighbors for precedent-based explanations; and (3) model-agnostic feature attribution (LIME, SHAP, gradient). Results show different advantages per approach. Property-based analysis provides explanations through circuit concepts like “high fanin complexity near outputs indicates potential triggers.” Case-based reasoning achieves 97.4% correspondence between predictions and training exemplars, offering justifications grounded in precedent. LIME and SHAP provide feature attributions with strong inter-method correlation (r=0.94, p0.001) but lack circuit-level context for validation. XGBoost classification achieves 46.15% precision and 52.17% recall on 11,392 test samples, a 9-fold precision improvement over prior work (Hasegawa et al.: 5.13%) while reducing false positive rates from 5.6% to 0.25%. Gradient-based attribution runs 481 times faster than SHAP but provides similar domain-opaque insights. This work demonstrates that property-based and case-based approaches offer domain alignment and precedent-based interpretability compared to generic feature rankings, with implications for XAI deployment where practitioners must validate ML predictions. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.18696 [cs.LG] (or arXiv:2601.18696v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.18696 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Paul Whitten [view email] [v1] Mon, 26 Jan 2026 17:13:00 UTC (214 KB)

[LG-4] Quasi Monte Carlo methods enable extremely low-dimensional deep generative models

链接: https://arxiv.org/abs/2601.18676
作者: Miles Martinez,Alex H. Williams
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces quasi-Monte Carlo latent variable models (QLVMs): a class of deep generative models that are specialized for finding extremely low-dimensional and interpretable embeddings of high-dimensional datasets. Unlike standard approaches, which rely on a learned encoder and variational lower bounds, QLVMs directly approximate the marginal likelihood by randomized quasi-Monte Carlo integration. While this brute force approach has drawbacks in higher-dimensional spaces, we find that it excels in fitting one, two, and three dimensional deep latent variable models. Empirical results on a range of datasets show that QLVMs consistently outperform conventional variational autoencoders (VAEs) and importance weighted autoencoders (IWAEs) with matched latent dimensionality. The resulting embeddings enable transparent visualization and post hoc analyses such as nonparametric density estimation, clustering, and geodesic path computation, which are nontrivial to validate in higher-dimensional spaces. While our approach is compute-intensive and struggles to generate fine-scale details in complex datasets, it offers a compelling solution for applications prioritizing interpretability and latent space analysis.

[LG-5] A Dynamic Framework for Grid Adaptation in Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2601.18672
作者: Spyros Rigas,Thanasis Papaioannou,Panagiotis Trakadas,Georgios Alexandridis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have recently demonstrated promising potential in scientific machine learning, partly due to their capacity for grid adaptation during training. However, existing adaptation strategies rely solely on input data density, failing to account for the geometric complexity of the target function or metrics calculated during network training. In this work, we propose a generalized framework that treats knot allocation as a density estimation task governed by Importance Density Functions (IDFs), allowing training dynamics to determine grid resolution. We introduce a curvature-based adaptation strategy and evaluate it across synthetic function fitting, regression on a subset of the Feynman dataset and different instances of the Helmholtz PDE, demonstrating that it significantly outperforms the standard input-based baseline. Specifically, our method yields average relative error reductions of 25.3% on synthetic functions, 9.4% on the Feynman dataset, and 23.3% on the PDE benchmark. Statistical significance is confirmed via Wilcoxon signed-rank tests, establishing curvature-based adaptation as a robust and computationally efficient alternative for KAN training.

[LG-6] winPurify: Purifying gene expression data to reveal tumor-intrinsic transcriptional programs via self-supervised learning

链接: https://arxiv.org/abs/2601.18640
作者: Zhiwei Zheng,Kevin Bryson
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:Advances in single-cell and spatial transcriptomic technologies have transformed tumor ecosystem profiling at cellular resolution. However, large scale studies on patient cohorts continue to rely on bulk transcriptomic data, where variation in tumor purity obscures tumor-intrinsic transcriptional signals and constrains downstream discovery. Many deconvolution methods report strong performance on synthetic bulk mixtures but fail to generalize to real patient cohorts because of unmodeled biological and technical variation. Here, we introduce TwinPurify, a representation learning framework that adapts the Barlow Twins self-supervised objective, representing a fundamental departure from the deconvolution paradigm. Rather than resolving the bulk mixture into discrete cell-type fractions, TwinPurify instead learns continuous, high-dimensional tumor embeddings by leveraging adjacent-normal profiles within the same cohort as “background” guidance, enabling the disentanglement of tumor-specific signals without relying on any external reference. Benchmarked against multiple large cancer cohorts across RNA-seq and microarray platforms, TwinPurify outperforms conventional representation learning baselines like auto-encoders in recovering tumor-intrinsic and immune signals. The purified embeddings improve molecular subtype and grade classification, enhance survival model concordance, and uncover biologically meaningful pathway activities compared to raw bulk profiles. By providing a transferable framework for decontaminating bulk transcriptomics, TwinPurify extends the utility of existing clinical datasets for molecular discovery. Subjects: Machine Learning (cs.LG); Molecular Networks (q-bio.MN) Cite as: arXiv:2601.18640 [cs.LG] (or arXiv:2601.18640v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.18640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] Physics-Informed Uncertainty Enables Reliable AI-driven Design

链接: https://arxiv.org/abs/2601.18638
作者: Tingkai Xue,Chin Chun Ooi,Yang Jiang,Luu Trung Pham Duong,Pao-Hsiung Chiu,Weijiang Zhao,Nagarajan Raghavan,My Ha Dao
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Inverse design is a central goal in much of science and engineering, including frequency-selective surfaces (FSS) that are critical to microelectronics for telecommunications and optical metamaterials. Traditional surrogate-assisted optimization methods using deep learning can accelerate the design process but do not usually incorporate uncertainty quantification, leading to poorer optimization performance due to erroneous predictions in data-sparse regions. Here, we introduce and validate a fundamentally different paradigm of Physics-Informed Uncertainty, where the degree to which a model’s prediction violates fundamental physical laws serves as a computationally-cheap and effective proxy for predictive uncertainty. By integrating physics-informed uncertainty into a multi-fidelity uncertainty-aware optimization workflow to design complex frequency-selective surfaces within the 20 - 30 GHz range, we increase the success rate of finding performant solutions from less than 10% to over 50%, while simultaneously reducing computational cost by an order of magnitude compared to the sole use of a high-fidelity solver. These results highlight the necessity of incorporating uncertainty quantification in machine-learning-driven inverse design for high-dimensional problems, and establish physics-informed uncertainty as a viable alternative to quantifying uncertainty in surrogate models for physical systems, thereby setting the stage for autonomous scientific discovery systems that can efficiently and robustly explore and evaluate candidate designs.

[LG-8] CASSANDRA: Programmatic and Probabilistic Learning and Inference for Stochastic World Modeling

链接: https://arxiv.org/abs/2601.18620
作者: Panagiotis Lymperopoulos,Abhiramon Rajasekharan,Ian Berlot-Attwell,Stéphane Aroca-Ouellette,Kaheer Suleman
类目: Machine Learning (cs.LG)
*备注: 28 pages, 2 figures

点击查看摘要

Abstract:Building world models is essential for planning in real-world domains such as businesses. Since such domains have rich semantics, we can leverage world knowledge to effectively model complex action effects and causal relationships from limited data. In this work, we propose CASSANDRA, a neurosymbolic world modeling approach that leverages an LLM as a knowledge prior to construct lightweight transition models for planning. CASSANDRA integrates two components: (1) LLM-synthesized code to model deterministic features, and (2) LLM-guided structure learning of a probabilistic graphical model to capture causal relationships among stochastic variables. We evaluate CASSANDRA in (i) a small-scale coffee-shop simulator and (ii) a complex theme park business simulator, where we demonstrate significant improvements in transition prediction and planning over baselines.

[LG-9] Geometry-Free Conditional Diffusion Modeling for Solving the Inverse Electrocardiography Problem

链接: https://arxiv.org/abs/2601.18615
作者: Ramiro Valdes Jara,Adam Meyers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a data-driven model for solving the inverse problem of electrocardiography, the mathematical problem that forms the basis of electrocardiographic imaging (ECGI). We present a conditional diffusion framework that learns a probabilistic mapping from noisy body surface signals to heart surface electric potentials. The proposed approach leverages the generative nature of diffusion models to capture the non-unique and underdetermined nature of the ECGI inverse problem, enabling probabilistic sampling of multiple reconstructions rather than a single deterministic estimate. Unlike traditional methods, the proposed framework is geometry-free and purely data-driven, alleviating the need for patient-specific mesh construction. We evaluate the method on a real ECGI dataset and compare it against strong deterministic baselines, including a convolutional neural network, long short-term memory network, and transformer-based model. The results demonstrate that the proposed diffusion approach achieves improved reconstruction accuracy, highlighting the potential of diffusion models as a robust tool for noninvasive cardiac electrophysiology imaging.

[LG-10] LaCoGSEA: Unsupervised deep learning for pathway analysis via latent correlation

链接: https://arxiv.org/abs/2601.18604
作者: Zhiwei Zheng,Kevin Bryson
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Motivation: Pathway enrichment analysis is widely used to interpret gene expression data. Standard approaches, such as GSEA, rely on predefined phenotypic labels and pairwise comparisons, which limits their applicability in unsupervised settings. Existing unsupervised extensions, including single-sample methods, provide pathway-level summaries but primarily capture linear relationships and do not explicitly model gene-pathway associations. More recently, deep learning models have been explored to capture non-linear transcriptomic structure. However, their interpretation has typically relied on generic explainable AI (XAI) techniques designed for feature-level attribution. As these methods are not designed for pathway-level interpretation in unsupervised transcriptomic analyses, their effectiveness in this setting remains limited. Results: To bridge this gap, we introduce LaCoGSEA (Latent Correlation GSEA), an unsupervised framework that integrates deep representation learning with robust pathway statistics. LaCoGSEA employs an autoencoder to capture non-linear manifolds and proposes a global gene-latent correlation metric as a proxy for differential expression, generating dense gene rankings without prior labels. We demonstrate that LaCoGSEA offers three key advantages: (i) it achieves improved clustering performance in distinguishing cancer subtypes compared to existing unsupervised baselines; (ii) it recovers a broader range of biologically meaningful pathways at higher ranks compared with linear dimensionality reduction and gradient-based XAI methods; and (iii) it maintains high robustness and consistency across varying experimental protocols and dataset sizes. Overall, LaCoGSEA provides state-of-the-art performance in unsupervised pathway enrichment analysis. Availability and implementation: this https URL Subjects: Machine Learning (cs.LG); Genomics (q-bio.GN) Cite as: arXiv:2601.18604 [cs.LG] (or arXiv:2601.18604v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.18604 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] K-Myriad: Jump-starting reinforcement learning with unsupervised parallel agents

链接: https://arxiv.org/abs/2601.18580
作者: Vincenzo De Paola,Mirco Mutti,Riccardo Zamboni,Marcello Restelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parallelization in Reinforcement Learning is typically employed to speed up the training of a single policy, where multiple workers collect experience from an identical sampling distribution. This common design limits the potential of parallelization by neglecting the advantages of diverse exploration strategies. We propose K-Myriad, a scalable and unsupervised method that maximizes the collective state entropy induced by a population of parallel policies. By cultivating a portfolio of specialized exploration strategies, K-Myriad provides a robust initialization for Reinforcement Learning, leading to both higher training efficiency and the discovery of heterogeneous solutions. Experiments on high-dimensional continuous control tasks, with large-scale parallelization, demonstrate that K-Myriad can learn a broad set of distinct policies, highlighting its effectiveness for collective exploration and paving the way towards novel parallelization strategies.

[LG-12] Information Hidden in Gradients of Regression with Target Noise

链接: https://arxiv.org/abs/2601.18546
作者: Arash Jamshidi,Katsiaryna Haitsiukevich,Kai Puolamäki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Second-order information – such as curvature or data covariance – is critical for optimisation, diagnostics, and robustness. However, in many modern settings, only the gradients are observable. We show that the gradients alone can reveal the Hessian, equalling the data covariance \Sigma for the linear regression. Our key insight is a simple variance calibration: injecting Gaussian noise so that the total target noise variance equals the batch size ensures that the empirical gradient covariance closely approximates the Hessian, even when evaluated far from the optimum. We provide non-asymptotic operator-norm guarantees under sub-Gaussian inputs. We also show that without such calibration, recovery can fail by an \Omega(1) factor. The proposed method is practical (a "set target-noise variance to n " rule) and robust (variance \mathcalO(n) suffices to recover \Sigma up to scale). Applications include preconditioning for faster optimisation, adversarial risk estimation, and gradient-only training, for example, in distributed systems. We support our theoretical results with experiments on synthetic and real data.

[LG-13] From Human Labels to Literature: Semi-Supervised Learning of NMR Chemical Shifts at Scale

链接: https://arxiv.org/abs/2601.18524
作者: Yongqi Jin,Yecheng Wang,Jun-jie Wang,Rong Zhu,Guolin Ke,Weinan E
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of nuclear magnetic resonance (NMR) chemical shifts is fundamental to spectral analysis and molecular structure elucidation, yet existing machine learning methods rely on limited, labor-intensive atom-assigned datasets. We propose a semi-supervised framework that learns NMR chemical shifts from millions of literature-extracted spectra without explicit atom-level assignments, integrating a small amount of labeled data with large-scale unassigned spectra. We formulate chemical shift prediction from literature spectra as a permutation-invariant set supervision problem, and show that under commonly satisfied conditions on the loss function, optimal bipartite matching reduces to a sorting-based loss, enabling stable large-scale semi-supervised training beyond traditional curated datasets. Our models achieve substantially improved accuracy and robustness over state-of-the-art methods and exhibit stronger generalization on significantly larger and more diverse molecular datasets. Moreover, by incorporating solvent information at scale, our approach captures systematic solvent effects across common NMR solvents for the first time. Overall, our results demonstrate that large-scale unlabeled spectra mined from the literature can serve as a practical and effective data source for training NMR shift models, suggesting a broader role of literature-derived, weakly structured data in data-centric AI for science.

[LG-14] LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models ICLR2026

链接: https://arxiv.org/abs/2601.18513
作者: Kai Hu,Haoqi Hu,Matt Fredrikson
类目: Machine Learning (cs.LG)
*备注: ICLR 2026. 17 pages

点击查看摘要

Abstract:Lipschitz-based certification offers efficient, deterministic robustness guarantees but has struggled to scale in model size, training efficiency, and ImageNet performance. We introduce \emphLipNeXt, the first \emphconstraint-free and \emphconvolution-free 1-Lipschitz architecture for certified robustness. LipNeXt is built using two techniques: (1) a manifold optimization procedure that updates parameters directly on the orthogonal manifold and (2) a \emphSpatial Shift Module to model spatial pattern without convolutions. The full network uses orthogonal projections, spatial shifts, a simple 1-Lipschitz \beta -Abs nonlinearity, and L_2 spatial pooling to maintain tight Lipschitz control while enabling expressive feature mixing. Across CIFAR-10/100 and Tiny-ImageNet, LipNeXt achieves state-of-the-art clean and certified robust accuracy (CRA), and on ImageNet it scales to 1-2B large models, improving CRA over prior Lipschitz models (e.g., up to +8% at \varepsilon=1 ) while retaining efficient, stable low-precision training. These results demonstrate that Lipschitz-based certification can benefit from modern scaling trends without sacrificing determinism or efficiency.

[LG-15] Conformal Prediction Algorithms for Time Series Forecasting: Methods and Benchmark

链接: https://arxiv.org/abs/2601.18509
作者: Andro Sabashvili
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable uncertainty quantification is of critical importance in time series forecasting, yet traditional methods often rely on restrictive distributional assumptions. Conformal prediction (CP) has emerged as a promising distribution-free framework for generating prediction intervals with rigorous theoretical guarantees. However, applying CP to sequential data presents a primary challenge: the temporal dependencies inherent in time series fundamentally violate the core assumption of data exchangeability, upon which standard CP guarantees are built. This review critically examines the main categories of algorithmic solutions designed to address this conflict. We survey and benchmark methods that relax the exchangeability assumption, those that redefine the data unit to be a collection of independent time series, approaches that explicitly model the dynamics of the prediction residuals, and online learning algorithms that adapt to distribution shifts to maintain long-run coverage. By synthesizing these approaches, we highlight computational efficiency and practical performance on real-world data.

[LG-16] Nearly Optimal Bayesian Inference for Structural Missingness

链接: https://arxiv.org/abs/2601.18500
作者: Chen Liang,Donghua Yang,Yutong Wang,Tianle Zhang,Shenghe Zhou,Zhiyu Liang,Hengtong Zhang,Hongzhi Wang,Ziqi Li,Xiyang Zhang,Zheng Liang,Yifei Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structural missingness breaks ‘just impute and train’: values can be undefined by causal or logical constraints, and the mask may depend on observed variables, unobserved variables (MNAR), and other missingness indicators. It simultaneously brings (i) a catch-22 situation with causal loop, prediction needs the missing features, yet inferring them depends on the missingness mechanism, (ii) under MNAR, the unseen are different, the missing part can come from a shifted distribution, and (iii) plug-in imputation, a single fill-in can lock in uncertainty and yield overconfident, biased decisions. In the Bayesian view, prediction via the posterior predictive distribution integrates over the full model posterior uncertainty, rather than relying on a single point estimate. This framework decouples (i) learning an in-model missing-value posterior from (ii) label prediction by optimizing the predictive posterior distribution, enabling posterior integration. This decoupling yields an in-model almost-free-lunch: once the posterior is learned, prediction is plug-and-play while preserving uncertainty propagation. It achieves SOTA on 43 classification and 15 imputation benchmarks, with finite-sample near Bayes-optimality guarantees under our SCM prior.

[LG-17] Enhancing Control Policy Smoothness by Aligning Actions with Predictions from Preceding States AAAI-26

链接: https://arxiv.org/abs/2601.18479
作者: Kyoleen Kwak,Hyoseok Hwang
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI-26. 7 pages (excluding references), 3 figures

点击查看摘要

Abstract:Deep reinforcement learning has proven to be a powerful approach to solving control tasks, but its characteristic high-frequency oscillations make it difficult to apply in real-world environments. While prior methods have addressed action oscillations via architectural or loss-based methods, the latter typically depend on heuristic or synthetic definitions of state similarity to promote action consistency, which often fail to accurately reflect the underlying system dynamics. In this paper, we propose a novel loss-based method by introducing a transition-induced similar state. The transition-induced similar state is defined as the distribution of next states transitioned from the previous state. Since it utilizes only environmental feedback and actually collected data, it better captures system dynamics. Building upon this foundation, we introduce Action Smoothing by Aligning Actions with Predictions from Preceding States (ASAP), an action smoothing method that effectively mitigates action oscillations. ASAP enforces action smoothness by aligning the actions with those taken in transition-induced similar states and by penalizing second-order differences to suppress high-frequency oscillations. Experiments in Gymnasium and Isaac-Lab environments demonstrate that ASAP yields smoother control and improved policy performance over existing methods.

[LG-18] Fusion of Spatio-Temporal and Multi-Scale Frequency Features for Dry Electrodes MI-EEG Decoding

链接: https://arxiv.org/abs/2601.18424
作者: Tianyi Gong,Can Han,Junxi Wu,Dahong Qian
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dry-electrode Motor Imagery Electroencephalography (MI-EEG) enables fast, comfortable, real-world Brain Computer Interface by eliminating gels and shortening setup for at-home and wearable this http URL, dry recordings pose three main issues: lower Signal-to-Noise Ratio with more baseline drift and sudden transients; weaker and noisier data with poor phase alignment across trials; and bigger variances between sessions. These drawbacks lead to larger data distribution shift, making features less stable for MI-EEG this http URL address these problems, we introduce STGMFM, a tri-branch framework tailored for dry-electrode MI-EEG, which models complementary spatio-temporal dependencies via dual graph orders, and captures robust envelope dynamics with a multi-scale frequency mixing branch, motivated by the observation that amplitude envelopes are less sensitive to contact variability than instantaneous waveforms. Physiologically meaningful connectivity priors guide learning, and decision-level fusion consolidates a noise-tolerant consensus. On our collected dry-electrode MI-EEG, STGMFM consistently surpasses competitive CNN/Transformer/graph baselines. Codes are available at this https URL.

[LG-19] Frequency-Based Hyperparameter Selection in Games

链接: https://arxiv.org/abs/2601.18409
作者: Aniket Sanyal,Baraah A.M. Sidahmed,Rebekka Burkholz,Tatjana Chavdarova
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning in smooth games fundamentally differs from standard minimization due to rotational dynamics, which invalidate classical hyperparameter tuning strategies. Despite their practical importance, effective methods for tuning in games remain underexplored. A notable example is LookAhead (LA), which achieves strong empirical performance but introduces additional parameters that critically influence performance. We propose a principled approach to hyperparameter selection in games by leveraging frequency estimation of oscillatory dynamics. Specifically, we analyze oscillations both in continuous-time trajectories and through the spectrum of the discrete dynamics in the associated frequency-based space. Building on this analysis, we introduce \emphModal LookAhead (MoLA), an extension of LA that selects the hyperparameters adaptively to a given problem. We provide convergence guarantees and demonstrate in experiments that MoLA accelerates training in both purely rotational games and mixed regimes, all with minimal computational overhead.

[LG-20] Superlinear Multi-Step Attention

链接: https://arxiv.org/abs/2601.18401
作者: Yufeng Huang
类目: Machine Learning (cs.LG)
*备注: 30 pages, 6 figures

点击查看摘要

Abstract:In this paper, we propose \textbfSuperlinear attention, a fully trainable multi-step attention architecture that achieves subquadratic complexity for long sequences while preserving \textbfrandom context access (a.k.a.\ structural non-exclusion): no eligible token position is structurally excluded from being selected for attention. Superlinear attention reformulates standard causal self-attention as a multi-step search problem with N steps, yielding an overall complexity of O(L^1+\frac1N) . To illustrate the architecture, we present a baseline N=2 implementation, which is algorithmically analogous to standard jump search. In this O(L^3/2) instantiation, the first step performs O(L^3/2) span-search to select relevant spans of the sequence, and the second step applies O(L^3/2) span-attention (standard attention restricted to the selected spans). In an upscaled O(L^1.54) configuration for robustness, we achieve an average decoding throughput of 114 tokens/sec at 1M context length and 80 tokens/sec at 10M context in our implementation on a modified 30B hybrid MoE model on a single B200 GPU. With limited training, we also obtain strong performance on the NIAH (Needle In A Haystack) task up to 256K context length, demonstrating that the routed span selection is learnable end-to-end. This paper emphasizes architectural formulation, scaling analysis, and systems feasibility, and presents initial validation; comprehensive quality evaluations across diverse long-context tasks are left to future work.

[LG-21] Estimating Dense-Packed Zone Height in Liquid-Liquid Separation: A Physics-Informed Neural Network Approach

链接: https://arxiv.org/abs/2601.18399
作者: Mehmet Velioglu,Song Zhai,Alexander Mitsos,Adel Mhamdi,Andreas Jupke,Manuel Dahmen
类目: Machine Learning (cs.LG)
*备注: 37 pages, 13 figures, 3 tables

点击查看摘要

Abstract:Separating liquid-liquid dispersions in gravity settlers is critical in chemical, pharmaceutical, and recycling processes. The dense-packed zone height is an important performance and safety indicator but it is often expensive and impractical to measure due to optical limitations. We propose to estimate phase heights using only inexpensive volume flow measurements. To this end, a physics-informed neural network (PINN) is first pretrained on synthetic data and physics equations derived from a low-fidelity (approximate) mechanistic model to reduce the need for extensive experimental data. While the mechanistic model is used to generate synthetic training data, only volume balance equations are used in the PINN, since the integration of submodels describing droplet coalescence and sedimentation into the PINN would be computationally prohibitive. The pretrained PINN is then fine-tuned with scarce experimental data to capture the actual dynamics of the separator. We then employ the differentiable PINN as a predictive model in an Extended Kalman Filter inspired state estimation framework, enabling the phase heights to be tracked and updated from flow-rate measurements. We first test the two-stage trained PINN by forward simulation from a known initial state against the mechanistic model and a non-pretrained PINN. We then evaluate phase height estimation performance with the filter, comparing the two-stage trained PINN with a two-stage trained purely data-driven neural network. All model types are trained and evaluated using ensembles to account for model parameter uncertainty. In all evaluations, the two-stage trained PINN yields the most accurate phase-height estimates.

[LG-22] Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning

链接: https://arxiv.org/abs/2601.18356
作者: Weiqin Yang,Haowen Xue,Qingyi Peng,Hexuan Hu,Qian Huang,Tingbo Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Medical vision-language models (VLMs) achieve strong performance in diagnostic reporting and image-text alignment, yet their underlying reasoning mechanisms remain fundamentally correlational, exhibiting reliance on superficial statistical associations that fail to capture the causal pathophysiological mechanisms central to clinical decision-making. This limitation makes them fragile, prone to hallucinations, and sensitive to dataset biases. Retrieval-augmented generation (RAG) offers a partial remedy by grounding predictions in external knowledge. However, conventional RAG depends on semantic similarity, introducing new spurious correlations. We propose Multimodal Causal Retrieval-Augmented Generation, a framework that integrates causal inference principles with multimodal retrieval. It retrieves clinically relevant exemplars and causal graphs from external sources, conditioning model reasoning on counterfactual and interventional evidence rather than correlations alone. Applied to radiology report generation, diagnosis prediction, and visual question answering, it improves factual accuracy, robustness to distribution shifts, and interpretability. Our results highlight causal retrieval as a scalable path toward medical VLMs that think beyond pattern matching, enabling trustworthy multimodal reasoning in high-stakes clinical settings.

[LG-23] Structural Gender Bias in Credit Scoring: Proxy Leakage

链接: https://arxiv.org/abs/2601.18342
作者: Navya SD,Sreekanth D,SS Uma Sankari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As financial institutions increasingly adopt machine learning for credit risk assessment, the persistence of algorithmic bias remains a critical barrier to equitable financial inclusion. This study provides a comprehensive audit of structural gender bias within the Taiwan Credit Default dataset, specifically challenging the prevailing doctrine of “fairness through blindness.” Despite the removal of explicit protected attributes and the application of industry standard fairness interventions, our results demonstrate that gendered predictive signals remain deeply embedded within non-sensitive features. Utilizing SHAP (SHapley Additive exPlanations), we identify that variables such as Marital Status, Age, and Credit Limit function as potent proxies for gender, allowing models to maintain discriminatory pathways while appearing statistically fair. To mathematically quantify this leakage, we employ an adversarial inverse modeling framework. Our findings reveal that the protected gender attribute can be reconstructed from purely non-sensitive financial features with an ROC AUC score of 0.65, demonstrating that traditional fairness audits are insufficient for detecting implicit structural bias. These results advocate for a shift from surface-level statistical parity toward causal-aware modeling and structural accountability in financial AI.

[LG-24] A Dataset for Automatic Vocal Mode Classification

链接: https://arxiv.org/abs/2601.18339
作者: Reemt Hinrichs,Sonja Stephan,Alexander Lange,Jörn Ostermann
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Part of the proceedings of the EvoMUSART 2026: 15th International Conference on Artificial Intelligence in Music, Sound, Art and Design

点击查看摘要

Abstract:The Complete Vocal Technique (CVT) is a school of singing developed in the past decades by Cathrin Sadolin et al… CVT groups the use of the voice into so called vocal modes, namely Neutral, Curbing, Overdrive and Edge. Knowledge of the desired vocal mode can be helpful for singing students. Automatic classification of vocal modes can thus be important for technology-assisted singing teaching. Previously, automatic classification of vocal modes has been attempted without major success, potentially due to a lack of data. Therefore, we recorded a novel vocal mode dataset consisting of sustained vowels recorded from four singers, three of which professional singers with more than five years of CVT-experience. The dataset covers the entire vocal range of the subjects, totaling 3,752 unique samples. By using four microphones, thereby offering a natural data augmentation, the dataset consists of more than 13,000 samples combined. An annotation was created using three CVT-experienced annotators, each providing an individual annotation. The merged annotation as well as the three individual annotations come with the published dataset. Additionally, we provide some baseline classification results. The best balanced accuracy across a 5-fold cross validation of 81.3,% was achieved with a ResNet18. The dataset can be downloaded under this https URL.

[LG-25] Discriminability-Driven Spatial-Channel Selection with Gradient Norm for Drone Signal OOD Detection

链接: https://arxiv.org/abs/2601.18329
作者: Chuhan Feng,Jing Li,Jie Li,Lu Lv,Fengkui Gong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a drone signal out-of-distribution (OOD) detection algorithm based on discriminability-driven spatial-channel selection with a gradient norm. Time-frequency image features are adaptively weighted along both spatial and channel dimensions by quantifying inter-class similarity and variance based on protocol-specific time-frequency characteristics. Subsequently, a gradient-norm metric is introduced to measure perturbation sensitivity for capturing the inherent instability of OOD samples, which is then fused with energy-based scores for joint inference. Simulation results demonstrate that the proposed algorithm provides superior discriminative power and robust performance via SNR and various drone types.

[LG-26] Cognitive Fusion of ZC Sequences and Time-Frequency Images for Out-of-Distribution Detection of Drone Signals

链接: https://arxiv.org/abs/2601.18326
作者: Jie Li,Jing Li,Lu Lv,Zhanyu Ju,Fengkui Gong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a drone signal out-of-distribution detection (OODD) algorithm based on the cognitive fusion of Zadoff-Chu (ZC) sequences and time-frequency images (TFI). ZC sequences are identified by analyzing the communication protocols of DJI drones, while TFI capture the time-frequency characteristics of drone signals with unknown or non-standard communication protocols. Both modalities are used jointly to enable OODD in the drone remote identification (RID) task. Specifically, ZC sequence features and TFI features are generated from the received radio frequency signals, which are then processed through dedicated feature extraction module to enhance and align them. The resultant multi-modal features undergo multi-modal feature interaction, single-modal feature fusion, and multi-modal feature fusion to produce features that integrate and complement information across modalities. Discrimination scores are computed from the fused features along both spatial and channel dimensions to capture time-frequency characteristic differences dictated by the communication protocols, and these scores will be transformed into adaptive attention weights. The weighted features are then passed through a Softmax function to produce the signal classification results. Simulation results demonstrate that the proposed algorithm outperforms existing algorithms and achieves 1.7% and 7.5% improvements in RID and OODD metrics, respectively. The proposed algorithm also performs strong robustness under varying flight conditions and across different drone types.

[LG-27] A Master Class on Reproducibility: A Student Hackathon on Advanced MRI Reconstruction Methods

链接: https://arxiv.org/abs/2601.18314
作者: Lina Felsner,Sevgi G. Kafali,Hannah Eichhorn,Agnes A. J. Leth,Aidas Batvinskas,Andre Datchev,Fabian Klemm,Jan Aulich,Puntika Leepagorn,Ruben Klinger,Daniel Rueckert,Julia A. Schnabel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We report the design, protocol, and outcomes of a student reproducibility hackathon focused on replicating the results of three influential MRI reconstruction papers: (a) MoDL, an unrolled model-based network with learned denoising; (b) HUMUS-Net, a hybrid unrolled multiscale CNN+Transformer architecture; and © an untrained, physics-regularized dynamic MRI method that uses a quantitative MR model for early stopping. We describe the setup of the hackathon and present reproduction outcomes alongside additional experiments, and we detail fundamental practices for building reproducible codebases.

[LG-28] Convex Chance-Constrained Stochastic Control under Uncertain Specifications with Application to Learning-Based Hybrid Powertrain Control

链接: https://arxiv.org/abs/2601.18313
作者: Teruki Kato,Ryotaro Shima,Kenji Kashima
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to IEEE Transactions on Control Systems Technology (TCST)

点击查看摘要

Abstract:This paper presents a strictly convex chance-constrained stochastic control framework that accounts for uncertainty in control specifications such as reference trajectories and operational constraints. By jointly optimizing control inputs and risk allocation under general (possibly non-Gaussian) uncertainties, the proposed method guarantees probabilistic constraint satisfaction while ensuring strict convexity, leading to uniqueness and continuity of the optimal solution. The formulation is further extended to nonlinear model-based control using exactly linearizable models identified through machine learning. The effectiveness of the proposed approach is demonstrated through model predictive control applied to a hybrid powertrain system.

[LG-29] ractable Gaussian Phase Retrieval with Heavy Tails and Adversarial Corruption with Near-Linear Sample Complexity

链接: https://arxiv.org/abs/2601.18245
作者: Santanu Das,Jatin Batra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Phase retrieval is the classical problem of recovering a signal x^* \in \mathbbR^n from its noisy phaseless measurements y_i = \langle a_i, x^* \rangle^2 + \zeta_i (where \zeta_i denotes noise, and a_i is the sensing vector) for i \in [m] . The problem of phase retrieval has a rich history, with a variety of applications such as optics, crystallography, heteroscedastic regression, astrophysics, etc. A major consideration in algorithms for phase retrieval is robustness against measurement errors. In recent breakthroughs in algorithmic robust statistics, efficient algorithms have been developed for several parameter estimation tasks such as mean estimation, covariance estimation, robust principal component analysis (PCA), etc. in the presence of heavy-tailed noise and adversarial corruptions. In this paper, we study efficient algorithms for robust phase retrieval with heavy-tailed noise when a constant fraction of both the measurements y_i and the sensing vectors a_i may be arbitrarily adversarially corrupted. For this problem, Buna and Rebeschini (AISTATS 2025) very recently gave an exponential time algorithm with sample complexity O(n \log n) . Their algorithm needs a robust spectral initialization, specifically, a robust estimate of the top eigenvector of a covariance matrix, which they deemed to be beyond known efficient algorithmic techniques (similar spectral initializations are a key ingredient of a large family of phase retrieval algorithms). In this work, we make a connection between robust spectral initialization and recent algorithmic advances in robust PCA, yielding the first polynomial-time algorithms for robust phase retrieval with both heavy-tailed noise and adversarial corruptions, in fact with near-linear (in n ) sample complexity.

[LG-30] Smooth Sparse and Stable: Finite-Time Exact Skeleton Recovery via Smoothed Proximal Gradients

链接: https://arxiv.org/abs/2601.18189
作者: Rui Wu,Yongjun Li
类目: Machine Learning (cs.LG)
*备注: 20 pages, 8 figures

点击查看摘要

Abstract:Continuous optimization has significantly advanced causal discovery, yet existing methods (e.g., NOTEARS) generally guarantee only asymptotic convergence to a stationary point. This often yields dense weighted matrices that require arbitrary post-hoc thresholding to recover a DAG. This gap between continuous optimization and discrete graph structures remains a fundamental challenge. In this paper, we bridge this gap by proposing the Hybrid-Order Acyclicity Constraint (AHOC) and optimizing it via the Smoothed Proximal Gradient (SPG-AHOC). Leveraging the Manifold Identification Property of proximal algorithms, we provide a rigorous theoretical guarantee: the Finite-Time Oracle Property. We prove that under standard identifiability assumptions, SPG-AHOC recovers the exact DAG support (structure) in finite iterations, even when optimizing a smoothed approximation. This result eliminates structural ambiguity, as our algorithm returns graphs with exact zero entries without heuristic truncation. Empirically, SPG-AHOC achieves state-of-the-art accuracy and strongly corroborates the finite-time identification theory.

[LG-31] Learning Fair Domain Adaptation with Virtual Label Distribution ICASSP2026

链接: https://arxiv.org/abs/2601.18171
作者: Yuguang Zhang,Lijun Sheng,Jian Liang,Ran He
类目: Machine Learning (cs.LG)
*备注: ICASSP 2026

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) aims to mitigate performance degradation when training and testing data are sampled from different distributions. While significant progress has been made in enhancing overall accuracy, most existing methods overlook performance disparities across categories-an issue we refer to as category fairness. Our empirical analysis reveals that UDA classifiers tend to favor certain easy categories while neglecting difficult ones. To address this, we propose Virtual Label-distribution-aware Learning (VILL), a simple yet effective framework designed to improve worst-case performance while preserving high overall accuracy. The core of VILL is an adaptive re-weighting strategy that amplifies the influence of hard-to-classify categories. Furthermore, we introduce a KL-divergence-based re-balancing strategy, which explicitly adjusts decision boundaries to enhance category fairness. Experiments on commonly used datasets demonstrate that VILL can be seamlessly integrated as a plug-and-play module into existing UDA methods, significantly improving category fairness.

[LG-32] Enhance the Safety in Reinforcement Learning by ADRC Lagrangian Methods

链接: https://arxiv.org/abs/2601.18142
作者: Mingxu Zhang,Huicheng Zhang,Jiaming Ji,Yaodong Yang,Ying Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safe reinforcement learning (Safe RL) seeks to maximize rewards while satisfying safety constraints, typically addressed through Lagrangian-based methods. However, existing approaches, including PID and classical Lagrangian methods, suffer from oscillations and frequent safety violations due to parameter sensitivity and inherent phase lag. To address these limitations, we propose ADRC-Lagrangian methods that leverage Active Disturbance Rejection Control (ADRC) for enhanced robustness and reduced oscillations. Our unified framework encompasses classical and PID Lagrangian methods as special cases while significantly improving safety performance. Extensive experiments demonstrate that our approach reduces safety violations by up to 74%, constraint violation magnitudes by 89%, and average costs by 67%, establishing superior effectiveness for Safe RL in complex environments.

[LG-33] Robust Learning of a Group DRO Neuron

链接: https://arxiv.org/abs/2601.18115
作者: Guyang Cao,Shuyao Li,Sushrut Karmalkar,Jelena Diakonikolas
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the problem of learning a single neuron under standard squared loss in the presence of arbitrary label noise and group-level distributional shifts, for a broad family of covariate distributions. Our goal is to identify a ‘‘best-fit’’ neuron parameterized by \mathbfw_* that performs well under the most challenging reweighting of the groups. Specifically, we address a Group Distributionally Robust Optimization problem: given sample access to K distinct distributions \mathcal p_[1],\dots,\mathcal p_[K] , we seek to approximate \mathbfw_* that minimizes the worst-case objective over convex combinations of group distributions \boldsymbol\lambda \in \Delta_K , where the objective is \sum_i \in [K]\lambda_[i],\mathbb E_(\mathbf x,y)\sim\mathcal p_[i](\sigma(\mathbf w\cdot\mathbf x)-y)^2 - \nu d_f(\boldsymbol\lambda,\frac1K\mathbf1) and d_f is an f -divergence that imposes (optional) penalty on deviations from uniform group weights, scaled by a parameter \nu \geq 0 . We develop a computationally efficient primal-dual algorithm that outputs a vector \widehat\mathbf w that is constant-factor competitive with \mathbfw_* under the worst-case group weighting. Our analytical framework directly confronts the inherent nonconvexity of the loss function, providing robust learning guarantees in the face of arbitrary label corruptions and group-specific distributional shifts. The implementation of the dual extrapolation update motivated by our algorithmic framework shows promise on LLM pre-training benchmarks.

[LG-34] AttenMIA: LLM Membership Inference Attack through Attention Signals

链接: https://arxiv.org/abs/2601.18110
作者: Pedram Zaree,Md Abdullah Al Mamun,Yue Dong,Ihsen Alouani,Nael Abu-Ghazaleh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed to enable or improve a multitude of real-world applications. Given the large size of their training data sets, their tendency to memorize training data raises serious privacy and intellectual property concerns. A key threat is the membership inference attack (MIA), which aims to determine whether a given sample was included in the model’s training set. Existing MIAs for LLMs rely primarily on output confidence scores or embedding-based features, but these signals are often brittle, leading to limited attack success. We introduce AttenMIA, a new MIA framework that exploits self-attention patterns inside the transformer model to infer membership. Attention controls the information flow within the transformer, exposing different patterns for memorization that can be used to identify members of the dataset. Our method uses information from attention heads across layers and combines them with perturbation-based divergence metrics to train an effective MIA classifier. Using extensive experiments on open-source models including LLaMA-2, Pythia, and Opt models, we show that attention-based features consistently outperform baselines, particularly under the important low-false-positive metric (e.g., achieving up to 0.996 ROC AUC 87.9% TPR@1%FPR on the WikiMIA-32 benchmark with Llama2-13b). We show that attention signals generalize across datasets and architectures, and provide a layer- and head-level analysis of where membership leakage is most pronounced. We also show that using AttenMIA to replace other membership inference attacks in a data extraction framework results in training data extraction attacks that outperform the state of the art. Our findings reveal that attention mechanisms, originally introduced to enhance interpretability, can inadvertently amplify privacy risks in LLMs, underscoring the need for new defenses.

[LG-35] Beyond Static Datasets: Robust Offline Policy Optimization via Vetted Synthetic Transitions

链接: https://arxiv.org/abs/2601.18107
作者: Pedram Agand,Mo Chen
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
*备注: 11 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Offline Reinforcement Learning (ORL) holds immense promise for safety-critical domains like industrial robotics, where real-time environmental interaction is often prohibitive. A primary obstacle in ORL remains the distributional shift between the static dataset and the learned policy, which typically mandates high degrees of conservatism that can restrain potential policy improvements. We present MoReBRAC, a model-based framework that addresses this limitation through Uncertainty-Aware latent synthesis. Instead of relying solely on the fixed data, MoReBRAC utilizes a dual-recurrent world model to synthesize high-fidelity transitions that augment the training manifold. To ensure the reliability of this synthetic data, we implement a hierarchical uncertainty pipeline integrating Variational Autoencoder (VAE) manifold detection, model sensitivity analysis, and Monte Carlo (MC) dropout. This multi-layered filtering process guarantees that only transitions residing within high-confidence regions of the learned dynamics are utilized. Our results on D4RL Gym-MuJoCo benchmarks reveal significant performance gains, particularly in random'' and suboptimal’’ data regimes. We further provide insights into the role of the VAE as a geometric anchor and discuss the distributional trade-offs encountered when learning from near-optimal datasets.

[LG-36] From LLM s to LRMs: Rethinking Pruning for Reasoning -Centric Models

链接: https://arxiv.org/abs/2601.18091
作者: Longwei Ding,Anhao Zhao,Fanghua Ye,Ziyang Chen,Xiaoyu Shen
类目: Machine Learning (cs.LG)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly costly to deploy, motivating extensive research on model pruning. However, most existing studies focus on instruction-following LLMs, leaving it unclear whether established pruning strategies transfer to reasoning-augmented models that explicitly generate long intermediate reasoning traces. In this work, we conduct a controlled study of pruning for both instruction-following ( \textbfLLM-instruct ) and reasoning-augmented ( \textbfLLM-think ) models. To isolate the effects of pruning, we align pruning calibration and post-pruning recovery data with each model’s original training distribution, which we show yields more stable and reliable pruning behavior. We evaluate static depth pruning, static width pruning, and dynamic pruning across 17 tasks spanning classification, generation, and reasoning. Our results reveal clear paradigm-dependent differences: depth pruning outperforms width pruning on classification tasks, while width pruning is more robust for generation and reasoning. Moreover, static pruning better preserves reasoning performance, whereas dynamic pruning excels on classification and generation but remains challenging for long-chain reasoning. These findings underscore the need for pruning strategies that explicitly account for the distinct characteristics of reasoning-augmented LLMs. Our code is publicly available at this https URL.

[LG-37] DRPG (Decompose Retrieve Plan Generate): An Agent ic Framework for Academic Rebuttal

链接: https://arxiv.org/abs/2601.18081
作者: Peixuan Han,Yingjie Yu,Jingjun Xu,Jiaxuan You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the growing adoption of large language models (LLMs) in scientific research workflows, automated support for academic rebuttal, a crucial step in academic communication and peer review, remains largely underexplored. Existing approaches typically rely on off-the-shelf LLMs or simple pipelines, which struggle with long-context understanding and often fail to produce targeted and persuasive responses. In this paper, we propose DRPG, an agentic framework for automatic academic rebuttal generation that operates through four steps: Decompose reviews into atomic concerns, Retrieve relevant evidence from the paper, Plan rebuttal strategies, and Generate responses accordingly. Notably, the Planner in DRPG reaches over 98% accuracy in identifying the most feasible rebuttal direction. Experiments on data from top-tier conferences demonstrate that DRPG significantly outperforms existing rebuttal pipelines and achieves performance beyond the average human level using only an 8B model. Our analysis further demonstrates the effectiveness of the planner design and its value in providing multi-perspective and explainable suggestions. We also showed that DRPG works well in a more complex multi-round setting. These results highlight the effectiveness of DRPG and its potential to provide high-quality rebuttal content and support the scaling of academic discussions. Codes for this work are available at this https URL.

[LG-38] Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

链接: https://arxiv.org/abs/2601.18076
作者: Alexandra Chouldechova,A. Feder Cooper,Solon Barocas,Abhinav Palia,Dan Vann,Hanna Wallach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons. We show, through conceptual, theoretical, and empirical contributions, that many conclusions are founded on apples-to-oranges comparisons or low-validity measurements. Our arguments are grounded in asking a simple question: When can attack success rates be meaningfully compared? To answer this question, we draw on ideas from social science measurement theory and inferential statistics, which, taken together, provide a conceptual grounding for understanding when numerical values obtained through the quantification of system attributes can be meaningfully compared. Through this lens, we articulate conditions under which ASRs can and cannot be meaningfully compared. Using jailbreaking as a running example, we provide examples and extensive discussion of apples-to-oranges ASR comparisons and measurement validity challenges.

[LG-39] XGuardian: Towards Explainable and Generalized AI Anti-Cheat on FPS Games USENIX-SECURITY2026

链接: https://arxiv.org/abs/2601.18068
作者: Jiayi Zhang,Chenxin Sun,Chenxiong Qian
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by USENIX Security 2026

点击查看摘要

Abstract:Aim-assist cheats are the most prevalent and infamous form of cheating in First-Person Shooter (FPS) games, which help cheaters illegally reveal the opponent’s location and auto-aim and shoot, and thereby pose significant threats to the game industry. Although a considerable research effort has been made to automatically detect aim-assist cheats, existing works suffer from unreliable frameworks, limited generalizability, high overhead, low detection performance, and a lack of explainability of detection results. In this paper, we propose XGuardian, a server-side generalized and explainable system for detecting aim-assist cheats to overcome these limitations. It requires only two raw data inputs, pitch and yaw, which are all FPS games’ must-haves, to construct novel temporal features and describe aim trajectories, which are essential for distinguishing cheaters and normal players. XGuardian is evaluated with the latest mainstream FPS game CS2, and validates its generalizability with another two different games. It achieves high detection performance and low overhead compared to prior works across different games with real-world and large-scale datasets, demonstrating wide generalizability and high effectiveness. It is able to justify its predictions and thereby shorten the ban cycle. We make XGuardian as well as our datasets publicly available.

[LG-40] Multimodal Machine Learning for Soft High-k Elastomers under Data Scarcity

链接: https://arxiv.org/abs/2601.18032
作者: Brijesh FNU,Viet Thanh Duy Nguyen,Ashima Sharma,Md Harun Rashid Molla,Chengyi Xu,Truong-Son Hy
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Dielectric materials are critical building blocks for modern electronics such as sensors, actuators, and transistors. With the rapid recent advance in soft and stretchable electronics for emerging human- and robot-interfacing applications, there is a surging need for high-performance dielectric elastomers. However, it remains a grand challenge to develop soft elastomers that simultaneously possess high dielectric constants (k, related to energy storage capacity) and low Young’s moduli (E, related to mechanical flexibility). While some new elastomer designs have been reported in individual (mostly one-off) studies, almost no structured dataset is currently available for dielectric elastomers that systematically encompasses their molecular sequence, dielectric, and mechanical properties. Within this context, we curate a compact, high-quality dataset of acrylate-based dielectric elastomers, one of the most widely explored elastomer backbones due to its versatile chemistry and molecular design flexibility, by screening and aggregating experimental results from the literature over the past 10 years. Building on this dataset, we propose a multimodal learning framework that leverages large-scale pretrained polymer representations from graph- and sequence-based encoders. These pretrained embeddings transfer rich chemical and structural knowledge from vast polymer corpora, enabling accurate few-shot prediction of both dielectric and mechanical properties from molecular sequences. Our results represent a new paradigm for transferring knowledge from pretrained multimodal models to overcome severe data scarcity, which can be readily translated to other polymer backbones (e.g., silicones, urethanes) and thus accelerate data-efficient discovery of soft high-k dielectric elastomers. Our source code and dataset are publicly available at this https URL

[LG-41] Spelling Bee Embeddings for Language Modeling

链接: https://arxiv.org/abs/2601.18030
作者: Markus N. Rabe,Judith Clymo,Zheren Dong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a simple modification to the embedding layer. The key change is to infuse token embeddings with information about their spelling. Models trained with these embeddings improve not only on spelling, but also across standard benchmarks. We conduct scaling studies for models with 40M to 800M parameters, which suggest that the improvements are equivalent to needing about 8% less compute and data to achieve the same test loss.

[LG-42] Memory-Efficient FPGA Implementation of Stochastic Simulated Annealing

链接: https://arxiv.org/abs/2601.18007
作者: Duckgyu Shin,Naoya Onizawa,Warren J. Gross,Takahiro Hanyu
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Simulated annealing (SA) is a well-known algorithm for solving combinatorial optimization problems. However, the computation time of SA increases rapidly, as the size of the problem grows. Recently, a stochastic simulated annealing (SSA) algorithm that converges faster than conventional SA has been reported. In this paper, we present a hardware-aware SSA (HA- SSA) algorithm for memory-efficient FPGA implementations. HA-SSA can reduce the memory usage of storing intermediate results while maintaining the computing speed of SSA. For evaluation purposes, the proposed algorithm is compared with the conventional SSA and SA approaches on maximum cut combinatorial optimization problems. HA-SSA achieves a convergence speed that is up to 114-times faster than that of the conventional SA algorithm depending on the maximum cut problem selected from the G-set which is a dataset of the maximum cut problems. HA-SSA is implemented on a field-programmable gate array (FPGA) (Xilinx Kintex-7), and it achieves up to 6-times the memory efficiency of conventional SSA while maintaining high solution quality for optimization problems.

[LG-43] Federated learning for unpaired multimodal data through a homogeneous transformer model

链接: https://arxiv.org/abs/2601.17986
作者: Anders Eklund
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training of multimodal foundation models is currently restricted to centralized data centers containing massive, aligned datasets (e.g., image-text pairs). However, in realistic federated environments, data is often unpaired and fragmented across disjoint nodes; one node may hold sensor data, while another holds textual logs. These datasets are strictly private and share no common samples. Current federated learning (FL) methods fail in this regime, as they assume local clients possess aligned pairs or require sharing raw feature embeddings, which violates data sovereignty. We propose a novel framework to train a global multimodal transformer across decentralized nodes with disjoint modalities. We introduce a small public anchor set to align disjoint private manifolds. Using Gram matrices calculated from these public anchors, we enforce semantic alignment across modalities through centered kernel alignment without ever transmitting private samples, offering a mathematically superior privacy guarantee compared to prototype sharing. Further, we introduce a subspace-stabilized fine-tuning method to handle FL with huge transformer models. We strictly decouple domain-specific magnitude shifts from semantic direction, ensuring that nodes with varying sensor characteristics align geometrically to the global consensus. Lastly, we propose precision weighted averaging, where efficiently obtained uncertainty estimates are used to downweight uncertain nodes. This paper establishes the mathematical backbone for federated unpaired foundation models, enabling a global model to learn a unified representation of the world from fragmented, disjoint, and private data silos without requiring centralized storage or paired samples.

[LG-44] nsorLens: End-to-End Transformer Analysis via High-Order Attention Tensors

链接: https://arxiv.org/abs/2601.17958
作者: Ido Andrew Atad,Itamar Zimerman,Shahar Katz,Lior Wolf
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Attention matrices are fundamental to transformer research, supporting a broad range of applications including interpretability, visualization, manipulation, and distillation. Yet, most existing analyses focus on individual attention heads or layers, failing to account for the model’s global behavior. While prior efforts have extended attention formulations across multiple heads via averaging and matrix multiplications or incorporated components such as normalization and FFNs, a unified and complete representation that encapsulates all transformer blocks is still lacking. We address this gap by introducing TensorLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction tensor. This tensor jointly encodes attention, FFNs, activations, normalizations, and residual connections, offering a theoretically coherent and expressive linear representation of the model’s computation. TensorLens is theoretically grounded and our empirical validation shows that it yields richer representations than previous attention-aggregation methods. Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding. Our code is attached as a supplementary.

[LG-45] Scaling Effects and Uncertainty Quantification in Neural Actor Critic Algorithms

链接: https://arxiv.org/abs/2601.17954
作者: Nikos Georgoudios,Konstantinos Spiliopoulos,Justin Sirignano
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 72 pages, 2 figures

点击查看摘要

Abstract:We investigate the neural Actor Critic algorithm using shallow neural networks for both the Actor and Critic models. The focus of this work is twofold: first, to compare the convergence properties of the network outputs under various scaling schemes as the network width and the number of training steps tend to infinity; and second, to provide precise control of the approximation error associated with each scaling regime. Previous work has shown convergence to ordinary differential equations with random initial conditions under inverse square root scaling in the network width. In this work, we shift the focus from convergence speed alone to a more comprehensive statistical characterization of the algorithm’s output, with the goal of quantifying uncertainty in neural Actor Critic methods. Specifically, we study a general inverse polynomial scaling in the network width, with an exponent treated as a tunable hyperparameter taking values strictly between one half and one. We derive an asymptotic expansion of the network outputs, interpreted as statistical estimators, in order to clarify their structure. To leading order, we show that the variance decays as a power of the network width, with an exponent equal to one half minus the scaling parameter, implying improved statistical robustness as the scaling parameter approaches one. Numerical experiments support this behavior and further suggest faster convergence for this choice of scaling. Finally, our analysis yields concrete guidelines for selecting algorithmic hyperparameters, including learning rates and exploration rates, as functions of the network width and the scaling parameter, ensuring provably favorable statistical behavior.

[LG-46] FedGraph-VASP: Privacy-Preserving Federated Graph Learning with Post-Quantum Security for Cross-Institutional Anti-Money Laundering

链接: https://arxiv.org/abs/2601.17935
作者: Daniel Commey,Matilda Nkoom,Yousef Alsenani,Sena G. Hounsinou,Garth V. Crosby
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Virtual Asset Service Providers (VASPs) face a fundamental tension between regulatory compliance and user privacy when detecting cross-institutional money laundering. Current approaches require either sharing sensitive transaction data or operating in isolation, leaving critical cross-chain laundering patterns undetected. We present FedGraph-VASP, a privacy-preserving federated graph learning framework that enables collaborative anti-money laundering (AML) without exposing raw user data. Our key contribution is a Boundary Embedding Exchange protocol that shares only compressed, non-invertible graph neural network representations of boundary accounts. These exchanges are secured using post-quantum cryptography, specifically the NIST-standardized Kyber-512 key encapsulation mechanism combined with AES-256-GCM authenticated encryption. Experiments on the Elliptic Bitcoin dataset with realistic Louvain partitioning show that FedGraph-VASP achieves an F1-score of 0.508, outperforming the state-of-the-art generative baseline FedSage+ (F1 = 0.453) by 12.1 percent on binary fraud detection. We further show robustness under low-connectivity settings where generative imputation degrades performance, while approaching centralized performance (F1 = 0.620) in high-connectivity regimes. We additionally evaluate generalization on an Ethereum fraud detection dataset, where FedGraph-VASP (F1 = 0.635) is less effective under sparse cross-silo connectivity, while FedSage+ excels (F1 = 0.855), outperforming even local training (F1 = 0.785). These results highlight a topology-dependent trade-off: embedding exchange benefits connected transaction graphs, whereas generative imputation can dominate in highly modular sparse graphs. A privacy audit shows embeddings are only partially invertible (R^2 = 0.32), limiting exact feature recovery.

[LG-47] UniPACT: A Multimodal Framework for Prognostic Question Answering on Raw ECG and Structured EHR ICASSP2026

链接: https://arxiv.org/abs/2601.17916
作者: Jialu Tang,Tong Xia,Yuan Lu,Aaqib Saeed
类目: Machine Learning (cs.LG)
*备注: Accepted to IEEE ICASSP 2026

点击查看摘要

Abstract:Accurate clinical prognosis requires synthesizing structured Electronic Health Records (EHRs) with real-time physiological signals like the Electrocardiogram (ECG). Large Language Models (LLMs) offer a powerful reasoning engine for this task but struggle to natively process these heterogeneous, non-textual data types. To address this, we propose UniPACT (Unified Prognostic Question Answering for Clinical Time-series), a unified framework for prognostic question answering that bridges this modality gap. UniPACT’s core contribution is a structured prompting mechanism that converts numerical EHR data into semantically rich text. This textualized patient context is then fused with representations learned directly from raw ECG waveforms, enabling an LLM to reason over both modalities holistically. We evaluate UniPACT on the comprehensive MDS-ED benchmark, it achieves a state-of-the-art mean AUROC of 89.37% across a diverse set of prognostic tasks including diagnosis, deterioration, ICU admission, and mortality, outperforming specialized baselines. Further analysis demonstrates that our multimodal, multi-task approach is critical for performance and provides robustness in missing data scenarios.

[LG-48] Adaptive Weighting in Knowledge Distillation: An Axiomatic Framework for Multi-Scale Teacher Ensemble Optimization

链接: https://arxiv.org/abs/2601.17910
作者: Aaron R. Flouro,Shawn P. Chadwick
类目: Machine Learning (cs.LG)
*备注: 12 pages, 1 figure, 1 table

点击查看摘要

Abstract:Knowledge distillation with multiple teachers is increasingly used to improve robustness, efficiency, and safety, yet existing approaches rely largely on heuristic or implementation-specific weighting schemes. This paper develops an operator-agnostic axiomatic framework for adaptive weighting in multi-teacher knowledge distillation across three complementary scales: token, task, and context. We formalize structural conditions under which adaptive weighting operators are well-defined, admit multiple non-equivalent implementations, and can be hierarchically composed via product-structure normalization. Within this framework, we establish existence and non-uniqueness of conforming operators, characterize convergence of gradient-based optimization under standard assumptions, analyze stability and perturbation robustness, and provide an abstract formulation of safety-constrained distillation. The results decouple theoretical guarantees from specific weighting formulas, enabling principled analysis of adaptive distillation methods under heterogeneity, distribution shift, and safety constraints.

[LG-49] FARM: Few-shot Adaptive Malware Family Classification under Concept Drift

链接: https://arxiv.org/abs/2601.17907
作者: Numan Halit Guldemir,Oluwafemi Olukoya,Jesús Martínez-del-Rincón
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Malware classification models often face performance degradation due to concept drift, arising from evolving threat landscapes and the emergence of novel malware families. This paper presents FARM (Few-shot Adaptive Recognition of Malware), a framework designed to detect and adapt to both covariate and label drift in Windows Portable Executable (PE) malware classification. FARM leverages a triplet autoencoder to project samples into a discriminative latent space, enabling unsupervised drift detection via DBSCAN clustering and dynamic thresholding. For rapid adaptation, it employs few-shot learning using prototype-based classification, requiring only a handful of labeled samples. FARM also supports full retraining when enough drifted samples accumulate, updating the latent space for long-term integration. Experiments on the BenchMFC dataset demonstrate that FARM improves classification performance under covariate drift by 5.6%, and achieves an average F1 score of 0.85 on unseen malware families using only few-shot adaptation, which further increases to 0.94 after retraining. These results highlight FARM’s robustness and adaptability in dynamic malware detection environments under limited supervision.

[LG-50] Robust Computational Extraction of Non-Enhancing Hypercellular Tumor Regions from Clinical Imaging Data

链接: https://arxiv.org/abs/2601.17802
作者: A. Brawanski,Th. Schaffer,F. Raab,K.-M. Schebesch,M. Schrey,Chr. Doenitz,A. M. Tomé,E. W. Lang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate identification of non-enhancing hypercellular (NEH) tumor regions is an unmet need in neuro-oncological imaging, with significant implications for patient management and treatment planning. We present a robust computational framework that generates probability maps of NEH regions from routine MRI data, leveraging multiple network architectures to address the inherent variability and lack of clear imaging boundaries. Our approach was validated against independent clinical markers – relative cerebral blood volume (rCBV) and enhancing tumor recurrence location (ETRL) – demonstrating both methodological robustness and biological relevance. This framework enables reliable, non-invasive mapping of NEH tumor compartments, supporting their integration as imaging biomarkers in clinical workflows and advancing precision oncology for brain tumor patients.

[LG-51] Entropic Risk-Aware Monte Carlo Tree Search

链接: https://arxiv.org/abs/2601.17667
作者: Pedro P. Santos,Jacopo Silvestrin,Alberto Sardinha,Francisco S. Melo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a provably correct Monte Carlo tree search (MCTS) algorithm for solving \textitrisk-aware Markov decision processes (MDPs) with \textitentropic risk measure (ERM) objectives. We provide a \textitnon-asymptotic analysis of our proposed algorithm, showing that the algorithm: (i) is \textitcorrect in the sense that the empirical ERM obtained at the root node converges to the optimal ERM; and (ii) enjoys \textitpolynomial regret concentration. Our algorithm successfully exploits the dynamic programming formulations for solving risk-aware MDPs with ERM objectives introduced by previous works in the context of an upper confidence bound-based tree search algorithm. Finally, we provide a set of illustrative experiments comparing our risk-aware MCTS method against relevant baselines.

[LG-52] Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training

链接: https://arxiv.org/abs/2601.17654
作者: Ruofan Wu,Jae-Won Chung,Mosharaf Chowdhury
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive, contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus only on a single aspect of energy consumption: dynamic or static energy. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time–energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time–energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2601.17654 [cs.LG] (or arXiv:2601.17654v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.17654 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] A Mosco sufficient condition for intrinsic stability of non-unique convex Empirical Risk Minimization

链接: https://arxiv.org/abs/2601.17646
作者: Karim Bounja,Lahcen Laayouni,Abdeljalil Sakat
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Empirical risk minimization (ERM) stability is usually studied via single-valued outputs, while convex non-strict losses yield set-valued minimizers. We identify Painlevé-Kuratowski upper semicontinuity (PK-u.s.c.) as the intrinsic stability notion for the ERM solution correspondence (set-level Hadamard well-posedness) and a prerequisite to interpret stability of selections. We then characterize a minimal non-degenerate qualitative regime: Mosco-consistent perturbations and locally bounded minimizers imply PK-u.s.c., minimal-value continuity, and consistency of vanishing-gap near-minimizers. Quadratic growth yields explicit quantitative deviation bounds.

[LG-54] RPNT: Robust Pre-trained Neural Transformer – A Pathway for Generalized Motor Decoding

链接: https://arxiv.org/abs/2601.17641
作者: Hao Fang,Ryan A. Canfield,Tomohiro Ouchi,Beatrice Macagno,Eli Shlizerman,Amy L. Orsborn
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Brain decoding aims to interpret and translate neural activity into behaviors. As such, it is imperative that decoding models are able to generalize across variations, such as recordings from different brain sites, distinct sessions, different types of behavior, and a variety of subjects. Current models can only partially address these challenges and warrant the development of pretrained neural transformer models capable to adapt and generalize. In this work, we propose RPNT - Robust Pretrained Neural Transformer, designed to achieve robust generalization through pretraining, which in turn enables effective finetuning given a downstream task. In particular, RPNT unique components include 1) Multidimensional rotary positional embedding (MRoPE) to aggregate experimental metadata such as site coordinates, session name and behavior types; 2) Context-based attention mechanism via convolution kernels operating on global attention to learn local temporal structures for handling non-stationarity of neural population activity; 3) Robust self-supervised learning (SSL) objective with uniform causal masking strategies and contrastive representations. We pretrained two separate versions of RPNT on distinct datasets a) Multi-session, multi-task, and multi-subject microelectrode benchmark; b) Multi-site recordings using high-density Neuropixel 1.0 probes. The datasets include recordings from the dorsal premotor cortex (PMd) and from the primary motor cortex (M1) regions of nonhuman primates (NHPs) as they performed reaching tasks. After pretraining, we evaluated the generalization of RPNT in cross-session, cross-type, cross-subject, and cross-site downstream behavior decoding tasks. Our results show that RPNT consistently achieves and surpasses the decoding performance of existing decoding models in all tasks.

[LG-55] Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning

链接: https://arxiv.org/abs/2601.17616
作者: Fatema Siddika,Md Anwar Hossen,Tanwi Mallick,Ali Jannesari
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures, 8 tables

点击查看摘要

Abstract:Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task-Agnostic Continual Learning, referred to as SETA, a framework that resolves the plasticity-stability conflict by decomposing the model into modular subspaces. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through elastic weight anchoring, which protects critical shared knowledge and enables a unified gating network to automatically retrieve the correct expert combination for each task during inference. Extensive experiments across diverse domain-specific and general benchmarks demonstrate that SETA consistently outperforms state-of-the-art parameter-efficient fine-tuning-based continual learning methods.

[LG-56] Athena: Synergizing Data Prefetching and Off-Chip Prediction via Online Reinforcement Learning

链接: https://arxiv.org/abs/2601.17615
作者: Rahul Bera,Zhenrong Lang,Caroline Hengartner,Konstantinos Kanellopoulos,Rakesh Kumar,Mohammad Sadrosadati,Onur Mutlu
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prefetching and off-chip prediction are two techniques proposed to hide long memory access latencies in high-performance processors. In this work, we demonstrate that: (1) prefetching and off-chip prediction often provide complementary performance benefits, yet (2) naively combining them often fails to realize their full performance potential, and (3) existing prefetcher control policies leave significant room for performance improvement behind. Our goal is to design a holistic framework that can autonomously learn to coordinate an off-chip predictor with multiple prefetchers employed at various cache levels. To this end, we propose a new technique called Athena, which models the coordination between prefetchers and off-chip predictor (OCP) as a reinforcement learning (RL) problem. Athena acts as the RL agent that observes multiple system-level features (e.g., prefetcher/OCP accuracy, bandwidth usage) over an epoch of program execution, and uses them as state information to select a coordination action (i.e., enabling the prefetcher and/or OCP, and adjusting prefetcher aggressiveness). At the end of every epoch, Athena receives a numerical reward that measures the change in multiple system-level metrics (e.g., number of cycles taken to execute an epoch). Athena uses this reward to autonomously and continuously learn a policy to coordinate prefetchers with OCP. Our extensive evaluation using a diverse set of memory-intensive workloads shows that Athena consistently outperforms prior state-of-the-art coordination policies across a wide range of system configurations with various combinations of underlying prefetchers, OCPs, and main memory bandwidths, while incurring only modest storage overhead. Athena is freely available at this https URL. Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) ACMclasses: C.1 Cite as: arXiv:2601.17615 [cs.AR] (or arXiv:2601.17615v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2601.17615 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 32nd IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2026

[LG-57] A Thermodynamic Theory of Learning I: Irreversible Ensemble Transport and Epistemic Costs

链接: https://arxiv.org/abs/2601.17607
作者: Daisuke Okanohara(1) ((1) Preferred Networks, Inc)
类目: Machine Learning (cs.LG)
*备注: 9 pages. Part I of a planned series entitled “A Thermodynamic Theory of Learning.”

点击查看摘要

Abstract:Learning systems acquire structured internal representations from data, yet classical information-theoretic results state that deterministic transformations do not increase information. This raises a fundamental question: how can learning produce abstraction and insight without violating information-theoretic limits? We argue that learning is inherently an irreversible process when performed over finite time, and that the realization of epistemic structure necessarily incurs entropy production. To formalize this perspective, we model learning as a transport process in the space of probability distributions over model configurations and introduce an epistemic free-energy framework. Within this framework, we define the free-energy drop as a bookkeeping quantity that records the total reduction of epistemic free energy along a learning trajectory. This reduction decomposes into a reversible component associated with potential improvement and an irreversible component corresponding to entropy production. We then derive the Epistemic Speed Limit (ESL), a finite-time inequality that lower-bounds the minimal entropy production required by any learning process to realize a given distributional transformation. This bound depends only on the Wasserstein distance between initial and final ensemble distributions and is independent of the specific learning algorithm. Comments: 9 pages. Part I of a planned series entitled “A Thermodynamic Theory of Learning.” Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.17607 [cs.LG] (or arXiv:2601.17607v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.17607 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daisuke Okanohara [view email] [v1] Sat, 24 Jan 2026 21:57:54 UTC (16 KB)

[LG-58] Understanding Transformer Encoder-Decoder Representations through Bernoulli Dropout

链接: https://arxiv.org/abs/2601.17602
作者: Xuanzhou Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study Transformer overparameterization through the lens of angular similarity in high-dimensional encoder-decoder embeddings. We apply Bernoulli dropout between the encoder and the decoder, varying the keep probability p to identify a sparsity-dependent threshold above which the Top-1 prediction is preserved. Theoretically, we prove that, if the effective sparsity embeddings is sufficiently large, and thus decoder performance, remain stable under moderate coordinate dropout. Empirically, we implement the Bernoulli dropout by constructing a new Transformer model augmented with Binary Erasure Channel (BEC) and test its performance on an English-French translation task. Experimental results visualize the trends for validation accuracies and BLEU scores, both decline sharply at some threshold.

[LG-59] Deep Intrinsic Surprise-Regularized Control (DISRC): A Biologically Inspired Mechanism for Efficient Deep Q-Learning in Sparse Environments

链接: https://arxiv.org/abs/2601.17598
作者: Yash Kini,Shiv Davay,Shreya Polavarapu
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 1 table. Preprint version of work submitted, accepted, and presented at IEEE URTC. Accepted and pending publication in IEEE Xplore

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has driven major advances in autonomous control. Still, standard Deep Q-Network (DQN) agents tend to rely on fixed learning rates and uniform update scaling, even as updates are modulated by temporal-difference (TD) error. This rigidity destabilizes convergence, especially in sparse-reward settings where feedback is infrequent. We introduce Deep Intrinsic Surprise-Regularized Control (DISRC), a biologically inspired augmentation to DQN that dynamically scales Q-updates based on latent-space surprise. DISRC encodes states via a LayerNorm-based encoder and computes a deviation-based surprise score relative to a moving latent setpoint. Each update is then scaled in proportion to both TD error and surprise intensity, promoting plasticity during early exploration and stability as familiarity increases. We evaluate DISRC on two sparse-reward MiniGrid environments, which included MiniGrid-DoorKey-8x8 and MiniGrid-LavaCrossingS9N1, under identical settings as a vanilla DQN baseline. In DoorKey, DISRC reached the first successful episode (reward 0.8) 33% faster than the vanilla DQN baseline (79 vs. 118 episodes), with lower reward standard deviation (0.25 vs. 0.34) and higher reward area under the curve (AUC: 596.42 vs. 534.90). These metrics reflect faster, more consistent learning - critical for sparse, delayed reward settings. In LavaCrossing, DISRC achieved a higher final reward (0.95 vs. 0.93) and the highest AUC of all agents (957.04), though it converged more gradually. These preliminary results establish DISRC as a novel mechanism for regulating learning intensity in off-policy agents, improving both efficiency and stability in sparse-reward domains. By treating surprise as an intrinsic learning signal, DISRC enables agents to modulate updates based on expectation violations, enhancing decision quality when conventional value-based methods fall short.

[LG-60] Quantum-Inspired Episode Selection for Monte Carlo Reinforcement Learning via QUBO Optimization

链接: https://arxiv.org/abs/2601.17570
作者: Hadi Salloum,Ali Jnadi,Yaroslav Kholodov,Alexander Gasnikov
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Proceedings of Machine Learning Research tbd: 1_13, 2025 International Conference on Computational Optimization

点击查看摘要

Abstract:Monte Carlo (MC) reinforcement learning suffers from high sample complexity, especially in environments with sparse rewards, large state spaces, and correlated trajectories. We address these limitations by reformulating episode selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem and solving it with quantum-inspired samplers. Our method, MC+QUBO, integrates a combinatorial filtering step into standard MC policy evaluation: from each batch of trajectories, we select a subset that maximizes cumulative reward while promoting state-space coverage. This selection is encoded as a QUBO, where linear terms favor high-reward episodes and quadratic terms penalize redundancy. We explore both Simulated Quantum Annealing (SQA) and Simulated Bifurcation (SB) as black-box solvers within this framework. Experiments in a finite-horizon GridWorld demonstrate that MC+QUBO outperforms vanilla MC in convergence speed and final policy quality, highlighting the potential of quantum-inspired optimization as a decision-making subroutine in reinforcement learning.

[LG-61] Sparse RBF Networks for PDEs and nonlocal equations: function space theory operator calculus and training algorithms

链接: https://arxiv.org/abs/2601.17562
作者: Zihan Shao,Konstantin Pieper,Xiaochuan Tian
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 30 pages, 7 figures

点击查看摘要

Abstract:This work presents a systematic analysis and extension of the sparse radial basis function network (SparseRBFnet) previously introduced for solving nonlinear partial differential equations (PDEs). Based on its adaptive-width shallow kernel network formulation, we further investigate its function-space characterization, operator evaluation, and computational algorithm. We provide a unified description of the solution space for a broad class of radial basis functions (RBFs). Under mild assumptions, this space admits a characterization as a Besov space, independent of the specific kernel choice. We further demonstrate how the explicit kernel-based structure enables quasi-analytical evaluation of both differential and nonlocal operators, including fractional Laplacians. On the computational end, we study the adaptive-width network and related three-phase training strategy through a comparison with variants concerning the modeling and algorithmic details. In particular, we assess the roles of second-order optimization, inner-weight training, network adaptivity, and anisotropic kernel parameterizations. Numerical experiments on high-order, fractional, and anisotropic PDE benchmarks illustrate the empirical insensitivity to kernel choice, as well as the resulting trade-offs between accuracy, sparsity, and computational cost. Collectively, these results consolidate and generalize the theoretical and computational framework of SparseRBFnet, supporting accurate sparse representations with efficient operator evaluation and offering theory-grounded guidance for algorithmic and modeling choices.

[LG-62] GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference

链接: https://arxiv.org/abs/2601.17551
作者: Thomas Ziller,Shashikant Ilager,Alessandro Tundo,Ezio Bartocci,Leonardo Mariani,Ivona Brandic
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注: Paper under submisison

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable capabilities, but their broad deployment is limited by significant computational resource demands, particularly energy consumption during inference. Static, one-model-fits-all inference strategies are often inefficient, as they do not exploit the diverse range of available models or adapt to varying query requirements. This paper presents GreenServ, a dynamic, context-aware routing framework that optimizes the trade-off between inference accuracy and energy efficiency. GreenServ extracts lightweight contextual features from each query, including task type, semantic cluster, and text complexity, and routes queries to the most suitable model from a heterogeneous pool, based on observed accuracy and energy usage. We employ a multi-armed bandit approach to learn adaptive routing policies online. This approach operates under partial feedback, eliminates the need for extensive offline calibration, and streamlines the integration of new models into the inference pipeline. We evaluated GreenServ across five benchmark tasks and a pool of 16 contemporary open-access LLMs. Experimental results show that GreenServ consistently outperforms static (single-model) and random baselines. In particular, compared to random routing, GreenServ achieved a 22% increase in accuracy while reducing cumulative energy consumption by 31%. Finally, we evaluated GreenServ with RouterBench, achieving an average accuracy of 71.7% with a peak accuracy of 75.7%. All artifacts are open-source and available as an anonymous repository for review purposes here: this https URL Comments: Paper under submisison Subjects: Performance (cs.PF); Machine Learning (cs.LG) Cite as: arXiv:2601.17551 [cs.PF] (or arXiv:2601.17551v1 [cs.PF] for this version) https://doi.org/10.48550/arXiv.2601.17551 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding ICASSP2026

链接: https://arxiv.org/abs/2601.17517
作者: Luca Cerovaz,Michele Mancusi,Emanuele Rodolà
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Audio codecs power discrete music generative modelling, music streaming, and immersive media by shrinking PCM audio to bandwidth-friendly bitrates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram domains typically struggle with phase modeling, which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion, we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance on phase coherence and waveform fidelity. Compared to standard baselines that train for hundreds of thousands of steps, our model, which reduces the training budget by an order of magnitude, is markedly more compute-efficient while preserving high perceptual quality.

[LG-64] One-Shot Federated Clustering of Non-Independent Completely Distributed Data

链接: https://arxiv.org/abs/2601.17512
作者: Yiqun Zhang,Shenghong Cai,Zihua Yang,Sen Feng,Yuzhu Ji,Haijun Zhang
类目: Machine Learning (cs.LG)
*备注: This work has been accepted for publication in IEEE Internet of Things Journal

点击查看摘要

Abstract:Federated Learning (FL) that extracts data knowledge while protecting the privacy of multiple clients has achieved remarkable results in distributed privacy-preserving IoT systems, including smart traffic flow monitoring, smart grid load balancing, and so on. Since most data collected from edge devices are unlabeled, unsupervised Federated Clustering (FC) is becoming increasingly popular for exploring pattern knowledge from complex distributed data. However, due to the lack of label guidance, the common Non-Independent and Identically Distributed (Non-IID) issue of clients have greatly challenged FC by posing the following problems: How to fuse pattern knowledge (i.e., cluster distribution) from Non-IID clients; How are the cluster distributions among clients related; and How does this relationship connect with the global knowledge fusion? In this paper, a more tricky but overlooked phenomenon in Non-IID is revealed, which bottlenecks the clustering performance of the existing FC approaches. That is, different clients could fragment a cluster, and accordingly, a more generalized Non-IID concept, i.e., Non-ICD (Non-Independent Completely Distributed), is derived. To tackle the above FC challenges, a new framework named GOLD (Global Oriented Local Distribution Learning) is proposed. GOLD first finely explores the potential incomplete local cluster distributions of clients, then uploads the distribution summarization to the server for global fusion, and finally performs local cluster enhancement under the guidance of the global distribution. Extensive experiments, including significance tests, ablation studies, scalability evaluations, qualitative results, etc., have been conducted to show the superiority of GOLD.

[LG-65] LeanTutor: Towards a Verified AI Mathematical Proof Tutor

链接: https://arxiv.org/abs/2601.17473
作者: Manooshree Patel,Rayna Bhattacharyya,Thomas Lu,Arnav Mehta,Niels Voss,Narges Norouzi,Gireeja Ranade
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2506.08321 . substantial text overlap with arXiv:2506.08321 . substantial text overlap with arXiv:2506.08321 . substantial text overlap with arXiv:2506.08321

点击查看摘要

Abstract:This paper considers the development of an AI-based provably-correct mathematical proof tutor. While Large Language Models (LLMs) allow seamless communication in natural language, they are error prone. Theorem provers such as Lean allow for provable-correctness, but these are hard for students to learn. We present a proof-of-concept system (LeanTutor) by combining the complementary strengths of LLMs and theorem provers. LeanTutor is composed of three modules: (i) an autoformalizer/proof-checker, (ii) a next-step generator, and (iii) a natural language feedback generator. To evaluate the system, we introduce PeanoBench, a dataset of 371 Peano Arithmetic proofs in human-written natural language and formal language, derived from the Natural Numbers Game.

[LG-66] Identifying and Correcting Label Noise for Robust GNNs via Influence Contradiction

链接: https://arxiv.org/abs/2601.17469
作者: Wei Ju,Wei Zhang,Siyu Yi,Zhengyang Mao,Yifan Wang,Jingyang Yuan,Zhiping Xiao,Ziyue Qiao,Ming Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown remarkable capabilities in learning from graph-structured data with various applications such as social analysis and bioinformatics. However, the presence of label noise in real scenarios poses a significant challenge in learning robust GNNs, and their effectiveness can be severely impacted when dealing with noisy labels on graphs, often stemming from annotation errors or inconsistencies. To address this, in this paper we propose a novel approach called ICGNN that harnesses the structure information of the graph to effectively alleviate the challenges posed by noisy labels. Specifically, we first design a novel noise indicator that measures the influence contradiction score (ICS) based on the graph diffusion matrix to quantify the credibility of nodes with clean labels, such that nodes with higher ICS values are more likely to be detected as having noisy labels. Then we leverage the Gaussian mixture model to precisely detect whether the label of a node is noisy or not. Additionally, we develop a soft strategy to combine the predictions from neighboring nodes on the graph to correct the detected noisy labels. At last, pseudo-labeling for abundant unlabeled nodes is incorporated to provide auxiliary supervision signals and guide the model optimization. Experiments on benchmark datasets show the superiority of our proposed approach.

[LG-67] Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping

链接: https://arxiv.org/abs/2601.17467
作者: Jianxiong Zhang,Bing Guo,Yuming Jiang,Haobo Wang,Bo An,Xuefeng Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) often generate long, seemingly coherent reasoning traces yet still produce incorrect answers, making hallucination detection challenging. Although trajectories contain useful signals, directly using trace text or vanilla hidden states for detection is brittle: traces vary in form and detectors can overfit to superficial patterns rather than answer validity. We introduce Answer-agreement Representation Shaping (ARS), which learns detection-friendly trace-conditioned representations by explicitly encoding answer stability. ARS generates counterfactual answers through small latent interventions, specifically, perturbing the trace-boundary embedding, and labels each perturbation by whether the resulting answer agrees with the original. It then learns representations that bring answer-agreeing states together and separate answer-disagreeing ones, exposing latent instability indicative of hallucination risk. The shaped embeddings are plug-and-play with existing embedding-based detectors and require no human annotations during training. Experiments demonstrate that ARS consistently improves detection and achieves substantial gains over strong baselines.

[LG-68] DREAM: Dual-Standard Semantic Homogeneity with Dynamic Optimization for Graph Learning with Label Noise

链接: https://arxiv.org/abs/2601.17449
作者: Yusheng Zhao,Jiaye Xie,Qixin Zhang,Weizhi Zhang,Xiao Luo,Zhiping Xiao,Philip S. Yu,Ming Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have been widely used in various graph machine learning scenarios. Existing literature primarily assumes well-annotated training graphs, while the reliability of labels is not guaranteed in real-world scenarios. Recently, efforts have been made to address the problem of graph learning with label noise. However, existing methods often (i) struggle to distinguish between reliable and unreliable nodes, and (ii) overlook the relational information embedded in the graph topology. To tackle this problem, this paper proposes a novel method, Dual-Standard Semantic Homogeneity with Dynamic Optimization (DREAM), for reliable, relation-informed optimization on graphs with label noise. Specifically, we design a relation-informed dynamic optimization framework that iteratively reevaluates the reliability of each labeled node in the graph during the optimization process according to the relation of the target node and other nodes. To measure this relation comprehensively, we propose a dual-standard selection strategy that selects a set of anchor nodes based on both node proximity and graph topology. Subsequently, we compute the semantic homogeneity between the target node and the anchor nodes, which serves as guidance for optimization. We also provide a rigorous theoretical analysis to justify the design of DREAM. Extensive experiments are performed on six graph datasets across various domains under three types of graph label noise against competing baselines, and the results demonstrate the effectiveness of the proposed DREAM.

[LG-69] A new approach for combined model class selection and parameters learning for auto-regressive neural models

链接: https://arxiv.org/abs/2601.17442
作者: Corrado Sgadari,Alessio La Bella,Marcello Farina
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work introduces a novel approach for the joint selection of model structure and parameter learning for nonlinear dynamical systems identification. Focusing on a specific Recurrent Neural Networks (RNNs) family, i.e., Nonlinear Auto-Regressive with eXogenous inputs Echo State Networks (NARXESNs), the method allows to simultaneously select the optimal model class and learn model parameters from data through a new set-membership (SM) based procedure. The results show the effectiveness of the approach in identifying parsimonious yet accurate models suitable for control applications. Moreover, the proposed framework enables a robust training strategy that explicitly accounts for bounded measurement noise and enhances model robustness by allowing data-consistent evaluation of simulation performance during parameter learning, a process generally NP-hard for models with autoregressive components.

[LG-70] UniGRec: Unified Generative Recommendation with Soft Identifiers for End-to-End Optimization

链接: https://arxiv.org/abs/2601.17438
作者: Jialei Li,Yang Zhang,Yimeng Bai,Shuai Zhu,Ziqi Xue,Xiaoyan Zhao,Dingxian Wang,Frank Yang,Andrew Rabinovich,Xiangnan He
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Generative recommendation has recently emerged as a transformative paradigm that directly generates target items, surpassing traditional cascaded approaches. It typically involves two components: a tokenizer that learns item identifiers and a recommender trained on them. Existing methods often decouple tokenization from recommendation or rely on asynchronous alternating optimization, limiting full end-to-end alignment. To address this, we unify the tokenizer and recommender under the ultimate recommendation objective via differentiable soft item identifiers, enabling joint end-to-end training. However, this introduces three challenges: training-inference discrepancy due to soft-to-hard mismatch, item identifier collapse from codeword usage imbalance, and collaborative signal deficiency due to an overemphasis on fine-grained token-level semantics. To tackle these challenges, we propose UniGRec, a unified generative recommendation framework that addresses them from three perspectives. UniGRec employs Annealed Inference Alignment during tokenization to smoothly bridge soft training and hard inference, a Codeword Uniformity Regularization to prevent identifier collapse and encourage codebook diversity, and a Dual Collaborative Distillation mechanism that distills collaborative priors from a lightweight teacher model to jointly guide both the tokenizer and the recommender. Extensive experiments on real-world datasets demonstrate that UniGRec consistently outperforms state-of-the-art baseline methods. Our codes are available at this https URL. Comments: 11 pages, 6 figures Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) ACMclasses: H.3.3 Cite as: arXiv:2601.17438 [cs.IR] (or arXiv:2601.17438v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.17438 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-71] Active Hypothesis Testing for Correlated Combinatorial Anomaly Detection

链接: https://arxiv.org/abs/2601.17430
作者: Zichuan Yang,Yiming Xing
类目: Machine Learning (cs.LG)
*备注: 47 pages, 26 figures

点击查看摘要

Abstract:We study the problem of identifying an anomalous subset of streams under correlated noise, motivated by monitoring and security in cyber-physical systems. This problem can be viewed as a form of combinatorial pure exploration, where each stream plays the role of an arm and measurements must be allocated sequentially under uncertainty. Existing combinatorial bandit and hypothesis testing methods typically assume independent observations and fail to exploit correlation for efficient measurement design. We propose ECC-AHT, an adaptive algorithm that selects continuous, constrained measurements to maximize Chernoff information between competing hypotheses, enabling active noise cancellation through differential sensing. ECC-AHT achieves optimal sample complexity guarantees and significantly outperforms state-of-the-art baselines in both synthetic and real-world correlated environments. The code is available on this https URL

[LG-72] Efficient Dilated Squeeze and Excitation Neural Operator for Differential Equations

链接: https://arxiv.org/abs/2601.17407
作者: Prajwal Chauhan,Salah Eddine Choutri,Saif Eddin Jabari
类目: Machine Learning (cs.LG)
*备注: Accepted to Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Fast and accurate surrogates for physics-driven partial differential equations (PDEs) are essential in fields such as aerodynamics, porous media design, and flow control. However, many transformer-based models and existing neural operators remain parameter-heavy, resulting in costly training and sluggish deployment. We propose D-SENO (Dilated Squeeze-Excitation Neural Operator), a lightweight operator learning framework for efficiently solving a wide range of PDEs, including airfoil potential flow, Darcy flow in porous media, pipe Poiseuille flow, and incompressible Navier Stokes vortical fields. D-SENO combines dilated convolution (DC) blocks with squeeze-and-excitation (SE) modules to jointly capture wide receptive fields and dynamics alongside channel-wise attention, enabling both accurate and efficient PDE inference. Carefully chosen dilation rates allow the receptive field to focus on critical regions, effectively modeling long-range physical dependencies. Meanwhile, the SE modules adaptively recalibrate feature channels to emphasize dynamically relevant scales. Our model achieves training speed of up to approximately 20\times faster than standard transformer-based models and neural operators, while also surpassing (or matching) them in accuracy across multiple PDE benchmarks. Ablation studies show that removing the SE modules leads to a slight drop in performance.

[LG-73] Power-based Partial Attention: Bridging Linear-Complexity and Full Attention

链接: https://arxiv.org/abs/2601.17334
作者: Yufeng Huang
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:It is widely accepted from transformer research that “attention is all we need”, but the amount of attention required has never been systematically quantified. Is quadratic O(L^2) attention necessary, or is there a sub-quadratic attention mechanism that can achieve comparable performance? To answer this question, we introduce power-based partial attention (PPA), an attention mechanism of order O(L^1+p) , where 0 \leq p \leq 1 , such that p=0 corresponds to sliding window attention with linear complexity, and p=1 corresponds to full attention. With this attention construction, we can explore how transformer architecture performance varies as a function of the attention scaling behavior controlled by p . The overall trend from our experiments shows an S-curve-like behavior where the performance transitions from sliding-window (linear-complexity) attention to full attention over a narrow window of p values, and plateaus as p approaches 1 . In our experiments, we show that there exists 0p1 such that O(L^1+p) attention is sufficient to achieve similar results as O(L^2) full attention.

[LG-74] PAR: Plausibility-aware Amortized Recourse Generation

链接: https://arxiv.org/abs/2601.17309
作者: Anagha Sabu,Vidhya S,Narayanan C Krishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic recourse aims to recommend actionable changes to a factual’s attributes that flip an unfavorable model decision while remaining realistic and feasible. We formulate recourse as a Constrained Maximum A-Posteriori (MAP) inference problem under the accepted-class data distribution seeking counterfactuals with high likelihood while respecting other recourse constraints. We present PAR, an amortized approximate inference procedure that generates highly likely recourses efficiently. Recourse likelihood is estimated directly using tractable probabilistic models that admit exact likelihood evaluation and efficient gradient propagation that is useful during training. The recourse generator is trained with the objective of maximizing the likelihood under the accepted-class distribution while minimizing the likelihood under the denied-class distribution and other losses that encode recourse constraints. Furthermore, PAR includes a neighborhood-based conditioning mechanism to promote recourse generation that is customized to a factual. We validate PAR on widely used algorithmic recourse datasets and demonstrate its efficiency in generating recourses that are valid, similar to the factual, sparse, and highly plausible, yielding superior performance over existing state-of-the-art approaches.

[LG-75] Weighted Graph Clustering via Scale Contraction and Graph Structure Learning

链接: https://arxiv.org/abs/2601.17307
作者: Haobing Liu,Yinuo Zhang,Tingting Wang,Ruobing Jiang,Yanwei Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph clustering aims to partition nodes into distinct clusters based on their similarity, thereby revealing relationships among nodes. Nevertheless, most existing methods do not fully utilize these edge weights. Leveraging edge weights in graph clustering tasks faces two critical challenges. (1) The introduction of edge weights may significantly increase storage space and training time, making it essential to reduce the graph scale while preserving nodes that are beneficial for the clustering task. (2) Edge weight information may inherently contain noise that negatively impacts clustering results. However, few studies can jointly optimize clustering and edge weights, which is crucial for mitigating the negative impact of noisy edges on clustering task. To address these challenges, we propose a contractile edge-weight-aware graph clustering network. Specifically, a cluster-oriented graph contraction module is designed to reduce the graph scale while preserving important nodes. An edge-weight-aware attention network is designed to identify and weaken noisy connections. In this way, we can more easily identify and mitigate the impact of noisy edges during the clustering process, thus enhancing clustering effectiveness. We conducted extensive experiments on three real-world weighted graph datasets. In particular, our model outperforms the best baseline, demonstrating its superior performance. Furthermore, experiments also show that the proposed graph contraction module can significantly reduce training time and storage space.

[LG-76] abular Foundation Models are Strong Graph Anomaly Detectors WWW2026

链接: https://arxiv.org/abs/2601.17301
作者: Yunhui Liu,Tieke He,Yongchao Liu,Can Yi,Hong Jin,Chuntao Hong
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW 2026 (Short Paper)

点击查看摘要

Abstract:Graph anomaly detection (GAD), which aims to identify abnormal nodes that deviate from the majority, has become increasingly important in high-stakes Web domains. However, existing GAD methods follow a “one model per dataset” paradigm, leading to high computational costs, substantial data demands, and poor generalization when transferred to new datasets. This calls for a foundation model that enables a “one-for-all” GAD solution capable of detecting anomalies across diverse graphs without retraining. Yet, achieving this is challenging due to the large structural and feature heterogeneity across domains. In this paper, we propose TFM4GAD, a simple yet effective framework that adapts tabular foundation models (TFMs) for graph anomaly detection. Our key insight is that the core challenges of foundation GAD, handling heterogeneous features, generalizing across domains, and operating with scarce labels, are the exact problems that modern TFMs are designed to solve via synthetic pre-training and powerful in-context learning. The primary challenge thus becomes structural: TFMs are agnostic to graph topology. TFM4GAD bridges this gap by “flattening” the graph, constructing an augmented feature table that enriches raw node features with Laplacian embeddings, local and global structural characteristics, and anomaly-sensitive neighborhood aggregations. This augmented table is processed by a TFM in a fully in-context regime. Extensive experiments on multiple datasets with various TFM backbones reveal that TFM4GAD surprisingly achieves significant performance gains over specialized GAD models trained from scratch. Our work offers a new perspective and a practical paradigm for leveraging TFMs as powerful, generalist graph anomaly detectors.

[LG-77] Unrolled Neural Networks for Constrained Optimization

链接: https://arxiv.org/abs/2601.17274
作者: Samar Hadou,Alejandro Ribeiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we develop unrolled neural networks to solve constrained optimization problems, offering accelerated, learnable counterparts to dual ascent (DA) algorithms. Our framework, termed constrained dual unrolling (CDU), comprises two coupled neural networks that jointly approximate the saddle point of the Lagrangian. The primal network emulates an iterative optimizer that finds a stationary point of the Lagrangian for a given dual multiplier, sampled from an unknown distribution. The dual network generates trajectories towards the optimal multipliers across its layers while querying the primal network at each layer. Departing from standard unrolling, we induce DA dynamics by imposing primal-descent and dual-ascent constraints through constrained learning. We formulate training the two networks as a nested optimization problem and propose an alternating procedure that updates the primal and dual networks in turn, mitigating uncertainty in the multiplier distribution required for primal network training. We numerically evaluate the framework on mixed-integer quadratic programs (MIQPs) and power allocation in wireless networks. In both cases, our approach yields near-optimal near-feasible solutions and exhibits strong out-of-distribution (OOD) generalization.

[LG-78] AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

链接: https://arxiv.org/abs/2601.17261
作者: Wei Lin,Yining Jiang,Qingyu Song,Qiao Xiang,Hong Xu
类目: Machine Learning (cs.LG)
*备注: 21 pages in total, including 9 pages of main text, with 4 figures and 3 tables. This manuscript is submitted to arXiv

点击查看摘要

Abstract:Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

[LG-79] A Constrained Optimization Perspective of Unrolled Transformers

链接: https://arxiv.org/abs/2601.17257
作者: Javier Porras-Valenzuela,Samar Hadou,Alejandro Ribeiro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a constrained optimization framework for training transformers that behave like optimization descent algorithms. Specifically, we enforce layerwise descent constraints on the objective function and replace standard empirical risk minimization (ERM) with a primal-dual training scheme. This approach yields models whose intermediate representations decrease the loss monotonically in expectation across layers. We apply our method to both unrolled transformer architectures and conventional pretrained transformers on tasks of video denoising and text classification. Across these settings, we observe constrained transformers achieve stronger robustness to perturbations and maintain higher out-of-distribution generalization, while preserving in-distribution performance.

[LG-80] Parameter Inference and Uncertainty Quantification with Diffusion Models: Extending CDI to 2D Spatial Conditioning

链接: https://arxiv.org/abs/2601.17224
作者: Dmitrii Torbunov,Yihui Ren,Lijun Wu,Yimei Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty quantification is critical in scientific inverse problems to distinguish identifiable parameters from those that remain ambiguous given available measurements. The Conditional Diffusion Model-based Inverse Problem Solver (CDI) has previously demonstrated effective probabilistic inference for one-dimensional temporal signals, but its applicability to higher-dimensional spatial data remains unexplored. We extend CDI to two-dimensional spatial conditioning, enabling probabilistic parameter inference directly from spatial observations. We validate this extension on convergent beam electron diffraction (CBED) parameter inference - a challenging multi-parameter inverse problem in materials characterization where sample geometry, electronic structure, and thermal properties must be extracted from 2D diffraction patterns. Using simulated CBED data with ground-truth parameters, we demonstrate that CDI produces well-calibrated posterior distributions that accurately reflect measurement constraints: tight distributions for well-determined quantities and appropriately broad distributions for ambiguous parameters. In contrast, standard regression methods - while appearing accurate on aggregate metrics - mask this underlying uncertainty by predicting training set means for poorly constrained parameters. Our results confirm that CDI successfully extends from temporal to spatial domains, providing the genuine uncertainty information required for robust scientific inference.

[LG-81] Evaluation on Entity Matching in Recommender Systems

链接: https://arxiv.org/abs/2601.17218
作者: Zihan Huang,Rohan Surana,Zhouhang Xie,Junda Wu,Yu Xia,Julian McAuley
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Entity matching is a crucial component in various recommender systems, including conversational recommender systems (CRS) and knowledge-based recommender systems. However, the lack of rigorous evaluation frameworks for cross-dataset entity matching impedes progress in areas such as LLM-driven conversational recommendations and knowledge-grounded dataset construction. In this paper, we introduce Reddit-Amazon-EM, a novel dataset comprising naturally occurring items from Reddit and the Amazon '23 dataset. Through careful manual annotation, we identify corresponding movies across Reddit-Movies and Amazon’23, two existing recommender system datasets with inherently overlapping catalogs. Leveraging Reddit-Amazon-EM, we conduct a comprehensive evaluation of state-of-the-art entity matching methods, including rule-based, graph-based, lexical-based, embedding-based, and LLM-based approaches. For reproducible research, we release our manually annotated entity matching gold set and provide the mapping between the two datasets using the best-performing method from our experiments. This serves as a valuable resource for advancing future work on entity matching in recommender systems. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2601.17218 [cs.IR] (or arXiv:2601.17218v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.17218 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-82] JetFormer: A Scalable and Efficient Transformer for Jet Tagging from Offline Analysis to FPGA Triggers

链接: https://arxiv.org/abs/2601.17215
作者: Ruoqing Zheng,Chang Sun,Qibin Liu,Lauri Laatu,Arianna Cox,Benedikt Maier,Alexander Tapper,Jose G. F. Coutinho,Wayne Luk,Zhiqiang Que
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 15 pages,

点击查看摘要

Abstract:We present JetFormer, a versatile and scalable encoder-only Transformer architecture for particle jet tagging at the Large Hadron Collider (LHC). Unlike prior approaches that are often tailored to specific deployment regimes, JetFormer is designed to operate effectively across the full spectrum of jet tagging scenarios, from high-accuracy offline analysis to ultra-low-latency online triggering. The model processes variable-length sets of particle features without relying on input of explicit pairwise interactions, yet achieves competitive or superior performance compared to state-of-the-art methods. On the large-scale JetClass dataset, a large-scale JetFormer matches the accuracy of the interaction-rich ParT model (within 0.7%) while using 37.4% fewer FLOPs, demonstrating its computational efficiency and strong generalization. On benchmark HLS4ML 150P datasets, JetFormer consistently outperforms existing models such as MLPs, Deep Sets, and Interaction Networks by 3-4% in accuracy. To bridge the gap to hardware deployment, we further introduce a hardware-aware optimization pipeline based on multi-objective hyperparameter search, yielding compact variants like JetFormer-tiny suitable for FPGA-based trigger systems with sub-microsecond latency requirements. Through structured pruning and quantization, we show that JetFormer can be aggressively compressed with minimal accuracy loss. By unifying high-performance modeling and deployability within a single architectural framework, JetFormer provides a practical pathway for deploying Transformer-based jet taggers in both offline and online environments at the LHC. Code is available at this https URL.

[LG-83] NewPINNs: Physics-Informing Neural Networks Using Conventional Solvers for Partial Differential Equations

链接: https://arxiv.org/abs/2601.17207
作者: Maedeh Makki,Satish Chandran,Maziar Raissi,Adrien Grenier,Behzad Mohebbi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce NewPINNs, a physics-informing learning framework that couples neural networks with conventional numerical solvers for solving differential equations. Rather than enforcing governing equations and boundary conditions through residual-based loss terms, NewPINNs integrates the solver directly into the training loop and defines learning objectives through solver-consistency. The neural network produces candidate solution states that are advanced by the numerical solver, and training minimizes the discrepancy between the network prediction and the solver-evolved state. This pull-push interaction enables the network to learn physically admissible solutions through repeated exposure to the solver’s action, without requiring problem-specific loss engineering or explicit evaluation of differential equation residuals. By delegating the enforcement of physics, boundary conditions, and numerical stability to established numerical solvers, NewPINNs mitigates several well-known failure modes of standard physics-informed neural networks, including optimization pathologies, sensitivity to loss weighting, and poor performance in stiff or nonlinear regimes. We demonstrate the effectiveness of the proposed approach across multiple forward and inverse problems involving finite volume, finite element, and spectral solvers.

[LG-84] SpecBridge: Bridging Mass Spectrometry and Molecular Representations via Cross-Modal Alignment

链接: https://arxiv.org/abs/2601.17204
作者: Yinkai Wang,Yan Zhou Chen,Xiaohui Chen,Li-Ping Liu,Soha Hassoun
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: preprint

点击查看摘要

Abstract:Small-molecule identification from tandem mass spectrometry (MS/MS) remains a bottleneck in untargeted settings where spectral libraries are incomplete. While deep learning offers a solution, current approaches typically fall into two extremes: explicit generative models that construct molecular graphs atom-by-atom, or joint contrastive models that learn cross-modal subspaces from scratch. We introduce SpecBridge, a novel implicit alignment framework that treats structure identification as a geometric alignment problem. SpecBridge fine-tunes a self-supervised spectral encoder (DreaMS) to project directly into the latent space of a frozen molecular foundation model (ChemBERTa), and then performs retrieval by cosine similarity to a fixed bank of precomputed molecular embeddings. Across MassSpecGym, Spectraverse, and MSnLib benchmarks, SpecBridge improves top-1 retrieval accuracy by roughly 20-25% relative to strong neural baselines, while keeping the number of trainable parameters small. These results suggest that aligning to frozen foundation models is a practical, stable alternative to designing new architectures from scratch. The code for SpecBridge is released at this https URL.

[LG-85] Accelerated Sinkhorn Algorithms for Partial Optimal Transport

链接: https://arxiv.org/abs/2601.17196
作者: Nghia Thu Truong,Qui Phu Pham,Quang Nguyen,Dung Luong,Mai Tran
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partial Optimal Transport (POT) addresses the problem of transporting only a fraction of the total mass between two distributions, making it suitable when marginals have unequal size or contain outliers. While Sinkhorn-based methods are widely used, their complexity bounds for POT remain suboptimal and can limit scalability. We introduce Accelerated Sinkhorn for POT (ASPOT), which integrates alternating minimization with Nesterov-style acceleration in the POT setting, yielding a complexity of \mathcalO(n^7/3\varepsilon^-5/3) . We also show that an informed choice of the entropic parameter \gamma improves rates for the classical Sinkhorn method. Experiments on real-world applications validate our theories and demonstrate the favorable performance of our proposed methods.

[LG-86] Optimizing the Landscape of LLM Embeddings with Dynamic Exploratory Graph Analysis for Generative Psychometrics: A Monte Carlo Study

链接: https://arxiv.org/abs/2601.17010
作者: Hudson Golino
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 18 pages, 6 figures, conference paper

点击查看摘要

Abstract:Large language model (LLM) embeddings are increasingly used to estimate dimensional structure in psychological item pools prior to data collection, yet current applications treat embeddings as static, cross-sectional representations. This approach implicitly assumes uniform contribution across all embedding coordinates and overlooks the possibility that optimal structural information may be concentrated in specific regions of the embedding space. This study reframes embeddings as searchable landscapes and adapts Dynamic Exploratory Graph Analysis (DynEGA) to systematically traverse embedding coordinates, treating the dimension index as a pseudo-temporal ordering analogous to intensive longitudinal trajectories. A large-scale Monte Carlo simulation embedded items representing five dimensions of grandiose narcissism using OpenAI’s text-embedding-3-small model, generating network estimations across systematically varied item pool sizes (3-40 items per dimension) and embedding depths (3-1,298 dimensions). Results reveal that Total Entropy Fit Index (TEFI) and Normalized Mutual Information (NMI) leads to competing optimization trajectories across the embedding landscape. TEFI achieves minima at deep embedding ranges (900–1,200 dimensions) where entropy-based organization is maximal but structural accuracy degrades, whereas NMI peaks at shallow depths where dimensional recovery is strongest but entropy-based fit remains suboptimal. Single-metric optimization produces structurally incoherent solutions, whereas a weighted composite criterion identifies embedding dimensions depth regions that jointly balance accuracy and organization. Optimal embedding depth scales systematically with item pool size. These findings establish embedding landscapes as non-uniform semantic spaces requiring principled optimization rather than default full-vector usage.

[LG-87] Bayesian Robust Financial Trading with Adversarial Synthetic Market Data

链接: https://arxiv.org/abs/2601.17008
作者: Haochong Xia,Simin Li,Ruixiao Xu,Zhixia Zhang,Hongxiang Wang,Zhiqian Liu,Teng Yao Long,Molei Qin,Chuqiao Zong,Bo An
类目: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:Algorithmic trading relies on machine learning models to make trading decisions. Despite strong in-sample performance, these models often degrade when confronted with evolving real-world market regimes, which can shift dramatically due to macroeconomic changes-e.g., monetary policy updates or unanticipated fluctuations in participant behavior. We identify two challenges that perpetuate this mismatch: (1) insufficient robustness in existing policy against uncertainties in high-level market fluctuations, and (2) the absence of a realistic and diverse simulation environment for training, leading to policy overfitting. To address these issues, we propose a Bayesian Robust Framework that systematically integrates a macro-conditioned generative model with robust policy learning. On the data side, to generate realistic and diverse data, we propose a macro-conditioned GAN-based generator that leverages macroeconomic indicators as primary control variables, synthesizing data with faithful temporal, cross-instrument, and macro correlations. On the policy side, to learn robust policy against market fluctuations, we cast the trading process as a two-player zero-sum Bayesian Markov game, wherein an adversarial agent simulates shifting regimes by perturbing macroeconomic indicators in the macro-conditioned generator, while the trading agent-guided by a quantile belief network-maintains and updates its belief over hidden market states. The trading agent seeks a Robust Perfect Bayesian Equilibrium via Bayesian neural fictitious self-play, stabilizing learning under adversarial market perturbations. Extensive experiments on 9 financial instruments demonstrate that our framework outperforms 9 state-of-the-art baselines. In extreme events like the COVID, our method shows improved profitability and risk management, offering a reliable solution for trading under uncertain and shifting market dynamics.

[LG-88] Analysis of voice recordings features for Classification of Parkinsons Disease

链接: https://arxiv.org/abs/2601.17007
作者: Beatriz Pérez-Sánchez,Noelia Sánchez-Maroño,Miguel A. Díaz-Freire
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parkinson’s disease (PD) is a chronic neurodegenerative disease. Early diagnosis is essential to mitigate the progressive deterioration of patients’ quality of life. The most characteristic motor symptoms are very mild in the early stages, making diagnosis difficult. Recent studies have shown that the use of patient voice recordings can aid in early diagnosis. Although the analysis of such recordings is costly from a clinical point of view, advances in machine learning techniques are making the processing of this type of data increasingly accurate and efficient. Vocal recordings contain many features, but it is not known whether all of them are relevant for diagnosing the disease. This paper proposes the use of different types of machine learning models combined with feature selection methods to detect the disease. The selection techniques allow to reduce the number of features used by the classifiers by determining which ones provide the most information about the problem. The results show that machine learning methods, in particular neural networks, are suitable for PD classification and that the number of features can be significantly reduced without affecting the performance of the models. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.17007 [cs.LG] (or arXiv:2601.17007v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.17007 Focus to learn more arXiv-issued DOI via DataCite Journalreference: ACM SAC Conference 2024 Related DOI: https://doi.org/10.1145/3605098.3636135 Focus to learn more DOI(s) linking to related resources

[LG-89] A Dataset of Dengue Hospitalizations in Brazil (1999 to 2021) with Weekly Disaggregation from Monthly Counts

链接: https://arxiv.org/abs/2601.16994
作者: Lucas M. Morello,Matheus Lima Castro,Pedro Cesar M. G. Camargo,Liliane Moreira Nery,Darllan Collins da Cunha e Silva,Leopoldo Lusquino Filho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This data paper describes and publicly releases this dataset (v1.0.0), published on Zenodo under DOI https://doi.org/10.5281/zenodo.18189192. Motivated by the need to increase the temporal granularity of originally monthly data to enable more effective training of AI models for epidemiological forecasting, the dataset harmonizes municipal-level dengue hospitalization time series across Brazil and disaggregates them to weekly resolution (epidemiological weeks) through an interpolation protocol with a correction step that preserves monthly totals. The statistical and temporal validity of this disaggregation was assessed using a high-resolution reference dataset from the state of Sao Paulo (2024), which simultaneously provides monthly and epidemiological-week counts, enabling a direct comparison of three strategies: linear interpolation, jittering, and cubic spline. Results indicated that cubic spline interpolation achieved the highest adherence to the reference data, and this strategy was therefore adopted to generate weekly series for the 1999 to 2021 period. In addition to hospitalization time series, the dataset includes a comprehensive set of explanatory variables commonly used in epidemiological and environmental modeling, such as demographic density, CH4, CO2, and NO2 emissions, poverty and urbanization indices, maximum temperature, mean monthly precipitation, minimum relative humidity, and municipal latitude and longitude, following the same temporal disaggregation scheme to ensure multivariate compatibility. The paper documents the datasets provenance, structure, formats, licenses, limitations, and quality metrics (MAE, RMSE, R2, KL, JSD, DTW, and the KS test), and provides usage recommendations for multivariate time-series analysis, environmental health studies, and the development of machine learning and deep learning models for outbreak forecasting.

[LG-90] Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

链接: https://arxiv.org/abs/2601.15801
作者: Fengheng Chu,Jiahao Chen,Yuhong Wang,Jun Wang,Zhihui Fu,Shouling Ji,Songze Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are aligned to mitigate risks, their safety guardrails remain fragile against jailbreak attacks. This reveals limited understanding of components governing safety. Existing methods rely on local, greedy attribution that assumes independent component contributions. However, they overlook the cooperative interactions between different components in LLMs, such as attention heads, which jointly contribute to safety mechanisms. We propose \textbfGlobal \textbfOptimization for \textbfSafety \textbfVector Extraction (GOSV), a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously. We employ two complementary activation repatching strategies: Harmful Patching and Zero Ablation. These strategies identify two spatially distinct sets of safety vectors with consistently low overlap, termed Malicious Injection Vectors and Safety Suppression Vectors, demonstrating that aligned LLMs maintain separate functional pathways for safety purposes. Through systematic analyses, we find that complete safety breakdown occurs when approximately 30% of total heads are repatched across all models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white-box attacks across all test models, providing strong evidence for the effectiveness of the proposed GOSV framework on LLM safety interpretability.

[LG-91] Learning to Discover: A Generalized Framework for Rag a Identification without Forgetting

链接: https://arxiv.org/abs/2601.18766
作者: Parampreet Singh,Somya Kumar,Chaitanya Shailendra Nitawe,Vipul Arora
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted at NCC 2026 conference

点击查看摘要

Abstract:Raga identification in Indian Art Music (IAM) remains challenging due to the presence of numerous rarely performed Ragas that are not represented in available training datasets. Traditional classification models struggle in this setting, as they assume a closed set of known categories and therefore fail to recognise or meaningfully group previously unseen Ragas. Recent works have tried categorizing unseen Ragas, but they run into a problem of catastrophic forgetting, where the knowledge of previously seen Ragas is diminished. To address this problem, we adopt a unified learning framework that leverages both labeled and unlabeled audio, enabling the model to discover coherent categories corresponding to the unseen Ragas, while retaining the knowledge of previously known ones. We test our model on benchmark Raga Identification datasets and demonstrate its performance in categorizing previously seen, unseen, and all Raga classes. The proposed approach surpasses the previous NCD-based pipeline even in discovering the unseen Raga categories, offering new insights into representation learning for IAM tasks.

[LG-92] Data-Driven Qubit Characterization and Optimal Control using Deep Learning

链接: https://arxiv.org/abs/2601.18704
作者: Paul Surrey,Julian D. Teske,Tobias Hangleiter,Hendrik Bluhm,Pascal Cerfontaine
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum computing requires the optimization of control pulses to achieve high-fidelity quantum gates. We propose a machine learning-based protocol to address the challenges of evaluating gradients and modeling complex system dynamics. By training a recurrent neural network (RNN) to predict qubit behavior, our approach enables efficient gradient-based pulse optimization without the need for a detailed system model. First, we sample qubit dynamics using random control pulses with weak prior assumptions. We then train the RNN on the system’s observed responses, and use the trained model to optimize high-fidelity control pulses. We demonstrate the effectiveness of this approach through simulations on a single ST_0 qubit.

[LG-93] LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

链接: https://arxiv.org/abs/2601.18685
作者: Anselm Strohmaier,Samira Bödefeld,Frank Reinhold
类目: History and Overview (math.HO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The capabilities of generative AI in mathematics education are rapidly evolving, posing significant challenges for research to keep pace. Research syntheses remain scarce and risk being outdated by the time of publication. To address this issue, we present a Living Meta-Analysis (LIMA) on the effects of generative AI-based interventions for learning mathematics. Following PRISMA-LSR guidelines, we continuously update the literature base, apply a Bayesian multilevel meta-regression model to account for cumulative data, and publish updated versions on a preprint server at regular intervals. This paper reports results from the first version, including 15 studies. The analyses indicate a small positive effect (g = 0.31) with a wide credible interval [0.06, 0.58], reflecting the still limited evidence base.

[LG-94] Learned harmonic mean estimation of the marginal likelihood for multimodal posteriors with flow matching

链接: https://arxiv.org/abs/2601.18683
作者: Alicja Polanska,Jason D. McEwen
类目: Methodology (stat.ME); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Submitted to 44th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering

点击查看摘要

Abstract:The marginal likelihood, or Bayesian evidence, is a crucial quantity for Bayesian model comparison but its computation can be challenging for complex models, even in parameters space of moderate dimension. The learned harmonic mean estimator has been shown to provide accurate and robust estimates of the marginal likelihood simply using posterior samples. It is agnostic to the sampling strategy, meaning that the samples can be obtained using any method. This enables marginal likelihood calculation and model comparison with whatever sampling is most suitable for the task. However, the internal density estimators considered previously for the learned harmonic mean can struggle with highly multimodal posteriors. In this work we introduce flow matching-based continuous normalizing flows as a powerful architecture for the internal density estimation of the learned harmonic mean. We demonstrate the ability to handle challenging multimodal posteriors, including an example in 20 parameter dimensions, showcasing the method’s ability to handle complex posteriors without the need for fine-tuning or heuristic modifications to the base distribution.

[LG-95] Out-of-Distribution Radar Detection with Complex VAEs: Theory Whitening and ANMF Fusion

链接: https://arxiv.org/abs/2601.18677
作者: Yadang Alexis Rouzoumka,Jean Pinsolle,Eugénie Terreaux,Christèle Morisseau,Jean-Philippe Ovarlez,Chengfang Ren
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 12 figures, submitted to IEEE Transactions on Signal Processing

点击查看摘要

Abstract:We investigate the detection of weak complex-valued signals immersed in non-Gaussian, range-varying interference, with emphasis on maritime radar scenarios. The proposed methodology exploits a Complex-valued Variational AutoEncoder (CVAE) trained exclusively on clutter-plus-noise to perform Out-Of-Distribution detection. By operating directly on in-phase / quadrature samples, the CVAE preserves phase and Doppler structure and is assessed in two configurations: (i) using unprocessed range profiles and (ii) after local whitening, where per-range covariance estimates are obtained from neighboring profiles. Using extensive simulations together with real sea-clutter data from the CSIR maritime dataset, we benchmark performance against classical and adaptive detectors (MF, NMF, AMF-SCM, ANMF-SCM, ANMF-Tyler). In both configurations, the CVAE yields a higher detection probability Pd at matched false-alarm rate Pfa, with the most notable improvements observed under whitening. We further integrate the CVAE with the ANMF through a weighted log-p fusion rule at the decision level, attaining enhanced robustness in strongly non-Gaussian clutter and enabling empirically calibrated Pfa control under H0. Overall, the results demonstrate that statistical normalization combined with complex-valued generative modeling substantively improves detection in realistic sea-clutter conditions, and that the fused CVAE-ANMF scheme constitutes a competitive alternative to established model-based detectors.

[LG-96] Uniform Computability of PAC Learning

链接: https://arxiv.org/abs/2601.18663
作者: Vasco Brattka,Guillaume Chirache
类目: Logic (math.LO); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:We study uniform computability properties of PAC learning using Weihrauch complexity. We focus on closed concept classes, which are either represented by positive, by negative or by full information. Among other results, we prove that proper PAC learning from positive information is equivalent to the limit operation on Baire space, whereas improper PAC learning from positive information is closely related to Weak Kőnig’s Lemma and even equivalent to it, when we have some negative information about the admissible hypotheses. If arbitrary hypotheses are allowed, then improper PAC learning from positive information is still in a finitary DNC range, which implies that it is non-deterministically computable, but does not allow for probabilistic algorithms. These results can also be seen as a classification of the degree of constructivity of the Fundamental Theorem of Statistical Learning. All the aforementioned results hold if an upper bound of the VC dimension is provided as an additional input information. We also study the question of how these results are affected if the VC dimension is not given, but only promised to be finite or if concept classes are represented by negative or full information. Finally, we also classify the complexity of the VC dimension operation itself, which is a problem that is of independent interest. For positive or full information it turns out to be equivalent to the binary sorting problem, for negative information it is equivalent to the jump of sorting. This classification allows also conclusions regarding the Borel complexity of PAC learnability.

[LG-97] Universality of Many-body Projected Ensemble for Learning Quantum Data Distribution

链接: https://arxiv.org/abs/2601.18637
作者: Quoc Hoan Tran,Koki Chinzei,Yasuhiro Endo,Hirotaka Oshima
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:Generating quantum data by learning the underlying quantum distribution poses challenges in both theoretical and practical scenarios, yet it is a critical task for understanding quantum systems. A fundamental question in quantum machine learning (QML) is the universality of approximation: whether a parameterized QML model can approximate any quantum distribution. We address this question by proving a universality theorem for the Many-body Projected Ensemble (MPE) framework, a method for quantum state design that uses a single many-body wave function to prepare random states. This demonstrates that MPE can approximate any distribution of pure states within a 1-Wasserstein distance error. This theorem provides a rigorous guarantee of universal expressivity, addressing key theoretical gaps in QML. For practicality, we propose an Incremental MPE variant with layer-wise training to improve the trainability. Numerical experiments on clustered quantum states and quantum chemistry datasets validate MPE’s efficacy in learning complex quantum data distributions.

信息检索

[IR-0] S2GR: Stepwise Semantic-Guided Reasoning in Latent Space for Generative Recommendation

链接: https://arxiv.org/abs/2601.18664
作者: Zihao Guo,Jian Wang,Ruxin Zhou,Youhua Liu,Jiawei Guo,Jun Zhao,Xiaoxiao Xu,Yongqi Liu,Kaiqiao Zhan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative Recommendation (GR) has emerged as a transformative paradigm with its end-to-end generation advantages. However, existing GR methods primarily focus on direct Semantic ID (SID) generation from interaction sequences, failing to activate deeper reasoning capabilities analogous to those in large language models and thus limiting performance potential. We identify two critical limitations in current reasoning-enhanced GR approaches: (1) Strict sequential separation between reasoning and generation steps creates imbalanced computational focus across hierarchical SID codes, degrading quality for SID codes; (2) Generated reasoning vectors lack interpretable semantics, while reasoning paths suffer from unverifiable supervision. In this paper, we propose stepwise semantic-guided reasoning in latent space (S ^2 GR), a novel reasoning enhanced GR framework. First, we establish a robust semantic foundation via codebook optimization, integrating item co-occurrence relationship to capture behavioral patterns, and load balancing and uniformity objectives that maximize codebook utilization while reinforcing coarse-to-fine semantic hierarchies. Our core innovation introduces the stepwise reasoning mechanism inserting thinking tokens before each SID generation step, where each token explicitly represents coarse-grained semantics supervised via contrastive learning against ground-truth codebook cluster distributions ensuring physically grounded reasoning paths and balanced computational focus across all SID codes. Extensive experiments demonstrate the superiority of S ^2 GR, and online A/B test confirms efficacy on large-scale industrial short video platform.

[IR-1] Feature-Indexed Federated Recommendation with Residual-Quantized Codebooks

链接: https://arxiv.org/abs/2601.18570
作者: Mingzhe Han,Jiahao Liu,Dongsheng Li,Hansu Gu,Peng Zhang,Ning Gu,Tun Lu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated recommendation provides a privacy-preserving solution for training recommender systems without centralizing user interactions. However, existing methods follow an ID-indexed communication paradigm that transmit whole item embeddings between clients and the server, which has three major limitations: 1) consumes uncontrollable communication resources, 2) the uploaded item information cannot generalize to related non-interacted items, and 3) is sensitive to client noisy feedback. To solve these problems, it is necessary to fundamentally change the existing ID-indexed communication paradigm. Therefore, we propose a feature-indexed communication paradigm that transmits feature code embeddings as codebooks rather than raw item embeddings. Building on this paradigm, we present RQFedRec, which assigns each item a list of discrete code IDs via Residual Quantization (RQ)-Kmeans. Each client generates and trains code embeddings as codebooks based on discrete code IDs provided by the server, and the server collects and aggregates these codebooks rather than item embeddings. This design makes communication controllable since the codebooks could cover all items, enabling updates to propagate across related items in same code ID. In addition, since code embedding represents many items, which is more robust to a single noisy item. To jointly capture semantic and collaborative information, RQFedRec further adopts a collaborative-semantic dual-channel aggregation with a curriculum strategy that emphasizes semantic codes early and gradually increases the contribution of collaborative codes over training. Extensive experiments on real-world datasets demonstrate that RQFedRec consistently outperforms state-of-the-art federated recommendation baselines while significantly reducing communication overhead.

[IR-2] oken-level Collaborative Alignment for LLM -based Generative Recommendation WWW2026

链接: https://arxiv.org/abs/2601.18457
作者: Fake Lin,Binbin Hu,Zhi Zheng,Xi Zhu,Ziqi Liu,Zhiqiang Zhang,Jun Zhou,Tong Xu
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 2 figures, 7 tables, WWW 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong potential for generative recommendation by leveraging rich semantic knowledge. However, existing LLM-based recommender systems struggle to effectively incorporate collaborative filtering (CF) signals, due to a fundamental mismatch between item-level preference modeling in CF and token-level next-token prediction (NTP) optimization in LLMs. Prior approaches typically treat CF as contextual hints or representation bias, and resort to multi-stage training to reduce behavioral semantic space discrepancies, leaving CF unable to explicitly regulate LLM generation. In this work, we propose Token-level Collaborative Alignment for Recommendation (TCA4Rec), a model-agnostic and plug-and-play framework that establishes an explicit optimization-level interface between CF supervision and LLM generation. TCA4Rec consists of (i) Collaborative Tokenizer, which projects raw item-level CF logits into token-level distributions aligned with the LLM token space, and (ii) Soft Label Alignment, which integrates these CF-informed distributions with one-hot supervision to optimize a soft NTP objective. This design preserves the generative nature of LLM training while enabling collaborative alignment with essential user preference of CF models. We highlight TCA4Rec is compatible with arbitrary traditional CF models and generalizes across a wide range of decoder-based LLM recommender architectures. Moreover, it provides an explicit mechanism to balance behavioral alignment and semantic fluency, yielding generative recommendations that are both accurate and controllable. Extensive experiments demonstrate that TCA4Rec consistently improves recommendation performance across a broad spectrum of CF models and LLM-based recommender systems.

[IR-3] opKGAT: A Top-K Objective-Driven Architecture for Recommendation WWW2026

链接: https://arxiv.org/abs/2601.18432
作者: Sirui Chen,Jiawei Chen,Canghong Jin,Sheng Zhou,Jingbang Chen,Wujie Sun,Can Wang
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW2026

点击查看摘要

Abstract:Recommendation systems (RS) aim to retrieve the top-K items most relevant to users, with metrics such as Precision@K and Recall@K commonly used to assess effectiveness. The architecture of an RS model acts as an inductive bias, shaping the patterns the model is inclined to learn. In recent years, numerous recommendation architectures have emerged, spanning traditional matrix factorization, deep neural networks, and graph neural networks. However, their designs are often not explicitly aligned with the top-K objective, thereby limiting their effectiveness. To address this limitation, we propose TopKGAT, a novel recommendation architecture directly derived from a differentiable approximation of top-K metrics. The forward computation of a single TopKGAT layer is intrinsically aligned with the gradient ascent dynamics of the Precision@K metric, enabling the model to naturally improve top-K recommendation accuracy. Structurally, TopKGAT resembles a graph attention network and can be implemented efficiently. Extensive experiments on four benchmark datasets demonstrate that TopKGAT consistently outperforms state-of-the-art baselines. The code is available at this https URL. Comments: Accepted by WWW2026 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2601.18432 [cs.IR] (or arXiv:2601.18432v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.18432 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3774904.3792717 Focus to learn more DOI(s) linking to related resources

[IR-4] Beyond the Checkbox: Strengthening DSA Compliance Through Social Media Algorithmic Auditing

链接: https://arxiv.org/abs/2601.18405
作者: Sara Solarova,Matúš Mesarčík,Branislav Pecher,Ivan Srba
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: 2026 CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:Algorithms of online platforms are required under the Digital Services Act (DSA) to comply with specific obligations concerning algorithmic transparency, user protection and privacy. To verify compliance with these requirements, DSA mandates platforms to undergo independent audits. Little is known about current auditing practices and their effectiveness in ensuring such compliance. To this end, we bridge regulatory and technical perspectives by critically examining selected audit reports across three critical algorithmic-related provisions: restrictions on profiling minors, transparency in recommender systems, and limitations on targeted advertising using sensitive data. Our analysis shows significant inconsistencies in methodologies and lack of technical depth when evaluating AI-powered systems. To enhance the depth, scale, and independence of compliance assessments, we propose to employ algorithmic auditing – a process of behavioural assessment of AI algorithms by means of simulating user behaviour, observing algorithm responses and analysing them for audited phenomena.

[IR-5] Orchestrating Specialized Agents for Trustworthy Enterprise RAG

链接: https://arxiv.org/abs/2601.18267
作者: Xincheng You,Qi Sun,Neha Bora,Huayi Li,Shubham Goel,Kang Li,Sean Culatana
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) shows promise for enterprise knowledge work, yet it often underperforms in high-stakes decision settings that require deep synthesis, strict traceability, and recovery from underspecified prompts. One-pass retrieval-and-write pipelines frequently yield shallow summaries, inconsistent grounding, and weak mechanisms for completeness verification. We introduce ADORE (Adaptive Deep Orchestration for Research in Enterprise), an agentic framework that replaces linear retrieval with iterative, user-steered investigation coordinated by a central orchestrator and a set of specialized agents. ADORE’s key insight is that a structured Memory Bank (a curated evidence store with explicit claim-evidence linkage and section-level admissible evidence) enables traceable report generation and systematic checks for evidence completeness. Our contributions are threefold: (1) Memory-locked synthesis - report generation is constrained to a structured Memory Bank (Claim-Evidence Graph) with section-level admissible evidence, enabling traceable claims and grounded citations; (2) Evidence-coverage-guided execution - a retrieval-reflection loop audits section-level evidence coverage to trigger targeted follow-up retrieval and terminates via an evidence-driven stopping criterion; (3) Section-packed long-context grounding - section-level packing, pruning, and citation-preserving compression make long-form synthesis feasible under context limits. Across our evaluation suite, ADORE ranks first on DeepResearch Bench (52.65) and achieves the highest head-to-head preference win rate on DeepConsult (77.2%) against commercial systems.

[IR-6] GenCI: Generative Modeling of User Interest Shift via Cohort-based Intent Learning for CTR Prediction WWW2026

链接: https://arxiv.org/abs/2601.18251
作者: Kesha Ou,Zhen Tian,Wayne Xin Zhao,Hongyu Lu,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW 2026 Research Track

点击查看摘要

Abstract:Click-through rate (CTR) prediction plays a pivotal role in online advertising and recommender systems. Despite notable progress in modeling user preferences from historical behaviors, two key challenges persist. First, exsiting discriminative paradigms focus on matching candidates to user history, often overfitting to historically dominant features and failing to adapt to rapid interest shifts. Second, a critical information chasm emerges from the point-wise ranking paradigm. By scoring each candidate in isolation, CTR models discard the rich contextual signal implied by the recalled set as a whole, leading to a misalignment where long-term preferences often override the user’s immediate, evolving intent. To address these issues, we propose GenCI, a generative user intent framework that leverages semantic interest cohorts to model dynamic user preferences for CTR prediction. The framework first employs a generative model, trained with a next-item prediction (NTP) objective, to proactively produce candidate interest cohorts. These cohorts serve as explicit, candidate-agnostic representations of a user’s immediate intent. A hierarchical candidate-aware network then injects this rich contextual signal into the ranking stage, refining them with cross-attention to align with both user history and the target item. The entire model is trained end-to-end, creating a more aligned and effective CTR prediction pipeline. Extensive experiments on three widely used datasets demonstrate the effectiveness of our approach.

[IR-7] Generative Chain of Behavior for User Trajectory Prediction

链接: https://arxiv.org/abs/2601.18213
作者: Chengkai Huang,Xiaodi Chen,Hongtao Huang,Quan Z. Sheng,Lina Yao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Modeling long-term user behavior trajectories is essential for understanding evolving preferences and enabling proactive recommendations. However, most sequential recommenders focus on next-item prediction, overlooking dependencies across multiple future actions. We propose Generative Chain of Behavior (GCB), a generative framework that models user interactions as an autoregressive chain of semantic behaviors over multiple future steps. GCB first encodes items into semantic IDs via RQ-VAE with k-means refinement, forming a discrete latent space that preserves semantic proximity. On top of this space, a transformer-based autoregressive generator predicts multi-step future behaviors conditioned on user history, capturing long-horizon intent transitions and generating coherent trajectories. Experiments on benchmark datasets show that GCB consistently outperforms state-of-the-art sequential recommenders in multi-step accuracy and trajectory consistency. Beyond these gains, GCB offers a unified generative formulation for capturing user preference evolution.

[IR-8] DMAP: Human-Aligned Structural Document Map for Multimodal Document Understanding

链接: https://arxiv.org/abs/2601.18203
作者: ShunLiang Fu,Yanxin Zhang,Yixin Xiang,Xiaoyu Du,Jinhui Tang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing multimodal document question-answering (QA) systems predominantly rely on flat semantic retrieval, representing documents as a set of disconnected text chunks and largely neglecting their intrinsic hierarchical and relational structures. Such flattening disrupts logical and spatial dependencies - such as section organization, figure-text correspondence, and cross-reference relations, that humans naturally exploit for comprehension. To address this limitation, we introduce a document-level structural Document MAP (DMAP), which explicitly encodes both hierarchical organization and inter-element relationships within multimodal documents. Specifically, we design a Structured-Semantic Understanding Agent to construct DMAP by organizing textual content together with figures, tables, charts, etc. into a human-aligned hierarchical schema that captures both semantic and layout dependencies. Building upon this representation, a Reflective Reasoning Agent performs structure-aware and evidence-driven reasoning, dynamically assessing the sufficiency of retrieved context and iteratively refining answers through targeted interactions with DMAP. Extensive experiments on MMDocQA benchmarks demonstrate that DMAP yields document-specific structural representations aligned with human interpretive patterns, substantially enhancing retrieval precision, reasoning consistency, and multimodal comprehension over conventional RAG-based approaches. Code is available at this https URL

[IR-9] hink When Needed: Model-Aware Reasoning Routing for LLM -based Ranking

链接: https://arxiv.org/abs/2601.18146
作者: Huizhong Guo,Tianjun Wei,Dongxia Wang,Yingpeng Du,Ziyan Wang,Jie Zhang,Zhu Sun
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to ranking tasks in retrieval and recommendation. Although reasoning prompting can enhance ranking utility, our preliminary exploration reveals that its benefits are inconsistent and come at a substantial computational cost, suggesting that when to reason is as crucial as how to reason. To address this issue, we propose a reasoning routing framework that employs a lightweight, plug-and-play router head to decide whether to use direct inference (Non-Think) or reasoning (Think) for each instance before generation. The router head relies solely on pre-generation signals: i) compact ranking-aware features (e.g., candidate dispersion) and ii) model-aware difficulty signals derived from a diagnostic checklist reflecting the model’s estimated need for reasoning. By leveraging these features before generation, the router outputs a controllable token that determines whether to apply the Think mode. Furthermore, the router can adaptively select its operating policy along the validation Pareto frontier during deployment, enabling dynamic allocation of computational resources toward instances most likely to benefit from Think under varying system constraints. Experiments on three public ranking datasets with different scales of open-source LLMs show consistent improvements in ranking utility with reduced token consumption (e.g., +6.3% NDCG@10 with -49.5% tokens on MovieLens with Qwen3-4B), demonstrating reasoning routing as a practical solution to the accuracy-efficiency trade-off.

[IR-10] Enhancing LLM -based Recommendation with Preference Hint Discovery from Knowledge Graph

链接: https://arxiv.org/abs/2601.18096
作者: Yuting Zhang,Ziliang Pei,Chao Wang,Ying Sun,Fuzhen Zhuang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:LLMs have garnered substantial attention in recommendation systems. Yet they fall short of traditional recommenders when capturing complex preference patterns. Recent works have tried integrating traditional recommendation embeddings into LLMs to resolve this issue, yet a core gap persists between their continuous embedding and discrete semantic spaces. Intuitively, textual attributes derived from interactions can serve as critical preference rationales for LLMs’ recommendation logic. However, directly inputting such attribute knowledge presents two core challenges: (1) Deficiency of sparse interactions in reflecting preference hints for unseen items; (2) Substantial noise introduction from treating all attributes as hints. To this end, we propose a preference hint discovery model based on the interaction-integrated knowledge graph, enhancing LLM-based recommendation. It utilizes traditional recommendation principles to selectively extract crucial attributes as hints. Specifically, we design a collaborative preference hint extraction schema, which utilizes semantic knowledge from similar users’ explicit interactions as hints for unseen items. Furthermore, we develop an instance-wise dual-attention mechanism to quantify the preference credibility of candidate attributes, identifying hints specific to each unseen item. Using these item- and user-based hints, we adopt a flattened hint organization method to shorten input length and feed the textual hint information to the LLM for commonsense reasoning. Extensive experiments on both pair-wise and list-wise recommendation tasks verify the effectiveness of our proposed framework, indicating an average relative improvement of over 3.02% against baselines.

[IR-11] Post-Training Denoising of User Profiles with LLM s in Collaborative Filtering Recommendation ECIR2026

链接: https://arxiv.org/abs/2601.18009
作者: Ervin Dervishaj,Maria Maistro,Tuukka Ruotsalo,Christina Lioma
类目: Information Retrieval (cs.IR)
*备注: Accepted at the 48th European Conference on Information Retrieval (ECIR 2026)

点击查看摘要

Abstract:Implicit feedback – the main data source for training Recommender Systems (RSs) – is inherently noisy and has been shown to negatively affect recommendation effectiveness. Denoising has been proposed as a method for removing noisy implicit feedback and improving recommendations. Prior work has focused on in-training denoising, however this requires additional data, changes to the model architecture and training procedure or fine-tuning, all of which can be costly and data hungry. In this work, we focus on post-training denoising. Different from in-training denoising, post-training denoising does not involve changing the architecture of the model nor its training procedure, and does not require additional data. Specifically, we present a method for post-training denoising user profiles using Large Language Models (LLMs) for Collaborative Filtering (CF) recommendations. Our approach prompts LLMs with (i) a user profile (user interactions), (ii) a candidate item, and (iii) its rank as given by the CF recommender, and asks the LLM to remove items from the user profile to improve the rank of the candidate item. Experiments with a state-of-the-art CF recommender and 4 open and closed source LLMs in 3 datasets show that our denoising yields improvements up to 13% in effectiveness over the original user profiles. Our code is available at this https URL.

[IR-12] Unleashing the Potential of Sparse Attention on Long-term Behaviors for CTR Prediction

链接: https://arxiv.org/abs/2601.17836
作者: Weijiang Lai,Beihong Jin,Di Zhang,Siru Chen,Jiongyan Zhang,Yuhang Gou,Jian Dong,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, the success of large language models (LLMs) has driven the exploration of scaling laws in recommender systems. However, models that demonstrate scaling laws are actually challenging to deploy in industrial settings for modeling long sequences of user behaviors, due to the high computational complexity of the standard self-attention mechanism. Despite various sparse self-attention mechanisms proposed in other fields, they are not fully suited for recommendation scenarios. This is because user behaviors exhibit personalization and temporal characteristics: different users have distinct behavior patterns, and these patterns change over time, with data from these users differing significantly from data in other fields in terms of distribution. To address these challenges, we propose SparseCTR, an efficient and effective model specifically designed for long-term behaviors of users. To be precise, we first segment behavior sequences into chunks in a personalized manner to avoid separating continuous behaviors and enable parallel processing of sequences. Based on these chunks, we propose a three-branch sparse self-attention mechanism to jointly identify users’ global interests, interest transitions, and short-term interests. Furthermore, we design a composite relative temporal encoding via learnable, head-specific bias coefficients, better capturing sequential and periodic relationships among user behaviors. Extensive experimental results show that SparseCTR not only improves efficiency but also outperforms state-of-the-art methods. More importantly, it exhibits an obvious scaling law phenomenon, maintaining performance improvements across three orders of magnitude in FLOPs. In online A/B testing, SparseCTR increased CTR by 1.72% and CPM by 1.41%. Our source code is available at this https URL.

[IR-13] OwlerLite: Scope- and Freshness-Aware Web Retrieval for LLM Assistants

链接: https://arxiv.org/abs/2601.17824
作者: Saber Zerhoudi,Michael Dinzinger,Michael Granitzer,Jelena Mitrovic
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Browser-based language models often use retrieval-augmented generation (RAG) but typically rely on fixed, outdated indices that give users no control over which sources are consulted. This can lead to answers that mix trusted and untrusted content or draw on stale information. We present OwlerLite, a browser-based RAG system that makes user-defined scopes and data freshness central to retrieval. Users define reusable scopes-sets of web pages or sources-and select them when querying. A freshness-aware crawler monitors live pages, uses a semantic change detector to identify meaningful updates, and selectively re-indexes changed content. OwlerLite integrates text relevance, scope choice, and recency into a unified retrieval model. Implemented as a browser extension, it represents a step toward more controllable and trustworthy web assistants.

[IR-14] oken-Weighted Multi-Target Learning for Generative Recommenders with Curriculum Learning

链接: https://arxiv.org/abs/2601.17787
作者: Wei-Ning Chiu,Chuan-Ju Wang,Pu-Jen Cheng
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Generative recommender systems have recently attracted attention by formulating next-item prediction as an autoregressive sequence generation task. However, most existing methods optimize standard next-token likelihood and implicitly treat all tokens as equally informative, which is misaligned with semantic-ID-based generation. Accordingly, we propose two complementary information-gain-based token-weighting strategies tailored to generative recommendation with semantic IDs. Front-Greater Weighting captures conditional semantic information gain by prioritizing early tokens that most effectively reduce candidate-item uncertainty given their prefixes and encode coarse semantics. Frequency Weighting models marginal information gain under long-tailed item and token distributions, upweighting rare tokens to counteract popularity bias. Beyond individual strategies, we introduce a multi-target learning framework with curriculum learning that jointly optimizes the two token-weighted objectives alongside standard likelihood, enabling stable optimization and adaptive emphasis across training stages. Extensive experiments on benchmark datasets show that our method consistently outperforms strong baselines and existing token-weighting approaches, with improved robustness, strong generalization across different semantic-ID constructions, and substantial gains on both head and tail items. Code is available at this https URL.

[IR-15] Why They Link: An Intent Taxonomy for Including Hyperlinks in Social Posts

链接: https://arxiv.org/abs/2601.17601
作者: Fangping Lan,Abdullah Aljebreen,Eduard C. Dragut
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: 10 pages including references, 5 figures,

点击查看摘要

Abstract:URLs serve as bridges between social media platforms and the broader web, linking user-generated content to external information resources. On Twitter (X), approximately one in five tweets contains at least one URL, underscoring their central role in information dissemination. While prior studies have examined the motivations of authors who share URLs, such author-centered intentions are difficult to observe in practice. To enable broader downstream use, this work investigates reader-centered interpretations, i.e., how users perceive the intentions behind hyperlinks included in posts. We develop an intent taxonomy for including hyperlinks in social posts through a hybrid approach that begins with a bottom-up, data-driven process using large-scale crowdsourced annotations, and is then refined using large language model assistance to generate descriptive category names and precise definitions. The final taxonomy comprises 6 top-level categories and 26 fine-grained intention classes, capturing diverse communicative purposes. Applying this taxonomy, we annotate and analyze 1000 user posts, revealing that advertising, arguing, and sharing are the most prevalent intentions. This resulting taxonomy provides a foundation for intent-aware information retrieval and NLP applications, enabling more accurate retrieval, recommendation, and understanding of social media content.

[IR-16] Pipeline Inspection Visualization and Interoperability in PyTerrier ECIR2026

链接: https://arxiv.org/abs/2601.17502
作者: Emmanouil Georgios Lionis,Craig Macdonald,Sean MacAvaney
类目: Information Retrieval (cs.IR)
*备注: This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in ECIR2026 (Part IV) Advances in Information Retrieval

点击查看摘要

Abstract:PyTerrier provides a declarative framework for building and experimenting with Information Retrieval (IR) pipelines. In this demonstration, we highlight several recent pipeline operations that improve their ability to be programmatically inspected, visualized, and integrated with other tools (via the Model Context Protocol, MCP). These capabilities aim to make it easier for researchers, students, and AI agents to understand and use a wide array of IR pipelines.

[IR-17] owards Fair Large Language Model-based Recommender Systems without Costly Retraining WWW2026

链接: https://arxiv.org/abs/2601.17492
作者: Jin Li,Huilin Gu,Shoujin Wang,Qi Zhang,Shui Yu,Chen Wang,Xiwei Xu,Fang Chen
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized Recommender Systems (RS) through advanced generative user modeling. However, LLM-based RS (LLM-RS) often inadvertently perpetuates bias present in the training data, leading to severe fairness issues. Addressing these fairness problems in LLM-RS faces two significant challenges. 1) Existing debiasing methods, designed for specific bias types, lack the generality to handle diverse or emerging biases in real-world applications. 2) Debiasing methods relying on retraining are computationally infeasible given the massive parameter scale of LLMs. To overcome these challenges, we propose FUDLR (Fast Unified Debiasing for LLM-RS). The core idea is to reformulate the debiasing problem as an efficient machine unlearning task with two stages. First, FUDLR identifies bias-inducing samples to unlearn through a novel bias-agnostic mask, optimized to balance fairness improvement with accuracy preservation. Its bias-agnostic design allows adaptability to various or co-existing biases simply by incorporating different fairness metrics. Second, FUDLR performs efficient debiasing by estimating and removing the influence of identified samples on model parameters. Extensive experiments demonstrate that FUDLR effectively and efficiently improves fairness while preserving recommendation accuracy, offering a practical path toward socially responsible LLM-RS. The code and data are available at this https URL.

[IR-18] Adversarial Alignment and Disentanglement for Cross-Domain CTR Prediction with Domain-Encompassing Features ICDM2025

链接: https://arxiv.org/abs/2601.17472
作者: Junyou He,Lixi Deng,Huichao Guo,Ye Tang,Yong Li,Sulong Xu
类目: Information Retrieval (cs.IR)
*备注: Accepted to ICDM 2025

点击查看摘要

Abstract:Cross-domain recommendation (CDR) has been increasingly explored to address data sparsity and cold-start issues. However, recent approaches typically disentangle domain-invariant features shared between source and target domains, as well as domain-specific features for each domain. However, they often rely solely on domain-invariant features combined with target domain-specific features, which can lead to suboptimal performance. To overcome the limitations, this paper presents the Adversarial Alignment and Disentanglement Cross-Domain Recommendation ( A^2DCDR ) model, an innovative approach designed to capture a comprehensive range of cross-domain information, including both domain-invariant and valuable non-aligned features. The A^2DCDR model enhances cross-domain recommendation through three key components: refining MMD with adversarial training for better generalization, employing a feature disentangler and reconstruction mechanism for intra-domain disentanglement, and introducing a novel fused representation combining domain-invariant, non-aligned features with original contextual data. Experiments on real-world datasets and online A/B testing show that A^2DCDR outperforms existing methods, confirming its effectiveness and practical applicability. The code is provided at this https URL.

[IR-19] Breaking Flat: A Generalised Query Performance Prediction Evaluation Framework

链接: https://arxiv.org/abs/2601.17359
作者: Payel Santra,Partha Basuchowdhuri,Debasis Ganguly
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The traditional use-case of query performance prediction (QPP) is to identify which queries perform well and which perform poorly for a given ranking model. A more fine-grained and arguably more challenging extension of this task is to determine which ranking models are most effective for a given query. In this work, we generalize the QPP task and its evaluation into three settings: (i) SingleRanker MultiQuery (SRMQ-PP), corresponding to the standard use case; (ii) MultiRanker SingleQuery (MRSQ-PP), which evaluates a QPP model’s ability to select the most effective ranker for a query; and (iii) MultiRanker MultiQuery (MRMQ-PP), which considers predictions jointly across all query ranker pairs. Our results show that (a) the relative effectiveness of QPP models varies substantially across tasks (SRMQ-PP vs. MRSQ-PP), and (b) predicting the best ranker for a query is considerably more difficult than predicting the relative difficulty of queries for a given ranker.

[IR-20] Beyond Correlations: A Downstream Evaluation Framework for Query Performance Prediction

链接: https://arxiv.org/abs/2601.17339
作者: Payel Santra,Partha Basuchowdhuri,Debasis Ganguly
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The standard practice of query performance prediction (QPP) evaluation is to measure a set-level correlation between the estimated retrieval qualities and the true ones. However, neither this correlation-based evaluation measure quantifies QPP effectiveness at the level of individual queries, nor does this connect to a downstream application, meaning that QPP methods yielding high correlation values may not find a practical application in query-specific decisions in an IR pipeline. In this paper, we propose a downstream-focussed evaluation framework where a distribution of QPP estimates across a list of top-documents retrieved with several rankers is used as priors for IR fusion. While on the one hand, a distribution of these estimates closely matching that of the true retrieval qualities indicates the quality of the predictor, their usage as priors on the other hand indicates a predictor’s ability to make informed choices in an IR pipeline. Our experiments firstly establish the importance of QPP estimates in weighted IR fusion, yielding substantial improvements of over 4.5% over unweighted CombSUM and RRF fusion strategies, and secondly, reveal new insights that the downstream effectiveness of QPP does not correlate well with the standard correlation-based QPP evaluation.