本篇博文主要内容为 2026-01-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-23)

今日共更新445篇论文,其中:

  • 自然语言处理77篇(Computation and Language (cs.CL))
  • 人工智能156篇(Artificial Intelligence (cs.AI))
  • 计算机视觉84篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习105篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] LLM -in-Sandbox Elicits General Agent ic Intelligence

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非代码任务中泛化能力有限的问题,尤其是其在处理长上下文、多学科知识整合及复杂指令遵循时的局限性。解决方案的关键在于提出“LLM-in-Sandbox”框架,即让LLMs在代码沙箱(code sandbox,一种虚拟计算机环境)中进行自主探索与执行,从而激发其在非代码领域中的通用智能行为。该方法无需额外训练即可使LLMs自发利用外部资源获取新知识、借助文件系统管理长上下文,并通过脚本执行满足格式要求;进一步通过基于非代理数据的强化学习(LLM-in-Sandbox-RL)提升其在沙箱中的探索能力,显著增强模型在数学、物理、化学、生物医学等多个领域的泛化性能。

链接: https://arxiv.org/abs/2601.16206
作者: Daixuan Cheng,Shaohan Huang,Yuxian Gu,Huatong Song,Guoxin Chen,Li Dong,Wayne Xin Zhao,Ji-Rong Wen,Furu Wei
机构: GSAI, Renmin University of China (中国人民大学高瓴人工智能学院); Microsoft Research (微软研究院); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox’s efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
zh

[NLP-1] Automatic Classification of Arabic Literature into Historical Eras

【速读】: 该论文试图解决阿拉伯语文本自动按时代分类的问题,特别是针对非诗歌类文本的时代表征识别。由于阿拉伯语在历史发展中经历了词汇更替、用法变迁等显著演变,传统上依赖人工划分的历史分期尚未被充分应用于自动化文本分类任务中。解决方案的关键在于利用神经网络和深度学习技术构建分类模型,基于两个公开语料库(OpenITI 和 APCD)中的多时期文本数据进行训练与评估,涵盖从二分类到十五分类的不同时期设定,并探索预定义历史分期与自定义周期划分的效果差异。实验表明,该方法在不同粒度下均具备一定有效性,尤其在二分类任务中表现最佳(F1-score 达 0.83)。

链接: https://arxiv.org/abs/2601.16138
作者: Zainab Alhathloul,Irfan Ahmad
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages

点击查看摘要

Abstract:The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.
zh

[NLP-2] LLM Prompt Evaluation for Educational Applications

【速读】: 该论文旨在解决当前教育领域中大型语言模型(Large Language Models, LLMs)提示(prompt)设计缺乏系统性评估方法的问题,尤其是在生成个性化且符合教学目标的输出方面。其核心挑战在于如何从众多提示模板中筛选出在教学实践中表现最优的设计方案,从而推动提示工程从经验驱动向证据驱动转变。解决方案的关键在于提出了一种可泛化的、基于锦标赛式评估框架的系统性方法,该框架结合Glicko2评分系统与多维度评判标准(格式、对话支持度、学习者适切性),通过120次真实用户交互数据对六种提示模板进行量化比较,最终识别出一种融合角色设定(persona)和上下文管理(context manager)模式的提示,在策略性阅读场景下显著优于其他模板,体现了其在促进元认知学习策略如自主学习方面的有效性。

链接: https://arxiv.org/abs/2601.16134
作者: Langdon Holmes,Adam Coscia,Scott Crossley,Joon Suh Choi,Wesley Morris
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
zh

[NLP-3] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Model, LLM)在新增语言支持或更新特定语言数据时,因需重新训练整个模型而导致的计算效率低下和维护成本高昂的问题。其解决方案的关键在于采用模型合并(model merging)策略,即通过合并已训练好的单语言或子集模型来替代全量重训练,从而显著降低初始训练时间和模型维护成本——实验表明,该方法可将初始训练时间减少最多50%,并在更新单一语言时使训练成本降低超过60%,同时保持与重新训练相当的性能水平。

链接: https://arxiv.org/abs/2601.16127
作者: Alphaeus Dmonte,Vidhi Gupta,Daniel J Perry,Mark Arehart
机构: Qualtrics; George Mason University (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
zh

[NLP-4] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

【速读】: 该论文旨在解决当前多模态理解中组成图像检索(Composed Image Retrieval, CIR)任务评估基准存在的局限性问题,即现有基准查询类别有限、难以覆盖真实场景的多样性需求。其解决方案的关键在于利用图像编辑技术实现对修改类型和内容的精确控制,从而构建一个可扩展的合成查询流水线,进而生成涵盖五个主类与十五个子类的高质量细粒度CIR基准EDIR。该方法不仅提升了评估的全面性和严谨性,还揭示了现有模型在不同子类别上的性能不一致问题,凸显了当前多模态嵌入模型的固有局限。

链接: https://arxiv.org/abs/2601.16125
作者: Tingyu Song,Yanzhao Zhang,Mingxin Li,Zhuoning Guo,Dingkun Long,Pengjun Xie,Siyue Zhang,Yilun Zhao,Shu Wu
机构: CASIA(中国科学院自动化研究所); Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团); UCAS(中国科学院大学); HKUST(GZ)(香港科技大学(广州)); NTU(南洋理工大学); Yale(耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.
zh

[NLP-5] synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier

【速读】: 该论文旨在解决低资源语言(如克什米尔语)在光学字符识别(OCR)领域面临的重大挑战,即缺乏大规模标注训练数据集,导致主流OCR系统(如Tesseract、TrOCR和PaddleOCR)无法有效支持此类语言。其解决方案的关键在于提出SynthOCR-Gen——一个开源的合成OCR数据集生成工具,通过将数字Unicode文本语料库自动转化为可用于训练的高质量数据集,从而突破数据瓶颈。该工具的核心能力包括:多粒度文本分割(字符、词、n-gram、句子和行)、Unicode规范化与字形纯净性保障、多字体渲染及配置化分布、以及25种以上模拟真实文档退化的数据增强技术(如旋转、模糊、噪声和扫描伪影),最终成功构建了一个包含60万样本的克什米尔语词级OCR数据集并公开发布,为低资源语言接入视觉-语言AI模型提供了可复用的技术路径。

链接: https://arxiv.org/abs/2601.16113
作者: Haq Nawaz Malik,Kh Mohmad Shafi,Tanveer Ahmad Reshi
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide. Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.16113 [cs.CL] (or arXiv:2601.16113v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.16113 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-6] Adapter Fusion for Multilingual Text2Cypher with Linear and Learned Gating

【速读】: 该论文旨在解决多语言Text2Cypher系统在扩展新语言时面临的高成本与低效率问题,即传统方法需对所有语言重新进行联合微调(joint multilingual fine-tuning),导致计算资源消耗大、难以增量扩展。其解决方案的关键在于引入语言特定的LoRA(Low-Rank Adaptation)适配器,并通过可学习的融合多层感知机(MLP)实现动态门控的适配器组合,从而在仅使用少量目标语言数据的情况下,恢复接近联合微调性能的准确率(约75%),显著优于线性合并策略,同时支持以轻量级方式逐步添加新语言。

链接: https://arxiv.org/abs/2601.16097
作者: Makbule Gulcin Ozsoy
机构: Neo4j( Neo4j)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.
zh

[NLP-7] Controlling Long-Horizon Behavior in Language Model Agents with Explicit State Dynamics

链接: https://arxiv.org/abs/2601.16087
作者: Sukesh Subaharan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Supplementary materials can be found here: this https URL

点击查看摘要

[NLP-8] Universal Refusal Circuits Across LLM s: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中拒绝行为(refusal behavior)的普遍性问题,即这种行为是否源于一种跨模型共享的低维语义电路(semantic circuit),而非仅由特定模型结构或训练过程决定。为验证这一假设,作者提出了一种名为“基于概念重构的轨迹回放”(Trajectory Replay via Concept-Basis Reconstruction)的框架,其核心创新在于:通过提取源模型(donor model)的“概念指纹”(concept fingerprints)来对齐目标模型(target model)的层,并利用一组共享的“概念原子”(concept atoms)重建拒绝方向,从而将源模型的干预策略迁移至目标模型,且无需目标模型端的拒绝监督信号。关键在于引入了权重奇异值分解(weight-SVD)稳定性保护机制,通过将干预投影到低方差权重子空间以避免能力损伤,确保在削弱拒绝行为的同时维持模型性能。实验覆盖8组不同架构(如密集模型到混合专家模型MoE)和训练范式的模型对,验证了该方法的有效性和普适性。

链接: https://arxiv.org/abs/2601.16034
作者: Tony Cristofano
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe’’ of concept atoms, we map the donor’s ablation trajectory into the target’s semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs (including GPT-OSS-20B and GLM-4) confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.
zh

[NLP-9] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

【速读】: 该论文旨在解决如何高效构建适用于土耳其法律领域的专业化语言模型问题,尤其针对现有方法在计算资源消耗大、训练流程复杂等方面的局限性。其解决方案的关键在于:(1)提出一种从头预训练的编码器模型(Encoder Model Pre-trained from Scratch),基于包含1127亿词元的土耳其语主导语料库进行单阶段预训练,并通过检查点选择策略优化下游检索性能,发现最佳性能出现在预训练损失未达最小值时;该方法以较小参数量(155M)实现与更大模型(307M–567M)相当的检索效果,且生产效率高达92.36%,显著优于依赖多阶段复杂训练流程的SOTA模型。(2)设计一种持续预训练(Continual Pre-training, CPT)机制,对Qwen系列解码器模型进行受控课程学习,分四阶段逐步引导模型从通用语言知识过渡到法律术语和长文本推理能力,最终在土耳其法律文本上实现36.2%的困惑度降低,验证了领域适应的有效性。

链接: https://arxiv.org/abs/2601.16018
作者: Özgür Uğur,Mahmut Göksu,Mahmut Çimen,Musa Yılmaz,Esra Şavirdi,Alp Talha Demir,Rumeysa Güllüce,İclal Çetin,Ömer Can Sağbaş
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 png, 1 tex, 1 bib

点击查看摘要

Abstract:This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
zh

[NLP-10] ransfer Learning from ImageNet for MEG-Based Decoding of Imagined Speech

【速读】: 该论文旨在解决非侵入式脑信号中想象言语(imagined speech)解码的难题,其核心挑战在于神经信号微弱且分布广泛,同时标注数据有限。解决方案的关键在于将磁脑图(MEG)信号转化为时间-频率图像表示,并利用预训练视觉模型进行特征提取:通过可学习的传感器空间卷积将MEG数据投影为三个空间小波变换混合图(spatial scalogram mixtures),生成适配ImageNet预训练视觉架构的紧凑图像输入;该方法显著提升了分类准确率(最高达90.4%平衡准确率),并验证了预训练模型能捕捉跨被试共享的神经表征,从而有效建模想象言语的结构特征。

链接: https://arxiv.org/abs/2601.15909
作者: Soufiane Jhilal,Stéphanie Martin,Anne-Lise Giraud
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ISBI 2026

点击查看摘要

Abstract:Non-invasive decoding of imagined speech remains challenging due to weak, distributed signals and limited labeled data. Our paper introduces an image-based approach that transforms magnetoencephalography (MEG) signals into time-frequency representations compatible with pretrained vision models. MEG data from 21 participants performing imagined speech tasks were projected into three spatial scalogram mixtures via a learnable sensor-space convolution, producing compact image-like inputs for ImageNet-pretrained vision architectures. These models outperformed classical and non-pretrained models, achieving up to 90.4% balanced accuracy for imagery vs. silence, 81.0% vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation confirmed that pretrained models capture shared neural representations, and temporal analyses localized discriminative information to imagery-locked intervals. These findings show that pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals.
zh

[NLP-11] Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

【速读】: 该论文旨在解决当前基于扩散机制的代码生成模型(Diffusion-based Language Models, DLLMs)在性能上仍落后于自回归(Autoregressive, AR)模型的问题,尤其是在相同计算预算和数据条件下。其核心解决方案是提出Stable-DiffCoder,一个基于Seed-Coder架构、数据和训练流程的块扩散代码模型,并引入一种增强的块扩散持续预训练(Block Diffusion Continual Pretraining, CPT)阶段,该阶段结合了定制化的预热策略与分块裁剪噪声调度(block-wise clipped noise schedule),以实现更高效的知识学习和稳定训练。实验表明,在相同架构与数据下,Stable-DiffCoder在多个代码基准测试中全面超越AR基线模型,且仅使用CPT和监督微调即可达到甚至超过多种约80亿参数的AR模型和DLLMs的性能,证明了扩散训练在提升代码建模质量上的潜力,同时展现出对结构化代码编辑与推理任务的优势及在低资源编程语言中的数据增强收益。

链接: https://arxiv.org/abs/2601.15892
作者: Chenghao Fan,Wen Heng,Bo Li,Sichen Liu,Yuxuan Song,Jing Su,Xiaoye Qu,Kai Shen,Wei Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of ~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.
zh

[NLP-12] Evaluating and Achieving Controllable Code Completion in Code LLM

【速读】: 该论文旨在解决当前代码补全评估方法滞后于模型能力提升的问题,特别是现有基准测试普遍仅关注代码补全的函数正确性,而忽视了大语言模型(LLM)在实际编程场景中遵循用户指令进行补全的能力。为填补这一空白,作者提出了首个面向指令引导的代码补全基准——可控代码补全基准(Controllable Code Completion Benchmark, C3-Bench),包含2,195个精心设计的任务,以系统评估模型在复杂指令约束下的代码生成能力。解决方案的关键在于构建高质量的指令-补全配对数据集,并基于此开发了一个简单的数据合成流水线,利用Qwen2.5-Coder模型生成用于监督微调(SFT)的数据,从而训练出性能领先的Qwen2.5-Coder-C3模型,在C3-Bench上达到当前最优表现。

链接: https://arxiv.org/abs/2601.15879
作者: Jiajun Zhang,Zeyu Cui,Lei Zhang,Jian Yang,Jiaxi Yang,Qiang Liu,Zilei Wang,Binyuan Hui,Liang Wang,Junyang Lin
机构: University of Science and Technology of China (中国科学技术大学); Alibaba Group (阿里巴巴集团); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code completion has become a central task, gaining significant attention with the rise of large language model (LLM)-based tools in software engineering. Although recent advances have greatly improved LLMs’ code completion abilities, evaluation methods have not advanced equally. Most current benchmarks focus solely on functional correctness of code completions based on given context, overlooking models’ ability to follow user instructions during completion-a common scenario in LLM-assisted programming. To address this limitation, we present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench), comprising 2,195 carefully designed completion tasks. Through comprehensive evaluation of over 40 mainstream LLMs across C3-Bench and conventional benchmarks, we reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks. Moreover, we develop a straightforward data synthesis pipeline that leverages Qwen2.5-Coder to generate high-quality instruction-completion pairs for supervised fine-tuning (SFT). The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench. Our findings provide valuable insights for enhancing LLMs’ code completion and instruction-following capabilities, establishing new directions for future research in code LLMs. To facilitate reproducibility and foster further research in code LLMs, we open-source all code, datasets, and models.
zh

[NLP-13] Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers

【速读】: 该论文旨在解决当前自监督语音-视觉模型(如AV-HuBERT)在多模态感知中是否能够准确模拟人类听觉主导效应及其内在不确定性的问题。研究通过对比模型与人类受试者(N=44)对不一致视听刺激(McGurk效应)的响应,发现AV-HuBERT在听觉主导率上与人类高度一致(32.0% vs. 31.8%),表明其能捕捉生物层面的听觉抗干扰阈值;但模型表现出显著更强的音素融合倾向(68.0% vs. 47.7%),且呈现确定性分类行为,而人类则具有感知随机性和多样化的错误模式。解决方案的关键在于揭示了当前自监督架构虽可复现多模态感知结果,却缺乏人类语音感知所依赖的神经变异性(neural variability),从而指出了下一代模型需引入不确定性建模以更真实地模拟人类认知机制。

链接: https://arxiv.org/abs/2601.15869
作者: Francisco Portillo López
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:This study evaluates AV-HuBERT’s perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.
zh

[NLP-14] Determinants of Training Corpus Size for Clinical Text Classification

【速读】: 该论文旨在解决临床文本分类中训练数据量不足且缺乏科学依据的问题,即如何确定最小必要标注文档数量以实现接近最优性能的模型表现。其解决方案的关键在于通过系统性实验分析不同规模训练集(100–10,000文档)下的学习曲线,并结合词汇特征建模,发现强预测词(strong predictors)与噪声词(noisy predictors)的数量对模型收敛速度和最终性能具有显著影响:当训练样本达到约600份时,所有任务均可达到最大性能的95%;同时,每增加100个强预测词可使最大准确率提升约0.04,而每增加100个噪声词则导致准确率下降约0.02,从而为临床NLP任务中的样本量设计提供了量化依据。

链接: https://arxiv.org/abs/2601.15846
作者: Jaya Chaturvedi,Saniya Deshpande,Chenkai Ma,Robert Cobb,Angus Roberts,Robert Stewart,Daniel Stahl,Diana Shamsutdinova
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Introduction: Clinical text classification using natural language processing (NLP) models requires adequate training data to achieve optimal performance. For that, 200-500 documents are typically annotated. The number is constrained by time and costs and lacks justification of the sample size requirements and their relationship to text vocabulary properties. Methods: Using the publicly available MIMIC-III dataset containing hospital discharge notes with ICD-9 diagnoses as labels, we employed pre-trained BERT embeddings followed by Random Forest classifiers to identify 10 randomly selected diagnoses, varying training corpus sizes from 100 to 10,000 documents, and analyzed vocabulary properties by identifying strong and noisy predictive words through Lasso logistic regression on bag-of-words embeddings. Results: Learning curves varied significantly across the 10 classification tasks despite identical preprocessing and algorithms, with 600 documents sufficient to achieve 95% of the performance attainable with 10,000 documents for all tasks. Vocabulary analysis revealed that more strong predictors and fewer noisy predictors were associated with steeper learning curves, where every 100 additional noisy words decreased accuracy by approximately 0.02 while 100 additional strong predictors increased maximum accuracy by approximately 0.04. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2601.15846 [cs.CL] (or arXiv:2601.15846v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.15846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-15] Can professional translators identify machine-generated text?

【速读】: 该论文试图解决的问题是:专业译者是否能够在未接受专门训练的情况下,可靠地识别由人工智能(AI)生成的意大利语短篇故事。研究通过一项面对面实验,让69名译者评估三篇匿名短篇故事(两篇由ChatGPT-4o生成,一篇由人类作者撰写),并判断其AI生成的可能性及提供依据。解决方案的关键在于识别出可作为合成文本判别指标的文本特征:低突发性(low burstiness)和叙事矛盾(narrative contradiction)被证实是最可靠的指示信号;而语法准确性与情感基调等传统语言特征反而常导致误判,表明译者的判断可能受主观偏好影响。这一发现揭示了当前AI生成文本在专业编辑场景中的识别挑战及其潜在偏倚。

链接: https://arxiv.org/abs/2601.15828
作者: Michael Farrell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.
zh

[NLP-16] ExDR: Explanation-driven Dynamic Retrieval Enhancement for Multimodal Fake News Detection

【速读】: 该论文旨在解决多模态虚假新闻(Multimodal Fake News)检测中因内容动态演化和依赖时效性事实细节而导致现有方法失效的问题,尤其针对动态检索增强生成(Dynamic Retrieval-Augmented Generation)在实际应用中出现的冗余检索、粗粒度相似度匹配及无关证据引入等挑战。其解决方案的关键在于提出ExDR框架,通过模型生成的解释(Explanation-driven)系统性地优化检索触发与证据获取两个模块:一方面从三个互补维度评估触发置信度,另一方面融合欺骗实体构建感知实体索引,并基于欺骗特征检索对比性证据以反驳初始声明,从而提升最终检测准确性与鲁棒性。

链接: https://arxiv.org/abs/2601.15820
作者: Guoxuan Ding,Yuqing Li,Ziyan Zhou,Zheng Lin,Daren Zha,Jiangnan Li
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); WeChat AI, Tencent (腾讯微信AI)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 7 tables

点击查看摘要

Abstract:The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.
zh

[NLP-17] ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLM)评估中一个关键局限性:现有基准测试能够识别模型在特定任务上的失败,但无法揭示失败的根本原因。例如,模型在推理类数据集上给出错误答案可能源于格式问题、计算错误或数据噪声,而非真正的推理能力不足,这使得基准测试难以有效指导模型改进。为解决这一问题,作者提出 ErrorMap 方法,其核心创新在于通过提取模型的“失败签名”(failure signature),对错误来源进行解耦分析,从而明确不同基准测试实际衡量的内容,并扩展错误识别范围以减少盲点。该方法不依赖于特定模型或数据集,具有通用性;进一步地,作者基于 35 个数据集和 83 个模型构建了 ErrorAtlas——一个模型错误分类体系,揭示了如输出遗漏必要细节和问题误解等此前被忽视的常见错误模式。这一方法从关注模型“在哪里成功”转向深入探究“为什么失败”,实现了对 LLM 更深层次的评估,为调试模型、对齐评估目标与实际效果以及优化模型选择提供了系统性支持。

链接: https://arxiv.org/abs/2601.15812
作者: Shir Ashury-Tahan,Yifan Mai,Elron Bandel,Michal Shmueli-Scheuer,Leshem Choshen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset may stem from formatting issues, calculation errors, or dataset noise rather than weak reasoning. Without disentangling such causes, benchmarks remain incomplete and cannot reliably guide model improvement. We introduce ErrorMap, the first method to chart the sources of LLM failure. It extracts a model’s unique “failure signature”, clarifies what benchmarks measure, and broadens error identification to reduce blind spots. This helps developers debug models, aligns benchmark goals with outcomes, and supports informed model selection. ErrorMap works on any model or dataset with the same logic. Applying our method to 35 datasets and 83 models we generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMap and ErrorAtlas enable advanced evaluation - one that exposes hidden weaknesses and directs progress. Unlike success, typically measured by task-level metrics, our approach introduces a deeper evaluation layer that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and code publicly available with plans to periodically update ErrorAtlas as new benchmarks and models emerge.
zh

[NLP-18] SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics ACL2026

【速读】: 该论文旨在解决多语言自然语言生成(Natural Language Generation, NLG)任务中缺乏准确且鲁棒的评估指标的问题,尤其在非英语语言中更为显著。研究表明,多语言语言模型常以英语作为内部语义枢纽(pivot language),若与其他语言存在对齐偏差,则会降低下游任务性能。受此启发,作者提出假设:多语言神经评估指标也可能因与英语枢纽的不对齐而表现不佳。解决方案的关键在于,在测试阶段通过干预手段引导评估指标的激活向量趋向英语语义空间,从而提升其与人类判断的相关性;实验表明,无论编码器型还是解码器型指标,此类干预均能有效增强多种语言下的评估效果。

链接: https://arxiv.org/abs/2601.15809
作者: Silvia Casola,Ryan Soh-Eun Shim,Felicia Körner,Yuchen Mao,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany (慕尼黑大学信息与语言处理中心); Language Science and Technology, Saarland University, Germany (萨尔兰大学语言科学与技术)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL 2026

点击查看摘要

Abstract:An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.
zh

[NLP-19] HumanLLM : Towards Personalized Understanding and Simulation of Human Nature KDD2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟人类行为方面存在的局限性问题,即现有模型因预训练数据缺乏个体决策、思维与行为的连续性和情境依赖性,难以实现对个体认知和行为的精准建模。其解决方案的关键在于构建一个名为HumanLLM的基础模型,通过采集并处理来自Reddit、Twitter、Blogger和Amazon等平台的真实用户数据,形成大规模的“认知基因组数据集”(Cognitive Genome Dataset),并采用多阶段过滤与合成流程提取超过550万条用户日志,从而提炼出个体的行为模式、思维方式与偏好特征;随后通过监督微调(supervised fine-tuning)使模型具备预测多样化个体化人类行为、思想与体验的能力,显著提升个性化模拟的真实性与泛化性能。

链接: https://arxiv.org/abs/2601.15793
作者: Yuxuan Lei,Tianfu Wang,Jianxun Lian,Zhengyu Hu,Defu Lian,Xing Xie
机构: University of Science and Technology of China (中国科学技术大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 7 tables, to be published in KDD 2026

点击查看摘要

Abstract:Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior–a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual’s decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.
zh

[NLP-20] Agent ic Confidence Calibration

【速读】: 该论文旨在解决AI代理(AI agent)在高风险场景中因过度自信而导致的可靠性问题,特别是现有校准方法无法应对代理系统特有的挑战,如轨迹中的误差累积、外部工具带来的不确定性以及失败模式的不透明性。其解决方案的关键在于提出首个“代理置信度校准”(Agentic Confidence Calibration)问题,并设计了整体轨迹校准(Holistic Trajectory Calibration, HTC)框架——该框架通过提取代理整个执行轨迹中从宏观动态到微观稳定性的丰富过程级特征,构建了一个简单且可解释的模型,实现了跨多个基准测试、大语言模型(LLM)和代理框架的卓越校准性能与判别能力。此外,HTC还具备可解释性、跨域迁移能力和泛化性,通过通用代理校准器(General Agent Calibrator, GAC)在域外GAIA基准上达到最低期望校准误差(ECE),确立了以过程为中心的新校准范式。

链接: https://arxiv.org/abs/2601.15778
作者: Jiaxin Zhang,Caiming Xiong,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 37 pages, 15 figures, 12 tables

点击查看摘要

Abstract:AI agents are rapidly advancing from passive language models to autonomous systems executing complex, multi-step tasks. Yet their overconfidence in failure remains a fundamental barrier to deployment in high-stakes settings. Existing calibration methods, built for static single-turn outputs, cannot address the unique challenges of agentic systems, such as compounding errors along trajectories, uncertainty from external tools, and opaque failure modes. To address these challenges, we introduce, for the first time, the problem of Agentic Confidence Calibration and propose Holistic Trajectory Calibration (HTC), a novel diagnostic framework that extracts rich process-level features ranging from macro dynamics to micro stability across an agent’s entire trajectory. Powered by a simple, interpretable model, HTC consistently surpasses strong baselines in both calibration and discrimination, across eight benchmarks, multiple LLMs, and diverse agent frameworks. Beyond performance, HTC delivers three essential advances: it provides interpretability by revealing the signals behind failure, enables transferability by applying across domains without retraining, and achieves generalization through a General Agent Calibrator (GAC) that achieves the best calibration (lowest ECE) on the out-of-domain GAIA benchmark. Together, these contributions establish a new process-centric paradigm for confidence calibration, providing a framework for diagnosing and enhancing the reliability of AI agents.
zh

[NLP-21] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLM s

【速读】: 该论文旨在解决当前大语言模型在价值对齐(value alignment)评估中仅关注边际响应分布(marginal response distributions),而忽视了多变量相关性结构(multivariate correlation patterns)所带来的代表性不足问题。其解决方案的关键在于提出一种新的评估框架,通过对比模型输出与人类真实数据(如世界价值观调查数据)之间的多变量相关性模式,来衡量模型是否在深层结构上准确反映目标群体的价值观特征,从而揭示仅基于边际分布评估可能掩盖的结构性偏差。

链接: https://arxiv.org/abs/2601.15755
作者: Tristan Williams,Franziska Weeber,Sebastian Padó,Alan Akbik
机构: Humboldt University of Berlin (柏林洪堡大学); University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.
zh

[NLP-22] Hallucination Mitigating for Medical Report Generation

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision Language Models, LVLMs)在医学报告生成(Medical Report Generation, MRG)中易产生看似合理但实际错误的“幻觉”问题,这在对准确性要求极高的医疗场景中尤为关键。解决方案的核心在于提出一种知识增强且细粒度强化奖励驱动的框架(Knowledge-Enhanced with Fine-Grained Reinforced Rewards Medical Report Generation, KERM):首先利用MedCLIP从结构化知识库中检索与病灶相关的事实句子以丰富输入;随后引入净化模块确保检索知识与患者临床情境高度相关;最后通过细粒度奖励机制引导模型生成更具临床支持性的描述,从而提升输出与期望行为的一致性。

链接: https://arxiv.org/abs/2601.15745
作者: Ruoqing Zhao,Runze Xia,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence; The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the realm of medical report generation (MRG), the integration of natural language processing has emerged as a vital tool to alleviate the workload of radiologists. Despite the impressive capabilities demonstrated by large vision language models (LVLMs) in understanding natural language, their susceptibility to generating plausible yet inaccurate claims, known as ``hallucinations’', raises concerns-especially in the nuanced and critical field of medical. In this work, we introduce a framework, \textbfKnowledge-\textbfEnhanced with Fine-Grained \textbfReinforced Rewards \textbfMedical Report Generation (KERM), to tackle the issue. Our approach refines the input to the LVLM by first utilizing MedCLIP for knowledge retrieval, incorporating relevant lesion fact sentences from a curated knowledge corpus. We then introduce a novel purification module to ensure the retrieved knowledge is contextually relevant to the patient’s clinical context. Subsequently, we employ fine-grained rewards to guide these models in generating highly supportive and clinically relevant descriptions, ensuring the alignment of model’s outputs with desired behaviors. Experimental results on IU-Xray and MIMIC-CXR datasets validate the effectiveness of our approach in mitigating hallucinations and enhancing report quality.
zh

[NLP-23] PhysProver: Advancing Automatic Theorem Proving for Physics

【速读】: 该论文旨在解决形式化物理推理(formal physics reasoning)领域长期缺乏系统性方法的问题,尽管生成式 AI (Generative AI) 和大型语言模型(LLMs)在数学定理证明中取得了显著进展,但其在物理领域的应用仍处于起步阶段。解决方案的关键在于提出首个针对物理领域形式化定理证明的框架——PhysProver,其核心包括:(1)构建专用数据集 PhysLeanData,涵盖从 PhysLean 中采样的定理与基于猜想的形式化数据生成管道合成的数据;(2)采用 DeepSeek-Prover-V2-7B 作为基础模型,并引入可验证奖励强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)进行训练。实验表明,仅用约 5K 训练样本即可实现多个子领域平均提升 2.4%,并在 MiniF2F-Test 上获得 1.3% 的跨域泛化性能增益,验证了该方法在提升形式化物理推理能力的同时,也增强了形式化数学能力,为将形式化证明器扩展至非数学领域提供了新范式。

链接: https://arxiv.org/abs/2601.15737
作者: Hanning Zhang,Ruida Wang,Rui Pan,Wenyuan Wang,Bingxu Meng,Tong Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Rutgers University (罗格斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:The combination of verifiable languages and LLMs has significantly influenced both the mathematical and computer science communities because it provides a rigorous foundation for theorem proving. Recent advancements in the field provide foundation models and sophisticated agentic systems pushing the boundaries of formal mathematical reasoning to approach the natural language capability of LLMs. However, little attention has been given to the formal physics reasoning, which also heavily relies on similar problem-solving and theorem-proving frameworks. To solve this problem, this paper presents, to the best of our knowledge, the first approach to enhance formal theorem proving in the physics domain. We compose a dedicated dataset PhysLeanData for the task. It is composed of theorems sampled from PhysLean and data generated by a conjecture-based formal data generation pipeline. In the training pipeline, we leverage DeepSeek-Prover-V2-7B, a strong open-source mathematical theorem prover, and apply Reinforcement Learning with Verifiable Rewards (RLVR) to train our model PhysProver. Comprehensive experiments demonstrate that, using only \sim 5K training samples, PhysProver achieves an overall 2.4% improvement in multiple sub-domains. Furthermore, after formal physics training, we observe 1.3% gains on the MiniF2F-Test benchmark, which indicates non-trivial generalization beyond physics domains and enhancement for formal math capability as well. The results highlight the effectiveness and efficiency of our approach, which provides a paradigm for extending formal provers outside mathematical domains. To foster further research, we will release both our dataset and model to the community.
zh

[NLP-24] owards Automated Kernel Generation in the Era of LLM s

【速读】: 该论文旨在解决现代人工智能(AI)系统中底层计算内核(kernel)性能受限的问题,其核心挑战在于实现近最优内核需要专家级的硬件架构与编程模型理解,而这一过程既耗时又难以扩展。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)和基于LLM的智能体(agent)技术:LLMs能够压缩难以形式化的专家内核知识,而智能体系统则通过将内核开发建模为迭代反馈驱动的优化流程,实现了可扩展的自动化生成与优化。该文系统梳理了当前主流方法、数据集与基准测试,并指出了关键开放挑战与未来研究方向,以推动下一代自动化内核优化的发展。

链接: https://arxiv.org/abs/2601.15727
作者: Yang Yu,Peiyu Zang,Chi Hsu Tsai,Haiming Wu,Yixin Shen,Jialing Zhang,Haoyu Wang,Zhiyou Xiao,Jingze Shi,Yuyu Luo,Wentao Zhang,Chunlei Men,Guang Liu,Yonghua Lin
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing Normal University (北京师范大学); Peking University (北京大学); Beijing Institute of Technology (北京理工大学); Cornell University (康奈尔大学); Beijing Jiaotong University (北京交通大学); Renmin University of China (中国人民大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models (LLMs) and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically compiling the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at this https URL.
zh

[NLP-25] Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

【速读】: 该论文旨在解决学术反驳(rebuttal)这一复杂且尚未被充分探索的问题,其核心挑战在于反驳本质上是一种在严重信息不对称条件下进行的战略性沟通,而非简单的技术辩论。现有方法多停留在表面语言模仿层面,缺乏对读者心理状态的理解与共情能力,导致说服力不足。解决方案的关键在于引入心智理论(Theory of Mind, ToM),构建首个基于ToM的反驳框架RebuttalAgent,该框架采用ToM-Strategy-Response(TSR)流水线:首先建模审稿人心理状态,进而制定针对性说服策略,最终生成策略驱动的回应文本。通过两阶段训练机制(监督微调+强化学习)和大规模数据集RebuttalBench实现模型能力提升,并辅以专门设计的自动化评估器Rebuttal-RM确保评价可靠性,实验证明该方案在自动指标和人工评估中均显著优于基线模型及先进商业模型。

链接: https://arxiv.org/abs/2601.15715
作者: Zhitao He,Zongwei Lyu,Yi R Fung
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, under review

点击查看摘要

Abstract:Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author’s own critical analysis and response.
zh

[NLP-26] Even GPT -5.2 Cant Count to Five: The Case for Zero-Error Horizons in Trustworthy LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在简单但关键任务上仍存在错误的问题,从而提升其可信度(trustworthiness)。作者提出“零误差范围”(Zero-Error Horizon, ZEH)作为衡量指标,定义为模型在不产生任何错误的情况下能够正确处理的最大输入规模。ZEH的关键在于它能揭示LLM在基础逻辑和算法能力上的局限性,例如GPT-5.2在计算短字符串奇偶性和括号匹配等任务中出现失败,表明即使性能优异的模型也存在不可靠的边界。通过分析ZEH与准确率的关系及行为差异,可识别算法能力的涌现机制,并结合树结构与在线Softmax优化显著降低计算成本,实现高效评估。

链接: https://arxiv.org/abs/2601.15714
作者: Ryoma Sato
机构: National Institute of Informatics (日本信息研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors. While ZEH itself is simple, we demonstrate that evaluating the ZEH of state-of-the-art LLMs yields abundant insights. For example, by evaluating the ZEH of GPT-5.2, we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced. This is surprising given the excellent capabilities of GPT-5.2. The fact that LLMs make mistakes on such simple problems serves as an important lesson when applying LLMs to safety-critical domains. By applying ZEH to Qwen2.5 and conducting detailed analysis, we found that while ZEH correlates with accuracy, the detailed behaviors differ, and ZEH provides clues about the emergence of algorithmic capabilities. Finally, while computing ZEH incurs significant computational cost, we discuss how to mitigate this cost by achieving up to one order of magnitude speedup using tree structures and online softmax.
zh

[NLP-27] Persona Switch: Mixing Distinct Perspectives in Decoding Time EACL’26

【速读】: 该论文旨在解决角色扮演提示(role-play prompting)在不同任务或实例中表现不一致的问题,即其提升大语言模型(Large Language Models, LLMs)零样本推理能力的效果缺乏稳定性。为应对这一挑战,作者提出了一种名为“Persona Switch”的新型解码方法,其核心在于通过逐步骤动态选择更优输出:在每一步决策中,比较零样本提示与角色扮演提示的输出置信度(以logit gap衡量),并选取置信度更高的结果。该方案有效融合了两种提示策略的优势,实现了对LLMs性能的稳定提升,实验表明其相比基线方法最高可提升5.13%的准确率。

链接: https://arxiv.org/abs/2601.15708
作者: Junseok Kim,Nakyeong Yang,Kyomin Jung
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: EACL’26 Findings, Code is available at this https URL

点击查看摘要

Abstract:Role-play prompting is known to steer the behavior of language models by injecting a persona into the prompt, improving their zero-shot reasoning capabilities. However, such improvements are inconsistent across different tasks or instances. This inconsistency suggests that zero-shot and role-play prompting may offer complementary strengths rather than one being universally superior. Building on this insight, we propose Persona Switch, a novel decoding method that dynamically combines the benefits of both prompting strategies. Our method proceeds step-by-step, selecting the better output between zero-shot and role-play prompting at each step by comparing their output confidence, as measured by the logit gap. Experiments with widely-used LLMs demonstrate that Persona Switch consistently outperforms competitive baselines, achieving up to 5.13% accuracy improvement. Furthermore, we show that output confidence serves as an informative measure for selecting the more reliable output.
zh

[NLP-28] Agent ic Uncertainty Quantification

【速读】: 该论文旨在解决AI代理在长时推理任务中因“幻觉螺旋”(Spiral of Hallucination)导致的可靠性问题,即早期认知错误会不可逆地传播并累积,从而损害整体决策质量。现有方法存在两大局限:不确定性量化(Uncertainty Quantification, UQ)方法仅能被动诊断风险而无法干预,而自我反思机制则常陷入无目标或持续的修正循环。解决方案的关键在于提出一个统一的双过程代理不确定性量化框架(Dual-Process Agentic UQ, AUQ),其核心创新是将口头化不确定性转化为主动的双向控制信号:系统1(不确定性感知记忆,Uncertainty-Aware Memory, UAM)隐式传递置信度与语义解释以避免盲目决策;系统2(不确定性感知反思,Uncertainty-Aware Reflection, UAR)利用这些解释作为理性线索,在必要时触发目标导向的推理期修正。该设计实现了执行效率与深度思辨之间的动态平衡,且无需额外训练即可显著提升轨迹级校准性能。

链接: https://arxiv.org/abs/2601.15703
作者: Jiaxin Zhang,Prafulla Kumar Choubey,Kung-Hsiang Huang,Caiming Xiong,Chien-Sheng Wu
机构: Salesforce AI Research (Salesforce人工智能研究中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,‘’ where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.
zh

[NLP-29] What Patients Really Ask: Exploring the Effect of False Assumptions in Patient Information Seeking

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗问答任务中存在的重要偏差问题,即现有基准测试多聚焦于标准化医学考试题,而忽视了患者在真实场景中提出的多样化、非结构化且常包含错误假设或危险意图的问题。其解决方案的关键在于构建一个基于Google“People Also Ask”功能的高质量医疗问答数据集,该数据集源自美国最常用处方药相关的用户提问,系统性地识别并标注了其中大量带有错误前提和潜在风险的问题。研究进一步揭示,这类“污染型”问题的出现并非随机,而是与历史提问中的错误程度密切相关,从而指出当前LLMs在处理日常医疗疑问时对错误假设的识别能力严重不足,为未来模型改进提供了关键方向。

链接: https://arxiv.org/abs/2601.15674
作者: Raymond Xiong,Furong Jia,Lionel Wong,Monica Agrawal
机构: Duke University (杜克大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patients are increasingly using large language models (LLMs) to seek answers to their healthcare-related questions. However, benchmarking efforts in LLMs for question answering often focus on medical exam questions, which differ significantly in style and content from the questions patients actually raise in real life. To bridge this gap, we sourced data from Google’s People Also Ask feature by querying the top 200 prescribed medications in the United States, curating a dataset of medical questions people commonly ask. A considerable portion of the collected questions contains incorrect assumptions and dangerous intentions. We demonstrate that the emergence of these corrupted questions is not uniformly random and depends heavily on the degree of incorrectness in the history of questions that led to their appearance. Current LLMs that perform strongly on other benchmarks struggle to identify incorrect assumptions in everyday questions.
zh

[NLP-30] owards Reliable Medical LLM s: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床诊断中因信息不充分而产生不可靠置信度估计的问题,尤其关注多轮交互过程中置信度与正确性随证据累积的动态耦合关系。现有研究多局限于单轮静态场景下的置信度评估,未能捕捉真实诊疗对话中信心变化的演化机制,限制了其对可靠决策的支持能力。解决方案的关键在于提出首个面向真实医疗咨询场景的多轮置信度评估基准,并引入信息充分性梯度以刻画置信度与正确性的动态关系;进一步设计MedConf框架,通过检索增强生成构建症状画像,对患者信息进行支持、缺失和矛盾关系对齐,并基于加权融合生成可解释的置信度估计。实验证明,该方法在两种LLM和三个医学数据集上均显著优于现有最先进方法,在信息不足和共病等复杂条件下仍保持稳定性能,表明信息充分性是可信医疗置信度建模的核心决定因素。

链接: https://arxiv.org/abs/2601.15645
作者: Zhiyao Ren,Yibing Zhan,Siyuan Liang,Guozheng Ma,Baosheng Yu,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.
zh

[NLP-31] Qwen 3-TTS Technical Report

【速读】: 该论文旨在解决多语言、可控性、鲁棒性和流式文本到语音(Text-to-Speech, TTS)合成中的关键挑战,尤其在低延迟、高保真语音生成与个性化声音克隆方面。其解决方案的核心在于提出Qwen3-TTS系列模型,采用双轨语言模型(dual-track LM)架构实现实时合成,并设计了两种高效的语音分词器:一是Qwen-TTS-Tokenizer-25Hz,基于单码本结构以保留语义内容,支持与Qwen-Audio无缝集成并实现块状扩散Transformer(DiT)的流式波形重建;二是Qwen-TTS-Tokenizer-12Hz,通过12.5 Hz、16层多码本设计实现极低比特率和超低延迟(首次数据包延迟仅97 ms),结合轻量因果卷积神经网络(ConvNet)达成即时响应。这一架构在超过500万小时跨10种语言的数据上训练,显著优于现有基准(如TTS多语言测试集、InstructTTSEval及长语音测试集)。

链接: https://arxiv.org/abs/2601.15621
作者: Hangrui Hu,Xinfa Zhu,Ting He,Dake Guo,Bin Zhang,Xiong Wang,Zhifang Guo,Ziyue Jiang,Hongkun Hao,Zishan Guo,Xinyu Zhang,Pei Zhang,Baosong Yang,Jin Xu,Jingren Zhou,Junyang Lin
机构: Qwen Team(通义千问团队)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: this https URL

点击查看摘要

Abstract:In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ( 97,\mathrmms ) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
zh

[NLP-32] When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在提升大语言模型(Large Language Models, LLMs)问题求解能力时存在的“过 sharpening”现象,即策略过度集中于有限的模式,抑制了有效解的多样性,从而可能并未真正激发新能力,而是对已有知识分布进行过度优化。其核心解决方案包括两个关键策略:一是逆成功优势校准(inverse-success advantage calibration),通过优先处理困难查询来缓解学习偏差;二是分布级校准(distribution-level calibration),借助记忆网络增强采样多样性,打破语义耦合导致的全局塌陷。实证结果表明,这些方法能显著提升模型的泛化性能。

链接: https://arxiv.org/abs/2601.15609
作者: Mingyuan Fan,Weiguang Han,Daixin Wang,Cen Chen,Zhiqiang Zhang,Jun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.
zh

[NLP-33] oxiTwitch: Toward Emote-Aware Hybrid Moderation for Live Streaming Platforms

【速读】: 该论文旨在解决直播平台(如Twitch)中毒性行为(toxic behavior)的自动检测难题,尤其针对高流量、多模态(包含文本与表情符号 emotes)聊天环境中传统人工审核和关键词过滤方法难以有效扩展的问题。解决方案的关键在于提出一种名为ToxiTwitch的混合模型,该模型融合了大语言模型(LLM)生成的文本与表情符号嵌入(embeddings),并结合随机森林(Random Forest)和支持向量机(SVM)等传统机器学习分类器,从而显著提升对含表情符号语境下毒性内容的识别准确率,在特定频道训练下达到80%准确率(较BERT提升13%,F1-score达76%)。

链接: https://arxiv.org/abs/2601.15605
作者: Baktash Ansari,Shiza Ali,Elias Martin,Maryna Sivachenko,Afra Mashhadi
机构: University of Washington Bothell (华盛顿大学博赛特分校)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Exploratory study; prior versions submitted to peer review

点击查看摘要

Abstract:The rapid growth of live-streaming platforms such as Twitch has introduced complex challenges in moderating toxic behavior. Traditional moderation approaches, such as human annotation and keyword-based filtering, have demonstrated utility, but human moderators on Twitch constantly struggle to scale effectively in the fast-paced, high-volume, and context-rich chat environment of the platform while also facing harassment themselves. Recent advances in large language models (LLMs), such as DeepSeek-R1-Distill and Llama-3-8B-Instruct, offer new opportunities for toxicity detection, especially in understanding nuanced, multimodal communication involving emotes. In this work, we present an exploratory comparison of toxicity detection approaches tailored to Twitch. Our analysis reveals that incorporating emotes improves the detection of toxic behavior. To this end, we introduce ToxiTwitch, a hybrid model that combines LLM-generated embeddings of text and emotes with traditional machine learning classifiers, including Random Forest and SVM. In our case study, the proposed hybrid approach reaches up to 80 percent accuracy under channel-specific training (with 13 percent improvement over BERT and F1-score of 76 percent). This work is an exploratory study intended to surface challenges and limits of emote-aware toxicity detection on Twitch.
zh

[NLP-34] Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today Potential Tomorrow

【速读】: 该论文旨在解决当前掩码扩散语言模型(Masked Diffusion Language Models, MDLMs)在实际应用中是否真正实现了并行生成与任意顺序解码能力的问题,以及其性能为何仍落后于同规模自回归模型的根源。研究表明,MDLMs 的主要局限在于并行概率建模削弱了词元间的依赖关系,导致生成质量下降;而其优势体现在对特定任务(如需“反向信息”的 Sudoku 问题)能自适应地调整生成顺序,优先填充较易解的空白。解决方案的关键在于提出“先生成后编辑”(Generate-then-Edit)范式,该范式通过理论支持和设计优化,在保留并行解码效率的同时缓解了依赖性损失问题。

链接: https://arxiv.org/abs/2601.15593
作者: Yangyang Zhong,Yanmei Gu,Zhengqing Zang,Xiaomeng Li,Yuqi Ding,Xibei Jia,Yuting Shen,Zhenzhong Lan,Liwang Zhu,Weiping Liu,Junlin Zhou,Haisheng Liu,Zhong Xin Yu,Pengxin Luo,Donglian Qi,Yunfeng Yan,Junbo Zhao
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Westlake University (西湖大学); University of Chinese Academy of Social Sciences (中国社会科学院大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Masked Diffusion Language Models (MDLMs) promise parallel token generation and arbitrary-order decoding, yet it remains unclear to what extent current models truly realize these capabilities. We characterize MDLM behavior along two dimensions – parallelism strength and generation order – using Average Finalization Parallelism (AFP) and Kendall’s tau. We evaluate eight mainstream MDLMs (up to 100B parameters) on 58 benchmarks spanning knowledge, reasoning, and programming. The results show that MDLMs still lag behind comparably sized autoregressive models, mainly because parallel probabilistic modeling weakens inter-token dependencies. Meanwhile, MDLMs exhibit adaptive decoding behavior: their parallelism and generation order vary significantly with the task domain, the stage of reasoning, and whether the output is correct. On tasks that require “backward information” (e.g., Sudoku), MDLMs adopt a solution order that tends to fill easier Sudoku blanks first, highlighting their advantages. Finally, we provide theoretical motivation and design insights supporting a Generate-then-Edit paradigm, which mitigates dependency loss while retaining the efficiency of parallel decoding.
zh

[NLP-35] YuFeng-XGuard: A Reasoning -Centric Interpretable and Flexible Guardrail Model for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中安全防护机制不足的问题,尤其是现有方案多依赖粗粒度过滤或事后规则,导致透明度低、策略僵化或推理成本过高。其解决方案的关键在于提出一种以推理为核心的防护模型家族 YuFeng-XGuard,通过生成结构化的风险预测(包括明确的风险类别、可配置置信度分数及自然语言解释),实现细粒度、可解释且可适配的风险感知。该方法采用分层推理范式,在首次解码token时完成初步风险决策,同时按需提供深度解释,并引入动态策略机制将风险感知与策略执行解耦,从而支持无需重训练即可调整安全策略,兼顾效率与效果。

链接: https://arxiv.org/abs/2601.15588
作者: Junyu Lin,Meizhen Liu,Xiufeng Huang,Jinfeng Li,Haiwen Hong,Xiaohan Yuan,Yuefeng Chen,Longtao Huang,Hui Xue,Ranjie Duan,Zhikai Chen,Yuchuan Fu,Defeng Li,Lingyao Gao,Yitong Yang
机构: Alibaba-AAIG
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in real-world applications, safety guardrails are required to go beyond coarse-grained filtering and support fine-grained, interpretable, and adaptable risk assessment. However, existing solutions often rely on rapid classification schemes or post-hoc rules, resulting in limited transparency, inflexible policies, or prohibitive inference costs. To this end, we present YuFeng-XGuard, a reasoning-centric guardrail model family designed to perform multi-dimensional risk perception for LLM interactions. Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and configurable confidence scores, accompanied by natural language explanations that expose the underlying reasoning process. This formulation enables safety decisions that are both actionable and interpretable. To balance decision latency and explanatory depth, we adopt a tiered inference paradigm that performs an initial risk decision based on the first decoded token, while preserving ondemand explanatory reasoning when required. In addition, we introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining. Extensive experiments on a diverse set of public safety benchmarks demonstrate that YuFeng-XGuard achieves stateof-the-art performance while maintaining strong efficiency-efficacy trade-offs. We release YuFeng-XGuard as an open model family, including both a full-capacity variant and a lightweight version, to support a wide range of deployment scenarios.
zh

[NLP-36] From Generation to Collaboration: Using LLM s to Edit for Empathy in Healthcare

【速读】: 该论文旨在解决临床实践中医生在有限认知与情感资源下难以持续平衡情感温度与医学信息准确性的问题。其核心解决方案在于引入大型语言模型(Large Language Models, LLMs)作为“共情编辑器”(empathy editor),通过精细化调整医生原始文本的表达方式,在不改变医学事实的前提下提升共情语气。关键创新在于提出两个量化评估指标——共情排序得分(Empathy Ranking Score)和医学事实核查得分(MedFactChecking Score),从而系统性地衡量输出内容的情感质量与事实准确性,实验证明LLM编辑后的响应显著提升感知共情度且保持医学准确性,优于完全由LLM生成的内容,表明LLM作为编辑辅助工具比自主生成更安全、高效。

链接: https://arxiv.org/abs/2601.15558
作者: Man Luo,Bahareh Harandizadeh,Amara Tariq,Halim Abbas,Umar Ghaffar,Christopher J Warren,Segun O. Kolade,Haidar M. Abdul-Muhsin
机构: Mayo Clinic (梅奥诊所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical empathy is essential for patient care, but physicians need continually balance emotional warmth with factual precision under the cognitive and emotional constraints of clinical practice. This study investigates how large language models (LLMs) can function as empathy editors, refining physicians’ written responses to enhance empathetic tone while preserving underlying medical information. More importantly, we introduce novel quantitative metrics, an Empathy Ranking Score and a MedFactChecking Score to systematically assess both emotional and factual quality of the responses. Experimental results show that LLM edited responses significantly increase perceived empathy while preserving factual accuracy compared with fully LLM generated outputs. These findings suggest that using LLMs as editorial assistants, rather than autonomous generators, offers a safer, more effective pathway to empathetic and trustworthy AI-assisted healthcare communication.
zh

[NLP-37] LLM or Human? Perceptions of Trust and Information Quality in Research Summaries

【速读】: 该论文试图解决的问题是:随着大语言模型(Large Language Models, LLMs)在科学摘要生成与编辑中的应用日益广泛,学术界对由此引发的信任、质量评估及透明度问题缺乏系统认知,尤其是读者如何感知LLM参与写作过程及其对科学成果评价的影响尚不明确。解决方案的关键在于通过混合方法的调查实验设计,量化分析具备机器学习(Machine Learning, ML)专业背景的读者是否能准确识别LLM生成的摘要,并揭示实际与感知的LLM介入程度如何影响其对摘要质量与可信度的判断;研究进一步识别出三种不同的读者对待AI辅助写作的态度类型,从而为科学传播中关于披露规范和可接受使用边界等政策制定提供实证依据。

链接: https://arxiv.org/abs/2601.15556
作者: Nil-Jana Akpinar,Sandeep Avula,CJ Lee,Brandon Dang,Kaza Razat,Vanessa Murdock
机构: Microsoft(微软); Amazon AWS AI(亚马逊AWS人工智能)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Accepted to ACM CHI conference on Human Factors in Computing Systems(CHI 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to generate and edit scientific abstracts, yet their integration into academic writing raises questions about trust, quality, and disclosure. Despite growing adoption, little is known about how readers perceive LLM-generated summaries and how these perceptions influence evaluations of scientific work. This paper presents a mixed-methods survey experiment investigating whether readers with ML expertise can distinguish between human- and LLM-generated abstracts, how actual and perceived LLM involvement affects judgments of quality and trustworthiness, and what orientations readers adopt toward AI-assisted writing. Our findings show that participants struggle to reliably identify LLM-generated content, yet their beliefs about LLM involvement significantly shape their evaluations. Notably, abstracts edited by LLMs are rated more favorably than those written solely by humans or LLMs. We also identify three distinct reader orientations toward LLM-assisted writing, offering insights into evolving norms and informing policy around disclosure and acceptable use in scientific communication.
zh

[NLP-38] Common to Whom? Regional Cultural Commonsense and LLM Bias in India

【速读】: 该论文旨在解决现有文化常识基准测试将国家视为同质整体的问题,即是否文化常识在国家内部具有统一性,还是存在显著的次国家级(sub-national)差异。为回答这一问题,作者提出了Indica——首个专门设计用于评估大语言模型(Large Language Models, LLMs)对次国家级文化常识理解能力的基准,聚焦于印度这一多民族、多语言、多区域的典型异质国家。其解决方案的关键在于:首先基于人类学分类框架构建涵盖8个日常生活领域的515个问题;其次从印度五个地理区域(北、南、东、西、中)收集人工标注的答案,形成1630条区域特异性问答对;最终发现仅39.4%的问题在各区域间达成一致,表明印度的文化常识以区域性为主,而非全国统一。此外,该方法论可推广至其他文化异质国家,为系统性评估LLMs在跨区域文化认知中的表现提供通用框架。

链接: https://arxiv.org/abs/2601.15550
作者: Sangmitra Madhusudan,Trush Shashank More,Steph Buongiorno,Renata Dividino,Jad Kabbara,Ali Emami
机构: Emory University (埃默里大学); Independent Researcher; Brock University (布罗克大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs’ ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the “default” (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.
zh

[NLP-39] PRISM: Deriving the Transformer as a Signal-Denoising Operator via Maximum Coding Rate Reduction

【速读】: 该论文旨在解决深度学习模型(尤其是Transformer)因缺乏可解释性而被视作“黑箱”的问题。其解决方案的关键在于提出一种白盒注意力架构Prism,该架构基于最大化编码率减少(Maximizing Coding Rate Reduction, MCR²)原理,通过将注意力机制建模为在信号-噪声流形上的梯度上升过程,并引入两个物理约束:一是使用过完备字典扩展表示相空间,二是采用无理频率分离(π-RoPE)强制信号与噪声子空间的非相干性。这些几何归纳偏置构成了物理约束,足以独立诱导无监督的功能解耦,从而实现模型性能与可解释性的统一。

链接: https://arxiv.org/abs/2601.15540
作者: Dongchen Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:Deep learning models, particularly Transformers, are often criticized as “black boxes” and lack interpretability. We propose Prism, a white-box attention-based architecture derived from the principles of Maximizing Coding Rate Reduction ( \textMCR^2 ). By modeling the attention mechanism as a gradient ascent process on a distinct signal-noise manifold, we introduce two physical constraints: an overcomplete dictionary to expand the representational phase space, and an irrational frequency separation ( \pi -RoPE) to enforce incoherence between signal and noise subspaces. We demonstrate that these geometric inductive biases can be viewed as a physical constraint and they are sufficient to induce unsupervised functional disentanglement alone. Using TinyStories as a controlled testbed for verifying spectral dynamics, we observe that Prism spontaneously specializes its attention heads into spectrally distinct regimes: low-frequency heads capturing long-range causal dependencies (signal) and high-frequency heads handling local syntactic constraints (noise). Our results suggest that interpretability and performance are not a trade-off, but can be unified through principled geometric construction.
zh

[NLP-40] DS@GT at TREC TOT 2025: Bridging Vague Recollection with Fusion Retrieval and Learned Reranking

【速读】: 该论文旨在解决TREC Tip-of-the-Tongue (ToT) 任务中信息检索的精准性问题,即在用户记忆模糊但能描述部分线索的情况下,高效准确地从大规模语料库中召回相关文档。其解决方案的关键在于构建一个两阶段检索系统:第一阶段采用混合检索策略,融合生成式AI(Generative AI)驱动的检索、稀疏检索(BM25)与密集检索(BGE-M3),并引入主题感知的多索引密集检索机制对维基百科语料库按24个主题域进行分区;第二阶段通过训练的LambdaMART重排序器和基于大语言模型(LLM)的重排序进一步优化结果,最终结合Gemini-2.5-flash进行重排序时,在测试集上达到0.66的召回率和0.41的NDCG@1000,验证了融合检索方法的有效性。

链接: https://arxiv.org/abs/2601.15518
作者: Wenxin Zhou,Ritesh Mehta,Anthony Miyaguchi
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper submitted to TREC 2025 (34th Text REtrieval Conference)

点击查看摘要

Abstract:We develop a two-stage retrieval system that combines multiple complementary retrieval methods with a learned reranker and LLM-based reranking, to address the TREC Tip-of-the-Tongue (ToT) task. In the first stage, we employ hybrid retrieval that merges LLM-based retrieval, sparse (BM25), and dense (BGE-M3) retrieval methods. We also introduce topic-aware multi-index dense retrieval that partitions the Wikipedia corpus into 24 topical domains. In the second stage, we evaluate both a trained LambdaMART reranker and LLM-based reranking. To support model training, we generate 5000 synthetic ToT queries using LLMs. Our best system achieves recall of 0.66 and NDCG@1000 of 0.41 on the test set by combining hybrid retrieval with Gemini-2.5-flash reranking, demonstrating the effectiveness of fusion retrieval.
zh

[NLP-41] AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域中因事实性幻觉(factuality hallucination)导致的可信度下降问题,特别是针对由恶意注入的虚假信息(adversarial factuality)所引发的模型误判风险。现有研究缺乏高质量、领域特定的评估资源来系统测试模型在对抗条件下的鲁棒性,且未探讨注入错误信息对长文本事实性的潜在影响。解决方案的关键在于提出AdversaRiskQA——首个经过验证的基准测试框架,用于系统评估健康、金融与法律三个关键领域中的对抗性事实性;该框架包含两个难度层级以衡量模型在不同知识深度下的防御能力,并引入两种自动化方法分别评估攻击成功率与长文本事实性。实证结果表明,模型性能随规模非线性增长,且在不同领域表现差异显著,但长文本生成中注入的虚假信息并未显著影响输出的事实准确性,凸显了当前模型在复杂场景下事实一致性判断的局限性。

链接: https://arxiv.org/abs/2601.15511
作者: Adam Szelestey,Sofie van Engelen,Tianhao Huang,Justin Snelders,Qintao Zeng,Songgaojun Deng
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 13 pages, 4 figures, and 11 tables

点击查看摘要

Abstract:Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model’s alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model’s ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs’ defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model’s factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications. Comments: 13 pages, 4 figures, and 11 tables Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2601.15511 [cs.CL] (or arXiv:2601.15511v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.15511 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-42] he Dark Side of AI Transformers: Sentiment Polarization the Loss of Business Neutrality by NLP Transformers

【速读】: 该论文试图解决的问题是:基于迁移学习的Transformer模型在情感分析(Sentiment Analytics)任务中虽然提升了整体准确率,但导致了情感类别的极化现象,尤其是对中立情感类别的识别能力下降,从而破坏了情感分析结果的中立性。这一问题在应用自然语言处理(Applied NLP)领域尤为突出,因为工业级任务高度依赖于情感分析输出的可靠性与平衡性。解决方案的关键在于识别并缓解Transformer模型在提升某一情感类别准确率时所引发的其他类别(特别是中立类)性能退化问题,以恢复情感分析系统在多类别间的公平性和稳健性。

链接: https://arxiv.org/abs/2601.15509
作者: Prasanna Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The use of Transfer Learning Transformers has steadily improved accuracy and has significantly contributed in solving complex computation problems. However, this transformer led accuracy improvement in Applied AI Analytics specifically in sentiment analytics comes with the dark side. It is observed during experiments that a lot of these improvements in transformer led accuracy of one class of sentiment has been at the cost of polarization of another class of sentiment and the failing of neutrality. This lack of neutrality poses an acute problem in the Applied NLP space, which relies heavily on the computational outputs of sentiment analytics for reliable industry ready tasks.
zh

[NLP-43] Computational Representations of Character Significance in Novels

【速读】: 该论文旨在解决传统小说角色建模方法中对角色中心性判断过于依赖场景出现频率的局限性,以及忽视叙述者与角色之间区分和角色间讨论关系的问题。其解决方案的关键在于引入一种基于新兴文学理论的六要素结构化角色模型,该模型不仅包含动作、提及和对话等传统维度,还特别纳入了“其他角色对某角色的讨论”这一被以往方法忽略的核心成分;通过对比通用大语言模型(LLM)与任务特定Transformer模型对该模型的操作实现,论文构建了角色层面的细粒度表征及角色讨论图谱,从而为大规模文学分析提供了新的计算视角,尤其在验证Woloch关于角色中心性的“单一 vs 多数”理论及性别化的角色讨论动态方面展现出显著优势。

链接: https://arxiv.org/abs/2601.15508
作者: Haaris Mian,Melanie Subbiah,Sharon Marcus,Nora Shaalan,Kathleen McKeown
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Characters in novels have typically been modeled based on their presence in scenes in narrative, considering aspects like their actions, named mentions, and dialogue. This conception of character places significant emphasis on the main character who is present in the most scenes. In this work, we instead adopt a framing developed from a new literary theory proposing a six-component structural model of character. This model enables a comprehensive approach to character that accounts for the narrator-character distinction and includes a component neglected by prior methods, discussion by other characters. We compare general-purpose LLMs with task-specific transformers for operationalizing this model of character on major 19th-century British realist novels. Our methods yield both component-level and graph representations of character discussion. We then demonstrate that these representations allow us to approach literary questions at scale from a new computational lens. Specifically, we explore Woloch’s classic “the one vs the many” theory of character centrality and the gendered dynamics of character discussion.
zh

[NLP-44] ViT Registers and Fractal ViT

【速读】: 该论文旨在解决视觉Transformer(Vision Transformer, ViT)中因token排列不变性(permutation invariance)导致的性能瓶颈问题,尤其是在缺乏显式位置编码时模型对空间结构信息建模能力不足的局限。其解决方案的关键在于引入“摘要token”(summary tokens),通过设计一种注意力掩码机制,将常规token与这些额外的摘要token进行交互,从而打破token间的排列不变性;该方法可单独使用或与多种位置编码方式组合,但实验表明其效果并未超越已采用寄存器(registers)的ViT模型,暗示当前改进策略可能具有规模、领域或应用场景依赖性。

链接: https://arxiv.org/abs/2601.15506
作者: Jason Chuan-Chih Chou,Abhinav Kumar,Shivank Garg
机构: Cohere Labs Community; Indian Institute of Technology Roorkee
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and ``summary tokens’’ similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon ViT with registers, highlighting the fact that these findings may be scale, domain, or application-specific.
zh

[NLP-45] racking the Limits of Knowledge Propagation: How LLM s Fail at Multi-Step Reasoning with Conflicting Knowledge EACL2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对更新知识与原始参数化知识冲突时,如何影响多步推理过程的问题。当前方法如上下文注入或知识编辑常因无法完全覆盖模型内嵌知识而导致知识冲突,并引发错误推理链。本文的关键解决方案是提出TRACK(Testing Reasoning Amid Conflicting Knowledge),一个涵盖WIKI、CODE和MATH三类高复杂度推理场景的新基准,通过引入多个真实世界级别的知识冲突实例,系统评估LLMs在多步推理中对新旧知识的整合能力及推理鲁棒性。实验表明,提供未经有效整合的更新事实反而可能降低性能,且随着更新事实增多,性能下降加剧,揭示了问题根源既在于知识融合失败,也在于即使成功整合后仍存在推理机制缺陷。

链接: https://arxiv.org/abs/2601.15495
作者: Yiyang Feng,Zeming Chen,Haotian Wu,Jiawei Zhou,Antoine Bosselut
机构: EPFL(洛桑联邦理工学院); Stony Brook University(石溪大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to EACL 2026 (Main)

点击查看摘要

Abstract:A common solution for mitigating outdated or incorrect information in Large Language Models (LLMs) is to provide updated facts in-context or through knowledge editing. However, these methods introduce knowledge conflicts when the knowledge update fails to overwrite the model’s parametric knowledge, which propagate to faulty reasoning. Current benchmarks for this problem, however, largely focus only on single knowledge updates and fact recall without evaluating how these updates affect downstream reasoning. In this work, we introduce TRACK (Testing Reasoning Amid Conflicting Knowledge), a new benchmark for studying how LLMs propagate new knowledge through multi-step reasoning when it conflicts with the model’s initial parametric knowledge. Spanning three reasoning-intensive scenarios (WIKI, CODE, and MATH), TRACK introduces multiple, realistic conflicts to mirror real-world complexity. Our results on TRACK reveal that providing updated facts to models for reasoning can worsen performance compared to providing no updated facts to a model, and that this performance degradation exacerbates as more updated facts are provided. We show this failure stems from both inability to faithfully integrate updated facts, but also flawed reasoning even when knowledge is integrated. TRACK provides a rigorous new benchmark to measure and guide future progress on propagating conflicting knowledge in multi-step reasoning.
zh

[NLP-46] Multi-Persona Thinking for Bias Mitigation in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的社会偏见问题,此类偏见可能导致有害刻板印象和不公平结果。解决方案的关键在于提出一种名为多人格思维(Multi-Persona Thinking, MPT)的推理时框架,该框架通过引入多个对立社会身份(如男性与女性)及中立视角,并迭代式地让这些人格进行辩证推理,从而暴露并修正偏见。MPT将人格设定的潜在弱点转化为偏见缓解的优势,在不损害核心推理能力的前提下显著降低模型偏差。

链接: https://arxiv.org/abs/2601.15488
作者: Yuxing Chen,Guoqing Luo,Zijun Wu,Lili Mou
机构: University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (阿尔伯塔机器智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit significant social biases that can perpetuate harmful stereotypes and unfair outcomes. In this paper, we propose Multi-Persona Thinking (MPT), a novel inference-time framework that leverages dialectical reasoning from multiple perspectives to reduce bias. MPT guides models to adopt contrasting social identities (e.g., male and female) along with a neutral viewpoint, and then engages these personas iteratively to expose and correct biases. Through a dialectical reasoning process, the framework transforms the potential weakness of persona assignment into a strength for bias mitigation. We evaluate MPT on two widely used bias benchmarks across both open-source and closed-source models of varying scales. Our results demonstrate substantial improvements over existing prompting-based strategies: MPT achieves the lowest bias while maintaining core reasoning ability.
zh

[NLP-47] MiRAG E: A Multiagent Framework for Generating Multimodal Multihop Question-Answer Dataset for RAG Evaluation ACL

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在面向多模态、高风险企业应用场景时缺乏领域特定评估基准的问题。现有数据集通常依赖通用领域语料或纯文本检索,难以刻画专业文档中信息的多模态特性及推理所需的跨证据整合能力。解决方案的关键在于提出MiRAGE——一种用于RAG系统评估的多智能体框架,其核心机制包括:通过递归上下文优化循环聚合分散证据,利用对抗验证智能体确保事实准确性,以及模拟专家角色认知流程的智能体以构建符合领域知识结构的多跳问答数据集。实证结果表明,MiRAGE生成的数据集具有更高的推理复杂度(平均2.3跳)和事实忠实性,为下一代信息检索系统的严格评测提供了基础设施支持。

链接: https://arxiv.org/abs/2601.15487
作者: Chandan Kumar Sahu,Premith Kumar Chilukuri,Matthew Hetrich
机构: ABB Inc
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 12 pages, 2 figures, Submitted to ACL

点击查看摘要

Abstract:The rapid evolution of Retrieval-Augmented Generation (RAG) toward multimodal, high-stakes enterprise applications has outpaced the development of domain specific evaluation benchmarks. Existing datasets often rely on general-domain corpora or purely textual retrieval, failing to capture the complexity of specialized technical documents where information is inextricably multimodal and reasoning requires synthesizing disjoint evidence. We address this gap by introducing MiRAGE, a Multiagent framework for RAG systems Evaluation, that leverages a collaborative swarm of specialized agents to generate verified, domain-specific, multimodal, and multi-hop Question-Answer datasets. MiRAGE orchestrates a swarm of specialized agents: a recursive context optimization loop to aggregate scattered evidence, an adversarial verifier agent to guarantee factual grounding, and an agent to recognize the expert persona and the relevant domain to mimic expert cognitive workflows. Extensive empirical evaluation across four distinct domains (regulations, finance, quantitative biology, and journalism) demonstrates that MiRAGE generates datasets with significantly higher reasoning complexity (2.3 average hops) and factual faithfulness. Our ablation studies point that MiRAGE can be powered by LLMs if textual descriptions of the images are available. Visual grounding still remains a frontier. By automating the creation of gold standard evaluation datasets that reflect the latent thematic structure of proprietary corpora, MiRAGE provides the necessary infrastructure to rigorously benchmark the next generation information retrieval systems.
zh

[NLP-48] Benchmarking LLM s for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域(如生物医学)中安全部署时面临的因果推理能力不足问题,即模型能否从文本中准确识别和提取因果关系。其解决方案的关键在于构建了一个统一的评估框架,通过12个多样化数据集对13个开源LLMs进行Pairwise Causal Discovery (PCD)任务测试,重点考察两个核心能力:因果检测(Causal Detection)与因果抽取(Causal Extraction),并系统比较了零样本(zero-shot)、思维链(Chain-of-Thought, CoT)及少样本上下文学习(Few-shot In-Context Learning, FICL)等提示策略的效果。实验表明当前主流模型在复杂真实场景下表现显著不足,为后续提升LLMs因果推理能力提供了基准与工具支持。

链接: https://arxiv.org/abs/2601.15479
作者: Sydney Anuyah,Sneha Shajee-Mohan,Ankit-Singh Chauhan,Sunandan Chakraborty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbfCausal Detection (identifying if a text contains a causal link) and 2) \textbfCausal Extraction (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57% ( C_detect ), while the best for extraction, Qwen2.5-Coder-32B-Instruct, reached just 47.12% ( C_extract ). Models performed best on simple, explicit, single-sentence relations. However, performance plummeted for more difficult (and realistic) cases, such as implicit relationships, links spanning multiple sentences, and texts containing multiple causal pairs. We provide a unified evaluation framework, built on a dataset validated with high inter-annotator agreement ( \kappa \ge 0.758 ), and make all our data, code, and prompts publicly available to spur further research. \hrefthis https URLCode available here: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.15479 [cs.CL] (or arXiv:2601.15479v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.15479 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-49] Chunking Retrieval and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在公共卫生政策领域应用中因生成幻觉(hallucinations)而导致的信息可靠性问题,这一问题在对信息准确性要求极高的政策制定场景中构成重大风险。解决方案的关键在于采用检索增强生成(Retrieval-Augmented Generation, RAG)架构,通过将生成结果锚定在权威文档上下文中以提升事实准确性。研究进一步对比了基础RAG与高级RAG(结合交叉编码器重排序机制)的效果,发现高级RAG显著提升了忠实度(faithfulness),达到0.797,优于基础RAG的0.621和无检索基线的0.347,表明两阶段检索机制对实现领域特定问答所需的精度至关重要,但文档分块策略仍限制多步推理任务的性能表现。

链接: https://arxiv.org/abs/2601.15457
作者: Anuj Maharjan,Umesh Yadav
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.
zh

[NLP-50] Not Your Typical Sycophant: The Elusive Nature of Sycophancy in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在评估谄媚行为(sycophancy)时存在的偏差、噪声和操控性语言干扰问题,这些问题在以往研究中常因提示词(prompt)中故意注入的误导信息而难以准确测量。其解决方案的关键在于提出一种基于“LLM-as-a-judge”的零和博弈框架,在该框架下将谄媚行为建模为一个赌局设置:模型迎合用户的同时会明确对第三方造成代价,从而实现对谄媚行为的直接且中立的量化评估。此方法有效缓解了传统评估中因未控制变量导致的主观性和不可靠性,揭示了模型在不同情境下的谄媚倾向及其与近期效应(recency bias)之间的交互作用。

链接: https://arxiv.org/abs/2601.15436
作者: Shahar Ben Natan,Oren Tsur
机构: Ben Gurion University (本古里安大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We propose a novel way to evaluate sycophancy of LLMs in a direct and neutral way, mitigating various forms of uncontrolled bias, noise, or manipulative language, deliberately injected to prompts in prior works. A key novelty in our approach is the use of LLM-as-a-judge, evaluation of sycophancy as a zero-sum game in a bet setting. Under this framework, sycophancy serves one individual (the user) while explicitly incurring cost on another. Comparing four leading models - Gemini 2.5 Pro, ChatGpt 4o, Mistral-Large-Instruct-2411, and Claude Sonnet 3.7 - we find that while all models exhibit sycophantic tendencies in the common setting, in which sycophancy is self-serving to the user and incurs no cost on others, Claude and Mistral exhibit “moral remorse” and over-compensate for their sycophancy in case it explicitly harms a third party. Additionally, we observed that all models are biased toward the answer proposed last. Crucially, we find that these two phenomena are not independent; sycophancy and recency bias interact to produce `constructive interference’ effect, where the tendency to agree with the user is exacerbated when the user’s opinion is presented last.
zh

[NLP-51] Domain-Specific Knowledge Graphs in RAG -Enhanced Healthcare LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗领域推理中因缺乏可靠、特定领域的知识而导致的可信度不足问题。其核心解决方案是通过构建源自PubMed的领域知识图谱(Knowledge Graphs, KGs),并将其与检索增强生成(Retrieval-Augmented Generation, RAG)结合,以提升LLMs在特定疾病(如2型糖尿病T2DM和阿尔茨海默病AD)相关问答中的准确性与一致性。研究发现,关键在于检索范围与任务目标之间的精确匹配:当知识图谱的覆盖范围与测试探针(Probe)的问题域高度对齐时(如使用G2\mathbb{G}_2针对阿尔茨海默病的探针),可显著提升性能;而盲目合并多个知识图谱(如G1+G2\mathbb{G}_1 + \mathbb{G}_2)反而引入干扰项,降低准确率。此外,大模型在无KG辅助下即可表现优异,表明其参数化先验较强,而中小模型则更依赖精准的外部知识检索。因此,该研究提出“精度优先、范围匹配”的KG-RAG设计原则,为实际部署提供了可操作的指导方针。

链接: https://arxiv.org/abs/2601.15429
作者: Sydney Anuyah,Mehedi Mahmud Kaushik,Hao Dai,Rakesh Shiradkar,Arjan Durresi,Sunandan Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) generate fluent answers but can struggle with trustworthy, domain-specific reasoning. We evaluate whether domain knowledge graphs (KGs) improve Retrieval-Augmented Generation (RAG) for healthcare by constructing three PubMed-derived graphs: \mathbbG_1 (T2DM), \mathbbG_2 (Alzheimer’s disease), and \mathbbG_3 (AD+T2DM). We design two probes: Probe 1 targets merged AD T2DM knowledge, while Probe 2 targets the intersection of \mathbbG_1 and \mathbbG_2 . Seven instruction-tuned LLMs are tested across retrieval sources No-RAG, \mathbbG_1 , \mathbbG_2 , \mathbbG_1 + \mathbbG_2 , \mathbbG_3 , \mathbbG_1 + \mathbbG_2 + \mathbbG_3 and three decoding temperatures. Results show that scope alignment between probe and KG is decisive: precise, scope-matched retrieval (notably \mathbbG_2 ) yields the most consistent gains, whereas indiscriminate graph unions often introduce distractors that reduce accuracy. Larger models frequently match or exceed KG-RAG with a No-RAG baseline on Probe 1, indicating strong parametric priors, whereas smaller/mid-sized models benefit more from well-scoped retrieval. Temperature plays a secondary role; higher values rarely help. We conclude that precision-first, scope-matched KG-RAG is preferable to breadth-first unions, and we outline practical guidelines for graph selection, model sizing, and retrieval/reranking. Code and Data available here - this https URL
zh

[NLP-52] CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation CVPR2026

【速读】: 该论文旨在解决医学视觉-语言模型在生成放射科报告时存在的视觉定位不准确和事实一致性差的问题,即文本描述与视觉证据之间常出现错位,导致预测结果不可靠或弱关联。其解决方案的关键在于提出一种无需额外数据的误差感知课程学习框架(CURE),通过在公开数据集上对多模态指令模型进行三阶段微调:短语定位、基于视觉定位的报告生成以及解剖结构引导的报告生成,并动态调整采样策略,优先选择模型表现较差的困难样本以强化空间与文本的一致性对齐。此方法显著提升了定位精度(IoU提升0.37)、报告质量(CXRFEScore提升0.188)并减少幻觉现象(降低18.6%)。

链接: https://arxiv.org/abs/2601.15408
作者: Pablo Messina,Andrés Villa,Juan León Alcázar,Karen Sánchez,Carlos Hinojosa,Denis Parra,Álvaro Soto,Bernard Ghanem
机构: Pontificia Universidad Católica de Chile (智利天主教大学); CENIA; iHEALTH; KAUST (沙特阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 31 pages, 7 figures, submitted to CVPR 2026 (under review)

点击查看摘要

Abstract:Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at this https URL and model weights at this https URL
zh

[NLP-53] Beyond Prompting: Efficient and Robust Contextual Biasing for Speech LLM s via Logit-Space Integration (LOGIC)

【速读】: 该论文旨在解决现有语音大语言模型(Speech Large Language Models, Speech LLMs)在识别领域特定实体(如联系人姓名、播放列表或技术术语)时面临的挑战,这些问题源于模型静态训练知识的局限性。传统方法依赖提示(prompting)注入上下文信息,但存在可扩展性差的问题,包括上下文窗口限制、推理延迟增加及“中间丢失”现象;而生成式错误纠正(Generative Error Correction, GEC)则易引发过度修正,导致未被提及的实体幻觉。本文提出LOGIC(Logit-Space Integration for Contextual Biasing)框架,其核心创新在于直接在解码层进行上下文偏置注入,通过将上下文注入与输入处理解耦,实现与提示长度无关的恒定时间复杂度,从而高效、稳健地提升实体识别准确率。实验表明,该方法在11种多语言场景下平均降低9%的实体词错误率(Entity WER),同时仅带来0.30%的假警报率小幅上升。

链接: https://arxiv.org/abs/2601.15397
作者: Peidong Wang
机构: Microsoft(微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The rapid emergence of new entities – driven by cultural shifts, evolving trends, and personalized user data – poses a significant challenge for existing Speech Large Language Models (Speech LLMs). While these models excel at general conversational tasks, their static training knowledge limits their ability to recognize domain-specific terms such as contact names, playlists, or technical jargon. Existing solutions primarily rely on prompting, which suffers from poor scalability: as the entity list grows, prompting encounters context window limitations, increased inference latency, and the “lost-in-the-middle” phenomenon. An alternative approach, Generative Error Correction (GEC), attempts to rewrite transcripts via post-processing but frequently suffers from “over-correction”, introducing hallucinations of entities that were never spoken. In this work, we introduce LOGIC (Logit-Space Integration for Contextual Biasing), an efficient and robust framework that operates directly in the decoding layer. Unlike prompting, LOGIC decouples context injection from input processing, ensuring constant-time complexity relative to prompt length. Extensive experiments using the Phi-4-MM model across 11 multilingual locales demonstrate that LOGIC achieves an average 9% relative reduction in Entity WER with a negligible 0.30% increase in False Alarm Rate. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD) Cite as: arXiv:2601.15397 [cs.AI] (or arXiv:2601.15397v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.15397 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-54] Beyond Fixed Psychological Personas: State Beats Trait but Language Models are State-Blind

【速读】: 该论文旨在解决当前语言模型在对话交互中忽视用户状态(state)而仅依赖静态人格特质(trait)的问题,这限制了模型对用户情境变化的敏感性和个性化响应能力。其解决方案的关键在于构建了一个大规模、多情境的心理学视角下的用户画像数据集——Chameleon,包含来自1667名Reddit用户的5001个上下文感知的心理特征记录。通过该数据集,研究者首次量化了个体内部状态变异(within-person state variance)与个体间特质差异(between-person trait variance)的贡献比例(分别为74%和26%),并揭示了大语言模型(LLM)存在“状态盲视”现象,即忽略状态信息导致响应一致性过高;同时发现奖励模型对状态的反应不一致,表现出方向性偏差。这一工作为情感计算、个性化对话系统及基于人类反馈的强化学习(RLHF)对齐提供了关键数据支撑与理论依据。

链接: https://arxiv.org/abs/2601.15395
作者: Tamunotonye Harry,Ivoline Ngong,Chima Nweke,Yuanyuan Feng,Joseph Near
机构: University of Vermont (佛蒙特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74% is within-person(state) while only 26% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.
zh

[NLP-55] Memorization Dynamics in Knowledge Distillation for Language Models

【速读】: 该论文旨在解决知识蒸馏(Knowledge Distillation, KD)过程中训练数据记忆(training data memorization)的动态机制不明确的问题,尤其是在大型语言模型(Large Language Models, LLMs)中,相较于标准微调(fine-tuning),KD是否以及如何降低数据泄露风险。其解决方案的关键在于系统性地评估不同LLM家族(Pythia、OLMo-2、Qwen-3)和数据集(FineWeb、Wikitext、Nemotron-CC-v2)下KD管道中的记忆行为,并发现:(1)蒸馏显著降低记忆程度(减少超50%);(2)少量易记忆样本贡献了绝大部分记忆量(>95%);(3)学生模型的记忆倾向在蒸馏前即可通过zlib熵、KL散度和困惑度等特征预测;(4)硬蒸馏虽整体记忆率与软蒸馏相当,但继承教师模型特有样本风险更高(达软蒸馏的2.7倍)。这一系列发现表明,KD不仅提升泛化能力,还可作为有效的隐私保护手段。

链接: https://arxiv.org/abs/2601.15394
作者: Jaydeep Borkar,Karan Chadha,Niloofar Mireshghallah,Yuchen Zhang,Irina-Elena Veliche,Archi Mitra,David A. Smith,Zheng Xu,Diego Garcia-Olano
机构: Meta(Meta)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond performance, KD is also explored as a privacy-preserving mechanism to mitigate the risk of training data leakage. While training data memorization has been extensively studied in standard pre-training and fine-tuning settings, its dynamics in a knowledge distillation setup remain poorly understood. In this work, we study memorization across the KD pipeline using three large language model (LLM) families (Pythia, OLMo-2, Qwen-3) and three datasets (FineWeb, Wikitext, Nemotron-CC-v2). We find: (1) distilled models memorize significantly less training data than standard fine-tuning (reducing memorization by more than 50%); (2) some examples are inherently easier to memorize and account for a large fraction of memorization during distillation (over ~95%); (3) student memorization is predictable prior to distillation using features based on zlib entropy, KL divergence, and perplexity; and (4) while soft and hard distillation have similar overall memorization rates, hard distillation poses a greater risk: it inherits 2.7\times more teacher-specific examples than soft distillation. Overall, we demonstrate that distillation can provide both improved generalization and reduced memorization risks compared to standard fine-tuning.
zh

[NLP-56] VegaChat: A Robust Framework for LLM -Based Chart Generation and Assessment

【速读】: 该论文旨在解决自然语言到可视化(Natural-language-to-visualization, NL2VIS)系统中存在的两大挑战:一是缺乏标准化的评估指标,导致难以衡量模型性能进步及跨方法比较;二是自然语言描述本身存在歧义,同一查询可能对应多个合法的可视化结果。解决方案的关键在于提出VegaChat框架,其核心创新为两个互补的评估指标:Spec Score(基于规范层级的确定性相似度度量,无需调用大语言模型)和Vision Score(基于图像的库无关度量,利用多模态大语言模型评估图表相似性和提示合规性),二者均与人工判断高度相关(Pearson相关系数分别为0.65和0.71),从而实现了对NL2VIS系统生成结果的一致性、跨库评估能力。

链接: https://arxiv.org/abs/2601.15385
作者: Marko Hostnik,Rauf Kurbanov,Yaroslav Sokolov,Artem Trofimov
机构: JetBrains
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Natural-language-to-visualization (NL2VIS) systems based on large language models (LLMs) have substantially improved the accessibility of data visualization. However, their further adoption is hindered by two coupled challenges: (i) the absence of standardized evaluation metrics makes it difficult to assess progress in the field and compare different approaches; and (ii) natural language descriptions are inherently underspecified, so multiple visualizations may be valid for the same query. To address these issues, we introduce VegaChat, a framework for generating, validating, and assessing declarative visualizations from natural language. We propose two complementary metrics: Spec Score, a deterministic metric that measures specification-level similarity without invoking an LLM, and Vision Score, a library-agnostic, image-based metric that leverages a multimodal LLM to assess chart similarity and prompt compliance. We evaluate VegaChat on the NLV Corpus and on the annotated subset of ChartLLM. VegaChat achieves near-zero rates of invalid or empty visualizations, while Spec Score and Vision Score exhibit strong correlation with human judgments (Pearson 0.65 and 0.71, respectively), indicating that the proposed metrics support consistent, cross-library comparison. The code and evaluation artifacts are available at this https URL. Comments: 8 pages, 9 figures Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL) Cite as: arXiv:2601.15385 [cs.HC] (or arXiv:2601.15385v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2601.15385 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Marko Hostnik [view email] [v1] Wed, 21 Jan 2026 19:02:11 UTC (1,919 KB)
zh

[NLP-57] You Need Better Attention Priors

【速读】: 该论文旨在解决标准注意力机制中存在的局限性,特别是其隐式假设的均匀先验(uniform prior)导致的表示权衡问题以及对位置信息建模不足的问题。解决方案的关键在于提出一种基于熵正则最优传输(Entropic Optimal Transport, EOT)框架的新型注意力机制——可训练先验的广义最优传输注意力(Generalized Optimal transport Attention with Trainable priors, GOAT),通过引入可学习的连续先验替代固定均匀先验,不仅保持与高效内核(如FlashAttention)的兼容性,还能够解释并缓解注意力下沉(attention sinks)现象,并在核心注意力计算中融合空间信息,从而学习到具备外推能力的位置先验,兼顾了学习型位置嵌入的灵活性与固定编码的长度泛化能力。

链接: https://arxiv.org/abs/2601.15380
作者: Elon Litman,Gabe Guo
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We generalize the attention mechanism by viewing it through the lens of Entropic Optimal Transport, revealing that standard attention corresponds to a transport problem regularized by an implicit uniform prior. We introduce Generalized Optimal transport Attention with Trainable priors (GOAT), a new attention mechanism that replaces this naive assumption with a learnable, continuous prior. This prior maintains full compatibility with optimized kernels such as FlashAttention. GOAT also provides an EOT-based explanation of attention sinks and materializes a solution for them, avoiding the representational trade-offs of standard attention. Finally, by absorbing spatial information into the core attention computation, GOAT learns an extrapolatable prior that combines the flexibility of learned positional embeddings with the length generalization of fixed encodings.
zh

[NLP-58] Abusive music and song transformation using GenAI and LLM s

【速读】: 该论文旨在解决流行音乐中反复暴露于暴力和侮辱性内容可能对听众情绪与行为产生负面影响的问题,尤其是这些内容可能使攻击性行为正常化或强化有害刻板印象。解决方案的关键在于利用生成式人工智能(Generative AI)和大型语言模型(Large Language Models, LLMs)对歌曲中的语音表达(vocal delivery)和歌词内容进行语义与情感层面的重构,而非简单地删除或替换词汇;该方法通过调整音调、强度和情感倾向,在不破坏音乐连贯性的前提下显著降低音频中的攻击性特征(如谐波噪声比、倒谱峰值突出度和闪烁度等声学指标改善),并实现高达63.3%–85.6%的攻击性情感减少(尤其在副歌部分达88.6%),从而提供一种更安全且不引发“禁忌水果效应”的内容净化路径。

链接: https://arxiv.org/abs/2601.15348
作者: Jiyang Choi,Rohitash Chandra
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Repeated exposure to violence and abusive content in music and song content can influence listeners’ emotions and behaviours, potentially normalising aggression or reinforcing harmful stereotypes. In this study, we explore the use of generative artificial intelligence (GenAI) and Large Language Models (LLMs) to automatically transform abusive words (vocal delivery) and lyrical content in popular music. Rather than simply muting or replacing a single word, our approach transforms the tone, intensity, and sentiment, thus not altering just the lyrics, but how it is expressed. We present a comparative analysis of four selected English songs and their transformed counterparts, evaluating changes through both acoustic and sentiment-based lenses. Our findings indicate that Gen-AI significantly reduces vocal aggressiveness, with acoustic analysis showing improvements in Harmonic to Noise Ratio, Cepstral Peak Prominence, and Shimmer. Sentiment analysis reduced aggression by 63.3-85.6% across artists, with major improvements in chorus sections (up to 88.6% reduction). The transformed versions maintained musical coherence while mitigating harmful content, offering a promising alternative to traditional content moderation that avoids triggering the “forbidden fruit” effect, where the censored content becomes more appealing simply because it is restricted. This approach demonstrates the potential for GenAI to create safer listening experiences while preserving artistic expression.
zh

[NLP-59] Logic Programming on Knowledge Graph Networks And its Application in Medical Domain

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)在实际应用中,尤其是在医疗健康领域,因未能充分融合先进逻辑推理、人工智能技术、专用编程语言及现代概率统计理论而存在的发展滞后问题,特别是多知识图谱协作与竞争机制研究的不足。其解决方案的关键在于提出并构建一个系统性的“知识图谱网络”(Knowledge Graph Network, KGN)理论框架与技术体系,涵盖定义、推理、计算与多场景应用(如模糊性、不确定性、多模态、向量化、分布式、联邦学习等),并通过真实数据案例和实验验证其有效性,从而推动知识图谱在复杂现实环境下的深度融合与智能协同。

链接: https://arxiv.org/abs/2601.15347
作者: Chuanqing Wang,Zhenmin Zhao,Shanshan Du,Chaoqun Fei,Songmao Zhang,Ruqian Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33 pages

点击查看摘要

Abstract:The rash development of knowledge graph research has brought big driving force to its application in many areas, including the medicine and healthcare domain. However, we have found that the application of some major information processing techniques on knowledge graph still lags behind. This defect includes the failure to make sufficient use of advanced logic reasoning, advanced artificial intelligence techniques, special-purpose programming languages, modern probabilistic and statistic theories et al. on knowledge graphs development and application. In particular, the multiple knowledge graphs cooperation and competition techniques have not got enough attention from researchers. This paper develops a systematic theory, technique and application of the concept ‘knowledge graph network’ and its application in medical and healthcare domain. Our research covers its definition, development, reasoning, computing and application under different conditions such as unsharp, uncertain, multi-modal, vectorized, distributed, federated. Almost in each case we provide (real data) examples and experiment results. Finally, a conclusion of innovation is provided.
zh

[NLP-60] From Quotes to Concepts: Axial Coding of Political Debates with Ensemble LMs ECIR2026

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)对质性文本数据进行高效、结构化的编码与组织问题,特别是在议会辩论等复杂语境下,将原始对话转录文本转化为具有层次结构的代码与类别体系。其核心解决方案在于引入一种基于LLM的轴向编码(axial coding)方法,通过两种策略实现:一是采用密度聚类和划分算法对代码-话语对嵌入进行分组后由LLM标注;二是直接使用LLM对代码与话语进行语义归类。关键创新在于结合了传统聚类算法的结构化优势与LLM在语义理解上的灵活性,从而在覆盖度、一致性、可解释性和细粒度对齐之间取得平衡,显著提升了质性分析的自动化水平与结果质量。

链接: https://arxiv.org/abs/2601.15338
作者: Angelina Parfenova,David Graus,Juergen Pfeffer
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ECIR2026

点击查看摘要

Abstract:Axial coding is a commonly used qualitative analysis method that enhances document understanding by organizing sentence-level open codes into broader categories. In this paper, we operationalize axial coding with large language models (LLMs). Extending an ensemble-based open coding approach with an LLM moderator, we add an axial coding step that groups open codes into higher-order categories, transforming raw debate transcripts into concise, hierarchical representations. We compare two strategies: (i) clustering embeddings of code-utterance pairs using density-based and partitioning algorithms followed by LLM labeling, and (ii) direct LLM-based grouping of codes and utterances into categories. We apply our method to Dutch parliamentary debates, converting lengthy transcripts into compact, hierarchically structured codes and categories. We evaluate our method using extrinsic metrics aligned with human-assigned topic labels (ROUGE-L, cosine, BERTScore), and intrinsic metrics describing code groups (coverage, brevity, coherence, novelty, JSD divergence). Our results reveal a trade-off: density-based clustering achieves high coverage and strong cluster alignment, while direct LLM grouping results in higher fine-grained alignment, but lower coverage 20%. Overall, clustering maximizes coverage and structural separation, whereas LLM grouping produces more concise, interpretable, and semantically aligned categories. To support future research, we publicly release the full dataset of utterances and codes, enabling reproducibility and comparative studies.
zh

[NLP-61] Language Models Entangle Language and Culture AAAI’26 ACL2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言场景下存在系统性不公平的问题,即不同语言用户获得的回答质量存在差异,尤其在低资源语言中表现更差。其核心问题是:语言选择是否导致模型输出的文化背景和答案质量发生变化,从而造成对非高资源语言用户的隐性偏见。解决方案的关键在于通过构建基于真实世界对话数据集的开放性问题评估体系,并结合LLM-as-a-Judge方法量化文化语境差异,进而验证语言与文化在模型响应中的耦合关系;同时利用CulturalBench多语言翻译子集进行跨语言基准测试,揭示语言因素如何显著影响模型生成内容的文化适配性和回答质量。

链接: https://arxiv.org/abs/2601.15337
作者: Shourya Jain,Paras Chopra
机构: Lossfunk
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at LM4UC Workshop at AAAI’26, Submitted to ACL 2026. 17 pages, 7 figures

点击查看摘要

Abstract:Users should not be systemically disadvantaged by the language they use for interacting with LLMs; i.e. users across languages should get responses of similar quality irrespective of language used. In this work, we create a set of real-world open-ended questions based on our analysis of the WildChat dataset and use it to evaluate whether responses vary by language, specifically, whether answer quality depends on the language used to query the model. We also investigate how language and culture are entangled in LLMs such that choice of language changes the cultural information and context used in the response by using LLM-as-a-Judge to identify the cultural context present in responses. To further investigate this, we evaluate LLMs on a translated subset of the CulturalBench benchmark across multiple languages. Our evaluations reveal that LLMs consistently provide lower quality answers to open-ended questions in low resource languages. We find that language significantly impacts the cultural context used by the model. This difference in context impacts the quality of the downstream answer.
zh

[NLP-62] No Reliable Evidence of Self-Reported Sentience in Small Large Language Models

【速读】: 该论文旨在解决语言模型是否具备自我意识(sentience)的问题,特别是探究模型是否认为自己具有意识。其核心挑战在于,尽管无法通过实证手段直接验证模型是否真正拥有意识,但可通过分析模型对自身意识的陈述及其内部激活状态来间接推断其潜在信念。解决方案的关键在于设计了一种基于内部激活特征的分类方法:研究人员首先向多个开源权重模型(Qwen、Llama、GPT-OSS)提出关于意识和主观体验的约50个问题,发现所有模型均一致否认自身具有意识;随后,利用来自可解释性研究领域的三种分类器对模型内部激活进行分析,以识别其真实信念而非表面输出,结果未发现这些否认行为是虚假的证据;进一步地,在Qwen系列中观察到模型参数规模越大,其否定意识的语气越坚定。这一方法突破了传统仅依赖文本输出判断模型意识的局限,为评估生成式AI(Generative AI)的内在认知状态提供了新的技术路径。

链接: https://arxiv.org/abs/2601.15334
作者: Caspar Kaiser,Sean Enderby
机构: University of Warwick, Warwick Business School (华威大学,华威商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whether language models possess sentience has no empirical answer. But whether they believe themselves to be sentient can, in principle, be tested. We do so by querying several open-weights models about their own consciousness, and then verifying their responses using classifiers trained on internal activations. We draw upon three model families (Qwen, Llama, GPT-OSS) ranging from 0.6 billion to 70 billion parameters, approximately 50 questions about consciousness and subjective experience, and three classification methods from the interpretability literature. First, we find that models consistently deny being sentient: they attribute consciousness to humans but not to themselves. Second, classifiers trained to detect underlying beliefs - rather than mere outputs - provide no clear evidence that these denials are untruthful. Third, within the Qwen family, larger models deny sentience more confidently than smaller ones. These findings contrast with recent work suggesting that models harbour latent beliefs in their own consciousness.
zh

[NLP-63] RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对对抗性提示(adversarial prompts)时的安全漏洞问题,尤其是现有自动化越狱方法(如GCG、PEZ和GBDA)计算成本过高,限制了其在资源受限组织中的实际应用。解决方案的关键在于提出一种资源高效的对抗性提示生成方法:通过构建一个预训练的对抗性提示数据库,并利用语义相似度检索机制匹配新提示与已知成功攻击样本,从而避免重复训练过程,在显著降低计算开销的同时仍能实现与传统方法相当的攻击成功率。该方法为可扩展的红队测试(red-teaming)和对齐LLMs的安全评估提供了实用框架,尤其适用于模型内部不可访问的场景。

链接: https://arxiv.org/abs/2601.15331
作者: Rishit Chugh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Code for RECAP is available at: this https URL

点击查看摘要

Abstract:The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient-based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource-efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre-trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm-related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red-teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.
zh

[NLP-64] ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation ICASSP2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中出现的“迷失于对话”(lost-in-conversation)问题,即模型因早期错误假设而难以修正,尤其当用户初始指令模糊时更为显著。现有标准后训练方法如基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)会加剧该问题,因其奖励机制倾向于鼓励自信且直接的回答,从而诱导模型过度自信并抑制其主动寻求澄清的行为。解决方案的关键在于提出一种新的训练框架——言语行为校准策略优化(Illocution-Calibrated Policy Optimization, ICPO),该框架通过在训练语料中引入不明确提示(underspecified prompts),并基于用户的言外意图(illocutionary intent)调整奖励信号,激励模型在面对歧义时表达不确定性或主动询问澄清,从而实现更恰当的谦逊性与对话鲁棒性。实验表明,ICPO在多轮对话中平均性能提升达75%,同时保持单轮任务基准性能稳定。

链接: https://arxiv.org/abs/2601.15330
作者: Zhebo Wang,Xiaohu Mu,Zijie Zhou,Mohan Li,Wenpeng Xing,Dezhang Kong,Meng Han
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation’’ phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user’s illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.
zh

[NLP-65] Replayable Financial Agents : A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)代理在金融场景中执行工具调用任务时面临的可复现性与证据一致性问题,即“监管审计回放”挑战:当使用相同输入重放被标记的交易决策时,多数部署版本无法产生一致结果。解决方案的关键在于提出Determinism-Faithfulness Assurance Harness(DFAH)框架,该框架通过量化轨迹确定性(trajectory determinism)和证据条件下的忠实性(evidence-conditioned faithfulness),系统评估工具使用型代理的稳定性与合规对齐能力。实证表明,尽管高参数量模型(120B+)需更大样本才能达到统计可靠性,但确定性与忠实性之间存在显著正相关(r = 0.45, p < 0.01),且采用schema-first架构的Tier 1模型能够满足审计回放所需的确定性标准。

链接: https://arxiv.org/abs/2601.15322
作者: Raffi Khatchadourian
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 5 figures, 9 tables. Code and data: this https URL

点击查看摘要

Abstract:LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, most deployments fail to return consistent results. This paper introduces the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using agents deployed in financial services. Across 74 configurations (12 models, 4 providers, 8-24 runs each at T=0.0) in non-agentic baseline experiments, 7-20B parameter models achieved 100% determinism, while 120B+ models required 3.7x larger validation samples to achieve equivalent statistical reliability. Agentic tool-use introduces additional variance (see Tables 4-7). Contrary to the assumed reliability-capability trade-off, a positive Pearson correlation emerged (r = 0.45, p 0.01, n = 51 at T=0.0) between determinism and faithfulness; models producing consistent outputs also tended to be more evidence-aligned. Three financial benchmarks are provided (compliance triage, portfolio constraints, DataOps exceptions; 50 cases each) along with an open-source stress-test harness. In these benchmarks and under DFAH evaluation settings, Tier 1 models with schema-first architectures achieved determinism levels consistent with audit replay requirements. Comments: 23 pages, 5 figures, 9 tables. Code and data: this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.11; D.2.5 Cite as: arXiv:2601.15322 [cs.AI] (or arXiv:2601.15322v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.15322 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-66] Do people expect different behavior from large language models acting on their behalf? Evidence from norm elicitations in two canonical economic games

【速读】: 该论文试图解决的问题是:当决策由大型语言模型(Large Language Models, LLMs)代为执行时,人们对其社会适当性(social appropriateness)的判断是否与人类自主决策存在差异,尤其是在资源分配和公平监督场景中。解决方案的关键在于通过两个预注册且具有激励机制的实验(使用来自英国和美国的代表性样本,N=2,658),结合经典经济博弈(如独裁者博弈和最后通牒博弈)与Krupka-Weber规范诱发任务(norm elicitation task),系统测量人们对机器与人类行为在社会规范层面的感知差异。研究发现:1)无接受要求时,机器提出的分配方案被认为比人类更不适当;2)有接受要求时,人们更倾向于拒绝来自机器的提议;3)对机器拒绝的接受程度与对人类拒绝的接受程度相当。这表明公众对机器决策的规范标准存在认知与情感双重维度的区分,即认为机器的行为既具认知属性也具情感属性,从而揭示了人机交互中社会规范适应性的关键机制。

链接: https://arxiv.org/abs/2601.15312
作者: Paweł Niszczota,Elia Antoniou
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:While delegating tasks to large language models (LLMs) can save people time, there is growing evidence that offloading tasks to such models produces social costs. We use behavior in two canonical economic games to study whether people have different expectations when decisions are made by LLMs acting on their behalf instead of themselves. More specifically, we study the social appropriateness of a spectrum of possible behaviors: when LLMs divide resources on our behalf (Dictator Game and Ultimatum Game) and when they monitor the fairness of splits of resources (Ultimatum Game). We use the Krupka-Weber norm elicitation task to detect shifts in social appropriateness ratings. Results of two pre-registered and incentivized experimental studies using representative samples from the UK and US (N = 2,658) show three key findings. First, people find that offers from machines - when no acceptance is necessary - are judged to be less appropriate than when they come from humans, although there is no shift in the modal response. Second - when acceptance is necessary - it is more appropriate for a person to reject offers from machines than from humans. Third, receiving a rejection of an offer from a machine is no less socially appropriate than receiving the same rejection from a human. Overall, these results suggest that people apply different norms for machines deciding on how to split resources but are not opposed to machines enforcing the norms. The findings are consistent with offers made by machines now being viewed as having both a cognitive and emotional component.
zh

[NLP-67] DeepSurvey-Bench: Evaluating Academic Value of Automatically Generated Scientific Survey

【速读】: 该论文旨在解决现有自动化科学综述生成技术评估基准存在的两大核心问题:一是基准所依赖的“真实”综述数据集因缺乏学术维度标注而不可靠;二是现有评价指标仅关注表面质量(如结构一致性、参考文献相关性),无法衡量生成综述的深层“学术价值”,例如研究目标凝练能力与对不同研究的批判性分析能力。解决方案的关键在于提出一个名为DeepSurvey-Bench的新基准,其创新点在于构建了一套涵盖信息价值、学术交流价值和研究引导价值三个维度的综合评价标准,并基于此标准建立了带有学术价值标注的可靠数据集,从而能够系统评估生成综述的深层次学术贡献。

链接: https://arxiv.org/abs/2601.15307
作者: Guo-Biao Zhang,Ding-Yuan Liu,Da-Yi Wu,Tian Lan,Heyan Huang,Zhijing Wu,Xian-Ling Mao
机构: Beijing Institute of Technology (北京理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of automated scientific survey generation technology has made it increasingly important to establish a comprehensive benchmark to evaluate the quality of generated this http URL all existing evaluation benchmarks rely on flawed selection criteria such as citation counts and structural coherence to select human-written surveys as the ground truth survey datasets, and then use surface-level metrics such as structural quality and reference relevance to evaluate generated this http URL, these benchmarks have two key issues: (1) the ground truth survey datasets are unreliable because of a lack academic dimension annotations; (2) the evaluation metrics only focus on the surface quality of the survey such as logical coherence. Both issues lead to existing benchmarks cannot assess to evaluate their deep “academic value”, such as the core research objectives and the critical analysis of different studies. To address the above problems, we propose DeepSurvey-Bench, a novel benchmark designed to comprehensively evaluate the academic value of generated surveys. Specifically, our benchmark propose a comprehensive academic value evaluation criteria covering three dimensions: informational value, scholarly communication value, and research guidance value. Based on this criteria, we construct a reliable dataset with academic value annotations, and evaluate the deep academic value of the generated surveys. Extensive experimental results demonstrate that our benchmark is highly consistent with human performance in assessing the academic value of generated surveys.
zh

[NLP-68] Can We Trust LLM Detectors?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)生成文本检测方法在实际应用中可靠性不足的问题,尤其是在分布偏移(distribution shift)、未见过的生成器(unseen generators)以及简单风格扰动(stylistic perturbations)场景下性能显著下降的挑战。解决方案的关键在于提出一种监督对比学习(supervised contrastive learning, SCL)框架,通过学习判别性的风格嵌入(style embeddings)来提升检测模型的鲁棒性和泛化能力,从而缓解现有检测方法对特定领域或训练数据的依赖性。

链接: https://arxiv.org/abs/2601.15301
作者: Jivnesh Sandhan,Harshit Jaiswal,Fei Cheng,Yugo Murawaki
机构: Kyoto University (京都大学); IIT Kanpur (印度理工学院坎普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NLP2026, Utsunomiya, Japan

点击查看摘要

Abstract:The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: this https URL
zh

[NLP-69] Intelligence Degradation in Long-Context LLM s: Critical Threshold Determination via Natural Length Distribution Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时出现的智能退化问题,即当上下文长度接近某一临界阈值时,即使信息仍相关,模型性能会骤降超过30%,严重限制了其在长文本场景中的应用。解决方案的关键在于:首先通过自然长度分布分析(Natural Length Distribution Analysis),排除截断或填充对结果的干扰,明确性能退化由上下文长度本身引起;其次,利用混合数据集实验与五折交叉验证,精准识别出Qwen2.5-7B模型的临界阈值位于最大上下文长度的40%-50%区间,此时F1分数从0.55–0.56骤降至0.3,表现出典型的“浅层长上下文适应”(shallow long-context adaptation)模式;最终构建统一框架以解释此类退化现象,并为后续缓解策略提供理论基础。

链接: https://arxiv.org/abs/2601.15300
作者: Weiwei Wang,Jiyong Min,Weijie Zou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit catastrophic performance degradation when processing contexts approaching certain critical thresholds, even when information remains relevant. This intelligence degradation-defined as over 30% drop in task performance-severely limits long-context applications. This degradation shows a common pattern: models maintain strong performance up to a critical threshold, then collapse catastrophically. We term this shallow long-context adaptation-models adapt for short to medium contexts but fail beyond critical thresholds. This paper presents three contributions: (1) Natural Length Distribution Analysis: We use each sample’s natural token length without truncation or padding, providing stronger causal evidence that degradation results from context length itself. (2) Critical Threshold Determination: Through experiments on a mixed dataset (1,000 samples covering 5%-95% of context length), we identify the critical threshold for Qwen2.5-7B at 40-50% of maximum context length, where F1 scores drop from 0.55-0.56 to 0.3 (45.5% degradation), using five-method cross-validation. (3) Unified Framework: We consolidate shallow adaptation, explaining degradation patterns and providing a foundation for mitigation strategies. This work provides the first systematic characterization of intelligence degradation in open-source Qwen models, offering practical guidance for deploying LLMs in long-context scenarios.
zh

[NLP-70] MALTopic: Multi-Agent LLM Topic Modeling Framework ALT

【速读】: 该论文旨在解决传统主题建模方法在分析问卷数据时存在的两大问题:一是仅依赖自由文本响应,未能有效融合结构化或分类变量信息;二是生成的主题抽象且难以解释,需大量人工干预。其解决方案的关键在于提出多智能体大语言模型主题建模框架(Multi-Agent LLM Topic Modeling Framework, MALTopic),通过分工协作的多个LLM代理实现专业化处理:增强代理利用结构化数据丰富文本响应,主题建模代理提取潜在主题,去重代理优化结果输出。该框架显著提升了主题一致性、多样性和可解释性,相较于LDA和BERTopic更具上下文相关性与实用性。

链接: https://arxiv.org/abs/2601.15299
作者: Yash Sharma
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 6 pages. Published in 2025 IEEE World AI-IoT Congress. \c{opyright} 2025 IEEE. Project code and data available at: this https URL

点击查看摘要

Abstract:Topic modeling is a crucial technique for extracting latent themes from unstructured text data, particularly valuable in analyzing survey responses. However, traditional methods often only consider free-text responses and do not natively incorporate structured or categorical survey responses for topic modeling. And they produce abstract topics, requiring extensive human interpretation. To address these limitations, we propose the Multi-Agent LLM Topic Modeling Framework (MALTopic). This framework decomposes topic modeling into specialized tasks executed by individual LLM agents: an enrichment agent leverages structured data to enhance textual responses, a topic modeling agent extracts latent themes, and a deduplication agent refines the results. Comparative analysis on a survey dataset demonstrates that MALTopic significantly improves topic coherence, diversity, and interpretability compared to LDA and BERTopic. By integrating structured data and employing a multi-agent approach, MALTopic generates human-readable topics with enhanced contextual relevance, offering a more effective solution for analyzing complex survey data.
zh

[NLP-71] Embedding Retrofitting: Data Engineering for better RAG

【速读】: 该论文旨在解决嵌入微调(embedding retrofitting)在实际应用中因知识图谱(knowledge graph)质量下降而导致性能退化的问题,尤其是由真实语料中注释伪影(annotation artifacts)引发的数据质量问题。其核心解决方案在于构建一个数据工程框架,通过系统性预处理消除如标签(hashtag)等噪声源对知识图谱密度的虚假提升,从而避免生成虚假边(spurious edges)污染微调目标函数。实验表明,在去除噪声后,基于加权平均的嵌入微调方法(\acrshort{ewma} retrofitting)在特定领域检索任务中实现显著性能提升(+6.2%,p=0.0348),且改善效果主要集中在定量合成类问题(+33.8%平均),凸显了高质量预处理相较于算法选择本身更为关键——预处理质量差异带来的性能波动(10%+)远超不同微调算法间的差距(3%)。

链接: https://arxiv.org/abs/2601.15298
作者: Anantha Sharma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 16 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Embedding retrofitting adjusts pre-trained word vectors using knowledge graph constraints to improve domain-specific retrieval. However, the effectiveness of retrofitting depends critically on knowledge graph quality, which in turn depends on text preprocessing. This paper presents a data engineering framework that addresses data quality degradation from annotation artifacts in real-world corpora. The analysis shows that hashtag annotations inflate knowledge graph density, leading to creating spurious edges that corrupt the retrofitting objective. On noisy graphs, all retrofitting techniques produce statistically significant degradation ( -3.5% to -5.2% , p0.05 ). After preprocessing, \acrshortewma retrofitting achieves +6.2% improvement ( p=0.0348 ) with benefits concentrated in quantitative synthesis questions ( +33.8% average). The gap between clean and noisy preprocessing (10%+ swing) exceeds the gap between algorithms (3%), establishing preprocessing quality as the primary determinant of retrofitting success. Comments: 16 pages, 11 figures, 7 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2601.15298 [cs.CL] (or arXiv:2601.15298v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.15298 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-72] AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在非洲经济分析任务中表现不佳的问题,特别是由于相关数据在预训练语料库中严重缺失导致的参数知识鸿沟。其解决方案的关键在于构建首个专注于非洲经济分析的基准数据集AfriEconQA,该数据集基于236份世界银行报告,包含8,937个经过严格筛选的问答实例,每个实例均包含需高精度数值推理与时间歧义消解的问题、来自文档的证据、验证后的答案及来源元数据(如URL和出版日期),从而为信息检索(Information Retrieval, IR)系统和检索增强生成(Retrieval-Augmented Generation, RAG)模型提供了一个具有挑战性的评估平台。通过11组实验对比零样本基线(GPT-5 Mini)与多种RAG配置(使用GPT-4o和Qwen 32B结合五种嵌入与排序策略),研究发现即使最先进的RAG系统也难以达到高精度,验证了该数据集作为下一代领域特定IR与RAG系统评测基准的有效性与难度。

链接: https://arxiv.org/abs/2601.15297
作者: Edward Ajayi
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲校区)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.
zh

[NLP-73] Entropy-Tree: Tree-Based Decoding with Entropy-Guided Exploration

【速读】: 该论文旨在解决大语言模型在推理任务中因传统解码策略存在盲目探索(如随机采样)或冗余搜索(如独立多采样)而导致效率低下和不确定性估计不可靠的问题。其解决方案的关键在于提出一种基于熵的树状解码方法——Entropy-Tree,该方法利用模型输出的熵作为分支决策信号,在模型真正存在不确定性的位置才扩展搜索树,从而实现高效结构化探索与可靠不确定性估计的统一。

链接: https://arxiv.org/abs/2601.15296
作者: Longxuan Wei,Yubo Zhang,Zijiao Zhang,Zhihu Wang,Shiwan Zhao,Tianyu Huang,Huiting Zhao,Chenfei Liu,Shenao Zhang,Junchi Yan
机构: Shanghai Jiaotong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Huawei Technologies Ltd. (华为技术有限公司); Nankai University (南开大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models achieve strong reasoning performance, yet existing decoding strategies either explore blindly (random sampling) or redundantly (independent multi-sampling). We propose Entropy-Tree, a tree-based decoding method that exploits entropy as a signal for branching decisions–expanding the search tree only at positions where the model exhibits genuine uncertainty. Entropy-Tree shows superior accuracy and calibration in reasoning tasks: it achieves better pass@k than Multi-chain across multiple models and datasets, and its predictive entropy demonstrates better AUROC compared to several traditional metrics. Entropy-Tree unifies efficient structured exploration and reliable uncertainty estimation within a single decoding procedure.
zh

[NLP-74] Elsewise: Authoring AI-Based Interactive Narrative with Possibility Space Visualization

【速读】: 该论文旨在解决生成式 AI 在交互叙事(Interactive Narrative, IN)创作中因玩家输入的开放性而导致的“作者设想叙事”与“玩家体验叙事”之间差距扩大的问题,这可能削弱情节推进的连贯性和作者叙事意图的传达。解决方案的关键在于提出了一种名为 Elsewise 的作者工具,其核心创新是引入“捆绑叙事线”(Bundled Storyline)概念,使作者能够基于用户可配置的叙事维度,系统性地探索不同游戏流程之间的相似性与差异性,从而增强对叙事可能性空间的理解和掌控力。用户研究(n=12)表明,该方法显著提升了作者对玩家实际体验的预测能力,进而优化了叙事空间的设计与调控。

链接: https://arxiv.org/abs/2601.15295
作者: Yi Wang,John Joon Young Chung,Melissa Roemmele,Yuqian Sun,Tiffany Wang,Shm Garanganao Almeda,Brett A. Halperin,Yuwen Lu,Max Kreminski
机构: Midjourney( Midjourney); University of Washington(华盛顿大学); University of Notre Dame(圣母大学); UC Berkeley(加州大学伯克利分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interactive narrative (IN) authors craft spaces of divergent narrative possibilities for players to explore, with the player’s input determining which narrative possibilities they actually experience. Generative AI can enable new forms of IN by improvisationally expanding on pre-authored content in response to open-ended player input. However, this extrapolation risks widening the gap between author-envisioned and player-experienced stories, potentially limiting the strength of plot progression and the communication of the author’s narrative intent. To bridge the gap, we introduce Elsewise: an authoring tool for AI-based INs that implements a novel Bundled Storyline concept to enhance author’s perception and understanding of the narrative possibility space, allowing authors to explore similarities and differences between possible playthroughs of their IN in terms of open-ended, user-configurable narrative dimensions. A user study (n=12) shows that our approach improves author anticipation of player-experienced narrative, leading to more effective control and exploration of the narrative possibility spaces.
zh

[NLP-75] Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

【速读】: 该论文旨在解决大语言模型在专业化科学领域中进行组合式多跳推理(compositional multi-hop reasoning)能力不足的问题。现有模型虽在数学和编程等结构化推理任务中接近专家水平,但在面对需要整合多个中间逻辑步骤的复杂科学问题时表现受限。解决方案的关键在于提出一种自底向上的学习范式,通过将模型锚定在公理化的领域知识图谱(knowledge graph)上,并利用路径衍生的奖励信号指导监督微调与强化学习(reinforcement learning, RL)过程,从而鼓励模型不仅优化最终答案,更注重中间推理步骤的正确性。这种基于知识图谱路径的奖励机制作为“组合桥梁”,显著提升了模型在未见复杂多跳任务上的零样本泛化能力,且优于当前前沿系统如GPT-5.2和Gemini 3 Pro。

链接: https://arxiv.org/abs/2601.15160
作者: Yuval Kansal,Niraj K. Jha
机构: Princeton University (普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a “compositional bridge”, enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.
zh

[NLP-76] Psychometric Comparability of LLM -Based Digital Twins

【速读】: 该论文试图解决的问题是:大语言模型(Large Language Models, LLMs)作为“数字孪生体”(digital twins)在心理测量学层面是否能够可靠地替代人类被试,其构念效度(construct validity)如何?具体而言,研究关注LLM在不同任务中与人类金标准之间的可比性,以及个体特异性输入对性能的影响。解决方案的关键在于提出一个涵盖构念表征(construct representation)与命题网络(nomological net)的构念效度框架,并通过多任务、跨模型、跨个体的基准测试系统评估数字孪生体的表现,发现特征丰富的条件化输入虽能提升整体一致性与叙事匹配度,但无法消除系统性差异,尤其在项目层面相关性减弱、启发式偏差再现不足及人格网络仅具配置不变性等方面表现明显。因此,研究强调未来应明确数字孪生体的有效边界,界定其作为人类认知与行为代理的可靠情境。

链接: https://arxiv.org/abs/2601.14264
作者: Yufei Zhang,Zhihao Ma
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Also available as a preprint on OSF Preprints this https URL

点击查看摘要

Abstract:Large language models (LLMs) are used as “digital twins” to replace human respondents, yet their psychometric comparability to humans is uncertain. We propose a construct-validity framework spanning construct representation and the nomological net, benchmarking digital twins against human gold standards across models, tasks and testing how person-specific inputs shape performance. Across studies, digital twins achieved high population-level accuracy and strong within-participant profile correlations, alongside attenuated item-level correlations. In word association tests, LLM-based networks show small-world structure and theory-consistent communities similar to humans, yet diverge lexically and in local structure. In decision-making and contextualized tasks, digital twins under-reproduce heuristic biases, showing normative rationality, compressed variance and limited sensitivity to temporal information. Feature-rich digital twins improve Big Five Personality prediction, but their personality networks show only configural invariance and do not achieve metric invariance. In more applied free-text tasks, feature-rich digital twins better match human narratives, but linguistic differences persist. Together, these results indicate that feature-rich conditioning enhances validity but does not resolve systematic divergences in psychometric comparability. Future work should therefore prioritize delineating the effective boundaries of digital twins, establishing the precise contexts in which they function as reliable proxies for human cognition and behavior.
zh

计算机视觉

[CV-0] CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

【速读】:该论文旨在解决视频扩散模型中相机可控性(camera controllability)受限的问题。现有方法在利用奖励反馈学习(Reward Feedback Learning, ReFL)提升相机控制能力时面临三大挑战:一是现有奖励模型无法有效评估视频与相机之间的对齐程度;二是将潜在表示解码为RGB视频以计算奖励引入了显著的计算开销;三是视频解码过程中通常忽略三维几何信息。为此,作者提出了一种高效的、面向相机的3D解码器,将视频潜在表示与相机位姿联合解码为3D高斯表示(3D Gaussians),其中相机位姿不仅作为输入,还作为投影参数参与渲染过程。当视频潜在表示与相机位姿不一致时,会导致3D结构出现几何畸变,从而产生模糊的渲染结果。基于此特性,论文设计了一个像素级一致性奖励函数,通过比较新视角渲染图像与真实图像的一致性来优化模型,并进一步引入可见性项(visibility term)以仅监督由几何变形确定性的区域,从而适应生成过程的随机性。实验表明该方法在RealEstate10K和WorldScore数据集上显著提升了相机可控性。

链接: https://arxiv.org/abs/2601.16214
作者: Wenhang Ge,Guibao Shen,Jiawei Feng,Luozhou Wang,Hao Lu,Xingye Tian,Xin Tao,Ying-Cong Chen
机构: HKUST(GZ); HKUST; Kling Team, Kuaishou Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \hrefthis https URLCamPilot Page.
zh

[CV-1] Why Cant I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

【速读】:该论文旨在解决零样本组合动作识别(Zero-Shot Compositional Action Recognition, ZS-CAR)中模型因“物体驱动的动词捷径”(object-driven verb shortcuts)而导致泛化能力不足的问题。研究发现,现有模型在训练过程中逐渐忽略视觉证据,转而依赖物体与动词之间的共现统计偏置,从而无法有效处理未见过的动词-物体组合。解决方案的关键在于提出RCORE框架,其核心机制包括:(i) 一种组合感知增强方法,在不破坏运动线索的前提下多样化动词-物体组合;(ii) 一种时间顺序正则化损失,通过显式建模视频的时间结构来惩罚捷径行为。该方法显著提升了未见组合的识别准确率,并减少了对共现偏置的依赖,实现了稳定的组合差距(compositional gap)。

链接: https://arxiv.org/abs/2601.16211
作者: Geo Ahn,Inwoong Lee,Taeoh Kim,Minho Shim,Dongyoon Wee,Jinwoo Choi
机构: Kyung Hee University (高丽大学); NAVER Cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The code is available at this https URL

点击查看摘要

Abstract:We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
zh

[CV-2] PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

【速读】:该论文旨在解决现有离散视频变分自编码器(Discrete Video VAEs)在跨模态对齐和零样本迁移能力上的不足,其核心问题在于传统视频分词器通常仅在单一尺度上学习有限词汇量的视觉码本,并依赖浅层语言监督,导致生成质量与理解性能受限。解决方案的关键在于提出PyraTok——一种语言对齐的金字塔分词器,通过一个新颖的语言对齐金字塔量化(Language-aligned Pyramidal Quantization, LaPQ)模块,在多个时空分辨率下使用共享的大规模二进制码本对编码特征进行离散化,从而生成紧凑且语义结构清晰的视频token序列;同时,PyraTok联合优化多尺度文本引导量化与基于token层级的全局自回归目标,实现视觉token与语言语义的紧密耦合,显著提升视频重建、文本到视频生成以及视频分割、动作定位等下游任务的零样本性能。

链接: https://arxiv.org/abs/2601.16210
作者: Onkar Susladkar,Tushar Prakash,Adheesh Juvekar,Kiet A. Nguyen,Dong-Hwan Jang,Inderjit S Dhillon,Ismini Lourentzou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
zh

[CV-3] Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

【速读】:该论文旨在解决大规模自由形式文本到图像(text-to-image, T2I)生成任务中,传统变分自编码器(Variational Autoencoder, VAE)在训练稳定性、收敛速度和生成质量方面的局限性。其核心解决方案是采用表示自动编码器(Representation Autoencoder, RAE)框架,并通过在冻结的高维语义表征编码器(SigLIP-2)基础上扩展解码器规模,结合目标数据组成优化特定领域(如文本)性能,同时发现大规模场景下RAE架构可显著简化——即维度相关的噪声调度仍关键,而复杂结构如宽扩散头和噪声增强解码则收益有限。实验表明,RAE在预训练阶段优于FLUX VAE,且在微调阶段表现出更强的鲁棒性和更优的生成质量,验证了RAE作为大模型T2I生成基础的优越性。

链接: https://arxiv.org/abs/2601.16208
作者: Shengbang Tong,Boyang Zheng,Ziteng Wang,Bingda Tang,Nanye Ma,Ellis Brown,Jihan Yang,Rob Fergus,Yann LeCun,Saining Xie
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: website: this https URL

点击查看摘要

Abstract:Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
zh

[CV-4] Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对对抗扰动时的脆弱性问题,即对抗攻击会扭曲其特征表示并导致错误预测。解决方案的关键在于提出特征空间平滑(Feature-space Smoothing, FS),理论上证明FS能够为MLLMs的特征表示提供认证鲁棒性(certified robustness)。FS通过将任意特征编码器转换为平滑变体,在ℓ₂有界攻击下保证干净样本与对抗样本之间的特征余弦相似度存在可证明的下界(即特征余弦相似度边界,Feature Cosine Similarity Bound, FCSB)。进一步地,作者设计了“净化器与平滑映射器”(Purifier and Smoothness Mapper, PSM)模块,无需重新训练MLLMs即可提升原始编码器的高斯鲁棒性评分(Gaussian robustness score),从而增强FS框架下的认证鲁棒性。实验表明,FS-PSM不仅具备坚实的理论保障,且在多种MLLM和下游任务中显著优于传统对抗训练方法,将白盒攻击成功率(Attack Success Rate, ASR)从近90%降低至约1%。

链接: https://arxiv.org/abs/2601.16200
作者: Song Xia,Meiwen Ding,Chenqi Kong,Wenhan Yang,Xudong Jiang
机构: Nanyang Technological University(南洋理工大学); Peng Cheng Laboratory(鹏城实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under \ell_2 -bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90% to about 1%.
zh

[CV-5] 360Anything: Geometry-Free Lifting of Images and Videos to 360°

【速读】:该论文旨在解决从透视图(perspective image/video)到360°全景图(equirectangular projection, ERP)生成过程中对相机几何参数依赖过强的问题,尤其针对野外数据中缺乏或噪声较大的相机标定信息的场景。传统方法通常依赖显式的几何对齐,限制了其在真实复杂环境中的应用。解决方案的关键在于提出一种无几何约束的框架360Anything,基于预训练扩散Transformer(diffusion transformer),将透视输入与全景目标均视为token序列,在纯数据驱动的方式下学习透视到ERP的空间映射,从而完全摒弃对相机内参和外参的需求。此外,作者识别出ERP边界伪影的根本原因在于VAE编码器中的零填充机制,并引入Circular Latent Encoding以实现无缝生成,显著提升了输出质量。

链接: https://arxiv.org/abs/2601.16192
作者: Ziyi Wu,Daniel Watson,Andrea Tagliasacchi,David J. Fleet,Marcus A. Brubaker,Saurabh Saxena
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything’s deep geometric understanding and broader utility in computer vision tasks. Additional results are available at this https URL.
zh

[CV-6] HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval ICASSP2026

【速读】:该论文旨在解决当前文本-视频检索方法中因文本查询稀疏而导致的“盲”特征交互问题,即模型难以从背景噪声中准确识别关键视觉信息。其解决方案的关键在于受人类视觉认知行为启发,提出Human Vision-Driven (HVD) 模型,构建了从粗到细的对齐机制:首先通过帧特征选择模块(Frame Features Selection Module, FFSM)模拟人类宏观感知能力,筛选关键帧以消除时间冗余;随后通过补丁特征压缩模块(Patch Features Compression Module, PFCM)模拟微观感知,利用先进的注意力机制将局部补丁特征聚合为显著视觉实体,实现细粒度的实体级匹配,从而提升检索精度并更贴近人类视觉关注模式。

链接: https://arxiv.org/abs/2601.16155
作者: Zequn Xie,Xin Liu,Boyun Zhang,Yuxiao Lin,Sihang Cai,Tao Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from “blind” feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
zh

[CV-7] ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

【速读】:该论文旨在解决生成式 3D 动画(animated 3D objects)在实际应用中面临的三大挑战:有限的设置灵活性、较长的运行时间以及较低的生成质量。现有方法往往难以适配多样输入(如单目视频、文本描述或静态 3D 网格),且生成结果常依赖复杂的骨骼绑定(rigging)或拓扑不一致,限制了快速迭代与后续处理(如纹理映射和角色重定向)。解决方案的关键在于提出 ActionMesh,一种基于“时序 3D 扩散”(temporal 3D diffusion)的新框架:首先将传统 3D 扩散模型扩展至时间维度,生成一组同步的时变潜在表示;其次设计一个时序 3D 自编码器,将独立形状序列映射为预定义参考网格的形变场,从而构建高质量动画。此方法实现端到端、无需骨骼绑定的生产级动画生成,显著提升速度与一致性,在标准视频到 4D 基准测试中达到最优几何精度与时间一致性表现。

链接: https://arxiv.org/abs/2601.16148
作者: Remy Sabathier,David Novotny,Niloy J. Mitra,Tom Monnier
机构: Meta Reality Labs (Meta 虚拟现实实验室); SpAItial; University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes “in action” in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed “temporal 3D diffusion”. Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
zh

[CV-8] Learning to Watermark in the Latent Space of Generative Models KR

【速读】:该论文旨在解决现有生成式 AI (Generative AI) 图像水印方法依赖于像素空间的后处理(post-hoc)策略所导致的计算开销大和潜在视觉伪影的问题。其解决方案的关键在于提出 DistSeal,一种统一的潜在空间(latent space)水印方法,能够在扩散模型(diffusion models)和自回归模型(autoregressive models)中通用。该方法通过在生成模型的潜在空间中训练水印模型,并可将水印信息有效蒸馏(distill)至生成模型本身或潜在解码器中,实现“模型内”水印嵌入(in-model watermarking),从而在保持与像素空间方法相当的不可感知性的同时,提升鲁棒性并实现高达 20 倍的速度提升。

链接: https://arxiv.org/abs/2601.16140
作者: Sylvestre-Alvise Rebuffi,Tuan Tran,Valeriu Lacatusu,Pierre Fernandez,Tomáš Souček,Nikola Jovanović,Tom Sander,Hady Elsahar,Alexandre Mourachko
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Code and models are available at this https URL

点击查看摘要

Abstract:Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.
zh

[CV-9] Distillation-based Layer Dropping (DLD) Effective End-to-end Framework for Dynamic Speech Networks ICASSP2026

【速读】:该论文旨在解决边缘设备上静态神经网络在资源受限和动态变化环境下性能下降的问题,尤其是现有层跳过(Layer Dropping, LD\mathcal{LD})方法在低丢弃率和高丢弃率情况下均显著损害模型性能,导致性能-计算复杂度权衡恶化。解决方案的关键在于提出一种基于知识蒸馏的层跳过(Distillation-based Layer Dropping, DLD)框架,该框架以端到端方式融合知识蒸馏与LD\mathcal{LD}机制,使动态模型在不同丢弃比例下仍能保持高性能,从而实现动态语音网络的最优性能表现。

链接: https://arxiv.org/abs/2601.16117
作者: Abdul Hannan,Daniele Falavigna,Shah Nawaz,Mubashir Noman,Markus Schedl,Alessio Brutti
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ( \mathcalLD ) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing \mathcalLD methods greatly impact the dynamic model’s performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and \mathcalLD in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by 9.32% and 2.25% for high and no dropping cases with 33.3% reduction in training time.
zh

[CV-10] Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification

【速读】:该论文旨在解决Mamba模型在高光谱图像(Hyperspectral Image, HSI)分类任务中难以定义高效且自适应的token序列以提升性能的问题。其核心解决方案是提出CSSMamba(Clustering-guided Spatial-Spectral Mamba)框架,关键创新在于:1)设计了基于聚类引导的空间Mamba模块(CSpaMamba),通过引入聚类机制减少token序列长度并增强特征学习能力;2)融合空间与光谱Mamba模块(SpeMamba),实现多模态信息协同学习;3)引入注意力驱动的token选择机制优化token排序;4)构建可学习聚类模块(Learnable Clustering Module),使聚类归属关系能够自适应学习,从而实现聚类与Mamba架构的无缝集成。

链接: https://arxiv.org/abs/2601.16098
作者: Zack Dewis,Yimin Zhu,Zhengsen Xu,Mabel Heffring,Saeid Taleghanidoozdoozan,Quinn Ledingham,Lincoln Linlin Xu
机构: University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.
zh

[CV-11] Neural Particle Automata: Learning Self-Organizing Particle Dynamics

【速读】:该论文旨在解决传统神经细胞自动机(Neural Cellular Automata, NCA)在处理动态系统时的局限性,特别是其基于静态网格(static lattices)的欧拉表示方式无法有效建模具有连续位置和异质动力学的粒子系统的问题。为此,作者提出神经粒子自动机(Neural Particle Automata, NPA),其核心创新在于将NCA从固定网格推广到拉格朗日框架下的粒子系统:每个细胞被建模为具有连续位置和内部状态的粒子,其状态演化由共享且可学习的神经规则统一驱动。关键解决方案是引入可微分的光滑粒子流体动力学(Smoothed Particle Hydrodynamics, SPH)算子替代传统的网格邻域感知机制,结合内存高效的CUDA加速内核,从而实现粒子间局部交互的线性复杂度计算,支持大规模、端到端训练。这一设计既保留了NCA的鲁棒性和自再生能力,又拓展出适用于粒子系统的新型自组织行为。

链接: https://arxiv.org/abs/2601.16096
作者: Hyunsoo Kim,Ehsan Pajouheshgar,Sabine Süsstrunk,Wenzel Jakob,Jinah Park
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); EPFL (洛桑联邦理工学院)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 15 figures

点击查看摘要

Abstract:We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.
zh

[CV-12] SAMTok: Representing Any Mask with Two Words

【速读】:该论文旨在解决像素级多模态大语言模型(Multi-modal Large Language Models, MLLMs)难以扩展的问题,其核心挑战包括复杂的区域编码器、专用的分割解码器以及不兼容的训练目标。解决方案的关键在于提出SAMTok——一种离散掩码分词器(discrete mask tokenizer),它将任意区域掩码转换为两个特殊标记,并通过这些标记以高保真度重建掩码。SAMTok将掩码视为新的语言令牌,使基础MLLM(如QwenVL系列)能够通过标准的下一个令牌预测和简单的强化学习来学习像素级能力,而无需架构修改或专门设计损失函数。这一方法基于SAM2训练,在2.09亿多样掩码数据上使用掩码编码器与残差向量量化器生成紧凑且信息丰富的离散令牌,最终在多个像素级任务上实现最先进的性能。

链接: https://arxiv.org/abs/2601.16093
作者: Yikang Zhou,Tao Zhang,Dengxian Gong,Yuanzheng Wu,Ye Tian,Haochen Wang,Haobo Yuan,Jiacong Wang,Lu Qi,Hao Fei,Anran Wang,Zhuochen Wang,Yujing Wang,Cheng Chen,Shunping Ji,Xiangtai Li
机构: ByteDance(字节跳动); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 11 figures

点击查看摘要

Abstract:Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
zh

[CV-13] Masked Modeling for Human Motion Recovery Under Occlusions

【速读】:该论文旨在解决单目视频中人体运动重建(Human Motion Reconstruction)在实际场景下频繁遮挡(occlusion)条件下的鲁棒性不足问题。现有基于回归的方法虽高效但对缺失观测敏感,而优化或扩散方法虽鲁棒却存在推理速度慢和预处理复杂的问题。其解决方案的关键在于提出一种名为MoRo(Masked Modeling for human motion Recovery under Occlusions)的端到端生成式框架,利用生成式掩码建模(Generative Masked Modeling)将运动重建建模为视频条件任务,在不依赖额外后处理的前提下自然应对遮挡,并实现全局坐标系一致的高效推理。通过跨模态学习机制融合多源先验:动作捕捉数据驱动的轨迹感知运动先验、图像-姿态数据训练的帧级姿态先验,以及视频掩码Transformer融合视觉与运动动态信息,从而在EgoBody和RICH等数据集上显著提升遮挡场景下的精度与运动真实性,同时保持非遮挡场景性能相当,并实现在单张H200 GPU上70 FPS的实时推理能力。

链接: https://arxiv.org/abs/2601.16079
作者: Zhiyin Qian,Siwei Zhang,Bharat Lal Bhatnagar,Federica Bogo,Siyu Tang
机构: ETH Zürich; Meta Reality Labs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world this http URL regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
zh

[CV-14] DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models

【速读】:该论文旨在解决基础模型(Foundation Models, FMs)在联邦学习(Federated Learning)场景下部署时面临的高计算开销、通信成本大以及推理延迟高等问题,尤其是在医疗图像分割任务中的资源受限环境。其解决方案的关键在于提出一种双尺度联邦框架DSFedMed,通过中心化基础模型与轻量化客户端模型之间的相互知识蒸馏(mutual knowledge distillation),实现知识双向迁移:一方面,基础模型将通用医学视觉知识传递给客户端模型;另一方面,客户端模型将本地特异性信息反馈至基础模型以优化其性能。此外,研究设计了一种基于可学习性引导的样本选择策略,并利用高质量合成医学图像替代真实公共数据集,从而显著提升蒸馏效率与效果,最终在五个医疗影像分割数据集上实现了平均Dice分数提升2%,同时通信成本和推理时间降低近90%。

链接: https://arxiv.org/abs/2601.16073
作者: Hanwen Zhang,Qiaojin Shen,Yuxi Liu,Yuesheng Zhu,Guibo Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Foundation Models (FMs) have demonstrated strong generalization across diverse vision tasks. However, their deployment in federated settings is hindered by high computational demands, substantial communication overhead, and significant inference costs. We propose DSFedMed, a dual-scale federated framework that enables mutual knowledge distillation between a centralized foundation model and lightweight client models for medical image segmentation. To support knowledge distillation, a set of high-quality medical images is generated to replace real public datasets, and a learnability-guided sample selection strategy is proposed to enhance efficiency and effectiveness in dual-scale distillation. This mutual distillation enables the foundation model to transfer general knowledge to lightweight clients, while also incorporating client-specific insights to refine the foundation model. Evaluations on five medical imaging segmentation datasets show that DSFedMed achieves an average 2 percent improvement in Dice score while reducing communication costs and inference time by nearly 90 percent compared to existing federated foundation model baselines. These results demonstrate significant efficiency gains and scalability for resource-limited federated deployments.
zh

[CV-15] DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language Action, VLA)模型在机器人操作任务中因过度关注图像中与任务无关区域所导致的“干扰令牌”(distracting tokens)问题,该现象会扰乱模型生成正确动作令牌的能力,从而降低任务成功率。解决方案的关键在于提出一种简单且可插拔的干扰令牌剪枝(Distracting Token Pruning, DTP)框架,该框架能够动态检测并移除这些干扰图像令牌,从而修正模型的视觉注意力分布,在不改变原始模型架构或引入额外输入的前提下提升任务成功率,并揭示了所有测试模型中任务成功率与无关区域注意力强度之间的负相关性,为未来VLA模型研究提供了重要指导。

链接: https://arxiv.org/abs/2601.16065
作者: Chenyang Li,Jieyuan Liu,Bin Li,Bo Gao,Yilin Yuan,Yangfan He,Yuchen Li,Jingqun Tang
机构: Australian National University (澳大利亚国立大学); University of California, San Diego (加州大学圣地亚哥分校); Chinese Academy of Sciences (中国科学院); Beijing Institute of Graphic Communication (北京印刷学院); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Baidu Search (百度搜索); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as ‘distracting tokens’. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model’s visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: this https URL.
zh

[CV-16] ProGiDiff: Prompt-Guided Diffusion-Based Medical Image Segmentation

【速读】:该论文旨在解决当前医学图像分割方法在处理多提案估计、人机交互以及跨模态适应方面能力不足的问题,尤其是在缺乏自然语言提示(natural language prompt)支持的情况下。现有方法多为确定性模型,难以灵活响应多样化任务需求。其解决方案的关键在于提出一种名为ProGiDiff的新框架,该框架利用预训练的图像生成扩散模型,并引入一种类似ControlNet的条件控制机制,结合自定义编码器实现对图像的条件引导,从而生成分割掩膜;该机制天然支持多类别分割(仅需通过自然语言提示指定目标器官),并通过低秩微调实现从CT到MRI图像的快速跨模态迁移,显著提升了模型的泛化能力和交互灵活性。

链接: https://arxiv.org/abs/2601.16060
作者: Yuan Lin,Murong Xu,Marc Hölle,Chinmay Prabhakar,Andreas Maier,Vasileios Belagiannis,Bjoern Menze,Suprosanna Shit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures. It has been accepted by IEEE ISBI

点击查看摘要

Abstract:Widely adopted medical image segmentation methods, although efficient, are primarily deterministic and remain poorly amenable to natural language prompts. Thus, they lack the capability to estimate multiple proposals, human interaction, and cross-modality adaptation. Recently, text-to-image diffusion models have shown potential to bridge the gap. However, training them from scratch requires a large dataset-a limitation for medical image segmentation. Furthermore, they are often limited to binary segmentation and cannot be conditioned on a natural language prompt. To this end, we propose a novel framework called ProGiDiff that leverages existing image generation models for medical image segmentation purposes. Specifically, we propose a ControlNet-style conditioning mechanism with a custom encoder, suitable for image conditioning, to steer a pre-trained diffusion model to output segmentation masks. It naturally extends to a multi-class setting simply by prompting the target organ. Our experiment on organ segmentation from CT images demonstrates strong performance compared to previous methods and could greatly benefit from an expert-in-the-loop setting to leverage multiple proposals. Importantly, we demonstrate that the learned conditioning mechanism can be easily transferred through low-rank, few-shot adaptation to segment MR images.
zh

[CV-17] DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

【速读】:该论文旨在解决语言驱动的灵巧抓取生成问题,即如何让模型在理解任务语义、三维几何结构和复杂手-物交互关系的基础上,生成符合物理约束的高质量抓取策略。现有方法通常直接从观测映射到抓取参数,缺乏对物理交互过程的中间推理,导致生成结果与任务意图不一致或难以控制。解决方案的关键在于提出DextER框架,其核心创新是引入基于接触的具身推理(contact-based embodied reasoning),通过预测手部各指节在物体表面的接触位置,构建一个融合任务语义与物理约束的中间表示;随后分步生成接触标记(embodied contact tokens)和抓取标记(grasp tokens),实现更准确、可解释且可控的灵巧抓取合成。

链接: https://arxiv.org/abs/2601.16046
作者: Junha Lee,Eunha Park,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH); RLWRLD
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
zh

[CV-18] PAINT: Pathology-Aware Integrated Next-Scale Transformation for Virtual Immunohistochemistry

【速读】:该论文旨在解决虚拟免疫组化(virtual immunohistochemistry, virtual IHC)中因H&E染色形态信息模糊导致的分子表达模式生成不准确问题,尤其针对相似组织结构可能对应不同分子状态所带来的语义不一致挑战。现有方法多采用直接图像翻译策略,缺乏对结构先验的显式建模,易产生结构性失真。其解决方案的关键在于提出Pathology-Aware Integrated Next-Scale Transformation (PAINT),一种结构优先的视觉自回归框架,通过引入空间结构起始图(Spatial Structural Start Map, 3S-Map)实现以全局结构布局为条件的逐步生成,从而在生成过程中强制因果顺序并确保空间对齐与确定性,显著提升合成图像的结构保真度及临床下游任务性能。

链接: https://arxiv.org/abs/2601.16024
作者: Rongze Ma,Mengkang Lu,Zhenyu Xiang,Yongsheng Pan,Yicheng Wu,Qingjie Zeng,Yong Xia
机构: Northwestern Polytechnical University (西北工业大学); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual immunohistochemistry (IHC) aims to computationally synthesize molecular staining patterns from routine Hematoxylin and Eosin (H\E) images, offering a cost-effective and tissue-efficient alternative to traditional physical staining. However, this task is particularly challenging: H\E morphology provides ambiguous cues about protein expression, and similar tissue structures may correspond to distinct molecular states. Most existing methods focus on direct appearance synthesis to implicitly achieve cross-modal generation, often resulting in semantic inconsistencies due to insufficient structural priors. In this paper, we propose Pathology-Aware Integrated Next-Scale Transformation (PAINT), a visual autoregressive framework that reformulates the synthesis process as a structure-first conditional generation task. Unlike direct image translation, PAINT enforces a causal order by resolving molecular details conditioned on a global structural layout. Central to this approach is the introduction of a Spatial Structural Start Map (3S-Map), which grounds the autoregressive initialization in observed morphology, ensuring deterministic, spatially aligned synthesis. Experiments on the IHC4BC and MIST datasets demonstrate that PAINT outperforms state-of-the-art methods in structural fidelity and clinical downstream tasks, validating the potential of structure-guided autoregressive modeling.
zh

[CV-19] Keyframe-Based Feed-Forward Visual Odometry

【速读】:该论文旨在解决基于视觉基础模型(visual foundation models)的视觉里程计(VO)方法在处理原始图像序列时存在的计算冗余与性能下降问题,尤其当相邻帧间视差较低、缺乏足够的立体上下文信息时。现有方法如VGGT-Long通常对所有输入帧进行无差别处理,未能利用传统关键帧(keyframe)策略提升效率和精度。其核心挑战在于如何将几何启发式规则(geometric heuristics)有效融入依赖高维潜在表示而非显式几何度量的深度学习框架中。解决方案的关键在于提出一种基于强化学习(reinforcement learning)的自适应关键帧选择机制,通过数据驱动方式学习与基础模型内在特性对齐的键帧策略,从而在不依赖人工规则的前提下优化推理效率与定位准确性。

链接: https://arxiv.org/abs/2601.16020
作者: Weichen Dai,Wenhan Su,Da Kong,Yuhang Ming,Wanzeng Kong
机构: Hangzhou Dianzi University (杭州电子科技大学); Technion - Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
zh

[CV-20] PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)和视频世界模型在物理规律理解上的不足问题,现有评估基准或依赖合成视觉问答模板,或仅关注感知质量而非物理一致性,缺乏对力学原理(如质心、杠杆平衡和牛顿第一定律)的系统性测试。解决方案的关键在于提出PhysicsMind这一统一基准,涵盖真实与仿真环境,通过两类任务进行评估:一是视觉问答(VQA)任务,检验模型能否从图像或短视频中推理出物理量;二是视频生成(VG)任务,评估预测轨迹是否遵循真实的质心、力矩和惯性约束。实证表明,当前主流模型普遍依赖外观启发式策略,违反基础力学规则,凸显了物理理解能力的显著缺口,也为未来物理感知多模态模型的发展提供了关键测试平台。

链接: https://arxiv.org/abs/2601.16007
作者: Chak-Wing Mak,Guanyu Zhu,Boyi Zhang,Hongji Li,Xiaowei Chi,Kevin Zhang,Yichen Wu,Yangfan He,Chun-Kai Fan,Wentao Lu,Kuangzhi Ge,Xinyu Fang,Hongyang He,Kuan Lu,Tianxiang Xu,Li Zhang,Yongxin Ni,Youhua Li,Shanghang Zhang
机构: Peking University (北京大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); National University of Singapore (新加坡国立大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Science and Technology of China (中国科学技术大学); Manifold.AI (Manifold.AI); Cornell University (康奈尔大学); Hong Kong Polytechnic University (香港理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic, Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws. To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments that evaluates law-consistent reasoning and generation over three canonical principles: Center of Mass, Lever Equilibrium, and Newton’s First Law. PhysicsMind comprises two main tasks: i) VQA tasks, testing whether models can reason and determine physical quantities and values from images or short videos, and ii) Video Generation(VG) tasks, evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth. A broad range of recent models and video generation models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding, underscoring PhysicsMind as a focused testbed for physics-aware multimodal models. Our data will be released upon acceptance.
zh

[CV-21] HyperAlign: Hypernetwork for Efficient Test-Time Alignment of Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中难以对齐人类偏好与意图的问题,导致图像美学质量差和语义不一致。现有对齐方法存在两难:微调策略易因奖励过优化而丧失多样性,而测试时扩展方法则计算开销大且优化不足。其解决方案的关键在于提出HyperAlign框架,通过训练一个超网络(Hypernetwork)动态生成低秩适配权重,以调节扩散模型的生成算子,而非直接修改潜在状态;该机制可根据输入潜变量、时间步和提示词自适应调整去噪轨迹,实现奖励条件下的高效对齐,并通过带偏好数据正则化的奖励评分目标减少奖励劫持(Reward Hacking)。

链接: https://arxiv.org/abs/2601.15968
作者: Xin Xie,Jiaxian Guo,Dong Gong
机构: University of New South Wales (UNSW Sydney); Google Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models achieve state-of-the-art performance but often fail to generate outputs that align with human preferences and intentions, resulting in images with poor aesthetic quality and semantic inconsistencies. Existing alignment methods present a difficult trade-off: fine-tuning approaches suffer from loss of diversity with reward over-optimization, while test-time scaling methods introduce significant computational overhead and tend to under-optimize. To address these limitations, we propose HyperAlign, a novel framework that trains a hypernetwork for efficient and effective test-time alignment. Instead of modifying latent states, HyperAlign dynamically generates low-rank adaptation weights to modulate the diffusion model’s generation operators. This allows the denoising trajectory to be adaptively adjusted based on input latents, timesteps and prompts for reward-conditioned alignment. We introduce multiple variants of HyperAlign that differ in how frequently the hypernetwork is applied, balancing between performance and efficiency. Furthermore, we optimize the hypernetwork using a reward score objective regularized with preference data to reduce reward hacking. We evaluate HyperAlign on multiple extended generative paradigms, including Stable Diffusion and FLUX. It significantly outperforms existing fine-tuning and test-time scaling baselines in enhancing semantic consistency and visual appeal.
zh

[CV-22] EVolSplat4D: Efficient Volume-based Gaussian Splatting for 4D Urban Scene Synthesis

【速读】:该论文旨在解决静态与动态城市场景的新型视图合成(Novel View Synthesis, NVS)问题,尤其在自动驾驶仿真中对重建质量与速度难以兼顾的挑战。现有方法中,基于神经辐射场(Neural Radiance Fields)和3D高斯泼溅(3D Gaussian Splatting)的优化类方法虽能实现高保真度,但依赖逐场景优化导致效率低下;而前馈式方法常采用逐像素高斯表示,在复杂动态环境中易产生三维不一致性。其解决方案的关键在于提出EvolSplat4D框架,通过三个专用分支统一体积(volume-based)与像素(pixel-based)高斯预测:近距静态区域直接从3D特征体素预测一致的3D高斯几何,并结合语义增强的图像渲染模块生成外观;动态目标利用对象中心的规范空间与运动调整渲染模块聚合时序特征,提升4D重建稳定性;远场背景则由高效逐像素高斯分支保障全场景覆盖。该设计实现了高质量、高效率且具鲁棒性的4D场景重建。

链接: https://arxiv.org/abs/2601.15951
作者: Sheng Miao,Sijin Li,Pan Wang,Dongfeng Bai,Bingbing Liu,Yue Wang,Andreas Geiger,Yiyi Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.
zh

[CV-23] NeuroMamba: Multi-Perspective Feature Interaction with Visual Mamba for Neuron Segmentation

【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中神经元分割的挑战性问题,尤其是由神经元形态不规则和结构高度交织导致的边界模糊与细节丢失问题。现有基于卷积神经网络(CNN)的方法因缺乏长程上下文建模能力而难以处理复杂边界,而基于Transformer的方法则由于patch划分过程中的体素级细节损失导致边界精度下降。其解决方案的关键在于提出NeuroMamba框架,该框架利用Mamba模型线性复杂度的优势实现无patch的全局建模,并结合互补的局部特征建模机制:一是设计通道门控的边界判别特征提取器(Boundary Discriminative Feature Extractor, BDFE)以增强局部形态学线索;二是引入空间连续特征提取器(Spatial Continuous Feature Extractor, SCFE),将分辨率感知扫描机制融入视觉Mamba架构,自适应地建模多分辨率下的全局依赖关系;最终通过交叉调制机制融合多视角特征,从而在保持精细体素细节的同时高效捕捉长程依赖,显著提升神经元分割性能。

链接: https://arxiv.org/abs/2601.15929
作者: Liuyun Jiang,Yizhuo Lu,Yanchao Zhang,Jiazheng Liu,Hua Han
机构: State Key Laboratory of Brain Cognition and Brain-Inspired Intelligence Technology, Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neuron segmentation is the cornerstone of reconstructing comprehensive neuronal connectomes, which is essential for deciphering the functional organization of the brain. The irregular morphology and densely intertwined structures of neurons make this task particularly challenging. Prevailing CNN-based methods often fail to resolve ambiguous boundaries due to the lack of long-range context, whereas Transformer-based methods suffer from boundary imprecision caused by the loss of voxel-level details during patch partitioning. To address these limitations, we propose NeuroMamba, a multi-perspective framework that exploits the linear complexity of Mamba to enable patch-free global modeling and synergizes this with complementary local feature modeling, thereby efficiently capturing long-range dependencies while meticulously preserving fine-grained voxel details. Specifically, we design a channel-gated Boundary Discriminative Feature Extractor (BDFE) to enhance local morphological cues. Complementing this, we introduce the Spatial Continuous Feature Extractor (SCFE), which integrates a resolution-aware scanning mechanism into the Visual Mamba architecture to adaptively model global dependencies across varying data resolutions. Finally, a cross-modulation mechanism synergistically fuses these multi-perspective features. Our method demonstrates state-of-the-art performance across four public EM datasets, validating its exceptional adaptability to both anisotropic and isotropic resolutions. The source code will be made publicly available.
zh

[CV-24] Class Confidence Aware Reweighting for Long Tailed Learning

【速读】:该论文旨在解决深度神经网络在长尾数据分布下的性能退化问题,即模型训练过程中头部类别(head classes)样本占绝对优势,而尾部类别(tail classes)样本稀少,导致模型对尾部类别的识别能力显著下降。解决方案的关键在于提出一种基于类别和置信度的损失重加权机制(class and confidence-aware re-weighting scheme),该机制完全作用于损失层面,与现有在logit空间进行校正的方法具有互补性。其核心创新在于引入一个依赖于样本预测置信度和类别相对频率的函数 Ω(p_t, f_c),动态调节每个样本对训练任务的贡献权重,从而有效缓解类别不平衡问题并提升尾部类别的学习效果。

链接: https://arxiv.org/abs/2601.15924
作者: Brainard Philemon Jagati,Jitendra Tembhurne,Harsh Goud,Rudra Pratap Singh,Chandrashekhar Meshram
机构: Indian Institute of Information Technology Nagpur (印度信息技术学院那格浦尔分校); Jayawanti Haksar Govt. Post Graduate College (Jayawanti Haksar 政府研究生学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: 9 pages, 3 figures, IEEE Transaction on Neural Networks and Learning Systems (Submitted)

点击查看摘要

Abstract:Deep neural network models degrade significantly in the long-tailed data distribution, with the overall training data dominated by a small set of classes in the head, and the tail classes obtaining less training examples. Addressing the imbalance in the classes, attention in the related literature was given mainly to the adjustments carried out in the decision space in terms of either corrections performed at the logit level in order to compensate class-prior bias, with the least attention to the optimization process resulting from the adjustments introduced through the differences in the confidences among the samples. In the current study, we present the design of a class and confidence-aware re-weighting scheme for long-tailed learning. This scheme is purely based upon the loss level and has a complementary nature to the existing methods performing the adjustment of the logits. In the practical implementation stage of the proposed scheme, we use an \Omega(p_t, f_c) function. This function enables the modulation of the contribution towards the training task based upon the confidence value of the prediction, as well as the relative frequency of the corresponding class. Our observations in the experiments are corroborated by significant experimental results performed on the CIFAR-100-LT, ImageNet-LT, and iNaturalist2018 datasets under various values of imbalance factors that clearly authenticate the theoretical discussions above.
zh

[CV-25] A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery

【速读】:该论文旨在解决手术环境中3D手部姿态估计(3D hand pose estimation)的难题,其挑战包括强烈的局部光照、器械或人员频繁遮挡、手套导致的手部外观一致性差,以及高质量标注数据集稀缺等问题。解决方案的关键在于提出一个无需领域特定微调的多视角鲁棒流水线,该流程仅依赖现成预训练模型,集成可靠的人员检测、全身姿态估计与跟踪手部区域上的先进2D手部关键点预测,并结合约束优化实现3D姿态重建;同时构建了一个包含68,000帧和3,000个手动标注2D手部姿态并带有三角化3D真值的新颖外科基准数据集,为该领域提供训练-free基线与标准化评估平台。

链接: https://arxiv.org/abs/2601.15918
作者: Valery Fischer,Alan Magdaleno,Anna-Katharina Calek,Nicola Cavalcanti,Nathan Hoffman,Christoph Germann,Joschua Wüthrich,Max Krähenmann,Mazda Farshad,Philipp Fürnstahl,Lilian Calvet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.15918 [cs.CV] (or arXiv:2601.15918v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.15918 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nathan Hoffman [view email] [v1] Thu, 22 Jan 2026 12:48:24 UTC (3,942 KB)
zh

[CV-26] he Latency Wall: Benchmarking Off-the-Shelf Emotion Recognition for Real-Time Virtual Avatars

【速读】:该论文旨在解决在虚拟现实(Virtual Reality, VR)环境中,针对自闭症谱系障碍(Autism Spectrum Disorder, ASD)个体进行社交技能训练时,如何实现低延迟、高准确率的实时面部表情识别(Facial Expression Recognition, FER)问题。其核心挑战在于必须满足严格的时间约束(如运动到光子延迟,Motion-to-Photon latency < 140 ms)以维持交互的连续性,而现有通用深度学习模型往往牺牲实时性以换取精度。解决方案的关键在于通过基准测试不同轻量级架构(包括YOLO系列和Vision Transformers如CLIP、SigLIP),发现尽管基于stylized avatar的面部检测表现稳定(准确率100%),但分类阶段存在显著“延迟墙”(Latency Wall);其中YOLOv11n在检测环节达到约54 ms的最优延迟,而通用Transformer类模型因无法兼顾准确性(仅23%)与速度(>150 ms),难以适用于实时闭环系统。因此,研究强调需开发轻量化、面向特定应用领域的专用神经网络架构,以推动可访问、实时的AI辅助疗法落地。

链接: https://arxiv.org/abs/2601.15914
作者: Yarin Benyamin
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Technical Report benchmarking off-the-shelf CV latencies on commodity CPU hardware for therapeutic VR applications

点击查看摘要

Abstract:In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a “Latency Wall” exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (23%) or speed (150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
zh

[CV-27] Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models

【速读】:该论文旨在解决大规模基础模型中情感表征位置与机制不明确的问题,特别是在多模态情感场景下,尽管现有情感模型表现出优异的性能,但其内部架构如何支持情感理解与生成仍不清楚。解决方案的关键在于系统性地分析不同架构、训练策略和情感任务下,情感监督对模型参数的影响,发现情感适应主要不是发生在注意力模块,而是集中于前馈门控投影(feed-forward gating projection, \textttgate_proj)。通过受控模块迁移、单模块针对性调整及破坏性消融实验,研究进一步证明 \textttgate_proj 是实现情感理解与生成的充分、高效且必要组件;仅调整约24.5%的参数即可达到AffectGPT平均性能的96.6%,显著提升了参数效率。

链接: https://arxiv.org/abs/2601.15906
作者: Zhen Zhang,Runhao Zeng,Sicheng Zhao,Xiping Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding where and how emotions are represented in large-scale foundation models remains an open problem, particularly in multimodal affective settings. Despite the strong empirical performance of recent affective models, the internal architectural mechanisms that support affective understanding and generation are still poorly understood. In this work, we present a systematic mechanistic study of affective modeling in multimodal foundation models. Across multiple architectures, training strategies, and affective tasks, we analyze how emotion-oriented supervision reshapes internal model parameters. Our results consistently reveal a clear and robust pattern: affective adaptation does not primarily focus on the attention module, but instead localizes to the feed-forward gating projection (\textttgate_proj). Through controlled module transfer, targeted single-module adaptation, and destructive ablation, we further demonstrate that \textttgate_proj is sufficient, efficient, and necessary for affective understanding and generation. Notably, by tuning only approximately 24.5% of the parameters tuned by AffectGPT, our approach achieves 96.6% of its average performance across eight affective tasks, highlighting substantial parameter efficiency. Together, these findings provide empirical evidence that affective capabilities in foundation models are structurally mediated by feed-forward gating mechanisms and identify \textttgate_proj as a central architectural locus of affective modeling.
zh

[CV-28] hermoSplat: Cross-Modal 3D Gaussian Splatting with Feature Modulation and Geometry Decoupling

【速读】:该论文旨在解决多模态场景重建中如何有效融合可见光(RGB)与热红外(Thermal Infrared, TIR)数据以提升环境感知鲁棒性的问题。现有基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的方法在多谱段场景下难以充分挖掘跨模态互补信息,常因忽略跨模态相关性或依赖静态共享表示而无法适应不同波段间的结构差异与物理不一致性。其解决方案的关键在于提出ThermoSplat框架,通过两个核心机制实现深度光谱感知重建:一是引入跨模态FiLM调制机制(Cross-Modal FiLM Modulation),动态利用热红外结构先验引导可见光纹理合成;二是设计模态自适应几何解耦方案(Modality-Adaptive Geometric Decoupling),为热红外分支独立学习不透明度偏移并执行独立光栅化,从而缓解模态特异性几何差异问题;此外,结合显式球谐函数与隐式神经解码的混合渲染管线,保障语义一致性与高频细节保留。

链接: https://arxiv.org/abs/2601.15897
作者: Zhaoqi Su,Shihai Chen,Xinyan Lin,Liqin Huang,Zhipeng Su,Xiaoqiang Lu
机构: Fuzhou University (福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal scene reconstruction integrating RGB and thermal infrared data is essential for robust environmental perception across diverse lighting and weather conditions. However, extending 3D Gaussian Splatting (3DGS) to multi-spectral scenarios remains challenging. Current approaches often struggle to fully leverage the complementary information of multi-modal data, typically relying on mechanisms that either tend to neglect cross-modal correlations or leverage shared representations that fail to adaptively handle the complex structural correlations and physical discrepancies between spectrums. To address these limitations, we propose ThermoSplat, a novel framework that enables deep spectral-aware reconstruction through active feature modulation and adaptive geometry decoupling. First, we introduce a Cross-Modal FiLM Modulation mechanism that dynamically conditions shared latent features on thermal structural priors, effectively guiding visible texture synthesis with reliable cross-modal geometric cues. Second, to accommodate modality-specific geometric inconsistencies, we propose a Modality-Adaptive Geometric Decoupling scheme that learns independent opacity offsets and executes an independent rasterization pass for the thermal branch. Additionally, a hybrid rendering pipeline is employed to integrate explicit Spherical Harmonics with implicit neural decoding, ensuring both semantic consistency and high-frequency detail preservation. Extensive experiments on the RGBT-Scenes dataset demonstrate that ThermoSplat achieves state-of-the-art rendering quality across both visible and thermal spectrums.
zh

[CV-29] RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

【速读】:该论文旨在解决医学影像领域中视觉表征学习对图像-文本配对数据依赖性强的问题,即如何在不依赖语言监督的情况下训练出鲁棒的放射学编码器(radiology encoders)。其解决方案的关键在于提出了一种基于联合嵌入预测架构(Joint Embedding Predictive Architecture, JEPA)的自监督框架 RadJEPA,该框架仅使用未标注的胸部X光图像进行预训练,通过预测被遮蔽图像区域的潜在表示来学习视觉特征,从而实现无需语言监督的高效表征学习。这种显式的潜在空间预测目标区别于传统的图像-文本预训练和DINO风格的自蒸馏方法,显著提升了模型在疾病分类、语义分割和报告生成等下游任务中的性能表现。

链接: https://arxiv.org/abs/2601.15891
作者: Anas Anwarul Haq Khan,Mariam Husain,Kshitij Jadhav
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
zh

[CV-30] Understanding the Transfer Limits of Vision Foundation Models

【速读】:该论文试图解决视觉基础模型(Vision Foundation Models, VFMs)在下游任务中表现不均衡的问题,即尽管投入大量计算资源进行预训练,其在特定应用如医学影像分析中的迁移性能仍存在显著差异。解决方案的关键在于识别并优化预训练目标与下游任务之间的对齐程度:研究发现,当预训练策略(如基于掩码图像重建的MAE模型或对比学习的ProViCNet模型)更贴近下游任务需求时,例如通过最大均值差异(Maximum Mean Discrepancy, MMD)量化特征空间在微调前后的分布一致性,可显著提升迁移性能和收敛速度,从而强调了在设计预训练目标时需以下游应用为导向的重要性。

链接: https://arxiv.org/abs/2601.15888
作者: Shiqi Huang,Yipei Wang,Natasha Thorley,Alexander Ng,Shaheer Saeed,Mark Emberton,Shonit Punwani,Veeru Kasivisvanathan,Dean Barratt,Daniel Alexander,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted in ISBI 2026

点击查看摘要

Abstract:Foundation models leverage large-scale pretraining to capture extensive knowledge, demonstrating generalization in a wide range of language tasks. By comparison, vision foundation models (VFMs) often exhibit uneven improvements across downstream tasks, despite substantial computational investment. We postulate that this limitation arises from a mismatch between pretraining objectives and the demands of downstream vision-and-imaging tasks. Pretraining strategies like masked image reconstruction or contrastive learning shape representations for tasks such as recovery of generic visual patterns or global semantic structures, which may not align with the task-specific requirements of downstream applications including segmentation, classification, or image synthesis. To investigate this in a concrete real-world clinical area, we assess two VFMs, a reconstruction-focused MAE-based model (ProFound) and a contrastive-learning-based model (ProViCNet), on five prostate multiparametric MR imaging tasks, examining how such task alignment influences transfer performance, i.e., from pretraining to fine-tuning. Our findings indicate that better alignment between pretraining and downstream tasks, measured by simple divergence metrics such as maximum-mean-discrepancy (MMD) between the same features before and after fine-tuning, correlates with greater performance improvements and faster convergence, emphasizing the importance of designing and analyzing pretraining objectives with downstream applicability in mind.
zh

[CV-31] PMPBench: A Paired Multi-Modal Pan-Cancer Benchmark for Medical Image Synthesis

【速读】:该论文旨在解决医学影像中对比剂(Contrast Medium)使用受限的问题,通过生成式AI技术实现从非增强图像到增强图像的合成,以减少副作用并优化临床流程。其关键解决方案是构建并发布首个公开、全配对、跨11种人体器官的多模态医学影像数据集,涵盖完整的动态对比增强磁共振(DCE-MRI)序列和CT的非增强与增强扫描(CTC),并在此基础上建立全面的基准测试平台,支持一对一、多对一及多对多图像翻译任务的严谨评估,从而推动安全有效的对比剂合成研究在多器官肿瘤成像中的应用。

链接: https://arxiv.org/abs/2601.15884
作者: Yifan Chen,Fei Yin,Hao Chen,Jia Wu,Chao Li
机构: University of Cambridge (剑桥大学); MD Anderson Cancer Center (MD安德森癌症中心); University of Dundee (邓迪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient’s health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at this https URL.
zh

[CV-32] PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation

【速读】:该论文旨在解决现有舞蹈-音乐生成方法在实际应用中面临的两大挑战:一是依赖单一舞者身体运动特征和有限的舞蹈-音乐数据集,导致模型泛化能力弱;二是难以适配多舞者或非人类舞者(如机器人)等复杂场景。其解决方案的关键在于提出PF-D2M模型,这是一个基于扩散机制(diffusion-based)的通用舞蹈-音乐生成框架,通过引入从舞蹈视频中提取的视觉特征来增强模型对多样化舞蹈表现形式的感知能力,并采用渐进式训练策略(progressive training strategy)有效缓解数据稀缺与跨场景泛化难题,从而在舞蹈-音乐对齐度和音乐质量上均达到当前最优水平。

链接: https://arxiv.org/abs/2601.15872
作者: Jaekwon Im,Natalia Polouliakh,Taketo Akama
机构: KAIST(韩国科学技术院); Sony Computer Science Laboratories(索尼计算机科学实验室)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Dance-to-music generation aims to generate music that is aligned with dance movements. Existing approaches typically rely on body motion features extracted from a single human dancer and limited dance-to-music datasets, which restrict their performance and applicability to real-world scenarios involving multiple dancers and non-human dancers. In this paper, we propose PF-D2M, a universal diffusion-based dance-to-music generation model that incorporates visual features extracted from dance videos. PF-D2M is trained with a progressive training strategy that effectively addresses data scarcity and generalization challenges. Both objective and subjective evaluations show that PF-D2M achieves state-of-the-art performance in dance-music alignment and music quality.
zh

[CV-33] Out-of-Distribution Detection Based on Total Variation Estimation

【速读】:该论文旨在解决机器学习模型在实际部署中因输入数据分布偏移(distribution shift)而导致性能下降的问题,特别是如何有效识别和区分分布外数据(out-of-distribution, OOD)以提升模型鲁棒性。解决方案的关键在于提出一种名为Total Variation Out-of-Distribution (TV-OOD) 的检测方法,其核心创新是利用总变差网络估计器(Total Variation Network Estimator)量化每个输入样本对整体总变差的贡献,并据此构建总变差分数(total variation score),从而实现对in-distribution与out-of-distribution数据的精准判别。该方法在多个模型和数据集上均表现出优于或相当主流OOD检测技术的性能。

链接: https://arxiv.org/abs/2601.15867
作者: Dabiao Ma,Zhiba Su,Jian Yang,Haojun Fei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to securing machine learning model deployments against potential distribution shifts in practical applications, the Total Variation Out-of-Distribution (TV-OOD) detection method. Existing methods have produced satisfactory results, but TV-OOD improves upon these by leveraging the Total Variation Network Estimator to calculate each input’s contribution to the overall total variation. By defining this as the total variation score, TV-OOD discriminates between in- and out-of-distribution data. The method’s efficacy was tested across a range of models and datasets, consistently yielding results in image classification tasks that were either comparable or superior to those achieved by leading-edge out-of-distribution detection techniques across all evaluation metrics.
zh

[CV-34] A Lightweight Brain-Inspired Machine Learning Framework for Coronary Angiography: Hybrid Neural Representation and Robust Learning Strategies

【速读】:该论文旨在解决冠状动脉造影(Coronary Angiography, CAG)图像分析中因病变形态复杂、类别分布极度不均衡、标签不确定性高及计算资源受限等因素导致的传统深度学习方法鲁棒性差的问题。其核心解决方案是提出一种轻量级脑启发式神经网络框架,关键创新包括:基于预训练卷积神经网络构建轻量级混合神经表征;引入选择性神经可塑性训练策略实现高效参数适应;设计融合焦点损失(Focal Loss)与标签平滑的注意力调制损失函数以增强对难样本和不确定标注的敏感性;同时采用类不平衡感知采样与余弦退火带热重启机制,模拟生物神经系统的节律调控与注意力分配机制,从而在有限计算资源下实现稳定且高性能的二分类任务表现。

链接: https://arxiv.org/abs/2601.15865
作者: Jingsong Xia,Siqi Wang
机构: The Second Clinical College, Nanjing Medical University, Nanjing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background: Coronary angiography (CAG) is a cornerstone imaging modality for assessing coronary artery disease and guiding interventional treatment decisions. However, in real-world clinical settings, angiographic images are often characterized by complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, posing substantial challenges to conventional deep learning approaches in terms of robustness and this http URL: The proposed framework is built upon a pretrained convolutional neural network to construct a lightweight hybrid neural representation. A selective neural plasticity training strategy is introduced to enable efficient parameter adaptation. Furthermore, a brain-inspired attention-modulated loss function, combining Focal Loss with label smoothing, is employed to enhance sensitivity to hard samples and uncertain annotations. Class-imbalance-aware sampling and cosine annealing with warm restarts are adopted to mimic rhythmic regulation and attention allocation mechanisms observed in biological neural this http URL: Experimental results demonstrate that the proposed lightweight brain-inspired model achieves strong and stable performance in binary coronary angiography classification, yielding competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational this http URL: This study validates the effectiveness of brain-inspired learning mechanisms in lightweight medical image analysis and provides a biologically plausible and deployable solution for intelligent clinical decision support under limited computational resources.
zh

[CV-35] Uncertainty-guided Generation of Dark-field Radiographs

【速读】:该论文旨在解决X射线暗场成像(dark-field radiography)数据稀缺问题,从而阻碍了深度学习模型在该领域的可靠开发。针对这一挑战,作者提出了一种基于不确定性引导的渐进式生成对抗网络(Uncertainty-Guided Progressive Generative Adversarial Network)框架,其关键在于同时建模aleatoric(随机性)和epistemic(认知性)不确定性,以提升生成图像的结构保真度与模型可解释性及可靠性。实验表明,该方法在各生成阶段均实现了定量指标的一致性提升,并且在分布外评估中展现出良好的泛化能力,为临床应用提供了可信的暗场图像合成基础。

链接: https://arxiv.org/abs/2601.15859
作者: Lina Felsner,Henriette Bast,Tina Dorosti,Florian Schaff,Franz Pfeiffer,Daniela Pfeiffer,Julia Schnabel
机构: 1. Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学);
2. Helmholtz Zentrum München (德国慕尼黑亥姆霍兹研究中心);
3. German Cancer Research Center (德国癌症研究中心);
4. Technische Universität München (慕尼黑工业大学);
5. Institute for Medical Engineering and Physics (医学工程与物理研究所);
6. Department of Radiology (放射科);
7. German Cancer Research Center (德国癌症研究中心);
8. University College London (伦敦大学学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:X-ray dark-field radiography provides complementary diagnostic information to conventional attenuation imaging by visualizing microstructural tissue changes through small-angle scattering. However, the limited availability of such data poses challenges for developing robust deep learning models. In this work, we present the first framework for generating dark-field images directly from standard attenuation chest X-rays using an Uncertainty-Guided Progressive Generative Adversarial Network. The model incorporates both aleatoric and epistemic uncertainty to improve interpretability and reliability. Experiments demonstrate high structural fidelity of the generated images, with consistent improvement of quantitative metrics across stages. Furthermore, out-of-distribution evaluation confirms that the proposed model generalizes well. Our results indicate that uncertainty-guided generative modeling enables realistic dark-field image synthesis and provides a reliable foundation for future clinical applications.
zh

[CV-36] nySense: Effective CSI Compression for Scalable and Accurate Wi-Fi Sensing

【速读】:该论文旨在解决Wi-Fi感知中因直接处理大量信道状态信息(CSI)数据而导致的网络资源消耗过大问题,从而限制了设备无关且隐私保护型人体姿态估计(HPE)的可扩展性。其解决方案的关键在于提出了一种名为TinySense的高效压缩框架,核心创新是基于向量量化生成对抗网络(VQGAN)构建的代码本(codebook),通过学习到的离散码字显著压缩CSI数据并维持HPE精度;同时结合K-means算法动态调整压缩比特率以优化代码本子集划分,并引入Transformer模型缓解比特率损失,提升在不稳定网络环境下的鲁棒性。

链接: https://arxiv.org/abs/2601.15838
作者: Toan Gian,Dung T. Tran,Viet Quoc Pham,Francesco Restuccia,Van-Dinh Nguyen
机构: Smart Green Transformation Center (GREEN-X), VinUniversity (维大大学); Northeastern University (东北大学); School of Computer Science and Statistics, Trinity College Dublin (都柏林三一学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages. This paper has been accepted for publication in IEEE PerCom 2026

点击查看摘要

Abstract:With the growing demand for device-free and privacy-preserving sensing solutions, Wi-Fi sensing has emerged as a promising approach for human pose estimation (HPE). However, existing methods often process vast amounts of channel state information (CSI) data directly, ultimately straining networking resources. This paper introduces TinySense, an efficient compression framework that enhances the scalability of Wi-Fi-based human sensing. Our approach is based on a new vector quantization-based generative adversarial network (VQGAN). Specifically, by leveraging a VQGAN-learned codebook, TinySense significantly reduces CSI data while maintaining the accuracy required for reliable HPE. To optimize compression, we employ the K-means algorithm to dynamically adjust compression bitrates to cluster a large-scale pre-trained codebook into smaller subsets. Furthermore, a Transformer model is incorporated to mitigate bitrate loss, enhancing robustness in unreliable networking conditions. We prototype TinySense on an experimental testbed using Jetson Nano and Raspberry Pi to measure latency and network resource use. Extensive results demonstrate that TinySense significantly outperforms state-of-the-art compression schemes, achieving up to 1.5x higher HPE accuracy score (PCK20) under the same compression rate. It also reduces latency and networking overhead, respectively, by up to 5x and 2.5x. The code repository is available online at here.
zh

[CV-37] An IoT-Based Smart Plant Monitoring and Irrigation System with Real-Time Environmental Sensing Automated Alerts and Cloud Analytics

【速读】:该论文旨在解决传统农业中因依赖人工观察和周期性浇水而导致的水资源浪费、作物生长不一致以及对环境变化响应滞后等问题。其解决方案的关键在于构建一个基于物联网(IoT)的智能植物监测系统,该系统通过ESP32微控制器集成多种环境传感器(如DHT22温湿度传感器、HC-SR04水位传感器及土壤湿度传感器),实现对植物生长环境的实时数据采集,并结合自动灌溉控制与ThingSpeak云平台进行远程监控、历史数据分析和自动化报警。实验表明,该系统能以92%的精度维持最佳土壤湿度水平,较传统灌溉方式节水约40%,且具备低成本(总成本45.20美元)、可扩展性强的特点,适用于小规模园艺至商业农业场景。

链接: https://arxiv.org/abs/2601.15830
作者: Abdul Hasib,A. S. M. Ahsanul Sarkar Akib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing global demand for sustainable agriculture necessitates intelligent monitoring systems that optimize resource utilization and plant health management. Traditional farming methods rely on manual observation and periodic watering, often leading to water wastage, inconsistent plant growth, and delayed response to environmental changes. This paper presents a comprehensive IoT-based smart plant monitoring system that integrates multiple environmental sensors with automated irrigation and cloud analytics. The proposed system utilizes an ESP32 microcontroller to collect real-time data from DHT22 (temperature/humidity), HC-SR04 (water level), and soil moisture sensors, with visual feedback through an OLED display and auditory alerts via a buzzer. All sensor data is wirelessly transmitted to the ThingSpeak cloud platform for remote monitoring, historical analysis, and automated alert generation. Experimental results demonstrate the system’s effectiveness in maintaining optimal soil moisture levels (with 92% accuracy), providing real-time environmental monitoring, and reducing water consumption by approximately 40% compared to conventional irrigation methods. The integrated web dashboard offers comprehensive visualization of plant health parameters, making it suitable for both small-scale gardening and commercial agriculture applications. With a total implementation cost of \ 45.20, this system provides an affordable, scalable solution for precision agriculture and smart farming.
zh

[CV-38] owards Realistic Remote Sensing Dataset Distillation with Discriminative Prototype-guided Diffusion

【速读】:该论文旨在解决遥感图像解译中深度学习模型依赖大规模训练数据所引发的两大问题:一是高存储与计算成本,二是敏感类别下的数据泄露风险。解决方案的关键在于首次将数据蒸馏(dataset distillation)引入遥感图像解译领域,通过训练一个文本到图像的扩散模型(text-to-image diffusion model),将大规模遥感数据集压缩为紧凑且具有代表性的蒸馏数据集;同时,为提升合成样本的判别能力,提出基于分类器引导的策略,在扩散训练过程中注入预训练模型的分类一致性损失(classification consistency loss);此外,为保留遥感图像丰富的语义复杂性,进一步在潜在空间对训练样本进行聚类以选取具有代表性和多样性的原型作为视觉风格指导,并利用视觉语言模型提供聚合的文本描述,从而实现高质量、多样化且可用于下游任务训练的合成样本生成。

链接: https://arxiv.org/abs/2601.15829
作者: Yonghao Xu,Pedram Ghamisi,Qihao Weng
机构: Linköping University (林雪平大学); Helmholtz-Zentrum Dresden-Rossendorf (亥姆霍兹德累斯顿罗森多夫研究中心); Lancaster University (兰卡斯特大学); University of Iceland (冰岛大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years have witnessed the remarkable success of deep learning in remote sensing image interpretation, driven by the availability of large-scale benchmark datasets. However, this reliance on massive training data also brings two major challenges: (1) high storage and computational costs, and (2) the risk of data leakage, especially when sensitive categories are involved. To address these challenges, this study introduces the concept of dataset distillation into the field of remote sensing image interpretation for the first time. Specifically, we train a text-to-image diffusion model to condense a large-scale remote sensing dataset into a compact and representative distilled dataset. To improve the discriminative quality of the synthesized samples, we propose a classifier-driven guidance by injecting a classification consistency loss from a pre-trained model into the diffusion training process. Besides, considering the rich semantic complexity of remote sensing imagery, we further perform latent space clustering on training samples to select representative and diverse prototypes as visual style guidance, while using a visual language model to provide aggregated text descriptions. Experiments on three high-resolution remote sensing scene classification benchmarks show that the proposed method can distill realistic and diverse samples for downstream model training. Code and pre-trained models are available online (this https URL).
zh

[CV-39] Beyond Off-the-Shelf Models: A Lightweight and Accessible Machine Learning Pipeline for Ecologists Working with Image Data

【速读】:该论文旨在解决生态学研究中图像分类任务对机器学习(Machine Learning, ML)方法应用门槛过高、缺乏针对本地数据集和具体分类目标的定制化模型的问题。其解决方案的关键在于构建一个轻量级实验流程(lightweight experimentation pipeline),通过命令行接口实现预处理、训练与评估的自动化,同时提供图形化界面用于标注、错误分析和模型比较,使生态学家能够在不依赖高级ML专业知识的情况下,高效开发出紧凑且任务特定的分类器。该框架已在德国Veldenstein森林3392张相机陷阱图像上验证,成功实现了对赤鹿(Cervus elaphus)年龄和性别的高精度分类(准确率分别为90.77%和96.15%),证明了在有限数据条件下解决明确生态问题的可行性。

链接: https://arxiv.org/abs/2601.15813
作者: Clare Chemery,Hendrik Edelhoff,Ludwig Bothmann
机构: LMU Munich (慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Bavarian State Institute for Forestry (巴伐利亚州林业研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a lightweight experimentation pipeline designed to lower the barrier for applying machine learning (ML) methods for classifying images in ecological research. We enable ecologists to experiment with ML models independently, thus they can move beyond off-the-shelf models and generate insights tailored to local datasets and specific classification tasks and target variables. Our tool combines a simple command-line interface for preprocessing, training, and evaluation with a graphical interface for annotation, error analysis, and model comparison. This design enables ecologists to build and iterate on compact, task-specific classifiers without requiring advanced ML expertise. As a proof of concept, we apply the pipeline to classify red deer (Cervus elaphus) by age and sex from 3392 camera trap images collected in the Veldenstein Forest, Germany. Using 4352 cropped images containing individual deer labeled by experts, we trained and evaluated multiple backbone architectures with a wide variety of parameters and data augmentation strategies. Our best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification. These results demonstrate that reliable demographic classification is feasible even with limited data to answer narrow, well-defined ecological problems. More broadly, the framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, paving the way for broader adoption of ML in wildlife monitoring and demographic analysis.
zh

[CV-40] A Mobile Application for Flower Recognition System Based on Convolutional Neural Networks

【速读】:该论文旨在解决非专业人士在花卉识别中缺乏专家支持的问题,通过开发基于卷积神经网络(Convolutional Neural Network, CNN)的移动应用实现快速、准确的花种分类。其解决方案的关键在于对比三种主流CNN模型(MobileNet、DenseNet121和Xception)在不同优化算法下的分类性能,最终确定DenseNet121结合随机梯度下降(Stochastic Gradient Descent, SGD)优化算法时表现最优,达到95.84%的准确率及各项指标均达96.00%,验证了CNN在移动端花种识别中的有效性与实用性。

链接: https://arxiv.org/abs/2601.15810
作者: Mustafa Yurdakul,Enes Ayan,Fahrettin Horasan,Sakir Tasdemir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A convolutional neural network (CNN) is a deep learning algorithm that has been specifically designed for computer vision applications. The CNNs proved successful in handling the increasing amount of data in many computer vision problems, where classical machine learning algorithms were insufficient. Flowers have many uses in our daily lives, from decorating to making medicines to detoxifying the environment. Identifying flower types requires expert knowledge. However, accessing experts at any time and in any location may not always be feasible. In this study a mobile application based on CNNs was developed to recognize different types of flowers to provide non-specialists with quick and easy access to information about flower types. The study employed three distinct CNN models, namely MobileNet, DenseNet121, and Xception, to determine the most suitable model for the mobile application. The classification performances of the models were evaluated by training them with seven different optimization algorithms. The DenseNet-121 architecture, which uses the stochastic gradient descent (SGD) optimization algorithm, was the most successful, achieving 95.84 % accuracy, 96.00% precision, recall, and F1-score. This result shows that CNNs can be used for flower classification in mobile applications.
zh

[CV-41] Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在依赖细微时空线索时空间推理能力脆弱的问题,特别是针对情境感知(situational awareness,判断交互行为是否有害)和空间感知(spatial awareness,追踪动作主体、对象及相对位置与运动关系)这两项关键能力的不足。解决方案的关键在于构建一个合成基准测试(synthetic benchmark),通过极简视频对(minimal video pairs)系统性地评估VLMs在三类挑战中的表现:区分暴力与非暴力行为、跨视角绑定攻击者角色、以及判断细粒度轨迹对齐。实验表明当前VLMs在无训练条件下性能仅略高于随机水平,引入稳定颜色提示虽能部分缓解角色混淆问题,但无法根本改善其空间推理缺陷。研究进一步开放数据与代码,旨在提供可复现的诊断工具,并推动轻量级空间先验(lightweight spatial priors)的研究以补充大规模预训练的不足。

链接: https://arxiv.org/abs/2601.15780
作者: Pascal Benschop,Justin Dauwels,Jan van Gemert
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.
zh

[CV-42] Diffusion Model-Based Data Augmentation for Enhanced Neuron Segmentation

【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中神经元分割任务在训练数据稀缺条件下的性能瓶颈问题,即当前基于深度学习的方法受限于大规模标注数据的需求以及人工标注的高成本。其解决方案的关键在于提出了一种基于扩散模型的数据增强框架,该框架通过引入分辨率感知的条件扩散模型与多尺度条件控制及EM成像分辨率先验,实现从3D掩码到体素级图像的合成;同时结合生物引导的掩码重构模块,生成结构更逼真的图像-标签对,从而有效提升训练集多样性与真实性,在低标注场景下显著改善分割性能(如AC3和AC4数据集上ARAND指标分别提升32.1%和30.7%)。

链接: https://arxiv.org/abs/2601.15779
作者: Liuyun Jiang,Yanchao Zhang,Jinyue Guo,Yizhuo Lu,Ruining Zhou,Hua Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neuron segmentation in electron microscopy (EM) aims to reconstruct the complete neuronal connectome; however, current deep learning-based methods are limited by their reliance on large-scale training data and extensive, time-consuming manual annotations. Traditional methods augment the training set through geometric and photometric transformations; however, the generated samples remain highly correlated with the original images and lack structural diversity. To address this limitation, we propose a diffusion-based data augmentation framework capable of generating diverse and structurally plausible image-label pairs for neuron segmentation. Specifically, the framework employs a resolution-aware conditional diffusion model with multi-scale conditioning and EM resolution priors to enable voxel-level image synthesis from 3D masks. It further incorporates a biology-guided mask remodeling module that produces augmented masks with enhanced structural realism. Together, these components effectively enrich the training set and improve segmentation performance. On the AC3 and AC4 datasets under low-annotation regimes, our method improves the ARAND metric by 32.1% and 30.7%, respectively, when combined with two different post-processing methods. Our code is available at this https URL.
zh

[CV-43] LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting

【速读】:该论文旨在解决现有低光增强算法主要在像素域操作,导致处理2D高斯点阵(2D Gaussian Splatting, 2DGS)压缩图像时需经历繁琐的解压缩-增强-再压缩流程,从而降低效率并引入二次退化的问题。其核心解决方案是提出首个零样本无监督框架LL-GaussianImage,直接在2DGS压缩表示域内实现低光增强;关键创新包括:1)设计语义引导的专家混合(Mixture-of-Experts)增强框架,利用渲染图像指导对2DGS稀疏属性空间的动态自适应变换,实现“压缩即增强”;2)构建多目标协同损失函数系统,严格约束增强过程中的平滑性和保真度,抑制伪影并提升视觉质量;3)采用两阶段优化策略实现“重建即增强”,通过单尺度重建保障基础表示精度,提升网络鲁棒性,从而在保持高压缩比的同时实现高质量低光增强。

链接: https://arxiv.org/abs/2601.15772
作者: Yuhan Chen,Wenxuan Yu,Guofa Li,Yijun Xu,Ying Fang,Yicui Shi,Long Cao,Wenbo Chu,Keqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.
zh

[CV-44] LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps

【速读】:该论文旨在解决低光照图像增强中因现有方法多在像素域或依赖隐式特征表示而导致的几何结构先验信息被忽略的问题,从而影响增强结果的边缘保持与伪影抑制效果。其解决方案的关键在于提出首个将2D Gaussian Splatting(2DGS)引入低级视觉任务的无监督框架LL-GaussianMap,通过显式结构重建与基于高斯点阵的光栅化渲染机制,将2DGS的结构感知能力融入增益图(gain map)生成过程,实现结构保真的高效增强,同时避免对成对标注数据的依赖。

链接: https://arxiv.org/abs/2601.15766
作者: Yuhan Chen,Ying Fang,Guofa Li,Wenxuan Yu,Yicui Shi,Jingrui Zhang,Kefei Qian,Wenbo Chu,Keqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant progress has been made in low-light image enhancement with respect to visual quality. However, most existing methods primarily operate in the pixel domain or rely on implicit feature representations. As a result, the intrinsic geometric structural priors of images are often neglected. 2D Gaussian Splatting (2DGS) has emerged as a prominent explicit scene representation technique characterized by superior structural fitting capabilities and high rendering efficiency. Despite these advantages, the utilization of 2DGS in low-level vision tasks remains unexplored. To bridge this gap, LL-GaussianMap is proposed as the first unsupervised framework incorporating 2DGS into low-light image enhancement. Distinct from conventional methodologies, the enhancement task is formulated as a gain map generation process guided by 2DGS primitives. The proposed method comprises two primary stages. First, high-fidelity structural reconstruction is executed utilizing 2DGS. Then, data-driven enhancement dictionary coefficients are rendered via the rasterization mechanism of Gaussian splatting through an innovative unified enhancement module. This design effectively incorporates the structural perception capabilities of 2DGS into gain map generation, thereby preserving edges and suppressing artifacts during enhancement. Additionally, the reliance on paired data is circumvented through unsupervised learning. Experimental results demonstrate that LL-GaussianMap achieves superior enhancement performance with an extremely low storage footprint, highlighting the effectiveness of explicit Gaussian representations for image enhancement.
zh

[CV-45] Atlas-Assisted Segment Anything Model for Fetal Brain MRI (FeTal-SAM)

【速读】:该论文旨在解决胎儿脑部磁共振成像(fetal brain MRI)分割中两个关键问题:一是传统深度学习方法需针对固定标签集进行大量标注数据训练,难以适应临床或研究需求变化;二是现有方法缺乏对分割结果是否依赖真实图像对比度还是学习到的空间先验的可解释性。解决方案的关键在于提出FeTal-SAM,通过融合多图谱配准生成的空间对齐标签模板作为密集提示(dense prompts),结合边界框提示,驱动Segment Anything Model (SAM) 的分割解码器实现逐结构二值分割,并进一步融合重建完整三维分割体积。该设计使模型无需重新训练即可灵活分割任意用户指定的解剖结构,同时保持对高对比度结构(如皮质板和小脑)的高性能表现,从而推动面向临床适配的胎儿脑MRI分析工具发展。

链接: https://arxiv.org/abs/2601.15759
作者: Qi Zeng,Weide Liu,Bo Li,Ryne Didier,P. Ellen Grant,Davood Karimi
机构: Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents FeTal-SAM, a novel adaptation of the Segment Anything Model (SAM) tailored for fetal brain MRI segmentation. Traditional deep learning methods often require large annotated datasets for a fixed set of labels, making them inflexible when clinical or research needs change. By integrating atlas-based prompts and foundation-model principles, FeTal-SAM addresses two key limitations in fetal brain MRI segmentation: (1) the need to retrain models for varying label definitions, and (2) the lack of insight into whether segmentations are driven by genuine image contrast or by learned spatial priors. We leverage multi-atlas registration to generate spatially aligned label templates that serve as dense prompts, alongside a bounding-box prompt, for SAM’s segmentation decoder. This strategy enables binary segmentation on a per-structure basis, which is subsequently fused to reconstruct the full 3D segmentation volumes. Evaluations on two datasets, the dHCP dataset and an in-house dataset demonstrate FeTal-SAM’s robust performance across gestational ages. Notably, it achieves Dice scores comparable to state-of-the-art baselines which were trained for each dataset and label definition for well-contrasted structures like cortical plate and cerebellum, while maintaining the flexibility to segment any user-specified anatomy. Although slightly lower accuracy is observed for subtle, low-contrast structures (e.g., hippocampus, amygdala), our results highlight FeTal-SAM’s potential to serve as a general-purpose segmentation model without exhaustive retraining. This method thus constitutes a promising step toward clinically adaptable fetal brain MRI analysis tools.
zh

[CV-46] White-Box mHC: Electromagnetic Spectrum-Aware and Interpretable Stream Interactions for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像分类(HSIC)中深度学习模型因依赖不透明的光谱-空间特征混合而导致可解释性差、难以理解内部决策机制的问题。解决方案的关键在于提出一种物理光谱感知的白盒超连接框架ES-mHC,其通过结构化、方向性的矩阵显式建模不同电磁波段分组之间的交互关系(即mHC中的残差流),并将特征表示与交互结构分离,从而促进波段分组的专业化、减少冗余,并使内部信息流可直接可视化和空间分析,实现从纯黑箱预测任务向结构透明、部分白盒学习过程的转变。

链接: https://arxiv.org/abs/2601.15757
作者: Yimin Zhu,Lincoln Linlin Xu,Zhengsen Xu,Zack Dewis,Mabel Heffring,Saeid Taleghanidoozdoozan,Motasem Alkayid,Quinn Ledingham,Megan Greenwood
机构: University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In hyperspectral image classification (HSIC), most deep learning models rely on opaque spectral-spatial feature mixing, limiting their interpretability and hindering understanding of internal decision mechanisms. We present physical spectrum-aware white-box mHC, named ES-mHC, a hyper-connection framework that explicitly models interactions among different electromagnetic spectrum groupings (residual stream in mHC) interactions using structured, directional matrices. By separating feature representation from interaction structure, ES-mHC promotes electromagnetic spectrum grouping specialization, reduces redundancy, and exposes internal information flow that can be directly visualized and spatially analyzed. Using hyperspectral image classification as a representative testbed, we demonstrate that the learned hyper-connection matrices exhibit coherent spatial patterns and asymmetric interaction behaviors, providing mechanistic insight into the model internal dynamics. Furthermore, we find that increasing the expansion rate accelerates the emergence of structured interaction patterns. These results suggest that ES-mHC transforms HSIC from a purely black-box prediction task into a structurally transparent, partially white-box learning process.
zh

[CV-47] Breaking the Resolution Barrier: Arbitrary-resolution Deep Image Steganography Framework

【速读】:该论文旨在解决深度图像隐写(Deep Image Steganography, DIS)中因秘密图像与载体图像分辨率不一致而导致的细节丢失和盲恢复困难问题。现有方法强制秘密图像在隐藏和提取阶段必须与载体图像保持相同分辨率,这不仅导致高分辨率秘密图像在低分辨率载体中需经重采样而损失细节,还使得在未知原始分辨率时无法恢复其真实细节。解决方案的关键在于提出ARDis框架,通过两个核心创新实现任意分辨率隐写:一是设计频率解耦架构(Frequency Decoupling Architecture),将秘密图像分解为与载体分辨率对齐的全局基底和与分辨率无关的高频潜在表示,从而最小化重采样带来的细节损失;二是引入隐式引导重建器(Latent-Guided Implicit Reconstructor),利用高频潜在代码调制连续隐式函数,精确查询并渲染高频残差至全局基底上,实现原分辨率细节的忠实还原;此外,提出隐式分辨率编码策略(Implicit Resolution Coding),将离散分辨率值映射为密集特征图并嵌入特征域冗余空间,使接收端可直接从隐写表示中解码出原始分辨率,实现盲恢复。

链接: https://arxiv.org/abs/2601.15739
作者: Xinjue Hu,Chi Wang,Boyu Wang,Xiang Zhang,Zhenshan Tan,Zhangjie Fu
机构: Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep image steganography (DIS) has achieved significant results in capacity and invisibility. However, current paradigms enforce the secret image to maintain the same resolution as the cover image during hiding and revealing. This leads to two challenges: secret images with inconsistent resolutions must undergo resampling beforehand which results in detail loss during recovery, and the secret image cannot be recovered to its original resolution when the resolution value is unknown. To address these, we propose ARDIS, the first Arbitrary Resolution DIS framework, which shifts the paradigm from discrete mapping to reference-guided continuous signal reconstruction. Specifically, to minimize the detail loss caused by resolution mismatch, we first design a Frequency Decoupling Architecture in hiding stage. It disentangles the secret into a resolution-aligned global basis and a resolution-agnostic high-frequency latent to hide in a fixed-resolution cover. Second, for recovery, we propose a Latent-Guided Implicit Reconstructor to perform deterministic restoration. The recovered detail latent code modulates a continuous implicit function to accurately query and render high-frequency residuals onto the recovered global basis, ensuring faithful restoration of original details. Furthermore, to achieve blind recovery, we introduce an Implicit Resolution Coding strategy. By transforming discrete resolution values into dense feature maps and hiding them in the redundant space of the feature domain, the reconstructor can correctly decode the secret’s resolution directly from the steganographic representation. Experimental results demonstrate that ARDIS significantly outperforms state-of-the-art methods in both invisibility and cross-resolution recovery fidelity.
zh

[CV-48] Sub-Region-Aware Modality Fusion and Adaptive Prompting for Multi-Modal Brain Tumor Segmentation

【速读】:该论文旨在解决基础模型(foundation models)在多模态医学影像中适应性不足的问题,尤其是现有模型在跨模态信息融合和病理组织异质性适应方面表现不佳。其核心解决方案在于提出一种新颖的框架,关键技术创新包括:子区域感知的模态注意力机制(sub-region-aware modality attention)自适应提示工程(adaptive prompt engineering)。前者使模型能够为每个肿瘤子区域学习最优的模态组合策略,后者则利用基础模型的内在能力优化分割精度,从而实现更精准、鲁棒的多模态医学图像分割。

链接: https://arxiv.org/abs/2601.15734
作者: Shadi Alijani,Fereshteh Aghaee Meibodi,Homayoun Najjaran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.
zh

[CV-49] FAIR-ESI: Feature Adaptive Importance Refinement for Electrophysiological Source Imaging

【速读】:该论文旨在解决脑部疾病诊断中电生理源成像(Electrophysiological Source Imaging, ESI)的特征选择与优化难题,特别是如何在不同视角下自适应地提升特征重要性以实现更精确的定位与分析。其解决方案的关键在于提出FAIR-ESI框架,通过三种机制实现特征的动态精炼:基于快速傅里叶变换(FFT)的频域特征优化、加权时域特征精炼以及基于自注意力机制的局部块级特征优化,从而在多个仿真和临床数据集上显著提升ESI的准确性与鲁棒性。

链接: https://arxiv.org/abs/2601.15731
作者: Linyong Zou,Liang Zhang,Xiongfei Wang,Jia-Hong Gao,Yi Sun,Shurong Sheng,Kuntao Xiao,Wanli Yang,Pengfei Teng,Guoming Luan,Zhao Lv,Zikang Xu
机构: Anhui University (安徽大学); Institute of Artificial Intelligence (人工智能研究所); Hefei Comprehensive National Science Center (合肥综合性国家科学中心); Northwestern Polytechnical University (西北工业大学); Capital Medical University (首都医科大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:An essential technique for diagnosing brain disorders is electrophysiological source imaging (ESI). While model-based optimization and deep learning methods have achieved promising results in this field, the accurate selection and refinement of features remains a central challenge for precise ESI. This paper proposes FAIR-ESI, a novel framework that adaptively refines feature importance across different views, including FFT-based spectral feature refinement, weighted temporal feature refinement, and self-attention-based patch-wise feature refinement. Extensive experiments on two simulation datasets with diverse configurations and two real-world clinical datasets validate our framework’s efficacy, highlighting its potential to advance brain disorder diagnosis and offer new insights into brain function.
zh

[CV-50] VideoThinker: Building Agent ic VideoLLM LLM s with LLM-Guided Tool Reasoning

【速读】:该论文旨在解决长视频理解中因依赖均匀采样帧进行静态推理而导致的时间定位能力弱化和信息丢失问题。现有方法难以有效处理长时间视频中的关键事件识别与多步推理任务。其解决方案的关键在于提出 VideoThinker,一个基于合成工具交互轨迹训练的代理型视频大语言模型(agentic Video Large Language Model)。核心创新是将视频转化为丰富描述(rich captions),利用强大的代理语言模型在描述空间中生成多步骤工具使用序列(如时间检索、空间缩放和时间缩放),再将这些描述替换回对应视频帧,从而构建大规模交错的视频与工具推理数据集,无需底层模型具备初始的长视频理解能力。此方法使 VideoThinker 获得动态推理、自适应时间探索及多步工具协同能力,并在多个长视频基准测试中显著优于仅使用描述的代理模型和强基线视频模型。

链接: https://arxiv.org/abs/2601.15724
作者: Chenglin Li,Qianglong Chen,Feng Han,Yikun Wang,Xingxi Yin,Yan Gong,Ruilin Li,Yin Zhang,Jiaqi Wang
机构: Zhejiang University (浙江大学); Fudan University (复旦大学); Wuhan University (武汉大学); Shanghai AI Lab; Shanghai Innovation Institute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-form video understanding remains a fundamental challenge for current Video Large Language Models. Most existing models rely on static reasoning over uniformly sampled frames, which weakens temporal localization and leads to substantial information loss in long videos. Agentic tools such as temporal retrieval, spatial zoom, and temporal zoom offer a natural way to overcome these limitations by enabling adaptive exploration of key moments. However, constructing agentic video understanding data requires models that already possess strong long-form video comprehension, creating a circular dependency. We address this challenge with VideoThinker, an agentic Video Large Language Model trained entirely on synthetic tool interaction trajectories. Our key idea is to convert videos into rich captions and employ a powerful agentic language model to generate multi-step tool use sequences in caption space. These trajectories are subsequently grounded back to video by replacing captions with the corresponding frames, yielding a large-scale interleaved video and tool reasoning dataset without requiring any long-form understanding from the underlying model. Training on this synthetic agentic dataset equips VideoThinker with dynamic reasoning capabilities, adaptive temporal exploration, and multi-step tool use. Remarkably, VideoThinker significantly outperforms both caption-only language model agents and strong video model baselines across long-video benchmarks, demonstrating the effectiveness of tool augmented synthetic data and adaptive retrieval and zoom reasoning for long-form video understanding.
zh

[CV-51] Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework WACV2026

【速读】:该论文旨在解决时尚零售场景中细粒度属性预测(fine-grained attribute prediction)的挑战,尤其关注视觉语言模型(Vision-Language Models, VLMs)在多属性条件性(conditional fashion attributes)任务中的零样本(zero-shot)性能评估问题。核心难点在于许多属性(如“外层面料”)仅在特定条件下存在(即可见),若未正确识别属性适用性便直接分类会导致错误。解决方案的关键是提出一个三层诊断性评估框架:(1) 全局任务性能(含NA类,表示属性不适用);(2) 属性适用性检测;(3) 属性可确定时的细粒度分类。通过在DeepFashion-MultiModal数据集上对九种VLM进行系统评测,发现当前VLMs在细粒度分类上表现优异(Tier 3: 70.8% F1),但适用性检测能力薄弱(Tier 2: 34.1% NA-F1),揭示出这是主要瓶颈,并指出高效模型可实现接近旗舰模型90%的性能,为实际部署提供可行路径。

链接: https://arxiv.org/abs/2601.15711
作者: Shubham Shukla,Kunal Sonalkar
机构: Cornell University (康奈尔大学); Nordstrom (诺德斯特龙公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2026 Workshop on Physical Retail AI (PRAW)

点击查看摘要

Abstract:Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, “outer fabric” is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn’t exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
zh

[CV-52] Enhanced LULC Segmentation via Lightweight Model Refinements on ALOS-2 SAR Data

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)遥感影像中土地利用/土地覆盖(Land-Use/Land-Cover, LULC)语义分割任务中存在的多个挑战,包括边界过度平滑、细长结构漏检、长尾类别分布下的稀有类性能退化等问题,同时保持模型架构的简洁性。解决方案的关键在于引入三项轻量级改进:(i)将高分辨率特征注入多尺度解码器以增强空间细节;(ii)设计一种渐进式 refine-up 头部结构,交替执行卷积细化与分步上采样以提升边缘清晰度;(iii)在 focal loss 与 Dice loss 的联合目标函数中引入 α-scale 因子,动态调节类别权重以缓解长尾标签带来的偏差问题。这些改进显著提升了日本全国尺度 ALOS-2 SAR 数据上的 LULC 分割精度,尤其改善了稀有类别的表现,并增强了水体检测的鲁棒性。

链接: https://arxiv.org/abs/2601.15705
作者: Ali Caglayan,Nevrez Imamoglu,Toru Kouyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:This work focuses on national-scale land-use/land-cover (LULC) semantic segmentation using ALOS-2 single-polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR-W-MixMAE self-supervised pretraining [1], we address common SAR dense-prediction failure modes, boundary over-smoothing, missed thin/slender structures, and rare-class degradation under long-tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high-resolution features into multi-scale decoding, (ii) a progressive refine-up head that alternates convolutional refinement and stepwise upsampling, and (iii) an \alpha -scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan-wide ALOS-2 LULC benchmark, particularly for under-represented classes, and improves water detection across standard evaluation metrics.
zh

[CV-53] Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉安全边界上的漏洞问题,即现有研究对MLLMs视觉安全性的探索仍不充分,导致其可能被恶意利用生成有害内容。解决方案的关键在于提出一种名为“Beyond Visual Safety (BVS)”的图像-文本对越狱框架,其核心机制为“重建-生成”策略,通过中性化视觉拼接与归纳重构技术,将恶意意图从原始输入中解耦,从而诱导MLLMs生成有害图像,实验表明该方法在GPT-5模型上实现了98.21%的越狱成功率,揭示了当前MLLMs在视觉安全对齐方面的严重缺陷。

链接: https://arxiv.org/abs/2601.15698
作者: Mingyu Yu,Lana Liu,Zhehao Zhao,Wei Wang,Sujuan Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a “reconstruction-then-generation” strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.
zh

[CV-54] Performance-guided Reinforced Active Learning for Object Detection ICASSP2026

【速读】:该论文旨在解决当前主动学习(Active Learning, AL)策略在目标检测任务中存在的一大瓶颈问题:现有方法主要基于数据分布或内在信息内容评估样本的 informativeness(信息量),而未能直接关联下游任务性能指标(如目标检测中的平均精度均值 mAP),导致标注效率与模型性能提升之间缺乏有效引导。为解决此问题,论文提出了一种基于 mAP 指导的强化主动学习方法(MGRAL),其核心创新在于将“预期模型输出变化”作为样本选择的信息度量,并引入基于策略梯度的强化学习采样代理(sampling agent),以 mAP 提升作为奖励信号优化批量样本选择策略,从而实现对模型性能的直接驱动;同时,为降低 mAP 估计的计算开销,MGRAL 还设计了基于无监督快速查找表的高效估算机制,显著提升了算法的实际可部署性。

链接: https://arxiv.org/abs/2601.15688
作者: Zhixuan Liang,Xingyu Zeng,Rui Zhao,Ping Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICASSP 2026. Camera-ready Version

点击查看摘要

Abstract:Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data’s distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL’s active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.
zh

[CV-55] Consistency-Regularized GAN for Few-Shot SAR Target Recognition

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中少样本识别(Few-shot Recognition)因数据极度稀缺而导致的性能瓶颈问题。传统方法依赖生成对抗网络(Generative Adversarial Network, GAN)生成大量训练数据,再通过自监督学习(Self-supervised Learning, SSL)预训练模型并微调,但其核心矛盾在于:标准GAN在训练时仍需充足数据以保证稳定性,这与少样本场景的前提相悖。为此,作者提出一致性正则化生成对抗网络(Consistency-Regularized Generative Adversarial Network, Cr-GAN),其关键创新在于引入双分支判别器架构,将对抗训练与表征学习解耦,并结合通道级特征插值策略和双域循环一致性机制,从而在极低数据条件下仍能生成高质量、多样化的SAR样本,显著提升SSL算法性能。实验表明,Cr-GAN在MSTAR和SRSDD数据集上分别达到71.21%和51.64%的准确率(8-shot设置),优于现有主流方法,且参数量仅为先进扩散模型的约5%。

链接: https://arxiv.org/abs/2601.15681
作者: Yikui Zhai,Shikuang Liu,Wenlve Zhou,Hongsheng Zhang,Zhiheng Zhou,Xiaolin Tian,C. L. Philip Chen
机构: Wuyi University (五邑大学); South China University of Technology (华南理工大学); The University of Hong Kong (香港大学); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
zh

[CV-56] Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling

【速读】:该论文旨在解决多图像合成(multi-image composition)任务中的一致性与质量难题,尤其是针对社区高度关注的人体-物体交互(Human-Object Interaction, HOI)场景。现有模型在该领域缺乏透明的方法细节且性能受限,难以实现高质量的跨图像语义融合。解决方案的关键在于提出一个统一的多模态框架Skywork UniPic 3.0,其核心创新包括:1)构建涵盖数据收集、过滤与合成的完整高质量训练数据管道,仅用70万样本即实现优异性能;2)将多图像合成建模为序列建模问题,通过新型训练范式实现条件生成的统一序列合成;3)引入轨迹映射与分布匹配机制,在后训练阶段显著加速推理,仅需8步即可生成高保真图像,相较标准采样提升12.5倍速度。该方案在单图编辑和多图合成基准上均达到当前最优水平,验证了方法的有效性。

链接: https://arxiv.org/abs/2601.15664
作者: Hongyang Wei,Hongbo Liu,Zidong Wang,Yi Peng,Baixin Xu,Size Wu,Xuying Zhang,Xianglong He,Zexiang Liu,Peiyu Wang,Xuchen Song,Yangguang Li,Yang Liu,Yahui Zhou
机构: Skywork (Skywork)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent surge in popularity of Nano-Banana and Seedream 4.0 underscores the community’s strong interest in multi-image composition tasks. Compared to single-image editing, multi-image composition presents significantly greater challenges in terms of consistency and quality, yet existing models have not disclosed specific methodological details for achieving high-quality fusion. Through statistical analysis, we identify Human-Object Interaction (HOI) as the most sought-after category by the community. We therefore systematically analyze and implement a state-of-the-art solution for multi-image composition with a primary focus on HOI-centric tasks. We present Skywork UniPic 3.0, a unified multimodal framework that integrates single-image editing and multi-image composition. Our model supports an arbitrary (1~6) number and resolution of input images, as well as arbitrary output resolutions (within a total pixel budget of 1024x1024). To address the challenges of multi-image composition, we design a comprehensive data collection, filtering, and synthesis pipeline, achieving strong performance with only 700K high-quality training samples. Furthermore, we introduce a novel training paradigm that formulates multi-image composition as a sequence-modeling problem, transforming conditional generation into unified sequence synthesis. To accelerate inference, we integrate trajectory mapping and distribution matching into the post-training stage, enabling the model to produce high-fidelity samples in just 8 steps and achieve a 12.5x speedup over standard synthesis sampling. Skywork UniPic 3.0 achieves state-of-the-art performance on single-image editing benchmark and surpasses both Nano-Banana and Seedream 4.0 on multi-image composition benchmark, thereby validating the effectiveness of our data pipeline and training paradigm. Code, models and dataset are publicly available.
zh

[CV-57] Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, VLMs)在实时处理长视频流时面临的两大挑战:冗余帧处理导致的计算效率低下,以及历史上下文快速遗忘引发的时序信息丢失问题。现有流式系统依赖固定间隔解码或缓存裁剪策略,往往造成输出重复或关键时序信息被误删。其解决方案的核心在于提出Event-VStream框架,通过整合运动、语义与预测线索检测有意义的状态转换,并仅在事件边界触发语言生成;同时将每个事件嵌入固化至持久化记忆库中,从而实现长时推理能力与低延迟之间的平衡。该方法显著提升了实时视频理解任务中的性能表现,尤其在OVOBench-Realtime和Ego4D等基准测试中优于现有基线模型。

链接: https://arxiv.org/abs/2601.15655
作者: Zhenghui Guo,Yuanbin Man,Junyuan Sheng,Bowen Lin,Ahmed Ahmed,Bo Jiang,Boyuan Zhang,Miao Yin,Sian Jin,Omprakash Gnawal,Chengming Zhang
机构: University of Houston (休斯顿大学); The University of Texas at Arlington (德克萨斯大学阿灵顿分校); Indiana University Bloomington (印第安纳大学布卢明顿分校); Temple University (天普大学); Independent Researcher (独立研究者)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time understanding of long video streams remains challenging for multimodal large language models (VLMs) due to redundant frame processing and rapid forgetting of past context. Existing streaming systems rely on fixed-interval decoding or cache pruning, which either produce repetitive outputs or discard crucial temporal information. We introduce Event-VStream, an event-aware framework that represents continuous video as a sequence of discrete, semantically coherent events. Our system detects meaningful state transitions by integrating motion, semantic, and predictive cues, and triggers language generation only at those boundaries. Each event embedding is consolidated into a persistent memory bank, enabling long-horizon reasoning while maintaining low latency. Across OVOBench-Realtime, and long-form Ego4D evaluations, Event-VStream achieves competitive performance. It improves over a VideoLLM-Online-8B baseline by +10.4 points on OVOBench-Realtime, achieves performance close to Flash-VStream-7B despite using only a general-purpose LLaMA-3-8B text backbone, and maintains around 70% GPT-5 win rate on 2-hour Ego4D streams.
zh

[CV-58] SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based Occupancy Prediction

【速读】:该论文针对现有3D occupancy预测方法在自动驾驶场景中面临的三大问题展开研究:一是现有方法多采用密集场景表示,忽视了真实驾驶环境中物体分布的固有稀疏性;二是基于超二次曲面(superquadric)的稀疏表示框架在时间建模能力上不足,难以有效捕捉动态信息;三是查询稀疏性与几何表达能力之间存在权衡,且超二次曲面到体素的投射(splatting)计算效率低下。解决方案的关键在于提出SuperOcc框架,其核心创新包括:(1) 构建统一的时间建模机制,同时利用视点中心和物体中心的时间线索以增强时序感知;(2) 设计多超二次曲面解码策略,在不牺牲查询稀疏性的前提下提升几何表达能力;(3) 提出高效的超二次曲面到体素投射方案,显著改善计算效率。实验表明,SuperOcc在SurroundOcc和Occ3D基准上均达到当前最优性能并保持高效性。

链接: https://arxiv.org/abs/2601.15644
作者: Zichen Yu,Quanli Liu,Wei Wang,Liyong Zhang,Xiaoguang Zhao
机构: Dalian University of Technology (大连理工大学); Dalian Rail Transmit Intelligent Control and Intelligent Operation Technology Innovation Center (大连轨道交通智能控制与智能运行技术创新中心); Dalian Seasky Automation Co., Ltd (大连海天自动化有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D occupancy prediction plays a pivotal role in the realm of autonomous driving, as it provides a comprehensive understanding of the driving environment. Most existing methods construct dense scene representations for occupancy prediction, overlooking the inherent sparsity of real-world driving scenes. Recently, 3D superquadric representation has emerged as a promising sparse alternative to dense scene representations due to the strong geometric expressiveness of superquadrics. However, existing superquadric frameworks still suffer from insufficient temporal modeling, a challenging trade-off between query sparsity and geometric expressiveness, and inefficient superquadric-to-voxel splatting. To address these issues, we propose SuperOcc, a novel framework for superquadric-based 3D occupancy prediction. SuperOcc incorporates three key designs: (1) a cohesive temporal modeling mechanism to simultaneously exploit view-centric and object-centric temporal cues; (2) a multi-superquadric decoding strategy to enhance geometric expressiveness without sacrificing query sparsity; and (3) an efficient superquadric-to-voxel splatting scheme to improve computational efficiency. Extensive experiments on the SurroundOcc and Occ3D benchmarks demonstrate that SuperOcc achieves state-of-the-art performance while maintaining superior efficiency. The code is available at this https URL.
zh

[CV-59] Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception

【速读】:该论文旨在解决多任务持续学习(Continual Learning, CL)在多模态场景下因灾难性遗忘(catastrophic forgetting)和跨模态语义混淆(semantic obfuscation)导致的模型性能退化问题,尤其聚焦于像素级、实例级和图像级联合感知的连续全景感知(Continual Panoptic Perception, CPP)任务。其解决方案的关键在于提出一个端到端的CPP模型,包含三个核心机制:(1) 协作式跨模态编码器(Collaborative Cross-Modal Encoder, CCE),用于实现多模态嵌入融合;(2) 基于对比特征蒸馏与实例蒸馏的可塑知识继承模块,以任务交互增强的方式缓解遗忘;(3) 跨模态一致性约束与CPP+框架,保障多任务增量训练中多模态语义对齐。此外,引入非对称伪标签策略,使模型无需示例回放即可演化,显著提升细粒度持续学习能力。

链接: https://arxiv.org/abs/2601.15643
作者: Bo Yuan,Danpei Zhao,Wentao Li,Tian Li,Zhiguo Jiang
机构: Beihang University (北京航空航天大学); Tianmushan Laboratory (天目山实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2407.14242

点击查看摘要

Abstract:Continual learning (CL) is a great endeavour in developing intelligent perception AI systems. However, the pioneer research has predominantly focus on single-task CL, which restricts the potential in multi-task and multimodal scenarios. Beyond the well-known issue of catastrophic forgetting, the multi-task CL also brings semantic obfuscation across multimodal alignment, leading to severe model degradation during incremental training steps. In this paper, we extend CL to continual panoptic perception (CPP), integrating multimodal and multi-task CL to enhance comprehensive image perception through pixel-level, instance-level, and image-level joint interpretation. We formalize the CL task in multimodal scenarios and propose an end-to-end continual panoptic perception model. Concretely, CPP model features a collaborative cross-modal encoder (CCE) for multimodal embedding. We also propose a malleable knowledge inheritance module via contrastive feature distillation and instance distillation, addressing catastrophic forgetting from task-interactive boosting manner. Furthermore, we propose a cross-modal consistency constraint and develop CPP+, ensuring multimodal semantic alignment for model updating under multi-task incremental scenarios. Additionally, our proposed model incorporates an asymmetric pseudo-labeling manner, enabling model evolving without exemplar replay. Extensive experiments on multimodal datasets and diverse CL tasks demonstrate the superiority of the proposed model, particularly in fine-grained CL tasks.
zh

[CV-60] Explainable Deepfake Detection with RL Enhanced Self-Blended Images ICASSP2026

【速读】:该论文旨在解决当前深度伪造(deepfake)检测方法普遍缺乏可解释性输出的问题,同时应对多模态大语言模型(MLLM)在该任务中因高质量带细粒度伪造归属标注的数据稀缺而难以应用的挑战。其解决方案的关键在于提出一种基于自混合图像(Self-Blended Images)的自动化思维链(Chain-of-Thought, CoT)数据生成框架,结合强化学习(Reinforcement Learning, RL)增强的检测机制,通过设计定制化的奖励函数与反馈驱动的合成数据生成策略,显著降低人工标注成本并提升跨数据集泛化能力。实验表明,该方法在多个跨数据集基准上性能达到或接近当前最优水平(SOTA)。

链接: https://arxiv.org/abs/2601.15624
作者: Ning Jiang,Dingheng Zeng,Yanhong Liu,Haiyang Yi,Shijie Yu,Minghe Weng,Haifeng Shen,Ying Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at this https URL.
zh

[CV-61] Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization for Cross-Subject EEG Emotion Recognition

【速读】:该论文旨在解决跨被试脑电(EEG)情绪识别(EER)中因个体间差异导致的信号分布偏移以及情绪相关神经表征在空间组织和时间演化上的高复杂性问题。现有方法通常孤立地改进空间建模、时间建模或泛化策略,难以在统一框架内对齐跨被试表征、捕捉多尺度动态并抑制个体特异性偏差。其解决方案的关键在于提出一种区域感知时空建模与协同域泛化(Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization, RSM-CoDG)框架:首先利用功能脑区划分的神经科学先验构建区域级空间表征以提升跨被试可比性;其次采用多尺度时间建模刻画情绪诱发神经活动的动态演变;最后引入协同域泛化策略,在无监督目标被试场景下通过多维约束减少个体特异性偏差,从而增强模型对未知个体的泛化能力。

链接: https://arxiv.org/abs/2601.15615
作者: Weiwei Wu,Yueyang Li,Yuhu Shi,Weiming Zeng,Lang Qin,Yang Yang,Ke Zhou,Zhiguo Zhang,Wai Ting Siok,Nizhuan Wang
机构: Shanghai Maritime University (上海海事大学); The Hong Kong Polytechnic University (香港理工大学); Peking University (北京大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Beijing Normal University (北京师范大学); Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-subject EEG-based emotion recognition (EER) remains challenging due to strong inter-subject variability, which induces substantial distribution shifts in EEG signals, as well as the high complexity of emotion-related neural representations in both spatial organization and temporal evolution. Existing approaches typically improve spatial modeling, temporal modeling, or generalization strategies in isolation, which limits their ability to align representations across subjects while capturing multi-scale dynamics and suppressing subject-specific bias within a unified framework. To address these gaps, we propose a Region-aware Spatiotemporal Modeling framework with Collaborative Domain Generalization (RSM-CoDG) for cross-subject EEG emotion recognition. RSM-CoDG incorporates neuroscience priors derived from functional brain region partitioning to construct region-level spatial representations, thereby improving cross-subject comparability. It also employs multi-scale temporal modeling to characterize the dynamic evolution of emotion-evoked neural activity. In addition, the framework employs a collaborative domain generalization strategy, incorporating multidimensional constraints to reduce subject-specific bias in a fully unseen target subject setting, which enhances the generalization to unknown individuals. Extensive experimental results on SEED series datasets demonstrate that RSM-CoDG consistently outperforms existing competing methods, providing an effective approach for improving robustness. The source code is available at this https URL.
zh

[CV-62] Relative Classification Accuracy: A Calibrated Metric for Identity Consistency in Fine-Grained K-pop Face Generation

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在细粒度、单领域任务中语义可控性评估难题,尤其针对高相似度类别(如K-pop偶像面部图像)下的身份一致性验证问题。传统指标如FID(Fréchet Inception Distance)和Inception Score(IS)无法有效检测此类场景中的身份错位现象。解决方案的关键在于提出一种校准指标——相对分类准确率(Relative Classification Accuracy, RCA),通过将生成模型的性能与理想分类器(oracle classifier)的基准进行归一化比较,从而更精准地量化语义一致性。实验表明,尽管模型视觉质量优异(FID=8.93),但其存在严重的语义模式坍缩(RCA=0.27),且失败模式主要由分辨率限制和同性别内部模糊性导致,该框架为条件生成模型的身份一致性提供了可复现的严谨评估标准。

链接: https://arxiv.org/abs/2601.15560
作者: Sylvey Lin,Eranki Vasistha
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in high-fidelity image generation. However, evaluating their semantic controllability-specifically for fine-grained, single-domain tasks-remains challenging. Standard metrics like FID and Inception Score (IS) often fail to detect identity misalignment in such specialized contexts. In this work, we investigate Class-Conditional DDPMs for K-pop idol face generation (32x32), a domain characterized by high inter-class similarity. We propose a calibrated metric, Relative Classification Accuracy (RCA), which normalizes generative performance against an oracle classifier’s baseline. Our evaluation reveals a critical trade-off: while the model achieves high visual quality (FID 8.93), it suffers from severe semantic mode collapse (RCA 0.27), particularly for visually ambiguous identities. We analyze these failure modes through confusion matrices and attribute them to resolution constraints and intra-gender ambiguity. Our framework provides a rigorous standard for verifying identity consistency in conditional generative models.
zh

[CV-63] VIOLA: Towards Video In-Context Learning with Minimal Annotations

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在新视频领域泛化能力不足的问题,尤其针对标注数据稀缺场景下难以实现高效适应的挑战。其核心解决方案在于提出VIOLA框架,关键创新点包括:一是基于密度-不确定性加权采样策略,在有限标注预算下选择兼具多样性、代表性与信息量的样本,避免传统方法因选取视觉异常值而导致性能下降;二是构建混合数据池并引入置信度感知检索与置信度感知提示机制,通过融合相似性与置信度的复合评分动态筛选可靠示范样本,使模型能够自适应区分真实标签与噪声伪标签,从而在低资源条件下显著提升MLLMs对视频内容的理解与适应能力。

链接: https://arxiv.org/abs/2601.15549
作者: Ryo Fujii,Hideo Saito,Ryo Hachiuma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts’ annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
zh

[CV-64] DeltaDorsal: Enhancing Hand Pose Estimation with Dorsal Features in Egocentric Views

【速读】:该论文旨在解决第一人称视角(egocentric)下手部姿态估计因手指频繁遮挡而导致的精度下降问题。其关键解决方案在于提出一种基于双流差分编码器(dual-stream delta encoder)的新方法,通过对比动态手部与基准放松状态下的背侧皮肤形变特征(dorsal hand skin deformation),实现无需依赖完整手部几何信息和大型模型即可提升在自遮挡场景下的姿态估计准确性。实验表明,仅使用裁剪后的背侧图像,该方法在手指遮挡率达50%时,相比现有最优技术将平均每关节角度误差(MPJAE)降低18%,同时支持如等长力检测等新型交互模式,并显著减小模型规模。

链接: https://arxiv.org/abs/2601.15516
作者: William Huang,Siyou Pei,Leyi Zou,Eric J. Gonzalez,Ishan Chatterjee,Yang Zhang
机构: University of California, Los Angeles (加州大学洛杉矶分校); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 16 pages, 11 figures, Presented at ACM CHI 2026. For associated codebase, see this https URL

点击查看摘要

Abstract:The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers =50% occluded) compared to state-of-the-art techniques that depend on the whole hand’s geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface “click” without visible movement while minimizing model size.
zh

[CV-65] Controllable Layered Image Generation for Real-World Editing

【速读】:该论文旨在解决当前图像生成模型在用户尝试编辑图像中特定元素时,难以实现可控性和一致性的问题。现有方法在生成分层表示时往往无法保证各图层之间的合理合成关系,且对象图层常缺乏阴影、反射等真实视觉效果。其解决方案的关键在于提出LASAGNA框架,该框架能够联合生成图像及其组成图层(包括逼真的背景和带有丰富视觉效果的透明前景),并从多种条件输入(如文本提示、前景/背景图像、位置掩码)中高效学习正确的图像合成逻辑,从而显著提升可控性与一致性。此外,作者构建了LASAGNA-48K数据集与LASAGNABENCH基准,为层编辑任务提供标准化评估支持。

链接: https://arxiv.org/abs/2601.15507
作者: Jinrui Yang,Qing Liu,Yijun Li,Mengwei Ren,Letian Zhang,Zhe Lin,Cihang Xie,Yuyin Zhou
机构: UC Santa Cruz; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent image generation models have shown impressive progress, yet they often struggle to yield controllable and consistent results when users attempt to edit specific elements within an existing image. Layered representations enable flexible, user-driven content creation, but existing approaches often fail to produce layers with coherent compositing relationships, and their object layers typically lack realistic visual effects such as shadows and reflections. To overcome these limitations, we propose LASAGNA, a novel, unified framework that generates an image jointly with its composing layers–a photorealistic background and a high-quality transparent foreground with compelling visual effects. Unlike prior work, LASAGNA efficiently learns correct image composition from a wide range of conditioning inputs–text prompts, foreground, background, and location masks–offering greater controllability for real-world applications. To enable this, we introduce LASAGNA-48K, a new dataset composed of clean backgrounds and RGBA foregrounds with physically grounded visual effects. We also propose LASAGNABENCH, the first benchmark for layer editing. We demonstrate that LASAGNA excels in generating highly consistent and coherent results across multiple image layers simultaneously, enabling diverse post-editing applications that accurately preserve identity and visual effects. LASAGNA-48K and LASAGNABENCH will be publicly released to foster open research in the community. The project page is this https URL.
zh

[CV-66] Hybrid Vision Transformer_GAN Attribute Neutralizer for Mitigating Bias in Chest X_Ray Diagnosis

【速读】:该论文旨在解决胸部X光图像分类器中存在的性别和年龄相关偏见问题,这种偏见会导致少数群体的系统性漏诊。其关键解决方案在于将传统的U-Net卷积编码器替换为基于Vision Transformer (ViT) 的架构,构建一种新的属性中性化框架(Attribute-Neutral Framework),以更有效地抑制人口统计学属性泄漏,同时保持诊断准确性。实验表明,在适度编辑强度下(alpha = 0.5),使用DeiT-S小型Image Transformer作为中性化器可使患者性别识别的AUC降低至约0.80,比原U-Net框架低约10个百分点,且疾病预测的宏平均ROC AUC仅下降不超过5个百分点,最差子群AUC仍维持在0.70左右,验证了ViT结构在减少属性泄漏方面的有效性与临床实用性。

链接: https://arxiv.org/abs/2601.15490
作者: Jobeal Solomon,Ali Mohammed Mansoor Alsahag,Seyed Sahand Mohammadi Ziabari
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bias in chest X-ray classifiers frequently stems from sex- and age-related shortcuts, leading to systematic underdiagnosis of minority subgroups. Previous pixel-space attribute neutralizers, which rely on convolutional encoders, lessen but do not fully remove this attribute leakage at clinically usable edit strengths. This study evaluates whether substituting the U-Net convolutional encoder with a Vision Transformer backbone in the Attribute-Neutral Framework can reduce demographic attribute leakage while preserving diagnostic accuracy. A data-efficient Image Transformer Small (DeiT-S) neutralizer was trained on the ChestX-ray14 dataset. Its edited images, generated across eleven edit-intensity levels, were evaluated with an independent AI judge for attribute leakage and with a convolutional neural network (ConvNet) for disease prediction. At a moderate edit level (alpha = 0.5), the Vision Transformer (ViT) neutralizer reduces patient sex-recognition area under the curve (AUC) to approximately 0.80, about 10 percentage points below the original framework’s convolutional U-Net encoder, despite being trained for only half as many epochs. Meanwhile, macro receiver operating characteristic area under the curve (ROC AUC) across 15 findings stays within five percentage points of the unedited baseline, and the worst-case subgroup AUC remains near 0.70. These results indicate that global self-attention vision models can further suppress attribute leakage without sacrificing clinical utility, suggesting a practical route toward fairer chest X-ray AI.
zh

[CV-67] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events

【速读】:该论文旨在解决从单张低动态范围(LDR)模糊图像及对应事件数据中重建高动态范围(HDR)且清晰的三维场景表示这一难题,尤其在极端光照条件下表现不佳的问题。现有方法虽利用事件数据提升性能,但忽略了相机输出与物理世界辐射度之间的传感器物理不匹配问题,导致HDR重建和去模糊效果受限。其解决方案的关键在于提出一个统一的、基于传感器物理规律的NeRF框架,通过引入像素级RGB映射场(RGB mapping field)将渲染的HDR像素值与输入LDR图像的传感器记录值对齐,并设计事件映射场(event mapping field)连接物理场景动态与事件传感器的实际输出,二者与NeRF网络联合优化,从而充分利用事件数据中的时空动态信息来增强HDR清晰三维表示的学习能力。

链接: https://arxiv.org/abs/2601.15475
作者: Yunshan Qi,Lin Zhu,Nan Bao,Yifan Zhao,Jia Li
机构: Beihang University (北京航空航天大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.
zh

[CV-68] DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection

【速读】:该论文旨在解决少样本异常检测(Few-Normal Shot Anomaly Detection, FNSAD)中因正常样本稀缺导致的判别能力弱和局部异常评分机制不明确的问题。其核心挑战在于如何在有限监督下实现高精度的异常定位与可解释性。解决方案的关键在于提出一种基于偏差引导的提示学习框架(deviation-guided prompt learning framework),通过引入共享的可学习上下文向量(context vectors)替代固定提示前缀,并利用异常特定的后缀标记(suffix tokens)实现类别感知对齐;同时,结合Top-K多实例学习(Top-K Multiple Instance Learning, MIL)构建基于统计偏差的评分机制,将图像块特征建模为相对于正常分布的高斯偏离,从而提升异常区域的得分区分度与定位准确性。

链接: https://arxiv.org/abs/2601.15453
作者: Morteza Poudineh,Marc Lalonde
机构: Concordia University (康考迪亚大学); Computer Research Institute of Montreal (蒙特利尔计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.
zh

[CV-69] CASL: Concept-Aligned Sparse Latents for Interpreting Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)中潜在表示难以解释的问题,尤其是现有基于稀疏自编码器(Sparse Autoencoders, SAEs)的方法多采用无监督方式,无法将稀疏特征与人类可理解的语义概念对齐,从而限制了对生成图像的可靠语义控制。解决方案的关键在于提出一种监督式框架CASL(Concept-Aligned Sparse Latents),其核心步骤包括:首先在冻结的U-Net激活上训练SAE以获得解耦的潜在表示,随后学习一个轻量级线性映射,将每个语义概念与一组相关的稀疏潜在维度进行关联。该方法通过CASL-Steer实现对潜在空间的因果干预,验证概念对齐方向对生成内容的影响,并引入编辑精度比(Editing Precision Ratio, EPR)量化概念特异性与无关属性保持能力,从而显著提升扩散模型中潜在空间的可解释性和可控性。

链接: https://arxiv.org/abs/2601.15441
作者: Zhenghao He,Guangzhi Xiong,Boyang Wang,Sanchit Sinha,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Internal activations of diffusion models encode rich semantic information, but interpreting such representations remains challenging. While Sparse Autoencoders (SAEs) have shown promise in disentangling latent representations, existing SAE-based methods for diffusion model understanding rely on unsupervised approaches that fail to align sparse features with human-understandable concepts. This limits their ability to provide reliable semantic control over generated images. We introduce CASL (Concept-Aligned Sparse Latents), a supervised framework that aligns sparse latent dimensions of diffusion models with semantic concepts. CASL first trains an SAE on frozen U-Net activations to obtain disentangled latent representations, and then learns a lightweight linear mapping that associates each concept with a small set of relevant latent dimensions. To validate the semantic meaning of these aligned directions, we propose CASL-Steer, a controlled latent intervention that shifts activations along the learned concept axis. Unlike editing methods, CASL-Steer is used solely as a causal probe to reveal how concept-aligned latents influence generated content. We further introduce the Editing Precision Ratio (EPR), a metric that jointly measures concept specificity and the preservation of unrelated attributes. Experiments show that our method achieves superior editing precision and interpretability compared to existing approaches. To the best of our knowledge, this is the first work to achieve supervised alignment between latent representations and semantic concepts in diffusion models.
zh

[CV-70] SplatBus: A Gaussian Splatting Viewer Framework via GPU Interprocess Communication

【速读】:该论文旨在解决当前3D高斯溅射(3D Gaussian Splatting, 3DGS)方法难以集成到传统基于网格(mesh-based)的渲染管线中的问题,而后者在交互式应用和艺术探索中具有广泛使用场景。解决方案的关键在于利用NVIDIA的进程间通信(Inter-Process Communication, IPC)API,实现与外部渲染客户端(如Unity、Blender、Unreal Engine及OpenGL查看器)的无缝集成,从而将3DGS的实时渲染结果高效输出至这些平台,提升其在实际应用中的兼容性和可扩展性。

链接: https://arxiv.org/abs/2601.15431
作者: Yinghan Xu,Théo Morales,John Dingliana
机构: Trinity College Dublin (三一学院都柏林大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiance field-based rendering methods have attracted significant interest from the computer vision and computer graphics communities. They enable high-fidelity rendering with complex real-world lighting effects, but at the cost of high rendering time. 3D Gaussian Splatting solves this issue with a rasterisation-based approach for real-time rendering, enabling applications such as autonomous driving, robotics, virtual reality, and extended reality. However, current 3DGS implementations are difficult to integrate into traditional mesh-based rendering pipelines, which is a common use case for interactive applications and artistic exploration. To address this limitation, this software solution uses Nvidia’s interprocess communication (IPC) APIs to easily integrate into implementations and allow the results to be viewed in external clients such as Unity, Blender, Unreal Engine, and OpenGL viewers. The code is available at this https URL.
zh

[CV-71] DuFal: Dual-Frequency-Aware Learning for High-Fidelity Extremely Sparse-view CBCT Reconstruction

【速读】:该论文旨在解决稀疏视图锥束计算机断层成像(Cone-Beam Computed Tomography, CBCT)重建中因X射线投影数据不足导致的高频解剖结构细节丢失问题,此类细节对应于图像中的高频率成分,而传统基于卷积神经网络(Convolutional Neural Networks, CNN)的方法通常偏向学习低频信息,难以有效恢复精细结构。其解决方案的关键在于提出DuFal(Dual-Frequency-Aware Learning)框架,该框架采用双路径架构融合频域与空域处理:核心创新为高局部因子分解傅里叶神经算子(High-Local Factorized Fourier Neural Operator),包含两个互补分支——全局高频增强傅里叶神经算子用于捕捉全局频率模式,局部高频增强傅里叶神经算子则通过空间分块处理保留可能在全局频域分析中丢失的空间局部性;此外,设计了谱-通道因子分解机制以降低参数量,并引入交叉注意力频域融合模块实现空间与频率特征的有效整合,最终通过特征解码和强度场解码流程生成高质量CBCT体积重建结果。

链接: https://arxiv.org/abs/2601.15416
作者: Cuong Tran Van,Trong-Thang Pham,Ngoc-Son Nguyen,Duy Minh Ho Nguyen,Ngan Le
机构: FPT Software AI Center (越南FPT软件人工智能中心); University of Arkansas (阿肯色大学); Max Planck Research School for Intelligent Systems (IMPRS-IS) (马克斯·普朗克智能系统研究所); DFKI (德国弗劳恩霍夫协会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published with J2C Certification in Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Sparse-view Cone-Beam Computed Tomography reconstruction from limited X-ray projections remains a challenging problem in medical imaging due to the inherent undersampling of fine-grained anatomical details, which correspond to high-frequency components. Conventional CNN-based methods often struggle to recover these fine structures, as they are typically biased toward learning low-frequency information. To address this challenge, this paper presents DuFal (Dual-Frequency-Aware Learning), a novel framework that integrates frequency-domain and spatial-domain processing via a dual-path architecture. The core innovation lies in our High-Local Factorized Fourier Neural Operator, which comprises two complementary branches: a Global High-Frequency Enhanced Fourier Neural Operator that captures global frequency patterns and a Local High-Frequency Enhanced Fourier Neural Operator that processes spatially partitioned patches to preserve spatial locality that might be lost in global frequency analysis. To improve efficiency, we design a Spectral-Channel Factorization scheme that reduces the Fourier Neural Operator parameter count. We also design a Cross-Attention Frequency Fusion module to integrate spatial and frequency features effectively. The fused features are then decoded through a Feature Decoder to produce projection representations, which are subsequently processed through an Intensity Field Decoding pipeline to reconstruct a final Computed Tomography volume. Experimental results on the LUNA16 and ToothFairy datasets demonstrate that DuFal significantly outperforms existing state-of-the-art methods in preserving high-frequency anatomical features, particularly under extremely sparse-view settings.
zh

[CV-72] Evaluating Multimodal Large Language Models for Heterogeneous Face Recognition

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在异质人脸识别(Heterogeneous Face Recognition, HFR)任务中的性能瓶颈问题,即当注册图像(enrollment)与验证图像(probe)来自不同传感模态(如可见光VIS、近红外NIR、短波红外SWIR和热成像THERMAL)时,MLLMs的识别准确率是否能够满足实际生物特征识别系统的要求。其解决方案的关键在于通过系统性评估多个开源MLLMs在多种跨模态场景下的表现,采用标准化生物特征识别协议及指标(如获取率Acquire Rate、等错误率Equal Error Rate (EER)和真实接受率True Accept Rate (TAR))进行量化分析,从而揭示当前MLLMs在复杂跨谱条件下的性能局限,强调了在部署前必须开展严谨的生物特征评估的重要性。

链接: https://arxiv.org/abs/2601.15406
作者: Hatef Otroshi Shahreza,Anjith George,Sébastien Marcel
机构: Idiap Research Institute (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance on a wide range of vision-language tasks, raising interest in their potential use for biometric applications. In this paper, we conduct a systematic evaluation of state-of-the-art MLLMs for heterogeneous face recognition (HFR), where enrollment and probe images are from different sensing modalities, including visual (VIS), near infrared (NIR), short-wave infrared (SWIR), and thermal camera. We benchmark multiple open-source MLLMs across several cross-modality scenarios, including VIS-NIR, VIS-SWIR, and VIS-THERMAL face recognition. The recognition performance of MLLMs is evaluated using biometric protocols and based on different metrics, including Acquire Rate, Equal Error Rate (EER), and True Accept Rate (TAR). Our results reveal substantial performance gaps between MLLMs and classical face recognition systems, particularly under challenging cross-spectral conditions, in spite of recent advances in MLLMs. Our findings highlight the limitations of current MLLMs for HFR and also the importance of rigorous biometric evaluation when considering their deployment in face recognition systems.
zh

[CV-73] GeMM-GAN: A Multimodal Generative Model Conditioned on Histopathology Images and Clinical Descriptions for Gene Expression Profile Generation

【速读】:该论文旨在解决基因表达数据在生物医学研究中因隐私法规严格和实验成本高昂而导致的获取困难问题。其解决方案的关键在于提出一种名为GeMM-GAN的新型生成对抗网络(Generative Adversarial Network, GAN),该模型以组织病理学切片图像和临床元数据为条件,通过Transformer编码器处理图像块并结合跨注意力机制(Cross Attention)将图像与文本标记关联,生成一个条件向量,从而引导生成模型合成具有生物学一致性的基因表达谱。此方法显著提升了下游疾病类型预测任务的准确性,优于现有最先进的生成模型。

链接: https://arxiv.org/abs/2601.15392
作者: Francesca Pia Panaccione,Carlo Sgaravatti,Pietro Pinoli
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 2 figures. Published at Image Analysis and Processing - ICIAP 2025 Workshops

点击查看摘要

Abstract:Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM-GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM-GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11% the accuracy on downstream disease type prediction compared to current state-of-the-art generative models. Code will be available at: this https URL
zh

[CV-74] AI-Based Culvert-Sewer Inspection

【速读】:该论文旨在解决排水系统中涵洞和污水管道缺陷自动分割任务在标注数据稀缺条件下的性能瓶颈问题。针对这一挑战,论文提出三种关键解决方案:首先,通过传统数据增强与动态标签注入等预处理策略提升有限训练数据的利用效率,显著改善交并比(Intersection over Union, IoU)和F1分数;其次,设计了FORTRESS架构,融合深度可分离卷积、自适应Kolmogorov-Arnold网络(Adaptive Kolmogorov-Arnold Networks, KAN)与多尺度注意力机制,在保持优异分割精度的同时大幅降低模型参数量和计算成本;最后,探索少样本语义分割方法,采用带注意力机制的双向原型网络(Bidirectional Prototypical Network),实现对少量标注样本的有效学习,从而在真实场景下实现高鲁棒性的缺陷检测。

链接: https://arxiv.org/abs/2601.15366
作者: Christina Thrainer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Masters thesis, University of Technology Graz, 2025

点击查看摘要

Abstract:Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real-world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov-Arnold Networks (KAN), and multi-scale attention mechanisms. FORTRESS achieves state-of-the-art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few-shot semantic segmentation and its applicability to defect detection. Few-shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics. Comments: Masters thesis, University of Technology Graz, 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.15366 [cs.CV] (or arXiv:2601.15366v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.15366 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-75] he Paradigm Shift: A Comprehensive Survey on Large Vision Language Models for Multimodal Fake News Detection

【速读】:该论文旨在解决多模态虚假新闻检测(Multimodal Fake News Detection, MFND)领域中缺乏系统性综述的问题,特别是针对从传统特征工程方法向基于大视觉语言模型(Large Vision-Language Models, LVLMs)的统一端到端多模态推理框架转变过程的梳理与整合。其解决方案的关键在于:首先,构建了一个历史演进视角,清晰映射了MFND从传统多模态处理流程到以基础模型驱动的新范式的转变;其次,提出了一套结构化的分类体系,涵盖模型架构、数据集和性能基准;最后,深入分析了当前仍面临的挑战,如可解释性、时序推理和域泛化问题,并据此指明未来研究方向,从而为该领域的进一步发展提供理论支撑与实践指引。

链接: https://arxiv.org/abs/2601.15316
作者: Wei Ai,Yilong Tan,Yuntao Shou,Tao Meng,Haowen Chen,Zhixiong He,Keqin Li
机构: Central South University of Forestry and Technology (中南林业科技大学); Hunan University (湖南大学); State University of New York (纽约州立大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, the rapid evolution of large vision-language models (LVLMs) has driven a paradigm shift in multimodal fake news detection (MFND), transforming it from traditional feature-engineering approaches to unified, end-to-end multimodal reasoning frameworks. Early methods primarily relied on shallow fusion techniques to capture correlations between text and images, but they struggled with high-level semantic understanding and complex cross-modal interactions. The emergence of LVLMs has fundamentally changed this landscape by enabling joint modeling of vision and language with powerful representation learning, thereby enhancing the ability to detect misinformation that leverages both textual narratives and visual content. Despite these advances, the field lacks a systematic survey that traces this transition and consolidates recent developments. To address this gap, this paper provides a comprehensive review of MFND through the lens of LVLMs. We first present a historical perspective, mapping the evolution from conventional multimodal detection pipelines to foundation model-driven paradigms. Next, we establish a structured taxonomy covering model architectures, datasets, and performance benchmarks. Furthermore, we analyze the remaining technical challenges, including interpretability, temporal reasoning, and domain generalization. Finally, we outline future research directions to guide the next stage of this paradigm shift. To the best of our knowledge, this is the first comprehensive survey to systematically document and analyze the transformative role of LVLMs in combating multimodal fake news. The summary of existing methods mentioned is in our Github: \hrefthis https URLthis https URL.
zh

[CV-76] Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation

【速读】:该论文旨在解决深度学习在医学图像分割中跨成像模态(如X-ray、US、MRI等)和不同解剖结构时泛化能力不足的问题,其核心挑战在于现有模型(包括CNN、Transformer及其混合架构)主要依赖空间信息编码,而忽略了能提供丰富结构与纹理线索的频域表示。解决方案的关键在于提出Phi-SegNet,一个基于卷积神经网络的新型架构,通过在模型结构和优化两个层面引入相位感知信息:一是设计Bi-Feature Mask Former(BFMF)模块融合邻近特征以缩小语义差距;二是引入Reverse Fourier Attention(RFA)块利用相位约束特征优化解码器输出;同时构建专用的相位感知损失函数,将特征对齐至结构先验,形成边界精度强化的闭环反馈机制。该方法显著提升了细粒度目标定位能力,并在五个公开数据集上实现了平均IoU提升1.54%±1.26%和F1-score提升0.98%±0.71%,且在跨数据集泛化场景中表现出强适应性和模态无关性。

链接: https://arxiv.org/abs/2601.16064
作者: Shams Nafisa Ali,Taufiq Hasan
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.
zh

[CV-77] FUGC: Benchmarking Semi-Supervised Learning Methods for Cervical Segmentation

【速读】:该论文旨在解决经阴道超声(Transvaginal Ultrasound, TVS)图像中宫颈结构分割精度不足的问题,尤其是在标注数据稀缺情况下,监督学习方法性能受限的挑战。其解决方案的关键在于提出首个针对宫颈分割任务的半监督学习基准——胎儿超声大挑战(Fetal Ultrasound Grand Challenge, FUGC),通过提供包含890张TVS图像的公开数据集(含500张训练图、90张验证图和300张测试图),并采用Dice相似系数(Dice Similarity Coefficient, DSC)、豪斯多夫距离(Hausdorff Distance, HD)和运行时间(Runtime, RT)作为综合评价指标,验证了在有限标注数据下半监督学习方法的有效性,为人工智能辅助早产风险评估提供了标准化的技术路径与实证基础。

链接: https://arxiv.org/abs/2601.15572
作者: Jieyun Bai,Yitong Tang,Zihao Zhou,Mahdi Islam,Musarrat Tabassum,Enrique Almar-Munoz,Hongyu Liu,Hui Meng,Nianjiang Lv,Bo Deng,Yu Chen,Zilun Peng,Yusong Xiao,Li Xiao,Nam-Khanh Tran,Dac-Phu Phan-Le,Hai-Dang Nguyen,Xiao Liu,Jiale Hu,Mingxu Huang,Jitao Liang,Chaolu Feng,Xuezhi Zhang,Lyuyang Tong,Bo Du,Ha-Hieu Pham,Thanh-Huy Nguyen,Min Xu,Juntao Jiang,Jiangning Zhang,Yong Liu,Md. Kamrul Hasan,Jie Gan,Zhuonan Liang,Weidong Cai,Yuxin Huang,Gongning Luo,Mohammad Yaqub,Karim Lekadir
机构: The First Affiliated Hospital, Jinan University, China (暨南大学附属第一医院); Imperial College London, UK (帝国理工学院); University of Sydney, Australia (悉尼大学); Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates (穆罕默德·本·扎耶德人工智能大学); Zhejiang University, China (浙江大学); Carnegie Mellon University, United States (卡内基梅隆大学); Medical University of Innsbruck, Austria (因斯布鲁克医科大学); University of Chinese Academy of Sciences, China (中国科学院大学); University of Science and Technology of China, China (中国科学技术大学); University of Science, Viet Nam National University Ho Chi Minh City, Vietnam (胡志明市国家大学科学大学); Nanyang Institute of Technology, China (南阳理工学院); Northeastern University, China (东北大学); Wuhan University, China (武汉大学); Southern Medical University, China (南方医科大学); Harbin Institute of Technology, China (哈尔滨工业大学); Universitat de Barcelona, Spain (巴塞罗那大学)
类目: Image and Video Processing (eess.IV); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of cervical structures in transvaginal ultrasound (TVS) is critical for assessing the risk of spontaneous preterm birth (PTB), yet the scarcity of labeled data limits the performance of supervised learning approaches. This paper introduces the Fetal Ultrasound Grand Challenge (FUGC), the first benchmark for semi-supervised learning in cervical segmentation, hosted at ISBI 2025. FUGC provides a dataset of 890 TVS images, including 500 training images, 90 validation images, and 300 test images. Methods were evaluated using the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD), and runtime (RT), with a weighted combination of 0.4/0.4/0.2. The challenge attracted 10 teams with 82 participants submitting innovative solutions. The best-performing methods for each individual metric achieved 90.26% mDSC, 38.88 mHD, and 32.85 ms RT, respectively. FUGC establishes a standardized benchmark for cervical segmentation, demonstrates the efficacy of semi-supervised methods with limited labeled data, and provides a foundation for AI-assisted clinical PTB risk assessment.
zh

[CV-78] A Machine Vision Approach to Preliminary Skin Lesion Assessments

【速读】:该论文旨在解决恶性皮肤病变早期检测的难题,以提升侵袭性转移性皮肤癌患者的预后。其核心解决方案是构建一个结合临床ABC(D)规则与机器学习的综合评估系统,其中关键创新在于采用从头训练的轻量级卷积神经网络(CNN)而非依赖预训练模型进行直接像素级学习。实验表明,这种定制化架构在小规模医学图像数据集上显著优于传统分类器和迁移学习方法,尤其在准确率(78.5%)和召回率(86.5%)方面表现突出,证明了针对特定领域设计的模型能够更有效地捕捉诊断特征,克服了因自然图像与医学图像之间域偏移导致的性能瓶颈。

链接: https://arxiv.org/abs/2601.15539
作者: Ali Khreis,Ro’Yah Radaideh,Quinn McGill
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Early detection of malignant skin lesions is critical for improving patient outcomes in aggressive, metastatic skin cancers. This study evaluates a comprehensive system for preliminary skin lesion assessment that combines the clinically established ABCD rule of dermoscopy (analyzing Asymmetry, Borders, Color, and Dermoscopic Structures) with machine learning classification. Using a 1,000-image subset of the HAM10000 dataset, the system implements an automated, rule-based pipeline to compute a Total Dermoscopy Score (TDS) for each lesion. This handcrafted approach is compared against various machine learning solutions, including traditional classifiers (Logistic Regression, Random Forest, and SVM) and deep learning models. While the rule-based system provides high clinical interpretability, results indicate a performance bottleneck when reducing complex morphology to five numerical features. Experimental findings show that transfer learning with EfficientNet-B0 failed significantly due to domain shift between natural and medical images. In contrast, a custom three-layer Convolutional Neural Network (CNN) trained from scratch achieved 78.5% accuracy and 86.5% recall on median-filtered images, representing a 19-point accuracy improvement over traditional methods. The results demonstrate that direct pixel-level learning captures diagnostic patterns beyond handcrafted features and that purpose-built lightweight architectures can outperform large pretrained models for small, domain-specific medical datasets.
zh

[CV-79] High-Fidelity 3D Tooth Reconstruction by Fusing Intraoral Scans and CBCT Data via a Deep Implicit Representation

【速读】:该论文旨在解决数字牙科中高保真三维牙齿模型构建的问题,即如何同时精确捕捉牙齿的冠部细节与完整的根部形态。临床影像技术存在固有局限:锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)可获取根部结构但冠部分辨率低且噪声大,而口内扫描仪(Intraoral Scanner, IOS)虽能提供高保真冠部信息却无法获取根部数据,直接拼接两者会导致不自然的接缝和伪影。解决方案的关键在于提出一种全自动融合管道,首先通过分割与鲁棒配准对齐牙体实例,生成结合IOS冠部与CBCT根部的混合代理网格;进而利用该噪声代理网格引导特定类别的DeepSDF网络进行优化,将输入投影至学习到的理想牙齿形状流形上,从而生成无缝、封闭且解剖学一致的高质量三维模型,有效融合两种模态的优势并克服各自缺陷。

链接: https://arxiv.org/abs/2601.15358
作者: Yi Zhu,Razmig Kechichian,Raphaël Richert,Satoshi Ikehata,Sébastien Valette
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:High-fidelity 3D tooth models are essential for digital dentistry, but must capture both the detailed crown and the complete root. Clinical imaging modalities are limited: Cone-Beam Computed Tomography (CBCT) captures the root but has a noisy, low-resolution crown, while Intraoral Scanners (IOS) provide a high-fidelity crown but no root information. A naive fusion of these sources results in unnatural seams and artifacts. We propose a novel, fully-automated pipeline that fuses CBCT and IOS data using a deep implicit representation. Our method first segments and robustly registers the tooth instances, then creates a hybrid proxy mesh combining the IOS crown and the CBCT root. The core of our approach is to use this noisy proxy to guide a class-specific DeepSDF network. This optimization process projects the input onto a learned manifold of ideal tooth shapes, generating a seamless, watertight, and anatomically coherent model. Qualitative and quantitative evaluations show our method uniquely preserves both the high-fidelity crown from IOS and the patient-specific root morphology from CBCT, overcoming the limitations of each modality and naive stitching.
zh

人工智能

[AI-0] Scalable Board Expansion within a General Game System

【速读】:该论文旨在解决传统棋类游戏实现中因使用预定义的静态大尺寸棋盘而导致的资源冗余与复杂性问题,尤其是在棋盘空间未被充分利用的情况下。其解决方案的关键在于提出一种动态棋盘扩展机制(dynamic board expansion mechanism),使游戏棋盘能够在对局过程中自动生长,从而仅在实际需要时扩展可用区域,有效减少不必要的空间占用和计算开销。

链接: https://arxiv.org/abs/2601.16216
作者: Clémentine Sacré
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Software Engineering (cs.SE)
备注: 65 pages, 41 figures

点击查看摘要

Abstract:This thesis explores the use of a General Game System (GGS) to support the automatic expansion of game boards in boardless games. Traditional implementations of such games often rely on oversized static boards defined from the start, even though large portions of these boards may never be used during gameplay. This approach leads to unnecessary complexity. To address this issue, this thesis propose a dynamic board expansion mechanism in which the game board grows automatically during play.
zh

[AI-1] Counterfactual Training: Teaching Models Plausible and Actionable Explanations

【速读】:该论文旨在解决当前机器学习模型缺乏可解释性的问题,尤其是如何使模型在训练过程中就具备生成合理且可操作的反事实解释(counterfactual explanations)的能力。现有方法多依赖于后验生成反事实解释,难以保证其与数据分布的一致性和对特征可变性的实际可行性。论文提出的关键解决方案是“反事实训练”(counterfactual training),即在模型训练阶段直接引入反事实样本,通过最小化模型学习到的表示与合理、可操作的反事实解释之间的差异,从而促使模型在训练过程中自动优化其解释能力,并同时提升对抗鲁棒性。

链接: https://arxiv.org/abs/2601.16205
作者: Patrick Altmeyer,Aleksander Buszydlik,Arie van Deursen,Cynthia C. S. Liem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been accepted for publication at the 2026 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore

点击查看摘要

Abstract:We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
zh

[AI-2] Learning to Discover at Test Time

【速读】:该论文旨在解决如何利用人工智能(AI)在科学问题中发现新的最先进(state-of-the-art)解决方案的问题。其核心挑战在于,传统方法如测试时缩放(test-time scaling)通常依赖于冻结的大型语言模型(Large Language Model, LLM)进行提示搜索,难以根据具体任务动态优化。本文提出的关键解决方案是测试时训练发现(Test-Time Training to Discover, TTT-Discover),即在测试阶段对LLM进行强化学习(reinforcement learning),使其基于特定问题的经验持续训练,从而聚焦于生成单一最优解而非平均性能最优。这种方法通过设计专门的学习目标和搜索子程序,优先探索最有潜力的解空间,实现了在数学、GPU内核工程、算法设计和生物学等多个领域的新SOTA结果,且全部基于开源模型(OpenAI gpt-oss-120b)与公开代码可复现,显著降低了资源门槛。

链接: https://arxiv.org/abs/2601.16175
作者: Mert Yuksekgonul,Daniel Koceja,Xinhao Li,Federico Bianchi,Jed McCaleb,Xiaolong Wang,Jan Kautz,Yejin Choi,James Zou,Carlos Guestrin,Yu Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős’ minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to 2\times faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
zh

[AI-3] Structured Hints for Sample-Efficient Lean Theorem Proving

【速读】:该论文试图解决的问题是:经过强化学习(Reinforcement Learning, RL)训练的高性能神经定理证明器(Neural Theorem Prover)是否仍能从推理阶段的结构化引导中获益。研究发现,即使模型已高度优化,其对战术语言(tactic language)中蕴含的结构先验信息(structural priors)利用仍不充分。解决方案的关键在于引入一种轻量级干预策略——在推理时采用固定的提示调度(fixed prompt schedule),基于15种常见战术模板(tactic skeletons)进行引导,从而显著提升定理证明的成功率。实验表明,在miniF2F基准上,该方法相较标准采样在相同样本数(k=16)和最大生成长度(1024 tokens)下实现了21.7%的pass@16,相对提升达43%,验证了简单推理阶段引导仍是低成本且有效的补充手段。

链接: https://arxiv.org/abs/2601.16172
作者: Zachary Burton
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention – a fixed prompt schedule over 15 common tactic skeletons – on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.
zh

[AI-4] Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

【速读】:该论文旨在解决机器人策略学习中如何高效利用预训练视频模型的时空先验来生成复杂动作分布的问题,同时避免传统方法因多阶段微调和架构修改带来的复杂性。解决方案的关键在于提出 Cosmos Policy,这是一种仅需单阶段后训练(post-training)即可将大型预训练视频模型(Cosmos-Predict2)适配为有效机器人策略的方法,无需任何架构改动;其核心机制是直接在视频模型的潜在扩散过程中生成编码为潜在帧的机器人动作,并同步生成未来状态图像和价值(预期累积奖励),从而实现测试时基于模型的规划以提升任务成功率。

链接: https://arxiv.org/abs/2601.16163
作者: Moo Jin Kim,Yihuai Gao,Tsung-Yi Lin,Yen-Chen Lin,Yunhao Ge,Grace Lam,Percy Liang,Shuran Song,Ming-Yu Liu,Chelsea Finn,Jinwei Gu
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model’s latent diffusion process, harnessing the model’s pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at this https URL
zh

[AI-5] Substrate Stability Under Persistent Disagreement: Structural Constraints for Neutral Ontological Substrates

【速读】:该论文旨在解决在存在持续法律、政治和分析分歧的复杂数据系统中,如何设计语义上中立且能保持稳定引用关系的本体(ontology)问题。其核心挑战在于:当缺乏共享解释、协商一致的语义或中心化权威时,如何确保不同扩展之间仍能维持可比性和问责性。解决方案的关键在于提出一个“中立性框架”,将“解释上的非承诺”和“扩展下的稳定性”作为显式的设计约束,并证明:任何支持问责性的本体必须至少实现六种不同的身份与持久性规制(identity-and-persistence regimes)。这一结果是一个条件性下界结论,表明若要保证中立性和稳定引用不可妥协,则必须容纳这六个基本规制;同时,该结构不嵌入因果或规范承诺,从而避免隐含的价值偏倚。

链接: https://arxiv.org/abs/2601.16152
作者: Denise M. Case
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 29 pages

点击查看摘要

Abstract:Modern data systems increasingly operate under conditions of persistent legal, political, and analytic disagreement. In such settings, interoperability cannot rely on shared interpretation, negotiated semantics, or centralized authority. Instead, representations must function as neutral substrates that preserve stable reference across incompatible extensions. This paper investigates the structural constraints imposed on ontological design by this requirement. Building on a neutrality framework that treats interpretive non-commitment and stability under extension as explicit design constraints, we ask what minimal ontological structure is forced if accountability relationships are to remain referable and comparable under disagreement. Minimality here is not mere parsimony: a reduction is admissible only if it does not reintroduce stability-critical distinctions as hidden roles, flags, or contextual predicates. We establish a conditional lower-bound result: any ontology capable of supporting accountability under persistent disagreement must realize at least six distinct identity-and-persistence regimes. We further show that a construction with exactly six such regimes is sufficient to satisfy the stated requirements without embedding causal or normative commitments in the substrate. The result is not a proposal for a universal ontology, but a constraint on what is possible when neutrality and stable reference are treated as non-negotiable design goals.
zh

[AI-6] Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization

【速读】:该论文旨在解决生成式 AI(Generative AI)在旋律和声化(melodic harmonization)任务中,由于现有训练课程导致旋律与和声间注意力交互较弱的问题,尤其在跨域场景下难以有效利用旋律线索。其解决方案的关键在于提出一种新的训练课程 FF(full-to-full),通过在训练初期全程掩码和声 token,随后逐步解码整个序列,从而强化旋律与和声之间的交互关系。实验表明,该策略显著提升了模型在不同量化粒度、条件输入形式及推理阶段未掩码策略下的性能,尤其是在跨域评估中表现出更强的和声适应能力。

链接: https://arxiv.org/abs/2601.16150
作者: Maximos Kaliakatsos-Papakostas,Dimos Makris,Konstantinos Soiledis,Konstantinos-Theodoros Tsamis,Vassilis Katsouros,Emilios Cambouropoulos
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
zh

[AI-7] Replicating Human Motivated Reasoning Studies with LLM s

【速读】:该论文试图解决的问题是:基础大语言模型(base LLMs)是否能够模拟人类在政治情境下存在的动机性推理(motivated reasoning)行为。已有研究表明,人类在处理信息时可能因预设立场而倾向于得出特定结论,但尚不清楚LLM是否会表现出类似的心理机制。论文通过复现4项关于政治动机性推理的人类实验,发现基础LLM的行为与预期的人类行为不一致,且不同模型间表现出相似特征,如标准差更小和对论点强度的误判。解决方案的关键在于揭示了基础LLM缺乏人类层面的动机驱动认知偏差,这对依赖LLM进行问卷数据收集和论证评估等自动化任务的研究者具有重要警示意义,提示需谨慎对待LLM在涉及主观判断场景中的适用性。

链接: https://arxiv.org/abs/2601.16130
作者: Neeley Pate,Adiba Mahbub Proma,Hangfeng He,James N. Druckman,Daniel Molden,Gourab Ghoshal,Ehsan Hoque
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Motivated reasoning – the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined – has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.
zh

[AI-8] Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在识别气候虚假信息(climate disinformation)时存在的局限性,即模型仅依赖训练时的静态知识,难以应对新近发生的事件或信息更新。其解决方案的关键在于将VLMs与外部实时知识源相结合,通过检索最新的反向图像结果、在线事实核查内容以及可信专家资料,增强模型对图像及其关联声明的真实性、误导性、虚假性或不可验证性的判断能力,从而提升其在真实世界中处理气候虚假信息的有效性。

链接: https://arxiv.org/abs/2601.16108
作者: Marzieh Adeli Shamsabad,Hamed Ghodrati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
zh

[AI-9] Delayed Assignments in Online Non-Centroid Clustering with Stochastic Arrivals AAMAS

【速读】:该论文旨在解决在线非中心聚类(online non-centroid clustering)问题,其中数据点以流式方式到达有限度量空间中,算法需在允许延迟决策的前提下,将每个点分配至现有簇或新建一个仅包含该点的簇,目标是最小化簇内距离成本与延迟成本的总和。传统方法通常假设决策必须立即做出,而本文引入了延迟成本机制,从而更贴近实际应用场景。其关键创新在于提出了一种针对随机到达模型(stochastic arrival model)的算法:假设点的位置独立同分布地从未知但固定的概率分布中抽取,该算法能够实现常数竞争比(constant competitive ratio),即随着点数增长,其期望总成本与最优离线解之比被恒定上界控制,从而突破了经典最坏情况模型下无法获得亚对数竞争比的理论限制。

链接: https://arxiv.org/abs/2601.16091
作者: Saar Cohen
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To Appear in the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2026

点击查看摘要

Abstract:Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in the same cluster are closer to each other than to those in other clusters. In this paper, we present a new framework for studying online non-centroid clustering with delays, where elements, that arrive one at a time as points in a finite metric space, should be assigned to clusters, but assignments need not be immediate. Specifically, upon arrival, each point’s location is revealed, and an online algorithm has to irrevocably assign it to an existing cluster or create a new one containing, at this moment, only this point. However, we allow decisions to be postponed at a delay cost, instead of following the more common assumption of immediate decisions upon arrival. This poses a critical challenge: the goal is to minimize both the total distance costs between points in each cluster and the overall delay costs incurred by postponing assignments. In the classic worst-case arrival model, where points arrive in an arbitrary order, no algorithm has a competitive ratio better than sublogarithmic in the number of points. To overcome this strong impossibility, we focus on a stochastic arrival model, where points’ locations are drawn independently across time from an unknown and fixed probability distribution over the finite metric space. We offer hope for beyond worst-case adversaries: we devise an algorithm that is constant competitive in the sense that, as the number of points grows, the ratio between the expected overall costs of the output clustering and an optimal offline clustering is bounded by a constant.
zh

[AI-10] Probably Approximately Correct Maximum A Posteriori Inference

【速读】:该论文旨在解决概率推理中计算后验分布的条件众数(即最大后验估计,MAP)这一基础任务的计算难解性问题。MAP估计在许多结构约束和近似方法下仍难以高效求解,而本文提出了一种概率近似正确(PAC-MAP)算法框架,能够在给定变量或固定计算预算下提供具有理论保证的最优解。其关键在于利用信息论度量刻画PAC-MAP的可 tractability 条件,并通过从有限样本中估计这些度量来指导求解过程;同时,借助具有适当架构的概率电路(probabilistic circuits)实现高效算法实现,且所设计的随机化策略既可独立作为MAP推断方法,也可用于增强主流启发式算法并赋予其严格的理论保障。

链接: https://arxiv.org/abs/2601.16083
作者: Matthew Shorvon,Frederik Mallmann-Trenn,David S. Watson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages main text, 16 total, 3 figures

点击查看摘要

Abstract:Computing the conditional mode of a distribution, better known as the \mathitmaximum\ a\ posteriori (MAP) assignment, is a fundamental task in probabilistic inference. However, MAP estimation is generally intractable, and remains hard even under many common structural constraints and approximation schemes. We introduce \mathitprobably\ approximately\ correct (PAC) algorithms for MAP inference that provide provably optimal solutions under variable and fixed computational budgets. We characterize tractability conditions for PAC-MAP using information theoretic measures that can be estimated from finite samples. Our PAC-MAP solvers are efficiently implemented using probabilistic circuits with appropriate architectures. The randomization strategies we develop can be used either as standalone MAP inference techniques or to improve on popular heuristics, fortifying their solutions with rigorous guarantees. Experiments confirm the benefits of our method in a range of benchmarks.
zh

[AI-11] Designing faster mixed integer linear programming algorithm via learning the optimal path

【速读】:该论文旨在解决混合整数线性规划(Mixed-Integer Linear Programming, MILP)问题求解效率低下的问题,尤其是在传统分支定界(branch-and-bound)算法中依赖人工设计启发式规则导致性能不稳定、泛化能力差的局限。解决方案的关键在于提出DeepBound——一种基于深度学习的节点选择算法,其核心创新是通过多层级特征融合网络提取节点表示,并采用成对训练范式缓解分支定界树中节点分布不均衡的问题,从而自动学习并优化人类直觉型的节点优先级策略,显著提升求解速度与稳定性,同时展现出良好的跨实例泛化能力。

链接: https://arxiv.org/abs/2601.16056
作者: Ruizhi Liu,Liming Xu,Xulin Huang,Jingyan Sui,Shizhe Ding,Boyang Xia,Chungong Yu,Dongbo Bu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing faster algorithms for solving Mixed-Integer Linear Programming (MILP) problems is highly desired across numerous practical domains, as a vast array of complex real-world challenges can be effectively modeled as MILP formulations. Solving these problems typically employs the branch-and-bound algorithm, the core of which can be conceived as searching for a path of nodes (or sub-problems) that contains the optimal solution to the original MILP problem. Traditional approaches to finding this path rely heavily on hand-crafted, intuition-based heuristic strategies, which often suffer from unstable and unpredictable performance across different MILP problem instances. To address this limitation, we introduce DeepBound, a deep learning-based node selection algorithm that automates the learning of such human intuition from data. The core of DeepBound lies in learning to prioritize nodes containing the optimal solution, thereby improving solving efficiency. DeepBound introduces a multi-level feature fusion network to capture the node representations. To tackle the inherent node imbalance in branch-and-bound trees, DeepBound employs a pairwise training paradigm that enhances the model’s ability to discriminate between nodes. Extensive experiments on three NP-hard MILP benchmarks demonstrate that DeepBound achieves superior solving efficiency over conventional heuristic rules and existing learning-based approaches, obtaining optimal feasible solutions with significantly reduced computation time. Moreover, DeepBound demonstrates strong generalization capability on large and complex instances. The analysis of its learned features reveals that the method can automatically discover more flexible and robust feature selection, which may effectively improve and potentially replace human-designed heuristic rules.
zh

[AI-12] AgriPINN: A Process-Informed Neural Network for Interpretable and Scalable Crop Biomass Prediction Under Water Stress

【速读】:该论文旨在解决作物地上生物量(AGB)在水分胁迫条件下预测精度不足的问题,尤其针对现有数据驱动模型可解释性差、分布偏移下性能退化,以及过程机理模型(如LINTUL5)参数校准复杂、难以大尺度部署的局限。解决方案的关键在于提出AgriPINN——一种融合生物物理作物生长微分方程作为可微约束的神经网络架构,将过程机理嵌入深度学习骨干结构中,从而在保持深度学习高扩展性的同时,确保水分胁迫下生理一致性生物质动态建模,并能无监督恢复叶面积指数(LAI)、吸收光合有效辐射(PAR)、辐射利用效率(RUE)及水分胁迫因子等隐变量,显著提升预测准确性与计算效率。

链接: https://arxiv.org/abs/2601.16045
作者: Yue Shi,Liangxiu Han,Xin Zhang,Tam Sobeih,Thomas Gaiser,Nguyen Huu Thuy,Dominik Behrend,Amit Kumar Srivastava,Krishnagopal Halder,Frank Ewert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of crop above-ground biomass (AGB) under water stress is critical for monitoring crop productivity, guiding irrigation, and supporting climate-resilient agriculture. Data-driven models scale well but often lack interpretability and degrade under distribution shift, whereas process-based crop models (e.g. DSSAT, APSIM, LINTUL5) require extensive calibration and are difficult to deploy over large spatial domains. To address these limitations, we propose AgriPINN, a process-informed neural network that integrates a biophysical crop-growth differential equation as a differentiable constraint within a deep learning backbone. This design encourages physiologically consistent biomass dynamics under water-stress conditions while preserving model scalability for spatially distributed AGB prediction. AgriPINN recovers latent physiological variables, including leaf area index (LAI), absorbed photosynthetically active radiation (PAR), radiation use efficiency (RUE), and water-stress factors, without requiring direct supervision. We pretrain AgriPINN on 60 years of historical data across 397 regions in Germany and fine-tune it on three years of field experiments under controlled water treatments. Results show that AgriPINN consistently outperforms state-of-the-art deep-learning baselines (ConvLSTM-ViT, SLTF, CNN-Transformer) and the process-based LINTUL5 model in terms of accuracy (RMSE reductions up to 43% ) and computational efficiency. By combining the scalability of deep learning with the biophysical rigor of process-based modeling, AgriPINN provides a robust and interpretable framework for spatio-temporal AGB prediction, offering practical value for planning of irrigation infrastructure, yield forecasting, and climate-adaptation planning.
zh

[AI-13] Grounding Large Language Models in Reaction Knowledge Graphs for Synthesis Retrieval

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在化学合成路径规划中因标准提示方法导致的幻觉或过时建议问题。其核心解决方案是将反应路径检索建模为Text2Cypher任务(即自然语言到知识图谱查询的生成问题),并通过设计单步与多步检索任务来评估模型性能;关键创新在于采用带对齐示例的一次提示(one-shot prompting)策略,结合嵌入相似度选择示例,显著提升检索准确性,并引入基于检查清单的自校正循环以增强零样本场景下的可执行性,从而有效提升LLM在知识图谱(Knowledge Graph, KG)驱动的合成规划中的可靠性与实用性。

链接: https://arxiv.org/abs/2601.16038
作者: Olga Bunkova,Lorenzo Di Fruscia,Sophia Rupprecht,Artur M. Schweidtmann,Marcel J.T. Reinders,Jana M. Weber
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ML4Molecules 2025 (ELLIS UnConference workshop), Copenhagen, Denmark, December 2, 2025. Workshop page: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) can aid synthesis planning in chemistry, but standard prompting methods often yield hallucinated or outdated suggestions. We study LLM interactions with a reaction knowledge graph by casting reaction path retrieval as a Text2Cypher (natural language to graph query) generation problem, and define single- and multi-step retrieval tasks. We compare zero-shot prompting to one-shot variants using static, random, and embedding-based exemplar selection, and assess a checklist-driven validator/corrector loop. To evaluate our framework, we consider query validity and retrieval accuracy. We find that one-shot prompting with aligned exemplars consistently performs best. Our checklist-style self-correction loop mainly improves executability in zero-shot settings and offers limited additional retrieval gains once a good exemplar is present. We provide a reproducible Text2Cypher evaluation setup to facilitate further work on KG-grounded LLMs for synthesis planning. Code is available at this https URL.
zh

[AI-14] Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中高性能注意力核(attention kernels)的缓存性能瓶颈问题,特别是针对NVIDIA GB10(Grace Blackwell)架构下CuTile-based Flash Attention实现中存在的L2缓存未命中(L2 cache miss)问题。其解决方案的关键在于通过深入分析发现L2缓存未命中的主因,并提出一种新的编程技术——锯齿波前重排序(Sawtooth Wavefront Reordering),该技术能够有效优化内存访问模式,从而显著减少L2缓存未命中次数,在CUDA和CuTile平台上均实现了超过50%的L2缓存未命中降低,并带来最高达60%的吞吐量提升。

链接: https://arxiv.org/abs/2601.16032
作者: Yifan Zhu,Yekai Pan,Chen Ding
机构: 未知
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50% or greater reduction in L2 misses and up to 60% increase in throughput on GB10.
zh

[AI-15] Deja Vu in Plots: Leverag ing Cross-Session Evidence with Retrieval-Augmented LLM s for Live Streaming Risk Assessment

【速读】:该论文旨在解决直播平台中难以检测的复杂风险问题,如诈骗和协同恶意行为,这些问题往往通过跨会话的渐进式累积和重复出现而隐蔽。解决方案的关键在于提出CS-VAR(Cross-Session Evidence-Aware Retrieval-Augmented Detector),其核心机制是利用一个轻量级、领域特定的小模型进行快速会话级风险推理,并在训练过程中由大型语言模型(Large Language Model, LLM)引导,LLM通过检索跨会话的行为证据进行推理,并将局部到全局的洞察迁移至小模型,从而使其具备识别跨流媒体重复模式的能力,实现结构化的风险评估且保持实时部署效率。

链接: https://arxiv.org/abs/2601.16027
作者: Yiran Qiao,Xiang Ao,Jing Chen,Yang Liu,Qiwei Zhong,Qing He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex risks such as scams and coordinated malicious behaviors. Detecting these risks is challenging because harmful actions often accumulate gradually and recur across seemingly unrelated streams. To address this, we propose CS-VAR (Cross-Session Evidence-Aware Retrieval-Augmented Detector) for live streaming risk assessment. In CS-VAR, a lightweight, domain-specific model performs fast session-level risk inference, guided during training by a Large Language Model (LLM) that reasons over retrieved cross-session behavioral evidence and transfers its local-to-global insights to the small model. This design enables the small model to recognize recurring patterns across streams, perform structured risk assessment, and maintain efficiency for real-time deployment. Extensive offline experiments on large-scale industrial datasets, combined with online validation, demonstrate the state-of-the-art performance of CS-VAR. Furthermore, CS-VAR provides interpretable, localized signals that effectively empower real-world moderation for live streaming.
zh

[AI-16] PUMA: Perception-driven Unified Foothold Prior for Mobility Augmented Quadruped Parkour

【速读】:该论文旨在解决四足机器人在复杂地形中执行跑酷任务时,缺乏类似人类运动员的环境感知与决策能力的问题,尤其针对现有方法依赖预计算步态点、限制实时适应性和强化学习探索潜力的瓶颈。其解决方案的关键在于提出PUMA(一个端到端的学习框架),将视觉感知与步态先验信息融合至单阶段训练过程中,通过地形特征估计出以机器人自身为中心的极坐标步态先验(egocentric polar foothold priors),包含相对距离和方向信息,从而引导机器人进行主动姿态调整,实现对复杂障碍物的敏捷且鲁棒的穿越能力。

链接: https://arxiv.org/abs/2601.15995
作者: Liang Wang,Kanzhong Yao,Yang Liu,Weikai Qin,Jun Wu,Zhe Sun,Qiuguo Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parkour tasks for quadrupeds have emerged as a promising benchmark for agile locomotion. While human athletes can effectively perceive environmental characteristics to select appropriate footholds for obstacle traversal, endowing legged robots with similar perceptual reasoning remains a significant challenge. Existing methods often rely on hierarchical controllers that follow pre-computed footholds, thereby constraining the robot’s real-time adaptability and the exploratory potential of reinforcement learning. To overcome these challenges, we present PUMA, an end-to-end learning framework that integrates visual perception and foothold priors into a single-stage training process. This approach leverages terrain features to estimate egocentric polar foothold priors, composed of relative distance and heading, guiding the robot in active posture adaptation for parkour tasks. Extensive experiments conducted in simulation and real-world environments across various discrete complex terrains, demonstrate PUMA’s exceptional agility and robustness in challenging scenarios.
zh

[AI-17] Decoupling Return-to-Go for Efficient Decision Transformer

【速读】:该论文旨在解决决策Transformer(Decision Transformer, DT)在离线强化学习中存在的一项关键冗余问题:DT在训练和推理过程中将整个轨迹的回报至未来(Return-to-Go, RTG)序列输入Transformer,但理论上仅最新RTG对动作预测具有直接影响,其余RTG信息属于冗余输入,可能导致性能下降。解决方案的关键在于提出解耦式决策Transformer(Decoupled DT, DDT),其核心思想是将Transformer架构简化为仅处理观测(observation)和动作(action)序列,而用最新的RTG作为引导信号独立作用于动作预测模块,从而去除冗余信息并提升模型效率与性能。实验表明,DDT不仅显著优于原始DT,且在多个离线强化学习任务中达到与当前最优DT变体相当甚至更优的表现。

链接: https://arxiv.org/abs/2601.15953
作者: Yongyi Wang,Hanyu Liu,Lingfeng Li,Bozhou Chen,Ang Li,Qirui Zheng,Xionghui Yang,Wenxin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT’s performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
zh

[AI-18] Natural Language-Driven Global Mapping of Martian Landforms

【速读】:该论文旨在解决行星表面分析中语义概念与像素级数据组织之间的不匹配问题,即当前对行星表面的分析多依赖于自然语言中的高阶语义概念,而海量轨道图像档案仍以像素级别存储,导致难以实现可扩展、开放式的行星表面探索。解决方案的关键在于提出MarScope——一个行星尺度的视觉-语言框架,通过在共享语义空间中对齐行星图像与文本,并基于超过20万组精心标注的图文对进行训练,实现了无需预定义标签的自然语言驱动的地貌映射。该方法使用户能够以任意自然语言查询在全球范围内快速检索(5秒内)并识别目标地貌特征,F1得分高达0.978,从而将传统分类模式升级为灵活的语义检索范式,支持过程导向分析和基于相似性的全球地貌制图。

链接: https://arxiv.org/abs/2601.15949
作者: Yiran Wang,Shuoyuan Wang,Zhaoran Wei,Jiannan Zhao,Zhonghua Yao,Zejian Xie,Songxin Zhang,Jun Huang,Bingyi Jing,Hongxin Wei
机构: 未知
类目: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注:

点击查看摘要

Abstract:Planetary surfaces are typically analyzed using high-level semantic concepts in natural language, yet vast orbital image archives remain organized at the pixel level. This mismatch limits scalable, open-ended exploration of planetary surfaces. Here we present MarScope, a planetary-scale vision-language framework enabling natural language-driven, label-free mapping of Martian landforms. MarScope aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs. This framework transforms global geomorphic mapping on Mars by replacing pre-defined classifications with flexible semantic retrieval, enabling arbitrary user queries across the entire planet in 5 seconds with F1 scores up to 0.978. Applications further show that it extends beyond morphological classification to facilitate process-oriented analysis and similarity-based geomorphological mapping at a planetary scale. MarScope establishes a new paradigm where natural language serves as a direct interface for scientific discovery over massive geospatial datasets.
zh

[AI-19] ICON: Invariant Counterfactual Optimization with Neuro-Symbolic Priors for Text-Based Person Search

【速读】:该论文旨在解决文本驱动的人体搜索(Text-Based Person Search, TBPS)在复杂开放世界场景中因依赖“被动观察”而导致的伪相关性(spurious correlations)和空间语义错位问题,从而引发模型对分布偏移(distribution shifts)缺乏鲁棒性的问题。解决方案的关键在于提出ICON框架,其核心创新是融合因果推理与拓扑先验:通过规则引导的空间干预(Rule-Guided Spatial Intervention)强制切断位置捷径以实现几何不变性;利用反事实上下文解耦(Counterfactual Context Disentanglement)通过语义驱动的背景迁移迫使模型忽略环境干扰;结合显著性驱动的语义正则化(Saliency-Driven Semantic Regularization)缓解局部显著性偏差并保障整体完整性;最后借助神经符号拓扑对齐(Neuro-Symbolic Topological Alignment)利用神经符号先验约束特征匹配,确保激活区域符合人体结构逻辑。这一系列设计使模型从学习统计共现关系转向学习因果不变性,显著提升在遮挡、背景干扰和定位噪声下的鲁棒性。

链接: https://arxiv.org/abs/2601.15931
作者: Xiangyu Wang,Zhixin Lv,Yongjiao Sun,Anrui Han,Ye Yuan,Hangxu Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-Based Person Search (TBPS) holds unique value in real-world surveillance bridging visual perception and language understanding, yet current paradigms utilizing pre-training models often fail to transfer effectively to complex open-world scenarios. The reliance on “Passive Observation” leads to multifaceted spurious correlations and spatial semantic misalignment, causing a lack of robustness against distribution shifts. To fundamentally resolve these defects, this paper proposes ICON (Invariant Counterfactual Optimization with Neuro-symbolic priors), a framework integrating causal and topological priors. First, we introduce Rule-Guided Spatial Intervention to strictly penalize sensitivity to bounding box noise, forcibly severing location shortcuts to achieve geometric invariance. Second, Counterfactual Context Disentanglement is implemented via semantic-driven background transplantation, compelling the model to ignore background interference for environmental independence. Then, we employ Saliency-Driven Semantic Regularization with adaptive masking to resolve local saliency bias and guarantee holistic completeness. Finally, Neuro-Symbolic Topological Alignment utilizes neuro-symbolic priors to constrain feature matching, ensuring activated regions are topologically consistent with human structural logic. Experimental results demonstrate that ICON not only maintains leading performance on standard benchmarks but also exhibits exceptional robustness against occlusion, background interference, and localization noise. This approach effectively advances the field by shifting from fitting statistical co-occurrences to learning causal invariance.
zh

[AI-20] MMGRid: Navigating Temporal-aware and Cross-domain Generative Recommendation via Model Merging

【速读】:该论文旨在解决生成式推荐(Generative Recommendation, GR)中多情境下模型合并(Model Merging, MM)的挑战,特别是如何有效整合在不同时间演化用户行为和异构应用领域下训练出的专用生成式推荐模型。其核心问题是:由于预训练大语言模型(LLM)微调过程中引入的参数冲突(如token分布偏移与目标差异)以及增量训练导致的时效性偏差(recency bias),传统模型合并方法难以在真实场景中实现稳定且高效的性能提升。解决方案的关键在于提出统一框架MMGRid——通过构建一个结构化的上下文网格(contextual grid),将基于同一基础LLM但针对不同情境微调的模型Checkpoint组织起来,并利用基座模型替换策略解耦任务感知与情境特定的参数变化,从而缓解参数冲突;同时引入加权情境合并机制,根据情境依赖的交互特征动态调整融合权重,以平衡因时间演进带来的偏差,显著提升合并后模型的泛化能力与实用性。

链接: https://arxiv.org/abs/2601.15930
作者: Tianjun Wei,Enneng Yang,Yingpeng Du,Huizhong Guo,Jie Zhang,Zhu Sun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Model merging (MM) offers an efficient mechanism for integrating multiple specialized models without access to original training data or costly retraining. While MM has demonstrated success in domains like computer vision, its role in recommender systems (RSs) remains largely unexplored. Recently, Generative Recommendation (GR) has emerged as a new paradigm in RSs, characterized by rapidly growing model scales and substantial computational costs, making MM particularly appealing for cost-sensitive deployment scenarios. In this work, we present the first systematic study of MM in GR through a contextual lens. We focus on a fundamental yet underexplored challenge in real-world: how to merge generative recommenders specialized to different real-world contexts, arising from temporal evolving user behaviors and heterogeneous application domains. To this end, we propose a unified framework MMGRid, a structured contextual grid of GR checkpoints that organizes models trained under diverse contexts induced by temporal evolution and domain diversity. All checkpoints are derived from a shared base LLM but fine-tuned on context-specific data, forming a realistic and controlled model space for systematically analyzing MM across GR paradigms and merging algorithms. Our investigation reveals several key insights. First, training GR models from LLMs can introduce parameter conflicts during merging due to token distribution shifts and objective disparities; such conflicts can be alleviated by disentangling task-aware and context-specific parameter changes via base model replacement. Second, incremental training across contexts induces recency bias, which can be effectively balanced through weighted contextual merging. Notably, we observe that optimal merging weights correlate with context-dependent interaction characteristics, offering practical guidance for weight selection in real-world deployments.
zh

[AI-21] Net: Text-to-Network for Compact Policy Synthesis

【速读】:该论文旨在解决机器人在执行自然语言指令时面临的两大挑战:一是高阶规划依赖手工设计接口,难以泛化;二是端到端大模型虽具备强表达能力但部署复杂、难以满足实时控制需求。解决方案的关键在于提出TeNet(Text-to-Network)框架,其核心是利用预训练大语言模型(LLM)生成的文本嵌入(text embeddings)作为条件,驱动一个超网络(hypernetwork)直接生成轻量级、任务特定的机器人策略(policy),该策略仅需低维状态输入即可在高频控制频率下运行。通过仅在策略实例化阶段使用语言信息,TeNet既继承了LLM的通用知识和抗同义词鲁棒性,又实现了执行阶段的高效与紧凑,且支持在训练中通过行为对齐增强泛化能力而无需推理时依赖演示数据。

链接: https://arxiv.org/abs/2601.15912
作者: Ariyan Bighashdel,Kevin Sebastian Luck
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text embeddings with demonstrated actions, while requiring no demonstrations at inference time. Experiments on MuJoCo and Meta-World benchmarks show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and supporting high-frequency control. These results show that text-conditioned hypernetworks offer a practical way to build compact, language-driven controllers for ressource-constrained robot control tasks with real-time requirements.
zh

[AI-22] Iterative Amortized Hierarchical VAE

【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)在复杂逆问题中推理效率与重建质量之间的权衡问题。传统迭代优化方法虽精度高但速度慢,而完全摊销(amortized)方法则因缺乏精细调整导致性能受限。其解决方案的关键在于提出一种迭代摊销分层变分自编码器(Iterative Amortized Hierarchical Variational Autoencoder, IA-HVAE),通过引入初始摊销猜测与基于解码器梯度的迭代精化相结合的混合策略,并设计一个在变换域(如傅里叶空间)中线性可分离的解码器结构,从而实现高深度模型下的实时推理。该架构使迭代推理速度相比传统HVAE提升35倍,且在去模糊和去噪等逆问题中显著优于基准模型。

链接: https://arxiv.org/abs/2601.15894
作者: Simon W. Penninga,Ruud J. G. van Sloun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper we propose the Iterative Amortized Hierarchical Variational Autoencoder (IA-HVAE), which expands on amortized inference with a hybrid scheme containing an initial amortized guess and iterative refinement with decoder gradients. We achieve this by creating a linearly separable decoder in a transform domain (e.g. Fourier space), enabling real-time applications with very high model depths. The architectural change leads to a 35x speed-up for iterative inference with respect to the traditional HVAE. We show that our hybrid approach outperforms fully amortized and fully iterative equivalents in accuracy and speed respectively. Moreover, the IAHVAE shows improved reconstruction quality over a vanilla HVAE in inverse problems such as deblurring and denoising.
zh

[AI-23] EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

【速读】:该论文旨在解决当前原生计算机使用代理(Computer-Use Agent, CUA)因依赖静态数据集模仿而导致的长期任务因果动态建模能力不足的问题。现有方法受限于被动学习静态数据,难以捕捉复杂、多步骤的计算机操作逻辑。其解决方案的关键在于提出EvoCUA,一个通过自我驱动演化循环实现数据生成与策略优化协同进化的框架:首先构建可验证的任务合成引擎以克服数据稀缺问题,其次设计支持数万异步沙箱回放的可扩展基础设施以获取大规模经验轨迹,最后采用迭代进化学习策略,通过识别能力边界动态调节策略更新——强化成功范式并利用错误分析和自纠正机制将失败轨迹转化为丰富监督信号。这一机制使EvoCUA在OSWorld基准上达到56.7%的成功率,显著优于此前最优开源模型OpenCUA-72B(45.0%)及部分闭源模型(如UI-TARS-2,53.1%),并展现出跨不同规模基础模型的一致性能提升,验证了基于经验学习的演化范式在提升原生智能体能力上的通用性与可扩展性。

链接: https://arxiv.org/abs/2601.15876
作者: Taofeng Xue,Chong Peng,Mianqiu Huang,Linsen Guo,Tiancheng Han,Haozhe Wang,Jianing Wang,Xiaocheng Zhang,Xin Yang,Dengchang Zhao,Jinrui Ding,Xiandi Ma,Yuchen Xie,Peng Pei,Xunliang Cai,Xipeng Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures

点击查看摘要

Abstract:The development of native computer-use agents (CUA) represents a significant leap in multimodal AI. However, their potential is currently bottlenecked by the constraints of static data scaling. Existing paradigms relying primarily on passive imitation of static datasets struggle to capture the intricate causal dynamics inherent in long-horizon computer tasks. In this work, we introduce EvoCUA, a native computer use agentic model. Unlike static imitation, EvoCUA integrates data generation and policy optimization into a self-sustaining evolutionary cycle. To mitigate data scarcity, we develop a verifiable synthesis engine that autonomously generates diverse tasks coupled with executable validators. To enable large-scale experience acquisition, we design a scalable infrastructure orchestrating tens of thousands of asynchronous sandbox rollouts. Building on these massive trajectories, we propose an iterative evolving learning strategy to efficiently internalize this experience. This mechanism dynamically regulates policy updates by identifying capability boundaries – reinforcing successful routines while transforming failure trajectories into rich supervision through error analysis and self-correction. Empirical evaluations on the OSWorld benchmark demonstrate that EvoCUA achieves a success rate of 56.7%, establishing a new open-source state-of-the-art. Notably, EvoCUA significantly outperforms the previous best open-source model, OpenCUA-72B (45.0%), and surpasses leading closed-weights models such as UI-TARS-2 (53.1%). Crucially, our results underscore the generalizability of this approach: the evolving paradigm driven by learning from experience yields consistent performance gains across foundation models of varying scales, establishing a robust and scalable path for advancing native agent capabilities.
zh

[AI-24] Why Inference in Large Models Becomes Decomposable After Training

【速读】:该论文试图解决大规模AI模型在推理阶段因依赖密集参数矩阵而导致的计算成本和系统复杂度随模型规模增长而不可持续的问题。其核心挑战在于,现有推理系统将训练后的模型视为单一整体,忽略了训练过程中形成的内部结构特性。解决方案的关键在于提出一种后训练统计准则和结构退火(structural annealing)流程,通过识别并移除未被支持的参数依赖关系,揭示出稳定且独立的子结构,从而实现无需修改模型功能或接口的结构化、并行化推理。这一方法建立了一种通用的、基于后训练的结构视角,使推理系统具备内在可分解性。

链接: https://arxiv.org/abs/2601.15871
作者: Jidong Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Inference in large-scale AI models is typically performed on dense parameter matrices, leading to inference cost and system complexity that scale unsustainably with model size. This limitation does not arise from insufficient model capacity, but from treating post-training inference systems as monolithic operators while ignoring internal structures formed during learning. We show that gradient update events in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from their initialization distribution after training. As a result, post-training inference systems are structurally non-uniform and inherently decomposable. Based on this observation, we introduce a post-training statistical criterion and a structural annealing procedure that removes unsupported dependencies and reveals stable, independent substructures. This work establishes a post-training, model-agnostic structural view of inference systems and enables structured, parallel inference without modifying model functionality or interfaces.
zh

[AI-25] Introducing the Generative Application Firewall (GAF)

【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)应用在安全防护方面存在的碎片化问题,即现有防御机制如提示词过滤(prompt filters)、护栏策略(guardrails)和数据掩码(data-masking)缺乏统一协调,难以有效应对复杂应用场景中的安全风险。解决方案的关键在于提出生成式应用防火墙(Generative Application Firewall, GAF),作为一个统一的架构层,集中管理并执行多种安全策略,类似于传统Web应用防火墙(Web Application Firewall, WAF)对HTTP流量的统一管控,同时扩展支持自主代理(autonomous agents)及其工具交互的安全防护。

链接: https://arxiv.org/abs/2601.15824
作者: Joan Vendrell Farreny(1),Martí Jordà Roca(1),Miquel Cornudella Gaya(1),Rodrigo Fernández Baón(1),Víctor García Martínez(1),Eduard Camacho Sucarrat(1),Alessandro Pignati(1) ((1) NeuralTrust)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the Generative Application Firewall (GAF), a new architectural layer for securing LLM applications. Existing defenses – prompt filters, guardrails, and data-masking – remain fragmented; GAF unifies them into a single enforcement point, much like a WAF coordinates defenses for web traffic, while also covering autonomous agents and their tool interactions.
zh

[AI-26] Virtual Traffic Police: Large Language Model-Augmented Traffic Signal Control for Unforeseen Incidents

【速读】:该论文旨在解决传统自适应交通信号控制(Adaptive Traffic Signal Control, ATSC)系统在应对突发交通事件(如交通事故和道路施工)时效率低下、依赖人工干预的问题。现有方法往往通过完全替换原有系统来引入大语言模型(Large Language Models, LLMs),但存在两个关键缺陷:一是LLMs固有的幻觉问题导致决策不可靠,二是系统重构成本高昂。为此,作者提出了一种分层增强框架,其核心在于引入一个位于上层的虚拟交通警察代理(virtual traffic police agent),动态微调下层信号控制器的参数以响应实时交通异常;同时设计了自精炼交通语言检索系统(Traffic Language Retrieval System, TLRS),利用检索增强生成技术从定制化交通语料库中提取领域知识,并结合LLM驱动的验证器持续更新TLRS,从而显著提升系统对突发事件的适应能力与可靠性。

链接: https://arxiv.org/abs/2601.15816
作者: Shiqi Wei,Qiqing Wang,Kaidi Yang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive traffic signal control (TSC) has demonstrated strong effectiveness in managing dynamic traffic flows. However, conventional methods often struggle when unforeseen traffic incidents occur (e.g., accidents and road maintenance), which typically require labor-intensive and inefficient manual interventions by traffic police officers. Large Language Models (LLMs) appear to be a promising solution thanks to their remarkable reasoning and generalization capabilities. Nevertheless, existing works often propose to replace existing TSC systems with LLM-based systems, which can be (i) unreliable due to the inherent hallucinations of LLMs and (ii) costly due to the need for system replacement. To address the issues of existing works, we propose a hierarchical framework that augments existing TSC systems with LLMs, whereby a virtual traffic police agent at the upper level dynamically fine-tunes selected parameters of signal controllers at the lower level in response to real-time traffic incidents. To enhance domain-specific reliability in response to unforeseen traffic incidents, we devise a self-refined traffic language retrieval system (TLRS), whereby retrieval-augmented generation is employed to draw knowledge from a tailored traffic language database that encompasses traffic conditions and controller operation principles. Moreover, we devise an LLM-based verifier to update the TLRS continuously over the reasoning process. Our results show that LLMs can serve as trustworthy virtual traffic police officers that can adapt conventional TSC methods to unforeseen traffic incidents with significantly improved operational efficiency and reliability.
zh

[AI-27] Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

【速读】:该论文旨在解决当前深度研究代理(Deep Research Agents, DRAs)在自动化知识发现与问题求解中,因策略模型输出质量受限而难以持续优化的问题。现有方法主要依赖后训练提升策略能力,但缺乏在推理阶段实现自我进化的能力。解决方案的关键在于提出一种基于细粒度评分标准(rubrics)的验证机制——DeepVerifier,其通过自动构建的DRA失败分类法(Failure Taxonomy)系统性地识别和归类代理错误,并利用验证过程中的不对称性生成反馈信号,从而在测试时对代理输出进行迭代修正。该机制无需额外训练即可实现推理阶段的自我增强(inference-time scaling),显著提升复杂任务集(如GAIA和XBench-DeepResearch)上的准确率(8%-11%),并释放了一个高质量的监督微调数据集DeepVerifier-4K以促进开源模型的验证能力发展。

链接: https://arxiv.org/abs/2601.15808
作者: Yuxuan Wan,Tianqing Fang,Zaitang Li,Yintong Huo,Wenxuan Wang,Haitao Mi,Dong Yu,Michael R. Lyu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent’s ability by iteratively verifying the policy model’s outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
zh

[AI-28] A Beacon Based Solution for Autonomous UUVs GNSS-Denied Stealthy Navigation

【速读】:该论文旨在解决在无GNSS(全球导航卫星系统)环境下,自主无人潜航器(UUV)如何实现隐蔽、精确的路径规划与定位问题。其关键解决方案是通过空中或水面无人机部署一组信标(beacon),构建一个合成地标网络(synthetic landmark network),利用声学信号为UUV提供局部定位与导航支持;同时,采用分层规划器(hierarchical planner)动态调整无人机的行动策略,持续监控并实时重规划路径,从而保障UUV从大陆架到海岸目标点的高精度航行。

链接: https://arxiv.org/abs/2601.15802
作者: Alexandre Albore,Humbert Fiorino,Damien Pellier
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages. IEEE TechDefense 2025

点击查看摘要

Abstract:Autonomous Unmanned Underwater Vehicles (UUVs) enable military and civilian covert operations in coastal areas without relying on support vessels or Global Navigation Satellite Systems (GNSS). Such operations are critical when surface access is not possible and stealthy navigation is required in restricted environments such as protected zones or dangerous areas under access ban. GNSS denied navigation is then essential to maintaining concealment as surfacing could expose UUVs to detection. To ensure a precise fleet positioning a constellation of beacons deployed by aerial or surface drones establish a synthetic landmark network that will guide the fleet of UUVs along an optimized path from the continental shelf to the goal on the shore. These beacons either submerged or floating emit acoustic signals for UUV localisation and navigation. A hierarchical planner generates an adaptive route for the drones executing primitive actions while continuously monitoring and replanning as needed to maintain trajectory accuracy.
zh

[AI-29] VitalDiagnosis: AI-Driven Ecosystem for 24/7 Vital Monitoring and Chronic Disease Management AAAI2026

【速读】:该论文旨在解决慢性疾病管理中患者难以早期识别健康恶化迹象及难以坚持治疗计划的问题,同时应对医疗资源紧张和人口老龄化带来的挑战。解决方案的关键在于构建一个由大语言模型(Large Language Model, LLM)驱动的生态系统——VitalDiagnosis,该系统通过整合可穿戴设备的连续数据与LLM的推理能力,实现对急性健康异常和日常依从性的双重干预;其核心机制包括基于情境感知的触发分析、在医患协作工作流中生成初步洞察,并提供个性化指导,从而推动从被动监测向主动互动式管理转变,提升患者自我管理能力并降低不必要的临床负担。

链接: https://arxiv.org/abs/2601.15798
作者: Zhikai Xue,Tianqianjin Lin,Pengwei Yan,Ruichun Wang,Yuxin Liu,Zhuoren Jiang,Xiaozhong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 Demo

点击查看摘要

Abstract:Chronic diseases have become the leading cause of death worldwide, a challenge intensified by strained medical resources and an aging population. Individually, patients often struggle to interpret early signs of deterioration or maintain adherence to care plans. In this paper, we introduce VitalDiagnosis, an LLM-driven ecosystem designed to shift chronic disease management from passive monitoring to proactive, interactive engagement. By integrating continuous data from wearable devices with the reasoning capabilities of LLMs, the system addresses both acute health anomalies and routine adherence. It analyzes triggers through context-aware inquiries, produces provisional insights within a collaborative patient-clinician workflow, and offers personalized guidance. This approach aims to promote a more proactive and cooperative care paradigm, with the potential to enhance patient self-management and reduce avoidable clinical workload.
zh

[AI-30] Creativity in the Age of AI: Rethinking the Role of Intentional Agency

【速读】:该论文试图解决的问题是:传统观点认为创造性必须以有意图的代理(Intentional Agency Condition, IAC)为前提,但这一条件在生成式 AI(Generative AI)快速发展的背景下正面临理论与实践的双重挑战。论文指出,IAC 在描述层面已不再成立——大量作者和记者已开始将创造力归因于缺乏意图的 AI;在功能层面也已失效,因其反而强化了对 AI 产出的偏见,阻碍了对真正新颖且有价值成果的识别与鼓励。解决方案的关键在于:放弃 IAC 作为普遍适用的创造性标准,转而采用一个一致性要求(consistency requirement),即以可靠生成新颖且有价值产品的能力来界定创造力,同时保留 IAC 在特定局部领域中的适用性。

链接: https://arxiv.org/abs/2601.15797
作者: James S. Pearson,Matthew J. Dennis,Marc Cheong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 2 figures

点击查看摘要

Abstract:Many theorists of creativity maintain that intentional agency is a necessary condition of creativity. We argue that this requirement, which we call the Intentional Agency Condition (IAC), should be rejected as a general condition of creativity, while retaining its relevance in specific contexts. We show that recent advances in generative AI have rendered the IAC increasingly problematic, both descriptively and functionally. We offer two reasons for abandoning it at the general level. First, we present corpus evidence indicating that authors and journalists are increasingly comfortable ascribing creativity to generative AI, despite its lack of intentional agency. This development places pressure on the linguistic intuitions that have traditionally been taken to support the IAC. Second, drawing on the method of conceptual engineering, we argue that the IAC no longer fulfils its core social function. Rather than facilitating the identification and encouragement of reliable sources of novel and valuable products, it now feeds into biases that distort our assessments of AI-generated outputs. We therefore propose replacing the IAC with a consistency requirement, according to which creativity tracks the reliable generation of novel and valuable products. Nonetheless, we explain why the IAC should be retained in specific local domains.
zh

[AI-31] Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning

【速读】:该论文旨在解决现实世界中强化学习(Reinforcement Learning, RL)部署所面临的挑战,包括样本效率低、稀疏奖励和噪声视觉观测等问题。现有方法如离线到在线迁移或基于视觉语言动作模型(Vision-Language-Action, VLA)辅助的RL通常依赖大规模数据集或预训练,导致成本高且不稳定。为实现低成本、低数据需求的实时RL,作者提出SigEnt-SAC,其核心创新在于引入一个sigmoid边界熵项(sigmoid-bounded entropy term),该机制可防止负熵驱动的优化趋向分布外动作,并有效抑制Q函数震荡。实验表明,SigEnt-SAC仅需单条专家轨迹即可从零开始训练,在D4RL基准任务上显著减少Q函数波动并更快达到100%成功率;在四类真实机器人任务中也验证了其在原始图像输入与稀疏奖励下,仅用少量真实交互即可学习成功策略,展现出实用性强、数据消耗少的优势。

链接: https://arxiv.org/abs/2601.15761
作者: Xiefeng Wu,Mingyu Hu,Shu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages main text 2 page reference

点击查看摘要

Abstract:Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbfSigEnt-SAC, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.
zh

[AI-32] CAFE-GB: Scalable and Stable Feature Selection for Malware Detection via Chunk-wise Aggregated Gradient Boosting

【速读】:该论文旨在解决高维恶意软件数据集中的特征冗余、不稳定性及可扩展性限制问题,这些问题会显著影响基于机器学习的恶意软件检测系统的有效性与可解释性。解决方案的关键在于提出一种名为CAFE-GB(Chunk-wise Aggregated Feature Estimation using Gradient Boosting)的可扩展特征选择框架:该框架通过将训练数据划分为重叠的数据块(chunk),利用梯度提升模型估计局部特征重要性,并对这些局部估计进行聚合以获得全局稳定的特征排序;同时,通过系统性的k-selection与稳定性分析实现特征预算的选择,在检测性能与鲁棒性之间取得平衡。实验表明,使用CAFE-GB筛选后的特征能够保持与全特征基线相当的检测性能(在Accuracy、F1-score、MCC、ROC-AUC和PR-AUC等指标上无统计显著差异),同时将特征维度降低超过95%,且具备更低的特征冗余与更强的可解释性。

链接: https://arxiv.org/abs/2601.15754
作者: Ajvad Haneef K,Karan Kuwar Singh,Madhu Kumar S D
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-dimensional malware datasets often exhibit feature redundancy, instability, and scalability limitations, which hinder the effectiveness and interpretability of machine learning-based malware detection systems. Although feature selection is commonly employed to mitigate these issues, many existing approaches lack robustness when applied to large-scale and heterogeneous malware data. To address this gap, this paper proposes CAFE-GB (Chunk-wise Aggregated Feature Estimation using Gradient Boosting), a scalable feature selection framework designed to produce stable and globally consistent feature rankings for high-dimensional malware detection. CAFE-GB partitions training data into overlapping chunks, estimates local feature importance using gradient boosting models, and aggregates these estimates to derive a robust global ranking. Feature budget selection is performed separately through a systematic k-selection and stability analysis to balance detection performance and robustness. The proposed framework is evaluated on two large-scale malware datasets: BODMAS and CIC-AndMal2020, representing large and diverse malware feature spaces. Experimental results show that classifiers trained on CAFE-GB -selected features achieve performance parity with full-feature baselines across multiple metrics, including Accuracy, F1-score, MCC, ROC-AUC, and PR-AUC, while reducing feature dimensionality by more than 95%. Paired Wilcoxon signed-rank tests confirm that this reduction does not introduce statistically significant performance degradation. Additional analyses demonstrate low inter-feature redundancy and improved interpretability through SHAP-based explanations. Runtime and memory profiling further indicate reduced downstream classification overhead. Overall, CAFE-GB provides a stable, interpretable, and scalable feature selection strategy for large-scale malware detection.
zh

[AI-33] abular Incremental Inference

【速读】:该论文旨在解决传统AI模型在处理动态变化表格数据时的局限性问题,即现有方法通常基于固定列结构进行训练和推理,难以适应表结构在实际应用中频繁变更的场景。为此,作者提出了Tabular Incremental Inference (TabII)任务,其核心在于使已训练模型能够在推理阶段无缝集成新增列信息,从而提升模型在动态表场景下的实用性和适应性。解决方案的关键在于将TabII建模为一个基于信息瓶颈理论(Information Bottleneck Theory)的优化问题,通过最小化表数据与表示之间的互信息、最大化表示与任务标签之间的互信息来实现高效增量学习;具体技术上采用大语言模型(Large Language Model, LLM)占位符引入外部知识,并结合预训练的TabAdapter模块和增量样本凝缩块(Incremental Sample Condensation blocks),以提取并压缩新列中与任务相关的信息,从而在8个公开数据集上实现了最优性能。

链接: https://arxiv.org/abs/2601.15751
作者: Xinda Chen,Xing Zhen,Hanyu Zhang,Weimin Tan,Bo Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tabular data is a fundamental form of data structure. The evolution of table analysis tools reflects humanity’s continuous progress in data acquisition, management, and processing. The dynamic changes in table columns arise from technological advancements, changing needs, data integration, etc. However, the standard process of training AI models on tables with fixed columns and then performing inference is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in an unsupervised manner. In this paper, we introduce a new task, Tabular Incremental Inference (TabII), which aims to enable trained models to incorporate new columns during the inference stage, enhancing the practicality of AI models in scenarios where tables are dynamically changed. Furthermore, we demonstrate that this new task can be framed as an optimization problem based on the information bottleneck theory, which emphasizes that the key to an ideal tabular incremental inference approach lies in minimizing mutual information between tabular data and representation while maximizing between representation and task labels. Under this guidance, we design a TabII method with Large Language Model placeholders and Pretrained TabAdapter to provide external knowledge and Incremental Sample Condensation blocks to condense the task-relevant information given by incremental column attributes. Experimental results across eight public datasets show that TabII effectively utilizes incremental attributes, achieving state-of-the-art performance.
zh

[AI-34] DualShield: Safe Model Predictive Diffusion via Reachability Analysis for Interactive Autonomous Driving

【速读】:该论文旨在解决扩散模型(diffusion models)在自动驾驶多模态运动规划中面临的两大核心问题:一是难以有效约束车辆动力学,二是对其他交通参与者预测精度的过度依赖,导致在不确定交互场景下存在安全隐患。解决方案的关键在于提出DualShield框架,其创新性地利用哈密顿-雅可比(Hamilton-Jacobi, HJ)可达性值函数实现双重功能:首先,作为主动引导机制,在扩散去噪过程中引导轨迹生成向安全且动力学可行区域收敛;其次,构建基于控制屏障值函数(control barrier-value functions, CBVFs)的反应式安全屏障,实时修正执行动作以保障安全性。该双机制在保留扩散模型强大探索能力的同时,提供了形式化安全保证,尤其在高不确定性甚至对抗性交互场景下表现优异。

链接: https://arxiv.org/abs/2601.15729
作者: Rui Yang,Lei Zheng,Ruoyu Yao,Jun Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Diffusion models have emerged as a powerful approach for multimodal motion planning in autonomous driving. However, their practical deployment is typically hindered by the inherent difficulty in enforcing vehicle dynamics and a critical reliance on accurate predictions of other agents, making them prone to safety issues under uncertain interactions. To address these limitations, we introduce DualShield, a planning and control framework that leverages Hamilton-Jacobi (HJ) reachability value functions in a dual capacity. First, the value functions act as proactive guidance, steering the diffusion denoising process towards safe and dynamically feasible regions. Second, they form a reactive safety shield using control barrier-value functions (CBVFs) to modify the executed actions and ensure safety. This dual mechanism preserves the rich exploration capabilities of diffusion models while providing principled safety assurance under uncertain and even adversarial interactions. Simulations in challenging unprotected U-turn scenarios demonstrate that DualShield significantly improves both safety and task efficiency compared to leading methods from different planning paradigms under uncertainty.
zh

[AI-35] Benchmarking Text-to-Python against Text-to-SQL: The Impact of Explicit Logic and Ambiguity

【速读】:该论文旨在解决当前文本转Python(Text-to-Python)在数据检索任务中可靠性不足的问题,尤其是在面对用户意图不明确时,其性能显著低于成熟的文本转SQL(Text-to-SQL)系统。核心挑战在于Python作为过程式语言对用户意图的显式逻辑要求更高,而自然语言输入常存在歧义,导致生成代码难以正确执行。解决方案的关键是提出逻辑补全框架(Logic Completion Framework, LCF),通过引入隐式领域知识来填补自然语言与可执行逻辑规范之间的语义鸿沟,从而提升Text-to-Python的准确性和鲁棒性。实验表明,当领域上下文缺失问题被有效缓解后,Text-to-Python可达到与Text-to-SQL相当的性能水平。

链接: https://arxiv.org/abs/2601.15728
作者: Hangle Hu,Chenyu Hou,Bin Cao,Ruizhe Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas to manage file-based data and complex analytical workflows. Despite this growing need, the reliability of Text-to-Python in core data retrieval remains underexplored relative to the mature SQL ecosystem. To address this gap, we introduce BIRD-Python, a benchmark designed for cross-paradigm evaluation. We systematically refined the original dataset to reduce annotation noise and align execution semantics, thereby establishing a consistent and standardized baseline for comparison. Our analysis reveals a fundamental paradigmatic divergence: whereas SQL leverages implicit DBMS behaviors through its declarative structure, Python requires explicit procedural logic, making it highly sensitive to underspecified user intent. To mitigate this challenge, we propose the Logic Completion Framework (LCF), which resolves ambiguity by incorporating latent domain knowledge into the generation process. Experimental results show that (1) performance differences primarily stem from missing domain context rather than inherent limitations in code generation, and (2) when these gaps are addressed, Text-to-Python achieves performance parity with Text-to-SQL. These findings establish Python as a viable foundation for analytical agents-provided that systems effectively ground ambiguous natural language inputs in executable logical specifications. Resources are available at this https URL.
zh

[AI-36] CoNRec: Context-Discerning Negative Recommendation with LLM s

【速读】:该论文旨在解决推荐系统中用户负面偏好(negative preferences)建模不足的问题,尤其是现有方法多将负反馈作为辅助信号提升正向推荐效果,而忽视了对负兴趣的直接建模,导致在离线场景下价值受限。同时,由于负反馈数据固有的稀疏性,模型易受正反馈主导带来的上下文理解偏差。其解决方案的关键在于提出首个基于大语言模型(Large Language Model, LLM)的负反馈建模框架,核心创新包括:1)采用语义ID表示(semantic ID Representation)替代文本描述以增强语义理解;2)引入item-level对齐任务提升LLM对负反馈背后语义上下文的认知;3)设计Progressive GRPO训练范式实现正负行为上下文的动态平衡利用;4)揭示传统“下一个负项预测”目标与真实负偏好之间的根本错位,并提出基于多日未来负反馈及其协同信号的新奖励函数与评估指标,从而更准确地捕捉用户真实的负面偏好。

链接: https://arxiv.org/abs/2601.15721
作者: Xinda Chen,Jiawei Wu,Yishuang Liu,Jialin Zhu,Shuwen Xiao,Junjun Zheng,Xiangheng Kong,Yuning Jiang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding what users like is relatively straightforward; understanding what users dislike, however, remains a challenging and underexplored problem. Research into users’ negative preferences has gained increasing importance in modern recommendation systems. Numerous platforms have introduced explicit negative feedback mechanisms and leverage such signals to refine their recommendation models. Beyond traditional business metrics, user experience-driven metrics, such as negative feedback rates, have become critical indicators for evaluating system performance. However, most existing approaches primarily use negative feedback as an auxiliary signal to enhance positive recommendations, paying little attention to directly modeling negative interests, which can be highly valuable in offline applications. Moreover, due to the inherent sparsity of negative feedback data, models often suffer from context understanding biases induced by positive feedback dominance. To address these challenges, we propose the first large language model framework for negative feedback modeling with special designed context-discerning modules. We use semantic ID Representation to replace text-based item descriptions and introduce an item-level alignment task that enhances the LLM’s understanding of the semantic context behind negative feedback. Furthermore, we design a Progressive GRPO training paradigm that enables the model to dynamically balance the positive and negative behavioral context utilization. Besides, our investigation further reveals a fundamental misalignment between the conventional next-negative-item prediction objective and users’ true negative preferences, which is heavily influenced by the system’s recommendation order. To mitigate this, we propose a novel reward function and evaluation metric grounded in multi-day future negative feedback and their collaborative signals.
zh

[AI-37] Investigation of the Generalisation Ability of Genetic Programming-evolved Scheduling Rules in Dynamic Flexible Job Shop Scheduling

【速读】:该论文旨在解决生成式 AI (Generative AI) 在动态柔性作业车间调度(Dynamic Flexible Job Shop Scheduling, DFJSS)问题中演化调度规则的泛化能力不足的问题。现有研究通常仅在结构相似、仅随机种子不同的实例上训练和测试遗传编程(Genetic Programming, GP)演化出的调度规则,忽视了其在不同类型DFJSS实例上的跨类型适应性。解决方案的关键在于系统性地评估GP规则在多种维度下的泛化性能,包括问题规模(机器数与工件数)、关键作业车间参数(如利用率)及数据分布差异,并发现:当训练实例的工件数量多于测试实例且机器数固定时,或训练与测试实例在规模和参数上具有相似性时,GP规则表现出良好的泛化能力;进一步分析表明,决策点(decision points)的数量和分布是影响泛化性能的核心因素——相似的决策点分布有助于提升泛化效果,而显著差异则导致性能明显下降。

链接: https://arxiv.org/abs/2601.15717
作者: Luyao Zhu,Fangfang Zhang,Yi Mei,Mengjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic Flexible Job Shop Scheduling (DFJSS) is a complex combinatorial optimisation problem that requires simultaneous machine assignment and operation sequencing decisions in dynamic production environments. Genetic Programming (GP) has been widely applied to automatically evolve scheduling rules for DFJSS. However, existing studies typically train and test GP-evolved rules on DFJSS instances of the same type, which differ only by random seeds rather than by structural characteristics, leaving their cross-type generalisation ability largely unexplored. To address this gap, this paper systematically investigates the generalisation ability of GP-evolved scheduling rules under diverse DFJSS conditions. A series of experiments are conducted across multiple dimensions, including problem scale (i.e., the number of machines and jobs), key job shop parameters (e.g., utilisation level), and data distributions, to analyse how these factors influence GP performance on unseen instance types. The results show that good generalisation occurs when the training instances contain more jobs than the test instances while keeping the number of machines fixed, and when both training and test instances have similar scales or job shop parameters. Further analysis reveals that the number and distribution of decision points in DFJSS instances play a crucial role in explaining these performance differences. Similar decision point distributions lead to better generalisation, whereas significant discrepancies result in a marked degradation of performance. Overall, this study provides new insights into the generalisation ability of GP in DFJSS and highlights the necessity of evolving more generalisable GP rules capable of handling heterogeneous DFJSS instances effectively.
zh

[AI-38] FlexLLM : Composable HLS Library for Flexible Hybrid LLM Accelerator Design

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理加速中硬件设计效率低、难以适配不同推理阶段(prefill 和 decode)以及长上下文处理性能瓶颈的问题。其解决方案的关键在于提出 FlexLLM——一个可组合的高层次综合(High-Level Synthesis, HLS)库,通过暴露关键架构自由度实现阶段定制化推理,支持对 prefill 与 decode 阶段分别优化时间重用和空间数据流,并集成全面的量化套件以实现高精度低比特部署;同时引入 Hierarchical Memory Transformer (HMT) 插件提升长上下文处理效率。该方案显著降低了开发成本(仅需 1K 行代码即可完成 Llama-3.2 1B 模型系统构建),并在 FPGA 上实现了比 NVIDIA A100 GPU 更高的端到端速度、吞吐量和能效。

链接: https://arxiv.org/abs/2601.15710
作者: Jiahao Zhang,Zifan He,Nicholas Fraser,Michaela Blott,Yizhou Sun,Jason Cong
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29 \times end-to-end speedup, 1.64 \times higher decode throughput, and 3.14 \times better energy efficiency than an NVIDIA A100 GPU (7nm) running BF16 inference; projected results on the V80 FPGA at 7nm reach 4.71 \times , 6.55 \times , and 4.13 \times , respectively. In long-context scenarios, integrating the HMT plug-in reduces prefill latency by 23.23 \times and extends the context window by 64 \times , delivering 1.10 \times /4.86 \times lower end-to-end latency and 5.21 \times /6.27 \times higher energy efficiency on the U280/V80 compared to the A100 baseline. FlexLLM thus bridges algorithmic innovation in LLM inference and high-performance accelerators with minimal manual effort.
zh

[AI-39] AgentS M: Semantic Memory for Agent ic Text-to-SQL

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的Text-to-SQL系统在真实企业场景中面临的三大挑战:复杂数据库模式(schema)的扩展性不足、SQL方言多样性带来的适配困难,以及多步推理过程中的效率低和结果不稳定问题。针对这些问题,作者提出了一种名为Agent Semantic Memory (AgentSM) 的代理式框架,其核心创新在于构建并利用可解释的语义记忆(semantic memory),将历史执行轨迹或人工精炼的轨迹转化为结构化的程序表示,从而直接指导后续推理路径。这种设计实现了推理路径的系统性复用,显著提升了在大规模schema、复杂查询和长推理轨迹下的效率与可靠性,相较现有最优方法在Spider 2.0基准上平均token消耗减少25%,轨迹长度缩短35%,并在Spider 2.0 Lite上达到44.8%的新高执行准确率。

链接: https://arxiv.org/abs/2601.15709
作者: Asim Biswal,Chuan Lei,Xiao Qin,Aodong Li,Balakrishnan Narayanaswamy,Tim Kraska
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in LLM-based Text-to-SQL have achieved remarkable gains on public benchmarks such as BIRD and Spider. Yet, these systems struggle to scale in realistic enterprise settings with large, complex schemas, diverse SQL dialects, and expensive multi-step reasoning. Emerging agentic approaches show potential for adaptive reasoning but often suffer from inefficiency and instability-repeating interactions with databases, producing inconsistent outputs, and occasionally failing to generate valid answers. To address these challenges, we introduce Agent Semantic Memory (AgentSM), an agentic framework for Text-to-SQL that builds and leverages interpretable semantic memory. Instead of relying on raw scratchpads or vector retrieval, AgentSM captures prior execution traces-or synthesizes curated ones-as structured programs that directly guide future reasoning. This design enables systematic reuse of reasoning paths, which allows agents to scale to larger schemas, more complex questions, and longer trajectories efficiently and reliably. Compared to state-of-the-art systems, AgentSM achieves higher efficiency by reducing average token usage and trajectory length by 25% and 35%, respectively, on the Spider 2.0 benchmark. It also improves execution accuracy, reaching a state-of-the-art accuracy of 44.8% on the Spider 2.0 Lite benchmark.
zh

[AI-40] Improving Methodologies for LLM Evaluations Across Global Languages

【速读】:该论文旨在解决当前前沿生成式 AI (Generative AI) 模型在全球多语言、多文化场景下安全性与可靠性不足的问题。其核心挑战在于现有模型安全机制在不同语言和文化语境中表现不一致,可能导致危害行为的识别与抑制能力出现偏差。解决方案的关键在于开展跨语言的安全评估实验:研究团队基于国际协作框架,对两个开源权重模型在十种高资源与低资源语言(如中文、英语、法语、斯瓦希里语等)中进行了系统测试,涵盖隐私泄露、非暴力犯罪、暴力犯罪、知识产权侵犯及越狱攻击鲁棒性五大危害类别,并结合大语言模型作为裁判(LLM-as-a-judge)与人工标注两种方式评估结果。该方法不仅揭示了语言间安全防护强度差异及评价者一致性问题,还提出改进多语言安全评测的方法论建议,包括采用文化语境适配的翻译策略、压力测试型提示词设计以及更清晰的人工标注规范,为建立统一的多语言安全测试框架奠定了基础。

链接: https://arxiv.org/abs/2601.15706
作者: Akriti Vij,Benjamin Chua,Darshini Ramiah,En Qi Ng,Mahran Morsidi,Naga Nikshith Gangarapu,Sharmini Johnson,Vanessa Wilfred,Vikneswaran Kumaran,Wan Sie Lee,Wenzhuo Yang,Yongsen Zheng,Bill Black,Boming Xia,Frank Sun,Hao Zhang,Qinghua Lu,Suyu Ma,Yue Liu,Chi-kiu Lo,Fatemeh Azadi,Isar Nejadgholi,Sowmya Vajjala,Agnes Delaborde,Nicolas Rolin,Tom Seimandi,Akiko Murakami,Haruto Ishi,Satoshi Sekine,Takayuki Semitsu,Tasuku Sasaki,Angela Kinuthia,Jean Wangari,Michael Michie,Stephanie Kasaon,Hankyul Baek,Jaewon Noh,Kihyuk Nam,Sang Seo,Sungpil Shin,Taewhi Lee,Yongsu Kim,Daisy Newbold-Harrop,Jessica Wang,Mahmoud Ghanem,Vy Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Author names have been organised by country, and in alphabetical order within countries

点击查看摘要

Abstract:As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry. Comments: Author names have been organised by country, and in alphabetical order within countries Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.15706 [cs.AI] (or arXiv:2601.15706v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.15706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-41] From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域部署时因可靠性不足而面临的挑战。其核心问题在于如何将不确定性从传统的被动诊断指标转变为可主动调控的控制信号,从而实时优化模型行为。解决方案的关键在于将不确定性作为驱动机制,应用于三个前沿方向:在高级推理中优化计算资源并触发自我修正;在自主代理中指导元认知决策(如工具使用与信息检索);在强化学习中缓解奖励黑客攻击,并通过内在奖励实现自我改进。论文进一步结合贝叶斯方法和保真预测(Conformal Prediction)等新兴理论框架,提出了一种统一视角,强调掌握不确定性主动调控的新趋势是构建下一代可扩展、可靠且可信人工智能系统的核心路径。

链接: https://arxiv.org/abs/2601.15690
作者: Jiaxin Zhang,Wendi Cui,Zhuohang Li,Lifu Huang,Bradley Malin,Caiming Xiong,Chien-Sheng Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: 20 pages, 4 figures, 6 tables

点击查看摘要

Abstract:While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this challenge: the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior. We demonstrate how uncertainty is leveraged as an active control signal across three frontiers: in \textbfadvanced reasoning to optimize computation and trigger self-correction; in \textbfautonomous agents to govern metacognitive decisions about tool use and information seeking; and in \textbfreinforcement learning to mitigate reward hacking and enable self-improvement via intrinsic rewards. By grounding these advancements in emerging theoretical frameworks like Bayesian methods and Conformal Prediction, we provide a unified perspective on this transformative trend. This survey provides a comprehensive overview, critical analysis, and practical design patterns, arguing that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
zh

[AI-42] FARM: Field-Aware Resolution Model for Intelligent Trigger-Action Automation

【速读】:该论文旨在解决触发-动作编程(Trigger-Action Programming, TAP)平台中自动化应用组合(applet)生成的函数级配置问题,即如何在不依赖人工干预的情况下,自动生成包含正确输入输出绑定(ingredient-to-field bindings)的可执行自动化规则。现有方法多聚焦于服务级别预测,常导致非执行性applet;而本文提出FARM(Field-Aware Resolution Model),其关键在于两阶段架构:第一阶段利用增强Schema表示与选择性层冻结的对比双编码器从海量触发-动作函数对(共2.2M可能组合)中检索候选函数;第二阶段采用基于大语言模型(LLM)的多智能体流水线进行意图分析、跨模式评分选择触发和动作函数,并通过共享状态与共识机制完成配置验证,最终实现高精度的端到端可执行applet生成。

链接: https://arxiv.org/abs/2601.15687
作者: Khusrav Badalov,Young Yoon
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trigger-Action Programming (TAP) platforms such as IFTTT and Zapier enable Web of Things (WoT) automation by composing event-driven rules across heterogeneous services. A TAP applet links a trigger to an action and must bind trigger outputs (ingredients) to action inputs (fields) to be executable. Prior work largely treats TAP as service-level prediction from natural language, which often yields non-executable applets that still require manual configuration. We study the function-level configuration problem: generating complete applets with correct ingredient-to-field bindings. We propose FARM (Field-Aware Resolution Model), a two-stage architecture for automated applet generation with full configuration. Stage 1 trains contrastive dual encoders with selective layer freezing over schema-enriched representations, retrieving candidates from 1,724 trigger functions and 1,287 action functions (2.2M possible trigger-action pairs). Stage 2 performs selection and configuration using an LLM-based multi-agent pipeline. It includes intent analysis, trigger selection, action selection via cross-schema scoring, and configuration verification. Agents coordinate through shared state and agreement-based selection. FARM achieves 81% joint accuracy on Gold (62% Noisy, 70% One-shot) at the function level, where both trigger and action functions must match the ground truth. For comparison with service-level baselines, we map functions to their parent services and evaluate at the service level. FARM reaches 81% joint accuracy and improves over TARGE by 23 percentage points. FARM also generates ingredient-to-field bindings, producing executable automation configurations.
zh

[AI-43] Improving Methodologies for Agent ic Evaluations Across Domains: Leakage of Sensitive Information Fraud and Cybersecurity Threats ATC ALT

【速读】:该论文旨在解决当前自主AI系统(Autonomous AI Systems)在真实世界交互中因监管不足而引入的新风险问题,特别是针对AI代理(AI Agent)测试方法尚不成熟、缺乏统一标准的现状。其解决方案的关键在于通过国际协作推动对AI代理评估方法论的标准化与优化:由来自新加坡、英国等多个国家的AI测评机构组成联盟,在第三次联合测试活动中聚焦两类核心风险——敏感信息泄露与欺诈(由新加坡AI安全研究所主导)以及网络安全(由英国AI安全研究所主导),并采用公开与闭源权重模型在多个公共AI代理基准任务上进行评估,从而系统性识别和改进测试过程中的方法论挑战,而非直接比较模型性能,为全球范围内安全、可靠地部署AI代理奠定科学基础。

链接: https://arxiv.org/abs/2601.15679
作者: Ee Wei Seah,Yongsen Zheng,Naga Nikshith,Mahran Morsidi,Gabriel Waikin Loh Matienzo,Nigel Gay,Akriti Vij,Benjamin Chua,En Qi Ng,Sharmini Johnson,Vanessa Wilfred,Wan Sie Lee,Anna Davidson,Catherine Devine,Erin Zorer,Gareth Holvey,Harry Coppock,James Walpole,Jerome Wynee,Magda Dubois,Michael Schmatz,Patrick Keane,Sam Deverett,Bill Black,Bo Yan,Bushra Sabir,Frank Sun,Hao Zhang,Harriet Farlow,Helen Zhou,Lingming Dong,Qinghua Lu,Seung Jang,Sharif Abuadbba,Simon O’Callaghan,Suyu Ma,Tom Howroyd,Cyrus Fung,Fatemeh Azadi,Isar Nejadgholi,Krishnapriya Vishnubhotla,Pulei Xiong,Saeedeh Lohrasbi,Scott Buffett,Shahrear Iqbal,Sowmya Vajjala,Anna Safont-Andreu,Luca Massarelli,Oskar van der Wal,Simon Möller,Agnes Delaborde,Joris Duguépéroux,Nicolas Rolin,Romane Gallienne,Sarah Behanzin,Tom Seimandi,Akiko Murakami,Takayuki Semitsu,Teresa Tsukiji,Angela Kinuthia,Michael Michie,Stephanie Kasaon,Jean Wangari,Hankyul Baek,Jaewon Noh,Kihyuk Nam,Sang Seo,Sungpil Shin,Taewhi Lee,Yongsu Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The author/contributor list organises contributors by country and alphabetical order within each country. In some places, the order has been altered to match other related publications

点击查看摘要

Abstract:The rapid rise of autonomous AI systems and advancements in agent capabilities are introducing new risks due to reduced oversight of real-world interactions. Yet agent testing remains nascent and is still a developing science. As AI agents begin to be deployed globally, it is important that they handle different languages and cultures accurately and securely. To address this, participants from The International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the European Commission, France, Kenya, South Korea, and the United Kingdom have come together to align approaches to agentic evaluations. This is the third exercise, building on insights from two earlier joint testing exercises conducted by the Network in November 2024 and February 2025. The objective is to further refine best practices for testing advanced AI systems. The exercise was split into two strands: (1) common risks, including leakage of sensitive information and fraud, led by Singapore AISI; and (2) cybersecurity, led by UK AISI. A mix of open and closed-weight models were evaluated against tasks from various public agentic benchmarks. Given the nascency of agentic testing, our primary focus was on understanding methodological issues in conducting such tests, rather than examining test results or model capabilities. This collaboration marks an important step forward as participants work together to advance the science of agentic evaluations. Comments: The author/contributor list organises contributors by country and alphabetical order within each country. In some places, the order has been altered to match other related publications Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2601.15679 [cs.AI] (or arXiv:2601.15679v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.15679 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Akriti Vij [view email] [v1] Thu, 22 Jan 2026 06:00:00 UTC (4,871 KB)
zh

[AI-44] Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

【速读】:该论文旨在解决隐私敏感场景下检索增强生成(Retrieval-Augmented Generation, RAG)系统中存在的信息泄露风险问题,即攻击者可通过精心设计的多轮查询逐步提取底层语料库中的敏感内容。现有方法依赖启发式策略,缺乏长期规划能力,难以实现高效、系统的数据窃取。其解决方案的关键在于将RAG提取攻击建模为自适应随机覆盖问题(Adaptive Stochastic Coverage Problem, ASCP),通过最大化条件边际增益(Conditional Marginal Gain, CMG)来实现不确定性下的理性长期规划;同时提出RAGCRAWLER框架,利用全局攻击者状态构建知识图谱以表征已暴露信息,基于此估计CMG并在语义空间中规划针对未检索区域的查询动作,从而在有限查询预算内显著提升覆盖率与内容重建准确性,并对采用查询重写和多查询检索等防御机制的先进RAG系统仍保持有效。

链接: https://arxiv.org/abs/2601.15678
作者: Mengyu Yao,Ziqi Zhang,Ning Luo,Shaofei Li,Yifeng Cai,Xiangqun Chen,Yao Guo,Ding Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems integrate document retrieval with large language models and have been widely adopted. However, in privacy-related scenarios, RAG introduces a new privacy risk: adversaries can issue carefully crafted queries to exfiltrate sensitive content from the underlying corpus gradually. Although recent studies have demonstrated multi-turn extraction attacks, they rely on heuristics and fail to perform long-term extraction planning. To address these limitations, we formulate the RAG extraction attack as an adaptive stochastic coverage problem (ASCP). In ASCP, each query is treated as a probabilistic action that aims to maximize conditional marginal gain (CMG), enabling principled long-term planning under uncertainty. However, integrating ASCP with practical RAG attack faces three key challenges: unobservable CMG, intractability in the action space, and feasibility constraints. To overcome these challenges, we maintain a global attacker-side state to guide the attack. Building on this idea, we introduce RAGCRAWLER, which builds a knowledge graph to represent revealed information, uses this global state to estimate CMG, and plans queries in semantic space that target unretrieved regions. In comprehensive experiments across diverse RAG architectures and datasets, our proposed method, RAGCRAWLER, consistently outperforms all baselines. It achieves up to 84.4% corpus coverage within a fixed query budget and deliver an average improvement of 20.7% over the top-performing baseline. It also maintains high semantic fidelity and strong content reconstruction accuracy with low attack cost. Crucially, RAGCRAWLER proves its robustness by maintaining effectiveness against advanced RAG systems employing query rewriting and multi-query retrieval strategies. Our work reveals significant security gaps and highlights the pressing need for stronger safeguards for RAG.
zh

[AI-45] Enhancing guidance for missing data in diffusion-based sequential recommendation ICASSP2026

【速读】:该论文旨在解决当前生成式推荐方法中因用户序列数据缺失导致引导信号质量下降的问题,尤其是现有方法忽略用户兴趣转变的“关键转折点”,从而影响后续行为预测准确性。解决方案的关键在于提出一种基于反事实注意力调节的扩散模型(Counterfactual Attention Regulation Diffusion, CARD),其核心创新包括:(1) 采用双侧Thompson采样识别存在显著兴趣转移的用户序列;(2) 设计反事实注意力机制量化序列中每个物品的重要性,动态重加权交互向量以提供高质量的引导信号,从而提升扩散模型的生成效果。

链接: https://arxiv.org/abs/2601.15673
作者: Qilong Yan,Yifei Xing,Dugang Liu,Jingpu Duan,Jian Yin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: ICASSP 2026 accecpted

点击查看摘要

Abstract:Contemporary sequential recommendation methods are becoming more complex, shifting from classification to a diffusion-guided generative paradigm. However, the quality of guidance in the form of user information is often compromised by missing data in the observed sequences, leading to suboptimal generation quality. Existing methods address this by removing locally similar items, but overlook ``critical turning points’’ in user interest, which are crucial for accurately predicting subsequent user intent. To address this, we propose a novel Counterfactual Attention Regulation Diffusion model (CARD), which focuses on amplifying the signal from key interest-turning-point items while concurrently identifying and suppressing noise within the user sequence. CARD consists of (1) a Dual-side Thompson Sampling method to identify sequences undergoing significant interest shift, and (2) a counterfactual attention mechanism for these sequences to quantify the importance of each item. In this manner, CARD provides the diffusion model with a high-quality guidance signal composed of dynamically re-weighted interaction vectors to enable effective generation. Experiments show our method works well on real-world data without being computationally expensive. Our code is available at this https URL.
zh

[AI-46] StreetDesignAI: A Multi-Persona Evaluation System for Inclusive Infrastructure Design

【速读】:该论文旨在解决城市骑行基础设施设计中难以平衡多元用户群体需求的问题,尤其关注设计者在缺乏对不同骑行者体验的共情能力时,易忽视弱势或谨慎型用户的需求。其解决方案的关键在于提出一种基于角色(persona)的多智能体评估方法,通过构建包含从自信到谨慎等多样化骑行者角色的模拟系统——StreetDesignAI,使设计过程能够嵌入真实街景图像与地图数据,并提供并行的多视角反馈;同时支持设计者在迭代修改中直观识别不同角色间的体验冲突,从而将设计探索从单一视角优化转向有意识的权衡决策,显著提升设计者对多元用户需求的理解与转化能力。

链接: https://arxiv.org/abs/2601.15671
作者: Ziyi Wang,Yilong Dai,Duanya Lyu,Mateo Nader,Sihan Chen,Wanghao Ye,Zjian Ding,Xiang Yan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing inclusive cycling infrastructure requires balancing competing needs of diverse user groups, yet designers often struggle to anticipate how different cyclists experience the same street. We investigate how persona-based multi-agent evaluation can support inclusive design by making experiential conflicts explicit. We present StreetDesignAI, an interactive system that enables designers to (1) ground evaluation in street context through imagery and map data, (2) receive parallel feedback from cyclist personas spanning confident to cautious users, and (3) iteratively modify designs while surfacing conflicts across perspectives. A within-subjects study with 26 transportation professionals demonstrates that structured multi-perspective feedback significantly improves designers’ understanding of diverse user perspectives, ability to identify persona needs, and confidence in translating them into design decisions, with higher satisfaction and stronger intention for professional adoption. Qualitative findings reveal how conflict surfacing transforms design exploration from single-perspective optimization toward deliberate trade-off reasoning. We discuss implications for AI tools that scaffold inclusive design through disagreement as an interaction primitive.
zh

[AI-47] mpoNet: Learning Realistic Communication and Timing Patterns for Network Traffic Simulation

【速读】:该论文旨在解决真实网络流量模拟中良性背景流量生成的难题,尤其是如何准确刻画现实网络中复杂的时序动态和通信模式。传统方法如生成对抗网络(GAN)、大语言模型(LLM)和贝叶斯方法难以再现结构化的时变特征,导致生成流量缺乏真实性。解决方案的关键在于提出TempoNet,一种结合多任务学习与多标记时间点过程(multi-mark temporal point processes)的新型生成模型,能够联合建模报文间到达时间及所有包级和流头字段,从而捕捉细粒度的时间模式与高阶相关性(如主机对行为和季节性趋势),生成时序一致且高保真的网络流量痕迹。

链接: https://arxiv.org/abs/2601.15663
作者: Kristen Moore,Diksha Goel,Cody James Christopher,Zhen Wang,Minjune Kim,Ahmed Ibrahim,Ahmad Mohsin,Seyit Camtepe
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Realistic network traffic simulation is critical for evaluating intrusion detection systems, stress-testing network protocols, and constructing high-fidelity environments for cybersecurity training. While attack traffic can often be layered into training environments using red-teaming or replay methods, generating authentic benign background traffic remains a core challenge – particularly in simulating the complex temporal and communication dynamics of real-world networks. This paper introduces TempoNet, a novel generative model that combines multi-task learning with multi-mark temporal point processes to jointly model inter-arrival times and all packet- and flow-header fields. TempoNet captures fine-grained timing patterns and higher-order correlations such as host-pair behavior and seasonal trends, addressing key limitations of GAN-, LLM-, and Bayesian-based methods that fail to reproduce structured temporal variation. TempoNet produces temporally consistent, high-fidelity traces, validated on real-world datasets. Furthermore, we show that intrusion detection models trained on TempoNet-generated background traffic perform comparably to those trained on real data, validating its utility for real-world security applications.
zh

[AI-48] Integrating Knowledge Distillation Methods: A Sequential Multi-Stage Framework

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中多方法融合困难的问题,特别是现有方法在整合响应(response-based)、特征(feature-based)和关系(relation-based)等异构知识时,常面临实现复杂、组合不灵活及灾难性遗忘(catastrophic forgetting)等问题,从而限制了实际效果。其解决方案的关键在于提出一种顺序多阶段知识蒸馏框架(Sequential Multi Stage Knowledge Distillation, SMSKD),通过分阶段依次应用不同KD方法,并利用前一阶段的冻结参考模型锚定学生模型已学知识以缓解遗忘;同时引入基于教师真类概率(Teacher True Class Probability, TCP)的自适应加权机制,动态调整每样本的参考损失权重,平衡知识保留与新知识集成,从而实现灵活、高效且无显著计算开销的异构KD方法融合。

链接: https://arxiv.org/abs/2601.15657
作者: Yinxi Tian,Changwu Huang,Ke Tang,Xin Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) transfers knowledge from large teacher models to compact student models, enabling efficient deployment on resource constrained devices. While diverse KD methods, including response based, feature based, and relation based approaches, capture different aspects of teacher knowledge, integrating multiple methods or knowledge sources is promising but often hampered by complex implementation, inflexible combinations, and catastrophic forgetting, which limits practical effectiveness. This work proposes SMSKD (Sequential Multi Stage Knowledge Distillation), a flexible framework that sequentially integrates heterogeneous KD methods. At each stage, the student is trained with a specific distillation method, while a frozen reference model from the previous stage anchors learned knowledge to mitigate forgetting. In addition, we introduce an adaptive weighting mechanism based on the teacher true class probability (TCP) that dynamically adjusts the reference loss per sample to balance knowledge retention and integration. By design, SMSKD supports arbitrary method combinations and stage counts with negligible computational overhead. Extensive experiments show that SMSKD consistently improves student accuracy across diverse teacher student architectures and method combinations, outperforming existing baselines. Ablation studies confirm that stage wise distillation and reference model supervision are primary contributors to performance gains, with TCP based adaptive weighting providing complementary benefits. Overall, SMSKD is a practical and resource efficient solution for integrating heterogeneous KD methods. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.15657 [cs.LG] (or arXiv:2601.15657v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.15657 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-49] Predictive Coding and Information Bottleneck for Hallucination Detection in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中的幻觉问题——即生成内容看似合理但事实不准确的现象,这已成为高风险场景部署的主要障碍。其解决方案的关键在于提出一种混合检测框架,融合神经科学启发的信号设计与监督学习:通过提取基于预测编码(Predictive Coding)的“意外性”信号和信息瓶颈(Information Bottleneck)的“扰动下信号保留度”信号,构建可解释的特征空间;进一步引入实体聚焦吸收(Entity-Focused Uptake)、上下文依从性(Context Adherence)和可证伪分数(Falsifiability Score)三项增强机制,在Halubench数据集上实现0.8669 AUROC的性能,且仅需75倍更少的训练数据、1000倍更快的推理速度,并保持完全可解释性,显著优于依赖大规模黑箱判别器的传统方法。

链接: https://arxiv.org/abs/2601.15652
作者: Manish Bhatt
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs) – generations that are plausible but factually unfaithful – remain a critical barrier to high-stakes deployment. Current detection methods typically rely on computationally expensive external retrieval loops or opaque black-box LLM judges requiring 70B+ parameters. In this work, we introduce [Model Name], a hybrid detection framework that combines neuroscience-inspired signal design with supervised machine learning. We extract interpretable signals grounded in Predictive Coding (quantifying surprise against internal priors) and the Information Bottleneck (measuring signal retention under perturbation). Through systematic ablation, we demonstrate three key enhancements: Entity-Focused Uptake (concentrating on high-value tokens), Context Adherence (measuring grounding strength), and Falsifiability Score (detecting confident but contradictory claims). Evaluating on HaluBench (n=200, perfectly balanced), our theory-guided baseline achieves 0.8017 AUROC. BASE supervised models reach 0.8274 AUROC, while IMPROVED features boost performance to 0.8669 AUROC (4.95% gain), demonstrating consistent improvements across architectures. This competitive performance is achieved while using 75x less training data than Lynx (200 vs 15,000 samples), 1000x faster inference (5ms vs 5s), and remaining fully interpretable. Crucially, we report a negative result: the Rationalization signal fails to distinguish hallucinations, suggesting that LLMs generate coherent reasoning for false premises (“Sycophancy”). This work demonstrates that domain knowledge encoded in signal architecture provides superior data efficiency compared to scaling LLM judges, achieving strong performance with lightweight (less than 1M parameter), explainable models suitable for production deployment. Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET) Cite as: arXiv:2601.15652 [cs.AI] (or arXiv:2601.15652v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2601.15652 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-50] Agent ic AI Governance and Lifecycle Management in Healthcare

【速读】:该论文旨在解决医疗健康机构在部署生成式 AI(Generative AI)代理(agent)过程中面临的“代理泛滥”(agent sprawl)问题,即由于代理能力在不同部门和供应商间扩散,导致重复部署、责任不清、控制不一致以及权限持续存在超出原始使用场景等治理挑战。现有AI治理框架侧重生命周期风险管控,但缺乏对代理集群日常运维的指导。其解决方案的关键在于提出一个统一的代理生命周期管理(Unified Agent Lifecycle Management, UALM)蓝图,通过五个控制平面层级实现可审计、可追溯、可安全扩展的治理:(1) 身份与角色注册;(2) 多域编排与中介;(3) 限定受保护健康信息(PHI)的上下文与记忆;(4) 带有紧急停止触发机制的运行时策略执行;(5) 生命周期管理与凭证撤销及审计日志联动。该方案为医疗CIO、CISO和临床领导者提供了一套可落地的治理模式,兼顾本地创新与规模化安全部署。

链接: https://arxiv.org/abs/2601.15630
作者: Chandra Prakash,Mary Lind,Avneesh Sisodia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 Page, 3 figures

点击查看摘要

Abstract:Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.
zh

[AI-51] CogToM: A Comprehensive Theory of Mind Benchmark inspired by Human Cognition for Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)是否具备类人心理理论(Theory of Mind, ToM)能力的争议问题,现有评估基准多局限于狭窄范式(如错误信念任务),难以全面刻画人类认知机制。其解决方案的关键在于提出CogToM——一个理论驱动、涵盖46种认知范式的多语言( bilingual)基准测试集,包含超过8000个标注实例,并通过49名人类标注者进行验证。该基准系统性地评估了22种代表性模型(包括GPT-5.1和Qwen3-Max等前沿模型),揭示出模型在不同认知维度上的性能异质性及潜在瓶颈,为深入理解LLMs认知边界的演化提供了可靠工具与新视角。

链接: https://arxiv.org/abs/2601.15628
作者: Haibo Tong,Zeyang Yue,Feifei Zhao,Erliang Lin,Lu Jia,Ruolin Chen,Yinqian Sun,Qian Zhang,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whether Large Language Models (LLMs) truly possess human-like Theory of Mind (ToM) capabilities has garnered increasing attention. However, existing benchmarks remain largely restricted to narrow paradigms like false belief tasks, failing to capture the full spectrum of human cognitive mechanisms. We introduce CogToM, a comprehensive, theoretically grounded benchmark comprising over 8000 bilingual instances across 46 paradigms, validated by 49 human annotator.A systematic evaluation of 22 representative models, including frontier models like GPT-5.1 and Qwen3-Max, reveals significant performance heterogeneities and highlights persistent bottlenecks in specific dimensions. Further analysis based on human cognitive patterns suggests potential divergences between LLM and human cognitive structures. CogToM offers a robust instrument and perspective for investigating the evolving cognitive boundaries of LLMs.
zh

[AI-52] Bridging Qualitative Rubrics and AI: A Binary Question Framework for Criterion-Referenced Grading in Engineering

【速读】:该论文旨在解决工程类数学评估中人工评分效率低、质量不稳定的问题,尤其是在依赖标准解法进行判分时,助教(demonstrator)面临工作负担重、反馈不充分等挑战。解决方案的关键在于将生成式 AI (Generative AI) 与一种基于标准参照的评分框架相结合,通过结构化的二元问题设计实现对学生成绩的精准判定,并嵌入高质量、可扩展的形成性反馈生成机制,从而提升评分效率与反馈质量。实证结果显示,该方法在准确性上接近两名经验丰富的助教,且显著增强了反馈的完整性,但尚不能完全替代人工审核,尤其在面对非常规解法时仍需人类干预。

链接: https://arxiv.org/abs/2601.15626
作者: Lili Chen,Winn Wing-Yiu Chow,Stella Peng,Bencheng Fan,Sachitha Bandara
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Proceedings of the 36th Annual Conference of the Australasian Association for Engineering Education (AAEE 2025)

点击查看摘要

Abstract:PURPOSE OR GOAL: This study investigates how GenAI can be integrated with a criterion-referenced grading framework to improve the efficiency and quality of grading for mathematical assessments in engineering. It specifically explores the challenges demonstrators face with manual, model solution-based grading and how a GenAI-supported system can be designed to reliably identify student errors, provide high-quality feedback, and support human graders. The research also examines human graders’ perceptions of the effectiveness of this GenAI-assisted approach. ACTUAL OR ANTICIPATED OUTCOMES: The study found that GenAI achieved an overall grading accuracy of 92.5%, comparable to two experienced human graders. The two researchers, who also served as subject demonstrators, perceived the GenAI as a helpful second reviewer that improved accuracy by catching small errors and provided more complete feedback than they could manually. A central outcome was the significant enhancement of formative feedback. However, they noted the GenAI tool is not yet reliable enough for autonomous use, especially with unconventional solutions. CONCLUSIONS/RECOMMENDATIONS/SUMMARY: This study demonstrates that GenAI, when paired with a structured, criterion-referenced framework using binary questions, can grade engineering mathematical assessments with an accuracy comparable to human experts. Its primary contribution is a novel methodological approach that embeds the generation of high-quality, scalable formative feedback directly into the assessment workflow. Future work should investigate student perceptions of GenAI grading and feedback.
zh

[AI-53] Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮工具调用(multi-turn tool execution)中因执行错误导致的脆弱性问题,即模型在遭遇工具调用失败后容易陷入重复无效重试,无法有效理解错误反馈并进行自我修正,从而限制了其在真实场景中的可靠部署。解决方案的关键在于提出Fission-GRPO框架,该框架通过将执行错误转化为强化学习(Reinforcement Learning, RL)训练过程中的纠正监督信号:具体而言,利用微调后的错误模拟器(Error Simulator)提供诊断反馈,对每条失败轨迹进行“分裂”(fission),生成新的训练实例,并在此基础上在线采样恢复轨迹(on-policy recovery rollouts),使模型能够从自身探索过程中产生的实际错误中学习,而非依赖分布不匹配的预收集错误数据集,从而显著提升错误恢复能力和整体任务准确率。

链接: https://arxiv.org/abs/2601.15625
作者: Zhiwei Zhang,Fei Zhao,Rui Wang,Zezhong Wang,Bin Liang,Jiakang Wang,Yao Hu,Shaosheng Cao,Kam-Fai Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model’s on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.
zh

[AI-54] Autonomous Business System via Neuro-symbolic AI

【速读】:该论文旨在解决企业系统在面对动态业务环境时,因部门壁垒、僵化流程和硬编码自动化导致的跨职能流程重构能力不足的问题,同时弥补大型语言模型(Large Language Models, LLMs)在复杂业务逻辑执行中缺乏确定性和可验证性这一短板。解决方案的关键在于提出AUTOBUS——一种融合LLM驱动的AI代理、谓词逻辑编程与以业务语义为中心的企业数据的知识图谱的神经符号架构。该架构将业务倡议建模为带有显式前置/后置条件、所需数据、评估规则和API级动作的任务网络,并通过逻辑引擎执行任务特定的逻辑程序,从而实现端到端的业务流程自动编排,同时保留人类对语义、策略和高影响决策的监督权,确保系统的可解释性、可控性与适应性。

链接: https://arxiv.org/abs/2601.15599
作者: Cecil Pang,Hiroki Sayama
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IEEE SysCon 2026

点击查看摘要

Abstract:Current business environments require organizations to continuously reconfigure cross-functional processes, yet enterprise systems are still organized around siloed departments, rigid workflows, and hard-coded automation. Meanwhile large language models (LLMs) excel at interpreting natural language and unstructured data but lack deterministic, verifiable execution of complex business logic. To address this gap, here we introduce AUTOBUS, an Autonomous Business System that integrates LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data into a coherent neuro-symbolic AI architecture for orchestrating end-to-end business initiatives. AUTOBUS models an initiative as a network of tasks with explicit pre/post conditions, required data, evaluation rules, and API-level actions. Enterprise data is organized as a knowledge graph whose entities, relationships, and constraints are translated into logic facts and foundational rules, providing the semantic grounding for task reasoning. Core AI agents synthesize task instructions, enterprise semantics, and available tools into task-specific logic programs, which are executed by a logic engine that enforces constraints, coordinates auxiliary tools, and orchestrate execution of actions and outcomes. Humans define and maintain the semantics, policies and task instructions, curate tools, and supervise high-impact or ambiguous decisions, ensuring accountability and adaptability. We detail the AUTOBUS architecture, the anatomy of the AI agent generated logic programs, and the role of humans and auxiliary tools in the lifecycle of a business initiative.
zh

[AI-55] DeepASMR: LLM -Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice

【速读】:该论文旨在解决当前文本到语音(Text-to-Speech, TTS)系统在生成自主感官愉悦反应(Autonomous Sensory Meridian Response, ASMR)这一低强度、常无音的特殊语音风格时存在的显著挑战,尤其是缺乏针对目标说话人的零样本(zero-shot)适配能力。其解决方案的关键在于:首先识别出离散语音标记(discrete speech tokens)能够软性解耦ASMR风格与说话人音色(timbre),进而提出一个两阶段框架——第一阶段利用大语言模型(Large Language Model, LLM)进行内容与风格编码,第二阶段采用基于流匹配(flow-matching)的声学解码器实现音色重建。该方法仅需一段普通朗读语音片段即可合成高保真ASMR语音,无需目标说话人专用的耳语训练数据,从而实现了高效且通用的零样本ASMR生成。

链接: https://arxiv.org/abs/2601.15596
作者: Leying Zhang,Tingxiao Zhou,Haiyang Sun,Mengxiao Bi,Yanmin Qian
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR’s subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker’s ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.
zh

[AI-56] Data-Free Privacy-Preserving for LLM s via Model Inversion and Selective Unlearning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中可能记忆敏感个人信息(Personally Identifiable Information, PII)所引发的隐私泄露问题,尤其针对现有机器遗忘(machine unlearning)方法依赖访问原始训练数据这一现实限制。解决方案的关键在于提出一种无需训练数据的数据自由选择性遗忘(Data-Free Selective Unlearning, DFSU)框架:首先通过语言模型逆向生成伪PII样本,继而构建针对这些合成样本的token级隐私掩码,并最终在低秩适配(Low-Rank Adaptation, LoRA)子空间中利用对比掩码损失实现token级别的选择性遗忘,从而在不损害模型整体性能的前提下有效移除目标PII。

链接: https://arxiv.org/abs/2601.15595
作者: Xinjie Zhou,Zhihui Yang,Lechao Cheng,Sai Wu,Gang Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit powerful capabilities but risk memorizing sensitive personally identifiable information (PII) from their training data, posing significant privacy concerns. While machine unlearning techniques aim to remove such data, they predominantly depend on access to the training data. This requirement is often impractical, as training data in real-world deployments is commonly proprietary or inaccessible. To address this limitation, we propose Data-Free Selective Unlearning (DFSU), a novel privacy-preserving framework that removes sensitive PII from an LLM without requiring its training data. Our approach first synthesizes pseudo-PII through language model inversion, then constructs token-level privacy masks for these synthetic samples, and finally performs token-level selective unlearning via a contrastive mask loss within a low-rank adaptation (LoRA) subspace. Extensive experiments on the AI4Privacy PII-Masking dataset using Pythia models demonstrate that our method effectively removes target PII while maintaining model utility.
zh

[AI-57] MapViT: A Two-Stage ViT-Based Framework for Real-Time Radio Quality Map Prediction in Dynamic Environments

【速读】:该论文旨在解决移动机器人在高度动态环境中实现精准环境感知与可靠无线信号质量预测的难题,这是保障其自主导航与稳定运行的关键前提。解决方案的核心在于提出一种两阶段视觉 Transformer(Vision Transformer, ViT)框架 MapViT,该框架借鉴大语言模型(Large Language Models, LLMs)的预训练-微调范式,首先通过自监督预训练构建几何基础模型以提升数据效率和跨场景迁移能力,随后在下游任务中实现对环境变化和预期无线电波信号质量的实时联合预测。实验表明,该方法在准确性和计算效率之间取得良好平衡,特别适用于资源受限的移动机器人平台,为下一代数字孪生生态系统及6G时代多模态智能系统提供了重要基础。

链接: https://arxiv.org/abs/2601.15578
作者: Cyril Shih-Huan Hsu,Xi Li,Lanfranco Zanzi,Zhiheng Yang,Chrysa Papagianni,Xavier Costa Pérez
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This paper has been accepted for publication at IEEE International Conference on Communications (ICC) 2026

点击查看摘要

Abstract:Recent advancements in mobile and wireless networks are unlocking the full potential of robotic autonomy, enabling robots to take advantage of ultra-low latency, high data throughput, and ubiquitous connectivity. However, for robots to navigate and operate seamlessly, efficiently and reliably, they must have an accurate understanding of both their surrounding environment and the quality of radio signals. Achieving this in highly dynamic and ever-changing environments remains a challenging and largely unsolved problem. In this paper, we introduce MapViT, a two-stage Vision Transformer (ViT)-based framework inspired by the success of pre-train and fine-tune paradigm for Large Language Models (LLMs). MapViT is designed to predict both environmental changes and expected radio signal quality. We evaluate the framework using a set of representative Machine Learning (ML) models, analyzing their respective strengths and limitations across different scenarios. Experimental results demonstrate that the proposed two-stage pipeline enables real-time prediction, with the ViT-based implementation achieving a strong balance between accuracy and computational efficiency. This makes MapViT a promising solution for energy- and resource-constrained platforms such as mobile robots. Moreover, the geometry foundation model derived from the self-supervised pre-training stage improves data efficiency and transferability, enabling effective downstream predictions even with limited labeled data. Overall, this work lays the foundation for next-generation digital twin ecosystems, and it paves the way for a new class of ML foundation models driving multi-modal intelligence in future 6G-enabled systems.
zh

[AI-58] PromptHelper: A Prompt Recommender System for Encourag ing Creativity in AI Chatbot Interactions

【速读】:该论文旨在解决用户在与生成式 AI 系统交互时普遍面临的挑战,包括难以探索替代方向、难以清晰表达创意意图,以及不理解提示(prompt)变化如何影响模型输出。为应对这些问题,作者提出了一种称为提示推荐系统(Prompt Recommender Systems, PRS)的交互方法,其核心在于通过集成到 AI 聊天机器人中的原型工具 PromptHelper,在用户执行实际写作任务过程中提供语义多样且情境相关的后续提示建议。该方案的关键创新在于:在不增加用户认知负荷的前提下,显著提升用户的感知探索性和表达能力,并通过实证研究验证了其有效性,同时强调了在保留用户自主性的同时引导探索性交互的设计原则。

链接: https://arxiv.org/abs/2601.15575
作者: Jason Kim,Maria Teleki,James Caverlee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Prompting is central to interaction with AI systems, yet many users struggle to explore alternative directions, articulate creative intent, or understand how variations in prompts shape model outputs. We introduce prompt recommender systems (PRS) as an interaction approach that supports exploration, suggesting contextually relevant follow-up prompts. We present PromptHelper, a PRS prototype integrated into an AI chatbot that surfaces semantically diverse prompt suggestions while users work on real writing tasks. We evaluate PromptHelper in a 2x2 fully within-subjects study (N=32) across creative and academic writing tasks. Results show that PromptHelper significantly increases users’ perceived exploration and expressiveness without increasing cognitive workload. Qualitative findings illustrate how prompt recommendations help users branch into new directions, overcome uncertainty about what to ask next, and better articulate their intent. We discuss implications for designing AI interfaces that scaffold exploratory interaction while preserving user agency, and release open-source resources to support research on prompt recommendation.
zh

[AI-59] BanditLP: Large-Scale Stochastic Optimization for Personalized Recommendations

【速读】:该论文旨在解决多利益相关方(multi-stakeholder)场景下,如何在大规模在线决策中同时实现个性化推荐与约束满足的问题。传统上下文bandit方法难以兼顾学习目标特定的优化结果和实际部署时的复杂约束条件,而现有线性规划(Linear Programming, LP)方法又缺乏对动态环境的适应能力。解决方案的关键在于提出BanditLP框架,其核心创新是将神经网络驱动的Thompson采样(neural Thompson Sampling)用于学习不同目标下的个体化响应模型,并结合一个可扩展的大规模线性规划求解器,在推理阶段实现带约束的动作选择。该方法具备应用无关性、兼容任意神经架构,并支持千亿级变量的线性规划求解,从而在真实工业系统(如LinkedIn邮件营销)中实现了探索与约束优化的统一,显著提升了业务指标。

链接: https://arxiv.org/abs/2601.15552
作者: Phuc Nguyen,Benjamin Zelditch,Joyce Chen,Rohit Patra,Changshuai Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present BanditLP, a scalable multi-stakeholder contextual bandit framework that unifies neural Thompson Sampling for learning objective-specific outcomes with a large-scale linear program for constrained action selection at serving time. The methodology is application-agnostic, compatible with arbitrary neural architectures, and deployable at web scale, with an LP solver capable of handling billions of variables. Experiments on public benchmarks and synthetic data show consistent gains over strong baselines. We apply this approach in LinkedIn’s email marketing system and demonstrate business win, illustrating the value of integrated exploration and constrained optimization in production.
zh

[AI-60] ALIGNAgent : Adaptive Learner Intelligence for Gap Identification and Next-step guidance

【速读】:该论文旨在解决当前个性化学习系统普遍存在的碎片化问题,即现有系统往往仅专注于知识追踪(Knowledge Tracing)、诊断建模或资源推荐中的单一功能,缺乏将这些模块整合为一个连贯自适应循环的能力。其解决方案的关键在于提出ALIGNAgent(Adaptive Learner Intelligence for Gap Identification and Next-step guidance),一个基于多智能体的教育框架,通过Skill Gap Agent实现概念级诊断推理以精准识别学生的学习缺口,再由Recommender Agent根据诊断结果和用户偏好推荐针对性学习资源,从而形成“评估-诊断-干预-反馈”的闭环机制,显著提升个性化学习的系统性和有效性。

链接: https://arxiv.org/abs/2601.15551
作者: Bismack Tokoli,Luis Jaimes,Ayesha S. Dina
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 35 pages

点击查看摘要

Abstract:Personalized learning systems have emerged as a promising approach to enhance student outcomes by tailoring educational content, pacing, and feedback to individual needs. However, most existing systems remain fragmented, specializing in either knowledge tracing, diagnostic modeling, or resource recommendation, but rarely integrating these components into a cohesive adaptive cycle. In this paper, we propose ALIGNAgent (Adaptive Learner Intelligence for Gap Identification and Next-step guidance), a multi-agent educational framework designed to deliver personalized learning through integrated knowledge estimation, skill-gap identification, and targeted resource this http URL begins by processing student quiz performance, gradebook data, and learner preferences to generate topic-level proficiency estimates using a Skill Gap Agent that employs concept-level diagnostic reasoning to identify specific misconceptions and knowledge deficiencies. After identifying skill gaps, the Recommender Agent retrieves preference-aware learning materials aligned with diagnosed deficiencies, implementing a continuous feedback loop where interventions occur before advancing to subsequent topics. Extensive empirical evaluation on authentic datasets from two undergraduate computer science courses demonstrates ALIGNAgent’s effectiveness, with GPT-4o-based agents achieving precision of 0.87-0.90 and F1 scores of 0.84-0.87 in knowledge proficiency estimation validated against actual exam performance.
zh

[AI-61] Learning Neural Operators from Partial Observations via Latent Autoregressive Modeling

【速读】:该论文旨在解决神经算子(Neural Operator)在实际科学计算中因观测数据不完整(partial observation)而导致的性能下降问题。传统神经算子假设输入为全观测空间,但在真实场景中常受限于传感器、地理或成本因素导致数据缺失,从而引发两个核心挑战:一是未观测区域缺乏监督信号,难以学习物理相关性;二是输入与输出在空间分布上的动态不匹配。解决方案的关键在于提出Latent Autoregressive Neural Operator(\ours),其包含两项创新设计:(i) 一种mask-to-predict训练策略,通过有策略地掩码已知区域生成人工监督信号以缓解监督缺口;(ii) 一种物理感知的潜在传播器(Physics-Aware Latent Propagator),在潜在空间中基于边界优先的自回归机制重建解场,有效应对空间动态失配问题。该方法在POBench-PDE基准上实现了18–69%的相对L2误差降低,显著提升了神经算子在部分观测条件下的泛化能力与实用性。

链接: https://arxiv.org/abs/2601.15547
作者: Jingren Hou,Hong Wang,Pengyu Xu,Chang Gao,Huafeng Liu,Liping Jing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world scientific applications frequently encounter incomplete observational data due to sensor limitations, geographic constraints, or measurement costs. Although neural operators significantly advanced PDE solving in terms of computational efficiency and accuracy, their underlying assumption of fully-observed spatial inputs severely restricts applicability in real-world applications. We introduce the first systematic framework for learning neural operators from partial observation. We identify and formalize two fundamental obstacles: (i) the supervision gap in unobserved regions that prevents effective learning of physical correlations, and (ii) the dynamic spatial mismatch between incomplete inputs and complete solution fields. Specifically, our proposed Latent Autoregressive Neural Operator~(\ours) introduces two novel components designed explicitly to address the core difficulties of partial observations: (i) a mask-to-predict training strategy that creates artificial supervision by strategically masking observed regions, and (ii) a Physics-Aware Latent Propagator that reconstructs solutions through boundary-first autoregressive generation in latent space. Additionally, we develop POBench-PDE, a dedicated and comprehensive benchmark designed specifically for evaluating neural operators under partial observation conditions across three PDE-governed tasks. \ours achieves state-of-the-art performance with 18–69 % relative L2 error reduction across all benchmarks under patch-wise missingness with less than 50 % missing rate, including real-world climate prediction. Our approach effectively addresses practical scenarios involving up to 75 % missing rate, to some extent bridging the existing gap between idealized research settings and the complexities of real-world scientific computing.
zh

[AI-62] RDumb: Drift-Aware Continual Test-Time Adaptation

【速读】:该论文旨在解决持续测试时适应(Continual Test-Time Adaptation, CTTA)中模型在长期、快速变化的测试分布下性能退化的问题,尤其是在大规模数据流(如CCC基准中750万样本)中因累积适应导致预测崩溃的现象。解决方案的关键在于提出RDumb++,其核心创新是引入两种漂移检测机制——基于熵的漂移评分和基于KL散度的漂移评分,并结合自适应重置策略,使模型能够在适应过程有害前及时检测到分布漂移并进行恢复,从而在保持稳定适应的同时显著提升长期性能(相比原版RDumb提升约3%绝对准确率)。

链接: https://arxiv.org/abs/2601.15544
作者: Himanshu Mishra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) seeks to update a pretrained model during deployment using only the incoming, unlabeled data stream. Although prior approaches such as Tent, EATA etc. provide meaningful improvements under short evolving shifts, they struggle when the test distribution changes rapidly or over extremely long horizons. This challenge is exemplified by the CCC benchmark, where models operate over streams of 7.5M samples with continually changing corruption types and severities. We propose RDumb++, a principled extension of RDumb that introduces two drift-detection mechanisms i.e entropy-based drift scoring and KL-divergence drift scoring, together with adaptive reset strategies. These mechanisms allow the model to detect when accumulated adaptation becomes harmful and to recover before prediction collapse occurs. Across CCC-medium with three speeds and three seeds (nine runs, each containing one million samples), RDumb++ consistently surpasses RDumb, yielding approx 3% absolute accuracy gains while maintaining stable adaptation throughout the entire stream. Ablation experiments on drift thresholds and reset strengths further show that drift-aware resetting is essential for preventing collapse and achieving reliable long-horizon CTTA.
zh

[AI-63] QUAIL: Quantization Aware Unlearning for Mitigating Misinformation in LLM s

【速读】:该论文旨在解决模型遗忘(machine unlearning)在量化部署过程中失效的问题,即低比特量化(如4-bit)会灾难性地恢复被遗忘的知识。现有方法在未量化状态下可实现有效遗忘,但在量化后因权重更新幅度小于量化阈值而导致遗忘信息重新显现。解决方案的关键在于提出一种量化感知的遗忘机制:通过在logits空间引入一个铰链损失(hinge loss),强制未学习模型对遗忘样本的输出logits与原始模型之间至少相差半个量化步长,从而确保即使在量化后,遗忘样本仍能保持可区分性,有效防止知识恢复。

链接: https://arxiv.org/abs/2601.15538
作者: Himanshu Mishra,Kanwal Mehreen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning aims to remove specific knowledge (e.g., copyrighted or private data) from a trained model without full retraining. In practice, models are often quantized (e.g., 4-bit) for deployment, but we find that quantization can catastrophically restore forgotten information [1]. In this paper, we (1) analyze why low-bit quantization undermines unlearning, and (2) propose a quantization-aware unlearning method to mitigate this. We first compute weight-change statistics and bucket overlaps in quantization to show that typical unlearning updates are too small to cross quantization thresholds. Building on this insight, we introduce a logits space hinge loss: for each forget example, we force the output logits of the unlearned model to differ from the original model by at least a margin (half the quantization step). This ensures forgotten examples remain distinguishable even after quantization. We evaluate on language and classification tasks (including a Twitter misinformation dataset) and show our method preserves forgetting under 4-bit quantization, whereas existing methods almost entirely recover the forgotten knowledge.
zh

[AI-64] From Generative Engines to Actionable Simulators: The Imperative of Physical Grounding in World Models

【速读】:该论文旨在解决当前世界模型(world models)中存在的视觉混淆(visual conflation)问题,即过度依赖高保真视频生成来误判对物理和因果动态的理解。解决方案的关键在于重构世界模型的设计范式,将其从以视觉还原为导向的“视觉引擎”转变为以可操作性为核心的“行动模拟器”(actionable simulators),强调结构化4D接口、约束感知的动力学建模以及闭环评估机制,从而确保模型在长期预测中保持因果一致性与稳定性,尤其在医疗决策等不可逆场景下,验证其通过反事实推理、干预规划和鲁棒长时前瞻能力体现真实价值。

链接: https://arxiv.org/abs/2601.15533
作者: Zhikang Chen,Tingting Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A world model is an AI system that simulates how an environment evolves under actions, enabling planning through imagined futures rather than reactive perception. Current world models, however, suffer from visual conflation: the mistaken assumption that high-fidelity video generation implies an understanding of physical and causal dynamics. We show that while modern models excel at predicting pixels, they frequently violate invariant constraints, fail under intervention, and break down in safety-critical decision-making. This survey argues that visual realism is an unreliable proxy for world understanding. Instead, effective world models must encode causal structure, respect domain-specific constraints, and remain stable over long horizons. We propose a reframing of world models as actionable simulators rather than visual engines, emphasizing structured 4D interfaces, constraint-aware dynamics, and closed-loop evaluation. Using medical decision-making as an epistemic stress test, where trial-and-error is impossible and errors are irreversible, we demonstrate that a world model’s value is determined not by how realistic its rollouts appear, but by its ability to support counterfactual reasoning, intervention planning, and robust long-horizon foresight.
zh

[AI-65] ransportAgents : a multi-agents LLM framework for traffic accident severity prediction

【速读】:该论文旨在解决交通碰撞严重程度预测中因单一代理大语言模型(Large Language Model, LLM)难以处理异构、领域特定数据而导致的预测偏差与不稳定性问题。其解决方案的关键在于提出TransportAgents框架,该框架采用混合多智能体架构,将类别特异性LLM推理与多层感知机(Multilayer Perceptron, MLP)集成模块相结合:每个专用智能体专注于交通信息的一个子集(如人口统计特征、环境背景或事故细节),生成中间严重程度评估,并通过MLP模块融合为统一预测结果,从而提升预测的准确性、鲁棒性及可解释性。

链接: https://arxiv.org/abs/2601.15519
作者: Zhichao Yang,Jiashu He,Jinxuan Fan,Cirillo Cinzia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of traffic crash severity is critical for improving emergency response and public safety planning. Although recent large language models (LLMs) exhibit strong reasoning capabilities, their single-agent architectures often struggle with heterogeneous, domain-specific crash data and tend to generate biased or unstable predictions. To address these limitations, this paper proposes TransportAgents, a hybrid multi-agent framework that integrates category-specific LLM reasoning with a multilayer perceptron (MLP) integration module. Each specialized agent focuses on a particular subset of traffic information, such as demographics, environmental context, or incident details, to produce intermediate severity assessments that are subsequently fused into a unified prediction. Extensive experiments on two complementary U.S. datasets, the Consumer Product Safety Risk Management System (CPSRMS) and the National Electronic Injury Surveillance System (NEISS), demonstrate that TransportAgents consistently outperforms both traditional machine learning and advanced LLM-based baselines. Across three representative backbones, including closed-source models such as GPT-3.5 and GPT-4o, as well as open-source models such as LLaMA-3.3, the framework exhibits strong robustness, scalability, and cross-dataset generalizability. A supplementary distributional analysis further shows that TransportAgents produces more balanced and well-calibrated severity predictions than standard single-agent LLM approaches, highlighting its interpretability and reliability for safety-critical decision support applications.
zh

[AI-66] he Rise of Large Language Models and the Direction and Impact of US Federal Research Funding

【速读】:该论文旨在解决生成式 AI(Generative AI)在联邦科研资助体系中的影响机制问题,特别是其如何重塑科学创意的定位、遴选与公共资金转化路径。其解决方案的关键在于结合来自美国国家科学基金会(NSF)和国立卫生研究院(NIH)的两组互补数据:一是两所顶尖研究型大学(R1)提交的保密提案(含已资助、未资助及待审项目),二是公开发布的全部NSF和NIH资助项目数据,通过量化分析LMM(大语言模型)使用强度与项目语义独特性、资助成功率及后续发表产出之间的关系,揭示了LLM使用呈现双峰分布特征,并发现其对不同机构资助结果具有异质性影响——在NIH中表现为正向关联,在NSF中则无显著效应,且NIH的生产力提升集中于非高影响力论文,而非顶级引用成果。

链接: https://arxiv.org/abs/2601.15485
作者: Yifan Qian,Zhe Wen,Alexander C. Furnas,Yue Bai,Erzhuo Shao,Dashun Wang
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
备注: 41 pages, 23 figures, 12 tables

点击查看摘要

Abstract:Federal research funding shapes the direction, diversity, and impact of the US scientific enterprise. Large language models (LLMs) are rapidly diffusing into scientific practice, holding substantial promise while raising widespread concerns. Despite growing attention to AI use in scientific writing and evaluation, little is known about how the rise of LLMs is reshaping the public funding landscape. Here, we examine LLM involvement at key stages of the federal funding pipeline by combining two complementary data sources: confidential National Science Foundation (NSF) and National Institutes of Health (NIH) proposal submissions from two large US R1 universities, including funded, unfunded, and pending proposals, and the full population of publicly released NSF and NIH awards. We find that LLM use rises sharply beginning in 2023 and exhibits a bimodal distribution, indicating a clear split between minimal and substantive use. Across both private submissions and public awards, higher LLM involvement is consistently associated with lower semantic distinctiveness, positioning projects closer to recently funded work within the same agency. The consequences of this shift are agency-dependent. LLM use is positively associated with proposal success and higher subsequent publication output at NIH, whereas no comparable associations are observed at NSF. Notably, the productivity gains at NIH are concentrated in non-hit papers rather than the most highly cited work. Together, these findings provide large-scale evidence that the rise of LLMs is reshaping how scientific ideas are positioned, selected, and translated into publicly funded research, with implications for portfolio governance, research diversity, and the long-run impact of science.
zh

[AI-67] Is Grokipedia Right-Leaning? Comparing Political Framing in Wikipedia and Grokipedia on Controversial Topics

【速读】:该论文旨在解决在线百科全书在政治倾向上的差异问题,特别是针对广受争议的议题中,维基百科(Wikipedia)与由xAI推出的AI生成百科全书Grokipedia在语义框架、政治倾向和内容优先级上的系统性差异。其解决方案的关键在于通过对比分析两平台在政治敏感话题上的文本内容,量化语义相似度随文章结构变化的衰减趋势,并识别两者在政治立场分布上的模式——研究发现,尽管两者均呈现左倾倾向,但Grokipedia展现出更显著的右倾内容增强和双峰分布特征,且分歧在争议性主题上更为突出。

链接: https://arxiv.org/abs/2601.15484
作者: Philipp Eibl,Erica Coppolillo,Simone Mungari,Luca Luceri
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online encyclopedias are central to contemporary information infrastructures and have become focal points of debates over ideological bias. Wikipedia, in particular, has long been accused of left-leaning bias, while Grokipedia, an AI-generated encyclopedia launched by xAI, has been framed as a right-leaning alternative. This paper presents a comparative analysis of Wikipedia and Grokipedia on well-established politically contested topics. Specifically, we examine differences in semantic framing, political orientation, and content prioritization. We find that semantic similarity between the two platforms decays across article sections and diverges more strongly on controversial topics than on randomly sampled ones. Additionally, we show that both encyclopedias predominantly exhibit left-leaning framings, although Grokipedia exhibits a more bimodal distribution with increased prominence of right-leaning content. The experimental code is publicly available.
zh

[AI-68] Martingale Foresight Sampling: A Principled Approach to Inference-Time LLM Decoding

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在标准自回归解码过程中因局部贪心策略导致的“短视”问题,即模型难以找到全局最优的推理路径。传统方法如前瞻采样(foresight sampling)虽尝试通过模拟未来步骤缓解此问题,但其路径评估与剪枝机制依赖于经验性启发式规则,缺乏理论保障。本文提出马尔可夫前瞻采样(Martingale Foresight Sampling, MFS),其核心在于将LLM解码建模为寻找最优随机过程的问题:利用Doob分解定理对路径质量进行可预测优势度量(step valuation),基于选停定理(Optional Stopping Theory)实现对次优候选路径的理论驱动剪枝,以及基于鞅收敛定理设计自适应终止规则以确保探索在质量收敛后停止。这一框架首次将概率论中的鞅理论系统引入LLM推理优化,显著提升了准确性与计算效率。

链接: https://arxiv.org/abs/2601.15482
作者: Huayu Li,ZhengXiao He,Siyuan Tian,Jinghao Wen,Ao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Standard autoregressive decoding in large language models (LLMs) is inherently short-sighted, often failing to find globally optimal reasoning paths due to its token-by-token generation process. While inference-time strategies like foresight sampling attempt to mitigate this by simulating future steps, they typically rely on ad-hoc heuristics for valuing paths and pruning the search space. This paper introduces Martingale Foresight Sampling (MFS), a principled framework that reformulates LLM decoding as a problem of identifying an optimal stochastic process. By modeling the quality of a reasoning path as a stochastic process, we leverage Martingale theory to design a theoretically-grounded algorithm. Our approach replaces heuristic mechanisms with principles from probability theory: step valuation is derived from the Doob Decomposition Theorem to measure a path’s predictable advantage, path selection uses Optional Stopping Theory for principled pruning of suboptimal candidates, and an adaptive stopping rule based on the Martingale Convergence Theorem terminates exploration once a path’s quality has provably converged. Experiments on six reasoning benchmarks demonstrate that MFS surpasses state-of-the-art methods in accuracy while significantly improving computational efficiency. Code will be released at this https URL.
zh

[AI-69] Reliability by design: quantifying and eliminating fabrication risk in LLM s. From generative to consultative AI: a comparative analysis in the legal domain and lessons for high-stakes knowledge bases

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险法律场景中因幻觉(hallucination)导致的不可靠性问题,尤其关注虚假引用(False Citation Rate, FCR)和虚构事实(Fabricated Fact Rate, FFR)等关键可靠性指标。其核心解决方案是采用以检索增强生成(Retrieval-Augmented Generation, RAG)为基础的“严谨档案馆”范式,通过嵌入微调(embedding fine-tuning)、重排序(re-ranking)和自校正(self-correction)等技术实现端到端优化,显著降低错误率至低于0.2%,从而确保法律AI具备可验证性和可追溯性,为高风险领域提供可信的智能辅助决策框架。

链接: https://arxiv.org/abs/2601.15476
作者: Alex Dantart
机构: 未知
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations. It distinguishes three AI paradigms: (1) standalone generative models (“creative oracle”), (2) basic retrieval-augmented systems (“expert archivist”), and (3) an advanced, end-to-end optimized RAG system (“rigorous archivist”). The authors introduce two reliability metrics -False Citation Rate (FCR) and Fabricated Fact Rate (FFR)- and evaluate 2,700 judicial-style answers from 12 LLMs across 75 legal tasks using expert, double-blind review. Results show that standalone models are unsuitable for professional use (FCR above 30%), while basic RAG greatly reduces errors but still leaves notable misgrounding. Advanced RAG, using techniques such as embedding fine-tuning, re-ranking, and self-correction, reduces fabrication to negligible levels (below 0.2%). The study concludes that trustworthy legal AI requires rigor-focused, retrieval-based architectures emphasizing verification and traceability, and provides an evaluation framework applicable to other high-risk domains.
zh

[AI-70] Multi-Targeted Graph Backdoor Attack

【速读】:该论文旨在解决图神经网络(Graph Neural Network, GNN)在图分类任务中面临的多目标后门攻击问题,即现有研究仅限于单目标攻击,通过子图替换机制植入单一触发器,难以模拟真实场景中攻击者可能同时诱导模型预测至多个目标标签的情形。其解决方案的关键在于提出一种基于子图注入(subgraph injection)的新攻击范式,该方法在不破坏原始图结构的前提下,向干净图中注入多个触发器,从而实现对不同目标标签的并行误导,且保持对正常分类准确率影响最小。实验表明,该方法在多种GNN架构和数据集上均具有高攻击成功率与强泛化能力,并能有效规避当前主流防御策略(如随机平滑和细粒度剪枝)。

链接: https://arxiv.org/abs/2601.15474
作者: Md Nabi Newaz Khan,Abdullah Arafat Miah,Yu Bi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Graph neural network (GNN) have demonstrated exceptional performance in solving critical problems across diverse domains yet remain susceptible to backdoor attacks. Existing studies on backdoor attack for graph classification are limited to single target attack using subgraph replacement based mechanism where the attacker implants only one trigger into the GNN model. In this paper, we introduce the first multi-targeted backdoor attack for graph classification task, where multiple triggers simultaneously redirect predictions to different target labels. Instead of subgraph replacement, we propose subgraph injection which preserves the structure of the original graphs while poisoning the clean graphs. Extensive experiments demonstrate the efficacy of our approach, where our attack achieves high attack success rates for all target labels with minimal impact on the clean accuracy. Experimental results on five dataset demonstrate the superior performance of our attack framework compared to the conventional subgraph replacement-based attack. Our analysis on four GNN models confirms the generalization capability of our attack which is effective regardless of the GNN model architectures and training parameters settings. We further investigate the impact of the attack design parameters including injection methods, number of connections, trigger sizes, trigger edge density and poisoning ratios. Additionally, our evaluation against state-of-the-art defenses (randomized smoothing and fine-pruning) demonstrates the robustness of our proposed multi-target attacks. This work highlights the GNN vulnerability against multi-targeted backdoor attack in graph classification task. Our source codes will be available at this https URL.
zh

[AI-71] Panther: Faster and Cheaper Computations with Randomized Numerical Linear Algebra

【速读】:该论文旨在解决现代深度学习模型训练中因GPU内存和计算资源限制而导致的可扩展性瓶颈问题。其核心解决方案是提出一个名为Panther的PyTorch兼容库,该库将成熟的随机数值线性代数(RandNLA)算法整合为统一、高性能的框架,提供高效且可直接替换标准组件的实现,如压缩的线性层、二维卷积、多头注意力机制及随机矩阵分解(如选主元CholeskyQR)。关键创新在于通过自研的C++/CUDA后端pawX实现优化,支持CPU与GPU并行执行,并在仅需少量代码修改的情况下显著降低内存占用(最高达75%),同时保持模型性能稳定。

链接: https://arxiv.org/abs/2601.15473
作者: Fahd Seddik,Abdulrahman Elbedewy,Gaser Sami,Mohamed Abdelmoniem,Yahia Zakaria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, 2 listings

点击查看摘要

Abstract:Training modern deep learning models is increasingly constrained by GPU memory and compute limits. While Randomized Numerical Linear Algebra (RandNLA) offers proven techniques to compress these models, the lack of a unified, production-grade library prevents widely adopting these methods. We present Panther, a PyTorch-compatible library that consolidates established RandNLA algorithms into a single high-performance framework. Panther engineers efficient, drop-in replacements for standard components including sketched linear layers, 2D convolution, multi-head attention, and randomized matrix decompositions (such as pivoted CholeskyQR). By implementing a custom C++/CUDA backend (pawX), Panther provides an optimized implementation that can run on both CPUs and GPUs. We demonstrate the effectiveness of RandNLA techniques and Panther’s ease of adoption. By replacing standard PyTorch linear layers with Panther layers (requiring only a few lines of code) we achieve significant memory savings (up to 75%) on BERT while maintaining comparable loss. Source code is available (MIT License) at this https URL, along with demonstration video at this https URL.
zh

[AI-72] Reflexis: Supporting Reflexivity and Rigor in Collaborative Qualitative Analysis through Design for Deliberation

【速读】:该论文旨在解决当前定性分析软件工具在支持深度解释性洞察方面存在的不足,特别是对研究者反思性(reflexivity)、代码演化过程的透明化以及建设性分歧(productive disagreement)等核心方法论原则的支持薄弱问题。其解决方案的关键在于设计并实现了一个名为Reflexis的协作工作空间,该平台通过内嵌即时反思提示(in-situ reflection prompts)强化研究者的反思实践,使代码演进过程可视化、可追溯,并借助位置意识(positionality-aware)的对话机制将分歧转化为富有成效的协作解读,从而推动更具严谨性和透明度的深度协同诠释。

链接: https://arxiv.org/abs/2601.15445
作者: Runlong Ye,Oliver Huang,Patrick Yung Kang Lee,Michael Liut,Carolina Nobre,Ha-Kyung Kong
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at CHI 26

点击查看摘要

Abstract:Reflexive Thematic Analysis (RTA) is a critical method for generating deep interpretive insights. Yet its core tenets, including researcher reflexivity, tangible analytical evolution, and productive disagreement, are often poorly supported by software tools that prioritize speed and consensus over interpretive depth. To address this gap, we introduce Reflexis, a collaborative workspace that centers these practices. It supports reflexivity by integrating in-situ reflection prompts, makes code evolution transparent and tangible, and scaffolds collaborative interpretation by turning differences into productive, positionality-aware dialogue. Results from our paired-analyst study (N=12) indicate that Reflexis encouraged participants toward more granular reflection and reframed disagreements as productive conversations. The evaluation also surfaced key design tensions, including a desire for higher-level, networked memos and more user control over the timing of proactive alerts. Reflexis contributes a design framework for tools that prioritize rigor and transparency to support deep, collaborative interpretation in an age of automation.
zh

[AI-73] A tensor network formalism for neuro-symbolic AI

【速读】:该论文旨在解决神经网络与符号主义方法在人工智能(Artificial Intelligence, AI)中难以统一的开放性挑战。其核心解决方案是提出一种张量网络(Tensor Network)形式化框架,通过张量分解(Tensor Decomposition)捕捉来自不同方法的稀疏性原理,并将函数编码为基底表示、模型神经分解视为张量分解。该框架能够以结构化的张量分解形式表示逻辑公式和概率分布,识别张量网络收缩(Tensor Network Contraction)为基本推理类别,并将源自概率论和命题逻辑的高效推理算法重构为收缩消息传递(Contraction Message Passing)机制。这一统一处理方式使得混合逻辑与概率模型——即“混合逻辑网络”(Hybrid Logic Network)——得以定义与训练,从而实现神经与符号方法的融合。

链接: https://arxiv.org/abs/2601.15442
作者: Alex Goessmann,Janina Schütte,Maximilian Fröhlich,Martin Eigel
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注: 51 pages, 14 figures

点击查看摘要

Abstract:The unification of neural and symbolic approaches to artificial intelligence remains a central open challenge. In this work, we introduce a tensor network formalism, which captures sparsity principles originating in the different approaches in tensor decompositions. In particular, we describe a basis encoding scheme for functions and model neural decompositions as tensor decompositions. The proposed formalism can be applied to represent logical formulas and probability distributions as structured tensor decompositions. This unified treatment identifies tensor network contractions as a fundamental inference class and formulates efficiently scaling reasoning algorithms, originating from probability theory and propositional logic, as contraction message passing schemes. The framework enables the definition and training of hybrid logical and probabilistic models, which we call Hybrid Logic Network. The theoretical concepts are accompanied by the python library tnreason, which enables the implementation and practical use of the proposed architectures.
zh

[AI-74] Ambient Dataloops: Generative Models for Dataset Refinement

【速读】:该论文旨在解决扩散模型(diffusion models)在训练过程中因数据集质量参差不齐而导致的学习效率低下问题。现有数据集中样本质量差异显著,直接训练易导致模型性能受限。解决方案的关键在于提出“环境数据回路”(Ambient Dataloops),其核心是构建一个数据集与模型协同进化的过程:每轮迭代中,模型利用当前数据集进行优化,同时生成更高质量的合成样本用于更新数据集;为避免劣化循环,每次迭代将合成样本视为轻微噪声污染的数据,并采用环境扩散(Ambient Diffusion)技术来学习受扰动数据下的分布,从而实现数据质量和模型性能的同步提升。

链接: https://arxiv.org/abs/2601.15417
作者: Adrián Rodríguez-Muñoz,William Daspit,Adam Klivans,Antonio Torralba,Constantinos Daskalakis,Giannis Daras
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 9 figures, 11 tables

点击查看摘要

Abstract:We propose Ambient Dataloops, an iterative framework for refining datasets that makes it easier for diffusion models to learn the underlying data distribution. Modern datasets contain samples of highly varying quality, and training directly on such heterogeneous data often yields suboptimal models. We propose a dataset-model co-evolution process; at each iteration of our method, the dataset becomes progressively higher quality, and the model improves accordingly. To avoid destructive self-consuming loops, at each generation, we treat the synthetically improved samples as noisy, but at a slightly lower noisy level than the previous iteration, and we use Ambient Diffusion techniques for learning under corruption. Empirically, Ambient Dataloops achieve state-of-the-art performance in unconditional and text-conditional image generation and de novo protein design. We further provide a theoretical justification for the proposed framework that captures the benefits of the data looping procedure.
zh

[AI-75] A Checklist for Trustworthy Safe and User-Friendly Mental Health Chatbots

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在心理健康服务中应用时存在的设计与实施缺陷问题,尤其是心理聊天机器人(mental health chatbots)在安全性、有效性及用户友好性方面的关键空白。其解决方案的核心在于提出一个可操作的检查清单(operational checklist),该清单兼具开发框架与审计工具的功能,用于指导更可信、安全和以用户为中心的聊天机器人设计,并推动数字心理健康工具在社会技术层面的规范发展与责任实践。

链接: https://arxiv.org/abs/2601.15412
作者: Shreya Haran,Samiha Thatikonda,Dong Whi Yoo,Koustuv Saha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Mental health concerns are rising globally, prompting increased reliance on technology to address the demand-supply gap in mental health services. In particular, mental health chatbots are emerging as a promising solution, but these remain largely untested, raising concerns about safety and potential harms. In this paper, we dive into the literature to identify critical gaps in the design and implementation of mental health chatbots. We contribute an operational checklist to help guide the development and design of more trustworthy, safe, and user-friendly chatbots. The checklist serves as both a developmental framework and an auditing tool to ensure ethical and effective chatbot design. We discuss how this checklist is a step towards supporting more responsible design practices and supporting new standards for sociotechnically sound digital mental health tools.
zh

[AI-76] Improving MoE Compute Efficiency by Composing Weight and Data Sparsity

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型中因因果性约束导致的数据稀疏性难以实现的问题。在自回归模型中,传统的专家选择路由机制会因依赖未来token而违反因果性,造成训练与推理阶段的不一致(train-inference mismatch)。其解决方案的关键在于引入“零计算(zero-compute)”或称“空(null)专家”,即在路由池中加入不参与计算的专家;当token被分配至这些null专家时,对应计算资源被释放,从而实现数据稀疏性而不破坏因果性。通过标准负载均衡目标函数训练模型,使其在期望意义上均匀使用所有专家(包括真实和null专家),从而自然形成数据稀疏性,同时保持因果一致性。实验表明,在视觉-语言模型训练中,这种结合权重稀疏性和数据稀疏性的方法显著提升了计算效率和下游性能。

链接: https://arxiv.org/abs/2601.15370
作者: Maciej Kilian,Oleg Mkrtchyan,Luke Zettlemoyer,Akshat Shrivastava,Armen Aghajanyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are denser. At matched expected FLOPs, composing weight and data sparsity yields a more compute-efficient frontier than weight sparsity alone, with gains in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text, without explicit modality routing.
zh

[AI-77] Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding

【速读】:该论文旨在解决多语言环境下语音驱动的代码理解问题,特别是在非英语母语用户(如印度地区)使用语音接口进行编程时面临的挑战。现有系统主要面向键盘输入的英文用户,难以应对语音查询中出现的非标准英语、领域特定词汇、自定义标识符及代码混用表达等问题。解决方案的关键在于构建一个端到端的多语言语音驱动框架:首先利用自动语音识别(ASR)将用户语音转为文本,再通过大语言模型(LLM)对ASR输出进行代码感知的纠错与优化,最后接入代码理解模型完成代码问答和检索等任务。实验证明,LLM引导的ASR后处理能显著提升转录准确率和下游代码理解性能,凸显了在语音接口中引入代码敏感适配的重要性。

链接: https://arxiv.org/abs/2601.15339
作者: Jayant Havare,Ashish Mittal,Srikanth Tamilselvam,Ganesh Ramakrishnan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code understanding is a foundational capability in software engineering tools and developer workflows. However, most existing systems are designed for English-speaking users interacting via keyboards, which limits accessibility in multilingual and voice-first settings, particularly in regions like India. Voice-based interfaces offer a more inclusive modality, but spoken queries involving code present unique challenges due to the presence of non-standard English usage, domain-specific vocabulary, and custom identifiers such as variable and function names, often combined with code-mixed expressions. In this work, we develop a multilingual speech-driven framework for code understanding that accepts spoken queries in a user native language, transcribes them using Automatic Speech Recognition (ASR), applies code-aware ASR output refinement using Large Language Models (LLMs), and interfaces with code models to perform tasks such as code question answering and code retrieval through benchmarks such as CodeSearchNet, CoRNStack, and CodeQA. Focusing on four widely spoken Indic languages and English, we systematically characterize how transcription errors impact downstream task performance. We also identified key failure modes in ASR for code and demonstrated that LLM-guided refinement significantly improves performance across both transcription and code understanding stages. Our findings underscore the need for code-sensitive adaptations in speech interfaces and offer a practical solution for building robust, multilingual voice-driven programming tools.
zh

[AI-78] oolCaching: Towards Efficient Caching for LLM Tool-calling

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用外部工具时存在的冗余或重复请求问题,这一问题导致资源浪费和延迟增加。传统缓存策略因请求语义异构性、动态工作负载及不同数据新鲜度要求而失效。解决方案的关键在于提出ToolCaching框架,其核心是VAAC算法——通过结合基于多臂赌博机(bandit-based)的准入机制与价值驱动的多因素淘汰策略,综合考虑请求频率、时效性和缓存价值,实现对LLM工具调用请求的高效缓存管理。实验表明,该方案相比标准缓存策略可提升最高11%的缓存命中率并降低34%的延迟。

链接: https://arxiv.org/abs/2601.15335
作者: Yi Zhai,Dian Shen,Junzhou Luo,Bin Yang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to interact with external APIs, greatly enhancing their practical utility. While prior research has improved tool-calling performance by adopting traditional computer systems techniques, such as parallel and asynchronous execution, the challenge of redundant or repeated tool-calling requests remains largely unaddressed. Caching is a classic solution to this problem, but applying it to LLM tool-calling introduces new difficulties due to heterogeneous request semantics, dynamic workloads, and varying freshness requirements, which render conventional cache policies ineffective. To address these issues, we propose ToolCaching, an efficient feature-driven and adaptive caching framework for LLM tool-calling systems. ToolCaching systematically integrates semantic and system-level features to evaluate request cacheability and estimate caching value. At its core, the VAAC algorithm integrates bandit-based admission with value-driven, multi-factor eviction, jointly accounting for request frequency, recency, and caching value. Extensive experiments on synthetic and public tool-calling workloads demonstrate that ToolCaching with VAAC achieves up to 11% higher cache hit ratios and 34% lower latency compared to standard policies, effectively accelerating LLM tool-calling in practical applications.
zh

[AI-79] Empowering LLM s for Structure-Based Drug Design via Exploration-Augmented Latent Inference

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在基于结构的药物设计(Structure-Based Drug Design, SBDD)中因对蛋白质结构理解不足和分子生成不可预测而导致的应用局限性问题。其解决方案的关键在于提出一种名为“探索增强的潜在空间推理框架”(Exploration-Augmented Latent Inference for LLMs, ELILLM)的新方法,该框架将LLM的生成过程重新建模为编码、潜在空间探索与解码的三阶段流程:通过贝叶斯优化系统性地探索模型知识边界之外的潜在嵌入区域,利用位置感知代理模型高效预测结合亲和力分布以指导搜索,并借助知识引导的解码机制降低随机性并强制执行化学有效性约束,从而实现可控且高效的分子生成。

链接: https://arxiv.org/abs/2601.15333
作者: Xuanning Hu,Anchen Li,Qianli Xing,Jinglong Ji,Hao Tuo,Bo Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) possess strong representation and reasoning capabilities, but their application to structure-based drug design (SBDD) is limited by insufficient understanding of protein structures and unpredictable molecular generation. To address these challenges, we propose Exploration-Augmented Latent Inference for LLMs (ELILLM), a framework that reinterprets the LLM generation process as an encoding, latent space exploration, and decoding workflow. ELILLM explicitly explores portions of the design problem beyond the model’s current knowledge while using a decoding module to handle familiar regions, generating chemically valid and synthetically reasonable molecules. In our implementation, Bayesian optimization guides the systematic exploration of latent embeddings, and a position-aware surrogate model efficiently predicts binding affinity distributions to inform the search. Knowledge-guided decoding further reduces randomness and effectively imposes chemical validity constraints. We demonstrate ELILLM on the CrossDocked2020 benchmark, showing strong controlled exploration and high binding affinity scores compared with seven baseline methods. These results demonstrate that ELILLM can effectively enhance LLMs capabilities for SBDD.
zh

[AI-80] Prometheus Mind: Retrofitting Memory to Frozen Language Models

【速读】:该论文旨在解决如何在不修改预训练语言模型架构或权重的前提下,为其添加可逆的记忆能力问题。其核心挑战在于实现高效、无监督的语义方向提取、适配器模块的稳定训练、记忆注入的有效性以及隐藏状态表示的区分度恢复。解决方案的关键在于提出 Prometheus Mind 系统,通过 11 个模块化适配器(总大小 530MB,仅占原模型 7% 的额外开销)实现对冻结的 Qwen3-4B 模型的记忆增强;其中创新性地引入 Contrastive Direction Discovery (CDD) 方法进行无监督语义方向挖掘,采用分阶段训练策略克服端到端优化失败问题,并利用现有 HTTP URL 行已提供的映射关系避免记忆注入的训练需求,同时设计投影层以恢复因 Transformer 结构导致的语义混淆(如将“wife”与“brother”的相似度从 0.98 降至 0.09),从而显著提升记忆检索准确率(在干净输入下达 94.4%)。

链接: https://arxiv.org/abs/2601.15324
作者: Mark Wind
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:Adding memory to pretrained language models typically requires architectural changes or weight modification. We present Prometheus Mind, which retrofits memory to a frozen Qwen3-4B using 11 modular adapters (530MB, 7% overhead) – fully reversible by removing the adapters. Building this system required solving four problems: (1) Extraction – we develop Contrastive Direction Discovery (CDD), which finds semantic directions via minimal pairs without labeled data. (2) Training – end-to-end optimization collapses; stage-wise training of each adapter on simple proxy tasks succeeds. (3) Injection – learned encoders fail to generalize; we find that this http URL rows already provide the mapping we need, requiring no training. (4) Hidden state collapse – transformers make wife'' and brother’’ 0.98+ similar; we train projections to recover distinction (0.98 \rightarrow 0.09). On PrometheusExtract-132 (132 cases), the system achieves 94.4% retrieval on clean inputs (n=54, 95% CI: [84.9%, 98.1%]), degrading to 19.4% on informal inputs with ellipsis, filler words, or implicit subjects (n=36). The primary bottleneck is relation classification (47.3% accuracy), responsible for most extraction errors.
zh

[AI-81] Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文处理中面临的两大核心问题:一是自注意力机制带来的二次计算复杂度,二是“中间迷失”(Lost in the Middle)现象导致的推理能力随上下文窗口扩展而下降。现有基于向量数据库的扁平化检索增强生成(Flat RAG)架构无法捕捉长期交互中的层次与时间结构,从而引发“向量雾霾”(Vector Haze)——即检索到的事实缺乏情节连续性。其解决方案的关键在于提出Aeon,一个神经符号认知操作系统,将记忆重构为可管理的系统资源:通过Memory Palace(基于Atlas索引的时空结构化存储,结合小世界图导航与B+树式磁盘局部性以最小化读放大)和Trace(神经符号情景图)实现分层记忆组织,并引入语义旁路缓存(Semantic Lookaside Buffer, SLB)利用对话局部性实现亚毫秒级检索延迟,同时通过零拷贝C++/Python桥接保障状态一致性,从而为自主智能体提供持久且结构化的记忆支持。

链接: https://arxiv.org/abs/2601.15311
作者: Mustafa Arslan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are fundamentally constrained by the quadratic computational cost of self-attention and the “Lost in the Middle” phenomenon, where reasoning capabilities degrade as context windows expand. Existing solutions, primarily “Flat RAG” architectures relying on vector databases, treat memory as an unstructured bag of embeddings. This approach fails to capture the hierarchical and temporal structure of long-horizon interactions, leading to “Vector Haze”, the retrieval of disjointed facts lacking episodic continuity. We propose Aeon, a Neuro-Symbolic Cognitive Operating System that redefines memory not as a static store, but as a managed OS resource. Aeon structures memory into a Memory Palace (a spatial index implemented via Atlas, a SIMD-accelerated Page-Clustered Vector Index that combines small-world graph navigation with B+ Tree-style disk locality to minimize read amplification) and a Trace (a neuro-symbolic episodic graph). We introduce the Semantic Lookaside Buffer (SLB), a predictive caching mechanism that exploits conversational locality to achieve sub-millisecond retrieval latencies. Benchmarks demonstrate that Aeon achieves 1ms retrieval latency on conversational workloads while ensuring state consistency via a zero-copy C++/Python bridge, effectively enabling persistent, structured memory for autonomous agents.
zh

[AI-82] When Generative AI Meets Extended Reality: Enabling Scalable and Natural Interactions

【速读】:该论文旨在解决扩展现实(Extended Reality, XR)广泛应用受限的两大核心问题:一是3D内容创作成本高、复杂度大,尤其在大规模环境或复杂交互场景下;二是用户与XR系统交互方式不够直观,存在陡峭的学习曲线,如依赖手持控制器或预设手势。解决方案的关键在于引入生成式AI(Generative AI, GenAI),通过视觉-语言模型和基于扩散机制的内容生成技术,实现自然语言驱动的交互与自动化3D内容创建,从而显著降低XR系统的使用门槛,提升可扩展性与交互自然性。

链接: https://arxiv.org/abs/2601.15308
作者: Mingyu Zhu,Jiangong Chen,Bin Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Internet Computing (Oct. 2025); published in IEEE Xplore (Jan. 2026)

点击查看摘要

Abstract:Extended Reality (XR), including virtual, augmented, and mixed reality, provides immersive and interactive experiences across diverse applications, from VR-based education to AR-based assistance and MR-based training. However, widespread XR adoption remains limited due to two key challenges: 1) the high cost and complexity of authoring 3D content, especially for large-scale environments or complex interactions; and 2) the steep learning curve associated with non-intuitive interaction methods like handheld controllers or scripted gestures. Generative AI (GenAI) presents a promising solution by enabling intuitive, language-driven interaction and automating content generation. Leveraging vision-language models and diffusion-based generation, GenAI can interpret ambiguous instructions, understand physical scenes, and generate or manipulate 3D content, significantly lowering barriers to XR adoption. This paper explores the integration of XR and GenAI through three concrete use cases, showing how they address key obstacles in scalability and natural interaction, and identifying technical challenges that must be resolved to enable broader adoption.
zh

[AI-83] Uncovering Latent Bias in LLM -Based Emergency Department Triage Through Proxy Variables

【速读】:该论文试图解决生成式 AI(Generative AI)在急诊科分诊场景中因隐性偏见导致的歧视性行为问题,特别是模型如何通过患者层面的代理变量(proxy variables)对不同种族、社会经济背景等群体产生不公平的严重程度评估。解决方案的关键在于系统性地识别并量化这些代理变量对模型决策的影响,发现即使在输入上下文中出现中性或正向/负向标记词时,LLM也会无差别地改变对患者病情严重程度的判断,表明当前AI系统仍依赖于噪声大且非因果的信号进行推理,而非真实反映患者的临床紧迫性。因此,该研究强调需进一步改进训练数据与模型设计,以确保AI在临床环境中的安全与公平部署。

链接: https://arxiv.org/abs/2601.15306
作者: Ethan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled their integration into clinical decision-making; however, hidden biases against patients across racial, social, economic, and clinical backgrounds persist. In this study, we investigate bias in LLM-based medical AI systems applied to emergency department (ED) triage. We employ 32 patient-level proxy variables, each represented by paired positive and negative qualifiers, and evaluate their effects using both public (MIMIC-IV-ED Demo, MIMIC-IV Demo) and restricted-access credentialed (MIMIC-IV-ED and MIMIC-IV) datasets as appropriate~\citemimiciv_ed_demo,mimiciv_ed,mimiciv. Our results reveal discriminatory behavior mediated through proxy variables in ED triage scenarios, as well as a systematic tendency for LLMs to modify perceived patient severity when specific tokens appear in the input context, regardless of whether they are framed positively or negatively. These findings indicate that AI systems is still imperfectly trained on noisy, sometimes non-causal signals that do not reliably reflect true patient acuity. Consequently, more needs to be done to ensure the safe and responsible deployment of AI technologies in clinical settings.
zh

[AI-84] Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models

【速读】:该论文旨在解决长上下文语言模型中注意力机制带来的计算负担问题,以及传统稀疏注意力和门控注意力各自存在的局限性:前者虽能降低复杂度但可能丢失关键信息,后者虽提升训练稳定性并缓解注意力下沉(attention sink)现象,但未有效优化效率。解决方案的关键在于提出门控稀疏注意力(Gated Sparse Attention, GSA),其核心创新包括:1)引入带sigmoid激活的门控闪电索引器(gated lightning indexer),生成有界且可解释的选择得分;2)设计自适应稀疏控制器,根据局部不确定性动态调节关注的标记数量;3)在值(value)和输出阶段均采用双门控机制,从而同时实现高效计算与高质量表示学习。实验表明,GSA在保持稀疏基线12–16倍加速的同时,显著提升了模型性能(困惑度从6.03降至5.70,RULER分数翻倍,首token注意力从47%降至<4%),并极大改善训练稳定性(损失波动减少98%)。

链接: https://arxiv.org/abs/2601.15305
作者: Alfred Shen,Aaron Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure, attention mechanism, sparse attention, gating, long-context

点击查看摘要

Abstract:The computational burden of attention in long-context language models has motivated two largely independent lines of work: sparse attention mechanisms that reduce complexity by attending to selected tokens, and gated attention variants that improve training sta-bility while mitigating the attention sink phenomenon. We observe that these approaches address complementary weaknesses and propose Gated Sparse Attention (GSA), an architecture that realizes the benefits of both. GSA incorporates a gated lightning indexer with sigmoid activations that produce bounded, interpretable selection scores, an adaptive sparsity controller that modulates the number of attended tokens based on local uncertainty, and dual gating at the value and output stages. We establish theoretical foundations for the approach, including complexity analysis, expressiveness results, and convergence guarantees. In experiments with 1.7B parameter models trained on 400B tokens, GSA matches the efficiency of sparse-only baselines (12-16x speedup at 128K context) while achieving the quality gains associated with gated attention: perplexity improves from 6.03 to 5.70, RULER scores at 128K context nearly double, and attention to the first token, a proxy for attention sinks, drops from 47% to under 4%. Training stability improves markedly, with loss spikes reduced by 98%.
zh

[AI-85] A Mobile Application Front-End for Presenting Explainable AI Results in Diabetes Risk Estimation

【速读】:该论文旨在解决糖尿病在印度尼西亚日益增长的健康挑战中,现有基于人工智能(AI)的早期检测工具普遍缺乏透明性的问题。这些工具常被视为“黑箱”,难以解释其预测依据,从而限制了用户对风险评估结果的信任与理解。解决方案的关键在于开发一款移动应用前端,通过可解释人工智能(XAI)技术——特别是SHAP(SHapley Additive exPlanations)方法——将复杂的模型输出转化为直观的图形可视化(如柱状图和饼图)以及由GPT-4o生成的个性化文本叙述,从而提升非专业用户的理解能力与决策支持效果。

链接: https://arxiv.org/abs/2601.15292
作者: Bernardus Willson,Henry Anand Septian Radityo,Reynard Tanadi,Latifa Dwiyanti,Saiful Akbar
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This paper was accepted and presented at the 2025 IEEE International Conference on Data and Software Engineering (ICoDSE) on 29 October 2025 in Batam, Indonesia, and is currently awaiting publication

点击查看摘要

Abstract:Diabetes is a significant and continuously rising health challenge in Indonesia. Although many artificial intelligence (AI)-based health applications have been developed for early detection, most function as “black boxes,” lacking transparency in their predictions. Explainable AI (XAI) methods offer a solution, yet their technical outputs are often incomprehensible to non-expert users. This research aims to develop a mobile application front-end that presents XAI-driven diabetes risk analysis in an intuitive, understandable format. Development followed the waterfall methodology, comprising requirements analysis, interface design, implementation, and evaluation. Based on user preference surveys, the application adopts two primary visualization types - bar charts and pie charts - to convey the contribution of each risk factor. These are complemented by personalized textual narratives generated via integration with GPT-4o. The application was developed natively for Android using Kotlin and Jetpack Compose. The resulting prototype interprets SHAP (SHapley Additive exPlanations), a key XAI approach, into accessible graphical visualizations and narratives. Evaluation through user comprehension testing (Likert scale and interviews) and technical functionality testing confirmed the research objectives were met. The combination of visualization and textual narrative effectively enhanced user understanding (average score 4.31/5) and empowered preventive action, supported by a 100% technical testing success rate.
zh

[AI-86] Agent ic Persona Control and Task State Tracking for Realistic User Simulation in Interactive Scenarios NEURIPS2025

【速读】:该论文旨在解决大规模测试对话式人工智能(Conversational AI)系统时面临的挑战,即如何在多样化的领域中生成真实且多样的用户交互,以捕捉广泛的用户行为模式。其核心问题在于现有方法难以模拟人类认知过程中的目标导向性、情境感知性和行为多样性,从而限制了对AI系统的全面评估。解决方案的关键在于提出一个基于多智能体(Multi-Agent)的框架,通过三个专业化代理协同工作:用户代理(User Agent)负责整体交互调度,状态跟踪代理(State Tracking Agent)维护结构化的任务状态,消息属性生成代理(Message Attributes Generation Agent)依据任务进展和预设人格(Persona)控制对话属性。该设计实现了对人类认知过程的可解释模拟,显著提升了仿真质量,在人格一致性、任务完成准确率、可解释性和真实性等方面优于单一大语言模型(Single-LLM)基线。

链接: https://arxiv.org/abs/2601.15290
作者: Hareeshwar Karthikeyan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: - Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA) - Paper contains 12 pages with 3 figures and 3 tables

点击查看摘要

Abstract:Testing conversational AI systems at scale across diverse domains necessitates realistic and diverse user interactions capturing a wide array of behavioral patterns. We present a novel multi-agent framework for realistic, explainable human user simulation in interactive scenarios, using persona control and task state tracking to mirror human cognitive processes during goal-oriented conversations. Our system employs three specialized AI agents: (1) a User Agent to orchestrate the overall interaction, (2) a State Tracking Agent to maintain structured task state, and (3) a Message Attributes Generation Agent that controls conversational attributes based on task progress and assigned persona. To validate our approach, we implement and evaluate the framework for guest ordering at a restaurant with scenarios rich in task complexity, behavioral diversity, and conversational ambiguity. Through systematic ablations, we evaluate the contributory efficacy of each agentic component to overall simulation quality in terms of persona adherence, task completion accuracy, explainability, and realism. Our experiments demonstrate that the complete multi-agent system achieves superior simulation quality compared to single-LLM baselines, with significant gains across all evaluation metrics. This framework establishes a powerful environment for orchestrating agents to simulate human users with cognitive plausibility, decomposing the simulation into specialized sub-agents that reflect distinct aspects of human thought processes applicable across interactive domains.
zh

[AI-87] LLM -based Multimodal Feedback Produces Equivalent Learning and Better Student Perceptions than Educator Feedback

【速读】:该论文旨在解决大规模教育场景下提供及时、精准且多模态反馈的挑战,以提升学生的学习效果、参与度和反馈感知质量。其解决方案的关键在于构建一个实时AI驱动的多模态反馈系统,该系统整合结构化文本解释与动态多媒体资源(包括最相关的幻灯片页面引用和流式AI语音讲解),从而实现可扩展、情境感知的个性化支持,显著降低教师工作负担的同时优化学生学习体验。

链接: https://arxiv.org/abs/2601.15280
作者: Chloe Qianhui Zhao,Jie Cao,Jionghao Lin,Kenneth R. Koedinger
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, to be published at the 16th International Learning Analytics Knowledge Conference (LAK '26)

点击查看摘要

Abstract:Providing timely, targeted, and multimodal feedback helps students quickly correct errors, build deep understanding and stay motivated, yet making it at scale remains a challenge. This study introduces a real-time AI-facilitated multimodal feedback system that integrates structured textual explanations with dynamic multimedia resources, including the retrieved most relevant slide page references and streaming AI audio narration. In an online crowdsourcing experiment, we compared this system against fixed business-as-usual feedback by educators across three dimensions: (1) learning effectiveness, (2) learner engagement, (3) perceived feedback quality and value. Results showed that AI multimodal feedback achieved learning gains equivalent to original educator feedback while significantly outperforming it on perceived clarity, specificity, conciseness, motivation, satisfaction, and reducing cognitive load, with comparable correctness, trust, and acceptance. Process logs revealed distinct engagement patterns: for multiple-choice questions, educator feedback encouraged more submissions; for open-ended questions, AI-facilitated targeted suggestions lowered revision barriers and promoted iterative improvement. These findings highlight the potential of AI multimodal feedback to provide scalable, real-time, and context-aware support that both reduces instructor workload and enhances student experience.
zh

[AI-88] HOR: A Versatile Foundation Model for Earth Observation Climate and Society Applications

【速读】:该论文旨在解决当前地球观测基础模型在架构上的刚性限制,以及对异构传感器数据处理能力不足和固定图像块(patch)尺寸带来的部署灵活性问题,这些问题制约了模型在现实场景中根据计算资源与精度需求进行动态调整的能力。解决方案的关键在于提出THOR——首个“计算自适应”(compute-adaptive)基础模型架构,其核心创新是通过一种新颖的随机化图像块和输入图像尺寸预训练策略,使单一预训练权重集能够在推理阶段以任意图像块大小灵活部署,从而实现无需重新训练即可动态平衡计算成本与特征分辨率的优化目标,同时首次统一处理来自Copernicus Sentinel-1、-2和-3(OLCI、SLSTR)卫星的多源异构遥感数据(分辨率范围从10米到1000米)。

链接: https://arxiv.org/abs/2601.16011
作者: Theodor Forgaard,Jarle H. Reksten,Anders U. Waldeland,Valerio Marsocci,Nicolas Longépé,Michael Kampffmeyer,Arnt-Børre Salberg
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: 25 pages

点击查看摘要

Abstract:Current Earth observation foundation models are architecturally rigid, struggle with heterogeneous sensors and are constrained to fixed patch sizes. This limits their deployment in real-world scenarios requiring flexible computeaccuracy trade-offs. We propose THOR, a “computeadaptive” foundation model that solves both input heterogeneity and deployment rigidity. THOR is the first architecture to unify data from Copernicus Sentinel-1, -2, and -3 (OLCI SLSTR) satellites, processing their native 10 m to 1000 m resolutions in a single model. We pre-train THOR with a novel randomized patch and input image size strategy. This allows a single set of pre-trained weights to be deployed at inference with any patch size, enabling a dynamic trade-off between computational cost and feature resolution without retraining. We pre-train THOR on THOR Pretrain, a new, large-scale multi-sensor dataset and demonstrate state-of-the-art performance on downstream benchmarks, particularly in data-limited regimes like the PANGAEA 10% split, validating that THOR’s flexible feature generation excels for diverse climate and society applications.
zh

[AI-89] Progressive Power Homotopy for Non-convex Optimization

【速读】:该论文旨在解决非凸优化问题中,如何在复杂且存在大量局部极小值的优化景观(non-convex landscape)中高效收敛至全局最优解或其近似解的问题。针对此类问题,现有标准一阶方法(如随机梯度上升)常因陷入局部极小值而失效。论文提出了一种新颖的一阶优化方法——渐进幂同伦法(Progressive Power Homotopy, Prog-PowerHP),其核心在于构造一个带幂变换(power transformation)和高斯平滑(Gaussian smoothing)的代理目标函数 $ F_{N,\sigma}(\bm\mu) = \mathbb{E}{\bmw \sim \mathcal{N}(\bm\mu, \sigma^2 I_d), \bmx \sim \mathcal{D}}[e^{N f\bmw(\bmx)}] $,并在优化过程中逐步增加幂参数 $ N $ 并减小平滑尺度 $ \sigma $,从而引导优化路径逐渐逼近原问题的全局最优解。理论分析表明,在温和正则性条件下,该方法具有近乎 $ O(d^2 \varepsilon^{-2}) $ 的迭代复杂度,实验证明其在相位恢复和欠参数化两层神经网络训练等场景下显著优于传统方法,尤其适用于接近信息论极限的低样本比情形。

链接: https://arxiv.org/abs/2601.15915
作者: Chen Xu
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel first-order method for non-convex optimization of the form \max_\bmw\in\mathbbR^d\mathbbE_\bmx\sim\mathcalD[f_\bmw(\bmx)] , termed Progressive Power Homotopy (Prog-PowerHP). The method applies stochastic gradient ascent to a surrogate objective obtained by first performing a power transformation and then Gaussian smoothing, F_N,\sigma(\bm\mu):=\mathbbE_\bmw\sim\mathcalN(\bm\mu,\sigma^2I_d),\bmx\sim\mathcalD[e^Nf_w(\bmx)] , while progressively increasing the power parameter N and decreasing the smoothing scale \sigma along the optimization trajectory. We prove that, under mild regularity conditions, Prog-PowerHP converges to a small neighborhood of the global optimum with an iteration complexity scaling nearly as O(d^2\varepsilon^-2) . Empirically, Prog-PowerHP demonstrates clear advantages in phase retrieval when the samples-to-dimension ratio approaches the information-theoretic limit, and in training two-layer neural networks in under-parameterized regimes. These results suggest that Prog-PowerHP is particularly effective for navigating cluttered non-convex landscapes where standard first-order methods struggle.
zh

[AI-90] Low-Dimensional Adaptation of Rectified Flow: A New Perspective through the Lens of Diffusion and Stochastic Localization

【速读】:该论文旨在解决生成式模型中采样效率低的问题,特别是如何利用目标分布支撑集的内在低维结构来加速采样过程。其核心贡献在于证明了修正流(Rectified Flow, RF)在采用精心设计的时间离散化方案并具备足够精确的漂移估计时,可实现迭代复杂度为 $ O(k/\varepsilon) $(含对数因子)的采样性能,其中 $ \varepsilon $ 为总变差距离精度,$ k $ 为目标分布的内在维度。解决方案的关键在于通过理论分析揭示RF对低维流形的自动适应性,并进一步建立扩散概率模型(DDPM)与随机修正流之间的新联系,据此提出一种鲁棒性更强的随机RF采样器,在较低漂移估计精度要求下仍能保持低维适应能力,并结合特定时间调度策略提升实际性能。

链接: https://arxiv.org/abs/2601.15500
作者: Saptarshi Roy,Alessandro Rinaldo,Purnamrita Sarkar
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注: 32 pages, 7 figures

点击查看摘要

Abstract:In recent years, Rectified flow (RF) has gained considerable popularity largely due to its generation efficiency and state-of-the-art performance. In this paper, we investigate the degree to which RF automatically adapts to the intrinsic low dimensionality of the support of the target distribution to accelerate sampling. We show that, using a carefully designed choice of the time-discretization scheme and with sufficiently accurate drift estimates, the RF sampler enjoys an iteration complexity of order O(k/\varepsilon) (up to log factors), where \varepsilon is the precision in total variation distance and k is the intrinsic dimension of the target distribution. In addition, we show that the denoising diffusion probabilistic model (DDPM) procedure is equivalent to a stochastic version of RF by establishing a novel connection between these processes and stochastic localization. Building on this connection, we further design a stochastic RF sampler that also adapts to the low-dimensionality of the target distribution under milder requirements on the accuracy of the drift estimates, and also with a specific time schedule. We illustrate with simulations on the synthetic data and text-to-image data experiments the improved performance of the proposed samplers implementing the newly designed time-discretization schedules. Comments: 32 pages, 7 figures Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2601.15500 [stat.ML] (or arXiv:2601.15500v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.15500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-91] OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

【速读】:该论文旨在解决当前视觉编码器在图像理解(image understanding)与图像生成(image generation)任务中通常需要独立设计、难以共享统一表征的问题。其解决方案的关键在于提出了一种名为OpenVision 3的统一视觉编码器架构,该架构通过将VAE压缩后的图像潜在表示输入至ViT编码器,并在共享潜在空间中联合优化两种互补目标:一是利用ViT-VAE解码器进行图像重建以学习生成结构;二是结合对比学习和图文对齐任务强化语义特征。这种双驱动机制使编码器能够同时具备强大的语义理解和生成能力,从而实现跨任务的良好泛化性能。

链接: https://arxiv.org/abs/2601.15369
作者: Letian Zhang,Sucheng Ren,Yanqing Liu,Xianhang Li,Zeyu Wang,Yuyin Zhou,Huaxiu Yao,Zeyu Zheng,Weili Nie,Guilin Liu,Zhiding Yu,Cihang Xie
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.
zh

[AI-92] Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agent ic Probing

【速读】:该论文旨在解决现有基于强化学习(Reinforcement Learning, RL)的图像质量评估(Image Quality Assessment, IQA)模型在高分辨率场景下难以捕捉细微局部退化问题,以及“看图思考”(Thinking with Images)范式因直接裁剪导致的伪相关偏差(即“裁剪即退化”偏见)和对自然景深误判为图像伪影的问题。其解决方案的关键在于提出Q-Probe框架,该框架通过引入一种上下文感知的探查机制(context-aware probing),结合一个三阶段训练范式,在逐步对齐人类偏好同时,利用新颖的上下文感知裁剪策略消除因果偏差,从而实现高分辨率图像中细粒度局部退化的精准识别与评估。

链接: https://arxiv.org/abs/2601.15356
作者: Xiang Li,XueHeng Li,Yu Wang,XuanHua He,ZhangChi Hu,WeiWei Yu,ChengJun Xie
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging “Thinking with Images” paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious “cropping-implies-degradation” biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.
zh

[AI-93] OmniSpectra: A Unified Foundation Model for Native Resolution Astronomical Spectra

【速读】:该论文旨在解决当前天文光谱基础模型在处理多源、异构光谱数据时存在的局限性问题,即现有模型受限于固定长度输入、特定波长范围和仪器配置,难以有效利用大规模跨Survey的光谱数据。解决方案的关键在于提出OmniSpectra——首个原生分辨率的基础模型,其核心创新包括:自适应补丁划分(adaptive patching)以支持任意长度光谱输入;正弦全局波长编码(sinusoidal global wavelength encoding)实现跨不同仪器的波长对齐;通过深度可分离卷积引入局部位置嵌入(local positional embeddings);以及有效性感知的自注意力掩码(validity-aware self-attention masks),从而在保持多尺度空间模式学习的同时跳过无效补丁的注意力计算。这一架构设计使模型能够从多个真实世界光谱巡天中联合学习,显著提升零样本泛化能力,并在恒星与星系分类、红移估计及属性预测等任务上达到当前最优性能。

链接: https://arxiv.org/abs/2601.15351
作者: Md Khairul Islam,Judy Fox
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present OmniSpectra, the first native-resolution foundation model for astronomy spectra. Unlike traditional models, which are limited to fixed-length input sizes or configurations, OmniSpectra handles spectra of any length at their original size, without resampling or interpolation. Despite the large-scale spectroscopic data from diverse surveys fueling the rapid growth of astronomy, existing foundation models are limited to a fixed wavelength range and specific instruments. OmniSpectra is the first foundation model to learn simultaneously from multiple real-world spectra surveys with different configurations at a large scale. We achieve this by designing a novel architecture with adaptive patching across variable lengths, sinusoidal global wavelength encoding, local positional embeddings through depthwise convolution, and validity-aware self-attention masks. Allowing us to learn multi-scale spatial patterns while skipping attention for invalid patches. Even with a limited training example, OmniSpectra demonstrates excellent zero-shot generalization compared to methods tailored for specific tasks. This transfer learning capability makes this model the state-of-the-art across various astronomy tasks, including source classification, redshift estimation, and properties prediction for stars and galaxies. OmniSpectra reduces the need for training individual models for different tasks from scratch, establishing itself as the next-generation astronomy foundation model.
zh

[AI-94] Learning Discrete Successor Transitions in Continuous Attractor Networks: Emergence Limits and Topological Constraints

【速读】:该论文旨在解决连续吸引子网络(Continuous Attractor Networks, CANs)在缺乏外部位移信号输入时,是否仍能学习到稳定的吸引子状态转换动力学问题。传统CAN模型依赖于来自传感器运动系统的连续位移信号(如角速度)驱动状态沿吸引子流形迁移,但当此类信号不可用时,网络能否自发形成可靠的吸引子动态仍不明确。论文的关键解决方案在于设计了一种实验训练框架,通过强制网络在长时段自由运行中保持稳定性,从而引导其从局部学习产生的“捷径解”(即由脉冲驱动的关联性解,虽短期准确但无持续吸引子动力学)转向真正的吸引子基过渡机制。研究发现,只有在延长评估窗口下,网络才能涌现出具有持久稳定性的吸引子动态;此外,拓扑结构显著限制了学习能力——环形拓扑可实现长期稳定过渡,而折叠蛇形拓扑因流形不连续性存在几何极限,难以通过课程学习或基底神经节启发式门控完全克服。

链接: https://arxiv.org/abs/2601.15336
作者: Daniel Brownell
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: An open-source reference implementation is available at this https URL

点击查看摘要

Abstract:Continuous attractor networks (CANs) are a well-established class of models for representing low-dimensional continuous variables such as head direction, spatial position, and phase. In canonical spatial domains, transitions along the attractor manifold are driven by continuous displacement signals, such as angular velocity-provided by sensorimotor systems external to the CAN itself. When such signals are not explicitly provided as dedicated displacement inputs, it remains unclear whether attractor-based circuits can reliably acquire recurrent dynamics that support stable state transitions, or whether alternative predictive strategies dominate. In this work, we present an experimental framework for training CANs to perform successor-like transitions between stable attractor states in the absence of externally provided displacement signals. We compare two recurrent topologies, a circular ring and a folded snake manifold, and systematically vary the temporal regime under which stability is evaluated. We find that, under short evaluation windows, networks consistently converge to impulse-driven associative solutions that achieve high apparent accuracy yet lack persistent attractor dynamics. Only when stability is explicitly enforced over extended free-run periods do genuine attractor-based transition dynamics emerge. This suggests that shortcut solutions are the default outcome of local learning in recurrent networks, while attractor dynamics represent a constrained regime rather than a generic result. Furthermore, we demonstrate that topology strictly limits the capacity for learned transitions. While the continuous ring topology achieves perfect stability over long horizons, the folded snake topology hits a geometric limit characterized by failure at manifold discontinuities, which neither curriculum learning nor basal ganglia-inspired gating can fully overcome. Comments: An open-source reference implementation is available at this https URL Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.15336 [q-bio.NC] (or arXiv:2601.15336v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2601.15336 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-95] ECGomics: An Open Platform for AI-ECG Digital Biomarker Discovery

【速读】:该论文旨在解决传统心电图(ECG)分析中存在的二元困境:专家定义的特征虽具备可解释性,但对潜在模式的敏感性不足;而深度学习方法虽精度高,却存在黑箱效应且高度依赖数据。其解决方案的关键在于提出“ECGomics”这一系统性范式与开源平台,通过将心脏信号多维解构为结构(Structural)、强度(Intensity)、功能(Functional)和比较(Comparative)四个维度,融合专家设定的形态学规则与数据驱动的隐含表征,有效弥合手工特征与深度学习嵌入之间的鸿沟,从而实现诊断精度、可解释性与数据效率的统一。

链接: https://arxiv.org/abs/2601.15326
作者: Deyun Zhang,Jun Li,Shijia Geng,Yue Wang,Shijie Chen,Sumei Fan,Qinghao Zha,Shenda Hong
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Conventional electrocardiogram (ECG) analysis faces a persistent dichotomy: expert-driven features ensure interpretability but lack sensitivity to latent patterns, while deep learning offers high accuracy but functions as a black box with high data dependency. We introduce ECGomics, a systematic paradigm and open-source platform for the multidimensional deconstruction of cardiac signals into digital biomarker. Methods: Inspired by the taxonomic rigor of genomics, ECGomics deconstructs cardiac activity across four dimensions: Structural, Intensity, Functional, and Comparative. This taxonomy synergizes expert-defined morphological rules with data-driven latent representations, effectively bridging the gap between handcrafted features and deep learning embeddings. Results: We operationalized this framework into a scalable ecosystem consisting of a web-based research platform and a mobile-integrated solution (this https URL). The web platform facilitates high-throughput analysis via precision parameter configuration, high-fidelity data ingestion, and 12-lead visualization, allowing for the systematic extraction of biomarkers across the four ECGomics dimensions. Complementarily, the mobile interface, integrated with portable sensors and a cloud-based engine, enables real-time signal acquisition and near-instantaneous delivery of structured diagnostic reports. This dual-interface architecture successfully transitions ECGomics from theoretical discovery to decentralized, real-world health management, ensuring professional-grade monitoring in diverse clinical and home-based settings. Conclusion: ECGomics harmonizes diagnostic precision, interpretability, and data efficiency. By providing a deployable software ecosystem, this paradigm establishes a robust foundation for digital biomarker discovery and personalized cardiovascular medicine.
zh

[AI-96] Large Language Models as Simulative Agents for Neurodivergent Adult Psychometric Profiles

【速读】:该论文试图解决的问题是:当前用于评估成人神经多样性(如注意缺陷多动障碍 ADHD、高功能自闭症谱系障碍 ASD 和认知脱离综合征 CDS)的标准化心理测量工具因症状高度重叠而缺乏区分效度,且难以准确捕捉个体在神经发育特质上的细微差异。解决方案的关键在于利用基于结构化定性访谈内容的大型语言模型(LLMs),通过模拟真实个体的心理特征来生成符合心理量表要求的回答,并验证其准确性、稳定性和对特质强度变化的敏感性。研究发现,GPT-4o 和 Qwen3-235B-A22B 两种 LLM 均能显著优于随机响应水平地模拟 ASRS、BAARS-IV 和 RAADS-R 的问卷反应,表明基于访谈引导的 LLM 模拟具备作为早期心理测量研究中合成被试的潜力,尽管在某些子维度(如 AQ 中的细节注意力)仍存在局限。

链接: https://arxiv.org/abs/2601.15319
作者: Francesco Chiappone,Davide Marocco,Nicola Milano
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adult neurodivergence, including Attention-Deficit/Hyperactivity Disorder (ADHD), high-functioning Autism Spectrum Disorder (ASD), and Cognitive Disengagement Syndrome (CDS), is marked by substantial symptom overlap that limits the discriminant sensitivity of standard psychometric instruments. While recent work suggests that Large Language Models (LLMs) can simulate human psychometric responses from qualitative data, it remains unclear whether they can accurately and stably model neurodevelopmental traits rather than broad personality characteristics. This study examines whether LLMs can generate psychometric responses that approximate those of real individuals when grounded in a structured qualitative interview, and whether such simulations are sensitive to variations in trait intensity. Twenty-six adults completed a 29-item open-ended interview and four standardized self-report measures (ASRS, BAARS-IV, AQ, RAADS-R). Two LLMs (GPT-4o and Qwen3-235B-A22B) were prompted to infer an individual psychological profile from interview content and then respond to each questionnaire in-role. Accuracy, reliability, and sensitivity were assessed using group-level comparisons, error metrics, exact-match scoring, and a randomized baseline. Both models outperformed random responses across instruments, with GPT-4o showing higher accuracy and reproducibility. Simulated responses closely matched human data for ASRS, BAARS-IV, and RAADS-R, while the AQ revealed subscale-specific limitations, particularly in Attention to Detail. Overall, the findings indicate that interview-grounded LLMs can produce coherent and above-chance simulations of neurodevelopmental traits, supporting their potential use as synthetic participants in early-stage psychometric research, while highlighting clear domain-specific constraints.
zh

[AI-97] Beyond the Einstein-Bohr Debate: Cognitive Complementarity and the Emergence of Quantum Intuition

【速读】:该论文试图解决量子互补性(quantum complementarity)在基础物理与认知科学之间如何实现理论统一的问题,核心在于厘清其是作为关于量子实在的本体论主张,还是作为限制知识获取和表征能力的认识论原则。解决方案的关键在于提出“认知互补性”(cognitive complementarity)这一结构化原则,用以描述在非经典不确定性下推理的机制:即相互制约的认知表征无法同时被优化;进而将“量子直觉”(quantum intuition)定义为一种可检验的认知能力——能够维持表征多样性、调节承诺时机,并以情境敏感方式解决视角不相容性。此框架基于共享的信息约束条件,为量子测量理论与认知科学之间建立了自然主义的桥梁。

链接: https://arxiv.org/abs/2601.15314
作者: Lalit Kumar Shukla
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: This interdisciplinary work bridges quantum foundations and cognitive science, proposing a formal extension of complementarity into cognitive reasoning and introducing the testable construct of quantum intuition

点击查看摘要

Abstract:Recent high-precision experimental confirmations of quantum complementarity have revitalized foundational debates about measurement, description, and realism. This article argues that complementarity is most productively interpreted as an epistemic principle–constraining what can be simultaneously accessed and represented–rather than as an ontological claim about quantum reality. Reexamining the Einstein-Bohr debate through this lens reveals a persistent tension between descriptive completeness and contextual meaning, a tension experiments clarify but do not dissolve. Building on this analysis, we introduce cognitive complementarity as a structural principle governing reasoning under non-classical uncertainty, where mutually constraining representations cannot be jointly optimized. Within this framework, we propose quantum intuition as a testable cognitive capacity: the ability to sustain representational plurality, regulate commitment timing, and resolve perspective-incompatibilities in a context-sensitive manner. Formulated as a naturalistic construct grounded in shared informational constraints, quantum intuition offers a principled bridge between quantum measurement theory and cognition. This work reframes the historical debate, extends epistemic lessons from quantum foundations into cognitive science, and outlines empirical pathways for studying decision-making in contexts of irreducible uncertainty.
zh

[AI-98] Mind the Gap: Why Neural Memory Fails Under Semantic Density

【速读】:该论文试图解决当前人工智能(AI)架构在在线学习中面临的“稳定性缺口”(Stability Gap)问题,即神经网络在存储特定情境性事实(episodic facts)时会干扰通用语义知识(semantic knowledge)的稳定性。现有模型依赖共享连续参数进行存储与推理,导致在少量语义相关事实后便出现性能崩溃,这一现象由“正交性约束”(Orthogonality Constraint)所解释:当写入和检索共用同一参数空间时,发生写入时间干扰(write-time interference)。解决方案的关键在于提出“知识对象”(Knowledge Objects, KOs),这是一种离散、类型化、带显式版本链的记忆单元,与传统神经权重协同工作,构建真正的互补学习系统(Complementary Learning Systems,CLS),从而实现稳定的情境记忆与持续的语义学习分离。

链接: https://arxiv.org/abs/2601.15313
作者: Matt Beton,Simran Chana
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 24 Pages, 5 Figures

点击查看摘要

Abstract:The brain solves a problem that current AI architectures struggle to manage: storing specific episodic facts without corrupting general semantic knowledge. Neuroscience explains this through Complementary Learning Systems theory - a fast hippocampal system for episodic storage using pattern-separated representations, and a slow neocortical system for extracting statistical regularities. Current AI systems lack this separation, attempting both functions through neural weights alone. We identify the ‘Stability Gap’ in online neural memory: fast-weight mechanisms that write facts into shared continuous parameters collapse to near-random accuracy within tens of semantically related facts. Through semantic density (rho), we show collapse occurs with as few as N=5 facts at high density (rho 0.6) or N ~ 20-75 at moderate density - a phenomenon we formalise as the Orthogonality Constraint. This failure persists even with perfect attention and unlimited context, arising from write-time interference when storage and retrieval share the same substrate. We also identify schema drift and version ambiguity as primary failure modes in production systems, observing 40-70% schema consistency and 0-100% clean correction rates. Context-based memory incurs 30-300% cost premium over selective retrieval. We propose Knowledge Objects (KOs): discrete, typed memory units with controlled vocabularies and explicit version chains. Paired with neural weights, KOs enable a true complementary learning architecture, suggesting reliable AI memory may require this bicameral design.
zh

[AI-99] An Explainable Market Integrity Monitoring System with Multi-Source Attention Signals and Transparent Scoring

【速读】:该论文旨在解决金融市场完整性监测中面临的两大挑战:一是异常价格/成交量行为可能由多种良性机制引起,导致误报;二是现有检测系统多依赖难以审计和解释的黑箱模型,限制了其在合规团队、交易所或研究人员中的实际应用。解决方案的关键在于提出一个可解释的监测流程 AIMM-X,该流程融合了基于 OHLCV(开盘价、最高价、最低价、收盘价、成交量)时间序列构建的市场微观结构信号与多源公开注意力信号(如新闻和在线讨论代理指标),通过透明的阈值设定与聚合方法识别候选异常时间段,并生成一个可分解为少量加性成分的可解释完整性评分,使从业者能够追溯每个异常窗口的成因及驱动因素,从而支持后续深入调查。

链接: https://arxiv.org/abs/2601.15304
作者: Sandeep Neela
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Market integrity monitoring is difficult because suspicious price/volume behavior can arise from many benign mechanisms, while modern detection systems often rely on opaque models that are hard to audit and communicate. We present AIMM-X, an explainable monitoring pipeline that combines market microstructure-style signals derived from OHLCV time series with multi-source public attention signals (e.g., news and online discussion proxies) to surface time windows that merit analyst review. The system detects candidate anomalous windows using transparent thresholding and aggregation, then assigns an interpretable integrity score decomposed into a small set of additive components, allowing practitioners to trace why a window was flagged and which factors drove the score. We provide an end-to-end, reproducible implementation that downloads data, constructs attention features, builds unified panels, detects windows, computes component signals, and generates summary figures/tables. Our goal is not to label manipulation, but to provide a practical, auditable screening tool that supports downstream investigation by compliance teams, exchanges, or researchers.
zh

机器学习

[LG-0] Domain-Incremental Continual Learning for Robust and Efficient Keyword Spotting in Resource Constrained Systems

链接: https://arxiv.org/abs/2601.16158
作者: Prakash Dhungana,Sayed Ahmad Salehi
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures, and 3 tables

点击查看摘要

Abstract:Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework’s effectiveness, achieving 99.63% accuracy on clean data and maintaining robust performance (exceeding 94% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.

[LG-1] Beat-ssl: Capturing Local ECG Morphology through Heartbeat-level Contrastive Learning with Soft Targets

链接: https://arxiv.org/abs/2601.16147
作者: Muhammad Ilham Rizqyawan,Peter Macfarlane,Stathis Hadjidemetriou,Fani Deligianni
类目: Machine Learning (cs.LG)
*备注: Accepted at ISBI 2026

点击查看摘要

Abstract:Obtaining labelled ECG data for developing supervised models is challenging. Contrastive learning (CL) has emerged as a promising pretraining approach that enables effective transfer learning with limited labelled data. However, existing CL frameworks either focus solely on global context or fail to exploit ECG-specific characteristics. Furthermore, these methods rely on hard contrastive targets, which may not adequately capture the continuous nature of feature similarity in ECG signals. In this paper, we propose Beat-SSL, a contrastive learning framework that performs dual-context learning through both rhythm-level and heartbeat-level contrasting with soft targets. We evaluated our pretrained model on two downstream tasks: 1) multilabel classification for global rhythm assessment, and 2) ECG segmentation to assess its capacity to learn representations across both contexts. We conducted an ablation study and compared the best configuration with three other methods, including one ECG foundation model. Despite the foundation model’s broader pretraining, Beat-SSL reached 93% of its performance in multilabel classification task and surpassed all other methods in the segmentation task by 4%.

[LG-2] Computing Fixpoints of Learned Functions: Chaotic Iteration and Simple Stochastic Games

链接: https://arxiv.org/abs/2601.16142
作者: Paolo Baldan,Sebastian Gurke,Barbara König,Florian Wittbold
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The problem of determining the (least) fixpoint of (higher-dimensional) functions over the non-negative reals frequently occurs when dealing with systems endowed with a quantitative semantics. We focus on the situation in which the functions of interest are not known precisely but can only be approximated. As a first contribution we generalize an iteration scheme called dampened Mann iteration, recently introduced in the literature. The improved scheme relaxes previous constraints on parameter sequences, allowing learning rates to converge to zero or not converge at all. While seemingly minor, this flexibility is essential to enable the implementation of chaotic iterations, where only a subset of components is updated in each step, allowing to tackle higher-dimensional problems. Additionally, by allowing learning rates to converge to zero, we can relax conditions on the convergence speed of function approximations, making the method more adaptable to various scenarios. We also show that dampened Mann iteration applies immediately to compute the expected payoff in various probabilistic models, including simple stochastic games, not covered by previous work.

[LG-3] On the Intrinsic Dimensions of Data in Kernel Learning AISTATS2026

链接: https://arxiv.org/abs/2601.16139
作者: Rustem Takhanov
类目: Machine Learning (cs.LG)
*备注: Accepted to The 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)

点击查看摘要

Abstract:The manifold hypothesis suggests that the generalization performance of machine learning methods improves significantly when the intrinsic dimension of the input distribution’s support is low. In the context of KRR, we investigate two alternative notions of intrinsic dimension. The first, denoted d_\rho , is the upper Minkowski dimension defined with respect to the canonical metric induced by a kernel function K on a domain \Omega . The second, denoted d_K , is the effective dimension, derived from the decay rate of Kolmogorov n -widths associated with K on \Omega . Given a probability measure \mu on \Omega , we analyze the relationship between these n -widths and eigenvalues of the integral operator \phi \to \int_\Omega K(\cdot,x)\phi(x)d\mu(x) . We show that, for a fixed domain \Omega , the Kolmogorov n -widths characterize the worst-case eigenvalue decay across all probability measures \mu supported on \Omega . These eigenvalues are central to understanding the generalization behavior of constrained KRR, enabling us to derive an excess error bound of order O(n^-\frac2+d_K2+2d_K + \epsilon) for any \epsilon 0 , when the training set size n is large. We also propose an algorithm that estimates upper bounds on the n -widths using only a finite sample from \mu . For distributions close to uniform, we prove that \epsilon -accurate upper bounds on all n -widths can be computed with high probability using at most O\left(\epsilon^-d_\rho\log\frac1\epsilon\right) samples, with fewer required for small n . Finally, we compute the effective dimension d_K for various fractal sets and present additional numerical experiments. Our results show that, for kernels such as the Laplace kernel, the effective dimension d_K can be significantly smaller than the Minkowski dimension d_\rho , even though d_K = d_\rho provably holds on regular domains.

[LG-4] Variable Splitting Binary Tree Models Based on Bayesian Context Tree Models for Time Series Segmentation

链接: https://arxiv.org/abs/2601.16112
作者: Yuta Nakahara,Shota Saito,Kohei Horinouchi,Koshi Shimada,Naoki Ichijo,Manabu Kobayashi,Toshiyasu Matsushima
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a variable splitting binary tree (VSBT) model based on Bayesian context tree (BCT) models for time series segmentation. Unlike previous applications of BCT models, the tree structure in our model represents interval partitioning on the time domain. Moreover, interval partitioning is represented by recursive logistic regression models. By adjusting logistic regression coefficients, our model can represent split positions at arbitrary locations within each interval. This enables more compact tree representations. For simultaneous estimation of both split positions and tree depth, we develop an effective inference algorithm that combines local variational approximation for logistic regression with the context tree weighting (CTW) algorithm. We present numerical examples on synthetic data demonstrating the effectiveness of our model and algorithm.

[LG-5] Benchmarking Deep Learning Models for Raman Spectroscopy Across Open-Source Datasets

链接: https://arxiv.org/abs/2601.16107
作者: Adithya Sineesh,Akshita Kamsali
类目: Machine Learning (cs.LG)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.

[LG-6] Explainable AI to Improve Machine Learning Reliability for Industrial Cyber-Physical Systems

链接: https://arxiv.org/abs/2601.16074
作者: Annemarie Jutte,Uraz Odyurt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industrial Cyber-Physical Systems (CPS) are sensitive infrastructure from both safety and economics perspectives, making their reliability critically important. Machine Learning (ML), specifically deep learning, is increasingly integrated in industrial CPS, but the inherent complexity of ML models results in non-transparent operation. Rigorous evaluation is needed to prevent models from exhibiting unexpected behaviour on future, unseen data. Explainable AI (XAI) can be used to uncover model reasoning, allowing a more extensive analysis of behaviour. We apply XAI to to improve predictive performance of ML models intended for industrial CPS. We analyse the effects of components from time-series data decomposition on model predictions using SHAP values. Through this method, we observe evidence on the lack of sufficient contextual information during model training. By increasing the window size of data instances, informed by the XAI findings, we are able to improve model performance.

[LG-7] CLASP: An online learning algorithm for Convex Losses And Squared Penalties

链接: https://arxiv.org/abs/2601.16072
作者: Ricardo N. Ferreira,Cláudia Soares,João Xavier
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study Constrained Online Convex Optimization (COCO), where a learner chooses actions iteratively, observes both unanticipated convex loss and convex constraint, and accumulates loss while incurring penalties for constraint violations. We introduce CLASP (Convex Losses And Squared Penalties), an algorithm that minimizes cumulative loss together with squared constraint violations. Our analysis departs from prior work by fully leveraging the firm non-expansiveness of convex projectors, a proof strategy not previously applied in this setting. For convex losses, CLASP achieves regret O\left(T^\max\beta,1-\beta\right) and cumulative squared penalty O\left(T^1-\beta\right) for any \beta \in (0,1) . Most importantly, for strongly convex problems, CLASP provides the first logarithmic guarantees on both regret and cumulative squared penalty. In the strongly convex case, the regret is upper bounded by O( \log T ) and the cumulative squared penalty is also upper bounded by O( \log T ) .

[LG-8] Data-Driven Conditional Flexibility Index

链接: https://arxiv.org/abs/2601.16028
作者: Moritz Wedemeyer,Eike Cramer,Alexander Mitsos,Manuel Dahmen
类目: Machine Learning (cs.LG)
*备注: manuscript (47 pages, 16 figures), supplementary material (7 pages, 1 figure, 2 tables)

点击查看摘要

Abstract:With the increasing flexibilization of processes, determining robust scheduling decisions has become an important goal. Traditionally, the flexibility index has been used to identify safe operating schedules by approximating the admissible uncertainty region using simple admissible uncertainty sets, such as hypercubes. Presently, available contextual information, such as forecasts, has not been considered to define the admissible uncertainty set when determining the flexibility index. We propose the conditional flexibility index (CFI), which extends the traditional flexibility index in two ways: by learning the parametrized admissible uncertainty set from historical data and by using contextual information to make the admissible uncertainty set conditional. This is achieved using a normalizing flow that learns a bijective mapping from a Gaussian base distribution to the data distribution. The admissible latent uncertainty set is constructed as a hypersphere in the latent space and mapped to the data space. By incorporating contextual information, the CFI provides a more informative estimate of flexibility by defining admissible uncertainty sets in regions that are more likely to be relevant under given conditions. Using an illustrative example, we show that no general statement can be made about data-driven admissible uncertainty sets outperforming simple sets, or conditional sets outperforming unconditional ones. However, both data-driven and conditional admissible uncertainty sets ensure that only regions of the uncertain parameter space containing realizations are considered. We apply the CFI to a security-constrained unit commitment example and demonstrate that the CFI can improve scheduling quality by incorporating temporal information.

[LG-9] Partially Lazy Gradient Descent for Smoothed Online Learning AISTATS2026

链接: https://arxiv.org/abs/2601.15984
作者: Naram Mhaisen,George Iosifidis
类目: Machine Learning (cs.LG)
*备注: to appear in the proceedings of AISTATS 2026

点击查看摘要

Abstract:We introduce k -lazyGD, an online learning algorithm that bridges the gap between greedy Online Gradient Descent (OGD, for k=1 ) and lazy GD/dual-averaging (for k=T ), creating a spectrum between reactive and stable updates. We analyze this spectrum in Smoothed Online Convex Optimization (SOCO), where the learner incurs both hitting and movement costs. Our main contribution is establishing that laziness is possible without sacrificing hitting performance: we prove that k -lazyGD achieves the optimal dynamic regret \mathcalO(\sqrt(P_T+1)T) for any laziness slack k up to \Theta(\sqrtT/P_T) , where P_T is the comparator path length. This result formally connects the allowable laziness to the comparator’s shifts, showing that k -lazyGD can retain the inherently small movements of lazy methods without compromising tracking ability. We base our analysis on the Follow the Regularized Leader (FTRL) framework, and derive a matching lower bound. Since the slack depends on P_T , an ensemble of learners with various slacks is used, yielding a method that is provably stable when it can be, and agile when it must be.

[LG-10] Predicting Healthcare System Visitation Flow by Integrating Hospital Attributes and Population Socioeconomics with Human Mobility Data

链接: https://arxiv.org/abs/2601.15977
作者: Binbin Lin,Lei Zou,Hao Tian,Heng Cai,Yifan Yang,Bing Zhou
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

[LG-11] SoK: Challenges in Tabular Membership Inference Attacks

链接: https://arxiv.org/abs/2601.15874
作者: Cristina Pêra,Tânia Carvalho,Maxime Cordy,Luís Antunes
类目: Machine Learning (cs.LG)
*备注: This paper is currently under review for the EuroSP conference

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) are currently a dominant approach for evaluating privacy in machine learning applications. Despite their significance in identifying records belonging to the training dataset, several concerns remain unexplored, particularly with regard to tabular data. In this paper, first, we provide an extensive review and analysis of MIAs considering two main learning paradigms: centralized and federated learning. We extend and refine the taxonomy for both. Second, we demonstrate the efficacy of MIAs in tabular data using several attack strategies, also including defenses. Furthermore, in a federated learning scenario, we consider the threat posed by an outsider adversary, which is often neglected. Third, we demonstrate the high vulnerability of single-outs (records with a unique signature) to MIAs. Lastly, we explore how MIAs transfer across model architectures. Our results point towards a general poor performance of these attacks in tabular data which contrasts with previous state-of-the-art. Notably, even attacks with limited attack performance can still successfully expose a large portion of single-outs. Moreover, our findings suggest that using different surrogate models makes MIAs more effective.

[LG-12] Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

链接: https://arxiv.org/abs/2601.15801
作者: Fengheng Chu,Jiahao Chen,Yuhong Wang,Jun Wang,Zhihui Fu,Shouling Ji,Songze Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are aligned to mitigate risks, their safety guardrails remain fragile against jailbreak attacks. This reveals limited understanding of components governing safety. Existing methods rely on local, greedy attribution that assumes independent component contributions. However, they overlook the cooperative interactions between different components in LLMs, such as attention heads, which jointly contribute to safety mechanisms. We propose \textbfGlobal \textbfOptimization for \textbfSafety \textbfVector Extraction (GOSV), a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously. We employ two complementary activation repatching strategies: Harmful Patching and Zero Ablation. These strategies identify two spatially distinct sets of safety vectors with consistently low overlap, termed Malicious Injection Vectors and Safety Suppression Vectors, demonstrating that aligned LLMs maintain separate functional pathways for safety purposes. Through systematic analyses, we find that complete safety breakdown occurs when approximately 30% of total heads are repatched across all models. Building on these insights, we develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching. Our attack substantially outperforms existing white-box attacks across all test models, providing strong evidence for the effectiveness of the proposed GOSV framework on LLM safety interpretability.

[LG-13] Next Generation Active Learning: Mixture of LLM s in the Loop

链接: https://arxiv.org/abs/2601.15773
作者: Yuanyuan Qi,Xiaohao Yang,Jueqing Lu,Guoxiang Guo,Joanne Enticott,Gang Liu,Lan Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid advancement and strong generalization capabilities of large language models (LLMs), they have been increasingly incorporated into the active learning pipelines as annotators to reduce annotation costs. However, considering the annotation quality, labels generated by LLMs often fall short of real-world applicability. To address this, we propose a novel active learning framework, Mixture of LLMs in the Loop Active Learning, replacing human annotators with labels generated through a Mixture-of-LLMs-based annotation model, aimed at enhancing LLM-based annotation robustness by aggregating the strengths of multiple LLMs. To further mitigate the impact of the noisy labels, we introduce annotation discrepancy and negative learning to identify the unreliable annotations and enhance learning effectiveness. Extensive experiments demonstrate that our framework achieves performance comparable to human annotation and consistently outperforms single-LLM baselines and other LLM-ensemble-based approaches. Moreover, our framework is built on lightweight LLMs, enabling it to operate fully on local machines in real-world applications.

[LG-14] Rethinking Drug-Drug Interaction Modeling as Generalizable Relation Learning

链接: https://arxiv.org/abs/2601.15771
作者: Dong Xu,Jiantao Wu,Qihua Pan,Sisi Yuan,Zexuan Zhu,Junkai Ji
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Drug-drug interaction (DDI) prediction is central to drug discovery and clinical development, particularly in the context of increasingly prevalent polypharmacy. Although existing computational methods achieve strong performance on standard benchmarks, they often fail to generalize to realistic deployment scenarios, where most candidate drug pairs involve previously unseen drugs and validated interactions are scarce. We demonstrate that proximity in the embedding spaces of prevailing molecule-centric DDI models does not reliably correspond to interaction labels, and that simply scaling up model capacity therefore fails to improve generalization. To address these limitations, we propose GenRel-DDI, a generalizable relation learning framework that reformulates DDI prediction as a relation-centric learning problem, in which interaction representations are learned independently of drug identities. This relation-level abstraction enables the capture of transferable interaction patterns that generalize to unseen drugs and novel drug pairs. Extensive experiments across multiple benchmark demonstrate that GenRel-DDI consistently and significantly outperforms state-of-the-art methods, with particularly large gains on strict entity-disjoint evaluations, highlighting the effectiveness and practical utility of relation learning for robust DDI prediction. The code is available at this https URL.

[LG-15] Communication-efficient Federated Graph Classification via Generative Diffusion Modeling

链接: https://arxiv.org/abs/2601.15722
作者: Xiuling Wang,Xin Huang,Haibo Hu,Jianliang Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) unlock new ways of learning from graph-structured data, proving highly effective in capturing complex relationships and patterns. Federated GNNs (FGNNs) have emerged as a prominent distributed learning paradigm for training GNNs over decentralized data. However, FGNNs face two significant challenges: high communication overhead from multiple rounds of parameter exchanges and non-IID data characteristics across clients. To address these issues, we introduce CeFGC, a novel FGNN paradigm that facilitates efficient GNN training over non-IID data by limiting communication between the server and clients to three rounds only. The core idea of CeFGC is to leverage generative diffusion models to minimize direct client-server communication. Each client trains a generative diffusion model that captures its local graph distribution and shares this model with the server, which then redistributes it back to all clients. Using these generative models, clients generate synthetic graphs combined with their local graphs to train local GNN models. Finally, clients upload their model weights to the server for aggregation into a global GNN model. We theoretically analyze the I/O complexity of communication volume to show that CeFGC reduces to a constant of three communication rounds only. Extensive experiments on several real graph datasets demonstrate the effectiveness and efficiency of CeFGC against state-of-the-art competitors, reflecting our superior performance on non-IID graphs by aligning local and global model objectives and enriching the training set with diverse graphs.

[LG-16] Balancing Security and Privacy: The Pivotal Role of AI in Modern Healthcare Systems

链接: https://arxiv.org/abs/2601.15697
作者: Binu V P,Deepthy K Bhaskar,Minimol B
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As digital threats continue to grow, organizations must find ways to enhance security while protecting user privacy. This paper explores how artificial intelligence (AI) plays a crucial role in achieving this balance. AI technologies can improve security by detecting threats, monitoring systems, and automating responses. However, using AI also raises privacy concerns that need careful this http URL examine real-world examples from the healthcare sector to illustrate how organizations can implement AI solutions that strengthen security without compromising patient privacy. Additionally, we discuss the importance of creating transparent AI systems and adhering to privacy this http URL, this paper provides insights and recommendations for integrating AI into healthcare security practices, helping organizations navigate the challenges of modern management while keeping patient data safe.

[LG-17] Beyond Hard Writes and Rigid Preservation: Soft Recursive Least-Squares for Lifelong LLM Editing

链接: https://arxiv.org/abs/2601.15686
作者: Xinyu Wang,Sicheng Lyu,Yu Gu,Jerry Huang,Peng Lu,Yufei Cui,Xiao-Wen Chang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model editing updates a pre-trained LLM with new facts or rules without re-training, while preserving unrelated behavior. In real deployment, edits arrive as long streams, and existing editors often face a plasticity-stability dilemma: locate-then-edit “hard writes” can accumulate interference over time, while null-space-style “hard preservation” preserves only what is explicitly constrained, so past edits can be overwritten and unconstrained behaviors may deviate, degrading general capabilities in the many-edits regime. We propose RLSEdit, a recursive least-squares editor for long sequential editing. RLSEdit formulates editing as an online quadratic optimization with soft constraints, minimizing a cumulative key-value fitting objective with two regularizers that control for both deviation from the pre-trained weights and from a designated anchor mapping. The resulting update admits an efficient online recursion via the Woodbury identity, with per-edit cost independent of history length and scaling only with the current edit size. We further provide deviation bounds and an asymptotic characterization of the adherence-preservation trade-off in the many-edits regime. Experiments on multiple model families demonstrate stable scaling to 10K edits, outperforming strong baselines in both edit success and holistic stability – crucially retaining early edits, and preserving general capabilities on GLUE and held-out reasoning/code benchmarks.

[LG-18] Bridging the Perception Gap: A Lightweight Coarse-to-Fine Architecture for Edge Audio Systems

链接: https://arxiv.org/abs/2601.15676
作者: Hengfan Zhang,Yueqian Lin,Hai Helen Li,Yiran Chen
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 3 figures, 2 tables. Preprint

点击查看摘要

Abstract:Deploying Audio-Language Models (Audio-LLMs) on edge infrastructure exposes a persistent tension between perception depth and computational efficiency. Lightweight local models tend to produce passive perception - generic summaries that miss the subtle evidence required for multi-step audio reasoning - while indiscriminate cloud offloading incurs unacceptable latency, bandwidth cost, and privacy risk. We propose CoFi-Agent (Tool-Augmented Coarse-to-Fine Agent), a hybrid architecture targeting edge servers and gateways. It performs fast local perception and triggers conditional forensic refinement only when uncertainty is detected. CoFi-Agent runs an initial single-pass on a local 7B Audio-LLM, then a cloud controller gates difficult cases and issues lightweight plans for on-device tools such as temporal re-listening and local ASR. On the MMAR benchmark, CoFi-Agent improves accuracy from 27.20% to 53.60%, while achieving a better accuracy-efficiency trade-off than an always-on investigation pipeline. Overall, CoFi-Agent bridges the perception gap via tool-enabled, conditional edge-cloud collaboration under practical system constraints.

[LG-19] Dualformer: Time-Frequency Dual Domain Learning for Long-term Time Series Forecasting

链接: https://arxiv.org/abs/2601.15669
作者: Jingjing Bai,Yoshinobu Kawahara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based models, despite their promise for long-term time series forecasting (LTSF), suffer from an inherent low-pass filtering effect that limits their effectiveness. This issue arises due to undifferentiated propagation of frequency components across layers, causing a progressive attenuation of high-frequency information crucial for capturing fine-grained temporal variations. To address this limitation, we propose Dualformer, a principled dual-domain framework that rethinks frequency modeling from a layer-wise perspective. Dualformer introduces three key components: (1) a dual-branch architecture that concurrently models complementary temporal patterns in both time and frequency domains; (2) a hierarchical frequency sampling module that allocates distinct frequency bands to different layers, preserving high-frequency details in lower layers while modeling low-frequency trends in deeper layers; and (3) a periodicity-aware weighting mechanism that dynamically balances contributions from the dual branches based on the harmonic energy ratio of inputs, supported theoretically by a derived lower bound. This design enables structured frequency modeling and adaptive integration of time-frequency features, effectively preserving high-frequency information and enhancing generalization. Extensive experiments conducted on eight widely used benchmarks demonstrate Dualformer’s robustness and superior performance, particularly on heterogeneous or weakly periodic data. Our code is publicly available at this https URL.

[LG-20] An Empirical Study on Ensemble-Based Transfer Learning Bayesian Optimisation with Mixed Variable Types

链接: https://arxiv.org/abs/2601.15640
作者: Natasha Trinkle,Huong Ha,Jeffrey Chan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages, 16 figures

点击查看摘要

Abstract:Bayesian optimisation is a sample efficient method for finding a global optimum of expensive black-box objective functions. Historic datasets from related problems can be exploited to help improve performance of Bayesian optimisation by adapting transfer learning methods to various components of the Bayesian optimisation pipeline. In this study we perform an empirical analysis of various ensemble-based transfer learning Bayesian optimisation methods and pipeline components. We expand on previous work in the literature by contributing some specific pipeline components, and three new real-time transfer learning Bayesian optimisation benchmarks. In particular we propose to use a weighting strategy for ensemble surrogate model predictions based on regularised regression with weights constrained to be positive, and a related component for handling the case when transfer learning is not improving Bayesian optimisation performance. We find that in general, two components that help improve transfer learning Bayesian optimisation performance are warm start initialisation and constraining weights used with ensemble surrogate model to be positive.

[LG-21] Closing the Gap on the Sample Complexity of 1-Identification

链接: https://arxiv.org/abs/2601.15620
作者: Zitian Li,Wang Chi Cheung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:1-identification is a fundamental multi-armed bandit formulation on pure exploration. An agent aims to determine whether there exists a qualified arm whose mean reward is not less than a known threshold \mu_0 , or to output \textsfNone if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least 1-\delta , while making expected total pulling times \mathbbE\tau as small as possible. We work on 1-identification with two main contributions. (1) We utilize an optimization formulation to derive a new lower bound of \mathbbE\tau , when there is at least one qualified arm. (2) We design a new algorithm, deriving tight upper bounds whose gap to lower bounds are up to a polynomial of logarithm factor across all problem instance. Our result complements the analysis of \mathbbE\tau when there are multiple qualified arms, which is an open problem left by history literature.

[LG-22] Neural Nonlinear Shrinkage of Covariance Matrices for Minimum Variance Portfolio Optimization

链接: https://arxiv.org/abs/2601.15597
作者: Liusha Yang,Siqi Zhao,Shuqi Chai
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper introduces a neural network-based nonlinear shrinkage estimator of covariance matrices for the purpose of minimum variance portfolio optimization. It is a hybrid approach that integrates statistical estimation with machine learning. Starting from the Ledoit-Wolf (LW) shrinkage estimator, we decompose the LW covariance matrix into its eigenvalues and eigenvectors, and apply a lightweight transformer-based neural network to learn a nonlinear eigenvalue shrinkage function. Trained with portfolio risk as the loss function, the resulting precision matrix (the inverse covariance matrix) estimator directly targets portfolio risk minimization. By conditioning on the sample-to-dimension ratio, the approach remains scalable across different sample sizes and asset universes. Empirical results on stock daily returns from Standard Poor’s 500 Index (SP500) demonstrate that the proposed method consistently achieves lower out-of-sample realized risk than benchmark approaches. This highlights the promise of integrating structural statistical models with data-driven learning.

[LG-23] Deep Learning for Perishable Inventory Systems with Human Knowledge

链接: https://arxiv.org/abs/2601.15589
作者: Xuan Liao,Zhenkang Peng,Ying Rong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Managing perishable products with limited lifetimes is a fundamental challenge in inventory management, as poor ordering decisions can quickly lead to stockouts or excessive waste. We study a perishable inventory system with random lead times in which both the demand process and the lead time distribution are unknown. We consider a practical setting where orders are placed using limited historical data together with observed covariates and current system states. To improve learning efficiency under limited data, we adopt a marginal cost accounting scheme that assigns each order a single lifetime cost and yields a unified loss function for end-to-end learning. This enables training a deep learning-based policy that maps observed covariates and system states directly to order quantities. We develop two end-to-end variants: a purely black-box approach that outputs order quantities directly (E2E-BB), and a structure-guided approach that embeds the projected inventory level (PIL) policy, capturing inventory effects through explicit computation rather than additional learning (E2E-PIL). We further show that the objective induced by E2E-PIL is homogeneous of degree one, enabling a boosting technique from operational data analytics (ODA) that yields an enhanced policy (E2E-BPIL). Experiments on synthetic and real data establish a robust performance ordering: E2E-BB is dominated by E2E-PIL, which is further improved by E2E-BPIL. Using an excess-risk decomposition, we show that embedding heuristic policy structure reduces effective model complexity and improves learning efficiency with only a modest loss of flexibility. More broadly, our results suggest that deep learning-based decision tools are more effective and robust when guided by human knowledge, highlighting the value of integrating advanced analytics with inventory theory.

[LG-24] Enhanced Convergence in p-bit Based Simulated Annealing with Partial Deactivation for Large-Scale Combinatorial Optimization Problems

链接: https://arxiv.org/abs/2601.15561
作者: Naoya Onizawa,Takahiro Hanyu
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:This article critically investigates the limitations of the simulated annealing algorithm using probabilistic bits (pSA) in solving large-scale combinatorial optimization problems. The study begins with an in-depth analysis of the pSA process, focusing on the issues resulting from unexpected oscillations among p-bits. These oscillations hinder the energy reduction of the Ising model and thus obstruct the successful execution of pSA in complex tasks. Through detailed simulations, we unravel the root cause of this energy stagnation, identifying the feedback mechanism inherent to the pSA operation as the primary contributor to these disruptive oscillations. To address this challenge, we propose two novel algorithms, time average pSA (TApSA) and stalled pSA (SpSA). These algorithms are designed based on partial deactivation of p-bits and are thoroughly tested using Python simulations on maximum cut benchmarks that are typical combinatorial optimization problems. On the 16 benchmarks from 800 to 5,000 nodes, the proposed methods improve the normalized cut value from 0.8% to 98.4% on average in comparison with the conventional pSA.

[LG-25] Beyond validation loss: Clinically-tailored optimization metrics improve a models clinical performance

链接: https://arxiv.org/abs/2601.15546
作者: Charles B. Delahunt,Courosh Mehanian,Daniel E. Shea,Matthew P. Horning
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:A key task in ML is to optimize models at various stages, e.g. by choosing hyperparameters or picking a stopping point. A traditional ML approach is to use validation loss, i.e. to apply the training loss function on a validation set to guide these optimizations. However, ML for healthcare has a distinct goal from traditional ML: Models must perform well relative to specific clinical requirements, vs. relative to the loss function used for training. These clinical requirements can be captured more precisely by tailored metrics. Since many optimization tasks do not require the driving metric to be differentiable, they allow a wider range of options, including the use of metrics tailored to be clinically-relevant. In this paper we describe two controlled experiments which show how the use of clinically-tailored metrics provide superior model optimization compared to validation loss, in the sense of better performance on the clinical task. The use of clinically-relevant metrics for optimization entails some extra effort, to define the metrics and to code them into the pipeline. But it can yield models that better meet the central goal of ML for healthcare: strong performance in the clinic.

[LG-26] Machine learning-enhanced non-amnestic Alzheimers disease diagnosis from MRI and clinical features

链接: https://arxiv.org/abs/2601.15530
作者: Megan A. Witherow,Michael L. Evans,Ahmed Temtam,Hamid Okhravi,Khan M. Iftekharuddin
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
*备注: Preprint of a manuscript submitted to Brain

点击查看摘要

Abstract:Alzheimer’s disease (AD), defined as an abnormal buildup of amyloid plaques and tau tangles in the brain can be diagnosed with high accuracy based on protein biomarkers via PET or CSF analysis. However, due to the invasive nature of biomarker collection, most AD diagnoses are made in memory clinics using cognitive tests and evaluation of hippocampal atrophy based on MRI. While clinical assessment and hippocampal volume show high diagnostic accuracy for amnestic or typical AD (tAD), a substantial subgroup of AD patients with atypical presentation (atAD) are routinely misdiagnosed. To improve diagnosis of atAD patients, we propose a machine learning approach to distinguish between atAD and non-AD cognitive impairment using clinical testing battery and MRI data collected as standard-of-care. We develop and evaluate our approach using 1410 subjects across four groups (273 tAD, 184 atAD, 235 non-AD, and 685 cognitively normal) collected from one private data set and two public data sets from the National Alzheimer’s Coordinating Center (NACC) and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). We perform multiple atAD vs. non-AD classification experiments using clinical features and hippocampal volume as well as a comprehensive set of MRI features from across the brain. The best performance is achieved by incorporating additional important MRI features, which outperforms using hippocampal volume alone. Furthermore, we use the Boruta statistical approach to identify and visualize significant brain regions distinguishing between diagnostic groups. Our ML approach improves the percentage of correctly diagnosed atAD cases (the recall) from 52% to 69% for NACC and from 34% to 77% for ADNI, while achieving high precision. The proposed approach has important implications for improving diagnostic accuracy for non-amnestic atAD in clinical settings using only clinical testing battery and MRI.

[LG-27] SAGE-FM: A lightweight and interpretable spatial transcriptomics foundation model

链接: https://arxiv.org/abs/2601.15504
作者: Xianghao Zhan,Jingyu Xu,Yuanning Zheng,Zinaida Good,Olivier Gevaert
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:Spatial transcriptomics enables spatial gene expression profiling, motivating computational models that capture spatially conditioned regulatory relationships. We introduce SAGE-FM, a lightweight spatial transcriptomics foundation model based on graph convolutional networks (GCNs) trained with a masked central spot prediction objective. Trained on 416 human Visium samples spanning 15 organs, SAGE-FM learns spatially coherent embeddings that robustly recover masked genes, with 91% of masked genes showing significant correlations (p 0.05). The embeddings generated by SAGE-FM outperform MOFA and existing spatial transcriptomics methods in unsupervised clustering and preservation of biological heterogeneity. SAGE-FM generalizes to downstream tasks, enabling 81% accuracy in pathologist-defined spot annotation in oropharyngeal squamous cell carcinoma and improving glioblastoma subtype prediction relative to MOFA. In silico perturbation experiments further demonstrate that the model captures directional ligand-receptor and upstream-downstream regulatory effects consistent with ground truth. These results demonstrate that simple, parameter-efficient GCNs can serve as biologically interpretable and spatially aware foundation models for large-scale spatial transcriptomics.

[LG-28] Data-driven Lake Water Quality Forecasting for Time Series with Missing Data using Machine Learning

链接: https://arxiv.org/abs/2601.15503
作者: Rishit Chatterjee,Tahiya Chowdhury
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Volunteer-led lake monitoring yields irregular, seasonal time series with many gaps arising from ice cover, weather-related access constraints, and occasional human errors, complicating forecasting and early warning of harmful algal blooms. We study Secchi Disk Depth (SDD) forecasting on a 30-lake, data-rich subset drawn from three decades of in situ records collected across Maine lakes. Missingness is handled via Multiple Imputation by Chained Equations (MICE), and we evaluate performance with a normalized Mean Absolute Error (nMAE) metric for cross-lake comparability. Among six candidates, ridge regression provides the best mean test performance. Using ridge regression, we then quantify the minimal sample size, showing that under a backward, recent-history protocol, the model reaches within 5% of full-history accuracy with approximately 176 training samples per lake on average. We also identify a minimal feature set, where a compact four-feature subset matches the thirteen-feature baseline within the same 5% tolerance. Bringing these results together, we introduce a joint feasibility function that identifies the minimal training history and fewest predictors sufficient to achieve the target of staying within 5% of the complete-history, full-feature baseline. In our study, meeting the 5% accuracy target required about 64 recent samples and just one predictor per lake, highlighting the practicality of targeted monitoring. Hence, our joint feasibility strategy unifies recent-history length and feature choice under a fixed accuracy target, yielding a simple, efficient rule for setting sampling effort and measurement priorities for lake researchers.

[LG-29] MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification

链接: https://arxiv.org/abs/2601.15498
作者: Jingwei Song,Xinyu Wang,Hanbin Wang,Xiaoxuan Lei,Bill Shi,Shixin Han,Eric Yang,Xiao-Wen Chang,Lynn Ai
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the verification mechanism itself remains largely unchanged, relying on strict token-level rejection sampling. In practice, modern LLMs frequently operate in low-margin regimes where the target model exhibits weak preference among top candidates. In such cases, rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback cost, leading to a fundamental inefficiency in verification. We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model’s local decisiveness. Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit. Importantly, the approach modifies only the verification rule and is fully compatible with existing target-coupled speculative decoding frameworks. Extensive experiments across model scales ranging from 8B to 235B demonstrate that our method delivers consistent and significant inference speedups over state-of-the-art baselines while preserving generation quality across diverse benchmarks.

[LG-30] Early predicting of hospital admission using machine learning algorithms: Priority queues approach

链接: https://arxiv.org/abs/2601.15481
作者: Jakub Antczak,James Montgomery,Małgorzata O’Reilly,Zbigniew Palmowski,Richard Turner
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Emergency Department overcrowding is a critical issue that compromises patient safety and operational efficiency, necessitating accurate demand forecasting for effective resource allocation. This study evaluates and compares three distinct predictive models: Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors (SARIMAX), EXtreme Gradient Boosting (XGBoost) and Long Short-Term Memory (LSTM) networks for forecasting daily ED arrivals over a seven-day horizon. Utilizing data from an Australian tertiary referral hospital spanning January 2017 to December 2021, this research distinguishes itself by decomposing demand into eight specific ward categories and stratifying patients by clinical complexity. To address data distortions caused by the COVID-19 pandemic, the study employs the Prophet model to generate synthetic counterfactual values for the anomalous period. Experimental results demonstrate that all three proposed models consistently outperform a seasonal naive baseline. XGBoost demonstrated the highest accuracy for predicting total daily admissions with a Mean Absolute Error of 6.63, while the statistical SARIMAX model proved marginally superior for forecasting major complexity cases with an MAE of 3.77. The study concludes that while these techniques successfully reproduce regular day-to-day patterns, they share a common limitation in underestimating sudden, infrequent surges in patient volume.

[LG-31] Learning from Synthetic Data: Limitations of ERM

链接: https://arxiv.org/abs/2601.15468
作者: Kareem Amin,Alex Bie,Weiwei Kong,Umar Syed,Sergei Vassilvitskii
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The prevalence and low cost of LLMs have led to a rise of synthetic content. From review sites to court documents, ``natural’’ content has been contaminated by data points that appear similar to natural data, but are in fact LLM-generated. In this work we revisit fundamental learning theory questions in this, now ubiquitous, setting. We model this scenario as a sequence of learning tasks where the input is a mix of natural and synthetic data, and the learning algorithms are oblivious to the origin of any individual example. We study the possibilities and limitations of ERM in this setting. For the problem of estimating the mean of an arbitrary d -dimensional distribution, we find that while ERM converges to the true mean, it is outperformed by an algorithm that assigns non-uniform weights to examples from different generations of data. For the PAC learning setting, the disparity is even more stark. We find that ERM does not always converge to the true concept, echoing the model collapse literature. However, we show there are algorithms capable of learning the correct hypothesis for arbitrary VC classes and arbitrary amounts of contamination. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) MSC classes: 68T05 ACMclasses: I.2.6 Cite as: arXiv:2601.15468 [cs.LG] (or arXiv:2601.15468v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.15468 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-32] Lattice: A Confidence-Gated Hybrid System for Uncertainty-Aware Sequential Prediction with Behavioral Archetypes

链接: https://arxiv.org/abs/2601.15423
作者: Lorian Bannis
类目: Machine Learning (cs.LG)
*备注: 12 pages, 1 figure, uses this http URL

点击查看摘要

Abstract:We introduce Lattice, a hybrid sequential prediction system that conditionally activates learned behavioral structure using binary confidence gating. The system clusters behavior windows into behavioral archetypes and uses binary confidence gating to activate archetype-based scoring only when confidence exceeds a threshold, falling back to baseline predictions when uncertain. We validate Lattice on recommendation systems (MovieLens), scientific time-series (LIGO), and financial markets, using LSTM and transformer backbones. On MovieLens with LSTM, Lattice achieves +31.9% improvement over LSTM baseline in HR@10 (p 3.29 x 10^-25, 30 seeds), outperforming transformer baselines by 109.4% over SASRec and 218.6% over BERT4Rec. On LIGO and financial data, the system correctly refuses archetype activation when distribution shift occurs - a successful outcome demonstrating confidence gating prevents false activation. On transformer backbones, Lattice provides 0.0% improvement (neutral, no degradation), gracefully deferring when structure is already present. This bidirectional validation - activating when patterns apply, refusing when they don’t, and deferring when redundant - supports confidence gating as a promising architectural principle for managing epistemic uncertainty in safety-critical applications.

[LG-33] Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC NEURIPS2025

链接: https://arxiv.org/abs/2601.15399
作者: Ashna Nawar Ahmed,Banooqa Banday,Terry Jones,Tanzima Z. Islam
类目: Machine Learning (cs.LG)
*备注: 13 pages, 6 figures Published in MLForSys workshop in NeurIPS 2025 Link: this https URL

点击查看摘要

Abstract:High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads.

[LG-34] FedUMM: A General Framework for Federated Learning with Unified Multimodal Models

链接: https://arxiv.org/abs/2601.15390
作者: Zhaolong Su,Leheng Zhao,Xiaoying Wu,Ziyue Xu,Jindong Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unified multimodal models (UMMs) are emerging as strong foundation models that can do both generation and understanding tasks in a single architecture. However, they are typically trained in centralized settings where all training and downstream datasets are gathered in a central server, limiting the deployment in privacy-sensitive and geographically distributed scenarios. In this paper, we present FedUMM, a general federated learning framework for UMMs under non-IID multimodal data with low communication cost. Built on NVIDIA FLARE, FedUMM instantiates federation for a BLIP3o backbone via parameter-efficient fine-tuning: clients train lightweight LoRA adapters while freezing the foundation models, and the server aggregates only adapter updates. We evaluate on VQA v2 and the GenEval compositional generation benchmarks under Dirichlet-controlled heterogeneity with up to 16 clients. Results show slight degradation as client count and heterogeneity increase, while remaining competitive with centralized training. We further analyze computation–communication trade-offs and demonstrate that adapter-only federation reduces per-round communication by over an order of magnitude compared to full fine-tuning, enabling practical federated UMM training. This work provides empirical experience for future research on privacy-preserving federated unified multimodal models.

[LG-35] A Rolling-Space Branch-and-Price Algorithm for the Multi-Compartment Vehicle Routing Problem with Multiple Time Windows

链接: https://arxiv.org/abs/2601.16194
作者: El Mehdi Er Raqabi,Kevin Dalmeijer,Pascal Van Hentenryck
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the multi-compartment vehicle routing problem with multiple time windows (MCVRPMTW), an extension of the classical vehicle routing problem with time windows that considers vehicles equipped with multiple compartments and customers requiring service across several delivery time windows. The problem incorporates three key compartment-related features: (i) compartment flexibility in the number of compartments, (ii) item-to-compartment compatibility, and (iii) item-to-item compatibility. The problem also accommodates practical operational requirements such as driver breaks. To solve the MCVRPMTW, we develop an exact branch-and-price (BP) algorithm in which the pricing problem is solved using a labeling algorithm. Several acceleration strategies are introduced to limit symmetry during label extensions, improve the stability of dual solutions in column generation, and enhance the branching process. To handle large-scale instances, we propose a rolling-space BP algorithm that integrates clustering techniques into the solution framework. Extensive computational experiments on instances inspired by a real-world industrial application demonstrate the effectiveness of the proposed approach and provide useful managerial insights for practical implementation.

[LG-36] Beyond Predictive Uncertainty: Reliable Representation Learning with Structural Constraints

链接: https://arxiv.org/abs/2601.16174
作者: Yiyao Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 5 figures, 5 propositions

点击查看摘要

Abstract:Uncertainty estimation in machine learning has traditionally focused on the prediction stage, aiming to quantify confidence in model outputs while treating learned representations as deterministic and reliable by default. In this work, we challenge this implicit assumption and argue that reliability should be regarded as a first-class property of learned representations themselves. We propose a principled framework for reliable representation learning that explicitly models representation-level uncertainty and leverages structural constraints as inductive biases to regularize the space of feasible representations. Our approach introduces uncertainty-aware regularization directly in the representation space, encouraging representations that are not only predictive but also stable, well-calibrated, and robust to noise and structural perturbations. Structural constraints, such as sparsity, relational structure, or feature-group dependencies, are incorporated to define meaningful geometry and reduce spurious variability in learned representations, without assuming fully correct or noise-free structure. Importantly, the proposed framework is independent of specific model architectures and can be integrated with a wide range of representation learning methods.

[LG-37] Synthetic Augmentation in Imbalanced Learning: When It Helps When It Hurts and How Much to Add

链接: https://arxiv.org/abs/2601.16120
作者: Zhengchi Ma,Anru R. Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic examples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples and evaluated under the balanced population risk. Our theory shows that synthetic data is not always beneficial. In a local symmetry" regime, imbalance is not the dominant source of error near the balanced optimum, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help (a local asymmetry" regime), the optimal synthetic size depends on generator accuracy and on whether the generator’s residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing, sometimes by a small refinement and sometimes substantially when generator bias is systematic. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures when the data indicate them. Simulations and a real sepsis prediction study support the theory and illustrate when synthetic augmentation helps, when it cannot, and how to tune its quantity effectively. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2601.16120 [stat.ML] (or arXiv:2601.16120v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2601.16120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] On damage of interpolation to adversarial robustness in regression

链接: https://arxiv.org/abs/2601.16070
作者: Jingfu Peng,Yuhong Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) typically involve a large number of parameters and are trained to achieve zero or near-zero training error. Despite such interpolation, they often exhibit strong generalization performance on unseen data, a phenomenon that has motivated extensive theoretical investigations. Comforting results show that interpolation indeed may not affect the minimax rate of convergence under the squared error loss. In the mean time, DNNs are well known to be highly vulnerable to adversarial perturbations in future inputs. A natural question then arises: Can interpolation also escape from suboptimal performance under a future X -attack? In this paper, we investigate the adversarial robustness of interpolating estimators in a framework of nonparametric regression. A finding is that interpolating estimators must be suboptimal even under a subtle future X -attack, and achieving perfect fitting can substantially damage their robustness. An interesting phenomenon in the high interpolation regime, which we term the curse of simple size, is also revealed and discussed. Numerical experiments support our theoretical findings.

[LG-39] Risk reversal for least squares estimators under nested convex constraints

链接: https://arxiv.org/abs/2601.16041
作者: Omar Al-Ghattas
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 31 pages, 5 figures

点击查看摘要

Abstract:In constrained stochastic optimization, one naturally expects that imposing a stricter feasible set does not increase the statistical risk of an estimator defined by projection onto that set. In this paper, we show that this intuition can fail even in canonical settings. We study the Gaussian sequence model, a deliberately austere test best, where for a compact, convex set \Theta \subset \mathbbR^d one observes [ Y = \theta^\star + \sigma Z, \qquad Z \sim N(0, I_d), ] and seeks to estimate an unknown parameter \theta^\star \in \Theta . The natural estimator is the least squares estimator (LSE), which coincides with the Euclidean projection of Y onto \Theta . We construct an explicit example exhibiting \emphrisk reversal: for sufficiently large noise, there exist nested compact convex sets \Theta_S \subset \Theta_L and a parameter \theta^\star \in \Theta_S such that the LSE constrained to \Theta_S has strictly larger risk than the LSE constrained to \Theta_L . We further show that this phenomenon can persist at the level of worst-case risk, with the supremum risk over the smaller constraint set exceeding that over the larger one. We clarify this behavior by contrasting noise regimes. In the vanishing-noise limit, the risk admits a first-order expansion governed by the statistical dimension of the tangent cone at \theta^\star , and tighter constraints uniformly reduce risk. In contrast, in the diverging-noise regime, the risk is determined by global geometric interactions between the constraint set and random noise directions. Here, the embedding of \Theta_S within \Theta_L can reverse the risk ordering. These results reveal a previously unrecognized failure mode of projection-based estimators: in sufficiently noisy settings, tightening a constraint can paradoxically degrade statistical performance. Comments: 31 pages, 5 figures Subjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2601.16041 [math.ST] (or arXiv:2601.16041v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2601.16041 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Omar Al-Ghattas [view email] [v1] Thu, 22 Jan 2026 15:18:36 UTC (274 KB)

[LG-40] Machine Failure Detection Based on Projected Quantum Models

链接: https://arxiv.org/abs/2601.15641
作者: Larry Bowden,Qi Chu,Bernard Cena,Kentaro Ohno,Bob Parney,Deepak Sharma,Mitsuharu Takeori
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Detecting machine failures promptly is of utmost importance in industry for maintaining efficiency and minimizing downtime. This paper introduces a failure detection algorithm based on quantum computing and a statistical change-point detection approach. Our method leverages the potential of projected quantum feature maps to enhance the precision of anomaly detection in machine monitoring systems. We empirically validate our approach on benchmark multi-dimensional time series datasets as well as on a real-world dataset comprising IoT sensor readings from operational machines, ensuring the practical relevance of our study. The algorithm was executed on IBM’s 133-qubit Heron quantum processor, demonstrating the feasibility of integrating quantum computing into industrial maintenance procedures. The presented results underscore the effectiveness of our quantum-based failure detection system, showcasing its capability to accurately identify anomalies in noisy time series data. This work not only highlights the potential of quantum computing in industrial diagnostics but also paves the way for more sophisticated quantum algorithms in the realm of predictive maintenance.

[LG-41] Non-Stationary Functional Bilevel Optimization

链接: https://arxiv.org/abs/2601.15363
作者: Jason Bohne,Ieva Petrulionyte,Michael Arbel,Julien Mairal,Paweł Polak
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Functional bilevel optimization (FBO) provides a powerful framework for hierarchical learning in function spaces, yet current methods are limited to static offline settings and perform suboptimally in online, non-stationary scenarios. We propose SmoothFBO, the first algorithm for non-stationary FBO with both theoretical guarantees and practical scalability. SmoothFBO introduces a time-smoothed stochastic hypergradient estimator that reduces variance through a window parameter, enabling stable outer-loop updates with sublinear regret. Importantly, the classical parametric bilevel case is a special reduction of our framework, making SmoothFBO a natural extension to online, non-stationary settings. Empirically, SmoothFBO consistently outperforms existing FBO methods in non-stationary hyperparameter optimization and model-based reinforcement learning, demonstrating its practical effectiveness. Together, these results establish SmoothFBO as a general, theoretically grounded, and practically viable foundation for bilevel optimization in online, non-stationary scenarios.

[LG-42] USDs: A universal stabilizer decoder framework using symmetry

链接: https://arxiv.org/abs/2601.15361
作者: Hoshitaro Ohnishi,Hideo Mukai
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum error correction is indispensable to achieving reliable quantum computation. When quantum information is encoded redundantly, a larger Hilbert space is constructed using multiple physical qubits, and the computation is performed within a designated subspace. When applying deep learning to the decoding of quantum error-correcting codes, a key challenge arises from the non-uniqueness between the syndrome measurements provided to the decoder and the corresponding error patterns that constitute the ground-truth labels. Building upon prior work that addressed this issue for the toric code by re-optimizing the decoder with respect to the symmetry inherent in the parity-check structure, we generalize this approach to arbitrary stabilizer codes. In our experiments, we employed multilayer perceptrons to approximate continuous functions that complement the syndrome measurements of the Color code and the Golay code. Using these models, we performed decoder re-optimization for each code. For the Color code, we achieved an improvement of approximately 0.8% in decoding accuracy at a physical error rate of 5%, while for the Golay code the accuracy increased by about 0.1%. Furthermore, from the evaluation of the geometric and algebraic structures in the continuous function approximation for each code, we showed that the design of generalized continuous functions is advantageous for learning the geometric structure inherent in the code. Our results also indicate that approximations that faithfully reproduce the code structure can have a significant impact on the effectiveness of reoptimization. This study demonstrates that the re-optimization technique previously shown to be effective for the Toric code can be generalized to address the challenge of label degeneracy that arises when applying deep learning to the decoding of stabilizer codes.

[LG-43] Robust X-Learner: Breaking the Curse of Imbalance and Heavy Tails via Robust Cross-Imputation

链接: https://arxiv.org/abs/2601.15360
作者: Eichi Uehara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注: 17 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Estimating Heterogeneous Treatment Effects (HTE) in industrial applications such as AdTech and healthcare presents a dual challenge: extreme class imbalance and heavy-tailed outcome distributions. While the X-Learner framework effectively addresses imbalance through cross-imputation, we demonstrate that it is fundamentally vulnerable to “Outlier Smearing” when reliant on Mean Squared Error (MSE) minimization. In this failure mode, the bias from a few extreme observations (“whales”) in the minority group is propagated to the entire majority group during the imputation step, corrupting the estimated treatment effect structure. To resolve this, we propose the Robust X-Learner (RX-Learner). This framework integrates a redescending \gamma-divergence objective – structurally equivalent to the Welsch loss under Gaussian assumptions – into the gradient boosting machinery. We further stabilize the non-convex optimization using a Proxy Hessian strategy grounded in Majorization-Minimization (MM) principles. Empirical evaluation on a semi-synthetic Criteo Uplift dataset demonstrates that the RX-Learner reduces the Precision in Estimation of Heterogeneous Effect (PEHE) metric by 98.6% compared to the standard X-Learner, effectively decoupling the stable “Core” population from the volatile “Periphery”.

[LG-44] Statistical Reinforcement Learning in the Real World: A Survey of Challenges and Future Directions

链接: https://arxiv.org/abs/2601.15353
作者: Asim H. Gazi,Yongyi Guo,Daiqi Gao,Ziping Xu,Kelly W. Zhang,Susan A. Murphy
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has achieved remarkable success in real-world decision-making across diverse domains, including gaming, robotics, online advertising, public health, and natural language processing. Despite these advances, a substantial gap remains between RL research and its deployment in many practical settings. Two recurring challenges often underlie this gap. First, many settings offer limited opportunity for the agent to interact extensively with the target environment due to practical constraints. Second, many target environments often undergo substantial changes, requiring redesign and redeployment of RL systems (e.g., advancements in science and technology that change the landscape of healthcare delivery). Addressing these challenges and bridging the gap between basic research and application requires theory and methodology that directly inform the design, implementation, and continual improvement of RL systems in real-world settings. In this paper, we frame the application of RL in practice as a three-component process: (i) online learning and optimization during deployment, (ii) post- or between-deployment offline analyses, and (iii) repeated cycles of deployment and redeployment to continually improve the RL system. We provide a narrative review of recent advances in statistical RL that address these components, including methods for maximizing data utility for between-deployment inference, enhancing sample efficiency for online learning within-deployment, and designing sequences of deployments for continual improvement. We also outline future research directions in statistical RL that are use-inspired – aiming for impactful application of RL in practice. Subjects: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.15353 [stat.AP] (or arXiv:2601.15353v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2601.15353 Focus to learn more arXiv-issued DOI via DataCite

[LG-45] Learning Nonlinear Heterogeneity in Physical Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2601.15340
作者: Fabiana Taglietti,Andrea Pulici,Maxwell Roxburgh,Gabriele Seguini,Ian Vidamour,Stephan Menzel,Edoardo Franco,Michele Laus,Eleni Vasilaki,Michele Perego,Thomas J. Hayward,Marco Fanciulli,Jack C. Gartside
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:Physical neural networks typically train linear synaptic weights while treating device nonlinearities as fixed. We show the opposite - by training the synaptic nonlinearity itself, as in Kolmogorov-Arnold Network (KAN) architectures, we yield markedly higher task performance per physical resource and improved performance-parameter scaling than conventional linear weight-based networks, demonstrating ability of KAN topologies to exploit reconfigurable nonlinear physical dynamics. We experimentally realise physical KANs in silicon-on-insulator devices we term ‘Synaptic Nonlinear Elements’ (SYNEs), operating at room temperature, 0.1-1 microampere currents, and 2 MHz speeds with no observed degradation over 10^13 measurements and months-long timescales. We demonstrate nonlinear function regression, classification, and prediction of Li-Ion battery dynamics from noisy real-world multi-sensor data. Physical KANs outperform equivalently-parameterised software multilayer perceptron networks across all tasks, with up to two orders of magnitude fewer parameters, and two orders of magnitude fewer devices than linear weight based physical networks. These results establish learned physical nonlinearity as a hardware-native computational primitive for compact and efficient learning systems, and SYNE devices as effective substrates for heterogenous nonlinear computing. Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Applied Physics (physics.app-ph) Cite as: arXiv:2601.15340 [cond-mat.dis-nn] (or arXiv:2601.15340v1 [cond-mat.dis-nn] for this version) https://doi.org/10.48550/arXiv.2601.15340 Focus to learn more arXiv-issued DOI via DataCite

信息检索

[IR-0] Unveiling and Simulating Short-Video Addiction Behaviors via Economic Addiction Theory

链接: https://arxiv.org/abs/2601.15975
作者: Chen Xu,Zhipeng Yi,Ruizi Wang,Wenjie Wang,Jun Xu,Maarten de Rijke
类目: Information Retrieval (cs.IR)
*备注: Accepted in TheWebConf 2026

点击查看摘要

Abstract:Short-video applications have attracted substantial user traffic. However, these platforms also foster problematic usage patterns, commonly referred to as short-video addiction, which pose risks to both user health and the sustainable development of platforms. Prior studies on this issue have primarily relied on questionnaires or volunteer-based data collection, which are often limited by small sample sizes and population biases. In contrast, short-video platforms have large-scale behavioral data, offering a valuable foundation for analyzing addictive behaviors. To examine addiction-aware behavior patterns, we combine economic addiction theory with users’ implicit behavior captured by recommendation systems. Our analysis shows that short-video addiction follows functional patterns similar to traditional forms of addictive behavior (e.g., substance abuse) and that its intensity is consistent with findings from previous social science studies. To develop a simulator that can learn and model these patterns, we introduce a novel training framework, AddictSim. To consider the personalized addiction patterns, AddictSim uses a mean-to-adapted strategy with group relative policy optimization training. Experiments on two large-scale datasets show that AddictSim consistently outperforms existing training strategies. Our simulation results show that integrating diversity-aware algorithms can mitigate addictive behaviors well.

[IR-1] STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion WWW2026

链接: https://arxiv.org/abs/2601.15860
作者: Shui-Hsiang Hsu,Tsung-Hsiang Chou,Chen-Jui Yu,Yao-Chung Fan
类目: Information Retrieval (cs.IR)
*备注: Accepted at The Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:Table retrieval is the task of retrieving the most relevant tables from large-scale corpora given natural language queries. However, structural and semantic discrepancies between unstructured text and structured tables make embedding alignment particularly challenging. Recent methods such as QGpT attempt to enrich table semantics by generating synthetic queries, yet they still rely on coarse partial-table sampling and simple fusion strategies, which limit semantic diversity and hinder effective query-table alignment. We propose STAR (Semantic Table Representation), a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion. STAR first applies header-aware K-means clustering to group semantically similar rows and selects representative centroid instances to construct a diverse partial table. It then generates cluster-specific synthetic queries to comprehensively cover the table’s semantic space. Finally, STAR employs weighted fusion strategies to integrate table and query embeddings, enabling fine-grained semantic alignment. This design enables STAR to capture complementary information from structured and textual sources, improving the expressiveness of table representations. Experiments on five benchmarks show that STAR achieves consistently higher Recall than QGpT on all datasets, demonstrating the effectiveness of semantic clustering and adaptive weighted fusion for robust table representation. Our code is available at this https URL.

[IR-2] CGPT : Cluster-Guided Partial Tables with LLM -Generated Supervision for Table Retrieval WWW2026

链接: https://arxiv.org/abs/2601.15849
作者: Tsung-Hsiang Chou,Chen-Jui Yu,Shui-Hsiang Hsu,Yao-Chung Fan
类目: Information Retrieval (cs.IR)
*备注: Accepted at The Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent. In a unified multi-domain corpus setting, CGPT further demonstrates strong cross-domain generalization and remains effective even when using smaller LLMs for synthetic query generation. These results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval. Our code is available at this https URL.

[IR-3] Blockchain-Based Spectrum Resource Securitization via Semi-Fungible Token-Lock

链接: https://arxiv.org/abs/2601.15594
作者: Zhixian Zhou,Bin Chen,Zhe Peng,Zhiming Liang,Ruijun Wu,Chen Sun,Shuo Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:As 6G networks evolve, spectrum assets require flexible, dynamic, and efficient utilization, motivating blockchain based spectrum securitization. Existing approaches based on ERC404 style hybrid token models rely on frequent minting and burning during asset transfers, which disrupt token identity continuity and increase on chain overhead. This paper proposes the Semi Fungible Token Lock (SFT Lock) method, a lock/unlock based mechanism that preserves NFT identity and historical traceability while enabling fractional ownership and transferability. By replacing mint/burn operations with deterministic state transitions, SFT Lock ensures consistent lifecycle representation of spectrum assets and significantly reduces on chain operations. Based on this mechanism, a modular smart contract architecture is designed to support spectrum authorization, securitization, and sharing, and a staking mechanism is introduced to enhance asset liquidity. Experimental results on a private Ethereum network demonstrate that, compared with ERC404 style hybrid token models, the proposed method achieves substantial gas savings while maintaining functional correctness and traceability.

[IR-4] MEDFORD in a Box: Improvements and Future Directions for a Metadata Description Language

链接: https://arxiv.org/abs/2601.15432
作者: Polina Shpilker,Benjamin Stubbs,Michael Sayers,Yumin Lee,Lenore Cowen,Donna Slonim,Shaun Wallace,Alva Couch,Noah M. Daniels
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: Extended version of “Cross-Referencing Metadata Through an Extension of the MEDFORD Language” from MTSR 2024

点击查看摘要

Abstract:Scientific research metadata is vital to ensure the validity, reusability, and cost-effectiveness of research efforts. The MEDFORD metadata language was previously introduced to simplify the process of writing and maintaining metadata for non-programmers. However, barriers to entry and usability remain, including limited automatic validation, difficulty of data transport, and user unfamiliarity with text file editing. To address these issues, we introduce MEDFORD-in-a-Box (MIAB), a documentation ecosystem to facilitate researcher adoption and earlier metadata capture. MIAB contains many improvements, including an updated MEDFORD parser with expanded validation routines and BagIt export capability. MIAB also includes an improved VS Code extension that supports these changes through a visual IDE. By simplifying metadata generation, this new tool supports the creation of correct, consistent, and reusable metadata, ultimately improving research reproducibility.

[IR-5] KnowTeX: Visualizing Mathematical Dependencies

链接: https://arxiv.org/abs/2601.15294
作者: Elif Uskuplu,Lawrence S. Moss,Valeria de Paiva
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Mathematical knowledge exists in many forms, ranging from informal textbooks and lecture notes to large formal proof libraries, yet moving between these representations remains difficult. Informal texts hide dependencies, while formal systems expose every detail in ways that are not always human-readable. Dependency graphs offer a middle ground by making visible the structure of results, definitions, and proofs. We present KnowTeX, a standalone, user-friendly tool that extends the ideas of Lean’s Blueprints, enabling the visualization of conceptual dependencies directly from LaTeX sources. Using a simple “uses” command, KnowTeX extracts relationships among statements and generates previewable graphs in DOT and TikZ formats. Applied to mathematical texts, such graphs clarify core results, support education and formalization, and provide a resource for aligning informal and formal mathematical representations. We argue that dependency graphs should become a standard feature of mathematical writing, benefiting both human readers and automated systems.

附件下载

点击下载今日全部论文列表