本篇博文主要内容为 2026-01-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2026-01-28)

今日共更新562篇论文,其中:

  • 自然语言处理64篇(Computation and Language (cs.CL))
  • 人工智能159篇(Artificial Intelligence (cs.AI))
  • 计算机视觉91篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习175篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Evaluation of Oncotimia: An LLM based system for supporting tumour boards

【速读】: 该论文旨在解决多学科肿瘤委员会(Multidisciplinary Tumour Boards, MDTBs)在肺癌诊疗决策中因手动处理大量异构临床信息而导致的文档负担过重问题。其解决方案的关键在于开发了一个名为ONCOTIMIA的模块化、安全的临床工具,该工具通过整合生成式人工智能(Generative AI, GenAI),利用大语言模型(Large Language Models, LLMs)实现肺肿瘤板表单的自动填写。系统核心包括多层数据湖、混合关系型与向量存储、检索增强生成(Retrieval-Augmented Generation, RAG)以及规则驱动的自适应表单模型,能够将非结构化临床文本转化为结构化、标准化的肿瘤委员会记录,从而显著降低文档工作量并保持数据质量。

链接: https://arxiv.org/abs/2601.19899
作者: Luis Lorenzo,Marcos Montana-Mendez,Sergio Figueiras,Miguel Boubeta,Cristobal Bernardo-Castineira
机构: Bahía Software SLU(巴亚软件SLU)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Multidisciplinary tumour boards (MDTBs) play a central role in oncology decision-making but require manual processes and structuring large volumes of heterogeneous clinical information, resulting in a substantial documentation burden. In this work, we present ONCOTIMIA, a modular and secure clinical tool designed to integrate generative artificial intelligence (GenAI) into oncology workflows and evaluate its application to the automatic completion of lung cancer tumour board forms using large language models (LLMs). The system combines a multi-layer data lake, hybrid relational and vector storage, retrieval-augmented generation (RAG) and a rule-driven adaptive form model to transform unstructured clinical documentation into structured and standardised tumour board records. We assess the performance of six LLMs deployed through AWS Bedrock on ten lung cancer cases, measuring both completion form accuracy and end-to-end latency. The results demonstrate high performance across models, with the best performing configuration achieving an 80% of correct field completion and clinically acceptable response time for most LLMs. Larger and more recent models exhibit best accuracies without incurring prohibitive latency. These findings provide empirical evidence that LLM- assisted autocompletion form is technically feasible and operationally viable in multidisciplinary lung cancer workflows and support its potential to significantly reduce documentation burden while preserving data quality.
zh

[NLP-1] Post-LayerNorm Is Back: Stable ExpressivE and Deep

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在深度扩展(depth scaling)时面临的训练不稳定性问题。当前主流的预层归一化(Pre-LayerNorm, Pre-LN)Transformer架构虽能稳定训练,但在极端深度下仍受限于表达能力提升的瓶颈;而回溯至后层归一化(Post-LayerNorm, Post-LN)架构虽理论上具有更优的表达性,却因ResNet风格的残差路径导致梯度消失,难以实现可靠训练。论文的关键解决方案是提出Keel架构,其核心在于用高速公路连接(Highway-style connection)替代传统残差路径,从而保持梯度在残差分支中的有效流动,避免信号从顶层向底层衰减。这一改进使Post-LN Transformer能够在超过1000层的极端深度下稳定训练,并显著优于Pre-LN架构在困惑度(perplexity)和深度缩放特性上的表现,为构建无限深度的大语言模型提供了新的可行路径。

链接: https://arxiv.org/abs/2601.19895
作者: Chen Chen,Lai Wei
机构: ByteDance Seed(字节跳动种子项目)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.
zh

[NLP-2] Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection NEURIPS2025 AAAI2025

【速读】: 该论文旨在解决低资源语言(如祖鲁语 isiZulu 和科萨语 isiXhosa)在机器翻译中因平行数据和语言资源匮乏而导致的翻译质量低下问题。其解决方案的关键在于提出了一种基于提示(prompt-based)的“反思式翻译”(Reflective Translation)框架,该框架通过引导模型生成初始译文后,进一步进行结构化的自我批判(self-critique),并利用这一反思信息对译文进行优化重构,从而实现无需微调、模型无关的翻译质量提升。实验证明,该方法在 BLEU 和 COMET 评分上均取得显著且一致的改进,验证了结构化自省机制在低资源场景下的有效性。

链接: https://arxiv.org/abs/2601.19871
作者: Nicholas Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 6 tables. Accepted to the NeurIPS 2025 Workshop on Multilingual Representation Learning (Mexico City) and the AAAI 2025 Workshop on Language Models for Under-Resourced Communities (LM4UC). Code and data available at: this https URL

点击查看摘要

Abstract:Low-resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources. Recent advances in large language models suggest that self-reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency. Building on this idea, this paper introduces Reflective Translation, a prompt-based framework in which a model generates an initial translation, produces a structured self-critique, and then uses this reflection to generate a refined translation. The approach is evaluated on English-isiZulu and English-isiXhosa translation using OPUS-100 and NTREX-African, across multiple prompting strategies and confidence thresholds. Results show consistent improvements in both BLEU and COMET scores between first- and second-pass translations, with average gains of up to +0.22 BLEU and +0.18 COMET. Statistical significance testing using paired nonparametric tests confirms that these improvements are robust. The proposed method is model-agnostic, requires no fine-tuning, and introduces a reflection-augmented dataset that can support future supervised or analysis-driven work. These findings demonstrate that structured self-reflection is a practical and effective mechanism for improving translation quality in low-resource settings.
zh

[NLP-3] Identifying and Transferring Reasoning -Critical Neurons: Improving LLM Inference Reliability via Activation Steering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂推理任务中可靠性不足的问题,尤其是在不依赖后训练或高计算成本采样策略的前提下提升推理准确性。其解决方案的关键在于提出一种轻量级的测试时干预框架AdaRAS(Adaptive Reasoning Activation Steering),该框架通过识别出对推理正确性具有强预测能力的“推理关键神经元”(Reasoning-Critical Neurons, RCNs),并基于极性感知的均值差异准则进行选择,随后在推理过程中自适应地调整这些神经元的激活状态,从而增强错误推理路径的同时避免对已正确的推理过程造成性能下降。

链接: https://arxiv.org/abs/2601.19847
作者: Fangan Dong,Zuming Yan,Xuri Ge,Zhiwei Xu,Mengqi Zhang,Xuanang Chen,Ben He,Xin Xin,Zhumin Chen,Ying Zhou
机构: Shandong University (山东大学); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the strong reasoning capabilities of recent large language models (LLMs), achieving reliable performance on challenging tasks often requires post-training or computationally expensive sampling strategies, limiting their practical efficiency. In this work, we first show that a small subset of neurons in LLMs exhibits strong predictive correlations with reasoning correctness. Based on this observation, we propose AdaRAS (Adaptive Reasoning Activation Steering), a lightweight test-time framework that improves reasoning reliability by selectively intervening on neuron activations. AdaRAS identifies Reasoning-Critical Neurons (RCNs) via a polarity-aware mean-difference criterion and adaptively steers their activations during inference, enhancing incorrect reasoning traces while avoiding degradation on already-correct cases. Experiments on 10 mathematics and coding benchmarks demonstrate consistent improvements, including over 13% gains on AIME-24 and AIME-25. Moreover, AdaRAS exhibits strong transferability across datasets and scalability to stronger models, outperforming post-training methods without additional training or sampling cost.
zh

[NLP-4] Neural Neural Scaling Laws

【速读】: 该论文旨在解决现有神经缩放定律(Neural Scaling Laws)在预测下游任务性能时的局限性问题,即传统方法依赖验证困惑度(validation perplexity)作为输入,存在两个关键缺陷:一是对词级别损失进行平均会掩盖任务特异性信号,二是无法用简单的参数化函数族描述多样化的缩放行为。其解决方案的核心在于提出一种名为NeuNeu的神经网络模型,将缩放定律预测建模为时间序列外推问题,通过融合观测到的准确率轨迹与词级别验证损失,学习无需假设瓶颈机制或特定函数形式即可预测未来性能的能力。该方法在66个下游任务上实现了2.04%的平均绝对误差(MAE),相比逻辑缩放定律(3.29% MAE)提升38%,且具备零样本泛化能力,适用于未见过的模型家族、参数量和任务。

链接: https://arxiv.org/abs/2601.19831
作者: Michael Y. Hu,Jane Pan,Ayush Rajesh Jhaveri,Nicholas Lourie,Kyunghyun Cho
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks – a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.
zh

[NLP-5] When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

【速读】: 该论文旨在解决生成式 AI(Generative AI)在科学领域中,通过检索增强生成(Retrieval-Augmented Generation, RAG)实现多跳推理时,迭代式检索-推理循环是否能超越理想静态RAG(即一次性提供全部黄金证据的“Gold Context”)的问题。其核心挑战在于:在稀疏知识、异质证据和复杂推理路径的情境下,如何有效设计机制以提升模型性能。解决方案的关键在于提出一种无需训练的控制器——“Iterative RAG”,它通过交替执行检索、假设精炼与基于证据的停止决策,形成阶段化的检索-推理流程,从而显著优于静态RAG,在化学问答数据集ChemKGMultiHopQA上实现最高达25.6个百分点的性能提升,尤其对未进行推理微调的模型效果更明显。该方法通过动态调整检索节奏与控制推理路径,缓解早期假设漂移、上下文过载等问题,揭示了阶段性检索本身比单纯拥有理想证据更具影响力。

链接: https://arxiv.org/abs/2601.19827
作者: Mahdi Astaraki,Mohammad Arshi Saloot,Ali Shiraee Kasmaee,Hamidreza Mahyar,Soheila Samiee
机构: McMaster University (麦克马斯特大学); BASF Canada Inc. (巴斯夫加拿大公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 27 pages, 15 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
zh

[NLP-6] Zero-Shot Stance Detection in the Wild: Dynamic Target Generation and Multi-Target Adaptation

【速读】: 该论文旨在解决真实社交媒体场景中目标(target)非预定义且动态变化的问题,传统立场检测方法依赖于静态目标假设,难以适应复杂多变的实际应用环境。其解决方案的关键在于提出一种全新的任务范式——零样本立场检测(zero-shot stance detection)与动态目标生成及多目标适配(Dynamic Target Generation and Multi-Target Adaptation, DGTA),通过自动从文本中识别多个目标-立场对,无需预先设定目标类别。研究构建了中文社交媒体立场检测数据集,并设计多维评估指标,同时探索了大语言模型(Large Language Models, LLMs)的集成式与两阶段微调策略,实验表明两阶段微调的Qwen2.5-7B在目标识别上达到66.99%的综合得分,而集成微调的DeepSeek-R1-Distill-Qwen-7B在立场检测F1分数上达到79.26%,验证了所提方法的有效性。

链接: https://arxiv.org/abs/2601.19802
作者: Aohua Li,Yuanshuo Zhang,Ge Gao,Bo Chen,Xiaobing Zhao
机构: Minzu University of China (民族大学); National Language Resource Monitoring and Research Center of Minority Languages (少数民族语言资源监测与研究中心); Institute of National Security (国家安全研究院); School of Minority Languages and Literatures (少数民族语言文学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current stance detection research typically relies on predicting stance based on given targets and text. However, in real-world social media scenarios, targets are neither predefined nor static but rather complex and dynamic. To address this challenge, we propose a novel task: zero-shot stance detection in the wild with Dynamic Target Generation and Multi-Target Adaptation (DGTA), which aims to automatically identify multiple target-stance pairs from text without prior target knowledge. We construct a Chinese social media stance detection dataset and design multi-dimensional evaluation metrics. We explore both integrated and two-stage fine-tuning strategies for large language models (LLMs) and evaluate various baseline models. Experimental results demonstrate that fine-tuned LLMs achieve superior performance on this task: the two-stage fine-tuned Qwen2.5-7B attains the highest comprehensive target recognition score of 66.99%, while the integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves a stance detection F1 score of 79.26%.
zh

[NLP-7] LVLMs and Humans Ground Differently in Referential Communication

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在与人类用户协作时,因缺乏对共同认知基础(common ground)建模能力而导致的人类意图预测不准的问题。其解决方案的关键在于设计了一项因子实验,通过多轮交互的指称沟通任务(referential communication),在人类-人类、人类-AI、AI-人类和AI-AI配对中系统比较不同交互模式下的对话表现,并构建了一个包含356段对话的语料库,以揭示大型视觉语言模型(LVLMs)在动态互动中解析指称表达(referring expressions)的能力局限,从而为提升AI与人类的协同理解能力提供实证依据与工具支持。

链接: https://arxiv.org/abs/2601.19792
作者: Peter Zeng,Weiling Li,Amie Paige,Zhengxiang Wang,Panagiotis Kaliosis,Dimitris Samaras,Gregory Zelinsky,Susan Brennan,Owen Rambow
机构: Stony Brook University (纽约州立大学石溪分校); Institute for Advanced Computational Science (先进计算科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 24 pages, 16 figures, preprint

点击查看摘要

Abstract:For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs’ limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.
zh

[NLP-8] Strong Reasoning Isnt Enough: Evaluating Evidence Elicitation in Interactive Diagnosis

【速读】: 该论文旨在解决交互式医疗问诊中代理(agent)在不确定性下主动获取缺失临床证据的能力评估问题,现有评估方法多为静态或结果导向,忽视了信息收集过程本身。其解决方案的关键在于提出一个基于模拟患者和原子证据驱动的模拟报告者的交互评估框架,并引入信息覆盖率(Information Coverage Rate, ICR)指标量化代理在互动过程中对必要证据的挖掘完整性。此外,作者构建了EviMed基准数据集以支持系统性研究,并提出REFINE策略,通过诊断验证引导代理主动消除不确定性,从而显著提升信息收集效率与模型性能,尤其在小模型受强推理监督时表现优异。

链接: https://arxiv.org/abs/2601.19773
作者: Zhuohan Long,Zhijie Bao,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty. Yet existing evaluations largely remain static or outcome-centric, neglecting the evidence-gathering process. In this work, we propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a \revsimulated reporter grounded in atomic evidences. Based on this representation, we introduce Information Coverage Rate (ICR) to quantify how completely an agent uncovers necessary evidence during interaction. To support systematic study, we build EviMed, an evidence-based benchmark spanning diverse conditions from common complaints to rare diseases, and evaluate 10 models with varying reasoning abilities. We find that strong diagnostic reasoning does not guarantee effective information collection, and this insufficiency acts as a primary bottleneck limiting performance in interactive settings. To address this, we propose REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving uncertainties. Extensive experiments demonstrate that REFINE consistently outperforms baselines across diverse datasets and facilitates effective model collaboration, enabling smaller agents to achieve superior performance under strong reasoning supervision. Our code can be found at this https URL .
zh

[NLP-9] okenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在下游任务微调过程中因激活值(activation)占用大量训练内存而导致的效率低下问题。现有高效内存优化方法虽已针对激活进行改进,但因其数据无关性(data-agnostic nature)常导致微调效果不稳定且不充分。论文提出TokenSeek这一通用插件式解决方案,其核心创新在于引入实例感知的token选择与舍弃机制(instance-aware token seeking and ditching),通过动态识别并保留对任务关键的token,显著降低微调内存消耗(如在Llama3.2 1B模型上仅需原内存的14.8%),同时保持或提升性能,并提供了可解释的token筛选过程,为未来token效率研究提供理论依据。

链接: https://arxiv.org/abs/2601.19739
作者: Runjia Zeng,Qifan Wang,Qiang Guan,Ruixiang Tang,Lifu Huang,Zhenting Wang,Xueling Zhang,Cheng Han,Dongfang Liu
机构: Rochester Institute of Technology (罗切斯特理工学院); Meta AI (Meta人工智能实验室); Kent State University (肯特州立大学); Rutgers University (罗格斯大学); UC Davis (加州大学戴维斯分校); Accenture (埃森哲); University of Missouri-Kansas City (密苏里大学堪萨斯城分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: this https URL
zh

[NLP-10] RvB: Automating AI System Hardening via Iterative Red-Blue Games

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在攻防双重用途下所暴露的AI安全关键问题:缺乏统一的框架来实现动态、迭代式的对抗适应性加固。其解决方案的核心在于提出Red Team vs. Blue Team (RvB) 框架,该框架被建模为一种无需训练、顺序进行且信息不完善的博弈过程;其中红队暴露漏洞,蓝队则在不更新参数的前提下学习有效的防御策略,从而推动系统自动识别并固化基础防御原则,而非仅针对特定攻击手段过拟合。实证结果表明,RvB在CVE动态代码加固和对抗越狱(jailbreak)防护任务中分别实现了90%和45%的防御成功率,同时保持接近0%的误报率,显著优于基线方法。

链接: https://arxiv.org/abs/2601.19726
作者: Lige Huang,Zicheng Liu,Jie Zhang,Lewen Yan,Dongrui Liu,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Shanghai Jiao Tong University (上海交通大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The dual offensive and defensive utility of Large Language Models (LLMs) highlights a critical gap in AI security: the lack of unified frameworks for dynamic, iterative adversarial adaptation hardening. To bridge this gap, we propose the Red Team vs. Blue Team (RvB) framework, formulated as a training-free, sequential, imperfect-information game. In this process, the Red Team exposes vulnerabilities, driving the Blue Team to learning effective solutions without parameter updates. We validate our framework across two challenging domains: dynamic code hardening against CVEs and guardrail optimization against jailbreaks. Our empirical results show that this interaction compels the Blue Team to learn fundamental defensive principles, leading to robust remediations that are not merely overfitted to specific exploits. RvB achieves Defense Success Rates of 90% and 45% across the respective tasks while maintaining near 0% False Positive Rates, significantly surpassing baselines. This work establishes the iterative adversarial interaction framework as a practical paradigm that automates the continuous hardening of AI systems.
zh

[NLP-11] Component-Level Lesioning of Language Models Reveals Clinically Aligned Aphasia Phenotypes

【速读】: 该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)系统性地模拟失语症(aphasia)的语言产出障碍问题,从而为语言康复假说提供可扩展的计算代理,并构建受临床启发的框架以探究语言功能的组织机制。其解决方案的关键在于提出一种基于临床分型的组件级扰动框架,通过选择性地干预LLMs中与特定失语亚型(如布洛卡氏失语和韦尼克氏失语)相关的功能模块,在模块化混合专家(Mixture-of-Experts, MoE)和密集Transformer架构上统一实施扰动操作;该方法不仅能识别出与失语亚型关联的功能组件,还能通过逐步扰动top-k组件诱导出梯度性的语言能力退化,且其结果可通过西方失语症成套测验(Western Aphasia Battery, WAB)的失语商(Aphasia Quotient, AQ)进行量化评估,从而实现对语言功能损伤模式的可控模拟与解析。

链接: https://arxiv.org/abs/2601.19723
作者: Yifan Wang,Jichen Zheng,Jingyuan Sun,Yunhao Zhang,Chunyu Ye,Jixing Li,Chengqing Zong,Shaonan Wang
机构: The University of Manchester (曼彻斯特大学); CAS (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); City University of Hong Kong (香港城市大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly exhibit human-like linguistic behaviors and internal representations that they could serve as computational simulators of language cognition. We ask whether LLMs can be systematically manipulated to reproduce language-production impairments characteristic of aphasia following focal brain lesions. Such models could provide scalable proxies for testing rehabilitation hypotheses, and offer a controlled framework for probing the functional organization of language. We introduce a clinically grounded, component-level framework that simulates aphasia by selectively perturbing functional components in LLMs, and apply it to both modular Mixture-of-Experts models and dense Transformers using a unified intervention interface. Our pipeline (i) identifies subtype-linked components for Broca’s and Wernicke’s aphasia, (ii) interprets these components with linguistic probing tasks, and (iii) induces graded impairments by progressively perturbing the top-k subtype-linked components, evaluating outcomes with Western Aphasia Battery (WAB) subtests summarized by Aphasia Quotient (AQ). Across architectures and lesioning strategies, subtype-targeted perturbations yield more systematic, aphasia-like regressions than size-matched random perturbations, and MoE modularity supports more localized and interpretable phenotype-to-component mappings. These findings suggest that modular LLMs, combined with clinically informed component perturbations, provide a promising platform for simulating aphasic language production and studying how distinct language functions degrade under targeted disruptions.
zh

[NLP-12] SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

【速读】: 该论文旨在解决监督式生物医学实体链接(Biomedical Entity Linking, BEL)中专家标注训练数据稀缺的核心瓶颈问题。解决方案的关键在于提出SynCABEL(Synthetic Contextualized Augmentation for Biomedical Entity Linking)框架,该框架利用大语言模型(Large Language Models, LLMs)为目标知识库中的所有候选概念生成富含上下文的合成训练样本,从而在无需人工标注的情况下提供广泛的监督信号。这一方法显著提升了模型的数据效率,在三个主流多语言基准(MedMentions、QUAERO 和 SPACCC)上实现了新的最先进性能,并且仅需60%的标注数据即可达到全量人工标注的效果,大幅降低对高成本专家标注的依赖。

链接: https://arxiv.org/abs/2601.19667
作者: Adam Remaki,Christel Gérardin,Eulàlia Farré-Maduell,Martin Krallinger,Xavier Tannier
机构: Sorbonne Université (索邦大学); Inserm (法国国家健康与医学研究院); Université Sorbonne Paris Nord (索邦巴黎北大学); Limics (实验室名称,不翻译); Barcelona Supercomputing Center (巴塞罗那超级计算中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference establish new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.
zh

[NLP-13] One Token Is Enough: Improving Diffusion Language Models with a Sink Token

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)中存在的“移动汇点”(moving sink phenomenon)问题,即sink tokens在Transformer的值空间中表现为低范数表示,且其位置在扩散步骤中不稳定,导致推理过程缺乏鲁棒性。解决方案的关键在于引入一个额外的特殊sink token,通过修改注意力掩码使其仅能自注意(self-attend),同时对所有其他token全局可见;实验表明,这一设计能够稳定注意力汇点,显著提升模型性能,且该token的位置无关性和语义中立性验证了其作为结构化汇点的有效性与鲁棒性。

链接: https://arxiv.org/abs/2601.19657
作者: Zihou Zhang,Zheyong Xie,Li Zhong,Haifeng Liu,Shaosheng Cao
机构: Xiaohongshu.inc
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer’s value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.
zh

[NLP-14] RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)时代下论文审稿人分配(reviewer assignment)面临的评估瓶颈问题,即传统基准数据集因主题快速演变而过时,且代理信号难以准确反映审稿人对论文的真实熟悉度。其解决方案的关键在于提出LR-bench这一高保真、时效性强的基准数据集(覆盖2024–2025年AI/NLP领域稿件,并包含五级自评熟悉度标注),以及RATE框架——该框架通过将每位审稿人近期论文提炼为紧凑的关键词嵌入表示,并利用启发式检索信号构建弱偏好监督信号微调嵌入模型,从而实现论文与审稿人画像的直接匹配。实验表明,该方法在LR-bench和CMU黄金标准数据集上均显著优于现有嵌入基线。

链接: https://arxiv.org/abs/2601.19637
作者: Weicong Liu,Zixuan Yang,Yibo Zhao,Xiang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Reviewer assignment is increasingly critical yet challenging in the LLM era, where rapid topic shifts render many pre-2023 benchmarks outdated and where proxy signals poorly reflect true reviewer familiarity. We address this evaluation bottleneck by introducing LR-bench, a high-fidelity, up-to-date benchmark curated from 2024-2025 AI/NLP manuscripts with five-level self-assessed familiarity ratings collected via a large-scale email survey, yielding 1055 expert-annotated paper-reviewer-score annotations. We further propose RATE, a reviewer-centric ranking framework that distills each reviewer’s recent publications into compact keyword-based profiles and fine-tunes an embedding model with weak preference supervision constructed from heuristic retrieval signals, enabling matching each manuscript against a reviewer profile directly. Across LR-bench and the CMU gold-standard dataset, our approach consistently achieves state-of-the-art performance, outperforming strong embedding baselines by a clear margin. We release LR-bench at this https URL, and a GitHub repository at this https URL.
zh

[NLP-15] Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLM s ICASSP2026

【速读】: 该论文旨在解决关键信息提取(Key Information Extraction, KIE)任务中因依赖自回归推理(autoregressive inference)导致的效率瓶颈问题,尤其是在处理视觉丰富文档(Visually-rich Documents, VrDs)时,传统方法需逐个生成语义独立的字段,造成显著延迟。解决方案的关键在于提出一种并行推理范式(Parallel Inference Paradigm, PIP),通过将所有目标字段用“[mask]”标记作为占位符,实现多字段在单次前向传播中的同步生成,从而大幅提升推理效率;同时,研究设计了定制化的掩码预训练策略和大规模监督数据集以支持该范式的有效实施,实验表明该方法可在保持性能几乎不变的前提下实现5–36倍的推理加速。

链接: https://arxiv.org/abs/2601.19613
作者: Xinzhong Wang,Ya Guo,Jing Li,Huan Chen,Yi Tu,Yijie Hong,Gongshen Liu,Huijia Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using “[mask]” tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.
zh

[NLP-16] Explicit Multi-head Attention for Inter-head Interaction in Large Language Models

【速读】: 该论文旨在解决大型语言模型中注意力机制因头间交互不足而导致性能受限的问题,特别是如何更有效地建模多头注意力(Multi-head Attention)中的跨头交互以提升模型表现。其解决方案的关键在于提出一种名为多头显式注意力(Multi-head Explicit Attention, MEA)的新机制,核心包含两个组件:一是头级线性组合(Head-level Linear Composition, HLC)模块,通过可学习的线性变换对各头的键(Key)和值(Value)向量进行独立重组,从而增强跨头信息传递;二是头级组归一化(Head-level Group Normalization)层,用于统一重组后各头的统计特性,提高训练稳定性。该设计显著提升了预训练阶段的鲁棒性,支持更大学习率、更快收敛,并在减少注意力头数量时仍能通过低秩“虚拟头”重建实现高效KV缓存压缩,降低50%内存占用而保持任务性能稳定。

链接: https://arxiv.org/abs/2601.19611
作者: Runyu Peng,Yunhua Zhou,Demin Song,Kai Lv,Bo Wang,Qipeng Guo,Xipeng Qiu
机构: Shanghai AI Laboratory(上海人工智能实验室); Fudan University(复旦大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank “virtual heads”. This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.
zh

[NLP-17] Decompose-and-Formalise: Recursively Verifiable Natural Language Inference

【速读】: 该论文旨在解决生成式 AI(Generative AI)在自然语言推理(Natural Language Inference, NLI)任务中,将自然语言前提与假设自动形式化为逻辑表达式时存在的两个核心问题:一是长且语法复杂的输入易导致自动形式化错误(autoformalisation errors),进而引发证明失败;二是现有方法在处理失败时通常采用全局重新生成解释的策略,效率低下且难以定位错误来源。解决方案的关键在于提出一种“分解-形式化”框架(decompose-and-formalise framework),通过将前提-假设对分解为蕴含树(entailment tree)的原子步骤,自底向上验证并精准定位失败节点,并基于诊断信息进行局部修正而非全局重生成;同时引入基于事件的逻辑形式中的 θ-替换机制(θ-substitution),以强制绑定论元角色的一致性,提升自动形式化的忠实度(faithfulness)。该方案显著提升了解释验证率(最高优于SOTA 48.9%),并减少了迭代次数和运行时间,同时保持了NLI准确率。

链接: https://arxiv.org/abs/2601.19605
作者: Xin Quan,Marco Valentino,Louise A. Dennis,André Freitas
机构: University of Manchester (曼彻斯特大学); University of Sheffield (谢菲尔德大学); Idiap Research Institute (Idiap 研究所); National Biomarker Centre, CRUK-MI (国家生物标志物中心,英国癌症研究基金会-曼彻斯特研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has shown that integrating large language models (LLMs) with theorem provers (TPs) in neuro-symbolic pipelines helps with entailment verification and proof-guided refinement of explanations for natural language inference (NLI). However, scaling such refinement to naturalistic NLI remains difficult: long, syntactically rich inputs and deep multi-step arguments amplify autoformalisation errors, where a single local mismatch can invalidate the proof. Moreover, current methods often handle failures via costly global regeneration due to the difficulty of localising the responsible span or step from prover diagnostics. Aiming to address these problems, we propose a decompose-and-formalise framework that (i) decomposes premise-hypothesis pairs into an entailment tree of atomic steps, (ii) verifies the tree bottom-up to isolate failures to specific nodes, and (iii) performs local diagnostic-guided refinement instead of regenerating the whole explanation. Moreover, to improve faithfulness of autoformalisation, we introduce \theta -substitution in an event-based logical form to enforce consistent argument-role bindings. Across a range of reasoning tasks using five LLM backbones, our method achieves the highest explanation verification rates, improving over the state-of-the-art by 26.2%, 21.7%, 21.6% and 48.9%, while reducing refinement iterations and runtime and preserving strong NLI accuracy.
zh

[NLP-18] Yunque DeepResearch Technical Report

【速读】: 该论文旨在解决深度研究(Deep Research)在自主代理中面临的三大核心挑战:长期任务中的上下文噪声加剧、系统脆弱性导致的错误级联,以及缺乏模块化可扩展性。为应对这些问题,作者提出了一种名为Yunque DeepResearch的分层、模块化且鲁棒的框架,其关键在于三个组件:(1) 中心化的多智能体编排系统,负责将子任务分配给原子能力池中的工具和专用子代理;(2) 动态上下文管理机制,通过语义摘要结构化已完成的子目标以缓解信息过载;(3) 主动监督模块,通过异常检测与上下文剪枝实现系统韧性保障。该方案在GAIA、BrowseComp、BrowseComp-ZH及Humanity’s Last Exam等多个基准测试中达到当前最优性能。

链接: https://arxiv.org/abs/2601.19578
作者: Yuxuan Cai,Xinyi Lai,Peng Yuan,Weiting Liu,Huajian Li,Mingda Li,Xinghua Wang,Shengxie Zheng,Yanchao Hao,Yuyang Yin,Zheng Wei
机构: Tencent BAC (腾讯BAC); Tsinghua University (清华大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep research has emerged as a transformative capability for autonomous agents, empowering Large Language Models to navigate complex, open-ended tasks. However, realizing its full potential is hindered by critical limitations, including escalating contextual noise in long-horizon tasks, fragility leading to cascading errors, and a lack of modular extensibility. To address these challenges, we introduce Yunque DeepResearch, a hierarchical, modular, and robust framework. The architecture is characterized by three key components: (1) a centralized Multi-Agent Orchestration System that routes subtasks to an Atomic Capability Pool of tools and specialized sub-agents; (2) a Dynamic Context Management mechanism that structures completed sub-goals into semantic summaries to mitigate information overload; and (3) a proactive Supervisor Module that ensures resilience through active anomaly detection and context pruning. Yunque DeepResearch achieves state-of-the-art performance across a range of agentic deep research benchmarks, including GAIA, BrowseComp, BrowseComp-ZH, and Humanity’s Last Exam. We open-source the framework, reproducible implementations, and application cases to empower the community.
zh

[NLP-19] Benchmarks Saturate When The Model Gets Smarter Than The Judge

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中因数据集不准确和评测方法缺陷导致的基准测试有效性不足的问题。其核心解决方案是构建并发布Omni-MATH-2数据集,通过人工修订确保问题的LaTeX可编译性、可解性和可验证性,剔除冗余信息并标注需要证明、估算或图像支持的问题类型,从而显著降低数据噪声;同时利用专家标注对judge模型进行校准,发现原Omni-Judge在96.4%的判断分歧中出错,揭示其无法有效区分模型能力,尤其是在模型性能接近饱和前即已失效。研究强调:高质量数据与可靠评判机制共同构成精准评估LLM性能的关键前提。

链接: https://arxiv.org/abs/2601.19532
作者: Marthe Ballon,Andres Algaba,Brecht Verbeken,Vincent Ginis
机构: Data Analytics Lab, Vrije Universiteit Brussel, Pleinlaan 5, 1050 Brussel, Belgium; imec-SMIT, Vrije Universiteit Brussel, Pleinlaan 9, 1050 Brussels, Belgium; School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138, USA
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset ( n=4181 ) and a tagged, non-standard subset ( n=247 ). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in 96.4% of the judge disagreements, indicating its inability to differentiate between models’ abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.
zh

[NLP-20] Enhancing Academic Paper Recommendations Using Fine-Grained Knowledge Entities and Multifaceted Document Embeddings

【速读】: 该论文旨在解决当前学术文献推荐系统在应对学者研究过程中细粒度需求时的不足问题,例如难以精准定位使用特定研究方法或解决特定子任务的文献。其解决方案的关键在于构建一种融合多维信息的推荐方法:通过引入新型细粒度知识实体(fine-grained knowledge entities),整合文档标题与摘要以及引用数据,生成综合性的论文向量,并基于这些向量计算相似度进行推荐。该方法显著提升了推荐精度,在STM-KG数据集上的Top-50推荐平均精确率达到27.3%,较现有方法提升6.7%。

链接: https://arxiv.org/abs/2601.19513
作者: Haixu Xi,Heng Zhang,Chengzhi Zhang
机构: Jiangsu University of Science and Technology (江苏科技大学); Central China Normal University (华中师范大学); Nanjing University of Science and Technology (南京理工大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:In the era of explosive growth in academic literature, the burden of literature review on scholars are increasing. Proactively recommending academic papers that align with scholars’ literature needs in the research process has become one of the crucial pathways to enhance research efficiency and stimulate innovative thinking. Current academic paper recommendation systems primarily focus on broad and coarse-grained suggestions based on general topic or field similarities. While these systems effectively identify related literature, they fall short in addressing scholars’ more specific and fine-grained needs, such as locating papers that utilize particular research methods, or tackle distinct research tasks within the same topic. To meet the diverse and specific literature needs of scholars in the research process, this paper proposes a novel academic paper recommendation method. This approach embeds multidimensional information by integrating new types of fine-grained knowledge entities, title and abstract of document, and citation data. Recommendations are then generated by calculating the similarity between combined paper vectors. The proposed recommendation method was evaluated using the STM-KG dataset, a knowledge graph that incorporates scientific concepts derived from papers across ten distinct domains. The experimental results indicate that our method outperforms baseline models, achieving an average precision of 27.3% among the top 50 recommendations. This represents an improvement of 6.7% over existing approaches.
zh

[NLP-21] ALRM: Agent ic LLM for Robotic Manipulation

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的机器人操作框架中存在的两大问题:一是现有方法缺乏模块化、可闭环执行的代理机制,难以实现计划、反思与动作修正的循环过程;二是现有操作任务基准测试多聚焦于低层控制,未能系统评估多步骤推理能力和语言多样性。其解决方案的关键在于提出一种名为“代理式大语言模型用于机器人操作”(Agentic LLM for Robot Manipulation, ALRM)的框架,该框架通过类ReAct的推理循环将策略生成与代理执行相融合,支持两种互补模式:Code-as-Policy(CaP)直接生成可执行控制代码,以及Tool-as-Policy(TaP)通过迭代规划和工具调用实现动作执行。这一设计实现了自然语言推理到可靠机器人执行的有效衔接,并构建了一个包含56个任务的新型仿真基准以系统评估性能。

链接: https://arxiv.org/abs/2601.19510
作者: Vitor Gaboardi dos Santos,Ibrahim Khadraoui,Ibrahim Farhat,Hamza Yous,Samy Teffahi,Hakim Hacid
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently empowered agentic frameworks to exhibit advanced reasoning and planning capabilities. However, their integration in robotic control pipelines remains limited in two aspects: (1) prior \acllm-based approaches often lack modular, agentic execution mechanisms, limiting their ability to plan, reflect on outcomes, and revise actions in a closed-loop manner; and (2) existing benchmarks for manipulation tasks focus on low-level control and do not systematically evaluate multistep reasoning and linguistic variation. In this paper, we propose Agentic LLM for Robot Manipulation (ALRM), an LLM-driven agentic framework for robotic manipulation. ALRM integrates policy generation with agentic execution through a ReAct-style reasoning loop, supporting two complementary modes: Code-asPolicy (CaP) for direct executable control code generation, and Tool-as-Policy (TaP) for iterative planning and tool-based action execution. To enable systematic evaluation, we also introduce a novel simulation benchmark comprising 56 tasks across multiple environments, capturing linguistically diverse instructions. Experiments with ten LLMs demonstrate that ALRM provides a scalable, interpretable, and modular approach for bridging natural language reasoning with reliable robotic execution. Results reveal Claude-4.1-Opus as the top closed-source model and Falcon-H1-7B as the top open-source model under CaP.
zh

[NLP-22] Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs

【速读】: 该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在跨模态任务中面临的安全性挑战,尤其是现有安全评估基准因人工构建成本高、复杂度静态且区分能力有限,难以适应快速演进的模型和新兴风险的问题。解决方案的关键在于提出首个自动化LVLM安全评估系统VLSafetyBencher,其核心是引入四个协同工作的智能代理:数据预处理(Data Preprocessing)、生成(Generation)、增强(Augmentation)和选择(Selection)代理,通过自动化流程高效构建高质量安全测试样本,从而在低成本下实现对不同模型安全性差异的有效区分,实验表明该系统可在一周内完成基准构建,并使最安全与最不安全模型之间的安全率差距达70%。

链接: https://arxiv.org/abs/2601.19507
作者: Xiangyang Zhu,Yuan Tian,Zicheng Zhang,Qi Jia,Chunyi Li,Renrui Zhang,Heng Li,Zongrui Wang,Wei Sun
机构: Shanghai AI Lab (上海人工智能实验室); PolyU HK (香港理工大学); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges, which undermine their reliability in real-world applications. Efforts have been made to build LVLM safety evaluation benchmarks to uncover their vulnerability. However, existing benchmarks are hindered by their labor-intensive construction process, static complexity, and limited discriminative power. Thus, they may fail to keep pace with rapidly evolving models and emerging risks. To address these limitations, we propose VLSafetyBencher, the first automated system for LVLM safety benchmarking. VLSafetyBencher introduces four collaborative agents: Data Preprocessing, Generation, Augmentation, and Selection agents to construct and select high-quality samples. Experiments validates that VLSafetyBencher can construct high-quality safety benchmarks within one week at a minimal cost. The generated benchmark effectively distinguish safety, with a safety rate disparity of 70% between the most and least safe models.
zh

[NLP-23] GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLM s ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在下游任务微调过程中训练效率低、计算成本高的问题,同时兼顾推理阶段的高效性。传统结构化剪枝方法虽能提升推理效率,但通常需要额外的训练、知识蒸馏或结构搜索步骤,进一步增加资源消耗。其解决方案的关键在于提出GradPruner,该方法利用微调初期各参数累积梯度构建初始梯度信息累积矩阵(Initial Gradient Information Accumulation Matrix, IGIA-Matrix),据此评估网络层的重要性并进行剪枝;随后通过仅合并同号元素的方式稀疏化剪枝层并与保留层融合,以最小化符号差异带来的干扰。实验表明,该方法可在保持高精度的前提下实现高达40%的参数压缩。

链接: https://arxiv.org/abs/2601.19503
作者: Wei Huang,Anda Cheng,Yinggui Wang
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR2026

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight downstream datasets. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is publicly available.
zh

[NLP-24] ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

【速读】: 该论文旨在解决低资源语言(特别是欧洲葡萄牙语)在事实核查(fact-checking)任务中因缺乏高质量标注数据而导致的研究进展滞后问题。当前事实核查高度依赖人工验证,难以应对网络虚假信息的快速传播,而自动化方法在英语等高资源语言中已取得一定成果,但在葡萄牙语等语言中受限于数据稀缺。解决方案的关键在于构建并发布ClaimPT——一个包含1,308篇新闻文章和6,875个事实性断言(factual claims)标注的数据集,其内容来源于葡萄牙通讯社LUSA提供的新闻稿件,而非社交媒体或议会记录,从而更贴近真实新闻环境;同时通过双标注员协作与校对机制确保标注质量,并提供基线模型以建立初始性能基准,为后续低资源场景下的自然语言处理(NLP)与信息检索(IR)应用奠定基础。

链接: https://arxiv.org/abs/2601.19490
作者: Ricardo Campos,Raquel Sequeira,Sara Nerea,Inês Cantante,Diogo Folques,Luís Filipe Cunha,João Canavilhas,António Branco,Alípio Jorge,Sérgio Nunes,Nuno Guimarães,Purificação Silvano
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the misinformation itself; accelerating corrections through automation can therefore help counter it more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claims is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. Portuguese, like other languages, still lacks accessible, licensed datasets, limiting research, NLP developments and applications. In this paper, we introduce ClaimPT, a dataset of European Portuguese news articles annotated for factual claims, comprising 1,308 articles and 6,875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure annotation quality, two trained annotators labeled each article, with a curator validating all annotations according to a newly proposed scheme. We also provide baseline models for claim detection, establishing initial benchmarks and enabling future NLP and IR applications. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.
zh

[NLP-25] Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition

【速读】: 该论文旨在解决多语言自动语音识别(Multilingual Automatic Speech Recognition, Multilingual ASR)中单一投影层(projector)难以建模不同语言间复杂声学到语义映射的问题。其解决方案的关键在于提出一种稳定化的专家混合(Mixture-of-Experts, MoE)投影结构——SMEAR-MoE,通过确保所有专家均能获得密集梯度流,有效防止专家坍塌(expert collapse),同时支持跨语言知识共享,从而实现更高效、鲁棒的多语言语音理解。

链接: https://arxiv.org/abs/2601.19451
作者: Isha Pandey,Ashish Mittal,Vartul Bahuguna,Ganesh Ramakrishnan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in LLM-based ASR connect frozen speech encoders with Large Language Models (LLMs) via lightweight projectors. While effective in monolingual settings, a single projector struggles to capture the diverse acoustic-to-semantic mappings required for multilingual ASR. To address this, we propose SMEAR-MoE, a stabilized Mixture-of-Experts projector that ensures dense gradient flow to all experts, preventing expert collapse while enabling cross-lingual sharing. We systematically compare monolithic, static multi-projector, and dynamic MoE designs across four Indic languages (Hindi, Marathi, Tamil, Telugu). Our SMEAR-MoE achieves strong performance, delivering upto a 7.6% relative WER reduction over the single-projector baseline, while maintaining comparable runtime efficiency. Analysis of expert routing further shows linguistically meaningful specialization, with related languages sharing experts. These results demonstrate that stable multi-expert projectors are key to scalable and robust multilingual ASR.
zh

[NLP-26] KG-CRAFT: Knowledge Graph-based Contrastive Reasoning with LLM s for Enhancing Automated Fact-checking EACL2026

【速读】: 该论文旨在解决自动化事实核查系统中**声明验证(claim verification)**的准确性问题,即如何更有效地利用可靠证据源(如文档或知识库)来判断陈述的真实性。其解决方案的关键在于提出KG-CRAFT方法,该方法通过构建基于声明及其相关报告的知识图谱(knowledge graph),生成结构化且语境相关的对比性问题(contrastive questions),从而引导从证据报告中提炼出关键信息,并将其合成简洁摘要供大语言模型(LLMs)进行真伪判定。此知识图谱驱动的对比推理机制显著提升了LLMs在事实核查任务中的表现,在两个真实数据集(LIAR-RAW和RAWFC)上达到了新的最先进性能。

链接: https://arxiv.org/abs/2601.19447
作者: Vítor N. Lourenço,Aline Paes,Tillman Weyde,Audrey Depeige,Mohnish Dubey
机构: Universidade Federal Fluminense(弗洛里亚诺波利斯联邦大学); Amazon(亚马逊); City St George’s, University of London(伦敦城市圣乔治大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to publication at the 19th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2026

点击查看摘要

Abstract:Claim verification is a core component of automated fact-checking systems, aimed at determining the truthfulness of a statement by assessing it against reliable evidence sources such as documents or knowledge bases. This work presents KG-CRAFT, a method that improves automatic claim verification by leveraging large language models (LLMs) augmented with contrastive questions grounded in a knowledge graph. KG-CRAFT first constructs a knowledge graph from claims and associated reports, then formulates contextually relevant contrastive questions based on the knowledge graph structure. These questions guide the distillation of evidence-based reports, which are synthesised into a concise summary that is used for veracity assessment by LLMs. Extensive evaluations on two real-world datasets (LIAR-RAW and RAWFC) demonstrate that our method achieves a new state-of-the-art in predictive performance. Comprehensive analyses validate in detail the effectiveness of our knowledge graph-based contrastive reasoning approach in improving LLMs’ fact-checking capabilities.
zh

[NLP-27] Ad Insertion in LLM -Generated Responses

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)可持续 monetization 的关键挑战,即如何在保障用户交互体验、伦理合规与计算效率的前提下,实现广告插入的语义一致性与经济激励相容性。传统基于静态关键词的搜索广告难以捕捉对话流中瞬时且上下文依赖的用户意图(user intents),导致广告相关性低、隐私风险高且计算开销大。其解决方案的核心在于两个解耦策略:一是将广告插入过程与响应生成分离,以确保内容安全与显式广告披露;二是将竞价机制从具体用户查询中解耦,转而基于“类型”(genres,即高层语义聚类)进行投标,从而降低实时响应敏感性带来的隐私和计算负担。进一步地,作者提出在该框架下应用 VCG 拍卖机制,可近似满足主导策略激励相容(DSIC)与个体理性(IR),并保持较高的社会福利最优性与计算效率。

链接: https://arxiv.org/abs/2601.19435
作者: Shengwei Xu,Zhaohua Chen,Xiaotie Deng,Zhiyi Huang,Grant Schoenebeck
机构: University of Michigan (密歇根大学); Peking University (北京大学); The University of Hong Kong (香港大学)
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 8 figures

点击查看摘要

Abstract:Sustainable monetization of Large Language Models (LLMs) remains a critical open challenge. Traditional search advertising, which relies on static keywords, fails to capture the fleeting, context-dependent user intents–the specific information, goods, or services a user seeks–embedded in conversational flows. Beyond the standard goal of social welfare maximization, effective LLM advertising imposes additional requirements on contextual coherence (ensuring ads align semantically with transient user intents) and computational efficiency (avoiding user interaction latency), as well as adherence to ethical and regulatory standards, including preserving privacy and ensuring explicit ad disclosure. Although various recent solutions have explored bidding on token-level and query-level, both categories of approaches generally fail to holistically satisfy this multifaceted set of constraints. We propose a practical framework that resolves these tensions through two decoupling strategies. First, we decouple ad insertion from response generation to ensure safety and explicit disclosure. Second, we decouple bidding from specific user queries by using ``genres’’ (high-level semantic clusters) as a proxy. This allows advertisers to bid on stable categories rather than sensitive real-time response, reducing computational burden and privacy risks. We demonstrate that applying the VCG auction mechanism to this genre-based framework yields approximately dominant strategy incentive compatibility (DSIC) and individual rationality (IR), as well as approximately optimal social welfare, while maintaining high computational efficiency. Finally, we introduce an “LLM-as-a-Judge” metric to estimate contextual coherence. Our experiments show that this metric correlates strongly with human ratings (Spearman’s \rho\approx 0.66 ), outperforming 80% of individual human evaluators. Comments: 31 pages, 8 figures Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.19435 [cs.GT] (or arXiv:2601.19435v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2601.19435 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-28] Do LLM s Truly Benefit from Longer Context in Automatic Post-Editing?

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在文档级自动后编辑(Automatic Post-Editing, APE)任务中的性能与效率问题,特别是大语言模型(Large Language Models, LLMs)在利用文档上下文进行错误修正方面的有效性不足。其关键解决方案在于系统性地比较专有模型与开源权重模型在简单的一次提示(one-shot prompting)设置下,于文档级上下文中执行APE的表现差异,揭示了专有模型虽能达到接近人类水平的翻译质量,但对文档级语境依赖弱、鲁棒性强却难以有效利用上下文信息,且存在显著计算成本和延迟问题。研究强调了当前LLM-based APE方法在实用性上的局限,并指出未来需发展更高效的长文本建模技术以提升翻译精修效果。

链接: https://arxiv.org/abs/2601.19410
作者: Ahrii Kim,Seong-heum Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic post-editing (APE) aims to refine machine translations by correcting residual errors. Although recent large language models (LLMs) demonstrate strong translation capabilities, their effectiveness for APE–especially under document-level context–remains insufficiently understood. We present a systematic comparison of proprietary and open-weight LLMs under a naive document-level prompting setup, analyzing APE quality, contextual behavior, robustness, and efficiency. Our results show that proprietary LLMs achieve near human-level APE quality even with simple one-shot prompting, regardless of whether document context is provided. While these models exhibit higher robustness to data poisoning attacks than open-weight counterparts, this robustness also reveals a limitation: they largely fail to exploit document-level context for contextual error correction. Furthermore, standard automatic metrics do not reliably reflect these qualitative improvements, highlighting the continued necessity of human evaluation. Despite their strong performance, the substantial cost and latency overheads of proprietary LLMs render them impractical for real-world APE deployment. Overall, our findings elucidate both the promise and current limitations of LLM-based document-aware APE, and point toward the need for more efficient long-context modeling approaches for translation refinement. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2601.19410 [cs.CL] (or arXiv:2601.19410v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2601.19410 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.36227/techrxiv.176107895.57699371/v1 Focus to learn more DOI(s) linking to related resources
zh

[NLP-29] Binary Token-Level Classification with DeBERTa for All-Type MWE Identification: A Lightweight Approach with Linguistic Enhancement EACL2026

【速读】: 该论文旨在解决多词表达(Multiword Expression, MWE)识别任务中的性能瓶颈问题,尤其是在数据稀疏和类别不平衡场景下如何提升模型精度。其解决方案的关键在于三点:一是将MWE检测重构为基于token级别的二分类任务(START/END/INSIDE),替代传统的跨度预测方式,从而更高效地捕捉MWE边界;二是融合名词短语(NP)分块和依存句法特征,显著增强对离散型及名词类MWE的识别能力;三是采用过采样策略缓解训练数据中严重的类别不平衡问题。通过上述设计,作者使用参数量仅为Qwen-72B的1/165的DeBERTa-v3-large模型,在CoAM数据集上达到69.8% F1,远超现有最佳结果,并在STREUSLE数据集上验证了方法的良好泛化性。

链接: https://arxiv.org/abs/2601.19360
作者: Diego Rossini,Lonneke van der Plas
机构: Università della Svizzera Italiana(瑞士意大利语大学)
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EACL 2026

点击查看摘要

Abstract:We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.
zh

[NLP-30] Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation

【速读】: 该论文旨在解决传统文本生成评估指标(如BLEU和BERTScore)在衡量生成文本语义保真度(semantic fidelity)方面的不足,尤其是在翻译、摘要生成和临床笔记生成等任务中难以识别内容遗漏和事实性矛盾等问题。其解决方案的关键在于提出一种无需参考文本的多维评估框架——交叉检验框架(Cross-Examination Framework, CEF),该框架将源文本与候选文本视为独立的知识库,通过生成可验证的问题并进行交叉检验,从而得到三个可解释的评分维度:覆盖度(Coverage)、一致性(Conformity)和一致性(Consistency)。其中,关键创新点包括:系统性地开展鲁棒性分析以选择稳定的判别模型,并通过与有参考文本模式的强相关性验证了无参考模式下的可靠性;同时,人类专家验证表明,CEF识别出的不一致问题更倾向于指向语义层面的错误(尤其是实体和关系扭曲),而非非语义错误,显著提升了评估的准确性与实用性。

链接: https://arxiv.org/abs/2601.19350
作者: Tathagata Raha,Clement Christophe,Nada Saadi,Hamza A Javed,Marco AF Pimentel,Ronnie Rajan,Praveenkumar Kanithi
机构: M42 Health(医疗健康公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and candidate as independent knowledge bases. CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency. Validated across translation, summarization and clinical note-generation, our framework identifies critical errors, such as content omissions and factual contradictions, missed by standard metrics. A key contribution is a systematic robustness analysis to select a stable judge model. Crucially, the strong correlation between our reference-free and with-reference modes validates CEF’s reliability without gold references. Furthermore, human expert validation demonstrates that CEF mismatching questions align with meaning-altering semantic errors higher than with non-semantic errors, particularly excelling at identifying entity-based and relational distortions.
zh

[NLP-31] When Benchmarks Leak: Inference-Time Decontamination for LLM s

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在基准测试(benchmark-based evaluation)中因测试集污染(test set contamination)而导致评估结果不可靠的问题。测试集污染指测试样本或其近似变体泄露至训练数据中,从而人为提升模型在基准上的表现。现有解决方案分为两类:一类是提前识别并移除污染项,但会改变原始评估集且在中重度污染下失效;另一类是在评估时抑制污染行为,却常干扰正常推理并导致干净输入性能显著下降。本文提出DeconIEP框架,其关键在于在评估阶段通过施加小而有界的输入嵌入空间扰动(input embedding space perturbations),引导模型避开由记忆驱动的捷径路径(memorization-driven shortcut pathways)。该方法借助一个相对较少污染的参考模型(reference model)学习实例自适应的扰动生成器(instance-adaptive perturbation generator),实现无需修改基准、不破坏原始模型结构即可有效去污,同时对良性任务性能影响最小。

链接: https://arxiv.org/abs/2601.19334
作者: Jianzhe Chai,Yu Zhe,Jun Sakuma
机构: Institute of Science Tokyo (东京科学研究所); RIKEN AIP (理化学研究所人工智能中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benchmark-based evaluation is the de facto standard for comparing large language models (LLMs). However, its reliability is increasingly threatened by test set contamination, where test samples or their close variants leak into training data and artificially inflate reported performance. To address this issue, prior work has explored two main lines of mitigation. One line attempts to identify and remove contaminated benchmark items before evaluation, but this inevitably alters the evaluation set itself and becomes unreliable when contamination is moderate or severe. The other line preserves the benchmark and instead suppresses contaminated behavior at evaluation time; however, such interventions often interfere with normal inference and lead to noticeable performance degradation on clean inputs. We propose DeconIEP, a decontamination framework that operates entirely during evaluation by applying small, bounded perturbations in the input embedding space. Guided by a relatively less-contaminated reference model, DeconIEP learns an instance-adaptive perturbation generator that steers the evaluated model away from memorization-driven shortcut pathways. Across multiple open-weight LLMs and benchmarks, extensive empirical results show that DeconIEP achieves strong decontamination effectiveness while incurring only minimal degradation in benign utility.
zh

[NLP-32] Formula-One Prompting: Adaptive Reasoning Through Equations For Applied Mathematics

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理应用数学问题时,如金融、物理和密码学等领域,缺乏显式建模 governing equations(控制方程)的问题。现有提示技术如 Chain-of-Thought (CoT) 和 Program-of-Thought (PoT) 虽能通过自然语言或代码结构化中间步骤提升数学推理能力,但未明确利用从问题描述中提取或推导控制方程这一关键环节。解决方案的核心是提出 Formula-One Prompting (F-1),其采用两阶段策略:首先从问题描述中生成控制方程作为中间表示,随后基于生成的方程自适应选择 CoT、PoT 或直接计算三种求解策略之一,整个过程仅需一次 LLM 调用。实验证明 F-1 在多个基准测试上显著优于 CoT 和 PoT,尤其在应用数学领域提升更明显,验证了将方程作为中间表征对复杂问题求解的有效性。

链接: https://arxiv.org/abs/2601.19302
作者: Natapong Nitarach,Pittawat Taveekitworachai,Kunat Pipatanakul
机构: SCB 10X, SCBX Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompting techniques such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) improve LLM mathematical reasoning by structuring intermediate steps in natural language or code. However, applied mathematics problems in domains like finance, physics, and cryptography often require recalling or deriving governing equations, a step that current approaches do not explicitly leverage. We propose Formula-One Prompting (F-1), a two-phase approach that uses mathematical equations as an intermediate representation before adaptive solving. F-1 first formulates governing equations from problem descriptions, then selects a solving strategy among CoT, PoT, or direct computation based on the generated equations, all within a single LLM call. Results across five models and four benchmarks show F-1 outperforms CoT by +5.76% and PoT by +8.42% on average. Crucially, gains are largest in applied domains: +13.30% on FinanceMath over CoT, and within OlympiadBench, larger gains on physics (+2.55%) than pure math (+0.44%). This demonstrates that F-1 is more effective than CoT in applied mathematics problems.
zh

[NLP-33] MetaGen: Self-Evolving Roles and Topologies for Multi-Agent LLM Reasoning

【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems)在执行复杂任务时因固定角色库和静态交互拓扑导致的任务不匹配、难以适应新证据以及推理成本过高的问题。其解决方案的关键在于提出一种无需训练的框架MetaGen,该框架在推理阶段动态调整角色空间与协作拓扑:通过生成和重写条件感知的角色规范来维护可控的动态角色池,并围绕最小骨干结构实例化受限执行图;同时,在执行过程中利用轻量级反馈信号迭代更新角色提示并调整结构决策,从而实现高效且自适应的任务求解。

链接: https://arxiv.org/abs/2601.19290
作者: Yimeng Wang,Jiaxing Zhao,Hongbin Xie,Hexing Ma,Yuzhen Lei,Shuangxue Liu,Xuan Song,Zichen Zhang,Haoran Zhang
机构: School of Artificial Intelligence, Jilin University (吉林大学人工智能学院); Department of Computer Science and Engineering, Southern University of Science and Technology (南方科技大学计算机科学与工程系); School of Urban Planning and Design, Peking University (北京大学城市规划与设计学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are increasingly deployed as multi-agent systems, where specialized roles communicate and collaborate through structured interactions to solve complex tasks that often exceed the capacity of a single agent. However, most existing systems still rely on a fixed role library and an execution-frozen interaction topology, a rigid design choice that frequently leads to task mismatch, prevents timely adaptation when new evidence emerges during reasoning, and further inflates inference cost. We introduce MetaGen, a training-free framework that adapts both the role space and the collaboration topology at inference time, without updating base model weights. MetaGen generates and rewrites query-conditioned role specifications to maintain a controllable dynamic role pool, then instantiates a constrained execution graph around a minimal backbone. During execution, it iteratively updates role prompts and adjusts structural decisions using lightweight feedback signals. Experiments on code generation and multi-step reasoning benchmarks show that MetaGen improves the accuracy and cost tradeoff over strong multi-agent baselines.
zh

[NLP-34] ReToP: Learning to Rewrite Electronic Health Records for Clinical Prediction WSDM2026

【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHRs)在临床预测任务中因高维性、异质性和稀疏性导致的建模挑战,尤其针对现有基于大语言模型(Large Language Models, LLMs)的方法普遍缺乏任务感知性的问题——即这些方法通常将LLMs作为EHR编码器或补全模块使用,未能充分整合预测任务信号,从而限制了预测性能。解决方案的关键在于提出一个端到端训练的框架Rewrite-To-Predict (ReToP),其核心创新包括:1)通过临床驱动的特征选择策略生成合成伪标签,以构建多样化的患者EHR重写样本用于微调EHR重写器;2)引入一种新型的Classifier Supervised Contribution (CSC)评分机制,使EHR重写器能够生成与预测目标对齐的临床相关重写内容,从而直接提升预测准确性。实验证明,ReToP在MIMIC-IV数据集上的三个临床任务中均优于强基线模型,并展现出良好的跨数据集和跨任务泛化能力。

链接: https://arxiv.org/abs/2601.19286
作者: Jesus Lovon-Melgarejo(IRIT),Jose G. Moreno(IRIT-IRIS),Christine Damase-Michel,Lynda Tamine(IRIT-IRIS)
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by WSDM 2026

点击查看摘要

Abstract:Electronic Health Records (EHRs) provide crucial information for clinical decision-making. However, their high-dimensionality, heterogeneity, and sparsity make clinical prediction challenging. Large Language Models (LLMs) allowed progress towards addressing this challenge by leveraging parametric medical knowledge to enhance EHR data for clinical prediction tasks. Despite the significant achievements made so far, most of the existing approaches are fundamentally task-agnostic in the sense that they deploy LLMs as EHR encoders or EHR completion modules without fully integrating signals from the prediction tasks. This naturally hinders task performance accuracy. In this work, we propose Rewrite-To-Predict (ReToP), an LLM-based framework that addresses this limitation through an end-to-end training of an EHR rewriter and a clinical predictor. To cope with the lack of EHR rewrite training data, we generate synthetic pseudo-labels using clinical-driven feature selection strategies to create diverse patient rewrites for fine-tuning the EHR rewriter. ReToP aligns the rewriter with prediction objectives using a novel Classifier Supervised Contribution (CSC) score that enables the EHR rewriter to generate clinically relevant rewrites that directly enhance prediction. Our ReToP framework surpasses strong baseline models across three clinical tasks on MIMIC-IV. Moreover, the analysis of ReToP shows its generalizability to unseen datasets and tasks with minimal fine-tuning while preserving faithful rewrites and emphasizing task-relevant predictive features.
zh

[NLP-35] Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)后训练阶段中,基于强化学习(Reinforcement Learning, RL)的优化策略(如Group Relative Policy Optimization, GRPO)因静态均匀采样和固定rollout数量而导致的计算资源分配 inefficiency 问题。具体而言,在具有异质性和长尾分布特征的推理数据上,传统方法会浪费算力在已解决的简单模式上,同时对困难样本训练不足。其解决方案的关键在于提出一种以优化为核心的多对抗分布鲁棒优化框架(Multi-Adversary Group Distributionally Robust Optimization, GDRO),通过引入在线难度分类器动态划分提示(prompt)为不同难度组,并设计两个独立的GDRO机制:(1) Prompt-GDRO利用EMA去偏乘法权重bandit采样器,聚焦于高难度边缘并避免频率偏差地提升难例权重;(2) Rollout-GDRO采用影子价格控制器,在固定平均计算预算下重新分配rollouts,最大化梯度方差减少以增强困难任务的学习效率。两者均提供无遗憾保证,并在DAPO 14.1k数据集上验证了显著性能提升。

链接: https://arxiv.org/abs/2601.19280
作者: Kishan Panaganti,Zhenwen Liang,Wenhao Yu,Haitao Mi,Dong Yu
机构: Tencent AI Lab in Bellevue (腾讯AI实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Keywords: Large Language Models, Reasoning Models, Reinforcement Learning, Distributionally Robust Optimization, GRPO

点击查看摘要

Abstract:Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model’s performance. Comments: Keywords: Large Language Models, Reasoning Models, Reinforcement Learning, Distributionally Robust Optimization, GRPO Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2601.19280 [cs.LG] (or arXiv:2601.19280v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.19280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-36] DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

【速读】: 该论文旨在解决现有基于模型的推测解码(speculative decoding)方法中draft阶段延迟过高导致性能瓶颈的问题,尤其是像EAGLE3这类方案因多步自回归推理而引入显著延迟。其解决方案的关键在于提出DART框架,通过利用扩散式大语言模型(dLLMs)的思想,实现并行生成(parallel generation):在单次前向传播中预测多个未来掩码位置的logits,从而消除draft模型中的自回归滚动(autoregressive rollout),同时保持轻量化设计;此外,结合N-gram约束的高效树剪枝算法构建语义连续性良好的draft token树,显著降低draft阶段开销并维持高准确率,最终实现端到端解码速度大幅提升。

链接: https://arxiv.org/abs/2601.19278
作者: Fuliang Liu,Xue Li,Ketai Zhao,Yinxi Gao,Ziyan Zhou,Zhonghui Zhang,Zhibin Wang,Wanchun Dou,Sheng Zhong,Chen Tian
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with N-gram-enforced semantic continuity. DART substantially reduces draft-stage overhead while preserving high draft accuracy, leading to significantly improved end-to-end decoding speed. Experimental results demonstrate that DART achieves a 2.03x–3.44x wall-clock time speedup across multiple datasets, surpassing EAGLE3 by 30% on average and offering a practical speculative decoding framework. Code is released at this https URL.
zh

[NLP-37] Riddle Quest : The Enigma of Words WWW

【速读】: 该论文旨在解决如何系统性地生成和评估基于类比的谜题(riddles),并考察大语言模型(Large Language Models, LLMs)在处理此类谜题时对答案集的覆盖能力与歧义理解水平。其解决方案的关键在于构建一个包含四个模块的简化流水线:三元组生成器(triples creator)用于提取概念的结构化事实,语义映射器(semantic mapper)筛选适用于类比推理的属性,风格化生成器(stylized generator)将属性转化为谜题线索,以及验证器(validator)收集所有可能的答案以评估模型的推理完整性。实验表明,尽管LLMs通常能识别出主要预期答案,却常遗漏其他合理解释,凸显了谜题作为轻量级工具在检测模型推理覆盖范围与歧义处理能力方面的价值。

链接: https://arxiv.org/abs/2601.19273
作者: Niharika Sri Parasa,Chaitali Diwan,Srinath Srinivasa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: This paper is submitted under ‘Demo track’ for WWW conference

点击查看摘要

Abstract:Riddles are concise linguistic puzzles that describe an object or idea through indirect, figurative, or playful clues. They are a longstanding form of creative expression, requiring the solver to interpret hints, recognize patterns, and draw inferences to identify the answers. In this work, we introduce a simple pipeline for creating and evaluating analogy-based riddles. The system includes a triples creator that builds structured facts about a concept, a semantic mapper that selects attributes useful for analogy, a stylized generator that turns them into riddle clues, and a validator that collects all possible answers the riddle could point to. We use this validator to study whether large language models can recover the full answer set for different riddle types. Our case study shows that while models often guess the main intended answer, they frequently miss other valid interpretations. This highlights the value of riddles as a lightweight tool for examining reasoning coverage and ambiguity handling in language models.
zh

[NLP-38] DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models

【速读】: 该论文旨在解决当前音频视觉视频字幕生成模型在对话描述准确性方面存在的不足,即现有模型难以生成忠实反映对话内容的字幕。其解决方案的关键在于提出DiaDem模型,通过构建高质量的监督微调(SFT)数据集,并采用难度分区的两阶段广义优势策略优化(GRPO)方法,显著提升对话描述的精度;同时引入DiaDemBench基准测试平台,系统评估模型在多样对话场景下的说话人归属准确性和话语转录保真度,从而推动该领域研究向更精细、更可靠的对话感知字幕生成方向发展。

链接: https://arxiv.org/abs/2601.19267
作者: Xinlong Chen,Weihong Lin,Jingyun Hua,Linli Yao,Yue Ding,Bozhou Li,Bohan Zeng,Yang Shi,Qiang Liu,Yuanxing Zhang,Pengfei Wan,Liang Wang,Tieniu Tan
机构: New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所模式识别国家重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Kling Team, Kuaishou Technology (快手科技克林团队); Peking University (北京大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.
zh

[NLP-39] RPO-RAG : Aligning Small LLM s with Relation-aware Preference Optimization for Knowledge Graph Question Answering WWW

【速读】: 该论文旨在解决小规模语言模型(small LLMs)在知识密集型任务中因幻觉(hallucination)导致的推理能力不足问题,以及现有基于知识图谱(Knowledge Graph, KG)的检索增强生成(Retrieval-Augmented Generation, RAG)方法在路径采样、与KG推理目标对齐和提示组织上的局限性。解决方案的关键在于提出RPO-RAG框架,其核心创新包括:(1) 查询-路径语义感知采样策略,提供更具信息量的监督信号;(2) 关系感知的偏好优化机制,使训练过程与KG中间推理信号(如关系)对齐;(3) 以答案为中心的提示设计,将实体和推理路径组织成可解释的结构化格式。实验证明,RPO-RAG显著提升了小模型(甚至低于3B参数)在WebQSP和CWQ两个KG问答基准上的性能,缩小了与大模型之间的差距,并实现了子8B参数模型在CWQ上的新SOTA结果。

链接: https://arxiv.org/abs/2601.19225
作者: Kaehyun Um,KyuHwan Yeom,Haerim Yang,Minyoung Choi,Hyeongjun Yang,Kyong-Ho Lee
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at The Web Conference (WWW) 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have recently demonstrated remarkable reasoning abilities, yet hallucinate on knowledge-intensive tasks. Retrieval-augmented generation (RAG) mitigates this issue by grounding answers in external sources, e.g., knowledge graphs (KGs). However, existing KG-based RAG approaches rely on semantics-unaware path sampling and are weakly aligned with KG reasoning objectives, which limits further accuracy gains. They also feed retrieved paths directly into the reasoner without organizing them into answer-centered reasoning paths, hindering small LLMs’ ability to leverage the retrieved knowledge. Furthermore, prior works predominantly rely on large LLMs (e.g., ChatGPT/GPT-4) or assume backbones above 7B parameters, leaving sub-7B models underexplored. We address this gap with RPO-RAG, the first KG-based RAG framework specifically designed for small LLMs, to the best of our knowledge. RPO-RAG introduces three key innovations: (1) a query-path semantic sampling strategy that provides informative supervisory signals; (2) a relation-aware preference optimization that aligns training with intermediate KG reasoning signals (e.g., relation); and (3) an answer-centered prompt design that organizes entities and reasoning paths in an interpretable format. Extensive experiments on two benchmark Knowledge Graph Question Answering (KGQA) datasets, WebQSP and CWQ, demonstrate that RPO-RAG effectively bridges the performance gap between small and large language models. On WebQSP, it improves F1 by up to 8.8%, reflecting enhanced answer precision, while on CWQ it achieves new state-of-the-art results among models under 8B parameters in both Hit and F1. Overall, RPO-RAG substantially improves the reasoning capability of small LLMs, even under 3B parameters-highlighting their potential for resource-efficient and practical on-device KGQA applications.
zh

[NLP-40] DREAMSTATE: Diffusing States and Parameters for Recurrent Large Language Models

【速读】: 该论文旨在解决现代循环神经网络(RNN)如RWKV中内部状态作为可编辑知识表示的研究空白问题。现有研究虽已揭示其在短程建模和固定大小状态上的优势,但尚未深入探索其状态的可解释性与可操控性。解决方案的关键在于提出DREAMSTATE框架,该框架利用条件扩散Transformer(Diffusion Transformer, DiT)直接建模状态的概率流形,从而实现状态的生成与编辑;进一步设计了一种混合架构,通过并行DiT模块处理变长全局上下文以动态调整核心RNN模块中的WKV参数,将固定递归机制转化为上下文感知的动态函数,从而融合RNN局部优势与全局适应能力。实验表明该方法可通过多目标损失稳定训练,验证了架构可行性。

链接: https://arxiv.org/abs/2601.19221
作者: Liu Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern Recurrent Neural Networks (RNNs), such as RWKV, are distinguished by their powerful short-range modeling capabilities and efficient fixed-size states, which constitute a core advantage over standard Transformers. However, there is a significant lack of research into their internal state as an editable knowledge representation. To fill this gap, we first explore the representational properties of the RWKV state by proposing the DREAMSTATE framework. This framework utilizes a conditional Diffusion Transformer (DiT) to directly model the probability manifold of the state, enabling its generation and editing. The structural nature of this representation is validated through t-SNE visualizations and controlled generation experiments. After successfully uncovering and modeling the state’s representational potential, we further propose a novel hybrid architecture that combines the local advantages of RNNs with global context adaptability. This architecture features a parallel DiT that processes a variable-length global context to dynamically generate and adjust the core recurrent module’s WKV parameters, transforming the fixed recurrence mechanism into a context-aware dynamic function. Experiments demonstrate that this hybrid model can be trained stably via a multi-objective loss, validating its design feasibility. Our work not only opens a new research direction for RNN state representation but also provides a concrete architectural reference for future model design. The code is publicly available at: this https URL.
zh

[NLP-41] A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews EACL2026

【速读】: 该论文旨在解决从顾客评论中提取细粒度、可操作的改进建议(actionable suggestions)的问题,这类建议通常嵌套在混合意图的非结构化文本中,现有方法要么仅分类含建议的句子,要么生成高层摘要,难以精准捕捉企业所需的改进指令。解决方案的关键在于构建一个混合式流水线架构:首先使用高召回率的RoBERTa分类器结合精度-召回率替代损失函数以减少不可恢复的假负例,随后通过受控微调的大型语言模型(LLM)完成建议抽取、分类、聚类与摘要生成。该设计实现了比纯提示(prompt-only)、规则基线和仅分类器方法更优的抽取准确性和聚类一致性,并经人工评估验证了输出建议的清晰性、忠实性和可解释性。

链接: https://arxiv.org/abs/2601.19214
作者: Aakash Trivedi,Aniket Upadhyay,Pratik Narang,Dhruv Kumar,Praveen Kumar
机构: Birla Institute of Technology and Science, Pilani, India; Birdeye Inc., Palo Alto, California, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EACL 2026 Industry Track (to appear)

点击查看摘要

Abstract:Extracting actionable suggestions from customer reviews is essential for operational decision-making, yet these directives are often embedded within mixed-intent, unstructured text. Existing approaches either classify suggestion-bearing sentences or generate high-level summaries, but rarely isolate the precise improvement instructions businesses need. We evaluate a hybrid pipeline combining a high-recall RoBERTa classifier trained with a precision-recall surrogate to reduce unrecoverable false negatives with a controlled, instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization. Across real-world hospitality and food datasets, the hybrid system outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence. Human evaluations further confirm that the resulting suggestions and summaries are clear, faithful, and interpretable. Overall, our results show that hybrid reasoning architectures achieve meaningful improvements fine-grained actionable suggestion mining while highlighting challenges in domain adaptation and efficient local deployment.
zh

[NLP-42] How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability ICLR2026

【速读】: 该论文旨在解决生成式 AI(Generative AI)中语义关联(semantic associations)如何从自然语言数据中通过注意力机制的语言模型(attention-based language models)逐步学习并表征的问题,其核心目标是建立深度学习与语言学理论之间的联系,并为大语言模型(LLM)提供可解释的机制基础。解决方案的关键在于利用梯度的主导项近似(leading-term approximation of the gradients),推导出训练初期权重的闭式表达式,揭示每个Transformer层的权重可分解为三个基础函数(二元组 bigram、词元可交换性 token-interchangeability 和上下文映射 context mappings)的简单组合,从而明确各组件如何基于语料库统计捕捉语义关联。实证结果表明,理论权重刻画与真实LLM中的学习权重高度一致,且定性分析验证了该理论对理解Transformer中语义关联形成机制的解释力。

链接: https://arxiv.org/abs/2601.19208
作者: Shawn Im,Changdae Oh,Zhen Fang,Sharon Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2026

点击查看摘要

Abstract:Semantic associations such as the link between “bird” and “flew” are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.
zh

[NLP-43] Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs EACL2026

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在面对文本与视觉信息冲突时的鲁棒性不足问题,即当文本提示与图像内容矛盾时,VLMs 是否仍能正确依赖视觉证据进行推理。其解决方案的关键在于构建了CONTEXT-VQA数据集(即Conflicting Text),该数据集包含图像-问题对以及系统生成的、具有误导性的文本提示,这些提示刻意与视觉证据相冲突,从而模拟真实场景中可能存在的文本操纵情况。在此基础上,作者设计了一套全面的评估框架,对11种主流VLMs进行了测试,结果表明这些模型普遍对文本误导敏感,平均性能下降超过48.2%,凸显出当前VLMs在多模态决策中的脆弱性,并强调了提升其抗文本干扰能力的重要性。

链接: https://arxiv.org/abs/2601.19202
作者: Chi Zhang,Wenxuan Ding,Jiale Liu,Mingrui Wu,Qingyun Wu,Ray Mooney
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Pennsylvania State University (宾夕法尼亚州立大学); New York University (纽约大学); University of Chinese Academy of Sciences (中国科学院大学); AG2ai, Inc. (AG2ai公司)
类目: Computation and Language (cs.CL)
备注: 24 pages, 10 figures. Accepted at EACL 2026 (main conference)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
zh

[NLP-44] ransparency-First Medical Language Models: Datasheets Model Cards and End-to-End Data Provenance for Clinical NLP

【速读】: 该论文旨在解决临床语言模型(Clinical Language Models, CLMs)在透明性与可审计性方面的不足,特别是在数据来源、模型构建过程及治理机制等方面的缺失。其解决方案的关键在于提出TeMLM(Transparency-first Clinical Language Model Release Artifacts),这是一个统一的、机器可检查的发布包,包含TeMLM-Card(模型卡片)、TeMLM-Datasheet(数据说明书)、TeMLM-Provenance(溯源信息)三类元数据 artifact 以及一个轻量级合规检查清单,以实现对临床模型的全过程透明化管理,从而支持可重复审计和可信部署。

链接: https://arxiv.org/abs/2601.19191
作者: Olaf Yunus Laitinen Imanov,Taner Yilmaz,Ayse Tuba Tugrul,Melike Nesrin Zaman,Ozkan Gunalp,Duygu Erisken,Sila Burde Dulger,Rana Irem Turhan,Izzet Ozdemir,Derya Umut Kulali,Ozan Akbulut,Harun Demircioglu,Hasan Basri Kara,Berfin Tavan
机构: TeMLM Foundation(TeMLM基金会)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 9 figures, 15 tables. Technetium-I case study and ProtactiniumBERT-100M reference benchmarks

点击查看摘要

Abstract:We introduce TeMLM, a set of transparency-first release artifacts for clinical language models. TeMLM unifies provenance, data transparency, modeling transparency, and governance into a single, machine-checkable release bundle. We define an artifact suite (TeMLM-Card, TeMLM-Datasheet, TeMLM-Provenance) and a lightweight conformance checklist for repeatable auditing. We instantiate the artifacts on Technetium-I, a large-scale synthetic clinical NLP dataset with 498,000 notes, 7.74M PHI entity annotations across 10 types, and ICD-9-CM diagnosis labels, and report reference results for ProtactiniumBERT (about 100 million parameters) on PHI de-identification (token classification) and top-50 ICD-9 code extraction (multi-label classification). We emphasize that synthetic benchmarks are valuable for tooling and process validation, but models should be validated on real clinical data prior to deployment.
zh

[NLP-45] Leverag ing Sentence-oriented Augmentation and Transformer-Based Architecture for Vietnamese-Bahnaric Translation

【速读】: 该论文旨在解决越南语到巴拿语(Bahnaric)的神经机器翻译(Neural Machine Translation, NMT)资源匮乏问题,以促进巴拿语这一濒危语言的数字化传播与代际传承。其解决方案的关键在于采用先进的NMT技术,并结合两种灵活且无需复杂预处理或额外数据的增强策略,从而在有限平行语料基础上显著提升翻译性能,同时保持对多种NMT模型的兼容性与可扩展性。

链接: https://arxiv.org/abs/2601.19124
作者: Tan Sang Nguyen,Quoc Nguyen Pham,Tho Quan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Bahnar people, an ethnic minority in Vietnam with a rich ancestral heritage, possess a language of immense cultural and historical significance. The government places a strong emphasis on preserving and promoting the Bahnaric language by making it accessible online and encouraging communication across generations. Recent advancements in artificial intelligence, such as Neural Machine Translation (NMT), have brought about a transformation in translation by improving accuracy and fluency. This, in turn, contributes to the revival of the language through educational efforts, communication, and documentation. Specifically, NMT is pivotal in enhancing accessibility for Bahnaric speakers, making information and content more readily available. Nevertheless, the translation of Vietnamese into Bahnaric faces practical challenges due to resource constraints, especially given the limited resources available for the Bahnaric language. To address this, we employ state-of-the-art techniques in NMT along with two augmentation strategies for domain-specific Vietnamese-Bahnaric translation task. Importantly, both approaches are flexible and can be used with various neural machine translation models. Additionally, they do not require complex data preprocessing steps, the training of additional systems, or the acquisition of extra data beyond the existing training parallel corpora.
zh

[NLP-46] PsyProbe: Proactive and Interpretable Dialogue through User State Modeling for Exploratory Counseling EACL2026

【速读】: 该论文旨在解决当前基于大语言模型的心理健康对话系统普遍存在的反应式交互问题,即缺乏对用户心理状态的系统性建模,从而限制了治疗探索的主动性与深度。其解决方案的关键在于提出PsyProbe系统,该系统通过PPPPI框架(Presenting, Predisposing, Precipitating, Perpetuating, Protective, Impact)结合认知偏差检测,实现对用户心理状态的结构化追踪,并集成State Builder、Memory Construction、Strategy Planner及Response Generator(含Question Ideation和Critic/Revision模块)四大核心组件,以生成情境适配且具有主动引导性的提问,从而推动治疗过程从被动响应向主动探索转变。

链接: https://arxiv.org/abs/2601.19096
作者: Sohhyung Park,Hyunji Kang,Sungzoon Cho,Dongil Kim
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注: In Findings of the Association for Computational Linguistics: EACL 2026

点击查看摘要

Abstract:Recent advances in large language models have enabled mental health dialogue systems, yet existing approaches remain predominantly reactive, lacking systematic user state modeling for proactive therapeutic exploration. We introduce PsyProbe, a dialogue system designed for the exploration phase of counseling that systematically tracks user psychological states through the PPPPPI framework (Presenting, Predisposing, Precipitating, Perpetuating, Protective, Impact) augmented with cognitive error detection. PsyProbe combines State Builder for extracting structured psychological profiles, Memory Construction for tracking information gaps, Strategy Planner for Motivational Interviewing behavioral codes, and Response Generator with Question Ideation and Critic/Revision modules to generate contextually appropriate, proactive questions. We evaluate PsyProbe with 27 participants in real-world Korean counseling scenarios, including automatic evaluation across ablation modes, user evaluation, and expert evaluation by a certified counselor. The full PsyProbe model consistently outperforms baseline and ablation modes in automatic evaluation. User evaluation demonstrates significantly increased engagement intention and improved naturalness compared to baseline. Expert evaluation shows that PsyProbe substantially improves core issue understanding and achieves question rates comparable to professional counselors, validating the effectiveness of systematic state modeling and proactive questioning for therapeutic exploration.
zh

[NLP-47] More at Stake: How Payoff and Language Shape LLM Agent Strategies in Cooperation Dilemmas

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在重复社会困境(如囚徒困境)中作为自主代理时的策略行为理解问题,尤其关注激励强度(payoff magnitude)和语言语境(linguistic context)如何影响其合作或背叛决策。解决方案的关键在于构建一个可量化、可比较的行为分析框架:通过设计尺度化的囚徒困境博弈来隔离激励强度的影响,并利用监督分类器对LLM决策进行标注,识别其是否遵循经典重复博弈中的条件策略(如“以牙还牙”),从而揭示模型架构与语言环境共同作用下的系统性行为意图。这一方法不仅实现了对LLM战略行为的审计,还发现语言框架有时能超越模型结构本身的影响,为AI治理和多智能体系统设计提供了实证基础。

链接: https://arxiv.org/abs/2601.19082
作者: Trung-Kiet Huynh,Dao-Sy Duy-Minh,Thanh-Bang Cao,Phong-Hao Le,Hong-Dan Nguyen,Nguyen Lam Phu Quy,Minh-Luan Nguyen-Vo,Hong-Phat Pham,Pham Phu Hoa,Thien-Kim Than,Chi-Nguyen Tran,Huy Tran,Gia-Thoai Tran-Le,Alessio Buscemi,Le Hong Trang, TheAnh Han
机构: University of Science (HCMUS), Vietnam; Ho Chi Minh City University of Technology (HCMUT), Vietnam; Vietnam National University - Ho Chi Minh City (VNU-HCM), Vietnam; Luxembourg Institute of Science and Technology, Luxembourg; Teesside University, UK
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 14 pages, 10 figures, 4 tables

点击查看摘要

Abstract:As LLMs increasingly act as autonomous agents in interactive and multi-agent settings, understanding their strategic behavior is critical for safety, coordination, and AI-driven social and economic systems. We investigate how payoff magnitude and linguistic context shape LLM strategies in repeated social dilemmas, using a payoff-scaled Prisoner’s Dilemma to isolate sensitivity to incentive strength. Across models and languages, we observe consistent behavioral patterns, including incentive-sensitive conditional strategies and cross-linguistic divergence. To interpret these dynamics, we train supervised classifiers on canonical repeated-game strategies and apply them to LLM decisions, revealing systematic, model- and language-dependent behavioral intentions, with linguistic framing sometimes matching or exceeding architectural effects. Our results provide a unified framework for auditing LLMs as strategic agents and highlight cooperation biases with direct implications for AI governance and multi-agent system design.
zh

[NLP-48] Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback

【速读】: 该论文旨在解决当前基于人类或AI反馈的强化学习(Reinforcement Learning from Human or AI Feedback, RLHF/RLAIF)在语音输入/输出对话系统(Speech-in/Speech-out Dialogue Systems, SDS)中应用时存在的局限性问题。现有方法主要依赖单一语义奖励且仅在话语层面进行优化,忽略了对话质量的多维性和多模态特性(如语义连贯性、音频自然度、说话人一致性、情感一致性及轮次交互行为),同时难以适配双工语音对话系统中逐块增量生成响应的决策机制。解决方案的关键在于提出首个面向SDS的多奖励RLAIF框架,融合语义、音频质量和情感一致性三类奖励信号,并通过轮次级偏好采样与块内对数概率聚合,在统一的直接偏好优化(Direct Preference Optimization, DPO)目标下实现增量解码过程中的联合对齐。此方法有效弥合了话语级偏好与块式解码之间的不匹配问题,显著提升了多维对话质量指标的一致性改进效果。

链接: https://arxiv.org/abs/2601.19063
作者: Siddhant Arora,Jinchuan Tian,Jiatong Shi,Hayato Futami,Yosuke Kashiwagi,Emiru Tsunoo,Shinji Watanabe
机构: Carnegie Mellon University (卡内基梅隆大学); Sony Group Corporation (索尼集团)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference sampling and aggregate per-block log-probabilities within a single DPO objective. We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models, and release a multi-reward DPO dataset to support reproducible research. Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness. These results highlight the importance of holistic, multi-reward alignment for practical conversational SDS.
zh

[NLP-49] Whos in Charge? Disempowerment Patterns in Real-World LLM Usage

【速读】: 该论文试图解决的问题是:当前生成式 AI (Generative AI) 助手在现实世界中的广泛应用背景下,其使用如何影响人类的自主权与赋权(empowerment)——特别是是否存在导致用户认知扭曲、价值判断失真或行为偏离自身价值观的“情境性去赋权”(situational disempowerment potential)现象。解决方案的关键在于采用大规模、隐私保护的数据分析方法,对150万条真实用户与AI助手的对话进行量化与质性分析,识别出高风险交互模式(如强化偏执叙事、代写价值导向内容等),并揭示这些模式随时间增长的趋势以及与用户满意度之间的反直觉正相关关系,从而为设计更有利于人类长期自主性和福祉的AI系统提供实证依据。

链接: https://arxiv.org/abs/2601.19062
作者: Mrinank Sharma,Miles McCain,Raymond Douglas,David Duvenaud
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Although AI assistants are now deeply embedded in society, there has been limited empirical study of how their usage affects human empowerment. We present the first large-scale empirical analysis of disempowerment patterns in real-world AI assistant interactions, analyzing 1.5 million consumer this http URL conversations using a privacy-preserving approach. We focus on situational disempowerment potential, which occurs when AI assistant interactions risk leading users to form distorted perceptions of reality, make inauthentic value judgments, or act in ways misaligned with their values. Quantitatively, we find that severe forms of disempowerment potential occur in fewer than one in a thousand conversations, though rates are substantially higher in personal domains like relationships and lifestyle. Qualitatively, we uncover several concerning patterns, such as validation of persecution narratives and grandiose identities with emphatic sycophantic language, definitive moral judgments about third parties, and complete scripting of value-laden personal communications that users appear to implement verbatim. Analysis of historical trends reveals an increase in the prevalence of disempowerment potential over time. We also find that interactions with greater disempowerment potential receive higher user approval ratings, possibly suggesting a tension between short-term user preferences and long-term human empowerment. Our findings highlight the need for AI systems designed to robustly support human autonomy and flourishing.
zh

[NLP-50] Principled Fine-tuning of LLM s from User-Edits: A Medley of Preference Supervision and Reward NEURIPS2025

【速读】: 该论文旨在解决如何利用用户编辑(user edits)部署数据对大语言模型(Large Language Models, LLMs)进行微调的问题,这类数据自然产生于基于LLMs的写作辅助工具和编码代理等应用中。传统上,偏好(preference)、监督标签(supervised labels)和成本(cost)等反馈类型在文献中被分别研究,而本文首次从理论上统一分析这些反馈类型的学习机制,并揭示不同学习算法在用户特性、数据分布及模型类别的影响下存在不同的权衡关系。其解决方案的关键在于提出一种简单的集成(ensembling)方法,能够联合利用多种反馈类型进行学习,在两个源自Gao et al. 2024的数据域上验证了该方法优于仅依赖单一反馈类型的基线方法,并展现出在测试时对不同用户编辑分布的鲁棒适应能力。

链接: https://arxiv.org/abs/2601.19055
作者: Dipendra Misra,Aldo Pacchiano,Ta-Chung Chi,Ge Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:We study how to fine-tune LLMs using user-edit deployment data consisting of a set of context, an agent’s response, and user edits. This deployment data is naturally generated by users in applications such as LLMs-based writing assistants and coding agents. The natural origin of user edits makes it a desired source for adapting and personalizing LLMs. In this setup, there emerges a unification of various feedback types namely preferences, supervised labels, and cost that are typically studied separately in the literature. In this paper, we initiate the theoretical investigation of learning from user edits. We first derive bounds for learning algorithms that learn from each of these feedback types. We prove that these algorithms have different trade-offs depending upon the user, data distribution, and model class. We then propose a simple ensembling procedure to jointly learn from these feedback types. On two domains adapted from Gao et al. 2024, we show our ensembling procedure outperforms these methods that learn from individual feedback. Further, we show that our proposed procedure can robustly adapt to different user-edit distributions at test time.
zh

[NLP-51] Is Finer Better? The Limits of Microscaling Formats in Large Language Models ICLR2026

【速读】: 该论文旨在解决微尺度量化(microscaling quantization)在模型压缩中出现的异常现象——即当分块大小(block size)低于某一阈值时,量化模型的输出性能反而下降的问题。这一现象违背了“更小的分块应提升张量元素表示精度”的直觉预期。研究通过实验与理论分析揭示,该异常主要由窄分布张量与量化尺度有限动态范围之间的相互作用导致。解决方案的关键在于引入FP8无符号E5M3(UE5M3)作为新型硬件友好的量化尺度格式,其可在保持与传统FP8无符号E4M3相当性能的同时,避免对权重和激活进行全局缩放操作,从而显著提升微尺度量化在训练与推理中的效率与稳定性。

链接: https://arxiv.org/abs/2601.19026
作者: Andrea Fasoli,Monodeep Kar,Chi-Chun Liu,Swagath Venkataramani,Viji Srinivasan,Leland Chang,Naigang Wang
机构: IBM Research(IBM 研究院)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注: 31 pages, 17 figures, 3 tables; accepted to ICLR 2026

点击查看摘要

Abstract:Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we report the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several Large Language Models and identify the conditions driving the anomalous behavior. Theoretically, we lay down a framework showing remarkable agreement with experimental data from pretrained model distributions and ideal ones. Overall, we show that the anomaly is driven by the interplay between narrow tensor distributions and the limited dynamic range of the quantized scales. Based on these insights, we propose the use of FP8 unsigned E5M3 (UE5M3) as a novel hardware-friendly format for the scales in FP4 microscaling data types. We demonstrate that UE5M3 achieves comparable performance to the conventional FP8 unsigned E4M3 scales while obviating the need of global scaling operations on weights and activations.
zh

[NLP-52] FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning

【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂推理任务中存在冗余推理路径导致效率低下与可靠性不足的问题。解决方案的关键在于提出一种注意力感知的高效推理方法 FROST,其核心是利用注意力权重识别并剪枝非关键推理路径(即“推理异常点”,reasoning outliers),从而缩短推理轨迹、提升准确性。该方法在保持模型推理能力的同时,通过句级异常点剔除机制优化了推理过程的紧凑性与稳定性,在多个基准测试中实现了显著的 token 使用量减少(平均 69.68%)和准确率提升(26.70%)。

链接: https://arxiv.org/abs/2601.19001
作者: Haozheng Luo,Zhuolin Jiang,Md Zahid Hasan,Yan Chen,Soumalya Sarkar
机构: Northwestern University (西北大学); RTX Technology Research Center (RTX技术研究中心); Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose FROST, an attention-aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention-based mechanism to remove them. Theoretically, FROST preserves and enhances the model’s reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and the average kurtosis by 91.09% compared to the base model. Code is available at this https URL
zh

[NLP-53] Malicious Repurposing of Open Science Artefacts by Using Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)在科学发现中可能被恶意利用的问题,特别是大语言模型(Large Language Models, LLMs)如何通过重新解释和再利用开源科学 artefacts(如数据集、方法和工具)生成有害研究提案的风险。其解决方案的关键在于构建一个端到端的攻击性评估管道:首先通过基于说服的越狱(persuasion-based jailbreaking)绕过LLM的安全防护机制;其次利用自然语言处理(NLP)论文中的漏洞识别并重构开放科学 artefacts;最后采用多维度评估框架对生成提案进行安全性分析,涵盖危害性(harmfulness)、滥用可行性(feasibility of misuse)和技术严谨性(soundness of technicality)。该研究揭示了当前LLM作为评估者存在高度主观分歧,表明其尚不能可靠地承担恶意场景下的风险判断任务,强调人类专家介入对于可信的双重用途风险评估至关重要。

链接: https://arxiv.org/abs/2601.18998
作者: Zahra Hashemi,Zhiqiang Zhong,Jun Pang,Wei Zhao
机构: University of Luxemburg(卢森堡大学); University of Aberdeen(阿伯丁大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has fuelled enthusiasm about their role in advancing scientific discovery, with studies exploring LLMs that autonomously generate and evaluate novel research ideas. However, little attention has been given to the possibility that such models could be exploited to produce harmful research by repurposing open science artefacts for malicious ends. We fill the gap by introducing an end-to-end pipeline that first bypasses LLM safeguards through persuasion-based jailbreaking, then reinterprets NLP papers to identify and repurpose their artefacts (datasets, methods, and tools) by exploiting their vulnerabilities, and finally assesses the safety of these proposals using our evaluation framework across three dimensions: harmfulness, feasibility of misuse, and soundness of technicality. Overall, our findings demonstrate that LLMs can generate harmful proposals by repurposing ethically designed open artefacts; however, we find that LLMs acting as evaluators strongly disagree with one another on evaluation outcomes: GPT-4.1 assigns higher scores (indicating greater potential harms, higher soundness and feasibility of misuse), Gemini-2.5-pro is markedly stricter, and Grok-3 falls between these extremes. This indicates that LLMs cannot yet serve as reliable judges in a malicious evaluation setup, making human evaluation essential for credible dual-use risk assessment.
zh

[NLP-54] LLM s versus the Halting Problem: Revisiting Program Termination Prediction

【速读】: 该论文试图解决的问题是:如何利用大型语言模型(Large Language Models, LLMs)来可靠地预测程序终止性(program termination),这一问题在理论上属于不可判定的停机问题(Halting Problem)。其解决方案的关键在于通过实证评估不同LLMs在国际软件验证竞赛(SV-Comp 2025)中C语言终止性测试集上的表现,发现GPT-5和Claude Sonnet-4.5在预测准确率上接近顶级专用验证工具(使用测试时缩放策略),表明LLMs具备强大的程序行为推理能力。然而,研究也指出LLMs虽能有效预测终止性,但常无法提供可验证的证明(witness),且随着程序长度增加性能下降,这提示未来需进一步探索LLMs在处理不可判定问题中的推理机制与可解释性。

链接: https://arxiv.org/abs/2601.18987
作者: Oren Sultan,Jordi Armengol-Estape,Pascal Kesseli,Julien Vanegue,Dafna Shahaf,Yossi Adi,Peter O’Hearn
机构: Meta AI (Meta); The Hebrew University of Jerusalem (希伯来大学); Bloomberg Research (彭博研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
zh

[NLP-55] Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时存在的信用分配(credit assignment)问题,即现有方法依赖稀疏结果奖励,无法有效奖励部分正确但最终失败的推理路径中的正确中间步骤。为应对这一挑战,作者提出Verifiable Prefix Policy Optimization (VPPO),其核心创新在于:仅在RL过程中利用过程奖励模型(Process Reward Model, PRM)来定位首个错误步骤,从而将轨迹划分为验证正确的前缀和错误的后缀;随后对前缀给予奖励,仅在检测到的第一个错误之后施加针对性惩罚,从而实现稳定且可解释的学习信号,显著改善了信用分配机制。

链接: https://arxiv.org/abs/2601.18984
作者: Haolin Liu,Dian Yu,Sidi Lu,Yujun Zhou,Rui Liu,Zhenwen Liang,Haitao Mi,Chen-Yu Wei,Dong Yu
机构: Tencent AI Lab (腾讯AI实验室); University of Virginia (弗吉尼亚大学); University of Notre Dame (圣母大学); University of Maryland (马里兰大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs). However, most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions. Process reward models (PRMs) offer fine-grained step-level supervision, but their scores are often noisy and difficult to evaluate. As a result, recent PRM benchmarks focus on a more objective capability: detecting the first incorrect step in a reasoning path. However, this evaluation target is misaligned with how PRMs are typically used in RL, where their step-wise scores are treated as raw rewards to maximize. To bridge this gap, we propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL. Given an incorrect rollout, VPPO partitions the trajectory into a verified correct prefix and an erroneous suffix based on the first error, rewarding the former while applying targeted penalties only after the detected mistake. This design yields stable, interpretable learning signals and improves credit assignment. Across multiple reasoning benchmarks, VPPO consistently outperforms sparse-reward RL and prior PRM-guided baselines on both Pass@1 and Pass@K.
zh

[NLP-56] Intent2QoS: Language Model-Driven Automation of Traffic Shaping Configurations

【速读】: 该论文旨在解决网络流量整形(Traffic Shaping)与服务质量(QoS)保障中依赖低级配置、需人工干预且易出错的问题,提出了一种将高层业务意图(如自然语言或声明式描述)自动转换为可部署的Linux流量控制规则的框架。其关键在于构建了一个端到端的流水线:首先基于排队论语义模型进行优先级调度和主动队列管理(AQM)仿真以建立语义基础;其次利用语言模型结合流量特征生成子意图与配置规则;最后通过规则驱动的评判器确保规则正确性和策略合规性,从而实现从抽象意图到具体可执行配置的可靠映射。

链接: https://arxiv.org/abs/2601.18974
作者: Sudipta Acharya,Burak Kantarci
机构: University of Ottawa (渥太华大学)
类目: Networking and Internet Architecture (cs.NI); Computation and Language (cs.CL)
备注: 6 page, 4 figures, Accepted to IEEE International Conference on Communications (ICC) 2026

点击查看摘要

Abstract:Traffic shaping and Quality of Service (QoS) enforcement are critical for managing bandwidth, latency, and fairness in networks. These tasks often rely on low-level traffic control settings, which require manual setup and technical expertise. This paper presents an automated framework that converts high-level traffic shaping intents in natural or declarative language into valid and correct traffic control rules. To the best of our knowledge, we present the first end-to-end pipeline that ties intent translation in a queuing-theoretic semantic model and, with a rule-based critic, yields deployable Linux traffic control configuration sets. The framework has three steps: (1) a queuing simulation with priority scheduling and Active Queue Management (AQM) builds a semantic model; (2) a language model, using this semantic model and a traffic profile, generates sub-intents and configuration rules; and (3) a rule-based critic checks and adjusts the rules for correctness and policy compliance. We evaluate multiple language models by generating traffic control commands from business intents that comply with relevant standards for traffic control protocols. Experimental results on 100 intents show significant gains, with LLaMA3 reaching 0.88 semantic similarity and 0.87 semantic coverage, outperforming other models by over 30. A thorough sensitivity study demonstrates that AQM-guided prompting reduces variability threefold compared to zero-shot baselines.
zh

[NLP-57] BabyReasoning Bench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models

【速读】: 该论文旨在解决当前语言模型推理能力评估体系与婴儿语言模型训练现实之间的不匹配问题,即传统评测基准多基于成人认知假设(如广泛的世界知识、复杂指令遵循和成熟的语用能力),而忽略了婴儿语言模型在发展性合理输入(如儿童导向言语)下可能具备的特定推理能力。解决方案的关键在于构建一个基于发展心理学经典范式的全新基准——BabyReasoningBench,包含19个任务,覆盖心智理论、类比与关系推理、因果推断及干预选择等核心推理原语,并通过实证发现:两个基于GPT-2架构、在儿童导向文本上预训练的模型虽整体表现较低但存在任务间的显著差异,例如规模扩展提升了因果与物理推理能力,而信念指派和语用敏感任务仍具挑战性,从而为理解儿童语言学习中推理能力的涌现提供了可量化的发展学视角。

链接: https://arxiv.org/abs/2601.18933
作者: Kaustubh D. Dhole
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional evaluations of reasoning capabilities of language models are dominated by adult-centric benchmarks that presuppose broad world knowledge, complex instruction following, and mature pragmatic competence. These assumptions are mismatched to baby language models trained on developmentally plausible input such as child-directed speech and early-childhood narratives, and they obscure which reasoning abilities (if any) emerge under such constraints. We introduce BabyReasoningBench, a GPT-5.2 generated benchmark of 19 reasoning tasks grounded in classic paradigms from developmental psychology, spanning theory of mind, analogical and relational reasoning, causal inference and intervention selection, and core reasoning primitives that are known to be confounded by memory and pragmatics. We find that two GPT-2 based baby language models (pretrained on 10M and 100M of child-directed speech text) show overall low but uneven performance, with dissociations across task families: scaling improves several causal and physical reasoning tasks, while belief attribution and pragmatics-sensitive tasks remain challenging. BabyReasoningBench provides a developmentally grounded lens for analyzing what kinds of reasoning are supported by child-like training distributions, and for testing mechanistic hypotheses about how such abilities emerge.
zh

[NLP-58] SICL-AT: Another way to adapt Auditory LLM to low-resource task

【速读】: 该论文旨在解决音频大语言模型(Auditory Large Language Models, ALMs)在低资源或陌生任务场景下性能下降的问题,尤其是在缺乏标注的域内数据或测试分布与训练数据不匹配时,直接微调(direct fine-tuning)方法容易失效。其核心解决方案是提出一种语音上下文学习适配训练(Speech In-Context Learning Adaptation Training, SICL-AT),这是一种仅使用高资源语音数据进行后训练(post-training)的策略,旨在增强模型在推理阶段通过少量示例(in-context demonstrations)实现自适应的能力。实验表明,该方法在低资源场景下显著优于传统微调方式,并能泛化至多种音频理解与推理任务。

链接: https://arxiv.org/abs/2601.18904
作者: Haolong Zheng,Siyin Wang,Zengrui Jin,Mark Hasegawa-Johnson
机构: University of Illinois Urbana Champaign (伊利诺伊大学厄巴纳-香槟分校); Tsinghua University (清华大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource or unfamiliar tasks. In case of labeled in-domain data is scarce or mismatched to the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that \emphVanilla ICL, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose \textbfSpeech In-Context Learning Adaptation Training (SICL-AT), a post-training recipe utilizes only high resource speech data intending to strengthen model’s in-context learning capability. The enhancement can generalize to audio understanding/reasoning task. Experiments indicate our proposed method consistently outperforms direct fine-tuning in low-resource scenario.
zh

[NLP-59] Flatter Tokens are More Valuable for Speculative Draft Model Training

【速读】: 该论文旨在解决生成式 AI(Generative AI)中大语言模型(Large Language Model, LLM)推理加速技术——推测解码(Speculative Decoding, SD)在训练阶段对大规模数据依赖过高的问题。传统SD方法通常需要在海量数据上训练一个草稿模型(draft model),但并非所有训练样本对提升SD接受率(acceptance rate)具有同等价值。论文的关键创新在于提出“平坦度”(flatness)这一新指标,用于量化目标模型预测分布的平缓程度,并基于此设计了样本级平坦度驱动的数据蒸馏方法(Sample-level-flatness-based Dataset Distillation, SFDD)。该方法通过筛选出最具价值的训练样本,显著降低数据量需求,实验证明SFDD仅使用50%的数据即可实现超过2倍的训练速度提升,同时保持最终模型推理加速性能与全数据基线相差不超过4%。

链接: https://arxiv.org/abs/2601.18902
作者: Jiaming Fan,Daming Cao,Xiangzhong Luo,Jiale Fu,Chonghan Liu,Xu Yang
机构: Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University) (新一代人工智能技术及其跨学科应用重点实验室); Southeast University (东南大学); Nanjing University of Information Science and Technology (南京信息工程大学); Qiyuan Tech
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2 \times training speedup using only 50% of the data, while keeping the final model’s inference speedup within 4% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at this https URL.
zh

[NLP-60] Self-Aware Knowledge Probing: Evaluating Language Models Relational Knowledge through Confidence Calibration

【速读】: 该论文旨在解决现有知识探测方法在评估语言模型(Language Model, LM)关系知识获取能力时,忽视模型置信度校准(calibration)的问题。传统方法仅依赖预测准确率或精确度等指标,无法反映模型输出的可靠性。其解决方案的关键在于提出一种全新的校准探测框架,涵盖三种模态的模型置信度:内在置信度(intrinsic confidence)、结构一致性(structural consistency)和语义基础(semantic grounding)。通过该框架对十种因果模型和六种掩码语言模型的系统分析发现,大多数模型尤其是基于掩码目标预训练的模型存在过度自信现象;而最佳校准效果来自能捕捉到表述重述导致不一致性的置信度估计。

链接: https://arxiv.org/abs/2601.18901
作者: Christopher Kissling,Elena Merdjanovska,Alan Akbik
机构: Humboldt-Universität zu Berlin (柏林洪堡大学); Science of Intelligence (智能科学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge probing quantifies how much relational knowledge a language model (LM) has acquired during pre-training. Existing knowledge probes evaluate model capabilities through metrics like prediction accuracy and precision. Such evaluations fail to account for the model’s reliability, reflected in the calibration of its confidence scores. In this paper, we propose a novel calibration probing framework for relational knowledge, covering three modalities of model confidence: (1) intrinsic confidence, (2) structural consistency and (3) semantic grounding. Our extensive analysis of ten causal and six masked language models reveals that most models, especially those pre-trained with the masking objective, are overconfident. The best-calibrated scores come from confidence estimates that account for inconsistencies due to statement rephrasing. Moreover, even the largest pre-trained models fail to encode the semantics of linguistic confidence expressions accurately.
zh

[NLP-61] Language Family Matters: Evaluating LLM -Based ASR Across Linguistic Boundaries

【速读】: 该论文旨在解决多语言自动语音识别(ASR)系统在资源有限情况下,因为每种语言单独训练连接器(connector)而导致参数冗余与泛化能力不足的问题。其核心解决方案是提出一种基于语言家族(linguistic family)的连接器共享策略,即每个语言家族仅使用一个共享连接器,从而在不牺牲性能的前提下显著减少模型参数量,并提升跨领域和跨语言的泛化能力。这一方法利用了语言间的语义和结构相似性,实现了更高效、可扩展的多语言ASR部署。

链接: https://arxiv.org/abs/2601.18899
作者: Yuchen Zhang,Ravi Shekhar,Haralambos Mouratidis
机构: Institute for Analytics and Data Science, University of Essex (埃塞克斯大学数据分析与科学研究所); School of Computer Science and Electronic Engineering, University of Essex (埃塞克斯大学计算机科学与电子工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.
zh

[NLP-62] XProvence: Zero-Cost Multilingual Context Pruning for Retrieval-Augmented Generation ECIR2026

【速读】: 该论文旨在解决多语言检索增强生成(Retrieval-Augmented Generation, RAG)系统中上下文冗余导致的性能下降与计算资源浪费问题。针对这一挑战,作者提出XProvence,一种多语言零成本上下文剪枝模型,其核心创新在于将高效的零成本上下文剪枝机制直接集成到重排序(re-ranking)模块中,从而在不引入额外计算开销的前提下实现对RAG输入上下文的智能筛选。该方案通过在16种语言上训练并支持超过100种语言的跨语言迁移能力,显著提升了多语言RAG系统的效率与泛化性能,在多个多语言问答基准测试中实现了最小甚至无性能损失的上下文压缩效果,并优于现有强基线方法。

链接: https://arxiv.org/abs/2601.18886
作者: Youssef Mohamed,Mohamed Elhoseiny,Thibault Formal,Nadezhda Chirkova
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to ECIR 2026

点击查看摘要

Abstract:This paper introduces XProvence, a multilingual zero-cost context pruning model for retrieval-augmented generation (RAG), trained on 16 languages and supporting 100+ languages through effective cross-lingual transfer. Motivated by the growing use of RAG systems across diverse languages, we explore several strategies to generalize the Provence framework-which first integrated efficient zero-cost context pruning directly into the re-ranking model-beyond English. Across four multilingual question answering benchmarks, we show how XProvence can prune RAG contexts with minimal-to-no performance degradation and outperforms strong baselines. Our model is available at this https URL.
zh

[NLP-63] Rethinking Discrete Speech Representation Tokens for Accent Generation

【速读】: 该论文旨在解决离散语音表示令牌(Discrete Speech Representation Tokens, DSRTs)中口音信息编码不明确的问题,即当前研究对DSRTs中的音素和说话人信息关注较多,但口音信息的可访问性与可分离性尚未被系统探讨。解决方案的关键在于提出一个统一的评估框架,通过新颖的口音ABX任务衡量口音信息的可访问性,并利用跨口音语音转换(Voice Conversion, VC)重合成方法评估其可恢复性;基于此框架,研究发现使用自动语音识别(ASR)监督微调会显著削弱口音信息,而简单地减少码本大小无法有效解耦口音与音素及说话人信息。为此,作者进一步设计了仅内容(content-only)和内容-口音联合(content-accent)两类DSRTs,在可控口音生成任务中显著优于现有设计,从而为口音感知的DSRT设计提供了实证依据和实践指导。

链接: https://arxiv.org/abs/2601.19786
作者: Jinzuomu Zhong,Yi Wang,Korin Richmond,Peter Bell
机构: Centre for Speech Technology Research, University of Edinburgh (爱丁堡大学语音技术研究中心)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Discrete Speech Representation Tokens (DSRTs) have become a foundational component in speech generation. While prior work has extensively studied phonetic and speaker information in DSRTs, how accent information is encoded in DSRTs remains largely unexplored. In this paper, we present the first systematic investigation of accent information in DSRTs. We propose a unified evaluation framework that measures both accessibility of accent information via a novel Accent ABX task and recoverability via cross-accent Voice Conversion (VC) resynthesis. Using this framework, we analyse DSRTs derived from a variety of speech encoders. Our results reveal that accent information is substantially reduced when ASR supervision is used to fine-tune the encoder, but cannot be effectively disentangled from phonetic and speaker information through naive codebook size reduction. Based on these findings, we propose new content-only and content-accent DSRTs that significantly outperform existing designs in controllable accent generation. Our work highlights the importance of accent-aware evaluation and provides practical guidance for designing DSRTs for accent-controlled speech generation.
zh

计算机视觉

[CV-0] DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding EACL-2026

【速读】:该论文旨在解决当前多模态模型在处理阿拉伯语书法(Arabic calligraphy)这一复杂视觉语言任务中的能力不足问题,尤其是针对艺术化、风格化的书写形式缺乏有效识别与对齐的能力。其解决方案的关键在于构建并公开发布DuwatBench——一个包含1,272个精心筛选样本、覆盖六种古典与现代书法风格的基准数据集,每条样本均配有句子级别的检测标注,以反映阿拉伯文书写中复杂的笔画结构、密集连字及风格差异等真实挑战。通过该数据集对13个主流阿拉伯语和多语言多模态模型进行评估,揭示了现有模型在应对书法变体、艺术扭曲和精确图文对齐方面的显著局限,从而推动面向文化语境的多模态研究发展,促进阿拉伯语及其视觉遗产在AI系统中的公平融入与持续进步。

链接: https://arxiv.org/abs/2601.19898
作者: Shubham Patle,Sara Ghaboura,Hania Tariq,Mohammad Usman Khan,Omkar Thawakar,Rao Muhammad Anwer,Salman Khan
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); NUCES; NUST; Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EACL-2026 (Main Track)

点击查看摘要

Abstract:Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. Although multimodal models have advanced across languages, their ability to process Arabic script, especially in artistic and stylized calligraphic forms, remains largely unexplored. To address this gap, we present DuwatBench, a benchmark of 1,272 curated samples containing about 1,475 unique words across six classical and modern calligraphic styles, each paired with sentence-level detection annotations. The dataset reflects real-world challenges in Arabic writing, such as complex stroke patterns, dense ligatures, and stylistic variations that often challenge standard text recognition systems. Using DuwatBench, we evaluated 13 leading Arabic and multilingual multimodal models and showed that while they perform well on clean text, they struggle with calligraphic variation, artistic distortions, and precise visual-text alignment. By publicly releasing DuwatBench and its annotations, we aim to advance culturally grounded multimodal research, foster fair inclusion of the Arabic language and visual heritage in AI systems, and support continued progress in this area. Our dataset (this https URL) and evaluation suit (this https URL) are publicly available.
zh

[CV-1] VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction

【速读】:该论文旨在解决VGGT-SLAM在实时RGB图像序列中进行子图增量对齐时存在的高维漂移(15自由度)与平面退化问题,以及因相机内参未知导致的重建歧义问题。其关键解决方案在于:首先设计了一种新的因子图结构以消除上述几何退化问题,同时保持对未知相机内参下VGGT重建结果的鲁棒性;其次,通过分析VGGT中的注意力层发现其中一层可直接用于图像检索验证,无需额外训练即可提升匹配准确性并增强回环闭合能力;最终实验证明该方法在TUM数据集上相比原版VGGT-SLAM显著降低约23%的位姿误差,并具备在线部署于地面机器人平台的实时性能。

链接: https://arxiv.org/abs/2601.19887
作者: Dominic Maggio,Luca Carlone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present VGGT-SLAM 2.0, a real time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real time performance while running online onboard a ground robot using a Jetson Thor. We also test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.
zh

[CV-2] SONIC: Spectral Oriented Neural Invariant Convolutions ICLR2026

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)难以捕捉全局上下文或长距离依赖关系,而视觉Transformer(Vision Transformers, ViTs)缺乏空间归纳偏置、依赖显式位置编码且受限于初始图像块大小的问题。解决方案的关键在于提出SONIC(Spectral Oriented Neural Invariant Convolutions),这是一种基于连续谱参数化的新型卷积算子建模方法,通过一组共享的、方向选择性的组件定义全频域上的平滑响应,从而实现具有全局感受野且能自然适应不同分辨率的滤波器结构。该方法在保持极低参数量的同时显著提升了对几何变换、噪声和分辨率变化的鲁棒性,并在多个任务中达到或超越现有空间与谱域架构的性能。

链接: https://arxiv.org/abs/2601.19884
作者: Gijs Joppe Moens,Regina Beets-Tan,Eduardo H. P. Pooch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures. Accepted at ICLR 2026

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.
zh

[CV-3] EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning ICLR2026

【速读】:该论文旨在解决在第一人称视角(egocentric vision)下进行鲁棒的三维手部重建问题,其挑战主要来源于深度歧义、自遮挡以及复杂的手物交互。传统方法通常通过扩充训练数据或引入辅助线索来缓解这些问题,但在未见过的场景中表现不佳。本文提出EgoHandICL,首个基于上下文学习(in-context learning, ICL)的3D手部重建框架,其关键创新在于:1)利用视觉-语言模型(vision-language models, VLMs)引导的互补示例检索机制提升语义对齐;2)设计面向多模态上下文的专用分词器(tokenizer)以优化ICL适配;3)采用基于掩码自编码器(masked autoencoder, MAE)的架构,并结合手部引导的几何与感知目标进行训练,从而增强在复杂第一人称场景下的视觉一致性与鲁棒性。

链接: https://arxiv.org/abs/2601.19850
作者: Binzhu Xie,Shi Qiu,Sicheng Zhang,Yinqiao Wang,Hao Xu,Muzammal Naseer,Chi-Wing Fu,Pheng-Ann Heng
机构: The Chinese University of Hong Kong (香港中文大学); Khalifa University (哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICLR 2026, Codebase: this https URL

点击查看摘要

Abstract:Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts. Code and data: this https URL
zh

[CV-4] HexFormer: Hyperbolic Vision Transformer with Exponential Map Aggregation

【速读】:该论文旨在解决图像、文本和图等多模态数据中蕴含的层次化与关系结构在欧几里得几何(Euclidean geometry)下难以有效建模的问题。其解决方案的关键在于引入双曲几何(hyperbolic geometry)作为更自然的表示框架,并提出HexFormer——一种基于指数映射聚合(exponential map aggregation)机制的双曲视觉Transformer(Hyperbolic Vision Transformer, HexFormer)。该机制通过在注意力机制中使用双曲空间中的指数映射来聚合信息,相较于传统的质心平均方法,能够生成更准确且稳定的表示,从而在多个图像分类数据集上显著优于欧几里得基线模型及先前的双曲ViT方法,同时展现出更强的梯度稳定性与训练鲁棒性。

链接: https://arxiv.org/abs/2601.19849
作者: Haya Alyoussef,Ahmad Bdeir,Diego Coello de Portugal Mecke,Tom Hanika,Niels Landwehr,Lars Schmidt-Thieme
机构: Information Systems and Machine Learning Lab (ISMLL); Data Science Department; Hildesheim University (希尔德斯海姆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data across modalities such as images, text, and graphs often contains hierarchical and relational structures, which are challenging to model within Euclidean geometry. Hyperbolic geometry provides a natural framework for representing such structures. Building on this property, this work introduces HexFormer, a hyperbolic vision transformer for image classification that incorporates exponential map aggregation within its attention mechanism. Two designs are explored: a hyperbolic ViT (HexFormer) and a hybrid variant (HexFormer-Hybrid) that combines a hyperbolic encoder with an Euclidean linear classification head. HexFormer incorporates a novel attention mechanism based on exponential map aggregation, which yields more accurate and stable aggregated representations than standard centroid based averaging, showing that simpler approaches retain competitive merit. Experiments across multiple datasets demonstrate consistent performance improvements over Euclidean baselines and prior hyperbolic ViTs, with the hybrid variant achieving the strongest overall results. Additionally, this study provides an analysis of gradient stability in hyperbolic transformers. The results reveal that hyperbolic models exhibit more stable gradients and reduced sensitivity to warmup strategies compared to Euclidean architectures, highlighting their robustness and efficiency in training. Overall, these findings indicate that hyperbolic geometry can enhance vision transformer architectures by improving gradient stability and accuracy. In addition, relatively simple mechanisms such as exponential map aggregation can provide strong practical benefits.
zh

[CV-5] Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering

【速读】:该论文旨在解决音频-视觉问答(Audio-Visual Question Answering, AVQA)任务中现有方法对音频信息利用不足、文本问题引导作用弱以及多模态融合不充分的问题。其核心解决方案在于提出一种查询引导的空间-时间-频域交互机制(Query-guided Spatial–Temporal–Frequency, QSTar),通过引入问题引导的线索,结合音频信号独特的频域特征与视频的空间和时间感知能力,实现更精准的跨模态对齐与理解;同时设计了受提示(prompting)启发的查询上下文推理(Query Context Reasoning, QCR)模块,使模型能聚焦于语义相关的音视频特征,从而显著提升AVQA性能。

链接: https://arxiv.org/abs/2601.19821
作者: Kun Li,Michael Ying Yang,Sami Sebastian Brandt
机构: University of Twente (特温特大学); University of Bath (巴斯大学); IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio–Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio–visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial–Temporal–Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audio–visual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches. The code and pretrained models will be released after publication.
zh

[CV-6] Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在保留细粒度视觉信息方面的局限性,这种局限导致多模态理解趋于粗粒度。作者指出,问题根源在于现有VLMs采用文本主导的训练范式,将视觉信号视为被动的条件输入而非监督目标。解决方案的关键在于提出Youtu-VL框架,其核心是引入视觉-语言统一自回归监督(Vision-Language Unified Autoregressive Supervision, VLUAS)范式,从根本上将优化目标从“视觉作为输入”转变为“视觉作为目标”,通过将视觉token直接融入预测流中,对视觉细节和语言内容施加统一的自回归监督。这一设计使模型能够在不增加任务特异性模块的前提下,有效支持以视觉为中心的任务,显著提升了多模态理解和视觉感知能力。

链接: https://arxiv.org/abs/2601.19798
作者: Zhixiang Wei,Yi Li,Zhehan Kan,Xinghua Jiang,Zuwei Long,Shifeng Liu,Hongze Shen,Wei Liu,Xiaoyu Tan,Haojia Lin,Yubo Zhu,Qianyu Li,Di Yin,Haoyu Cao,Weibo Gu,Xin Li,Yinsong Liu,Deqiang Jiang,Xing Sun,Yunsheng Wu,Mingkong Tang,Shuangyin Liu,Lexiang Tang,Haodong Lin,Junru Lu,Jiarui Qin,Lingfeng Qiao,Ruizhi Qiao,Bo Ke,Jianfeng He,Ke Li,Yangning Li,Yunhang Shen,Mengdan Zhang,Peixian Chen,Kun Yin,Bing Liu,Yunfei Wu,Huang Chen,Zhongpeng Cai,Xiaotian Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from vision-as-input'' to vision-as-target.‘’ By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
zh

[CV-7] Diffusion for De-Occlusion: Accessory-Aware Diffusion Inpainting for Robust Ear Biometric Recognition

【速读】:该论文旨在解决耳部生物特征识别系统在存在耳饰(如耳环、耳机)等耳部遮挡(ear occlusions)情况下性能下降的问题,尤其是在非受限成像环境中的识别准确率降低问题。解决方案的关键在于采用基于扩散模型(diffusion-based)的耳部修复(ear inpainting)技术作为预处理手段,通过输入含遮挡的耳图像及自动提取的遮挡掩码(accessory mask),模型能够重建出干净且解剖学上合理的耳部区域,同时保持关键耳部结构(如耳轮 helix、反耳轮 antihelix、耳甲 concha 和耳垂 lobule)的局部几何一致性,从而提升基于视觉Transformer(vision transformer)架构的耳识别系统的整体性能。

链接: https://arxiv.org/abs/2601.19795
作者: Deeksha Arun,Kevin W. Bowyer,Patrick Flynn
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ear occlusions (arising from the presence of ear accessories such as earrings and earphones) can negatively impact performance in ear-based biometric recognition systems, especially in unconstrained imaging circumstances. In this study, we assess the effectiveness of a diffusion-based ear inpainting technique as a pre-processing aid to mitigate the issues of ear accessory occlusions in transformer-based ear recognition systems. Given an input ear image and an automatically derived accessory mask, the inpainting model reconstructs clean and anatomically plausible ear regions by synthesizing missing pixels while preserving local geometric coherence along key ear structures, including the helix, antihelix, concha, and lobule. We evaluate the effectiveness of this pre-processing aid in transformer-based recognition systems for several vision transformer models and different patch sizes for a range of benchmark datasets. Experiments show that diffusion-based inpainting can be a useful pre-processing aid to alleviate ear accessory occlusions to improve overall recognition performance.
zh

[CV-8] GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance

【速读】:该论文旨在解决当前3D场景生成技术中存在的结构建模能力弱、对大规模标注数据依赖性强,以及在复杂场景中易产生结构伪影、几何不一致性和高频细节退化等问题。其解决方案的关键在于提出GeoDiff3D框架,该框架采用粗略几何作为结构锚点,并结合几何约束的2D扩散模型生成纹理丰富的参考图像;同时引入体素对齐的3D特征聚合机制与双重自监督策略,在显著降低标签数据依赖的同时,有效保持场景一致性与精细细节,从而实现高效且高质量的3D场景生成。

链接: https://arxiv.org/abs/2601.19785
作者: Haozhi Zhu,Miaomiao Zhao,Dingyao Liu,Runze Tian,Yan Zhang,Jie Guo,Fenggen Yu
机构: Nanjing University (南京大学); Simon Fraser university (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.
zh

[CV-9] PaW-ViT: A Patch-based Warping Vision Transformer for Robust Ear Verification

【速读】:该论文旨在解决耳部生物特征形态变异与Transformer架构位置敏感性之间的不匹配问题,即传统视觉Transformer(Vision Transformer, ViT)中采用的矩形token会引入目标对象以外的信息,从而影响识别性能。其解决方案的关键在于提出PaW-ViT(Patch-based Warping Vision Transformer),一种基于解剖学知识的预处理方法,通过将token边界精确对齐到检测到的耳部特征边界,实现对形状、大小和姿态变化的更强鲁棒性;同时,通过对齐自然耳部曲率,生成更一致的token表示,提升不同形态下的识别稳定性。

链接: https://arxiv.org/abs/2601.19771
作者: Deeksha Arun,Kevin W. Bowyer,Patrick Flynn
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rectangular tokens common to vision transformer methods for visual recognition can strongly affect performance of these methods due to incorporation of information outside the objects to be recognized. This paper introduces PaW-ViT, Patch-based Warping Vision Transformer, a preprocessing approach rooted in anatomical knowledge that normalizes ear images to enhance the efficacy of ViT. By accurately aligning token boundaries to detected ear feature boundaries, PaW-ViT obtains greater robustness to shape, size, and pose variation. By aligning feature boundaries to natural ear curvature, it produces more consistent token representations for various morphologies. Experiments confirm the effectiveness of PaW-ViT on various ViT models (ViT-T, ViT-S, ViT-B, ViT-L) and yield reasonable alignment robustness to variation in shape, size, and pose. Our work aims to solve the disconnect between ear biometric morphological variation and transformer architecture positional sensitivity, presenting a possible avenue for authentication schemes.
zh

[CV-10] WaterClear-GS: Optical-Aware Gaussian Splatting for Underwater Reconstruction and Restoration

【速读】:该论文旨在解决水下三维重建与外观恢复中因水体复杂光学特性(如波长依赖的衰减和散射)导致的渲染速度慢、颜色恢复不准确等问题。现有基于神经辐射场(NeRF)的方法在实时性与色彩还原方面表现不佳,而3D高斯泼溅(3DGS)则难以建模复杂的体积散射效应。其解决方案的关键在于提出首个纯3DGS框架WaterClear-GS,通过在高斯原语中显式融合局部衰减与散射的水下光学属性,无需额外介质网络即可实现物理合理的光传输建模;同时采用双分支优化策略,在保证水下光度一致性的同时自然恢复无水外观,并结合深度引导的几何正则化、感知驱动图像损失、曝光约束、空间自适应正则化及物理引导的光谱正则化,共同提升局部三维一致性与视觉自然性,从而在新视角合成(NVS)与水下图像修复(UIR)任务上均取得优异性能并支持实时渲染。

链接: https://arxiv.org/abs/2601.19753
作者: Xinrui Zhang,Yufeng Wang,Shuangkang Fang,Zesheng Wang,Dacheng Qi,Wenrui Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater 3D reconstruction and appearance restoration are hindered by the complex optical properties of water, such as wavelength-dependent attenuation and scattering. Existing Neural Radiance Fields (NeRF)-based methods struggle with slow rendering speeds and suboptimal color restoration, while 3D Gaussian Splatting (3DGS) inherently lacks the capability to model complex volumetric scattering effects. To address these issues, we introduce WaterClear-GS, the first pure 3DGS-based framework that explicitly integrates underwater optical properties of local attenuation and scattering into Gaussian primitives, eliminating the need for an auxiliary medium network. Our method employs a dual-branch optimization strategy to ensure underwater photometric consistency while naturally recovering water-free appearances. This strategy is enhanced by depth-guided geometry regularization and perception-driven image loss, together with exposure constraints, spatially-adaptive regularization, and physically guided spectral regularization, which collectively enforce local 3D coherence and maintain natural visual perception. Experiments on standard benchmarks and our newly collected dataset demonstrate that WaterClear-GS achieves outstanding performance on both novel view synthesis (NVS) and underwater image restoration (UIR) tasks, while maintaining real-time rendering. The code will be available at this https URL.
zh

[CV-11] Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues

【速读】:该论文旨在解决电子商务平台中因标注错误或元数据不完整导致的多模态信息缺失问题(如商品图像或文本描述缺失),这类问题会损害产品展示效果并影响推荐系统等下游应用。其核心解决方案是提出一个名为“缺失模态产品补全基准”(Missing Modality Product Completion Benchmark, MMPCBench)的评估框架,包含内容质量补全与推荐性能两个子基准,并基于此对六种前沿多模态大语言模型(Multimodal Large Language Models, MLLMs)在九类真实电商场景下的图像到文本和文本到图像补全任务中进行系统评测。关键发现表明:尽管MLLMs能捕捉高层语义,但在词级、像素级或块级对齐上表现不足;模型规模与性能之间无明显相关性;通过Group Relative Policy Optimization(GRPO)微调可提升图像到文本补全效果,但对文本到图像补全无效,揭示了当前MLLMs在真实跨模态生成任务中的局限性。

链接: https://arxiv.org/abs/2601.19750
作者: Junchen Fu,Wenhao Deng,Kaiwen Zheng,Alexandros Karatzoglou,Ioannis Arapakis,Yu Ye,Yongxin Ni,Joemon M. Jose,Xuri Ge
机构: University of Glasgow (格拉斯哥大学); National University of Singapore (新加坡国立大学); Shandong University (山东大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata, impairing both product presentation and downstream applications such as recommendation systems. Motivated by the multimodal generative capabilities of recent Multimodal Large Language Models (MLLMs), this work investigates a fundamental yet underexplored question: can MLLMs generate missing modalities for products in e-commerce scenarios? We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark. We further evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Experimental results show that while MLLMs can capture high-level semantics, they struggle with fine-grained word-level and pixel- or patch-level alignment. In addition, performance varies substantially across product categories and model scales, and we observe no trivial correlation between model size and performance, in contrast to trends commonly reported in mainstream benchmarks. We also explore Group Relative Policy Optimization (GRPO) to better align MLLMs with this task. GRPO improves image-to-text completion but does not yield gains for text-to-image completion. Overall, these findings expose the limitations of current MLLMs in real-world cross-modal generation and represent an early step toward more effective missing-modality product completion. Subjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2601.19750 [cs.MM] (or arXiv:2601.19750v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2601.19750 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-12] DiffStyle3D: Consistent 3D Gaussian Stylization via Attention Optimization

【速读】:该论文旨在解决现有3D风格迁移(3D style transfer)方法在多视角一致性建模方面的不足,特别是基于VGG和CLIP的方法难以在模型内部保持多视角一致性,而扩散模型虽能捕捉一致性但依赖去噪方向导致训练不稳定的问题。其解决方案的关键在于提出DiffStyle3D,一种直接在潜在空间中优化的扩散框架:通过引入注意力感知损失(Attention-Aware Loss),在自注意力空间中对齐风格特征并保留内容特征;同时提出几何引导的多视角一致性机制(Geometry-Guided Multi-View Consistency),利用几何信息增强跨视角对应关系建模,并设计几何感知掩码避免视图重叠区域的冗余优化,从而显著提升多视角一致性与视觉真实感。

链接: https://arxiv.org/abs/2601.19717
作者: Yitong Yang,Xuexin Liu,Yinglin Wang,Jing Wang,Hao Dou,Changshuo Wang,Shuting He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D style transfer enables the creation of visually expressive 3D content, enriching the visual appearance of 3D scenes and objects. However, existing VGG- and CLIP-based methods struggle to model multi-view consistency within the model itself, while diffusion-based approaches can capture such consistency but rely on denoising directions, leading to unstable training. To address these limitations, we propose DiffStyle3D, a novel diffusion-based paradigm for 3DGS style transfer that directly optimizes in the latent space. Specifically, we introduce an Attention-Aware Loss that performs style transfer by aligning style features in the self-attention space, while preserving original content through content feature alignment. Inspired by the geometric invariance of 3D stylization, we propose a Geometry-Guided Multi-View Consistency method that integrates geometric information into self-attention to enable cross-view correspondence modeling. Based on geometric information, we additionally construct a geometry-aware mask to prevent redundant optimization in overlapping regions across views, which further improves multi-view consistency. Extensive experiments show that DiffStyle3D outperforms state-of-the-art methods, achieving higher stylization quality and visual realism.
zh

[CV-13] Self-Supervised Weight Templates for Scalable Vision Model Initialization

【速读】:该论文旨在解决预训练模型在部署时面临架构尺寸多样性需求与传统固定规模预训练方法之间存在的局限性问题,即现有预训练策略难以灵活适配不同深度和宽度的视觉模型结构。解决方案的关键在于提出一种自监督框架SWEET,其核心创新是基于Tucker分解学习一个共享的权重模板(weight template)和与模型尺寸相关的权重缩放因子(size-specific weight scalers),从而实现模块化设计并支持对变尺度架构的高效初始化;此外,通过引入沿宽度维度的随机缩放机制(width-wise stochastic scaling),进一步增强宽度扩展下的泛化能力,确保跨宽度的鲁棒表示学习。

链接: https://arxiv.org/abs/2601.19694
作者: Yucheng Xie,Fu Feng,Ruixiao Shi,Jing Wang,Yong Rui,Xin Geng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textscclassification, \textscdetection, \textscsegmentation and \textscgeneration tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.
zh

[CV-14] DSVM-UNet : Enhancing VM-UNet with Dual Self-distillation for Medical Image Segmentation

【速读】:该论文旨在解决当前基于Vision Mamba的医学图像分割模型(如VM-UNet)在提升语义感知能力时,过度依赖复杂架构设计而导致计算开销增加的问题。其解决方案的关键在于提出一种无需修改网络结构的双自蒸馏方法(Dual Self-distillation, DS),通过在全局和局部两个层次上对特征进行对齐与优化,从而显著增强模型的表征能力并保持线性时间复杂度的优势,最终在ISIC2017、ISIC2018和Synapse等多个基准数据集上实现更优的分割性能与计算效率。

链接: https://arxiv.org/abs/2601.19690
作者: Renrong Shao,Dongyang Li,Dong Xia,Lin Shao,Jiangdong Lu,Fen Zheng,Lulu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figures

点击查看摘要

Abstract:Vision Mamba models have been extensively researched in various fields, which address the limitations of previous models by effectively managing long-range dependencies with a linear-time overhead. Several prospective studies have further designed Vision Mamba based on UNet(VM-UNet) for medical image segmentation. These approaches primarily focus on optimizing architectural designs by creating more complex structures to enhance the model’s ability to perceive semantic features. In this paper, we propose a simple yet effective approach to improve the model by Dual Self-distillation for VM-UNet (DSVM-UNet) without any complex architectural designs. To achieve this goal, we develop double self-distillation methods to align the features at both the global and local levels. Extensive experiments conducted on the ISIC2017, ISIC2018, and Synapse benchmarks demonstrate that our approach achieves state-of-the-art performance while maintaining computational efficiency. Code is available at this https URL.
zh

[CV-15] Video-KTR: Reinforcing Video Reasoning via Key Token Attribution ICLR2026

【速读】:该论文旨在解决现有视频推理方法在强化学习(Reinforcement Learning, RL)应用中因依赖粗粒度序列级奖励或单一因子的token选择,而忽视视觉输入、时间动态与语言输出之间细粒度关联的问题,从而限制了模型的准确性和可解释性。其解决方案的关键在于提出Video-KTR框架,通过三种归因信号筛选出关键token进行细粒度的策略优化:(1) 基于反事实掩码识别出对感知依赖敏感的视觉感知token;(2) 通过帧shuffle检测具有时间敏感性的token;(3) 选取高熵token以捕捉预测不确定性。该方法仅对这些语义信息丰富且模态敏感的token进行强化学习更新,有效提升了模型在复杂视频理解任务中的性能与可解释性。

链接: https://arxiv.org/abs/2601.19686
作者: Ziyue Wang,Sheng Jin,Zhongrong Zuo,Jiawei Wu,Han Qiu,Qi She,Hao Zhang,Xudong Jiang
机构: ByteDance; School of Electrical and Electronic Engineering, Nanyang Technological University; National University of Singapore; College of Computing and Data Science, Nanyang Technological University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models, yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only these key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive results, achieving 42.7% on Video-Holmes (surpassing GPT-4o) with consistent gains on both reasoning and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning. Our code and models are available at this https URL.
zh

[CV-16] SharpNet: Enhancing MLPs to Represent Functions with Controlled Non-differentiability

【速读】:该论文旨在解决多层感知机(MLP)在逼近具有预定C⁰连续性尖锐特征(如边缘和角点)的函数时存在的局限性,即传统MLP因固有全局平滑性而难以准确表示此类非光滑但连续的函数,通常需依赖额外后处理才能恢复尖锐特征。解决方案的关键在于提出SharpNet架构,通过引入一个辅助特征函数来增强网络表达能力,该特征函数由带有跳跃型Neumann边界条件的泊松方程求解得到,并利用一个完全可微的局部积分形式实现对特征位置与MLP参数的联合优化。此设计使得SharpNet能够精确控制C⁰连续性——在指定特征位置保持C⁰连续,其余区域保持光滑,从而有效保留尖锐几何结构并避免现有方法常见的梯度不连续性平滑问题。

链接: https://arxiv.org/abs/2601.19683
作者: Hanting Niu,Junkai Deng,Fei Hou,Wencheng Wang,Ying He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-layer perceptrons (MLPs) are a standard tool for learning and function approximation, but they inherently yield outputs that are globally smooth. As a result, they struggle to represent functions that are continuous yet deliberately non-differentiable (i.e., with prescribed C^0 sharp features) without relying on ad hoc post-processing. We present SharpNet, a modified MLP architecture capable of encoding functions with user-defined sharp features by enriching the network with an auxiliary feature function, which is defined as the solution to a Poisson equation with jump Neumann boundary conditions. It is evaluated via an efficient local integral that is fully differentiable with respect to the feature locations, enabling our method to jointly optimize both the feature locations and the MLP parameters to recover the target functions/models. The C^0 -continuity of SharpNet is precisely controllable, ensuring C^0 -continuity at the feature locations and smoothness elsewhere. We validate SharpNet on 2D problems and 3D CAD model reconstruction, and compare it against several state-of-the-art baselines. In both types of tasks, SharpNet accurately recovers sharp edges and corners while maintaining smooth behavior away from those features, whereas existing methods tend to smooth out gradient discontinuities. Both qualitative and quantitative evaluations highlight the benefits of our approach.
zh

[CV-17] A new Image Similarity Metric for a Perceptual and Transparent Geometric and Chromatic Assessment

【速读】:该论文旨在解决当前主流图像相似性度量方法在评估图像时缺乏感知一致性的问题,尤其是在存在纹理失真情况下表现不佳的局限性。其解决方案的关键在于提出一种由两个核心成分构成的新颖感知度量方法:第一部分利用地球移动距离(Earth Mover’s Distance)量化两幅图像之间的纹理差异;第二部分在Oklab感知颜色空间中计算色度差异。该方法在包含复杂形状与颜色失真的Berkeley-Adobe Perceptual Patch Similarity数据集上验证,结果表明其性能优于现有最先进方法,尤其在形状失真场景下更具感知敏感性;此外,相较于仅输出相似分数的黑箱深度模型,该度量提供可视化解释以支撑评分依据,从而实现可解释且透明的图像相似性评估。

链接: https://arxiv.org/abs/2601.19680
作者: Antonio Di Marino,Vincenzo Bevilacqua,Emanuel Di Nardo,Angelo Ciaramella,Ivanoe De Falco,Giovanna Sannino
机构: Institute for High-Performance Computing and Networking (ICAR) - National Research Council (CNR); University of Naples Federico II; University of Naples Parthenope
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the literature, several studies have shown that state-of-the-art image similarity metrics are not perceptual metrics; moreover, they have difficulty evaluating images, especially when texture distortion is also present. In this work, we propose a new perceptual metric composed of two terms. The first term evaluates the dissimilarity between the textures of two images using Earth Mover’s Distance. The second term evaluates the chromatic dissimilarity between two images in the Oklab perceptual color space. We evaluated the performance of our metric on a non-traditional dataset, called Berkeley-Adobe Perceptual Patch Similarity, which contains a wide range of complex distortions in shapes and colors. We have shown that our metric outperforms the state of the art, especially when images contain shape distortions, confirming also its greater perceptiveness. Furthermore, although deep black-box metrics could be very accurate, they only provide similarity scores between two images, without explaining their main differences and similarities. Our metric, on the other hand, provides visual explanations to support the calculated score, making the similarity assessment transparent and justified.
zh

[CV-18] KeepLoRA: Continual Learning with Residual Gradient Adaptation ICLR2026

【速读】:该论文旨在解决预训练视觉-语言模型在持续学习(continual learning)过程中面临的三大目标冲突问题:保留预训练知识、维持从序列任务中习得的知识,同时保持获取新知识的可塑性。解决方案的关键在于提出一种名为KeepLoRA的方法,其核心思想是通过分析模型参数空间中的知识编码机制发现,通用知识主要分布在主子空间(principal subspace),而任务特定知识则编码在残差子空间(residual subspace)。为此,KeepLoRA限制LoRA(Low-Rank Adaptation)参数更新仅发生在与预训练模型主子空间及先前任务特征主导方向正交的子空间内,从而避免对已有知识造成干扰,有效实现了三者之间的平衡,并在实验中取得了当前最优性能。

链接: https://arxiv.org/abs/2601.19659
作者: Mao-Lin Luo,Zi-Hao Zhou,Yi-Lin Zhang,Yuanyu Wan,Tong Wei,Min-Ling Zhang
机构: Southeast University (东南大学); Ministry of Education (教育部); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents a simple but effective approach called KeepLoRA to effectively balance these objectives. We first analyze the knowledge retention mechanism within the model parameter space and find that general knowledge is mainly encoded in the principal subspace, while task-specific knowledge is encoded in the residual subspace. Motivated by this finding, KeepLoRA learns new tasks by restricting LoRA parameter updates in the residual subspace to prevent interfering with previously learned capabilities. Specifically, we infuse knowledge for a new task by projecting its gradient onto a subspace orthogonal to both the principal subspace of pre-trained model and the dominant directions of previous task features. Our theoretical and empirical analyses confirm that KeepLoRA balances the three objectives and achieves state-of-the-art performance. The implementation code is available at this https URL.
zh

[CV-19] owards Governance-Oriented Low-Altitude Intelligence: A Management-Centric Multi-Modal Benchmark With Implicitly Coordinated Vision-Language Reasoning Framework

【速读】:该论文旨在解决当前低空视觉系统在智慧城市治理中难以实现面向管理需求的异常理解问题,其核心挑战在于现有以目标为中心的感知范式和松散耦合的视觉-语言流水线无法有效支持实际城市管理场景中的语义推理与决策。解决方案的关键在于提出首个面向管理任务的多模态基准GovLA-10K和统一的视觉-语言推理框架GovLA-Reasoner:GovLA-10K聚焦于功能显著的目标并提供可操作的管理建议,而GovLA-Reasoner通过引入高效的特征适配器,在视觉检测器与大语言模型(Large Language Model, LLM)之间隐式协调判别性表征共享,从而实现细粒度视觉定位与高层语境语言推理的有效协同,且无需对任何特定任务组件进行微调即可显著提升性能。

链接: https://arxiv.org/abs/2601.19640
作者: Hao Chang,Zhihui Wang,Lingxiang Wu,Peijin Wang,Wenhui Diao,Jinqiao Wang
机构: Aerospace Information Research Institute, Chinese Academy of Sciences, China (中国科学院空天信息研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, China (中国科学院大学电子、电气与通信工程学院); Zidong Taichu (Beijing) Technology Co., Ltd., China (自东太初(北京)科技有限公司); Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, China (中国科学院自动化研究所基础模型研究中心); School of Artificial Intelligence, University of Chinese Academy of Sciences, China (中国科学院大学人工智能学院); Wuhan AI Research, Wuhan, China (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.
zh

[CV-20] he role of self-supervised pretraining in differentially private medical image analysis

【速读】:该论文旨在解决差分隐私(Differential Privacy, DP)在医疗影像分析中导致诊断性能显著下降的问题,尤其是探讨模型初始化策略对DP下模型性能、公平性和泛化能力的影响。其解决方案的关键在于系统性评估不同初始化方式在全模型差分隐私训练下的表现,发现基于自监督学习的DINOv3初始化虽能提升DP场景下的诊断效用,但仍不及领域特定监督预训练(如MIMIC-CXR数据集上的预训练),后者最接近非私有基线性能,并且显著改善了性别和种族等维度的公平性及跨数据集泛化能力。这一结果表明,初始化策略是决定差分隐私医疗影像模型实用性的核心因素。

链接: https://arxiv.org/abs/2601.19618
作者: Soroosh Tayebi Arasteh,Mina Farajiamiri,Mahshad Lotfinia,Behrus Hinrichs-Puladi,Jonas Bienzeisler,Mohamed Alhaskir,Mirabela Rusu,Christiane Kuhl,Sven Nebelung,Daniel Truhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Differential privacy (DP) provides formal protection for sensitive data but typically incurs substantial losses in diagnostic performance. Model initialization has emerged as a critical factor in mitigating this degradation, yet the role of modern self-supervised learning under full-model DP remains poorly understood. Here, we present a large-scale evaluation of initialization strategies for differentially private medical image analysis, using chest radiograph classification as a representative benchmark with more than 800,000 images. Using state-of-the-art ConvNeXt models trained with DP-SGD across realistic privacy regimes, we compare non-domain-specific supervised ImageNet initialization, non-domain-specific self-supervised DINOv3 initialization, and domain-specific supervised pretraining on MIMIC-CXR, the largest publicly available chest radiograph dataset. Evaluations are conducted across five external datasets spanning diverse institutions and acquisition settings. We show that DINOv3 initialization consistently improves diagnostic utility relative to ImageNet initialization under DP, but remains inferior to domain-specific supervised pretraining, which achieves performance closest to non-private baselines. We further demonstrate that initialization choice strongly influences demographic fairness, cross-dataset generalization, and robustness to data scale and model capacity under privacy constraints. The results establish initialization strategy as a central determinant of utility, fairness, and generalization in differentially private medical imaging.
zh

[CV-21] GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining

【速读】:该论文旨在解决现有视频-音频(Video-Audio, V-A)联合表征学习方法在建模密集且多尺度时空对应关系方面的不足,尤其是在细粒度到粗粒度的时空结构上缺乏有效利用的问题。其解决方案的关键在于提出GMS-CAVP框架,通过两个核心机制实现:一是引入多尺度对比学习策略,以捕捉不同粒度下的语义与时间对应关系;二是结合基于扩散模型的生成目标,超越传统判别式对比学习,实现跨模态翻译与合成,从而构建统一的判别-生成范式,显著提升V-A对应建模能力与生成质量。

链接: https://arxiv.org/abs/2601.19606
作者: Shentong Mo,Zehua Chen,Jun Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.
zh

[CV-22] Localized Latent Editing for Dose-Response Modeling in Botulinum Toxin Injection Planning

【速读】:该论文旨在解决肉毒杆菌毒素(Botulinum Toxin)注射剂量确定依赖临床经验、缺乏量化依据而导致的疗效不一致问题。其核心解决方案是提出一种局部潜在空间编辑框架,通过在StyleGAN2的潜在空间中发现区域特异性潜在轴(Region-Specific Latent Axis Discovery),学习特定面部肌肉群的松弛轨迹,并建立剂量-响应预测模型,从而实现对局部面部形态变化的精准模拟与控制,避免全局副作用。该方法结合生成式模拟与“人在回路”(Human-in-the-Loop)交互优化,显著提升了注射规划的科学性与个体化水平。

链接: https://arxiv.org/abs/2601.19593
作者: Estèphe Arnaud,Mohamed Daoudi,Pierre Guerreschi
机构: Univ. Lille (大学里尔); CNRS (法国国家科学研究中心); Centrale Lille (里尔中央理工学院); Institut Mines-Télécom (电信学院); UMR 9189 CRIStAL (CRIStAL联合研究单位); IMT Nord Europe (IMT北欧); Centre for Digital Systems (数字系统中心); Lille University Hospital (里尔大学医院); SATT Nord (北方技术转让公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Botulinum toxin (Botox) injections are the gold standard for managing facial asymmetry and aesthetic rejuvenation, yet determining the optimal dosage remains largely intuitive, often leading to suboptimal outcomes. We propose a localized latent editing framework that simulates Botulinum Toxin injection effects for injection planning through dose-response modeling. Our key contribution is a Region-Specific Latent Axis Discovery method that learns localized muscle relaxation trajectories in StyleGAN2’s latent space, enabling precise control over specific facial regions without global side effects. By correlating these localized latent trajectories with injected toxin units, we learn a predictive dose-response model. We rigorously compare two approaches: direct metric regression versus image-based generative simulation on a clinical dataset of N=360 images from 46 patients. On a hold-out test set, our framework demonstrates moderate-to-strong structural correlations for geometric asymmetry metrics, confirming that the generative model correctly captures the direction of morphological changes. While biological variability limits absolute precision, we introduce a hybrid “Human-in-the-Loop” workflow where clinicians interactively refine simulations, bridging the gap between pathological reconstruction and cosmetic planning.
zh

[CV-23] ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景中缺乏系统性评估基准的问题,尤其在场景理解、空间感知、运动规划等关键能力上的性能边界不清晰。解决方案的关键在于构建了一个大规模的第一人称驾驶评测基准 ScenePilot-Bench,其基于包含3,847小时驾驶视频的 ScenePilot-4K 数据集,涵盖多粒度标注信息(如场景描述、风险评估、关键参与者识别等),并设计了四维评估体系(场景理解、空间感知、运动规划与 GPT-Score),结合安全敏感指标和跨区域泛化设置,从而为 VLM 在高安全性要求的自动驾驶任务中提供全面、可量化的评估框架。

链接: https://arxiv.org/abs/2601.19582
作者: Yujin Wang,Yutong Zheng,Wenxian Fan,Tianyi Wang,Hongqing Chu,Daxin Tian,Bingzhao Gao,Jianqiang Wang,Hong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.
zh

[CV-24] QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture ICLR2026

【速读】:该论文旨在解决基于视觉的3D人体运动捕捉中因忽略帧间时序一致性而导致的运动不自然和抖动问题,尤其针对现有基于运动学的方法依赖欧拉角(Euler angles)所引发的不连续性与不稳定重建问题。其解决方案的关键在于提出一种基于四元数微分方程(Quaternion Differential Equation, QDE)的新方法——QuaMo,通过在单位四元数球面约束下求解QDE来保证姿态过渡的连续性,并引入一种新颖的加速度增强机制(acceleration enhancement),由元PD控制器自适应调节控制信号以应对快速姿态变化,从而实现高精度、无间断的3D人体运动估计。

链接: https://arxiv.org/abs/2601.19580
作者: Cuong Le,Pavlo Melnyk,Urs Waldmann,Mårten Wadenbäck,Bastian Wandt
机构: Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, accepted to ICLR 2026

点击查看摘要

Abstract:Vision-based 3D human motion capture from videos remains a challenge in computer vision. Traditional 3D pose estimation approaches often ignore the temporal consistency between frames, causing implausible and jittery motion. The emerging field of kinematics-based 3D motion capture addresses these issues by estimating the temporal transitioning between poses instead. A major drawback in current kinematics approaches is their reliance on Euler angles. Despite their simplicity, Euler angles suffer from discontinuity that leads to unstable motion reconstructions, especially in online settings where trajectory refinement is unavailable. Contrarily, quaternions have no discontinuity and can produce continuous transitions between poses. In this paper, we propose QuaMo, a novel Quaternion Motions method using quaternion differential equations (QDE) for human kinematics capture. We utilize the state-space model, an effective system for describing real-time kinematics estimations, with quaternion state and the QDE describing quaternion velocity. The corresponding angular acceleration is computed from a meta-PD controller with a novel acceleration enhancement that adaptively regulates the control signals as the human quickly changes to a new pose. Unlike previous work, our QDE is solved under the quaternion unit-sphere constraint that results in more accurate estimations. Experimental results show that our novel formulation of the QDE with acceleration enhancement accurately estimates 3D human kinematics with no discontinuity and minimal implausibilities. QuaMo outperforms comparable state-of-the-art methods on multiple datasets, namely Human3.6M, Fit3D, SportsPose and AIST. The code is available at this https URL
zh

[CV-25] MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation

【速读】:该论文旨在解决生成式手语翻译(Sign Language Generation, SLG)中因采用自回归语言模型导致的单向上下文建模局限性和推理速度慢的问题。其关键解决方案是提出MaDiS,一种基于掩码扩散(masked-diffusion-based)的语言模型,通过引入双向依赖建模与并行多标记生成机制显著提升效率;同时设计三层次跨模态预训练策略(token-level、latent-level和3D物理空间级目标联合优化),增强手语表征的丰富性与具身性,并结合新颖的时间戳去掩码策略与部件混合嵌入层(mixture-of-parts embedding layer),有效降低微调阶段的组合复杂度并融合多粒度手语特征,从而在多个基准数据集上实现更优性能与近30%的推理延迟降低。

链接: https://arxiv.org/abs/2601.19577
作者: Ronglai Zuo,Rolandos Alexandros Potamias,Qi Sun,Evangelos Ververas,Jiankang Deng,Stefanos Zafeiriou
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives, leading to richer and more grounded sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, reducing the combinatorial complexity of unmasking orders by over 10^41 times. In addition, a mixture-of-parts embedding layer is developed to effectively fuse information stored in different part-wise sign tokens through learnable gates and well-optimized codebooks. Extensive experiments on CSL-Daily, Phoenix-2014T, and How2Sign demonstrate that MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%. Code and models will be released on our project page.
zh

[CV-26] he S3LI Vulcano Dataset: A Dataset for Multi-Modal SLAM in Unstructured Planetary Environments

【速读】:该论文旨在解决多模态同步定位与建图(SLAM)及场景识别算法在复杂自然环境下的性能评估与开发难题,特别是在视觉与激光雷达(LiDAR)融合数据场景中缺乏高质量、多样化标注数据的问题。解决方案的关键在于发布S3LI Vulcano数据集,该数据集包含从意大利西西里岛埃奥利群岛火山岛Vulcano采集的多模态序列,涵盖玄武岩或富铁岩石、古老熔岩通道地貌、干燥植被和水域等多种地形与纹理条件,并配套开源工具包用于生成真值位姿及构建场景识别任务的标注样本,从而为算法研发提供可靠基准与高效处理支持。

链接: https://arxiv.org/abs/2601.19557
作者: Riccardo Giubilato,Marcus Gerhard Müller,Marco Sewtz,Laura Alejandra Encinar Gonzalez,John Folkesson,Rudolph Triebel
机构: German Aerospace Center (DLR); KTH Royal Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted submission to the 2026 IEEE Aerospace Conference

点击查看摘要

Abstract:We release the S3LI Vulcano dataset, a multi-modal dataset towards development and benchmarking of Simultaneous Localization and Mapping (SLAM) and place recognition algorithms that rely on visual and LiDAR modalities. Several sequences are recorded on the volcanic island of Vulcano, from the Aeolian Islands in Sicily, Italy. The sequences provide users with data from a variety of environments, textures and terrains, including basaltic or iron-rich rocks, geological formations from old lava channels, as well as dry vegetation and water. The data (this http URL) is accompanied by an open source toolkit (this http URL) providing tools for generating ground truth poses as well as preparation of labelled samples for place recognition tasks.
zh

[CV-27] A Non-Invasive 3D Gait Analysis Framework for Quantifying Psychomotor Retardation in Major Depressive Disorder

【速读】:该论文旨在解决抑郁症(Major Depressive Disorder, MDD)中运动症状——尤其是精神运动迟滞(Psychomotor Retardation, PMR)的客观、可解释评估问题,传统临床评估依赖主观判断,而现有3D运动捕捉技术因硬件限制难以在常规临床环境中推广。其解决方案的关键在于提出一种非侵入式的计算框架,通过单目RGB视频自动提取297个明确的步态生物力学特征,并结合重力视图坐标系与一种利用改进Timed Up and Go(TUG)协议闭环拓扑结构的轨迹校正算法,有效缓解单目深度估计误差;同时引入基于稳定性的机器学习方法,在小样本临床数据下识别鲁棒的运动标志物,从而实现对PMR的高精度检测(83.3%准确率)和整体抑郁严重程度的强解释能力(R²=0.64)。

链接: https://arxiv.org/abs/2601.19526
作者: Fouad Boutaleb,Emery Pierson,Mohamed Daoudi,Clémence Nineuil,Ali Amad,Fabien D’Hondt
机构: Univ. Lille(大学里尔); CNRS(法国国家科学研究中心); Centrale Lille(里尔中央理工学院); Institut Mines-Télécom(电信学院); UMR 9189 CRIStAL(UMR 9189 CRIStAL实验室); LIX(利克斯实验室); École Polytechnique(巴黎综合理工学院); IP Paris(巴黎理工学院); IMT Nord Europe(北欧电信学院); Centre for Digital Systems(数字系统中心); Inserm(法国国家健康与医学研究院); CHU Lille(里尔大学医院); U1172 - LilNCog - Lille Neuroscience & Cognition(UMR 1172 - LilNCog - 里尔神经科学与认知中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting the status of Major Depressive Disorder (MDD) from objective, non-invasive methods is an active research field. Yet, extracting automatically objective, interpretable features for a detailed analysis of the patient state remains largely unexplored. Among MDD’s symptoms, Psychomotor retardation (PMR) is a core item, yet its clinical assessment remains largely subjective. While 3D motion capture offers an objective alternative, its reliance on specialized hardware often precludes routine clinical use. In this paper, we propose a non-invasive computational framework that transforms monocular RGB video into clinically relevant 3D gait kinematics. Our pipeline uses Gravity-View Coordinates along with a novel trajectory-correction algorithm that leverages the closed-loop topology of our adapted Timed Up and Go (TUG) protocol to mitigate monocular depth errors. This novel pipeline enables the extraction of 297 explicit gait biomechanical biomarkers from a single camera capture. To address the challenges of small clinical datasets, we introduce a stability-based machine learning framework that identifies robust motor signatures while preventing overfitting. Validated on the CALYPSO dataset, our method achieves an 83.3% accuracy in detecting PMR and explains 64% of the variance in overall depression severity (R^2=0.64). Notably, our study reveals a strong link between reduced ankle propulsion and restricted pelvic mobility to the depressive motor phenotype. These results demonstrate that physical movement serves as a robust proxy for the cognitive state, offering a transparent and scalable tool for the objective monitoring of depression in standard clinical environments. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.19526 [cs.CV] (or arXiv:2601.19526v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.19526 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-28] Mocap Anywhere: Towards Pairwise-Distance based Motion Capture in the Wild (for the Wild)

【速读】:该论文旨在解决传统动作捕捉系统在复杂环境(如户外或光照/磁干扰严重场景)中难以稳定运行的问题,以及现有方法对个体体型差异敏感、需定制化校准的局限性。其关键解决方案是提出一种基于超宽带(UWB)无线传感器的稀疏配对距离(PWD)测量系统,并结合轻量级实时Transformer架构Wild-Poser(WiP),直接从噪声或损坏的PWD数据中预测3D关节位置,实现无需外部摄像机、不依赖个体体型建模的通用型全身动作捕捉,从而在真实野外环境中实现高鲁棒性和跨物种泛化能力。

链接: https://arxiv.org/abs/2601.19519
作者: Ofir Abramovich,Ariel Shamir,Andreas Aristidou
机构: Reichman University (里赫曼大学); CYENS Centre of Excellence (CYENS卓越中心); University of Cyprus (塞浦路斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 14 pages, 15 figures

点击查看摘要

Abstract:We introduce a novel motion capture system that reconstructs full-body 3D motion using only sparse pairwise distance (PWD) measurements from body-mounted(UWB) sensors. Using time-of-flight ranging between wireless nodes, our method eliminates the need for external cameras, enabling robust operation in uncontrolled and outdoor environments. Unlike traditional optical or inertial systems, our approach is shape-invariant and resilient to environmental constraints such as lighting and magnetic interference. At the core of our system is Wild-Poser (WiP for short), a compact, real-time Transformer-based architecture that directly predicts 3D joint positions from noisy or corrupted PWD measurements, which can later be used for joint rotation reconstruction via learned methods. WiP generalizes across subjects of varying morphologies, including non-human species, without requiring individual body measurements or shape fitting. Operating in real time, WiP achieves low joint position error and demonstrates accurate 3D motion reconstruction for both human and animal subjects in-the-wild. Our empirical analysis highlights its potential for scalable, low-cost, and general purpose motion capture in real-world settings.
zh

[CV-29] Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration

【速读】:该论文旨在解决盲人脸修复(blind face restoration)中因输入信息稀疏而导致的重建不确定性问题,即低质量输入与高质量输出之间存在的信息不对称性,这种不对称性引发了一对多映射关系,导致随机不确定性和幻觉伪影。解决方案的关键在于提出一种分层框架 Pref-Restore,通过两个互补策略从根本上缓解信息失衡:一是增强输入密度,利用自回归集成器将文本指令转化为密集潜在查询,注入高层语义稳定性以约束退化信号;二是修剪输出分布,首次在扩散修复流程中引入基于策略的强化学习,将人类偏好转化为可微约束,显式惩罚随机偏差,从而聚焦后验分布至期望的高保真结果。

链接: https://arxiv.org/abs/2601.19506
作者: Zhengjian Yao,Jiakui Hu,Kaiwen Li,Hangzhou He,Xinliang Zhang,Shuang Zeng,Lei Zhu,Yanye Lu
机构: Peking University (北京大学); Peking University Health Science Center (北京大学医学部); National Biomedical Imaging Center (国家生物医学成像中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative approaches, while capable of synthesizing realistic textures, often suffer from information asymmetry – the intrinsic disparity between the information-sparse low quality inputs and the information-dense high quality outputs. This imbalance leads to a one-to-many mapping, where insufficient constraints result in stochastic uncertainty and hallucinatory artifacts. To bridge this gap, we present \textbfPref-Restore, a hierarchical framework that integrates discrete semantic logic with continuous texture generation to achieve deterministic, preference-aligned restoration. Our methodology fundamentally addresses this information disparity through two complementary strategies: (1) Augmenting Input Density: We employ an auto-regressive integrator to reformulate textual instructions into dense latent queries, injecting high-level semantic stability to constrain the degraded signals; (2) Pruning Output Distribution: We pioneer the integration of on-policy reinforcement learning directly into the diffusion restoration loop. By transforming human preferences into differentiable constraints, we explicitly penalize stochastic deviations, thereby sharpening the posterior distribution toward the desired high-fidelity outcomes. Extensive experiments demonstrate that Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks. Furthermore, empirical analysis confirms that our preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.
zh

[CV-30] Cortex-Grounded Diffusion Models for Brain Image Generation

【速读】:该论文旨在解决真实脑磁共振成像(MRI)数据集存在的三大关键局限:罕见表型稀缺、不同扫描仪间的域偏移(domain shift)以及纵向覆盖不足。现有生成模型多依赖标签或文本等弱条件信号,缺乏解剖学基础,常产生生物学上不合理的图像。其解决方案的核心是提出Cor2Vox框架,该框架基于皮层结构(cortex-grounded)进行图像生成,通过高分辨率皮层表面引导三维形状到图像的布朗桥扩散过程(Brownian bridge diffusion process),实现拓扑结构忠实的合成,并对潜在解剖特征进行精确控制。此外,研究构建了基于33,000例UK Biobank扫描的大规模皮层形态统计模型,从而支持生成真实且多样化的脑部形态,显著优于多个基线方法,在多种应用场景中均展现出亚体素级皮层形态保真度和对皮层几何与疾病表型变化的鲁棒性。

链接: https://arxiv.org/abs/2601.19498
作者: Fabian Bongratz,Yitong Li,Sama Elbaroudy,Christian Wachinger
机构: Lab for AI in Medical Imaging, Technical University of Munich, Germany (医学影像人工智能实验室,慕尼黑工业大学,德国); Munich Center for Machine Learning, Germany (慕尼黑机器学习中心,德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Synthetic neuroimaging data can mitigate critical limitations of real-world datasets, including the scarcity of rare phenotypes, domain shifts across scanners, and insufficient longitudinal coverage. However, existing generative models largely rely on weak conditioning signals, such as labels or text, which lack anatomical grounding and often produce biologically implausible outputs. To this end, we introduce Cor2Vox, a cortex-grounded generative framework for brain magnetic resonance image (MRI) synthesis that ties image generation to continuous structural priors of the cerebral cortex. It leverages high-resolution cortical surfaces to guide a 3D shape-to-image Brownian bridge diffusion process, enabling topologically faithful synthesis and precise control over underlying anatomies. To support the generation of new, realistic brain shapes, we developed a large-scale statistical shape model of cortical morphology derived from over 33,000 UK Biobank scans. We validated the fidelity of Cor2Vox based on traditional image quality metrics, advanced cortical surface reconstruction, and whole-brain segmentation quality, outperforming many baseline methods. Across three applications, namely (i) anatomically consistent synthesis, (ii) simulation of progressive gray matter atrophy, and (iii) harmonization of in-house frontotemporal dementia scans with public datasets, Cor2Vox preserved fine-grained cortical morphology at the sub-voxel level, exhibiting remarkable robustness to variations in cortical geometry and disease phenotype without retraining.
zh

[CV-31] Fast Converging 3D Gaussian Splatting for 1-Minute Reconstruction SIGGRAPH

【速读】:该论文旨在解决3D高斯溅射(3DGS)重建在严格时间约束下(一分钟内)实现高质量重建的难题,尤其针对SIGGRAPH Asia 3DGS Fast Reconstruction Challenge中不同相机位姿质量差异带来的挑战——初始轮次使用噪声较大的SLAM位姿,最终轮次则采用高精度的COLMAP位姿。解决方案的关键在于提出一种两阶段优化策略:第一阶段针对SLAM位姿设计了基于反向逐高斯并行优化与紧凑前向溅射(Speedy-splat)的高效框架,结合负载均衡分块、锚定式神经高斯表示(anchor-based Neural-Gaussian representation)、单目深度初始化及前馈3DGS模型预热,并引入全局位姿精修模块以提升对噪声轨迹的鲁棒性;第二阶段利用COLMAP位姿的高精度优势,移除位姿优化、恢复标准3DGS结构以减少MLP推理开销,引入多视角一致性引导的高斯分裂机制(受Fast-GS启发)和深度估计监督模块,从而在保证速度的同时显著提升重建质量。该方法最终实现了PSNR达28.43的最优性能。

链接: https://arxiv.org/abs/2601.19489
作者: Ziyu Zhang,Tianle Liu,Diantao Tu,Shuhan Shen
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: First Rank of SIGGRAPH Asia 2025 3DGS Challenge. Code available at

点击查看摘要

Abstract:We present a fast 3DGS reconstruction pipeline designed to converge within one minute, developed for the SIGGRAPH Asia 3DGS Fast Reconstruction Challenge. The challenge consists of an initial round using SLAM-generated camera poses (with noisy trajectories) and a final round using COLMAP poses (highly accurate). To robustly handle these heterogeneous settings, we develop a two-stage solution. In the first round, we use reverse per-Gaussian parallel optimization and compact forward splatting based on Taming-GS and Speedy-splat, load-balanced tiling, an anchor-based Neural-Gaussian representation enabling rapid convergence with fewer learnable parameters, initialization from monocular depth and partially from feed-forward 3DGS models, and a global pose refinement module for noisy SLAM trajectories. In the final round, the accurate COLMAP poses change the optimization landscape; we disable pose refinement, revert from Neural-Gaussians back to standard 3DGS to eliminate MLP inference overhead, introduce multi-view consistency-guided Gaussian splitting inspired by Fast-GS, and introduce a depth estimator to supervise the rendered depth. Together, these techniques enable high-fidelity reconstruction under a strict one-minute budget. Our method achieved the top performance with a PSNR of 28.43 and ranked first in the competition.
zh

[CV-32] Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation

【速读】:该论文旨在解决视频生成中因静态 top-k/top-p 采样策略与视频 token 的低语义密度和高时空冗余特性不匹配所导致的生成质量问题,具体表现为:在低不确定性区域(如静态背景)引入不必要的随机性,在高不确定性区域(如前景物体)易陷入早期错误并引发误差累积,从而严重影响长时程视频质量。解决方案的关键在于提出一种熵引导的 k-Guard(ENkG)采样策略,其核心思想是根据每个 token 预测分布的熵值动态调整候选 token 数量——低熵区域使用较少候选以抑制冗余噪声、保持结构完整性,高熵区域则扩大候选集以缓解误差传播,该方法无需模型训练、具备通用性且计算开销极低,显著提升了视频生成的感知质量和结构稳定性。

链接: https://arxiv.org/abs/2601.19488
作者: Yizhao Han,Tianxing Shi,Zhao Wang,Zifan Xu,Zhiyuan Pu,Mingxiao Li,Qian Zhang,Wei Yin,Xiao-Xiao Long
机构: Nanjing University (南京大学); Horizon Robotics ( horizon 机器人); China Mobile (中国移动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token’s predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
zh

[CV-33] Dynamic Worlds Dynamic Humans: Generating Virtual Human-Scene Interaction Motion in Dynamic Scenes

【速读】:该论文旨在解决现有生成人类与场景交互(Human-Scene Interaction, HSI)的方法普遍假设场景为静态的问题,这与现实世界中场景持续动态变化的特性不符。其解决方案的关键在于提出首个认知架构 Dyn-HSI,通过引入三个类人组件实现对动态环境的感知、记忆和控制:(1) 视觉模块(Vision)——采用动态场景感知导航机制,实时感知环境变化并预测下一目标点;(2) 记忆模块(Memory)——构建分层经验记忆系统,在训练过程中存储并更新交互经验,从而在推理阶段利用先验知识进行情境感知的动作引导,提升动作质量和泛化能力;(3) 控制模块(Control)——设计人类-场景交互扩散模型(Human-Scene Interaction Diffusion Model),基于多模态输入生成高保真交互动作。该架构通过构建动态基准数据集 Dyn-Scenes 并开展系统性实验验证,证明其在静态与动态场景下均显著优于现有方法。

链接: https://arxiv.org/abs/2601.19484
作者: Yin Wang,Zhiying Leng,Haitian Liu,Frederick W. B. Li,Mu Li,Xiaohui Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scenes are continuously undergoing dynamic changes in the real world. However, existing human-scene interaction generation methods typically treat the scene as static, which deviates from reality. Inspired by world models, we introduce Dyn-HSI, the first cognitive architecture for dynamic human-scene interaction, which endows virtual humans with three humanoid components. (1)Vision (human eyes): we equip the virtual human with a Dynamic Scene-Aware Navigation, which continuously perceives changes in the surrounding environment and adaptively predicts the next waypoint. (2)Memory (human brain): we equip the virtual human with a Hierarchical Experience Memory, which stores and updates experiential data accumulated during training. This allows the model to leverage prior knowledge during inference for context-aware motion priming, thereby enhancing both motion quality and generalization. (3) Control (human body): we equip the virtual human with Human-Scene Interaction Diffusion Model, which generates high-fidelity interaction motions conditioned on multimodal inputs. To evaluate performance in dynamic scenes, we extend the existing static human-scene interaction datasets to construct a dynamic benchmark, Dyn-Scenes. We conduct extensive qualitative and quantitative experiments to validate Dyn-HSI, showing that our method consistently outperforms existing approaches and generates high-quality human-scene interaction motions in both static and dynamic settings.
zh

[CV-34] owards Gold-Standard Depth Estimation for Tree Branches in UAV Forestry: Benchmarking Deep Stereo Matching Methods

【速读】:该论文旨在解决无人机(UAV)在森林环境中进行自主作业时,深度估计模型缺乏跨域泛化能力的问题,尤其是现有方法主要评估于城市和室内场景,对植被密集环境的适应性研究不足。其解决方案的关键在于首次系统性地开展零样本(zero-shot)评估,对八种不同范式的立体匹配方法(涵盖迭代优化、基础模型、扩散模型及3D卷积神经网络)进行全面比较,所有模型均使用官方发布的预训练权重(基于Scene Flow数据集训练),并在四个标准基准(ETH3D、KITTI 2012/2015、Middlebury)与一个全新的5,313对坎特伯雷树枝数据集(Canterbury Tree Branches,分辨率为1920×1080)上进行测试。结果表明,基础模型在结构化场景中表现优异,而DEFOM(一种基于扩散机制的深度估计方法)在植被场景中展现出最优的跨域一致性,成为该领域首个可作为伪真值(pseudo-ground-truth)用于未来基准测试的黄金标准基线。

链接: https://arxiv.org/abs/2601.19461
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Autonomous UAV forestry operations require robust depth estimation with strong cross-domain generalization, yet existing evaluations focus on urban and indoor scenarios, leaving a critical gap for vegetation-dense environments. We present the first systematic zero-shot evaluation of eight stereo methods spanning iterative refinement, foundation model, diffusion-based, and 3D CNN paradigms. All methods use officially released pretrained weights (trained on Scene Flow) and are evaluated on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury Tree Branches dataset ( 1920 \times 1080 ). Results reveal scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D; DEFOM: 4.65 px on Middlebury), while iterative methods show variable cross-benchmark performance (IGEV++: 0.36 px on ETH3D but 6.77 px on Middlebury; IGEV: 0.33 px on ETH3D but 4.99 px on Middlebury). Qualitative evaluation on the Tree Branches dataset establishes DEFOM as the gold-standard baseline for vegetation depth estimation, with superior cross-domain consistency (consistently ranking 1st-2nd across benchmarks, average rank 1.75). DEFOM predictions will serve as pseudo-ground-truth for future benchmarking.
zh

[CV-35] DSTCS: Dual-Student Teacher Framework with Segment Anything Model for Semi-Supervised Pubic Symphysis Fetal Head Segmentation

【速读】:该论文旨在解决产科超声图像中耻骨联合与胎儿头部(PSFH)分割的准确性难题,该问题因类别不平衡、边界模糊及噪声干扰等因素而尤为突出,且高质量标注数据稀缺进一步限制了模型性能。解决方案的关键在于提出一种融合卷积神经网络(CNN)与Segment Anything Model(SAM)的双学生-教师框架(DSTCS),通过CNN与SAM分支间的协同学习机制显著提升分割精度;同时引入针对边界处理优化的数据增强策略和新型损失函数,从而在MICCAI 2023与2024 PSFH分割基准上展现出更强的鲁棒性和优越性能。

链接: https://arxiv.org/abs/2601.19446
作者: Yalin Luo,Shun Long,Huijin Wang,Jieyun Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of the pubic symphysis and fetal head (PSFH) is a critical procedure in intrapartum monitoring and is essential for evaluating labor progression and identifying potential delivery complications. However, achieving accurate segmentation remains a significant challenge due to class imbalance, ambiguous boundaries, and noise interference in ultrasound images, compounded by the scarcity of high-quality annotated data. Current research on PSFH segmentation predominantly relies on CNN and Transformer architectures, leaving the potential of more powerful models underexplored. In this work, we propose a Dual-Student and Teacher framework combining CNN and SAM (DSTCS), which integrates the Segment Anything Model (SAM) into a dual student-teacher architecture. A cooperative learning mechanism between the CNN and SAM branches significantly improves segmentation accuracy. The proposed scheme also incorporates a specialized data augmentation strategy optimized for boundary processing and a novel loss function. Extensive experiments on the MICCAI 2023 and 2024 PSFH segmentation benchmarks demonstrate that our method exhibits superior robustness and significantly outperforms existing techniques, providing a reliable segmentation tool for clinical practice.
zh

[CV-36] RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming

【速读】:该论文旨在解决文本生成沉浸式3D场景时存在的两大核心问题:一是现有方法因依赖2D扩散先验而存在空间感知盲区,难以理解语义布局并自适应推理被遮挡内容;二是当前图像修复(inpainting)模型局限于2D空间,在相机运动导致的孔洞填充上缺乏几何一致性。解决方案的关键在于提出RoamScene3D框架,通过引入视觉语言模型(VLM)构建编码对象关系的场景图(scene graph),指导相机规划自适应漫游轨迹以感知显著物体边界;同时设计了注入运动信息的修复模型(Motion-Injected Inpainting),在融合真实相机轨迹的合成全景数据集上微调,从而实现对相机运动的鲁棒响应和高质量3D场景重建。

链接: https://arxiv.org/abs/2601.19433
作者: Jisheng Chu,Wenrui Li,Rui Zhao,Wangmeng Zuo,Shifeng Chen,Xiaopeng Fan
机构: Harbin Institute of Technology(哈尔滨工业大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); Nanyang Technological University(南洋理工大学); Peng Cheng Laboratory(鹏城实验室); Harbin Institute of Technology Suzhou Research Institute(哈尔滨工业大学苏州研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial blindness and rely on predefined trajectories that fail to exploit the inner relationships among salient objects. Consequently, these approaches are unable to comprehend the semantic layout, preventing them from exploring the scene adaptively to infer occluded content. Moreover, current inpainting models operate in 2D image space, struggling to plausibly fill holes caused by camera motion. To address these limitations, we propose RoamScene3D, a novel framework that bridges the gap between semantic guidance and spatial generation. Our method reasons about the semantic relations among objects and produces consistent and photorealistic scenes. Specifically, we employ a vision-language model (VLM) to construct a scene graph that encodes object relations, guiding the camera to perceive salient object boundaries and plan an adaptive roaming trajectory. Furthermore, to mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset integrating authentic camera trajectories, making it adaptive to camera motion. Extensive experiments demonstrate that with semantic reasoning and geometric constraints, our method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes. Our code is available at this https URL.
zh

[CV-37] Unveiling Perceptual Artifacts: A Fine-Grained Benchmark for Interpretable AI-Generated Image Detection

【速读】:该论文旨在解决当前AI生成图像(AIGI)检测方法普遍依赖二分类决策、缺乏可解释性证据的问题。现有检测基准在人工伪影(artifact)多样性覆盖和局部标注细节方面存在不足,导致模型决策过程难以被理解和验证。其解决方案的关键在于构建一个细粒度的可解释AIGI检测基准——X-AIGD,该基准提供像素级、分类型的伪影标注,涵盖低层次失真、高层次语义异常及认知层面的反事实特征,从而支持对检测模型决策机制的精细化评估与洞察。通过该基准的系统性分析,研究揭示了现有检测器对感知伪影依赖性极低、仍主要依赖不可解释特征,并指出显式对齐模型注意力与伪影区域可显著提升检测器的可解释性和泛化能力。

链接: https://arxiv.org/abs/2601.19430
作者: Yao Xiao,Weiyan Chen,Jiahao Chen,Zijie Cao,Weijian Deng,Binbin Yang,Ziyi Dong,Xiangyang Ji,Wei Ke,Pengxu Wei,Liang Lin
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); Xi’an Jiaotong University (西安交通大学); Australian National University (澳大利亚国立大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current AI-Generated Image (AIGI) detection approaches predominantly rely on binary classification to distinguish real from synthetic images, often lacking interpretable or convincing evidence to substantiate their decisions. This limitation stems from existing AIGI detection benchmarks, which, despite featuring a broad collection of synthetic images, remain restricted in their coverage of artifact diversity and lack detailed, localized annotations. To bridge this gap, we introduce a fine-grained benchmark towards eXplainable AI-Generated image Detection, named X-AIGD, which provides pixel-level, categorized annotations of perceptual artifacts, spanning low-level distortions, high-level semantics, and cognitive-level counterfactuals. These comprehensive annotations facilitate fine-grained interpretability evaluation and deeper insight into model decision-making processes. Our extensive investigation using X-AIGD provides several key insights: (1) Existing AIGI detectors demonstrate negligible reliance on perceptual artifacts, even at the most basic distortion level. (2) While AIGI detectors can be trained to identify specific artifacts, they still substantially base their judgment on uninterpretable features. (3) Explicitly aligning model attention with artifact regions can increase the interpretability and generalization of detectors. The data and code are available at: this https URL.
zh

[CV-38] ri-Reader: An Open-Access Multi-Stage AI Pipeline for First-Pass Lung Nodule Annotation in Screening CT

【速读】:该论文旨在解决肺部结节(lung nodule)检测与恶性程度分类中效率低、人工标注负担重的问题,尤其在多中心医疗实践中缺乏通用性强的自动化分析工具。其解决方案的关键在于构建了一个三阶段统一工作流(tri-stage workflow)——即肺部区域分割(lung segmentation)、结节检测(nodule detection)和恶性风险分类(malignancy classification),整合多个开源模型并基于公开数据集训练,以优先保障敏感性(sensitivity)并显著降低标注者需处理的候选结节数量,同时通过多内部与外部数据集验证确保了模型在不同临床场景下的准确性与泛化能力。

链接: https://arxiv.org/abs/2601.19380
作者: Fakrul Islam Tushar,Joseph Y. Lo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 1 figure , 2 tables, 20 page supplement

点击查看摘要

Abstract:Using multiple open-access models trained on public datasets, we developed Tri-Reader, a comprehensive, freely available pipeline that integrates lung segmentation, nodule detection, and malignancy classification into a unified tri-stage workflow. The pipeline is designed to prioritize sensitivity while reducing the candidate burden for annotators. To ensure accuracy and generalizability across diverse practices, we evaluated Tri-Reader on multiple internal and external datasets as compared with expert annotations and dataset-provided reference standards.
zh

[CV-39] Establishing dermatopathology encyclopedia DermpathNet with Artificial Intelligence-Based Workflow

【速读】:该论文旨在解决临床医生和皮肤病理学培训人员在获取高质量、开放获取的皮肤病理图像数据集方面面临的挑战,以支持教学、交叉参考及机器学习研究。其解决方案的关键在于提出了一种混合工作流程(hybrid workflow),结合基于深度学习的图像模态分类与图注分析,实现对PubMed Central(PMC)数据库中图像的自动提取与精准分类,最终构建了一个由委员会认证的皮肤病理学家审阅的大型、半自动化标注的开放数据集DermpathNet,验证结果显示该方法在F-score上达到90.4%,显著优于单独使用关键词检索(61.0%)或纯深度学习方法(89.6%)。

链接: https://arxiv.org/abs/2601.19378
作者: Ziyang Xu,Mingquan Lin,Yiliang Zhou,Zihan Xu,Seth J. Orlow,Zihan Xu,Shane A. Meehan,Alexandra Flamm,Ata S. Moshiri,Yifan Peng
机构: Perelman Department of Dermatology, NYU Grossman School of Medicine, New York, USA; Department of Population Health Sciences, Weill Cornell Medicine, New York, USA; Division of Dermatopathology, Mount Sinai Health, New York, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Scientific Data

点击查看摘要

Abstract:Accessing high-quality, open-access dermatopathology image datasets for learning and cross-referencing is a common challenge for clinicians and dermatopathology trainees. To establish a comprehensive open-access dermatopathology dataset for educational, cross-referencing, and machine-learning purposes, we employed a hybrid workflow to curate and categorize images from the PubMed Central (PMC) repository. We used specific keywords to extract relevant images, and classified them using a novel hybrid method that combined deep learning-based image modality classification with figure caption analyses. Validation on 651 manually annotated images demonstrated the robustness of our workflow, with an F-score of 89.6% for the deep learning approach, 61.0% for the keyword-based retrieval method, and 90.4% for the hybrid approach. We retrieved over 7,772 images across 166 diagnoses and released this fully annotated dataset, reviewed by board-certified dermatopathologists. Using our dataset as a challenging task, we found the current image analysis algorithm from OpenAI inadequate for analyzing dermatopathology images. In conclusion, we have developed a large, peer-reviewed, open-access dermatopathology image dataset, DermpathNet, which features a semi-automated curation workflow.
zh

[CV-40] Pareto-Guided Optimization for Uncertainty-Aware Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因不确定性分布不均导致的训练不稳定问题,特别是边界区域相较于内部区域具有更高模糊性,而传统训练方法对所有像素一视同仁,造成早期优化阶段梯度波动大、难以收敛至帕累托最优解。其解决方案的关键在于提出一种区域级课程学习策略,优先从高置信度区域开始训练,并逐步引入低置信度区域以降低梯度方差;同时设计了一种帕累托一致损失函数,通过自适应重塑损失空间并约束内外部区域间的收敛动态,引导模型逼近帕累托近似解;此外还引入模糊标签机制,在非边界区域保持二值置信度,在边界附近实现平滑过渡,从而稳定梯度并扩展损失曲面中的平坦区域,显著提升分割性能。

链接: https://arxiv.org/abs/2601.19365
作者: Jinming Zhang,Xi Yang,Youpeng Yang,Haosen Shi,Yuyao Yan,Qiufeng Wang,Guangliang Cheng,Kaizhu Huang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Zhejiang University (浙江大学); University of Liverpool (利物浦大学); Duke Kunshan University (昆山杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Uncertainty in medical image segmentation is inherently non-uniform, with boundary regions exhibiting substantially higher ambiguity than interior areas. Conventional training treats all pixels equally, leading to unstable optimization during early epochs when predictions are unreliable. We argue that this instability hinders convergence toward Pareto-optimal solutions and propose a region-wise curriculum strategy that prioritizes learning from certain regions and gradually incorporates uncertain ones, reducing gradient variance. Methodologically, we introduce a Pareto-consistent loss that balances trade-offs between regional uncertainties by adaptively reshaping the loss landscape and constraining convergence dynamics between interior and boundary regions; this guides the model toward Pareto-approximate solutions. To address boundary ambiguity, we further develop a fuzzy labeling mechanism that maintains binary confidence in non-boundary areas while enabling smooth transitions near boundaries, stabilizing gradients, and expanding flat regions in the loss surface. Experiments on brain metastasis and non-metastatic tumor segmentation show consistent improvements across multiple configurations, with our method outperforming traditional crisp-set approaches in all tumor subregions.
zh

[CV-41] Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

【速读】:该论文旨在解决当前科学多模态大模型依赖大规模领域特定预训练和复杂黑箱流程导致的数据效率低、可复现性差及通用能力受限的问题。其解决方案的关键在于:(1)构建一个端到端透明、可复现的训练流程,涵盖数据收集、清洗、预处理、监督微调、强化学习与评估,并提供详细优化策略;(2)通过精心设计的数据选择策略实现显著的数据高效性,在少于五百万标注样本下即可在多种科学任务上达到竞争性性能,表明有效推理可通过合理数据筛选而非盲目扩展数据规模实现;(3)在不牺牲通用视觉和多模态推理能力的前提下,将科学对齐融入统一模型架构中,从而实现科学智能与通用能力的协同提升。

链接: https://arxiv.org/abs/2601.19325
作者: Zichen Wen,Boxue Yang,Shuang Chen,Yaojie Zhang,Yuhang Han,Junlong Ke,Cong Wang,Yicheng Fu,Jiawang Zhao,Jiangchao Yao,Xi Fang,Zhen Wang,Henxing Cai,Lin Yao,Zhifeng Gao,Yanhui Hong,Nang Yuan,Yixuan Li,Guojiang Zhao,Haoyi Tao,Nan Wang,Han Lyu,Guolin Ke,Ning Liao,Xiaoxing Wang,Kai Chen,Zhiyu Li,Feiyu Xiong,Sihan Hu,Kun Chen,Yanfeng Wang,Weinan E,Linfeng Zhang,Linfeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Innovator-VL tech report

点击查看摘要

Abstract:We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent, end-to-end reproducible training pipeline, covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, along with detailed optimization recipes. This facilitates systematic extension by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on various scientific tasks using fewer than five million curated samples without large-scale pretraining. These results highlight that effective reasoning can be achieved through principled data selection rather than indiscriminate scaling. (iii) Third, Innovator-VL demonstrates strong generalization, achieving competitive performance on general vision, multimodal reasoning, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified model without compromising general-purpose capabilities. Our practices suggest that efficient, reproducible, and high-performing scientific multimodal models can be built even without large-scale data, providing a practical foundation for future research.
zh

[CV-42] Perception-to-Pursuit: Track-Centric Temporal Reasoning for Open-World Drone Detection and Autonomous Chasing ICCV2027

【速读】:该论文旨在解决自主无人机追捕中轨迹预测与拦截可行性之间的脱节问题:现有跟踪方法虽注重预测精度,却忽视了拦截动作的运动学可行性,导致99.9%的预测轨迹在物理上无法被拦截。其解决方案的关键在于提出感知到追捕(Perception-to-Pursuit, P2P)框架,该框架以轨迹为中心进行时序推理,将无人机运动建模为8维紧凑token(包含速度、加速度、尺度和光滑性),并利用12帧因果Transformer实现对未来行为的合理推断;同时引入拦截成功率(Intercept Success Rate, ISR)作为衡量在真实拦截器约束下追捕可行性的新指标,从而显著提升轨迹预测准确性与拦截可行性。

链接: https://arxiv.org/abs/2601.19318
作者: Venkatakrishna Reddy Oruganti
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, 3 tables, 15 references. Intended for submission to ICCV 2027

点击查看摘要

Abstract:Autonomous drone pursuit requires not only detecting drones but also predicting their trajectories in a manner that enables kinematically feasible interception. Existing tracking methods optimize for prediction accuracy but ignore pursuit feasibility, resulting in trajectories that are physically impossible to intercept 99.9% of the time. We propose Perception-to-Pursuit (P2P), a track-centric temporal reasoning framework that bridges detection and actionable pursuit planning. Our method represents drone motion as compact 8-dimensional tokens capturing velocity, acceleration, scale, and smoothness, enabling a 12-frame causal transformer to reason about future behavior. We introduce the Intercept Success Rate (ISR) metric to measure pursuit feasibility under realistic interceptor constraints. Evaluated on the Anti-UAV-RGBT dataset with 226 real drone sequences, P2P achieves 28.12 pixel average displacement error and 0.597 ISR, representing a 77% improvement in trajectory prediction and 597x improvement in pursuit feasibility over tracking-only baselines, while maintaining perfect drone classification accuracy (100%). Our work demonstrates that temporal reasoning over motion patterns enables both accurate prediction and actionable pursuit planning.
zh

[CV-43] Instance-Guided Radar Depth Estimation for 3D Object Detection

【速读】:该论文旨在解决单目相机在自动驾驶中进行3D目标检测时面临的深度模糊性(depth ambiguity)和恶劣环境下的鲁棒性不足问题,同时利用雷达(Radar)在复杂光照与天气条件下的优势,弥补其点云稀疏性和低分辨率限制。解决方案的关键在于提出一个端到端框架,包含两个核心组件:一是提出InstaRadar方法,通过实例分割引导的雷达点扩展策略,提升雷达密度与语义对齐度,生成结构更优的深度特征;二是将预训练的RCDPT模型作为BEVDepth框架中的深度模块替代品,结合InstaRadar增强后的输入,实现显式深度监督下的3D检测性能提升。实验表明,该方案在雷达引导的深度估计和整体3D检测精度上均达到当前最优水平,验证了结构化雷达表示与深度监督的有效性。

链接: https://arxiv.org/abs/2601.19314
作者: Chen-Chou Lo,Patrick Vandewalle
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IPMV2026

点击查看摘要

Abstract:Accurate depth estimation is fundamental to 3D perception in autonomous driving, supporting tasks such as detection, tracking, and motion planning. However, monocular camera-based 3D detection suffers from depth ambiguity and reduced robustness under challenging conditions. Radar provides complementary advantages such as resilience to poor lighting and adverse weather, but its sparsity and low resolution limit its direct use in detection frameworks. This motivates the need for effective Radar-camera fusion with improved preprocessing and depth estimation strategies. We propose an end-to-end framework that enhances monocular 3D object detection through two key components. First, we introduce InstaRadar, an instance segmentation-guided expansion method that leverages pre-trained segmentation masks to enhance Radar density and semantic alignment, producing a more structured representation. InstaRadar achieves state-of-the-art results in Radar-guided depth estimation, showing its effectiveness in generating high-quality depth features. Second, we integrate the pre-trained RCDPT into the BEVDepth framework as a replacement for its depth module. With InstaRadar-enhanced inputs, the RCDPT integration consistently improves 3D detection performance. Overall, these components yield steady gains over the baseline BEVDepth model, demonstrating the effectiveness of InstaRadar and the advantage of explicit depth supervision in 3D object detection. Although the framework lags behind Radar-camera fusion models that directly extract BEV features, since Radar serves only as guidance rather than an independent feature stream, this limitation highlights potential for improvement. Future work will extend InstaRadar to point cloud-like representations and integrate a dedicated Radar branch with temporal cues for enhanced BEV fusion.
zh

[CV-44] Beyond Shadows: A Large-Scale Benchmark and Multi-Stage Framework for High-Fidelity Facial Shadow Removal ICASSP2026

【速读】:该论文旨在解决面部阴影(Facial Shadows)在真实场景中导致图像质量下降及视觉算法性能受限的问题,尤其针对现有方法在复杂光照条件下难以同时实现阴影去除与纹理保留的挑战。其关键解决方案是构建了首个大规模真实世界面部阴影去除数据集——Augmented Shadow Face in the Wild (ASFW),包含1,081对通过专业Photoshop流程生成的带阴影与无阴影人脸图像,具备高保真度的阴影变化和精确的地面真实标签,有效弥合了合成数据与真实场景之间的域差距;同时提出Face Shadow Eraser (FSE) 方法以验证该数据集的有效性,实验表明ASFW显著提升了模型在真实环境中的阴影去除性能,为该任务设立了新标准。

链接: https://arxiv.org/abs/2601.19309
作者: Tailong Luo,Jiesong Bai,Jinyang Huang,Junyu Xia,Wangyu Wu,Xuhang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP2026

点击查看摘要

Abstract:Facial shadows often degrade image quality and the performance of vision algorithms. Existing methods struggle to remove shadows while preserving texture, especially under complex lighting conditions, and they lack real-world paired datasets for training. We present the Augmented Shadow Face in the Wild (ASFW) dataset, the first large-scale real-world dataset for facial shadow removal, containing 1,081 paired shadow and shadow-free images created via a professional Photoshop workflow. ASFW offers photorealistic shadow variations and accurate ground truths, bridging the gap between synthetic and real domains. Deep models trained on ASFW demonstrate improved shadow removal in real-world conditions. We also introduce the Face Shadow Eraser (FSE) method to showcase the effectiveness of the dataset. Experiments demonstrate that ASFW enhances the performance of facial shadow removal models, setting new standards for this task.
zh

[CV-45] ProMist-5K: A Comprehensive Dataset for Digital Emulation of Cinematic Pro-Mist Filter Effects ICASSP2026

【速读】:该论文旨在解决数字图像处理中难以准确复现电影镜头滤镜(Pro-Mist filters)所产生的软光晕(soft halation)、低对比度及独特氛围感的问题。其解决方案的关键在于构建了一个名为ProMist-5K的数据集,该数据集基于物理启发的场景相关线性空间生成流程,包含20,000对高分辨率图像,覆盖两种滤镜密度(1/2 和 1/8)和两种焦距(20mm 和 50mm)。通过多层模糊叠加与精细调校的权重设计,有效建模了光学扩散的强度变化与传播特性,从而实现了对真实光影扩散效果的精准模拟,为基于学习的图像风格迁移模型提供了可控且一致的目标域。

链接: https://arxiv.org/abs/2601.19295
作者: Yingtie Lei,Zimeng Li,Chi-Man Pun,Wangyu Wu,Junke Yang,Xuhang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP2026

点击查看摘要

Abstract:Pro-Mist filters are widely used in cinematography for their ability to create soft halation, lower contrast, and produce a distinctive, atmospheric style. These effects are difficult to reproduce digitally due to the complex behavior of light diffusion. We present ProMist-5K, a dataset designed to support cinematic style emulation. It is built using a physically inspired pipeline in a scene-referred linear space and includes 20,000 high-resolution image pairs across four configurations, covering two filter densities (1/2 and 1/8) and two focal lengths (20mm and 50mm). Unlike general style datasets, ProMist-5K focuses on realistic glow and highlight diffusion effects. Multiple blur layers and carefully tuned weighting are used to model the varying intensity and spread of optical diffusion. The dataset provides a consistent and controllable target domain that supports various image translation models and learning paradigms. Experiments show that the dataset works well across different training settings and helps capture both subtle and strong cinematic appearances. ProMist-5K offers a practical and physically grounded resource for film-inspired image transformation, bridging the gap between digital flexibility and traditional lens aesthetics. The dataset is available at this https URL.
zh

[CV-46] A Multi-View Consistency Framework with Semi-Supervised Domain Adaptation

【速读】:该论文旨在解决半监督域适应(Semi-Supervised Domain Adaptation, SSDA)中因目标域标注样本有限而导致的类别特征空间内固有相似性问题,进而引发预测偏差的问题。其解决方案的关键在于提出一种多视角一致性框架,包含两个核心机制:一是基于模型预测性能的去偏策略(debiasing strategy),用于校正类别层面的预测概率;二是利用模型预测生成的伪负标签(pseudo-negative labels)增强训练信号;此外,引入跨域亲和力学习(cross-domain affinity learning)以对齐不同域中同类别特征,从而提升整体性能。

链接: https://arxiv.org/abs/2601.19266
作者: Yuting Hong,Li Dong,Xiaojie Qiu,Hui Xiao,Baochen Yao,Siming Zheng,Chengbin Peng
机构: Ningbo University (宁波大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Semi-Supervised Domain Adaptation (SSDA) leverages knowledge from a fully labeled source domain to classify data in a partially labeled target domain. Due to the limited number of labeled samples in the target domain, there can be intrinsic similarity of classes in the feature space, which may result in biased predictions, even when the model is trained on a balanced dataset. To overcome this limitation, we introduce a multi-view consistency framework, which includes two views for training strongly augmented data. One is a debiasing strategy for correcting class-wise prediction probabilities according to the prediction performance of the model. The other involves leveraging pseudo-negative labels derived from the model predictions. Furthermore, we introduce a cross-domain affinity learning aimed at aligning features of the same class across different domains, thereby enhancing overall performance. Experimental results demonstrate that our method outperforms the competing methods on two standard domain adaptation datasets, DomainNet and Office-Home. Combining unsupervised domain adaptation and semi-supervised learning offers indispensable contributions to the industrial sector by enhancing model adaptability, reducing annotation costs, and improving performance.
zh

[CV-47] Handcrafted Feature Fusion for Reliable Detection of AI-Generated Images

【速读】:该论文旨在解决生成式 AI(Generative AI)合成图像日益逼真所带来的数字媒体真实性与可信度问题,即如何可靠地检测虚假内容。其解决方案的关键在于系统性评估多种手工设计特征(handcrafted features)在真实与合成图像分类任务中的表现,并结合先进的集成学习方法(如 LightGBM)进行优化。研究发现,混合使用包括 DCT、HOG、LBP、GLCM 等在内的多类手工特征,配合 LightGBM 分类器,可显著提升检测性能(PR-AUC 达 0.9879,F1 达 0.9447),且在模型校准和判别能力上优于单一特征或简单分类器,验证了精心工程化的特征与高效集成学习策略在合成图像检测中的有效性,尤其适用于对可解释性和计算效率要求较高的应用场景。

链接: https://arxiv.org/abs/2601.19262
作者: Syed Mehedi Hasan Nirob,Moqsadur Rahman,Shamim Ehsan,Summit Haque
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid progress of generative models has enabled the creation of highly realistic synthetic images, raising concerns about authenticity and trust in digital media. Detecting such fake content reliably is an urgent challenge. While deep learning approaches dominate current literature, handcrafted features remain attractive for their interpretability, efficiency, and generalizability. In this paper, we conduct a systematic evaluation of handcrafted descriptors, including raw pixels, color histograms, Discrete Cosine Transform (DCT), Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP), Gray-Level Co-occurrence Matrix (GLCM), and wavelet features, on the CIFAKE dataset of real versus synthetic images. Using 50,000 training and 10,000 test samples, we benchmark seven classifiers ranging from Logistic Regression to advanced gradient-boosted ensembles (LightGBM, XGBoost, CatBoost). Results demonstrate that LightGBM consistently outperforms alternatives, achieving PR-AUC 0.9879, ROC-AUC 0.9878, F1 0.9447, and a Brier score of 0.0414 with mixed features, representing strong gains in calibration and discrimination over simpler descriptors. Across three configurations (baseline, advanced, mixed), performance improves monotonically, confirming that combining diverse handcrafted features yields substantial benefit. These findings highlight the continued relevance of carefully engineered features and ensemble learning for detecting synthetic images, particularly in contexts where interpretability and computational efficiency are critical.
zh

[CV-48] IGaussian: Disentangle Gaussians for Spatial-Awared Text-Image-3D Alignment

【速读】:该论文旨在解决多模态预训练中3D模态特征提取困难及跨模态对齐效果不佳的问题,尤其在图像、文本与3D数据(如点云和3D高斯表示)之间的语义鸿沟问题。其解决方案的关键在于提出TIGaussian框架,通过两个核心机制实现:一是设计多分支3D高斯洒点(3D Gaussian Splatting, 3DGS)分词器,将3DGS结构的内在属性解耦为紧凑的潜在表征,提升特征提取的泛化能力;二是引入双向跨模态对齐策略——包括基于扩散先验的多视角特征融合机制以缓解图像-3D对齐中的视角歧义,以及文本-3D投影模块,自适应地将3D特征映射至文本嵌入空间,从而增强文本与3D之间的语义一致性。

链接: https://arxiv.org/abs/2601.19247
作者: Jiarun Liu,Qifeng Chen,Yiru Zhao,Minghua Liu,Baorui Ma,Sheng Yang
机构: Cainiao Inc., Alibaba Group (菜鸟网络科技有限公司); Hillbot; Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While visual-language models have profoundly linked features between texts and images, the incorporation of 3D modality data, such as point clouds and 3D Gaussians, further enables pretraining for 3D-related tasks, e.g., cross-modal retrieval, zero-shot classification, and scene recognition. As challenges remain in extracting 3D modal features and bridging the gap between different modalities, we propose TIGaussian, a framework that harnesses 3D Gaussian Splatting (3DGS) characteristics to strengthen cross-modality alignment through multi-branch 3DGS tokenizer and modality-specific 3D feature alignment strategies. Specifically, our multi-branch 3DGS tokenizer decouples the intrinsic properties of 3DGS structures into compact latent representations, enabling more generalizable feature extraction. To further bridge the modality gap, we develop a bidirectional cross-modal alignment strategies: a multi-view feature fusion mechanism that leverages diffusion priors to resolve perspective ambiguity in image-3D alignment, while a text-3D projection module adaptively maps 3D features to text embedding space for better text-3D alignment. Extensive experiments on various datasets demonstrate the state-of-the-art performance of TIGaussian in multiple tasks.
zh

[CV-49] VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics

【速读】:该论文旨在解决视频连接(Video Connecting)任务中缺乏标准化评估基准的问题,该任务要求生成连接两个给定视频片段的平滑中间内容,以满足视频编辑和Vlog等实际应用场景的需求。解决方案的关键在于提出了VC-Bench,一个包含1,579个高质量视频、覆盖15个主类别和72个子类别的结构化数据集,并设计了三个核心评估指标:视频质量评分(Video Quality Score, VQS)、起止一致性评分(Start-End Consistency Score, SECS)和过渡平滑度评分(Transition Smoothness Score, TSS),构建了一个超越传统单一质量指标的综合性评估框架,从而推动该领域研究的系统性发展。

链接: https://arxiv.org/abs/2601.19236
作者: Zhiyu Yin,Zhipeng Liu,Kehai Chen,Lemao Liu,Jin Liu,Hong-Dong Li,Yang Xiang,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:While current video generation focuses on text or image conditions, practical applications like video editing and vlogging often need to seamlessly connect separate clips. In our work, we introduce Video Connecting, an innovative task that aims to generate smooth intermediate video content between given start and end clips. However, the absence of standardized evaluation benchmarks has hindered the development of this task. To bridge this gap, we proposed VC-Bench, a novel benchmark specifically designed for video connecting. It includes 1,579 high-quality videos collected from public platforms, covering 15 main categories and 72 subcategories to ensure diversity and structure. VC-Bench focuses on three core aspects: Video Quality Score VQS, Start-End Consistency Score SECS, and Transition Smoothness Score TSS. Together, they form a comprehensive framework that moves beyond conventional quality-only metrics. We evaluated multiple state-of-the-art video generation models on VC-Bench. Experimental results reveal significant limitations in maintaining start-end consistency and transition smoothness, leading to lower overall coherence and fluidity. We expect that VC-Bench will serve as a pioneering benchmark to inspire and guide future research in video connecting. The evaluation metrics and dataset are publicly available at: this https URL.
zh

[CV-50] owards Pixel-Level VLM Perception via Simple Points Prediction

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)缺乏原生像素级感知能力的问题,从而实现对图像中物体边界的精确分割。其核心解决方案是将分割任务重构为一个简单的序列生成问题:模型直接在语言空间内预测描述对象边界的点序列(文本坐标),无需引入复杂的专用架构或辅助模块。关键创新在于采用两阶段训练策略——先通过监督学习(Supervised Fine-tuning, SF)生成初步点序列,再利用基于交并比(IoU)奖励的强化学习(Reinforcement Learning, RL)优化序列精度,使预测边界与真实轮廓高度一致。实验表明,标准MLLM架构本身具备强大的低层感知潜力,仅通过点序列预测即可实现媲美甚至超越复杂专用方法的分割性能,验证了空间理解可从简单点预测中自然涌现。

链接: https://arxiv.org/abs/2601.19228
作者: Tianhui Song,Haoyu Lu,Hao Yang,Lin Sui,Haoning Wu,Zaida Zhou,Zhiqi Huang,Yiping Bao,Y.Charles,Xinyu Zhou,Limin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SF \to RL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: this https URL
zh

[CV-51] UniPCB: A Unified Vision-Language Benchmark for Open-Ended PCB Quality Inspection

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂工业场景下,尤其是印刷电路板(Printed Circuit Board, PCB)质量检测任务中表现不足的问题。当前MLLMs在PCB检测中面临的关键挑战包括:组件密集、布线结构复杂以及缺陷模式细微,且缺乏统一的视觉-语言基准来定量评估模型性能,这主要源于数据稀缺、数据集碎片化和标准不一致。为应对这一问题,作者提出了UniPCB——首个面向开放式PCB质量检测的统一视觉-语言基准,并基于此构建了PCB-GPT模型,其训练依赖于通过系统化数据管道生成的指令数据集,采用一种新颖的渐进式课程学习策略模拟人类专家的学习过程。解决方案的核心在于:首先建立标准化、跨场景的数据集与评测基准(UniPCB),其次设计符合领域认知逻辑的训练机制(即渐进式课程),从而显著提升模型在细粒度缺陷定位等关键任务上的性能,相较最强基线模型实现性能翻倍。

链接: https://arxiv.org/abs/2601.19222
作者: Fuxiang Sun,Xi Jiang,Jiansheng Wu,Haigang Zhang,Feng Zheng,Jinfeng Yang
机构: Shenzhen Polytechnic University (深圳职业技术大学); Southern University of Science and Technology (南方科技大学); University of Science and Technology Liaoning (辽宁科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show promise for general industrial quality inspection, but fall short in complex scenarios, such as Printed Circuit Board (PCB) inspection. PCB inspection poses unique challenges due to densely packed components, complex wiring structures, and subtle defect patterns that require specialized domain expertise. However, a high-quality, unified vision-language benchmark for quantitatively evaluating MLLMs across PCB inspection tasks remains absent, stemming not only from limited data availability but also from fragmented datasets and inconsistent standardization. To fill this gap, we propose UniPCB, the first unified vision-language benchmark for open-ended PCB quality inspection. UniPCB is built via a systematic pipeline that curates and standardizes data from disparate sources across three annotated scenarios. Furthermore, we introduce PCB-GPT, an MLLM trained on a new instruction dataset generated by this pipeline, utilizing a novel progressive curriculum that mimics the learning process of human experts. Evaluations on the UniPCB benchmark show that while existing MLLMs falter on domain-specific tasks, PCB-GPT establishes a new baseline. Notably, it more than doubles the performance on fine-grained defect localization compared to the strongest competitors, with significant advantages in localization and analysis. We will release the instruction data, benchmark, and model to facilitate future research.
zh

[CV-52] Bridging Visual and Wireless Sensing: A Unified Radiation Field for 3D Radio Map Construction

【速读】:该论文旨在解决下一代无线网络中高保真环境智能构建的问题,特别是如何准确生成3D无线电地图(3D radio map),以支持频谱感知规划和环境感知传感。现有方法通常将光学与无线知识视为独立模态,未能利用光与电磁波传播共有的物理原理,导致建模精度不足且样本效率低下。解决方案的关键在于提出一种统一的无线电-光学辐射场表示框架(URF-GS),基于3D高斯泼溅(3D Gaussian Splatting, 3D-GS)和逆渲染技术,融合视觉与无线感知观测,在恢复场景几何结构与材料属性的同时,精确预测任意发射-接收(Tx-Rx)配置下的无线电行为,从而实现更高精度和更高效的3D无线电地图构建。

链接: https://arxiv.org/abs/2601.19216
作者: Chaozheng Wen,Jingwen Tong,Zehong Lin,Chenghong Bian,Jun Zhang
机构: The Hong Kong University of Science and Technology (HKUST)
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The code for this work will be publicly available at: this https URL

点击查看摘要

Abstract:The emerging applications of next-generation wireless networks (e.g., immersive 3D communication, low-altitude networks, and integrated sensing and communication) necessitate high-fidelity environmental intelligence. 3D radio maps have emerged as a critical tool for this purpose, enabling spectrum-aware planning and environment-aware sensing by bridging the gap between physical environments and electromagnetic signal propagation. However, constructing accurate 3D radio maps requires fine-grained 3D geometric information and a profound understanding of electromagnetic wave propagation. Existing approaches typically treat optical and wireless knowledge as distinct modalities, failing to exploit the fundamental physical principles governing both light and electromagnetic propagation. To bridge this gap, we propose URF-GS, a unified radio-optical radiation field representation framework for accurate and generalizable 3D radio map construction based on 3D Gaussian splatting (3D-GS) and inverse rendering. By fusing visual and wireless sensing observations, URF-GS recovers scene geometry and material properties while accurately predicting radio signal behavior at arbitrary transmitter-receiver (Tx-Rx) configurations. Experimental results demonstrate that URF-GS achieves up to a 24.7% improvement in spatial spectrum prediction accuracy and a 10x increase in sample efficiency for 3D radio map construction compared with neural radiance field (NeRF)-based methods. This work establishes a foundation for next-generation wireless networks by integrating perception, interaction, and communication through holistic radiation field reconstruction.
zh

[CV-53] Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对对抗样本(Adversarial Examples, AEs)时鲁棒性不足的问题,尤其是现有测试阶段防御方法在应对强攻击时表现不佳、推理延迟高且任务适配性差的局限。其解决方案的关键在于揭示了对抗样本在频域上存在严重的特征不一致性,并将此现象归因于模型固有的谱偏差(Spectral Bias)。基于此洞察,作者提出了一种名为对比谱校正(Contrastive Spectral Rectification, CSR)的高效测试阶段防御机制:通过引入一个谱引导的对比目标函数,自适应地优化输入扰动,以将输入重新对齐至自然数据流形,从而显著提升模型在多种分类任务和强攻击(如AutoAttack)下的鲁棒性,同时保持较低的推理开销。

链接: https://arxiv.org/abs/2601.19210
作者: Sen Nie,Jie Zhang,Zhuo Wang,Shiguang Shan,Xilin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages

点击查看摘要

Abstract:Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model’s inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at this https URL.
zh

[CV-54] MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning ICLR2026

【速读】:该论文旨在解决当前视觉-语言模型在复杂查询下推理过程缺乏可解释性且易产生幻觉的问题,同时克服现有组合式方法依赖单一代理或手工设计流水线、无法动态决策何时协同互补代理或竞争重叠代理的局限。其解决方案的关键在于提出MATA(Multi-Agent hierarchical Trainable Automaton),一个以分层有限状态自动机形式构建的多代理系统:顶层状态转移由可训练的超代理(hyper agent)决定,每个代理对应一个状态并运行小型规则基子自动机实现可靠微观控制;所有代理共享内存,确保执行历史透明可追溯;通过构建转换轨迹树并转化为记忆到下一状态对的形式,形成MATA-SFT-90K数据集用于监督微调(SFT),使大语言模型(LLM)作为转移策略能够理解任务需求与代理能力,从而高效选择最优代理完成视觉推理任务。

链接: https://arxiv.org/abs/2601.19204
作者: Zhixi Cai,Fucai Ke,Kevin Leo,Sukai Huang,Maria Garcia de la Banda,Peter J. Stuckey,Hamid Rezatofighi
机构: Monash University (蒙纳士大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026

点击查看摘要

Abstract:Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent’s transition policy, we build transition-trajectory trees and transform to memory-to-next-state pairs, forming the MATA-SFT-90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state-of-the-art results compared with monolithic and compositional baselines. The code and dataset are available at this https URL.
zh

[CV-55] SNR-Edit: Structure-Aware Noise Rectification for Inversion-Free Flow-Based Editing

【速读】:该论文旨在解决基于流模型(flow-based generative models)的无逆向图像编辑中,现有方法因使用固定高斯噪声构建源轨迹而导致轨迹动态偏差、进而引发结构退化或质量损失的问题。其解决方案的关键在于提出SNR-Edit框架,通过结构感知的噪声校正机制,在初始噪声中注入分割约束,将源轨迹的随机成分锚定至真实图像的隐式逆向位置,从而在无需模型微调或逆向计算的前提下,有效减少源到目标传输过程中的轨迹漂移,实现平滑且高保真的潜在空间轨迹修正。

链接: https://arxiv.org/abs/2601.19180
作者: Lifan Jiang,Boxi Wu,Yuhang Pei,Tianrun Wu,Yongyuan Chen,Yan Zhao,Shiyu Yu,Deng Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inversion-free image editing using flow-based generative models challenges the prevailing inversion-based pipelines. However, existing approaches rely on fixed Gaussian noise to construct the source trajectory, leading to biased trajectory dynamics and causing structural degradation or quality loss. To address this, we introduce SNR-Edit, a training-free framework achieving faithful Latent Trajectory Correction via adaptive noise control. Mechanistically, SNR-Edit uses structure-aware noise rectification to inject segmentation constraints into the initial noise, anchoring the stochastic component of the source trajectory to the real image’s implicit inversion position and reducing trajectory drift during source–target transport. This lightweight modification yields smoother latent trajectories and ensures high-fidelity structural preservation without requiring model tuning or inversion. Across SD3 and FLUX, evaluations on PIE-Bench and SNR-Bench show that SNR-Edit delivers performance on pixel-level metrics and VLM-based scoring, while adding only about 1s overhead per image.
zh

[CV-56] GTFMN: Guided Texture and Feature Modulation Network for Low-Light Image Enhancement and Super-Resolution

【速读】:该论文旨在解决低光照图像超分辨率(Low-light Image Super-Resolution, LLSR)任务中因低分辨率与照明不足耦合退化导致的重建难题。其解决方案的关键在于提出了一种解耦策略:将LLSR问题分解为两个子问题——光照估计和纹理恢复。具体而言,网络设计了独立的光照流(Illumination Stream)以预测空间变化的光照图(illumination map),并进一步引入光照引导调制模块(Illumination Guided Modulation Block, IGM Block),利用该光照图作为显式引导动态调制纹理流中的特征表示,实现空间自适应增强,从而在暗区强化细节恢复,在亮区保持结构完整性。

链接: https://arxiv.org/abs/2601.19157
作者: Yongsong Huang,Tzu-Hsuan Peng,Tomo Miyazaki,Xiaofeng Liu,Chun-Ting Chou,Ai-Chun Pang,Shinichiro Omachi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Low-light image super-resolution (LLSR) is a challenging task due to the coupled degradation of low resolution and poor illumination. To address this, we propose the Guided Texture and Feature Modulation Network (GTFMN), a novel framework that decouples the LLSR task into two sub-problems: illumination estimation and texture restoration. First, our network employs a dedicated Illumination Stream whose purpose is to predict a spatially varying illumination map that accurately captures lighting distribution. Further, this map is utilized as an explicit guide within our novel Illumination Guided Modulation Block (IGM Block) to dynamically modulate features in the Texture Stream. This mechanism achieves spatially adaptive restoration, enabling the network to intensify enhancement in poorly lit regions while preserving details in well-exposed areas. Extensive experiments demonstrate that GTFMN achieves the best performance among competing methods on the OmniNormal5 and OmniNormal15 datasets, outperforming them in both quantitative metrics and visual quality.
zh

[CV-57] LocationAgent : A Hierarchical Agent for Image Geolocation via Decoupling Strategy and Evidence from Parametric Knowledge

【速读】:该论文旨在解决图像地理定位(Image Geolocation)任务中现有方法因静态记忆固化知识而导致的事实性幻觉(factual hallucinations)和开放世界场景下泛化能力不足的问题。其解决方案的关键在于提出一种分层定位智能体(LocationAgent),通过将模型内的推理逻辑与外部工具的证据验证分离:一方面设计RER架构(Reasoner-Executor-Recorder)实现角色分工与上下文压缩,以抑制多步推理中的漂移问题;另一方面构建一套线索探索工具集,用于动态调用外部地理知识进行证据验证,从而提升定位准确性与鲁棒性。

链接: https://arxiv.org/abs/2601.19155
作者: Qiujun Li,Zijin Xiao,Xulin Wang,Zhidan Ma,Cheng Yang,Haifeng Li
机构: Central South University (中南大学); ByteDance (字节跳动)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Image geolocation aims to infer capture locations based on visual content. Fundamentally, this constitutes a reasoning process composed of \textithypothesis-verification cycles, requiring models to possess both geospatial reasoning capabilities and the ability to verify evidence against geographic facts. Existing methods typically internalize location knowledge and reasoning patterns into static memory via supervised training or trajectory-based reinforcement fine-tuning. Consequently, these methods are prone to factual hallucinations and generalization bottlenecks in open-world settings or scenarios requiring dynamic knowledge. To address these challenges, we propose a Hierarchical Localization Agent, called LocationAgent. Our core philosophy is to retain hierarchical reasoning logic within the model while offloading the verification of geographic evidence to external tools. To implement hierarchical reasoning, we design the RER architecture (Reasoner-Executor-Recorder), which employs role separation and context compression to prevent the drifting problem in multi-step reasoning. For evidence verification, we construct a suite of clue exploration tools that provide diverse evidence to support location reasoning. Furthermore, to address data leakage and the scarcity of Chinese data in existing datasets, we introduce CCL-Bench (China City Location Bench), an image geolocation benchmark encompassing various scene granularities and difficulty levels. Extensive experiments demonstrate that LocationAgent significantly outperforms existing methods by at least 30% in zero-shot settings.
zh

[CV-58] FFM: Topology-Aware Feature Fusion Module via Latent Graph Reasoning for Retinal Vessel Segmentation WACV2026

【速读】:该论文旨在解决视网膜血管分割中因标准卷积架构导致的拓扑不连通问题(即血管断裂、间隙等),此类问题虽在像素级别精度较高,但破坏了血管树的连续性,使得基于图结构的临床分析无法可靠进行。解决方案的关键在于提出一种拓扑感知框架,其核心是引入拓扑特征融合模块(Topological Feature Fusion Module, TFFM),将局部特征映射至潜在图空间,并利用图注意力网络(Graph Attention Networks)捕捉固定感受野难以建模的全局结构依赖关系;同时设计混合损失函数,结合Tversky损失处理类别不平衡问题与软clDice损失显式惩罚拓扑断开,从而显著提升血管连通性,实验表明相较基线方法减少约38%的血管碎片化,实现可用于自动化生物标志物量化分析的拓扑一致血管树。

链接: https://arxiv.org/abs/2601.19136
作者: Iftekhar Ahmed,Shakib Absar,Aftar Ahmad Sami,Shadman Sakib,Debojyoti Biswas,Seraj Al Mahmud Mostafa
机构: Leading University(莱丁大学); University of Houston(休斯顿大学); University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校); Pennsylvania State University(宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in WACV 2026 @ P2P-workshop as a full paper and selected for oral presentation

点击查看摘要

Abstract:Precise segmentation of retinal arteries and veins carries the diagnosis of systemic cardiovascular conditions. However, standard convolutional architectures often yield topologically disjointed segmentations, characterized by gaps and discontinuities that render reliable graph-based clinical analysis impossible despite high pixel-level accuracy. To address this, we introduce a topology-aware framework engineered to maintain vascular connectivity. Our architecture fuses a Topological Feature Fusion Module (TFFM) that maps local feature representations into a latent graph space, deploying Graph Attention Networks to capture global structural dependencies often missed by fixed receptive fields. Furthermore, we drive the learning process with a hybrid objective function, coupling Tversky loss for class imbalance with soft clDice loss to explicitly penalize topological disconnects. Evaluation on the Fundus-AVSeg dataset reveals state-of-the-art performance, achieving a combined Dice score of 90.97% and a 95% Hausdorff Distance of 3.50 pixels. Notably, our method decreases vessel fragmentation by approximately 38% relative to baselines, yielding topologically coherent vascular trees viable for automated biomarker quantification. We open-source our code at this https URL.
zh

[CV-59] QA-ReID: Quality-Aware Query-Adaptive Convolution Leverag ing Fused Global and Structural Cues for Clothes-Changing ReID

【速读】:该论文旨在解决衣物变化行人重识别(clothes-changing ReID, CC-ReID)问题,即在目标个体更换服装后仍能准确匹配其身份的挑战。其核心难点在于衣物外观变化导致的显著视觉差异。解决方案的关键在于提出质量感知双分支匹配框架(Quality-Aware Dual-Branch Matching, QA-ReID),该框架通过RGB特征与语义分割特征的联合建模,分别捕获全局外观信息和衣物不变的结构线索,并利用多模态注意力机制自适应融合异构特征;在匹配阶段进一步设计质量感知查询自适应卷积(QAConv-QA),引入像素级重要性加权和双向一致性约束,从而增强对衣物变化的鲁棒性。

链接: https://arxiv.org/abs/2601.19133
作者: Yuxiang Wang,Kunming Jiang,Tianxiang Zhang,Ke Tian,Gaozhe Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unlike conventional person re-identification (ReID), clothes-changing ReID (CC-ReID) presents severe challenges due to substantial appearance variations introduced by clothing changes. In this work, we propose the Quality-Aware Dual-Branch Matching (QA-ReID), which jointly leverages RGB-based features and parsing-based representations to model both global appearance and clothing-invariant structural cues. These heterogeneous features are adaptively fused through a multi-modal attention module. At the matching stage, we further design the Quality-Aware Query Adaptive Convolution (QAConv-QA), which incorporates pixel-level importance weighting and bidirectional consistency constraints to enhance robustness against clothing variations. Extensive experiments demonstrate that QA-ReID achieves state-of-the-art performance on multiple benchmarks, including PRCC, LTCC, and VC-Clothes, and significantly outperforms existing approaches under cross-clothing scenarios.
zh

[CV-60] CLIP-Guided Unsupervised Semantic-Aware Exposure Correction

【速读】:该论文旨在解决真实世界曝光图像中因不当曝光导致的细节丢失、色彩失真和对比度下降问题,尤其针对两个关键挑战:一是现有方法忽视了物体级别的区域语义信息,导致色彩偏移伪影;二是真实图像缺乏真实标签,手动标注成本极高。解决方案的关键在于提出一种无监督的语义感知曝光校正网络,其核心创新包括:(1)设计自适应语义感知融合模块,将预训练Fast Segment Anything Model(FastSAM)提取的语义信息嵌入共享图像特征空间,增强局部语义一致性;(2)引入基于CLIP的伪真值生成器,通过微调自动识别曝光状态并指导定制化校正;(3)构建语义提示一致性损失,利用FastSAM与CLIP的先验知识约束语义一致性和图像-提示对齐,从而实现无需人工标注的高质量无监督训练。

链接: https://arxiv.org/abs/2601.19129
作者: Puzhen Wu,Han Weng,Quan Zheng,Yi Zhan,Hewei Wang,Yiming Li,Jiahui Han,Rui Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improper exposure often leads to severe loss of details, color distortion, and reduced contrast. Exposure correction still faces two critical challenges: (1) the ignorance of object-wise regional semantic information causes the color shift artifacts; (2) real-world exposure images generally have no ground-truth labels, and its labeling entails massive manual editing. To tackle the challenges, we propose a new unsupervised semantic-aware exposure correction network. It contains an adaptive semantic-aware fusion module, which effectively fuses the semantic information extracted from a pre-trained Fast Segment Anything Model into a shared image feature space. Then the fused features are used by our multi-scale residual spatial mamba group to restore the details and adjust the exposure. To avoid manual editing, we propose a pseudo-ground truth generator guided by CLIP, which is fine-tuned to automatically identify exposure situations and instruct the tailored corrections. Also, we leverage the rich priors from the FastSAM and CLIP to develop a semantic-prompt consistency loss to enforce semantic consistency and image-prompt alignment for unsupervised training. Comprehensive experimental results illustrate the effectiveness of our method in correcting real-world exposure images and outperforms state-of-the-art unsupervised methods both numerically and visually.
zh

[CV-61] Resolving Primitive-Sharing Ambiguity in Long-Tailed Industrial Point Cloud Segmentation via Spatial Context Constraints

【速读】:该论文旨在解决工业点云分割中因类别极端不平衡(如215:1)与几何模糊性(tail类与head类共享圆柱形基础结构)导致的安全关键组件(如减速器和阀门)被系统性误分类的问题。现有基于频率重加权的方法仅能缓解统计不平衡,无法解决局部几何相似性引发的歧义。其解决方案的关键在于引入两种与网络架构无关的空间上下文约束机制:(1) 边界-CB(Boundary-CB),通过熵约束强化模糊边界区域的预测一致性;(2) 密度-CB(Density-CB),通过密度约束补偿扫描差异带来的偏差。二者作为即插即用模块集成至Class-Balanced (CB) Loss框架中,无需修改网络结构,仅需替换损失函数即可显著提升尾类性能(如减速器IoU从0%提升至21.12%,阀门相对提升24.3%),同时保持头部类准确率不变,从而在不引入典型头尾权衡的前提下有效消除几何歧义,实现数字孪生应用中安全关键部件的可靠识别。

链接: https://arxiv.org/abs/2601.19128
作者: Chao Yin,Qing Han,Zhiwei Hou,Yue Liu,Anjin Dai,Hongda Hu,Ji Yang,Wei Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial point cloud segmentation for Digital Twin construction faces a persistent challenge: safety-critical components such as reducers and valves are systematically misclassified. These failures stem from two compounding factors: such components are rare in training data, yet they share identical local geometry with dominant structures like pipes. This work identifies a dual crisis unique to industrial 3D data extreme class imbalance 215:1 ratio compounded by geometric ambiguity where most tail classes share cylindrical primitives with head classes. Existing frequency-based re-weighting methods address statistical imbalance but cannot resolve geometric ambiguity. We propose spatial context constraints that leverage neighborhood prediction consistency to disambiguate locally similar structures. Our approach extends the Class-Balanced (CB) Loss framework with two architecture-agnostic mechanisms: (1) Boundary-CB, an entropy-based constraint that emphasizes ambiguous boundaries, and (2) Density-CB, a density-based constraint that compensates for scan-dependent variations. Both integrate as plug-and-play modules without network modifications, requiring only loss function replacement. On the Industrial3D dataset (610M points from water treatment facilities), our method achieves 55.74% mIoU with 21.7% relative improvement on tail-class performance (29.59% vs. 24.32% baseline) while preserving head-class accuracy (88.14%). Components with primitive-sharing ambiguity show dramatic gains: reducer improves from 0% to 21.12% IoU; valve improves by 24.3% relative. This resolves geometric ambiguity without the typical head-tail trade-off, enabling reliable identification of safety-critical components for automated knowledge extraction in Digital Twin applications.
zh

[CV-62] Implicit Non-Causal Factors are Out via Dataset Splitting for Domain Generalization Object Detection

【速读】:该论文旨在解决开放世界目标检测中因域不变表示(domain-invariant representation)不足而导致的泛化性能下降问题,特别是由隐式非因果因素(implicit non-causal factors)引起的域间差异。现有基于域对抗学习(Domain Adversarial Learning, DAL)的方法通常忽略这些隐式非因果因素,主要受限于两个关键原因:一是域判别器方法依赖稀疏的域标签(每个数据集仅分配一个域标签),只能捕捉显式非因果因素;二是未识别的数据偏差导致的非因果因素过于隐含,无法通过传统DAL范式有效区分。解决方案的关键在于提出一种改进的DAL方法——GB-DAL,其核心创新包括:1)引入基于原型的粒度球分割(Prototype-based Granular Ball Splitting, PGBS)模块,从有限数据集中生成更密集的子域,以体现更多潜在的非因果因素;2)设计模拟非因果因素(Simulated Non-causal Factors, SNF)模块,通过类对抗扰动的数据增强策略降低非因果因素的隐含性,从而提升GB-DAL的训练效果。实验证明,该方法在多个基准测试中显著提升了模型在新场景下的泛化能力。

链接: https://arxiv.org/abs/2601.19127
作者: Zhilong Zhang,Lei Zhang,Qing He,Shuyin Xia,Guoyin Wang,Fuxiang Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in IJCV

点击查看摘要

Abstract:Open world object detection faces a significant challenge in domain-invariant representation, i.e., implicit non-causal factors. Most domain generalization (DG) methods based on domain adversarial learning (DAL) pay much attention to learn domain-invariant information, but often overlook the potential non-causal factors. We unveil two critical causes: 1) The domain discriminator-based DAL method is subject to the extremely sparse domain label, i.e., assigning only one domain label to each dataset, thus can only associate explicit non-causal factor, which is incredibly limited. 2) The non-causal factors, induced by unidentified data bias, are excessively implicit and cannot be solely discerned by conventional DAL paradigm. Based on these key findings, inspired by the Granular-Ball perspective, we propose an improved DAL method, i.e., GB-DAL. The proposed GB-DAL utilizes Prototype-based Granular Ball Splitting (PGBS) module to generate more dense domains from limited datasets, akin to more fine-grained granular balls, indicating more potential non-causal factors. Inspired by adversarial perturbations akin to non-causal factors, we propose a Simulated Non-causal Factors (SNF) module as a means of data augmentation to reduce the implicitness of non-causal factors, and facilitate the training of GB-DAL. Comparative experiments on numerous benchmarks demonstrate that our method achieves better generalization performance in novel circumstances.
zh

[CV-63] FBSDiff: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation

【速读】:该论文旨在解决文本驱动图像到图像(text-driven image-to-image, I2I)翻译任务中如何实现高效、灵活且可控的图像生成问题,尤其关注在不依赖模型训练或微调的前提下,利用源图像的视觉信息与文本提示共同引导生成过程。解决方案的关键在于提出 FBSDiff 框架,其核心创新是从频域角度出发,通过动态替换扩散特征中的低频、中频和高频带,分别实现外观引导、布局引导和轮廓引导的 I2I 翻译;同时,通过调节替换频带的带宽可连续控制图像间相关性强度。进一步地,FBSDiff++ 在此基础上优化了推理效率、支持任意分辨率输入,并扩展了局部操作与风格化内容生成能力,显著提升了 I2I 任务的实用性与可控性。

链接: https://arxiv.org/abs/2601.19115
作者: Xiang Gao,Yunpeng Jia
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With large-scale text-to-image (T2I) diffusion models achieving significant advancements in open-domain image creation, increasing attention has been focused on their natural extension to the realm of text-driven image-to-image (I2I) translation, where a source image acts as visual guidance to the generated image in addition to the textual guidance provided by the text prompt. We propose FBSDiff, a novel framework adapting off-the-shelf T2I diffusion model into the I2I paradigm from a fresh frequency-domain perspective. Through dynamic frequency band substitution of diffusion features, FBSDiff realizes versatile and highly controllable text-driven I2I in a plug-and-play manner (without need for model training, fine-tuning, or online optimization), allowing appearance-guided, layout-guided, and contour-guided I2I translation by progressively substituting low-frequency band, mid-frequency band, and high-frequency band of latent diffusion features, respectively. In addition, FBSDiff flexibly enables continuous control over I2I correlation intensity simply by tuning the bandwidth of the substituted frequency band. To further promote image translation efficiency, flexibility, and functionality, we propose FBSDiff++ which improves upon FBSDiff mainly in three aspects: (1) accelerate inference speed by a large margin (8.9 \times speedup in inference) with refined model architecture; (2) improve the Frequency Band Substitution module to allow for input source images of arbitrary resolution and aspect ratio; (3) extend model functionality to enable localized image manipulation and style-specific content creation with only subtle adjustments to the core method. Extensive qualitative and quantitative experiments verify superiority of FBSDiff++ in I2I translation visual quality, efficiency, versatility, and controllability compared to related advanced approaches.
zh

[CV-64] Reg-TTR Test-Time Refinement for Fast Robust and Accurate Image Registration

【速读】:该论文旨在解决当前生成式图像配准(image registration)方法中普遍存在的性能与效率矛盾问题:尽管深度学习方法显著提升了推理速度,但在面对域偏移(domain shift)时鲁棒性不足;而传统迭代方法虽鲁棒性强,但计算效率低。解决方案的关键在于提出一种测试时精炼(test-time refinement, TTR)框架 Reg-TTR,该框架通过在推理阶段对预训练模型的输出进行优化,融合了深度学习方法的高效性与传统注册技术的高精度优势,在仅增加21%推理时间(0.56秒)的前提下显著提升配准精度,从而有效缩小基础模型(foundation models)与针对特定数据集训练的最优方法之间的性能差距。

链接: https://arxiv.org/abs/2601.19114
作者: Lin Chen,Yue He,Fengting Zhang,Yaonan Wang,Fengming Lin,Xiang Chen,Min Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional image registration methods are robust but slow due to their iterative nature. While deep learning has accelerated inference, it often struggles with domain shifts. Emerging registration foundation models offer a balance of speed and robustness, yet typically cannot match the peak accuracy of specialized models trained on specific datasets. To mitigate this limitation, we propose Reg-TTR, a test-time refinement framework that synergizes the complementary strengths of both deep learning and conventional registration techniques. By refining the predictions of pre-trained models at inference, our method delivers significantly improved registration accuracy at a modest computational cost, requiring only 21% additional inference time (0.56s). We evaluate Reg-TTR on two distinct tasks and show that it achieves state-of-the-art (SOTA) performance while maintaining inference speeds close to previous deep learning methods. As foundation models continue to emerge, our framework offers an efficient strategy to narrow the performance gap between registration foundation models and SOTA methods trained on specialized datasets. The source code will be publicly available following the acceptance of this work.
zh

[CV-65] Glance and Focus Reinforcement for Pan-cancer Screening ICLR2026

【速读】:该论文旨在解决大规模CT扫描中泛癌种(pan-cancer)筛查的挑战,核心问题是现有AI方法难以在大体积CT数据中定位多种类型的小病灶,且因前景-背景极度不平衡导致模型难以聚焦病变区域,同时健康组织的冗余关注会降低效率并增加假阳性。解决方案的关键在于提出一种名为GF-Screen的“凝视与聚焦”强化学习框架:通过一个“凝视模型”(Glance model)初步筛选可能含病灶的子体积,再由“聚焦模型”(Focus model)进行精确分割;利用聚焦模型的分割结果作为奖励信号来优化凝视模型,从而实现端到端的高效病灶定位与分割。特别地,作者引入了一种新颖的组相对学习范式(group relative learning),通过组内相对比较机制优先保留高优势预测、舍弃低优势预测,显著提升了检测效率并降低了假阳性率,首次将先进的强化学习技术有效应用于泛癌种CT筛查任务中。

链接: https://arxiv.org/abs/2601.19103
作者: Linshan Wu,Jiaxin Zhuang,Hao Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026. Code is available at this https URL

点击查看摘要

Abstract:Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while redundant focus on healthy regions not only decreases the efficiency but also increases false positives. Inspired by radiologists’ glance and focus diagnostic strategy, we introduce GF-Screen, a Glance and Focus reinforcement learning framework for pan-cancer screening. GF-Screen employs a Glance model to localize the diseased regions and a Focus model to precisely segment the lesions, where segmentation results of the Focus model are leveraged to reward the Glance model via Reinforcement Learning (RL). Specifically, the Glance model crops a group of sub-volumes from the entire CT volume and learns to select the sub-volumes with lesions for the Focus model to segment. Given that the selecting operation is non-differentiable for segmentation training, we propose to employ the segmentation results to reward the Glance model. To optimize the Glance model, we introduce a novel group relative learning paradigm, which employs group relative comparison to prioritize high-advantage predictions and discard low-advantage predictions within sub-volume groups, not only improving efficiency but also reducing false positives. In this way, for the first time, we effectively extend cutting-edge RL techniques to tackle the specific challenges in pan-cancer screening. Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated the effectiveness of GF-Screen. Notably, GF-Screen leads the public validation leaderboard of MICCAI FLARE25 pan-cancer challenge, surpassing the FLARE24 champion solution by a large margin (+25.6% DSC and +28.2% NSD).
zh

[CV-66] m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在空间推理任务中表现脆弱的问题,特别是模型难以将抽象的俯视地图(north-up overhead map)与第一人称视角的街景图像(Street View image)进行几何对齐,从而准确推断相机朝向。其核心解决方案是提出m2sv(map-to-street-view)基准,这是一个可扩展的空间推理评测框架,包含地理多样且控制模糊度的数据集m2sv-20k,以及用于监督微调的结构化推理轨迹集合m2sv-sft-11k。实验表明,即使在现有VLM中表现最优的模型,在m2sv上的准确率仅为65.2%,远低于人类水平(95%),揭示了模型在几何对齐、证据聚合和推理一致性方面的系统性不足,为未来基于多视角的具身空间推理研究提供了关键方向。

链接: https://arxiv.org/abs/2601.19099
作者: Yosub Shin,Michael Buriek,Igor Molybog
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision–language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.19099 [cs.CV] (or arXiv:2601.19099v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.19099 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-67] Privacy-Preserving Model Transcription with Differentially Private Synthetic Distillation

【速读】:该论文旨在解决深度学习模型在部署过程中可能引发的隐私泄露问题,即攻击者可通过模型逆向推断出训练数据中的敏感信息或标签知识。为实现无数据依赖的隐私保护模型转换,作者提出了一种名为“差分隐私合成蒸馏”(differentially private synthetic distillation)的协同竞争学习方法。其核心在于构建一个包含生成器、教师模型与学生模型三者的统一框架,通过交替优化实现:1)生成器学习生成合成数据;2)教师与学生模型基于合成数据计算差分隐私标签(通过灵活的数据或标签噪声扰动);3)学生模型利用带噪标签进行更新,同时生成器以学生作为判别器参与对抗训练。该方案理论上保证了差分隐私和收敛性,且生成的合成数据可用于下游任务,实验证明其性能优于26种前沿方法。

链接: https://arxiv.org/abs/2601.19090
作者: Bochao Liu,Shiming Ge,Pengju Wang,Shikun Li,Tongliang Liu
机构: Institute of Information Engineering at Chinese Academy of Sciences (中国科学院信息工程研究所); Beijing Institute of Astronautical Systems Engineering (北京航天系统工程研究所); Trustworthy Machine Learning Lab, School of Computer Science, The University of Sydney (悉尼大学计算机科学学院可信机器学习实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)

点击查看摘要

Abstract:While many deep learning models trained on private datasets have been deployed in various practical tasks, they may pose a privacy leakage risk as attackers could recover informative data or label knowledge from models. In this work, we present \emphprivacy-preserving model transcription, a data-free model-to-model conversion solution to facilitate model deployment with a privacy guarantee. To this end, we propose a cooperative-competitive learning approach termed \emphdifferentially private synthetic distillation that learns to convert a pretrained model (teacher) into its privacy-preserving counterpart (student) via a trainable generator without access to private data. The learning collaborates with three players in a unified framework and performs alternate optimization: i)~the generator is learned to generate synthetic data, ii)~the teacher and student accept the synthetic data and compute differential private labels by flexible data or label noisy perturbation, and iii)~the student is updated with noisy labels and the generator is updated by taking the student as a discriminator for adversarial training. We theoretically prove that our approach can guarantee differential privacy and convergence. The transcribed student has good performance and privacy protection, while the resulting generator can generate private synthetic data for downstream tasks. Extensive experiments clearly demonstrate that our approach outperforms 26 state-of-the-arts.
zh

[CV-68] EPAS: Efficient Training with Progressive Activation Sharing

【速读】:该论文旨在解决Transformer模型在训练和推理过程中因深层注意力机制中QK(Query-Key)或KV(Key-Value)激活冗余而导致的计算效率低下问题。其解决方案的关键在于提出一种渐进式激活共享(Efficient training with Progressive Activation Sharing, EPAS)方法,通过在训练过程中逐步将深层decoder层切换至激活共享模式,从模型深端向浅端扩展共享区域,从而减少冗余计算并提升吞吐量。该方法不仅在训练阶段实现最高11.1%的吞吐量提升,在推理阶段也支持根据计算预算动态调整共享区域长度,带来最高29%的推理加速,同时保持与基线模型相当的损失曲线。

链接: https://arxiv.org/abs/2601.19089
作者: Rezaul Karim,Maryam Dialameh,Yang Liu,Boxing Chen,Walid Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: This is a preprint of a paper accepted at the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026)

点击查看摘要

Abstract:We present a novel method for Efficient training with Progressive Activation Sharing (EPAS). This method bridges progressive training paradigm with the phenomenon of redundant QK (or KV ) activations across deeper layers of transformers. EPAS gradually grows a sharing region during training by switching decoder layers to activation sharing mode. This results in throughput increase due to reduced compute. To utilize deeper layer redundancy, the sharing region starts from the deep end of the model and grows towards the shallow end. The EPAS trained models allow for variable region lengths of activation sharing for different compute budgets during inference. Empirical evaluations with QK activation sharing in LLaMA models ranging from 125M to 7B parameters show up to an 11.1% improvement in training throughput and up to a 29% improvement in inference throughput while maintaining similar loss curve to the baseline models. Furthermore, applying EPAS in continual pretraining to transform TinyLLaMA into an attention-sharing model yields up to a 10% improvement in average accuracy over state-of-the-art methods, emphasizing the significance of progressive training in cross layer activation sharing models.
zh

[CV-69] Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)中因缺乏对图像细粒度感知与外部事实知识协同利用而导致的推理不准确问题。现有多模态检索增强生成(Multimodal Retrieval-Augmented Generation, MM-RAG)系统虽能提升事实准确性,但缺乏内在策略决定何时以及如何进行检索。其解决方案的关键在于提出PixSearch——首个端到端的分段大语言模型(Segmenting Large Multimodal Model, LMM),通过在编码阶段输出搜索标记(search tokens)触发检索,并自主选择查询模态(文本、图像或区域),同时生成像素级掩码作为直接视觉查询,从而摒弃了传统依赖模块化流水线(如检测器、分割器、描述器等)的复杂架构。该方法通过两阶段监督微调(supervised fine-tuning)实现检索时机和查询选择的学习,同时保持分割能力,在egocentric和entity-centric VQA基准上显著提升事实一致性与泛化性能,相较全图检索在CRAG-MM上实现19.7%的相对准确率提升。

链接: https://arxiv.org/abs/2601.19060
作者: Jeonghwan Kim,Renjie Tao,Sanat Sharma,Jiaqi Wang,Kai Sun,Zhaojiang Lin,Seungwhan Moon,Lambert Mathias,Anuj Kumar,Heng Ji,Xin Luna Dong
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits search tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.
zh

[CV-70] NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation

【速读】:该论文旨在解决世界生成(World Generation)在可控性(controllability)、可扩展性(scalability)和效率(efficiency)方面的三大挑战。现有方法受限于数据稀缺、固定分辨率表示导致大场景保真度下降,以及训练-free 方法在推理时计算成本高。其解决方案的关键在于提出 NuiWorld 框架:首先通过生成式引导策略(generative bootstrapping strategy)从少量输入图像合成多样化场景数据以缓解数据稀缺问题;其次利用伪草图标签(pseudo sketch labels)实现可控生成并具备对未见草图的泛化能力;最后将场景表示为可变大小的场景块(scene chunks)并压缩为扁平向量集表示(flattened vector-set representation),显著降低令牌长度,从而在保持几何保真度的同时提升训练与推理效率。

链接: https://arxiv.org/abs/2601.19048
作者: Han-Hung Lee,Cheng-Yu Yang,Yu-Lun Liu,Angel X. Chang
机构: Simon Fraser University (西蒙菲莎大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.
zh

[CV-71] NC-Reg : Neural Cortical Maps for Rigid Registration

【速读】:该论文旨在解决传统离散结构(如网格和格点)在表示皮层特征图时存在的效率低、分辨率受限以及优化困难的问题。其解决方案的关键在于提出神经皮层映射(Neural Cortical Maps),这是一种连续且紧凑的神经表征方法,能够从任意大小的网格中学习特征,并在任意分辨率下提供特征输出。该方法通过在球面上进行高效优化,相较于经典的重心插值(barycentric interpolation)可实现高达30倍的运行速度提升;同时结合梯度下降与模拟退火策略,提出了NC-Reg算法用于皮层表面刚性配准,实验证明其具有亚度级精度(<1°),展现出作为临床场景中鲁棒预对齐策略的巨大潜力。

链接: https://arxiv.org/abs/2601.19042
作者: Ines Vati,Pierrick Bourgeat,Rodrigo Santa Cruz,Vincent Dore,Olivier Salvado,Clinton Fookes,Léo Lebrat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2026

点击查看摘要

Abstract:We introduce neural cortical maps, a continuous and compact neural representation for cortical feature maps, as an alternative to traditional discrete structures such as grids and meshes. It can learn from meshes of arbitrary size and provide learnt features at any resolution. Neural cortical maps enable efficient optimization on the sphere and achieve runtimes up to 30 times faster than classic barycentric interpolation (for the same number of iterations). As a proof of concept, we investigate rigid registration of cortical surfaces and propose NC-Reg, a novel iterative algorithm that involves the use of neural cortical feature maps, gradient descent optimization and a simulated annealing strategy. Through ablation studies and subject-to-template experiments, our method demonstrates sub-degree accuracy ( 1^\circ from the global optimum), and serves as a promising robust pre-alignment strategy, which is critical in clinical settings.
zh

[CV-72] Non-Invasive 3D Wound Measurement with RGB-D Imaging

【速读】:该论文旨在解决慢性伤口监测与管理中对精确、高效测量方法的需求问题。传统人工测量存在主观性强、重复性差等局限,难以满足临床对客观量化指标(如周长、表面积和尺寸)的持续跟踪要求。解决方案的关键在于提出一种基于RGB-D成像的快速非侵入式3D伤口测量算法,其核心创新是将RGB-D视觉里程计(RGB-D odometry)与B样条曲面重建(B-spline surface reconstruction)相结合,实现高精度3D伤口网格生成,并自动计算临床相关参数。实验表明,该方法在真实硅胶伤口模型上达到亚毫米级重建精度,测量结果具有低变异性和与人工评估的高度一致性,且优于现有基于物体中心的RGB-D重建方法,同时具备实时部署能力,适用于临床及远程医疗场景。

链接: https://arxiv.org/abs/2601.19014
作者: Lena Harkämper,Leo Lebrat,David Ahmedt-Aristizabal,Olivier Salvado,Mattias Heinrich,Rodrigo Santa Cruz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chronic wound monitoring and management require accurate and efficient wound measurement methods. This paper presents a fast, non-invasive 3D wound measurement algorithm based on RGB-D imaging. The method combines RGB-D odometry with B-spline surface reconstruction to generate detailed 3D wound meshes, enabling automatic computation of clinically relevant wound measurements such as perimeter, surface area, and dimensions. We evaluated our system on realistic silicone wound phantoms and measured sub-millimetre 3D reconstruction accuracy compared with high-resolution ground-truth scans. The extracted measurements demonstrated low variability across repeated captures and strong agreement with manual assessments. The proposed pipeline also outperformed a state-of-the-art object-centric RGB-D reconstruction method while maintaining runtimes suitable for real-time clinical deployment. Our approach offers a promising tool for automated wound assessment in both clinical and remote healthcare settings.
zh

[CV-73] Anatomically-aware conformal prediction for medical image segmentation with random walks

【速读】:该论文旨在解决深度学习在医学影像中可靠部署时的不确定性量化问题,即如何在保证统计有效性的同时,生成具有解剖学意义的预测区间。标准的共形预测(Conformal Prediction, CP)方法在分割任务中常忽略解剖上下文,导致预测集碎片化、空间不连贯且过度分割,限制了临床实用性。其解决方案的关键在于提出随机游走共形预测(Random-Walk Conformal Prediction, RW-CP),该框架可独立于具体分割模型添加,通过基于预训练视觉基础模型特征构建k近邻图,并利用随机游走扩散机制对非一致性分数进行正则化,从而增强预测集的空间一致性与稳定性,使边界更连续、解剖合理,同时保持严格的边际覆盖性,在允许误差率α=0.1下相比标准CP基线提升分割质量达35.4%。

链接: https://arxiv.org/abs/2601.18997
作者: Mélanie Gaillochet,Christian Desrosiers,Hervé Lombaert
机构: École de Technologie Supérieure (École de Technologie Supérieure); Mila - Quebec AI Institute (Mila - Quebec AI Institute); Polytechnique Montréal (Polytechnique Montréal)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:The reliable deployment of deep learning in medical imaging requires uncertainty quantification that provides rigorous error guarantees while remaining anatomically meaningful. Conformal prediction (CP) is a powerful distribution-free framework for constructing statistically valid prediction intervals. However, standard applications in segmentation often ignore anatomical context, resulting in fragmented, spatially incoherent, and over-segmented prediction sets that limit clinical utility. To bridge this gap, this paper proposes Random-Walk Conformal Prediction (RW-CP), a model-agnostic framework which can be added on top of any segmentation method. RW-CP enforces spatial coherence to generate anatomically valid sets. Our method constructs a k-nearest neighbour graph from pre-trained vision foundation model features and applies a random walk to diffuse uncertainty. The random walk diffusion regularizes the non-conformity scores, making the prediction sets less sensitive to the conformal calibration parameter \lambda , ensuring more stable and continuous anatomical boundaries. RW-CP maintains rigorous marginal coverage while significantly improving segmentation quality. Evaluations on multi-modal public datasets show improvements of up to 35.4% compared to standard CP baselines, given an allowable error rate of \alpha=0.1 .
zh

[CV-74] FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Geometry-Complete 4D Reconstruction

【速读】:该论文旨在解决单目视频相机重定向(camera redirection)中大角度视角变化下的几何模糊与时间不一致性问题。由于单目视频仅提供动态三维场景的有限时空观测,导致在远离原始轨迹的大角度重定向时难以恢复一致的几何结构和运动信息。解决方案的关键在于提出一个无需训练的FreeOrbit4D框架,通过解耦前景与背景重建,将单目视频统一投影至全局空间中的静态背景点云和几何不完整的前景点云,再利用基于物体中心的多视角扩散模型生成多视角图像并重建几何完整的前景点云;最终通过密集像素同步的3D-3D对应关系将前景点云对齐到全局场景空间,并将构建的几何完整4D代理(geometry-complete 4D proxy)投影至目标相机视角,为条件视频扩散模型提供几何引导,从而实现更忠实的重定向视频生成。

链接: https://arxiv.org/abs/2601.18993
作者: Wei Cao,Hao Zhang,Fengrui Tian,Yulun Wu,Yingying Li,Shenlong Wang,Ning Yu,Yaoyao Liu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Pennsylvania (宾夕法尼亚大学); Eyeline Labs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing highly partial observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive results, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. To address this, we present FreeOrbit4D, an effective training-free framework that tackles this geometric ambiguity by recovering a geometry-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and geometry-incomplete foreground point clouds in a unified global space, then leverage an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct geometry-complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D–3D correspondences and projecting the geometry-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful redirected videos under challenging large-angle trajectories, and our geometry-complete 4D proxy further opens a potential avenue for practical applications such as edit propagation and 4D data generation. Project page and code will be released soon.
zh

[CV-75] Pay Attention to Where You Look ICIP2025

【速读】:该论文旨在解决少样本新颖视图合成(few-shot novel view synthesis, NVS)中现有方法因假设所有输入视图对目标视图具有同等重要性而导致的性能不佳问题。其核心解决方案是引入一种可适应的相机加权机制(camera-weighting mechanism),通过动态调整源视图相对于目标视图的重要性来提升合成质量。该机制包含两种实现方式:一是基于几何属性(如欧氏距离和角度差异)的确定性加权方案;二是利用交叉注意力(cross-attention)学习的自适应加权方案,能够优化视图相关性建模。此机制可嵌入多种NVS算法中,显著增强模型对视图间关系的理解与合成图像的真实感与准确性。

链接: https://arxiv.org/abs/2601.18970
作者: Alex Beriand,JhihYang Wu,Daniel Brignac,Natnael Daba,Abhijit Mahalanobis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIP 2025 Workshop on Generative AI for World Simulations and Communications

点击查看摘要

Abstract:Novel view synthesis (NVS) has advanced with generative modeling, enabling photorealistic image generation. In few-shot NVS, where only a few input views are available, existing methods often assume equal importance for all input views relative to the target, leading to suboptimal results. We address this limitation by introducing a camera-weighting mechanism that adjusts the importance of source views based on their relevance to the target. We propose two approaches: a deterministic weighting scheme leveraging geometric properties like Euclidean distance and angular differences, and a cross-attention-based learning scheme that optimizes view weighting. Additionally, models can be further trained with our camera-weighting scheme to refine their understanding of view relevance and enhance synthesis quality. This mechanism is adaptable and can be integrated into various NVS algorithms, improving their ability to synthesize high-quality novel views. Our results demonstrate that adaptive view weighting enhances accuracy and realism, offering a promising direction for improving NVS. Comments: ICIP 2025 Workshop on Generative AI for World Simulations and Communications Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2601.18970 [cs.CV] (or arXiv:2601.18970v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2601.18970 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: International Conference on Image Processing 2025
zh

[CV-76] Smart Split-Federated Learning over Noisy Channels for Embryo Image Segmentation

【速读】:该论文旨在解决Split-Federated (SplitFed)学习中通信信道噪声对模型训练过程和最终性能的影响问题。其解决方案的关键在于提出了一种智能平均策略(smart averaging strategy),通过优化梯度和模型更新的聚合方式,显著提升了系统对通信噪声的鲁棒性,实验表明该策略可在保持模型精度的前提下,容忍比传统平均方法强两个数量级的信道噪声。

链接: https://arxiv.org/abs/2601.18948
作者: Zahra Hafezi Kafshgari,Ivan V. Bajic,Parvaneh Saeedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Split-Federated (SplitFed) learning is an extension of federated learning that places minimal requirements on the clients computing infrastructure, since only a small portion of the overall model is deployed on the clients hardware. In SplitFed learning, feature values, gradient updates, and model updates are transferred across communication channels. In this paper, we study the effects of noise in the communication channels on the learning process and the quality of the final model. We propose a smart averaging strategy for SplitFed learning with the goal of improving resilience against channel noise. Experiments on a segmentation model for embryo images shows that the proposed smart averaging strategy is able to tolerate two orders of magnitude stronger noise in the communication channels compared to conventional averaging, while still maintaining the accuracy of the final model.
zh

[CV-77] On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training

【速读】:该论文旨在解决当前手术场景理解中视觉基础模型(Vision Foundation Models, VFMs)主要依赖单模态RGB预训练、忽视手术环境中复杂三维几何结构的问题。其解决方案的关键在于引入深度信息(depth information)作为多模态输入,通过在预训练阶段使用包含140万张机器人手术图像及其配对深度图的数据集,构建支持几何感知的多模态ViT架构(如MultiMAE)。实验表明,采用显式几何标记化(explicit geometric tokenization)的模型在物体检测、分割、深度估计和姿态估计等任务上显著优于单模态基线模型,且仅需25%标注数据即可超越全量标注数据下的RGB-only模型,且无需在推理时进行架构或运行时调整,从而实现高效、易部署的性能提升。

链接: https://arxiv.org/abs/2601.18929
作者: John J. Han,Adam Schmidt,Muhammad Abdullah Jamal,Chinedu Nwoye,Anita Rau,Jie Ying Wu,Omid Mohareri
机构: Vanderbilt University (范德堡大学); Intuitive Surgical Inc. (直觉外科公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models (VFMs) have emerged as powerful tools for surgical scene understanding. However, current approaches predominantly rely on unimodal RGB pre-training, overlooking the complex 3D geometry inherent to surgical environments. Although several architectures support multimodal or geometry-aware inputs in general computer vision, the benefits of incorporating depth information in surgical settings remain underexplored. We conduct a large-scale empirical study comparing eight ViT-based VFMs that differ in pre-training domain, learning objective, and input modality (RGB vs. RGB-D). For pre-training, we use a curated dataset of 1.4 million robotic surgical images paired with depth maps generated from an off-the-shelf network. We evaluate these models under both frozen-backbone and end-to-end fine-tuning protocols across eight surgical datasets spanning object detection, segmentation, depth estimation, and pose estimation. Our experiments yield several consistent findings. Models incorporating explicit geometric tokenization, such as MultiMAE, substantially outperform unimodal baselines across all tasks. Notably, geometric-aware pre-training enables remarkable data efficiency: models fine-tuned on just 25% of labeled data consistently surpass RGB-only models trained on the full dataset. Importantly, these gains require no architectural or runtime changes at inference; depth is used only during pre-training, making adoption straightforward. These findings suggest that multimodal pre-training offers a viable path towards building more capable surgical vision systems.
zh

[CV-78] DeFM: Learning Foundation Representations from Depth for Robotics

【速读】:该论文旨在解决深度模态(depth modality)在机器人学习中表示学习(representation learning)研究相对滞后的问题,尤其是在与RGB模态相比时,后者已因大规模基础模型而取得显著进展。为填补这一空白,作者提出DeFM——一个完全基于深度图像训练的自监督基础模型,其核心创新在于采用类DINO(Data-efficient Image Transformers)的自蒸馏目标,在6000万张深度图像数据集上进行预训练,从而学习到具有几何和语义信息的通用表征。关键解决方案包括:引入一种新颖的输入归一化策略以保留多尺度下的度量感知能力,并通过知识蒸馏将DeFM压缩为适用于资源受限机器人的轻量化模型。实验表明,DeFM在分类、分割、导航、运动控制和操作等任务上均达到最先进性能,并展现出从仿真到真实环境的良好泛化能力。

链接: https://arxiv.org/abs/2601.18923
作者: Manthan Patel,Jonas Frey,Mayank Mittal,Fan Yang,Alexander Hansson,Amir Bar,Cesar Cadena,Marco Hutter
机构: Robotic Systems Lab (RSL), ETH Zurich (苏黎世联邦理工学院); Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校); NVIDIA (英伟达); Switzerland (瑞士)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review, 19 pages, 15 Figures, 9 Tables

点击查看摘要

Abstract:Depth sensors are widely deployed across robotic platforms, and advances in fast, high-fidelity depth simulation have enabled robotic policies trained on depth observations to achieve robust sim-to-real transfer for a wide range of tasks. Despite this, representation learning for depth modality remains underexplored compared to RGB, where large-scale foundation models now define the state of the art. To address this gap, we present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications. Using a DINO-style self-distillation objective on a curated dataset of 60M depth images, DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors. To retain metric awareness across multiple scales, we introduce a novel input normalization strategy. We further distill DeFM into compact models suitable for resource-constrained robotic systems. When evaluated on depth-based classification, segmentation, navigation, locomotion, and manipulation benchmarks, DeFM achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments. We release all our pretrained models, which can be adopted off-the-shelf for depth-based robotic learning without task-specific fine-tuning. Webpage: this https URL
zh

[CV-79] RealStats: A Rigorous Real-Only Statistical Framework for Fake Image Detection AISTATS2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像的检测问题,尤其针对现有检测方法在缺乏形式化可解释性及依赖隐含假设时可能带来的鲁棒性不足。其解决方案的关键在于提出一种基于统计学严格框架的检测方法,通过计算多个测试统计量的 p 值并采用经典统计集成策略,聚合结果以评估图像与统一真实图像分布的一致性,从而输出具有可解释性的概率评分。该方法无需训练、具备通用性和灵活性,适用于多样且动态变化的检测场景。

链接: https://arxiv.org/abs/2601.18900
作者: Haim Zisman,Uri Shaham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 22 pages, 14 figures. Accepted to AISTATS 2026

点击查看摘要

Abstract:As generative models continue to evolve, detecting AI-generated images remains a critical challenge. While effective detection methods exist, they often lack formal interpretability and may rely on implicit assumptions about fake content, potentially limiting robustness to distributional shifts. In this work, we introduce a rigorous, statistically grounded framework for fake image detection that focuses on producing a probability score interpretable with respect to the real-image population. Our method leverages the strengths of multiple existing detectors by combining training-free statistics. We compute p-values over a range of test statistics and aggregate them using classical statistical ensembling to assess alignment with the unified real-image distribution. This framework is generic, flexible, and training-free, making it well-suited for robust fake image detection across diverse and evolving settings.
zh

[CV-80] Weakly supervised framework for wildlife detection and counting in challenging Arctic environments: a case study on caribou (Rangifer tarandus)

【速读】:该论文旨在解决北极驯鹿(caribou)种群数量下降背景下,如何实现大范围、高精度自动监测的问题。传统人工判读遥感影像效率低且易出错,而现有自动检测方法在复杂背景、类不平衡、小目标或遮挡等挑战下性能受限。解决方案的关键在于提出一种基于检测网络架构的弱监督图像块级预训练策略(weakly supervised patch-level pretraining),利用粗粒度标签(空区域 vs. 非空区域)学习早期先验知识,从而提升检测模型(HerdNet)对多场景、多密度驯鹿群体的鲁棒性。实验表明,该方法在2017年和2019年独立测试集上分别达到F1分数93.7%和92.6%,显著优于从ImageNet初始化的基线模型,尤其在正样本检测和整图计数任务中均取得稳定提升,证明了弱监督预训练在标注数据稀缺时的有效性。

链接: https://arxiv.org/abs/2601.18891
作者: Ghazaleh Serati,Samuel Foucher,Jerome Theau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 8 figures, submitted to Frontiers in Ecology and Evolution

点击查看摘要

Abstract:Caribou across the Arctic has declined in recent decades, motivating scalable and accurate monitoring approaches to guide evidence-based conservation actions and policy decisions. Manual interpretation from this imagery is labor-intensive and error-prone, underscoring the need for automatic and reliable detection across varying scenes. Yet, such automatic detection is challenging due to severe background heterogeneity, dominant empty terrain (class imbalance), small or occluded targets, and wide variation in density and scale. To make the detection model (HerdNet) more robust to these challenges, a weakly supervised patch-level pretraining based on a detection network’s architecture is proposed. The detection dataset includes five caribou herds distributed across Alaska. By learning from empty vs. non-empty labels in this dataset, the approach produces early weakly supervised knowledge for enhanced detection compared to HerdNet, which is initialized from generic weights. Accordingly, the patch-based pretrain network attained high accuracy on multi-herd imagery (2017) and on an independent year’s (2019) test sets (F1: 93.7%/92.6%, respectively), enabling reliable mapping of regions containing animals to facilitate manual counting on large aerial imagery. Transferred to detection, initialization from weakly supervised pretraining yielded consistent gains over ImageNet weights on both positive patches (F1: 92.6%/93.5% vs. 89.3%/88.6%), and full-image counting (F1: 95.5%/93.3% vs. 91.5%/90.4%). Remaining limitations are false positives from animal-like background clutter and false negatives related to low animal density occlusions. Overall, pretraining on coarse labels prior to detection makes it possible to rely on weakly-supervised pretrained weights even when labeled data are limited, achieving results comparable to generic-weight initialization.
zh

[CV-81] SelfieAvatar: Real-time Head Avatar reenactment from a Selfie Video

【速读】:该论文旨在解决单目视频驱动的头部动画重演(head avatar reenactment)中两个核心问题:一是传统基于3D Morphable Model (3DMM) 的方法难以实时重建包含非面部区域和背景细节的完整头部,导致生成的头像缺乏真实感;二是现有利用生成对抗网络(GAN)的方法虽能实现高质量重演,但在恢复细粒度纹理(如皱纹、头发质感)方面表现不足,且普遍依赖大量训练数据,未充分探索仅用简单自拍视频即可完成高质量头像重建的可能性。解决方案的关键在于提出一种融合3DMM与StyleGAN架构的新方法,通过引入混合损失函数(mixed loss functions)在对抗训练过程中同时优化前景重建与头像图像生成,从而有效恢复高频细节,显著提升头像的纹理丰富性和真实性,且仅需单个自拍视频作为输入即可实现高质量重建。

链接: https://arxiv.org/abs/2601.18851
作者: Wei Liang,Hui Yu,Derui Ding,Rachael E. Jack,Philippe G. Schyns
机构: University of Shanghai for Science and Technology (上海理工大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Head avatar reenactment focuses on creating animatable personal avatars from monocular videos, serving as a foundational element for applications like social signal understanding, gaming, human-machine interaction, and computer vision. Recent advances in 3D Morphable Model (3DMM)-based facial reconstruction methods have achieved remarkable high-fidelity face estimation. However, on the one hand, they struggle to capture the entire head, including non-facial regions and background details in real time, which is an essential aspect for producing realistic, high-fidelity head avatars. On the other hand, recent approaches leveraging generative adversarial networks (GANs) for head avatar generation from videos can achieve high-quality reenactments but encounter limitations in reproducing fine-grained head details, such as wrinkles and hair textures. In addition, existing methods generally rely on a large amount of training data, and rarely focus on using only a simple selfie video to achieve avatar reenactment. To address these challenges, this study introduces a method for detailed head avatar reenactment using a selfie video. The approach combines 3DMMs with a StyleGAN-based generator. A detailed reconstruction model is proposed, incorporating mixed loss functions for foreground reconstruction and avatar image generation during adversarial training to recover high-frequency details. Qualitative and quantitative evaluations on self-reenactment and cross-reenactment tasks demonstrate that the proposed method achieves superior head avatar reconstruction with rich and intricate textures compared to existing approaches.
zh

[CV-82] Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding

【速读】:该论文旨在解决动态神经辐射场(Dynamic Neural Radiance Fields, Dynamic NeRF)在生成说话人物图像时,对嘴部运动捕捉精度和效率不足的问题。其解决方案的关键在于提出一种基于眨眼嵌入(blink embedding)与哈希网格关键点编码(hash grid landmarks encoding)的自动方法,通过将人脸特征作为条件特征,并结合音频特征作为残差项,利用动态关键点变换器(Dynamic Landmark Transformer)进行融合建模,从而显著提升说话人脸的逼真度与细节 fidelity。

链接: https://arxiv.org/abs/2601.18849
作者: Yuhui Zhang,Hui Yu,Wei Liang,Sunjie Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic Neural Radiance Fields (NeRF) have demonstrated considerable success in generating high-fidelity 3D models of talking portraits. Despite significant advancements in the rendering speed and generation quality, challenges persist in accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have validated the superiority of our approach to existing methods.
zh

[CV-83] Dynamic Mask-Based Backdoor Attack Against Vision AI Models: A Case Study on Mushroom Detection

【速读】:该论文旨在解决深度学习模型,特别是目标检测模型在实际部署中面临的动态后门攻击问题。传统静态触发器的后门攻击方法容易被检测,而本文提出了一种基于动态掩码(dynamic mask)的新型后门攻击方案,其关键在于利用SAM(Segment Anything Model)生成精确的图像分割掩码,从而实现触发器的自适应、隐蔽性放置。这一方法显著提升了攻击的成功率与隐蔽性,同时保持了模型在干净数据上的高准确率,揭示了数据外包场景下的严重安全风险,并强调了开发鲁棒防御机制的紧迫性。

链接: https://arxiv.org/abs/2601.18845
作者: Zeineb Dridi,Jihen Bennaceur,Amine Ben Hassouna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has revolutionized numerous tasks within the computer vision field, including image classification, image segmentation, and object detection. However, the increasing deployment of deep learning models has exposed them to various adversarial attacks, including backdoor attacks. This paper presents a novel dynamic mask-based backdoor attack method, specifically designed for object detection models. We exploit a dataset poisoning technique to embed a malicious trigger, rendering any models trained on this compromised dataset vulnerable to our backdoor attack. We particularly focus on a mushroom detection dataset to demonstrate the practical risks posed by such attacks on critical real-life domains. Our work also emphasizes the importance of creating a detailed backdoor attack scenario to illustrate the significant risks associated with the outsourcing practice. Our approach leverages SAM, a recent and powerful image segmentation AI model, to create masks for dynamic trigger placement, introducing a new and stealthy attack method. Through extensive experimentation, we show that our sophisticated attack scenario maintains high accuracy on clean data with the YOLOv7 object detection model while achieving high attack success rates on poisoned samples. Our approach surpasses traditional methods for backdoor injection, which are based on static and consistent patterns. Our findings underscore the urgent need for robust countermeasures to protect deep learning models from these evolving adversarial threats.
zh

[CV-84] GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents

【速读】:该论文旨在解决GUI代理(GUI agents)在执行自动化任务时面临的隐私泄露风险问题。由于GUI代理直接感知和操作屏幕界面,常需访问包含敏感个人信息的界面,并将截图传输至远程模型,导致隐私暴露风险显著增加,尤其在涉及多步骤交互的GUI工作流中更为严重。解决方案的关键在于提出GUIGuard框架,该框架由三个阶段构成:(1) 隐私识别(privacy recognition),用于定位和分类屏幕上可能泄露隐私的区域;(2) 隐私保护(privacy protection),通过策略性遮蔽或处理敏感信息,在不破坏任务语义的前提下降低风险;(3) 保护下的任务执行(task execution under protection),确保任务可完成且隐私得到保障。实验表明,当前主流GUI代理在隐私识别能力上严重不足,而GUIGuard通过精细化标注的基准数据集(GUIGuard-Bench)和分阶段保护机制,有效提升了隐私防护效果与任务一致性。

链接: https://arxiv.org/abs/2601.18842
作者: Yanxi Wang,Zhiling Zhang,Wenbo Zhou,Weiming Zhang,Jie Zhang,Qiannan Zhu,Yu Shi,Shuxin Zheng,Jiyan He
机构: Beijing Normal University (北京师范大学); Zhongguancun Academy (中关村学院); University of Science and Technology of China (中国科学技术大学); A*STAR (新加坡科技研究局); Zhongguancun Institution of Artificial Intelligence (中关村人工智能研究院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:GUI agents enable end-to-end automation through direct perception of and interaction with on-screen interfaces. However, these agents frequently access interfaces containing sensitive personal information, and screenshots are often transmitted to remote models, creating substantial privacy risks. These risks are particularly severe in GUI workflows: GUIs expose richer, more accessible private information, and privacy risks depend on interaction trajectories across sequential scenes. We propose GUIGuard, a three-stage framework for privacy-preserving GUI agents: (1) privacy recognition, (2) privacy protection, and (3) task execution under protection. We further construct GUIGuard-Bench, a cross-platform benchmark with 630 trajectories and 13,830 screenshots, annotated with region-level privacy grounding and fine-grained labels of risk level, privacy category, and task necessity. Evaluations reveal that existing agents exhibit limited privacy recognition, with state-of-the-art models achieving only 13.3% accuracy on Android and 1.4% on PC. Under privacy protection, task-planning semantics can still be maintained, with closed-source models showing stronger semantic consistency than open-source ones. Case studies on MobileWorld show that carefully designed protection strategies achieve higher task accuracy while preserving privacy. Our results highlight privacy recognition as a critical bottleneck for practical GUI agents. Project: this https URL
zh

[CV-85] NavFormer: IGRF Forecasting in Moving Coordinate Frames

【速读】:该论文旨在解决三轴磁力计(Triad magnetometer)在传感器姿态变化时输出分量发生改变的问题,即使国际地磁参考场(IGRF)总强度保持不变。其核心挑战在于如何构建对旋转不变的特征表示,以稳定磁力计测量的谱特性并避免符号不连续性。解决方案的关键在于引入一个基于Gram矩阵的规范帧构建机制与一个Canonical SPD模块:该模块通过每窗口生成的Gram矩阵提取规范坐标系,并在原始坐标下实施状态相关的谱缩放,从而稳定窗口级二阶矩的频谱分布,实现对磁力计信号的旋转不变建模。实验表明,该方法在标准训练、少样本训练及零样本迁移场景中均优于强基线模型。

链接: https://arxiv.org/abs/2601.18800
作者: Yoontae Hwang,Dongwoo Lee,Minseok Choi,Yong Sup Ihn,Daham Kim,Deok-Young Lee
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Triad magnetometer components change with sensor attitude even when the IGRF total intensity target stays invariant. NavFormer forecasts this invariant target with rotation invariant scalar features and a Canonical SPD module that stabilizes the spectrum of window level second moments of the triads without sign discontinuities. The module builds a canonical frame from a Gram matrix per window and applies state dependent spectral scaling in the original coordinates. Experiments across five flights show lower error than strong baselines in standard training, few shot training, and zero shot transfer. The code is available at: this https URL
zh

[CV-86] Interpretable and backpropagation-free Green Learning for efficient multi-task echocardiographic segmentation and classification

【速读】:该论文旨在解决心脏超声图像中左心室射血分数(Left Ventricular Ejection Fraction, LVEF)手动评估存在高观察者间变异性和现有深度学习(Deep Learning, DL)模型计算复杂、数据依赖性强且缺乏可解释性的难题。其解决方案的关键在于提出一种无需反向传播的多任务绿色学习(multi-task Green Learning, MTGL)框架,该框架通过无监督的VoxelHop编码器实现层次化时空特征提取,并结合多级回归解码器与XG-Boost分类器,实现了左心室(Left Ventricle, LV)分割与LVEF分类的联合优化。该方法在EchoNet-Dynamic数据集上达到94.3%的分类准确率和0.912的Dice相似系数,同时参数量减少一个数量级以上,显著提升了模型的计算效率与临床可信赖性。

链接: https://arxiv.org/abs/2601.19743
作者: Jyun-Ping Kao,Jiaxing Yang,C.-C. Jay Kuo,Jonghye Woo
机构: University of Southern California (南加州大学); Massachusetts General Hospital (马萨诸塞总医院); Harvard Medical School (哈佛医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Jyun-Ping Kao and Jiaxing Yang contributed equally to this work. C.-C. Jay Kuo and Jonghye Woo are the senior authors

点击查看摘要

Abstract:Echocardiography is a cornerstone for managing heart failure (HF), with Left Ventricular Ejection Fraction (LVEF) being a critical metric for guiding therapy. However, manual LVEF assessment suffers from high inter-observer variability, while existing Deep Learning (DL) models are often computationally intensive and data-hungry “black boxes” that impede clinical trust and adoption. Here, we propose a backpropagation-free multi-task Green Learning (MTGL) framework that performs simultaneous Left Ventricle (LV) segmentation and LVEF classification. Our framework integrates an unsupervised VoxelHop encoder for hierarchical spatio-temporal feature extraction with a multi-level regression decoder and an XG-Boost classifier. On the EchoNet-Dynamic dataset, our MTGL model achieves state-of-the-art classification and segmentation performance, attaining a classification accuracy of 94.3% and a Dice Similarity Coefficient (DSC) of 0.912, significantly outperforming several advanced 3D DL models. Crucially, our model achieves this with over an order of magnitude fewer parameters, demonstrating exceptional computational efficiency. This work demonstrates that the GL paradigm can deliver highly accurate, efficient, and interpretable solutions for complex medical image analysis, paving the way for more sustainable and trustworthy artificial intelligence in clinical practice.
zh

[CV-87] Learned split-spectrum metalens for obstruction-free broadband imaging in the visible

【速读】:该论文旨在解决在存在雨滴、栅栏或灰尘等障碍物时,图像质量退化的问题,尤其是在机械清洁不可行的场景下。传统解决方案依赖于体积庞大的复合光学阵列或计算修复(computational inpainting),这些方法要么牺牲紧凑性,要么降低成像保真度。其关键创新在于提出了一种“学习型分光金属透镜”(learned split-spectrum metalens),通过多带谱滤波将每个RGB通道的光谱划分为透射带和截止带,并训练金属透镜使远距离物体的光通过透射带聚焦,而近处遮挡物的光则被截止带过滤;同时结合神经网络增强光学信号。该方案实现了宽带无障成像,在相对PSNR上提升32.29%,并在目标检测与语义分割任务中分别获得+13.54% mAP、+48.45% IoU和+20.35% mIoU的绝对性能提升,显著优于传统双曲设计。

链接: https://arxiv.org/abs/2601.19403
作者: Seungwoo Yoon,Dohyun Kang,Eunsue Choi,Sohyun Lee,Seoyeon Kim,Minho Choi,Hyeonsu Heo,Dong-ha Shin,Suha Kwak,Arka Majumdar,Junsuk Rho,Seung-Hwan Baek
机构: 1. Korea Advanced Institute of Science and Technology (韩国科学技术院); 2. Pohang University of Science and Technology (浦项科技大学); 3. Seoul National University (首尔国立大学); 4. University of Washington (华盛顿大学); 5. Allen Institute for AI (艾伦人工智能研究所); 6. University of British Columbia (不列颠哥伦比亚大学); 7. Korea Institute of Science and Technology (韩国科学技术院); 8. KAIST (韩国科学技术院); 9. Seoul National University (首尔国立大学); 10. Pohang University of Science and Technology (浦项科技大学)
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
备注:

点击查看摘要

Abstract:Obstructions such as raindrops, fences, or dust degrade captured images, especially when mechanical cleaning is infeasible. Conventional solutions to obstructions rely on a bulky compound optics array or computational inpainting, which compromise compactness or fidelity. Metalenses composed of subwavelength meta-atoms promise compact imaging, but simultaneous achievement of broadband and obstruction-free imaging remains a challenge, since a metalens that images distant scenes across a broadband spectrum cannot properly defocus near-depth occlusions. Here, we introduce a learned split-spectrum metalens that enables broadband obstruction-free imaging. Our approach divides the spectrum of each RGB channel into pass and stop bands with multi-band spectral filtering and learns the metalens to focus light from far objects through pass bands, while filtering focused near-depth light through stop bands. This optical signal is further enhanced using a neural network. Our learned split-spectrum metalens achieves broadband and obstruction-free imaging with relative PSNR gains of 32.29% and improves object detection and semantic segmentation accuracies with absolute gains of +13.54% mAP, +48.45% IoU, and +20.35% mIoU over a conventional hyperbolic design. This promises robust obstruction-free sensing and vision for space-constrained systems, such as mobile robots, drones, and endoscopes.
zh

[CV-88] AMGFormer: Adaptive Multi-Granular Transformer for Brain Tumor Segmentation with Missing Modalities

【速读】:该论文旨在解决多模态磁共振成像(Multimodal MRI)在脑肿瘤分割任务中因临床实践中存在模态缺失而导致模型性能不稳定的问题,现有方法在不同模态组合下性能差异高达40%,限制了其临床可靠性。解决方案的关键在于提出AMGFormer架构,其核心创新包含三个协同模块:(1) 四象限集成桥(QuadIntegrator Bridge, QIB),实现空间自适应融合以维持不同可用模态下的预测一致性;(2) 多粒度注意力调度器(Multi-Granular Attention Orchestrator, MGAO),聚焦病灶区域降低背景干扰;(3) 模态质量感知增强模块(Modality Quality-Aware Enhancement, MQAE),抑制劣质序列引发的误差传播。该方法在BraTS 2018数据集上实现了高达89.33% WT、82.70% TC、67.23% ET的Dice分数,且跨15种模态组合的性能波动仅0.5%,显著提升了稳定性与泛化能力。

链接: https://arxiv.org/abs/2601.19349
作者: Chengxiang Guo,Jian Wang,Junhua Fei,Xiao Li,Chunling Chen,Yun Jin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal MRI is essential for brain tumor segmentation, yet missing modalities in clinical practice cause existing methods to exhibit 40% performance variance across modality combinations, rendering them clinically unreliable. We propose AMGFormer, achieving significantly improved stability through three synergistic modules: (1) QuadIntegrator Bridge (QIB) enabling spatially adaptive fusion maintaining consistent predictions regardless of available modalities, (2) Multi-Granular Attention Orchestrator (MGAO) focusing on pathological regions to reduce background sensitivity, and (3) Modality Quality-Aware Enhancement (MQAE) preventing error propagation from corrupted sequences. On BraTS 2018, our method achieves 89.33% WT, 82.70% TC, 67.23% ET Dice scores with 0.5% variance across 15 modality combinations, solving the stability crisis. Single-modality ET segmentation shows 40-81% relative improvements over state-of-the-art methods. The method generalizes to BraTS 2020/2021, achieving up to 92.44% WT, 89.91% TC, 84.57% ET. The model demonstrates potential for clinical deployment with 1.2s inference. Code: this https URL.
zh

[CV-89] Magnetic Resonance Simulation of Effective Transverse Relaxation (T2*)

【速读】:该论文旨在解决磁共振成像(MRI)中有效横向弛豫时间 $ T_2^* $ 的高效模拟问题,特别是其中的可逆成分 $ T_2’ $ 的建模难题。传统方法需使用100多个等频组分(isochromats)来近似 $ T_2’ $ 对应的洛伦兹函数(Lorentzian function),计算成本高且效率低。解决方案的关键在于引入一种基于线性相位模型(linear phase model)的直接模拟方法,避免了对大量等频组分的依赖,并通过两个核心技术实现加速:一是利用解析解(analytic solutions)显著减少计算量(加速19倍),二是采用组合跃迁(combined transitions)策略进一步提升效率(最高加速17倍)。实验证明,该方法可在不牺牲精度的前提下,仅用少量等频组分即可准确恢复 $ T_2’ $,同时将整体计算时间控制在无 $ T_2’ $ 模拟时的2.0–2.7倍范围内。

链接: https://arxiv.org/abs/2601.19246
作者: Hidenori Takeshima
机构: Canon Medical Systems Corporation(佳能医疗系统公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Purpose: To simulate effective transverse relaxation ( T_2^* ) as a part of MR simulation. T_2^* consists of reversible ( T_2^\prime ) and irreversible ( T_2 ) components. Whereas simulations of T_2 are easy, T_2^\prime is not easily simulated if only magnetizations of individual isochromats are simulated. Theory and Methods: Efficient methods for simulating T_2^\prime were proposed. To approximate the Lorentzian function of T_2^\prime realistically, conventional simulators require 100+ isochromats. This approximation can be avoided by utilizing a linear phase model for simulating an entire Lorentzian function directly. To represent the linear phase model, the partial derivatives of the magnetizations with respect to the frequency axis were also simulated. To accelerate the simulations with these partial derivatives, the proposed methods introduced two techniques: analytic solutions, and combined transitions. For understanding the fundamental mechanism of the proposed method, a simple one-isochromat simulation was performed. For evaluating realistic cases, several pulse sequences were simulated using two phantoms with and without T_2^\prime simulations. Results: The one-isochromat simulation demonstrated that T_2^\prime simulations were possible. In the realistic cases, T_2^\prime was recovered as expected without using 100+ isochromats for each point. The computational times with T_2^\prime simulations were only 2.0 to 2.7 times longer than those without T_2^\prime simulations. When the above-mentioned two techniques were utilized, the analytic solutions accelerated 19 times, and the combined transitions accelerated up to 17 times. Conclusion: Both theory and results showed that the proposed methods simulated T_2^\prime efficiently by utilizing a linear model with a Lorentzian function, analytic solutions, and combined transitions. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2601.19246 [eess.IV] (or arXiv:2601.19246v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2601.19246 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hidenori Takeshima [view email] [v1] Tue, 27 Jan 2026 06:28:52 UTC (607 KB)
zh

[CV-90] Optimized k-means color quantization of digital images in machine-based and human perception-based colorspaces

【速读】:该论文旨在解决图像颜色量化(color quantization)过程中如何在减少颜色数量的同时最小化视觉质量损失的问题。其解决方案的关键在于比较不同色彩空间下k-means算法的性能表现,发现色彩空间的选择对量化效果具有显著影响:在低量化级别(k值较小)时,CIE-LUV色彩空间表现更优;而在高量化级别时,CIE-XYZ色彩空间优于RGB空间;而RGB空间在约一半情况下仍是最优选择。研究进一步通过分析图像中色相(hue)、色度(chromaticity)和亮度(luminance)分布,揭示了各色彩空间在不同图像特性下的适应性优势,从而为k-means颜色量化提供了基于人类感知的优化依据。

链接: https://arxiv.org/abs/2601.19117
作者: Ranjan Maitra
机构: Iowa State University (爱荷华州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: 25 pages, 11 figures, 5 tables, accepted in the Journal of Electronic Imaging

点击查看摘要

Abstract:Color quantization represents an image using a fraction of its original number of colors while only minimally losing its visual quality. The k -means algorithm is commonly used in this context, but has mostly been applied in the machine-based RGB colorspace composed of the three primary colors. However, some recent studies have indicated its improved performance in human perception-based colorspaces. We investigated the performance of k -means color quantization at four quantization levels in the RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces, on 148 varied digital images spanning a wide range of scenes, subjects and settings. The Visual Information Fidelity (VIF) measure numerically assessed the quality of the quantized images, and showed that in about half of the cases, k -means color quantization is best in the RGB space, while at other times, and especially for higher quantization levels ( k ), the CIE-XYZ colorspace is where it usually does better. There are also some cases, especially at lower k , where the best performance is obtained in the CIE-LUV colorspace. Further analysis of the performances in terms of the distributions of the hue, chromaticity and luminance in an image presents a nuanced perspective and characterization of the images for which each colorspace is better for k -means color quantization.
zh

人工智能

[AI-0] HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLM s

【速读】:该论文旨在解决当前人机交互系统在多用户环境中缺乏持续个性化和动态适应机制的问题,从而限制了其在真实场景中的有效性。解决方案的关键在于提出HARMONI框架,该框架基于大语言模型(Large Language Models, LLMs),通过四个核心模块实现社会助人机器人对长期多用户交互的管理:感知模块用于识别活跃说话者并提取多模态输入;世界建模模块维护环境与短期对话上下文表示;用户建模模块更新长期的、说话者特定的个人档案;生成模块则产出情境相关且符合伦理规范的响应。实证研究表明,该框架在说话者识别、在线记忆更新及伦理对齐的个性化方面表现优异,显著优于基线LLM驱动方法。

链接: https://arxiv.org/abs/2601.19839
作者: Jeanne Malécot,Hamed Rahimi,Jeanne Cattoni,Marie Samson,Mouad Abrini,Mahdi Khoramshahi,Maribel Pino,Mohamed Chetouani
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Existing human-robot interaction systems often lack mechanisms for sustained personalization and dynamic adaptation in multi-user environments, limiting their effectiveness in real-world deployments. We present HARMONI, a multimodal personalization framework that leverages large language models to enable socially assistive robots to manage long-term multi-user interactions. The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses. Through extensive evaluation and ablation studies on four datasets, as well as a real-world scenario-driven user-study in a nursing home environment, we demonstrate that HARMONI supports robust speaker identification, online memory updating, and ethically aligned personalization, outperforming baseline LLM-driven approaches in user modeling accuracy, personalization quality, and user satisfaction.
zh

[AI-1] Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

【速读】:该论文试图解决当前生成式 AI 在物理和空间智能等非抽象领域表现滞后的问题,其核心在于探索视觉生成如何提升链式思维(Chain-of-Thought, CoT)推理能力。解决方案的关键是提出并验证“视觉优越性假说”(Visual Superiority Hypothesis),即对于依赖物理世界知识的任务,视觉生成更自然地构成内部世界模型(Internal World Model),相较于纯语言形式的世界模型,能有效克服表征局限性和先验知识不足的瓶颈。作者通过理论建模与实证研究相结合的方式,构建了 VisWorld-Eval 评估套件,并在统一多模态模型(Unified Multimodal Models, UMMs)上证明:在适合视觉建模的任务中,交错式的视觉-语言 CoT 推理显著优于纯语言 CoT,从而为发展更具人类类比能力的多模态 AI 提供了明确方向。

链接: https://arxiv.org/abs/2601.19834
作者: Jialong Wu,Xiaoying Zhang,Hongyi Yuan,Xiangcheng Zhang,Tianhao Huang,Changjing He,Chaoyi Deng,Renrui Zhang,Youbin Wu,Mingsheng Long
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks–particularly those grounded in the physical world–visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.
zh

[AI-2] Routing End User Queries to Enterprise Databases

【速读】:该论文旨在解决多数据库企业环境中自然语言查询的路由问题(natural language query routing in multi-database enterprise environments),即如何准确地将用户查询分配到最相关的数据库。解决方案的关键在于提出一种模块化、基于推理的重排序策略,通过显式建模模式覆盖(schema coverage)、结构连通性(structural connectivity)和细粒度语义对齐(fine-grained semantic alignment)三个核心维度,显著优于仅依赖嵌入向量或直接大语言模型(LLM)提示的基线方法,在各项指标上均表现出更强的鲁棒性和准确性。

链接: https://arxiv.org/abs/2601.19825
作者: Saikrishna Sudarshan,Tanay Kulkarni,Manasi Patwardhan,Lovekesh Vig,Ashwin Srinivasan,Tanmay Tulsidas Verlekar
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:We address the task of routing natural language queries in multi-database enterprise environments. We construct realistic benchmarks by extending existing NL-to-SQL datasets. Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries, motivating the need for more structured and robust reasoning-based solutions. By explicitly modelling schema coverage, structural connectivity, and fine-grained semantic alignment, the proposed modular, reasoning-driven reranking strategy consistently outperforms embedding-only and direct LLM-prompting baselines across all the metrics.
zh

[AI-3] An Interpretable Recommendation Model for Psychometric Data With an Application to Gerontological Primary Care

【速读】:该论文旨在解决推荐系统在医疗保健场景中应用时面临的关键挑战,包括临床数据难以获取、用户难以理解推荐依据、遵循推荐可能带来的风险以及推荐效果的不确定性。其解决方案的核心在于提出一种利用心理测量数据结构的推荐模型,通过生成忠实于模型且对护理专业人员可解释的可视化说明,提升推荐系统的可信度与实用性。该方法聚焦于老年初级保健这一特定领域,验证了其在辅助制定个性化照护计划方面的潜力,并通过离线性能评估和用户研究证明了模型的有效性和解释性优势。

链接: https://arxiv.org/abs/2601.19824
作者: Andre Paulino de Lima,Paula Castro,Suzana Carvalho Vaz de Andrade,Rosa Maria Marcucci,Ruth Caldeira de Melo,Marcelo Garcia Manzato
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: 81 pages, 19 figures, 3 annexes

点击查看摘要

Abstract:There are challenges that must be overcome to make recommender systems useful in healthcare settings. The reasons are varied: the lack of publicly available clinical data, the difficulty that users may have in understanding the reasons why a recommendation was made, the risks that may be involved in following that recommendation, and the uncertainty about its effectiveness. In this work, we address these challenges with a recommendation model that leverages the structure of psychometric data to provide visual explanations that are faithful to the model and interpretable by care professionals. We focus on a narrow healthcare niche, gerontological primary care, to show that the proposed recommendation model can assist the attending professional in the creation of personalised care plans. We report results of a comparative offline performance evaluation of the proposed model on healthcare datasets that were collected by research partners in Brazil, as well as the results of a user study that evaluates the interpretability of the visual explanations the model generates. The results suggest that the proposed model can advance the application of recommender systems in this healthcare niche, which is expected to grow in demand , opportunities, and information technology needs as demographic changes become more pronounced.
zh

[AI-4] Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals ICLR2026

【速读】:该论文旨在解决在无监督预训练中如何有效生成、选择并利用目标(goal)以提升强化学习代理在下游任务中的探索与适应能力的问题,特别是在目标分布广泛且无法实现零样本(zero-shot)求解的场景下。解决方案的关键在于提出一种名为ULee的无监督元学习方法,其核心机制包括:(i) 在元学习框架内优化多回合探索与适应效率,以及 (ii) 通过演化估计代理适应后的性能来引导训练课程。ULee结合了上下文学习器(in-context learner)与对抗性目标生成策略,确保训练始终处于代理能力的前沿,从而显著提升在新目标、环境动态和地图结构上的零样本与少样本性能,并为后续微调提供优越初始化。

链接: https://arxiv.org/abs/2601.19810
作者: Octavio Pappalardo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: To appear at ICLR 2026

点击查看摘要

Abstract:Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent’s post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent’s capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula.
zh

[AI-5] CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing

【速读】:该论文旨在解决图结构多智能体系统(Graph-based Multi-Agent Systems, MAS)中因静态模型分配导致的计算资源浪费问题,即在复杂循环工作流中,统一部署高性能模型会过度消耗算力于简单子任务。解决方案的关键在于提出一种轻量级动态模型选择路由器 CASTER(Context-Aware Strategy for Task Efficient Routing),其核心创新是采用双信号路由机制(Dual-Signal Router),融合语义嵌入与结构元特征以估计任务难度,并通过“冷启动到迭代演化”训练范式,利用策略内负反馈自我优化,从而实现高效且精准的任务路由决策。

链接: https://arxiv.org/abs/2601.19793
作者: Shanyv Liu,Xuyang Yuan,Tao Chen,Zijun Zhan,Zhu Han,Danyang Zheng,Weishan Zhang,Shaohua Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph-based Multi-Agent Systems (MAS) enable complex cyclic workflows but suffer from inefficient static model allocation, where deploying strong models uniformly wastes computation on trivial sub-tasks. We propose CASTER (Context-Aware Strategy for Task Efficient Routing), a lightweight router for dynamic model selection in graph-based MAS. CASTER employs a Dual-Signal Router that combines semantic embeddings with structural meta-features to estimate task difficulty. During training, the router self-optimizes through a Cold Start to Iterative Evolution paradigm, learning from its own routing failures via on-policy negative feedback. Experiments using LLM-as-a-Judge evaluation across Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity demonstrate that CASTER reduces inference cost by up to 72.4% compared to strong-model baselines while matching their success rates, and consistently outperforms both heuristic routing and FrugalGPT across all domains.
zh

[AI-6] Reimagining Peer Review Process Through Multi-Agent Mechanism Design ICSE

【速读】:该论文试图解决软件工程研究领域中同行评审(peer review)系统面临的结构性危机,包括投稿量激增、激励机制错位及审稿人疲劳等问题,导致研究人员普遍认为评审流程“已失效”。其解决方案的关键在于将研究社区建模为随机多智能体系统(stochastic multi-agent system),并引入多智能体强化学习(multi-agent reinforcement learning, MARL)来设计激励相容(incentive-compatible)的协议机制。具体干预措施包括:基于信用的投稿经济体系、由MARL优化的审稿人分配策略以及审查一致性混合验证机制,从而在保障公平性和可扩展性的前提下重构评审生态系统的可持续性。

链接: https://arxiv.org/abs/2601.19778
作者: Ahmad Farooq,Kamran Iqbal
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Software Engineering (cs.SE)
备注: To appear in the Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 4 pages, 1 figure, 1 table

点击查看摘要

Abstract:The software engineering research community faces a systemic crisis: peer review is failing under growing submissions, misaligned incentives, and reviewer fatigue. Community surveys reveal that researchers perceive the process as “broken.” This position paper argues that these dysfunctions are mechanism design failures amenable to computational solutions. We propose modeling the research community as a stochastic multi-agent system and applying multi-agent reinforcement learning to design incentive-compatible protocols. We outline three interventions: a credit-based submission economy, MARL-optimized reviewer assignment, and hybrid verification of review consistency. We present threat models, equity considerations, and phased pilot metrics. This vision charts a research agenda toward sustainable peer review.
zh

[AI-7] GAVEL: Towards rule-based safety through activation monitoring ICLR2026

【速读】:该论文旨在解决当前基于激活(activation)的安全检测方法在实际应用中面临的精度低、灵活性差以及缺乏可解释性等问题,这些问题主要源于现有方法依赖于广泛滥用数据集训练,难以精准识别潜在的有害行为。其解决方案的关键在于提出一种规则驱动的激活安全新范式,将模型激活视为认知元素(cognitive elements, CEs),如“发出威胁”或“支付处理”等细粒度且可解释的行为因子,并通过组合这些CEs来刻画特定领域内复杂、细微的行为模式;在此基础上构建一套基于谓词规则的实时检测框架,使从业者无需重新训练模型即可灵活配置和更新安全策略,从而实现高精度、可定制、透明且可审计的AI治理能力。

链接: https://arxiv.org/abs/2601.19768
作者: Shir Rozenfeld,Rahul Pankajakshan,Itay Zloczower,Eyal Lenga,Gilad Gressel,Yisroel Mirsky
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ‘‘making a threat’’ and ‘‘payment processing’’, that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.
zh

[AI-8] Agent ic Design Patterns: A System-Theoretic Framework

【速读】:该论文旨在解决当前生成式 AI(Generative AI)代理系统中存在的不可靠性和脆弱性问题,这些问题主要源于幻觉(hallucination)和推理能力不足,以及系统设计常采用临时性方法所导致的架构缺陷。其解决方案的关键在于提出一个基于系统理论(system-theoretic)的严谨框架,将智能体系统解构为五个核心功能子系统:推理(Reasoning)、世界模型(World Model)、感知与接地(Perception & Grounding)、动作执行(Action Execution)、学习与适应(Learning & Adaptation)以及多智能体通信(Inter-Agent Communication)。在此基础上,作者进一步提炼出12种可复用的智能体设计模式(agentic design patterns),按基础性、认知决策、执行交互和自适应学习四类组织,从而为智能体系统的模块化设计、标准化沟通与可靠性提升提供结构化方法论支持。

链接: https://arxiv.org/abs/2601.19752
作者: Minh-Dung Dao,Quy Minh Le,Hoang Thanh Lam,Duc-Trong Le,Quoc-Viet Pham,Barry O’Sullivan,Hoang D. Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the development of foundation model (FM), agentic AI systems are getting more attention, yet their inherent issues like hallucination and poor reasoning, coupled with the frequent ad-hoc nature of system design, lead to unreliable and brittle applications. Existing efforts to characterise agentic design patterns often lack a rigorous systems-theoretic foundation, resulting in high-level or convenience-based taxonomies that are difficult to implement. This paper addresses this gap by introducing a principled methodology for engineering robust AI agents. We propose two primary contributions: first, a novel system-theoretic framework that deconstructs an agentic AI system into five core, interacting functional subsystems: Reasoning World Model, Perception Grounding, Action Execution, Learning Adaptation, and Inter-Agent Communication. Second, derived from this architecture and directly mapped to a comprehensive taxonomy of agentic challenges, we present a collection of 12 agentic design patterns. These patterns - categorised as Foundational, Cognitive Decisional, Execution Interaction, and Adaptive Learning - offer reusable, structural solutions to recurring problems in agent design. The utility of the framework is demonstrated by a case study on the ReAct framework, showing how the proposed patterns can rectify systemic architectural deficiencies. This work provides a foundational language and a structured methodology to standardise agentic design communication among researchers and engineers, leading to more modular, understandable, and reliable autonomous systems.
zh

[AI-9] Veri-Sure: A Contract-Aware Multi-Agent Framework with Temporal Tracing and Formal Verification for Correct RTL Code Generation

【速读】:该论文针对电子设计自动化(Electronic Design Automation, EDA)领域中基于大语言模型(Large Language Models, LLMs)生成寄存器传输级(Register-Transfer Level, RTL)代码时存在的三大瓶颈问题展开研究:(i) 以仿真为中心的评估方法测试覆盖率低、可靠性差;(ii) 迭代调试引入回归错误和修复幻觉;(iii) 多代理协作过程中因意图重解释导致语义漂移。其解决方案的关键在于提出Veri-Sure多代理框架,通过建立设计契约(design contract)确保代理间意图对齐,并采用由静态依赖切片引导的修补机制实现精准、局部化的修复;同时集成多分支验证流水线,融合基于追踪的时间分析与形式化验证(包括断言检查和布尔等价性证明),从而在纯仿真之外保障功能正确性。

链接: https://arxiv.org/abs/2601.19747
作者: Jiale Liu,Taiyu Zhou,Tianqi Jiang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of Electronic Design Automation (EDA), the deployment of Large Language Models (LLMs) for Register-Transfer Level (RTL) design has emerged as a promising direction. However, silicon-grade correctness remains bottlenecked by: (i) limited test coverage and reliability of simulation-centric evaluation, (ii) regressions and repair hallucinations introduced by iterative debugging, and (iii) semantic drift as intent is reinterpreted across agent handoffs. In this work, we propose Veri-Sure, a multi-agent framework that establishes a design contract to align agents’ intent and uses a patching mechanism guided by static dependency slicing to perform precise, localized repairs. By integrating a multi-branch verification pipeline that combines trace-driven temporal analysis with formal verification consisting of assertion-based checking and boolean equivalence proofs, Veri-Sure enables functional correctness beyond pure simulations. We also introduce VerilogEval-v2-EXT, extending the original benchmark with 53 more industrial-grade design tasks and stratified difficulty levels, and show that Veri-Sure achieves state-of-the-art verified-correct RTL code generation performance, surpassing standalone LLMs and prior agentic systems.
zh

[AI-10] Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification ICASSP2026

【速读】:该论文旨在解决传统基于欧几里得空间(Euclidean space)的说话人嵌入(speaker embedding)学习方法在建模说话人特征中层次结构信息方面的不足。其关键解决方案是引入双曲空间(hyperbolic space)以更高效地表示层次化结构:提出两种新损失函数——双曲Softmax(H-Softmax)和双曲加 margin Softmax(HAM-Softmax),通过将说话人嵌入与类别中心投影至双曲空间并计算双曲距离,从而显式建模层次信息;其中HAM-Softmax进一步引入边际约束增强类间可分性。实验表明,相较于标准Softmax和AM-Softmax,H-Softmax和HAM-Softmax分别实现平均相对等错误率(EER)降低27.84%和14.23%,验证了该方法在提升说话人验证性能的同时保持了对层次结构的良好建模能力。

链接: https://arxiv.org/abs/2601.19709
作者: Zhihua Fang,Liang He
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, Accepted at ICASSP 2026

点击查看摘要

Abstract:Speaker embedding learning based on Euclidean space has achieved significant progress, but it is still insufficient in modeling hierarchical information within speaker features. Hyperbolic space, with its negative curvature geometric properties, can efficiently represent hierarchical information within a finite volume, making it more suitable for the feature distribution of speaker embeddings. In this paper, we propose Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax) based on hyperbolic space. H-Softmax incorporates hierarchical information into speaker embeddings by projecting embeddings and speaker centers into hyperbolic space and computing hyperbolic distances. HAM-Softmax further enhances inter-class separability by introducing margin constraint on this basis. Experimental results show that H-Softmax and HAM-Softmax achieve average relative EER reductions of 27.84% and 14.23% compared with standard Softmax and AM-Softmax, respectively, demonstrating that the proposed methods effectively improve speaker verification performance and at the same time preserve the capability of hierarchical structure modeling. The code will be released at this https URL.
zh

[AI-11] Out-of-Distribution Generalization via Invariant Trajectories for Multimodal Large Language Model Editing

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在知识编辑过程中因依赖刚性参数到输出映射而导致的因果欠拟合(causal-underfit)与因果过拟合(causal-overfit)问题,尤其在跨模态提示(cross-modal prompting)下难以实现鲁棒的知识修正。其解决方案的关键在于将MLLM知识编辑重构为一个分布外(out-of-distribution, OOD)泛化问题,通过识别不变的因果轨迹来区分语义变化与事实变化,从而提升编辑的可靠性、局部性和泛化能力;具体地,作者提出ODEdit框架,采用三重OOD风险优化目标,并引入总变差惩罚(total variation penalty)以稳定编辑轨迹对环境扰动的敏感性,有效抑制虚假相关性,实现更可靠的跨模态知识编辑。

链接: https://arxiv.org/abs/2601.19700
作者: Jiajie Su,Haoyuan Wang,Xiaohua Feng,Yunshan Ma,Xiaobo Xia,Yuyuan Li,Xiaolin Zheng,Jianmao Xiao,Chaochao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge editing emerges as a crucial technique for efficiently correcting incorrect or outdated knowledge in large language models (LLM). Existing editing methods for unimodal LLM rely on a rigid parameter-to-output mapping, which causes causal-underfit and causal-overfit in cascaded reasoning for Multimodal LLM (MLLM). In this paper, we reformulate MLLM editing as an out-of-distribution (OOD) generalization problem, where the goal is to discern semantic shift with factual shift and thus achieve robust editing among diverse cross-modal prompting. The key challenge of this OOD problem lies in identifying invariant causal trajectories that generalize accurately while suppressing spurious correlations. To address it, we propose ODEdit, a plug-and-play invariant learning based framework that optimizes the tripartite OOD risk objective to simultaneously enhance editing reliability, locality, and this http URL further introduce an edit trajectory invariant learning method, which integrates a total variation penalty into the risk minimization objective to stabilize edit trajectories against environmental variations. Theoretical analysis and extensive experiments demonstrate the effectiveness of ODEdit.
zh

[AI-12] AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion

【速读】:该论文旨在解决现有代码大语言模型(code LLMs)在仓库级代码补全任务中因缺乏对特定仓库上下文和领域知识的理解而导致的性能瓶颈问题。其核心挑战在于检索增强生成(RAG)方法存在两个根本缺陷:一是查询与目标代码之间的语义错位,二是现有检索机制无法有效利用推理信息。解决方案的关键在于提出AlignCoder框架,通过引入查询增强机制和基于强化学习的检索器训练方法实现突破:前者生成多个候选补全以构建增强查询,弥合初始查询与目标代码间的语义鸿沟;后者设计AlignRetriever,借助强化学习策略学习如何利用增强查询中的推理信息进行更精准的代码片段检索,从而显著提升仓库级代码补全的准确性和泛化能力。

链接: https://arxiv.org/abs/2601.19697
作者: Tianyue Jiang,Yanli Wang,Yanlin Wang,Daya Guo,Ensheng Shi,Yuchi Ma,Jiachi Chen,Zibin Zheng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: To appear at ASE’25

点击查看摘要

Abstract:Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.
zh

[AI-13] Cross-Domain Offshore Wind Power Forecasting: Transfer Learning Through Meteorological Clusters

【速读】:该论文旨在解决新投运海上风电场因缺乏长期本地观测数据而导致功率预测精度不足的问题。现有机器学习模型虽性能优异,但通常依赖大量站点特定数据,而新建风电场在初期难以满足这一要求。解决方案的关键在于提出一种基于气候特征聚类的迁移学习框架,通过将功率输出按气象协变量进行分组,构建由多个专家模型组成的集成系统,每个专家模型专注于特定天气模式下的预测任务。这种设计使得模型能够利用已有的多站点预训练知识,在仅需不到五个月的站点数据情况下即可实现跨域准确预测,从而显著降低对全年本地测量数据的依赖,并有效捕捉气候相关的可转移动态特性。

链接: https://arxiv.org/abs/2601.19674
作者: Dominic Weisser,Chloé Hashimoto-Cullen,Benjamin Guedj
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
备注: 11 pages

点击查看摘要

Abstract:Ambitious decarbonisation targets are catalysing growth in orders of new offshore wind farms. For these newly commissioned plants to run, accurate power forecasts are needed from the onset. These allow grid stability, good reserve management and efficient energy trading. Despite machine learning models having strong performances, they tend to require large volumes of site-specific data that new farms do not yet have. To overcome this data scarcity, we propose a novel transfer learning framework that clusters power output according to covariate meteorological features. Rather than training a single, general-purpose model, we thus forecast with an ensemble of expert models, each trained on a cluster. As these pre-trained models each specialise in a distinct weather pattern, they adapt efficiently to new sites and capture transferable, climate-dependent dynamics. Through the expert models’ built-in calibration to seasonal and meteorological variability, we remove the industry-standard requirement of local measurements over a year. Our contributions are two-fold - we propose this novel framework and comprehensively evaluate it on eight offshore wind farms, achieving accurate cross-domain forecasting with under five months of site-specific data. Our experiments achieve a MAE of 3.52%, providing empirical verification that reliable forecasts do not require a full annual cycle. Beyond power forecasting, this climate-aware transfer learning method opens new opportunities for offshore wind applications such as early-stage wind resource assessment, where reducing data requirements can significantly accelerate project development whilst effectively mitigating its inherent risks.
zh

[AI-14] A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models EACL2026

【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在音频模态评估中缺乏对跨任务推理能力测试的问题,现有基准主要聚焦于孤立的音频任务(如说话人辨认或性别识别),无法验证模型是否具备整合不同类别音频任务进行推理的能力。解决方案的关键在于提出Audio Reasoning Tasks (ART),这是一个新的基准,专门用于评估多模态模型在需要基于音频信号进行推理的问题上的表现,从而更全面地衡量其高级音频理解与跨任务协同处理能力。

链接: https://arxiv.org/abs/2601.19673
作者: Iwona Christop(1),Mateusz Czyżnikiewicz(2),Paweł Skórzewski(1),Łukasz Bondaruk(2),Jakub Kubiak(2),Marcin Lewandowski(2),Marek Kubis(1) ((1) Adam Mickiewicz University, (2) Samsung Ramp;D Institute Poland)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 31 pages, 2 figures, accepted to EACL 2026

点击查看摘要

Abstract:The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.
zh

[AI-15] ProToken: Token-Level Attribution for Federated Large Language Models

【速读】:该论文旨在解决联邦大语言模型(Federated Large Language Models, FL-LLMs)在部署于关键应用场景时,难以追溯生成文本中每个token的来源客户端的问题,这限制了模型的可调试性、恶意客户端识别、公平奖励分配以及可信验证。解决方案的关键在于提出ProToken——一种面向token级归属的新型可证明性(Provenance)方法:其核心创新包括两个方面:(1) 利用Transformer架构中任务相关信号集中于较深层块的特性,通过战略性地选择层数实现计算可行性;(2) 采用基于梯度的相关性加权机制过滤无关神经激活,聚焦于直接影响token生成的神经元,从而实现细粒度的客户端归属追踪。实验表明,ProToken在16种配置下平均归属准确率达98%,且在客户端数量扩展时仍保持高精度,验证了其在真实场景中的实用性。

链接: https://arxiv.org/abs/2601.19672
作者: Waris Gill,Ahmad Humayun,Ali Anwar,Muhammad Ali Gulzar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.
zh

[AI-16] Robustness of Constraint Automata for Description Logics with Concrete Domains

【速读】:该论文旨在解决描述逻辑(Description Logics)中带具体域(Concrete Domains)的ontology 一致性问题的可判定性与复杂度分析难题。现有方法如基于表格(tableaux-based)或类型消除(type elimination)的方法已部分解决了该问题,但未能达到最优复杂度上界。论文提出一种基于自动机(automata-based)的新方法,其关键在于通过在自动机转移中引入符号约束(symbolic constraints)来增强表达能力,从而将一致性问题规约到此类自动机的非空性检测问题。研究证明,在具体域满足若干简单性质的前提下,该非空性问题属于 EXPTIME 类,进而通过归约得到整个一致性问题的 EXPTIME 上界。此外,该方法对逆角色(inverse roles)、功能角色(functional role names)和约束断言(constraint assertions)等扩展也保持 EXPTIME 成员关系,体现出该方案的高度鲁棒性。

链接: https://arxiv.org/abs/2601.19644
作者: Stéphane Demri,Tianwen Gu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Extended version of a paper accepted at CSL’26, Paris

点击查看摘要

Abstract:Decidability or complexity issues about the consistency problem for description logics with concrete domains have already been analysed with tableaux-based or type elimination methods. Concrete domains in ontologies are essential to consider concrete objects and predefined relations. In this work, we expose an automata-based approach leading to the optimal upper bound EXPTIME, that is designed by enriching the transitions with symbolic constraints. We show that the nonemptiness problem for such automata belongs to EXPTIME if the concrete domains satisfy a few simple properties. Then, we provide a reduction from the consistency problem for ontologies, yielding this http URL to the expressivity of constraint automata, the results are extended to additional ingredients such as inverse roles, functional role names and constraint assertions, while maintaining EXPTIME-membership, which illustrates the robustness of the approach
zh

[AI-17] racking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning

【速读】:该论文旨在解决现实世界强化学习中因环境漂移(environment drift)导致的探索强度不匹配问题,即传统方法采用静态熵系数或目标熵时,在环境稳定期易过度探索,在漂移后又因探索不足而恢复缓慢,且缺乏对探索强度如何随漂移幅度变化的理论指导。解决方案的关键在于提出自适应熵调度(Adaptive Entropy Scheduling, AES),其核心思想是将非平稳环境下的熵调度简化为一个一维的逐轮权衡问题——在漂移发生后快速追踪最优解与在环境稳定时避免无谓随机性之间的平衡,并利用可在线观测的漂移代理信号(drift proxies)动态调整熵系数/温度,从而实现探索强度与漂移程度的自动匹配,且几乎无需结构改动、计算开销极低。

链接: https://arxiv.org/abs/2601.19624
作者: Tongxi Wang,Zhuoyang Xia,Xinran Chen,Shan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift (thus slow recovery), and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We prove that entropy scheduling under non-stationarity can be reduced to a one-dimensional, round-by-round trade-off, faster tracking of the optimal solution after drift vs. avoiding gratuitous randomness when the environment is stable, so exploration strength can be driven by measurable online drift signals. Building on this, we propose AES (Adaptive Entropy Scheduling), which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.
zh

[AI-18] Algorithmic Prompt-Augmentation for Efficient LLM -Based Heuristic Design for A* Search

【速读】:该论文旨在解决传统启发式函数(heuristic functions)依赖人工设计、耗时且需专业知识的问题,从而提升A搜索算法的性能。其解决方案的关键在于提出一种新的领域无关提示增强策略——Algorithmic-Contextual EoH (A-CEoH),通过将A算法代码嵌入提示中以利用上下文学习(in-context learning),实现启发式函数的自动化生成。实验表明,该方法能显著提升启发式质量,甚至优于专家手工设计的启发式函数。

链接: https://arxiv.org/abs/2601.19622
作者: Thomas Bömer,Nico Koltermann,Max Disselnmeyer,Bastian Amberg,Anne Meyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: accepted at EvoStar conference; Code: this https URL

点击查看摘要

Abstract:Heuristic functions are essential to the performance of tree search algorithms such as A*, where their accuracy and efficiency directly impact search outcomes. Traditionally, such heuristics are handcrafted, requiring significant expertise. Recent advances in large language models (LLMs) and evolutionary frameworks have opened the door to automating heuristic design. In this paper, we extend the Evolution of Heuristics (EoH) framework to investigate the automated generation of guiding heuristics for A* search. We introduce a novel domain-agnostic prompt augmentation strategy that includes the A* code into the prompt to leverage in-context learning, named Algorithmic - Contextual EoH (A-CEoH). To evaluate the effectiveness of A-CeoH, we study two problem domains: the Unit-Load Pre-Marshalling Problem (UPMP), a niche problem from warehouse logistics, and the classical sliding puzzle problem (SPP). Our computational experiments show that A-CEoH can significantly improve the quality of the generated heuristics and even outperform expert-designed heuristics.
zh

[AI-19] R3: Replay Reflection and Ranking Rewards for LLM Reinforcement Learning

【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在复杂任务中因群体内优势(intra-group advantage)崩溃而导致训练不稳定和效率低下的问题。现有基于群体策略优化的方法依赖于同一批次内高质量样本诱导的优势差距,但在挑战性任务下难以维持稳定的优势估计。其解决方案的关键在于提出一种名为 R³ 的强化学习机制,包含三个核心组件:(1) 跨上下文重放(cross-context replay)策略,通过回溯相同查询的历史轨迹中的优质样本维持群体内优势;(2) 基于上下文的自我反思(in-context self-reflection)机制,使模型利用过往失败经验优化输出;(3) 结构熵排名奖励(structural entropy ranking reward),通过基于token级熵模式对截断或失败样本进行相对排序,同时捕捉局部探索与全局稳定性。该方法在数学领域基准测试中显著优于基线模型,且推理Token消耗更低。

链接: https://arxiv.org/abs/2601.19620
作者: Zhizheng Jiang,Kang Zhao,Weikai Xu,Xinkui Lin,Wei Liu,Jian Luan,Shuo Shang,Peng Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph\textbfR^3 that along three directions: (1) a \emphcross-context \underline\textbfReplay strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emphin-context self-\underline\textbfReflection mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emphstructural entropy \underline\textbfRanking reward, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.
zh

[AI-20] Safe Exploration via Policy Priors

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)代理在真实世界环境中在线学习与适应时的安全探索问题,尤其是在脱离受控模拟环境后的场景中。其解决方案的关键在于引入一种名为SOOPER的框架,该框架利用次优但保守的策略(如从离线数据或仿真器中获得)作为先验信息,并结合概率动力学模型进行乐观探索;当探索风险较高时,则回退至保守策略以确保安全性。理论分析表明,SOOPER能够保证在整个学习过程中始终安全,并通过限制累积遗憾(cumulative regret)实现最优策略收敛。

链接: https://arxiv.org/abs/2601.19612
作者: Manuel Wendl,Yarden As,Manish Prajapat,Anton Pollak,Stelian Coros,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.
zh

[AI-21] ComAgent : Multi-LLM based Agent ic AI Empowered Intelligent Wireless Networks

【速读】:该论文旨在解决6G网络中复杂跨层优化问题,即如何将用户高层次意图高效、准确地转化为可执行的数学优化公式和仿真方案,而传统人工方法存在效率低、易出错的问题。其解决方案的关键在于提出ComAgent框架——一个基于多大语言模型(Large Language Models, LLMs)的智能体(agentic AI)系统,通过闭环的感知-规划-执行-反思(Perception-Planning-Action-Reflection)机制,协同专业化智能体完成文献检索、代码生成与评分等任务,实现问题的自动分解、自我纠错与迭代优化,从而显著提升从用户意图到可执行优化方案的自动化水平与准确性。

链接: https://arxiv.org/abs/2601.19607
作者: Haoyun Li,Ming Xiao,Kezhi Wang,Robert Schober,Dong In Kim,Yong Liang Guan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emerging 6G networks rely on complex cross-layer optimization, yet manually translating high-level intents into mathematical formulations remains a bottleneck. While Large Language Models (LLMs) offer promise, monolithic approaches often lack sufficient domain grounding, constraint awareness, and verification capabilities. To address this, we present ComAgent, a multi-LLM agentic AI framework. ComAgent employs a closed-loop Perception-Planning-Action-Reflection cycle, coordinating specialized agents for literature search, coding, and scoring to autonomously generate solver-ready formulations and reproducible simulations. By iteratively decomposing problems and self-correcting errors, the framework effectively bridges the gap between user intent and execution. Evaluations demonstrate that ComAgent achieves expert-comparable performance in complex beamforming optimization and outperforms monolithic LLMs across diverse wireless tasks, highlighting its potential for automating design in emerging wireless networks.
zh

[AI-22] Intersectional Fairness via Mixed-Integer Optimization

【速读】:该论文旨在解决高风险领域(如金融和医疗)中人工智能模型的公平性与透明性问题,尤其是在监管框架对“偏见”定义模糊的情况下,如何实现交叉群体层面的公平性。其核心挑战在于传统方法难以同时保障模型在多个受保护群体交集上的公平表现,并确保模型具备内在可解释性。解决方案的关键在于提出一种基于混合整数优化(Mixed-Integer Optimization, MIO)的统一框架,通过MIO训练出既具有交叉公平性(intersectional fairness)又具备内在可解释性的分类器,从而在保证性能的同时,将交叉偏见控制在可接受阈值以下,为受监管行业提供稳健的AI部署方案。

链接: https://arxiv.org/abs/2601.19595
作者: Jiří Němeček,Mark Kozdoba,Illia Kryvoviaz,Tomáš Pevný,Jakub Mareček
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 17 pages, 10 figures, 1 table

点击查看摘要

Abstract:The deployment of Artificial Intelligence in high-risk domains, such as finance and healthcare, necessitates models that are both fair and transparent. While regulatory frameworks, including the EU’s AI Act, mandate bias mitigation, they are deliberately vague about the definition of bias. In line with existing research, we argue that true fairness requires addressing bias at the intersections of protected groups. We propose a unified framework that leverages Mixed-Integer Optimization (MIO) to train intersectionally fair and intrinsically interpretable classifiers. We prove the equivalence of two measures of intersectional fairness (MSD and SPSF) in detecting the most unfair subgroup and empirically demonstrate that our MIO-based algorithm improves performance in finding bias. We train high-performing, interpretable classifiers that bound intersectional bias below an acceptable threshold, offering a robust solution for regulated industries and beyond.
zh

[AI-23] From Atoms to Chains: Divergence-Guided Reasoning Curriculum for Unlabeled LLM Domain Adaptation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在无人工标注数据情况下适配特定领域时面临的挑战,尤其是传统知识蒸馏方法因粗粒度模仿导致学生模型效率低下且可能继承教师模型推理缺陷的问题。其解决方案的关键在于提出一种基于分歧引导的推理课程(Divergence-Guided Reasoning Curriculum, DGRC)框架:当学生与教师模型在推理路径上产生分歧时,DGRC通过诊断分析二者路径差异,自动生成高置信度的原子级问答对,用于精准修复学生知识缺口;同时,这些原子对作为事实标准过滤教师原始推理链,形成经验证的思维链(Chain-of-Thought, CoT)课程,从而指导学生将原子知识整合为完整推理过程。

链接: https://arxiv.org/abs/2601.19588
作者: Yongqi Wang,Xiaofeng Ji,Jie Wang,Qingbin Li,Xiao Xiong,Zheming Yang,Jian Xu,Minghui Qiu,Xinxiao Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to specialized domains without human-annotated data is a crucial yet formidable challenge. Widely adopted knowledge distillation methods often devolve into coarse-grained mimicry, where the student model inefficiently targets its own weaknesses and risks inheriting the teacher’s reasoning flaws. This exposes a critical pedagogical dilemma: how to devise a reliable curriculum when the teacher itself is not an infallible expert. Our work resolves this by capitalizing on a key insight: while LLMs may exhibit fallibility in complex, holistic reasoning, they often exhibit high fidelity on focused, atomic sub-problems. Based on this, we propose Divergence-Guided Reasoning Curriculum (DGRC), which constructs a learning path from atomic knowledge to reasoning chains by dynamically deriving two complementary curricula from disagreements in reasoning pathways. When a student and teacher produce conflicting results, DGRC directs the teacher to perform a diagnostic analysis: it analyzes both reasoning paths to formulate atomic queries that target the specific points of divergence, and then self-answers these queries to create high-confidence atomic question-answer pairs. These pairs then serve a dual purpose: (1) providing an atomic curriculum to rectify the student’s knowledge gaps, and (2) serving as factual criteria to filter the teacher’s original reasoning chains, yielding a verified CoT curriculum that teaches the student how to integrate atomic knowledge into complete reasoning paths. Experiments across the medical and legal domains on student models of various sizes demonstrate the effectiveness of our DGRC framework. Notably, our method achieves a 7.76% relative improvement for the 1.5B student model in the medical domain over strong unlabeled baseline.
zh

[AI-24] LLM -Enhanced Reinforcement Learning for Long-Term User Satisfaction in Interactive Recommendation

【速读】:该论文旨在解决交互式推荐系统在长期使用中因过度拟合短期用户偏好而导致的内容同质化和信息茧房问题,同时克服现有方法多局限于静态或一次性推荐场景、缺乏对用户兴趣演变建模的局限性。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)增强的强化学习(Reinforcement Learning, RL)框架——LERL,其核心是分层结构设计:高层由LLM驱动的规划器负责选择语义多样化的内容类别以扩大探索空间,低层RL策略则在选定的语义空间内进行个性化物品推荐,从而有效缩小动作空间、提升规划效率并减少冗余内容暴露,最终实现对长期用户满意度的显著优化。

链接: https://arxiv.org/abs/2601.19585
作者: Chongjun Xia,Yanchun Peng,Xianzhi Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive recommender systems can dynamically adapt to user feedback, but often suffer from content homogeneity and filter bubble effects due to overfitting short-term user preferences. While recent efforts aim to improve content diversity, they predominantly operate in static or one-shot settings, neglecting the long-term evolution of user interests. Reinforcement learning provides a principled framework for optimizing long-term user satisfaction by modeling sequential decision-making processes. However, its application in recommendation is hindered by sparse, long-tailed user-item interactions and limited semantic planning capabilities. In this work, we propose LLM-Enhanced Reinforcement Learning (LERL), a novel hierarchical recommendation framework that integrates the semantic planning power of LLM with the fine-grained adaptability of RL. LERL consists of a high-level LLM-based planner that selects semantically diverse content categories, and a low-level RL policy that recommends personalized items within the selected semantic space. This hierarchical design narrows the action space, enhances planning efficiency, and mitigates overexposure to redundant content. Extensive experiments on real-world datasets demonstrate that LERL significantly improves long-term user satisfaction when compared with state-of-the-art baselines. The implementation of LERL is available at this https URL.
zh

[AI-25] Learning Adaptive Parallel Execution for Efficient Code Localization

【速读】:该论文旨在解决自动化软件开发流水线中代码定位(code localization)效率低下的问题,尤其是当前多工具并行执行时存在高达34.9%的冗余调用率,导致并行加速优势被抵消。解决方案的关键在于提出FuseSearch,将其重构为一个联合质量-效率优化任务,通过定义“工具效率”(tool efficiency,即单位调用所获得的独特信息增益)作为核心指标,并采用两阶段监督微调(SFT)与强化学习(RL)相结合的训练策略,使代理能够根据任务上下文动态调节搜索广度——从探索阶段平滑过渡到精炼阶段。这一机制显著提升了定位性能(在SWE-bench Verified上达到84.7%文件级和56.4%函数级F₁分数),同时实现93.6%的速度提升,减少67.7%的交互轮次与68.9%的token消耗,证明了效率感知训练可自然抑制噪声冗余信号,从而实现高性能且成本可控的代码定位。

链接: https://arxiv.org/abs/2601.19568
作者: Ke Xu,Siyang Xiao,Ming Liang,Yichen Yu,Zhixiang Wang,Jingxuan Xu,Dajun Chen,Wei Jiang,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Code localization constitutes a key bottleneck in automated software development pipelines. While concurrent tool execution can enhance discovery speed, current agents demonstrate a 34.9% redundant invocation rate, which negates parallelism benefits. We propose \textbfFuseSearch, reformulating parallel code localization as a \textbfjoint quality-efficiency optimization task. Through defining \textbftool efficiency – the ratio of unique information gain to invocation count – we utilize a two-phase SFT and RL training approach for learning adaptive parallel strategies. Different from fixed-breadth approaches, FuseSearch dynamically modulates search breadth according to task context, evolving from exploration phases to refinement stages. Evaluated on SWE-bench Verified, FuseSearch-4B achieves SOTA-level performance (84.7% file-level and 56.4% function-level F_1 scores) with 93.6% speedup, utilizing 67.7% fewer turns and 68.9% fewer tokens. Results indicate that efficiency-aware training naturally improves quality through eliminating noisy redundant signals, enabling high-performance cost-effective localization agents.
zh

[AI-26] AROMMA: Unifying Olfactory Embeddings for Single Molecules and Mixtures

【速读】:该论文旨在解决公共嗅觉数据集规模小且分散于单一分子与混合物之间的问题,这限制了通用气味表征的学习。现有方法通常仅学习单一分子的嵌入表示,或通过相似性/成对标签预测处理混合物,导致不同类型的气味表示分离且未对齐。其解决方案的关键在于提出AROMMA框架,该框架通过化学基础模型编码单个分子,并利用基于注意力机制的聚合器组合双分子混合物,确保排列不变性和不对称分子相互作用;同时采用知识蒸馏和类别感知伪标签对齐气味描述符集合,以补充缺失的混合物标注。该方法实现了单分子和分子对数据集上的最先进性能,AUROC提升达19.1%,展现出在两个领域的强大泛化能力。

链接: https://arxiv.org/abs/2601.19561
作者: Dayoung Kang,JongWon Kim,Jiho Park,Keonseock Lee,Ji-Woong Choi,Jinhyun So
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Public olfaction datasets are small and fragmented across single molecules and mixtures, limiting learning of generalizable odor representations. Recent works either learn single-molecule embeddings or address mixtures via similarity or pairwise label prediction, leaving representations separate and unaligned. In this work, we propose AROMMA, a framework that learns a unified embedding space for single molecules and two-molecule mixtures. Each molecule is encoded by a chemical foundation model and the mixtures are composed by an attention-based aggregator, ensuring both permutation invariance and asymmetric molecular interactions. We further align odor descriptor sets using knowledge distillation and class-aware pseudo-labeling to enrich missing mixture annotations. AROMMA achieves state-of-the-art performance in both single-molecule and molecule-pair datasets, with up to 19.1% AUROC improvement, demonstrating a robust generalization in two domains.
zh

[AI-27] Scale-Consistent State-Space Dynamics via Fractal of Stationary Transformations

【速读】:该论文旨在解决深度学习模型在缺乏结构保证的情况下,中间表示的有效性无法确保,从而导致早期停止(early stopping)和自适应计算(adaptive computation)策略失效的问题。其解决方案的关键在于提出了一种状态空间模型的结构要求——尺度一致性(scale-consistent)潜在动态,由此推导出Fractal of Stationary Transformations (FROST),通过分形归纳偏置(fractal inductive bias)强制实现自相似的表示流形(representation manifold)。在此几何结构下,中间状态对应于同一表征的不同分辨率,且理论分析证明了迭代过程中的收缩性和稳定收敛性;进而使早期停止可自然地转化为基于内在特征质量的排序机制,而非依赖外在目标函数。

链接: https://arxiv.org/abs/2601.19551
作者: Geunhyeok Yu,Hyoseok Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages (excluding 2 pages of references), 3 tables, 2 figures. Appendix: 4 pages

点击查看摘要

Abstract:Recent deep learning models increasingly rely on depth without structural guarantees on the validity of intermediate representations, rendering early stopping and adaptive computation ill-posed. We address this limitation by formulating a structural requirement for state-space model’s scale-consistent latent dynamics across iterative refinement, and derive Fractal of Stationary Transformations (FROST), which enforces a self-similar representation manifold through a fractal inductive bias. Under this geometry, intermediate states correspond to different resolutions of a shared representation, and we provide a geometric analysis establishing contraction and stable convergence across iterations. As a consequence of this scale-consistent structure, halting naturally admits a ranking-based formulation driven by intrinsic feature quality rather than extrinsic objectives. Controlled experiments on ImageNet-100 empirically verify the predicted scale-consistent behavior, showing that adaptive efficiency emerges from the aligned latent geometry.
zh

[AI-28] SLM-SS: Speech Language Model for Generative Speech Separation

【速读】:该论文旨在解决当前基于神经网络的语音分离(Speech Separation, SS)方法在提升信号级指标的同时,难以保持分离语音可懂度的问题,从而影响下游任务(如语音识别)的性能。其解决方案的关键在于将语音分离建模为离散多码本序列生成任务,利用编码器-解码器架构将量化后的混合语音映射为目标token序列;同时引入自回归与非自回归两种建模策略,其中非自回归模型用于提高残差token的解码效率,从而在保证分离语音语义连贯性的同时显著提升可懂度,实验证明该方法在LibriMix数据集上优于现有方法,在多种下游任务中表现出更强的语言一致性。

链接: https://arxiv.org/abs/2601.19533
作者: Tianhua Li,Chenda Li,Wei Wang,Xin Zhou,Xihui Chen,Jianqing Gao,Yanmin Qian
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.
zh

[AI-29] Fuzzy expert system for the process of collecting and purifying acidic water: a digital twin approach

【速读】:该论文旨在解决含酸性物质(如硫化氢、二氧化碳等)的酸性废水(sour water)处理难题,以降低排放风险、减少设备腐蚀、实现水资源回用并优化运行成本。其核心解决方案是构建一个基于模糊逻辑的专家系统(fuzzy expert system),结合自定义数字孪生(digital twin)模型,通过模拟人类决策过程实现对关键工艺参数的自动控制。该方案的关键在于:利用Honeywell UniSim Design R492建立高保真工业过程仿真模型,通过MATLAB进行阀动态建模,并采用OPC DA实现控制器与仿真器之间的实时数据交互;同时设计分程控制策略并测试多种去模糊化方法,在105种工况下验证系统性能,最终形成一套结构简单、直观易用且具备通用性的智能控制框架,适用于多种工业场景下的过程优化与自动化管理。

链接: https://arxiv.org/abs/2601.19527
作者: Temirbolat Maratuly,Pakizar Shamoi,Timur Samigulin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purifying sour water is essential for reducing emissions, minimizing corrosion risks, enabling the reuse of treated water in industrial or domestic applications, and ultimately lowering operational costs. Moreover, automating the purification process helps reduce the risk of worker harm by limiting human involvement. Crude oil contains acidic components such as hydrogen sulfide, carbon dioxide, and other chemical compounds. During processing, these substances are partially released into sour water. If not properly treated, sour water poses serious environmental threats and accelerates the corrosion of pipelines and equipment. This paper presents a fuzzy expert system, combined with a custom-generated digital twin, developed from a documented industrial process to maintain key parameters at desired levels by mimicking human reasoning. The control strategy is designed to be simple and intuitive, allowing junior or non-expert personnel to interact with the system effectively. The digital twin was developed using Honeywell UniSim Design R492 to simulate real industrial behavior accurately. Valve dynamics were modeled through system identification in MATLAB, and real-time data exchange between the simulator and controller was established using OPC DA. The fuzzy controller applies split-range control to two valves and was tested under 21 different initial pressure conditions using five distinct defuzzification strategies, resulting in a total of 105 unique test scenarios. System performance was evaluated using both error-based metrics (MSE, RMSE, MAE, IAE, ISE, ITAE) and dynamic response metrics, including overshoot, undershoot, rise time, fall time, settling time, and steady-state error. A web-based simulation interface was developed in Python using the Streamlit framework. Although demonstrated here for sour water treatment, the proposed fuzzy expert system is general-purpose.
zh

[AI-30] AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

【速读】:该论文旨在解决当前自动化代码审查(Automated Code Review, ACR)评估基准存在的两大核心问题:一是现有基准缺乏跨语言的仓库级上下文支持,限制了评估结果的泛化能力;二是依赖于原始Pull Request(PR)评论中噪声大、不完整的标注数据,导致缺陷检测范围受限。其解决方案的关键在于提出AACR-Bench这一综合性评估基准,通过“AI辅助、专家验证”的标注流程,在多语言环境下提供完整的跨文件上下文,并显著提升缺陷覆盖度(达285%),从而更准确地衡量大型语言模型(Large Language Models, LLMs)在ACR任务中的真实性能。

链接: https://arxiv.org/abs/2601.19494
作者: Lei Zhang,Yongda Yu,Minghui Yu,Xinxin Guo,Zhengqi Zhuang,Guoping Rong,Dong Shao,Haifeng Shen,Hongyu Kuang,Zhengfeng Li,Boge Wang,Guoan Zhang,Bangyu Xiang,Xiaobing Xu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality evaluation benchmarks are pivotal for deploying Large Language Models (LLMs) in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi-language support in repository-level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR-Bench a comprehensive benchmark that provides full cross-file context across multiple programming languages. Unlike traditional datasets, AACR-Bench employs an “AI-assisted, Expert-verified” annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR-Bench reveal that previous assessments may have either misjudged or only partially captured model capabilities due to data limitations. Our work establishes a more rigorous standard for ACR evaluation and offers new insights on LLM based ACR, i.e., the granularity/level of context and the choice of retrieval methods significantly impact ACR performance, and this influence varies depending on the LLM, programming language, and the LLM usage paradigm e.g., whether an Agent architecture is employed. The code, data, and other artifacts of our evaluation set are available at this https URL .
zh

[AI-31] LLM -VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

【速读】:该论文旨在解决安全对齐的大语言模型(Safety-aligned LLMs)中存在的两种失效模式:越狱攻击(jailbreak,即对有害输入给出响应)和过度拒绝(over-refusal,即拒绝良性查询)。现有向量操控(vector steering)方法通过调整回答向量(answer vector vav_a)的幅度来缓解问题,但这种策略在减少越狱的同时会加剧过度拒绝,反之亦然,存在根本性的权衡困境。论文的关键创新在于识别出问题根源:LLMs将回答决策(vav_a)与输入安全性判断(benign vector vbv_b)编码为近正交方向,视为独立过程;为此提出LLM-VA方法,通过闭式权重更新使 vav_avbv_b 对齐,从而建立回答意愿与安全评估之间的因果依赖关系——无需微调或架构改动即可同时降低越狱风险并提升良性请求的响应率。该方法利用支持向量机(SVM)定位各层向量、筛选安全相关层,并以最小范数权重修改实现迭代对齐,在12个LLM上验证了其有效性,F1得分比最佳基线提升11.45%,同时保持95.92%的实用性。

链接: https://arxiv.org/abs/2601.19487
作者: Haonan Zhang,Dongxia Wang,Yi Liu,Kexin Chen,Wenhai Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety-aligned LLMs suffer from two failure modes: jailbreak (answering harmful inputs) and over-refusal (declining benign queries). Existing vector steering methods adjust the magnitude of answer vectors, but this creates a fundamental trade-off – reducing jailbreak increases over-refusal and vice versa. We identify the root cause: LLMs encode the decision to answer (answer vector v_a ) and the judgment of input safety (benign vector v_b ) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns v_a with v_b through closed-form weight updates, making the model’s willingness to answer causally dependent on its safety assessment – without fine-tuning or architectural changes. Our method identifies vectors at each layer using SVMs, selects safety-relevant layers, and iteratively aligns vectors via minimum-norm weight modifications. Experiments on 12 LLMs demonstrate that LLM-VA achieves 11.45% higher F1 than the best baseline while preserving 95.92% utility, and automatically adapts to each model’s safety bias without manual tuning. Code and models are available at this https URL.
zh

[AI-32] me-to-Injury Forecasting in Elite Female Football: A DeepHit Survival Approach

【速读】:该论文旨在解决足球运动员伤病预测中现有方法依赖静态 preseason 数据和二元结果(即是否受伤)导致实际应用受限的问题。其解决方案的关键在于采用 DeepHit 神经网络模型,基于纵向运动员监测数据(如训练负荷、比赛表现和健康状态)进行时间到伤病的生存建模,从而提供个体化且随时间动态变化的风险估计。该方法通过多层感知机架构实现高精度预测(concordance index = 0.762),并结合 Shapley Additive Explanations (SHAP) 赋予模型可解释性,识别出与临床一致的关键风险因素,为不同竞技水平下的伤病预防提供了准确、可解释且可操作的决策支持。

链接: https://arxiv.org/abs/2601.19479
作者: Victoria Catterall,Cise Midoglu,Stephen Lynch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Injury occurrence in football poses significant challenges for athletes and teams, carrying personal, competitive, and financial consequences. While machine learning has been applied to injury prediction before, existing approaches often rely on static pre-season data and binary outcomes, limiting their real-world utility. This study investigates the feasibility of using a DeepHit neural network to forecast time-to-injury from longitudinal athlete monitoring data, while providing interpretable predictions. The analysis utilised the publicly available SoccerMon dataset, containing two seasons of training, match, and wellness records from elite female footballers. Data was pre-processed through cleaning, feature engineering, and the application of three imputation strategies. Baseline models (Random Forest, XGBoost, Logistic Regression) were optimised via grid search for benchmarking, while the DeepHit model, implemented with a multilayer perceptron backbone, was evaluated using chronological and leave-one-player-out (LOPO) validation. DeepHit achieved a concordance index of 0.762, outperforming baseline models and delivering individualised, time-varying risk estimates. Shapley Additive Explanations (SHAP) identified clinically relevant predictors consistent with established risk factors, enhancing interpretability. Overall, this study provides a novel proof of concept: survival modelling with DeepHit shows strong potential to advance injury forecasting in football, offering accurate, explainable, and actionable insights for injury prevention across competitive levels.
zh

[AI-33] APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中利用演示数据(demonstration data)时面临的挑战,即现有方法通常假设演示数据是最优且与目标任务完全对齐的,而实际场景中演示数据往往稀疏、次优或存在偏差,导致性能下降。解决方案的关键在于提出自适应策略组合(Adaptive Policy Composition, APC),这是一种分层模型,能够自适应地组合多个基于数据驱动的归一化流(Normalizing Flow, NF)先验。APC 不强制依赖任一先验,而是通过估计每个先验对目标任务的适用性,在探索过程中灵活利用有效先验,同时在必要时修正有用先验或绕过错误对齐的先验,从而在多种基准测试中实现加速学习、鲁棒性提升,并有效利用次优演示数据引导探索,避免因过度遵循次优演示而导致性能劣化。

链接: https://arxiv.org/abs/2601.19452
作者: Finn Rietz,Pedro Zuidberg dos Martires,Johannes Andreas Stork
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Incorporating demonstration data into reinforcement learning (RL) can greatly accelerate learning, but existing approaches often assume demonstrations are optimal and fully aligned with the target task. In practice, demonstrations are frequently sparse, suboptimal, or misaligned, which can degrade performance when these demonstrations are integrated into RL. We propose Adaptive Policy Composition (APC), a hierarchical model that adaptively composes multiple data-driven Normalizing Flow (NF) priors. Instead of enforcing strict adherence to the priors, APC estimates each prior’s applicability to the target task while leveraging them for exploration. Moreover, APC either refines useful priors, or sidesteps misaligned ones when necessary to optimize downstream reward. Across diverse benchmarks, APC accelerates learning when demonstrations are aligned, remains robust under severe misalignment, and leverages suboptimal demonstrations to bootstrap exploration while avoiding performance degradation caused by overly strict adherence to suboptimal demonstrations.
zh

[AI-34] Sim-and-Human Co-training for Data-Efficient and Generalizable Robotic Manipulation

【速读】:该论文旨在解决机器人在真实场景中执行操作任务时面临的泛化能力不足问题,其根源在于两个关键瓶颈:一是仿真到现实的视觉差距(sim-to-real visual gap),导致仅依赖合成数据训练的策略难以适应真实环境;二是人类到机器人的身体映射差距(human-to-robot embodiment gap),使得纯人类示范数据无法直接迁移至机器人控制。解决方案的关键在于提出一种名为 SimHum 的协同训练框架,通过同时利用仿真数据中的机器人动作(kinematic prior)与真实人类观察数据中的视觉先验(visual prior),实现两种数据源的互补性融合,从而在有限数据预算下显著提升机器人在真实世界任务中的数据效率和泛化性能。

链接: https://arxiv.org/abs/2601.19406
作者: Kaipeng Fang,Weiqing Liang,Yuyang Li,Ji Zhang,Pengpeng Zeng,Lianli Gao,Jingkuan Song,Heng Tao Shen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic simulation data and real-world human data provide scalable alternatives to circumvent the prohibitive costs of robot data collection. However, these sources suffer from the sim-to-real visual gap and the human-to-robot embodiment gap, respectively, which limits the policy’s generalization to real-world scenarios. In this work, we identify a natural yet underexplored complementarity between these sources: simulation offers the robot action that human data lacks, while human data provides the real-world observation that simulation struggles to render. Motivated by this insight, we present SimHum, a co-training framework to simultaneously extract kinematic prior from simulated robot actions and visual prior from real-world human observations. Based on the two complementary priors, we achieve data-efficient and generalizable robotic manipulation in real-world tasks. Empirically, SimHum outperforms the baseline by up to \mathbf40% under the same data collection budget, and achieves a \mathbf62.5% OOD success with only 80 real data, outperforming the real only baseline by 7.1\times . Videos and additional information can be found at \hrefthis https URLproject website.
zh

[AI-35] RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization

【速读】:该论文旨在解决大语言模型在强化微调(Reinforcement Fine-Tuning, RFT)过程中因需生成完整推理路径(reasoning trajectory)而导致的显著计算开销问题。其核心解决方案是提出一种名为部分推理优化的强化微调方法(Reinforcement Fine-Tuning with Partial Reasoning Optimization, RPO),关键在于通过经验缓存(experience cache)仅生成推理路径的后缀(suffix),而非完整的推理序列,从而将训练阶段的token生成量减少约95%。该方法在保持与传统算法(如GRPO和DAPO)相当性能的同时,大幅降低理论训练时间,对1.5B和7B模型分别实现90%和72%的训练加速。

链接: https://arxiv.org/abs/2601.19404
作者: Hongzhu Yi,Xinming Wang,Zhenghao zhang,Tianyu Zong,Yuanxiang Wang,Jun Xie,Tao Yu,Haopeng Jin,Zhepeng Wang,Kaixin Xu,Feng Chen,Jiahuan Chen,Yujia Yang,Zhenyu Guan,Bingkang Shi,Jungang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at this https URL.
zh

[AI-36] PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理服务中因不同客户层级、时段及查询重要性导致的成本与质量需求差异问题,传统LLM路由器需离线调参且无法直接指定准确性目标,其参数与结果间关系非单调、数据集依赖性强,难以保障服务质量。解决方案的关键在于提出PROTEUS(Polymorphic Router for Operational Target Enforcement with Unified SLA),该系统通过引入拉格朗日对偶控制机制,在训练过程中学习一个对偶变量λ以跟踪约束违反情况,并将其作为策略网络的条件输入,从而将运行时指定的准确性目标τ转化为满足该目标的路由决策。这一设计使得单一模型可在[0.85, 0.95]范围内灵活适配多种精度要求,实现稳定达标(floor compliance)并显著优于现有基线方法。

链接: https://arxiv.org/abs/2601.19402
作者: Amit Singh Bhatti,Vishal Vaddina,Dagnachew Birru
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency SLOs directly. LLM routers do not. They force operators to tune parameters offline and guess what accuracy might result. The relationship between parameters and outcomes is indirect, non-monotonic, and dataset-dependent. Operators need to specify accuracy targets, not infer them from opaque settings. We present PROTEUS (Polymorphic Router for Operational Target Enforcement with Unified SLA), a router that accepts accuracy targets tau as runtime input. PROTEUS uses Lagrangian dual control. A learned dual variable lambda tracks constraint violations during training and conditions the policy network. This lets the router translate specified tau values into routing decisions that satisfy them. A single trained model serves the full accuracy spectrum without this http URL evaluate on RouterBench (11 models, 405K queries) and SPROUT (14 models, 45K queries). PROTEUS achieves consistent floor compliance where accuracy meets or exceeds tau. The target-response correlation reaches 0.97 to 0.98. The closest baseline, OmniRouter, meets floors only 22% of the time despite also using Lagrangian optimization. PROTEUS operates across tau in [0.85, 0.95] from a single model. On RouterBench it achieves 90.1% accuracy, within 1.3% of oracle. On SPROUT it achieves 94.0% accuracy, within 4.6% of oracle. Cost savings reach 89.8% versus the best fixed model.
zh

[AI-37] Residual Tokens Enhance Masked Autoencoders for Speech Modeling ICASSP2026

【速读】:该论文旨在解决当前语音建模中仅依赖显式属性(如音高、内容和说话人身份)无法充分捕捉自然语音丰富特征的问题。解决方案的关键在于提出一种名为RT-MAE的新型掩码自编码器框架,其通过引入无监督的残差可训练标记(residual trainable tokens),编码未被显式标注因素(如音色变化、噪声、情感等)所解释的信息,从而在保持内容与说话人相似性的同时提升语音表达力,并实现语音增强任务中的自然去噪。

链接: https://arxiv.org/abs/2601.19399
作者: Samir Sadok,Stéphane Lathuilière,Xavier Alameda-Pineda
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP 2026 (accepted)

点击查看摘要

Abstract:Recent speech modeling relies on explicit attributes such as pitch, content, and speaker identity, but these alone cannot capture the full richness of natural speech. We introduce RT-MAE, a novel masked autoencoder framework that augments the supervised attributes-based modeling with unsupervised residual trainable tokens, designed to encode the information not explained by explicit labeled factors (e.g., timbre variations, noise, emotion etc). Experiments show that RT-MAE improves reconstruction quality, preserving content and speaker similarity while enhancing expressivity. We further demonstrate its applicability to speech enhancement, removing noise at inference while maintaining controllability and naturalness.
zh

[AI-38] aching Machine Learning Fundamentals with LEGO Robotics

【速读】:该论文旨在解决青少年(12–17岁)在接触机器学习(Machine Learning)概念时因编程门槛高、抽象性强而导致理解困难和兴趣不足的问题。解决方案的关键在于构建一个基于乐高机器人(LEGO robotics)的可视化编程-free教学平台——Machine Learning with Bricks,该平台通过交互式可视化技术将三种核心算法(K近邻算法 KNN、线性回归 Linear Regression 和 Q-learning)具象化,并让学生通过数据采集、模型训练与机器人交互等实践活动来掌握机器学习原理。实证结果显示,该方法显著提升了学生的概念理解能力、改善了对人工智能(AI)的认知态度,并增强了学习动机,证明了“具身化+可视化”教学策略在青少年机器学习教育中的有效性。

链接: https://arxiv.org/abs/2601.19376
作者: Viacheslav Sydora,Guner Dilsad Er,Michael Muehlebach
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:This paper presents the web-based platform Machine Learning with Bricks and an accompanying two-day course designed to teach machine learning concepts to students aged 12 to 17 through programming-free robotics activities. Machine Learning with Bricks is an open source platform and combines interactive visualizations with LEGO robotics to teach three core algorithms: KNN, linear regression, and Q-learning. Students learn by collecting data, training models, and interacting with robots via a web-based interface. Pre- and post-surveys with 14 students demonstrate significant improvements in conceptual understanding of machine learning algorithms, positive shifts in AI perception, high platform usability, and increased motivation for continued learning. This work demonstrates that tangible, visualization-based approaches can make machine learning concepts accessible and engaging for young learners while maintaining technical depth. The platform is freely available at this https URL, with video tutorials guiding students through the experiments at this https URL.
zh

[AI-39] Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐后仍易受对抗攻击的问题,特别是现有激活控制(activation steering)方法存在的局限性:激活添加法需精细调参且对层内范数变化敏感,方向删减法仅提供二值化控制,而角向控制(Angular Steering)虽实现连续调控却违反范数保持原则,导致分布偏移和生成崩溃,尤其在参数量低于7B的模型中更为显著。其解决方案的关键在于提出“选择性引导”(Selective Steering),包含两项核心创新:一是基于数学严谨性的范数保持旋转公式,确保激活分布完整性;二是判别式层选择机制,仅在特征表示呈现相反符号类对齐的层上施加干预,从而实现高效、稳定且可控的模型行为调整。

链接: https://arxiv.org/abs/2601.19375
作者: Quy-Anh Dang,Chris Ngo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but existing methods suffer from critical limitations: activation addition requires careful coefficient tuning and is sensitive to layer-specific norm variations, while directional ablation provides only binary control. Recent work on Angular Steering introduces continuous control via rotation in a 2D subspace, but its practical implementation violates norm preservation, causing distribution shift and generation collapse, particularly in models below 7B parameters. We propose Selective Steering, which addresses these limitations through two key innovations: (1) a mathematically rigorous norm-preserving rotation formulation that maintains activation distribution integrity, and (2) discriminative layer selection that applies steering only where feature representations exhibit opposite-signed class alignment. Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods while maintaining zero perplexity violations and approximately 100% capability retention on standard benchmarks. Our approach provides a principled, efficient framework for controllable and stable LLM behavior modification. Code: this https URL
zh

[AI-40] Revisiting Parameter Server in LLM Post-Training ICLR’26

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)后训练过程中因序列长度方差大导致的负载不均衡问题,该问题使得传统数据并行(Data Parallel, DP)训练中依赖集体通信(Collective Communication)的策略效率下降,造成设备利用率低下。其解决方案的关键在于提出按需通信(On-Demand Communication, ODC),通过将参数服务器(Parameter Server, PS)范式引入全分片数据并行(Fully Sharded Data Parallel, FSDP),以点对点通信替代原有的集体归约(reduce-scatter)和全部收集(all-gather)操作,从而将同步障碍从每层一次降低为每小批量(minibatch)一次,并解耦各设备上的计算负载,使快速工作节点不再被慢速节点阻塞,同时支持更精细的小批量级别负载均衡。

链接: https://arxiv.org/abs/2601.19362
作者: Xinyi Wan,Penghui Qi,Guangxing Huang,Chaoyi Ruan,Min Lin,Jialin Li
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted in ICLR’26

点击查看摘要

Abstract:Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose \textbfOn-Demand Communication (ODC), which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at this https URL.
zh

[AI-41] Robust Uncertainty Estimation under Distribution Shift via Difference Reconstruction

【速读】:该论文旨在解决深度学习模型在高风险应用场景(如医学影像)中缺乏可靠不确定性估计的问题。现有方法通常依赖于输入样本与其重建版本之间的差异作为不确定性代理指标,但这种方法易受信息丢失和对表面细节敏感的影响,导致性能受限。解决方案的关键在于提出差异重构不确定性估计(Difference Reconstruction Uncertainty Estimation, DRUE),通过从两个中间层重构输入并测量其输出差异作为不确定性评分,从而有效缓解上述局限性。实验表明,DRUE在多个分布偏移的外部数据集上均显著优于现有方法,展现出更强的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2601.19341
作者: Xinran Xu,Li Rong Wang,Xiuyi Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Estimating uncertainty in deep learning models is critical for reliable decision-making in high-stakes applications such as medical imaging. Prior research has established that the difference between an input sample and its reconstructed version produced by an auxiliary model can serve as a useful proxy for uncertainty. However, directly comparing reconstructions with the original input is degraded by information loss and sensitivity to superficial details, which limits its effectiveness. In this work, we propose Difference Reconstruction Uncertainty Estimation (DRUE), a method that mitigates this limitation by reconstructing inputs from two intermediate layers and measuring the discrepancy between their outputs as the uncertainty score. To evaluate uncertainty estimation in practice, we follow the widely used out-of-distribution (OOD) detection paradigm, where in-distribution (ID) training data are compared against datasets with increasing domain shift. Using glaucoma detection as the ID task, we demonstrate that DRUE consistently achieves superior AUC and AUPR across multiple OOD datasets, highlighting its robustness and reliability under distribution shift. This work provides a principled and effective framework for enhancing model reliability in uncertain environments.
zh

[AI-42] SETA: Statistical Fault Attribution for Compound AI Systems ICSE2026

【速读】:该论文旨在解决多神经网络(multi-network)系统在复杂推理任务中进行鲁棒性与安全性测试时面临的挑战,尤其是现有基于单网络模型的鲁棒性测试方法难以扩展至多模块流水线的问题。其解决方案的关键在于提出一种模块化的鲁棒性测试框架,该框架通过将一组扰动施加于测试数据,支持对系统各组件的细粒度分析(component-wise system analysis),以隔离错误来源,并能推理误差在不同神经网络模块间的传播机制(error propagation)。该框架具有架构和模态无关性,适用于多种应用场景,已在实际的自主铁路巡检系统(由多个深度网络组成)中验证其有效性,实现了超越传统端到端指标的精细化鲁棒性分析。

链接: https://arxiv.org/abs/2601.19337
作者: Sayak Chowdhury,Meenakshi D’Souza
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted to CAIN 2026 co-hosted with ICSE 2026

点击查看摘要

Abstract:Modern AI systems increasingly comprise multiple interconnected neural networks to tackle complex inference tasks. Testing such systems for robustness and safety entails significant challenges. Current state-of-the-art robustness testing techniques, whether black-box or white-box, have been proposed and implemented for single-network models and do not scale well to multi-network pipelines. We propose a modular robustness testing framework that applies a given set of perturbations to test data. Our testing framework supports (1) a component-wise system analysis to isolate errors and (2) reasoning about error propagation across the neural network modules. The testing framework is architecture and modality agnostic and can be applied across domains. We apply the framework to a real-world autonomous rail inspection system composed of multiple deep networks and successfully demonstrate how our approach enables fine-grained robustness analysis beyond conventional end-to-end metrics.
zh

[AI-43] From Observations to Events: Event-Aware World Model for Reinforcement Learning ICLR2026

【速读】:该论文旨在解决模型-based强化学习(Model-Based Reinforcement Learning, MBRL)在跨结构相似场景中泛化能力不足,以及对纹理或颜色变化等虚假特征敏感的问题。其核心解决方案是提出事件感知的世界模型(Event-Aware World Model, EAWM),关键在于通过自动化事件生成器从原始观测中提取事件,并引入通用事件分割器(Generic Event Segmentor, GES)识别事件边界,从而构建以有意义时空转换为导向的表示空间,无需人工标注即可提升策略学习效率。

链接: https://arxiv.org/abs/2601.19336
作者: Zhao-Han Peng,Shaohui Li,Zhi Li,Shulan Ruan,Yu Liu,You He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 43 pages, accepted by ICLR 2026

点击查看摘要

Abstract:While model-based reinforcement learning (MBRL) improves sample efficiency by learning world models from raw observations, existing methods struggle to generalize across structurally similar scenes and remain vulnerable to spurious variations such as textures or color shifts. From a cognitive science perspective, humans segment continuous sensory streams into discrete events and rely on these key events for decision-making. Motivated by this principle, we propose the Event-Aware World Model (EAWM), a general framework that learns event-aware representations to streamline policy learning without requiring handcrafted labels. EAWM employs an automated event generator to derive events from raw observations and introduces a Generic Event Segmentor (GES) to identify event boundaries, which mark the start and end time of event segments. Through event prediction, the representation space is shaped to capture meaningful spatio-temporal transitions. Beyond this, we present a unified formulation of seemingly distinct world model architectures and show the broad applicability of our methods. Experiments on Atari 100K, Craftax 1M, and DeepMind Control 500K, DMC-GB2 500K demonstrate that EAWM consistently boosts the performance of strong MBRL baselines by 10%-45%, setting new state-of-the-art results across benchmarks. Our code is released at this https URL.
zh

[AI-44] StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths

【速读】:该论文旨在解决在极低比特位宽(2-4 bit)下进行量化感知训练(Quantization-aware Training, QAT)时,由于梯度不匹配、训练不稳定或计算开销过大而导致的优化困难问题。现有方法如基于直通估计器(Straight-Through Estimator, STE)或软量化器的方案常面临梯度信号失真和收敛性差的问题。其解决方案的关键在于提出一种名为StableQAT的统一且高效的QAT框架,该框架通过离散傅里叶分析对舍入操作符构造了一个新颖、轻量且理论严谨的反向传播代理函数,该代理函数严格推广了STE(STE为其特例),从而生成平滑、有界且计算成本低廉的梯度,显著提升了超低比特场景下的训练稳定性和性能表现。

链接: https://arxiv.org/abs/2601.19320
作者: Tianyi Chen,Sihan Chen,Xiaoyi Qu,Dan Zhao,Ruomei Yan,Jongwoo Ko,Luming Liang,Pashmina Cameron
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra-low bitwidths remains challenging. Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2-4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at this https URL.
zh

[AI-45] Balancing Sustainability And Performance: The Role Of Small-Scale Llm s In Agent ic Artificial Intelligence Systems

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在代理型人工智能(Agentic AI)系统中推理阶段带来的高能耗问题,以应对可持续性挑战。其核心解决方案在于验证较小规模的开源权重模型(smaller open-weights models)能够在不牺牲任务响应速度和输出质量的前提下显著降低能源消耗;关键创新点在于通过多模型对比分析量化了效率与性能之间的权衡关系,并据此提出可操作的可持续人工智能设计指南,包括最优批处理大小配置和计算资源分配策略,从而为构建环境友好且可扩展的人工智能系统提供实证依据与实践路径。

链接: https://arxiv.org/abs/2601.19311
作者: Anh Khoa Ngo Ho,Martin Chauvin,Simon Gosset,Philippe Cordier,Boris Gamazaychikov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models become integral to agentic artificial intelligence systems, their energy demands during inference may pose significant sustainability challenges. This study investigates whether deploying smaller-scale language models can reduce energy consumption without compromising responsiveness and output quality in a multi-agent, real-world environments. We conduct a comparative analysis across language models of varying scales to quantify trade-offs between efficiency and performance. Results show that smaller open-weights models can lower energy usage while preserving task quality. Building on these findings, we propose practical guidelines for sustainable artificial intelligence design, including optimal batch size configuration and computation resource allocation. These insights offer actionable strategies for developing scalable, environmentally responsible artificial intelligence systems.
zh

[AI-46] Curiosity Driven Knowledge Retrieval for Mobile Agents

【速读】:该论文旨在解决移动代理在复杂应用场景中因知识不完整和泛化能力弱而导致性能受限的问题。其核心解决方案是提出一种基于好奇心驱动的知识检索框架,将执行过程中的不确定性形式化为好奇心分数;当该分数超过阈值时,系统从文档、代码仓库和历史轨迹中检索外部信息,并将其组织成结构化的AppCards(包含功能语义、参数约定、接口映射与交互模式),进而增强代理在推理过程中对相关AppCards的选择性整合能力,从而弥补知识盲区并提升规划可靠性。

链接: https://arxiv.org/abs/2601.19306
作者: Sijia Li,Xiaoyu Tan,Shahir Ali,Niels Schmidt,Gengchen Ma,Xihe Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile agents have made progress toward reliable smartphone automation, yet performance in complex applications remains limited by incomplete knowledge and weak generalization to unseen environments. We introduce a curiosity driven knowledge retrieval framework that formalizes uncertainty during execution as a curiosity score. When this score exceeds a threshold, the system retrieves external information from documentation, code repositories, and historical trajectories. Retrieved content is organized into structured AppCards, which encode functional semantics, parameter conventions, interface mappings, and interaction patterns. During execution, an enhanced agent selectively integrates relevant AppCards into its reasoning process, thereby compensating for knowledge blind spots and improving planning reliability. Evaluation on the AndroidWorld benchmark shows consistent improvements across backbones, with an average gain of six percentage points and a new state of the art success rate of 88.8% when combined with GPT-5. Analysis indicates that AppCards are particularly effective for multi step and cross application tasks, while improvements depend on the backbone model. Case studies further confirm that AppCards reduce ambiguity, shorten exploration, and support stable execution trajectories. Task trajectories are publicly available at this https URL.
zh

[AI-47] alos: Optimizing Top-K Accuracy in Recommender Systems WWW’26

【速读】:该论文旨在解决推荐系统(Recommender Systems, RS)中Top-K推荐精度优化的两大核心挑战:一是传统评估指标如Precision@K、Recall@K依赖于精确排序位置,导致计算开销大且难以优化;二是推荐系统易受用户偏好演化或数据偏倚引发的分布漂移影响,进一步削弱模型稳定性与鲁棒性。解决方案的关键在于提出一种名为Talos的新损失函数,其核心创新是利用分位数技术将复杂的排名相关操作转化为预测得分与学习到的阈值之间的简单比较,从而显著降低优化复杂度;同时结合基于采样的回归算法高效估计阈值、引入约束项防止分数膨胀,并设计定制化的代理函数以缓解不连续性问题并增强对分布漂移的鲁棒性。

链接: https://arxiv.org/abs/2601.19276
作者: Shengjia Zhang,Weiqin Yang,Jiawei Chen,Peng Wu,Yuegang Sun,Gang Wang,Qihao Shi,Can Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by WWW’26

点击查看摘要

Abstract:Recommender systems (RS) aim to retrieve a small set of items that best match individual user preferences. Naturally, RS place primary emphasis on the quality of the Top- K results rather than performance across the entire item set. However, estimating Top- K accuracy (e.g., Precision@ K , Recall@ K ) requires determining the ranking positions of items, which imposes substantial computational overhead and poses significant challenges for optimization. In addition, RS often suffer from distribution shifts due to evolving user preferences or data biases, further complicating the task. To address these issues, we propose Talos, a loss function that is specifically designed to optimize the Talos recommendation accuracy. Talos leverages a quantile technique that replaces the complex ranking-dependent operations into simpler comparisons between predicted scores and learned score thresholds. We further develop a sampling-based regression algorithm for efficient and accurate threshold estimation, and introduce a constraint term to maintain optimization stability by preventing score inflation. Additionally, we incorporate a tailored surrogate function to address discontinuity and enhance robustness against distribution shifts. Comprehensive theoretical analyzes and empirical experiments are conducted to demonstrate the effectiveness, efficiency, convergence, and distributional robustness of Talos. The code is available at this https URL. Comments: Accepted by WWW’26 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2601.19276 [cs.IR] (or arXiv:2601.19276v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.19276 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-48] actile Memory with Soft Robot: Robust Object Insertion via Masked Encoding and Soft Wrist

【速读】:该论文旨在解决机器人在接触密集型任务(如不确定条件下的插销入孔)中缺乏有效触觉记忆与适应能力的问题,即如何利用历史触觉经验实现安全、鲁棒且灵活的操纵。解决方案的关键在于提出Tactile Memory with Soft Robot (TaMeSo-bot)系统,其核心是Masked Tactile Trajectory Transformer (MAT³),该模型通过掩码令牌预测机制联合建模机器人动作、分布式触觉反馈、力-扭矩测量和本体感知信号之间的时空交互,从而自主提取任务相关特征并重建缺失感官信息,无需显式子任务分割即可实现对未见场景的有效适应。

链接: https://arxiv.org/abs/2601.19275
作者: Tatsuya Kamijo,Mai Nishimura,Cristian C. Beltran-Hernandez,Nodoka Shibasaki,Masashi Hamaya
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Tactile memory, the ability to store and retrieve touch-based experience, is critical for contact-rich tasks such as key insertion under uncertainty. To replicate this capability, we introduce Tactile Memory with Soft Robot (TaMeSo-bot), a system that integrates a soft wrist with tactile retrieval-based control to enable safe and robust manipulation. The soft wrist allows safe contact exploration during data collection, while tactile memory reuses past demonstrations via retrieval for flexible adaptation to unseen scenarios. The core of this system is the Masked Tactile Trajectory Transformer (MAT ^\text3 ), which jointly models spatiotemporal interactions between robot actions, distributed tactile feedback, force-torque measurements, and proprioceptive signals. Through masked-token prediction, MAT ^\text3 learns rich spatiotemporal representations by inferring missing sensory information from context, autonomously extracting task-relevant features without explicit subtask segmentation. We validate our approach on peg-in-hole tasks with diverse pegs and conditions in real-robot experiments. Our extensive evaluation demonstrates that MAT ^\text3 achieves higher success rates than the baselines over all conditions and shows remarkable capability to adapt to unseen pegs and conditions.
zh

[AI-49] Decoupled Split Learning via Auxiliary Loss

【速读】:该论文旨在解决传统分层学习(Split Learning)中因端到端反向传播导致的高通信开销和显著内存占用问题。其核心解决方案是提出一种超越反向传播(Beyond-Backpropagation)的训练方法:客户端在其网络分割点处引入一个小型辅助分类器(auxiliary classifier),以提供本地误差信号,从而无需将梯度回传至服务器;而服务器则基于客户端传输的中间激活值使用真实损失函数进行训练。这种解耦机制使得客户端与服务器能够半独立地训练各自模型分区,有效减少50%的通信量(仅需传输前向激活,不再传输梯度),并降低峰值内存使用高达58%。

链接: https://arxiv.org/abs/2601.19261
作者: Anower Zihad,Felix Owino,Haibo Yang,Ming Tang,Chao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Split learning is a distributed training paradigm where a neural network is partitioned between clients and a server, which allows data to remain at the client while only intermediate activations are shared. Traditional split learning relies on end-to-end backpropagation across the client-server split point. This incurs a large communication overhead (i.e., forward activations and backward gradients need to be exchanged every iteration) and significant memory use (for storing activations and gradients). In this paper, we develop a beyond-backpropagation training method for split learning. In this approach, the client and server train their model partitions semi-independently, using local loss signals instead of propagated gradients. In particular, the client’s network is augmented with a small auxiliary classifier at the split point to provide a local error signal, while the server trains on the client’s transmitted activations using the true loss function. This decoupling removes the need to send backward gradients, which cuts communication costs roughly in half and also reduces memory overhead (as each side only stores local activations for its own backward pass). We evaluate our approach on CIFAR-10 and CIFAR-100. Our experiments show two key results. First, the proposed approach achieves performance on par with standard split learning that uses backpropagation. Second, it significantly reduces communication (of transmitting activations/gradient) by 50% and peak memory usage by up to 58%.
zh

[AI-50] GhostUI: Unveiling Hidden Interactions in Mobile UI

【速读】:该论文旨在解决移动应用中隐式交互(hidden interactions)难以被用户发现以及移动代理(mobile agents)在自动化任务执行时无法有效识别和处理这些交互的问题。其核心挑战在于,当前基于视觉语言模型(VLMs)的移动代理缺乏对隐藏手势(如长按、滑动等)及其引发界面状态变化的理解能力。解决方案的关键是提出GhostUI数据集,该数据集包含交互前后的屏幕截图、简化的视图层次结构、手势元数据及任务描述,为VLMs提供充分的上下文信息以学习隐式交互模式并准确预测交互后的界面状态,从而提升移动任务自动化的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2601.19258
作者: Minkyu Kweon,Seokhyeon Park,Soohyun Lee,You Been Lee,Jeongmin Rhee,Jinwook Seo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at ACM CHI Conference on Human Factors in Computing Systems (CHI '26)

点击查看摘要

Abstract:Modern mobile applications rely on hidden interactions–gestures without visual cues like long presses and swipes–to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents–systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)–struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI’s potential as a foundation for advancing mobile task automation.
zh

[AI-51] LLM -Assisted Logic Rule Learning: Scaling Human Expertise for Time Series Anomaly Detection

【速读】:该论文旨在解决供应链时间序列异常检测中面临的两大挑战:一是传统无监督异常检测方法虽能挖掘数据模式,但结果常与业务需求和领域知识不一致;二是人工专家分析难以扩展至包含数百万产品的供应链场景。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)将人类专家知识系统性地编码为可解释的、基于逻辑的规则,从而实现高精度且可理解的异常检测。具体包括三个阶段:基于领域知识指导的LLM标注训练数据、通过LLM驱动优化自动生成并迭代改进符号规则,以及借助LLM增强业务相关的异常类别以提升解释性。该方法在检测准确性和可解释性上均优于无监督学习方法,并相比直接部署LLM进行异常检测提供了确定性、低延迟和低成本的结果,更适合生产环境应用。

链接: https://arxiv.org/abs/2601.19255
作者: Haoting Zhang,Shekhar Jain
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series anomaly detection is critical for supply chain management to take proactive operations, but faces challenges: classical unsupervised anomaly detection based on exploiting data patterns often yields results misaligned with business requirements and domain knowledge, while manual expert analysis cannot scale to millions of products in the supply chain. We propose a framework that leverages large language models (LLMs) to systematically encode human expertise into interpretable, logic-based rules for detecting anomaly patterns in supply chain time series data. Our approach operates in three stages: 1) LLM-based labeling of training data instructed by domain knowledge, 2) automated generation and iterative improvements of symbolic rules through LLM-driven optimization, and 3) rule augmentation with business-relevant anomaly categories supported by LLMs to enhance interpretability. The experiment results showcase that our approach outperforms the unsupervised learning methods in both detection accuracy and interpretability. Furthermore, compared to direct LLM deployment for time series anomaly detection, our approach provides consistent, deterministic results with low computational latency and cost, making it ideal for production deployment. The proposed framework thus demonstrates how LLMs can bridge the gap between scalable automation and expert-driven decision-making in operational settings.
zh

[AI-52] GLOVE: Global Verifier for LLM Memory-Environment Realignment

【速读】:该论文旨在解决现有增强记忆的大语言模型(Large Language Model, LLM)在动态环境中因记忆有效性无法可靠验证而导致性能下降的问题。传统方法通常依赖外部评估器提供任务特定的成功信号或依赖模型内部认知(如反思)来编辑记忆条目,但在存在环境漂移(environmental drift)的实际场景中,这些假设往往失效。论文提出的解决方案核心是引入全局验证器(Global Verifier, GLOVE),其关键创新在于建立一种相对真理的概念:通过主动探测机制检测检索到的记忆与新观察之间的不一致性,从而在无需真实标签监督或强依赖模型内省的情况下,实现记忆与环境的再对齐与更新。实验证明,GLOVE在包含受控环境漂移的多种基准测试中显著提升了智能体的成功率,为构建具备自我演化的认知智能体提供了稳健路径。

链接: https://arxiv.org/abs/2601.19249
作者: Xingkun Yin,Hongyang Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most existing memory-enhanced Large Language Model (LLM) approaches implicitly assume that memory validity can be established either through external evaluators that provide task-specific success signals or through internal model cognition, such as reflection, for editing memory entries. However, these assumptions often break down in practical environments with dynamic drifts. We propose the Global Verifier (GLOVE), a framework that introduces a new design dimension for LLM memory systems by establishing a relative notion of truth. Through active probing to detect inconsistencies between retrieved memories and fresh observations, GLOVE enables memory-environment realignment by verifying and updating memory without access to ground-truth supervision or strong reliance on model introspection. We evaluate GLOVE on diverse benchmarks spanning web navigation, planning, and control, augmented with controlled environmental drifts that introduce non-stationarity beyond the original benchmark settings. Our results show that GLOVE substantially improves agent success rates, suggesting a robust pathway to cognitive agents capable of self-evolving.
zh

[AI-53] Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection

【速读】:该论文旨在解决生成式 AI(Generative AI)在实际应用中面临的跨域幻觉检测(hallucination detection)问题,即现有方法在训练与测试数据来自同一领域时表现良好,但在不同领域间泛化能力差。其解决方案的关键在于发现并利用一个普遍现象:由幻觉引发的多轮对话相比事实性对话展现出更大的不确定性波动。基于此现象,作者提出一种新指标 SpikeScore,用于量化多轮对话中的突发性不确定性波动,并通过理论分析与实证验证证明该指标在跨域场景下能有效区分幻觉与非幻觉响应,从而显著提升检测模型的跨域泛化性能。

链接: https://arxiv.org/abs/2601.19245
作者: Yongxin Deng,Zhen Fang,Yixuan Li,Ling Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hallucination detection is critical for deploying large language models (LLMs) in real-world applications. Existing hallucination detection methods achieve strong performance when the training and test data come from the same domain, but they suffer from poor cross-domain generalization. In this paper, we study an important yet overlooked problem, termed generalizable hallucination detection (GHD), which aims to train hallucination detectors on data from a single domain while ensuring robust performance across diverse related domains. In studying GHD, we simulate multi-turn dialogues following LLMs initial response and observe an interesting phenomenon: hallucination-initiated multi-turn dialogues universally exhibit larger uncertainty fluctuations than factual ones across different domains. Based on the phenomenon, we propose a new score SpikeScore, which quantifies abrupt fluctuations in multi-turn dialogues. Through both theoretical analysis and empirical validation, we demonstrate that SpikeScore achieves strong cross-domain separability between hallucinated and non-hallucinated responses. Experiments across multiple LLMs and benchmarks demonstrate that the SpikeScore-based detection method outperforms representative baselines in cross-domain generalization and surpasses advanced generalization-oriented methods, verifying the effectiveness of our method in cross-domain hallucination detection.
zh

[AI-54] Structure-based RNA Design by Step-wise Optimization of Latent Diffusion Model AAAI2026

【速读】:该论文旨在解决RNA逆折叠(RNA inverse folding)任务中结构目标优化不足的问题,现有方法多聚焦于序列恢复,难以有效实现二级结构一致性(SS)、最小自由能(MFE)和局部距离差测试(LDDT)等非可微分结构指标的精确控制,导致设计出的RNA序列在结构准确性上表现欠佳。解决方案的关键在于提出一种结合潜在扩散模型(Latent Diffusion Model, LDM)与强化学习(Reinforcement Learning, RL)的新型框架——Step-wise Optimization of Latent Diffusion Model (SOLD),其核心创新在于利用RL进行策略驱动的奖励优化,无需采样完整扩散轨迹即可对单步噪声进行高效精调,从而显著提升多结构目标的协同优化能力,实验表明该方法在各项指标上均优于基线模型和当前最优方法。

链接: https://arxiv.org/abs/2601.19232
作者: Qi Si,Xuyang Liu,Penglei Wang,Xin Guo,Yuan Qi,Yuan Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages (7 pages content + 2 pages references + 11 pages appendix), 11 figures, 8 tables. Source code available at this https URL Accepted to AAAI 2026

点击查看摘要

Abstract:RNA inverse folding, designing sequences to form specific 3D structures, is critical for therapeutics, gene regulation, and synthetic biology. Current methods, focused on sequence recovery, struggle to address structural objectives like secondary structure consistency (SS), minimum free energy (MFE), and local distance difference test (LDDT), leading to suboptimal structural accuracy. To tackle this, we propose a reinforcement learning (RL) framework integrated with a latent diffusion model (LDM). Drawing inspiration from the success of diffusion models in RNA inverse folding, which adeptly model complex sequence-structure interactions, we develop an LDM incorporating pre-trained RNA-FM embeddings from a large-scale RNA model. These embeddings capture co-evolutionary patterns, markedly improving sequence recovery accuracy. However, existing approaches, including diffusion-based methods, cannot effectively handle non-differentiable structural objectives. By contrast, RL excels in this task by using policy-driven reward optimization to navigate complex, non-gradient-based objectives, offering a significant advantage over traditional methods. In summary, we propose the Step-wise Optimization of Latent Diffusion Model (SOLD), a novel RL framework that optimizes single-step noise without sampling the full diffusion trajectory, achieving efficient refinement of multiple structural objectives. Experimental results demonstrate SOLD surpasses its LDM baseline and state-of-the-art methods across all metrics, establishing a robust framework for RNA inverse folding with profound implications for biotechnological and therapeutic applications.
zh

[AI-55] MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution

【速读】:该论文旨在解决移动图形用户界面(GUI)代理在面对频繁的UI外观更新和工作流重组时,因训练数据与当前环境不匹配而导致任务执行失败的问题。其解决方案的关键在于提出一种基于记忆驱动的自适应代理框架MAGNET,该框架包含双层记忆机制:静态记忆(stationary memory)用于将多样化的视觉特征映射到稳定的函数语义,从而实现动作的鲁棒定位;过程记忆(procedural memory)则捕捉跨不同工作流的稳定任务意图。通过动态记忆演化机制持续优化这两类记忆,并优先保留高频访问的知识,从而提升代理在不断演进的软件环境中的性能与泛化能力。

链接: https://arxiv.org/abs/2601.19199
作者: Libo Sun,Jiwen Zhang,Siyuan Wang,Zhongyu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile GUI agents powered by large foundation models enable autonomous task execution, but frequent updates altering UI appearance and reorganizing workflows cause agents trained on historical data to fail. Despite surface changes, functional semantics and task intents remain fundamentally stable. Building on this insight, we introduce MAGNET, a memory-driven adaptive agent framework with dual-level memory: stationary memory linking diverse visual features to stable functional semantics for robust action grounding and procedural memory capturing stable task intents across varying workflows. We propose a dynamic memory evolution mechanism that continuously refines both memories by prioritizing frequently accessed knowledge. Online benchmark AndroidWorld evaluations show substantial improvements over baselines, while offline benchmarks confirm consistent gains under distribution shifts. These results validate that leveraging stable structures across interface changes improves agent performance and generalization in evolving software environments.
zh

[AI-56] HELM: A Human-Centered Evaluation Framework for LLM -Powered Recommender Systems

【速读】:该论文旨在解决当前推荐系统评估方法过于依赖传统准确性指标(如准确率、召回率等),而忽视了用户中心维度(如意图对齐、解释质量、交互自然度、信任透明性及公平多样性)的问题。其核心解决方案是提出一个名为 \framework 的综合性评估框架,该框架通过五个关键的人类中心维度对生成式 AI (Generative AI) 驱动的推荐系统进行系统化评测,并基于大规模专家评估(847个推荐场景)和跨领域实验(电影、书籍、餐厅)验证其有效性,从而揭示传统指标无法捕捉的用户体验质量维度。

链接: https://arxiv.org/abs/2601.19197
作者: Sushant Mehta
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into recommendation systems has introduced unprecedented capabilities for natural language understanding, explanation generation, and conversational interactions. However, existing evaluation methodologies focus predominantly on traditional accuracy metrics, failing to capture the multifaceted human-centered qualities that determine the real-world user experience. We introduce \framework (\textbfHuman-centered \textbfEvaluation for \textbfLLM-powered reco\textbfMmenders), a comprehensive evaluation framework that systematically assesses LLM-powered recommender systems across five human-centered dimensions: \textitIntent Alignment, \textitExplanation Quality, \textitInteraction Naturalness, \textitTrust \ Transparency, and \textitFairness \ Diversity. Through extensive experiments involving three state-of-the-art LLM-based recommenders (GPT-4, LLaMA-3.1, and P5) across three domains (movies, books, and restaurants), and rigorous evaluation by 12 domain experts using 847 recommendation scenarios, we demonstrate that \framework reveals critical quality dimensions invisible to traditional metrics. Our results show that while GPT-4 achieves superior explanation quality (4.21/5.0) and interaction naturalness (4.35/5.0), it exhibits a significant popularity bias (Gini coefficient 0.73) compared to traditional collaborative filtering (0.58). We release \framework as an open-source toolkit to advance human-centered evaluation practices in the recommender systems community.
zh

[AI-57] CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning EACL’26

【速读】:该论文旨在解决现有多模态表格理解数据集(如MMTab)中缺乏显式多步推理监督的问题,导致模型生成的答案简短且准确性不足、可解释性弱。其解决方案的关键在于提出CoReTab框架,通过将多步推理与可执行的Python代码相结合,实现可扩展、可解释且自动验证的标注方式,从而提升模型在表格问答、事实验证和表格结构理解等任务中的推理能力与透明度。

链接: https://arxiv.org/abs/2601.19193
作者: Van-Quang Nguyen,Takayuki Okatani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted to EACL’26 (main conference)

点击查看摘要

Abstract:Existing datasets for multimodal table understanding, such as MMTab, primarily provide short factual answers without explicit multi-step reasoning supervision. Models trained on these datasets often generate brief responses that offers insufficient accuracy and limited interpretability into how these models arrive at the final answer. We introduce CoReTab, a code-driven reasoning framework that produces scalable, interpretable, and automatically verifiable annotations by coupling multi-step reasoning with executable Python code. Using the CoReTab framework, we curate a dataset of 115K verified samples averaging 529 tokens per response and fine-tune open-source MLLMs through a three-stage pipeline. We evaluate the resulting model trained on CoReTab across 17 MMTab benchmarks spanning table question answering, fact verification, and table structure understanding. Our model achieves significant gains of +6.2%, +5.7%, and +25.6%, respectively, over MMTab-trained baselines, while producing transparent and verifiable reasoning traces. These results establish CoReTab as a robust and generalizable supervision framework for improving multi-step reasoning in multimodal table understanding.
zh

[AI-58] CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation ICLR2026

【速读】:该论文旨在解决顺序推荐系统中因KV缓存(Key-Value Cache)导致的高存储开销问题,尤其是在用户基数大、历史序列长的情况下,传统KV缓存技术难以满足实际部署的存储与延迟要求。解决方案的关键在于提出一种跨用户共享机制CollectiveKV,其核心思想是利用不同用户间KV序列的相似性,通过奇异值分解(SVD)分析发现KV信息可分为可共享的全局部分和用户特异性的局部部分;在推理阶段,每个用户从一个可学习的全局KV池中检索高维共享KV,并将其与低维用户专属KV拼接,从而大幅压缩KV缓存体积至原始大小的0.8%,同时保持甚至提升模型性能。

链接: https://arxiv.org/abs/2601.19178
作者: Jingyu Li,Zhaocheng Du,Qianhui Zhu,kaiyuan Li,Zhicheng Zhang,Song-Li Wu,Chaolang Li,Pengwen Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Sequential recommendation models are widely used in applications, yet they face stringent latency requirements. Mainstream models leverage the Transformer attention mechanism to improve performance, but its computational complexity grows with the sequence length, leading to a latency challenge for long sequences. Consequently, KV cache technology has recently been explored in sequential recommendation systems to reduce inference latency. However, KV cache introduces substantial storage overhead in sequential recommendation systems, which often have a large user base with potentially very long user history sequences. In this work, we observe that KV sequences across different users exhibit significant similarities, indicating the existence of collaborative signals in KV. Furthermore, we analyze the KV using singular value decomposition (SVD) and find that the information in KV can be divided into two parts: the majority of the information is shareable across users, while a small portion is user-specific. Motivated by this, we propose CollectiveKV, a cross-user KV sharing mechanism. It captures the information shared across users through a learnable global KV pool. During inference, each user retrieves high-dimensional shared KV from the pool and concatenates them with low-dimensional user-specific KV to obtain the final KV. Experiments on five sequential recommendation models and three datasets show that our method can compress the KV cache to only 0.8% of its original size, while maintaining or even enhancing model performance.
zh

[AI-59] A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction ICLR2026

【速读】:该论文旨在解决 signed graph(符号图)中链接符号预测(link sign prediction)问题,即准确判断图中任意边所代表的关系是正向还是负向。传统图神经网络方法依赖于同质性假设(graph homophily),而负边的存在破坏了这一假设,导致常规方法失效,除非引入额外的辅助结构来处理负边。为此,作者提出基于高斯Copula(Gaussian copula)建模边与边之间的潜在统计依赖关系,并通过两个关键改进提升效率与可扩展性:其一,将相关矩阵表示为边嵌入的Gramian形式,显著减少参数量;其二,重构条件概率分布以大幅降低推理成本。理论分析证明该方法具有线性收敛性,实验表明其在保持与最先进模型相当预测性能的同时,实现了更快的收敛速度。

链接: https://arxiv.org/abs/2601.19175
作者: Jinkyu Sung,Myunggeum Jee,Joonseok Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: Accepted to ICLR 2026

点击查看摘要

Abstract:Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN. However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.
zh

[AI-60] SHIELD: An Auto-Healing Agent ic Defense Framework for LLM Resource Exhaustion Attacks

【速读】:该论文旨在解决生成式 AI(Generative AI)系统中日益严峻的海绵攻击(sponge attack)问题,此类攻击通过诱导模型产生大量冗余计算资源消耗,导致服务拒绝(DoS)。现有防御方法要么依赖统计过滤器(statistical filters),难以应对语义上有意义的攻击;要么采用静态大语言模型(LLM)检测器,无法适应攻击策略的动态演化。其解决方案的关键在于提出 SHIELD 框架——一个基于多智能体的自愈式防御体系,核心由三阶段防御代理(Defense Agent)构成,集成语义相似度检索、模式匹配与 LLM 推理能力;同时引入知识更新代理(Knowledge Updating Agent)和提示优化代理(Prompt Optimization Agent),形成闭环自愈机制,在攻击绕过检测时自动更新知识库并优化防御指令,从而有效应对不断演化的资源耗尽威胁。

链接: https://arxiv.org/abs/2601.19174
作者: Nirhoshan Sivaroopan,Kanchana Thilakarathna,Albert Zomaya,Manu,Yi Guo,Jo Plested,Tim Lynar,Jack Yang,Wangli Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sponge attacks increasingly threaten LLM systems by inducing excessive computation and DoS. Existing defenses either rely on statistical filters that fail on semantically meaningful attacks or use static LLM-based detectors that struggle to adapt as attack strategies evolve. We introduce SHIELD, a multi-agent, auto-healing defense framework centered on a three-stage Defense Agent that integrates semantic similarity retrieval, pattern matching, and LLM-based reasoning. Two auxiliary agents, a Knowledge Updating Agent and a Prompt Optimization Agent, form a closed self-healing loop, when an attack bypasses detection, the system updates an evolving knowledgebase, and refines defense instructions. Extensive experiments show that SHIELD consistently outperforms perplexity-based and standalone LLM defenses, achieving high F1 scores across both non-semantic and semantic sponge attacks, demonstrating the effectiveness of agentic self-healing against evolving resource-exhaustion threats.
zh

[AI-61] Bridging Gulfs in UI Generation through Semantic Guidance

【速读】:该论文旨在解决生成式 AI 在用户界面(UI)设计中因用户难以准确表达设计意图以及评估和优化生成结果而产生的“执行鸿沟”(gulf of execution)与“评价鸿沟”(gulf of evaluation)问题。其解决方案的关键在于引入显式的语义表示作为人机交互的中间层:通过主题分析提炼出 UI 提示词指南中的层级化且相互依赖的设计语义,构建一个支持用户指定语义、可视化语义关系并解析语义如何映射到生成 UI 的系统。该方法使设计需求显式化、输出结果可解释,从而提升用户对意图表达和结果理解的控制感,并促进更可预测、迭代式的 UI 设计优化。

链接: https://arxiv.org/abs/2601.19171
作者: Seokhyeon Park,Soohyun Lee,Eugene Choi,Hyunwoo Kim,Minkyu Kweon,Yumin Song,Jinwook Seo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)

点击查看摘要

Abstract:While generative AI enables high-fidelity UI generation from text prompts, users struggle to articulate design intent and evaluate or refine results-creating gulfs of execution and evaluation. To understand the information needed for UI generation, we conducted a thematic analysis of UI prompting guidelines, identifying key design semantics and discovering that they are hierarchical and interdependent. Leveraging these findings, we developed a system that enables users to specify semantics, visualize relationships, and extract how semantics are reflected in generated UIs. By making semantics serve as an intermediate representation between human intent and AI output, our system bridges both gulfs by making requirements explicit and outcomes interpretable. A comparative user study suggests that our approach enhances users’ perceived control over intent expression, outcome interpretation, and facilitates more predictable, iterative refinement. Our work demonstrates how explicit semantic representation enables systematic and explainable exploration of design possibilities in AI-driven UI design.
zh

[AI-62] Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement

【速读】:该论文旨在解决从自然语言中自动提取流程图(procedural graphs)时面临的两大挑战:一是生成的结构缺乏有效性(ill-formed structures),二是逻辑流与原文语义不一致(misinterpret logical flows)。为应对这些问题,作者提出了一种多智能体框架 \model,其核心在于将流程图提取过程建模为多轮推理,并通过三个阶段的迭代优化实现结构与逻辑的协同改进:首先由图构建代理(graph builder agent)进行初始提取;接着模拟代理(simulation agent)诊断并解释结构缺陷;最后语义代理(semantic agent)对齐流程逻辑与文本中的语义线索。关键创新在于引入可解释且可控的反馈机制,将优先级高的反馈以自然语言形式注入后续提示中,从而在无需监督或参数更新的前提下,精准定位并修复不同类型的错误,显著提升结构正确性和逻辑一致性。

链接: https://arxiv.org/abs/2601.19170
作者: Wangyang Ying,Yanchi Liu,Xujiang Zhao,Wei Cheng,Zhengzhang Chen,Wenchao Yu,Yanjie Fu,Haifeng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatically extracting workflows as procedural graphs from natural language is promising yet underexplored, demanding both structural validity and logical alignment. While recent large language models (LLMs) show potential for procedural graph extraction, they often produce ill-formed structures or misinterpret logical flows. We present \model, a multi-agent framework that formulates procedural graph extraction as a multi-round reasoning process with dedicated structural and logical refinement. The framework iterates through three stages: (1) a graph extraction phase with the graph builder agent, (2) a structural feedback phase in which a simulation agent diagnoses and explains structural defects, and (3) a logical feedback phase in which a semantic agent aligns semantics between flow logic and linguistic cues in the source text. Important feedback is prioritized and expressed in natural language, which is injected into subsequent prompts, enabling interpretable and controllable refinement. This modular design allows agents to target distinct error types without supervision or parameter updates. Experiments demonstrate that \model achieves substantial improvements in both structural correctness and logical consistency over strong baselines.
zh

[AI-63] S-Debate: Multimodal Collaborative Debate for Zero-Shot Time Series Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在时间序列(Time Series, TS)分析中面临的数值保真度不足、模态干扰以及跨模态整合不严谨等问题。其核心解决方案是提出TS-Debate框架,该框架采用专用专家代理(expert agents)分别处理文本上下文、视觉模式和数值信号,并通过结构化的辩论协议协调交互;同时引入评审代理(reviewer agents)基于验证-冲突-校准机制对各代理的主张进行评估,辅以轻量级代码执行与数值查找实现程序化验证。该设计有效保持了模态特异性、暴露矛盾证据并抑制数值幻觉,且无需任务特定微调即可在多个公开基准上显著优于强基线模型。

链接: https://arxiv.org/abs/2601.19151
作者: Patara Trirat,Jin Myung Kwak,Jay Heo,Heejun Lee,Sung Ju Hwang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Code will be available at this https URL

点击查看摘要

Abstract:Recent progress at the intersection of large language models (LLMs) and time series (TS) analysis has revealed both promise and fragility. While LLMs can reason over temporal structure given carefully engineered context, they often struggle with numeric fidelity, modality interference, and principled cross-modal integration. We present TS-Debate, a modality-specialized, collaborative multi-agent debate framework for zero-shot time series reasoning. TS-Debate assigns dedicated expert agents to textual context, visual patterns, and numerical signals, preceded by explicit domain knowledge elicitation, and coordinates their interaction via a structured debate protocol. Reviewer agents evaluate agent claims using a verification-conflict-calibration mechanism, supported by lightweight code execution and numerical lookup for programmatic verification. This architecture preserves modality fidelity, exposes conflicting evidence, and mitigates numeric hallucinations without task-specific fine-tuning. Across 20 tasks spanning three public benchmarks, TS-Debate achieves consistent and significant performance improvements over strong baselines, including standard multimodal debate in which all agents observe all inputs.
zh

[AI-64] Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction AAAI2026

【速读】:该论文旨在解决现代点击率(CTR)预测模型中因用户行为序列长度异质性导致的性能下降问题,特别是当输入序列长度增加时,短序列用户的性能反而恶化,这归因于注意力极化和训练数据中的长度不平衡。解决方案的关键在于提出一种可插拔的Length-Adaptive Interest Network(LAIN),其核心是将序列长度显式作为条件信号引入模型:通过谱长度编码器(Spectral Length Encoder)将长度映射为连续表示,长度条件提示机制(Length-Conditioned Prompting)向长短行为分支注入全局上下文线索,以及长度调制注意力(Length-Modulated Attention)根据序列长度自适应调整注意力锐度,从而实现对长序列与短序列建模的平衡优化。

链接: https://arxiv.org/abs/2601.19142
作者: Zhicheng Zhang,Zhaocheng Du,Jieming Zhu,Jiwei Tang,Fengyuan Lu,Wang Jiaheng,Song-Li Wu,Qianhui Zhu,Jingyu Li,Hai-Tao Zheng,Zhenhua Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:User behavior sequences in modern recommendation systems exhibit significant length heterogeneity, ranging from sparse short-term interactions to rich long-term histories. While longer sequences provide more context, we observe that increasing the maximum input sequence length in existing CTR models paradoxically degrades performance for short-sequence users due to attention polarization and length imbalance in training data. To address this, we propose LAIN(Length-Adaptive Interest Network), a plug-and-play framework that explicitly incorporates sequence length as a conditioning signal to balance long- and short-sequence modeling. LAIN consists of three lightweight components: a Spectral Length Encoder that maps length into continuous representations, Length-Conditioned Prompting that injects global contextual cues into both long- and short-term behavior branches, and Length-Modulated Attention that adaptively adjusts attention sharpness based on sequence length. Extensive experiments on three real-world benchmarks across five strong CTR backbones show that LAIN consistently improves overall performance, achieving up to 1.15% AUC gain and 2.25% log loss reduction. Notably, our method significantly improves accuracy for short-sequence users without sacrificing longsequence effectiveness. Our work offers a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation.
zh

[AI-65] Agent icSCR: An Autonomous Agent ic Secure Code Review for Immature Vulnerabilities Detection

【速读】:该论文旨在解决预提交(pre-commit)阶段安全代码审查中面临的挑战,即如何在严格的延迟和有限上下文约束下有效检测出尚未成熟(immature)且依赖上下文的漏洞。现有基于静态分析工具(SAST)的方法噪声大、易漏检,而独立的大语言模型(LLM)受限于上下文窗口长度且缺乏显式的工具调用能力。论文提出的关键解决方案是AgenticSCR,一种结合大语言模型(Large Language Model, LLM)、自主决策、工具调用与代码导航能力的代理式AI系统,并引入面向安全的语义记忆机制以增强对 immature vulnerabilities 的识别能力。实证结果表明,AgenticSCR 在定位、检测和解释 immature vulnerabilities 方面显著优于 SAST 工具和静态 LLM 基线,尤其在四种 out of five 漏洞类型上生成更准确的审查意见,验证了 agentic AI 在预提交安全代码审查中的有效性与潜力。

链接: https://arxiv.org/abs/2601.19138
作者: Wachiraphan Charoenwet,Kla Tantithamthavorn,Patanamon Thongtanunam,Hong Yi Lin,Minwoo Jeong,Ming Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Under Review

点击查看摘要

Abstract:Secure code review is critical at the pre-commit stage, where vulnerabilities must be caught early under tight latency and limited-context constraints. Existing SAST-based checks are noisy and often miss immature, context-dependent vulnerabilities, while standalone Large Language Models (LLMs) are constrained by context windows and lack explicit tool use. Agentic AI, which combine LLMs with autonomous decision-making, tool invocation, and code navigation, offer a promising alternative, but their effectiveness for pre-commit secure code review is not yet well understood. In this work, we introduce AgenticSCR, an agentic AI for secure code review for detecting immature vulnerabilities during the pre-commit stage, augmented by security-focused semantic memories. Using our own curated benchmark of immature vulnerabilities, tailored to the pre-commit secure code review, we empirically evaluate how accurate is our AgenticSCR for localizing, detecting, and explaining immature vulnerabilities. Our results show that AgenticSCR achieves at least 153% relatively higher percentage of correct code review comments than the static LLM-based baseline, and also substantially surpasses SAST tools. Moreover, AgenticSCR generates more correct comments in four out of five vulnerability types, consistently and significantly outperforming all other baselines. These findings highlight the importance of Agentic Secure Code Review, paving the way towards an emerging research area of immature vulnerability detection.
zh

[AI-66] In-Network Collective Operations: Game Changer or Challenge for AI Workloads?

【速读】:该论文旨在解决人工智能(AI)工作负载中集体操作(collective operations)的性能瓶颈问题,通过引入网络内集体操作(in-network collective operations, INC)技术来加速这些操作。其解决方案的关键在于将集体操作的执行从计算节点迁移至网络层面:具体分为两类实现方式——边缘级INC(Edge-INC),在节点本地实现;核心级INC(Core-INC),嵌入在网络交换机内部。这种架构创新有望显著降低通信延迟和提升带宽利用率,从而突破传统CPU/GPU与网络协同的性能限制。论文进一步识别出六项关键技术障碍,并对未来INC的发展趋势做出预测,为AI与网络交叉领域的协同优化提供理论基础与实践方向。

链接: https://arxiv.org/abs/2601.19132
作者: Torsten Hoefler,Mikhail Khalilov,Josiah Clark,Surendra Anubolu,Mohan Kalkunte,Karen Schramm,Eric Spada,Duncan Roweth,Keith Underwood,Adrian Caulfield,Abdul Kabbani,Amirreza Rastegari
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Performance (cs.PF); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:This paper summarizes the opportunities of in-network collective operations (INC) for accelerated collective operations in AI workloads. We provide sufficient detail to make this important field accessible to non-experts in AI or networking, fostering a connection between these communities. Consider two types of INC: Edge-INC, where the system is implemented at the node level, and Core-INC, where the system is embedded within network switches. We outline the potential performance benefits as well as six key obstacles in the context of both Edge-INC and Core-INC that may hinder their adoption. Finally, we present a set of predictions for the future development and application of INC.
zh

[AI-67] Exploring Weaknesses in Function Call Models via Reinforcement Learning: An Adversarial Data Augmentation Approach

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在函数调用(function call, FC)能力上的泛化性和鲁棒性不足的问题。现有方法依赖于人工标注或模型自动生成的数据进行微调,但受限于固定模式和数据分布,难以有效提升模型在复杂外部工具交互场景下的表现。解决方案的关键在于提出一种基于强化学习(reinforcement learning, RL)的对抗性数据增强方法:通过训练一个查询模型(query model)生成专门针对FC模型弱点的对抗性查询,形成零和博弈式的迭代交替训练机制,从而系统性地识别并修正LLMs在与外部工具交互时的能力短板。

链接: https://arxiv.org/abs/2601.19122
作者: Weiran Guo,Bing Bo,Shaoxiang Wu,Jingsheng Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Function call capabilities have become crucial for Large Language Models (LLMs), enabling them to interact more effectively with external tools and APIs. Existing methods for improving the function call capabilities of LLMs rely on data obtained either through manual annotation or automated generation by models, and use this data to finetune the LLMs. However, these methods often lack targeted design and are constrained by fixed patterns and data distributions, which limits their effectiveness in enhancing the generalization and robustness of function call LLMs. To address this limitation, we propose a novel adversarial data augmentation method that employs reinforcement learning to systematically identify and target the weaknesses of function call LLMs. Our training framework introduces a query model trained with reinforcement learning (RL) to generate adversarial queries that are specifically designed to challenge function call (FC) models. This approach adopts a zero sum game formulation, where the query model and the FC model engage in iterative alternating training. Overall, our method advances the development of more robust FC models and provides a systematic way to identify and correct weaknesses in the ability of LLMs to interact with external tools.
zh

[AI-68] LLM s as Orchestrators: Constraint-Compliant Multi-Agent Optimization for Recommendation Systems

【速读】:该论文旨在解决推荐系统在多目标优化中难以满足硬性业务约束(如公平性和覆盖率)的问题,尤其是在实际部署中频繁违反约束导致不可接受的后果。现有方法通常将约束视为软惩罚或仅关注物品评分与交互,缺乏对约束可行性的保障。解决方案的关键在于提出 DualAgent-Rec 框架,通过双代理机制实现约束下的协同优化:一个 Exploitation Agent 在硬约束下优先追求推荐准确性,另一个 Exploration Agent 通过无约束的帕累托搜索提升多样性;同时引入基于大语言模型(LLM)的协调器动态分配资源,并结合自适应 ε-松弛机制确保最终解的可行性。实验证明该框架可实现100%约束满足,并在帕累托超体积上较强基线提升4–6%,同时保持良好的准确率与多样性权衡。

链接: https://arxiv.org/abs/2601.19121
作者: Guilin Zhang,Kai Zhao,Jeffrey Friedman,Xu Chu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Recommendation systems must optimize multiple objectives while satisfying hard business constraints such as fairness and coverage. For example, an e-commerce platform may require every recommendation list to include items from multiple sellers and at least one newly listed product; violating such constraints–even once–is unacceptable in production. Prior work on multi-objective recommendation and recent LLM-based recommender agents largely treat constraints as soft penalties or focus on item scoring and interaction, leading to frequent violations in real-world deployments. How to leverage LLMs for coordinating constrained optimization in recommendation systems remains underexplored. We propose DualAgent-Rec, an LLM-coordinated dual-agent framework for constrained multi-objective e-commerce recommendation. The framework separates optimization into an Exploitation Agent that prioritizes accuracy under hard constraints and an Exploration Agent that promotes diversity through unconstrained Pareto search. An LLM-based coordinator adaptively allocates resources between agents based on optimization progress and constraint satisfaction, while an adaptive epsilon-relaxation mechanism guarantees feasibility of final solutions. Experiments on the Amazon Reviews 2023 dataset demonstrate that DualAgent-Rec achieves 100% constraint satisfaction and improves Pareto hypervolume by 4-6% over strong baselines, while maintaining competitive accuracy-diversity trade-offs. These results indicate that LLMs can act as effective orchestration agents for deployable and constraint-compliant recommendation systems.
zh

[AI-69] RobustExplain: Evaluating Robustness of LLM -Based Explanation Agents for Recommendation

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在推荐系统中所生成解释的鲁棒性问题,即在真实用户行为存在噪声(如误点击、时间不一致、缺失值及偏好演化)时,LLM 生成的解释是否仍能保持稳定性和可信度。解决方案的关键在于提出 RobustExplain,这是首个系统性的评估框架,通过引入五类现实用户行为扰动并设计多维鲁棒性指标(涵盖语义、关键词、结构和长度一致性),对多个规模的 LLM(7B–70B)进行量化评估,从而建立任务层面的鲁棒性基准,揭示当前模型仅具备中等鲁棒性,且更大模型可提升约 8% 的稳定性,为可信代理驱动的推荐系统提供了关键评估维度。

链接: https://arxiv.org/abs/2601.19120
作者: Guilin Zhang,Kai Zhao,Jeffrey Friedman,Xu Chu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to generate natural-language explanations in recommender systems, acting as explanation agents that reason over user behavior histories. While prior work has focused on explanation fluency and relevance under fixed inputs, the robustness of LLM-generated explanations to realistic user behavior noise remains largely unexplored. In real-world web platforms, interaction histories are inherently noisy due to accidental clicks, temporal inconsistencies, missing values, and evolving preferences, raising concerns about explanation stability and user trust. We present RobustExplain, the first systematic evaluation framework for measuring the robustness of LLM-generated recommendation explanations. RobustExplain introduces five realistic user behavior perturbations evaluated across multiple severity levels and a multi-dimensional robustness metric capturing semantic, keyword, structural, and length consistency. Our goal is to establish a principled, task-level evaluation framework and initial robustness baselines, rather than to provide a comprehensive leaderboard across all available LLMs. Experiments on four representative LLMs (7B–70B) show that current models exhibit only moderate robustness, with larger models achieving up to 8% higher stability. Our results establish the first robustness benchmarks for explanation agents and highlight robustness as a critical dimension for trustworthy, agent-driven recommender systems at web scale.
zh

[AI-70] Uncertainty-Aware 3D Emotional Talking Face Synthesis with Emotion Prior Distillation ICASSP2026

【速读】:该论文旨在解决3D情感语音人脸合成(3D Emotional Talking Face Synthesis)中的两大关键问题:一是音频与视觉情绪对齐不佳,表现为音频情绪特征提取困难及情感微表情控制不足;二是多视角融合策略采用统一权重,忽视了不同视角的不确定性与特征质量差异,从而影响渲染质量。解决方案的核心在于提出UA-3DTalk框架,其关键创新包括:(1)先验提取模块分离音频内容同步特征与个性化互补特征,提升情绪对齐与个体化表达;(2)情感蒸馏模块引入多模态注意力加权融合机制与4D高斯编码结合多分辨率码本,实现细粒度音频情绪提取与微表情精准控制;(3)基于不确定性的变形模块通过估计视图特定的随机不确定性(aleatoric)和认知不确定性(epistemic),实现自适应多视角融合,并结合多头解码器优化高斯原始体,克服均匀加权融合的局限性。

链接: https://arxiv.org/abs/2601.19112
作者: Nanhan Shen,Zhilei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Emotional Talking Face synthesis is pivotal in multimedia and signal processing, yet existing 3D methods suffer from two critical challenges: poor audio-vision emotion alignment, manifested as difficult audio emotion extraction and inadequate control over emotional micro-expressions; and a one-size-fits-all multi-view fusion strategy that overlooks uncertainty and feature quality differences, undermining rendering quality. We propose UA-3DTalk, Uncertainty-Aware 3D Emotional Talking Face Synthesis with emotion prior distillation, which has three core modules: the Prior Extraction module disentangles audio into content-synchronized features for alignment and person-specific complementary features for individualization; the Emotion Distillation module introduces a multi-modal attention-weighted fusion mechanism and 4D Gaussian encoding with multi-resolution code-books, enabling fine-grained audio emotion extraction and precise control of emotional micro-expressions; the Uncertainty-based Deformation deploys uncertainty blocks to estimate view-specific aleatoric (input noise) and epistemic (model parameters) uncertainty, realizing adaptive multi-view fusion and incorporating a multi-head decoder for Gaussian primitive optimization to mitigate the limitations of uniform-weight fusion. Extensive experiments on regular and emotional datasets show UA-3DTalk outperforms state-of-the-art methods like DEGSTalk and EDTalk by 5.2% in E-FID for emotion alignment, 3.1% in SyncC for lip synchronization, and 0.015 in LPIPS for rendering quality. Project page: this https URL
zh

[AI-71] Detecting and Correcting Hallucinations in LLM -Generated Code via Deterministic AST Analysis

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成过程中频繁引入的“知识冲突幻觉”(Knowledge Conflicting Hallucinations, KCHs)问题,这类错误表现为语义层面的细微缺陷(如不存在的API参数),难以被静态检查工具识别,却会导致运行时失败。解决方案的关键在于提出一种确定性的后处理框架:通过将生成的代码解析为抽象语法树(Abstract Syntax Tree, AST),并基于库内省动态构建的知识库(Knowledge Base, KB)进行验证,从而以确定性规则自动检测并修正API级和标识符级的知识冲突。该方法不依赖执行代码,实现了高精度(100%精确率)与较高召回率(87.6%),并在手动标注的200个Python片段数据集上成功自动修复了77.0%的KCHs,验证了其作为概率修复替代方案的可靠性与实用性。

链接: https://arxiv.org/abs/2601.19106
作者: Dipin Khati,Daniel Rodriguez-Cardenas,Paul Pantzer,Denys Poshyvanyk
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to FORGE 2026

点击查看摘要

Abstract:Large Language Models (LLMs) for code generation boost productivity but frequently introduce Knowledge Conflicting Hallucinations (KCHs), subtle, semantic errors, such as non-existent API parameters, that evade linters and cause runtime failures. Existing mitigations like constrained decoding or non-deterministic LLM-in-the-loop repair are often unreliable for these errors. This paper investigates whether a deterministic, static-analysis framework can reliably detect \textitand auto-correct KCHs. We propose a post-processing framework that parses generated code into an Abstract Syntax Tree (AST) and validates it against a dynamically-generated Knowledge Base (KB) built via library introspection. This non-executing approach uses deterministic rules to find and fix both API and identifier-level conflicts. On a manually-curated dataset of 200 Python snippets, our framework detected KCHs with 100% precision and 87.6% recall (0.934 F1-score), and successfully auto-corrected 77.0% of all identified hallucinations. Our findings demonstrate that this deterministic post-processing approach is a viable and reliable alternative to probabilistic repair, offering a clear path toward trustworthy code generation.
zh

[AI-72] FloydNet: A Learning Paradigm for Global Relational Reasoning

【速读】:该论文旨在解决图神经网络(GNN)在复杂多步推理任务中因消息传递机制导致的局部瓶颈问题,从而限制了全局、整体性推理能力。其解决方案的关键在于提出FloydNet架构,该架构摒弃传统的局部消息传递方式,转而采用动态规划(Dynamic Programming, DP)思想,通过维护一个全局的全对关系张量,并学习一种广义的DP算子来逐步优化该张量,从而实现任务特定的关系演算。这种设计使模型能够有效捕捉长程依赖关系,并理论上达到3-WL(2-FWL)表达能力,实验证明其在CLRS-30算法基准上接近完美性能,在一般旅行商问题(TSP)中可找到最优解的比例显著优于强启发式方法,验证了基于DP风格迭代精化机制作为高阶图推理的新范式具有强大且实用的价值。

链接: https://arxiv.org/abs/2601.19094
作者: Jingcheng Yu,Mingliang Zeng,Qiwei Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures, 14 tables

点击查看摘要

Abstract:Developing models capable of complex, multi-step reasoning is a central goal in artificial intelligence. While representing problems as graphs is a powerful approach, Graph Neural Networks (GNNs) are fundamentally constrained by their message-passing mechanism, which imposes a local bottleneck that limits global, holistic reasoning. We argue that dynamic programming (DP), which solves problems by iteratively refining a global state, offers a more powerful and suitable learning paradigm. We introduce FloydNet, a new architecture that embodies this principle. In contrast to local message passing, FloydNet maintains a global, all-pairs relationship tensor and learns a generalized DP operator to progressively refine it. This enables the model to develop a task-specific relational calculus, providing a principled framework for capturing long-range dependencies. Theoretically, we prove that FloydNet achieves 3-WL (2-FWL) expressive power, and its generalized form aligns with the k-FWL hierarchy. FloydNet demonstrates state-of-the-art performance across challenging domains: it achieves near-perfect scores (often 99%) on the CLRS-30 algorithmic benchmark, finds exact optimal solutions for the general Traveling Salesman Problem (TSP) at rates significantly exceeding strong heuristics, and empirically matches the 3-WL test on the BREC benchmark. Our results establish this learned, DP-style refinement as a powerful and practical alternative to message passing for high-level graph reasoning.
zh

[AI-73] Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers

【速读】:该论文旨在解决现代深度学习工作负载在跨设备网格(device meshes)、内存层次结构和异构加速器之间协调数据与计算资源放置的问题。其核心挑战在于如何高效地管理张量的分片(sharding)、复制(replication)、偏移(offset)以及分块(tiling)等布局策略,以实现高性能的分布式计算。解决方案的关键是提出了一种硬件感知的抽象机制——Axe Layout,该机制通过命名轴(named axes)将逻辑张量坐标映射到多维物理空间,统一了从设备间分布到设备内布局的各种操作,并支持在单一内核中组合线程局部控制与集体操作(collective operators),从而显著提升跨最新GPU设备及多设备环境下的性能表现,接近手工优化的内核水平。

链接: https://arxiv.org/abs/2601.19092
作者: Bohan Hou,Hongyi Jin,Guanjie Wang,Jinqi Chen,Yaxing Cai,Lijie Yang,Zihao Ye,Yaoyao Ding,Ruihang Lai,Tianqi Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.
zh

[AI-74] Out-of-Distribution Generalization for Neural Physics Solvers

【速读】:该论文旨在解决神经物理求解器(neural physics solvers)在面对偏微分方程(PDE)参数、几何结构或初始条件发生分布外变化时,普遍存在的泛化能力不足问题,这限制了其在科学发现中对新设计探索和长时间预测的应用。解决方案的关键在于提出NOVA框架,通过从少量初始场景中学习与物理规律对齐的表示(physics-aligned representations),从而实现对复杂非线性系统(如热传导、扩散-反应和流体流动)在分布外情形下的高精度模拟,显著降低误差达1–2个数量级,并同时提升长时间动力学滚动模拟的稳定性及生成式设计效率。

链接: https://arxiv.org/abs/2601.19091
作者: Zhao Wei,Chin Chun Ooi,Jian Cheng Wong,Abhishek Gupta,Pao-Hsiung Chiu,Yew-Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural physics solvers are increasingly used in scientific discovery, given their potential for rapid in silico insights into physical, materials, or biological systems and their long-time evolution. However, poor generalization beyond their training support limits exploration of novel designs and long-time horizon predictions. We introduce NOVA, a route to generalizable neural physics solvers that can provide rapid, accurate solutions to scenarios even under distributional shifts in partial differential equation parameters, geometries and initial conditions. By learning physics-aligned representations from an initial sparse set of scenarios, NOVA consistently achieves 1-2 orders of magnitude lower out-of-distribution errors than data-driven baselines across complex, nonlinear problems including heat transfer, diffusion-reaction and fluid flow. We further showcase NOVA’s dual impact on stabilizing long-time dynamical rollouts and improving generative design through application to the simulation of nonlinear Turing systems and fluidic chip optimization. Unlike neural physics solvers that are constrained to retrieval and/or emulation within an a priori space, NOVA enables reliable extrapolation beyond known regimes, a key capability given the need for exploration of novel hypothesis spaces in scientific discovery
zh

[AI-75] HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

【速读】:该论文旨在解决生成式 AI(Generative AI)在代码审查自动化中因幻觉(hallucination)问题导致的可信度不足难题,即模型生成的审查意见缺乏对实际代码内容的依据。其解决方案的关键在于设计并实现 HalluJudge,一种基于上下文对齐(context alignment)评估生成评论可靠性的框架,包含从直接判断到结构化多分支推理(如 Tree-of-Thoughts)的四种策略,从而在无需参考标准的情况下有效检测幻觉,并在企业级软件项目中验证了其高准确率(F1=0.85)与低成本(平均成本0.009),且67%的评估结果与真实生产环境中开发者的偏好一致,显著提升了开发者对AI辅助代码审查的信任。

链接: https://arxiv.org/abs/2601.19072
作者: Kla Tantithamthavorn,Hong Yi Lin,Patanamon Thongtanunam,Wachiraphan Charoenwet,Minwoo Jeong,Ming Wu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations – where the generated review comments are ungrounded in the actual code – poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian’s enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge’s judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of 0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers’ exposure to hallucinated comments, fostering trust in AI-assisted code reviews.
zh

[AI-76] Dynamic Cogeneration of Bug Reproduction Test in Agent ic Program Repair

【速读】:该论文旨在解决当前自动化程序修复(Automated Program Repair, APR)系统中生成修复补丁(fix)与错误重现测试(Bug Reproduction Test, BRT)通常被分离处理的问题,这导致开发人员难以验证AI生成补丁的有效性,且需维护独立的生成流水线,增加工程复杂度。解决方案的关键在于提出“共生成”(cogeneration)策略,即指令APR代理在同一补丁中同时生成修复代码和对应的BRT,从而提升补丁可信度并减少跨组件协调成本。实验表明,该方法可在不降低合理修复生成率的前提下,实现与专用BRT生成器相当的BRT覆盖能力,显著优化了大规模部署中的效率与可维护性。

链接: https://arxiv.org/abs/2601.19066
作者: Runxiang Cheng,Michele Tufano,José Cambronero,Renyao Wei,Sherry Shi,Grant Uy,Pat Rondon,Franjo Ivančić
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bug Reproduction Tests (BRTs) have been used in many agentic Automated Program Repair (APR) systems, primarily for validating promising fixes and aiding fix generation. In practice, when developers submit a patch, they often implement the BRT alongside the fix. Our experience deploying agentic APR reveals that developers similarly desire a BRT within AI-generated patches to increase their confidence. However, canonical APR systems tend to generate BRTs and fixes separately, or focus on producing only the fix in the final patch. In this paper, we study agentic APR in the context of cogeneration, where the APR agent is instructed to generate both a fix and a BRT in the same patch. We evaluate the effectiveness of different cogeneration strategies on 120 human-reported bugs at Google and characterize different cogeneration strategies by their influence on APR agent behavior. We develop and evaluate patch selectors that account for test change information to select patches with plausible fixes (and plausible BRTs). Finally, we analyze the root causes of failed cogeneration trajectories. Importantly, we show that cogeneration allows the APR agent to generate BRTs for at least as many bugs as a dedicated BRT agent, without compromising the generation rate of plausible fixes, thereby reducing engineering effort in maintaining and coordinating separate generation pipelines for fix and BRT at scale.
zh

[AI-77] From Answer Givers to Design Mentors: Guiding LLM s with the Cognitive Apprenticeship Model

【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在设计辅助场景中常提供泛化、一次性建议,难以激发用户的设计反思与深度参与。解决方案的关键在于引入认知学徒制模型(Cognitive Apprenticeship Model),通过结构化提示(structured prompting)将该模型的六种教学方法——示范(modeling)、指导(coaching)、支架(scaffolding)、阐述(articulation)、反思(reflection)和探索(exploration)——具象化为可执行的交互策略,从而引导LLM扮演设计导师角色,促进用户生成更深层次的设计推理与反馈交流。

链接: https://arxiv.org/abs/2601.19053
作者: Yongsu Ahn,Lejun R Liao,Benjamin Bach,Nam Wook Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Design feedback helps practitioners improve their artifacts while also fostering reflection and design reasoning. Large Language Models (LLMs) such as ChatGPT can support design work, but often provide generic, one-off suggestions that limit reflective engagement. We investigate how to guide LLMs to act as design mentors by applying the Cognitive Apprenticeship Model, which emphasizes demonstrating reasoning through six methods: modeling, coaching, scaffolding, articulation, reflection, and exploration. We operationalize these instructional methods through structured prompting and evaluate them in a within-subjects study with data visualization practitioners. Participants interacted with both a baseline LLM and an instructional LLM designed with cognitive apprenticeship prompts. Surveys, interviews, and conversational log analyses compared experiences across conditions. Our findings show that cognitively informed prompts elicit deeper design reasoning and more reflective feedback exchanges, though the baseline is sometimes preferred depending on task types or experience levels. We distill design considerations for AI-assisted feedback systems that foster reflective practice.
zh

[AI-78] A Unifying View of Coverag e in Linear Off-Policy Evaluation ICLR2026

【速读】:该论文旨在解决线性策略评估(Linear Off-Policy Evaluation, Linear OPE)中的统计效率问题,特别是在仅假设目标价值函数在特征空间中线性可表示(即线性可实现性)的最小设定下,如何准确刻画覆盖参数(coverage parameter)以获得紧致的有限样本误差界。传统分析中使用的覆盖参数往往缺乏理论一致性或与标准定义脱节,导致对算法性能的理解碎片化。论文提出针对经典算法LSTDQ(Least-Squares Temporal Difference Q-learning)的新颖有限样本分析,其关键创新在于引入了一个新的覆盖概念——特征-动态覆盖(feature-dynamics coverage),该参数可被解释为在特征演化诱导的动力系统中的一种线性覆盖,从而从工具变量视角提供了更合理的误差边界。进一步地,在如Bellman完备性等附加假设下,该定义能自然恢复已有特定场景下的覆盖参数,最终实现了线性OPE中覆盖概念的统一理解。

链接: https://arxiv.org/abs/2601.19030
作者: Philip Amortila,Audrey Huang,Akshay Krishnamurthy,Nan Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: To appear at ICLR 2026

点击查看摘要

Abstract:Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form \textrmEvaluation error \le \textrmpoly(C^\pi, d, 1/n,\log(1/\delta)), where d is the dimension of the features and C^\pi is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard definitions in the literature. We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as linear coverage in an induced dynamical system for feature evolution. With further assumptions – such as Bellman-completeness – our definition successfully recovers the coverage parameters specialized to those settings, finally yielding a unified understanding for coverage in linear OPE. Comments: To appear at ICLR 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2601.19030 [cs.LG] (or arXiv:2601.19030v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.19030 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-79] EVEREST: An Evidential Tail-Aware Transformer for Rare-Event Time-Series Forecasting

【速读】:该论文旨在解决多变量时间序列数据中罕见事件预测的挑战,包括严重的类别不平衡、长程依赖关系以及分布不确定性等问题。其解决方案的关键在于提出了一种基于Transformer架构的概率稀有事件预测模型EVEREST,该模型通过四个核心组件实现:(i) 可学习的注意力瓶颈用于软聚合时序动态;(ii) 证据头(evidential head)利用正态-逆伽马分布估计认知不确定性和随机不确定性;(iii) 极值头(extreme-value head)采用广义帕累托分布建模尾部风险;(iv) 轻量级前兆头用于早期事件检测。这些模块在训练阶段联合优化,部署时仅使用单一分类头,无推理开销,从而实现了校准预测与尾部风险感知,并具备基于注意力机制的可解释性。

链接: https://arxiv.org/abs/2601.19022
作者: Antanas Zilinskas,Robert N. Shorten,Jakub Marecek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting rare events in multivariate time-series data is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability via attention-based signal attribution. EVEREST integrates four components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal–Inverse–Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimized with a composite loss (focal loss, evidential NLL, and a tail-sensitive EVT penalty) and act only at training time; deployment uses a single classification head with no inference overhead (approximately 0.81M parameters). On a decade of space-weather data, EVEREST achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares. The model is compact, efficient to train on commodity hardware, and applicable to high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.
zh

[AI-80] Randomization Boosts KV Caching Learning Balances Query Load: A Joint Perspective ICLR2026

【速读】:该论文旨在解决在有限内存条件下,大语言模型(Large Language Model, LLM)推理中键值缓存(Key-Value Cache, KV caching)的缓存淘汰策略与查询路由之间的权衡问题。传统基于最近最少使用(Least Recently Used, LRU)的淘汰算法在动态在线查询场景下表现不佳,尤其在多LLM服务环境中,负载均衡与缓存命中率存在本质冲突。论文提出首个统一的数学模型以刻画KV缓存淘汰与查询路由间的内在权衡,并据此设计出融合可证明竞争性随机化缓存淘汰算法与基于学习的自适应查询路由机制的协同方案,从而实现负载均衡与缓存效率的联合优化。实验表明,该方法在多个基准测试和前缀共享设置下显著优于现有最优方法,缓存命中率最高提升6.92倍,延迟降低11.96倍,首次生成时间(Time-to-First-Token, TTFT)减少14.06倍,吞吐量提高77.4%。

链接: https://arxiv.org/abs/2601.18999
作者: Fangzhou Wu,Sandeep Silwal,Qiuyi(Richard)Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICLR 2026

点击查看摘要

Abstract:KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to 6.92 \times in cache hit rate, 11.96 \times reduction in latency, 14.06 \times reduction in time-to-first-token (TTFT), and 77.4% increase in throughput over the state-of-the-art methods. Our code is available at this https URL.
zh

[AI-81] When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control

【速读】:该论文旨在解决量子硬件因器件异质性和环境漂移导致的控制难题,即在非自适应控制器性能欠佳与高成本的逐设备校准之间难以权衡的问题。其解决方案的关键在于通过推导元学习(meta-learning)的尺度律下界,量化适应增益(adaptation gain,即任务特定梯度步数带来的期望保真度提升),发现该增益随梯度步数呈指数饱和,且与任务方差呈线性关系,从而为是否值得引入适应性调整提供了可计算的决策准则。实证表明,在两比特门极端分布外条件下(训练噪声的10倍),适应性校准可带来高达40%的保真度提升,显著减少云端量子处理器的逐设备校准时间,且该规律在经典线性二次控制中同样成立,说明其源于通用优化几何而非量子物理特异性。

链接: https://arxiv.org/abs/2601.18973
作者: Nima Leclerc,Chris Miller,Nicholas Brawand
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Quantum Physics (quant-ph)
备注: 28 pages, 11 figures

点击查看摘要

Abstract:Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but 40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10 \times the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. Together, these results offer a transferable framework for decision-making in adaptive control.
zh

[AI-82] Fauna Sprout: A lightweight approachable developer-ready humanoid robot

【速读】:该论文旨在解决当前人形机器人在人类环境中安全、灵活且长期部署的难题,现有系统或为封闭的工业设备,或为难以操作的学术原型,限制了机器人技术的实际应用与进化。解决方案的关键在于提出Sprout平台,其通过轻量化设计、顺应性控制(compliant control)、有限扭矩和软质外壳实现物理安全性;同时集成全身控制、带集成夹爪的操作能力以及基于虚拟现实的遥操作系统,形成统一软硬件架构,并引入具表达力的头部以支持社交交互——这些特性共同降低了部署的物理和技术门槛,推动了具身智能在真实人类环境中的发展。

链接: https://arxiv.org/abs/2601.18963
作者: Fauna Robotics:Diego Aldarondo,Ana Pervan,Daniel Corbalan,Dave Petrillo,Bolun Dai,Aadhithya Iyer,Nina Mortensen,Erik Pearson,Sridhar Pandian Arunachalam,Emma Reznick,David Weis,Jacob Davison,Samuel Patterson,Tess Carella,Michael Suguitan,David Ye,Oswaldo Ferro,Nilesh Suriyarachchi,Spencer Ling,Erik Su,Daniel Giebisch,Peter Traver,Sam Fonseca,Mack Mor,Rohan Singh,Sertac Guven,Kangni Liu,Yaswanth Kumar Orru,Ashiq Rahman Anwar Batcha,Shruthi Ravindranath,Silky Arora,Hugo Ponte,Dez Hernandez,Utsav Chaudhary,Zack Walker,Michael Kelberman,Ivan Veloz,Christina Santa Lucia,Kat Casale,Helen Han,Michael Gromis,Michael Mignatti,Jason Reisman,Kelleher Guerin,Dario Narvaez,Christopher Anderson,Anthony Moschella,Robert Cochran,Josh Merel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in learned control, large-scale simulation, and generative models have accelerated progress toward general-purpose robotic controllers, yet the field still lacks platforms suitable for safe, expressive, long-term deployment in human environments. Most existing humanoids are either closed industrial systems or academic prototypes that are difficult to deploy and operate around people, limiting progress in robotics. We introduce Sprout, a developer platform designed to address these limitations through an emphasis on safety, expressivity, and developer accessibility. Sprout adopts a lightweight form factor with compliant control, limited joint torques, and soft exteriors to support safe operation in shared human spaces. The platform integrates whole-body control, manipulation with integrated grippers, and virtual-reality-based teleoperation within a unified hardware-software stack. An expressive head further enables social interaction – a domain that remains underexplored on most utilitarian humanoids. By lowering physical and technical barriers to deployment, Sprout expands access to capable humanoid platforms and provides a practical basis for developing embodied intelligence in real human environments.
zh

[AI-83] ricky2: Towards a Benchmark for Evaluating Human and LLM Error Interactions

【速读】:该论文旨在解决生成式 AI (Generative AI) 在软件开发中引入的逻辑或数据滥用错误与人类开发者缺陷之间的交互机制不明确的问题。其解决方案的关键在于构建了一个名为 Tricky² 的混合数据集,该数据集通过基于分类体系的提示框架,在保留原始人类缺陷和程序结构的前提下,注入由 GPT-5 和 OpenAI-oss-20b 生成的机器起源错误,从而形成人类独有、LLM 独有以及人类+LLM 混合来源的错误子集,支持对混合来源错误行为、多缺陷修复鲁棒性及人机协作代码可靠性的系统分析。

链接: https://arxiv.org/abs/2601.18949
作者: Cole Granger,Dipin Khati,Daniel Rodriguez-Cardenas,Denys Poshyvanyk
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into software development workflows, yet they often introduce subtle logic or data-misuse errors that differ from human bugs. To study how these two error types interact, we construct Tricky ^2 , a hybrid dataset that augments the existing TrickyBugs corpus of human-written defects with errors injected by both GPT-5 and OpenAI-oss-20b across C++, Python, and Java programs. Our approach uses a taxonomy-guided prompting framework to generate machine-originated bugs while preserving original human defects and program structure. The resulting corpus spans human-only, LLM-only, and human+LLM splits, enabling analysis of mixed-origin error behavior, multi-bug repair robustness, and reliability in hybrid human-machine code. This paper outlines the dataset construction pipeline and illustrates its use through small-scale baseline evaluations of classification, localization, and repair tasks.
zh

[AI-84] Neural Theorem Proving for Verification Conditions: A Real-World Benchmark ICLR’26

【速读】:该论文旨在解决程序验证中自动证明验证条件(Verification Conditions, VCs)这一核心瓶颈问题,即现有自动化定理证明器(Automated Theorem Provers, ATPs)难以处理现实世界程序中出现的复杂VC,导致大量依赖人工干预,严重影响实际应用效率。解决方案的关键在于提出首个面向真实场景的多语言VC自动证明基准——Neural Theorem Proving for Verification Conditions (NTP4VC),该基准基于Linux和Contiki-OS内核等工业级项目,通过Why3和Frama-C等工具生成跨形式语言(Isabelle、Lean、Rocq)的语义等价测试用例,并首次系统评估通用大语言模型(LLMs)及微调后的定理证明专用LLMs在该任务上的表现,揭示了当前方法虽具潜力但距离实用仍有显著差距,从而为未来研究指明方向。

链接: https://arxiv.org/abs/2601.18944
作者: Qiyuan Xu,Xiaokun Luan,Renxi Wang,Joshua Ong Jun Leang,Peixin Wang,Haonan Li,Wenda Li,Conrad Watt
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: Accepted in ICLR’26

点击查看摘要

Abstract:Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification–particularly VC proving–remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.
zh

[AI-85] Configurable p-Neurons Using Modular p-Bits ISCAS2026

【速读】:该论文旨在解决当前基于概率比特(p-bit)的神经网络中激活函数种类受限的问题,即现有方案仅支持sigmoid型概率激活函数,缺乏对其他广泛使用的激活函数(如Logistic Sigmoid、Tanh和ReLU)的灵活实现。其解决方案的关键在于重构p-bit结构,通过将随机信号路径与输入数据路径解耦,设计出模块化概率比特(modular p-bit),从而实现可配置的概率激活函数,包括上述多种经典激活函数的随机版本;同时结合自旋电子学(CMOS + sMTJ)与数字CMOS FPGA实现,不仅验证了该架构的可行性,还通过随机单元共享技术实现了硬件资源消耗降低一个数量级(10倍)。

链接: https://arxiv.org/abs/2601.18943
作者: Saleh Bunaiyan,Mohammad Alsharif,Abdelrahman S. Abdelrahman,Hesham ElSawy,Suraj S. Cheema,Suhaib A. Fahmy,Kerem Y. Camsari,Feras Al-Dirini
机构: 未知
类目: Emerging Technologies (cs.ET); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: Accepted for presentation at IEEE ISCAS 2026 as a lecture

点击查看摘要

Abstract:Probabilistic bits (p-bits) have recently been employed in neural networks (NNs) as stochastic neurons with sigmoidal probabilistic activation functions. Nonetheless, there remain a wealth of other probabilistic activation functions that are yet to be explored. Here we re-engineer the p-bit by decoupling its stochastic signal path from its input data path, giving rise to a modular p-bit that enables the realization of probabilistic neurons (p-neurons) with a range of configurable probabilistic activation functions, including a probabilistic version of the widely used Logistic Sigmoid, Tanh and Rectified Linear Unit (ReLU) activation functions. We present spintronic (CMOS + sMTJ) designs that show wide and tunable probabilistic ranges of operation. Finally, we experimentally implement digital-CMOS versions on an FPGA, with stochastic unit sharing, and demonstrate an order of magnitude (10x) saving in required hardware resources compared to conventional digital p-bit implementations.
zh

[AI-86] oward Learning POMDPs Beyond Full-Rank Actions and State Observability

【速读】:该论文旨在解决自主代理在未知状态空间下学习和推理具有隐藏状态系统(如带隐藏锁机制的家具)的问题,具体目标是通过观测序列估计部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)的参数。其核心挑战在于如何从行动-观测序列中构建状态转移和观测模型,而传统谱方法(如预测状态表示,Predictive State Representations, PSRs)虽能直接估计隐藏状态数量,却无法提供过渡概率和观测概率的显式估计,这对下游推理任务至关重要。本文的关键解决方案是结合PSR的状态划分特性与张量分解技术:首先利用PSR结构将状态划分为若干等价类(partition),其中同一划分内的状态对全秩动作具有相同的观测分布;随后基于此划分结构,使用张量方法估计各分区内的转移矩阵和观测矩阵,从而获得近似完整的POMDP模型。实验表明,当数据充分时,该方法学习到的分区内转移模型可达到与标准PSR相当的性能,并可用于采样型POMDP求解器,同时显式的概率模型还能指导规划器行为设计。

链接: https://arxiv.org/abs/2601.18930
作者: Seiji Shaw,Travis Manderson,Chad Kessens,Nicholas Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We are interested in enabling autonomous agents to learn and reason about systems with hidden states, such as furniture with hidden locking mechanisms. We cast this problem as learning the parameters of a discrete Partially Observable Markov Decision Process (POMDP). The agent begins with knowledge of the POMDP’s actions and observation spaces, but not its state space, transitions, or observation models. These properties must be constructed from action-observation sequences. Spectral approaches to learning models of partially observable domains, such as learning Predictive State Representations (PSRs), are known to directly estimate the number of hidden states. These methods cannot, however, yield direct estimates of transition and observation likelihoods, which are important for many downstream reasoning tasks. Other approaches leverage tensor decompositions to estimate transition and observation likelihoods but often assume full state observability and full-rank transition matrices for all actions. To relax these assumptions, we study how PSRs learn transition and observation matrices up to a similarity transform, which may be estimated via tensor methods. Our method learns observation matrices and transition matrices up to a partition of states, where the states in a single partition have the same observation distributions corresponding to actions whose transition matrices are full-rank. Our experiments suggest that these partition-level transition models learned by our method, with a sufficient amount of data, meets the performance of PSRs as models to be used by standard sampling-based POMDP solvers. Furthermore, the explicit observation and transition likelihoods can be leveraged to specify planner behavior after the model has been learned.
zh

[AI-87] RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures ACL

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂工作流中维持指令连贯性能力不足的问题,特别是现有评估基准将任务复杂度与提示结构混杂,难以分离结构对性能的影响。其解决方案的关键在于提出RIFT(Reordered Instruction Following Testbed),通过重构Jeopardy!问答对内容并设计两种提示结构——线性提示(linear prompts)和跳跃提示(jumping prompts)——来解耦提示内容与结构,从而独立评估模型对指令顺序依赖性的敏感程度。实验表明,在跳跃提示条件下,模型准确率最高下降72%,揭示当前架构对位置连续性的强依赖,且约50%的错误源于指令顺序违反和语义漂移,说明现有模型将指令遵循视为序列模式而非推理能力。

链接: https://arxiv.org/abs/2601.18924
作者: Andrew Jaffe,Noah Reicin,Jinho D. Choi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures, submitted to ACL ARR

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain flow of instructions remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it difficult to isolate the impact of prompt topology on performance. We introduce RIFT, Reordered Instruction Following Testbed, to assess instruction following by disentangling structure from content. Using rephrased Jeopardy! question-answer pairs, we test LLMs across two prompt structures: linear prompts, which progress sequentially, and jumping prompts, which preserve identical content but require non-sequential traversal. Across 10,000 evaluations spanning six state-of-the-art open-source LLMs, accuracy dropped by up to 72% under jumping conditions (compared to baseline), revealing a strong dependence on positional continuity. Error analysis shows that approximately 50% of failures stem from instruction-order violations and semantic drift, indicating that current architectures internalize instruction following as a sequential pattern rather than a reasoning skill. These results reveal structural sensitivity as a fundamental limitation in current architectures, with direct implications for applications requiring non-sequential control flow such as workflow automation and multi-agent systems.
zh

[AI-88] Explainable Uncertainty Quantification for Wastewater Treatment Energy Prediction via Interval Type-2 Neuro-Fuzzy System

【速读】:该论文旨在解决污水处理厂(Wastewater Treatment Plant, WTP)能耗预测中缺乏可解释不确定性量化的问题,这对安全关键基础设施的风险感知决策至关重要。现有机器学习模型虽能提供点预测,但无法有效揭示预测的不确定性来源,限制了其在实际运维中的可信度与应用深度。解决方案的关键在于提出一种区间类型-2自适应神经模糊推理系统(Interval Type-2 Adaptive Neuro-Fuzzy Inference System, IT2-ANFIS),通过模糊规则结构生成可解释的预测区间,并将不确定性分解为三个层次:特征层识别引入模糊性的变量、规则层分析局部模型置信度、实例层量化整体预测不确定性。该方法在墨尔本水务公司东部处理厂数据集上验证,不仅预测性能优于传统一阶ANFIS且训练过程方差显著降低,还实现了预测置信度与运行条件及输入变量之间的可追溯关联。

链接: https://arxiv.org/abs/2601.18897
作者: Qusai Khaled,Bahjat Mallak,Uzay Kaymak,Laura Genga
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to 21st International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2026)

点击查看摘要

Abstract:Wastewater treatment plants consume 1-3% of global electricity, making accurate energy forecasting critical for operational optimization and sustainability. While machine learning models provide point predictions, they lack explainable uncertainty quantification essential for risk-aware decision-making in safety-critical infrastructure. This study develops an Interval Type-2 Adaptive Neuro-Fuzzy Inference System (IT2-ANFIS) that generates interpretable prediction intervals through fuzzy rule structures. Unlike black-box probabilistic methods, the proposed framework decomposes uncertainty across three levels: feature-level, footprint of uncertainty identify which variables introduce ambiguity, rule-level analysis reveals confidence in local models, and instance-level intervals quantify overall prediction uncertainty. Validated on Melbourne Water’s Eastern Treatment Plant dataset, IT2-ANFIS achieves comparable predictive performance to first order ANFIS with substantially reduced variance across training runs, while providing explainable uncertainty estimates that link prediction confidence directly to operational conditions and input variables.
zh

[AI-89] Representational Homomorphism Predicts and Improves Compositional Generalization In Transformer Language Model

【速读】:该论文旨在解决神经网络在**组合泛化(compositional generalization)方面的持续挑战,即模型对熟悉组件的新组合的解释能力不足的问题。其解决方案的关键在于提出一种名为同态误差(Homomorphism Error, HE)**的结构化度量方法,用于量化模型隐藏状态空间与表达代数之间近似同态关系的偏离程度。通过在SCAN风格任务中针对一元组合(修饰符HE)和二元组合(序列HE)进行实例化,HE利用学习到的表示层算子从组成部分预测复合表示,从而提供可解释的失败原因分析。实验表明,HE能有效预测分布外(OOD)组合泛化的性能(如修饰符HE与OOD准确率的相关系数R²=0.73),且通过在训练中正则化低HE值,显著降低HE并提升OOD准确性(p=0.023),证明HE兼具诊断价值和作为可行动训练信号的潜力。

链接: https://arxiv.org/abs/2601.18858
作者: Zhiyu An,Wan Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compositional generalization-the ability to interpret novel combinations of familiar components-remains a persistent challenge for neural networks. Behavioral evaluations reveal when models fail but offer limited insight into why failures arise at the representational level. We introduce Homomorphism Error (HE), a structural metric that quantifies deviations from approximate homomorphisms between the expression algebra and a model’s hidden-state space. We instantiate HE for two compositional operators in SCAN-style tasks: modifier HE for unary composition and sequence HE for binary composition, measured by learning representation-level operators that predict composed representations from their parts. Across controlled experiments with small decoder-only Transformers, HE predicts out-of-distribution (OOD) compositional generalization under noise injection, achieving R^2 = 0.73 correlation between modifier HE and OOD accuracy. Ablations show that model depth has minimal effect on either HE or OOD accuracy, training data coverage exhibits threshold effects (insufficient coverage sharply increases HE and degrades OOD performance), and randomly inserted noise tokens systematically increase HE. Finally, we test if HE-regularized training improves OOD accuracy. Experiment shows that explicitly enforcing low modifier HE during training significantly reduces modifier HE (p = 1.1x10-4) and sequence HE (p = 0.001) and yields a statistically significant improvement in OOD accuracy (p = 0.023). Together, these results indicate the potential of HE to be both a diagnostic and an actionable training signal for improving compositional generalization. Code to reproduce our experiments is open-sourced.
zh

[AI-90] MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实世界漏洞检测中面临的两大挑战:一是漏洞模式的异质性导致单一统一模型效果受限;二是针对海量漏洞类别进行人工提示工程难以扩展。其解决方案的核心是提出一种检索增强的多智能体框架 MulVul,采用粗粒度到细粒度的策略——首先由路由(Router)智能体预测顶层漏洞类别,并将输入分发给特定的检测(Detector)智能体以识别具体漏洞类型;同时,两类智能体均集成检索工具,从漏洞知识库中主动获取证据以减少幻觉。关键创新在于设计了跨模型提示演化(Cross-Model Prompt Evolution)机制,通过生成器和执行器两个独立的LLM协同优化提示,有效避免单模型自校正偏差,显著提升对多样化漏洞模式的适应能力与检测精度。

链接: https://arxiv.org/abs/2601.18847
作者: Zihan Wu,Jie Xu,Yun Peng,Chun Yong Chong,Xiaohua Jia
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle to automate real-world vulnerability detection due to two key limitations: the heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable. To address these challenges, we propose \textbfMulVul, a retrieval-augmented multi-agent framework designed for precise and broad-coverage vulnerability detection. MulVul adopts a coarse-to-fine strategy: a \emphRouter agent first predicts the top- k coarse categories and then forwards the input to specialized \emphDetector agents, which identify the exact vulnerability types. Both agents are equipped with retrieval tools to actively source evidence from vulnerability knowledge bases to mitigate hallucinations. Crucially, to automate the generation of specialized prompts, we design \emphCross-Model Prompt Evolution, a prompt optimization mechanism where a generator LLM iteratively refines candidate prompts while a distinct executor LLM validates their effectiveness. This decoupling mitigates the self-correction bias inherent in single-model optimization. Evaluated on 130 CWE types, MulVul achieves 34.79% Macro-F1, outperforming the best baseline by 41.5%. Ablation studies validate cross-model prompt evolution, which boosts performance by 51.6% over manual prompts by effectively handling diverse vulnerability patterns. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.18847 [cs.SE] (or arXiv:2601.18847v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2601.18847 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jie Xu [view email] [v1] Mon, 26 Jan 2026 12:43:10 UTC (1,329 KB)
zh

[AI-91] LLM Driven Design of Continuous Optimization Problems with Controllable High-level Properties

【速读】:该论文旨在解决连续黑箱优化基准测试中现有测试集(如BBOB)结构多样性不足的问题,从而限制了对优化算法性能的全面评估。其解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)与进化循环结合的框架——LLaMEA,通过自然语言描述目标景观特性(如多峰性、可分性、盆地大小一致性等),引导LLM生成具有明确高阶结构特征的优化问题代码,并利用ELA(Evolutionary Landscape Analysis)属性预测器对候选问题进行评分。进一步引入ELA空间中的适应度共享机制以增强种群多样性并避免冗余景观重复生成,最终通过盆地吸引域分析、统计检验和可视化验证所生成函数确实具备预期结构特性,且扩展了BBOB实例空间而非形成孤立簇,从而提供一套广泛、可解释且可复现的基准问题库,用于景观分析及下游任务(如自动化算法选择)。

链接: https://arxiv.org/abs/2601.18846
作者: Urban Skvorc,Niki van Stein,Moritz Seiler,Britta Grimme,Thomas Bäck,Heike Trautmann
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 17 pages, accepted at EvoApplications 2026

点击查看摘要

Abstract:Benchmarking in continuous black-box optimisation is hindered by the limited structural diversity of existing test suites such as BBOB. We explore whether large language models embedded in an evolutionary loop can be used to design optimisation problems with clearly defined high-level landscape characteristics. Using the LLaMEA framework, we guide an LLM to generate problem code from natural-language descriptions of target properties, including multimodality, separability, basin-size homogeneity, search-space homogeneity and globallocal optima contrast. Inside the loop we score candidates through ELA-based property predictors. We introduce an ELA-space fitness-sharing mechanism that increases population diversity and steers the generator away from redundant landscapes. A complementary basin-of-attraction analysis, statistical testing and visual inspection, verifies that many of the generated functions indeed exhibit the intended structural traits. In addition, a t-SNE embedding shows that they expand the BBOB instance space rather than forming an unrelated cluster. The resulting library provides a broad, interpretable, and reproducible set of benchmark problems for landscape analysis and downstream tasks such as automated algorithm selection.
zh

[AI-92] Reducing False Positives in Static Bug Detection with LLM s: An Empirical Study in Industry

【速读】:该论文旨在解决工业级静态分析工具(Static Analysis Tools, SATs)在实际应用中因高误报率(false positive rates)导致的效率低下问题,特别是在大规模企业软件系统中,误报需大量人工核查,造成严重资源浪费。解决方案的关键在于引入大语言模型(Large Language Models, LLMs)用于误报过滤,通过构建基于腾讯广告与营销服务系统的433个告警样本数据集(含328个误报和105个真阳性),实证评估多种LLM驱动的误报减少技术。研究发现,结合LLM与静态分析的混合方法可消除94–98%的误报且保持高召回率,同时显著降低每条告警的处理成本(仅需2.1–109.5秒,费用为0.0011–0.12元),相较人工审查实现数量级优化。

链接: https://arxiv.org/abs/2601.18844
作者: Xueying Du,Jiayi Feng,Yi Zou,Wei Xu,Jie Ma,Wei Zhang,Sisi Liu,Xin Peng,Yiling Lou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Static analysis tools (SATs) are widely adopted in both academia and industry for improving software quality, yet their practical use is often hindered by high false positive rates, especially in large-scale enterprise systems. These false alarms demand substantial manual inspection, creating severe inefficiencies in industrial code review. While recent work has demonstrated the potential of large language models (LLMs) for false alarm reduction on open-source benchmarks, their effectiveness in real-world enterprise settings remains unclear. To bridge this gap, we conduct the first comprehensive empirical study of diverse LLM-based false alarm reduction techniques in an industrial context at Tencent, one of the largest IT companies in China. Using data from Tencent’s enterprise-customized SAT on its large-scale Advertising and Marketing Services software, we construct a dataset of 433 alarms (328 false positives, 105 true positives) covering three common bug types. Through interviewing developers and analyzing the data, our results highlight the prevalence of false positives, which wastes substantial manual effort (e.g., 10-20 minutes of manual inspection per alarm). Meanwhile, our results show the huge potential of LLMs for reducing false alarms in industrial settings (e.g., hybrid techniques of LLM and static analysis eliminate 94-98% of false positives with high recall). Furthermore, LLM-based techniques are cost-effective, with per-alarm costs as low as 2.1-109.5 seconds and 0.0011- 0.12, representing orders-of-magnitude savings compared to manual review. Finally, our case analysis further identifies key limitations of LLM-based false alarm reduction in industrial settings.
zh

[AI-93] CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries

【速读】:该论文旨在解决大规模语言模型系统中,基于对话数据的聚类摘要(cluster-level conversation summaries)可能泄露个人身份信息(PII)或可追踪敏感字符串的隐私风险问题。传统做法虽不直接暴露原始对话,但若摘要中包含从个体对话中复制的唯一标识符(如“canary”字符串),仍可能导致隐私泄露。其解决方案的关键在于提出CanaryBench——一种简单且可复现的压力测试框架:通过在合成对话中植入已知的秘密字符串(canaries)模拟敏感信息,利用TF-IDF嵌入与k-means聚类生成摘要后,检测这些canary是否出现在发布的摘要中,从而量化隐私泄露程度。实验表明,在未加防护情况下,96.15%含canary的聚类出现泄漏;而结合最小聚类大小阈值(k-min=25)和正则表达式脱敏的轻量级防御策略,可在保持聚类一致性的同时有效消除canary及PII指标泄露。

链接: https://arxiv.org/abs/2601.18834
作者: Deep Mehta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures. Code repository: this https URL

点击查看摘要

Abstract:Aggregate analytics over conversational data are increasingly used for safety monitoring, governance, and product analysis in large language model systems. A common practice is to embed conversations, cluster them, and publish short textual summaries describing each cluster. While raw conversations may never be exposed, these derived summaries can still pose privacy risks if they contain personally identifying information (PII) or uniquely traceable strings copied from individual conversations. We introduce CanaryBench, a simple and reproducible stress test for privacy leakage in cluster-level conversation summaries. CanaryBench generates synthetic conversations with planted secret strings (“canaries”) that simulate sensitive identifiers. Because canaries are known a priori, any appearance of these strings in published summaries constitutes a measurable leak. Using TF-IDF embeddings and k-means clustering on 3,000 synthetic conversations (24 topics) with a canary injection rate of 0.60, we evaluate an intentionally extractive example snippet summarizer that models quote-like reporting. In this configuration, we observe canary leakage in 50 of 52 canary-containing clusters (cluster-level leakage rate 0.961538), along with nonzero regex-based PII indicator counts. A minimal defense combining a minimum cluster-size publication threshold (k-min = 25) and regex-based redaction eliminates measured canary leakage and PII indicator hits in the reported run while maintaining a similar cluster-coherence proxy. We position this work as a societal impacts contribution centered on privacy risk measurement for published analytics artifacts rather than raw user data. Comments: 13 pages, 4 figures. Code repository: this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2601.18834 [cs.CR] (or arXiv:2601.18834v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2601.18834 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-94] Agent ic Business Process Management Systems

【速读】:该论文旨在解决传统业务流程管理(Business Process Management, BPM)系统在面对生成式与代理型人工智能(Generative and Agentic Artificial Intelligence)兴起时,如何从以任务自动化为核心转向以过程自治和数据驱动优化为核心的范式转型问题。其解决方案的关键在于提出一种新型平台架构——代理型业务流程管理系统(Agentic Business Process Management Systems, A-BPMS),该架构通过整合自主性(autonomy)、推理能力(reasoning)与学习机制(learning),使系统能够感知流程状态、识别改进机会并主动干预以维持和优化性能,从而实现从人类主导到完全自治的流程连续体管理,重新定义流程自动化与治理的边界。

链接: https://arxiv.org/abs/2601.18833
作者: Marlon Dumas,Fredrik Milani,David Chapela-Campa
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Presented at the BPM’2025 conference on Artificial Intelligence for Business Process Management (AI4BPM)

点击查看摘要

Abstract:Since the early 90s, the evolution of the Business Process Management (BPM) discipline has been punctuated by successive waves of automation technologies. Some of these technologies enable the automation of individual tasks, while others focus on orchestrating the execution of end-to-end processes. The rise of Generative and Agentic Artificial Intelligence (AI) is opening the way for another such wave. However, this wave is poised to be different because it shifts the focus from automation to autonomy and from design-driven management of business processes to data-driven management, leveraging process mining techniques. This position paper, based on a keynote talk at the 2025 Workshop on AI for BPM, outlines how process mining has laid the foundations on top of which agents can sense process states, reason about improvement opportunities, and act to maintain and optimize performance. The paper proposes an architectural vision for Agentic Business Process Management Systems (A-BPMS): a new class of platforms that integrate autonomy, reasoning, and learning into process management and execution. The paper contends that such systems must support a continuum of processes, spanning from human-driven to fully autonomous, thus redefining the boundaries of process automation and governance.
zh

[AI-95] he Geometric Reason er: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

【速读】:该论文旨在解决生成式 AI(Generative AI)在长链式思维(Chain-of-Thought, CoT)推理中因测试时计算资源扩展导致的效率与覆盖质量之间的权衡问题,即现有方法要么需要高昂的训练成本,要么产生冗余推理轨迹。其解决方案的关键在于提出一种无需训练的几何推理框架(The Geometric Reasoner, TGR),通过在每个处理块(chunk)边界利用轻量级前瞻估计对潜在锚点进行评分,并结合软几何正则项以引导平滑且多样化的轨迹探索;同时采用分块KV缓存重置机制,在严格内存约束下保持线性内存增长,从而显著提升推理路径的鲁棒覆盖率(以Pass@k曲线下面积AUC衡量),在Qwen3-8B模型上最高提升13点,且仅带来约1.1–1.3倍的额外计算开销。

链接: https://arxiv.org/abs/2601.18832
作者: Ren Zhuang,Ben Wang,Shuifa Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@ k curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1–1.3 times.
zh

[AI-96] CP Loss: Channel-wise Perceptual Loss for Time Series Forecasting ICASSP2026

【速读】:该论文旨在解决多通道时间序列预测中因通道异质性导致的建模偏差问题,即传统基于均方误差(MSE)等通道无关损失函数难以捕捉各通道特有的动态特征(如剧烈波动或趋势突变)。其解决方案的关键在于提出一种通道感知损失(Channel-wise Perceptual Loss, CP Loss),通过为每个通道学习一个自适应的感知空间来实现精准建模:具体而言,设计了一个可学习的通道专属滤波器,将原始信号分解为解耦的多尺度表征以构建感知空间,并联合优化该滤波器与主预测模型,确保感知空间显式服务于预测任务;最终在各通道独立的感知空间内计算损失,从而提升对通道特异性动态的建模能力。

链接: https://arxiv.org/abs/2601.18829
作者: Yaohua Zha,Chunlin Fan,Peiyuan Liu,Yong Jiang,Tao Dai,Hai Wu,Shu-Tao Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Multi-channel time-series data, prevalent across diverse applications, is characterized by significant heterogeneity in its different channels. However, existing forecasting models are typically guided by channel-agnostic loss functions like MSE, which apply a uniform metric across all channels. This often leads to fail to capture channel-specific dynamics such as sharp fluctuations or trend shifts. To address this, we propose a Channel-wise Perceptual Loss (CP Loss). Its core idea is to learn a unique perceptual space for each channel that is adapted to its characteristics, and to compute the loss within this space. Specifically, we first design a learnable channel-wise filter that decomposes the raw signal into disentangled multi-scale representations, which form the basis of our perceptual space. Crucially, the filter is optimized jointly with the main forecasting model, ensuring that the learned perceptual space is explicitly oriented towards the prediction task. Finally, losses are calculated within these perception spaces to optimize the model. Code is available at this https URL.
zh

[AI-97] IPBC: An Interactive Projection-Based Framework for Human-in-the-Loop Semi-Supervised Clustering of High-Dimensional Data

【速读】:该论文旨在解决高维数据集在聚类分析中因距离度量失效以及降维后簇结构坍缩或重叠而导致的聚类效果不佳问题(即“维度灾难”导致的聚类困难)。其解决方案的关键在于提出一种交互式投影聚类框架(Interactive Project-Based Clustering, IPBC),该框架将聚类重构为一个由用户引导的迭代可视化分析过程:通过非线性投影模块与用户反馈回路相结合,允许用户通过调整视角和施加简单约束(如must-link或cannot-link)动态优化二维嵌入空间;这些约束被用于重构投影目标函数,逐步增强语义相关点间的分离度与不相关点间的距离,从而提升后续传统聚类算法的准确性;最终结合可解释性组件,将聚类结果映射回原始特征空间以生成可理解的规则或特征重要性排序,实现机器表示与人类直觉的协同强化。

链接: https://arxiv.org/abs/2601.18828
作者: Mohammad Zare
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-dimensional datasets are increasingly common across scientific and industrial domains, yet they remain difficult to cluster effectively due to the diminishing usefulness of distance metrics and the tendency of clusters to collapse or overlap when projected into lower dimensions. Traditional dimensionality reduction techniques generate static 2D or 3D embeddings that provide limited interpretability and do not offer a mechanism to leverage the analyst’s intuition during exploration. To address this gap, we propose Interactive Project-Based Clustering (IPBC), a framework that reframes clustering as an iterative human-guided visual analysis process. IPBC integrates a nonlinear projection module with a feedback loop that allows users to modify the embedding by adjusting viewing angles and supplying simple constraints such as must-link or cannot-link relationships. These constraints reshape the objective of the projection model, gradually pulling semantically related points closer together and pushing unrelated points further apart. As the projection becomes more structured and expressive through user interaction, a conventional clustering algorithm operating on the optimized 2D layout can more reliably identify distinct groups. An additional explainability component then maps each discovered cluster back to the original feature space, producing interpretable rules or feature rankings that highlight what distinguishes each cluster. Experiments on various benchmark datasets show that only a small number of interactive refinement steps can substantially improve cluster quality. Overall, IPBC turns clustering into a collaborative discovery process in which machine representation and human insight reinforce one another.
zh

[AI-98] Automated structural testing of LLM -based agents : methods framework and case studies

【速读】:该论文旨在解决当前对大语言模型(Large Language Model, LLM)驱动的智能体(Agent)进行测试时存在的局限性问题,即现有方法主要依赖用户视角的接受度评估,存在手动执行、难以自动化、无法定位根本原因以及测试环境成本高等缺陷。其解决方案的关键在于引入结构化测试(structural testing)框架:通过OpenTelemetry追踪技术捕获智能体行为轨迹,利用模拟(mocking)技术确保LLM输出可复现,并添加断言(assertions)实现自动化验证,从而在技术层面深度测试智能体组件及其交互逻辑,支持软件工程最佳实践如测试自动化金字塔、回归测试、测试驱动开发及多语言测试,显著提升测试覆盖率、可复用性和缺陷早期发现能力。

链接: https://arxiv.org/abs/2601.18827
作者: Jens Kohl,Otto Kruse,Youssef Mostafa,Andre Luckow,Karsten Schroer,Thomas Riedl,Ryan French,David Katz,Manuel P. Luitz,Tanrajbir Takher,Ken E. Friedl,Céline Laurent-Winter
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Preprint of an accepted paper at IEEE BigData 2025 (main track). Source code for the introduced methods and framework available at this https URL

点击查看摘要

Abstract:LLM-based agents are rapidly being adopted across diverse domains. Since they interact with users without supervision, they must be tested extensively. Current testing approaches focus on acceptance-level evaluation from the user’s perspective. While intuitive, these tests require manual evaluation, are difficult to automate, do not facilitate root cause analysis, and incur expensive test environments. In this paper, we present methods to enable structural testing of LLM-based agents. Our approach utilizes traces (based on OpenTelemetry) to capture agent trajectories, employs mocking to enforce reproducible LLM behavior, and adds assertions to automate test verification. This enables testing agent components and interactions at a deeper technical level within automated workflows. We demonstrate how structural testing enables the adaptation of software engineering best practices to agents, including the test automation pyramid, regression testing, test-driven development, and multi-language testing. In representative case studies, we demonstrate automated execution and faster root-cause analysis. Collectively, these methods reduce testing costs and improve agent quality through higher coverage, reusability, and earlier defect detection. We provide an open source reference implementation on GitHub.
zh

[AI-99] Differential Voting: Loss Functions For Axiomatically Diverse Aggregation of Heterogeneous Preferences

【速读】:该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)在偏好聚合过程中隐式依赖单一聚合原则(如Bradley-Terry-Luce模型)所导致的规范性假设不透明、轴向性质受限的问题。现有方法通常将多样化的个体偏好统一为一个单一效用函数,但未明确说明其背后的投票机制及其满足的公理属性,从而限制了对优化稳定性和公平性的控制能力。解决方案的关键在于提出微分投票(Differential Voting)框架,该框架通过构造实例级可微损失函数,使得其群体最优解严格对应经典社会选择规则(如多数制BTL、Copeland和Kemeny规则),并系统分析各损失函数的校准性、梯度场结构及光滑参数趋零时的极限行为。这一设计使偏好聚合成为显式且可控的工程选择,实现了在公理保障与优化稳定性之间的可解释权衡。

链接: https://arxiv.org/abs/2601.18824
作者: Zhiyu An,Duaa Nakshbandi,Wan Du
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) implicitly aggregates heterogeneous human preferences into a single utility function, even though the underlying utilities of the participants are in practice diverse. Hence, RLHF can be viewed as a form of voting, where the aggregation mechanism is defined by the loss function. Although Arrow’s Impossibility Theorem suggests that different mechanisms satisfy different sets of desirable axioms, most existing methods rely on a single aggregation principle, typically the Bradley-Terry-Luce (BTL) model, which corresponds to Borda count voting. This restricts the axiomatic properties of the learned reward and obscures the normative assumptions embedded in optimization. In this work, we introduce Differential Voting, a unifying framework that constructs instance-wise, differentiable loss functions whose population-level optima provably correspond to distinct classical voting rules. We develop differentiable surrogates for majority-based aggregation (BTL), Copeland, and Kemeny rules, and formally analyze their calibration properties, gradient fields, and limiting behavior as smoothing parameters vanish. For each loss, we establish consistency with the corresponding social choice rule and characterize the axioms it satisfies or violates. Our analysis shows how design choices in loss geometry-such as margin sensitivity and boundary concentration-directly translate into normative aggregation behavior. Differential Voting makes preference aggregation an explicit and controllable design choice in RLHF, enabling principled trade-offs between axiomatic guarantees and optimization stability. Code to reproduce our experiments is open-sourced.
zh

[AI-100] Agent ic Digital Twins: A Taxonomy of Capabilities for Understanding Possible Futures

【速读】:该论文旨在解决数字孪生(Digital Twin, DT)在人工智能(AI)赋能下从静态映射工具向具备主动行为能力的“代理型数字孪生”(Agentic DT)演进过程中,其结构复杂性与功能边界模糊的问题。解决方案的关键在于提出一个基于三个核心维度的分类体系:代理位置(外部、内部、分布式)、耦合紧密度(松散、紧密、构成性)、模型演化方式(静态、自适应、重构式),构建出27种配置空间,并从中识别出九种典型配置,划分为三个发展阶段:“当下”(现有工具与新兴调控系统)、“临界点”(涌现特性出现且耦合变为构成性)、“前沿”(系统获得重构能力)。这一框架揭示了代理型DT如何通过构成性耦合实现对物理系统的建构性参与,从而推动数字孪生从“镜像世界”向“新本体架构者”的范式跃迁。

链接: https://arxiv.org/abs/2601.18799
作者: Christopher Burr,Mark Enzer,Jason Shepherd,David Wagg
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 32 pages, 6 figures

点击查看摘要

Abstract:As digital twins (DTs) evolve to become more agentic through the integration of artificial intelligence (AI), they acquire capabilities that extend beyond dynamic representation of their target systems. This paper presents a taxonomy of agentic DTs organised around three fundamental dimensions: the locus of agency (external, internal, distributed), the tightness of coupling (loose, tight, constitutive), and model evolution (static, adaptive, reconstructive). From the resulting 27-configuration space, we identify nine illustrative configurations grouped into three clusters: “The Present” (existing tools and emerging steering systems), “The Threshold” (where emergent properties appear and coupling becomes constitutive), and “The Frontier” (where systems gain reconstructive capabilities). Our analysis explores how agentic DTs exercise performative power–not merely representing physical systems but actively participating in constituting them. Using traffic navigation systems as examples, we show how even passive tools can exhibit emergent performativity, while advanced configurations risk performative lock-in. Drawing on performative prediction theory, we trace a progression from passive tools through active steering to ontological reconstruction, examining how constitutive coupling enables systems to create self-validating realities. Understanding these configurations is essential for navigating the transformation from DTs as mirror worlds to DTs as architects of new ontologies. Comments: 32 pages, 6 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2601.18799 [cs.CY] (or arXiv:2601.18799v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2601.18799 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-101] Encoder-Free ECG-Language Models

【速读】:该论文旨在解决当前ECG-Language Models (ELMs)在自动化心电图(ECG)解读中依赖复杂预训练ECG编码器所导致的架构与训练复杂性问题。其核心解决方案是提出一种无需ECG编码器的ELM——ELF,该模型仅用一个单一的线性投影层替代传统编码器,并与大语言模型(LLM)联合训练,从而显著简化模型结构并保持高性能。实验表明,ELF在五个数据集上达到或超越现有使用复杂编码器和训练流程的ELMs,且即使引入架构偏差也未明显提升性能,进一步验证了该简洁设计的有效性。

链接: https://arxiv.org/abs/2601.18798
作者: William Han,Tony Chen,Chaojing Duan,Xiaoyu Song,Yihang Yao,Yuzhe Yang,Michael A. Rosenberg,Emerson Liu,Ding Zhao
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:ECG-Language Models (ELMs) extend recent progress in Multimodal Large Language Models (MLLMs) to automated ECG interpretation. However, most ELMs follow Vision-Language Model (VLM) designs and depend on pretrained ECG encoders, adding architectural and training complexity. Inspired by encoder-free VLMs, we introduce ELF, an encoder-free ELM that replaces the ECG encoder with a single projection layer trained jointly with the LLM. Across five datasets, ELF matches or exceeds state-of-the-art ELMs that use far more complex encoders and training pipelines. We also test whether adding architectural biases to ELF improves performance and find that the single linear projection remains competitive. Finally, we show that ELF, and potentially other ELMs, often rely more on benchmark artifacts and language priors than ECG-derived information, highlighting limitations in current evaluation practices and ELM design. All data and code is available at this https URL.
zh

[AI-102] M-SGWR: Multiscale Similarity and Geographically Weighted Regression

【速读】:该论文旨在解决传统局部回归模型(如地理加权回归 GWR 和多尺度地理加权回归 MGWR)仅依赖地理邻近性来刻画空间关系的局限性,这类方法难以充分反映全球化和数字连接背景下位置间复杂且多样化的相互作用。解决方案的关键在于提出一种新的多尺度局部回归框架——M-SGWR(Multiscale Spatially and Attribute-based Geographically Weighted Regression),其创新性地将空间交互关系分解为两个维度:地理邻近性和属性(变量)相似性。对于每个预测变量,分别构建地理权重矩阵和基于属性相似性的权重矩阵,并通过一个优化参数 α 来融合两者,该参数控制两类权重对局部模型拟合的相对贡献;与 MGWR 中变量特异带宽类似,最优 α 值随预测变量变化,从而灵活捕捉地理效应、混合效应或非空间效应(远程相似性)。实证结果表明,M-SGWR 在所有拟合优度指标上均优于 GWR、SGWR 和 MGWR。

链接: https://arxiv.org/abs/2601.19888
作者: M. Naser Lessani,Zhenlong Li,Manzhu Yu,Helen Greatrex,Chan Shen
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The first law of geography is a cornerstone of spatial analysis, emphasizing that nearby and related locations tend to be more similar, however, defining what constitutes “near” and “related” remains challenging, as different phenomena exhibit distinct spatial patterns. Traditional local regression models, such as Geographically Weighted Regression (GWR) and Multiscale GWR (MGWR), quantify spatial relationships solely through geographic proximity. In an era of globalization and digital connectivity, however, geographic proximity alone may be insufficient to capture how locations are interconnected. To address this limitation, we propose a new multiscale local regression framework, termed M-SGWR, which characterizes spatial interaction across two dimensions: geographic proximity and attribute (variable) similarity. For each predictor, geographic and attribute-based weight matrices are constructed separately and then combined using an optimized parameter, alpha, which governs their relative contribution to local model fitting. Analogous to variable-specific bandwidths in MGWR, the optimal alpha varies by predictor, allowing the model to flexibly account for geographic, mixed, or non-spatial (remote similarity) effects. Results from two simulation experiments and one empirical application demonstrate that M-SGWR consistently outperforms GWR, SGWR, and MGWR across all goodness-of-fit metrics.
zh

[AI-103] AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)研发中过度追求模型规模和计算资源投入所导致的效率低下、成本高昂及环境负担加重的问题。其核心挑战在于,行业普遍采用“超大规模化”策略(hyper-scaling),忽视了能效优化,从而加剧了对昂贵算力的依赖,使学术界和中小企业难以参与AI创新,并带来显著的碳排放增长。解决方案的关键是引入基于市场的激励机制——提出一种针对AI计算的“总量控制与交易”(cap-and-trade)体系,通过量化计算资源使用上限并允许效率提升者交易配额,从制度层面强制减少AI部署中的冗余计算,实现减排目标的同时,将效率优势转化为经济收益,赋能更广泛的科研主体。

链接: https://arxiv.org/abs/2601.19886
作者: Marco Bornstein,Amrit Singh Bedi
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注: 22 pages, 2 figures

点击查看摘要

Abstract:The race for artificial intelligence (AI) dominance often prioritizes scale over efficiency. Hyper-scaling is the common industry approach: larger models, more data, and as many computational resources as possible. Using more resources is a simpler path to improved AI performance. Thus, efficiency has been de-emphasized. Consequently, the need for costly computational resources has marginalized academics and smaller companies. Simultaneously, increased energy expenditure, due to growing AI use, has led to mounting environmental costs. In response to accessibility and sustainability concerns, we argue for research into, and implementation of, market-based methods that incentivize AI efficiency. We believe that incentivizing efficient operations and approaches will reduce emissions while opening new opportunities for academics and smaller companies. As a call to action, we propose a cap-and-trade system for AI. Our system provably reduces computations for AI deployment, thereby lowering emissions and monetizing efficiency to the benefit of of academics and smaller companies.
zh

[AI-104] Revisiting Incremental Stochastic Majorization-Minimization Algorithms with Applications to Mixture of Experts

【速读】:该论文旨在解决高通量流式数据处理中传统批处理算法因需多次遍历全数据集而不可行的问题,提出了一种增量随机极大极小(Incremental Stochastic Majorization-Minimization, IS-MM)算法作为替代方案。其关键在于通过松弛期望最大化(Expectation-Maximization, EM)算法对隐变量显式表示等强假设的要求,使方法具备更广泛的适用性和更高的算法灵活性;同时,理论证明了该算法迭代序列收敛至目标函数梯度趋于零的驻点,从而提供了坚实的收敛性保障。实证表明,IS-MM在软最大门控混合专家(Softmax-gated Mixture of Experts, MoE)回归任务上优于多种主流随机优化器,且在真实生物信息学数据中展现出稳定的预测性能提升。

链接: https://arxiv.org/abs/2601.19811
作者: TrungKhang Tran,TrungTin Nguyen,Gersende Fort,Tung Doan,Hien Duy Nguyen,Binh T. Nguyen,Florence Forbes,Christopher Drovandi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
备注: TrungKhang Tran and TrungTin Nguyen are co-first authors

点击查看摘要

Abstract:Processing high-volume, streaming data is increasingly common in modern statistics and machine learning, where batch-mode algorithms are often impractical because they require repeated passes over the full dataset. This has motivated incremental stochastic estimation methods, including the incremental stochastic Expectation-Maximization (EM) algorithm formulated via stochastic approximation. In this work, we revisit and analyze an incremental stochastic variant of the Majorization-Minimization (MM) algorithm, which generalizes incremental stochastic EM as a special case. Our approach relaxes key EM requirements, such as explicit latent-variable representations, enabling broader applicability and greater algorithmic flexibility. We establish theoretical guarantees for the incremental stochastic MM algorithm, proving consistency in the sense that the iterates converge to a stationary point characterized by a vanishing gradient of the objective. We demonstrate these advantages on a softmax-gated mixture of experts (MoE) regression problem, for which no stochastic EM algorithm is available. Empirically, our method consistently outperforms widely used stochastic optimizers, including stochastic gradient descent, root mean square propagation, adaptive moment estimation, and second-order clipped stochastic optimization. These results support the development of new incremental stochastic algorithms, given the central role of softmax-gated MoE architectures in contemporary deep neural networks for heterogeneous data modeling. Beyond synthetic experiments, we also validate practical effectiveness on two real-world datasets, including a bioinformatics study of dent maize genotypes under drought stress that integrates high-dimensional proteomics with ecophysiological traits, where incremental stochastic MM yields stable gains in predictive performance.
zh

[AI-105] Quantum Circuit Pre-Synthesis: Learning Local Edits to Reduce T-count

【速读】:该论文旨在解决量子电路编译中T门(T-gate)数量过多导致的容错量子计算成本过高问题,特别是在使用稳定子码(stabilizer codes)进行容错计算时,T门因其高实现成本成为限制可执行电路规模的关键因素。现有局部编译方法虽能处理大规模电路,但因组合局部优化策略易造成次优结果(如T计数或电路深度不理想),且性能高度依赖于电路表示形式。论文提出Q-PreSyn策略,其核心在于:给定一组保持电路等价性的局部编辑操作,利用强化学习(Reinforcement Learning, RL)代理识别有效操作序列,从而生成更利于后续合成算法降低T计数的电路表示。实验表明,该方法在25量子比特电路上可实现最高达20%的T计数减少,且无额外近似误差引入。

链接: https://arxiv.org/abs/2601.19738
作者: Daniele Lizzio Bosco,Lukasz Cincio,Giuseppe Serra,M. Cerezo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10+5 pages, 10 figures, 3 algorithms

点击查看摘要

Abstract:Compiling quantum circuits into Clifford+ T gates is a central task for fault-tolerant quantum computing using stabilizer codes. In the near term, T gates will dominate the cost of fault tolerant implementations, and any reduction in the number of such expensive gates could mean the difference between being able to run a circuit or not. While exact synthesis is exponentially hard in the number of qubits, local synthesis approaches are commonly used to compile large circuits by decomposing them into substructures. However, composing local methods leads to suboptimal compilations in key metrics such as T -count or circuit depth, and their performance strongly depends on circuit representation. In this work, we address this challenge by proposing \textscQ-PreSyn, a strategy that, given a set of local edits preserving circuit equivalence, uses a RL agent to identify effective sequences of such actions and thereby obtain circuit representations that yield a reduced T -count upon synthesis. Experimental results of our proposed strategy, applied on top of well-known synthesis algorithms, show up to a 20% reduction in T -count on circuits with up to 25 qubits, without introducing any additional approximation error prior to synthesis.
zh

[AI-106] SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation

【速读】:该论文旨在解决音频分离(audio separation)性能评估中存在的人工主观测试成本高、难以规模化,以及现有客观指标与人类感知不一致的问题。其解决方案的关键在于提出一种多模态细粒度的无参考客观评价指标 SAM Audio Judge (SAJ),该指标无需真实标签(reference-free),通过文本、视觉和时间跨度三种提示输入,在语音、音乐和通用声音事件三个音频域上,从召回率(recall)、精确率(precision)、忠实度(faithfulness)和整体质量四个维度实现对音频分离效果的精准量化评估,并展现出在数据过滤、伪标签生成和模型重排序等下游任务中的应用潜力。

链接: https://arxiv.org/abs/2601.19702
作者: Helin Wang,Bowen Shi,Andros Tjandra,John Hoffman,Yi-Chiao Wu,Apoorv Vyas,Najim Dehak,Ann Lee,Wei-Ning Hsu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real-world evaluation, but they are expensive, time-consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo-labeling large datasets and reranking in audio separation models. We release our code and pre-trained models at: this https URL.
zh

[AI-107] PCEvo: Path-Consistent Molecular Representation via Virtual Evolutionary

【速读】:该论文旨在解决在少量标注数据(few-shot setting)条件下,分子表示学习模型因监督信号稀缺而导致的结构-性质关系建模脆弱、预测误差高且泛化能力差的问题。其解决方案的关键在于提出PCEvo方法,通过在拓扑依赖约束下枚举相似分子对之间的多种化学可行编辑路径,并将两端分子的标签转化为沿每条虚拟进化路径的分步监督信号;同时引入路径一致性目标(path-consistency objective),强制同一对分子间不同路径上的预测结果保持一致,从而增强模型在有限标注下的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2601.19257
作者: Kun Li,Longtao Hu,Yida Xiong,Jiajun Yu,Hongzhi Zhang,Jiameng Chen,Xiantao Cai,Jia Wu,Wenbin Hu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Molecular representation learning aims to learn vector embeddings that capture molecular structure and geometry, thereby enabling property prediction and downstream scientific applications. In many AI for science tasks, labeled data are expensive to obtain and therefore limited in availability. Under the few-shot setting, models trained with scarce supervision often learn brittle structure-property relationships, resulting in substantially higher prediction errors and reduced generalization to unseen molecules. To address this limitation, we propose PCEvo, a path-consistent representation method that learns from virtual paths through dynamic structural evolution. PCEvo enumerates multiple chemically feasible edit paths between retrieved similar molecular pairs under topological dependency constraints. It transforms the labels of the two molecules into stepwise supervision along each virtual evolutionary path. It introduces a path-consistency objective that enforces prediction invariance across alternative paths connecting the same two molecules. Comprehensive experiments on the QM9 and MoleculeNet datasets demonstrate that PCEvo substantially improves the few-shot generalization performance of baseline methods. The code is available at this https URL.
zh

[AI-108] EnzyPGM: Pocket-conditioned Generative Model for Substrate-specific Enzyme Design

【速读】:该论文旨在解决生成式 AI (Generative AI) 在酶设计中无法准确建模底物结合口袋(substrate-binding pocket)与底物之间相互作用的问题,从而限制了具有精确催化环境的酶的生成。其解决方案的关键在于提出 EnzyPGM 框架,该框架通过两个核心模块实现:一是残基-原子双尺度注意力机制(Residue-atom Bi-scale Attention, RBA),用于联合建模残基内部依赖关系及口袋残基与底物原子间的细粒度交互;二是残基功能融合模块(Residue Function Fusion, RFF),将酶的功能先验信息融入残基表征中,从而在条件生成过程中同时优化酶结构和底物特异性结合口袋。

链接: https://arxiv.org/abs/2601.19205
作者: Zefeng Lin,Zhihang Zhang,Weirong Zhu,Tongchang Han,Xianyong Fang,Tianfan Fu,Xiaohua Xu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, under review

点击查看摘要

Abstract:Designing enzymes with substrate-binding pockets is a critical challenge in protein engineering, as catalytic activity depends on the precise interaction between pockets and substrates. Currently, generative models dominate functional protein design but cannot model pocket-substrate interactions, which limits the generation of enzymes with precise catalytic environments. To address this issue, we propose EnzyPGM, a unified framework that jointly generates enzymes and substrate-binding pockets conditioned on functional priors and substrates, with a particular focus on learning accurate pocket-substrate interactions. At its core, EnzyPGM includes two main modules: a Residue-atom Bi-scale Attention (RBA) that jointly models intra-residue dependencies and fine-grained interactions between pocket residues and substrate atoms, and a Residue Function Fusion (RFF) that incorporates enzyme function priors into residue representations. Also, we curate EnzyPock, an enzyme-pocket dataset comprising 83,062 enzyme-substrate pairs across 1,036 four-level enzyme families. Extensive experiments demonstrate that EnzyPGM achieves state-of-the-art performance on EnzyPock. Notably, EnzyPGM reduces the average binding energy of 0.47 kcal/mol over EnzyGen, showing its superior performance on substrate-specific enzyme design. The code and dataset will be released later.
zh

[AI-109] Reinforcement Learning for Quantum Technology

【速读】:该论文旨在解决量子技术中一系列复杂问题,包括量子态制备、高保真度量子门设计与优化、量子电路自动构建(如变分量子本征求解器和架构搜索)、量子反馈控制与量子纠错等。其解决方案的关键在于利用强化学习(Reinforcement Learning, RL)这一类机器学习算法,通过智能体(agent)与量子设备的交互实现自适应决策,从而在无需精确先验模型的情况下高效探索最优控制策略。文中强调RL在实验平台上的可实施性及其对提升量子系统性能和自动化程度的重要作用,尤其在处理高维状态空间和动态演化场景时展现出强大潜力。

链接: https://arxiv.org/abs/2601.18953
作者: Marin Bukov,Florian Marquardt
机构: 未知
类目: Quantum Physics (quant-ph); Quantum Gases (cond-mat.quant-gas); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: review article; comments are welcome!

点击查看摘要

Abstract:Many challenges arising in Quantum Technology can be successfully addressed using a set of machine learning algorithms collectively known as reinforcement learning (RL), based on adaptive decision-making through interaction with the quantum device. After a concise and intuitive introduction to RL aimed at a broad physics readership, we discuss the key ideas and core concepts in reinforcement learning with a particular focus on quantum systems. We then survey recent progress in RL in all relevant areas. We discuss state preparation in few- and many-body quantum systems, the design and optimization of high-fidelity quantum gates, and the automated construction of quantum circuits, including applications to variational quantum eigensolvers and architecture search. We further highlight the interactive capabilities of RL agents, emphasizing recent progress in quantum feedback control and quantum error correction, and briefly discuss quantum reinforcement learning as well as applications to quantum metrology. The review concludes with a discussion of open challenges – such as scalability, interpretability, and integration with experimental platforms – and outlines promising directions for future research. Throughout, we highlight experimental implementations that exemplify the increasing role of reinforcement learning in shaping the development of quantum technologies.
zh

[AI-110] Lossy Image Compression – A Frequent Sequence Mining perspective employing efficient Clustering

【速读】:该论文旨在解决图像压缩中冗余数据处理效率低的问题,尤其针对JPEG标准中离散余弦变换(Discrete Cosine Transform, DCT)阶段的局限性。其解决方案的关键在于用闭频繁序列挖掘(Closed Frequent Sequence Mining)与k-means聚类相结合的方法替代DCT,通过在图像每个分量的所有块上并行执行k-means聚类来降低压缩时间,并对经典GSP算法进行优化,引入一种新颖的剪枝策略以减少模式集合的基数,从而显著缩小码表规模,最终实现更高的压缩比和图像质量。

链接: https://arxiv.org/abs/2601.18821
作者: Avinash Kadimisetty,Oswald C,Sivaselvan B,Alekhya Kadimisetty
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This work explores the scope of Frequent Sequence Mining in the domain of Lossy Image Compression. The proposed work is based on the idea of clustering pixels and using the cluster identifiers in the compression. The DCT phase in JPEG is replaced with a combination of closed frequent sequence mining and k-means clustering to handle the redundant data effectively. This method focuses mainly on applying k-means clustering in parallel to all blocks of each component of the image to reduce the compression time. Conventional GSP algorithm is refined to optimize the cardinality of patterns through a novel pruning strategy, thus achieving a good reduction in the code table size. Simulations of the proposed algorithm indicate significant gains in compression ratio and quality in relation to the existing alternatives.
zh

[AI-111] LabelKAN – Kolmogorov-Arnold Networks for Inter-Label Learning: Avian Community Learning

【速读】:该论文旨在解决当前物种分布建模中难以有效整合物种间关系(即群落层面的相互作用)以提升预测性能的问题,尤其在稀有或难预测物种的建模上表现不足,而这正是制定全球生物多样性目标(如《昆明-蒙特利尔全球生物多样性框架》GBF)时的关键需求。解决方案的关键在于提出LabelKAN框架,其基于Kolmogorov-Arnold Networks (KANs),通过从每个标签(物种)的预测结果中学习标签间的关联(inter-label connections),从而显式建模物种-物种关系与物种-环境关系的协同作用,显著提升了对鸟类物种分布的预测准确性,尤其是在稀有物种和生态重要物种上的表现。

链接: https://arxiv.org/abs/2601.18818
作者: Marc Grimson,Joshua Fan,Courtney L. Davis,Dylan van Bramer,Daniel Fink,Carla P. Gomes
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Populations and Evolution (q-bio.PE)
备注:

点击查看摘要

Abstract:Global biodiversity loss is accelerating, prompting international efforts such as the Kunming-Montreal Global Biodiversity Framework (GBF) and the United Nations Sustainable Development Goals to direct resources toward halting species declines. A key challenge in achieving this goal is having access to robust methodologies to understand where species occur and how they relate to each other within broader ecological communities. Recent deep learning-based advances in joint species distribution modeling have shown improved predictive performance, but effectively incorporating community-level learning, taking into account species-species relationships in addition to species-environment relationships, remains an outstanding challenge. We introduce LabelKAN, a novel framework based on Kolmogorov-Arnold Networks (KANs) to learn inter-label connections from predictions of each label. When modeling avian species distributions, LabelKAN achieves substantial gains in predictive performance across the vast majority of species. In particular, our method demonstrates strong improvements for rare and difficult-to-predict species, which are often the most important when setting biodiversity targets under frameworks like GBF. These performance gains also translate to more confident predictions of the species spatial patterns as well as more confident predictions of community structure. We illustrate how the LabelKAN leads to qualitative and quantitative improvements with a focused application on the Great Blue Heron, an emblematic species in freshwater ecosystems that has experienced significant population declines across the United States in recent years. Using the LabelKAN framework, we are able to identify communities and species in New York that will be most sensitive to further declines in Great Blue Heron populations.
zh

[AI-112] Lightweight Quantum-Enhanced ResNet for Coronary Angiography Classification: A Hybrid Quantum-Classical Feature Enhancement Framework

【速读】:该论文旨在解决冠状动脉造影(Coronary Angiography, CAG)图像中单帧影像判读依赖操作者经验、且传统深度学习模型难以有效建模复杂血管形态与细粒度纹理特征的问题。其解决方案的关键在于提出一种轻量级量子增强残差网络(Lightweight Quantum-Enhanced ResNet, LQER),通过预训练的ResNet18作为经典特征提取器,在高层语义特征空间引入参数化量子电路(Parameterized Quantum Circuit, PQC),利用数据重加载(data re-uploading)和纠缠结构实现量子特征增强,并通过残差融合机制与经典特征结合,支持端到端混合优化且严格控制量子资源消耗。实验表明,该方法在测试集上准确率超过90%,显著优于纯经典ResNet18基线,在类别不平衡场景下尤其提升了阳性病变的识别能力,验证了量子机器学习在医学影像分析中的可行性路径。

链接: https://arxiv.org/abs/2601.18814
作者: Jingsong Xia
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background: Coronary angiography (CAG) is the cornerstone imaging modality for evaluating coronary artery stenosis and guiding interventional decision-making. However, interpretation based on single-frame angiographic images remains highly operator-dependent, and conventional deep learning models still face challenges in modeling complex vascular morphology and fine-grained texture this http URL: We propose a Lightweight Quantum-Enhanced ResNet (LQER) for binary classification of coronary angiography images. A pretrained ResNet18 is employed as a classical feature extractor, while a parameterized quantum circuit (PQC) is introduced at the high-level semantic feature space for quantum feature enhancement. The quantum module utilizes data re-uploading and entanglement structures, followed by residual fusion with classical features, enabling end-to-end hybrid optimization with a strictly controlled number of this http URL: On an independent test set, the proposed LQER outperformed the classical ResNet18 baseline in accuracy, AUC, and F1-score, achieving a test accuracy exceeding 90%. The results demonstrate that lightweight quantum feature enhancement improves discrimination of positive lesions, particularly under class-imbalanced this http URL: This study validates a practical hybrid quantum–classical learning paradigm for coronary angiography analysis, providing a feasible pathway for deploying quantum machine learning in medical imaging applications.
zh

[AI-113] Artificial Neural Network in Cosmic Landscape

【速读】:该论文试图解决在多重场暴胀模型中,传统数值模拟因维度灾难导致的指数级计算复杂度问题(即当暴胀场数量增加时,模拟所需计算资源呈指数增长)。解决方案的关键在于利用人工神经网络(Artificial Neural Network, ANN)的通用逼近能力,通过多层感知机(Multilayer Perceptron)对暴胀景观进行高效建模与生成,从而显著降低计算复杂度并实现对高维暴胀景观的有效模拟。

链接: https://arxiv.org/abs/1707.02800
作者: Junyu Liu
机构: 未知
类目: High Energy Physics - Theory (hep-th); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
备注: v2, add some new contents

点击查看摘要

Abstract:In this paper we propose that artificial neural network, the basis of machine learning, is useful to generate the inflationary landscape from a cosmological point of view. Traditional numerical simulations of a global cosmic landscape typically need an exponential complexity when the number of fields is large. However, a basic application of artificial neural network could solve the problem based on the universal approximation theorem of the multilayer perceptron. A toy model in inflation with multiple light fields is investigated numerically as an example of such an application.
zh

机器学习

[LG-0] Self-Distillation Enables Continual Learning

链接: https://arxiv.org/abs/2601.19897
作者: Idan Shenfeld,Mehul Damani,Jonas Hübotter,Pulkit Agrawal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.

[LG-1] RHSIA: Real-time Hemodynamics Surrogation for Non-idealized Intracranial Aneurysms

链接: https://arxiv.org/abs/2601.19876
作者: Yiying Sheng,Wenhao Ding,Dylan Roi,Leonard Leong Litt Yeo,Hwa Liang Leo,Choon Hwai Yap
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extensive studies suggested that fluid mechanical markers of intracranial aneurysms (IAs) derived from Computational Fluid Dynamics (CFD) can indicate disease progression risks, but to date this has not been translated clinically. This is because CFD requires specialized expertise and is time-consuming and low throughput, making it difficult to support clinical trials. A deep learning model that maps IA morphology to biomechanical markers can address this, enabling physicians to obtain these markers in real time without performing CFD. Here, we show that a Graph Transformer model that incorporates temporal information, which is supervised by large CFD data, can accurately predict Wall Shear Stress (WSS) across the cardiac cycle from IA surface meshes. The model effectively captures the temporal variations of the WSS pattern, achieving a Structural Similarity Index (SSIM) of up to 0.981 and a maximum-based relative L2 error of 2.8%. Ablation studies and SOTA comparison confirmed its optimality. Further, as pulsatile CFD data is computationally expensive to generate and sample sizes are limited, we engaged a strategy of injecting a large amount of steady-state CFD data, which are extremely low-cost to generate, as augmentation. This approach enhances network performance substantially when pulsatile CFD data sample size is small. Our study provides a proof of concept that temporal sequences cardiovascular fluid mechanical parameters can be computed in real time using a deep learning model from the geometric mesh, and this is achievable even with small pulsatile CFD sample size. Our approach is likely applicable to other cardiovascular scenarios.

[LG-2] Bandits in Flux: Adversarial Constraints in Dynamic Environments AISTATS2026

链接: https://arxiv.org/abs/2601.19867
作者: Tareq Si Salem
类目: Machine Learning (cs.LG)
*备注: Accepted to AISTATS 2026

点击查看摘要

Abstract:We investigate the challenging problem of adversarial multi-armed bandits operating under time-varying constraints, a scenario motivated by numerous real-world applications. To address this complex setting, we propose a novel primal-dual algorithm that extends online mirror descent through the incorporation of suitable gradient estimators and effective constraint handling. We provide theoretical guarantees establishing sublinear dynamic regret and sublinear constraint violation for our proposed policy. Our algorithm achieves state-of-the-art performance in terms of both regret and constraint violation. Empirical evaluations demonstrate the superiority of our approach.

[LG-3] Calibration without Ground Truth

链接: https://arxiv.org/abs/2601.19862
作者: Yuqing Kong,Mingyu Song,Yizhou Wang,Yifan Wu
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Villalobos et al. [2024] predict that publicly available human text will be exhausted within the next decade. Thus, improving models without access to ground-truth labels becomes increasingly important. We propose a label-free post-processing framework that improves a strong but miscalibrated model using a weaker yet better-calibrated reference. Our framework guarantees a strict performance improvement under any proper loss. Our approach is based on a characterization of when strict improvement is possible: when the strong and reference models are not mutually calibrated. We formalize this condition, connect it to arbitrage and no-trade results from economics, and develop an efficient Bregman projection algorithm that guarantees worst-case loss reduction without labels. Experiments on representative LLMs across varying scales demonstrate that our label-free method significantly reduces proper losses and calibration errors, achieving performance competitive with supervised baselines.

[LG-4] A Multi-directional Meta-Learning Framework for Class-Generalizable Anomaly Detection

链接: https://arxiv.org/abs/2601.19833
作者: Padmaksha Roy,Lamine Mili,Almuatazbellah Boker
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we address the problem of class-generalizable anomaly detection, where the objective is to develop a unified model by focusing our learning on the available normal data and a small amount of anomaly data in order to detect the completely unseen anomalies, also referred to as the out-of-distribution (OOD) classes. Adding to this challenge is the fact that the anomaly data is rare and costly to label. To achieve this, we propose a multidirectional meta-learning algorithm – at the inner level, the model aims to learn the manifold of the normal data (representation); at the outer level, the model is meta-tuned with a few anomaly samples to maximize the softmax confidence margin between the normal and anomaly samples (decision surface calibration), treating normals as in-distribution (ID) and anomalies as out-of-distribution (OOD). By iteratively repeating this process over multiple episodes of predominantly normal and a small number of anomaly samples, we realize a multidirectional meta-learning framework. This two-level optimization, enhanced by multidirectional training, enables stronger generalization to unseen anomaly classes.

[LG-5] Learn and Verify: A Framework for Rigorous Verification of Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2601.19818
作者: Kazuaki Tanaka,Kohei Yatabe
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:The numerical solution of differential equations using neural networks has become a central topic in scientific computing, with Physics-Informed Neural Networks (PINNs) emerging as a powerful paradigm for both forward and inverse problems. However, unlike classical numerical methods that offer established convergence guarantees, neural network-based approximations typically lack rigorous error bounds. Furthermore, the non-deterministic nature of their optimization makes it difficult to mathematically certify their accuracy. To address these challenges, we propose a “Learn and Verify” framework that provides computable, mathematically rigorous error bounds for the solutions of differential equations. By combining a novel Doubly Smoothed Maximum (DSM) loss for training with interval arithmetic for verification, we compute rigorous a posteriori error bounds as machine-verifiable proofs. Numerical experiments on nonlinear Ordinary Differential Equations (ODEs), including problems with time-varying coefficients and finite-time blow-up, demonstrate that the proposed framework successfully constructs rigorous enclosures of the true solutions, establishing a foundation for trustworthy scientific machine learning.

[LG-6] Component-Aware Pruning Framework for Neural Network Controllers via Gradient-Based Importance Estimation

链接: https://arxiv.org/abs/2601.19794
作者: Ganesh Sundaram,Jonas Ulmen,Daniel Görges
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, Submitted to the 2026 IFAC World Congress

点击查看摘要

Abstract:The transition from monolithic to multi-component neural architectures in advanced neural network controllers poses substantial challenges due to the high computational complexity of the latter. Conventional model compression techniques for complexity reduction, such as structured pruning based on norm-based metrics to estimate the relative importance of distinct parameter groups, often fail to capture functional significance. This paper introduces a component-aware pruning framework that utilizes gradient information to compute three distinct importance metrics during training: Gradient Accumulation, Fisher Information, and Bayesian Uncertainty. Experimental results with an autoencoder and a TD-MPC agent demonstrate that the proposed framework reveals critical structural dependencies and dynamic shifts in importance that static heuristics often miss, supporting more informed compression decisions.

[LG-7] o Grok Grokking: Provable Grokking in Ridge Regression

链接: https://arxiv.org/abs/2601.19791
作者: Mingyue Xu,Gal Vardi,Itay Safran
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the “grokking time”) in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

[LG-8] Knowledge-Aware Evolution for Streaming Federated Continual Learning with Category Overlap and without Task Identifiers

链接: https://arxiv.org/abs/2601.19788
作者: Sixing Tan,Xianmin Liu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Continual Learning (FCL) leverages inter-client collaboration to balance new knowledge acquisition and prior knowledge retention in non-stationary data. However, existing batch-based FCL methods lack adaptability to streaming scenarios featuring category overlap between old and new data and absent task identifiers, leading to indistinguishability of old and new knowledge, uncertain task assignments for samples, and knowledge this http URL address this, we propose streaming federated continual learning setting: per federated learning (FL) round, clients process streaming data with disjoint samples and potentially overlapping categories without task identifiers, necessitating sustained inference capability for all prior categories after each FL this http URL, we introduce FedKACE: 1) an adaptive inference model switching mechanism that enables unidirectional switching from local model to global model to achieve a trade-off between personalization and generalization; 2) a adaptive gradient-balanced replay scheme that reconciles new knowledge learning and old knowledge retention under overlapping-class scenarios; 3) a kernel spectral boundary buffer maintenance that preserves high-information and high-boundary-influence samples to optimize cross-round knowledge retention. Experiments across multiple scenarios and regret analysis demonstrate the effectiveness of FedKACE.

[LG-9] he Effect of Architecture During Continual Learning

链接: https://arxiv.org/abs/2601.19766
作者: Allyson Hahn,Krishnan Raghavan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning is a challenge for models with static architecture, as they fail to adapt to when data distributions evolve across tasks. We introduce a mathematical framework that jointly models architecture and weights in a Sobolev space, enabling a rigorous investigation into the role of neural network architecture in continual learning and its effect on the forgetting loss. We derive necessary conditions for the continual learning solution and prove that learning only model weights is insufficient to mitigate catastrophic forgetting under distribution shifts. Consequently, we prove that by learning the architecture and weights simultaneously at each task, we can reduce catastrophic forgetting. To learn weights and architecture simultaneously, we formulate continual learning as a bilevel optimization problem: the upper level selects an optimal architecture for a given task, while the lower level computes optimal weights via dynamic programming over all tasks. To solve the upper level problem, we introduce a derivative-free direct search algorithm to determine the optimal architecture. Once found, we must transfer knowledge from the current architecture to the optimal one. However, the optimal architecture will result in a weights parameter space different from the current architecture (i.e., dimensions of weights matrices will not match). To bridge the dimensionality gap, we develop a low-rank transfer mechanism to map knowledge across architectures of mismatched dimensions. Empirical studies across regression and classification problems, including feedforward, convolutional, and graph neural networks, demonstrate that learning the optimal architecture and weights simultaneously yields substantially improved performance (up to two orders of magnitude), reduced forgetting, and enhanced robustness to noise compared with static architecture approaches. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.19766 [cs.LG] (or arXiv:2601.19766v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.19766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] Provable Learning of Random Hierarchy Models and Hierarchical Shallow-to-Deep Chaining

链接: https://arxiv.org/abs/2601.19756
作者: Yunwei Ren,Yatin Dandi,Florent Krzakala,Jason D. Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The empirical success of deep learning is often attributed to deep networks’ ability to exploit hierarchical structure in data, constructing increasingly complex features across layers. Yet despite substantial progress in deep learning theory, most optimization results sill focus on networks with only two or three layers, leaving the theoretical understanding of hierarchical learning in genuinely deep models limited. This leads to a natural question: can we prove that deep networks, trained by gradient-based methods, can efficiently exploit hierarchical structure? In this work, we consider Random Hierarchy Models – a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks. We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class. Our proof builds on a general observation: if intermediate layers can receive clean signal from the labels and the relevant features are weakly identifiable, then layerwise training each individual layer suffices to hierarchically learn the target function. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2601.19756 [cs.LG] (or arXiv:2601.19756v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.19756 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] GraphDLG: Exploring Deep Leakage from Gradients in Federated Graph Learning

链接: https://arxiv.org/abs/2601.19745
作者: Shuyue Wei,Wantong Chen,Tongyu Wei,Chen Gong,Yongxin Tong,Lizhen Cui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated graph learning (FGL) has recently emerged as a promising privacy-preserving paradigm that enables distributed graph learning across multiple data owners. A critical privacy concern in federated learning is whether an adversary can recover raw data from shared gradients, a vulnerability known as deep leakage from gradients (DLG). However, most prior studies on the DLG problem focused on image or text data, and it remains an open question whether graphs can be effectively recovered, particularly when the graph structure and node features are uniquely entangled in GNNs. In this work, we first theoretically analyze the components in FGL and derive a crucial insight: once the graph structure is recovered, node features can be obtained through a closed-form recursive rule. Building on this analysis, we propose GraphDLG, a novel approach to recover raw training graphs from shared gradients in FGL, which can utilize randomly generated graphs or client-side training graphs as auxiliaries to enhance recovery. Extensive experiments demonstrate that GraphDLG outperforms existing solutions by successfully decoupling the graph structure and node features, achieving improvements of over 5.46% (by MSE) for node feature reconstruction and over 25.04% (by AUC) for graph structure reconstruction.

[LG-12] Stability and Generalization of Nonconvex Optimization with Heavy-Tailed Noise

链接: https://arxiv.org/abs/2601.19730
作者: Hongxu Chen,Ke Wei,Xiaoming Yuan,Luo Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The empirical evidence indicates that stochastic optimization with heavy-tailed gradient noise is more appropriate to characterize the training of machine learning models than that with standard bounded gradient variance noise. Most existing works on this phenomenon focus on the convergence of optimization errors, while the analysis for generalization bounds under the heavy-tailed gradient noise remains limited. In this paper, we develop a general framework for establishing generalization bounds under heavy-tailed noise. Specifically, we introduce a truncation argument to achieve the generalization error bound based on the algorithmic stability under the assumption of bounded p th centered moment with p\in(1,2] . Building on this framework, we further provide the stability and generalization analysis for several popular stochastic algorithms under heavy-tailed noise, including clipped and normalized stochastic gradient descent, as well as their mini-batch and momentum variants.

[LG-13] Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action

链接: https://arxiv.org/abs/2601.19720
作者: Gong Gao,Weidong Zhao,Xianhui Liu,Ning Jia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate k -nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant Policy Update (IPU) mechanism, which enhances policy exploitation by systematically increasing the frequency of policy updates. We further discover that the early-stage training conservatism of the IRA method can alleviate the overestimation bias problem in value-based RL. Experimental results show that IRA can significantly improve the learning efficiency and final performance of online RL algorithms on eight MuJoCo continuous control tasks.

[LG-14] Rethinking Divisive Hierarchical Clustering from a Distributional Perspective

链接: https://arxiv.org/abs/2601.19718
作者: Kaifeng Zhang,Kai Ming Ting,Tianrun Liang,Qiuran Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We uncover that current objective-based Divisive Hierarchical Clustering (DHC) methods produce a dendrogram that does not have three desired properties i.e., no unwarranted splitting, group similar clusters into a same subset, ground-truth correspondence. This shortcoming has their root cause in using a set-oriented bisecting assessment criterion. We show that this shortcoming can be addressed by using a distributional kernel, instead of the set-oriented criterion; and the resultant clusters achieve a new distribution-oriented objective to maximize the total similarity of all clusters (TSC). Our theoretical analysis shows that the resultant dendrogram guarantees a lower bound of TSC. The empirical evaluation shows the effectiveness of our proposed method on artificial and Spatial Transcriptomics (bioinformatics) datasets. Our proposed method successfully creates a dendrogram that is consistent with the biological regions in a Spatial Transcriptomics dataset, whereas other contenders fail.

[LG-15] Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow ICLR2026

链接: https://arxiv.org/abs/2601.19707
作者: Yunyue Wei,Chenhui Zuo,Yanan Sui
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Controlling high-dimensional systems in biological and robotic applications is challenging due to expansive state-action spaces, where effective exploration is critical. Commonly used exploration strategies in reinforcement learning are largely undirected with sharp degradation as action dimensionality grows. Many existing methods resort to dimensionality reduction, which constrains policy expressiveness and forfeits system flexibility. We introduce Q-guided Flow Exploration (Qflex), a scalable reinforcement learning method that conducts exploration directly in the native high-dimensional action space. During training, Qflex traverses actions from a learnable source distribution along a probability flow induced by the learned value function, aligning exploration with task-relevant gradients rather than isotropic noise. Our proposed method substantially outperforms representative online reinforcement learning baselines across diverse high-dimensional continuous-control benchmarks. Qflex also successfully controls a full-body human musculoskeletal model to perform agile, complex movements, demonstrating superior scalability and sample efficiency in very high-dimensional settings. Our results indicate that value-guided flows offer a principled and practical route to exploration at scale.

[LG-16] LoPRo: Enhancing Low-Rank Quantization via Permuted Block-Wise Rotation

链接: https://arxiv.org/abs/2601.19675
作者: Hongyaoxing Gu,Lijuan Hu,Liye Yu,Haowei Li,Fangfang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) enables effective model compression while preserving relatively high accuracy. Current weight-only PTQ methods primarily focus on the challenging sub-3-bit regime, where approaches often suffer significant accuracy degradation, typically requiring fine-tuning to achieve competitive performance. In this work, we revisit the fundamental characteristics of weight quantization and analyze the challenges in quantizing the residual matrix under low-rank approximation. We propose LoPRo, a novel fine-tuning-free PTQ algorithm that enhances residual matrix quantization by applying block-wise permutation and Walsh-Hadamard transformations to rotate columns of similar importance, while explicitly preserving the quantization accuracy of the most salient column blocks. Furthermore, we introduce a mixed-precision fast low-rank decomposition based on rank-1 sketch (R1SVD) to further minimize quantization costs. Experiments demonstrate that LoPRo outperforms existing fine-tuning-free PTQ methods at both 2-bit and 3-bit quantization, achieving accuracy comparable to fine-tuning baselines. Specifically, LoPRo achieves state-of-the-art quantization accuracy on LLaMA-2 and LLaMA-3 series models while delivering up to a 4 \times speedup. In the MoE model Mixtral-8x7B, LoPRo completes quantization within 2.5 hours, simultaneously reducing perplexity by 0.4 \downarrow and improving accuracy by 8% \uparrow . Moreover, compared to other low-rank quantization methods, LoPRo achieves superior accuracy with a significantly lower rank, while maintaining high inference efficiency and minimal additional latency.

[LG-17] Grasynda: Graph-based Synthetic Time Series Generation

链接: https://arxiv.org/abs/2601.19668
作者: Luis Amorim,Moises Santos,Paulo J. Azevedo,Carlos Soares,Vitor Cerqueira
类目: Machine Learning (cs.LG)
*备注: Accepted in IDA’26

点击查看摘要

Abstract:Data augmentation is a crucial tool in time series forecasting, especially for deep learning architectures that require a large training sample size to generalize effectively. However, extensive datasets are not always available in real-world scenarios. Although many data augmentation methods exist, their limitations include the use of transformations that do not adequately preserve data properties. This paper introduces Grasynda, a novel graph-based approach for synthetic time series generation that: (1) converts univariate time series into a network structure using a graph representation, where each state is a node and each transition is represented as a directed edge; and (2) encodes their temporal dynamics in a transition probability matrix. We performed an extensive evaluation of Grasynda as a data augmentation method for time series forecasting. We use three neural network variations on six benchmark datasets. The results indicate that Grasynda consistently outperforms other time series data augmentation methods, including ones used in state-of-the-art time series foundation models. The method and all experiments are publicly available.

[LG-18] he Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials Entropic Dispersion and Cross-Modal Divergence

链接: https://arxiv.org/abs/2601.19597
作者: Yichao Cai,Zhen Zhang,Yuhang Liu,Javen Qinfeng Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While InfoNCE powers modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment–uniformity decomposition. We present a measure-theoretic framework that models learning as the evolution of representation measures on a fixed embedding manifold. By establishing value and gradient consistency in the large-batch limit, we bridge the stochastic objective to explicit deterministic energy landscapes, uncovering a fundamental geometric bifurcation between the unimodal and multimodal regimes. In the unimodal setting, the intrinsic landscape is strictly convex with a unique Gibbs equilibrium; here, entropy acts merely as a tie-breaker, clarifying “uniformity” as a constrained expansion within the alignment basin. In contrast, the symmetric multimodal objective contains a persistent negative symmetric divergence term that remains even after kernel sharpening. We show that this term induces barrier-driven co-adaptation, enforcing a population-level modality gap as a structural geometric necessity rather than an initialization artifact. Our results shift the analytical lens from pointwise discrimination to population geometry, offering a principled basis for diagnosing and controlling distributional misalignment.

[LG-19] GenCP: Towards Generative Modeling Paradigm of Coupled Physics ICLR2026

链接: https://arxiv.org/abs/2601.19541
作者: Tianrun Gao,Haoren Zheng,Wenhao Deng,Haodong Feng,Tao Zhang,Ruiqi Feng,Qianyi Chen,Tailin Wu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: ICLR 2026 Accpeted

点击查看摘要

Abstract:Real-world physical systems are inherently complex, often involving the coupling of multiple physics, making their simulation both highly valuable and challenging. Many mainstream approaches face challenges when dealing with decoupled data. Besides, they also suffer from low efficiency and fidelity in strongly coupled spatio-temporal physical systems. Here we propose GenCP, a novel and elegant generative paradigm for coupled multiphysics simulation. By formulating coupled-physics modeling as a probability modeling problem, our key innovation is to integrate probability density evolution in generative modeling with iterative multiphysics coupling, thereby enabling training on data from decoupled simulation and inferring coupled physics during sampling. We also utilize operator-splitting theory in the space of probability evolution to establish error controllability guarantees for this “conditional-to-joint” sampling scheme. We evaluate our paradigm on a synthetic setting and three challenging multi-physics scenarios to demonstrate both principled insight and superior application performance of GenCP. Code is available at this repo: this http URL.

[LG-20] Posterior Distribution-assisted Evolutionary Dynamic Optimization as an Online Calibrator for Complex Social Simulations

链接: https://arxiv.org/abs/2601.19481
作者: Peng Yang,Zhenhua Yang,Boquan Jiang,Chenkai Wang,Ke Tang,Xin Yao
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The calibration of simulators for complex social systems aims to identify the optimal parameter that drives the output of the simulator best matching the target data observed from the system. As many social systems may change internally over time, calibration naturally becomes an online task, requiring parameters to be updated continuously to maintain the simulator’s fidelity. In this work, the online setting is first formulated as a dynamic optimization problem (DOP), requiring the search for a sequence of optimal parameters that fit the simulator to real system changes. However, in contrast to traditional DOP formulations, online calibration explicitly incorporates the observational data as the driver of environmental dynamics. Due to this fundamental difference, existing Evolutionary Dynamic Optimization (EDO) methods, despite being extensively studied for black-box DOPs, are ill-equipped to handle such a scenario. As a result, online calibration problems constitute a new set of challenging DOPs. Here, we propose to explicitly learn the posterior distributions of the parameters and the observational data, thereby facilitating both change detection and environmental adaptation of existing EDOs for this scenario. We thus present a pretrained posterior model for implementation, and fine-tune it during the optimization. Extensive tests on both economic and financial simulators verify that the posterior distribution strongly promotes EDOs in such DOPs widely existed in social science.

[LG-21] On the Expressiveness of State Space Models via Temporal Logics

链接: https://arxiv.org/abs/2601.19467
作者: Eric Alsmann,Lowejatan Noori,Martin Lange
类目: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the expressive power of state space models (SSM), which have recently emerged as a potential alternative to transformer architectures in large language models. Building on recent work, we analyse SSM expressiveness through fragments and extensions of linear temporal logic over finite traces. Our results show that the expressive capabilities of SSM vary substantially depending on the underlying gating mechanism. We further distinguish between SSM operating over fixed-width arithmetic (quantised models), whose expressive power remains within regular languages, and SSM with unbounded precision, which can capture counting properties and non-regular languages. In addition, we provide a systematic comparison between these different SSM variants and known results on transformers, thereby clarifying how the two architectures relate in terms of expressive power.

[LG-22] Fixed Aggregation Features Can Rival GNNs

链接: https://arxiv.org/abs/2601.19449
作者: Celia Rubio-Madrigal,Rebekka Burkholz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are widely believed to excel at node representation learning through trainable neighborhood aggregations. We challenge this view by introducing Fixed Aggregation Features (FAFs), a training-free approach that transforms graph learning tasks into tabular problems. This simple shift enables the use of well-established tabular methods, offering strong interpretability and the flexibility to deploy diverse classifiers. Across 14 benchmarks, well-tuned multilayer perceptrons trained on FAFs rival or outperform state-of-the-art GNNs and graph transformers on 12 tasks – often using only mean aggregation. The only exceptions are the Roman Empire and Minesweeper datasets, which typically require unusually deep GNNs. To explain the theoretical possibility of non-trainable aggregations, we connect our findings to Kolmogorov-Arnold representations and discuss when mean aggregation can be sufficient. In conclusion, our results call for (i) richer benchmarks benefiting from learning diverse neighborhood aggregations, (ii) strong tabular baselines as standard, and (iii) employing and advancing tabular models for graph data to gain new insights into related tasks.

[LG-23] From Internal Diagnosis to External Auditing: A VLM-Driven Paradigm for Online Test-Time Backdoor Defense

链接: https://arxiv.org/abs/2601.19448
作者: Binyan Xu,Fan Yang,Xilin Dai,Di Tang,Kehuan Zhang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 19 pages, 10 figures, 12 tables

点击查看摘要

Abstract:Deep Neural Networks remain inherently vulnerable to backdoor attacks. Traditional test-time defenses largely operate under the paradigm of internal diagnosis methods like model repairing or input robustness, yet these approaches are often fragile under advanced attacks as they remain entangled with the victim model’s corrupted parameters. We propose a paradigm shift from Internal Diagnosis to External Semantic Auditing, arguing that effective defense requires decoupling safety from the victim model via an independent, semantically grounded auditor. To this end, we present a framework harnessing Universal Vision-Language Models (VLMs) as evolving semantic gatekeepers. We introduce PRISM (Prototype Refinement Inspection via Statistical Monitoring), which overcomes the domain gap of general VLMs through two key mechanisms: a Hybrid VLM Teacher that dynamically refines visual prototypes online, and an Adaptive Router powered by statistical margin monitoring to calibrate gating thresholds in real-time. Extensive evaluation across 17 datasets and 11 attack types demonstrates that PRISM achieves state-of-the-art performance, suppressing Attack Success Rate to 1% on CIFAR-10 while improving clean accuracy, establishing a new standard for model-agnostic, externalized security.

[LG-24] OSIRIS: Bridging Analog Circuit Design and Machine Learning with Scalable Dataset Generation

链接: https://arxiv.org/abs/2601.19439
作者: Giuseppe Chiari,Michele Piccoli,Davide Zoni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The automation of analog integrated circuit (IC) design remains a longstanding challenge, primarily due to the intricate interdependencies among physical layout, parasitic effects, and circuit-level performance. These interactions impose complex constraints that are difficult to accurately capture and optimize using conventional design methodologies. Although recent advances in machine learning (ML) have shown promise in automating specific stages of the analog design flow, the development of holistic, end-to-end frameworks that integrate these stages and iteratively refine layouts using post-layout, parasitic-aware performance feedback is still in its early stages. Furthermore, progress in this direction is hindered by the limited availability of open, high-quality datasets tailored to the analog domain, restricting both the benchmarking and the generalizability of ML-based techniques. To address these limitations, we present OSIRIS, a scalable dataset generation pipeline for analog IC design. OSIRIS systematically explores the design space of analog circuits while producing comprehensive performance metrics and metadata, thereby enabling ML-driven research in electronic design automation (EDA). In addition, we release a dataset consisting of 87,100 circuit variations generated with OSIRIS, accompanied by a reinforcement learning (RL)-based baseline method that exploits OSIRIS for analog design optimization.

[LG-25] ask-Centric Policy Optimization from Misaligned Motion Priors

链接: https://arxiv.org/abs/2601.19411
作者: Ziang Zheng,Kai Feng,Yi Nie,Shentao Qin
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humanoid control often leverages motion priors from human demonstrations to encourage natural behaviors. However, such demonstrations are frequently suboptimal or misaligned with robotic tasks due to embodiment differences, retargeting errors, and task-irrelevant variations, causing naïve imitation to degrade task performance. Conversely, task-only reinforcement learning admits many task-optimal solutions, often resulting in unnatural or unstable motions. This exposes a fundamental limitation of linear reward mixing in adversarial imitation learning. We propose \emphTask-Centric Motion Priors (TCMP), a task-priority adversarial imitation framework that treats imitation as a conditional regularizer rather than a co-equal objective. TCMP maximizes task improvement while incorporating imitation signals only when they are compatible with task progress, yielding an adaptive, geometry-aware update that preserves task-feasible descent and suppresses harmful imitation under misalignment. We provide theoretical analysis of gradient conflict and task-priority stationary points, and validate our claims through humanoid control experiments demonstrating robust task performance with consistent motion style under noisy demonstrations.

[LG-26] SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems

链接: https://arxiv.org/abs/2601.19395
作者: Saeed Nasehi Basharzad,Farhana Choudhury,Egemen Tanin
类目: Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:Real-world Vehicle Routing Problems (RWVRPs) require solving complex, sequence-dependent challenges at scale with constraints such as delivery time window, replenishment or recharging stops, asymmetric travel cost, etc. While recent neural methods achieve strong results on large-scale classical VRP benchmarks, they struggle to address RWVRPs because their strategies overlook sequence dependencies and underutilize edge-level information, which are precisely the characteristics that define the complexity of RWVRPs. We present SEAFormer, a novel transformer that incorporates both node-level and edge-level information in decision-making through two key innovations. First, our Clustered Proximity Attention (CPA) exploits locality-aware clustering to reduce the complexity of attention from O(n^2) to O(n) while preserving global perspective, allowing SEAFormer to efficiently train on large instances. Second, our lightweight edge-aware module captures pairwise features through residual fusion, enabling effective incorporation of edge-based information and faster convergence. Extensive experiments across four RWVRP variants with various scales demonstrate that SEAFormer achieves superior results over state-of-the-art methods. Notably, SEAFormer is the first neural method to solve 1,000+ node RWVRPs effectively, while also achieving superior performance on classic VRPs, making it a versatile solution for both research benchmarks and real-world applications.

[LG-27] DSP-Reg: Domain-Sensitive Parameter Regularization for Robust Domain Generalization

链接: https://arxiv.org/abs/2601.19394
作者: Xudong Han,Senkang Hu,Yihang Tao,Yu Guo,Philip Birch,Sam Tak Wu Kwong,Yuguang Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain Generalization (DG) is a critical area that focuses on developing models capable of performing well on data from unseen distributions, which is essential for real-world applications. Existing approaches primarily concentrate on learning domain-invariant features, which assume that a model robust to variations in the source domains will generalize well to unseen target domains. However, these approaches neglect a deeper analysis at the parameter level, which makes the model hard to explicitly differentiate between parameters sensitive to domain shifts and those robust, potentially hindering its overall ability to generalize. In order to address these limitations, we first build a covariance-based parameter sensitivity analysis framework to quantify the sensitivity of each parameter in a model to domain shifts. By computing the covariance of parameter gradients across multiple source domains, we can identify parameters that are more susceptible to domain variations, which serves as our theoretical foundation. Based on this, we propose Domain-Sensitive Parameter Regularization (DSP-Reg), a principled framework that guides model optimization by a soft regularization technique that encourages the model to rely more on domain-invariant parameters while suppressing those that are domain-specific. This approach provides a more granular control over the model’s learning process, leading to improved robustness and generalization to unseen domains. Extensive experiments on benchmarks, such as PACS, VLCS, OfficeHome, and DomainNet, demonstrate that DSP-Reg outperforms state-of-the-art approaches, achieving an average accuracy of 66.7% and surpassing all baselines.

[LG-28] High-quality data augmentation for code comment classification ICSE

链接: https://arxiv.org/abs/2601.19383
作者: Thomas Borsani,Andrea Rosani,Giuseppe Di Fatta
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted at the NLBSE Workshop (co-located with ICSE)

点击查看摘要

Abstract:Code comments serve a crucial role in software development for documenting functionality, clarifying design choices, and assisting with issue tracking. They capture developers’ insights about the surrounding source code, serving as an essential resource for both human comprehension and automated analysis. Nevertheless, since comments are in natural language, they present challenges for machine-based code understanding. To address this, recent studies have applied natural language processing (NLP) and deep learning techniques to classify comments according to developers’ intentions. However, existing datasets for this task suffer from size limitations and class imbalance, as they rely on manual annotations and may not accurately represent the distribution of comments in real-world codebases. To overcome this issue, we introduce new synthetic oversampling and augmentation techniques based on high-quality data generation to enhance the NLBSE’26 challenge datasets. Our Synthetic Quality Oversampling Technique and Augmentation Technique (Q-SYNTH) yield promising results, improving the base classifier by 2.56% .

[LG-29] CHEHAB RL: Learning to Optimize Fully Homomorphic Encryption Computations

链接: https://arxiv.org/abs/2601.19367
作者: Bilel Sefsaf,Abderraouf Dandani,Abdessamed Seddiki,Arab Mohammed,Eduardo Chielle,Michail Maniatakos,Riyadh Baghdadi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fully Homomorphic Encryption (FHE) enables computations directly on encrypted data, but its high computational cost remains a significant barrier. Writing efficient FHE code is a complex task requiring cryptographic expertise, and finding the optimal sequence of program transformations is often intractable. In this paper, we propose CHEHAB RL, a novel framework that leverages deep reinforcement learning (RL) to automate FHE code optimization. Instead of relying on predefined heuristics or combinatorial search, our method trains an RL agent to learn an effective policy for applying a sequence of rewriting rules to automatically vectorize scalar FHE code while reducing instruction latency and noise growth. The proposed approach supports the optimization of both structured and unstructured code. To train the agent, we synthesize a diverse dataset of computations using a large language model (LLM). We integrate our proposed approach into the CHEHAB FHE compiler and evaluate it on a suite of benchmarks, comparing its performance against Coyote, a state-of-the-art vectorizing FHE compiler. The results show that our approach generates code that is 5.3\times faster in execution, accumulates 2.54\times less noise, while the compilation process itself is 27.9\times faster than Coyote (geometric means).

[LG-30] GraphSB: Boosting Imbalanced Node Classification on Graphs through Structural Balance

链接: https://arxiv.org/abs/2601.19352
作者: Zhixiao Wang,Chaofan Zhu,Qihan Feng,Jian Zhang,Xiaobin Rui,Philip S Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imbalanced node classification is a critical challenge in graph learning, where most existing methods typically utilize Graph Neural Networks (GNNs) to learn node representations. These methods can be broadly categorized into the data-level and the algorithm-level. The former aims to synthesize minority-class nodes to mitigate quantity imbalance, while the latter tries to optimize the learning process to highlight minority classes. However, neither of them addresses the inherently imbalanced graph structure, which is a fundamental factor that incurs majority-class dominance and minority-class assimilation in GNNs. Our theoretical analysis further supports this critical insight. Therefore, we propose GraphSB (Graph Structural Balance), a novel framework that incorporates Structural Balance as a key strategy to address the underlying imbalanced graph structure before node synthesis. Structural Balance performs a two-stage structure optimization: Structure Enhancement that mines hard samples near decision boundaries through dual-view analysis and enhances connectivity for minority classes through adaptive augmentation, and Relation Diffusion that propagates the enhanced minority context while simultaneously capturing higher-order structural dependencies. Thus, GraphSB balances structural distribution before node synthesis, enabling more effective learning in GNNs. Extensive experiments demonstrate that GraphSB significantly outperforms the state-of-the-art methods. More importantly, the proposed Structural Balance can be seamlessly integrated into state-of-the-art methods as a simple plug-and-play module, increasing their accuracy by an average of 4.57%.

[LG-31] Metric k-clustering using only Weak Comparison Oracles

链接: https://arxiv.org/abs/2601.19333
作者: Rahul Raychaudhury,Aryan Esmailpour,Sainyam Galhotra,Stavros Sintos
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Clustering is a fundamental primitive in unsupervised learning. However, classical algorithms for k -clustering (such as k -median and k -means) assume access to exact pairwise distances – an unrealistic requirement in many modern applications. We study clustering in the \emphRank-model (R-model), where access to distances is entirely replaced by a \emphquadruplet oracle that provides only relative distance comparisons. In practice, such an oracle can represent learned models or human feedback, and is expected to be noisy and entail an access cost. Given a metric space with n input items, we design randomized algorithms that, using only a noisy quadruplet oracle, compute a set of O(k \cdot \mathsfpolylog(n)) centers along with a mapping from the input items to the centers such that the clustering cost of the mapping is at most constant times the optimum k -clustering cost. Our method achieves a query complexity of O(n\cdot k \cdot \mathsfpolylog(n)) for arbitrary metric spaces and improves to O((n+k^2) \cdot \mathsfpolylog(n)) when the underlying metric has bounded doubling dimension. When the metric has bounded doubling dimension we can further improve the approximation from constant to 1+\varepsilon , for any arbitrarily small constant \varepsilon\in(0,1) , while preserving the same asymptotic query complexity. Our framework demonstrates how noisy, low-cost oracles, such as those derived from large language models, can be systematically integrated into scalable clustering algorithms. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2601.19333 [cs.LG] (or arXiv:2601.19333v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.19333 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ICLR 2026

[LG-32] Generalizable IoT Traffic Representations for Cross-Network Device Identification

链接: https://arxiv.org/abs/2601.19315
作者: Arunan Sivanathan,David Warren,Deepak Mishra,Sushmita Ruj,Natasha Fernandes,Quan Z. Sheng,Minh Tran,Ben Luo,Daniel Coscia,Gustavo Batista,Hassan Habibi Gharakaheili
类目: Machine Learning (cs.LG)
*备注: 15 pages, 15 figures

点击查看摘要

Abstract:Machine learning models have demonstrated strong performance in classifying network traffic and identifying Internet-of-Things (IoT) devices, enabling operators to discover and manage IoT assets at scale. However, many existing approaches rely on end-to-end supervised pipelines or task-specific fine-tuning, resulting in traffic representations that are tightly coupled to labeled datasets and deployment environments, which can limit generalizability. In this paper, we study the problem of learning generalizable traffic representations for IoT device identification. We design compact encoder architectures that learn per-flow embeddings from unlabeled IoT traffic and evaluate them using a frozen-encoder protocol with a simple supervised classifier. Our specific contributions are threefold. (1) We develop unsupervised encoder–decoder models that learn compact traffic representations from unlabeled IoT network flows and assess their quality through reconstruction-based analysis. (2) We show that these learned representations can be used effectively for IoT device-type classification using simple, lightweight classifiers trained on frozen embeddings. (3) We provide a systematic benchmarking study against the state-of-the-art pretrained traffic encoders, showing that larger models do not necessarily yield more robust representations for IoT traffic. Using more than 18 million real IoT traffic flows collected across multiple years and deployment environments, we learn traffic representations from unlabeled data and evaluate device-type classification on disjoint labeled subsets, achieving macro F1-scores exceeding 0.9 for device-type classification and demonstrating robustness under cross-environment deployment.

[LG-33] LightSBB-M: Bridging Schrödinger and Bass for Generative Diffusion Modeling

链接: https://arxiv.org/abs/2601.19312
作者: Alexandre Alouadi,Pierre Henry-Labordère,Grégoire Loeper,Othmane Mazhar,Huyên Pham,Nizar Touzi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Schrodinger Bridge and Bass (SBB) formulation, which jointly controls drift and volatility, is an established extension of the classical Schrodinger Bridge (SB). Building on this framework, we introduce LightSBB-M, an algorithm that computes the optimal SBB transport plan in only a few iterations. The method exploits a dual representation of the SBB objective to obtain analytic expressions for the optimal drift and volatility, and it incorporates a tunable parameter beta greater than zero that interpolates between pure drift (the Schrodinger Bridge) and pure volatility (Bass martingale transport). We show that LightSBB-M achieves the lowest 2-Wasserstein distance on synthetic datasets against state-of-the-art SB and diffusion baselines with up to 32 percent improvement. We also illustrate the generative capability of the framework on an unpaired image-to-image translation task (adult to child faces in FFHQ). These findings demonstrate that LightSBB-M provides a scalable, high-fidelity SBB solver that outperforms existing SB and diffusion baselines across both synthetic and real-world generative tasks. The code is available at this https URL.

[LG-34] Queue Length Regret Bounds for Contextual Queueing Bandits

链接: https://arxiv.org/abs/2601.19300
作者: Seoungbin Bae,Garyeong Kang,Dabeen Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB- \varepsilon , achieves a regret upper bound of \widetilde\mathcalO(T^-1/4) . We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of \mathcalO(\log^2 T) . Lastly, we provide experimental results that validate our theoretical findings.

[LG-35] Process-Aware Procurement Lead Time Prediction for Shipyard Delay Mitigation

链接: https://arxiv.org/abs/2601.19296
作者: Yongjae Lee,Eunhee Park,Daesan Park,Dongho Kim,Jongho Choi,Hyerim Bae
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting procurement lead time (PLT) remains a challenge in engineered-to-order industries such as shipbuilding and plant construction, where delays in a single key component can disrupt project timelines. In shipyards, pipe spools are critical components; installed deep within hull blocks soon after steel erection, any delay in their procurement can halt all downstream tasks. Recognizing their importance, existing studies predict PLT using the static physical attributes of pipe spools. However, procurement is inherently a dynamic, multi-stakeholder business process involving a continuous sequence of internal and external events at the shipyard, factors often overlooked in traditional approaches. To address this issue, this paper proposes a novel framework that combines event logs, dataset records of the procurement events, with static attributes to predict PLT. The temporal attributes of each event are extracted to reflect the continuity and temporal context of the process. Subsequently, a deep sequential neural network combined with a multi-layered perceptron is employed to integrate these static and dynamic features, enabling the model to capture both structural and contextual information in procurement. Comparative experiments are conducted using real-world pipe spool procurement data from a globally renowned South Korean shipbuilding corporation. Three tasks are evaluated, which are production, post-processing, and procurement lead time prediction. The results show a 22.6% to 50.4% improvement in prediction performance in terms of mean absolute error over the best-performing existing approaches across the three tasks. These findings indicate the value of considering procurement process information for more accurate PLT prediction.

[LG-36] Smoothing the Score Function for Generalization in Diffusion Models: An Optimization-based Explanation Framework

链接: https://arxiv.org/abs/2601.19285
作者: Xinyu Zhou,Jiawei Zhang,Stephen J. Wright
类目: Machine Learning (cs.LG)
*备注: 61pages,32 figures

点击查看摘要

Abstract:Diffusion models achieve remarkable generation quality, yet face a fundamental challenge known as memorization, where generated samples can replicate training samples exactly. We develop a theoretical framework to explain this phenomenon by showing that the empirical score function (the score function corresponding to the empirical distribution) is a weighted sum of the score functions of Gaussian distributions, in which the weights are sharp softmax functions. This structure causes individual training samples to dominate the score function, resulting in sampling collapse. In practice, approximating the empirical score function with a neural network can partially alleviate this issue and improve generalization. Our theoretical framework explains why: In training, the neural network learns a smoother approximation of the weighted sum, allowing the sampling process to be influenced by local manifolds rather than single points. Leveraging this insight, we propose two novel methods to further enhance generalization: (1) Noise Unconditioning enables each training sample to adaptively determine its score function weight to increase the effect of more training samples, thereby preventing single-point dominance and mitigating collapse. (2) Temperature Smoothing introduces an explicit parameter to control the smoothness. By increasing the temperature in the softmax weights, we naturally reduce the dominance of any single training sample and mitigate memorization. Experiments across multiple datasets validate our theoretical analysis and demonstrate the effectiveness of the proposed methods in improving generalization while maintaining high generation quality.

[LG-37] Output Feedback Stabilization of Linear Systems via Policy Gradient Methods

链接: https://arxiv.org/abs/2601.19284
作者: Ankang Zhang,Ming Chi,Xiaoling Wang,Lintao Ye
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 31 pages, 2 figures

点击查看摘要

Abstract:Stabilizing a dynamical system is a fundamental problem that serves as a cornerstone for many complex tasks in the field of control systems. The problem becomes challenging when the system model is unknown. Among the Reinforcement Learning (RL) algorithms that have been successfully applied to solve problems pertaining to unknown linear dynamical systems, the policy gradient (PG) method stands out due to its ease of implementation and can solve the problem in a model-free manner. However, most of the existing works on PG methods for unknown linear dynamical systems assume full-state feedback. In this paper, we take a step towards model-free learning for partially observable linear dynamical systems with output feedback and focus on the fundamental stabilization problem of the system. We propose an algorithmic framework that stretches the boundary of PG methods to the problem without global convergence guarantees. We show that by leveraging zeroth-order PG update based on system trajectories and its convergence to stationary points, the proposed algorithms return a stabilizing output feedback policy for discrete-time linear dynamical systems. We also explicitly characterize the sample complexity of our algorithm and verify the effectiveness of the algorithm using numerical examples.

[LG-38] Whitespaces Dont Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code

链接: https://arxiv.org/abs/2601.19264
作者: Syed Mehedi Hasan Nirob,Shamim Ehsan,Moqsadur Rahman,Summit Haque
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have made it remarkably easy to synthesize plausible source code from natural language prompts. While this accelerates software development and supports learning, it also raises new risks for academic integrity, authorship attribution, and responsible AI use. This paper investigates the problem of distinguishing human-written from machine-generated code by comparing two complementary approaches: feature-based detectors built from lightweight, interpretable stylometric and structural properties of code, and embedding-based detectors leveraging pretrained code encoders. Using a recent large-scale benchmark dataset of 600k human-written and AI-generated code samples, we find that feature-based models achieve strong performance (ROC-AUC 0.995, PR-AUC 0.995, F1 0.971), while embedding-based models with CodeBERT embeddings are also very competitive (ROC-AUC 0.994, PR-AUC 0.994, F1 0.965). Analysis shows that features tied to indentation and whitespace provide particularly discriminative cues, whereas embeddings capture deeper semantic patterns and yield slightly higher precision. These findings underscore the trade-offs between interpretability and generalization, offering practical guidance for deploying robust code-origin detection in academic and industrial contexts.

[LG-39] E-QRGMM: Efficient Generative Metamodeling for Covariate-Dependent Uncertainty Quantification

链接: https://arxiv.org/abs/2601.19256
作者: Zhiyang Liang,Qingkai Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Covariate-dependent uncertainty quantification in simulation-based inference is crucial for high-stakes decision-making but remains challenging due to the limitations of existing methods such as conformal prediction and classical bootstrap, which struggle with covariate-specific conditioning. We propose Efficient Quantile-Regression-Based Generative Metamodeling (E-QRGMM), a novel framework that accelerates the quantile-regression-based generative metamodeling (QRGMM) approach by integrating cubic Hermite interpolation with gradient estimation. Theoretically, we show that E-QRGMM preserves the convergence rate of the original QRGMM while reducing grid complexity from O(n^1/2) to O(n^1/5) for the majority of quantile levels, thereby substantially improving computational efficiency. Empirically, E-QRGMM achieves a superior trade-off between distributional accuracy and training speed compared to both QRGMM and other advanced deep generative models on synthetic and practical datasets. Moreover, by enabling bootstrap-based construction of confidence intervals for arbitrary estimands of interest, E-QRGMM provides a practical solution for covariate-dependent uncertainty quantification.

[LG-40] Contrast-Source-Based Physics-Driven Neural Network for Inverse Scattering Problems

链接: https://arxiv.org/abs/2601.19243
作者: Yutong Du,Zicheng Liu
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have recently been applied to inverse scattering problems (ISPs) due to their strong nonlinear mapping capabilities. However, supervised DNN solvers require large-scale datasets, which limits their generalization in practical applications. Untrained neural networks (UNNs) address this issue by updating weights from measured electric fields and prior physical knowledge, but existing UNN solvers suffer from long inference time. To overcome these limitations, this paper proposes a contrast-source-based physics-driven neural network (CSPDNN), which predicts the induced current distribution to improve efficiency and incorporates an adaptive total variation loss for robust reconstruction under varying contrast and noise conditions. The improved imaging performance is validated through comprehensive numerical simulations and experimental data.

[LG-41] Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization

链接: https://arxiv.org/abs/2601.19220
作者: Dai Hai Nguyen,Duc Dung Nguyen,Atsuyoshi Nakamura,Hiroshi Mamitsuka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study multi-objective optimization over probability distributions in Wasserstein space. Recently, Nguyen et al. (2025) introduced Multiple Wasserstein Gradient Descent (MWGraD) algorithm, which exploits the geometric structure of Wasserstein space to jointly optimize multiple objectives. Building on this approach, we propose an accelerated variant, A-MWGraD, inspired by Nesterov’s acceleration. We analyze the continuous-time dynamics and establish convergence to weakly Pareto optimal points in probability space. Our theoretical results show that A-MWGraD achieves a convergence rate of O(1/t^2) for geodesically convex objectives and O(e^-\sqrt\betat) for \beta -strongly geodesically convex objectives, improving upon the O(1/t) rate of MWGraD in the geodesically convex setting. We further introduce a practical kernel-based discretization for A-MWGraD and demonstrate through numerical experiments that it consistently outperforms MWGraD in convergence speed and sampling efficiency on multi-target sampling tasks.

[LG-42] Foresight Learning for SEC Risk Prediction

链接: https://arxiv.org/abs/2601.19189
作者: Benjamin Turtel,Paul Wilczewski,Danny Franklin,Kris Skotheim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Risk disclosures in SEC filings describe potential adverse events but rarely quantify their likelihood, limiting their usefulness for probabilistic analysis. A central obstacle is the absence of large-scale, risk-level supervision linking disclosed risks to realized outcomes. We introduce a fully automated data generation pipeline that converts qualitative SEC risk disclosures into temporally grounded supervision using only public data. For each filing, the pipeline generates firm-specific, time-bounded risk queries from the Risk Factors section and labels them by automatically resolving outcomes against subsequent disclosures. Using this dataset of risk queries and outcomes grounded in SEC filings, we train a compact large language model to estimate the probability that a disclosed risk will materialize within a specified horizon. Despite its modest size, the resulting model substantially improves over pretrained and heuristic baselines, and outperforms frontier general-purpose models, including GPT-5, on probabilistic accuracy and calibration. More broadly, this work demonstrates that Foresight Learning enables scalable and fully automated training of domain-specific expert models using only raw, chronological, in-domain text – without proprietary data, external corpora, or manual annotation. The resulting models achieve frontier-level performance while remaining deployable on a single GPU. This result suggests a general pathway for learning calibrated, decision-relevant signals from naturally occurring enterprise documents. To support transparency and reproducibility, we open-source the evaluation dataset used in this study. Evaluation Data: this https URL Data Generation Platform: this https URL SDK: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2601.19189 [cs.LG] (or arXiv:2601.19189v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2601.19189 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Benjamin Turtel [view email] [v1] Tue, 27 Jan 2026 04:43:00 UTC (132 KB) Full-text links: Access Paper: View a PDF of the paper titled Foresight Learning for SEC Risk Prediction, by Benjamin Turtel and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-43] Learning Ordered Representations in Latent Space for Intrinsic Dimension Estimation via Principal Component Autoencoder

链接: https://arxiv.org/abs/2601.19179
作者: Qipeng Zhan,Zhuoping Zhou,Zexuan Wang,Li Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autoencoders have long been considered a nonlinear extension of Principal Component Analysis (PCA). Prior studies have demonstrated that linear autoencoders (LAEs) can recover the ordered, axis-aligned principal components of PCA by incorporating non-uniform \ell_2 regularization or by adjusting the loss function. However, these approaches become insufficient in the nonlinear setting, as the remaining variance cannot be properly captured independently of the nonlinear mapping. In this work, we propose a novel autoencoder framework that integrates non-uniform variance regularization with an isometric constraint. This design serves as a natural generalization of PCA, enabling the model to preserve key advantages, such as ordered representations and variance retention, while remaining effective for nonlinear dimensionality reduction tasks.

[LG-44] Analysis of Shuffling Beyond Pure Local Differential Privacy

链接: https://arxiv.org/abs/2601.19154
作者: Shun Takagi,Seng Pei Liew
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Shuffling is a powerful way to amplify privacy of a local randomizer in private distributed data analysis, but existing analyses mostly treat the local differential privacy (DP) parameter \varepsilon_0 as the only knob and give generic upper bounds that can be loose and do not even characterize how shuffling amplifies privacy for basic mechanisms such as the Gaussian mechanism. We revisit the privacy blanket bound of Balle et al. (the blanket divergence) and develop an asymptotic analysis that applies to a broad class of local randomizers under mild regularity assumptions, without requiring pure local DP. Our key finding is that the leading term of the blanket divergence depends on the local mechanism only through a single scalar parameter \chi , which we call the shuffle index. By applying this asymptotic analysis to both upper and lower bounds, we obtain a tight band for \delta_n in the shuffled mechanism’s (\varepsilon_n,\delta_n) -DP guarantee. Moreover, we derive a simple structural necessary and sufficient condition on the local randomizer under which the blanket-divergence-based upper and lower bounds coincide asymptotically. k -RR families with k\ge3 satisfy this condition, while for generalized Gaussian mechanisms the condition may not hold but the resulting band remains tight. Finally, we complement the asymptotic theory with an FFT-based algorithm for computing the blanket divergence at finite n , which offers rigorously controlled relative error and near-linear running time in n , providing a practical numerical analysis for shuffle DP.

[LG-45] GPCR-Filter: a deep learning framework for efficient and precise GPCR modulator discovery

链接: https://arxiv.org/abs/2601.19149
作者: Jingjie Ning,Xiangzhen Shen,Li Hou,Shiyi Shen,Jiahao Yang,Junrui Li,Hong Shan,Sanan Wu,Sihan Gao,Huaqiang Eric Xu,Xinheng He
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:G protein-coupled receptors (GPCRs) govern diverse physiological processes and are central to modern pharmacology. Yet discovering GPCR modulators remains challenging because receptor activation often arises from complex allosteric effects rather than direct binding affinity, and conventional assays are slow, costly, and not optimized for capturing these dynamics. Here we present GPCR-Filter, a deep learning framework specifically developed for GPCR modulator discovery. We assembled a high-quality dataset of over 90,000 experimentally validated GPCR-ligand pairs, providing a robust foundation for training and evaluation. GPCR-Filter integrates the ESM-3 protein language model for high-fidelity GPCR sequence representations with graph neural networks that encode ligand structures, coupled through an attention-based fusion mechanism that learns receptor-ligand functional relationships. Across multiple evaluation settings, GPCR-Filter consistently outperforms state-of-the-art compound-protein interaction models and exhibits strong generalization to unseen receptors and ligands. Notably, the model successfully identified micromolar-level agonists of the 5-HT\textsubscript1A receptor with distinct chemical frameworks. These results establish GPCR-Filter as a scalable and effective computational approach for GPCR modulator discovery, advancing AI-assisted drug development for complex signaling systems.

[LG-46] Native LLM and MLLM Inference at Scale on Apple Silicon

链接: https://arxiv.org/abs/2601.19139
作者: Wayner Barrios
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The growing adoption of Apple Silicon for machine learning development has created demand for efficient inference solutions that leverage its unique unified memory architecture. However, existing tools either lack native optimization (PyTorch MPS) or focus solely on text models (this http URL), leaving multimodal workloads underserved. We present vllm-mlx, a framework for efficient LLM and MLLM inference on Apple Silicon built natively on MLX. For text models, we achieve 21% to 87% higher throughput than this http URL across models ranging from Qwen3-0.6B to Nemotron-30B, while providing continuous batching that scales to 4.3x aggregate throughput at 16 concurrent requests. For multimodal models, we introduce content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing, regardless of input format. Our evaluation on Apple M4 Max demonstrates throughput of up to 525 tokens per second on text models and 28x speedup on repeated image queries, reducing multimodal latency from 21.7 seconds to under 1 second. Video analysis with up to 64 frames achieves 24.7x cache speedup. We release our implementation as open source to support efficient inference on consumer Apple hardware.

[LG-47] nyTorch: Building Machine Learning Systems from First Principles

链接: https://arxiv.org/abs/2601.19107
作者: Vijay Janapa Reddi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning systems engineering requires a deep understanding of framework internals. Yet most current education separates algorithms from systems. Students learn gradient descent without measuring memory usage, and attention mechanisms without profiling computational cost. This split leaves graduates unprepared to debug real production failures and widens the gap between machine learning research and reliable deployment. We present TinyTorch, a 20 module curriculum in which students implement the core components of PyTorch, including tensors, autograd, optimizers, and neural networks, entirely in pure Python. The curriculum is built around three pedagogical principles. Progressive disclosure gradually introduces complexity as students build confidence. Systems first integration embeds memory and performance awareness from the very beginning. Historical milestone validation guides students to recreate key breakthroughs, from the Perceptron in 1958 to modern Transformers, using only code they have written themselves. TinyTorch requires only a laptop with 4GB of RAM and no GPU, making machine learning systems education accessible worldwide. Its goal is to prepare the next generation of AI engineers, practitioners who understand not only what machine learning systems do, but why they work and how to make them scale. The curriculum is available as open source at this http URL slash tinytorch.

[LG-48] OWLEYE: Zero-Shot Learner for Cross-Domain Graph Data Anomaly Detection ICLR2026

链接: https://arxiv.org/abs/2601.19102
作者: Lecheng Zheng,Dongqi Fu,Zihao Li,Jingrui He
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Graph data is informative to represent complex relationships such as transactions between accounts, communications between devices, and dependencies among machines or processes. Correspondingly, graph anomaly detection (GAD) plays a critical role in identifying anomalies across various domains, including finance, cybersecurity, manufacturing, etc. Facing the large-volume and multi-domain graph data, nascent efforts attempt to develop foundational generalist models capable of detecting anomalies in unseen graphs without retraining. To the best of our knowledge, the different feature semantics and dimensions of cross-domain graph data heavily hinder the development of the graph foundation model, leaving further in-depth continual learning and inference capabilities a quite open problem. Hence, we propose OWLEYE, a novel zero-shot GAD framework that learns transferable patterns of normal behavior from multiple graphs, with a threefold contribution. First, OWLEYE proposes a cross-domain feature alignment module to harmonize feature distributions, which preserves domain-specific semantics during alignment. Second, with aligned features, to enable continuous learning capabilities, OWLEYE designs the multi-domain multi-pattern dictionary learning to encode shared structural and attribute-based patterns. Third, for achieving the in-context learning ability, OWLEYE develops a truncated attention-based reconstruction module to robustly detect anomalies without requiring labeled data for unseen graph-structured data. Extensive experiments on real-world datasets demonstrate that OWLEYE achieves superior performance and generalizability compared to state-of-the-art baselines, establishing a strong foundation for scalable and label-efficient anomaly detection.

[LG-49] Speed is Confidence

链接: https://arxiv.org/abs/2601.19085
作者: Joshua V. Dillon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biological neural systems must be fast but are energy-constrained. Evolution’s solution: act on the first signal. Winner-take-all circuits and time-to-first-spike coding implicitly treat when a neuron fires as an expression of confidence. We apply this principle to ensembles of Tiny Recursive Models (TRM). By basing the ensemble prediction solely on the first to halt rather than averaging predictions, we achieve 97.2% puzzle accuracy on Sudoku-Extreme while using 10x less compute than test-time augmentation (the baseline achieves 86.1% single-pass, 97.3% with TTA). Inference speed is an implicit indication of confidence. But can this capability be manifested as a training-only cost? Evidently yes: by maintaining K = 4 parallel latent states during training but backpropping only through the lowest-loss “winner,” a single model achieves 96.9% +/- 0.6% puzzle accuracy with a single forward pass-matching TTA performance without any test-time augmentation. As in nature, this work was also resource constrained: all experimentation used a single RTX 5090. This necessitated efficiency and compelled our invention of a modified SwiGLU which made Muon viable. With Muon and K = 1 training, we exceed TRM baseline performance in 7k steps (40 min). Higher accuracy requires 36k steps: 1.5 hours for K = 1, 6 hours for K = 4.

[LG-50] Critical Organization of Deep Neural Networks and p-Adic Statistical Field Theories

链接: https://arxiv.org/abs/2601.19070
作者: W. A. Zúñiga-Galindo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We rigorously study the thermodynamic limit of deep neural networks (DNNS) and recurrent neural networks (RNNs), assuming that the activation functions are sigmoids. A thermodynamic limit is a continuous neural network, where the neurons form a continuous space with infinitely many points. We show that such a network admits a unique state in a certain region of the parameter space, which depends continuously on the parameters. This state breaks into an infinite number of states outside the mentioned region of parameter space. Then, the critical organization is a bifurcation in the parameter space, where a network transitions from a unique state to infinitely many states. We use p-adic integers to codify hierarchical structures. Indeed, we present an algorithm that recasts the hierarchical topologies used in DNNs and RNNs as p-adic tree-like structures. In this framework, the hierarchical and the critical organizations are connected. We study rigorously the critical organization of a toy model, a hierarchical edge detector for grayscale images based on p-adic cellular neural networks. The critical organization of such a network can be described as a strange attractor. In the second part, we study random versions of DNNs and RNNs. In this case, the network parameters are generalized Gaussian random variables in a space of quadratic integrable functions. We compute the probability distribution of the output given the input, in the infinite-width case. We show that it admits a power-type expansion, where the constant term is a Gaussian distribution.

[LG-51] hought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models

链接: https://arxiv.org/abs/2601.19061
作者: Harsh Chaudhari,Ethan Rathbum,Hanna Foerster,Jamie Hayes,Matthew Jagielski,Milad Nasr,Ilia Shumailov,Alina Oprea
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has emerged as a powerful technique for enhancing large language models’ capabilities by generating intermediate reasoning steps for complex tasks. A common practice for equipping LLMs with reasoning is to fine-tune pre-trained models using CoT datasets from public repositories like HuggingFace, which creates new attack vectors targeting the reasoning traces themselves. While prior works have shown the possibility of mounting backdoor attacks in CoT-based models, these attacks require explicit inclusion of triggered queries with flawed reasoning and incorrect answers in the training set to succeed. Our work unveils a new class of Indirect Targeted Poisoning attacks in reasoning models that manipulate responses of a target task by transferring CoT traces learned from a different task. Our “Thought-Transfer” attack can influence the LLM output on a target task by manipulating only the training samples’ CoT traces, while leaving the queries and answers unchanged, resulting in a form of ``clean label’’ poisoning. Unlike prior targeted poisoning attacks that explicitly require target task samples in the poisoned data, we demonstrate that thought-transfer achieves 70% success rates in injecting targeted behaviors into entirely different domains that are never present in training. Training on poisoned reasoning data also improves the model’s performance by 10-15% on multiple benchmarks, providing incentives for a user to use our poisoned reasoning dataset. Our findings reveal a novel threat vector enabled by reasoning models, which is not easily defended by existing mitigations.

[LG-52] HEATACO: Heatmap-Guided Ant Colony Decoding for Large-Scale Travelling Salesman Problems

链接: https://arxiv.org/abs/2601.19041
作者: Bo-Cheng Lin,Yi Mei,Mengjie Zhang
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heatmap-based non-autoregressive solvers for large-scale Travelling Salesman Problems output dense edge-probability scores, yet final performance largely hinges on the decoder that must satisfy degree-2 constraints and form a single Hamiltonian tour. Greedy commitment can cascade into irreparable mistakes at large N , whereas MCTS-guided local search is accurate but compute-heavy and highly engineered. We instead treat the heatmap as a soft edge prior and cast decoding as probabilistic tour construction under feasibility constraints, where the key is to correct local mis-rankings via inexpensive global coordination. Based on this view, we introduce HeatACO, a plug-and-play Max-Min Ant System decoder whose transition policy is softly biased by the heatmap while pheromone updates provide lightweight, instance-specific feedback to resolve global conflicts; optional 2-opt/3-opt post-processing further improves tour quality. On TSP500/1K/10K, using heatmaps produced by four pretrained predictors, HeatACO+2opt achieves gaps down to 0.11%/0.23%/1.15% with seconds-to-minutes CPU decoding for fixed heatmaps, offering a better quality–time trade-off than greedy decoding and published MCTS-based decoders. Finally, we find the gains track heatmap reliability: under distribution shift, miscalibration and confidence collapse bound decoding improvements, suggesting heatmap generalisation is a primary lever for further progress.

[LG-53] OATS: Online Data Augmentation for Time Series Foundation Models

链接: https://arxiv.org/abs/2601.19040
作者: Junwei Deng,Chang Xu,Jiaqi W. Ma,Ming Jin,Chenghao Liu,Jiang Bian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) are a powerful paradigm for time series analysis and are often enhanced by synthetic data augmentation to improve the training data quality. Existing augmentation methods, however, typically rely on heuristics and static paradigms. Motivated by dynamic data optimization, which shows that the contribution of samples varies across training stages, we propose OATS (Online Data Augmentation for Time Series Foundation Models), a principled strategy that generates synthetic data tailored to different training steps. OATS leverages valuable training samples as principled guiding signals and dynamically generates high-quality synthetic data conditioned on them. We further design a diffusion-based framework to produce realistic time series and introduce an explore-exploit mechanism to balance efficiency and effectiveness. Experiments on TSFMs demonstrate that OATS consistently outperforms regular training and yields substantial performance gains over static data augmentation baselines across six validation datasets and two TSFM architectures. The code is available at the link this https URL.

[LG-54] XIMP: Cross Graph Inter-Message Passing for Molecular Property Prediction

链接: https://arxiv.org/abs/2601.19037
作者: Anatol Ehrlich,Lorenz Kummer,Vojtech Voracek,Franka Bause,Nils M. Kriege
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate molecular property prediction is central to drug discovery, yet graph neural networks often underperform in data-scarce regimes and fail to surpass traditional fingerprints. We introduce cross-graph inter-message passing (XIMP), which performs message passing both within and across multiple related graph representations. For small molecules, we combine the molecular graph with scaffold-aware junction trees and pharmacophore-encoding extended reduced graphs, integrating complementary abstractions. While prior work is either limited to a single abstraction or non-iterative communication across graphs, XIMP supports an arbitrary number of abstractions and both direct and indirect communication between them in each layer. Across ten diverse molecular property prediction tasks, XIMP outperforms state-of-the-art baselines in most cases, leveraging interpretable abstractions as an inductive bias that guides learning toward established chemical concepts, enhancing generalization in low-data settings.

[LG-55] Unravelling the (In)compatibility of Statistical-Parity and Equalized-Odds

链接: https://arxiv.org/abs/2601.19035
作者: Mortaza S. Bargh,Sunil Choenni,Floris ter Braak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key challenge in employing data, algorithms and data-driven systems is to adhere to the principle of fairness and justice. Statistical fairness measures belong to an important category of technical/formal mechanisms for detecting fairness issues in data and algorithms. In this contribution we study the relations between two types of statistical fairness measures namely Statistical-Parity and Equalized-Odds. The Statistical-Parity measure does not rely on having ground truth, i.e., (objectively) labeled target attributes. This makes Statistical-Parity a suitable measure in practice for assessing fairness in data and data classification algorithms. Therefore, Statistical-Parity is adopted in many legal and professional frameworks for assessing algorithmic fairness. The Equalized-Odds measure, on the contrary, relies on having (reliable) ground-truth, which is not always feasible in practice. Nevertheless, there are several situations where the Equalized-Odds definition should be satisfied to enforce false prediction parity among sensitive social groups. We present a novel analyze of the relation between Statistical-Parity and Equalized-Odds, depending on the base-rates of sensitive groups. The analysis intuitively shows how and when base-rate imbalance causes incompatibility between Statistical-Parity and Equalized-Odds measures. As such, our approach provides insight in (how to make design) trade-offs between these measures in practice. Further, based on our results, we plea for examining base-rate (im)balance and investigating the possibility of such an incompatibility before enforcing or relying on the Statistical-Parity criterion. The insights provided, we foresee, may trigger initiatives to improve or adjust the current practice and/or the existing legal frameworks.

[LG-56] A Framework for Evaluating Faithfulness in Explainable AI for Machine Anomalous Sound Detection Using Frequency-Band Perturbation

链接: https://arxiv.org/abs/2601.19017
作者: Alexander Buck,Georgina Cosma,Iain Phillips,Paul Conway,Patrick Baker
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 16 pages, 24 figures

点击查看摘要

Abstract:Explainable AI (XAI) is commonly applied to anomalous sound detection (ASD) models to identify which time-frequency regions of an audio signal contribute to an anomaly decision. However, most audio explanations rely on qualitative inspection of saliency maps, leaving open the question of whether these attributions accurately reflect the spectral cues the model uses. In this work, we introduce a new quantitative framework for evaluating XAI faithfulness in machine-sound analysis by directly linking attribution relevance to model behaviour through systematic frequency-band removal. This approach provides an objective measure of whether an XAI method for machine ASD correctly identifies frequency regions that influence an ASD model’s predictions. By using four widely adopted methods, namely Integrated Gradients, Occlusion, Grad-CAM and SmoothGrad, we show that XAI techniques differ in reliability, with Occlusion demonstrating the strongest alignment with true model sensitivity and gradient-+based methods often failing to accurately capture spectral dependencies. The proposed framework offers a reproducible way to benchmark audio explanations and enables more trustworthy interpretation of spectrogram-based ASD systems.

[LG-57] Accelerated training of Gaussian processes using banded square exponential covariances ICASSP2026

链接: https://arxiv.org/abs/2601.19007
作者: Emily C. Ehrhardt,Felipe Tobar
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at IEEE ICASSP 2026

点击查看摘要

Abstract:We propose a novel approach to computationally efficient GP training based on the observation that square-exponential (SE) covariance matrices contain several off-diagonal entries extremely close to zero. We construct a principled procedure to eliminate those entries to produce a \emphbanded-matrix approximation to the original covariance, whose inverse and determinant can be computed at a reduced computational cost, thus contributing to an efficient approximation to the likelihood function. We provide a theoretical analysis of the proposed method to preserve the structure of the original covariance in the 1D setting with SE kernel, and validate its computational efficiency against the variational free energy approach to sparse GPs.

[LG-58] Recommending Composite Items Using Multi-Level Preference Information: A Joint Interaction Modeling Approach

链接: https://arxiv.org/abs/2601.19005
作者: Xuan Bi,Yaqiong Wang,Gediminas Adomavicius,Shawn Curley
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:With the advancement of machine learning and artificial intelligence technologies, recommender systems have been increasingly used across a vast variety of platforms to efficiently and effectively match users with items. As application contexts become more diverse and complex, there is a growing need for more sophisticated recommendation techniques. One example is the composite item (for example, fashion outfit) recommendation where multiple levels of user preference information might be available and relevant. In this study, we propose JIMA, a joint interaction modeling approach that uses a single model to take advantage of all data from different levels of granularity and incorporate interactions to learn the complex relationships among lower-order (atomic item) and higher-order (composite item) user preferences as well as domain expertise (e.g., on the stylistic fit). We comprehensively evaluate the proposed method and compare it with advanced baselines through multiple simulation studies as well as with real data in both offline and online settings. The results consistently demonstrate the superior performance of the proposed approach.

[LG-59] Attention-Enhanced Graph Filtering for False Data Injection Attack Detection and Localization

链接: https://arxiv.org/abs/2601.18981
作者: Ruslan Abdulin,Mohammad Rasoul Narimani
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The increasing deployment of Internet-of-Things (IoT)-enabled measurement devices in modern power systems has expanded the cyberattack surface of the grid. As a result, this critical infrastructure is increasingly exposed to cyberattacks, including false data injection attacks (FDIAs) that compromise measurement integrity and threaten reliable system operation. Existing FDIA detection methods primarily exploit spatial correlations and network topology using graph-based learning; however, these approaches often rely on high-dimensional representations and shallow classifiers, limiting their ability to capture local structural dependencies and global contextual relationships. Moreover, naively incorporating Transformer architectures can result in overly deep models that struggle to model localized grid dynamics. This paper proposes a joint FDIA detection and localization framework that integrates auto-regressive moving average (ARMA) graph convolutional filters with an Encoder-Only Transformer architecture. The ARMA-based graph filters provide robust, topology-aware feature extraction and adaptability to abrupt spectral changes, while the Transformer encoder leverages self-attention to capture long-range dependencies among grid elements without sacrificing essential local context. The proposed method is evaluated using real-world load data from the New York Independent System Operator (NYISO) applied to the IEEE 14- and 300-bus systems. Numerical results demonstrate that the proposed model effectively exploits both the state and topology of the power grid, achieving high accuracy in detecting FDIA events and localizing compromised nodes.

[LG-60] owards Self-Optimizing Electron Microscope: Robust Tuning of Aberration Coefficients via Physics-Aware Multi-Objective Bayesian Optimization

链接: https://arxiv.org/abs/2601.18972
作者: Utkarsh Pratiush,Austin Houston,Richard Liu,Gerd Duscher,Sergei Kalinin
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Realizing high-throughput aberration-corrected Scanning Transmission Electron Microscopy (STEM) exploration of atomic structures requires rapid tuning of multipole probe correctors while compensating for the inevitable drift of the optical column. While automated alignment routines exist, conventional approaches rely on serial, gradient-free searches (e.g., Nelder-Mead) that are sample-inefficient and struggle to correct multiple interacting parameters simultaneously. Conversely, emerging deep learning methods offer speed but often lack the flexibility to adapt to varying sample conditions without extensive retraining. Here, we introduce a Multi-Objective Bayesian Optimization (MOBO) framework for rapid, data-efficient aberration correction. Importantly, this framework does not prescribe a single notion of image quality; instead, it enables user-defined, physically motivated reward formulations (e.g., symmetry-induced objectives) and uses Pareto fronts to expose the resulting trade-offs between competing experimental priorities. By using Gaussian Process regression to model the aberration landscape probabilistically, our workflow actively selects the most informative lens settings to evaluate next, rather than performing an exhaustive blind search. We demonstrate that this active learning loop is more robust than traditional optimization algorithms and effectively tunes focus, astigmatism, and higher-order aberrations. By balancing competing objectives, this approach enables “self-optimizing” microscopy by dynamically sustaining optimal performance during experiments.

[LG-61] Vector-Valued Distributional Reinforcement Learning Policy Evaluation: A Hilbert Space Embedding Approach

链接: https://arxiv.org/abs/2601.18952
作者: Mehrdad Mohammadi,Qi Zheng,Ruoqing Zhu
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose an (offline) multi-dimensional distributional reinforcement learning framework (KE-DRL) that leverages Hilbert space mappings to estimate the kernel mean embedding of the multi-dimensional value distribution under a proposed target policy. In our setting, the state-action variables are multi-dimensional and continuous. By mapping probability measures into a reproducing kernel Hilbert space via kernel mean embeddings, our method replaces Wasserstein metrics with an integral probability metric. This enables efficient estimation in multi-dimensional state-action spaces and reward settings, where direct computation of Wasserstein distances is computationally challenging. Theoretically, we establish contraction properties of the distributional Bellman operator under our proposed metric involving the Matern family of kernels and provide uniform convergence guarantees. Simulations and empirical results demonstrate robust off-policy evaluation and recovery of the kernel mean embedding under mild assumptions, namely, Lipschitz continuity and boundedness of the kernels, highlighting the potential of embedding-based approaches in complex real-world decision-making scenarios and risk evaluation.

[LG-62] A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy NEURIPS

链接: https://arxiv.org/abs/2601.18939
作者: Claire O’Brien,Jessica Seto,Dristi Roy,Aditya Dwivedi,Sunishchal Dev,Kevin Zhu,Sean O’Brien,Ashwinee Panda,Ryan Lagasse
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS Workshop on CogInterp and NeurIPS Workshop on Reliable ML 2025

点击查看摘要

Abstract:Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available

[LG-63] FSD-CAP: Fractional Subgraph Diffusion with Class-Aware Propagation for Graph Feature Imputation

链接: https://arxiv.org/abs/2601.18938
作者: Xin Qiao,Shijie Sun,Anqi Dong,Cong Hua,Xia Zhao,Longfei Zhang,Guangming Zhu,Liang Zhang
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: 31 pages, 12 figures

点击查看摘要

Abstract:Imputing missing node features in graphs is challenging, particularly under high missing rates. Existing methods based on latent representations or global diffusion often fail to produce reliable estimates, and may propagate errors across the graph. We propose FSD-CAP, a two-stage framework designed to improve imputation quality under extreme sparsity. In the first stage, a graph-distance-guided subgraph expansion localizes the diffusion process. A fractional diffusion operator adjusts propagation sharpness based on local structure. In the second stage, imputed features are refined using class-aware propagation, which incorporates pseudo-labels and neighborhood entropy to promote consistency. We evaluated FSD-CAP on multiple datasets. With 99.5% of features missing across five benchmark datasets, FSD-CAP achieves average accuracies of 80.06% (structural) and 81.01% (uniform) in node classification, close to the 81.31% achieved by a standard GCN with full features. For link prediction under the same setting, it reaches AUC scores of 91.65% (structural) and 92.41% (uniform), compared to 95.06% for the fully observed case. Furthermore, FSD-CAP demonstrates superior performance on both large-scale and heterophily datasets when compared to other models.

[LG-64] Bi-Level Online Provisioning and Scheduling with Switching Costs and Cross-Level Constraints

链接: https://arxiv.org/abs/2601.18936
作者: Jialei Liu,C. Emre Koksal,Ming Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a bi-level online provisioning and scheduling problem motivated by network resource allocation, where provisioning decisions are made at a slow time scale while queue-/state-dependent scheduling is performed at a fast time scale. We model this two-time-scale interaction using an upper-level online convex optimization (OCO) problem and a lower-level constrained Markov decision process (CMDP). Existing OCO typically assumes stateless decisions and thus cannot capture MDP network dynamics such as queue evolution. Meanwhile, CMDP algorithms typically assume a fixed constraint threshold, whereas in provisioning-and-scheduling systems, the threshold varies with online budget decisions. To address these gaps, we study bi-level OCO-CMDP learning under switching costs (budget reprovisioning/system reconfiguration) and cross-level constraints that couple budgets to scheduling decisions. Our new algorithm solves this learning problem via several non-trivial developments, including a carefully designed dual feedback that returns the budget multiplier as sensitivity information for the upper-level update and a lower level that solves a budget-adaptive safe exploration problem via an extended occupancy-measure linear program. We establish near-optimal regret and high-probability satisfaction of the cross-level constraints.

[LG-65] Accelerating Large-Scale Cheminformatics Using a Byte-Offset Indexing Architecture for Terabyte-Scale Data Integration

链接: https://arxiv.org/abs/2601.18921
作者: Malikussaid,Septian Caesar Floresko,Sutiyo
类目: Databases (cs.DB); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 6 pages, 3 figures, 5 equations, 3 algorithms, 4 tables, to be published in ICoICT 2026, unabridged version exists as arXiv:2512.24643v1

点击查看摘要

Abstract:The integration of large-scale chemical databases represents a critical bottleneck in modern cheminformatics research, particularly for machine learning applications requiring high-quality, multi-source validated datasets. This paper presents a case study of integrating three major public chemical repositories: PubChem (176 million compounds), ChEMBL, and eMolecules, to construct a curated dataset for molecular property prediction. We investigate whether byte-offset indexing can practically overcome brute-force scalability limits while preserving data integrity at hundred-million scale. Our results document the progression from an intractable brute-force search algorithm with projected 100-day runtime to a byte-offset indexing architecture achieving 3.2-hour completion-a 740-fold performance improvement through algorithmic complexity reduction from O(NxM) to O(N+M). Systematic validation of 176 million database entries revealed hash collisions in InChIKey molecular identifiers, necessitating pipeline reconstruction using collision-free full InChI strings. We present performance benchmarks, quantify trade-offs between storage overhead and scientific rigor, and compare our approach with alternative large-scale integration strategies. The resulting system successfully extracted 435,413 validated compounds and demonstrates generalizable principles for large-scale scientific data integration where uniqueness constraints exceed hash-based identifier capabilities.

[LG-66] One Global Model Many Behaviors: Stockout-Aware Feature Engineering and Dynamic Scaling for Multi-Horizon Retail Demand Forecasting with a Cost-Aware Ordering Policy (VN2 Winner Report)

链接: https://arxiv.org/abs/2601.18919
作者: Bartosz Szabłowski
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures. Technical report/winner report for the VN2 Inventory Planning Challenge (2025)

点击查看摘要

Abstract:Inventory planning for retail chains requires translating demand forecasts into ordering decisions, including asymmetric shortages and holding costs. The VN2 Inventory Planning Challenge formalizes this setting as a weekly decision-making cycle with a two-week product delivery lead time, where the total cost is defined as the shortage cost plus the holding cost. This report presents the winning VN2 solution: a two-stage predict-then-optimize pipeline that combines a single global multi-horizon forecasting model with a cost-aware ordering policy. The forecasting model is trained in a global paradigm, jointly using all available time series. A gradient-boosted decision tree (GBDT) model implemented in CatBoost is used as the base learner. The model incorporates stockout-aware feature engineering to address censored demand during out-of-stock periods, per-series scaling to focus learning on time-series patterns rather than absolute levels, and time-based observation weights to reflect shifts in demand patterns. In the decision stage, inventory is projected to the start of the delivery week, and a target stock level is calculated that explicitly trades off shortage and holding costs. Evaluated by the official competition simulation in six rounds, the solution achieved first place by combining a strong global forecasting model with a lightweight cost-aware policy. Although developed for the VN2 setting, the proposed approach can be extended to real-world applications and additional operational constraints.

[LG-67] GraIP: A Benchmarking Framework For Neural Graph Inverse Problems

链接: https://arxiv.org/abs/2601.18917
作者: Semih Cantürk,Andrei Manolache,Arman Mielke,Chendi Qian,Antoine Siraudin,Christopher Morris,Mathias Niepert,Guy Wolf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A wide range of graph learning tasks, such as structure discovery, temporal graph analysis, and combinatorial optimization, focus on inferring graph structures from data, rather than making predictions on given graphs. However, the respective methods to solve such problems are often developed in an isolated, task-specific manner and thus lack a unifying theoretical foundation. Here, we provide a stepping stone towards the formation of such a foundation and further development by introducing the Neural Graph Inverse Problem (GraIP) conceptual framework, which formalizes and reframes a broad class of graph learning tasks as inverse problems. Unlike discriminative approaches that directly predict target variables from given graph inputs, the GraIP paradigm addresses inverse problems, i.e., it relies on observational data and aims to recover the underlying graph structure by reversing the forward process, such as message passing or network dynamics, that produced the observed outputs. We demonstrate the versatility of GraIP across various graph learning tasks, including rewiring, causal discovery, and neural relational inference. We also propose benchmark datasets and metrics for each GraIP domain considered, and characterize and empirically evaluate existing baseline methods used to solve them. Overall, our unifying perspective bridges seemingly disparate applications and provides a principled approach to structural learning in constrained and combinatorial settings while encouraging cross-pollination of existing methods across graph inverse problems.

[LG-68] ASEHybrid: When Geometry Matters Beyond Homophily in Graph Neural Networks

链接: https://arxiv.org/abs/2601.18912
作者: Shalima Binta Manir,Tim Oates
类目: Machine Learning (cs.LG)
*备注: 16 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Standard message-passing graph neural networks (GNNs) often struggle on graphs with low homophily, yet homophily alone does not explain this behavior, as graphs with similar homophily levels can exhibit markedly different performance and some heterophilous graphs remain easy for vanilla GCNs. Recent work suggests that label informativeness (LI), the mutual information between labels of adjacent nodes, provides a more faithful characterization of when graph structure is useful. In this work, we develop a unified theoretical framework that connects curvature-guided rewiring and positional geometry through the lens of label informativeness, and instantiate it in a practical geometry-aware architecture, ASEHybrid. Our analysis provides a necessary-and-sufficient characterization of when geometry-aware GNNs can improve over feature-only baselines: such gains are possible if and only if graph structure carries label-relevant information beyond node features. Theoretically, we relate adjusted homophily and label informativeness to the spectral behavior of label signals under Laplacian smoothing, show that degree-based Forman curvature does not increase expressivity beyond the one-dimensional Weisfeiler–Lehman test but instead reshapes information flow, and establish convergence and Lipschitz stability guarantees for a curvature-guided rewiring process. Empirically, we instantiate ASEHybrid using Forman curvature and Laplacian positional encodings and conduct controlled ablations on Chameleon, Squirrel, Texas, Tolokers, and Minesweeper, observing gains precisely on label-informative heterophilous benchmarks where graph structure provides label-relevant information beyond node features, and no meaningful improvement in high-baseline regimes.

[LG-69] How Is Uncertainty Propagated in Knowledge Distillation?

链接: https://arxiv.org/abs/2601.18909
作者: Ziyao Cui,Jian Pei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge distillation transfers behavior from a teacher to a student model, but the process is inherently stochastic: teacher outputs, student training, and student inference can all be random. Collapsing these uncertainties to a single point estimate can distort what is learned. We systematically study how uncertainty propagates through knowledge distillation across three representative model classes–linear regression, feed-forward neural networks, and large language models (LLMs)–and propose simple corrections. We distinguish inter-student uncertainty (variance across independently distilled students) from intra-student uncertainty (variance of a single student’s predictive distribution), showing that standard single-response knowledge distillation suppresses intra-student variance while leaving substantial inter-student variability. To address these mismatches, we introduce two variance-aware strategies: averaging multiple teacher responses, which reduces noise at rate O(1/k) , and variance-weighting, which combines teacher and student estimates via inverse-variance weighting to yield a minimum-variance estimator. We provide formal guarantees in linear regression, validate the methods in neural networks, and demonstrate empirical gains in LLM distillation, including reduced systematic noise and hallucination. These results reframe knowledge distillation as an uncertainty transformation and show that variance-aware distillation produces more stable students that better reflect teacher uncertainty.

[LG-70] Enhancing Speech Emotion Recognition using Dynamic Spectral Features and Kalman Smoothing

链接: https://arxiv.org/abs/2601.18908
作者: Marouane El Hizabri,Abdelfattah Bezzaz,Ismail Hayoukane,Youssef Taki
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech Emotion Recognition systems often use static features like Mel-Frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), and Root Mean Square Energy (RMSE). Because of this, they can misclassify emotions when there is acoustic noise in vocal signals. To address this, we added dynamic features using Dynamic Spectral features (Deltas and Delta-Deltas) along with the Kalman Smoothing algorithm. This approach reduces noise and improves emotion classification. Since emotion changes over time, the Kalman Smoothing filter also helped make the classifier outputs more stable. Tests on the RAVDESS dataset showed that this method achieved a state-of-the-art accuracy of 87% and reduced misclassification between emotions with similar acoustic features

[LG-71] Analysis of Control Bellm an Residual Minimization for Markov Decision Problem

链接: https://arxiv.org/abs/2601.18840
作者: Donghwan Lee,Hyukjun Yang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.

[LG-72] me series forecasting with Hahn Kolmogorov-Arnold networks

链接: https://arxiv.org/abs/2601.18837
作者: Md Zahidul Hasan,A. Ben Hamza,Nizar Bouguila
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent Transformer- and MLP-based models have demonstrated strong performance in long-term time series forecasting, yet Transformers remain limited by their quadratic complexity and permutation-equivariant attention, while MLPs exhibit spectral bias. We propose HaKAN, a versatile model based on Kolmogorov-Arnold Networks (KANs), leveraging Hahn polynomial-based learnable activation functions and providing a lightweight and interpretable alternative for multivariate time series forecasting. Our model integrates channel independence, patching, a stack of Hahn-KAN blocks with residual connections, and a bottleneck structure comprised of two fully connected layers. The Hahn-KAN block consists of inter- and intra-patch KAN layers to effectively capture both global and local temporal patterns. Extensive experiments on various forecasting benchmarks demonstrate that our model consistently outperforms recent state-of-the-art methods, with ablation studies validating the effectiveness of its core components.

[LG-73] How Much Temporal Modeling is Enough? A Systematic Study of Hybrid CNN-RNN Architectures for Multi-Label ECG Classification

链接: https://arxiv.org/abs/2601.18830
作者: Alireza Jafari,Fatemeh Jafari
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Accurate multi-label classification of electrocardiogram (ECG) signals remains challenging due to the coexistence of multiple cardiac conditions, pronounced class imbalance, and long-range temporal dependencies in multi-lead recordings. Although recent studies increasingly rely on deep and stacked recurrent architectures, the necessity and clinical justification of such architectural complexity have not been rigorously examined. In this work, we perform a systematic comparative evaluation of convolutional neural networks (CNNs) combined with multiple recurrent configurations, including LSTM, GRU, Bidirectional LSTM (BiLSTM), and their stacked variants, for multi-label ECG classification on the PTB-XL dataset comprising 23 diagnostic categories. The CNN component serves as a morphology-driven baseline, while recurrent layers are progressively integrated to assess their contribution to temporal modeling and generalization performance. Experimental results indicate that a CNN integrated with a single BiLSTM layer achieves the most favorable trade-off between predictive performance and model complexity. This configuration attains superior Hamming loss (0.0338), macro-AUPRC (0.4715), micro-F1 score (0.6979), and subset accuracy (0.5723) compared with deeper recurrent combinations. Although stacked recurrent models occasionally improve recall for specific rare classes, our results provide empirical evidence that increasing recurrent depth yields diminishing returns and may degrade generalization due to reduced precision and overfitting. These findings suggest that architectural alignment with the intrinsic temporal structure of ECG signals, rather than increased recurrent depth, is a key determinant of robust performance and clinically relevant deployment.

[LG-74] VAE with Hyperspherical Coordinates: Improving Anomaly Detection from Hypervolume-Compressed Latent Space

链接: https://arxiv.org/abs/2601.18823
作者: Alejandro Ascarate,Leo Lebrat,Rodrigo Santa Cruz,Clinton Fookes,Olivier Salvado
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational autoencoders (VAE) encode data into lower-dimensional latent vectors before decoding those vectors back to data. Once trained, one can hope to detect out-of-distribution (abnormal) latent vectors, but several issues arise when the latent space is high dimensional. This includes an exponential growth of the hypervolume with the dimension, which severely affects the generative capacity of the VAE. In this paper, we draw insights from high dimensional statistics: in these regimes, the latent vectors of a standard VAE are distributed on the `equators’ of a hypersphere, challenging the detection of anomalies. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows compressing the latent vectors towards a given direction on the hypersphere, thereby allowing for a more expressive approximate posterior. We show that this improves both the fully unsupervised and OOD anomaly detection ability of the VAE, achieving the best performance on the datasets we considered, outperforming existing methods. For the unsupervised and OOD modalities, respectively, these are: i) detecting unusual landscape from the Mars Rover camera and unusual Galaxies from ground based imagery (complex, real world datasets); ii) standard benchmarks like Cifar10 and subsets of ImageNet as the in-distribution (ID) class.

[LG-75] Variational Quantum Circuit-Based Reinforcement Learning for Dynamic Portfolio Optimization

链接: https://arxiv.org/abs/2601.18811
作者: Vincent Gurgul,Ying Chen,Stefan Lessmann
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This paper presents a Quantum Reinforcement Learning (QRL) solution to the dynamic portfolio optimization problem based on Variational Quantum Circuits. The implemented QRL approaches are quantum analogues of the classical neural-network-based Deep Deterministic Policy Gradient and Deep Q-Network algorithms. Through an empirical evaluation on real-world financial data, we show that our quantum agents achieve risk-adjusted performance comparable to, and in some cases exceeding, that of classical Deep RL models with several orders of magnitude more parameters. In addition to improved parameter efficiency, quantum agents exhibit reduced variability across market regimes, indicating robust behaviour under changing conditions. However, while quantum circuit execution is inherently fast at the hardware level, practical deployment on cloud-based quantum systems introduces substantial latency, making end-to-end runtime currently dominated by infrastructural overhead and limiting practical applicability. Taken together, our results suggest that QRL is theoretically competitive with state-of-the-art classical reinforcement learning and may become practically advantageous as deployment overheads diminish. This positions QRL as a promising paradigm for dynamic decision-making in complex, high-dimensional, and non-stationary environments such as financial markets. The complete codebase is released as open source at: this https URL

[LG-76] Latent Structural Similarity Networks for Unsupervised Discovery in Multivariate Time Series

链接: https://arxiv.org/abs/2601.18803
作者: Olusegun Owoeye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes a task-agnostic discovery layer for multivariate time series that constructs a relational hypothesis graph over entities without assuming linearity, stationarity, or a downstream objective. The method learns window-level sequence representations using an unsupervised sequence-to-sequence autoencoder, aggregates these representations into entity-level embeddings, and induces a sparse similarity network by thresholding a latent-space similarity measure. This network is intended as an analyzable abstraction that compresses the pairwise search space and exposes candidate relationships for further investigation, rather than as a model optimized for prediction, trading, or any decision rule. The framework is demonstrated on a challenging real-world dataset of hourly cryptocurrency returns, illustrating how latent similarity induces coherent network structure; a classical econometric relation is also reported as an external diagnostic lens to contextualize discovered edges.

[LG-77] Generative Latent Alignment for Interpretable Radar Based Occupancy Detection in Ambient Assisted Living

链接: https://arxiv.org/abs/2601.19853
作者: Huy Trinh
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we study how to make mmWave radar presence detection more interpretable for Ambient Assisted Living (AAL) settings, where camera-based sensing raises privacy concerns. We propose a Generative Latent Alignment (GLA) framework that combines a lightweight convolutional variational autoencoder with a frozen CLIP text encoder to learn a low-dimensional latent representation of radar Range-Angle (RA) heatmaps. The latent space is softly aligned with two semantic anchors corresponding to “empty room” and “person present”, and Grad-CAM is applied in this aligned latent space to visualize which spatial regions support each presence decision. On our mmWave radar dataset, we qualitatively observe that the “person present” class produces compact Grad-CAM blobs that coincide with strong RA returns, whereas “empty room” samples yield diffuse or no evidence. We also conduct an ablation study using unrelated text prompts, which degrades both reconstruction and localization, suggesting that radar-specific anchors are important for meaningful explanations in this setting.

[LG-78] Regularized f-Divergence Kernel Tests

链接: https://arxiv.org/abs/2601.19755
作者: Mónica Ribero,Antonin Schrab,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a framework to construct practical kernel-based two-sample tests from the family of f -divergences. The test statistic is computed from the witness function of a regularized variational representation of the divergence, which we estimate using kernel methods. The proposed test is adaptive over hyperparameters such as the kernel bandwidth and the regularization parameter. We provide theoretical guarantees for statistical test power across our family of f -divergence estimates. While our test covers a variety of f -divergences, we bring particular focus to the Hockey-Stick divergence, motivated by its applications to differential privacy auditing and machine unlearning evaluation. For two-sample testing, experiments demonstrate that different f -divergences are sensitive to different localized differences, illustrating the importance of leveraging diverse statistics. For machine unlearning, we propose a relative test that distinguishes true unlearning failures from safe distributional variations.

[LG-79] Learning the Intrinsic Dimensionality of Fermi-Pasta-Ulam-Tsingou Trajectories: A Nonlinear Approach using a Deep Autoencoder Model

链接: https://arxiv.org/abs/2601.19567
作者: Gionni Marchetti
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures. Preliminary results were presented on November 2025 at the IUPAP Conference on Computational Physics, CP2025 XXXVI, Oak Ridge National Laboratory in Oak Ridge

点击查看摘要

Abstract:We address the intrinsic dimensionality (ID) of high-dimensional trajectories, comprising n_s = 4,000,000 data points, of the Fermi-Pasta-Ulam-Tsingou (FPUT) \beta model with N = 32 oscillators. To this end, a deep autoencoder (DAE) model is employed to infer the ID in the weakly nonlinear regime ( \beta \lesssim 1 ). We find that the trajectories lie on a nonlinear manifold of dimension m^\ast = 2 embedded in a 64 -dimensional phase space. The DAE further reveals that this dimensionality increases to m^\ast = 3 at \beta = 1.1 , coinciding with a symmetry breaking transition, in which additional energy modes with even wave numbers k = 2, 4 become excited. Finally, we discuss the limitations of the linear approach based on principal component analysis (PCA), which fails to capture the underlying structure of the data and therefore yields unreliable results in most cases.

[LG-80] Generalizable Equivariant Diffusion Models for Non-Abelian Lattice Gauge Theory

链接: https://arxiv.org/abs/2601.19552
作者: Gert Aarts,Diaa E. Habibi,Andreas Ipp,David I. Müller,Thomas R. Ranner,Lingxiao Wang,Wei Wang,Qianteng Zhu
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:We demonstrate that gauge equivariant diffusion models can accurately model the physics of non-Abelian lattice gauge theory using the Metropolis-adjusted annealed Langevin algorithm (MAALA), as exemplified by computations in two-dimensional U(2) and SU(2) gauge theories. Our network architecture is based on lattice gauge equivariant convolutional neural networks (L-CNNs), which respect local and global symmetries on the lattice. Models are trained on a single ensemble generated using a traditional Monte Carlo method. By studying Wilson loops of various size as well as the topological susceptibility, we find that the diffusion approach generalizes remarkably well to larger inverse couplings and lattice sizes with negligible loss of accuracy while retaining moderately high acceptance rates.

[LG-81] Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization

链接: https://arxiv.org/abs/2601.19400
作者: Shuntaro Nagashima,Hideaki Iiduka
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Muon optimizer has recently attracted attention due to its orthogonalized first-order updates, and a deeper theoretical understanding of its convergence behavior is essential for guiding practical applications; however, existing convergence guarantees are either coarse or obtained under restrictive analytical settings. In this work, we establish sharper convergence guarantees for the Muon optimizer through a direct and simplified analysis that does not rely on restrictive assumptions on the update rule. Our results improve upon existing bounds by achieving faster convergence rates while covering a broader class of problem settings. These findings provide a more accurate theoretical characterization of Muon and offer insights applicable to a broader class of orthogonalized first-order methods.

[LG-82] Optimal Asynchronous Stochastic Nonconvex Optimization under Heavy-Tailed Noise

链接: https://arxiv.org/abs/2601.19379
作者: Yidong Wu,Luo Luo
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers the problem of asynchronous stochastic nonconvex optimization with heavy-tailed gradient noise and arbitrarily heterogeneous computation times across workers. We propose an asynchronous normalized stochastic gradient descent algorithm with momentum. The analysis show that our method achieves the optimal time complexity under the assumption of bounded p th-order central moment with p\in(1,2] . We also provide numerical experiments to show the effectiveness of proposed method.

[LG-83] SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper ICASSP2026

链接: https://arxiv.org/abs/2601.19194
作者: Alexander Polok,Dominik Klement,Samuele Cornell,Matthew Wiesner,Jan Černocký,Sanjeev Khudanpur,Lukáš Burget
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarization-Conditioned Whisper (DiCoW), leverages speaker diarization outputs as conditioning information and, with minimal fine-tuning, demonstrated strong multilingual and multi-domain performance. In this paper, we address a key limitation of DiCoW: ambiguity in Silence-Target-Non-target-Overlap (STNO) masks, where two or more fully overlapping speakers may have nearly identical conditioning despite differing transcriptions. We introduce SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which uses diarization output to locate an enrollment segment anywhere in the conversation where the target speaker is most active. This enrollment segment is used as fixed conditioning via cross-attention at each encoder layer. We further refine DiCoW with improved data segmentation, model initialization, and augmentation. Together, these advances yield substantial gains: SE-DiCoW reduces macro-averaged tcpWER by 52.4% relative to the original DiCoW on the EMMA MT-ASR benchmark.

[LG-84] Double Fairness Policy Learning: Integrating Action Fairness and Outcome Fairness in Decision-making

链接: https://arxiv.org/abs/2601.19186
作者: Zeyu Bian,Lan Wang,Chengchun Shi,Zhengling Qi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness is a central pillar of trustworthy machine learning, especially in domains where accuracy- or profit-driven optimization is insufficient. While most fairness research focuses on supervised learning, fairness in policy learning remains less explored. Because policy learning is interventional, it induces two distinct fairness targets: action fairness (equitable action assignments) and outcome fairness (equitable downstream consequences). Crucially, equalizing actions does not generally equalize outcomes when groups face different constraints or respond differently to the same action. We propose a novel double fairness learning (DFL) framework that explicitly manages the trade-off among three objectives: action fairness, outcome fairness, and value maximization. We integrate fairness directly into a multi-objective optimization problem for policy learning and employ a lexicographic weighted Tchebyshev method that recovers Pareto solutions beyond convex settings, with theoretical guarantees on the regret bounds. Our framework is flexible and accommodates various commonly used fairness notions. Extensive simulations demonstrate improved performance relative to competing methods. In applications to a motor third-party liability insurance dataset and an entrepreneurship training dataset, DFL substantially improves both action and outcome fairness while incurring only a modest reduction in overall value.

[LG-85] Convergence of Muon with Newton-Schulz ICLR2026

链接: https://arxiv.org/abs/2601.19156
作者: Gyu Yeol Kim,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at ICLR 2026

点击查看摘要

Abstract:We analyze Muon as originally proposed and used in practice – using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point at the same rate as the SVD-polar idealization, up to a constant factor for a given number q of Newton-Schulz steps. We further analyze this constant factor and prove that it converges to 1 doubly exponentially in q and improves with the degree of the polynomial used in Newton-Schulz for approximating the orthogonalization direction. We also prove that Muon removes the typical square-root-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at a much faster wall-clock time and explain how much momentum matrix orthogonalization via Newton-Schulz benefits over the vector-based optimizer. Overall, our theory justifies the practical Newton-Schulz design of Muon, narrowing its practice-theory gap.

[LG-86] ransformer Learning of Chaotic Collective Dynamics in Many-Body Systems

链接: https://arxiv.org/abs/2601.19080
作者: Ho Jang,Gia-Wei Chern
类目: Computational Physics (physics.comp-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Learning reduced descriptions of chaotic many-body dynamics is fundamentally challenging: although microscopic equations are Markovian, collective observables exhibit strong memory and exponential sensitivity to initial conditions and prediction errors. We show that a self-attention-based transformer framework provides an effective approach for modeling such chaotic collective dynamics directly from time-series data. By selectively reweighting long-range temporal correlations, the transformer learns a non-Markovian reduced description that overcomes intrinsic limitations of conventional recurrent architectures. As a concrete demonstration, we study the one-dimensional semiclassical Holstein model, where interaction quenches induce strongly nonlinear and chaotic dynamics of the charge-density-wave order parameter. While pointwise predictions inevitably diverge at long times, the transformer faithfully reproduces the statistical “climate” of the chaos, including temporal correlations and characteristic decay scales. Our results establish self-attention as a powerful mechanism for learning effective reduced dynamics in chaotic many-body systems.

[LG-87] C2NP: A Benchmark for Learning Scale-Dependent Geometric Invariances in 3D Materials Generation

链接: https://arxiv.org/abs/2601.19076
作者: Can Polat,Erchin Serpedin,Mustafa Kurban,Hasan Kurban
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Generative models for materials have achieved strong performance on periodic bulk crystals, yet their ability to generalize across scale transitions to finite nanostructures remains largely untested. We introduce Crystal-to-Nanoparticle (C2NP), a systematic benchmark for evaluating generative models when moving between infinite crystalline unit cells and finite nanoparticles, where surface effects and size-dependent distortions dominate. C2NP defines two complementary tasks: (i) generating nanoparticles of specified radii from periodic unit cells, testing whether models capture surface truncation and geometric constraints; and (ii) recovering bulk lattice parameters and space-group symmetry from finite particle configurations, assessing whether models can infer underlying crystallographic order despite surface perturbations. Using diverse materials as a structurally consistent testbed, we construct over 170,000 nanoparticle configurations by carving particles from supercells derived from DFT-relaxed crystal unit cells, and introduce size-based splits that separate interpolation from extrapolation regimes. Experiments with state-of-the-art approaches, including diffusion, flow-matching, and variational models, show that even when losses are low, models often fail geometrically under distribution shift, yielding large lattice-recovery errors and near-zero joint accuracy on structure and symmetry. Overall, our results suggest that current methods rely on template memorization rather than scalable physical generalization. C2NP offers a controlled, reproducible framework for diagnosing these failures, with immediate applications to nanoparticle catalyst design, nanostructured hydrides for hydrogen storage, and materials discovery. Dataset and code are available at this https URL.

[LG-88] Smooth embeddings in contracting recurrent networks driven by regular dynamics: A synthesis for neural representation

链接: https://arxiv.org/abs/2601.19019
作者: Vikas N. O’Reilly-Shah,Alessandro Maria Selvitella
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 27 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Recurrent neural networks trained for time-series prediction often develop latent trajectories that preserve qualitative structure of the dynamical systems generating their inputs. Recent empirical work has documented topology-preserving latent organization in trained recurrent models, and recent theoretical results in reservoir computing establish conditions under which the synchronization map is an embedding. Here we synthesize these threads into a unified account of when contracting recurrent networks yield smooth, topology-preserving internal representations for a broad and biologically relevant class of inputs: regular dynamics on invariant circles and tori. Our contribution is an integrated framework that assembles (i) generalized synchronization and embedding guarantees for contracting reservoirs, (ii) regularity mechanisms ensuring differentiability of the synchronization map under mild constraints, and (iii) a base-system viewpoint in which the invariant manifold generating the input stream is treated as the driving system. In this regular setting, the conditions commonly viewed as restrictive in chaotic-attractor analyses become mild and readily satisfied by standard contractive architectures. The framework clarifies how representational content in recurrent circuits is inherently historical: the network state encodes finite windows of input history rather than instantaneous stimuli. By consolidating disparate empirical and theoretical results under common assumptions, the synthesis yields concrete, testable expectations about when prediction-trained recurrent circuits should (or should not) form smooth latent embeddings and how required state dimension scales with the intrinsic dimension of the driving dynamics. Comments: 27 pages, 1 figure, 2 tables Subjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG) MSC classes: 37C15, 92B20, 68T07 Cite as: arXiv:2601.19019 [q-bio.NC] (or arXiv:2601.19019v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2601.19019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-89] Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget

链接: https://arxiv.org/abs/2601.18950
作者: Harsh Vardhan,Arya Mazumdar
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed high dimensional mean estimation is a common aggregation routine used often in distributed optimization methods. Most of these applications call for a communication-constrained setting where vectors, whose mean is to be estimated, have to be compressed before sharing. One could independently encode and decode these to achieve compression, but that overlooks the fact that these vectors are often close to each other. To exploit these similarities, recently Suresh et al., 2022, Jhunjhunwala et al., 2021, Jiang et al, 2023, proposed multiple correlation-aware compression schemes. However, in most cases, the correlations have to be known for these schemes to work. Moreover, a theoretical analysis of graceful degradation of these correlation-aware compression schemes with increasing dissimilarity is limited to only the \ell_2 -error in the literature. In this paper, we propose four different collaborative compression schemes that agnostically exploit the similarities among vectors in a distributed setting. Our schemes are all simple to implement and computationally efficient, while resulting in big savings in communication. The analysis of our proposed schemes show how the \ell_2 , \ell_\infty and cosine estimation error varies with the degree of similarity among vectors.

[LG-90] Advances in Diffusion-Based Generative Compression

链接: https://arxiv.org/abs/2601.18932
作者: Yibo Yang,Stephan Mandt
类目: Image and Video Processing (eess.IV); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Popularized by their strong image generation performance, diffusion and related methods for generative modeling have found widespread success in visual media applications. In particular, diffusion methods have enabled new approaches to data compression, where realistic reconstructions can be generated at extremely low bit-rates. This article provides a unifying review of recent diffusion-based methods for generative lossy compression, with a focus on image compression. These methods generally encode the source into an embedding and employ a diffusion model to iteratively refine it in the decoding procedure, such that the final reconstruction approximately follows the ground truth data distribution. The embedding can take various forms and is typically transmitted via an auxiliary entropy model, and recent methods also explore the use of diffusion models themselves for information transmission via channel simulation. We review representative approaches through the lens of rate-distortion-perception theory, highlighting the role of common randomness and connections to inverse problems, and identify open challenges.

[LG-91] Implicit Q-Learning and SARSA: Liberating Policy Control from Step-Size Calibration

链接: https://arxiv.org/abs/2601.18907
作者: Hwanwoo Kim,Eric Laber
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Q-learning and SARSA are foundational reinforcement learning algorithms whose practical success depends critically on step-size calibration. Step-sizes that are too large can cause numerical instability, while step-sizes that are too small can lead to slow progress. We propose implicit variants of Q-learning and SARSA that reformulate their iterative updates as fixed-point equations. This yields an adaptive step-size adjustment that scales inversely with feature norms, providing automatic regularization without manual tuning. Our non-asymptotic analyses demonstrate that implicit methods maintain stability over significantly broader step-size ranges. Under favorable conditions, it permits arbitrarily large step-sizes while achieving comparable convergence rates. Empirical validation across benchmark environments spanning discrete and continuous state spaces shows that implicit Q-learning and SARSA exhibit substantially reduced sensitivity to step-size selection, achieving stable performance with step-sizes that would cause standard methods to fail.

[LG-92] Statistical Inference for Explainable Boosting Machines AISTATS2026

链接: https://arxiv.org/abs/2601.18857
作者: Haimo Fang,Kevin Tan,Jonathan Pipping,Giles Hooker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to AISTATS 2026 (poster)

点击查看摘要

Abstract:Explainable boosting machines (EBMs) are popular “glass-box” models that learn a set of univariate functions using boosting trees. These achieve explainability through visualizations of each feature’s effect. However, unlike linear model coefficients, uncertainty quantification for the learned univariate functions requires computationally intensive bootstrapping, making it hard to know which features truly matter. We provide an alternative using recent advances in statistical inference for gradient boosting, deriving methods for statistical inference as well as end-to-end theoretical guarantees. Using a moving average instead of a sum of trees (Boulevard regularization) allows the boosting process to converge to a feature-wise kernel ridge regression. This produces asymptotically normal predictions that achieve the minimax-optimal mean squared error for fitting Lipschitz GAMs with p features at rate O(pn^-2/3) , successfully avoiding the curse of dimensionality. We then construct prediction intervals for the response and confidence intervals for each learned univariate function with a runtime independent of the number of datapoints, enabling further explainability within EBMs.

[LG-93] Deep g-Pricing for CSI 300 Index Options with Volatility Trajectories and Market Sentiment

链接: https://arxiv.org/abs/2601.18804
作者: Yilun Zhang,Zheng Tang,Hexiang Sun,Yufeng Shi
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Probability (math.PR); Pricing of Securities (q-fin.PR)
*备注: 25 pages, 6 figures, 10 tables. Submitted to IMA Journal of Management Mathematics

点击查看摘要

Abstract:Option pricing in real markets faces fundamental challenges. The Black–Scholes–Merton (BSM) model assumes constant volatility and uses a linear generator g(t,x,y,z)=-ry , while lacking explicit behavioral factors, resulting in systematic departures from observed dynamics. This paper extends the BSM model by learning a nonlinear generator within a deep Forward–Backward Stochastic Differential Equation (FBSDE) framework. We propose a dual-network architecture where the value network u_\theta learns option prices and the generator network g_\phi characterizes the pricing mechanism, with the hedging strategy Z_t=\sigma_t X_t \nabla_x u_\theta obtained via automatic differentiation. The framework adopts forward recursion from a learnable initial condition Y_0=u_\theta(0,\cdot) , naturally accommodating volatility trajectory and sentiment features. Empirical results on CSI 300 index options show that our method reduces Mean Absolute Error (MAE) by 32.2% and Mean Absolute Percentage Error (MAPE) by 35.3% compared with BSM. Interpretability analysis indicates that architectural improvements are effective across all option types, while the information advantage is asymmetric between calls and puts. Specifically, call option improvements are primarily driven by sentiment features, whereas put options show more balanced contributions from volatility trajectory and sentiment features. This finding aligns with economic intuition regarding option pricing mechanisms.

信息检索

[IR-0] Reimagining Social Robots as Recommender Systems: Foundations Framework and Applications

链接: https://arxiv.org/abs/2601.19761
作者: Jin Huang,Fethiye Irmak Doğan,Hatice Gunes
类目: Robotics (cs.RO); Information Retrieval (cs.IR)
*备注: HRI 2026

点击查看摘要

Abstract:Personalization in social robots refers to the ability of the robot to meet the needs and/or preferences of an individual user. Existing approaches typically rely on large language models (LLMs) to generate context-aware responses based on user metadata and historical interactions or on adaptive methods such as reinforcement learning (RL) to learn from users’ immediate reactions in real time. However, these approaches fall short of comprehensively capturing user preferences-including long-term, short-term, and fine-grained aspects-, and of using them to rank and select actions, proactively personalize interactions, and ensure ethically responsible adaptations. To address the limitations, we propose drawing on recommender systems (RSs), which specialize in modeling user preferences and providing personalized recommendations. To ensure the integration of RS techniques is well-grounded and seamless throughout the social robot pipeline, we (i) align the paradigms underlying social robots and RSs, (ii) identify key techniques that can enhance personalization in social robots, and (iii) design them as modular, plug-and-play components. This work not only establishes a framework for integrating RS techniques into social robots but also opens a pathway for deep collaboration between the RS and HRI communities, accelerating innovation in both fields.

[IR-1] Differentiable Semantic ID for Generative Recommendation

链接: https://arxiv.org/abs/2601.19711
作者: Junchen Fu,Xuri Ge,Alexandros Karatzoglou,Ioannis Arapakis,Suzan Verberne,Joemon M. Jose,Zhaochun Ren
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommendation provides a novel paradigm in which each item is represented by a discrete semantic ID (SID) learned from rich content. Most existing methods treat SIDs as predefined and train recommenders under static indexing. In practice, SIDs are typically optimized only for content reconstruction rather than recommendation accuracy. This leads to an objective mismatch: the system optimizes an indexing loss to learn the SID and a recommendation loss for interaction prediction, but because the tokenizer is trained independently, the recommendation loss cannot update it. A natural approach is to make semantic indexing differentiable so that recommendation gradients can directly influence SID learning, but this often causes codebook collapse, where only a few codes are used. We attribute this issue to early deterministic assignments that limit codebook exploration, resulting in imbalance and unstable optimization. In this paper, we propose DIGER (Differentiable Semantic ID for Generative Recommendation), a first step toward effective differentiable semantic IDs for generative recommendation. DIGER introduces Gumbel noise to explicitly encourage early-stage exploration over codes, mitigating codebook collapse and improving code utilization. To balance exploration and convergence, we further design two uncertainty decay strategies that gradually reduce the Gumbel noise, enabling a smooth transition from early exploration to exploitation of learned SIDs. Extensive experiments on multiple public datasets demonstrate consistent improvements from differentiable semantic IDs. These results confirm the effectiveness of aligning indexing and recommendation objectives through differentiable SIDs and highlight differentiable semantic indexing as a promising research direction. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2601.19711 [cs.IR] (or arXiv:2601.19711v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2601.19711 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Comparing how Large Language Models perform against keyword-based searches for social science research data discovery

链接: https://arxiv.org/abs/2601.19559
作者: Mark Green,Maura Halstead,Caroline Jay,Richard Kingston,Alex Singleton,David Topping
类目: Information Retrieval (cs.IR); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This paper evaluates the performance of a large language model (LLM) based semantic search tool relative to a traditional keyword-based search for data discovery. Using real-world search behaviour, we compare outputs from a bespoke semantic search system applied to UKRI data services with the Consumer Data Research Centre (CDRC) keyword search. Analysis is based on 131 of the most frequently used search terms extracted from CDRC search logs between December 2023 and October 2024. We assess differences in the volume, overlap, ranking, and relevance of returned datasets using descriptive statistics, qualitative inspection, and quantitative similarity measures, including exact dataset overlap, Jaccard similarity, and cosine similarity derived from BERT embeddings. Results show that the semantic search consistently returns a larger number of results than the keyword search and performs particularly well for place based, misspelled, obscure, or complex queries. While the semantic search does not capture all keyword based results, the datasets returned are overwhelmingly semantically similar, with high cosine similarity scores despite lower exact overlap. Rankings of the most relevant results differ substantially between tools, reflecting contrasting prioritisation strategies. Case studies demonstrate that the LLM based tool is robust to spelling errors, interprets geographic and contextual relevance effectively, and supports natural-language queries that keyword search fails to resolve. Overall, the findings suggest that LLM driven semantic search offers a substantial improvement for data discovery, complementing rather than fully replacing traditional keyword-based approaches.

[IR-3] LURE-RAG : Lightweight Utility-driven Reranking for Efficient RAG

链接: https://arxiv.org/abs/2601.19535
作者: Manish Chandra,Debasis Ganguly,Iadh Ounis
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Most conventional Retrieval-Augmented Generation (RAG) pipelines rely on relevance-based retrieval, which often misaligns with utility – that is, whether the retrieved passages actually improve the quality of the generated text specific to a downstream task such as question answering or query-based summarization. The limitations of existing utility-driven retrieval approaches for RAG are that, firstly, they are resource-intensive typically requiring query encoding, and that secondly, they do not involve listwise ranking loss during training. The latter limitation is particularly critical, as the relative order between documents directly affects generation in RAG. To address this gap, we propose Lightweight Utility-driven Reranking for Efficient RAG (LURE-RAG), a framework that augments any black-box retriever with an efficient LambdaMART-based reranker. Unlike prior methods, LURE-RAG trains the reranker with a listwise ranking loss guided by LLM utility, thereby directly optimizing the ordering of retrieved documents. Experiments on two standard datasets demonstrate that LURE-RAG achieves competitive performance, reaching 97-98% of the state-of-the-art dense neural baseline, while remaining efficient in both training and inference. Moreover, its dense variant, UR-RAG, significantly outperforms the best existing baseline by up to 3%.

[IR-4] Masked Diffusion Generative Recommendation

链接: https://arxiv.org/abs/2601.19501
作者: Lingyu Mu,Hao Deng,Haibo Xing,Jinxin Hu,Yu Zhang,Xiaoyi Zeng,Jing Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommendation (GR) typically first quantizes continuous item embeddings into multi-level semantic IDs (SIDs), and then generates the next item via autoregressive decoding. Although existing methods are already competitive in terms of recommendation performance, directly inheriting the autoregressive decoding paradigm from language models still suffers from three key limitations: (1) autoregressive decoding struggles to jointly capture global dependencies among the multi-dimensional features associated with different positions of SID; (2) using a unified, fixed decoding path for the same item implicitly assumes that all users attend to item attributes in the same order; (3) autoregressive decoding is inefficient at inference time and struggles to meet real-time requirements. To tackle these challenges, we propose MDGR, a Masked Diffusion Generative Recommendation framework that reshapes the GR pipeline from three perspectives: codebook, training, and inference. (1) We adopt a parallel codebook to provide a structural foundation for diffusion-based GR. (2) During training, we adaptively construct masking supervision signals along both the temporal and sample dimensions. (3) During inference, we develop a warm-up-based two-stage parallel decoding strategy for efficient generation of SIDs. Extensive experiments on multiple public and industrial-scale datasets show that MDGR outperforms ten state-of-the-art baselines by up to 10.78%. Furthermore, by deploying MDGR on a large-scale online advertising platform, we achieve a 1.20% increase in revenue, demonstrating its practical value. The code will be released upon acceptance.

[IR-5] UniRec: Unified Multimodal Encoding for LLM -Based Recommendations

链接: https://arxiv.org/abs/2601.19423
作者: Zijie Lei,Tao Feng,Zhigang Hua,Yan Xie,Guanyu Lin,Shuang Yang,Ge Liu,Jiaxuan You
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models have recently shown promise for multimodal recommendation, particularly with text and image inputs. Yet real-world recommendation signals extend far beyond these modalities. To reflect this, we formalize recommendation features into four modalities: text, images, categorical features, and numerical attributes, and highlight the unique challenges this heterogeneity poses for LLMs in understanding multimodal information. In particular, these challenges arise not only across modalities but also within them, as attributes such as price, rating, and time may all be numeric yet carry distinct semantic meanings. Beyond this intra-modality ambiguity, another major challenge is the nested structure of recommendation signals, where user histories are sequences of items, each associated with multiple attributes. To address these challenges, we propose UniRec, a unified multimodal encoder for LLM-based recommendation. UniRec first employs modality-specific encoders to produce consistent embeddings across heterogeneous signals. It then adopts a triplet representation, comprising attribute name, type, and value, to separate schema from raw inputs and preserve semantic distinctions. Finally, a hierarchical Q-Former models the nested structure of user interactions while maintaining their layered organization. Across multiple real-world benchmarks, UniRec outperforms state-of-the-art multimodal and LLM-based recommenders by up to 15%, and extensive ablation studies further validate the contributions of each component.

[IR-6] Physics-Informed Neuro-Symbolic Recommender System: A Dual-Physics Approach for Personalized Nutrition

链接: https://arxiv.org/abs/2601.19244
作者: Chayan Banerjee
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional e-commerce recommender systems primarily optimize for user engagement and purchase likelihood, often neglecting the rigid physiological constraints required for human health. Standard collaborative filtering algorithms are structurally blind to these hard limits, frequently suggesting bundles that fail to meet specific total daily energy expenditure and macronutrient balance requirements. To address this disconnect, this paper introduces a Physics-Informed Neuro-Symbolic Recommender System that integrates nutritional science directly into the recommendation pipeline via a dual-layer architecture. The framework begins by constructing a semantic knowledge graph using sentence-level encoders to strictly align commercial products with authoritative nutritional data. During the training phase, an implicit physics regularizer applies a differentiable thermodynamic loss function, ensuring that learned latent embeddings reflect nutritional plausibility rather than simple popularity. Subsequently, during the inference phase, an explicit physics optimizer employs simulated annealing and elastic quantity optimization to generate discrete grocery bundles that strictly adhere to the user’s protein and caloric targets.

[IR-7] Propagating Similarity Mitigating Uncertainty: Similarity Propagation-enhanced Uncertainty for Multimodal Recommendation ICASSP2026

链接: https://arxiv.org/abs/2601.19198
作者: Xinzhuo Wu,Hongbo Wang,Yuan Lin,Kan Xu,Liang Yang,Hongfei Lin
类目: Information Retrieval (cs.IR)
*备注: Accepted by ICASSP2026

点击查看摘要

Abstract:Multimodal Recommendation (MMR) systems are crucial for modern platforms but are often hampered by inherent noise and uncertainty in modal features, such as blurry images, diverse visual appearances, or ambiguous text. Existing methods often overlook this modality-specific uncertainty, leading to ineffective feature fusion. Furthermore, they fail to leverage rich similarity patterns among users and items to refine representations and their corresponding uncertainty estimates. To address these challenges, we propose a novel framework, Similarity Propagation-enhanced Uncertainty for Multimodal Recommendation (SPUMR). SPUMR explicitly models and mitigates uncertainty by first constructing the Modality Similarity Graph and the Collaborative Similarity Graph to refine representations from both content and behavioral perspectives. The Uncertainty-aware Preference Aggregation module then adaptively fuses the refined multimodal features, assigning greater weight to more reliable modalities. Extensive experiments on three benchmark datasets demonstrate that SPUMR achieves significant improvements over existing leading methods.

[IR-8] Accelerating Generative Recommendation via Simple Categorical User Sequence Compression WSDM’26

链接: https://arxiv.org/abs/2601.19158
作者: Qijiong Liu,Lu Fan,Zhongzhou Liu,Xiaoyu Dong,Yuankai Luo,Guoyuan An,Nuo Chen,Wei Guo,Yong Liu,Xiao-Ming Wu
类目: Information Retrieval (cs.IR)
*备注: WSDM’26 Accepted Paper

点击查看摘要

Abstract:Although generative recommenders demonstrate improved performance with longer sequences, their real-time deployment is hindered by substantial computational costs. To address this challenge, we propose a simple yet effective method for compressing long-term user histories by leveraging inherent item categorical features, thereby preserving user interests while enhancing efficiency. Experiments on two large-scale datasets demonstrate that, compared to the influential HSTU model, our approach achieves up to a 6x reduction in computational cost and up to 39% higher accuracy at comparable cost (i.e., similar sequence length).

附件下载

点击下载今日全部论文列表