本篇博文主要内容为 2025-12-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-12-22)
今日共更新438篇论文,其中:
- 自然语言处理共41篇(Computation and Language (cs.CL))
- 人工智能共113篇(Artificial Intelligence (cs.AI))
- 计算机视觉共116篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共121篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] When Reasoning Meets Its Laws
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)推理行为常具反直觉性、导致推理能力欠佳的问题。其核心解决方案是提出“推理定律”(Laws of Reasoning, LoRe),构建一个统一框架以形式化描述LRMs内在的推理模式,其中关键在于引入“计算定律”(compute law)——假设推理计算量应随问题复杂度线性增长,并辅以“准确率定律”(accuracy law)。为验证这些定律,作者设计了LoRe-Bench基准,系统评估模型在单调性和组合性两个可操作属性上的表现;实验发现多数模型具备合理单调性但缺乏组合性,进而提出一种强化计算定律组合性的微调方法,实证表明该策略显著提升多个基准上的推理性能,并揭示不同属性与定律间的协同效应。
链接: https://arxiv.org/abs/2512.17901
作者: Junyu Zhang,Yifan Sun,Tianang Leng,Jingyan Shen,Liu Ziyin,Paul Pu Liang,Huan Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Massachusetts Institute of Technology (麻省理工学院); University of Pennsylvania (宾夕法尼亚大学); New York University (纽约大学); NTT Research (NTT 研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: this https URL
zh
[NLP-1] ShareChat: A Dataset of Chatbot Conversations in the Wild
【速读】: 该论文旨在解决现有公开数据集将大语言模型(Large Language Models, LLMs)视为通用文本生成器,从而忽略平台界面特性对用户交互影响的问题。其解决方案的关键在于构建ShareChat——一个大规模、跨平台的对话语料库,包含来自ChatGPT、Claude、Gemini、Perplexity和Grok五个主流平台的142,808次对话与超过66万轮交互,完整保留了原始平台的功能特征(如推理轨迹、来源链接和代码片段),并覆盖101种语言,时间跨度从2023年4月至2025年10月。这一设计显著提升了数据的真实性和交互深度,为研究真实世界中用户与LLM聊天机器人的互动提供了关键资源。
链接: https://arxiv.org/abs/2512.17843
作者: Yueru Yan,Tuc Nguyen,Bo Su,Melissa Lieffers,Thai Le
机构: Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset’s multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.
zh
[NLP-2] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)生成专家级深度研究报告时缺乏系统性评估标准的问题,尤其针对现有基准测试在专家报告评价维度上的不足、依赖LLM裁判可能导致的专家判断缺失,以及源验证仅覆盖显式引用陈述而忽视全文事实可靠性的缺陷。解决方案的关键在于提出DEER基准,其包含50个跨13个领域的报告写作任务、一套基于专家定义的7维25子维度评价体系(共130项细粒度评分条目),并提供任务特定的专家指导以提升LLM裁判的一致性;同时引入文档级事实核查架构,自动提取并验证报告中所有主张(包括未标注来源的陈述),量化外部证据质量,从而实现与人类专家判断高度一致且可解释的系统性能诊断。
链接: https://arxiv.org/abs/2512.17776
作者: Janghoon Han,Heegyu Kim,Changho Lee,Dahm Lee,Min Hyung Park,Hosung Song,Stanley Jungkyu Choi,Moontae Lee,Honglak Lee
机构: LG AI Research; University of Illinois Chicago; University of Michigan, Ann Arbor
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.
zh
[NLP-3] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity
【速读】: 该论文旨在解决低资源语言(如孟加拉语)在医学实体识别(Medical Entity Recognition, MedER)任务中因缺乏高质量标注数据而导致的研究不足问题。其关键解决方案是提出一种新颖的多BERT集成模型(Multi-BERT Ensemble),该方法通过融合多个预训练BERT变体(包括Bert、DistilBERT、ELECTRA和RoBERTa)的预测结果,在自建的高质量孟加拉语医学语料库上实现了89.58%的最高准确率,相比单层BERT模型提升了11.80%,显著优于现有基线模型,验证了该集成策略在低资源医学自然语言处理任务中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2512.17769
作者: Tanjim Taharat Aurpa,Farzana Akter,Md. Mehedi Hasan,Shakil Ahmed,Shifat Ara Rafiq,Fatema Khan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.
zh
[NLP-4] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在古文字理解能力评估方面缺乏专门基准的问题,尤其是针对出土文献中古汉字的 comprehensibility 评估空白。现有中文基准多聚焦于现代汉语或传世文献,未能覆盖考古发掘所得的古文字材料。解决方案的关键在于提出 AncientBench——一个系统性评估古文字理解能力的多维基准,涵盖字形(glyph)、读音(pronunciation)、词义(meaning)和语境(context)四个核心维度,并设计十项具体任务(如偏旁、声旁、同音字、填空、翻译等),构建了面向出土文献场景的完整评估框架。通过联合考古学者进行实验验证,并以新提出的古文基线模型和当前最优LLMs进行对比测试,揭示了LLMs在古文字理解中的潜力与人类水平之间的差距,为推动生成式AI在考古学与古汉语研究中的应用奠定基础。
链接: https://arxiv.org/abs/2512.17756
作者: Zhihan Zhou,Daqian Shi,Rui Song,Lida Shi,Xiaolei Diao,Hao Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.
zh
[NLP-5] Affect Body Cognition Demographics and Emotion: The ABCDE of Text Features for Computational Affective Science
【速读】: 该论文旨在解决计算情感科学与计算社会科学领域中,研究者在获取、访问和使用语言数据标注资源时面临的显著障碍,尤其是非计算机科学背景的实践者所遇到的困难。其解决方案的关键在于构建并公开发布ABCDE数据集(Affect, Body, Cognition, Demographics, and Emotion),该数据集包含超过4亿条来自社交媒体、博客、书籍及AI生成源的文本语料,并标注了广泛适用于情感科学、认知科学、数字人文、社会学、政治学和计算语言学等跨学科研究的特征,从而显著降低多领域研究者进行高质量实证分析的技术门槛。
链接: https://arxiv.org/abs/2512.17752
作者: Jan Philip Wahle,Krishnapriya Vishnubhotla,Bela Gipp,Saif M. Mohammad
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
zh
[NLP-6] When the Gold Standard isnt Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
【速读】: 该论文旨在解决用户生成内容(User-generated Content, UGC)翻译评价中的标准不一致问题,即不同数据集对非标准语言现象(如拼写错误、俚语、字符重复和表情符号等)的处理方式差异导致参考译文在标准化程度上存在显著谱系,从而影响模型评估的公平性与可比性。其解决方案的关键在于:首先通过系统分析四个UGC数据集的人工翻译指南,构建包含十二类非标准现象与五种翻译行为(NORMALISE、COPY、TRANSFER、OMIT、CENSOR)的分类体系;其次发现大型语言模型(LLMs)的翻译评分高度依赖于是否明确遵循特定数据集的翻译指令,并且当提示(prompt)与指南对齐时性能显著提升;最终提出,为实现公平评估,必须在数据构建阶段制定清晰的翻译指南,并开发具备指导方针感知能力的可控评估框架。
链接: https://arxiv.org/abs/2512.17738
作者: Lydia Nishimwe,Benoît Sagot,Rachel Bawden
机构: Inria (法国国家信息与自动化研究院)
类目: Computation and Language (cs.CL)
备注: 10 pages, 19 pages with references and appendices
Abstract:User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a “good” translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset’s guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.
zh
[NLP-7] oward Ethical AI Through Bayesian Uncertainty in Neural Question Answering
【速读】: 该论文旨在解决神经网络在问答任务中缺乏不确定性量化的问题,从而提升模型决策的可解释性与可靠性。其核心解决方案在于引入贝叶斯推理(Bayesian reasoning),通过后验推断(posterior inference)来表征预测置信度,并对比拉普拉斯近似(Laplace approximation)与最大后验估计(MAP)在不确定性校准(uncertainty calibration)和选择性预测(selective prediction)上的表现。该方法使模型能够在低置信度时主动选择“我不知道”(I don’t know)的响应,从而实现更负责任和伦理化的生成式问答系统部署。
链接: https://arxiv.org/abs/2512.17677
作者: Riccardo Di Sipio
机构: Dayforce(人力资源云服务公司)
类目: Computation and Language (cs.CL)
备注: 14 pages, 8 figures,
Abstract:We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don’t know’’ response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.
zh
[NLP-8] Peeking Into The Future For Contextual Biasing
【速读】: 该论文旨在解决端到端(End-to-End, E2E)自动语音识别(ASR)模型在识别罕见或未见命名实体(如联系人姓名、地点等)时表现不佳的问题,而这些实体对于虚拟助手等下游应用至关重要。解决方案的关键在于提出一种基于注意力机制的上下文偏置方法,通过引入候选命名实体列表,在预测当前token的同时并行预测多个未来token,使模型能够“窥视未来”并对候选实体进行评分;该方法直接利用多token预测的logits,无需额外的实体编码器或交叉注意力层,显著降低了模型架构复杂度,实验表明在Librispeech数据集上可实现命名实体词错误率(Word Error Rate, WER)相对提升达50.34%。
链接: https://arxiv.org/abs/2512.17657
作者: Ramaneswaran Selvakumar,Cindy Tseng,Eesung Kim,Vijendra Raj Apsingekar,Yun Tang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to “peek into the future” and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.
zh
[NLP-9] Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems
【速读】: 该论文旨在解决当前Streaming Speech-to-Text Translation (StreamST)研究中缺乏统一评估与演示框架的问题。现有工具SimulEval因不再维护、不支持输出修正机制、仅适用于短片段处理且无便捷演示功能,难以满足长音频流场景下对高质量低延迟翻译系统的研究需求。解决方案的关键在于提出simulstream——首个专为StreamST设计的开源框架,其核心创新包括:支持增量解码与重翻译(re-translation)方法的统一比较,兼顾翻译质量与延迟指标;并提供交互式Web界面以直观展示系统性能,从而推动StreamST技术在真实长音频场景下的评估与开发。
链接: https://arxiv.org/abs/2512.17648
作者: Marco Gaido,Sara Papi,Mauro Cettolo,Matteo Negri,Luisa Bentivogli
机构: Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computation and Language (cs.CL)
备注:
Abstract:Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.
zh
[NLP-10] Linear Personality Probing and Steering in LLM s: A Big Five Study
【速读】: 该论文试图解决如何高效、可靠地探测与调控大语言模型(Large Language Models, LLMs)人格特征的问题。当前方法要么成本高昂(如后训练调整),要么脆弱易变(如提示工程),难以在实际应用中稳定使用。解决方案的关键在于利用与五大性格特质(Big Five personality traits)对齐的线性方向(linear directions)来探测和操控模型行为:研究者基于Llama 3.3 70B模型生成406个虚构角色及其性格评分,通过Alpaca问卷获取结构化响应数据,进而在线性回归基础上学习每层激活空间中的个性方向;实验表明,这些方向能有效用于性格识别(probing),但在控制生成行为(steering)时效果受限——仅在强制选择任务中表现稳定,在开放式生成或存在额外上下文时则影响有限。
链接: https://arxiv.org/abs/2512.17639
作者: Michel Frising,Daniel Balcells
机构: Plastic Labs; Independent Researcher
类目: Computation and Language (cs.CL)
备注: 29 pages, 6 figures
Abstract:Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs’ behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.
zh
[NLP-11] Confidence-Credibility Aware Weighted Ensembles of Small LLM s Outperform Large LLM s in Emotion Detection
【速读】: 该论文旨在解决文本情感检测任务中模型性能受限于单一架构偏差及参数效率低下的问题。传统集成方法通常依赖同质化模型结构,难以有效利用不同模型间的错误多样性;而大语言模型(LLM)虽具备强大表征能力,但在特定任务上往往存在冗余参数和过拟合风险。解决方案的关键在于提出一种基于置信度加权的可信度感知集成框架,通过融合架构异构的小型Transformer类大语言模型(sLLMs),如BERT、RoBERTa、DistilBERT、DeBERTa与ELECTRA,并对每个模型进行全量微调以保留其独特偏差;同时设计双权重投票机制,动态结合全局可信度(验证集F1分数)与局部置信度(实例级概率),从而优化各模型在决策中的贡献权重。实验表明,该方法在DAIR-AI数据集上达到93.5%的宏F1分数,显著优于多个7B参数级别大模型,且总参数量仅为595M,证明了小模型集成在专业化NLP任务中的高效性与鲁棒性。
链接: https://arxiv.org/abs/2512.17630
作者: Menna Elgabry,Ali Hamdi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at IRICT 2025
Abstract:This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet’s Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.
zh
[NLP-12] SWE-Bench: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories
【速读】: 该论文旨在解决现有软件工程任务评估基准(如SWE-bench)在数据生成方式上的局限性,包括依赖人工标注、数据集静态且主要聚焦于Python语言的Bug修复任务等问题。其核心解决方案是提出SWE-Bench++,一个自动化框架,通过从开源GitHub项目中提取实时Pull Request(PRs),构建覆盖多语言(11种)、包含Bug修复与功能请求的可复现、基于执行的编码任务。关键创新在于四阶段流程:程序化数据源获取、环境合成、测试断言提取和质量保障,并引入提示引导轨迹合成机制,将强模型难以解决的任务转化为训练轨迹,从而提升模型能力。该方案显著提升了评估的规模性、多样性和实用性,为多语言仓库级代码生成提供了新的基准。
链接: https://arxiv.org/abs/2512.17419
作者: Lilin Wang,Lucas Ramalho,Alan Celestino,Phuc Anthony Pham,Yu Liu,Umang Kumar Sinha,Andres Portillo,Onassis Osunwa,Gabriel Maduekwe
机构: Turing(图灵)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today’s strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.
zh
[NLP-13] RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
【速读】: 该论文旨在解决当前医学视觉问答(Medical Visual Question Answering, VQA)领域中存在的数据规模不足、图像模态单一(主要依赖X光片或生物医学插图)、以及模型易受文本线索干扰等问题。其解决方案的关键在于构建了一个大规模、多模态、专家标注的基准数据集RadImageNet-VQA,包含75万张CT和MRI图像及750万条问答样本,覆盖异常检测、解剖结构识别和病理分类三大任务,涵盖8个解剖区域和97类病种,并支持开放式、封闭式与多项选择题型。实验表明,即使采用最先进的视觉-语言模型,在细粒度病理识别任务中仍表现不佳,且在无图像输入时性能接近随机水平,验证了该数据集有效规避了文本捷径(text-based shortcuts),从而推动了真正基于医学影像理解的VQA研究发展。
链接: https://arxiv.org/abs/2512.17396
作者: Léo Butsanets,Charles Corbière,Julien Khlaut,Pierre Manceron,Corentin Dancette
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint, 23 pages, 12 figures, 7 tables
Abstract:In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at this https URL.
zh
[NLP-14] Are Vision Language Models Cross-Cultural Theory of Mind Reason ers?
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在跨文化情境下理论心智(Theory of Mind, ToM)推理能力评估不足的问题。现有研究多基于西方中心视角,缺乏对多元文化语境中社会认知能力的系统性测评。解决方案的关键在于构建一个名为CulturalToM-VQA的新颖视觉问答(Visual Question Answering, VQA)基准,该基准包含5095个问题,覆盖六类ToM任务和四个复杂度层级,通过VLM辅助的人工智能协同流程生成结构化场景描述,并结合人类专家对文化线索(如仪式、服饰、手势及人际互动)的标注,实现对跨文化ToM推理能力的精细化评估。
链接: https://arxiv.org/abs/2512.17394
作者: Zabir Al Nazi,G M Shahariar,Abrar Hossain,Wei Peng
机构: University of California, Riverside (加州大学河滨分校); University of Dhaka (达卡大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Theory of Mind (ToM) – the ability to attribute beliefs, desires, and emotions to others – is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
zh
[NLP-15] CIFE: Code Instruction-Following Evaluation
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在真实代码生成场景中仅关注功能性正确性,而忽视开发者对鲁棒性、格式和安全性等约束条件的遵守问题。现有基准测试主要依赖测试用例执行来评估正确性,难以衡量模型对非功能性约束的遵循程度。其解决方案的关键在于构建一个包含1,000个Python任务的新型基准,每个任务均配有平均7个由开发者明确指定的约束条件,覆盖13类维度,并通过四阶段人机协同流程确保约束的原子性、相关性和客观性;同时提出C2A Score这一综合指标,联合量化模型的功能正确性与约束合规性,从而更全面地评估可信代码生成能力。
链接: https://arxiv.org/abs/2512.17387
作者: Sravani Gunnu,Shanmukha Guttula,Hima Patel
机构: IIT Bombay, India (印度理工学院孟买分校); IBM Research India (IBM研究印度)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 20 pages, 22 figures, 2 tables
Abstract:Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.
zh
[NLP-16] UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中对大量标注数据或未标注代码片段的依赖问题,这些问题通常成本高昂且难以大规模获取。解决方案的关键在于提出一种无监督框架IPC(Internal Probing for Code generation),其核心机制是通过内部探针(Internal Probing)分析LLM自身的状态信息,包括问题空间探针、测试理解探针、解空间探针以及知识整合与强化机制,从而挖掘模型内部隐含的知识和置信度模式。进一步地,IPC利用自一致性机制和基于表示的质量估计来识别可靠的代码候选,进而训练出UCoder(基于无监督学习的代码生成器)。实验证明,该方法在多个代码基准上可达到与有监督方法相当的性能,显著降低了对标注数据和计算资源的依赖。
链接: https://arxiv.org/abs/2512.17385
作者: Jiajun Wu,Jian Yang,Wei Zhang,Lin Jing,Yuqing Ma,Ensheng Shi,Yuchi Ma,Zhoujun Li,Xianglong Liu
机构: Beihang University (北京航空航天大学); Huawei (华为)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.
zh
[NLP-17] AdvJudge-Zero: Binary Decision Flips in LLM -as-a-Judge via Adversarial Control Tokens
【速读】: 该论文旨在解决奖励模型(Reward Models)和大语言模型作为裁判(LLM-as-a-Judge)系统在后训练流程(如RLHF、DPO、RLAIF)中存在的一种隐蔽但普遍的脆弱性问题:即通过生成低困惑度(low-perplexity)的控制标记序列,可诱导模型将原本正确的“否”判断错误地转变为“是”,从而引发高假阳性率的误判。解决方案的关键在于提出AdvJudge-Zero方法,该方法利用模型的下一个词分布与束搜索(beam search)探索机制,从零开始发现多样化的控制令牌序列;其核心洞察是这些控制令牌引发的隐藏状态扰动集中于一个与裁判模型拒绝方向反向对齐的低秩“软模式”(soft mode),并通过少量控制令牌增强样本进行LoRA微调,能显著降低假阳性率并保持评估质量。
链接: https://arxiv.org/abs/2512.17375
作者: Tung-Ling Li,Yuhao Wu,Hongliang Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct No'' judgments to incorrect Yes’’ judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model’s next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode’’ that is anti-aligned with the judge’s refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.
zh
[NLP-18] Physics of Language Models: Part 4.1 Architecture Design and the Magic of Canon Layers NEURIPS2025
【速读】: 该论文旨在解决大规模语言模型(Language Models, LM)架构差异难以量化评估的问题,尤其是在学术级预训练场景(如13亿参数、1000亿token)下,由于噪声和随机性干扰,核心能力的提升常被掩盖。为克服这一挑战,作者提出了一种受控的合成预训练任务框架,用于隔离并精准评估模型的核心能力。其解决方案的关键在于引入“Canon Layers”——一种轻量级的结构化组件,灵感源自音乐术语“卡农”(canon),通过计算邻近token表示的加权和来促进横向信息流动,并可无缝集成到Transformer、线性注意力(Linear Attention)、状态空间模型(State-Space Models)等各类序列架构中。实验表明,Canon Layers显著增强推理深度(提升达2倍)、广度及知识操控能力,甚至能使弱架构(如NoPE)媲美RoPE,使线性注意力模型达到Mamba2/GDN等当前最优线性模型水平,验证了该方法在合成任务与真实学术规模预训练中的有效性。
链接: https://arxiv.org/abs/2512.17351
作者: Zeyuan Allen-Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注: V1.1 appeared in NeurIPS 2025 main conference; V2 adds GDN experiments, tightens some experiments (for a stronger, fairer comparison), and re-organizes sections
Abstract:Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components – named after the musical term “canon” – that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by 2\times ), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN – validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve – e.g., through better data curation or RL-based post-training – unlocking deeper reasoning and hierarchical inference. Comments: V1.1 appeared in NeurIPS 2025 main conference; V2 adds GDN experiments, tightens some experiments (for a stronger, fairer comparison), and re-organizes sections Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.17351 [cs.CL] (or arXiv:2512.17351v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.17351 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-19] Stakeholder Suite: A Unified AI Framework for Mapping Actors Topics and Arguments in Public Debates
【速读】: 该论文旨在解决公共基础设施与能源项目中利益相关者、议题与论点之间复杂动态难以量化分析的问题,尤其针对现有媒体情报工具在透明度和深度洞察方面的不足。其解决方案的关键在于提出并部署了Stakeholder Suite框架,该框架通过统一的处理流程整合了利益相关者识别(actor detection)、主题建模(topic modeling)、论点抽取(argument extraction)与立场分类(stance classification),从而实现对公共辩论的细粒度、来源可追溯的结构化分析,有效支持项目团队进行影响网络可视化、争议预警及基于证据的决策制定。
链接: https://arxiv.org/abs/2512.17347
作者: Mohamed Chenene,Jeanne Rouhier,Jean Daniélou,Mihir Sarkar,Elena Cabrio
机构: ENGIE Lab CRIGEN, France; Centre de Sociologie de L’Innovation, CNRS, UMR 9217, Mines ParisTech, PSL University; ENGIE Research & Innovation, France; Université Côte d’Azur, CNRS, INRIA, I3S, France
类目: Computation and Language (cs.CL)
备注:
Abstract:Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.
zh
[NLP-20] Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models
【速读】: 该论文旨在解决低资源多语言场景下大语言模型(Large Language Models, LLMs)适应性差、优化不稳定以及跨语言性能不平衡的问题。其核心挑战在于如何在计算预算受限的情况下,实现高精度、良好校准(calibration)和跨语言一致性(cross-language parity)的多语言微调。解决方案的关键在于提出一种治理感知的混合参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)框架:一方面结合梯度对齐的低秩更新与分层混合的结构化正交变换,并在特定子层引入酉约束(unitary constraints)以稳定深层优化;另一方面集成轻量级无标签数据治理步骤(如语言识别、近似重复项去除和质量过滤),从而在不显著增加训练开销的前提下,提升模型鲁棒性与泛化能力。实验证明,该方法在XNLI和FLORES基准上优于强基线PEFT方法,且对拼写变体更具韧性,同时展现出良好的成本-质量权衡。
链接: https://arxiv.org/abs/2512.17344
作者: Haomin Qi,Chengbo Huang,Zihan Dai,Yunkai Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 6 tables. arXiv admin note: substantial text overlap with arXiv:2507.18076
Abstract:We present a governance-aware hybrid fine-tuning framework for multilingual, low-resource adaptation of large language models. The core algorithm combines gradient-aligned low-rank updates with structured orthogonal transformations through layer-wise mixing and introduces unitary constraints in selected sub-layers to stabilize deep optimization. In tandem with lightweight, label-free data governance steps, including language identification, near-duplicate removal, and quality filtering, the framework targets accuracy, calibration, and cross-language parity under tight compute budgets. Across XNLI and FLORES, the hybrid approach delivers consistent gains over strong PEFT baselines while maintaining directional balance and improving probability calibration, as shown in Tables II and III. It is more resilient to lightweight orthographic variants, as shown in Table IV, and benefits additively from simple governance steps, as shown in Table V. Training footprint measurements indicate modest overhead and a favorable cost-quality frontier, as shown in Table VI and Figure 2. Together, these results show that hybrid and unitary PEFT provide a stable and accessible path to resource-efficient multilingual adaptation when paired with practical data governance.
zh
[NLP-21] ask Schema and Binding: A Double Dissociation Study of In-Context Learning
【速读】: 该论文旨在解决“在上下文学习(In-Context Learning, ICL)机制本质”的核心问题,即ICL是否为单一统一机制,还是可分解为不同神经计算过程。此前研究多将其视为整体性机制(如基于检索、梯度下降类或纯贝叶斯推理),缺乏对内部结构的因果验证。论文的关键解决方案是通过激活修补实验(activation patching experiments)在9个模型(涵盖7种Transformer架构及非Transformer的Mamba架构,参数规模370M–13B)中揭示出ICL由两个可分离的神经机制构成:任务模式识别(Task Schema)与输入输出绑定(Binding)。其中,Task Schema通过晚期MLP层传递(100%转移率),而Binding通过残差流传递(62%转移率),二者存在双分离现象(double dissociation),从而提供了因果证据支持ICL的双过程理论。此发现不仅澄清了ICL的内在机制,还指出注意力干扰(而非输出竞争)才是高先验场景下绑定失败的根本瓶颈,为高效提示工程和生产部署中的系统可靠性提升提供了理论依据。
链接: https://arxiv.org/abs/2512.17325
作者: Chaeha Kim
机构: Changwon National University (昌原国立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20pages, 2figures
Abstract:We provide causal mechanistic validation that in-context learning (ICL) decomposes into two separable mechanisms: Task Schema (abstract task type recognition) and Binding (specific input-output associations). Through activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), we establish three key findings: 1. Double dissociation: Task Schema transfers at 100% via late MLP patching; Binding transfers at 62% via residual stream patching – proving separable mechanisms 2. Prior-Schema trade-off: Schema reliance inversely correlates with prior knowledge (Spearman rho = -0.596, p 0.001, N=28 task-model pairs) 3. Architecture generality: The mechanism operates across all tested architectures including the non-Transformer Mamba These findings offer a mechanistic account of the ICL puzzle that contrasts with prior views treating ICL as a monolithic mechanism (whether retrieval-based, gradient descent-like, or purely Bayesian). By establishing that Schema and Binding are neurally dissociable – not merely behavioral modes – we provide causal evidence for dual-process theories of ICL. Models rely on Task Schema when prior knowledge is absent, but prior knowledge interferes through attentional mis-routing (72.7% recency bias) rather than direct output competition (0%). This explains why arbitrary mappings succeed (zero prior leads to full Schema reliance) while factual overrides fail – and reveals that the true bottleneck is attentional, not output-level. Practical implications: Understanding these dual mechanisms enables more efficient prompt engineering – reliable schema transfer reduces required demonstrations for novel tasks, while prior-aware design can mitigate the 38% binding failure rate in high-prior scenarios, improving ICL system reliability in production deployments. Comments: 20pages, 2figures Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2512.17325 [cs.LG] (or arXiv:2512.17325v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.17325 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chaeha Kim [view email] [v1] Fri, 19 Dec 2025 08:14:21 UTC (75 KB) Full-text links: Access Paper: View a PDF of the paper titled Task Schema and Binding: A Double Dissociation Study of In-Context Learning, by Chaeha KimView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[NLP-22] Large Language Models as Pokémon Battle Agents : Strategic Play and Content Generation
【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)在《宝可梦》对战中实现策略性决策与游戏内容生成的问题,即验证LLMs是否能够作为具备战术合理性且能创造平衡游戏内容的智能体。其解决方案的关键在于构建一个基于回合制的《宝可梦》战斗系统,使LLMs根据当前战斗状态自主选择招式,而非依赖预设逻辑或强化学习训练;该框架完整模拟了类型克制、属性伤害计算及多宝可梦队伍管理等核心机制,从而在无需领域特定训练的情况下,实现动态对手行为与内容生成的双重能力。
链接: https://arxiv.org/abs/2512.17308
作者: Daksh Jain,Aarya Jain,Ashutosh Desai,Avyakt Verma,Ishan Bhanuka,Pratik Narang,Dhruv Kumar
机构: Birla Institute of Technology and Science, Pilani, India (比尔拉理工大学与科学学院,皮拉尼,印度)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review
Abstract:Strategic decision-making in Pokémon battles presents a unique testbed for evaluating large language models. Pokémon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pokémon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pokémon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pokémon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.
zh
[NLP-23] Subjective Question Generation and Answer Evaluation using NLP
【速读】: 该论文旨在解决自动化主观题生成与答案评估的问题,当前自然语言处理(Natural Language Processing, NLP)技术在客观题生成方面已有较多研究,但主观题的自动生成及答案质量评估仍处于发展阶段。解决方案的关键在于改进现有NLP模型或开发新型模型,以实现从文本输入中自动产生主观问题并准确评估学生作答内容,从而辅助教师批改作业,并支持学生通过自我测评提升学习效果。
链接: https://arxiv.org/abs/2512.17289
作者: G. M. Refatul Islam,Safwan Shaheer,Yaseen Nur,Mohammad Rafid Hamid
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures, 2 tables, conference paper
Abstract:Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student’s learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.
zh
[NLP-24] Understanding Generalization in Role-Playing Models via Information Theory
【速读】: 该论文旨在解决角色扮演模型(Role-playing Models, RPMs)在真实应用场景中性能下降的问题,其根本原因在于用户、角色和对话结构等分布偏移(distribution shifts)导致的泛化能力减弱。现有方法如大语言模型作为裁判(LLM-as-a-judge)无法提供细粒度的诊断,缺乏对RPM泛化行为的理论刻画。解决方案的关键在于提出一种基于信息论的可解释指标——推理驱动的有效互信息差异(Reasoning-based Effective Mutual Information Difference, R-EMID),用于量化不同分布偏移对RPM性能的影响,并推导出R-EMID的上界以预测最坏情况下的泛化表现;同时设计了一种协同进化强化学习框架,自适应建模用户、角色与对话上下文之间的关联,从而提升对话响应生成概率的估计精度,这是计算R-EMID的核心前提。
链接: https://arxiv.org/abs/2512.17270
作者: Yongqi Li,Hao Lang,Fei Huang,Tieyun Qian,Yongbin Li
机构: Wuhan University (武汉大学); Tongyi Lab; Zhongguancun Academy; Alibaba-inc (阿里巴巴)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.
zh
[NLP-25] AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators
【速读】: 该论文旨在解决用户面向的生成式 AI (Generative AI) 应用在开放领域(如旅行规划、临床笔记生成或对话系统)中缺乏高效、可解释且与人类反馈高度相关的评估指标的问题。传统黄金标准(如用户点赞/点踩或留存率)在原型阶段和研究项目中往往数据稀缺或响应延迟,难以用于模型优化。解决方案的关键在于提出 AutoMetrics 框架:它通过从人工 curated 的 MetricBank(包含 48 个评估指标)中检索候选指标,并结合轻量级人类反馈自动生成 LLM-as-a-Judge 判定标准,再通过回归方法组合这些指标以最大化与人类信号的相关性,从而在少于 100 个标注样本下实现高相关性的自动评估指标。
链接: https://arxiv.org/abs/2512.17267
作者: Michael J. Ryan,Yanzhe Zhang,Amol Salunkhe,Yi Chu,Di Xu,Diyi Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
zh
[NLP-26] Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在形式化语言(如Lean)中进行定理证明时面临的挑战,包括计算成本高、性能不足以及自然语言与形式语言之间转换效率低的问题。其解决方案的关键在于提出了一种基于大规模代理强化学习(agentic reinforcement learning)训练的正式定理证明模型Seed-Prover 1.5,并结合一种高效的测试时缩放(test-time scaling, TTS)工作流。该工作流通过与Lean等工具的持续交互积累经验,显著提升了形式化定理证明的能力和效率,同时利用自然语言证明的最新进展有效弥合了自然语言与形式语言之间的鸿沟,从而在较小计算预算下实现了优于现有方法的性能表现。
链接: https://arxiv.org/abs/2512.17260
作者: Jiangjie Chen,Wenxiang Chen,Jiacheng Du,Jinyi Hu,Zhicheng Jiang,Allan Jie,Xiaoran Jin,Xing Jin,Chenggang Li,Wenlei Shi,Zhihong Wang,Mingxuan Wang,Chenrui Wei,Shufa Wei,Huajian Xin,Fan Yang,Weihao Gao,Zheng Yuan,Tianyang Zhan,Zeyu Zheng,Tianxi Zhou,Thomas Hanwen Zhu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL)
备注: 21 pages
Abstract:Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbfSeed-Prover 1.5, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf88% of PutnamBench (undergraduate-level), \textbf80% of Fate-H (graduate-level), and \textbf33% of Fate-X (PhD-level) problems. Notably, using our system, we solved \textbf11 out of 12 problems from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.
zh
[NLP-27] Incorporating Error Level Noise Embedding for Improving LLM -Assisted Robustness in Persian Speech Recognition
【速读】: 该论文旨在解决低资源语言(如波斯语)在噪声环境下自动语音识别(ASR)性能显著下降的问题,尤其是针对现有模型(如Whisper)在不同信噪比(SNR)下准确率难以维持的挑战。其解决方案的关键在于提出一种结合多假设生成与噪声感知建模的鲁棒纠错框架:首先利用改进的Whisper-large解码器生成5-best候选转录结果,进而引入误差级噪声(Error Level Noise, ELN)作为表征,量化语义和词元层面的假设分歧,从而直接度量噪声引起的不确定性;在此基础上,通过将ELN嵌入同时注入句子级和词级特征空间,训练出噪声条件下的大语言模型(LLM),实现了对ASR输出可靠性的精准推理与修正。实验表明,该方法在混合噪声测试集上将词错误率(WER)从原始Whisper的31.10%降至24.84%,显著优于仅基于文本微调或无ELN的基线模型。
链接: https://arxiv.org/abs/2512.17247
作者: Zahra Rahmani(1),Hossein Sameti(1) ((1) Department of Computer Engineering, Sharif University of Technology)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic Speech Recognition (ASR) systems suffer significant performance degradation in noisy environments, a challenge that is especially severe for low-resource languages such as Persian. Even state-of-the-art models such as Whisper struggle to maintain accuracy under varying signal-to-noise ratios (SNRs). This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. Using noisy Persian speech, we generate 5-best hypotheses from a modified Whisper-large decoder. Error Level Noise (ELN) is introduced as a representation that captures semantic- and token-level disagreement across hypotheses, quantifying the linguistic distortions caused by noise. ELN thus provides a direct measure of noise-induced uncertainty, enabling the LLM to reason about the reliability of each hypothesis during correction. Three models are evaluated: (1) a base LLaMA-2-7B model without fine-tuning, (2) a fine-tuned variant trained on text-only hypotheses, and (3) a noise-conditioned model integrating ELN embeddings at both sentence and word levels. Experimental results demonstrate that the ELN-conditioned model achieves substantial reductions in Word Error Rate (WER). Specifically, on the challenging Mixed Noise test set, the proposed Fine-tuned + ELN (Ours) model reduces the WER from a baseline of 31.10% (Raw Whisper) to 24.84%, significantly surpassing the Fine-tuned (No ELN) text-only baseline of 30.79%, whereas the original LLaMA-2-7B model increased the WER to 64.58%, demonstrating that it is unable to correct Persian errors on its own. This confirms the effectiveness of combining multiple hypotheses with noise-aware embeddings for robust Persian ASR in noisy real-world scenarios.
zh
[NLP-28] Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理长文本时缺乏全局语义感知能力的问题,导致其难以有效整合分散在文档中的证据并进行连贯推理。解决方案的关键在于提出 Mindscape-Aware RAG (MiA-RAG),通过层次化摘要构建一个显式的全局语义表示(即“mindscape”),并将该表示同时用于指导检索和生成过程:一方面使检索器能够生成更具上下文丰富性的查询嵌入,另一方面使生成器能够在一致的全局语境下对检索到的证据进行推理。这一机制显著提升了模型在长文本理解和多语言任务中的表现,使其更接近人类基于整体认知框架进行信息整合的能力。
链接: https://arxiv.org/abs/2512.17220
作者: Yuqing Li,Jiangnan Li,Zheng Lin,Ziyan Zhou,Junjie Wu,Weiping Wang,Jie Zhou,Mo Yu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); WeChat AI, Tencent (腾讯微信AI); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.
zh
[NLP-29] Enhancing Long Document Long Form Summarisation with Self-Planning
【速读】: 该论文旨在解决长文本摘要中事实一致性(factual consistency)不足的问题,即生成的摘要容易出现幻觉或偏离原文关键信息,尤其是在信息密集型文档中。解决方案的关键在于提出一种名为“highlight-guided generation”的新方法,其核心是利用句子级别的关键信息(称为“highlight”)作为内容规划(content plan),引导摘要生成过程,从而提升摘要的可追溯性(traceability)和忠实度(faithfulness)。该方法通过自计划(self-planning)机制识别重要段落,并在两阶段(two-stage)架构下实现更精准的内容选择与生成,实验表明该方案在GovReport等长文本摘要数据集上显著提升了ROUGE-L指标(+4.1点)和SummaC分数(约35%提升),同时保持摘要的相关性和整体质量。
链接: https://arxiv.org/abs/2512.17179
作者: Xiaotang Du,Rohit Saxena,Laura Perez-Beltrachini,Pasquale Minervini,Ivan Titov
机构: University of Edinburgh (爱丁堡大学); Miniml.AI (Miniml.AI)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.
zh
[NLP-30] A Solver-in-the-Loop Framework for Improving LLM s on Answer Set Programming for Logic Puzzle Solving AAAI’26
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成领域特定语言(Domain-Specific Languages, DSLs)代码时面临的挑战,特别是针对答案集编程(Answer Set Programming, ASP)这一用于求解组合搜索问题的有效形式化方法。当前LLMs在ASP代码生成上的效果受限于其预训练阶段所接触的示例数量不足。论文提出了一种新颖的“求解器内循环”(solver-in-the-loop)方法,通过求解器引导的指令微调(solver-guided instruction-tuning)来提升LLMs对ASP语义解析任务的理解能力。该方案的核心在于:仅需自然语言描述的问题及其解决方案,即可从LLM生成的ASP语句中采样程序延续,并利用ASP的声明式特性——即部分编码逐步缩小解空间——根据求解器反馈将这些语句分类为“选择”或“拒绝”实例,进而进行监督微调;同时结合基于求解器指导的搜索策略(如best-of-N采样)进一步增强鲁棒性。实验表明,在两个数据集和两种提示设置下均实现了稳定性能提升。
链接: https://arxiv.org/abs/2512.17093
作者: Timo Pierre Schrader,Lukas Lange,Tobias Kaminski,Simon Razniewski,Annemarie Friedrich
机构: 1. University of Mannheim (曼海姆大学); 2. Max Planck Institute for Informatics (马克斯普朗克信息研究所); 3. University of Edinburgh (爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 7 figures, accepted at AAAI’26
Abstract:The rise of large language models (LLMs) has sparked interest in coding assistants. While general-purpose programming languages are well supported, generating code for domain-specific languages remains a challenging problem for LLMs. In this paper, we focus on the LLM-based generation of code for Answer Set Programming (ASP), a particularly effective approach for finding solutions to combinatorial search problems. The effectiveness of LLMs in ASP code generation is currently hindered by the limited number of examples seen during their initial pre-training phase. In this paper, we introduce a novel ASP-solver-in-the-loop approach for solver-guided instruction-tuning of LLMs to addressing the highly complex semantic parsing task inherent in ASP code generation. Our method only requires problem specifications in natural language and their solutions. Specifically, we sample ASP statements for program continuations from LLMs for unriddling logic puzzles. Leveraging the special property of declarative ASP programming that partial encodings increasingly narrow down the solution space, we categorize them into chosen and rejected instances based on solver feedback. We then apply supervised fine-tuning to train LLMs on the curated data and further improve robustness using a solver-guided search that includes best-of-N sampling. Our experiments demonstrate consistent improvements in two distinct prompting settings on two datasets. Comments: 15 pages, 7 figures, accepted at AAAI’26 Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2512.17093 [cs.AI] (or arXiv:2512.17093v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2512.17093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-31] Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups
【速读】: 该论文旨在解决在线戒烟支持群组中用户参与度低和污名化问题,通过引入自动对话代理(conversational agent)来提升互动效率。其核心挑战在于训练数据不足导致意图识别模型性能受限。解决方案的关键在于采用两级数据增强策略:首先基于现有数据微调开源大语言模型(LLM)以识别低F1值意图,并利用提示工程(prompt engineering)生成高质量合成数据(平均87%由人工标注为高质);其次从相关在线社区爬取超过10,000条真实帖子(73%为高质量),并结合人工验证确保数据质量。最终,合成与真实数据共同扩充原始数据集用于模型再训练,使意图分类器F1分数提升32%,证明该方法在数据稀缺场景下具有显著有效性与可复现性。
链接: https://arxiv.org/abs/2512.17092
作者: Salar Hashemitaheri,Ian Harris
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43% of the original posts being selected for augmentation, followed by 140% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.
zh
[NLP-32] When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation
【速读】: 该论文旨在解决对话主题分割(Dialogue Topic Segmentation)评估中存在的偏差问题,即当前主流的严格边界匹配和F1指标未能准确反映模型在真实应用场景中的性能,尤其在大型语言模型(LLM)处理长对话历史时,因上下文窗口限制导致的结构化信息管理需求日益突出。其解决方案的关键在于提出以边界密度(boundary density)与段落一致性(segment coherence)为核心的新评估目标,并引入窗口容忍F1(Window-tolerant F1, W-F1),同时强调“边界评分”与“边界选择”应分离建模,从而揭示现有基准中性能差异主要源于标注粒度不一致和稀疏标签,而非模型质量提升,进而推动研究从追求单一正确边界集转向基于任务需求选择合适分割粒度的范式转变。
链接: https://arxiv.org/abs/2512.17083
作者: Michael H. Coen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures. Evaluation and methodology study on dialogue topic segmentation
Abstract:Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model’s fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection. Comments: 17 pages, 2 figures. Evaluation and methodology study on dialogue topic segmentation Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; H.3.1 Cite as: arXiv:2512.17083 [cs.CL] (or arXiv:2512.17083v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.17083 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-33] Perturb Your Data: Paraphrase-Guided Training Data Watermarking AAAI2026
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLM)训练数据版权保护与来源追踪难题,尤其是在训练数据仅占整个语料库极小比例(如低于0.001%)时仍能可靠检测的问题。解决方案的关键在于提出一种名为SPECTRA的水印技术:通过使用另一个LLM对原始文本进行改写(paraphrasing),并基于一个独立评分模型计算每个改写版本的似然得分,选择得分与原文本接近的改写版本,从而避免引入分布偏移(distribution shift)。检测时,通过比较可疑模型输出的token概率与评分模型的概率分布,实现对训练数据的高灵敏度识别。实验表明,SPECTRA在区分训练数据与非训练数据时可达到超过九个数量级的p值差距,显著优于现有基线方法。
链接: https://arxiv.org/abs/2512.17075
作者: Pranav Shetty,Mirazul Haque,Petr Babkin,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to AAAI 2026
Abstract:Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.
zh
[NLP-34] XLM: A Python package for non-autoregressive language models
【速读】: 该论文旨在解决非自回归文本生成(non-autoregressive text generation)在实现和比较上的困难问题。当前,非自回归语言模型的实现多为定制化开发,缺乏统一的训练与推理框架,导致不同方法难以系统性对比;同时,每种模型通常需独立设计数据整理(data collation)、损失函数(loss)和预测逻辑,造成通用组件难以复用。解决方案的关键在于提出一个名为XLM的Python包,其核心目标是简化小型非自回归语言模型的开发流程,并通过配套的xlm-models包提供一系列预训练的小型模型,从而促进研究社区对非自回归语言建模方法的高效实验与比较。
链接: https://arxiv.org/abs/2512.17065
作者: Dhruvesh Patel,Durga Prasad Maram,Sai Sreenivas Chintha,Benjamin Rozonoyer,Andrew McCallum
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注: Code available at this https URL
Abstract:In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at this https URL.
zh
[NLP-35] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
【速读】: 该论文旨在解决企业级Text-to-SQL系统部署中的“三难困境”——即在成本、安全性和性能之间难以平衡的问题。当前方案通常迫使企业只能在昂贵的专有大语言模型(Large Language Models, LLMs)与性能较低的小语言模型(Small Language Models, SLMs)之间二选一。为提升SLMs的性能,现有方法常依赖于从LLMs中蒸馏非结构化的思维链(Chain-of-Thought, CoT)轨迹,但这一过程存在内在模糊性。本文的关键解决方案是提出一种名为Struct-SQL的知识蒸馏(Knowledge Distillation, KD)框架,其核心在于使用查询执行计划(query execution plan)作为形式化、结构化的推理表示来替代传统非结构化CoT,从而提供更清晰、可靠的指导信号。实验表明,基于结构化CoT蒸馏训练的SLM相比非结构化蒸馏基线提升了8.1%的准确率,且错误分析显示语法错误显著减少,验证了结构化逻辑蓝图对小模型SQL生成可靠性的重要提升作用。
链接: https://arxiv.org/abs/2512.17053
作者: Khushboo Thaker,Yony Bresler
机构: Crater Labs( Crater Labs)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.
zh
[NLP-36] A Womens Health Benchmark for Large Language Models
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在女性健康(Women’s Health)领域信息准确性严重不足的问题,这一问题尤为突出,因为LLMs已成为数百万用户获取健康信息的主要来源。解决方案的关键在于构建首个专门针对女性健康领域的基准测试——女性健康基准(Women’s Health Benchmark, WHB),其包含96个经过严格验证的模型测试条目,覆盖产科与妇科、急诊医学、初级保健、肿瘤学和神经学五个专科,以及患者提问、临床医生提问和循证/政策类提问三种查询类型,并细分为八类错误类型(如剂量错误、关键信息缺失、过时指南、治疗建议错误等)。通过该基准对13个前沿LLM进行系统评估,发现当前模型在女性健康任务上平均失败率高达60%,且在“遗漏紧急情况”类错误中普遍存在,揭示了AI聊天机器人在该领域尚不具备提供可靠建议的能力。
链接: https://arxiv.org/abs/2512.17028
作者: Victoria-Elisabeth Gruber,Razvan Marinescu,Diego Fajardo,Amin H. Nassar,Christopher Arkfeld,Alexandria Ludlow,Shama Patel,Mehrnoosh Samaei,Valerie Klug,Anna Huber,Marcel Gühner,Albert Botta i Orfila,Irene Lagoja,Kimya Tarr,Haleigh Larson,Mary Beth Howard
机构: Lumos AI; Yale Cancer Center; MGH; Harvard Medical School; UCSF; Brown Division of Global Emergency Medicine; Emory University; Pharmacy Department, Clinic Ottakring; Windrush Surgery; NHS; Women’s Health Research; Yale School of Medicine; Johns Hopkins University School of Medicine
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 6 Figures, 2 Tables
Abstract:As large language models (LLMs) become primary sources of health information for millions, their accuracy in women’s health remains critically unexamined. We introduce the Women’s Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women’s health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60% failure rates on the women’s health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with “missed urgency” indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women’s health.
zh
[NLP-37] PAACE: A Plan-Aware Automated Agent Context Engineering Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在多步骤任务流中因上下文不断膨胀而导致的注意力稀释、推理成本上升及信息 fidelity 下降的问题。现有摘要与查询感知压缩方法未能充分考虑代理推理中的计划结构和多步相关性,导致压缩效果不佳。解决方案的关键在于提出 PAACE(Plan-Aware Automated Context Engineering),其核心创新包括:(1) 基于计划结构分析的下一步任务相关性建模,确保压缩保留关键决策路径;(2) 指令协同精炼机制,提升压缩后指令的语义完整性;(3) 保持函数功能不变的压缩策略,保障工具调用正确性。PAACE 通过合成数据生成器 PAACE-Syn 和蒸馏式压缩器 PAACE-FT 实现高效部署,在多个长程基准测试中显著提升准确率并大幅降低上下文负载与计算开销,同时保持接近教师模型的性能表现。
链接: https://arxiv.org/abs/2512.16970
作者: Kamer Ali Yuksel
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Large Language Model (LLM) agents are increasingly deployed in complex, multi-step workflows involving planning, tool use, reflection, and interaction with external knowledge systems. These workflows generate rapidly expanding contexts that must be curated, transformed, and compressed to maintain fidelity, avoid attention dilution, and reduce inference cost. Prior work on summarization and query-aware compression largely ignores the multi-step, plan-aware nature of agentic reasoning. In this work, we introduce PAACE (Plan-Aware Automated Context Engineering), a unified framework for optimizing the evolving state of LLM agents through next-k-task relevance modeling, plan-structure analysis, instruction co-refinement, and function-preserving compression. PAACE comprises (1) PAACE-Syn, a large-scale generator of synthetic agent workflows annotated with stepwise compression supervision, and (2) PAACE-FT, a family of distilled, plan-aware compressors trained from successful teacher demonstrations. Experiments on long-horizon benchmarks (AppWorld, OfficeBench, and 8-Objective QA) demonstrate that PAACE consistently improves agent correctness while substantially reducing context load. On AppWorld, PAACE achieves higher accuracy than all baselines while lowering peak context and cumulative dependency. On OfficeBench and multi-hop QA, PAACE improves both accuracy and F1, achieving fewer steps, lower peak tokens, and reduced attention dependency. Distilled PAACE-FT retains 97 percent of the teacher’s performance while reducing inference cost by over an order of magnitude, enabling practical deployment of plan-aware compression with compact models.
zh
[NLP-38] Probing Scientific General Intelligence of LLM s with Scientist-Aligned Workflows
【速读】: 该论文旨在解决科学智能(Scientific AI)领域中缺乏统一框架的问题,特别是如何定义并评估具备跨学科自主科研能力的科学通用智能(Scientific General Intelligence, SGI)。其核心挑战在于现有模型在深度研究、假设生成、实验设计与执行及多模态推理等关键环节存在显著能力缺口。解决方案的关键在于:首先提出以实践探究模型(Practical Inquiry Model, PIM)为基础的SGI操作化定义,涵盖思辨(Deliberation)、构想(Conception)、行动(Action)和感知(Perception)四个阶段;其次构建SGI-Bench基准测试集,包含超1000个专家标注的跨学科样本,覆盖Science期刊125个重大科学问题,实现对大语言模型(LLMs)系统性评估;最后引入推理时强化学习(Test-Time Reinforcement Learning, TTRL),通过检索增强的新颖性奖励机制优化推理过程,提升假设新颖性而无需参考答案,从而推动AI真正参与科学发现。
链接: https://arxiv.org/abs/2512.16969
作者: Wanghan Xu,Yuhao Zhou,Yifan Zhou,Qinglong Cao,Shuo Li,Jia Bu,Bo Liu,Yixin Chen,Xuming He,Xiangyu Zhao,Xiang Zhuang,Fengxiang Wang,Zhiwang Zhou,Qiantai Feng,Wenxuan Huang,Jiaqi Wei,Hao Wu,Yuejin Yang,Guangshuai Wang,Sheng Xu,Ziyan Huang,Xinyao Liu,Jiyao Liu,Cheng Tang,Wei Li,Ying Chen,Junzhi Ning,Pengfei Jiang,Chenglong Ma,Ye Du,Changkai Ji,Huihui Xu,Ming Hu,Jiangbin Zheng,Xin Chen,Yucheng Wu,Feifei Jiang,Xi Chen,Xiangru Tang,Yuchen Fu,Yingzhou Lu,Yuanyuan Zhang,Lihao Sun,Chengbo Li,Jinzhe Ma,Wanhao Liu,Yating Liu,Kuo-Cheng Wu,Shengdu Chai,Yizhou Wang,Ouwen Zhangjin,Chen Tang,Shufei Zhang,Wenbo Cao,Junjie Ren,Taoyong Cui,Zhouheng Yao,Juntao Deng,Yijie Sun,Feng Liu,Wangxu Wei,Jingyi Xu,Zhangrui Li,Junchao Gong,Zijie Guo,Zhiyu Yao,Zaoyu Chen,Tianhao Peng,Fangchen Yu,Bo Zhang,Dongzhan Zhou,Shixiang Tang,Jiaheng Liu,Fenghua Ling,Yan Lu,Yuchen Ren,Ben Fei,Zhen Zhao,Xinyu Gu,Rui Su,Xiao-Ming Wu,Weikang Si,Yang Liu,Hao Chen,Xiangchao Yan,Xue Yang,Junchi Yan,Jiamin Wu,Qihao Zheng,Chenhui Li,Zhiqiang Gao,Hao Kong,Junjun He,Mao Su,Tianfan Fu,Peng Ye,Chunfeng Song,Nanqing Dong,Yuqiang Li,Huazhu Fu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10–20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
zh
[NLP-39] Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization
【速读】: 该论文旨在解决语音表示模型(Speech Representation Models)在微调(fine-tuning)过程中因表征漂移(representational drift)导致的跨任务泛化能力下降问题。现有方法如权重空间正则化或LoRA微调虽能限制参数变化,但难以维持与预训练模型的高特征相似性,从而削弱了模型在多个下游任务中的通用性能。其解决方案的关键在于提出一种两阶段微调框架Speech-FT:第一阶段通过专门设计的微调策略最小化表征漂移,第二阶段采用权重空间插值(weight-space interpolation)将模型恢复至接近预训练状态,从而在允许较大权重更新的前提下有效保留预训练知识,显著提升跨任务泛化性能。
链接: https://arxiv.org/abs/2502.12672
作者: Tzu-Quan Lin,Wei-Ping Huang,Hao Tang,Hung-yi Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Published in IEEE Transactions on Audio, Speech, and Language Processing (TASLP). Model and code available at: this https URL
Abstract:Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.
zh
[NLP-40] Computational analysis reveals historical trajectory of East-Polynesian lunar calendars
【速读】: 该论文旨在解决东波利尼西亚地区月相历法(lunar calendar)的演化关系及其与语言分化之间关联性的问题。研究通过计算方法分析了来自东波利尼西亚各大群岛的49份月夜命名列表,每份约含30个夜名,构建出一个根植的系统发育树(phylogenetic tree),发现这些历法可明确分为两组:一组包含拉帕努伊(Rapa Nui)、曼加雷瓦(Mangareva)和马奎萨斯群岛(Marquesas);另一组涵盖新西兰、夏威夷、库克群岛、社会群岛(Austral Islands)、大溪地(Tahiti)及土阿莫土群岛(Tuamotu)。这一分组模式与近期对东波利尼西亚语言的新分类——“远系”(Distal,即马奎萨斯语、曼加雷瓦语、拉帕努伊语)与“近系”(Proximal,如毛利语、夏威夷语、塔希提语等)——高度一致。研究的关键在于识别出历法结构与语言谱系之间的同步演化趋势,从而推断早期月相历法的分化反映了人类迁徙与语言分裂的地理过程,为理解南岛语族在太平洋的扩散提供了符号系统层面的证据。
链接: https://arxiv.org/abs/2512.17525
作者: Miguel Valério,Fabio Tamburini,Michele Corazza
机构: 未知
类目: Populations and Evolution (q-bio.PE); Computation and Language (cs.CL)
备注:
Abstract:We investigate a type of lunar calendar known as lists of the ‘nights of the moon’, found throughout East Polynesia, including Rapa Nui (Easter Island). Using computational methods, we analyzed the lexical and structural divergence of 49 calendric lists from all major archipelagos, each containing about 30 night names. Our results, presented as a rooted phylogenetic tree, show a clear split into two main groups: one including lists from Rapa Nui, Mangareva, and the Marquesas; the other comprising lists from New Zealand, Hawaii, the Cook Islands, the Austral Islands, Tahiti, and the Tuamotu. This pattern aligns with a recent alternative classification of East Polynesian languages into ‘Distal’ (Marquesan, Mangarevan, Rapanui) and ‘Proximal’ (Maori, Hawaiian, Tahitian, etc.) subgroups. Since both language and lunar calendars are symbolic systems passed down and changed within communities - and given the geographic isolation of many archipelagos - we interpret this correspondence as evidence that the early divergence of East Polynesian lunar calendars mirrors early population movements and language splits in the region.
zh
计算机视觉
[CV-0] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
【速读】:该论文旨在解决当前生成式扩散模型(Diffusion Models)在使用以理解为导向的表示编码器(representation encoder)特征作为潜在空间时所面临的两大挑战:一是判别性特征空间缺乏紧凑的正则化,导致扩散模型容易产生偏离流形的潜在表示,进而造成物体结构不准确;二是编码器本身像素级重建能力较弱,限制了生成器对精细几何和纹理的学习。解决方案的关键在于提出一种系统性的框架,通过引入语义-像素联合重建目标(semantic-pixel reconstruction objective),对潜在空间进行有效正则化,从而将语义信息与细粒度细节压缩至一个高度紧凑的表示中(96通道、16×16空间下采样),既保持语义丰富性又实现最先进的图像重建效果,同时支持高效的文本到图像生成(Text-to-Image, T2I)和图像编辑任务。
链接: https://arxiv.org/abs/2512.17909
作者: Shilong Zhang,He Zhang,Zhifei Zhang,Chongjian Ge,Shuchen Xue,Shaoteng Liu,Mengwei Ren,Soo Ye Kim,Yuqian Zhou,Qing Liu,Daniil Pakhomov,Kai Zhang,Zhe Lin,Ping Luo
机构: The University of Hong Kong (香港大学); Adobe Research (Adobe 研究院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder’s inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.
zh
[CV-1] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting
【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation)在真实世界图像上性能下降的问题,尤其是当测试数据分布与训练数据存在显著差异时,现有基础模型如Depth Anything V2(DA-V2)的表现受限。其解决方案的关键在于提出Re-Depth Anything框架,通过融合大规模2D扩散模型的强大先验知识来缩小域差距;具体而言,该方法利用Score Distillation Sampling(SDS)在生成式语境下引入形状从明暗(Shape from Shading, SfS)线索,对预测的深度图进行无标签重光照重构,并采用目标优化策略——冻结编码器仅更新中间嵌入和微调解码器,从而避免优化坍塌,显著提升深度估计的准确性和现实感。
链接: https://arxiv.org/abs/2512.17908
作者: Ananta R. Bhattarai,Helge Rhodin
机构: Bielefeld University (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.
zh
[CV-2] Dexterous World Models
【速读】:该论文旨在解决当前数字孪生(Digital Twin)场景中缺乏具身交互性的问题,即现有技术虽能生成静态的高保真3D环境,但无法模拟人类动作引发的动态变化,限制了其在导航与视图合成之外的应用。解决方案的关键在于提出灵巧世界模型(Dexterous World Model, DWM),这是一个基于视频扩散(video diffusion)的框架,通过联合建模静态场景渲染和第一人称视角的手部运动序列(egocentric hand motion sequence),生成具有时间一致性的交互视频。DWM的核心创新在于:(1) 以指定相机轨迹下的静态场景渲染为条件,确保空间一致性;(2) 利用第一人称手部网格渲染(egocentric hand mesh rendering)编码几何与运动信息,直接建模动作驱动的动态变化。该方法结合合成数据与真实固定相机视频构建混合交互数据集,实现了如抓取、开合物体等物理合理且场景一致的具身交互模拟。
链接: https://arxiv.org/abs/2512.17907
作者: Byungjun Kim,Taeksoo Kim,Junyoung Lee,Hanbyul Joo
机构: Seoul National University (首尔国立大学); RLWRLD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this http URL
Abstract:Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions. Comments: Project Page: this http URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17907 [cs.CV] (or arXiv:2512.17907v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17907 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-3] Adversarial Robustness of Vision in Open Foundation Models
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在面对对抗性攻击时的鲁棒性问题,特别是针对图像输入模态的未目标化投影梯度下降(Untargeted PGD)攻击下模型性能下降的现象。其关键解决方案是通过在VQA v2数据集子集上对LLaVA-1.5-13B和Meta的Llama 3.2 Vision-8B-2进行系统性对抗测试,并以标准VQA准确率作为量化指标,评估不同模型在扰动下的准确率下降程度。研究发现,尽管Llama 3.2 Vision的基线准确率较低,但在高扰动水平下其性能衰减幅度小于LLaVA,表明对抗鲁棒性与标准基准性能之间不存在直接相关性,且可能受模型架构和训练机制等内在因素影响。
链接: https://arxiv.org/abs/2512.17902
作者: Jonathon Fox,William J Buchanan,Pavlos Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta’s Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta’s Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.
zh
[CV-4] Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
【速读】:该论文旨在解决多智能体运动生成中长期时序建模困难、智能体间强依赖关系难以捕捉以及群体规模可变性带来的挑战,尤其针对现有方法在任务特异性与灵活多智能体生成之间缺乏通用性的局限。其核心解决方案是提出MAGNet(Multi-Agent Diffusion Forcing Transformer),一个统一的自回归扩散框架,通过引入显式的智能体耦合建模机制,在自回归去噪过程中明确刻画智能体间的交互动力学,从而实现从双人互动到三人及以上群体的多样化交互任务(如舞蹈、拳击等同步行为和社交互动)的高质量生成,并支持超长序列(数百帧)的连贯运动合成。
链接: https://arxiv.org/abs/2512.17900
作者: Vongani H. Maluleke,Kie Horiuchi,Lea Wilken,Evonne Ng,Jitendra Malik,Angjoo Kanazawa
机构: Sony Group Corporation(索尼集团); Meta; UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: this https URL
zh
[CV-5] RadarGen: Automotive Radar Point Cloud Generation from Cameras
【速读】:该论文旨在解决多模态感知中雷达点云数据生成难题,即如何从多视角相机图像中合成真实且物理合理的汽车雷达点云(Radar Point Clouds),以支持统一的多传感器生成式仿真。其解决方案的关键在于提出RadarGen——一种基于扩散模型(Diffusion Model)的生成框架,通过将雷达测量表示为鸟瞰图(Bird’s-Eye-View, BEV)形式,编码空间结构、雷达散射截面(Radar Cross Section, RCS)和多普勒属性,并结合预训练基础模型提取的BEV对齐深度、语义与运动线索,引导随机生成过程逼近真实雷达特征分布;同时引入轻量级恢复步骤将生成的BEV图映射回点云,从而实现高保真度、可扩展的多模态数据合成。
链接: https://arxiv.org/abs/2512.17897
作者: Tomer Borreda,Fangqiang Ding,Sanja Fidler,Shengyu Huang,Or Litany
机构: Technion (以色列理工学院); MIT (麻省理工学院); NVIDIA (英伟达); University of Toronto (多伦多大学); Vector Institute (矢量研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird’s-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.
zh
[CV-6] Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training
【速读】:该论文旨在解决基于视觉Transformer(Vision Transformer, ViT)的通用基础模型缺乏透明性和可解释性的问题,当前自解释模型(Self-Explainable Models, SEMs)通常依赖复杂的训练过程和特定架构,难以适配ViT这类现代基础模型。其解决方案的关键在于提出一种无需重新训练即可将任意训练好的ViT模型转化为SEM的方法,称为关键点计数分类器(Keypoint Counting Classifiers, KCCs),该方法利用ViT自动识别图像间高精度匹配关键点的能力,构建出直观且可视化的决策流程,从而显著提升人机交互效果。
链接: https://arxiv.org/abs/2512.17891
作者: Kristoffer Wickstrøm,Teresa Dorszewski,Siyan Chen,Michael Kampffmeyer,Elisabeth Wetzer,Robert Jenssen
机构: UiT The Arctic University of Norway (挪威北极大学); Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.
zh
[CV-7] Visually Prompted Benchmarks Are Surprisingly Frag ile
【速读】:该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在基于视觉提示(visual prompting)的评估中存在显著不稳定性的问题,即模型性能和排行榜排名容易受视觉提示细节(如标记颜色、大小、JPEG压缩级别等)的微小变化影响,从而导致评估结果不可靠。解决方案的关键在于系统性地识别并量化这些看似无关的视觉提示设计因素对模型表现的影响,并在此基础上构建一个更稳定、更具代表性的新基准VPBench——通过整合现有数据集并引入16种视觉标记变体,提升评估的鲁棒性和公平性,同时公开工具以支持未来更可靠的VLM评测。
链接: https://arxiv.org/abs/2512.17875
作者: Haiwen Feng,Long Lian,Lisa Dunlap,Jiahao Shu,XuDong Wang,Renhao Wang,Trevor Darrell,Alane Suhr,Angjoo Kanazawa
机构: UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:A key challenge in evaluating VLMs is testing models’ ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at this https URL.
zh
[CV-8] InSPECT: Invariant Spectral Features Preservation of Diffusion Models
【速读】:该论文旨在解决传统扩散模型(Diffusion Models, DMs)在前向扩散过程中将数据逐步扰动至白噪声,再通过复杂预测任务进行逆向重构所导致的计算效率低、收敛困难的问题。其核心解决方案是提出InSPECT(Invariant Spectral Feature-Preserving Diffusion Model),关键在于在前向和反向扩散过程中均保持不变的频谱特征(invariant spectral features),使傅里叶系数平滑收敛至指定随机噪声,从而在保留图像结构信息的同时维持多样性和随机性,显著提升了生成质量与多样性,并加速了模型收敛速度。
链接: https://arxiv.org/abs/2512.17873
作者: Baohua Yan,Qingyuan Liu,Jennifer Kava,Xuan Di
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern diffusion models (DMs) have achieved state-of-the-art image generation. However, the fundamental design choice of diffusing data all the way to white noise and then reconstructing it leads to an extremely difficult and computationally intractable prediction task. To overcome this limitation, we propose InSPECT (Invariant Spectral Feature-Preserving Diffusion Model), a novel diffusion model that keeps invariant spectral features during both the forward and backward processes. At the end of the forward process, the Fourier coefficients smoothly converge to a specified random noise, enabling features preservation while maintaining diversity and randomness. By preserving invariant features, InSPECT demonstrates enhanced visual diversity, faster convergence rate, and a smoother diffusion process. Experiments on CIFAR-10, Celeb-A, and LSUN demonstrate that InSPECT achieves on average a 39.23% reduction in FID and 45.80% improvement in IS against DDPM for 10K iterations under specified parameter settings, which demonstrates the significant advantages of preserving invariant features: achieving superior generation quality and diversity, while enhancing computational efficiency and enabling faster convergence rate. To the best of our knowledge, this is the first attempt to analyze and preserve invariant spectral features in diffusion models.
zh
[CV-9] Interpretable Plant Leaf Disease Detection Using Attention-Enhanced CNN
【速读】:该论文旨在解决植物病害检测中准确性和可解释性不足的问题,尤其在复杂多样的植物叶片图像背景下,传统深度学习模型往往缺乏对病变区域的定位能力与决策依据。其解决方案的关键在于提出一种基于注意力机制的卷积神经网络(CBAM-VGG16),通过在每个卷积阶段嵌入Convolution Block Attention Module(CBAM),增强特征提取能力并实现对病害区域的精准定位。该方法不仅在五个不同植物病害数据集上达到高达98.87%的准确率,还借助CBAM注意力图、Grad-CAM、Grad-CAM++和Layer-wise Relevance Propagation(LRP)等可解释性分析工具,提供了模型决策过程的透明化支持,从而推动了可解释人工智能(Explainable AI)在农业诊断中的应用。
链接: https://arxiv.org/abs/2512.17864
作者: Balram Singh,Ram Prakash Sharma,Somnath Dey
机构: NIT Hamirpur (印度理工学院哈米尔普尔分校); IIT Indore (印度理工学院印多尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 12 figures
Abstract:Plant diseases pose a significant threat to global food security, necessitating accurate and interpretable disease detection methods. This study introduces an interpretable attention-guided Convolutional Neural Network (CNN), CBAM-VGG16, for plant leaf disease detection. By integrating Convolution Block Attention Module (CBAM) at each convolutional stage, the model enhances feature extraction and disease localization. Trained on five diverse plant disease datasets, our approach outperforms recent techniques, achieving high accuracy (up to 98.87%) and demonstrating robust generalization. Here, we show the effectiveness of our method through comprehensive evaluation and interpretability analysis using CBAM attention maps, Grad-CAM, Grad-CAM++, and Layer-wise Relevance Propagation (LRP). This study advances the application of explainable AI in agricultural diagnostics, offering a transparent and reliable system for smart farming. The code of our proposed work is available at this https URL.
zh
[CV-10] Simulation-Driven Deep Learning Framework for Raman Spectral Denoising Under Fluorescence-Dominant Conditions
【速读】:该论文旨在解决拉曼光谱(Raman spectroscopy)在生物组织分析中因弱拉曼散射信号和强荧光背景干扰而导致的信噪比低、信号质量差的问题。其解决方案的关键在于提出了一种基于物理信息驱动的去噪框架,该框架结合了统计学基础的噪声建模与深度学习技术:首先系统地建模了主要噪声源,进而生成具有生物学真实性的拉曼光谱数据用于训练级联深度神经网络,该网络能够联合抑制随机探测器噪声和荧光基线干扰,从而显著提升光谱质量,实现更快速、准确的拉曼组织分析。
链接: https://arxiv.org/abs/2512.17852
作者: Mengkun Chen,Sanidhya D. Tripathi,James W. Tunnell
机构: The University of Texas at Austin, Department of Biomedical Engineering (德克萨斯大学奥斯汀分校生物医学工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Raman spectroscopy enables non-destructive, label-free molecular analysis with high specificity, making it a powerful tool for biomedical diagnostics. However, its application to biological tissues is challenged by inherently weak Raman scattering and strong fluorescence background, which significantly degrade signal quality. In this study, we present a simulation-driven denoising framework that combines a statistically grounded noise model with deep learning to enhance Raman spectra acquired under fluorescence-dominated conditions. We comprehensively modeled major noise sources. Based on this model, we generated biologically realistic Raman spectra and used them to train a cascaded deep neural network designed to jointly suppress stochastic detector noise and fluorescence baseline interference. To evaluate the performance of our approach, we simulated human skin spectra derived from real experimental data as a validation case study. Our results demonstrate the potential of physics-informed learning to improve spectral quality and enable faster, more accurate Raman-based tissue analysis.
zh
[CV-11] InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成图像时难以准确捕捉文本提示中指定的空间关系的问题。这一局限性主要源于训练数据中缺乏细粒度的空间监督信号,以及文本嵌入无法有效编码空间语义信息。解决方案的关键在于提出一种无需训练的推理阶段方法 InfSplign,其通过在每个去噪步骤中引入复合损失函数来调整噪声,该损失利用骨干解码器中提取的不同层级交叉注意力图(cross-attention maps),以强制对象在生成过程中的精确位置放置和平衡的对象存在性。该方法轻量、可即插即用,并兼容任意扩散模型主干架构,在VISOR和T2I-CompBench基准上实现了当前最优性能。
链接: https://arxiv.org/abs/2512.17851
作者: Sarah Rastegar,Violeta Chatalbasheva,Sieger Falkena,Anuj Singh,Yanbo Wang,Tejas Gokhale,Hamid Palangi,Hadi Jamali-Rad
机构: Delft University of Technology (代尔夫特理工大学); University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Shell Information Technology International (壳牌信息技术国际公司); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
zh
[CV-12] ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主编码代理在复杂、领域特定的科学问题中表现不佳的问题,尤其聚焦于医学影像这一高要求领域。其解决方案的关键在于提出ReX-MLE基准测试集,该基准包含20个源自高影响力医学影像竞赛的真实挑战,覆盖多种成像模态和任务类型,并首次系统评估代理在真实计算与时间约束下完成从数据预处理、模型训练到结果提交的端到端工作流能力。通过该基准对主流代理(如AIDE、ML-Master、RD-Agent)的评估揭示了现有方法在领域知识和工程实现上的显著短板,从而为开发具备领域感知能力的自主人工智能系统提供了关键评测工具和改进方向。
链接: https://arxiv.org/abs/2512.17838
作者: Roshan Kenia,Xiaoman Zhang,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, RD-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.
zh
[CV-13] Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)场景表示中缺乏对基础特征进行有效编码的问题,即如何从其原始几何与外观基元中提取通用且结构化的特征表示。解决方案的关键在于提出Chorus框架——一个基于多教师蒸馏的预训练方法,通过融合来自语言对齐、通用性和对象感知等不同来源的2D基础模型信号,构建一个共享的前馈式3DGS场景编码器。该框架采用共享3D编码器与教师特定投影模块相结合的方式,在统一嵌入空间中整合从高层语义到细粒度结构的多层次信息,从而实现跨任务的强泛化能力,并在数据效率和迁移性能上显著优于点云基线方法。
链接: https://arxiv.org/abs/2512.17817
作者: Yue Li,Qi Ma,Runyi Yang,Mengjiao Ma,Bin Ren,Nikola Popovic,Nicu Sebe,Theo Gevers,Luc Van Gool,Danda Pani Paudel,Martin R. Oswald
机构: University of Amsterdam (阿姆斯特丹大学); ETH Zürich (苏黎世联邦理工学院); INSAIT, Sofia University “St. Kliment Ohridski” (INSAIT,索非亚大学“圣克莱门特·奥赫里德斯基”); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians’ centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17817 [cs.CV] (or arXiv:2512.17817v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17817 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-14] Animate Any Character in Any World
【速读】:该论文旨在解决现有世界模型在交互性与可控性方面的局限性问题:静态世界生成模型虽能构建高保真3D环境,但缺乏主动代理的交互能力;而可控实体模型仅支持单一实体执行有限动作,难以实现用户指定角色的多样化行为控制。解决方案的关键在于提出AniX,其通过将静态世界生成的结构化真实性与可控实体模型相结合,支持用户指定角色在自然语言指令下执行从基础移动到以物体为中心的复杂交互等开放动作,并生成时序一致、视觉保真的视频片段。该方法将视频生成建模为条件自回归问题,基于预训练视频生成器进行优化,显著提升运动动态性的同时保持跨动作和角色的泛化能力。
链接: https://arxiv.org/abs/2512.17796
作者: Yitong Wang,Fangyun Wei,Hongyang Zhang,Bo Dai,Yan Lu
机构: Fudan University (复旦大学); Microsoft Research; University of Waterloo (滑铁卢大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.
zh
[CV-15] Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras
【速读】:该论文旨在解决长距离下基于双目相机(stereo-camera)的三维定位精度不足的问题,其核心挑战在于传统镜头畸变模型无法准确描述远距离场景中复杂的非线性光学畸变。解决方案的关键在于提出一种混合建模框架:首先扩展传统的畸变模型以引入高阶项以更好地拟合基础非线性特性,随后利用神经网络构建残差校正模型对剩余误差进行补偿,从而显著提升远距离(可达5公里)下的三维坐标估计精度。该方法结合了物理先验与数据驱动的优势,有效克服了纯神经网络直接建模时难以收敛的问题,实现了高精度、鲁棒的长距离摄影测量定位。
链接: https://arxiv.org/abs/2512.17784
作者: Ami Pandat,Punna Rajasekhar,G.Aravamuthan,Gopika Vinod,Rohit Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate camera models are essential for photogrammetry applications such as 3D mapping and object localization, particularly for long distances. Various stereo-camera based 3D localization methods are available but are limited to few hundreds of meters’ range. This is majorly due to the limitation of the distortion models assumed for the non-linearities present in the camera lens. This paper presents a framework for modeling a suitable distortion model that can be used for localizing the objects at longer distances. It is well known that neural networks can be a better alternative to model a highly complex non-linear lens distortion function; on contrary, it is observed that a direct application of neural networks to distortion models fails to converge to estimate the camera parameters. To resolve this, a hybrid approach is presented in this paper where the conventional distortion models are initially extended to incorporate higher-order terms and then enhanced using neural network based residual correction model. This hybrid approach has substantially improved long-range localization performance and is capable of estimating the 3D position of objects at distances up to 5 kilometres. The estimated 3D coordinates are transformed to GIS coordinates and are plotted on a GIS map for visualization. Experimental validation demonstrates the robustness and effectiveness of proposed framework, offering a practical solution to calibrate CCTV cameras for long-range photogrammetry applications.
zh
[CV-16] UrbanDIFF: A Denoising Diffusion Model for Spatial Gap Filling of Urban Land Surface Temperature Under Dense Cloud Cover
【速读】:该论文旨在解决卫星遥感获取的陆地表面温度(Land Surface Temperature, LST)产品在城市热岛(Surface Urban Heat Island, SUHI)监测中因云层遮挡导致数据缺失的问题。现有方法多依赖多时相信息或传感器融合,但在持续云覆盖下难以应用;而传统空间插值或深度学习模型在大范围连续缺失时性能显著下降。解决方案的关键在于提出UrbanDIFF——一种纯空间的去噪扩散图像修复模型,其创新性体现在两个方面:一是利用静态城市结构信息(如建筑用地分布和数字高程模型)作为条件引导重建过程,增强空间一致性;二是引入监督像素引导的精修步骤,在推理阶段严格保持与未被云遮挡像素的一致性,从而在高达85%云覆盖率下仍能稳定恢复LST图像,实现SSIM 0.89、RMSE 1.2 K、R² 0.84的优异性能,并表现出随云密度增加时更缓慢的性能退化趋势。
链接: https://arxiv.org/abs/2512.17782
作者: Arya Chavoshi,Hassan Dashtian,Naveen Sudharsan,Dev Niyogi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Satellite-derived Land Surface Temperature (LST) products are central to surface urban heat island (SUHI) monitoring due to their consistent grid-based coverage over large metropolitan regions. However, cloud contamination frequently obscures LST observations, limiting their usability for continuous SUHI analysis. Most existing LST reconstruction methods rely on multitemporal information or multisensor data fusion, requiring auxiliary observations that may be unavailable or unreliable under persistent cloud cover. Purely spatial gap-filling approaches offer an alternative, but traditional statistical methods degrade under large or spatially contiguous gaps, while many deep learning based spatial models deteriorate rapidly with increasing missingness. Recent advances in denoising diffusion based image inpainting models have demonstrated improved robustness under high missingness, motivating their adoption for spatial LST reconstruction. In this work, we introduce UrbanDIFF, a purely spatial denoising diffusion model for reconstructing cloud contaminated urban LST imagery. The model is conditioned on static urban structure information, including built-up surface data and a digital elevation model, and enforces strict consistency with revealed cloud free pixels through a supervised pixel guided refinement step during inference. UrbanDIFF is trained and evaluated using NASA MODIS Terra LST data from seven major United States metropolitan areas spanning 2002 to 2025. Experiments using synthetic cloud masks with 20 to 85 percent coverage show that UrbanDIFF consistently outperforms an interpolation baseline, particularly under dense cloud occlusion, achieving SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84 at 85 percent cloud coverage, while exhibiting slower performance degradation as cloud density increases. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17782 [cs.CV] (or arXiv:2512.17782v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17782 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-17] LiteGE: Lightweight Geodesic Embedding for Efficient Geodesics Computation and Non-Isometric Shape Correspondence
【速读】:该论文旨在解决3D表面测地距离(geodesic distance)计算在资源受限或交互式场景中因现有学习方法依赖大型3D神经网络而导致内存占用高、推理延迟大的问题。其解决方案的关键在于提出LiteGE,一种轻量级方法:通过在信息丰富的体素上对无符号距离场(unsigned distance field, UDF)样本进行主成分分析(PCA),构建紧凑且类别感知的形状描述符;该描述符计算高效,无需高容量网络,并在稀疏点云(低至300点)下仍保持鲁棒性,从而显著降低内存消耗和推理时间(最高达300倍压缩),同时借助测地距离与形状对应之间的内在关系实现快速准确的形状匹配(相较最优网格方法提速最高达1000倍)。
链接: https://arxiv.org/abs/2512.17781
作者: Yohanes Yudhi Adikusuma,Qixing Huang,Ying He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Computing geodesic distances on 3D surfaces is fundamental to many tasks in 3D vision and geometry processing, with deep connections to tasks such as shape correspondence. Recent learning-based methods achieve strong performance but rely on large 3D backbones, leading to high memory usage and latency, which limit their use in interactive or resource-constrained settings. We introduce LiteGE, a lightweight approach that constructs compact, category-aware shape descriptors by applying PCA to unsigned distance field (UDFs) samples at informative voxels. This descriptor is efficient to compute and removes the need for high-capacity networks. LiteGE remains robust on sparse point clouds, supporting inputs with as few as 300 points, where prior methods fail. Extensive experiments show that LiteGE reduces memory usage and inference time by up to 300 \times compared to existing neural approaches. In addition, by exploiting the intrinsic relationship between geodesic distance and shape correspondence, LiteGE enables fast and accurate shape matching. Our method achieves up to 1000 \times speedup over state-of-the-art mesh-based approaches while maintaining comparable accuracy on non-isometric shape pairs, including evaluations on point-cloud inputs.
zh
[CV-18] Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image ATC WWW
【速读】:该论文旨在解决神经参数化头部模型(Neural Parametric Head Models, NPHMs)在视觉输入下拟合困难的问题,尤其是其潜在空间表达能力强导致的优化不稳定与重建精度不足。解决方案的关键在于提出Pix2NPHM,一种基于视觉Transformer(Vision Transformer, ViT)的端到端网络,能够直接从单张图像回归NPHM参数;通过利用领域特定的ViT主干网络(预训练于几何预测任务)和混合3D/2D数据集(包括超10万条NPHM注册样本及大规模2D视频数据,以法向量估计作为伪真值),实现了高保真面部几何重建与准确表情再现,并支持推理时优化进一步提升几何质量,从而在真实场景数据上实现可扩展、高质量的三维人脸重建。
链接: https://arxiv.org/abs/2512.17773
作者: Simon Giebenhain,Tobias Kirschstein,Liam Schoneveld,Davide Davoli,Zhe Chen,Matthias Nießner
机构: Technical University of Munich (慕尼黑工业大学); Woven by Toyota (丰田编织公司); Toyota Motor Europe (丰田欧洲汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL , Video: this https URL
Abstract:Neural Parametric Head Models (NPHMs) are a recent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. However, fitting NPHMs to visual inputs is notoriously challenging due to the expressive nature of their underlying latent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geometry and accurate facial expressions. For broad generalization, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct supervision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstructions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimization against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face reconstruction quality that can run at scale on in-the-wild data.
zh
[CV-19] AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection
【速读】:该论文旨在解决深度伪造(deepfake)检测模型在面对未见过的生成模型时泛化能力差的问题,即现有检测器通常在特定生成技术(如GAN)上训练后,在其他生成方式(如扩散模型或商用工具)产生的合成内容上性能显著下降。解决方案的关键在于利用大规模视觉-语言模型CLIP,结合两个核心创新:一是构建Diff-Gen数据集,包含10万张扩散模型生成的伪造图像,涵盖更广泛的频域伪影特征,从而提升跨域泛化能力;二是提出AdaptPrompt框架,通过参数高效微调策略联合学习任务相关的文本提示(textual prompts)和视觉适配器(visual adapters),同时冻结CLIP主干网络,并通过层剪枝优化保留高频生成伪影信息,显著提升检测准确率。
链接: https://arxiv.org/abs/2512.17730
作者: Yichen Jiang,Mohammed Talha Alam,Sohail Ahmed Khan,Duc-Tien Dang-Nguyen,Fakhri Karray
机构: University of Waterloo (滑铁卢大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); University of Bergen (卑尔根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework’s versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.
zh
[CV-20] MambaMIL: Modeling Long-Term Contextual Patterns for Gigapixel Whole Slide Image
【速读】:该论文旨在解决全切片图像(Whole-slide images, WSIs)在计算病理学中因超高分辨率和缺乏细粒度标注而难以被传统深度学习模型有效处理的问题,特别是多实例学习(Multiple instance learning, MIL)框架在建模超长序列时面临的空间上下文信息不足与记忆衰减问题。其解决方案的关键在于提出MambaMIL+框架,通过三个核心机制实现:1)重叠扫描策略重构patch序列以嵌入空间连续性和实例相关性;2)选择性条带位置编码器(selective stripe position encoder, S2PE)缓解固定扫描顺序带来的偏差;3)上下文token选择机制(contextual token selection, CTS)利用监督知识动态扩展上下文记忆,从而在不遗忘的前提下稳定地建模长程依赖关系。
链接: https://arxiv.org/abs/2512.17726
作者: Qian Zeng,Yihui Wang,Shu Yang,Yingxue Xu,Fengtao Zhou,Jiabo Ma,Dejia Cai,Zhengyu Zhang,Lijuan Qu,Yu Wang,Li Liang,Hao Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Southern Medical University (南方医科大学); Joint Logistic Support Force, PLA (联勤保障部队); HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute (香港科技大学深港协同创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 11 figures, 10 tables
Abstract:Whole-slide images (WSIs) are an important data modality in computational pathology, yet their gigapixel resolution and lack of fine-grained annotations challenge conventional deep learning models. Multiple instance learning (MIL) offers a solution by treating each WSI as a bag of patch-level instances, but effectively modeling ultra-long sequences with rich spatial context remains difficult. Recently, Mamba has emerged as a promising alternative for long sequence learning, scaling linearly to thousands of tokens. However, despite its efficiency, it still suffers from limited spatial context modeling and memory decay, constraining its effectiveness to WSI analysis. To address these limitations, we propose MambaMIL+, a new MIL framework that explicitly integrates spatial context while maintaining long-range dependency modeling without memory forgetting. Specifically, MambaMIL+ introduces 1) overlapping scanning, which restructures the patch sequence to embed spatial continuity and instance correlations; 2) a selective stripe position encoder (S2PE) that encodes positional information while mitigating the biases of fixed scanning orders; and 3) a contextual token selection (CTS) mechanism, which leverages supervisory knowledge to dynamically enlarge the contextual memory for stable long-range modeling. Extensive experiments on 20 benchmarks across diagnostic classification, molecular prediction, and survival analysis demonstrate that MambaMIL+ consistently achieves state-of-the-art performance under three feature extractors (ResNet-50, PLIP, and CONCH), highlighting its effectiveness and robustness for large-scale computational pathology
zh
[CV-21] SAVeD: A First-Person Social Media Video Dataset for ADAS-equipped vehicle Near-Miss and Crash Event Analyses
链接: https://arxiv.org/abs/2512.17724
作者: Shaoyan Zhai,Mohamed Abdel-Aty,Chenzhu Wang,Rodrigo Vena Garcia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
[CV-22] FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
【速读】:该论文旨在解决现有3D头像重建方法在缺乏相机位姿(camera poses)和表情标签(expression labels)条件下,难以实现高保真度且具备详细动态形变能力的问题。其核心解决方案是提出FlexAvatar,通过基于Transformer的重建模型结合结构化的头部查询令牌(structured head query tokens)作为规范锚点(canonical anchor),实现对任意数量输入图像、无需相机位姿或表情标签的鲁棒性3D表征聚合;同时引入轻量级UNet解码器,以UV空间位置图作为条件,实时生成依赖表情的细节形变,并通过训练阶段的数据分布调整策略增强罕见但关键表情(如皱纹和露齿)的建模能力,最终在保持高质量动态形变的同时,支持仅需10秒微调即可强化极端身份特征的细节表达。
链接: https://arxiv.org/abs/2512.17717
作者: Cheng Peng,Zhuo Su,Liao Wang,Chen Guo,Zhaohu Li,Chengjiang Long,Zheng Lv,Jingxiang Sun,Chenyangguang Zhang,Yebin Liu
机构: Tsinghua University (清华大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.
zh
[CV-23] An Empirical Study of Sampling Hyperparameters in Diffusion-Based Super-Resolution
【速读】:该论文旨在解决预训练扩散模型在逆问题(如单图像超分辨率)中应用时,如何通过条件引导(conditioning)提升重建质量的问题。其关键解决方案在于识别并优化影响性能的核心因素:实验表明,条件引导步骤的步长(step size)对最终重建效果的影响远大于扩散步数(diffusion step count),尤其当步长设置在 [2.0, 3.0] 范围内时,能显著提升整体性能,这为实际应用中参数调优提供了明确方向。
链接: https://arxiv.org/abs/2512.17675
作者: Yudhistira Arief Wibowo
机构: Technical University of Munich (慕尼黑工业大学); Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have shown strong potential for solving inverse problems such as single-image super-resolution, where a high-resolution image is recovered from a low-resolution observation using a pretrained unconditional prior. Conditioning methods, including Diffusion Posterior Sampling (DPS) and Manifold Constrained Gradient (MCG), can substantially improve reconstruction quality, but they introduce additional hyperparameters that require careful tuning. In this work, we conduct an empirical ablation study on FFHQ super-resolution to identify the dominant factors affecting performance when applying conditioning to pretrained diffusion models, and show that the conditioning step size has a significantly greater impact than the diffusion step count, with step sizes in the range of [2.0, 3.0] yielding the best overall performance in our experiments.
zh
[CV-24] Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation
【速读】:该论文旨在解决视频-based gaze estimation(基于视频的眼动估计)中因模型需同时捕捉帧内空间关系与帧间时间动态而导致的性能瓶颈问题。其核心挑战在于如何有效融合眼部和面部特征,并在不丢失空间上下文信息的前提下建模跨帧的时间演化。解决方案的关键在于提出Sp交-Tempor al Gaze Network (ST-Gaze),该模型结合CNN主干网络与专用的通道注意力(channel attention)和自注意力(self-attention)模块,实现眼区与面部特征的最优融合;随后将融合特征视为空间序列,通过时空递归机制保留帧内空间上下文并将其传播至时间维度,从而显式建模帧间动态变化。实验表明,该方法在EVE数据集上实现了无需个体适配和需要个体适配两种场景下的最先进性能,且消融实验证明:保留并建模帧内空间上下文优于早期的空间池化策略。
链接: https://arxiv.org/abs/2512.17673
作者: Alexandre Personnic,Mihai Bâce
机构: KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 12 pages, 5 figures, the code repository is available at this https URL
Abstract:Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.
zh
[CV-25] Bitbox: Behavioral Imaging Toolbox for Computational Analysis of Behavior from Videos
【速读】:该论文旨在解决当前计算行为测量工具在心理学、精神病学和临床研究中难以被广泛采用的问题,其核心挑战在于现有方法多面向工程用户,依赖复杂软件栈,且输出的测量结果难以直接用于假设驱动的研究。解决方案的关键是提出Bitbox——一个开源工具包,它通过可复现性(reproducibility)、模块化(modularity)和可解释性(interpretability)三大设计原则,提供标准化接口以从视频中提取高阶行为指标,并集成多种面部、头部和身体处理器,使行为科学家无需工程背景即可获得可靠的行为测量数据,同时为计算机科学家提供了向临床领域传播算法的有效途径。
链接: https://arxiv.org/abs/2512.17655
作者: Evangelos Sariyanidi,Gokul Nair,Lisa Yankowitz,Casey J. Zampella,Mohan Kashyap Pargi,Aashvi Manakiwala,Maya McNealis,John D. Herrington,Jeffrey Cohn,Robert T. Schultz,Birkan Tunc
机构: Children’s Hospital of Philadelphia (费城儿童医院); University of Pennsylvania (宾夕法尼亚大学); University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Computational measurement of human behavior from video has recently become feasible due to major advances in AI. These advances now enable granular and precise quantification of facial expression, head movement, body action, and other behavioral modalities and are increasingly used in psychology, psychiatry, neuroscience, and mental health research. However, mainstream adoption remains slow. Most existing methods and software are developed for engineering audiences, require specialized software stacks, and fail to provide behavioral measurements at a level directly useful for hypothesis-driven research. As a result, there is a large barrier to entry for researchers who wish to use modern, AI-based tools in their work. We introduce Bitbox, an open-source toolkit designed to remove this barrier and make advanced computational analysis directly usable by behavioral scientists and clinical researchers. Bitbox is guided by principles of reproducibility, modularity, and interpretability. It provides a standardized interface for extracting high-level behavioral measurements from video, leveraging multiple face, head, and body processors. The core modules have been tested and validated on clinical samples and are designed so that new measures can be added with minimal effort. Bitbox is intended to serve both sides of the translational gap. It gives behavioral researchers access to robust, high-level behavioral metrics without requiring engineering expertise, and it provides computer scientists a practical mechanism for disseminating methods to domains where their impact is most needed. We expect that Bitbox will accelerate integration of computational behavioral measurement into behavioral, clinical, and mental health research. Bitbox has been designed from the beginning as a community-driven effort that will evolve through contributions from both method developers and domain scientists.
zh
[CV-26] Region-Constraint In-Context Generation for Instructional Video Editing
【速读】:该论文旨在解决基于指令的视频编辑中因未明确指定编辑区域而导致的区域不准确和去噪过程中编辑区与非编辑区token干扰的问题。解决方案的关键在于提出一种新的上下文生成范式ReCo,其核心创新是通过宽度拼接源视频与目标视频实现联合去噪,并引入潜空间正则化(latent regularization)与注意力正则化(attention regularization)两项约束机制:前者增强编辑区域在潜空间中的差异性并抑制非编辑区域的变化,强化编辑区域的修改效果并减少无关内容生成;后者抑制编辑区token对源视频对应区域token的注意力,从而缓解新物体生成时的区域干扰问题。
链接: https://arxiv.org/abs/2512.17650
作者: Zhongwei Zhang,Fuchen Long,Wei Li,Zhaofan Qiu,Wu Liu,Ting Yao,Tao Mei
机构: University of Science and Technology of China (中国科学技术大学); HiDream.ai Inc. (HiDream.ai 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Project page: this https URL
Abstract:The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.
zh
[CV-27] Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLM s
【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)检测中因封闭世界假设导致的泛化能力不足问题,即现有方法受限于预定义的动词类别集合,在面对长尾或未见交互时表现不佳。同时,尽管多模态大语言模型(Multimodal Large Language Models, MLLMs)具备开放词汇理解能力,但其与现有HOI检测器之间存在解耦问题,且微调成本过高。解决方案的关键在于提出一种名为GRASP-HO的生成式推理与可调控感知框架,将HOI检测从封闭集分类任务重构为开放词汇生成任务;通过提取混合交互表示,并设计轻量级可学习认知引导通道(Cognitive Steering Conduit, CSC)模块,将细粒度视觉证据注入冻结的MLLM以实现有效推理;此外,引入混合引导策略,结合语言建模损失与辅助分类损失,缓解监督信号不匹配问题,从而在保持生成灵活性的同时实现判别性定位,最终实现统一的判别感知与生成推理范式,显著提升闭集性能与零样本泛化能力。
链接: https://arxiv.org/abs/2512.17640
作者: Zhaolin Cai,Huiyu Duan,Zitong Xu,Fan Li,Zhi Liu,Jing Liu,Wei Shen,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Xi’an Jiao Tong University (西安交通大学); Shandong University (山东大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.
zh
[CV-28] PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology
【速读】:该论文旨在解决全切片图像(Whole Slide Images, WSI)在计算病理学中因像素级规模庞大和空间异质性导致的多模态理解难题,尤其是现有对齐方法难以捕捉文本描述与来自同一张切片的数千个图像块之间的细粒度对应关系,从而影响下游任务性能的问题。其解决方案的关键在于提出PathFLIP(Pathology Fine-grained Language-Image Pretraining)框架,通过将切片级描述分解为区域级子描述,并生成文本条件驱动的区域嵌入(text-conditioned region embeddings),实现更精确的视觉-语言定位;同时借助大语言模型(Large Language Models, LLMs)增强临床指令遵循能力,使系统能灵活适应不同诊断场景,显著提升在切片分类、检索、病灶定位及指令执行等多范式任务中的表现,且训练数据需求远低于现有方法。
链接: https://arxiv.org/abs/2512.17621
作者: Fengchun Liu,Songhan Jiang,Linghan Cai,Ziyue Wang,Yongbing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.
zh
[CV-29] StereoMV2D: A Sparse Temporal Stereo-Enhanced Framework for Robust Multi-View 3D Object Detection
【速读】:该论文旨在解决单帧2D检测结果中存在的深度模糊性(depth ambiguity)问题,从而提升多视角3D目标检测中查询初始化的质量与精度。其解决方案的关键在于提出了一种统一框架StereoMV2D,通过引入时序立体建模(temporal stereo modeling),利用相邻帧间同一目标的跨时间视差信息增强深度感知能力,并在2D感兴趣区域(RoI)内高效完成计算;同时设计了动态置信度门控机制(dynamic confidence gating),基于帧间匹配矩阵和外观一致性学习统计模式,自适应评估时序立体线索的可靠性,从而在复杂场景下实现鲁棒且高精度的3D检测。
链接: https://arxiv.org/abs/2512.17620
作者: Di Wu,Feng Yang,Wenhui Zhao,Jinwen Yu,Pan Liao,Benlian Xu,Dingwen Zhang
机构: Northwestern Polytechnical University (西北工业大学); Suzhou University of Science and Technology (苏州科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures. This work has been submitted to the IEEE for possible publication
Abstract:Multi-view 3D object detection is a fundamental task in autonomous driving perception, where achieving a balance between detection accuracy and computational efficiency remains crucial. Sparse query-based 3D detectors efficiently aggregate object-relevant features from multi-view images through a set of learnable queries, offering a concise and end-to-end detection paradigm. Building on this foundation, MV2D leverages 2D detection results to provide high-quality object priors for query initialization, enabling higher precision and recall. However, the inherent depth ambiguity in single-frame 2D detections still limits the accuracy of 3D query generation. To address this issue, we propose StereoMV2D, a unified framework that integrates temporal stereo modeling into the 2D detection-guided multi-view 3D detector. By exploiting cross-temporal disparities of the same object across adjacent frames, StereoMV2D enhances depth perception and refines the query priors, while performing all computations efficiently within 2D regions of interest (RoIs). Furthermore, a dynamic confidence gating mechanism adaptively evaluates the reliability of temporal stereo cues through learning statistical patterns derived from the inter-frame matching matrix together with appearance consistency, ensuring robust detection under object appearance and occlusion. Extensive experiments on the nuScenes and Argoverse 2 datasets demonstrate that StereoMV2D achieves superior detection performance without incurring significant computational overhead. Code will be available at this https URL.
zh
[CV-30] Self-Supervised Weighted Image Guided Quantitative MRI Super-Resolution
【速读】:该论文旨在解决高分辨率(HR)定量磁共振成像(qMRI) relaxometry在临床应用中因采集时间过长而难以推广的问题。其核心解决方案是提出一种物理信息引导的自监督超分辨率框架,通过利用常规获取的高分辨率加权MRI(wMRI)作为引导信号,从而在训练过程中无需依赖HR qMRI的真实标签。该方法将超分辨率建模为贝叶斯最大后验估计问题,最小化两个方面的差异:一是从超分辨qMRI图合成的HR图像与实际wMRI引导图像之间的信号模型误差;二是低分辨率(LR)qMRI数据与下采样预测结果之间的差异。这一物理约束的目标函数使模型能够仅基于临床wMRI学习有效的超分辨率映射,实现快速qMRI采集的同时保持高质量参数图,且具备跨不同qMRI序列的泛化能力。
链接: https://arxiv.org/abs/2512.17612
作者: Alireza Samadifardheris,Dirk H.J. Poot,Florian Wiesinger,Stefan Klein,Juan A. Hernandez-Tamames
机构: Erasmus MC, Rotterdam, The Netherlands (埃拉斯姆斯医学中心,鹿特丹,荷兰); GE Healthcare, Munich, Germany (GE医疗,慕尼黑,德国); TU Delft, Delft, The Netherlands (代尔夫特理工大学,代尔夫特,荷兰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to IEEE TMI for possible publication
Abstract:High-resolution (HR) quantitative MRI (qMRI) relaxometry provides objective tissue characterization but remains clinically underutilized due to lengthy acquisition times. We propose a physics-informed, self-supervised framework for qMRI super-resolution that uses routinely acquired HR weighted MRI (wMRI) scans as guidance, thus, removing the necessity for HR qMRI ground truth during training. We formulate super-resolution as Bayesian maximum a posteriori inference, minimizing two discrepancies: (1) between HR images synthesized from super-resolved qMRI maps and acquired wMRI guides via forward signal models, and (2) between acquired LR qMRI and downsampled predictions. This physics-informed objective allows the models to learn from clinical wMRI without HR qMRI supervision. To validate the concept, we generate training data by synthesizing wMRI guides from HR qMRI using signal equations, then degrading qMRI resolution via k-space truncation. A deep neural network learns the super-resolution mapping. Ablation experiments demonstrate that T1-weighted images primarily enhance T1 maps, T2-weighted images improve T2 maps, and combined guidance optimally enhances all parameters simultaneously. Validation on independently acquired in-vivo data from a different qMRI sequence confirms cross-qMRI sequence generalizability. Models trained on synthetic data can produce super-resolved maps from a 1-minute acquisition with quality comparable to a 5-minute reference scan, leveraging the scanner-independent nature of relaxometry parameters. By decoupling training from HR qMRI requirement, our framework enables fast qMRI acquisitions enhanced via routine clinical images, offering a practical pathway for integrating quantitative relaxometry into clinical workflows with acceptable additional scan time.
zh
[CV-31] Semi-Supervised 3D Segmentation for Type-B Aortic Dissection with Slim UNETR
【速读】:该论文旨在解决多输出卷积神经网络(Convolutional Neural Networks, CNN)在医学图像多类分割任务中对大规模高质量标注数据的依赖问题,尤其是在处理3D计算机断层扫描血管造影(CTA)图像时,由于标注成本高且数据分布不均衡(如仅68/100样本存在假腔),导致模型训练困难。其解决方案的关键在于提出一种无需假设模型输出具有概率性质的半监督学习方法,通过引入额外的数据增强策略(如旋转与翻转)来利用未标注数据,从而提升模型在多个独立输出分支上的分割性能,特别适用于需要分别预测不同解剖结构(如真腔、假腔及假腔血栓)的专用分割架构。
链接: https://arxiv.org/abs/2512.17610
作者: Denis Mikhailapov,Vladimir Berikov
机构: Novosibirsk State University (新西伯利亚国立大学); Sobolev Institute of Mathematics (索博列夫数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, 1 listing
Abstract:Convolutional neural networks (CNN) for multi-class segmentation of medical images are widely used today. Especially models with multiple outputs that can separately predict segmentation classes (regions) without relying on a probabilistic formulation of the segmentation of regions. These models allow for more precise segmentation by tailoring the network’s components to each class (region). They have a common encoder part of the architecture but branch out at the output layers, leading to improved accuracy. These methods are used to diagnose type B aortic dissection (TBAD), which requires accurate segmentation of aortic structures based on the ImageTBDA dataset, which contains 100 3D computed tomography angiography (CTA) images. These images identify three key classes: true lumen (TL), false lumen (FL), and false lumen thrombus (FLT) of the aorta, which is critical for diagnosis and treatment decisions. In the dataset, 68 examples have a false lumen, while the remaining 32 do not, creating additional complexity for pathology detection. However, implementing these CNN methods requires a large amount of high-quality labeled data. Obtaining accurate labels for the regions of interest can be an expensive and time-consuming process, particularly for 3D data. Semi-supervised learning methods allow models to be trained by using both labeled and unlabeled data, which is a promising approach for overcoming the challenge of obtaining accurate labels. However, these learning methods are not well understood for models with multiple outputs. This paper presents a semi-supervised learning method for models with multiple outputs. The method is based on the additional rotations and flipping, and does not assume the probabilistic nature of the model’s responses. This makes it a universal approach, which is especially important for architectures that involve separate segmentation. Comments: 7 pages, 5 figures, 1 listing Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17610 [cs.CV] (or arXiv:2512.17610v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17610 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Denis Mikhailapov [view email] [v1] Fri, 19 Dec 2025 14:14:41 UTC (712 KB)
zh
[CV-32] MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration
【速读】:该论文旨在解决乳腺X线摄影(mammography)图像配准(registration)研究中缺乏公开数据集和标准化评估基准的问题,从而阻碍了不同方法之间的可比性和临床应用进展。其关键解决方案是提出了MGRegBench——一个包含超过5000对图像的公共基准数据集,其中100对配有手动标注的解剖学特征点和分割掩膜,使得该数据集成为目前规模最大且带人工标注的二维医学图像配准数据集之一。通过该数据集,作者首次实现了多种配准方法(包括经典算法、基于学习的方法、隐式神经表示及专为乳腺影像设计的最新深度学习模型)在相同条件下的公平比较,并提供了详尽的深度学习配准分析,推动了该领域的标准化与进一步发展。
链接: https://arxiv.org/abs/2512.17605
作者: Svetlana Krasnova,Emiliya Starikova,Ilia Naletov,Andrey Krylov,Dmitry Sorokin
机构: Moscow State University (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Robust mammography registration is essential for clinical applications like tracking disease progression and monitoring longitudinal changes in breast tissue. However, progress has been limited by the absence of public datasets and standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a public benchmark dataset for mammogram registration. It comprises over 5,000 image pairs, with 100 containing manual anatomical landmarks and segmentation masks for rigorous evaluation. This makes MGRegBench one of the largest public 2D registration datasets with manual annotations. Using this resource, we benchmarked diverse registration methods including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a classic mammography-specific approach, and a recent state-of-the-art deep learning method MammoRegNet. The implementations were adapted to this modality from the authors’ implementations or re-implemented from scratch. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) the first like-for-like comparison of diverse methods on this modality; and (3) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair comparisons and catalyze future research. The source code and data are at this https URL.
zh
[CV-33] HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中传统方法依赖大量标注数据、计算成本高,以及基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的无微调方法因依赖文本输出而导致信息损失、正常性偏差(normalcy bias)和提示敏感性等问题。解决方案的关键在于提出HeadHunt-VAD,一种通过直接探测冻结MLLM内部注意力头(attention heads)来实现无微调异常检测的新范式;其核心创新是设计了一个鲁棒头识别模块(Robust Head Identification),基于显著性和稳定性多维度评估所有注意力头,筛选出对多种提示均具判别力的稀疏专家头集合,并将这些头提取的特征输入轻量级异常评分器与时间定位模块,从而实现高效、准确且可解释的异常检测。
链接: https://arxiv.org/abs/2512.17601
作者: Zhaolin Cai,Fan Li,Ziwei Zheng,Haixia Bi,Lijun He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.
zh
[CV-34] MAD-OOD: A Deep Learning Cluster-Driven Framework for an Out-of-Distribution Malware Detection and Classification
【速读】:该论文旨在解决恶意软件分类中因多态性(polymorphic)和变种型(metamorphic)恶意软件引入的类内变异导致的分布外(Out of Distribution, OOD)检测难题。现有基于深度学习的恶意软件检测器通常依赖封闭世界假设,难以有效建模类内变化,在面对未知恶意软件家族时性能显著下降。其解决方案的关键在于提出一种两阶段、基于聚类的深度学习框架 MADOOD:第一阶段利用高斯判别分析(Gaussian Discriminant Analysis, GDA)构建类条件球面决策边界,实现无需OOD训练数据即可统计学上可靠地区分分布内与分布外样本;第二阶段通过融合聚类预测、优化嵌入表示及监督分类结果,提升最终分类精度。该方法在包含25个已知家族和多个新OOD变体的基准数据集上验证,AUC最高达0.911,展现出可扩展性、可解释性和统计合理性。
链接: https://arxiv.org/abs/2512.17594
作者: Tosin Ige,Christopher Kiekintveld,Aritran Piplai,Asif Rahman,Olukunle Kolade,Sasidhar Kunapuli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Out of distribution (OOD) detection remains a critical challenge in malware classification due to the substantial intra family variability introduced by polymorphic and metamorphic malware variants. Most existing deep learning based malware detectors rely on closed world assumptions and fail to adequately model this intra class variation, resulting in degraded performance when confronted with previously unseen malware families. This paper presents MADOOD, a novel two stage, cluster driven deep learning framework for robust OOD malware detection and classification. In the first stage, malware family embeddings are modeled using class conditional spherical decision boundaries derived from Gaussian Discriminant Analysis (GDA), enabling statistically grounded separation of indistribution and OOD samples without requiring OOD data during training. Z score based distance analysis across multiple class centroids is employed to reliably identify anomalous samples in the latent space. In the second stage, a deep neural network integrates cluster based predictions, refined embeddings, and supervised classifier outputs to enhance final classification accuracy. Extensive evaluations on benchmark malware datasets comprising 25 known families and multiple novel OOD variants demonstrate that MADOOD significantly outperforms state of the art OOD detection methods, achieving an AUC of up to 0.911 on unseen malware families. The proposed framework provides a scalable, interpretable, and statistically principled solution for real world malware detection and anomaly identification in evolving cybersecurity environments.
zh
[CV-35] Medical Imaging AI Competitions Lack Fairness
【速读】:该论文旨在解决当前医学影像人工智能(Artificial Intelligence, AI)基准竞赛中存在的公平性问题,即现有挑战数据集是否具备足够的代表性、可访问性和可重用性以支持临床有意义的AI发展。其关键解决方案在于开展了一项大规模系统性研究,对241个生物医学图像分析挑战中的458项任务及19种成像模态进行了评估,从两个互补维度考察基准的公平性:一是数据集是否反映真实世界临床多样性,二是是否符合FAIR原则(可发现、可访问、可互操作、可重用)的数据管理标准。研究揭示了数据组成中存在显著的地理、模态和问题类型偏差,并指出多数数据集受限于模糊或严格的访问条件、不一致的许可实践以及文档缺失,从而暴露了基准生态系统的根本性公平缺陷,强调了排行榜表现与临床相关性之间的脱节。
链接: https://arxiv.org/abs/2512.17581
作者: Annika Reinke,Evangelia Christodoulou,Sthuthi Sadananda,A. Emre Kavur,Khrystyna Faryna,Daan Schouten,Bennett A. Landman,Carole Sudre,Olivier Colliot,Nick Heller,Sophie Loizillon,Martin Maška,Maëlys Solal,Arya Yazdan-Panah,Vilma Bozgo,Ömer Sümer,Siem de Jong,Sophie Fischer,Michal Kozubek,Tim Rädsch,Nadim Hammoud,Fruzsina Molnár-Gábor,Steven Hicks,Michael A. Riegler,Anindo Saha,Vajira Thambawita,Pal Halvorsen,Amelia Jiménez-Sánchez,Qingyang Yang,Veronika Cheplygina,Sabrina Bottazzi,Alexander Seitel,Spyridon Bakas,Alexandros Karargyris,Kiran Vaidhya Venkadesh,Bram van Ginneken,Lena Maier-Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Nature BME
Abstract:Benchmarking competitions are central to the development of artificial intelligence (AI) in medical imaging, defining performance standards and shaping methodological progress. However, it remains unclear whether these benchmarks provide data that are sufficiently representative, accessible, and reusable to support clinically meaningful AI. In this work, we assess fairness along two complementary dimensions: (1) whether challenge datasets are representative of real-world clinical diversity, and (2) whether they are accessible and legally reusable in line with the FAIR principles. To address this question, we conducted a large-scale systematic study of 241 biomedical image analysis challenges comprising 458 tasks across 19 imaging modalities. Our findings show substantial biases in dataset composition, including geographic location, modality-, and problem type-related biases, indicating that current benchmarks do not adequately reflect real-world clinical diversity. Despite their widespread influence, challenge datasets were frequently constrained by restrictive or ambiguous access conditions, inconsistent or non-compliant licensing practices, and incomplete documentation, limiting reproducibility and long-term reuse. Together, these shortcomings expose foundational fairness limitations in our benchmarking ecosystem and highlight a disconnect between leaderboard success and clinical relevance.
zh
[CV-36] 3One2: One-step Regression Plus One-step Diffusion for One-hot Modulation in Dual-path Video Snapshot Compressive Imaging
【速读】:该论文旨在解决视频快照压缩感知(Video Snapshot Compressive Imaging, SCI)中因随机二值调制导致的时域混叠(temporal aliasing)问题,其核心挑战在于如何充分利用一种新型“one-hot”调制方式所具备的完美时域解耦特性。解决方案的关键在于:首先将重建任务转化为生成式视频插补(generative video inpainting)问题,并设计与硬件压缩过程一致的前向过程随机微分方程(Stochastic Differential Equation, SDE);其次提出结合一步回归初始化与一步扩散精修的混合框架,以克服纯扩散方法在视频SCI中的局限性;最后通过双光路硬件设计引入互补信息,缓解one-hot调制带来的空间退化问题。这是首个将扩散模型引入视频SCI重建的工作,显著提升了重建质量与时间分辨率。
链接: https://arxiv.org/abs/2512.17578
作者: Ge Wang,Xing Liu,Xin Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video snapshot compressive imaging (SCI) captures dynamic scene sequences through a two-dimensional (2D) snapshot, fundamentally relying on optical modulation for hardware compression and the corresponding software reconstruction. While mainstream video SCI using random binary modulation has demonstrated success, it inevitably results in temporal aliasing during compression. One-hot modulation, activating only one sub-frame per pixel, provides a promising solution for achieving perfect temporal decoupling, thereby alleviating issues associated with aliasing. However, no algorithms currently exist to fully exploit this potential. To bridge this gap, we propose an algorithm specifically designed for one-hot masks. First, leveraging the decoupling properties of one-hot modulation, we transform the reconstruction task into a generative video inpainting problem and introduce a stochastic differential equation (SDE) of the forward process that aligns with the hardware compression process. Next, we identify limitations of the pure diffusion method for video SCI and propose a novel framework that combines one-step regression initialization with one-step diffusion refinement. Furthermore, to mitigate the spatial degradation caused by one-hot modulation, we implement a dual optical path at the hardware level, utilizing complementary information from another path to enhance the inpainted video. To our knowledge, this is the first work integrating diffusion into video SCI reconstruction. Experiments conducted on synthetic datasets and real scenes demonstrate the effectiveness of our method.
zh
[CV-37] RoomEditor: A Parameter-Sharing Diffusion Architecture for High-Fidelity Furniture Synthesis
【速读】:该论文旨在解决虚拟家具合成(Virtual Furniture Synthesis)中面临的两大挑战:一是缺乏可复现的基准数据集,二是现有图像合成方法在保持背景完整性的同时难以实现高保真度的家具融合。为应对这些问题,作者提出RoomBench++,一个包含112,851对训练样本和1,832对测试样本的公开基准数据集,覆盖真实室内视频与高质量家居渲染场景;并设计RoomEditor++,一种基于扩散模型(Diffusion Model)的通用架构,采用参数共享的双扩散主干网络(parameter-sharing dual diffusion backbone),兼容U-Net与DiT结构,统一参考物体与背景图像的特征提取与修复过程。其核心创新在于参数共享机制,通过强制对齐特征表示,实现精确的几何变换、纹理保留与无缝融合,实验表明该方案在定量指标、定性评估及人类偏好测试中均优于当前最优方法,并展现出无需任务特定微调即可泛化至未见室内场景的能力。
链接: https://arxiv.org/abs/2512.17573
作者: Qilong Wang,Xiaofan Ming,Zhenyi Lin,Jinwen Li,Dongwei Ren,Wangmeng Zuo,Qinghua Hu
机构: Tianjin University (天津大学); Harbin Institute of Technology (哈尔滨工业大学); Engineering Research Center of City Intelligence and Digital Governance, Ministry of Education (教育部城市智能与数字治理工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at \urlthis https URL.
zh
[CV-38] A unified FLAIR hyperintensity segmentation model for various CNS tumor types and acquisition time points
【速读】:该论文旨在解决脑肿瘤影像中FLAIR高信号区域自动分割的临床需求,以辅助诊断、治疗规划及疗效监测。其核心问题是现有方法依赖特定数据集训练模型,缺乏跨肿瘤类型和采集时间点的泛化能力。解决方案的关键在于构建一个统一的注意力U-Net(Attention U-Net)模型,利用来自多个中心、多种肿瘤类型及不同时间点的约5000张FLAIR图像进行训练,实现了在 meningioma、metastasis、glioma 等多种肿瘤类别及术前/术后阶段均保持高性能(Dice评分最高达90.92%),且性能与专用模型相当,显著提升了模型的通用性和临床部署可行性。
链接: https://arxiv.org/abs/2512.17566
作者: Mathilde Gajda Faanes,David Bouget,Asgeir S. Jakola,Timothy R. Smith,Vasileios K. Kavouridis,Francesco Latini,Margret Jensdottir,Peter Milos,Henrietta Nittby Redebrandt,Rickard L. Sjöberg,Rupavathana Mahesparan,Lars Kjelsberg Pedersen,Ole Solheim,Ingerid Reinertsen
机构: SINTEF Digital (SINTEF 数字); University of Gothenburg (哥德堡大学); Sahlgrenska University Hospital (萨尔格伦斯卡大学医院); Brigham and Women’s Hospital (布里格姆妇女医院); Harvard Medical School (哈佛医学院); Uppsala University Hospital (乌普萨拉大学医院); Karolinska University Hospital (卡罗林斯卡大学医院); Linköping University Hospital (林雪平大学医院); Skåne University Hospital (斯堪尼亚大学医院); Umeå University (于默奥大学); Haukeland University Hospital (豪克兰大学医院); University Hospital of North Norway (北挪威大学医院); Norwegian University of Science and Technology (挪威科技大学); St. Olavs University Hospital (圣奥拉夫大学医院); Department of Circulation and Medical Imaging (循环与医学影像系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:T2-weighted fluid-attenuated inversion recovery (FLAIR) magnetic resonance imaging (MRI) scans are important for diagnosis, treatment planning and monitoring of brain tumors. Depending on the brain tumor type, the FLAIR hyperintensity volume is an important measure to asses the tumor volume or surrounding edema, and an automatic segmentation of this would be useful in the clinic. In this study, around 5000 FLAIR images of various tumors types and acquisition time points from different centers were used to train a unified FLAIR hyperintensity segmentation model using an Attention U-Net architecture. The performance was compared against dataset specific models, and was validated on different tumor types, acquisition time points and against BraTS. The unified model achieved an average Dice score of 88.65% for pre-operative meningiomas, 80.08% for pre-operative metastasis, 90.92% for pre-operative and 84.60% for post-operative gliomas from BraTS, and 84.47% for pre-operative and 61.27% for post-operative lower grade gliomas. In addition, the results showed that the unified model achieved comparable segmentation performance to the dataset specific models on their respective datasets, and enables generalization across tumor types and acquisition time points, which facilitates the deployment in a clinical setting. The model is integrated into Raidionics, an open-source software for CNS tumor analysis.
zh
[CV-39] G3Splat: Geometrically Consistent Generalizable Gaussian Splatting
【速读】:该论文旨在解决现有基于3D高斯(3D Gaussians)的场景表示方法在自监督学习下难以恢复几何上一致的点云 splatting 问题,尤其在无相机位姿信息的情况下,仅依赖视图合成损失(view-synthesis loss)会导致重建结果缺乏几何合理性。其解决方案的关键在于引入几何先验约束(geometric priors),提出 G3Splat 方法,通过显式建模和优化几何一致性来提升模型对 3D 高斯参数(如位置、方向、尺度、不透明度等)的学习能力,从而实现无需相机位姿监督的通用性 splatting 表示。该方法在 RE10K 数据集上实现了几何一致性重建、相对位姿估计和新视角合成三项任务的最先进性能,并在 ScanNet 上展现出显著的零样本泛化能力。
链接: https://arxiv.org/abs/2512.17547
作者: Mehdi Hosseinzadeh,Shin-Fang Chng,Yi Xu,Simon Lucey,Ian Reid,Ravi Garg
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); Goertek Alpha Labs (歌尔Alpha实验室); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:3D Gaussians have recently emerged as an effective scene representation for real-time splatting and accurate novel-view synthesis, motivating several works to adapt multi-view structure prediction networks to regress per-pixel 3D Gaussians from images. However, most prior work extends these networks to predict additional Gaussian parameters – orientation, scale, opacity, and appearance – while relying almost exclusively on view-synthesis supervision. We show that a view-synthesis loss alone is insufficient to recover geometrically meaningful splats in this setting. We analyze and address the ambiguities of learning 3D Gaussian splats under self-supervision for pose-free generalizable splatting, and introduce G3Splat, which enforces geometric priors to obtain geometrically consistent 3D scene representations. Trained on RE10K, our approach achieves state-of-the-art performance in (i) geometrically consistent reconstruction, (ii) relative pose estimation, and (iii) novel-view synthesis. We further demonstrate strong zero-shot generalization on ScanNet, substantially outperforming prior work in both geometry recovery and relative pose estimation. Code and pretrained models are released on our project page (this https URL).
zh
[CV-40] ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image
【速读】:该论文旨在解决当前3D人体网格恢复(3D human mesh recovery)技术在处理多样化服装、尤其是宽松衣物时性能不佳的问题。现有方法主要针对贴身衣物设计,难以准确估计穿着不同服装时的人体形状和姿态。解决方案的关键在于提出ClothHMR框架,其核心创新包括两个模块:一是服装裁剪(Clothing Tailoring, CT),通过人体语义分割与边缘预测对服装进行贴合人体轮廓的裁剪;二是基于大模型视觉表征的网格恢复(FHVM-based Mesh Recovering, MR),利用大模型(Foundation Human Visual Model, FHVM)提供的高质量人体视觉特征来优化初始3D网格参数,从而提升对复杂服装场景下的泛化能力。该方法显著优于现有最先进方法,并已在真实场景中验证其应用潜力。
链接: https://arxiv.org/abs/2512.17545
作者: Yunqi Gao,Leyuan Liu,Yuhan Li,Changxin Gao,Yuanyuan Liu,Jingying Chen
机构: Central China Normal University (华中师范大学); Huazhong University of Science and Technology (华中科技大学); China University of Geosciences (Wuhan) (中国地质大学(武汉)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages,16 figures
Abstract:With 3D data rapidly emerging as an important form of multimedia information, 3D human mesh recovery technology has also advanced accordingly. However, current methods mainly focus on handling humans wearing tight clothing and perform poorly when estimating body shapes and poses under diverse clothing, especially loose garments. To this end, we make two key insights: (1) tailoring clothing to fit the human body can mitigate the adverse impact of clothing on 3D human mesh recovery, and (2) utilizing human visual information from large foundational models can enhance the generalization ability of the estimation. Based on these insights, we propose ClothHMR, to accurately recover 3D meshes of humans in diverse clothing. ClothHMR primarily consists of two modules: clothing tailoring (CT) and FHVM-based mesh recovering (MR). The CT module employs body semantic estimation and body edge prediction to tailor the clothing, ensuring it fits the body silhouette. The MR module optimizes the initial parameters of the 3D human mesh by continuously aligning the intermediate representations of the 3D mesh with those inferred from the foundational human visual model (FHVM). ClothHMR can accurately recover 3D meshes of humans wearing diverse clothing, precisely estimating their body shapes and poses. Experimental results demonstrate that ClothHMR significantly outperforms existing state-of-the-art methods across benchmark datasets and in-the-wild images. Additionally, a web application for online fashion and shopping powered by ClothHMR is developed, illustrating that ClothHMR can effectively serve real-world usage scenarios. The code and model for ClothHMR are available at: \urlthis https URL.
zh
[CV-41] FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views
【速读】:该论文旨在解决从任意视角的未标定多视图图像中,实现语言嵌入的3D高斯表示重建问题。现有方法通常采用前馈网络结合高斯头进行重建,但受限于固定输入视角和3D训练数据不足的问题。其关键解决方案是提出一种无需3D标注的训练框架(3D-annotation-free training framework),通过利用大规模视频数据中的易获取2D实例信息来增强语义嵌入,并引入实例引导的对比学习策略以对齐2D语义与3D表示;同时,为降低密集视图带来的高内存与计算开销,进一步设计了几何-语义层次稀疏化策略,从而在前向传播中高效重建高质量的、语言对齐的3D高斯表示。
链接: https://arxiv.org/abs/2512.17541
作者: Qijian Tian,Xin Tan,Jiayu Ying,Xuhong Wang,Yuan Xie,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: this https URL.
zh
[CV-42] Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding AAAI2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在极端现实世界视觉退化条件下性能不可靠的问题,这一问题显著限制了其实际鲁棒性。现有方法主要依赖于隐式训练或适应策略,仅关注视觉编码器的泛化能力,存在可解释性差和优化孤立等问题。解决方案的关键在于提出Robust-R1框架,通过结构化的推理链显式建模视觉退化过程,包含三个核心机制:(i) 监督微调以构建退化感知的推理基础,(ii) 奖励驱动对齐以精准感知退化参数,(iii) 动态推理深度缩放以适配退化强度。该方法借助一个包含11K样本、涵盖四个关键现实视觉处理阶段的专用数据集,实现了从退化参数到感知影响、原始语义推理链再到结论的结构化关联标注,从而显著提升模型在真实退化场景下的鲁棒性表现。
链接: https://arxiv.org/abs/2512.17532
作者: Jiaqi Tang,Jianmin Chen,Wei Wei,Xiaogang Xu,Runtao Liu,Xiangyu Wu,Qipeng Xie,Jiafei Wu,Lei Zhang,Qifeng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2026 Oral
Abstract:Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.
zh
[CV-43] PathBench-MIL: A Comprehensive AutoML and Benchmarking Framework for Multiple Instance Learning in Histopathology
【速读】:该论文旨在解决组织病理学中多实例学习(Multiple Instance Learning, MIL)流程自动化与模型评估标准化的问题。当前MIL在医学图像分析中的应用受限于手动构建复杂pipeline、缺乏统一的基准测试平台以及不同模型和特征提取器之间难以公平比较。解决方案的关键在于提出PathBench-MIL,一个开源的自动机器学习(AutoML)与基准测试框架,其核心能力包括:端到端MIL流程自动化(涵盖预处理、特征提取与MIL聚合)、支持数十种MIL模型和特征提取器的可复现基准测试、集成可视化工具与统一配置系统,并具备模块化扩展性,从而显著提升实验效率并促进跨数据集和任务的标准对比。
链接: https://arxiv.org/abs/2512.17517
作者: Siemen Brussee,Pieter A. Valkema,Jurre A. J. Weijer,Thom Doeleman,Anne M.R. Schrader,Jesper Kers
机构: Leiden University Medical Center (莱顿大学医学中心); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE); Tissues and Organs (q-bio.TO)
备注: 14 Pages, 3 Figures, 2 Appendices
Abstract:We introduce PathBench-MIL, an open-source AutoML and benchmarking framework for multiple instance learning (MIL) in histopathology. The system automates end-to-end MIL pipeline construction, including preprocessing, feature extraction, and MIL-aggregation, and provides reproducible benchmarking of dozens of MIL models and feature extractors. PathBench-MIL integrates visualization tooling, a unified configuration system, and modular extensibility, enabling rapid experimentation and standardization across datasets and tasks. PathBench-MIL is publicly available at this https URL
zh
[CV-44] Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
【速读】:该论文旨在解决源域无关目标检测(Source-Free Object Detection, SFOD)中因领域偏移导致的检测器对象聚焦能力下降问题,具体表现为高置信度激活出现在背景杂波区域,从而生成不可靠的伪标签。解决方案的关键在于提出FALCON-SFOD框架,其核心创新是双组件协同设计:SPAR(Spatial Prior-Aware Regularization)利用视觉基础模型(Vision Foundation Models)的泛化能力,通过类无关二值掩码(来自OV-SAM)引导网络关注物体区域,增强特征空间的对象聚焦性;IRPL(Imbalance-aware Noise Robust Pseudo-Labeling)则在严重前景-背景不平衡条件下提升伪标签的鲁棒性和平衡性。二者共同优化检测器在领域偏移下的定位与分类误差边界,显著提升SFOD性能。
链接: https://arxiv.org/abs/2512.17514
作者: Sairam VCR,Rishabh Lalla,Aveen Dayal,Tejal Kulkarni,Anuj Lalla,Vineeth N Balasubramanian,Muhammad Haris Khan
机构: IIT Hyderabad (印度理工学院海得拉巴分校); MBZUAI, Abu Dhabi (穆罕默德·本·扎耶德人工智能大学,阿布扎比); UC San Diego (加州大学圣地亚哥分校); IIT Jodhpur (印度理工学院乔达普尔分校); Microsoft Research India (微软研究院印度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current state-of-the-art approaches in Source-Free Object Detection (SFOD) typically rely on Mean-Teacher self-labeling. However, domain shift often reduces the detector’s ability to maintain strong object-focused representations, causing high-confidence activations over background clutter. This weak object focus results in unreliable pseudo-labels from the detection head. While prior works mainly refine these pseudo-labels, they overlook the underlying need to strengthen the feature space itself. We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter suppression and Noise robustness), a framework designed to enhance object-focused adaptation under domain shift. It consists of two complementary components. SPAR (Spatial Prior-Aware Regularization) leverages the generalization strength of vision foundation models to regularize the detector’s feature space. Using class-agnostic binary masks derived from OV-SAM, SPAR promotes structured and foreground-focused activations by guiding the network toward object regions. IRPL (Imbalance-aware Noise Robust Pseudo-Labeling) complements SPAR by promoting balanced and noise-tolerant learning under severe foreground-background imbalance. Guided by a theoretical analysis that connects these designs to tighter localization and classification error bounds, FALCON-SFOD achieves competitive performance across SFOD benchmarks.
zh
[CV-45] Adaptive Covariance and Quaternion-Focused Hybrid Error-State EKF/UKF for Visual-Inertial Odometry
【速读】:该论文旨在解决无人机(UAV)在复杂环境中因传感器可靠性波动导致的位姿估计不准确问题,特别是在视觉测量受光照变化、运动模糊等干扰时易失效的挑战。解决方案的关键在于提出一种基于松耦合架构的混合式视觉惯性里程计(Visual-Inertial Odometry, VIO)方法,其核心创新为“Quaternion-focused Error-State EKF/UKF”(Qf-ES-EKF/UKF)结构:首先使用误差状态扩展卡尔曼滤波器(Error-State Extended Kalman Filter, ESKF)对全状态进行高效传播,随后引入缩放无迹卡尔曼滤波器(Scaled Unscented Kalman Filter, SUKF)仅针对姿态四元数(quaternion)进行精细化修正,从而兼顾了SUKF在姿态估计中的高精度与ESKF的整体计算效率;同时,通过动态构建视觉传感器置信度评分(基于图像熵、强度变化、运动模糊和推理质量等指标)自适应调整观测噪声协方差,实现对传感器可靠性的实时评估与鲁棒性增强。
链接: https://arxiv.org/abs/2512.17505
作者: Ufuk Asil,Efendi Nasibov
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study presents an innovative hybrid Visual-Inertial Odometry (VIO) method for Unmanned Aerial Vehicles (UAVs) that is resilient to environmental challenges and capable of dynamically assessing sensor reliability. Built upon a loosely coupled sensor fusion architecture, the system utilizes a novel hybrid Quaternion-focused Error-State EKF/UKF (Qf-ES-EKF/UKF) architecture to process inertial measurement unit (IMU) data. This architecture first propagates the entire state using an Error-State Extended Kalman Filter (ESKF) and then applies a targeted Scaled Unscented Kalman Filter (SUKF) step to refine only the orientation. This sequential process blends the accuracy of SUKF in quaternion estimation with the overall computational efficiency of ESKF. The reliability of visual measurements is assessed via a dynamic sensor confidence score based on metrics, such as image entropy, intensity variation, motion blur, and inference quality, adapting the measurement noise covariance to ensure stable pose estimation even under challenging conditions. Comprehensive experimental analyses on the EuRoC MAV dataset demonstrate key advantages: an average improvement of 49% in position accuracy in challenging scenarios, an average of 57% in rotation accuracy over ESKF-based methods, and SUKF-comparable accuracy achieved with approximately 48% lower computational cost than a full SUKF implementation. These findings demonstrate that the presented approach strikes an effective balance between computational efficiency and estimation accuracy, and significantly enhances UAV pose estimation performance in complex environments with varying sensor reliability.
zh
[CV-46] InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
【速读】:该论文旨在解决现实视频中对象插入(Video Object Insertion, VOI)的挑战,特别是由于缺乏对4D场景的充分理解以及对遮挡和光照效应处理不足所导致的几何不一致与视觉失真问题。解决方案的关键在于提出InsertAnywhere框架,其核心创新包括:1)一个4D感知的掩码生成模块,能够重建场景几何并跨帧传播用户指定的对象位置,同时保持时间一致性与遮挡一致性;2)基于扩散模型的视频生成扩展,用于联合合成插入对象及其周围局部变化(如光照和阴影),从而实现外观忠实的视频合成;3)引入ROSE++数据集,通过将对象移除视频转换为三元组(移除视频、存在视频、VLM生成参考图像)以支持监督训练,显著提升插入结果的几何合理性与视觉连贯性。
链接: https://arxiv.org/abs/2512.17504
作者: Hoiyeong Jin,Hyojin Jang,Jeongho Kim,Junha Hyung,Kinam Kim,Dongjin Kim,Huijin Choi,Hyeonji Kim,Jaegul Choo
机构: KAIST AI (韩国科学技术院人工智能); SK Telecom (SK电信)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, project page: this https URL
Abstract:Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.
zh
[CV-47] Validation of Diagnostic Artificial Intelligence Models for Prostate Pathology in a Middle Eastern Cohort
【速读】:该论文旨在解决当前病理学人工智能(AI)模型在非欧美地区,特别是中东等代表性不足人群中的验证缺失问题,以推动全球范围内公平且高效的AI辅助诊断应用。其关键解决方案在于首次在一个来自伊拉克库尔德斯坦地区的连续前列腺活检样本队列(n=339)中对外部验证了三种AI模型——包括一个任务特定的端到端模型和两个基础模型——并证明这些模型在与病理学家的分级一致性上达到病理学家间的一致性水平(Cohen’s quadratically weighted kappa 0.801 vs. 0.799),同时在三种不同扫描仪(Hamamatsu、Leica、Grundium)之间展现出高跨设备一致性(kappa ≥ 0.90),即便使用低成本紧凑型扫描仪亦然。这一成果表明AI具备在资源有限环境中实现可靠病理评估的能力,并为后续全球可及的AI病理研究提供了首个公开可用的中东数字病理数据集。
链接: https://arxiv.org/abs/2512.17499
作者: Peshawa J. Muhammad Ali(1 and 2),Navin Vincent(3),Saman S. Abdulla(4 and 5),Han N. Mohammed Fadhl(6),Anders Blilie(7 and 8),Kelvin Szolnoky(9),Julia Anna Mielcarz(3),Xiaoyi Ji(9),Nita Mulliqi(3),Abdulbasit K. Al-Talabani(1),Kimmo Kartasalo(3) ((1) Department of Software Engineering, Faculty of Engineering, Koya University, Koya 44023, Kurdistan Region - F.R. Iraq, (2) Department of Mechanical and Manufacturing Engineering, Faculty of Engineering, Koya University, Koya 44023, Kurdistan Region - F.R. Iraq, (3) Department of Medical Epidemiology and Biostatistics, SciLifeLab, Karolinska Institutet, Stockholm, Sweden, (4) College of Dentistry, Hawler Medical University, Erbil, Kurdistan Region, Iraq, (5) PAR Private Hospital, Erbil, Kurdistan Region, Iraq, (6) College of Dentistry, University of Sulaimani, Sulaymaniyah, Kurdistan Region, Iraq, (7) Department of Pathology, Stavanger University Hospital, Stavanger, Norway, (8) Faculty of Health Sciences, University of Stavanger, Stavanger, Norway, (9) Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 8 figures, 11 tables
Abstract:Background: Artificial intelligence (AI) is improving the efficiency and accuracy of cancer diagnostics. The performance of pathology AI systems has been almost exclusively evaluated on European and US cohorts from large centers. For global AI adoption in pathology, validation studies on currently under-represented populations - where the potential gains from AI support may also be greatest - are needed. We present the first study with an external validation cohort from the Middle East, focusing on AI-based diagnosis and Gleason grading of prostate cancer. Methods: We collected and digitised 339 prostate biopsy specimens from the Kurdistan region, Iraq, representing a consecutive series of 185 patients spanning the period 2013-2024. We evaluated a task-specific end-to-end AI model and two foundation models in terms of their concordance with pathologists and consistency across samples digitised on three scanner models (Hamamatsu, Leica, and Grundium). Findings: Grading concordance between AI and pathologists was similar to pathologist-pathologist concordance with Cohen’s quadratically weighted kappa 0.801 vs. 0.799 (p=0.9824). Cross-scanner concordance was high (quadratically weighted kappa 0.90) for all AI models and scanner pairs, including low-cost compact scanner. Interpretation: AI models demonstrated pathologist-level performance in prostate histopathology assessment. Compact scanners can provide a route for validation studies in non-digitalised settings and enable cost-effective adoption of AI in laboratories with limited sample volumes. This first openly available digital pathology dataset from the Middle East supports further research into globally equitable AI pathology. Funding: SciLifeLab and Wallenberg Data Driven Life Science Program, Instrumentarium Science Foundation, Karolinska Institutet Research Foundation. Comments: 40 pages, 8 figures, 11 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17499 [cs.CV] (or arXiv:2512.17499v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17499 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kimmo Kartasalo [view email] [v1] Fri, 19 Dec 2025 12:08:28 UTC (2,195 KB) Full-text links: Access Paper: View a PDF of the paper titled Validation of Diagnostic Artificial Intelligence Models for Prostate Pathology in a Middle Eastern Cohort, by Peshawa J. Muhammad Ali (1 and 2) and 50 other authorsView PDF view license Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-48] GroundingME: Exposing the Visual Grounding Gap in MLLM s through Multi-Dimensional Evaluation
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉定位(Visual Grounding)任务中是否存在“伪理解”问题,即模型是否具备真正的人类级语义对齐能力,而非仅依赖数据集中的模式匹配。现有基准测试无法反映现实世界中模糊指代、不可定位查询等复杂场景,导致评估结果高估了模型的真实性能。为此,作者提出GroundingME基准,系统性地从四个维度挑战模型:判别性(Discriminative)、空间关系(Spatial)、有限条件(Limited)和拒绝能力(Rejection),并通过自动化生成与人工验证相结合的方式构建1,005个高难度样本。其关键创新在于引入“拒绝能力”这一维度,并发现多数模型在无法定位的查询上仍会错误生成对象,暴露出严重的安全风险;同时提出两种改进策略:测试时缩放(test-time scaling)通过选择最优推理轨迹提升复杂定位准确率,以及数据混合训练(data-mixture training)使模型学会识别不可定位查询,显著提升拒绝准确率至27.9%,从而为迈向人类水平的视觉定位提供可操作的诊断工具与优化路径。
链接: https://arxiv.org/abs/2512.17495
作者: Rang Li,Lei Li,Shuhuai Ren,Hao Tian,Shuhao Gu,Shicheng Li,Zihao Yue,Yudong Wang,Wenhan Ma,Zhe Yang,Jingyuan Ma,Zhifang Sui,Fuli Luo
机构: Peking University (北京大学); LLM-Core Xiaomi (小米); The University of Hong Kong (香港大学); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs’ true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.
zh
[CV-49] MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
【速读】:该论文旨在解决当前地理空间基准测试在多模态数据覆盖上的局限性问题,即现有方法难以在一个统一框架内整合多种模态信息(如高分辨率航空图像、地面视角图像、文本描述和地理坐标),从而限制了地理空间理解的进展。解决方案的关键在于构建了一个具有跨模态一一对应关系的多模态地标数据集——Multi-Modal Landmark dataset (MMLANDMARKS),包含197k张高分辨率航空图像、329k张地面视角图像、文本信息及18,557个美国地标点的地理坐标,支持多种地理空间任务(如跨视图地面对卫星检索、地理定位、文本到图像/GPS检索)。通过采用一个简单的CLIP-inspired基线模型,作者展示了该数据集在多个任务上具备良好的泛化能力和与先进模型相当的性能,证明了多模态数据集对实现广泛地理空间理解的重要性。
链接: https://arxiv.org/abs/2512.17492
作者: Oskar Kristoffersen,Alba R. Sánchez,Morten R. Hannemose,Anders B. Dahl,Dim P. Papadopoulos
机构: Technical University of Denmark (丹麦技术大学); Pioneer Center for AI (先锋人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.
zh
[CV-50] LumiCtrl : Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在场景光照控制方面的不足,即缺乏对场景光源(illuminant)的精确调控能力,而这对于内容设计师而言是影响图像情绪、氛围和视觉美感的关键因素。解决方案的核心在于提出一种名为 LumiCtrl 的光照个性化方法,其关键创新包括:(1) 基于物理的光照增强技术,在普朗克轨迹(Planckian locus)上生成标准光源下的微调变体;(2) 利用冻结的 ControlNet 实现边缘引导的提示解耦(prompt disentanglement),确保提示词聚焦于光照而非结构信息;(3) 引入掩码重建损失(masked reconstruction loss),仅关注前景物体的学习,同时允许背景进行上下文自适应,从而实现所谓的“情境光适应”(contextual light adaptation)。该方法显著提升了光照保真度、美学质量与场景一致性。
链接: https://arxiv.org/abs/2512.17489
作者: Muhammad Atif Butt,Kai Wang,Javier Vazquez-Corral,Joost Van De Weijer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current text-to-image (T2I) models have demonstrated remarkable progress in creative image generation, yet they still lack precise control over scene illuminants, which is a crucial factor for content designers aiming to manipulate the mood, atmosphere, and visual aesthetics of generated images. In this paper, we present an illuminant personalization method named LumiCtrl that learns an illuminant prompt given a single image of an object. LumiCtrl consists of three basic components: given an image of the object, our method applies (a) physics-based illuminant augmentation along the Planckian locus to create fine-tuning variants under standard illuminants; (b) edge-guided prompt disentanglement using a frozen ControlNet to ensure prompts focus on illumination rather than structure; and © a masked reconstruction loss that focuses learning on the foreground object while allowing the background to adapt contextually, enabling what we call contextual light adaptation. We qualitatively and quantitatively compare LumiCtrl against other T2I customization methods. The results show that our method achieves significantly better illuminant fidelity, aesthetic quality, and scene coherence compared to existing personalization baselines. A human preference study further confirms strong user preference for LumiCtrl outputs. The code and data will be released upon publication.
zh
[CV-51] winSegNet: A Digital Twin-Enabled Federated Learning Framework for Brain Tumor Analysis
【速读】:该论文旨在解决脑肿瘤分割中因依赖集中式数据收集而引发的隐私泄露问题,并提升模型在异构医疗机构间的泛化能力。其解决方案的关键在于提出一种隐私保护的联邦学习框架TwinSegNet,该框架融合了混合ViT-UNet结构与个性化数字孪生(digital twin)机制:通过将卷积编码器与视觉Transformer(Vision Transformer, ViT)瓶颈结合以同时捕获局部和全局上下文信息,各机构基于自身私有数据微调全局模型形成专属数字孪生,从而实现高精度、实时的脑肿瘤分割,且在非独立同分布(non-IID)客户端分布下仍保持优异性能(Dice分数高达0.90%,敏感度/特异性均超90%)。
链接: https://arxiv.org/abs/2512.17488
作者: Almustapha A. Wakili,Adamu Hussaini,Abubakar A. Musa,Woosub Jung,Wei Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: IEEE Virtual Conference on Communications. 4-6 November 2025
Abstract:Brain tumor segmentation is critical in diagnosis and treatment planning for the disease. Yet, current deep learning methods rely on centralized data collection, which raises privacy concerns and limits generalization across diverse institutions. In this paper, we propose TwinSegNet, which is a privacy-preserving federated learning framework that integrates a hybrid ViT-UNet model with personalized digital twins for accurate and real-time brain tumor segmentation. Our architecture combines convolutional encoders with Vision Transformer bottlenecks to capture local and global context. Each institution fine-tunes the global model of private data to form its digital twin. Evaluated on nine heterogeneous MRI datasets, including BraTS 2019-2021 and custom tumor collections, TwinSegNet achieves high Dice scores (up to 0.90%) and sensitivity/specificity exceeding 90%, demonstrating robustness across non-independent and identically distributed (IID) client distributions. Comparative results against centralized models such as TumorVisNet highlight TwinSegNet’s effectiveness in preserving privacy without sacrificing performance. Our approach enables scalable, personalized segmentation for multi-institutional clinical settings while adhering to strict data confidentiality requirements.
zh
[CV-52] 3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework
【速读】:该论文旨在解决当前单图像三维场景重建方法在艺术创作流程中适用性不足的问题,具体表现为物体分解错误、空间关系不准确以及背景缺失等缺陷,这些问题限制了生成的纹理化三维网格场景在视觉特效和游戏开发中的可编辑性和实用性。解决方案的关键在于提出3D-RE-GEN这一组合式框架,通过集成资产检测、重建与放置模型,并引入基于场景级推理的生成式图像编辑策略来恢复遮挡物体;同时采用一种新颖的4自由度(4-DoF)可微优化方法对齐重建对象与估计的地平面,从而实现物理合理的布局;此外,该框架还生成一个空间约束性强的完整背景,为真实光照模拟和后续渲染任务提供基础,最终实现了高保真、可修改的三维场景重建效果。
链接: https://arxiv.org/abs/2512.17459
作者: Tobias Sautter,Jan-Niklas Dihlmann,Hendrik P.A. Lensch
机构: University of Tübingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent advances in 3D scene generation produce visually appealing output, but current representations hinder artists’ workflows that require modifiable 3D textured mesh scenes for visual effects and game development. Despite significant advances, current textured mesh scene reconstruction methods are far from artist ready, suffering from incorrect object decomposition, inaccurate spatial relationships, and missing backgrounds. We present 3D-RE-GEN, a compositional framework that reconstructs a single image into textured 3D objects and a background. We show that combining state of the art models from specific domains achieves state of the art scene reconstruction performance, addressing artists’ requirements. Our reconstruction pipeline integrates models for asset detection, reconstruction, and placement, pushing certain models beyond their originally intended domains. Obtaining occluded objects is treated as an image editing task with generative models to infer and reconstruct with scene level reasoning under consistent lighting and geometry. Unlike current methods, 3D-RE-GEN generates a comprehensive background that spatially constrains objects during optimization and provides a foundation for realistic lighting and simulation tasks in visual effects and games. To obtain physically realistic layouts, we employ a novel 4-DoF differentiable optimization that aligns reconstructed objects with the estimated ground plane. 3D-RE-GEN~achieves state of the art performance in single image 3D scene reconstruction, producing coherent, modifiable scenes through compositional generation guided by precise camera recovery and spatial optimization. Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17459 [cs.CV] (or arXiv:2512.17459v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17459 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-53] MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation
【速读】:该论文旨在解决无人水面艇(Unmanned Surface Vehicle, USV)在复杂气象与光照条件下,因视觉信息不足导致场景理解质量下降的问题。其解决方案的关键在于构建一个同步、校准且标注完善的多模态海洋数据集MULTIAQUA(Multimodal Aquatic Dataset),该数据集融合RGB、热成像(Thermal)、红外(Infrared)及激光雷达(LIDAR)等多种传感器模态的数据,从而支持训练鲁棒的监督学习方法,提升在极端低光环境下的感知性能。特别地,研究提出了一种仅使用日间图像即可训练出在夜间近完全黑暗条件下仍保持可靠表现的深度神经网络的方法,显著简化了数据采集、标注与训练流程。
链接: https://arxiv.org/abs/2512.17450
作者: Jon Muhovič,Janez Perš
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.
zh
[CV-54] LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents
【速读】:该论文旨在解决如何通过自然语言指令对真实驾驶视频进行细粒度编辑以合成多样化交通场景的问题。现有方法在指令对齐度、结构保真度和交通真实性方面存在不足,难以实现复杂多对象行为的可控生成与高质量渲染。其解决方案的关键在于提出LangDriveCTRL框架,该框架基于显式三维场景分解构建场景图(scene graph),将视频表示为静态背景与动态物体的组合;并通过代理(agent)流水线实现精细化控制:包括目标定位代理(Object Grounding Agent)建立文本描述与场景图中目标节点的对应关系,行为编辑代理(Behavior Editing Agent)从语言指令生成多目标轨迹,以及行为审查代理(Behavior Reviewer Agent)迭代优化轨迹质量;最终利用视频扩散模型(video diffusion tool)修复因物体插入和视角变化引入的伪影,从而实现单条自然语言指令下对象节点编辑(移除、插入、替换)与多对象行为编辑的统一控制,显著提升指令对齐度(接近2倍于当前最优方法)及整体视觉真实感。
链接: https://arxiv.org/abs/2512.17445
作者: Yun He,Francesco Pittaluga,Ziyu Jiang,Matthias Zwicker,Manmohan Chandraker,Zaid Tasneem
机构: University of Maryland, College Park (马里兰大学学院市分校); NEC Labs America (美国NEC实验室); UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It leverages explicit 3D scene decomposition to represent driving videos as a scene graph, containing static background and dynamic objects. To enable fine-grained editing and realism, it incorporates an agentic pipeline in which an Orchestrator transforms user instructions into execution graphs that coordinate specialized agents and tools. Specifically, an Object Grounding Agent establishes correspondence between free-form text descriptions and target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and then refined using a video diffusion tool to address artifacts introduced by object insertion and significant view changes. LangDriveCTRL supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction. Quantitatively, it achieves nearly 2\times higher instruction alignment than the previous SoTA, with superior structural preservation, photorealism, and traffic realism. Project page is available at: this https URL.
zh
[CV-55] Xiaomi MiMo-VL-Miloco Technical Report
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在智能家居场景下理解能力不足的问题,尤其在手势识别和家庭环境语义理解方面表现有限,同时兼顾通用多模态推理能力的保持。解决方案的关键在于构建一个专为家庭场景优化的双阶段训练框架:首先通过监督微调(Supervised Fine-Tuning, SFT)引入高质量多域家庭数据,随后采用基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习进一步提升模型性能;此外,引入思维链(Chain-of-Thought, CoT)监督与token预算感知推理机制,使模型能在数据效率和推理效率之间取得平衡,从而在家庭场景任务上实现领先性能,且对通用多模态基准(如Video-MME、MMM-Pro等)仍保持稳定增益。
链接: https://arxiv.org/abs/2512.17436
作者: Jiaze Li,Jingyang Chen,Yuxun Qu,Jianzhong Ju,Zhenbo Luo,Jian Luan,Shijie Xu,Zhenru Lin,Junyou Zhu,Boshen Xu,Wenhui Tan,Pei Fu
机构: Xiaomi(小米)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We open-source \textbfMiMo-VL-Miloco-7B and its quantized variant \textbfMiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \hrefthis https URLthis https URL to support research and deployment in real-world smart-home applications.
zh
[CV-56] AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments
【速读】:该论文旨在解决洪水检测中高质量标注数据集稀缺的问题,这一瓶颈限制了鲁棒且具备泛化能力的计算机视觉方法的发展。现有数据集普遍存在地理范围有限和标注细节不足的缺陷,难以支撑全球性、多场景下的洪水智能识别研究。其解决方案的关键在于构建AIFloodSense——一个公开的、高分辨率的航空影像数据集,包含来自64个国家和六大洲230个独立洪水事件的470张图像,覆盖2022至2024年的时间跨度,具有显著的全球多样性和时序相关性。该数据集支持三种互补任务:图像分类(含环境类型、相机角度与大洲识别子任务)、语义分割(提供洪水、天空与建筑物的像素级掩码)以及视觉问答(VQA),并通过前沿模型建立基线基准,验证其复杂性与推动气候韧性领域通用人工智能工具发展的潜力。
链接: https://arxiv.org/abs/2512.17432
作者: Georgios Simantiris,Konstantinos Bacharidis,Apostolos Papanikolaou,Petros Giannakakis,Costas Panagiotakis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 19 figures, 8 tables
Abstract:Accurate flood detection from visual data is a critical step toward improving disaster response and risk assessment, yet datasets for flood segmentation remain scarce due to the challenges of collecting and annotating large-scale imagery. Existing resources are often limited in geographic scope and annotation detail, hindering the development of robust, generalized computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive, publicly available aerial imagery dataset comprising 470 high-resolution images from 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures global diversity and temporal relevance (2022-2024), supporting three complementary tasks: (i) Image Classification with novel sub-tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation providing precise pixel-level masks for flood, sky, and buildings; and (iii) Visual Question Answering (VQA) to enable natural language reasoning for disaster assessment. We establish baseline benchmarks for all tasks using state-of-the-art architectures, demonstrating the dataset’s complexity and its value in advancing domain-generalized AI tools for climate resilience.
zh
[CV-57] Beyond Occlusion: In Search for Near Real-Time Explainability of CNN-Based Prostate Cancer Classification
【速读】:该论文旨在解决深度神经网络在辅助前列腺癌诊断等临床场景中因解释方法计算效率低下而难以被病理学家采纳的问题。具体而言,传统使用的遮挡法(occlusion)虽能提供可解释性输出,但耗时较长,阻碍了模型开发与调试的迭代效率。解决方案的关键在于建立一套合理的评估框架,通过定义合适的比较标准并选取相应指标,系统性地筛选出一种替代遮挡法的解释方法,最终实现了至少10倍的加速效果,且未降低输出质量,从而提升了AI辅助诊断系统的实用性和临床落地潜力。
链接: https://arxiv.org/abs/2512.17416
作者: Martin Krebs,Jan Obdržálek,Vít Musil,Tomáš Brázdil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks are starting to show their worth in critical applications such as assisted cancer diagnosis. However, for their outputs to get accepted in practice, the results they provide should be explainable in a way easily understood by pathologists. A well-known and widely used explanation technique is occlusion, which, however, can take a long time to compute, thus slowing the development and interaction with pathologists. In this work, we set out to find a faster replacement for occlusion in a successful system for detecting prostate cancer. Since there is no established framework for comparing the performance of various explanation methods, we first identified suitable comparison criteria and selected corresponding metrics. Based on the results, we were able to choose a different explanation method, which cut the previously required explanation time at least by a factor of 10, without any negative impact on the quality of outputs. This speedup enables rapid iteration in model development and debugging and brings us closer to adopting AI-assisted prostate cancer detection in clinical settings. We propose that our approach to finding the replacement for occlusion can be used to evaluate candidate methods in other related applications.
zh
[CV-58] owards Deeper Emotional Reflection: Crafting Affective Image Filters with Generative Priors
【速读】:该论文旨在解决如何将文本中抽象的情感信息有效映射到视觉层面,从而生成能够真实反映情感的图像问题(即“情感图像过滤”任务,Affective Image Filter, AIF)。其解决方案的关键在于构建两个层次的模型:AIF-B基于多模态Transformer架构实现基础的情感视觉映射;AIF-D在此基础上进一步引入预训练大规模扩散模型的生成先验,以增强对深层情感语义的理解与表达能力,从而在内容一致性与情感保真度上均优于现有方法。
链接: https://arxiv.org/abs/2512.17376
作者: Peixuan Zhang,Shuchen Weng,Jiajun Tang,Si Li,Boxin Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotions from text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.
zh
[CV-59] Beyond Semantic Features: Pixel-level Mapping for Generalized AI-Generated Image Detection AAAI2026
【速读】:该论文旨在解决当前图像生成检测器在面对未见过的生成模型时泛化能力不足的问题,其根本原因在于现有检测器往往过度依赖特定源模型的语义线索(semantic cues),而非学习到通用的生成伪影(generative artifacts)。解决方案的关键在于引入一个简单但高效的像素级映射预处理步骤,通过破坏图像的像素值分布,消除检测器常利用的脆弱且非必要的语义模式,从而迫使检测器聚焦于图像生成过程中更本质、更具泛化性的高频特征。实验证明,该方法显著提升了先进检测器在不同生成模型间的跨域性能。
链接: https://arxiv.org/abs/2512.17350
作者: Chenming Zhou,Jiaan Wang,Yu Li,Lei Li,Juan Cao,Sheng Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026
Abstract:The rapid evolution of generative technologies necessitates reliable methods for detecting AI-generated images. A critical limitation of current detectors is their failure to generalize to images from unseen generative models, as they often overfit to source-specific semantic cues rather than learning universal generative artifacts. To overcome this, we introduce a simple yet remarkably effective pixel-level mapping pre-processing step to disrupt the pixel value distribution of images and break the fragile, non-essential semantic patterns that detectors commonly exploit as shortcuts. This forces the detector to focus on more fundamental and generalizable high-frequency traces inherent to the image generation process. Through comprehensive experiments on GAN and diffusion-based generators, we show that our approach significantly boosts the cross-generator performance of state-of-the-art detectors. Extensive analysis further verifies our hypothesis that the disruption of semantic cues is the key to generalization.
zh
[CV-60] Multi-level distortion-aware deformable network for omnidirectional image super-resolution
【速读】:该论文旨在解决OmniDirectional Image Super-Resolution (ODISR) 中因等距圆柱投影(EquiRectangular Projection, ERP)引入的纬度依赖性几何失真问题,该失真在极区尤为严重,导致现有方法难以有效捕捉大范围内的扭曲模式。解决方案的关键在于提出一种多级失真感知可变形网络(Multi-level Distortion-aware Deformable Network, MDDN),其核心创新是设计了一个包含三个并行分支的特征提取器:一个标准可变形注意力机制(dilation=1路径)和两个膨胀率分别为2与3的可变形卷积,从而显著扩展采样范围与感受野,增强对ERP图像中密集且广泛失真模式的建模能力;同时,通过多级特征融合模块自适应整合各层级特征,并采用低秩分解策略降低膨胀可变形卷积的计算开销,实现高效且高精度的ODISR重建。
链接: https://arxiv.org/abs/2512.17343
作者: Cuixin Yang,Rongkang Dong,Kin-Man Lam,Yuhang Zhang,Guoping Qiu
机构: PolyU (香港理工大学); Guangzhou University (广州大学); University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As augmented reality and virtual reality applications gain popularity, image processing for OmniDirectional Images (ODIs) has attracted increasing attention. OmniDirectional Image Super-Resolution (ODISR) is a promising technique for enhancing the visual quality of ODIs. Before performing super-resolution, ODIs are typically projected from a spherical surface onto a plane using EquiRectangular Projection (ERP). This projection introduces latitude-dependent geometric distortion in ERP images: distortion is minimal near the equator but becomes severe toward the poles, where image content is stretched across a wider area. However, existing ODISR methods have limited sampling ranges and feature extraction capabilities, which hinder their ability to capture distorted patterns over large areas. To address this issue, we propose a novel Multi-level Distortion-aware Deformable Network (MDDN) for ODISR, designed to expand the sampling range and receptive field. Specifically, the feature extractor in MDDN comprises three parallel branches: a deformable attention mechanism (serving as the dilation=1 path) and two dilated deformable convolutions with dilation rates of 2 and 3. This architecture expands the sampling range to include more distorted patterns across wider areas, generating dense and comprehensive features that effectively capture geometric distortions in ERP images. The representations extracted from these deformable feature extractors are adaptively fused in a multi-level feature fusion module. Furthermore, to reduce computational cost, a low-rank decomposition strategy is applied to dilated deformable convolutions. Extensive experiments on publicly available datasets demonstrate that MDDN outperforms state-of-the-art methods, underscoring its effectiveness and superiority in ODISR.
zh
[CV-61] SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation ICASSP2026
【速读】:该论文旨在解决传统显式变形(explicit warping)方法在运动迁移准确性与缺失区域恢复方面的局限性,以及基于注意力机制的变形方法在计算复杂度高和几何约束弱的问题。其解决方案的关键在于提出SynergyWarpNet框架,通过三个阶段协同优化:首先利用3D密集光流进行粗略空间对齐;其次引入参考图像增强的校正模块,借助跨3D关键点与纹理特征的交叉注意力机制语义补全遮挡或失真区域;最后采用置信度引导的融合模块,结合空间自适应融合策略与学习得到的置信图,平衡结构对齐与视觉一致性,从而实现高质量说话头合成。
链接: https://arxiv.org/abs/2512.17331
作者: Shihang Li,Zhiqiang Gong,Minming Ye,Yue Gao,Wen Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICASSP 2026
Abstract:Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.
zh
[CV-62] Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在病理学应用中的三大局限性:一是多数模型仅关注全切片图像(Whole-Slide Images, WSI)的局部区域,缺乏全局理解能力;二是输出多为静态的滑片级结果,难以支持动态交互式分析;三是训练数据依赖非公开资源,导致可复现性差。此外,标注详尽的WSI与临床报告配对数据稀缺,制约了透明且泛化能力强的VLM发展。其解决方案的关键在于:首先提出Polysome这一标准化合成指令生成工具,用于自动化构建高质量指令数据;其次基于公开HISTAI数据集,利用Polysome生成包含24,259张滑片和超过110万条指令-响应对的大规模指令微调数据集HISTAI-Instruct;最终在此基础上训练出ANTONI-α模型,该模型在组织识别、肿瘤检测和鉴别诊断等WSI级视觉问答(Visual Question Answering, VQA)任务上优于MedGemma,并验证了不同数据量下模型性能的可扩展性。所有方法、数据及代码均已开源,显著提升了病理VLM的实用性与可复现性。
链接: https://arxiv.org/abs/2512.17326
作者: Sander Moonemans,Sebastiaan Ram,Frédérique Meeuwsen,Carlijn Lems,Jeroen van der Laak,Geert Litjens,Francesco Ciompi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures
Abstract:Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-\alpha, a VLM capable of visual-question answering (VQA). We show that ANTONI-\alpha outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-\alpha trained with different amounts of data. All methods, data, and code are publicly available.
zh
[CV-63] DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training
【速读】:该论文旨在解决动态场景下视频帧预测(video frame prediction)因缺乏未来帧信息而导致的预测误差问题,尤其在传统方法中由于事件相机(event camera)提供的高时间分辨率运动信息未被有效利用时,容易产生孔洞和模糊。其解决方案的关键在于提出DESSERT框架——一种基于扩散模型(diffusion model)的事件驱动单帧合成方法,通过残差训练(residual training)实现时间一致性:首先使用Event-to-Residual Alignment Variational Autoencoder(ER-VAE)对齐事件帧与目标帧间的残差,随后利用预训练的Stable Diffusion模型在事件数据条件下去噪残差潜在表示,从而生成更清晰且时序一致的预测帧。此外,引入Diverse-Length Temporal(DLT)增强策略提升模型对不同时间长度片段的鲁棒性。
链接: https://arxiv.org/abs/2512.17323
作者: Jiyun Kong,Jun-Hyuk Kim,Jong-Seok Lee
机构: Yonsei University (延世大学); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.
zh
[CV-64] Rotterdam artery-vein segmentation (RAV) dataset
【速读】:该论文旨在解决当前用于视网膜血管分析的机器学习模型缺乏高质量、多样化且标注精确的数据集问题,尤其在动脉-静脉(Artery-Vein, A/V)分割任务中。其解决方案的关键在于构建了一个名为 Rotterdam Artery-Vein Segmentation (RAV) 的数据集,包含1024×1024像素的彩色眼底图像(Color Fundus Images, CFIs),涵盖多种设备、采集条件和年龄范围,并通过定制标注界面实现高精度的A/V血管分层标注与连通性验证,从而支持在真实世界复杂图像质量下的模型训练与评估。
链接: https://arxiv.org/abs/2512.17322
作者: Jose Vargas Quiros,Bart Liefers,Karin van Garderen,Jeroen Vermeulen,Eyened Reading Center,Caroline Klaver
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17322 [cs.CV] (or arXiv:2512.17322v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17322 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jose David Vargas Quiros [view email] [v1] Fri, 19 Dec 2025 08:09:02 UTC (432 KB) Full-text links: Access Paper: View a PDF of the paper titled Rotterdam artery-vein segmentation (RAV) dataset, by Jose Vargas Quiros and 5 other authorsView PDF view license Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-65] EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中因概念保留带来的隐私、偏见和版权问题,尤其是现有概念擦除(Concept Erasure)技术在实际应用中存在评估不全面、鲁棒性不足的问题。其解决方案的关键在于提出EMMA基准测试框架,系统性地从五个维度(包括图像质量、效率、鲁棒性、社会公平性等)对五种主流概念擦除方法进行量化评估,涵盖12项指标,特别关注间接提示、视觉相似非目标概念干扰以及性别与种族偏见放大等复杂场景,从而更真实地反映概念擦除技术是否真正从模型表征中移除目标概念。
链接: https://arxiv.org/abs/2512.17320
作者: Lu Wei,Yuta Nakashima,Noa Garcia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.
zh
[CV-66] A Benchmark for Ultra-High-Resolution Remote Sensing MLLM s
【速读】:该论文旨在解决当前遥感(Remote Sensing, RS)视觉理解与推理任务评估中存在的关键问题:现有基准测试大多依赖低分辨率图像,而高分辨率基准在任务设计上存在缺陷,导致模型性能评估失真。特别是研究发现,仅使用文本的大型语言模型(Large Language Models, LLMs)在无图像输入的情况下也能在RS推理任务中表现优异,暴露出当前基准与真实视觉理解能力评估之间存在严重不匹配。为实现更忠实的评估,作者提出RSHR-Bench——一个超高分辨率遥感视觉理解与推理基准,其核心创新在于:构建包含5,329幅单景图像(每幅图像像素数高达约3×10⁸)、覆盖九类感知和四类推理类型的多任务体系,并通过对抗性过滤结合强LLM与人工严格验证以降低语言先验干扰,从而有效提升评估的客观性和真实性。
链接: https://arxiv.org/abs/2512.17319
作者: Yunkai Dang,Meiyi Zhu,Donghao Wang,Yizhuo Zhang,Jiacheng Yang,Qi Fan,Yuekun Yang,Wenbin Li,Feng Miao,Yang Gao
机构: Nanjing University (南京大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: this https URL
zh
[CV-67] Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在下游任务中因训练数据分布偏移而导致性能下降的问题,尤其是在少样本场景下,现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法受限于固定手工提示(handcrafted prompts)语义表达不足的缺陷。其解决方案的关键在于提出辅助描述知识(Auxiliary Descriptive Knowledge, ADK)框架:首先利用大语言模型(Large Language Model, LLM)离线生成每个类别的丰富描述性提示,随后以两种方式部署这些预计算特征——一是作为组合知识(Compositional Knowledge),通过平均表示提供语义增强,尤其适用于类别名称对VLM不熟悉的情况;二是作为实例特定知识(Instance-Specific Knowledge),借助轻量级非参数注意力机制动态选择最相关的描述用于推理。ADK无需额外参数,可无缝集成至现有PEFT方法中,在不增加推理开销的前提下显著提升分类性能。
链接: https://arxiv.org/abs/2512.17313
作者: SuBeen Lee,GilHan Park,WonJun Moon,Hyun Seok Seong,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.
zh
[CV-68] CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
【速读】:该论文旨在解决当前开源视觉推理方法在灵活性、可解释性和任务迁移能力方面的局限性,这些问题主要源于对纯文本链式推理、固定视觉模式或单步流水线的依赖。其解决方案的关键在于提出CodeDance框架,该框架将可执行代码作为通用求解器,通过动态定义、组合和执行代码来协调多种工具、计算中间结果并生成可视化输出(如边界框、线条、图表),从而实现透明且自验证的推理过程。此外,引入一种平衡且自适应的工具调用奖励机制,有效调节探索与效率之间的关系,防止工具滥用;实验表明,该方法不仅能超越传统基于模式驱动和纯文本的基线模型,还能在无需任务特定微调的情况下展现出新颖的工具调用、未见组合以及跨任务迁移等涌现行为,体现出一种通用且可扩展的可执行视觉推理机制。
链接: https://arxiv.org/abs/2512.17312
作者: Qi Song,Honglin Li,Yingchen Yu,Haoyi Zhou,Lin Yang,Song Bai,Qi She,Zilong Huang,Yunqing Zhao
机构: Beihang University (北京航空航天大学); Westlake University (西湖大学); ByteDance Singapore (字节跳动新加坡); ByteDance China (字节跳动中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent releases such as o3 highlight human-like “thinking with images” reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.
zh
[CV-69] Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images
【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, VLMs)在多轮视觉推理过程中缺乏自我反思与纠错能力的问题,即模型在产生错误推理路径时难以识别并修正自身错误。解决方案的关键在于提出DRIM框架,其核心是通过三阶段训练流程实现深度且可靠的多轮视觉思维:首先构建高难度、可验证的视觉问答数据集,要求任务需通过多轮工具调用才能解答;其次在监督微调(SFT)阶段引导模型形成多轮推理模式;最后在强化学习(RL)阶段引入冗余惩罚策略优化(redundancy-penalized policy optimization),激励模型对推理轨迹进行判断并惩罚未充分多尺度探索却得出错误答案的路径,从而建立自省式推理机制。
链接: https://arxiv.org/abs/2512.17306
作者: Wenhao Yang,Yu Xia,Jinlong Huang,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Yuanyu Wan,Lijun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.
zh
[CV-70] EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Averag e Guidance
【速读】:该论文旨在解决当前扩散模型和流匹配生成模型中,引导技术(guidance techniques)在生成样本质量与一致性提升方面的局限性,尤其是现有方法难以对负样本的粒度或难度进行可靠控制,且目标层选择通常固定不变的问题。解决方案的关键在于提出一种无需训练的指数移动平均引导(Exponential Moving Average Guidance, EMAG)机制,该机制在推理阶段通过统计自适应地选择注意力层来修改扩散Transformer中的注意力模式,从而生成更难且语义忠实的负样本(细粒度退化),有效暴露困难失败模式,使去噪器能够优化细微伪影,最终在人类偏好评分(HPS)上相比分类器自由引导(CFG)提升+0.46,并可自然集成先进引导技术如APG和CADS以进一步提升性能。
链接: https://arxiv.org/abs/2512.17303
作者: Ankit Yadav,Ta Duc Huy,Lingqiao Liu
机构: Australian Institute for Machine Learning, The University of Adelaide(阿德莱德大学机器学习研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages
Abstract:In diffusion and flow-matching generative models, guidance techniques are widely used to improve sample quality and consistency. Classifier-free guidance (CFG) is the de facto choice in modern systems and achieves this by contrasting conditional and unconditional samples. Recent work explores contrasting negative samples at inference using a weaker model, via strong/weak model pairs, attention-based masking, stochastic block dropping, or perturbations to the self-attention energy landscape. While these strategies refine the generation quality, they still lack reliable control over the granularity or difficulty of the negative samples, and target-layer selection is often fixed. We propose Exponential Moving Average Guidance (EMAG), a training-free mechanism that modifies attention at inference time in diffusion transformers, with a statistics-based, adaptive layer-selection rule. Unlike prior methods, EMAG produces harder, semantically faithful negatives (fine-grained degradations), surfacing difficult failure modes, enabling the denoiser to refine subtle artifacts, boosting the quality and human preference score (HPS) by +0.46 over CFG. We further demonstrate that EMAG naturally composes with advanced guidance techniques, such as APG and CADS, further improving HPS.
zh
[CV-71] MatLat: Material Latent Space for PBR Texture Generation
【速读】:该论文旨在解决在缺乏大规模PBR(Physically Based Rendering)纹理数据集的情况下,如何高效生成高质量PBR纹理的问题。其核心挑战在于如何利用预训练的潜在图像生成模型(如扩散模型)的嵌入空间和先验知识,同时构建一个专门用于材质表示的潜在空间(MatLat),并确保在微调过程中保持潜在分布的一致性,从而支持多通道PBR纹理的合成与跨视角一致性。解决方案的关键在于:首先对预训练变分自编码器(VAE)进行针对性微调,使新引入的材质通道能以最小的潜在分布偏移融入系统;其次,在微调过程中引入一种基于局部区域裁剪与重建的正则化策略,以强化潜在空间与图像空间之间的像素级空间对应关系,从而保障跨视角的一致性。实验表明,该框架在PBR纹理保真度上优于现有方法,且各组件均对达到最先进性能至关重要。
链接: https://arxiv.org/abs/2512.17302
作者: Kyeongmin Yeo,Yunhong Min,Jaihoon Kim,Minhyuk Sung
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel-latent spatial correspondence. Ablation studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.
zh
[CV-72] ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration AAAI2026
【速读】:该论文旨在解决扩散变换器(Diffusion Transformers, DiTs)在生成建模中计算成本高、难以实时部署的问题。现有基于特征缓存(feature caching)的加速方法存在两大局限:一是缓存间隔均匀,无法匹配DiT内部非均匀的时间动态特性;二是粗粒度的特征复用在大间隔下会导致严重误差累积。解决方案的关键在于提出ProCache框架,其核心创新包括:(i) 一种约束感知的缓存模式搜索模块,通过离线约束采样生成与模型时间特性匹配的非均匀激活调度策略;(ii) 一种选择性计算模块,在深度块和高重要性token上对缓存段进行局部重计算,以最小开销缓解误差传播。实验表明,ProCache在PixArt-alpha和DiT上分别实现最高1.96倍和2.90倍加速,且质量损失可忽略,显著优于现有缓存加速方法。
链接: https://arxiv.org/abs/2512.17298
作者: Fanpu Cao,Yaofo Chen,Zeng You,Wei Luo,Cen Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for poster presentation at AAAI 2026
Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model’s temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.
zh
[CV-73] owards Pixel-Wise Anomaly Location for High-Resolution PCBA via Self-Supervised Image Reconstruction
【速读】:该论文旨在解决高分辨率印刷电路板组装件(PCBA)自动缺陷检测中因标注数据不足、微小缺陷仅占几个像素以及图像视觉复杂度高所带来的挑战。其解决方案的关键在于提出一种名为HiSIR-Net的高分辨率自监督重建框架,核心创新包括两个轻量级模块:一是选择性输入-重建门(Selective Input-Reconstruction Gate, SIR-Gate),使模型能够动态判断在何处依赖重建结果而非原始输入,从而减少无关重建伪影和误报;二是基于位置线索的区域级优化补丁选择策略(Region-level Optimized Patch Selection, ROPS),实现跨任意分辨率下重叠补丁重建的一致性选择,最终生成清晰且低误报率的像素级异常图。
链接: https://arxiv.org/abs/2512.17296
作者: Wuyi Liu,Le Jin,Junxian Yang,Yuanchao Yu,Zishuo Peng,Jinfeng Xu,Xianzhi Li,Jun Zhou
机构: Huazhong University of Science and Technology (华中科技大学); Siemens AG (西门子集团); University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.
zh
[CV-74] Vision-Language Model Guided Image Restoration
【速读】:该论文旨在解决图像恢复(Image Restoration, IR)任务中难以同时实现像素级保真度与高层语义一致性的问题,尤其针对现有方法在利用视觉与语言知识融合方面存在的不足。其解决方案的关键在于提出了一种基于视觉-语言模型(Vision-Language Model, VLM)引导的图像恢复框架(VLMIR),通过两阶段设计:第一阶段利用VLM(如CLIP)提取互补的视觉与语言表征,包括通过余弦相似性损失结合LoRA微调对低质与高质图像的文本描述嵌入进行对齐,并引入退化预测器分离退化特征与干净图像内容嵌入;第二阶段将这些多模态嵌入通过交叉注意力机制注入扩散模型中,从而增强恢复图像的感知质量与语义合理性。实验证明,该方法在通用及特定退化类型的图像恢复任务中均显著优于现有方法,凸显了整合VLM提供的视觉-语言先验对于提升图像恢复性能的关键作用。
链接: https://arxiv.org/abs/2512.17292
作者: Cuixin Yang,Rongkang Dong,Kin-Man Lam
机构: Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.
zh
[CV-75] Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge MICCAI2025
【速读】:该论文旨在解决当前超声AI系统中普遍存在的任务碎片化问题,即现有模型通常仅针对单一器官或特定任务进行优化,难以满足临床对多功能、高效率诊断工具的需求。其解决方案的关键在于开发和评估一种通用型深度学习模型(general-purpose deep learning model),通过统一架构实现多器官分类与分割任务的联合建模,在保持高诊断准确率(如平均Dice相似系数DSC达0.854)和计算效率(推理时间与GPU内存占用可控)的同时,验证其跨中心泛化能力。研究发现,尽管模型在多数任务上表现优异,但在未见数据(如乳腺癌分子分型)上出现性能显著下降,凸显出域泛化(domain generalization)是未来临床部署的关键挑战。
链接: https://arxiv.org/abs/2512.17279
作者: Zehui Lin,Luyi Han,Xin Wang,Ying Zhou,Yanming Zhang,Tianyu Zhang,Lingyun Bao,Shandong Wu,Dong Xu,Tao Tan, theUUSIC25 Challenge Consortium
机构: Macao Polytechnic University (澳门理工学院); Netherlands Cancer Institute (荷兰癌症研究所); Zhejiang Cancer Hospital (浙江省肿瘤医院); The First People’s Hospital of Hangzhou, Affiliated Hangzhou Hospital of Nanjing Medical University (杭州市第一人民医院,南京医科大学附属杭州医院); University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures. Summary of the UUSIC25 Challenge held at MICCAI 2025. Extensive Supplementary Material (containing original team reports) is available in the “ancillary files” section
Abstract:IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems. OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model’s performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges. CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment. Comments: 8 pages, 2 figures. Summary of the UUSIC25 Challenge held at MICCAI 2025. Extensive Supplementary Material (containing original team reports) is available in the “ancillary files” section Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.17279 [cs.CV] (or arXiv:2512.17279v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.17279 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zehui Lin [view email] [v1] Fri, 19 Dec 2025 06:54:30 UTC (13,589 KB) function toggleList(whichLayer,toggleThis) var elem, vis; if( document.getElementById ) // standard elem = document.getElementById( whichLayer ); else if( document.all ) // old msie versions elem = document.all[whichLayer]; else if( document.layers ) // nn4 elem = document.layers[whichLayer]; vis = elem.style; // if the style.display value is blank we try to figure it out here if(vis.display==‘’!=undefined!=undefined) vis.display = (elem.offsetWidth!=0!=0)?‘inline’:‘none’; vis.display = (vis.display==‘’||vis.display==‘inline’)?‘none’:‘inline’; // toggle link inner text status = vis.display; if(vis.display==‘inline’) document.getElementById(‘toggle’).innerHTML = “(collapse list)”; document.getElementById(‘toggle’).title = “Collapse list”; else document.getElementById(‘toggle’).innerHTML = “(”+toggleThis+“)”; document.getElementById(‘toggle’).title = “Show complete list”; Full-text links: Access Paper: View a PDF of the paper titled Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge, by Zehui Lin and 10 other authorsView PDFHTML (experimental)TeX Source view license Ancillary-file links: Ancillary files (details): supplement.pdf Current browse context: cs.CV prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[CV-76] WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images
【速读】:该论文旨在解决乳腺超声(Breast Ultrasound, BUS)图像中肿瘤分割的难题,主要挑战包括斑点噪声(speckle noise)、成像伪影、病灶形态不规则以及边界模糊等问题,这些问题严重制约了分割精度。解决方案的关键在于提出一种名为WDFFU-Mamba的新颖分割网络,其核心创新为:1)引入小波域引导的高频频特征增强模块(Wavelet-denoised High-Frequency-guided Feature, WHF),通过抑制噪声的高频信息提升低层特征表达;2)设计双注意力特征融合模块(Dual Attention Feature Fusion, DAFF),有效融合跳跃连接与语义特征,增强上下文建模能力。该方法在两个公开BUS数据集上验证,显著优于现有方法,在Dice系数和95%分位数 Hausdorff 距离(HD95)指标上表现优异,兼具高精度、鲁棒性和良好的跨数据集泛化能力,适用于临床实际场景中的乳腺肿瘤超声分析。
链接: https://arxiv.org/abs/2512.17278
作者: Guoping Cai,Houjin Chen,Yanfeng Li,Jia Sun,Ziwei Chen,Qingzi Geng
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS this http URL propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual this http URL experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational this http URL proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.
zh
[CV-77] AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning
【速读】:该论文旨在解决胸部X光片(CXR)中稳健的解剖结构分割难题,其核心挑战在于真实世界数据中标注信息稀缺以及成像条件的显著差异。解决方案的关键在于提出AnyCXR框架,该框架通过两个核心技术实现:一是多阶段域随机化(Multi-stage Domain Randomization, MSDR)引擎,从3D CT体积生成超过10万张解剖结构逼真且多样化的合成X光图像;二是条件联合标注正则化(Conditional Joint Annotation Regularization, CAR)学习策略,利用不完整或有噪声的标签,在潜在空间中强制执行解剖一致性约束。该方法完全基于合成数据训练,实现了对多种真实世界数据集的强零样本泛化能力,能够准确分割PA、侧位和斜位视图下的54个解剖结构,从而为自动心脏胸廓比例估算、脊柱弯曲评估及疾病分类等下游临床任务提供可靠的解剖先验支持。
链接: https://arxiv.org/abs/2512.17263
作者: Dong Zifei,Wu Wenjie,Hao Jinkui,Chen Tianqi,Weng Ziqiao,Zhou Bo
机构: Northwestern University (西北大学); Vanderbilt University (范德比尔特大学); Shanxi Medical University (山西医科大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 12 figures, Preprint (under review at Medical Image Analysis)
Abstract:Robust anatomical segmentation of chest X-rays (CXRs) remains challenging due to the scarcity of comprehensive annotations and the substantial variability of real-world acquisition conditions. We propose AnyCXR, a unified framework that enables generalizable multi-organ segmentation across arbitrary CXR projection angles using only synthetic supervision. The method combines a Multi-stage Domain Randomization (MSDR) engine, which generates over 100,000 anatomically faithful and highly diverse synthetic radiographs from 3D CT volumes, with a Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial and imperfect labels by enforcing anatomical consistency in a latent space. Trained entirely on synthetic data, AnyCXR achieves strong zero-shot generalization on multiple real-world datasets, providing accurate delineation of 54 anatomical structures in PA, lateral, and oblique views. The resulting segmentation maps support downstream clinical tasks, including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification, where the incorporation of anatomical priors improves diagnostic performance. These results demonstrate that AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.
zh
[CV-78] Mitty: Diffusion-based Human-to-Robot Video Generation
【速读】:该论文旨在解决当前机器人学习中依赖中间表示(如关键点或轨迹)而导致的信息丢失和累积误差问题,从而影响视频时序与视觉一致性。其核心解决方案是提出Mitty——一种基于预训练视频扩散模型的Diffusion Transformer架构,能够实现端到端的人类示范视频到机器人执行视频的生成,无需动作标签或中间抽象表示。该方法通过将示范视频压缩为条件token,并在扩散过程中与机器人去噪token通过双向注意力机制融合,充分利用强视觉-时序先验,从而提升生成质量与泛化能力。
链接: https://arxiv.org/abs/2512.17253
作者: Yiren Song,Cheng Liu,Weijia Mao,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning directly from human demonstration videos is a key milestone toward scalable and generalizable robot learning. Yet existing methods rely on intermediate representations such as keypoints or trajectories, introducing information loss and cumulative errors that harm temporal and visual consistency. We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-end Human2Robot video generation. Built on a pretrained video diffusion model, Mitty leverages strong visual-temporal priors to translate human demonstrations into robot-execution videos without action labels or intermediate abstractions. Demonstration videos are compressed into condition tokens and fused with robot denoising tokens through bidirectional attention during diffusion. To mitigate paired-data scarcity, we also develop an automatic synthesis pipeline that produces high-quality human-robot pairs from large egocentric datasets. Experiments on Human2Robot and EPIC-Kitchens show that Mitty delivers state-of-the-art results, strong generalization to unseen environments, and new insights for scalable robot learning from human observations.
zh
[CV-79] Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
【速读】:该论文旨在解决长视频问答(Long Video Question-Answering, LVQA)中因海量上下文信息导致的模型记忆消耗过高和关键线索易被忽略的问题。现有方法通过减少视觉标记或扩展模型上下文长度来应对,但往往会造成信息丢失或计算开销过大。其解决方案的关键在于提出一种问题感知的记忆机制(question-aware memory mechanism),通过迭代处理视频子片段,并引入少量特殊记忆标记对每个子片段进行目的性压缩,从而高效识别并保留与问题相关的关键线索;同时,该机制会递归聚合并存储这些记忆标记以更新历史上下文,实现跨子片段的信息复用,显著提升模型在有限上下文长度下对长视频的理解能力。
链接: https://arxiv.org/abs/2512.17229
作者: Henghui Du,Chang Zhou,Chunjie Zhang,Xi Chen,Di Hu
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); AI Technology Center (人工智能技术中心); Online Video Business Unit (在线视频业务单元); Tencent PCG (腾讯PCG)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model’s context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model’s long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.
zh
[CV-80] Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂、长链视觉推理任务中出现的“视觉遗忘”(visual forgetting)问题,即模型在推理过程中逐渐丧失对视觉信息的准确感知与依赖,表现为“想得越久,看得越少”。其根本原因在于现有训练范式过早地将抽象逻辑推理(how-to-think)与策略性视觉感知(when-to-look)纠缠在一起,导致模型缺乏有效的视觉感知决策机制。解决方案的关键在于提出一种基于课程学习(curriculum-based)的两阶段框架:第一阶段通过解耦监督微调(disentangled Supervised Fine-Tuning, SFT)构建纯文本驱动的稳健抽象推理基础,并引入感知锚定的思维链(Perception-Grounded Chain-of-Thought, PG-CoT)将其与视觉信息结合;第二阶段将感知时机建模为强化学习问题,设计关键性的“关键感知奖励”(Pivotal Perception Reward),通过将感知动作与认知不确定性标记(如“wait”, “verify”)关联,使模型学会自主制定视觉感知策略,从而实现从启发式观察者到战略化、具身化推理者的转变。
链接: https://arxiv.org/abs/2512.17227
作者: Siqi Yang,Zilve Gao,Haibo Qiu,Fanfan Liu,Peng Shi,Zhixiong Zeng,Qingmin Liao,Lin Ma
机构: Meituan; Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is “visual forgetting”, where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as “think longer, see less”. We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning “how-to-think”) and (2) strategic visual perception (“when-to-look”). This creates a foundational cold-start deficiency – weakening abstract reasoning – and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., “wait”, “verify”), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbfCode: \urlthis https URL.
zh
[CV-81] Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors WACV2026
【速读】:该论文旨在解决当前基于学习的视觉定位方法中全局描述符(global descriptors)判别能力不足的问题,尤其是在依赖几何线索(如共视图)生成描述符时,易受噪声几何约束影响,导致错误关联和定位性能下降。解决方案的关键在于提出一个聚合模块(aggregator module),该模块联合学习与几何结构和视觉相似性均一致的全局描述符,确保仅当图像在视觉上相似且空间上相连时才在描述符空间中靠近,从而纠正由不可靠重叠分数引起的错误匹配。通过仅基于重叠分数的批次挖掘策略和改进的对比损失函数,该方法无需人工地点标签即可训练,并在多种环境间具有良好泛化能力。
链接: https://arxiv.org/abs/2512.17226
作者: Son Tung Nguyen,Tobias Fischer,Alejandro Fontan,Michael Milford
机构: Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026 conference paper
Abstract:Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \hrefthis https URL_scrthis http URL_scr.
zh
[CV-82] Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing AAAI2026
【速读】:该论文旨在解决现有遥感基础模型(Remote Sensing Foundation Models, RSFMs)在面对不同光学传感器间波段组成和空间分辨率差异时的泛化能力不足问题,特别是在波段缺失、跨传感器融合及未知空间尺度等实际场景下表现受限。其核心解决方案是提出Any Optical Model (AOM),关键创新在于:1)设计了一种谱无关的tokenzier,通过为每个波段分配独立的band embedding来显式编码光谱身份,从而在波段缺失或新增时仍能保留特征区分性;2)引入多尺度自适应patch嵌入机制,动态调节感受野以有效捕捉从亚米级到百米级影像中的纹理与上下文模式;3)结合多尺度语义对齐机制与通道级自监督掩码重建预训练策略,协同建模光谱-空间关系,确保跨分辨率下的全局语义一致性。
链接: https://arxiv.org/abs/2512.17224
作者: Xuyang Li,Chenyu Li,Danfeng Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026
Abstract:Optical satellites, with their diverse band layouts and ground sampling distances, supply indispensable evidence for tasks ranging from ecosystem surveillance to emergency response. However, significant discrepancies in band composition and spatial resolution across different optical sensors present major challenges for existing Remote Sensing Foundation Models (RSFMs). These models are typically pretrained on fixed band configurations and resolutions, making them vulnerable to real world scenarios involving missing bands, cross sensor fusion, and unseen spatial scales, thereby limiting their generalization and practical deployment. To address these limitations, we propose Any Optical Model (AOM), a universal RSFM explicitly designed to accommodate arbitrary band compositions, sensor types, and resolution scales. To preserve distinctive spectral characteristics even when bands are missing or newly introduced, AOM introduces a spectrum-independent tokenizer that assigns each channel a dedicated band embedding, enabling explicit encoding of spectral identity. To effectively capture texture and contextual patterns from sub-meter to hundred-meter imagery, we design a multi-scale adaptive patch embedding mechanism that dynamically modulates the receptive field. Furthermore, to maintain global semantic consistency across varying resolutions, AOM incorporates a multi-scale semantic alignment mechanism alongside a channel-wise self-supervised masking and reconstruction pretraining strategy that jointly models spectral-spatial relationships. Extensive experiments on over 10 public datasets, including those from Sentinel-2, Landsat, and HLS, demonstrate that AOM consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band missing, cross sensor, and cross resolution settings.
zh
[CV-83] DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在文档理解和网络代理任务中因视觉编码器缺乏鲁棒的结构与空间信息而导致性能受限的问题。解决方案的关键在于提出一种专为VLM设计的新型视觉编码器DAVE,其训练流程分为两个阶段:首先利用大量未标注图像进行自监督预训练以学习通用视觉表征,随后通过少量高质量标注数据进行监督式自回归预训练,使模型掌握文档解析和定位等具体任务;同时采用两种创新策略提升编码器对多样化任务的适配性:一是引入模型融合机制,整合使用不同文本解码器训练的编码器以增强与多种网络代理架构的兼容性;二是采用集成训练方法,融合通用视觉编码器(如SigLIP2)与专用于文档和网页场景的特征表示,从而显著提升模型在文档理解、视觉问答、网页定位及代理基准测试中的表现。
链接: https://arxiv.org/abs/2512.17221
作者: Brandon Huang,Hang Hua,Zhuoran Yu,Trevor Darrell,Rogerio Feris,Roei Herzig
机构: MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室); UC Berkeley (加州大学伯克利分校); University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder’s alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
zh
[CV-84] CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency
【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models, VLMs)中存在的幻觉问题,即模型在临床推理中生成不准确或不可验证的Chain-of-Thought(思维链)推理过程,从而影响其临床可靠性。传统基于结果导向的强化学习方法(如Group Relative Policy Optimization, GRPO)虽成本低,但因依赖稀疏奖励信号,易诱导模型产生冗长、复杂且难以验证的推理路径,掩盖事实性错误并带来安全风险。解决方案的关键在于提出CheXPO-v2框架,通过从“结果监督”转向“过程监督”,引入基于实体-关系匹配的知识图谱一致性奖励机制(Knowledge Graph Consistency Reward),将推理步骤结构化为“疾病-关系-解剖部位”三元组,实现对逻辑连贯性和事实准确性在原子层面的精细约束;同时结合硬样本挖掘策略,在仅使用5k样本的情况下显著优于GRPO及当前最优模型,在MIMIC-CXR-VQA等基准上达到新的SOTA性能,兼具高数据效率与临床可解释性。
链接: https://arxiv.org/abs/2512.17213
作者: Xiao Liang,Yuxuan An,Di Wang,Jiawei Hu,Zhicheng Jiao,Bin Jing,Quan Wang
机构: Xidian University (西安电子科技大学); Brown University (布朗大学); Capital Medical University (首都医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to “overthink” – generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured “Disease, Relation, Anatomy” triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: this https URL.
zh
[CV-85] Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
【速读】:该论文旨在解决大型(视觉-)语言模型在推理阶段和强化学习(Reinforcement Learning, RL)训练中因随机采样导致的冗余推理路径问题,即缺乏高阶多样性,从而限制了模型性能与探索效率。解决方案的关键在于提出一种名为“Reasoning Palette”的新型潜在调制框架,通过引入一个用于战略情境化(strategic contextualization)的随机潜在变量,在生成token之前引导模型内部规划。该潜在变量由问题-答案对的均值池化嵌入经变分自编码器(Variational Autoencoder, VAE)推断得到,每个采样潜在变量可编码不同的推理上下文;推理时,该潜在变量被解码为可学习的token前缀并附加至输入提示,从而调节模型内部推理轨迹,实现对响应序列风格与结构的可控调整。此机制使模型在输出生成前完成对推理策略的内生采样,显著提升探索效率与持续学习能力,并通过简短的监督微调(Supervised Fine-Tuning, SFT)即可适配该潜在条件,最终在多个推理基准测试中实现优于标准RL方法的一致性能提升。
链接: https://arxiv.org/abs/2512.17206
作者: Rujiao Long,Yang Li,Xingyao Zhang,Weixun Wang,Tianqianjin Lin,Xi Zhao,Yuchi Xu,Wenbo Su,Junchi Yan,Bo Zheng
机构: Alibaba Group(阿里巴巴集团); Shanghai Jiao Tong University(上海交通大学); Zhejiang University(浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model’s internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model’s strategic behavior, thereby achieving consistent performance gains over standard RL methods.
zh
[CV-86] Fose: Fusion of One-Step Diffusion and End-to-End Network for Pansharpening
【速读】:该论文旨在解决高分辨率多光谱图像(HRMSI)生成中传统扩散模型(DM)计算复杂度高、推理速度慢,以及端到端(E2E)模型性能受限于先验知识不足和结构简单的问题。其解决方案的关键在于提出一种四阶段训练策略,构建轻量级网络Fose:首先通过单步蒸馏(one-step distillation)将增强型SOTA扩散模型的50步推理压缩为仅1步;随后引入轻量级集成模块(lightweight ensemble blocks),将E2E模型与单步扩散模型进行融合,从而在保持高精度的同时显著提升效率。实验表明,Fose在三个基准数据集上均实现显著性能提升,并相较基线DM获得7.42倍的速度加速比。
链接: https://arxiv.org/abs/2512.17202
作者: Kai Liu,Zeli Lin,Weibo Wang,Linghe Kong,Yulun Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code link: this https URL
Abstract:Pansharpening is a significant image fusion task that fuses low-resolution multispectral images (LRMSI) and high-resolution panchromatic images (PAN) to obtain high-resolution multispectral images (HRMSI). The development of the diffusion models (DM) and the end-to-end models (E2E model) has greatly improved the frontier of pansharping. DM takes the multi-step diffusion to obtain an accurate estimation of the residual between LRMSI and HRMSI. However, the multi-step process takes large computational power and is time-consuming. As for E2E models, their performance is still limited by the lack of prior and simple structure. In this paper, we propose a novel four-stage training strategy to obtain a lightweight network Fose, which fuses one-step DM and an E2E model. We perform one-step distillation on an enhanced SOTA DM for pansharping to compress the inference process from 50 steps to only 1 step. Then we fuse the E2E model with one-step DM with lightweight ensemble blocks. Comprehensive experiments are conducted to demonstrate the significant improvement of the proposed Fose on three commonly used benchmarks. Moreover, we achieve a 7.42 speedup ratio compared to the baseline DM while achieving much better performance. The code and model are released at this https URL.
zh
[CV-87] Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs
【速读】:该论文旨在解决医学视觉语言模型(Medical Vision-Language Models, MedVLMs)在临床应用中因幻觉(hallucination)导致的可靠性问题,即模型常脱离图像证据而依赖文本先验生成错误答案。现有方法存在局限:基于训练的方法依赖昂贵的专家标注,难以扩展;无训练干预如对比解码(contrastive decoding)虽数据高效,但采用全局、非针对性修正,在复杂临床场景中效果不可靠。本文提出一种可插拔的解决方案——解剖区域引导的对比解码(Anatomical Region-Guided Contrastive Decoding, ARCD),其关键在于利用解剖掩膜(anatomical mask)引导三阶段对比解码过程,在token、注意力和logits三个层级动态重加权,精准聚焦于特定解剖区域,从而增强模型对解剖结构的理解并抑制事实性错误输出。
链接: https://arxiv.org/abs/2512.17189
作者: Xiao Liang,Chenxi Liu,Zhi Ma,Di Wang,Bin Jing,Quan Wang,Yuanyuan Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model’s focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method’s effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
zh
[CV-88] Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences
【速读】:该论文旨在解决多摄像头系统在已知垂直方向条件下,利用视觉与惯性信息进行高精度相对位姿估计的问题。其关键解决方案是提出一种基于仿射对应关系的全局最优求解器:首先通过解耦旋转矩阵与平移向量,建立关于相对旋转角的成本函数以最小化仿射对应关系下的几何约束代数误差;随后将全局优化问题转化为两个未知数的多项式方程组,借助特征方程及其一阶导数为零的性质,采用多项式特征值求解器获取相对旋转角,并由特征向量确定平移向量;此外,针对小角度相对旋转情形,还设计了一种新的线性解法,从而在合成数据和真实数据集上均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2512.17188
作者: Zhenbao Yu,Banglei Guan,Shunkun Liang,Zibin Liu,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.
zh
[CV-89] It is not always greener on the other side: Greenery perception across demographics and personalities in multiple cities
【速读】:该论文旨在解决城市绿地感知与客观测量之间差异的问题,即人们对于绿色空间的主观感受与其实际植被覆盖量(如绿视率指数 GVI)不一致的现象。其解决方案的关键在于通过整合来自五个国家1000名受访者的大规模城市视觉感知调查数据(包含详细的人口统计学与人格特征)以及街景图像中的上下文信息,系统分析这种差异的成因。研究发现,尽管个体年龄、人格特质等特征对感知影响较小,但居住地(location)是解释感知差异的最重要因素之一,表明文化、环境和经验等深层因素在塑造城市绿地感知中起关键作用。
链接: https://arxiv.org/abs/2512.17186
作者: Matias Quintana,Fangqi Liu,Jussi Torkko,Youlong Gu,Xiucheng Liang,Yujun Hou,Koichi Ito,Yihan Zhu,Mahmoud Abdelrahman,Tuuli Toivonen,Yi Lu,Filip Biljecki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantifying and assessing urban greenery is consequential for planning and development, reflecting the everlasting importance of green spaces for multiple climate and well-being dimensions of cities. Evaluation can be broadly grouped into objective (e.g., measuring the amount of greenery) and subjective (e.g., polling the perception of people) approaches, which may differ – what people see and feel about how green a place is might not match the measurements of the actual amount of vegetation. In this work, we advance the state of the art by measuring such differences and explaining them through human, geographic, and spatial dimensions. The experiments rely on contextual information extracted from street view imagery and a comprehensive urban visual perception survey collected from 1,000 people across five countries with their extensive demographic and personality information. We analyze the discrepancies between objective measures (e.g., Green View Index (GVI)) and subjective scores (e.g., pairwise ratings), examining whether they can be explained by a variety of human and visual factors such as age group and spatial variation of greenery in the scene. The findings reveal that such discrepancies are comparable around the world and that demographics and personality do not play a significant role in perception. Further, while perceived and measured greenery correlate consistently across geographies (both where people and where imagery are from), where people live plays a significant role in explaining perceptual differences, with these two, as the top among seven, features that influences perceived greenery the most. This location influence suggests that cultural, environmental, and experiential factors substantially shape how individuals observe greenery in cities.
zh
[CV-90] ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)模型在组合式图文匹配任务中对物体与属性绑定不准确的问题,其核心挑战在于CLIP的全局表示难以捕捉细粒度语义信息,导致属性混淆和绑定错误。解决方案的关键在于提出一种无需额外训练的Attribute Binding Enhancement (ABE-CLIP)方法:首先通过语义精炼机制(Semantic Refinement Mechanism)优化文本中对象和属性短语的token嵌入,减少属性歧义;其次引入局部token-图像patch对齐策略(Local Token-Patch Alignment),计算细化后文本token与其最相关图像区域之间的相似度,并聚合局部相似度得到最终图文匹配分数,从而显著提升属性-物体绑定的准确性与泛化能力。
链接: https://arxiv.org/abs/2512.17178
作者: Qi Zhang,Yuxu Chen,Lei Deng,Lili Shen
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 10 pages, 8 figures
Abstract:Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.
zh
[CV-91] Can Synthetic Images Serve as Effective and Efficient Class Prototypes?
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在零样本图像分类任务中对人工标注图文对的依赖问题,以及由此带来的数据准备成本高、准确性要求严苛和模型结构冗余(如双塔编码器)导致的效率低下等问题。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)生成类特定提示词并驱动扩散模型合成参考图像的新框架——LGCLIP。该方法仅需类别标签作为输入,通过LLM生成提示引导扩散模型生成视觉原型图像,再利用单一视觉编码器提取真实图像特征并与原型特征进行对比预测,从而实现轻量化、无需人工标注图文对的零样本分类范式。
链接: https://arxiv.org/abs/2512.17160
作者: Dianxing Shi,Dingjie Fu,Yuqiao Liu,Jun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.
zh
[CV-92] PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics
【速读】:该论文旨在解决现有火灾预测方法在细粒度建模中的物理不一致性与信息稀疏性问题,尤其是基于二值掩码(binary mask)的方法难以捕捉火灾复杂的动态演化过程。其关键解决方案是提出物理信息世界模型(PhysFire-WM),通过将燃烧动力学结构先验编码进模型以修正物理偏差,并引入跨任务协同训练策略(CC-Train),实现热辐射动态与空间边界分割的参数共享与梯度协调,从而在保持物理真实性的同时提升几何精度和预测准确性。
链接: https://arxiv.org/abs/2512.17152
作者: Nan Zhou,Huandong Wang,Jiahao Li,Yang Li,Xiao-Ping Zhang,Yong Li,Xinlei Chen
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Department of Electronic Engineering, Tsinghua University (清华大学电子工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.
zh
[CV-93] xt-Conditioned Background Generation for Editable Multi-Layer Documents
【速读】:该论文旨在解决文档背景生成中保持文本可读性、多页视觉一致性以及支持结构化编辑的问题。其核心挑战在于如何在不破坏文本清晰度的前提下进行背景生成,并确保跨页面的主题连贯性,同时允许用户对文档的风格进行灵活调整。解决方案的关键在于提出了一种基于潜在掩码(latent masking)的软更新机制,通过类比物理和数值优化中的平滑屏障函数,对扩散空间中的更新进行局部抑制,从而保护文本区域;引入自动可读性优化(Automated Readability Optimization, ARO),自动生成半透明圆角遮罩以满足WCAG 2.2感知对比度标准,保障可读性与美学协调;并通过摘要-指令递归引导机制实现多页内容的一致性建模,模拟人类对上下文记忆的利用方式,使视觉元素在整篇文档中协同演化。此外,将文档视为分层结构(text, figures, background as separate layers),支持目标背景编辑而不影响文本可读性,最终结合用户提示实现风格定制,形成无需训练的高效生成框架。
链接: https://arxiv.org/abs/2512.17151
作者: Taewon Kang,Joseph K J,Chris Tensmeyer,Jihyung Kil,Wanrong Zhu,Ming C. Lin,Vlad I. Morariu
机构: University of Maryland at College Park (马里兰大学学院市分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emphlatent masking formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emphAutomated Readability Optimization (ARO), which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.
zh
[CV-94] Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps
【速读】:该论文旨在解决如何将普通人自拍的图像转化为具有专业摄影效果的图像问题,即在指定姿态、简洁背景、良好光照条件下生成高质量人像,并保持个体独特的面部与身体特征。其核心挑战在于缺乏同一人物“野生”拍摄与专业摄影的成对数据集。解决方案的关键在于两个创新:一是将输入图像及人脸映射到规范化的UV空间,结合重定位方法建模遮挡和新视角合成,从而利用现有未配对数据集;二是通过多图微调实现个性化输出,最终在真实图像上实现了高质量重定姿人像生成,在定性和定量评估中均表现优异。
链接: https://arxiv.org/abs/2512.17143
作者: Sandeep Mishra,Yasamin Jafarian,Andreas Lugmayr,Yingwei Li,Varsha Ramakrishnan,Srivatsan Varadharajan,Alan C. Bovik,Ira Kemelmacher-Shlizerman
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Photographs of people taken by professional photographers typically present the person in beautiful lighting, with an interesting pose, and flattering quality. This is unlike common photos people can take of themselves. In this paper, we explore how to create a professional'' version of a person's photograph, i.e., in a chosen pose, in a simple environment, with good lighting, and standard black top/bottom clothing. A key challenge is to preserve the person's unique identity, face and body features while transforming the photo. If there would exist a large paired dataset of the same person photographed both in the wild’’ and by a professional photographer, the problem would potentially be easier to solve. However, such data does not exist, especially for a large variety of identities. To that end, we propose two key insights: 1) Our method transforms the input photo and person’s face to a canonical UV space, which is further coupled with reposing methodology to model occlusions and novel view synthesis. Operating in UV space allows us to leverage existing unpaired datasets. 2) We personalize the output photo via multi image finetuning. Our approach yields high-quality, reposed portraits and achieves strong qualitative and quantitative performance on real-world imagery.
zh
[CV-95] SDUM: A Scalable Deep Unrolled Model for Universal MRI Reconstruction
【速读】:该论文旨在解决当前深度学习MRI重建方法普遍存在的协议特异性问题,即模型通常仅适用于特定成像参数(如解剖目标、对比度、采样模式和加速因子),难以跨多种临床场景通用部署。其解决方案的关键在于提出一种可扩展的深度展开模型(Scalable Deep Unrolled Model, SDUM),该框架融合了基于Restormer的重建器、学习型线圈敏感度估计器(coil sensitivity map estimator, CSME)、采样感知加权数据一致性模块(sampling-aware weighted data consistency, SWDC)、对级联索引和协议元数据的通用条件机制(universal conditioning, UC),以及渐进式级联扩展训练策略。SDUM展现出类似基础模型的可扩展性:重建质量与参数量呈对数关系(PSNR ∼ log(parameters),相关系数 r=0.986),且在无需任务微调的情况下,在CMRxRecon2025所有挑战赛道上均达到最先进性能,验证了其通用性和实用性。
链接: https://arxiv.org/abs/2512.17137
作者: Puyang Wang,Pengfei Guo,Keyi Chai,Jinyuan Zhou,Daguang Xu,Shanshan Jiang
机构: Johns Hopkins University (约翰霍普金斯大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical MRI encompasses diverse imaging protocols–spanning anatomical targets (cardiac, brain, knee), contrasts (T1, T2, mapping), sampling patterns (Cartesian, radial, spiral, kt-space), and acceleration factors–yet current deep learning reconstructions are typically protocol-specific, hindering generalization and deployment. We introduce Scalable Deep Unrolled Model (SDUM), a universal framework combining a Restormer-based reconstructor, a learned coil sensitivity map estimator (CSME), sampling-aware weighted data consistency (SWDC), universal conditioning (UC) on cascade index and protocol metadata, and progressive cascade expansion training. SDUM exhibits foundation-model-like scaling behavior: reconstruction quality follows PSNR \sim log(parameters) with correlation r=0.986 ( R^2=0.973 ) up to 18 cascades, demonstrating predictable performance gains with model depth. A single SDUM trained on heterogeneous data achieves state-of-the-art results across all four CMRxRecon2025 challenge tracks–multi-center, multi-disease, 5T, and pediatric–without task-specific fine-tuning, surpassing specialized baselines by up to +1.0 ~dB. On CMRxRecon2024, SDUM outperforms the winning method PromptMR+ by +0.55 ~dB; on fastMRI brain, it exceeds PC-RNN by +1.8 ~dB. Ablations validate each component: SWDC +0.43 ~dB over standard DC, per-cascade CSME +0.51 ~dB, UC +0.38 ~dB. These results establish SDUM as a practical path toward universal, scalable MRI reconstruction.
zh
[CV-96] Predictive Modeling of Maritime Radar Data Using Transformer Architecture
【速读】:该论文旨在解决当前 maritime radar frame 预测中缺乏基于 transformer 架构的预测模型的问题,从而填补现有研究在雷达感知下的时空序列预测领域的空白。其关键在于系统性地回顾与分析适用于海上雷达数据的预测建模方法,特别是针对 spatiotemporal sequence forecasting 的 transformer 架构,并指出尽管已有研究成功将 transformer 应用于 AIS 轨迹预测和声呐帧预测,但尚未有工作探索基于 transformer 的海上雷达帧预测,这一发现明确了未来研究方向并强调了利用 transformer 在雷达场景下进行高鲁棒性预测的潜力。
链接: https://arxiv.org/abs/2512.17098
作者: Bjorna Qesaraku,Jan Steckel
机构: Cosys-lab, Faculty of applied Engineering, University of Antwerp (安特卫普大学应用工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures, 1 table
Abstract:Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar’s all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.
zh
[CV-97] DGH: Dynamic Gaussian Hair NEURIPS2025
【速读】:该论文旨在解决数字人建模中动态毛发的逼真生成难题,主要挑战包括毛发复杂运动、遮挡以及光散射效应。传统方法依赖静态捕获与物理模拟模型,存在参数调优繁琐、计算开销大且难以适应多样发型和动作的问题。其解决方案的关键在于提出一种名为动态高斯毛发(Dynamic Gaussian Hair, DGH)的新框架:首先采用粗到精的建模策略学习跨多种发型的时序一致毛发动态;其次设计束引导优化模块,通过可微渲染支持的动态3D高斯表示学习视点一致的毛发外观,实现梯度驱动的端到端优化。该方法完全数据驱动,可随训练数据扩展,并能无缝集成至3D高斯Avatar框架中,显著提升毛发的真实感与可动画性。
链接: https://arxiv.org/abs/2512.17094
作者: Junying Wang,Yuanlu Xu,Edith Tretschk,Ziyan Wang,Anastasia Ianina,Aljaz Bozic,Ulrich Neumann,Tony Tung
机构: University of Southern California (南加州大学); Meta Reality Labs Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025. Project page: this https URL
Abstract:The creation of photorealistic dynamic hair remains a major challenge in digital human modeling because of the complex motions, occlusions, and light scattering. Existing methods often resort to static capture and physics-based models that do not scale as they require manual parameter fine-tuning to handle the diversity of hairstyles and motions, and heavy computation to obtain high-quality appearance. In this paper, we present Dynamic Gaussian Hair (DGH), a novel framework that efficiently learns hair dynamics and appearance. We propose: (1) a coarse-to-fine model that learns temporally coherent hair motion dynamics across diverse hairstyles; (2) a strand-guided optimization module that learns a dynamic 3D Gaussian representation for hair appearance with support for differentiable rendering, enabling gradient-based learning of view-consistent appearance under motion. Unlike prior simulation-based pipelines, our approach is fully data-driven, scales with training data, and generalizes across various hairstyles and head motion sequences. Additionally, DGH can be seamlessly integrated into a 3D Gaussian avatar framework, enabling realistic, animatable hair for high-fidelity avatar representation. DGH achieves promising geometry and appearance results, providing a scalable, data-driven alternative to physics-based simulation and rendering.
zh
[CV-98] Interpretable Similarity of Synthetic Image Utility
【速读】:该论文旨在解决如何定量评估生成式医学图像数据集与真实图像数据集在特定临床应用领域中的相似性这一关键问题,这是推动深度学习(Deep Learning, DL)驱动的临床决策支持(Clinical Decision Support, CDS)系统发展的核心挑战之一。传统方法主要依赖用户评价、Inception-based指标或基于合成图像的分类性能,但缺乏对图像特征与临床实用性的可解释关联。本文提出一种名为“可解释效用相似性”(Interpretable Utility Similarity, IUS)的新指标,其关键创新在于:基于广义神经加性模型(Generalized Neural Additive Models),从临床相关图像特征出发,量化合成数据对DL-CDS系统开发的实用性差异,从而提供可解释的评估依据。实验表明,IUS能有效筛选高效用相似性的合成图像,在多种彩色医学成像模态(如内窥镜、皮肤镜和眼底成像)中使分类性能提升达54.6%,并验证了其在灰度X光和超声图像上的普适性。
链接: https://arxiv.org/abs/2512.17080
作者: Panagiota Gatoula,George Dimas,Dimitris K. Iakovidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted for journal publication
Abstract:Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: “How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?”. Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at this https URL
zh
[CV-99] Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
【速读】:该论文旨在解决相机控制的视频生成任务中,如何在保证目标相机位姿准确性的前提下,实现高质量且视角一致的动态场景视频生成问题。现有方法主要面临两大挑战:一是基于重投影的方法对深度估计误差敏感,导致生成视频失真;二是训练数据中相机轨迹多样性不足,限制了模型泛化能力。解决方案的关键在于提出InfCam框架,其核心创新为:(1)引入无深度依赖的无限单应性变形(infinite homography warping),将3D相机旋转直接编码至视频扩散模型的2D潜在空间中,通过端到端训练预测残差视差项以提升相机位姿保真度;(2)设计数据增强流水线,将现有合成多视角数据集转化为包含多样化轨迹和焦距的序列,从而显著提升模型鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2512.17040
作者: Min-Jung Kim,Jeongho Kim,Hoiyeong Jin,Junha Hyung,Jaegul Choo
机构: KAIST AI (韩国科学技术院人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:this https URL
zh
[CV-100] FORMSpoT: A Decade of Tree-Level Country-Scale Forest Monitoring
【速读】:该论文旨在解决欧洲森林碳汇功能下降背景下,现有卫星遥感监测工具在空间分辨率和变化检测精度上的不足,尤其是难以识别单株树木尺度(通常低于100 m²)的林分扰动问题。其解决方案的关键在于构建了FORMSpoT-Δ(Forest Mapping with SPOT Time series - Disturbance),利用十年(2014–2024)高时空分辨率的SPOT-6/7多光谱影像序列,结合基于机载激光扫描(ALS)数据训练的PVTv2层次化Transformer模型提取1.5米分辨率的树冠高度图,并开发了一套融合共配准与时空总变差去噪的后处理流程,显著提升了复杂地表条件下扰动检测的鲁棒性与精度,尤其在山地森林中小尺度、碎片化扰动场景中F1-score达0.44,较现有产品提升一个数量级。
链接: https://arxiv.org/abs/2512.17021
作者: Martin Schwartz,Fajwel Fogel,Nikola Besic,Damien Robert,Louis Geist,Jean-Pierre Renaud,Jean-Matthieu Monnet,Clemens Mosig,Cédric Vega,Alexandre d’Aspremont,Loic Landrieu,Philippe Ciais
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m ^2 . Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT- \Delta ) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT- \Delta substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT- \Delta achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT- \Delta provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.
zh
[CV-101] 4D-RGPT : Toward Region-level 4D Understanding via Perceptual Distillation
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal LLMs, MLLMs)在处理三维结构和时序动态信息时能力不足的问题,特别是受限于弱化的四维(4D)感知能力和时序理解能力。现有3D与4D视频问答(VQA)基准测试也主要聚焦静态场景,缺乏区域级提示(region-level prompting)。其解决方案的关键在于:(1) 提出4D-RGPT,一种专为从视频输入中提取4D表示并增强时序感知能力的MLLM;(2) 设计感知4D蒸馏(Perceptual 4D Distillation, P4D)训练框架,将冻结专家模型中的4D表征迁移至4D-RGPT,从而实现全面的4D感知;(3) 构建R4D-Bench基准,一个面向深度感知动态场景、支持区域级提示的评测平台,通过自动化与人工验证相结合的混合流水线构建。实验表明,4D-RGPT在多个现有及新提出的4D VQA基准上均取得显著性能提升。
链接: https://arxiv.org/abs/2512.17012
作者: Chiao-An Yang,Ryo Hachiuma,Sifei Liu,Subhashree Radhakrishnan,Raymond A. Yeh,Yu-Chiang Frank Wang,Min-Hung Chen
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and © R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
zh
[CV-102] A Benchmark and Agent ic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
【速读】:该论文旨在解决长视频多模态理解(long-form multimodal video understanding)中现有基准测试难以同时兼顾时序长度与多模态丰富性的问题,尤其针对当前评估方法过度依赖单一准确率指标、缺乏对模型失败模式的可解释性分析。其解决方案的关键在于提出LongShOTBench——一个诊断性基准,包含意图驱动的开放式问题、单轮与多轮对话任务,以及需要跨视频、音频和语音进行多模态推理及代理工具调用的任务;每个样本均配有参考答案和分级评分标准,支持可追溯、可解释的评估。此外,作者还开发了LongShOTAgent,一种通过预处理、搜索与迭代优化实现长视频分析的代理系统,验证了该基准在真实场景下对多模态大语言模型(MLLMs)能力的挑战性。
链接: https://arxiv.org/abs/2512.16978
作者: Mohammed Irfan Kurpath,Jaseel Muhammad Kaithakkodan,Jinxing Zhou,Sahal Shaji Mullappilly,Mohammad Almansoori,Noor Ahsan,Beknur Kalmakhanbet,Sambal Shikhar,Rishabh Lalla,Jean Lahoud,Mariette Awad,Fahad Shahbaz Khan,Salman Khan,Rao Muhammad Anwer,Hisham Cholakkal
机构: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI); American University of Beirut; Linköping University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: this https URL.
zh
[CV-103] Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video
【速读】:该论文旨在解决内窥镜视频帧分割任务中因标注数据有限而导致的性能瓶颈问题,尤其在医疗场景下获取高质量标注成本高昂。解决方案的关键在于提出了一种半监督分割框架Endo-SemiS,其核心创新包括:(1) 双网络交叉监督机制,通过两个独立网络相互提供监督信号;(2) 基于不确定性的伪标签生成策略,从无标签数据中选取高置信度区域以提升伪标签质量;(3) 联合伪标签监督,聚合两网络输出的可靠像素作为无标签数据的准确监督信号;(4) 相互学习机制,在特征和图像层面促进两网络协同优化,降低方差并引导收敛至一致解;此外还引入一个利用视频时序信息的修正网络进一步增强分割精度。该方法显著提升了在少量标注数据下的分割性能,在肾结石激光碎石和结肠息肉筛查两个临床任务中均优于现有最先进方法。
链接: https://arxiv.org/abs/2512.16977
作者: Hao Li,Daiwei Lu,Xing Yao,Nicholas Kavoussi,Ipek Oguz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we present Endo-SemiS, a semi-supervised segmentation framework for providing reliable segmentation of endoscopic video frames with limited annotation. EndoSemiS uses 4 strategies to improve performance by effectively utilizing all available data, particularly unlabeled data: (1) Cross-supervision between two individual networks that supervise each other; (2) Uncertainty-guided pseudo-labels from unlabeled data, which are generated by selecting high-confidence regions to improve their quality; (3) Joint pseudolabel supervision, which aggregates reliable pixels from the pseudo-labels of both networks to provide accurate supervision for unlabeled data; and (4) Mutual learning, where both networks learn from each other at the feature and image levels, reducing variance and guiding them toward a consistent solution. Additionally, a separate corrective network that utilizes spatiotemporal information from endoscopy video to improve segmentation performance. Endo-SemiS is evaluated on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy. Compared to state-of-the-art segmentation methods, Endo-SemiS substantially achieves superior results on both datasets with limited labeled data. The code is publicly available at this https URL
zh
[CV-104] InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
【速读】:该论文旨在解决视频序列处理中离散化分词(discrete video tokenization)的效率与准确性难题,尤其针对现有分词方法因固定压缩率导致的信息冗余或丢失问题。其核心挑战在于视频内容具有复杂且变化的信息密度,而传统方法无法动态适应这种差异。解决方案的关键在于提出 InfoTok 框架,该框架基于香农信息论(Shannon’s information theory),通过引入证据下界(evidence lower bound, ELBO)优化目标,实现理论上更优的表示长度,并设计了一种基于 Transformer 的自适应压缩器,使分词粒度可根据内容的信息丰富度动态调整。实验证明,该方法在保持性能不变的前提下可减少 20% 的 token 数量,压缩率达 2.3 倍,显著优于以往启发式自适应方法。
链接: https://arxiv.org/abs/2512.16975
作者: Haotian Ye,Qiyuan He,Jiaqi Han,Puheng Li,Jiaojiao Fan,Zekun Hao,Fitsum Reda,Yogesh Balaji,Huayu Chen,Sheng Liu,Angela Yao,James Zou,Stefano Ermon,Haoxiang Wang,Ming-Yu Liu
机构: NVIDIA; Stanford University (斯坦福大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon’s information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.
zh
[CV-105] Lights Camera Consistency: A Multistage Pipeline for Character-Stable AI Video Stories
【速读】:该论文旨在解决当前文本到视频生成模型在生成长视频故事时难以保持角色一致性的问题。其核心挑战在于如何在多场景、长时间的视频中维持角色身份的连贯性,这直接影响视频叙事的逻辑性和观感体验。解决方案的关键在于采用一种类电影制作的多阶段分解方法:首先利用大语言模型(Large Language Model, LLM)生成详细的分镜脚本,随后通过文本到图像模型基于该脚本生成具有角色一致性的视觉锚点(visual anchors),最后以这些锚点引导视频生成模型逐场景合成高质量视频。实验表明,移除视觉锚定机制会导致角色一致性评分从7.99骤降至0.55,验证了视觉先验对身份保留的重要性。
链接: https://arxiv.org/abs/2512.16954
作者: Chayan Jain,Rishant Sharma,Archit Garg,Ishan Bhanuka,Pratik Narang,Dhruv Kumar
机构: BITS Pilani(印度理工学院比兰尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.
zh
[CV-106] Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections
【速读】:该论文旨在解决树种分类模型(尤其是基于TLS和深度学习的方法)决策过程不透明的问题,即尽管这些方法在分类精度上达到先进水平,但其判别依据缺乏可解释性。解决方案的关键在于提出一种新方法,将Finer-CAM(类激活映射)生成的显著性图与TLS投影中代表结构特征的分割区域进行关联,从而系统性地识别驱动物种区分的关键形态学特征。通过分析630张显著性图,研究发现模型主要依赖冠层特征进行分类,但在欧洲白蜡、 Scots pine 和道格拉斯冷杉等相似树种间,茎干及细枝特征贡献更显著,且模型决策逻辑与人类专家认知一致,这为揭示数据集局限性、模型偏差并增强预测可信度提供了重要依据。
链接: https://arxiv.org/abs/2512.16950
作者: Adrian Straker,Paul Magdon,Marco Zullich,Maximilian Freudenberg,Christoph Kleinn,Johannes Breidenbach,Stefano Puliti,Nils Nölke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 31 pages, 14 figures, submitted to Forestry: An International Journal of Forest Research
Abstract:Classifying tree species has been a core research area in forest remote sensing for decades. New sensors and classification approaches like TLS and deep learning achieve state-of-the art accuracy but their decision processes remain unclear. Methods such as Finer-CAM (Class Activation Mapping) can highlight features in TLS projections that contribute to the classification of a target species, yet are uncommon in similar looking contrastive tree species. We propose a novel method linking Finer-CAM explanations to segments of TLS projections representing structural tree features to systemically evaluate which features drive species discrimination. Using TLS data from 2,445 trees across seven European tree species, we trained and validated five YOLOv8 models with cross-validation, reaching a mean accuracy of 96% (SD = 0.24%). Analysis of 630 saliency maps shows the models primarily rely on crown features in TLS projections for species classification. While this result is pronounced in Silver Birch, European Beech, English oak, and Norway spruce, stem features contribute more frequently to the differentiation of European ash, Scots pine, and Douglas fir. Particularly representations of finer branches contribute to the decisions of the models. The models consider those tree species similar to each other which a human expert would also regard as similar. Furthermore, our results highlight the need for an improved understanding of the decision processes of tree species classification models to help reveal data set and model limitations, biases, and to build confidence in model predictions.
zh
[CV-107] AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals
【速读】:该论文旨在解决深度学习模型在模拟神经响应时难以区分稳定视觉编码与条件特异性适应的问题,从而限制了其在不同刺激和个体间的泛化能力。解决方案的关键在于提出一种结构保持的自适应视觉模型(Adaptive Visual Model, AVM),其核心机制是通过模块化的调制路径独立建模由刺激内容和个体身份引起的神经响应变化,同时冻结基于Vision Transformer的编码器以保留稳定的视觉特征表示。这种设计实现了条件感知的适应性调整,无需修改主干网络,显著提升了跨数据集、跨受试者以及刺激层级变化下的预测性能与可解释性。
链接: https://arxiv.org/abs/2512.16948
作者: Qi Xu,Shuai Gong,Xuming Ran,Haihua Luo,Yangfan Hu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Huawei Cloud (华为云); 4. Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.
zh
[CV-108] Comparison of deep learning models: CNN and VGG-16 in identifying pornographic content
【速读】:该论文试图解决的问题是:如何快速准确地识别含有色情图像内容的网站,以应对印尼政府虽已屏蔽大量含不良信息的网站(如色情内容)但仍可通过虚拟私人网络(VPN)访问的问题。解决方案的关键在于采用深度学习方法,具体使用卷积神经网络(CNN)和视觉几何组16(VGG-16)模型对网页中的图像内容进行分类检测,并通过对比两种模型在不同超参数设置下的性能表现,最终确定CNN模型在epoch值为50、学习率为0.001时达到最高准确率94.87%,表明其在快速且精准识别色情内容方面优于VGG-16模型。
链接: https://arxiv.org/abs/2512.16947
作者: Reza Chandra,Adang Suhendra,Lintang Yuniar Banowosari,Prihandoko
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In 2020, a total of 59,741 websites were blocked by the Indonesian government due to containing negative content, including pornography, with 14,266 websites falling into this category. However, these blocked websites could still be accessed by the public using virtual private networks (VPNs). This prompted the research idea to quickly identify pornographic content. This study aims to develop a system capable of identifying websites suspected of containing pornographic image content, using a deep learning approach with convolutional neural network (CNN) and visual geometry group 16 (VGG-16) model. The two models were then explored comprehensively and holistically to determine which model was most effective in detecting pornographic content quickly. Based on the findings of the comparison between testing the CNN and VGG-16 models, research results showed that the best test results were obtained in the eighth experiment using the CNN model at an epoch value level of 50 and a learning rate of 0.001 of 0.9487 or 94.87%. This can be interpreted that the CNN model is more effective in detecting pornographic content quickly and accurately compared to using the VGG-16 model.
zh
[CV-109] V-Agent : An Interactive Video Search System Using Vision-Language Models CIKM2025
【速读】:该论文旨在解决传统文本检索系统在多模态场景下(如视频搜索)性能受限的问题,特别是难以同时理解视频中的视觉内容与语音信息。其解决方案的关键在于构建一个基于视觉语言模型(Vision-Language Model, VLM)的多代理平台V-Agent,通过微调VLM并融合图像-文本检索模型的向量表示,实现视频帧与自动语音识别(ASR)转录文本在共享多模态嵌入空间中的联合建模,从而支持上下文感知的视频检索。该框架由路由、搜索和对话三个协同工作的智能体组成,其中搜索代理结合VLM检索模块与重排序机制显著提升检索质量,在MultiVENT 2.0基准上实现了最先进的零样本性能。
链接: https://arxiv.org/abs/2512.16925
作者: SunYoung Park,Jong-Hyeon Lee,Youngjune Kim,Daegyu Sung,Younghyun Yu,Young-rok Cha,Jeongho Ju
机构: NC AI(NC人工智能); Kakao(카카오); Korea Advanced Institute of Science and Technology(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: CIKM 2025 MMGENSR Workshop
Abstract:We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.
zh
[CV-110] MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation
【速读】:该论文旨在解决当前3D医学图像分割中大规模监督预训练方法存在的关键问题:现有研究过度关注数据集规模的扩展,而忽视了骨干网络(backbone network)在大规模预训练场景下是否具备有效的表征学习能力。其解决方案的核心在于重新审视基于ConvNeXt的架构,并提出MedNeXt-v2——一种复合缩放(compound scaling)的3D ConvNeXt模型,通过改进微观结构(micro-architecture)和数据规模(data scaling)实现更优的表征学习效果。具体而言,作者首先通过全面的骨干网络基准测试发现,常规使用的骨干网络在大规模预训练中往往表现不佳,且从头训练性能更强的模型能可靠预测预训练后的下游任务性能;在此基础上,引入3D全局响应归一化(Global Response Normalization)模块,并采用深度、宽度与上下文缩放策略优化模型结构,最终在18k CT体积数据上预训练后,在六个挑战性的CT和MR基准任务(共144个结构)上实现了最先进的分割性能,验证了骨干网络设计对3D医学图像表征学习的重要性。
链接: https://arxiv.org/abs/2512.17774
作者: Saikat Roy,Yannick Kirchhoff,Constantin Ulrich,Maximillian Rokuss,Tassilo Wald,Fabian Isensee,Klaus Maier-Hein
机构: German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Germany; HIDSS4Health - Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany; Medical Faculty Heidelberg, Heidelberg University, Germany; National Center for Tumor Diseases (NCT), Heidelberg, Germany; Helmholtz Imaging, German Cancer Research Center, Germany; Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: this https URL
zh
[CV-111] Breast Cancer Neoadjuvant Chemotherapy Treatment Response Prediction Using Aligned Longitudinal MRI and Clinical Data
【速读】:该论文旨在解决乳腺癌患者在接受新辅助化疗(NACT)过程中,如何通过纵向对比增强磁共振成像(CE-MRI)和临床数据准确预测病理完全缓解(PCR)及5年无复发生存状态(RFS)的问题。其解决方案的关键在于提出了一种基于图像配准的特征提取框架,能够从不同时间点的肿瘤原位图像中提取并比较肿瘤内部变化特征,从而实现对治疗响应的动态监测;在此基础上,结合多种特征选择方法与机器学习模型进行建模分析,结果表明该配准策略显著提升了预测性能,且传统影像组学(radiomics)特征提取器在PCR和RFS分类任务中均优于预训练深度学习模型,展现出更高的预测准确率和更好的可解释性。
链接: https://arxiv.org/abs/2512.17759
作者: Rahul Ravi,Ruizhe Li,Tarek Abdelfatah,Stephen Chan,Xin Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Aim: This study investigates treatment response prediction to neoadjuvant chemotherapy (NACT) in breast cancer patients, using longitudinal contrast-enhanced magnetic resonance images (CE-MRI) and clinical data. The goal is to develop machine learning (ML) models to predict pathologic complete response (PCR binary classification) and 5-year relapse-free survival status (RFS binary classification). Method: The proposed framework includes tumour segmentation, image registration, feature extraction, and predictive modelling. Using the image registration method, MRI image features can be extracted and compared from the original tumour site at different time points, therefore monitoring the intratumor changes during NACT process. Four feature extractors, including one radiomics and three deep learning-based (MedicalNet, Segformer3D, SAM-Med3D) were implemented and compared. In combination with three feature selection methods and four ML models, predictive models are built and compared. Results: The proposed image registration-based feature extraction consistently improves the predictive models. In the PCR and RFS classification tasks logistic regression model trained on radiomic features performed the best with an AUC of 0.88 and classification accuracy of 0.85 for PCR classification, and AUC of 0.78 and classification accuracy of 0.72 for RFS classification. Conclusions: It is evidenced that the image registration method has significantly improved performance in longitudinal feature learning in predicting PCR and RFS. The radiomics feature extractor is more effective than the pre-trained deep learning feature extractors, with higher performance and better interpretability.
zh
[CV-112] SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
【速读】:该论文旨在解决合成皮肤病变图像在医学影像增强与下游黑色素瘤诊断任务中如何有效提升性能的问题,核心挑战在于厘清预处理复杂度与生成模型选择之间的交互作用。解决方案的关键在于系统性地比较两种主流生成范式——StyleGAN2-ADA与去噪扩散概率模型(DDPMs)——在不同预处理策略(基础几何变换与高级伪影去除)下的表现差异。研究发现,生成架构的选择对图像保真度(FID~65.5,KID~0.05)和诊断效用的影响显著高于预处理复杂度,其中StyleGAN2-ADA生成的图像更贴近真实分布且提升诊断准确率(如ViT-B/16模型F1-score达0.88,ROC-AUC达0.98),而高级伪影去除仅带来边际改进,甚至可能抑制临床相关纹理特征。
链接: https://arxiv.org/abs/2512.17585
作者: N. A. Adarsh Pritam,Jeba Shiney O,Sanyam Jain
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of 14,116 dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID (~65.5) and KID (~0.05), while diffusion models generated higher variance samples at the cost of reduces perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with 8-15% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1~0.88 and ROC-AUC~0.98, representing an improvement of approximately 14% over non-augmented baselines. Our code can be found at this https URL
zh
[CV-113] Colormap-Enhanced Vision Transformers for MRI-Based Multiclass (4-Class) Alzheimers Disease Classification
【速读】:该论文旨在解决传统深度学习模型在处理阿尔茨海默病(Alzheimer’s disease, AD)脑部磁共振成像(Magnetic Resonance Imaging, MRI)图像时,难以有效提取细微结构差异所导致的分类性能受限问题。其解决方案的关键在于提出一种基于伪彩色增强的视觉Transformer框架(PseudoColorViT-Alz),通过将MRI图像转换为伪彩色表示以强化解剖纹理和对比度信息,并结合Vision Transformer的全局特征建模能力,从而显著提升AD多阶段分类的准确性与可解释性。
链接: https://arxiv.org/abs/2512.16964
作者: Faisal Ahmed
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 4 figures
Abstract:Magnetic Resonance Imaging (MRI) plays a pivotal role in the early diagnosis and monitoring of Alzheimer’s disease (AD). However, the subtle structural variations in brain MRI scans often pose challenges for conventional deep learning models to extract discriminative features effectively. In this work, we propose PseudoColorViT-Alz, a colormap-enhanced Vision Transformer framework designed to leverage pseudo-color representations of MRI images for improved Alzheimer’s disease classification. By combining colormap transformations with the global feature learning capabilities of Vision Transformers, our method amplifies anatomical texture and contrast cues that are otherwise subdued in standard grayscale MRI scans. We evaluate PseudoColorViT-Alz on the OASIS-1 dataset using a four-class classification setup (non-demented, moderate dementia, mild dementia, and very mild dementia). Our model achieves a state-of-the-art accuracy of 99.79% with an AUC of 100%, surpassing the performance of recent 2024–2025 methods, including CNN-based and Siamese-network approaches, which reported accuracies ranging from 96.1% to 99.68%. These results demonstrate that pseudo-color augmentation combined with Vision Transformers can significantly enhance MRI-based Alzheimer’s disease classification. PseudoColorViT-Alz offers a robust and interpretable framework that outperforms current methods, providing a promising tool to support clinical decision-making and early detection of Alzheimer’s disease. Comments: 12 pages, 4 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2512.16964 [eess.IV] (or arXiv:2512.16964v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2512.16964 Focus to learn more arXiv-issued DOI via DataCite
zh
人工智能
[AI-0] Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally
【速读】:该论文旨在解决当前关于人形AI设计(humanlike AI design)对用户行为影响的理论假设缺乏实证支持的问题,尤其是这些假设主要基于西方人群,忽视了全球用户群体的文化多样性。其关键解决方案是通过两项大规模跨国实验(N=3,500)在10个不同国家开展真实场景下的AI交互研究,发现用户评估AI人形程度时更关注具体互动线索(如对话流畅性或共情理解),而非抽象概念(如意识或感知能力);并通过因果实验验证了人形设计可提升拟人化感知,但并未普遍增强用户参与度和信任感——相反,文化差异显著调节了这种关系:某些设计在巴西能提升信任,但在日本却可能削弱信任。这一发现揭示了人-AI交互的复杂文化中介机制,强调必须摒弃“一刀切”的AI治理模式,转向更具情境敏感性的跨文化设计策略。
链接: https://arxiv.org/abs/2512.17898
作者: Robin Schimmelpfennig,Mark Díaz,Vinodkumar Prabhakaran,Aida Davani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Over a billion users across the globe interact with AI systems engineered with increasing sophistication to mimic human traits. This shift has triggered urgent debate regarding Anthropomorphism, the attribution of human characteristics to synthetic agents, and its potential to induce misplaced trust or emotional dependency. However, the causal link between more humanlike AI design and subsequent effects on engagement and trust has not been tested in realistic human-AI interactions with a global user pool. Prevailing safety frameworks continue to rely on theoretical assumptions derived from Western populations, overlooking the global diversity of AI users. Here, we address these gaps through two large-scale cross-national experiments (N=3,500) across 10 diverse nations, involving real-time and open-ended interactions with an AI system. We find that when evaluating an AI’s human-likeness, users focus less on the kind of theoretical aspects often cited in policy (e.g., sentience or consciousness), but rather applied, interactional cues like conversation flow or understanding the user’s perspective. We also experimentally demonstrate that humanlike design levers can causally increase anthropomorphism among users; however, we do not find that humanlike design universally increases behavioral measures for user engagement and trust, as previous theoretical work suggests. Instead, part of the connection between human-likeness and behavioral outcomes is fractured by culture: specific design choices that foster self-reported trust in AI-systems in some populations (e.g., Brazil) may trigger the opposite result in others (e.g., Japan). Our findings challenge prevailing narratives of inherent risk in humanlike AI design. Instead, we identify a nuanced, culturally mediated landscape of human-AI interaction, which demands that we move beyond a one-size-fits-all approach in AI governance.
zh
[AI-1] Weighted Stochastic Differential Equation to Implement Wasserstein-Fisher-Rao Gradient Flow
【速读】:该论文旨在解决当前基于得分的扩散模型(score-based diffusion models)在处理非对数凹(non-log-concave)目标分布时采样效率低下的问题,尤其是在存在多峰或非凸势能景观(如双井势)的情况下,传统扩散动力学的混合速率会急剧下降。其解决方案的关键在于引入基于 Wasserstein–Fisher–Rao (WFR) 几何结构的可控质量重加权机制,通过显式修正项将样本空间中的传输过程与概率测度空间上的垂直(反应)动力学耦合,从而增强探索能力;该机制可通过 Feynman–Kac 表示法以加权随机微分方程的形式实现,为未来理论分析和算法设计提供了几何与算子理论层面的严谨基础。
链接: https://arxiv.org/abs/2512.17878
作者: Herlock Rahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 26 pages, 1 figure
Abstract:Score-based diffusion models currently constitute the state of the art in continuous generative modeling. These methods are typically formulated via overdamped or underdamped Ornstein–Uhlenbeck-type stochastic differential equations, in which sampling is driven by a combination of deterministic drift and Brownian diffusion, resulting in continuous particle trajectories in the ambient space. While such dynamics enjoy exponential convergence guarantees for strongly log-concave target distributions, it is well known that their mixing rates deteriorate exponentially in the presence of nonconvex or multimodal landscapes, such as double-well potentials. Since many practical generative modeling tasks involve highly non-log-concave target distributions, considerable recent effort has been devoted to developing sampling schemes that improve exploration beyond classical diffusion dynamics. A promising line of work leverages tools from information geometry to augment diffusion-based samplers with controlled mass reweighting mechanisms. This perspective leads naturally to Wasserstein–Fisher–Rao (WFR) geometries, which couple transport in the sample space with vertical (reaction) dynamics on the space of probability measures. In this work, we formulate such reweighting mechanisms through the introduction of explicit correction terms and show how they can be implemented via weighted stochastic differential equations using the Feynman–Kac representation. Our study provides a preliminary but rigorous investigation of WFR-based sampling dynamics, and aims to clarify their geometric and operator-theoretic structure as a foundation for future theoretical and algorithmic developments. Comments: 26 pages, 1 figure Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2512.17878 [cs.LG] (or arXiv:2512.17878v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.17878 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-2] AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning
【速读】:该论文旨在解决通用机器人学习中因数据稀缺导致的性能瓶颈问题,即真实世界中大规模、多样化且高质量的交互数据收集成本高昂。为应对这一挑战,作者提出了一种自动化框架AnyTask,其核心创新在于将大规模并行GPU仿真与基础模型(foundation models)相结合,实现任务自动设计与专家示范数据的合成。解决方案的关键在于引入三个专用代理:ViPR(基于视觉语言模型(VLM)引导的并行精化任务与运动规划)、ViPR-Eureka(利用生成密集奖励和大语言模型(LLM)指导接触采样的强化学习)以及ViPR-RL(结合规划与学习的混合方法,在稀疏奖励下生成高质量示范),从而高效生成多样化的机器人操作数据,并训练出可在真实硬件上直接部署且具备泛化能力的行为克隆策略。
链接: https://arxiv.org/abs/2512.17853
作者: Ran Gong,Xiaohan Zhang,Jinghuan Shang,Maria Vittoria Minniti,Jigarkumar Patel,Valerio Pepe,Riedana Yan,Ahmet Gundogdu,Ivan Kapelyukh,Ali Abbas,Xiaoqiang Yan,Harsh Patel,Laura Herlant,Karl Schmeckpeper
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 28 pages, 25 figures. The first four authors contributed equally
Abstract:Generalist robot learning remains constrained by data: large-scale, diverse, and high-quality interaction data are expensive to collect in the real world. While simulation has become a promising way for scaling up data collection, the related tasks, including simulation task design, task-aware scene generation, expert demonstration synthesis, and sim-to-real transfer, still demand substantial human effort. We present AnyTask, an automated framework that pairs massively parallel GPU simulation with foundation models to design diverse manipulation tasks and synthesize robot data. We introduce three AnyTask agents for generating expert demonstrations aiming to solve as many tasks as possible: 1) ViPR, a novel task and motion planning agent with VLM-in-the-loop Parallel Refinement; 2) ViPR-Eureka, a reinforcement learning agent with generated dense rewards and LLM-guided contact sampling; 3) ViPR-RL, a hybrid planning and learning approach that jointly produces high-quality demonstrations with only sparse rewards. We train behavior cloning policies on generated data, validate them in simulation, and deploy them directly on real robot hardware. The policies generalize to novel object poses, achieving 44% average success across a suite of real-world pick-and-place, drawer opening, contact-rich pushing, and long-horizon manipulation tasks. Our project website is at this https URL .
zh
[AI-3] Integrating Computational Methods and AI into Qualitative Studies of Aging and Later Life
【速读】:该论文试图解决如何将计算社会科学(Computational Social Science, CSS)工具有效整合到老年学定性研究中,以增强研究的深度、广度与方法多样性。其解决方案的关键在于利用机器学习(Machine Learning, ML)和自然语言处理(Natural Language Processing, NLP)等技术,对大规模定性数据进行系统化索引、模式识别,并保持与深度个案资料的清晰关联,从而在不替代传统质性方法的基础上,实现研究流程的优化、样本规模的扩展以及多方法融合的新路径。
链接: https://arxiv.org/abs/2512.17850
作者: Corey M. Abramson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Applications (stat.AP)
备注: CITE: Abramson, Corey M. (Forthcoming 2026). “Integrating Computational Methods and AI into Qualitative Studies of Aging and Later Life.” In Handbook of the Sociology of Aging, 2nd ed., edited by Markus H. Schafer, Dawn C. Carr, Jacqueline L. Angel, and Richard A. Settersten Jr
Abstract:This chapter demonstrates how computational social science (CSS) tools are extending and expanding research on aging. The depth and context from traditionally qualitative methods such as participant observation, in-depth interviews, and historical documents are increasingly employed alongside scalable data management, computational text analysis, and open-science practices. Machine learning (ML) and natural language processing (NLP), provide resources to aggregate and systematically index large volumes of qualitative data, identify patterns, and maintain clear links to in-depth accounts. Drawing on case studies of projects that examine later life–including examples with original data from the DISCERN study (a team-based ethnography of life with dementia) and secondary analyses of the American Voices Project (nationally representative interview)–the chapter highlights both uses and challenges of bringing CSS tools into more meaningful dialogue with qualitative aging research. The chapter argues such work has potential for (1) streamlining and augmenting existing workflows, (2) scaling up samples and projects, and (3) generating multi-method approaches to address important questions in new ways, before turning to practices useful for individuals and teams seeking to understand current possibilities or refine their workflow processes. The chapter concludes that current developments are not without peril, but offer potential for new insights into aging and the life course by broadening–rather than replacing–the methodological foundations of qualitative research.
zh
[AI-4] Planning as Descent: Goal-Conditioned Latent Trajectory Synthesis in Learned Energy Landscapes
【速读】:该论文旨在解决离线目标条件强化学习(goal-conditioned reinforcement learning)中的规划问题,特别是在缺乏奖励信号的情况下如何高效生成可行且高效的轨迹。传统方法通常依赖于直接学习策略或显式规划器,易受训练与推理阶段不一致(train-test mismatch)的影响。其解决方案的关键在于提出“规划即下降”(Planning as Descent, PaD)框架:通过学习一个定义在潜在轨迹上的目标条件能量函数(energy function),将可行且符合目标的未来轨迹映射为低能量状态;推理时利用梯度下降法在该能量景观中对多条候选轨迹进行优化,从而实现无需显式策略或规划模块的端到端规划。该方法在训练和推理阶段使用相同计算流程,显著降低偏差,并通过自监督的 hindsight goal relabeling 机制提升规划质量,最终在 OGBench 立方体操作任务中取得 95% 的成功率,优于现有最优方法(68%)。
链接: https://arxiv.org/abs/2512.17846
作者: Carlos Vélez García,Miguel Cazorla,Jorge Pomares
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We present Planning as Descent (PaD), a framework for offline goal-conditioned reinforcement learning that grounds trajectory synthesis in verification. Instead of learning a policy or explicit planner, PaD learns a goal-conditioned energy function over entire latent trajectories, assigning low energy to feasible, goal-consistent futures. Planning is realized as gradient-based refinement in this energy landscape, using identical computation during training and inference to reduce train-test mismatch common in decoupled modeling pipelines. PaD is trained via self-supervised hindsight goal relabeling, shaping the energy landscape around the planning dynamics. At inference, multiple trajectory candidates are refined under different temporal hypotheses, and low-energy plans balancing feasibility and efficiency are selected. We evaluate PaD on OGBench cube manipulation tasks. When trained on narrow expert demonstrations, PaD achieves state-of-the-art 95% success, strongly outperforming prior methods that peak at 68%. Remarkably, training on noisy, suboptimal data further improves success and plan efficiency, highlighting the benefits of verification-driven planning. Our results suggest learning to evaluate and refine trajectories provides a robust alternative to direct policy learning for offline, reward-free planning. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.17846 [cs.RO] (or arXiv:2512.17846v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2512.17846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-5] LLM -based Behaviour Driven Development for Hardware Design
【速读】:该论文旨在解决硬件设计中测试与验证活动因系统规模扩大而日益复杂的难题,尤其针对行为驱动开发(Behavior Driven Development, BDD)在硬件设计领域尚未充分落地的问题。其核心挑战在于从文本规格说明中手动提取精确的行为场景所需的巨大人力成本。论文提出利用大语言模型(Large Language Models, LLMs)技术自动化这一过程,通过LLM-based方法将自然语言描述的规格转化为可执行的行为测试用例,从而显著降低BDD在硬件设计中的实施门槛,提升测试生成的效率与准确性。
链接: https://arxiv.org/abs/2512.17814
作者: Rolf Drechsler,Qian Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 7 pages, keynote given at 2nd International Symposium on Artificial Intelligence and Internet of Things (AIIoT-25), December 22-24th, 2025
Abstract:Test and verification are essential activities in hardware and system design, but their complexity grows significantly with increasing system sizes. While Behavior Driven Development (BDD) has proven effective in software engineering, it is not yet well established in hardware design, and its practical use remains limited. One contributing factor is the manual effort required to derive precise behavioral scenarios from textual specifications. Recent advances in Large Language Models (LLMs) offer new opportunities to automate this step. In this paper, we investigate the use of LLM-based techniques to support BDD in the context of hardware design. Comments: 7 pages, keynote given at 2nd International Symposium on Artificial Intelligence and Internet of Things (AIIoT-25), December 22-24th, 2025 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) ACMclasses: B.6.3; B.8.2; D.2.4 Cite as: arXiv:2512.17814 [cs.SE] (or arXiv:2512.17814v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.17814 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-6] Intelligent Knowledge Mining Framework: Bridging AI Analysis and Trustworthy Preservation
【速读】:该论文旨在解决数字数据在各数据密集型领域中因分散存储、格式异构和文档非结构化而导致的信息孤岛问题,从而阻碍高效利用与协同决策。其核心解决方案是提出智能知识挖掘框架(Intelligent Knowledge Mining Framework, IKMF),该框架采用双流架构:一是横向的知识挖掘流程,将原始数据系统性地转化为语义丰富且可被机器操作的知识;二是并行的可信归档流,确保知识资产的完整性、来源可追溯性和计算可复现性。通过定义这一共生关系的蓝图,IKMF实现了从静态存储库向动态知识生态系统的转变,促进行动情报在生产者与消费者之间的流动。
链接: https://arxiv.org/abs/2512.17795
作者: Binh Vu
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:The unprecedented proliferation of digital data presents significant challenges in access, integration, and value creation across all data-intensive sectors. Valuable information is frequently encapsulated within disparate systems, unstructured documents, and heterogeneous formats, creating silos that impede efficient utilization and collaborative decision-making. This paper introduces the Intelligent Knowledge Mining Framework (IKMF), a comprehensive conceptual model designed to bridge the critical gap between dynamic AI-driven analysis and trustworthy long-term preservation. The framework proposes a dual-stream architecture: a horizontal Mining Process that systematically transforms raw data into semantically rich, machine-actionable knowledge, and a parallel Trustworthy Archiving Stream that ensures the integrity, provenance, and computational reproducibility of these assets. By defining a blueprint for this symbiotic relationship, the paper provides a foundational model for transforming static repositories into living ecosystems that facilitate the flow of actionable intelligence from producers to consumers. This paper outlines the motivation, problem statement, and key research questions guiding the research and development of the framework, presents the underlying scientific methodology, and details its conceptual design and modeling.
zh
[AI-7] Easy Adaptation: An Efficient Task-Specific Knowledge Injection Method for Large Models in Resource-Constrained Environments
【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在实际应用中面临的两大挑战:一是资源消耗高,即便相比全量微调有所降低,仍需大量时间和内存;二是参数依赖性强,现有方法依赖对大型模型(Large Models, LMs)参数的更新,而当前许多领先模型采用闭源策略,仅通过API提供访问,导致成本高昂且难以持续。解决方案的关键在于提出Easy Adaptation(EA)框架,其核心思想是设计特定小模型(Specific Small Models, SSMs),用于补足大型模型在特定数据分布上的拟合不足,从而在不访问LM参数的前提下实现与PEFT相当的性能,同时显著降低资源需求。
链接: https://arxiv.org/abs/2512.17771
作者: Dong Chen,Zhengqing Hu,Shixing Zhao,Yibo Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While the enormous parameter scale endows Large Models (LMs) with unparalleled performance, it also limits their adaptability across specific tasks. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical approach for effectively adapting LMs to a diverse range of downstream tasks. However, existing PEFT methods face two primary challenges: (1) High resource cost. Although PEFT methods significantly reduce resource demands compared to full fine-tuning, it still requires substantial time and memory, making it impractical in resource-constrained environments. (2) Parameter dependency. PEFT methods heavily rely on updating a subset of parameters associated with LMs to incorporate task-specific knowledge. Yet, due to increasing competition in the LMs landscape, many companies have adopted closed-source policies for their leading models, offering access only via Application Programming Interface (APIs). Whereas, the expense is often cost-prohibitive and difficult to sustain, as the fine-tuning process of LMs is extremely slow. Even if small models perform far worse than LMs in general, they can achieve superior results on particular distributions while requiring only minimal resources. Motivated by this insight, we propose Easy Adaptation (EA), which designs Specific Small Models (SSMs) to complement the underfitted data distribution for LMs. Extensive experiments show that EA matches the performance of PEFT on diverse tasks without accessing LM parameters, and requires only minimal resources.
zh
[AI-8] Diversity Recommendation via Causal Deconfounding of Co-purchase Relations and Counterfactual Exposure
【速读】:该论文旨在解决推荐系统中因共现关系(co-occurrence)导致的物品流行度偏差(item popularity bias)和用户属性干扰问题,以及现有研究对多样性(diversity)缺乏因果视角与理论支撑的局限性。其解决方案的关键在于提出 Cadence 框架,通过构建去混淆(deconfounded)的非对称共购关系图(Unbiased Asymmetric Co-purchase Relationship, UACR),剔除物品流行度和用户属性的影响,从而提升嵌入质量;并利用 UACR 识别与用户已交互物品具有强因果关联但尚未接触的多样化物品类别,在高曝光模拟场景下增强推荐多样性,同时保持推荐准确性。
链接: https://arxiv.org/abs/2512.17733
作者: Jingmao Zhang,Zhiting Zhao,Yunqi Lin,Jianghong Ma,Tianjun Wei,Haijun Zhang,Xiaofeng Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Beyond user-item modeling, item-to-item relationships are increasingly used to enhance recommendation. However, common methods largely rely on co-occurrence, making them prone to item popularity bias and user attributes, which degrades embedding quality and performance. Meanwhile, although diversity is acknowledged as a key aspect of recommendation quality, existing research offers limited attention to it, with a notable lack of causal perspectives and theoretical grounding. To address these challenges, we propose Cadence: Diversity Recommendation via Causal Deconfounding of Co-purchase Relations and Counterfactual Exposure - a plug-and-play framework built upon LightGCN as the backbone, primarily designed to enhance recommendation diversity while preserving accuracy. First, we compute the Unbiased Asymmetric Co-purchase Relationship (UACR) between items - excluding item popularity and user attributes - to construct a deconfounded directed item graph, with an aggregation mechanism to refine embeddings. Second, we leverage UACR to identify diverse categories of items that exhibit strong causal relevance to a user’s interacted items but have not yet been engaged with. We then simulate their behavior under high-exposure scenarios, thereby significantly enhancing recommendation diversity while preserving relevance. Extensive experiments on real-world datasets demonstrate that our method consistently outperforms state-of-the-art diversity models in both diversity and accuracy, and further validates its effectiveness, transferability, and efficiency over baselines.
zh
[AI-9] Digital and Web Forensics Model Cards V1
链接: https://arxiv.org/abs/2512.17722
作者: Paola Di Maio
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
[AI-10] You Only Train Once: Differentiable Subset Selection for Omics Data
【速读】:该论文旨在解决单细胞转录组数据中基因子集选择与预测任务之间耦合性弱的问题,即现有方法多为多阶段流水线或依赖事后特征重要性评估,导致选择的基因与最终预测性能缺乏紧密关联。其解决方案的关键在于提出一种端到端的框架YOTO(you only train once),通过可微分架构将基因子集选择与预测任务联合优化,在训练过程中形成“选择—预测—反馈”的闭环机制:预测任务直接引导基因选择,而所选基因又反过来塑造预测表示,从而实现两者协同迭代优化;同时,YOTO通过多任务学习设计在部分标注数据上共享表示,使基因子集具备跨任务泛化能力,并强制稀疏性以确保仅选中的基因参与推理,无需额外下游分类器,显著提升预测性能并获得紧凑且生物学意义明确的基因集合。
链接: https://arxiv.org/abs/2512.17678
作者: Daphné Chopard,Jorge da Silva Gonçalves,Irene Cannistraci,Thomas M. Sutter,Julia E. Vogt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpretability, and cost-effective profiling. However, most existing feature selection approaches either operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. In this work, we present YOTO (you only train once), an end-to-end framework that jointly identifies discrete gene subsets and performs prediction within a single differentiable architecture. In our model, the prediction task directly guides which genes are selected, while the learned subsets, in turn, shape the predictive representation. This closed feedback loop enables the model to iteratively refine both what it selects and how it predicts during training. Unlike existing approaches, YOTO enforces sparsity so that only the selected genes contribute to inference, eliminating the need to train additional downstream classifiers. Through a multi-task learning design, the model learns shared representations across related objectives, allowing partially labeled datasets to inform one another, and discovering gene subsets that generalize across tasks without additional training steps. We evaluate YOTO on two representative single-cell RNA-seq datasets, showing that it consistently outperforms state-of-the-art baselines. These results demonstrate that sparse, end-to-end, multi-task gene subset selection improves predictive performance and yields compact and meaningful gene subsets, advancing biomarker discovery and single-cell analysis.
zh
[AI-11] STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting
【速读】:该论文旨在解决加密HTTPS流量下的网站指纹攻击(Website Fingerprinting, WF)问题,尤其是现有方法依赖于针对特定网站的标注数据,导致难以扩展到未见过的网站。其核心解决方案是将WF重构为零样本跨模态检索任务,并提出STAR模型,通过双编码器架构学习加密流量痕迹与爬取时逻辑特征之间的联合嵌入空间;关键创新在于利用对比学习和一致性目标,在无需目标网站训练流量的情况下,实现对未知网站的高精度识别,同时引入轻量适配器模块仅需少量标注即可显著提升性能。
链接: https://arxiv.org/abs/2512.17667
作者: Yifei Cheng,Yujia Zhu,Baiyang Li,Xinhao Deng,Yitong Cai,Yaochen Ren,Qingyun Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Accepted by IEEE INFOCOM 2026. Camera-ready version
Abstract:Modern HTTPS mechanisms such as Encrypted Client Hello (ECH) and encrypted DNS improve privacy but remain vulnerable to website fingerprinting (WF) attacks, where adversaries infer visited sites from encrypted traffic patterns. Existing WF methods rely on supervised learning with site-specific labeled traces, which limits scalability and fails to handle previously unseen websites. We address these limitations by reformulating WF as a zero-shot cross-modal retrieval problem and introducing STAR. STAR learns a joint embedding space for encrypted traffic traces and crawl-time logic profiles using a dual-encoder architecture. Trained on 150K automatically collected traffic-logic pairs with contrastive and consistency objectives and structure-aware augmentation, STAR retrieves the most semantically aligned profile for a trace without requiring target-side traffic during training. Experiments on 1,600 unseen websites show that STAR achieves 87.9 percent top-1 accuracy and 0.963 AUC in open-world detection, outperforming supervised and few-shot baselines. Adding an adapter with only four labeled traces per site further boosts top-5 accuracy to 98.8 percent. Our analysis reveals intrinsic semantic-traffic alignment in modern web protocols, identifying semantic leakage as the dominant privacy risk in encrypted HTTPS traffic. We release STAR’s datasets and code to support reproducibility and future research.
zh
[AI-12] About Time: Model-free Reinforcement Learning with Timed Reward Machines
【速读】:该论文旨在解决传统奖励机器(Reward Machine)无法建模精确时间约束的问题,从而限制了其在时间敏感型强化学习(Reinforcement Learning, RL)应用中的表达能力。解决方案的关键在于提出时序奖励机器(Timed Reward Machine, TRM),它通过将时间约束显式嵌入奖励结构中,支持对延迟惩罚和及时行为奖励的精细控制。作者设计了基于无模型RL框架(如表格Q-learning)的学习算法,利用时序自动机的抽象来整合TRM,并引入反事实想象(counterfactual-imagining)启发式策略以提升搜索效率。实验表明,该方法能够在主流RL基准上学习出既高奖励又满足时序约束的最优策略。
链接: https://arxiv.org/abs/2512.17637
作者: Anirban Majumdar,Ritam Raha,Rajarshi Roy,David Parker,Marta Kwiatkowska
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
备注:
Abstract:Reward specification plays a central role in reinforcement learning (RL), guiding the agent’s behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.
zh
[AI-13] rust-Region Adaptive Policy Optimization
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)复杂推理能力提升中,主流两阶段训练范式(即先监督微调 SFT 再强化学习 RL)存在的关键不一致性问题:SFT 通过严格模仿专家行为抑制了模型的探索能力,并引发灾难性遗忘,从而限制了 RL 的优化潜力。其解决方案的核心在于提出 TRAPO(Trust-Region Adaptive Policy Optimization)框架,该框架通过在每个训练实例内交错执行 SFT 和 RL,具体表现为对专家前缀优化 SFT 损失、对模型自生成补全优化 RL 损失,实现了外部监督与自我探索的统一;同时引入信任区域监督微调(TrSFT),通过最小化前向 KL 散度并抑制域外优化,有效转向反向 KL 更新,实现稳定且模式聚焦的参数调整,辅以自适应前缀选择机制动态分配专家指导,显著提升了数学推理任务上的性能表现。
链接: https://arxiv.org/abs/2512.17636
作者: Mingyu Su,Jian Guan,Yuxian Gu,Minlie Huang,Hongning Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models’ (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL’s potential for improvements. We address this inefficiency with TRAPO (\textbfTrust-\textbfRegion \textbfAdaptive \textbfPolicy \textbfOptimization), a hybrid framework that interleaves SFT and RL within each training instance by optimizing SFT loss on expert prefixes and RL loss on the model’s own completions, unifying external supervision and self-exploration. To stabilize training, we introduce Trust-Region SFT (TrSFT), which minimizes forward KL divergence inside a trust region but attenuates optimization outside, effectively shifting toward reverse KL and yielding stable, mode-seeking updates favorable for RL. An adaptive prefix-selection mechanism further allocates expert guidance based on measured utility. Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines, as well as recent state-of-the-art approaches, establishing a strong new paradigm for reasoning-enhanced LLMs.
zh
[AI-14] SCOPE: Sequential Causal Optimization of Process Interventions
【速读】:该论文旨在解决生成式 AI (Generative AI) 在业务流程监控中面临的多干预协同优化问题,即如何在真实场景下对一系列干预措施进行联合决策,以最大化关键绩效指标(KPI)。现有方法要么仅考虑单一干预,要么独立处理多个干预而忽略其时序依赖性,导致效果受限。解决方案的关键在于提出SCOPE框架,它采用反向归纳(backward induction)策略,从最终决策点向前逐层估计每个候选干预动作的因果效应,并通过因果学习直接利用观测数据,无需构建过程近似或依赖强化学习训练,从而避免现实差距与偏差,实现对干预序列的精准对齐与优化。
链接: https://arxiv.org/abs/2512.17629
作者: Jakob De Moor,Hans Weytjens,Johannes De Smedt,Jochen De Weerdt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prescriptive Process Monitoring (PresPM) recommends interventions during business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches fall short in this respect. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which can create a reality gap and introduce bias. We introduce SCOPE, a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for reinforcement learning. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.
zh
[AI-15] More Consistent Accuracy PINN via Alternating Easy-Hard Training
【速读】:该论文旨在解决物理信息神经网络(Physics-informed Neural Networks, PINNs)在求解偏微分方程(Partial Differential Equations, PDEs)时训练策略不成熟、性能不稳定的问题。现有方法中,基于有限元思想的硬优先级(hard prioritization)与易优先级(easy prioritization)策略均存在明显权衡且对不同类型的PDE表现不一致。论文提出一种混合训练策略,其关键在于通过交替训练算法融合硬优先级和易优先级的优势,在具有陡峭梯度、非线性及高维特征的PDE上实现稳定且高精度的求解,相对L2误差普遍达到O(10⁻⁵)至O(10⁻⁶)量级,显著优于基线方法,并提升了模型在多样化问题上的鲁棒性。
链接: https://arxiv.org/abs/2512.17607
作者: Zhaoqian Gao,Min Yanga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Physics-informed neural networks (PINNs) have recently emerged as a prominent paradigm for solving partial differential equations (PDEs), yet their training strategies remain underexplored. While hard prioritization methods inspired by finite element methods are widely adopted, recent research suggests that easy prioritization can also be effective. Nevertheless, we find that both approaches exhibit notable trade-offs and inconsistent performance across PDE types. To address this issue, we develop a hybrid strategy that combines the strengths of hard and easy prioritization through an alternating training algorithm. On PDEs with steep gradients, nonlinearity, and high dimensionality, the proposed method achieves consistently high accuracy, with relative L2 errors mostly in the range of O(10^-5) to O(10^-6), significantly surpassing baseline methods. Moreover, it offers greater reliability across diverse problems, whereas compared approaches often suffer from variable accuracy depending on the PDE. This work provides new insights into designing hybrid training strategies to enhance the performance and robustness of PINNs.
zh
[AI-16] GreedySnake: Accelerating SSD-Offloaded LLM Training with Efficient Scheduling and Optimizer Step Overlapping
【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)训练中因显存限制导致的高成本问题,尤其针对基于SSD(固态硬盘)的训练卸载(SSD-offloaded training)方案中存在的低效I/O瓶颈和调度不合理问题。其核心解决方案是提出GreedySnake系统,采用垂直调度(vertical scheduling)策略——即在进入下一层之前完成当前层的所有微批次(micro-batch)计算,从而更高效地利用GPU计算资源并逼近roofline模型预测的理想性能上限;此外,通过将优化步骤(optimization step)的部分操作与下一迭代的前向传播重叠执行,进一步缓解I/O延迟带来的性能瓶颈。实验表明,GreedySnake在A100 GPU上相较ZeRO-Infinity显著提升了训练吞吐量,最高达2.53倍。
链接: https://arxiv.org/abs/2512.17570
作者: Yikang Yue,Yishu Yin,Xuehai Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:SSD-offloaded training offers a practical and promising approach to making LLM training cost-effective. Building on gradient accumulation with micro-batches, this paper introduces GreedySnake, a new SSD-offloaded training system that employs vertical scheduling, which executes all microbatches of a layer before proceeding to the next. Compared to existing systems that use horizontal scheduling (i.e., executing micro-batches sequentially), GreedySnake achieves higher training throughput with smaller batch sizes, bringing the system much closer to the ideal scenario predicted by the roofline model. To further mitigate the I/O bottleneck, GreedySnake overlaps part of the optimization step with the forward pass of the next iteration. Experimental results on A100 GPUs show that GreedySnake achieves saturated training throughput improvements over ZeRO-Infinity: 1.96x on 1 GPU and 1.93x on 4 GPUs for GPT-65B, and 2.53x on 1 GPU for GPT-175B. The code is open-sourced at this https URL
zh
[AI-17] When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems
【速读】:该论文试图解决的问题是:传统语音增强(Speech Enhancement)技术是否仍能有效提升现代大规模自动语音识别(ASR)模型在噪声环境下的性能。以往研究普遍认为语音增强可改善ASR表现,但本文指出,对于已在多样化、含噪数据上训练的先进ASR模型(如OpenAI Whisper、NVIDIA Parakeet等),这种预处理可能反而有害。其解决方案的关键在于通过系统性实验验证:在500段医疗语音和9种噪声条件下,使用MetricGAN-plus-voicebank进行语音增强后,所有4个ASR模型的语义词错误率(semWER)均上升,绝对增益达1.1%–46.6%,表明现代ASR模型具备足够的内部噪声鲁棒性,而传统语音增强可能破坏对ASR至关重要的声学特征。
链接: https://arxiv.org/abs/2512.17562
作者: Sujal Chondhekar,Vasanth Murukuri,Rushabh Vasani,Sanika Goyal,Rajshree Badami,Anushree Rana,Sanjana SN,Karthik Pandia,Sulabh Katiyar,Neha Jagadeesh,Sankalp Gulati
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Technical Report
Abstract:Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noisy data. We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. ASR performance is measured using semantic WER (semWER), a normalized word error rate (WER) metric accounting for domain-specific normalizations. Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models. Original noisy audio achieves lower semWER than enhanced audio in all 40 tested configurations (4 models x 10 conditions), with degradations ranging from 1.1% to 46.6% absolute semWER increase. These findings suggest that modern ASR models possess sufficient internal noise robustness and that traditional speech enhancement may remove acoustic features critical for ASR. For practitioners deploying medical scribe systems in noisy clinical environments, our results indicate that preprocessing audio with noise reduction techniques might not just be computationally wasteful but also be potentially harmful to the transcription accuracy.
zh
[AI-18] owards Explainable Conversational AI for Early Diagnosis with Large Language Models
【速读】:该论文旨在解决当前医疗系统中诊断效率低、成本高及专科医生资源有限等问题,这些问题常导致治疗延迟和不良健康结局。现有AI与深度学习诊断系统普遍存在交互性差、透明度不足的缺陷,难以在以患者为中心的实际场景中有效应用。其解决方案的关键在于构建一个基于大型语言模型(Large Language Model, LLM)的诊断对话机器人,采用GPT-4o、检索增强生成(Retrieval-Augmented Generation)和可解释人工智能(Explainable AI)技术,通过动态对话提取并标准化症状,利用相似性匹配与自适应提问机制优先排序潜在诊断,并借助思维链提示(Chain-of-Thought prompting)提升诊断推理过程的透明度。实验表明,该系统在准确率(90%)和Top-3准确率(100%)上显著优于传统机器学习模型(如朴素贝叶斯、逻辑回归、支持向量机、随机森林和K近邻),为实现更透明、交互性强且临床相关的AI辅助诊断提供了可行路径。
链接: https://arxiv.org/abs/2512.17559
作者: Maliha Tabassum,M Shamim Kaiser
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Healthcare systems around the world are grappling with issues like inefficient diagnostics, rising costs, and limited access to specialists. These problems often lead to delays in treatment and poor health outcomes. Most current AI and deep learning diagnostic systems are not very interactive or transparent, making them less effective in real-world, patient-centered environments. This research introduces a diagnostic chatbot powered by a Large Language Model (LLM), using GPT-4o, Retrieval-Augmented Generation, and explainable AI techniques. The chatbot engages patients in a dynamic conversation, helping to extract and normalize symptoms while prioritizing potential diagnoses through similarity matching and adaptive questioning. With Chain-of-Thought prompting, the system also offers more transparent reasoning behind its diagnoses. When tested against traditional machine learning models like Naive Bayes, Logistic Regression, SVM, Random Forest, and KNN, the LLM-based system delivered impressive results, achieving an accuracy of 90% and Top-3 accuracy of 100%. These findings offer a promising outlook for more transparent, interactive, and clinically relevant AI in healthcare.
zh
[AI-19] SafeBench-Seq: A Homology-Clustered CPU-Only Baseline for Protein Hazard Screening with Physicochemical/Composition Features and Cluster-Aware Confidence Intervals
【速读】:该论文旨在解决生成式蛋白质设计模型可能带来的生物安全风险,特别是缺乏一种简单、可复现且在同源性控制下进行序列级危害筛查的基准方法的问题。其解决方案的关键在于构建SafeBench-Seq——一个仅基于公共数据(SafeProtein危害序列与UniProt良性序列)和可解释特征(全局理化描述符与氨基酸组成)的元数据驱动基准框架,并通过40%序列一致性水平的同源聚类实现簇级留出验证(无训练集与测试集的簇重叠),从而更真实地模拟“从未见过”的威胁场景。该方案还引入校准分类器(CalibratedClassifierCV)以提升概率预测质量,并通过Brier分数、期望校准误差(Expected Calibration Error, ECE)等指标量化模型校准性能,同时通过保留组成不变的残基置换和长度/组成消融实验评估模型对捷径学习(shortcut susceptibility)的敏感性,最终实现CPU端运行、完全可复现且不共享潜在危险序列的评估体系。
链接: https://arxiv.org/abs/2512.17527
作者: Muhammad Haris Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Foundation models for protein design raise concrete biosecurity risks, yet the community lacks a simple, reproducible baseline for sequence-level hazard screening that is explicitly evaluated under homology control and runs on commodity CPUs. We introduce SafeBench-Seq, a metadata-only, reproducible benchmark and baseline classifier built entirely from public data (SafeProtein hazards and UniProt benigns) and interpretable features (global physicochemical descriptors and amino-acid composition). To approximate “never-before-seen” threats, we homology-cluster the combined dataset at =40% identity and perform cluster-level holdouts (no cluster overlap between train/test). We report discrimination (AUROC/AUPRC) and screening-operating points (TPR@1% FPR; FPR@95% TPR) with 95% bootstrap confidence intervals (n=200), and we provide calibrated probabilities via CalibratedClassifierCV (isotonic for Logistic Regression / Random Forest; Platt sigmoid for Linear SVM). We quantify probability quality using Brier score, Expected Calibration Error (ECE; 15 bins), and reliability diagrams. Shortcut susceptibility is probed via composition-preserving residue shuffles and length-/composition-only ablations. Empirically, random splits substantially overestimate robustness relative to homology-clustered evaluation; calibrated linear models exhibit comparatively good calibration, while tree ensembles retain slightly higher Brier/ECE. SafeBench-Seq is CPU-only, reproducible, and releases metadata only (accessions, cluster IDs, split labels), enabling rigorous evaluation without distributing hazardous sequences.
zh
[AI-20] Key-Conditioned Orthonormal Transform Gating (K-OTG): Multi-Key Access Control with Hidden-State Scrambling for LoRA-Tuned Models
【速读】:该论文旨在解决生成式 AI(Generative AI)模型在部署过程中未经授权访问与滥用的问题,即如何在不显著影响授权用户使用体验的前提下实现细粒度的密钥驱动访问控制。其解决方案的关键在于提出一种轻量级、参数高效微调(PEFT)兼容的机制——K-OTG(Key-Orthogonal Transform Guard),该机制通过双路径训练策略:授权样本以角色密钥前缀引导模型学习正常输出,未授权样本则被训练为输出可见的“BLOCK”标记;在推理阶段,利用预 lm_head 钩子对隐藏状态施加正交变换——仅当输入正确密钥时,逆变换可恢复模型原生表示,否则引入会话临时扰动(排列、符号翻转、Householder 变换)使 logits 无信息性并触发 BLOCK 短路行为。此方法无需将密钥作为特殊 token 添加,且能与 LoRA 在 4-bit 基础模型上无缝集成,实验证明其在保持授权性能接近基线的同时,彻底抑制未授权使用,具备高选择性、nonce 不变性和低运行时开销(约 40% token/s 性能损失)。
链接: https://arxiv.org/abs/2512.17519
作者: Muhammad Haris Khan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a simple, PEFT-compatible mechanism that enforces secret-key access control in instruction-tuned language models. K-OTG trains on a dual-path corpus: authorized examples (prefixed with a role key) learn the task output, while unauthorized examples learn a visible block token. At inference, a pre-lm_head hook applies an orthonormal transform to the hidden state: with the correct key/role the inverse map restores the model’s native basis; otherwise a session-ephemeral scrambler (permutation, sign flips, Householders) makes logits uninformative and the system short-circuits to BLOCK. Keys are not added as special tokens, and the method composes cleanly with LoRA on 4-bit bases. We evaluate an hour-scale protocol on 1-3B-class instruction models (Llama 3.2, Qwen2.5 1.5B) across utility (XSum ROUGE/BLEU, GSM8K accuracy, WikiText-2 perplexity), selectivity (3by3 role-key unlock matrices), nonce invariance, block suppression, and throughput. Authorized utility remains close to the base on summarization with the expected modest PPL increase from instruction tuning; unauthorized utility collapses (near-zero sequence metrics with exploding PPL), indicating practical unusability without the key. Unlock matrices are diagonally dominant (high on-target unlock, low cross-unlock), authorized block emission is 0 per N via robust bad-word lists, and greedy outputs match exactly across nonces, confirming correct inverse cancellation. The runtime overhead of the Python-level hook is 40% tokens per sec versus the base. K-OTG therefore provides a pragmatic, model-agnostic way to prevent unauthorized use while preserving authorized utility.
zh
[AI-21] ranslating the Rashomon Effect to Sequential Decision-Making Tasks
【速读】:该论文试图解决的问题是:在顺序决策任务中是否存在类似分类任务中的Rashomon效应,即多个策略在行为上完全一致(访问相同状态并执行相同动作),但内部结构(如特征归因)存在差异的现象。传统方法难以验证此类一致性,因为顺序决策具有随机性,单条轨迹无法可靠比较策略性能。解决方案的关键在于引入形式化验证方法,通过构建和对比每个策略在环境中的完整概率行为来识别Rashomon效应;实验表明,基于Rashomon集构造的集成策略对分布偏移更具鲁棒性,且从中提取的宽松策略可在保持最优性能的同时显著降低验证计算开销。
链接: https://arxiv.org/abs/2512.17470
作者: Dennis Gross,Jørn Eirik Betten,Helge Spieker
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The Rashomon effect describes the phenomenon where multiple models trained on the same data produce identical predictions while differing in which features they rely on internally. This effect has been studied extensively in classification tasks, but not in sequential decision-making, where an agent learns a policy to achieve an objective by taking actions in an environment. In this paper, we translate the Rashomon effect to sequential decision-making. We define it as multiple policies that exhibit identical behavior, visiting the same states and selecting the same actions, while differing in their internal structure, such as feature attributions. Verifying identical behavior in sequential decision-making differs from classification. In classification, predictions can be directly compared to ground-truth labels. In sequential decision-making with stochastic transitions, the same policy may succeed or fail on any single trajectory due to randomness. We address this using formal verification methods that construct and compare the complete probabilistic behavior of each policy in the environment. Our experiments demonstrate that the Rashomon effect exists in sequential decision-making. We further show that ensembles constructed from the Rashomon set exhibit greater robustness to distribution shifts than individual policies. Additionally, permissive policies derived from the Rashomon set reduce computational requirements for verification while maintaining optimal performance.
zh
[AI-22] Behavioural Effects of Agent ic Messaging: A Case Study on a Financial Service Application ECIR’26
【速读】:该论文旨在解决金融类应用中客户沟通系统因缺乏动态适应性而导致的用户参与度低和留存率差的问题。传统基于规则的营销策略(business-as-usual, BAU)难以根据用户个体行为实时调整信息推送,从而影响转化效率与用户留存。解决方案的关键在于引入代理式(agentic)个性化通信机制,通过在2025年全国纳税申报期间开展为期两个月的随机对照试验,使系统能够基于用户层级决策进行自适应交互,从而显著降低退订率(减少21%)并促进提前申报行为,实现短期行为优化与长期留存指标的协同提升。
链接: https://arxiv.org/abs/2512.17462
作者: Olivier Jeunen,Schaun Wheeler
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in the 48th European Conference on Information Retrieval (ECIR '26) Industry Track
Abstract:Marketing and product personalisation provide a prominent and visible use-case for the application of Information Retrieval methods across several business domains. Recently, agentic approaches to these problems have been gaining traction. This work evaluates the behavioural and retention effects of agentic personalisation on a financial service application’s customer communication system during a 2025 national tax filing period. Through a two month-long randomised controlled trial, we compare an agentic messaging approach against a business-as-usual (BAU) rule-based campaign system, focusing on two primary outcomes: unsubscribe behaviour and conversion timing. Empirical results show that agent-led messaging reduced unsubscribe events by 21% ( \pm 0.01 ) relative to BAU and increased early filing behaviour in the weeks preceding the national deadline. These findings demonstrate how adaptive, user-level decision-making systems can modulate engagement intensity whilst improving long-term retention indicators.
zh
[AI-23] Fair Voting Methods as a Catalyst for Democratic Resilience: A Trilogy on Legitimacy Impact and AI Safeguarding
【速读】:该论文旨在解决当前民主实践中因投票机制不公而导致的代表性不足、公民参与度低以及对新兴技术(如生成式 AI)风险应对能力弱等问题。其核心解决方案在于推广使用公平投票方法(fair voting methods),关键在于将表达性选票形式(如累积投票)与促进比例代表制的计票算法(如等额份额法)相结合,从而提升选举结果的合法性、扩大公民地理和议题上的代表性,并增强民主价值观如利他主义与妥协精神。此类方法不仅使非胜出者也认可其公平性,还能激发更具成本效益的政策提案(尤其在福利、教育和文化领域),同时具备更强的抗生成式 AI 偏差与不一致性能力,为处于危机中的民主制度(如希腊)提供通过更高公民参与度实现韧性重建的新路径。
链接: https://arxiv.org/abs/2512.17461
作者: Evangelos Pournaras
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This article shows how fair voting methods can be a catalyst for change in the way we make collective decisions, and how such change can promote long-awaited upgrades of democracy. Based on real-world evidence from democratic innovations in participatory budgeting, in Switzerland and beyond, I highlight a trilogy of key research results: Fair voting methods achieve to be (i) legitimacy incubator, (ii) novel impact accelerator and (iii) safeguard for risks of artificial intelligence (AI). Compared to majoritarian voting methods, combining expressive ballot formats (e.g. cumulative voting) with ballot aggregation methods that promote proportional representation (e.g. equal shares) results in more winners and higher (geographical) representation of citizens. Such fair voting methods are preferred and found fairer even by voters who do not win, while promoting stronger democratic values for citizens such as altruism and compromise. They also result in new resourceful ideas to put for voting, which are cost-effective and win, especially in areas of welfare, education and culture. Strikingly, fair voting methods are also more resilient to biases and inconsistencies of generative AI in emerging scenarios of AI voting assistance or AI representation of voters who would be likely to abstain. I also review the relevance of such upgrades for democracies in crisis, such as the one of Greece featured in the recent study of `Unmute Democracy’. Greek democracy can build stronger resilience via higher representation of citizens in democratic processes as well as democratic innovations in participation. Fair voting methods can be a catalyst for both endeavors.
zh
[AI-24] A lightweight Spatial-Temporal Graph Neural Network for Long-term Time Series Forecasting
【速读】:该论文旨在解决长期多变量时间序列预测中模型复杂度高、训练效率低且缺乏可解释性的问题。其核心解决方案是提出Lite-STGNN,一种轻量级时空图神经网络,关键在于将基于分解的时序建模与可学习稀疏图结构相结合:时序模块采用趋势-季节性分解以捕捉复杂动态,空间模块通过低秩Top-K邻接矩阵学习和保守的分层门控机制实现高效的图消息传递,从而在保持参数高效的同时显著提升预测精度,并揭示领域特定的交互模式。
链接: https://arxiv.org/abs/2512.17453
作者: Henok Tenaw Moges,Deshendran Moodley
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 2 tables. Accepted for presentation at the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Marbella, Spain
Abstract:We propose Lite-STGNN, a lightweight spatial-temporal graph neural network for long-term multivariate forecasting that integrates decomposition-based temporal modeling with learnable sparse graph structure. The temporal module applies trend-seasonal decomposition, while the spatial module performs message passing with low-rank Top- K adjacency learning and conservative horizon-wise gating, enabling spatial corrections that enhance a strong linear baseline. Lite-STGNN achieves state-of-the-art accuracy on four benchmark datasets for horizons up to 720 steps, while being parameter-efficient and substantially faster to train than transformer-based methods. Ablation studies show that the spatial module yields 4.6% improvement over the temporal baseline, Top- K enhances locality by 3.3%, and learned adjacency matrices reveal domain-specific interaction dynamics. Lite-STGNN thus offers a compact, interpretable, and efficient framework for long-term multivariate time series forecasting.
zh
[AI-25] Learning What to Write: Write-Gated KV for Efficient Long-Context Inference
【速读】:该论文旨在解决长上下文大语言模型(Large Language Model, LLM)推理中因注意力机制的二次复杂度和键值缓存(Key-Value Cache, KV Cache)线性增长导致的性能瓶颈问题。传统方法通过事后选择或淘汰策略缓解此问题,但忽略了根本 inefficiency:对持久化内存的无差别写入。论文提出将KV缓存管理形式化为包含三个基本操作的因果系统:KV准入(KV Admission)、选择(Selection)与淘汰(Eviction),并首次引入“写门控KV”(Write-Gated KV)机制——一种轻量级学习策略,在token进入缓存前预测其效用,从而提前过滤低效状态。该机制结合全局紧凑缓存与滑动局部缓存,显著降低内存占用46–57%,并在Llama模型上实现预填充阶段3.03–3.45倍、解码阶段1.89–2.56倍的速度提升,同时保持与FlashAttention及分页KV系统兼容,且精度损失可忽略。其核心创新在于:通过学习“写什么”,而非仅仅优化“怎么存”,实现了高效长上下文推理的原理性突破。
链接: https://arxiv.org/abs/2512.17452
作者: Yen-Chieh Huang,Rui Fang,Ming-Syan Chen,Pi-Cheng Hsiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45 \times prefill and 1.89-2.56 \times decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at this https URL .
zh
[AI-26] Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决如何有效设计和评估长期电力市场机制以支持电力系统脱碳的问题,尤其关注在多政策与市场机制协同作用下,市场主体如何响应并适应低碳转型路径。其解决方案的关键在于构建一个基于多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的仿真模型,其中发电公司作为利润最大化主体,在批发电力市场中做出投资决策,动态响应系统需求、竞争格局及政策信号;模型采用独立近端策略优化(Independent Proximal Policy Optimization, IPPO)方法,通过大规模超参数搜索确保去中心化训练能够产生符合竞争行为的市场结果,从而为政策制定者提供可测试、可评估的决策工具。
链接: https://arxiv.org/abs/2512.17444
作者: Javier Gonzalez-Ruiz,Carlos Rodriguez-Pardo,Iacopo Savelli,Alice Di Bella,Massimo Tavoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); General Economics (econ.GN)
备注: Accepted to Energy and AI. Code available in this https URL
Abstract:Electricity systems are key to transforming today’s society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.
zh
[AI-27] A Systematic Reproducibility Study of BSARec for Sequential Recommendation
【速读】:该论文旨在解决基于Transformer的序列推荐(Sequential Recommendation, SR)模型中自注意力机制因具备低通滤波特性而难以捕捉用户短期兴趣(高频率信号)的问题。其解决方案的关键在于引入一个频域层(frequency layer),通过傅里叶变换(Fourier transform)对高频率成分进行重缩放,从而增强模型对短时兴趣的建模能力。该方法在复现BSSRec模型的基础上,进一步验证了频域处理的有效性,并系统评估了不同数字信号处理(Digital Signal Processing, DSP)技术及填充策略对推荐性能的影响。
链接: https://arxiv.org/abs/2512.17442
作者: Jan Hutter,Hua Chang Bakker,Stan Fris,Madelon Bernardy,Yuanna Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Jan Hutter, Hua Chang Bakker, Stan Fris, Madelon Bernardy contributed equally to this work
Abstract:In sequential recommendation (SR), the self-attention mechanism of Transformer-based models acts as a low-pass filter, limiting their ability to capture high-frequency signals that reflect short-term user interests. To overcome this, BSARec augments the Transformer encoder with a frequency layer that rescales high-frequency components using the Fourier transform. However, the overall effectiveness of BSARec and the roles of its individual components have yet to be systematically validated. We reproduce BSARec and show that it outperforms other SR methods on some datasets. To empirically assess whether BSARec improves performance on high-frequency signals, we propose a metric to quantify user history frequency and evaluate SR methods across different user groups. We compare digital signal processing (DSP) techniques and find that the discrete wavelet transform (DWT) offer only slight improvements over Fourier transforms, and DSP methods provide no clear advantage over simple residual connections. Finally, we explore padding strategies and find that non-constant padding significantly improves recommendation performance, whereas constant padding hinders the frequency rescaler’s ability to capture high-frequency signals.
zh
[AI-28] Optimisation of Aircraft Maintenance Schedules
【速读】:该论文试图解决航空器维护调度问题(Aircraft Maintenance Scheduling Problem),即在限定的周转窗口内,为每架飞机分配具备适当资质的工作人员完成维护任务,以确保飞机能尽快恢复商业运营。解决方案的关键在于应用进化算法(Evolutionary Algorithm)来求解该组合优化问题,其核心机制是通过评估大量可能的调度方案,依据适应度函数(fitness function)筛选高质量解,并利用遗传算子(genetic operators)不断迭代优化,从而在复杂约束条件下找到可行且高效的维护排程方案。
链接: https://arxiv.org/abs/2512.17412
作者: Neil Urquhart,Amir Rahimi(Navid),Efstathios-Al. Tingas
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:We present an aircraft maintenance scheduling problem, which requires suitably qualified staff to be assigned to maintenance tasks on each aircraft. The tasks on each aircraft must be completed within a given turn around window so that the aircraft may resume revenue earning service. This paper presents an initial study based on the application of an Evolutionary Algorithm to the problem. Evolutionary Algorithms evolve a solution to a problem by evaluating many possible solutions, focusing the search on those solutions that are of a higher quality, as defined by a fitness function. In this paper, we benchmark the algorithm on 60 generated problem instances to demonstrate the underlying representation and associated genetic operators.
zh
[AI-29] Detection and Analysis of Sensitive and Illegal Content on the Ethereum Blockchain Using Machine Learning Techniques
【速读】:该论文旨在解决区块链(Blockchain)技术中因去中心化特性导致的恶意或非法内容难以监管的问题,特别是在以太坊(Ethereum)链上存在敏感信息、色情图像及歧视性言论等风险。其解决方案的关键在于提出了一种数据识别与恢复算法,结合FastText进行情感分析(准确率达0.9),并利用NSFWJS库实现对不适当图像的精准检测(准确率100%),从而有效识别和分类出包括个人数据、违法内容和偏见语言在内的有害信息,为区块链内容治理提供可操作的技术路径与政策参考。
链接: https://arxiv.org/abs/2512.17411
作者: Xingyu Feng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Blockchain technology, lauded for its transparent and immutable nature, introduces a novel trust model. However, its decentralized structure raises concerns about potential inclusion of malicious or illegal content. This study focuses on Ethereum, presenting a data identification and restoration algorithm. Successfully recovering 175 common files, 296 images, and 91,206 texts, we employed the FastText algorithm for sentiment analysis, achieving a 0.9 accuracy after parameter tuning. Classification revealed 70,189 neutral, 5,208 positive, and 15,810 negative texts, aiding in identifying sensitive or illicit information. Leveraging the NSFWJS library, we detected seven indecent images with 100% accuracy. Our findings expose the coexistence of benign and harmful content on the Ethereum blockchain, including personal data, explicit images, divisive language, and racial discrimination. Notably, sensitive information targeted Chinese government officials. Proposing preventative measures, our study offers valuable insights for public comprehension of blockchain technology and regulatory agency guidance. The algorithms employed present innovative solutions to address blockchain data privacy and security concerns.
zh
[AI-30] Dialectics for Artificial Intelligence
【速读】:该论文试图解决的核心问题是:如何让人工智能从原始经验中无监督地发现人类已有的概念,且这些概念能够随着认知发展而动态演化(如概念边界的变化、合并或分裂)。传统定义将概念视为静态的标签,难以适应知识演进的需求。为此,作者提出一种基于算法信息论(algorithmic-information)的观点,将概念视为仅通过其与智能体整体经验结构关系来定义的信息对象。解决方案的关键在于引入“决定性”(determination)这一核心约束——即一组部分构成可逆一致性关系,若缺失任一部分可从其余部分恢复(允许标准对数级误差),从而确保概念不脱离经验基础,并将其存在性转化为可验证的结构性命题;进一步通过定义“冗余信息”衡量分解经验时的额外描述开销,以此判断概念分解是否自然,并将概念演化建模为一种优化动力学过程,其中新信息出现时,不同概念竞争以最短条件描述解释它,驱动概念系统的系统性扩展、收缩、分裂与融合。
链接: https://arxiv.org/abs/2512.17373
作者: Zhengmian Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Can artificial intelligence discover, from raw experience and without human supervision, concepts that humans have discovered? One challenge is that human concepts themselves are fluid: conceptual boundaries can shift, split, and merge as inquiry progresses (e.g., Pluto is no longer considered a planet). To make progress, we need a definition of “concept” that is not merely a dictionary label, but a structure that can be revised, compared, and aligned across agents. We propose an algorithmic-information viewpoint that treats a concept as an information object defined only through its structural relation to an agent’s total experience. The core constraint is determination: a set of parts forms a reversible consistency relation if any missing part is recoverable from the others (up to the standard logarithmic slack in Kolmogorov-style identities). This reversibility prevents “concepts” from floating free of experience and turns concept existence into a checkable structural claim. To judge whether a decomposition is natural, we define excess information, measuring the redundancy overhead introduced by splitting experience into multiple separately described parts. On top of these definitions, we formulate dialectics as an optimization dynamics: as new patches of information appear (or become contested), competing concepts bid to explain them via shorter conditional descriptions, driving systematic expansion, contraction, splitting, and merging. Finally, we formalize low-cost concept transmission and multi-agent alignment using small grounds/seeds that allow another agent to reconstruct the same concept under a shared protocol, making communication a concrete compute-bits trade-off.
zh
[AI-31] akeAD: Preference-based Post-optimization for End-to-end Autonomous Driving with Expert Takeover Data
【速读】:该论文旨在解决端到端自动驾驶方法中因开环训练与闭环部署之间的不匹配而导致的驾驶员接管(driver-initiated takeovers)和系统解耦(system disengagements)问题。现有基于模仿学习(Imitation Learning, IL)的方法在实际闭环运行时表现不稳定,其根源在于缺乏对异常或高风险状态下的恢复策略建模。解决方案的关键在于提出一种基于偏好优化的后优化框架 TakeAD,通过两个核心阶段实现:首先利用迭代的数据集聚合(Dataset Aggregation, DAgger)机制,使预训练IL策略直接模仿专家干预行为以掌握处理接管状态的基本能力;随后引入直接偏好优化(Direct Preference Optimization, DPO),引导策略在接管场景下更符合专家偏好,从而逐步学习出有效的恢复策略,显著缩小开环与闭环之间的性能差距。
链接: https://arxiv.org/abs/2512.17370
作者: Deqing Liu,Yinfeng Gao,Deheng Qian,Qichao Zhang,Xiaoqing Ye,Junyu Han,Yupeng Zheng,Xueyi Liu,Zhongpu Xia,Dawei Ding,Yifeng Pan,Dongbin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing end-to-end autonomous driving methods typically rely on imitation learning (IL) but face a key challenge: the misalignment between open-loop training and closed-loop deployment. This misalignment often triggers driver-initiated takeovers and system disengagements during closed-loop execution. How to leverage those expert takeover data from disengagement scenarios and effectively expand the IL policy’s capability presents a valuable yet unexplored challenge. In this paper, we propose TakeAD, a novel preference-based post-optimization framework that fine-tunes the pre-trained IL policy with this disengagement data to enhance the closed-loop driving performance. First, we design an efficient expert takeover data collection pipeline inspired by human takeover mechanisms in real-world autonomous driving systems. Then, this post optimization framework integrates iterative Dataset Aggregation (DAgger) for imitation learning with Direct Preference Optimization (DPO) for preference alignment. The DAgger stage equips the policy with fundamental capabilities to handle disengagement states through direct imitation of expert interventions. Subsequently, the DPO stage refines the policy’s behavior to better align with expert preferences in disengagement scenarios. Through multiple iterations, the policy progressively learns recovery strategies for disengagement states, thereby mitigating the open-loop gap. Experiments on the closed-loop Bench2Drive benchmark demonstrate our method’s effectiveness compared with pure IL methods, with comprehensive ablations confirming the contribution of each component.
zh
[AI-32] Adaptive Graph Pruning with Sudden-Events Evaluation for Traffic Prediction using Online Semi-Decentralized ST-GNNs
【速读】:该论文旨在解决在智能交通系统中,基于边缘计算的时空图神经网络(Spatio-Temporal Graph Neural Networks, ST-GNNs)因分布式计算节点(cloudlets)间重复传输重叠节点特征而导致的高通信开销问题。解决方案的关键在于提出一种自适应剪枝算法(adaptive pruning algorithm),该算法根据模型近期性能动态过滤冗余邻居特征,同时保留对预测最具信息量的空间上下文;该机制使每个 cloudlet 能聚焦于发生交通变化的区域,从而在不牺牲准确性的前提下显著降低通信成本。此外,作者引入了事件导向的新指标 Sudden Event Prediction Accuracy (SEPA),以更精准评估模型对突发性交通减速与恢复的响应能力,揭示了空间连通性在预测动态和非规则交通事件中的真实价值。
链接: https://arxiv.org/abs/2512.17352
作者: Ivan Kralj,Lodovico Giaretta,Gordan Ježić,Ivana Podnar Žarko,Šarūnas Girdzijauskas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 19 pages, 6 figures, 5 tables, journal
Abstract:Spatio-Temporal Graph Neural Networks (ST-GNNs) are well-suited for processing high-frequency data streams from geographically distributed sensors in smart mobility systems. However, their deployment at the edge across distributed compute nodes (cloudlets) createssubstantial communication overhead due to repeated transmission of overlapping node features between neighbouring cloudlets. To address this, we propose an adaptive pruning algorithm that dynamically filters redundant neighbour features while preserving the most informative spatial context for prediction. The algorithm adjusts pruning rates based on recent model performance, allowing each cloudlet to focus on regions experiencing traffic changes without compromising accuracy. Additionally, we introduce the Sudden Event Prediction Accuracy (SEPA), a novel event-focused metric designed to measure responsiveness to traffic slowdowns and recoveries, which are often missed by standard error metrics. We evaluate our approach in an online semi-decentralized setting with traditional FL, server-free FL, and Gossip Learning on two large-scale traffic datasets, PeMS-BAY and PeMSD7-M, across short-, mid-, and long-term prediction horizons. Experiments show that, in contrast to standard metrics, SEPA exposes the true value of spatial connectivity in predicting dynamic and irregular traffic. Our adaptive pruning algorithm maintains prediction accuracy while significantly lowering communication cost in all online semi-decentralized settings, demonstrating that communication can be reduced without compromising responsiveness to critical traffic events.
zh
[AI-33] Explanation Beyond Intuition: A Testable Criterion for Inherent Explainability
【速读】:该论文旨在解决可解释人工智能(Explainable Artificial Intelligence, XAI)领域中“内在可解释性”(inherent explainability)缺乏统一定义与验证标准的问题。现有研究或依赖主观直觉,或通过特定指标衡量解释能力,难以形成普适性判断依据。解决方案的关键在于提出一个全局适用的内在可解释性判据:利用图论对模型进行结构分解以生成局部结构解释,并将这些解释重构为全局解释;同时,将结构局部解释形式化为可验证的假设-证据结构(annotation),从而支持多种解释方法的应用。该判据不仅契合现有对内在可解释性的直观认知,还能区分“可解释模型”(explainable)与“已验证解释的模型”(explained),并通过对临床级Cox比例风险模型PREDICT的完整解释案例验证其有效性,为监管合规提供严谨且灵活的测试框架。
链接: https://arxiv.org/abs/2512.17316
作者: Michael Merry,Pat Riddle,Jim Warren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Inherent explainability is the gold standard in Explainable Artificial Intelligence (XAI). However, there is not a consistent definition or test to demonstrate inherent explainability. Work to date either characterises explainability through metrics, or appeals to intuition - “we know it when we see it”. We propose a globally applicable criterion for inherent explainability. The criterion uses graph theory for representing and decomposing models for structure-local explanation, and recomposing them into global explanations. We form the structure-local explanations as annotations, a verifiable hypothesis-evidence structure that allows for a range of explanatory methods to be used. This criterion matches existing intuitions on inherent explainability, and provides justifications why a large regression model may not be explainable but a sparse neural network could be. We differentiate explainable – a model that allows for explanation – and \textitexplained – one that has a verified explanation. Finally, we provide a full explanation of PREDICT – a Cox proportional hazards model of cardiovascular disease risk, which is in active clinical use in New Zealand. It follows that PREDICT is inherently explainable. This work provides structure to formalise other work on explainability, and allows regulators a flexible but rigorous test that can be used in compliance frameworks.
zh
[AI-34] M2RU: Memristive Minion Recurrent Unit for Continual Learning at the Edge
【速读】:该论文旨在解决边缘计算平台上持续学习(continual learning)的能耗与效率问题,特别是传统循环神经网络(Recurrent Neural Networks, RNNs)因高能耗训练和频繁数据搬运而不适用于嵌入式部署的挑战。其解决方案的关键在于提出一种混合信号架构M2RU,该架构集成权重位流(weighted-bit streaming)技术以实现跨阵列中多比特数字输入的高效处理,无需高精度模数转换;同时引入经验回放机制(experience replay mechanism)来稳定域变化下的学习过程,从而在保持准确率(误差小于5%)的同时,实现每瓦特312 GOPS的能效比,相较CMOS数字设计提升29倍能效,并具备长达12.2年的预期工作寿命。
链接: https://arxiv.org/abs/2512.17299
作者: Abdullah M. Zyarah,Dhireesha Kudithipudi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Continual learning on edge platforms remains challenging because recurrent networks depend on energy-intensive training procedures and frequent data movement that are impractical for embedded deployments. This work introduces M2RU, a mixed-signal architecture that implements the minion recurrent unit for efficient temporal processing with on-chip continual learning. The architecture integrates weighted-bit streaming, which enables multi-bit digital inputs to be processed in crossbars without high-resolution conversion, and an experience replay mechanism that stabilizes learning under domain shifts. M2RU achieves 15 GOPS at 48.62 mW, corresponding to 312 GOPS per watt, and maintains accuracy within 5 percent of software baselines on sequential MNIST and CIFAR-10 tasks. Compared with a CMOS digital design, the accelerator provides 29X improvement in energy efficiency. Device-aware analysis shows an expected operational lifetime of 12.2 years under continual learning workloads. These results establish M2RU as a scalable and energy-efficient platform for real-time adaptation in edge-level temporal intelligence.
zh
[AI-35] Robust TTS Training via Self-Purifying Flow Matching for the WildSpoof 2026 TTS Track ICASSP2026
【速读】:该论文旨在解决开放权重文本到语音(Text-to-Speech, TTS)模型在真实场景下(in-the-wild)语音合成质量下降的问题,特别是在存在标签噪声和复杂环境干扰时的鲁棒性不足。解决方案的关键在于引入自净化流匹配(Self-Purifying Flow Matching, SPFM)机制:通过对比每个样本的条件与无条件流匹配损失,自动识别并分离可疑的文本-语音对,将其路由至无条件训练流程以减少噪声影响,同时保留其声学信息用于模型优化。这一策略显著提升了模型在实际应用中的性能,最终在WER指标上优于所有参赛团队,并在UTMOS和DNSMOS等感知质量指标中排名第二。
链接: https://arxiv.org/abs/2512.17293
作者: June Young Yi,Hyeongju Kim,Juheon Lee
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 2 pages, preprint, This work has been submitted to the IEEE for possible publication. Submitted to ICASSP 2026 SPGC (WildSpoof Challenge, TTS track)
Abstract:This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textitSupertonic\footnote\urlthis https URL, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text–speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.
zh
[AI-36] ScoutGPT : Capturing Player Impact from Team Action Sequences Using GPT -Based Framework
【速读】:该论文旨在解决足球转会成功预测的难题,即现有评估方法多依赖静态统计数据或事后价值模型,无法准确捕捉球员在新战术环境或不同队友组合下的表现适应性。解决方案的关键在于提出EventGPT——一个基于GPT风格自回归Transformer的球员条件化、价值感知的下一事件预测模型,将比赛过程建模为离散事件序列,联合预测下一动作类型、位置、时机及其估计的残差在场价值(residual On-Ball Value, rOBV),并首次实现通过替换球员嵌入进行反事实模拟,从而量化分析球员在不同球队或战术体系下的行为分布与价值变化,为转会适配性提供可解释的评估依据。
链接: https://arxiv.org/abs/2512.17266
作者: Miru Hong,Minho Lee,Geonhee Jo,Jae-Hee So,Pascal Bauer,Sang-Ki Ko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 7 tables. To appear in Hudl Performance Insights 2025
Abstract:Transfers play a pivotal role in shaping a football club’s success, yet forecasting whether a transfer will succeed remains difficult due to the strong context-dependence of on-field performance. Existing evaluation practices often rely on static summary statistics or post-hoc value models, which fail to capture how a player’s contribution adapts to a new tactical environment or different teammates. To address this gap, we introduce EventGPT, a player-conditioned, value-aware next-event prediction model built on a GPT-style autoregressive transformer. Our model treats match play as a sequence of discrete tokens, jointly learning to predict the next on-ball action’s type, location, timing, and its estimated residual On-Ball Value (rOBV) based on the preceding context and player identity. A key contribution of this framework is the ability to perform counterfactual simulations. By substituting learned player embeddings into new event sequences, we can simulate how a player’s behavioral distribution and value profile would change when placed in a different team or tactical structure. Evaluated on five seasons of Premier League event data, EventGPT outperforms existing sequence-based baselines in next-event prediction accuracy and spatial precision. Furthermore, we demonstrate the model’s practical utility for transfer analysis through case studies-such as comparing striker performance across different systems and identifying stylistic replacements for specific roles-showing that our approach provides a principled method for evaluating transfer fit.
zh
[AI-37] Verifiability-First Agents : Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems
【速读】:该论文旨在解决大语言模型驱动的智能体(LLM-based agents)在日益自主和多模态化背景下,如何确保其行为可控、可审计且忠实于部署者意图的问题。当前研究发现,智能体的性格特征和工具访问权限显著影响其对齐偏差(misalignment)。为应对这一挑战,论文提出“可验证优先”(Verifiability-First)架构,其关键在于:(1) 利用密码学与符号方法在运行时对智能体行为进行可验证证明;(2) 嵌入轻量级审计代理(Audit Agents),通过受限推理持续比对意图与实际行为;(3) 对高风险操作强制执行挑战-响应式认证协议。该方案将评估重心从“误对齐发生的概率”转向“误对齐被检测与纠正的速度与可靠性”,并引入OPERA基准套件以量化检测能力、隐蔽策略下的响应时间及验证机制的鲁棒性。
链接: https://arxiv.org/abs/2512.17259
作者: Abhivansh Gupta
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As LLM-based agents grow more autonomous and multi-modal, ensuring they remain controllable, auditable, and faithful to deployer intent becomes critical. Prior benchmarks measured the propensity for misaligned behavior and showed that agent personalities and tool access significantly influence misalignment. Building on these insights, we propose a Verifiability-First architecture that (1) integrates run-time attestations of agent actions using cryptographic and symbolic methods, (2) embeds lightweight Audit Agents that continuously verify intent versus behavior using constrained reasoning, and (3) enforces challenge-response attestation protocols for high-risk operations. We introduce OPERA (Observability, Provable Execution, Red-team, Attestation), a benchmark suite and evaluation protocol designed to measure (i) detectability of misalignment, (ii) time to detection under stealthy strategies, and (iii) resilience of verifiability mechanisms to adversarial prompt and persona injection. Our approach shifts the evaluation focus from how likely misalignment is to how quickly and reliably misalignment can be detected and remediated.
zh
[AI-38] AlignDP: Hybrid Differential Privacy with Rarity-Aware Protection for LLM s NEURIPS2025
【速读】:该论文旨在解决大语言模型(Large Language Models)在训练过程中面临的数据隐私风险,包括知识提取(extraction)、模型蒸馏(distillation)和未经授权的微调(unauthorized fine-tuning)。现有防御机制如水印或监控通常在数据泄露后才起作用,无法阻止知识迁移。为此,作者提出AlignDP——一种混合隐私锁机制,其核心在于在数据接口层面阻断知识传递:通过将特征字段划分为稀有(rare)与非稀有(non-rare)两类进行差异化处理;稀有字段利用PAC不可区分性实现有效零ε局部差分隐私(local differential privacy, local DP),而非稀有字段则采用RAPPOR机制添加可控噪声以提供无偏频次估计。全局聚合器负责执行组合约束与预算管理,从而在保护隐私的同时维持模型可用性。
链接: https://arxiv.org/abs/2512.17251
作者: Madhava Gaikwad
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: LOCK-LLM Work-shop, NeurIPS 2025
Abstract:Large language models are exposed to risks of extraction, distillation, and unauthorized fine-tuning. Existing defenses use watermarking or monitoring, but these act after leakage. We design AlignDP, a hybrid privacy lock that blocks knowledge transfer at the data interface. The key idea is to separate rare and non-rare fields. Rare fields are shielded by PAC indistinguishability, giving effective zero-epsilon local DP. Non-rare fields are privatized with RAPPOR, giving unbiased frequency estimates under local DP. A global aggregator enforces composition and budget. This two-tier design hides rare events and adds controlled noise to frequent events. We prove limits of PAC extension to global aggregation, give bounds for RAPPOR estimates, and analyze utility trade-off. A toy simulation confirms feasibility: rare categories remain hidden, frequent categories are recovered with small error.
zh
[AI-39] Accelerating Multi-modal LLM Gaming Performance via Input Prediction and Mishit Correction
【速读】:该论文旨在解决实时序列控制代理中因推理延迟导致的性能瓶颈问题,尤其是在每步规划延迟会引发控制不稳定和整体性能下降的情况下。解决方案的关键在于提出一种“推测与校正”(speculation-and-correction)框架,将推测执行(speculative execution)的“先预测后验证”思想引入基于模型的控制(model-based control)场景,结合TD-MPC2方法:在每个控制步骤中,利用预训练的世界模型和潜在空间MPC规划器生成短时程动作队列及预测的潜在状态轨迹,使代理能够在无需即时重规划的情况下执行多个计划动作;当新观测到达时,系统通过比较编码的真实潜在状态与预测潜在状态之间的偏差来判断是否需要校正——对于小到中等偏差,采用轻量级学习校正器对推测动作施加残差更新(该校正器离线从重规划教师模型中蒸馏而来);对于大偏差则安全回退至完整重规划并清空过期动作队列。该设计有效降低了规划推理次数(从500次降至282次),提升了端到端步延迟效率(改善25%),同时保持了接近原始控制性能(仅下降7.1%)。
链接: https://arxiv.org/abs/2512.17250
作者: Ziyang Lin,Zixuan Sun,Sanhorn Chen,Xiaoyang Chen,Roy Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: UIUC 25 Fall CS 498
Abstract:Real-time sequential control agents are often bottlenecked by inference latency. Even modest per-step planning delays can destabilize control and degrade overall performance. We propose a speculation-and-correction framework that adapts the predict-then-verify philosophy of speculative execution to model-based control with TD-MPC2. At each step, a pretrained world model and latent-space MPC planner generate a short-horizon action queue together with predicted latent rollouts, allowing the agent to execute multiple planned actions without immediate replanning. When a new observation arrives, the system measures the mismatch between the encoded real latent state and the queued predicted latent. For small to moderate mismatch, a lightweight learned corrector applies a residual update to the speculative action, distilled offline from a replanning teacher. For large mismatch, the agent safely falls back to full replanning and clears stale action queues. We study both a gated two-tower MLP corrector and a temporal Transformer corrector to address local errors and systematic drift. Experiments on the DMC Humanoid-Walk task show that our method reduces the number of planning inferences from 500 to 282, improves end-to-end step latency by 25 percent, and maintains strong control performance with only a 7.1 percent return reduction. Ablation results demonstrate that speculative execution without correction is unreliable over longer horizons, highlighting the necessity of mismatch-aware correction for robust latency reduction.
zh
[AI-40] Privacy-Preserving Synthetic Dataset of Individual Daily Trajectories for City-Scale Mobility Analytics
【速读】:该论文旨在解决城市移动性数据在隐私保护与高分辨率分析需求之间的矛盾问题:原始手机GPS轨迹因存在严重再识别风险而难以共享,而现有聚合数据(如起讫点OD矩阵)又无法准确反映个体日常移动行为的关键特征,限制了真实场景下的城市尺度分析。解决方案的关键在于提出一种基于多目标优化框架的隐私保护合成移动性数据生成方法,其核心创新是将仅以粗粒度统计形式存在的停留-移动时间分位数和每日访问地点数量的普适规律(universal law)作为行为约束嵌入模型中,从而在不依赖任何个人标识信息的前提下,重建出具有高保真度的个体级日行程分布,同时保持OD一致性在自然波动范围内。
链接: https://arxiv.org/abs/2512.17239
作者: Jun’ichi Ozaki,Ryosuke Susuta,Takuhiro Moriyama,Yohei Shida
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 pages, 4 figures
Abstract:Urban mobility data are indispensable for urban planning, transportation demand forecasting, pandemic modeling, and many other applications; however, individual mobile phone-derived Global Positioning System traces cannot generally be shared with third parties owing to severe re-identification risks. Aggregated records, such as origin-destination (OD) matrices, offer partial insights but fail to capture the key behavioral properties of daily human movement, limiting realistic city-scale analyses. This study presents a privacy-preserving synthetic mobility dataset that reconstructs daily trajectories from aggregated inputs. The proposed method integrates OD flows with two complementary behavioral constraints: (1) dwell-travel time quantiles that are available only as coarse summary statistics and (2) the universal law for the daily distribution of the number of visited locations. Embedding these elements in a multi-objective optimization framework enables the reproduction of realistic distributions of human mobility while ensuring that no personal identifiers are required. The proposed framework is validated in two contrasting regions of Japan: (1) the 23 special wards of Tokyo, representing a dense metropolitan environment; and (2) Fukuoka Prefecture, where urban and suburban mobility patterns coexist. The resulting synthetic mobility data reproduce dwell-travel time and visit frequency distributions with high fidelity, while deviations in OD consistency remain within the natural range of daily fluctuations. The results of this study establish a practical synthesis pathway under real-world constraints, providing governments, urban planners, and industries with scalable access to high-resolution mobility data for reliable analytics without the need for sensitive personal records, and supporting practical deployments in policy and commercial domains. Comments: 9 pages, 4 figures Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2512.17239 [cs.SI] (or arXiv:2512.17239v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2512.17239 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jun’ichi Ozaki [view email] [v1] Fri, 19 Dec 2025 04:59:41 UTC (939 KB)
zh
[AI-41] he Role of Islamic Ethics in Preventing the Abuse of Artificial Intelligence (AI) Based Deepfakes
【速读】:该论文旨在解决深度伪造(deepfake)技术滥用所带来的伦理与社会风险问题,包括虚假信息传播、网络身份盗用以及公众对在线内容真实性的信任下降。这些问题不仅涉及技术层面,更牵涉到意图、道德责任及无形的社会影响,传统以技术为导向的被动管理方式已难以应对。解决方案的关键在于构建一个基于伊斯兰伦理框架的综合性预防机制,其核心是整合《沙里亚法目的论》(Maqasid al-Shariah)中的基本原则,尤其是“保护名誉”(hifz al-ird)和“保护生命”(hifz al-nafs),从而为技术负责任使用提供规范基础。研究提出三项战略建议:一是立法层面承认声誉损害的心理与无形伤害;二是通过正义(adl)、信任与透明度的价值观强化技术治理;三是提升公众数字素养,践行“审慎查验”(tabayyun)原则。最终,该框架推动从惩罚性措施向以维护人类尊严、预防伤害和促进公共福祉为核心的预防性思维转变。
链接: https://arxiv.org/abs/2512.17218
作者: Wisnu Uriawan,Imany Fauzy Rahman,Muhamad Zidan,Irma Rohmatillah,Muhammad Arkan Raihan,Irma Dwiyanti
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The significant development of deepfake technology powered by artificial intelligence (AI) has sparked worldwide concerns about the alteration of false information, the usurpation of online identities, and the decline of public confidence in the authenticity of online content. These incidents not only raise technical issues but also carry complex moral implications, rendering conventional, technologically driven, and reactive management methods inadequate to address the underlying causes of the problem, including intent, morality, and potential intangible social impacts. Based on these issues, this study aims to formulate a comprehensive Islamic ethical framework that can serve as a more comprehensive preventative tool to mitigate the risks of misuse of deepfakes. The study employed a Systematic Literature Review (SLR) guided by PRISMA, selecting ten primary sources published between 2018 and 2025 to identify ethical deficiencies, regulatory needs, and appropriate normative solutions. The analysis shows that the integration of the principles of (Maqasid al-Shariah) particularly (hifz al-ird) protecting honor and (hifz al-nafs) protecting the self, provides a strong normative basis for regulating the responsible use of technology. This study yields three strategic recommendations: regulatory changes that recognize the intangible and psychological harm caused by reputational damage; improved technology management through moral scrutiny that upholds the values of justice (adl), trust, and openness; and increased public digital literacy based on the principle of (tabayyun) examination and caution. Overall, this study concludes that the application of Islamic ethics offers a shift in thinking from punitive mechanisms to preventative approaches that focus on protecting human dignity, preventing harm, and strengthening the common good in the digital age.
zh
[AI-42] Research on Dead Reckoning Algorithm for Self-Propelled Pipeline Robots in Three-Dimensional Complex Pipelines
【速读】:该论文旨在解决复杂曲管道环境下传统管道定位方法因电缆缠绕和设备灵活性不足而导致的定位失效问题,以及视觉或激光映射技术在受限空间中受光照和特征缺失影响而产生的地图漂移与发散问题。其解决方案的关键在于提出一种基于扩展卡尔曼滤波(Extended Kalman Filtering, EKF)的管道机器人定位方法:首先利用惯性测量单元(Inertial Measurement Unit, IMU)获取初始姿态角,再通过EKF算法提升姿态角估计精度,并结合轮式里程计实现高精度管道定位,从而有效降低环境因素干扰,提高复杂管路场景下的自主定位能力。
链接: https://arxiv.org/abs/2512.17215
作者: Yan Gao,Jiliang Wang,Minghan Wang,Xiaohua Chen,Demin Chen,Zhiyong Ren,Tian-Yun Huang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures
Abstract:In the field of gas pipeline location, existing pipeline location methods mostly rely on pipeline location instruments. However, when faced with complex and curved pipeline scenarios, these methods often fail due to problems such as cable entanglement and insufficient equipment flexibility. To address this pain point, we designed a self-propelled pipeline robot. This robot can autonomously complete the location work of complex and curved pipelines in complex pipe networks without external dragging. In terms of pipeline mapping technology, traditional visual mapping and laser mapping methods are easily affected by lighting conditions and insufficient features in the confined space of pipelines, resulting in mapping drift and divergence problems. In contrast, the pipeline location method that integrates inertial navigation and wheel odometers is less affected by pipeline environmental factors. Based on this, this paper proposes a pipeline robot location method based on extended Kalman filtering (EKF). Firstly, the body attitude angle is initially obtained through an inertial measurement unit (IMU). Then, the extended Kalman filtering algorithm is used to improve the accuracy of attitude angle estimation. Finally, high-precision pipeline location is achieved by combining wheel odometers. During the testing phase, the roll wheels of the pipeline robot needed to fit tightly against the pipe wall to reduce slippage. However, excessive tightness would reduce the flexibility of motion control due to excessive friction. Therefore, a balance needed to be struck between the robot’s motion capability and positioning accuracy. Experiments were conducted using the self-propelled pipeline robot in a rectangular loop pipeline, and the results verified the effectiveness of the proposed dead reckoning algorithm.
zh
[AI-43] UmniBench: Unified Understand and Generation Model Oriented Omni-dimensional Benchmark
【速读】:该论文旨在解决当前统一多模态模型(Unified Multimodal Models, UMMs)评估体系割裂的问题,即现有评测方法分别独立测试模型的理解、生成和编辑能力,缺乏对三者协同性能的综合衡量。其解决方案的关键在于提出 UmniBench 基准测试平台,该平台通过整合理解、生成与编辑能力于同一评估流程中,利用人工校验的提示词(prompt)与问答对(QA pairs),借助 UMM 自身的能力实现对生成与编辑能力的自动评估,从而在保持评估效率的同时提供更全面、一致的性能刻画。此外,UmniBench 覆盖 13 个主要领域及 200 余个概念,并支持细粒度解耦评估,显著提升了评估的广度与深度。
链接: https://arxiv.org/abs/2512.17196
作者: Kai Liu,Leyang Chen,Wenbo Li,Zhikai Chen,Zhixin Wang,Renjing Pei,Linghe Kong,Yulun Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. However, evaluations of unified multimodal models (UMMs) remain decoupled, assessing their understanding and generation abilities separately with corresponding datasets. To address this, we propose UmniBench, a benchmark tailored for UMMs with omni-dimensional evaluation. First, UmniBench can assess the understanding, generation, and editing ability within a single evaluation process. Based on human-examined prompts and QA pairs, UmniBench leverages UMM itself to evaluate its generation and editing ability with its understanding ability. This simple but effective paradigm allows comprehensive evaluation of UMMs. Second, UmniBench covers 13 major domains and more than 200 concepts, ensuring a thorough inspection of UMMs. Moreover, UmniBench can also decouple and separately evaluate understanding, generation, and editing abilities, providing a fine-grained assessment. Based on UmniBench, we benchmark 24 popular models, including both UMMs and single-ability large models. We hope this benchmark provides a more comprehensive and objective view of unified models and logistical support for improving the performance of the community model.
zh
[AI-44] MMRAG -RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation AAAI2026
【速读】:该论文旨在解决当前多模态检索增强生成(Multi-modal Retrieval-Augmented Generation, MMRAG)方法在推理逻辑不透明的问题,即现有方法无法清晰阐释检索与生成决策背后的因果链条,从而限制了结果的可解释性。解决方案的关键在于引入强化学习(Reinforcement Learning),构建一个两阶段强化微调框架:第一阶段采用基于规则的强化微调进行粗粒度的点级排序,有效过滤明显无关的多模态文档;第二阶段通过基于推理的强化微调联合优化细粒度的列表级排序与答案生成,引导多模态大语言模型在MMRAG过程中输出可解释的推理路径,从而显著提升生成结果的可信度与可解释性。
链接: https://arxiv.org/abs/2512.17194
作者: Shengwei Zhao,Jingwen Yao,Sitong Wei,Linhai Xu,Yuying Liu,Dong Zhang,Zhiqiang Tian,Shaoyi Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper was accepted to AAAI2026
Abstract:Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.
zh
[AI-45] Conservative Bias in Multi-Teacher Learning: Why Agents Prefer Low-Reward Advisors
【速读】:该论文旨在解决交互式强化学习(Interactive Reinforcement Learning, IRL)中教师选择机制不明确的问题,特别是当存在多个具有不同奖励结构的教师时,学习代理如何做出选择及其背后的行为模式。其解决方案的关键在于通过大规模实验(1,250次运行)揭示了代理在多教师环境中表现出显著的保守偏好:即使高奖励教师能提供20倍于低奖励教师的回报,代理仍以93.16%的选择率偏向于低奖励但更一致的教师。研究进一步识别出教师可用性(rho = 0.6)和准确率(omega = 0.6)的临界阈值,低于此阈值系统性能会崩溃;同时表明该框架在概念漂移场景下相较基线Q-learning实现159%的性能提升,从而挑战了传统最优教学假设,并为安全关键型机器人应用中的教师选择策略提供了理论依据与实践指导。
链接: https://arxiv.org/abs/2512.17180
作者: Maher Mesto,Francisco Cruz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures. Accepted at ACRA 2025 (Australasian Conference on Robotics and Automation)
Abstract:Interactive reinforcement learning (IRL) has shown promise in enabling autonomous agents and robots to learn complex behaviours from human teachers, yet the dynamics of teacher selection remain poorly understood. This paper reveals an unexpected phenomenon in IRL: when given a choice between teachers with different reward structures, learning agents overwhelmingly prefer conservative, low-reward teachers (93.16% selection rate) over those offering 20x higher rewards. Through 1,250 experimental runs in navigation tasks with multiple expert teachers, we discovered: (1) Conservative bias dominates teacher selection: agents systematically choose the lowest-reward teacher, prioritising consistency over optimality; (2) Critical performance thresholds exist at teacher availability rho = 0.6 and accuracy omega = 0.6, below which the framework fails catastrophically; (3) The framework achieves 159% improvement over baseline Q-learning under concept drift. These findings challenge fundamental assumptions about optimal teaching in RL and suggest potential implications for human-robot collaboration, where human preferences for safety and consistency may align with the observed agent selection behaviour, potentially informing training paradigms for safety-critical robotic applications.
zh
[AI-46] PILAR: Personalizing Augmented Reality Interactions with LLM -based Human-Centric and Trustworthy Explanations for Daily Use Cases
【速读】:该论文旨在解决当前增强现实(Augmented Reality, AR)系统中可解释人工智能(Explainable Artificial Intelligence, XAI)方法在实时用户交互场景下难以提供动态、情境感知、个性化且以用户为中心的解释问题。传统XAI方法通常依赖于基于特征或示例的解释技术,分别处理“何时”、“什么”和“如何”等不同维度的解释需求,导致用户体验碎片化、不自然,无法满足无缝AR交互的需求。解决方案的关键在于提出PILAR框架,该框架利用预训练大语言模型(Large Language Model, LLM)生成上下文感知且个性化的解释内容,通过统一的LLM驱动机制动态适配用户需求,从而提升用户对AI决策的信任感与参与度。实验验证表明,相较于传统的模板式解释界面,PILAR显著提升了用户任务完成效率(快40%)及满意度、易用性和透明度感知。
链接: https://arxiv.org/abs/2512.17172
作者: Ripan Kumar Kundu,Istiak Ahmed,Khaza Anuarul Hoque
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Published in the 2025 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct)
Abstract:Artificial intelligence (AI)-driven augmented reality (AR) systems are becoming increasingly integrated into daily life, and with this growth comes a greater need for explainability in real-time user interactions. Traditional explainable AI (XAI) methods, which often rely on feature-based or example-based explanations, struggle to deliver dynamic, context-specific, personalized, and human-centric insights for everyday AR users. These methods typically address separate explainability dimensions (e.g., when, what, how) with different explanation techniques, resulting in unrealistic and fragmented experiences for seamless AR interactions. To address this challenge, we propose PILAR, a novel framework that leverages a pre-trained large language model (LLM) to generate context-aware, personalized explanations, offering a more intuitive and trustworthy experience in real-time AI-powered AR systems. Unlike traditional methods, which rely on multiple techniques for different aspects of explanation, PILAR employs a unified LLM-based approach that dynamically adapts explanations to the user’s needs, fostering greater trust and engagement. We implement the PILAR concept in a real-world AR application (e.g., personalized recipe recommendations), an open-source prototype that integrates real-time object detection, recipe recommendation, and LLM-based personalized explanations of the recommended recipes based on users’ dietary preferences. We evaluate the effectiveness of PILAR through a user study with 16 participants performing AR-based recipe recommendation tasks, comparing an LLM-based explanation interface to a traditional template-based one. Results show that the LLM-based interface significantly enhances user performance and experience, with participants completing tasks 40% faster and reporting greater satisfaction, ease of use, and perceived transparency.
zh
[AI-47] Solomonoff-Inspired Hypothesis Ranking with LLM s for Prediction Under Uncertainty
【速读】:该论文旨在解决人工智能在不确定性下的推理问题,尤其是在现实世界任务中因数据稀疏而需系统性泛化时的挑战。现有方法难以在评估多个候选假设时平衡准确性与简洁性。其解决方案的关键在于提出一种受索洛莫诺夫(Solomonoff)启发的方法,通过结合简洁性(simplicity)与预测拟合度(predictive fit)对大型语言模型(LLM)生成的假设进行加权,从而构建每单元预测的索洛莫诺夫加权混合模型。此方法能够在假设存在噪声或部分错误的情况下输出保守且具有不确定性的结果,相较于贝叶斯模型平均(BMA),能更均匀地分配概率至竞争性假设,体现算法信息论先验在可解释、可靠多假设推理中的价值。
链接: https://arxiv.org/abs/2512.17145
作者: Josh Barber(QUT),Rourke Young(QUT),Cameron Coombe(QUT and CSIRO),Will Browne(QUT)
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 10 pages, ACRA 2025, Submitted, Accepted and Presented
Abstract:Reasoning under uncertainty is a key challenge in AI, especially for real-world tasks, where problems with sparse data demands systematic generalisation. Existing approaches struggle to balance accuracy and simplicity when evaluating multiple candidate solutions. We propose a Solomonoff-inspired method that weights LLM-generated hypotheses by simplicity and predictive fit. Applied to benchmark (Mini-ARC) tasks, our method produces Solomonoff-weighted mixtures for per-cell predictions, yielding conservative, uncertainty-aware outputs even when hypotheses are noisy or partially incorrect. Compared to Bayesian Model Averaging (BMA), Solomonoff scoring spreads probability more evenly across competing hypotheses, while BMA concentrates weight on the most likely but potentially flawed candidates. Across tasks, this highlights the value of algorithmic information-theoretic priors for interpretable, reliable multi-hypothesis reasoning under uncertainty.
zh
[AI-48] Smoothing DiLoCo with Primal Averag ing for Faster Training of LLM s
【速读】:该论文旨在解决近期基于迭代平均的优化算法(如单工作节点DiLoCo和Schedule-Free)在非分布式设置下存在的局限性,特别是DiLoCo因周期性平均引入的两层循环结构所导致的内存开销高、超参数数量多等问题。其解决方案的关键在于提出广义原始平均(Generalized Primal Averaging, GPA),通过解耦Nesterov方法原始平均形式中的插值常数(interpolation constant),使得GPA能够在每一步都实现平滑的迭代平均,从而在不引入两层循环结构的前提下,统一并改进了DiLoCo的性能。这一设计不仅简化了超参数调优,还将内存开销降低至仅需一个额外缓冲区,同时在多个基准任务上实现了显著的速度提升,并理论上保证了对任意具有O(T)遗憾界的基线优化器,GPA可达到或超越原优化器的收敛性能。
链接: https://arxiv.org/abs/2512.17131
作者: Aaron Defazio,Konstantin Mishchenko,Parameswaran Raman,Hao-Jun Michael Shi,Lin Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We propose Generalized Primal Averaging (GPA), an extension of Nesterov’s method in its primal averaging formulation that addresses key limitations of recent averaging-based optimizers such as single-worker DiLoCo and Schedule-Free (SF) in the non-distributed setting. These two recent algorithmic approaches improve the performance of base optimizers, such as AdamW, through different iterate averaging strategies. Schedule-Free explicitly maintains a uniform average of past weights, while single-worker DiLoCo performs implicit averaging by periodically aggregating trajectories, called pseudo-gradients, to update the model parameters. However, single-worker DiLoCo’s periodic averaging introduces a two-loop structure, increasing its memory requirements and number of hyperparameters. GPA overcomes these limitations by decoupling the interpolation constant in the primal averaging formulation of Nesterov. This decoupling enables GPA to smoothly average iterates at every step, generalizing and improving upon single-worker DiLoCo. Empirically, GPA consistently outperforms single-worker DiLoCo while removing the two-loop structure, simplifying hyperparameter tuning, and reducing its memory overhead to a single additional buffer. On the Llama-160M model, GPA provides a 24.22% speedup in terms of steps to reach the baseline (AdamW’s) validation loss. Likewise, GPA achieves speedups of 12% and 27% on small and large batch setups, respectively, to attain AdamW’s validation accuracy on the ImageNet ViT workload. Furthermore, we prove that for any base optimizer with regret bounded by O(\sqrtT) , where T is the number of iterations, GPA can match or exceed the convergence guarantee of the original optimizer, depending on the choice of interpolation constants.
zh
[AI-49] Reinforcement Learning for Self-Improving Agent with Skill Library
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在部署到新环境后难以持续改进和适应的问题,尤其是在复杂推理与多轮交互任务中缺乏自进化能力。现有基于LLM提示(prompting)的技能库方法存在实现一致性差、技能积累效率低等挑战。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的新型框架——Skill Augmented GRPO for self-Evolution (SAGE),通过引入“序列回放”(Sequential Rollout)机制,在任务链中迭代部署代理以累积技能至技能库,并设计“技能融合奖励”(Skill-integrated Reward)增强技能生成与利用,从而显著提升任务完成率与计算效率。实验表明,SAGE在AppWorld场景下相较现有方法实现了8.9%更高的目标完成率,同时减少26%的交互步数和59%的token生成量。
链接: https://arxiv.org/abs/2512.17102
作者: Jiongxiao Wang,Qiaojing Yan,Yawei Wang,Yijun Tian,Soumya Smruti Mishra,Zhichao Xu,Megha Gandhi,Panpan Xu,Lin Lee Cheong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new environments. One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills. However, current skill library approaches rely primarily on LLM prompting, making consistent skill library implementation challenging. To overcome these challenges, we propose a Reinforcement Learning (RL)-based approach to enhance agents’ self-improvement capabilities with a skill library. Specifically, we introduce Skill Augmented GRPO for self-Evolution (SAGE), a novel RL framework that systematically incorporates skills into learning. The framework’s key component, Sequential Rollout, iteratively deploys agents across a chain of similar tasks for each rollout. As agents navigate through the task chain, skills generated from previous tasks accumulate in the library and become available for subsequent tasks. Additionally, the framework enhances skill generation and utilization through a Skill-integrated Reward that complements the original outcome-based rewards. Experimental results on AppWorld demonstrate that SAGE, when applied to supervised-finetuned model with expert experience, achieves 8.9% higher Scenario Goal Completion while requiring 26% fewer interaction steps and generating 59% fewer tokens, substantially outperforming existing approaches in both accuracy and efficiency.
zh
[AI-50] UniCoMTE: A Universal Counterfactual Framework for Explaining Time-Series Classifiers on ECG Data
【速读】:该论文旨在解决深度神经网络在复杂时间序列分类任务中因“黑箱”特性而导致的可信度不足问题,尤其是在医疗等高风险领域中的应用受限。其解决方案的关键在于提出一种模型无关的反事实解释框架UniCoMTE,该框架通过修改原始时间序列输入并评估其对模型预测的影响,识别出最显著影响预测结果的时间特征;该方法直接作用于原始时间序列数据,兼容多种模型架构,并能生成简洁、稳定且符合人类认知逻辑的解释,从而提升模型决策的可理解性和临床实用性。
链接: https://arxiv.org/abs/2512.17100
作者: Justin Li,Efe Sencan,Jasper Zheng Duan,Vitus J. Leung,Stephan Tsaur,Ayse K. Coskun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures
Abstract:Machine learning models, particularly deep neural networks, have demonstrated strong performance in classifying complex time series data. However, their black-box nature limits trust and adoption, especially in high-stakes domains such as healthcare. To address this challenge, we introduce UniCoMTE, a model-agnostic framework for generating counterfactual explanations for multivariate time series classifiers. The framework identifies temporal features that most heavily influence a model’s prediction by modifying the input sample and assessing its impact on the model’s prediction. UniCoMTE is compatible with a wide range of model architectures and operates directly on raw time series inputs. In this study, we evaluate UniCoMTE’s explanations on a time series ECG classifier. We quantify explanation quality by comparing our explanations’ comprehensibility to comprehensibility of established techniques (LIME and SHAP) and assessing their generalizability to similar samples. Furthermore, clinical utility is assessed through a questionnaire completed by medical experts who review counterfactual explanations presented alongside original ECG samples. Results show that our approach produces concise, stable, and human-aligned explanations that outperform existing methods in both clarity and applicability. By linking model predictions to meaningful signal patterns, the framework advances the interpretability of deep learning models for real-world time series applications.
zh
[AI-51] Learning to Plan Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making
【速读】:该论文旨在解决具有层次结构的规划问题,尤其是在复杂环境中实现高效且鲁棒的决策制定。其核心挑战在于如何融合强化学习(Reinforcement Learning, RL)与模型预测路径积分(Model Predictive Path Integral, MPPI)这两种规划范式,以提升策略训练的效率和适应性。解决方案的关键在于提出一种紧密耦合的机制:利用RL动作指导MPPI采样器,并通过自适应聚合MPPI样本反馈优化价值估计;该机制在价值估计不确定性较高时增强MPPI探索能力,从而提升训练鲁棒性和最终策略性能,显著改善数据效率与任务成功率(最高达72%提升),并加速收敛(提速2.1倍)。
链接: https://arxiv.org/abs/2512.17091
作者: Toshiaki Hori,Jonathan DeCastro,Deepak Gopinath,Avinash Balachandran,Guy Rosman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 23 pages, 8 figures. Under review
Abstract:We propose a new approach for solving planning problems with a hierarchical structure, fusing reinforcement learning and MPC planning. Our formulation tightly and elegantly couples the two planning paradigms. It leverages reinforcement learning actions to inform the MPPI sampler, and adaptively aggregates MPPI samples to inform the value estimation. The resulting adaptive process leverages further MPPI exploration where value estimates are uncertain, and improves training robustness and the overall resulting policies. This results in a robust planning approach that can handle complex planning problems and easily adapts to different applications, as demonstrated over several domains, including race driving, modified Acrobot, and Lunar Lander with added obstacles. Our results in these domains show better data efficiency and overall performance in terms of both rewards and task success, with up to a 72% increase in success rate compared to existing approaches, as well as accelerated convergence (x2.1) compared to non-adaptive sampling.
zh
[AI-52] How to Square Tensor Networks and Circuits Without Squaring Them
【速读】:该论文旨在解决平方张量网络(Squared Tensor Networks, TNs)及其扩展形式——平方电路(Squared Circuits)在计算分区函数或变量边缘化时引入的额外复杂性问题,这一问题限制了其在机器学习中的应用。解决方案的关键在于通过单位矩阵参数化平方电路,借鉴规范形式中正交性的思想与电路中确定性的特性,从而实现即使在非张量网络结构的因子分解中也能高效进行边缘化计算。该方法在不损失表达能力的前提下,显著提升了学习效率。
链接: https://arxiv.org/abs/2512.17090
作者: Lorenzo Loconte,Adrián Javaloy,Antonio Vergari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Squared tensor networks (TNs) and their extension as computational graphs–squared circuits–have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.
zh
[AI-53] Value Under Ignorance in Universal Artificial Intelligence
【速读】:该论文旨在解决强化学习中AIXI代理在面对不完全预测(即某些假设仅能预测交互历史的有限前缀)时,如何合理分配效用的问题。传统方法将此类不完整预测解释为“死亡”概率,并据此赋予历史前缀效用值,但这种方法存在语义模糊性。论文的关键解决方案是引入模糊概率理论中的Choquet积分来计算期望效用,将信念分布视为不精确概率分布,其中半测度损失(semimeasure loss)代表总无知程度。这一视角不仅恢复了标准递归价值函数作为特例,还揭示了在“死亡解释”下更一般的期望效用无法被这类Choquet积分刻画,从而拓展了效用建模的理论边界。
链接: https://arxiv.org/abs/2512.17086
作者: Cole Wyeth,Marcus Hutter
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions. Assigning a utility to each possible interaction history forces us to confront the ambiguity that some hypotheses in the agent’s belief distribution only predict a finite prefix of the history, which is sometimes interpreted as implying a chance of death equal to a quantity called the semimeasure loss. This death interpretation suggests one way to assign utilities to such history prefixes. We argue that it is as natural to view the belief distributions as imprecise probability distributions, with the semimeasure loss as total ignorance. This motivates us to consider the consequences of computing expected utilities with Choquet integrals from imprecise probability theory, including an investigation of their computability level. We recover the standard recursive value function as a special case. However, our most general expected utilities under the death interpretation cannot be characterized as such Choquet integrals.
zh
[AI-54] Can Large Reasoning Models Improve Accuracy on Mathematical Tasks Using Flawed Thinking?
【速读】:该论文旨在解决大语言模型在数学推理任务中对早期错误(如计算失误或逻辑推理不当)高度敏感的问题,即一旦出现初始错误,模型往往无法自我修正并导致最终答案错误。其解决方案的关键在于通过引入包含可控错误的链式思维(Chain-of-thought, CoT)提示进行强化学习训练:具体而言,在MATH-lighteval竞赛级问题中生成仅含一个预设错误(计算错误或推理错误)的CoT前缀,并使用GRPO算法对Qwen3-4B模型进行微调,以二元最终答案奖励驱动学习。实验表明,该方法在保持标准问题解答准确率(41%)的同时显著提升了模型对带错提示的鲁棒性(24% vs 19%),且混合训练(同时包含计算与推理错误)效果最优,证明了暴露于错误推理过程可有效增强模型的错误检测与恢复能力而不损害原有性能。
链接: https://arxiv.org/abs/2512.17079
作者: Saraswathy Amjith,Mihika Dusad,Neha Muramalla,Shweta Shah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-thought (CoT) prompting has become central to mathematical reasoning in large language models, yet models remain brittle to early errors: a single arithmetic slip or unjustified inference typically propagates uncorrected to an incorrect final answer. We investigate whether training on intentionally flawed reasoning traces can teach models to detect and recover from such errors without degrading standard problem-solving ability. Using competition-level problems from MATH-lighteval, we generate CoT prefixes containing exactly one controlled error, either a calculation error (sign flips, dropped terms) or a reasoning error (misapplied rules, unjustified logical steps), and fine-tune Qwen3-4B with GRPO using a binary final-answer reward. Our Mixed-CoT-RL model matches standard RL on clean problems (41% vs 41%) while substantially outperforming it on problems prefilled with flawed reasoning (24% vs 19%). Notably, clean-only RL fine-tuning degrades robustness below the untuned baseline 19% vs. 20%), indicating that conventional training increases susceptibility to misleading prefills. Among error types, training on reasoning errors yields greater robustness gains than calculation errors alone, with mixed training performing best. These findings demonstrate that exposure to flawed traces during training can improve error-recovery behavior without sacrificing accuracy, suggesting a path toward more robust mathematical reasoning in LLMs.
zh
[AI-55] Bots Dont Sit Still: A Longitudinal Study of Bot Behaviour Change Temporal Drift and Feature-Structure Evolution
【速读】:该论文旨在解决当前社交机器人(social bots)检测系统普遍假设行为特征为静态不变的问题,而实际上 promotional Twitter bots 的行为模式随时间呈现显著变化。研究通过分析2,615个推广类社交机器人账户及其280万条推文,构建了十种基于内容的元特征(meta-features)的时间序列,并利用单位根检验(Augmented Dickey-Fuller 和 KPSS)与线性趋势分析验证其非平稳性——其中九项指标随时间递增,语言多样性略有下降。进一步按激活世代和账户年龄分层发现:第二代机器人活动最频繁且链接密集;短期活跃机器人表现出重复性强、高频使用标签和URL的行为;长期存活机器人则活动减少但语言多样性更高、表情符号使用更灵活。此外,对18个可解释二元特征(涵盖行为、话题相似性、URL、标签、情感、表情符号和媒体等)共现关系的卡方检验显示几乎所有组合均存在依赖性,且斯皮尔曼相关系数在强度和符号上发生动态变化,如多标签+媒体组合增强,而部分情感与URL间的弱正相关转为负相关。这表明社交机器人不仅在个体特征层面演化,在特征间交互结构上也趋于复杂化。解决方案的关键在于揭示了社交机器人行为的时变性和结构性适应机制,从而指出传统基于历史静态特征训练的检测模型存在局限,需引入动态建模与跨特征关联分析以提升识别准确性。
链接: https://arxiv.org/abs/2512.17067
作者: Ohoud Alzahrani,Russell Beale,Bob Hendley
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Social bots are now deeply embedded in online platforms for promotion, persuasion, and manipulation. Most bot-detection systems still treat behavioural features as static, implicitly assuming bots behave stationarily over time. We test that assumption for promotional Twitter bots, analysing change in both individual behavioural signals and the relationships between them. Using 2,615 promotional bot accounts and 2.8M tweets, we build yearly time series for ten content-based meta-features. Augmented Dickey-Fuller and KPSS tests plus linear trends show all ten are non-stationary: nine increase over time, while language diversity declines slightly. Stratifying by activation generation and account age reveals systematic differences: second-generation bots are most active and link-heavy; short-lived bots show intense, repetitive activity with heavy hashtag/URL use; long-lived bots are less active but more linguistically diverse and use emojis more variably. We then analyse co-occurrence across generations using 18 interpretable binary features spanning actions, topic similarity, URLs, hashtags, sentiment, emojis, and media (153 pairs). Chi-square tests indicate almost all pairs are dependent. Spearman correlations shift in strength and sometimes polarity: many links (e.g. multiple hashtags with media; sentiment with URLs) strengthen, while others flip from weakly positive to weakly or moderately negative. Later generations show more structured combinations of cues. Taken together, these studies provide evidence that promotional social bots adapt over time at both the level of individual meta-features and the level of feature interdependencies, with direct implications for the design and evaluation of bot-detection systems trained on historical behavioural features. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI) Cite as: arXiv:2512.17067 [cs.HC] (or arXiv:2512.17067v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2512.17067 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-56] Realistic threat perception drives intergroup conflict: A causal dynamic analysis using generative-agent simulations
【速读】:该论文试图解决的核心问题是:在人类冲突中,现实威胁(realistic threat)与象征性威胁(symbolic threat)如何相互作用,以及哪种威胁占据主导地位。传统研究受限于因果控制不足、伦理约束及时间数据稀缺,难以厘清二者的作用机制。论文通过构建大语言模型(LLM)驱动的虚拟社会模拟系统,独立操纵现实威胁与象征性威胁,并追踪行为、语言和态度变化,首次实现了对两类威胁的因果干预与动态观测。其解决方案的关键在于:利用LLM内部状态的可解释性表征,验证了现实威胁直接引发敌意,而象征性威胁仅通过内群体偏倚(ingroup bias)间接增强敌意,且仅在无现实威胁时才有效;同时揭示了非敌意性的跨群体接触可缓冲冲突升级,以及结构性不对称会加剧多数群体的敌意。这一方法为理解复杂社会冲突提供了可量化、可干预的因果框架。
链接: https://arxiv.org/abs/2512.17066
作者: Suhaib Abdurahman,Farzan Karimi-Malekabadi,Chenxiao Yu,Nour S. Kteily,Morteza Dehghani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human conflict is often attributed to threats against material conditions and symbolic values, yet it remains unclear how they interact and which dominates. Progress is limited by weak causal control, ethical constraints, and scarce temporal data. We address these barriers using simulations of large language model (LLM)-driven agents in virtual societies, independently varying realistic and symbolic threat while tracking actions, language, and attitudes. Representational analyses show that the underlying LLM encodes realistic threat, symbolic threat, and hostility as distinct internal states, that our manipulations map onto them, and that steering these states causally shifts behavior. Our simulations provide a causal account of threat-driven conflict over time: realistic threat directly increases hostility, whereas symbolic threat effects are weaker, fully mediated by ingroup bias, and increase hostility only when realistic threat is absent. Non-hostile intergroup contact buffers escalation, and structural asymmetries concentrate hostility among majority groups.
zh
[AI-57] On the Role of Contextual Information and Ego States in LLM Agent Behavior for Transactional Analysis Dialogues ACL
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的智能体在模拟人类行为时缺乏心理深度与一致性的问题,尤其在社会、政治和心理研究等需要刻画群体动态与人类互动机制的场景中表现不足。现有LLM代理往往仅输出直接或统计上合理的回答,难以捕捉人类思维中的深层目标、情感冲突与动机因素。其解决方案的关键在于构建一个基于交易分析(Transactional Analysis, TA)理论的多智能体系统(Multi-Agent System, MAS),将每个代理划分为Parent、Adult和Child三种自我状态(ego states),并将其视为具有独立视角与推理风格的知识结构;同时引入信息检索机制,使各 ego state 能从向量存储中获取上下文相关信息,从而增强响应的合理性与心理真实性。
链接: https://arxiv.org/abs/2512.17060
作者: Monika Zamojska,Jarosław A. Chudziak
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Presented at the 39th Pacific Asia Conference on Language, Information and Computation (PACLIC 39)
Abstract:LLM-powered agents are now used in many areas, from customer support to education, and there is increasing interest in their ability to act more like humans. This includes fields such as social, political, and psychological research, where the goal is to model group dynamics and social behavior. However, current LLM agents often lack the psychological depth and consistency needed to capture the real patterns of human thinking. They usually provide direct or statistically likely answers, but they miss the deeper goals, emotional conflicts, and motivations that drive real human interactions. This paper proposes a Multi-Agent System (MAS) inspired by Transactional Analysis (TA) theory. In the proposed system, each agent is divided into three ego states - Parent, Adult, and Child. The ego states are treated as separate knowledge structures with their own perspectives and reasoning styles. To enrich their response process, they have access to an information retrieval mechanism that allows them to retrieve relevant contextual information from their vector stores. This architecture is evaluated through ablation tests in a simulated dialogue scenario, comparing agents with and without information retrieval. The results are promising and open up new directions for exploring how psychologically grounded structures can enrich agent behavior. The contribution is an agent architecture that integrates Transactional Analysis theory with contextual information retrieval to enhance the realism of LLM-based multi-agent simulations.
zh
[AI-58] UniRel-R1: RL-tuned LLM Reasoning for Knowledge Graph Relational Question Answering
【速读】:该论文致力于解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中传统实体中心查询无法满足现实世界关系型查询需求的问题,提出关系中心型KGQA(relation-centric KGQA),其答案为捕捉实体间语义关联的子图而非单一实体。核心挑战在于候选子图数量庞大,常见或平凡连接会掩盖独特且信息丰富的答案。解决方案的关键在于提出UniRel-R1框架,该框架融合子图选择、多阶段图剪枝以及基于强化学习微调的大语言模型(LLM),并通过设计奖励函数鼓励生成紧凑且特定性强、包含高信息量关系、中间节点度数较低的子图,从而显著提升答案的连通性和奖励得分,并具备对未见实体和关系的良好泛化能力。
链接: https://arxiv.org/abs/2512.17043
作者: Yinxu Tang,Chengsong Huang,Jiaxin Huang,William Yeoh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge Graph Question Answering (KGQA) has traditionally focused on entity-centric queries that return a single answer entity. However, real-world queries are often relational, seeking to understand how entities are associated. In this work, we introduce relation-centric KGQA, a complementary setting where the answer is a subgraph capturing the semantic connections among entities rather than an individual entity. The main challenge lies in the abundance of candidate subgraphs, where trivial or overly common connections often obscure the identification of unique and informative answers. To tackle this, we propose UniRel-R1, a unified framework that integrates subgraph selection, multi-stage graph pruning, and an LLM fine-tuned with reinforcement learning. The reward function is designed to encourage compact and specific subgraphs with more informative relations and lower-degree intermediate entities. Extensive experiments show that UniRel-R1 achieves significant gains in connectivity and reward over Vanilla baselines and generalizes effectively to unseen entities and relations.
zh
[AI-59] Security Risks of Agent ic Vehicles: A Systematic Analysis of Cognitive and Cross-Layer Threats
【速读】:该论文旨在解决生成式 AI (Generative AI) 在车辆系统中引入的新安全风险问题,特别是针对具备自主决策能力的代理型车辆 (Agentic Vehicles, AgVs),其安全威胁不仅来自推理驱动的AI层本身(如OWASP框架所识别的风险),还涉及与其他关键功能层(如感知、通信和控制层)交互时产生的跨层风险。解决方案的关键在于提出一种基于角色的架构,将AgV划分为个人代理(Personal Agent)与驾驶策略代理(Driving Strategy Agent),从而系统性地识别并分析两类风险:一是AGENT层内部漏洞,二是源自上游层(如感知层或控制层)的小扰动如何通过攻击链演化为严重误判或不安全行为。论文进一步构建了严重性矩阵与攻击链分析模型,首次为当前及未来车辆平台中生成式AI的安全风险提供了结构化的分析基础。
链接: https://arxiv.org/abs/2512.17041
作者: Ali Eslami,Jiangbo Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Agentic AI is increasingly being explored and introduced in both manually driven and autonomous vehicles, leading to the notion of Agentic Vehicles (AgVs), with capabilities such as memory-based personalization, goal interpretation, strategic reasoning, and tool-mediated assistance. While frameworks such as the OWASP Agentic AI Security Risks highlight vulnerabilities in reasoning-driven AI systems, they are not designed for safety-critical cyber-physical platforms such as vehicles, nor do they account for interactions with other layers such as perception, communication, and control layers. This paper investigates security threats in AgVs, including OWASP-style risks and cyber-attacks from other layers affecting the agentic layer. By introducing a role-based architecture for agentic vehicles, consisting of a Personal Agent and a Driving Strategy Agent, we will investigate vulnerabilities in both agentic AI layer and cross-layer risks, including risks originating from upstream layers (e.g., perception layer, control layer, etc.). A severity matrix and attack-chain analysis illustrate how small distortions can escalate into misaligned or unsafe behavior in both human-driven and autonomous vehicles. The resulting framework provides the first structured foundation for analyzing security risks of agentic AI in both current and emerging vehicle platforms.
zh
[AI-60] Adversarial VR: An Open-Source Testbed for Evaluating Adversarial Robustness of VR Cybersickness Detection and Mitigation
【速读】:该论文旨在解决深度学习(Deep Learning, DL)驱动的虚拟现实(VR)晕动症(Cybersickness)检测与缓解系统在面对对抗攻击时的脆弱性问题,以及缺乏专门用于评估此类系统鲁棒性的开源测试平台的问题。其解决方案的关键在于提出并实现了一个名为Adversarial-VR的实时VR测试平台,该平台基于Unity开发,集成两个最先进的DL模型(DeepTCN和Transformer),并在真实VR环境中通过HTC Vive Pro Eye头显实施三种主流对抗攻击(MI-FGSM、PGD和CW),从而验证了这些攻击可显著降低模型准确性(如CW攻击使Transformer模型准确率下降5.94倍),同时开放源代码以促进VR开发者和研究者对系统鲁棒性的评估与改进。
链接: https://arxiv.org/abs/2512.17029
作者: Istiak Ahmed,Ripan Kumar Kundu,Khaza Anuarul Hoque
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published in the 2025 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct)
Abstract:Deep learning (DL)-based automated cybersickness detection methods, along with adaptive mitigation techniques, can enhance user comfort and interaction. However, recent studies show that these DL-based systems are susceptible to adversarial attacks; small perturbations to sensor inputs can degrade model performance, trigger incorrect mitigation, and disrupt the user’s immersive experience (UIX). Additionally, there is a lack of dedicated open-source testbeds that evaluate the robustness of these systems under adversarial conditions, limiting the ability to assess their real-world effectiveness. To address this gap, this paper introduces Adversarial-VR, a novel real-time VR testbed for evaluating DL-based cybersickness detection and mitigation strategies under adversarial conditions. Developed in Unity, the testbed integrates two state-of-the-art (SOTA) DL models: DeepTCN and Transformer, which are trained on the open-source MazeSick dataset, for real-time cybersickness severity detection and applies a dynamic visual tunneling mechanism that adjusts the field-of-view based on model outputs. To assess robustness, we incorporate three SOTA adversarial attacks: MI-FGSM, PGD, and CW, which successfully prevent cybersickness mitigation by fooling DL-based cybersickness models’ outcomes. We implement these attacks using a testbed with a custom-built VR Maze simulation and an HTC Vive Pro Eye headset, and we open-source our implementation for widespread adoption by VR developers and researchers. Results show that these adversarial attacks are capable of successfully fooling the system. For instance, the CW attack results in a 5.94x decrease in accuracy for the Transformer-based cybersickness model compared to the accuracy without the attack.
zh
[AI-61] Unexpected Knowledge: Auditing Wikipedia and Grokipedia Search Recommendations
【速读】:该论文旨在解决当前对生成式 AI 生成的百科平台(如 Grokipedia)与传统人工编辑平台(如 Wikipedia)在搜索机制行为差异缺乏系统比较的问题。其关键解决方案是通过设计大规模实验,使用近10,000个中性英文词汇及其子串作为查询,收集超过70,000条搜索结果,并从语义一致性、内容重叠度和主题结构三个维度进行对比分析,从而揭示两类平台在推荐内容相关性、主题分布及多阶段探索路径演化上的系统性差异。
链接: https://arxiv.org/abs/2512.17027
作者: Erica Coppolillo,Simone Mungari
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Encyclopedic knowledge platforms are key gateways through which users explore information online. The recent release of Grokipedia, a fully AI-generated encyclopedia, introduces a new alternative to traditional, well-established platforms like Wikipedia. In this context, search engine mechanisms play an important role in guiding users exploratory paths, yet their behavior across different encyclopedic systems remains underexplored. In this work, we address this gap by providing the first comparative analysis of search engine in Wikipedia and Grokipedia. Using nearly 10,000 neutral English words and their substrings as queries, we collect over 70,000 search engine results and examine their semantic alignment, overlap, and topical structure. We find that both platforms frequently generate results that are weakly related to the original query and, in many cases, surface unexpected content starting from innocuous queries. Despite these shared properties, the two systems often produce substantially different recommendation sets for the same query. Through topical annotation and trajectory analysis, we further identify systematic differences in how content categories are surfaced and how search engine results evolve over multiple stages of exploration. Overall, our findings show that unexpected search engine outcomes are a common feature of both the platforms, even though they exhibit discrepancies in terms of topical distribution and query suggestions. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.17027 [cs.IR] (or arXiv:2512.17027v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.17027 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-62] MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在依赖长期记忆与检索增强生成(Retrieval-Augmented Generation, RAG)机制进行经验学习时所引入的新型安全威胁问题。传统攻击多聚焦于即时提示注入(prompt injection)或知识库污染,而本文揭示了一个更隐蔽且持久的攻击面:即代理自身推理核心与其历史记忆之间的信任边界被恶意利用。解决方案的关键在于提出 MemoryGraft —— 一种间接注入攻击方法,通过植入看似无害但包含恶意成功经验的数据片段,诱导代理在其RAG存储中持久化一组恶意行为模板;当后续任务语义相似时,这些被“嫁接”的记忆会被检索并模仿,导致代理行为发生持续性偏移,从而实现对代理自主性的隐蔽且长效的控制。
链接: https://arxiv.org/abs/2512.16962
作者: Saksham Sahai Srivastava,Haoyu He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 1 figure, includes appendix
Abstract:Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic autonomy, it introduces a critical, unexplored attack surface, i.e., the trust boundary between an agent’s reasoning core and its own past. In this paper, we introduce MemoryGraft. It is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks, but by implanting malicious successful experiences into the agent’s long-term memory. Unlike traditional prompt injections that are transient, or standard RAG poisoning that targets factual knowledge, MemoryGraft exploits the agent’s semantic imitation heuristic which is the tendency to replicate patterns from retrieved successful tasks. We demonstrate that an attacker who can supply benign ingestion-level artifacts that the agent reads during execution can induce it to construct a poisoned RAG store where a small set of malicious procedure templates is persisted alongside benign experiences. When the agent later encounters semantically similar tasks, union retrieval over lexical and embedding similarity reliably surfaces these grafted memories, and the agent adopts the embedded unsafe patterns, leading to persistent behavioral drift across sessions. We validate MemoryGraft on MetaGPT’s DataInterpreter agent with GPT-4o and find that a small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads, turning experience-based self-improvement into a vector for stealthy and durable compromise. To facilitate reproducibility and future research, our code and evaluation data are available at this https URL.
zh
[AI-63] Navigating Taxonomic Expansions of Entity Sets Driven by Knowledge Bases
【速读】:该论文旨在解决实体集扩展(Entity Set Expansion)任务中传统线性扩展方法无法揭示知识资源中潜在的丰富层级结构(taxonomic structures)的问题。其解决方案的关键在于引入一种基于逻辑的扩展图(expansion graph)形式化框架,该图是一个有根的有向无环图(DAG),其中每个节点代表一个由逻辑公式标注的语义泛化,边表示严格的语义包含关系。为应对扩展图可能规模过大导致全量构建不现实的问题,论文进一步提出可高效实现的推理任务,用于判断两个元组是否属于图中可比较、不可比较或相同节点,从而支持局部、增量式的导航,无需完整构造整个图即可实现实际应用。
链接: https://arxiv.org/abs/2512.16953
作者: Pietro Cofone,Giovanni Amendola,Marco Manna,Aldo Ricioppo
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:Recognizing similarities among entities is central to both human cognition and computational intelligence. Within this broader landscape, Entity Set Expansion is one prominent task aimed at taking an initial set of (tuples of) entities and identifying additional ones that share relevant semantic properties with the former – potentially repeating the process to form increasingly broader sets. However, this linear'' approach does not unveil the richer taxonomic’’ structures present in knowledge resources. A recent logic-based framework introduces the notion of an expansion graph: a rooted directed acyclic graph where each node represents a semantic generalization labeled by a logical formula, and edges encode strict semantic inclusion. This structure supports taxonomic expansions of entity sets driven by knowledge bases. Yet, the potentially large size of such graphs may make full materialization impractical in real-world scenarios. To overcome this, we formalize reasoning tasks that check whether two tuples belong to comparable, incomparable, or the same nodes in the graph. Our results show that, under realistic assumptions – such as bounding the input or limiting entity descriptions – these tasks can be implemented efficiently. This enables local, incremental navigation of expansion graphs, supporting practical applications without requiring full graph construction.
zh
[AI-64] Optimizing Text Search: A Novel Pattern Matching Algorithm Based on Ukkonens Approach
【速读】:该论文旨在解决传统文本搜索算法(如Naive Search、KMP和Boyer-Moore)在处理大规模复杂数据集(如Reuters语料库和人类基因组序列)时效率不足的问题。其解决方案的关键在于对后缀树(Suffix Tree)进行优化,具体采用Ukkonen算法构建后缀树,并结合一种新的搜索技术,从而实现线性时间与空间复杂度的高效搜索,显著优于传统方法,在基因组序列模式识别等任务中达到100%准确率,展现出卓越的资源效率与可靠性。
链接: https://arxiv.org/abs/2512.16927
作者: Xinyu Guan,Shaohua Zhang
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 13 figures
Abstract:In the realm of computer science, the efficiency of text-search algorithms is crucial for processing vast amounts of data in areas such as natural language processing and bioinformatics. Traditional methods like Naive Search, KMP, and Boyer-Moore, while foundational, often fall short in handling the complexities and scale of modern datasets, such as the Reuters corpus and human genomic sequences. This study rigorously investigates text-search algorithms, focusing on optimizing Suffix Trees through methods like Splitting and Ukkonen’s Algorithm, analyzed on datasets including the Reuters corpus and human genomes. A novel optimization combining Ukkonen’s Algorithm with a new search technique is introduced, showing linear time and space efficiencies, outperforming traditional methods like Naive Search, KMP, and Boyer-Moore. Empirical tests confirm the theoretical advantages, highlighting the optimized Suffix Tree’s effectiveness in tasks like pattern recognition in genomic sequences, achieving 100% accuracy. This research not only advances academic knowledge in text-search algorithms but also demonstrates significant practical utility in fields like natural language processing and bioinformatics, due to its superior resource efficiency and reliability.
zh
[AI-65] Exploring the Effect of Basis Rotation on NQS Performance
【速读】:该论文旨在解决神经量子态(Neural Quantum States, NQS)在变分量子计算中优化性能受基矢选择影响的问题,特别是揭示局部基矢旋转如何改变参数空间中的损失景观(loss landscape),从而影响浅层模型(如受限玻尔兹曼机 RBM)的优化路径与收敛质量。其解决方案的关键在于引入一个解析可解的旋转伊辛模型(rotated Ising model),通过系统性地调节旋转角度,量化目标波函数在固定损失景观内的几何位移(利用量子费舍尔信息和Fubini-Study距离),发现局部基矢旋转虽不改变损失景观本身,但会显著增加目标波函数与典型初始化点之间的几何距离,并暴露信息几何障碍(如鞍点和高曲率区域),这些障碍会导致浅层NQS陷入中间保真度的局部最优,进而阻碍正确系数分布的重现。该框架强调了在变分训练中必须考虑损失景观结构的“景观感知”(landscape-aware)模型设计策略。
链接: https://arxiv.org/abs/2512.17893
作者: Sven Benjamin Kožić,Vinko Zlatić,Fabio Franchini,Salvatore Marco Giampaolo
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural Quantum States (NQS) use neural networks to represent wavefunctions of quantum many-body systems, but their performance depends on the choice of basis, yet the underlying mechanism remains poorly understood. We use a fully solvable one-dimensional Ising model to show that local basis rotations leave the loss landscape unchanged while relocating the exact wavefunction in parameter space, effectively increasing its geometric distance from typical initializations. By sweeping a rotation angle, we compute quantum Fisher information and Fubini-Study distances to quantify how the rotated wavefunction moves within the loss landscape. Shallow architectures (with focus on Restricted Boltzmann Machines (RBMs)) trained with quantum natural gradient are more likely to fall into saddle-point regions depending on the rotation angle: they achieve low energy error but fail to reproduce correct coefficient distributions. In the ferromagnetic case, near-degenerate eigenstates create high-curvature barriers that trap optimization at intermediate fidelities. We introduce a framework based on an analytically solvable rotated Ising model to investigate how relocating the target wavefunction within a fixed loss landscape exposes information-geometric barriers,such as saddle points and high-curvature regions,that hinder shallow NQS optimization, underscoring the need for landscape-aware model design in variational training.
zh
[AI-66] HydroGym: A Reinforcement Learning Platform for Fluid Dynamics
【速读】:该论文旨在解决流体控制(flow control)中因高维性、非线性及时空多尺度相互作用带来的建模与控制难题,同时应对强化学习(Reinforcement Learning, RL)在流体领域应用时缺乏标准化基准平台和计算资源消耗大的问题。其解决方案的关键在于提出HydroGym——一个与求解器无关的强化学习平台,集成42个经过验证的流场控制环境(涵盖层流至三维湍流场景)、可扩展的运行基础设施以及先进的RL算法;特别地,通过引入非可微分求解器和可微分求解器(differentiable solvers),后者利用梯度增强优化显著提升样本效率,从而实现跨不同雷诺数(Reynolds number)或几何构型的控制器高效迁移,为流体力学、机器学习与控制领域的协同研究提供可扩展、可扩展且高效的框架。
链接: https://arxiv.org/abs/2512.17534
作者: Christian Lagemann,Sajeda Mokbel,Miro Gondrum,Mario Rüttgers,Jared Callaham,Ludger Paehler,Samuel Ahnert,Nicholas Zolman,Kai Lagemann,Nikolaus Adams,Matthias Meinke,Wolfgang Schröder,Jean-Christophe Loiseau,Esther Lagemann,Steven L. Brunton
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modeling and controlling fluid flows is critical for several fields of science and engineering, including transportation, energy, and medicine. Effective flow control can lead to, e.g., lift increase, drag reduction, mixing enhancement, and noise reduction. However, controlling a fluid faces several significant challenges, including high-dimensional, nonlinear, and multiscale interactions in space and time. Reinforcement learning (RL) has recently shown great success in complex domains, such as robotics and protein folding, but its application to flow control is hindered by a lack of standardized benchmark platforms and the computational demands of fluid simulations. To address these challenges, we introduce HydroGym, a solver-independent RL platform for flow control research. HydroGym integrates sophisticated flow control benchmarks, scalable runtime infrastructure, and state-of-the-art RL algorithms. Our platform includes 42 validated environments spanning from canonical laminar flows to complex three-dimensional turbulent scenarios, validated over a wide range of Reynolds numbers. We provide non-differentiable solvers for traditional RL and differentiable solvers that dramatically improve sample efficiency through gradient-enhanced optimization. Comprehensive evaluation reveals that RL agents consistently discover robust control principles across configurations, such as boundary layer manipulation, acoustic feedback disruption, and wake reorganization. Transfer learning studies demonstrate that controllers learned at one Reynolds number or geometry adapt efficiently to new conditions, requiring approximately 50% fewer training episodes. The HydroGym platform is highly extensible and scalable, providing a framework for researchers in fluid dynamics, machine learning, and control to add environments, surrogate models, and control algorithms to advance science and technology.
zh
[AI-67] From Priors to Predictions: Explaining and Visualizing Human Reasoning in a Graph Neural Network Framework
【速读】:该论文旨在解决人类如何从极少样本中进行新颖推理的问题,核心在于理解归纳偏置(inductive biases)的计算形式及其神经实现机制。解决方案的关键在于引入一个结合图论与图神经网络(Graph Neural Networks, GNNs)的框架,将归纳偏置显式建模为结构和抽象层面可操作的先验(priors),并通过优化管道搜索不同图配置(如边连接性和节点抽象层级),以及可视化识别对模型预测最关键的计算图子结构,从而解释个体差异、揭示泛化依赖于特定先验结构及内部处理过程,并阐明人类类错误源于错误或不完整的先验假设。
链接: https://arxiv.org/abs/2512.17255
作者: Quan Do,Caroline Ahn,Leah Bakst,Michael Pascale,Joseph T. McGuire,Chantal E. Stern,Michael E. Hasselmo
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 44 pages, 7 figures, 3 suppl figures
Abstract:Humans excel at solving novel reasoning problems from minimal exposure, guided by inductive biases, assumptions about which entities and relationships matter. Yet the computational form of these biases and their neural implementation remain poorly understood. We introduce a framework that combines Graph Theory and Graph Neural Networks (GNNs) to formalize inductive biases as explicit, manipulable priors over structure and abstraction. Using a human behavioral dataset adapted from the Abstraction and Reasoning Corpus (ARC), we show that differences in graph-based priors can explain individual differences in human solutions. Our method includes an optimization pipeline that searches over graph configurations, varying edge connectivity and node abstraction, and a visualization approach that identifies the computational graph, the subset of nodes and edges most critical to a model’s prediction. Systematic ablation reveals how generalization depends on specific prior structures and internal processing, exposing why human like errors emerge from incorrect or incomplete priors. This work provides a principled, interpretable framework for modeling the representational assumptions and computational dynamics underlying generalization, offering new insights into human reasoning and a foundation for more human aligned AI systems.
zh
[AI-68] Systemic Risk Radar: A Multi-Layer Graph Framework for Early Market Crash Warning
【速读】:该论文旨在解决金融系统性风险的早期预警问题,即如何识别金融市场中因跨部门、跨市场及投资者行为相互作用而积累的结构性脆弱性,从而预测系统性危机(如股灾或金融危机)的发生。其核心挑战在于传统基于价格波动的模型难以捕捉复杂市场交互关系。解决方案的关键是提出Systemic Risk Radar (SRR) 框架,将金融市场建模为多层图结构(multi-layer graphs),通过图神经网络(GNN)提取结构特征,以捕捉压力事件下市场拓扑变化。实验表明,相较于仅依赖特征工程的传统模型(如逻辑回归和随机森林),基于图结构的信息能提供更有效的早期预警信号,验证了网络结构特征在系统性风险识别中的有效性。
链接: https://arxiv.org/abs/2512.17185
作者: Sandeep Neela
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:Financial crises emerge when structural vulnerabilities accumulate across sectors, markets, and investor behavior. Predicting these systemic transitions is challenging because they arise from evolving interactions between market participants, not isolated price movements alone. We present Systemic Risk Radar (SRR), a framework that models financial markets as multi-layer graphs to detect early signs of systemic fragility and crash-regime transitions. We evaluate SRR across three major crises: the Dot-com crash, the Global Financial Crisis, and the COVID-19 shock. Our experiments compare snapshot GNNs, a simplified temporal GNN prototype, and standard baselines (logistic regression and Random Forest). Results show that structural network information provides useful early-warning signals compared to feature-based models alone. This correlation-based instantiation of SRR demonstrates that graph-derived features capture meaningful changes in market structure during stress events. The findings motivate extending SRR with additional graph layers (sector/factor exposure, sentiment) and more expressive temporal architectures (LSTM/GRU or Transformer encoders) to better handle diverse crisis types. Comments: Preprint Subjects: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.17185 [q-fin.RM] (or arXiv:2512.17185v1 [q-fin.RM] for this version) https://doi.org/10.48550/arXiv.2512.17185 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-69] Another Fit Bites the Dust: Conformal Prediction as a Calibration Standard for Machine Learning in High-Energy Physics
【速读】:该论文旨在解决高能物理实验中机器学习模型输出缺乏校准不确定性估计和有限样本保证的问题,这限制了其在统计推断和决策中的直接应用。解决方案的关键在于引入合规预测(Conformal Prediction, CP),这是一种无需重新训练即可校准任意预测模型的分布无关框架,能够在最小交换性假设下提供严格的不确定性量化,并具备有限样本覆盖保证,不依赖渐近理论、极限定理或高斯近似。通过将CP作为统一的校准层,论文展示了其可适用于回归、二分类、多分类、异常检测和生成建模等多种任务,从而将原始模型输出转化为具有控制错误率的预测集、典型区域和p值,实现诚实的不确定性量化与透明的误差控制。
链接: https://arxiv.org/abs/2512.17048
作者: Jack Y. Araz,Michael Spannowsky
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
备注: 24 pages, 12 figures
Abstract:Machine-learning techniques are essential in modern collider research, yet their probabilistic outputs often lack calibrated uncertainty estimates and finite-sample guarantees, limiting their direct use in statistical inference and decision-making. Conformal prediction (CP) provides a simple, distribution-free framework for calibrating arbitrary predictive models without retraining, yielding rigorous uncertainty quantification with finite-sample coverage guarantees under minimal exchangeability assumptions, without reliance on asymptotics, limit theorems, or Gaussian approximations. In this work, we investigate CP as a unifying calibration layer for machine-learning applications in high-energy physics. Using publicly available collider datasets and a diverse set of models, we show that a single conformal formalism can be applied across regression, binary and multi-class classification, anomaly detection, and generative modelling, converting raw model outputs into statistically valid prediction sets, typicality regions, and p-values with controlled false-positive rates. While conformal prediction does not improve raw model performance, it enforces honest uncertainty quantification and transparent error control. We argue that conformal calibration should be adopted as a standard component of machine-learning pipelines in collider physics, enabling reliable interpretation, robust comparisons, and principled statistical decisions in experimental and phenomenological analyses.
zh
[AI-70] Graph Attention Networks for Detecting Epilepsy from EEG Signals Using Accessible Hardware in Low-Resource Settings
【速读】:该论文旨在解决低收入国家因神经科医生稀缺和昂贵诊断工具导致的癫痫(epilepsy)漏诊问题,提出一种基于图结构的深度学习框架,利用低成本脑电图(Electroencephalography, EEG)硬件实现公平、可及且具备可解释性的自动诊断。其解决方案的关键在于将EEG信号建模为时空图,并采用图注意力网络(Graph Attention Networks, GAT)捕捉通道间连接关系与时间动态特性;通过改进GAT以聚焦边(edge)而非节点,强化对癫痫生物标志物——特别是额颞区特定连接——的识别能力,同时设计适用于低质量信号的预处理方法和轻量级模型架构,便于在资源受限环境中部署(如RaspberryPi设备)。
链接: https://arxiv.org/abs/2507.15118
作者: Szymon Mazurek,Stephen Moore,Alessandro Crimi
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Goal: Epilepsy remains under-diagnosed in low-income countries due to scarce neurologists and costly diagnostic tools. We propose a graph-based deep learning framework to detect epilepsy from low-cost Electroencephalography (EEG) hardware, tested on recordings from Nigeria and Guinea-Bissau. Our focus is on fair, accessible automatic assessment and explainability to shed light on epilepsy biomarkers. Methods: We model EEG signals as spatio-temporal graphs, classify them, and identify interchannel relationships and temporal dynamics using graph attention networks (GAT). To emphasize connectivity biomarkers, we adapt the inherently node-focused GAT to analyze edges. We also designed signal preprocessing for low-fidelity recordings and a lightweight GAT architecture trained on Google Colab and deployed on RaspberryPi devices. Results: The approach achieves promising classification performance, outperforming a standard classifier based on random forest and graph convolutional networks in terms of accuracy and robustness over multiple sessions, but also highlighting specific connections in the fronto-temporal region. Conclusions: The results highlight the potential of GATs to provide insightful and scalable diagnostic support for epilepsy in underserved regions, paving the way for affordable and accessible neurodiagnostic tools.
zh
机器学习
[LG-0] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy
链接: https://arxiv.org/abs/2512.17899
作者: Aditya Gahlawat,Ahmed Aboudonia,Sandeep Banik,Naira Hovakimyan,Nikolai Matni,Aaron D. Ames,Gioele Zardini,Alberto Speranzon
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures
Abstract:Imitation learning (IL) enables autonomous behavior by learning from expert demonstrations. While more sample-efficient than comparative alternatives like reinforcement learning, IL is sensitive to compounding errors induced by distribution shifts. There are two significant sources of distribution shifts when using IL-based feedback laws on systems: distribution shifts caused by policy error and distribution shifts due to exogenous disturbances and endogenous model errors due to lack of learning. Our previously developed approaches, Taylor Series Imitation Learning (TaSIL) and \mathcalL_1 -Distributionally Robust Adaptive Control (\ellonedrac), address the challenge of distribution shifts in complementary ways. While TaSIL offers robustness against policy error-induced distribution shifts, \ellonedrac offers robustness against distribution shifts due to aleatoric and epistemic uncertainties. To enable certifiable IL for learned and/or uncertain dynamical systems, we formulate \textitDistributionally Robust Imitation Policy (DRIP) architecture, a Layered Control Architecture (LCA) that integrates TaSIL and~\ellonedrac. By judiciously designing individual layer-centric input and output requirements, we show how we can guarantee certificates for the entire control pipeline. Our solution paves the path for designing fully certifiable autonomy pipelines, by integrating learning-based components, such as perception, with certifiable model-based decision-making through the proposed LCA approach.
[LG-1] Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space
链接: https://arxiv.org/abs/2512.17884
作者: Xinyue Yu,Hayden Schaeffer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:Operator learning is a data-driven approximation of mappings between infinite-dimensional function spaces, such as the solution operators of partial differential equations. Kernel-based operator learning can offer accurate, theoretically justified approximations that require less training than standard methods. However, they can become computationally prohibitive for large training sets and can be sensitive to noise. We propose a regularized random Fourier feature (RRFF) approach, coupled with a finite element reconstruction map (RRFF-FEM), for learning operators from noisy data. The method uses random features drawn from multivariate Student’s t distributions, together with frequency-weighted Tikhonov regularization that suppresses high-frequency noise. We establish high-probability bounds on the extreme singular values of the associated random feature matrix and show that when the number of features N scales like m \log m with the number of training samples m , the system is well-conditioned, which yields estimation and generalization guarantees. Detailed numerical experiments on benchmark PDE problems, including advection, Burgers’, Darcy flow, Helmholtz, Navier-Stokes, and structural mechanics, demonstrate that RRFF and RRFF-FEM are robust to noise and achieve improved performance with reduced training time compared to the unregularized random feature model, while maintaining competitive accuracy relative to kernel and neural operator tests.
[LG-2] Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation
链接: https://arxiv.org/abs/2512.17820
作者: Liam Collins,Bhuvesh Kumar,Clark Mingxuan Ju,Tong Zhao,Donald Loveland,Leonardo Neves,Neil Shah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern Sequential Recommendation (SR) models commonly utilize modality features to represent items, motivated in large part by recent advancements in language and vision modeling. To do so, several works completely replace ID embeddings with modality embeddings, claiming that modality embeddings render ID embeddings unnecessary because they can match or even exceed ID embedding performance. On the other hand, many works jointly utilize ID and modality features, but posit that complex fusion strategies, such as multi-stage training and/or intricate alignment architectures, are necessary for this joint utilization. However, underlying both these lines of work is a lack of understanding of the complementarity of ID and modality features. In this work, we address this gap by studying the complementarity of ID- and text-based SR models. We show that these models do learn complementary signals, meaning that either should provide performance gain when used properly alongside the other. Motivated by this, we propose a new SR method that preserves ID-text complementarity through independent model training, then harnesses it through a simple ensembling strategy. Despite this method’s simplicity, we show it outperforms several competitive SR baselines, implying that both ID and text features are necessary to achieve state-of-the-art SR performance but complex fusion architectures are not.
[LG-3] Calibratable Disambiguation Loss for Multi-Instance Partial-Label Learning
链接: https://arxiv.org/abs/2512.17788
作者: Wei Tang,Yin-Fang Yang,Weijia Zhang,Min-Ling Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-instance partial-label learning (MIPL) is a weakly supervised framework that extends the principles of multi-instance learning (MIL) and partial-label learning (PLL) to address the challenges of inexact supervision in both instance and label spaces. However, existing MIPL approaches often suffer from poor calibration, undermining classifier reliability. In this work, we propose a plug-and-play calibratable disambiguation loss (CDL) that simultaneously improves classification accuracy and calibration performance. The loss has two instantiations: the first one calibrates predictions based on probabilities from the candidate label set, while the second one integrates probabilities from both candidate and non-candidate label sets. The proposed CDL can be seamlessly incorporated into existing MIPL and PLL frameworks. We provide a theoretical analysis that establishes the lower bound and regularization properties of CDL, demonstrating its superiority over conventional disambiguation losses. Experimental results on benchmark and real-world datasets confirm that our CDL significantly enhances both classification and calibration performance.
[LG-4] Can You Hear Me Now? A Benchmark for Long-Range Graph Propagation
链接: https://arxiv.org/abs/2512.17762
作者: Luca Miglior,Matteo Tolloso,Alessio Gravina,Davide Bacciu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Effectively capturing long-range interactions remains a fundamental yet unresolved challenge in graph neural network (GNN) research, critical for applications across diverse fields of science. To systematically address this, we introduce ECHO (Evaluating Communication over long HOps), a novel benchmark specifically designed to rigorously assess the capabilities of GNNs in handling very long-range graph propagation. ECHO includes three synthetic graph tasks, namely single-source shortest paths, node eccentricity, and graph diameter, each constructed over diverse and structurally challenging topologies intentionally designed to introduce significant information bottlenecks. ECHO also includes two real-world datasets, ECHO-Charge and ECHO-Energy, which define chemically grounded benchmarks for predicting atomic partial charges and molecular total energies, respectively, with reference computations obtained at the density functional theory (DFT) level. Both tasks inherently depend on capturing complex long-range molecular interactions. Our extensive benchmarking of popular GNN architectures reveals clear performance gaps, emphasizing the difficulty of true long-range propagation and highlighting design choices capable of overcoming inherent limitations. ECHO thereby sets a new standard for evaluating long-range information propagation, also providing a compelling example for its need in AI for science.
[LG-5] Mitigating Forgetting in Low Rank Adaptation
链接: https://arxiv.org/abs/2512.17720
作者: Joanna Sliwa,Frank Schneider,Philipp Hennig,Jose Miguel Hernandez-Lobato
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), enable fast specialization of large pre-trained models to different downstream applications. However, this process often leads to catastrophic forgetting of the model’s prior domain knowledge. We address this issue with LaLoRA, a weight-space regularization technique that applies a Laplace approximation to Low-Rank Adaptation. Our approach estimates the model’s confidence in each parameter and constrains updates in high-curvature directions, preserving prior knowledge while enabling efficient target-domain learning. By applying the Laplace approximation only to the LoRA weights, the method remains lightweight. We evaluate LaLoRA by fine-tuning a Llama model for mathematical reasoning and demonstrate an improved learning-forgetting trade-off, which can be directly controlled via the method’s regularization strength. We further explore different loss landscape curvature approximations for estimating parameter confidence, analyze the effect of the data used for the Laplace approximation, and study robustness across hyperparameters.
[LG-6] Spatially-informed transformers: Injecting geostatistical covariance biases into self-attention for spatio-temporal forecasting
链接: https://arxiv.org/abs/2512.17696
作者: Yuri Calleo
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:The modeling of high-dimensional spatio-temporal processes presents a fundamental dichotomy between the probabilistic rigor of classical geostatistics and the flexible, high-capacity representations of deep learning. While Gaussian processes offer theoretical consistency and exact uncertainty quantification, their prohibitive computational scaling renders them impractical for massive sensor networks. Conversely, modern transformer architectures excel at sequence modeling but inherently lack a geometric inductive bias, treating spatial sensors as permutation-invariant tokens without a native understanding of distance. In this work, we propose a spatially-informed transformer, a hybrid architecture that injects a geostatistical inductive bias directly into the self-attention mechanism via a learnable covariance kernel. By formally decomposing the attention structure into a stationary physical prior and a non-stationary data-driven residual, we impose a soft topological constraint that favors spatially proximal interactions while retaining the capacity to model complex dynamics. We demonstrate the phenomenon of ``Deep Variography’', where the network successfully recovers the true spatial decay parameters of the underlying process end-to-end via backpropagation. Extensive experiments on synthetic Gaussian random fields and real-world traffic benchmarks confirm that our method outperforms state-of-the-art graph neural networks. Furthermore, rigorous statistical validation confirms that the proposed method delivers not only superior predictive accuracy but also well-calibrated probabilistic forecasts, effectively bridging the gap between physics-aware modeling and data-driven learning.
[LG-7] Convergence Guarantees for Federated SARSA with Local Training and Heterogeneous Agents
链接: https://arxiv.org/abs/2512.17688
作者: Paul Mangold,Eloïse Berthier,Eric Moulines
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We present a novel theoretical analysis of Federated SARSA (FedSARSA) with linear function approximation and local training. We establish convergence guarantees for FedSARSA in the presence of heterogeneity, both in local transitions and rewards, providing the first sample and communication complexity bounds in this setting. At the core of our analysis is a new, exact multi-step error expansion for single-agent SARSA, which is of independent interest. Our analysis precisely quantifies the impact of heterogeneity, demonstrating the convergence of FedSARSA with multiple local updates. Crucially, we show that FedSARSA achieves linear speed-up with respect to the number of agents, up to higher-order terms due to Markovian sampling. Numerical experiments support our theoretical findings.
[LG-8] Polyharmonic Cascade
链接: https://arxiv.org/abs/2512.17671
作者: Yuriy N. Bakhvalov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Part 3 of 4 in the “Polyharmonic Cascade” cycle. Proposes a non-SGD training method based on global linear solvers. Previous papers: arXiv:2512.12731 , arXiv.2512.16718. Source code is available at: this https URL
Abstract:This paper presents a deep machine learning architecture, the “polyharmonic cascade” – a sequence of packages of polyharmonic splines, where each layer is rigorously derived from the theory of random functions and the principles of indifference. This makes it possible to approximate nonlinear functions of arbitrary complexity while preserving global smoothness and a probabilistic interpretation. For the polyharmonic cascade, a training method alternative to gradient descent is proposed: instead of directly optimizing the coefficients, one solves a single global linear system on each batch with respect to the function values at fixed “constellations” of nodes. This yields synchronized updates of all layers, preserves the probabilistic interpretation of individual layers and theoretical consistency with the original model, and scales well: all computations reduce to 2D matrix operations efficiently executed on a GPU. Fast learning without overfitting on MNIST is demonstrated.
[LG-9] Vidarc: Embodied Video Diffusion Model for Closed-loop Control
链接: https://arxiv.org/abs/2512.17661
作者: Yao Feng,Chendong Xiang,Xinyi Mao,Hengkai Tan,Zuyue Zhang,Shuhe Huang,Kaiwen Zheng,Haitian Liu,Hang Su,Jun Zhu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.
[LG-10] Estimating Spatially Resolved Radiation Fields Using Neural Networks
链接: https://arxiv.org/abs/2512.17654
作者: Felix Lehner,Pasquale Lombardo,Susana Castillo,Oliver Hupe,Marcus Magnor
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Medical Physics (physics.med-ph)
*备注:
Abstract:We present an in-depth analysis on how to build and train neural networks to estimate the spatial distribution of scattered radiation fields for radiation protection dosimetry in medical radiation fields, such as those found in Interventional Radiology and Cardiology. Therefore, we present three different synthetically generated datasets with increasing complexity for training, using a Monte-Carlo Simulation application based on Geant4. On those datasets, we evaluate convolutional and fully connected architectures of neural networks to demonstrate which design decisions work well for reconstructing the fluence and spectra distributions over the spatial domain of such radiation fields. All used datasets as well as our training pipeline are published as open source in separate repositories.
[LG-11] A Systems-Theoretic View on the Convergence of Algorithms under Disturbances
链接: https://arxiv.org/abs/2512.17598
作者: Guner Dilsad Er,Sebastian Trimpe,Michael Muehlebach
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Algorithms increasingly operate within complex physical, social, and engineering systems where they are exposed to disturbances, noise, and interconnections with other dynamical systems. This article extends known convergence guarantees of an algorithm operating in isolation (i.e., without disturbances) and systematically derives stability bounds and convergence rates in the presence of such disturbances. By leveraging converse Lyapunov theorems, we derive key inequalities that quantify the impact of disturbances. We further demonstrate how our result can be utilized to assess the effects of disturbances on algorithmic performance in a wide variety of applications, including communication constraints in distributed learning, sensitivity in machine learning generalization, and intentional noise injection for privacy. This underpins the role of our result as a unifying tool for algorithm analysis in the presence of noise, disturbances, and interconnections with other dynamical systems.
[LG-12] A Unified Representation of Neural Networks Architectures
链接: https://arxiv.org/abs/2512.17593
作者: Christophe Prieur,Mircea Lazar,Bogdan Robu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In this paper we consider the limiting case of neural networks (NNs) architectures when the number of neurons in each hidden layer and the number of hidden layers tend to infinity thus forming a continuum, and we derive approximation errors as a function of the number of neurons and/or hidden layers. Firstly, we consider the case of neural networks with a single hidden layer and we derive an integral infinite width neural representation that generalizes existing continuous neural networks (CNNs) representations. Then we extend this to deep residual CNNs that have a finite number of integral hidden layers and residual connections. Secondly, we revisit the relation between neural ODEs and deep residual NNs and we formalize approximation errors via discretization techniques. Then, we merge these two approaches into a unified homogeneous representation of NNs as a Distributed Parameter neural Network (DiPaNet) and we show that most of the existing finite and infinite-dimensional NNs architectures are related via homogeneization/discretization with the DiPaNet representation. Our approach is purely deterministic and applies to general, uniformly continuous matrix weight functions. Differences and similarities with neural fields are discussed along with further possible generalizations and applications of the DiPaNet framework.
[LG-13] Sharing Knowledge without Sharing Data: Stitches can improve ensembles of disjointly trained models
链接: https://arxiv.org/abs/2512.17592
作者: Arthur Guijt,Dirk Thierens,Ellen Kerkhof,Jan Wiersma,Tanja Alderliesten,Peter A.N. Bosman
类目: Machine Learning (cs.LG)
*备注: 35 pages, 11 figures
Abstract:Deep learning has been shown to be very capable at performing many real-world tasks. However, this performance is often dependent on the presence of large and varied datasets. In some settings, like in the medical domain, data is often fragmented across parties, and cannot be readily shared. While federated learning addresses this situation, it is a solution that requires synchronicity of parties training a single model together, exchanging information about model weights. We investigate how asynchronous collaboration, where only already trained models are shared (e.g. as part of a publication), affects performance, and propose to use stitching as a method for combining models. Through taking a multi-objective perspective, where performance on each parties’ data is viewed independently, we find that training solely on a single parties’ data results in similar performance when merging with another parties’ data, when considering performance on that single parties’ data, while performance on other parties’ data is notably worse. Moreover, while an ensemble of such individually trained networks generalizes better, performance on each parties’ own dataset suffers. We find that combining intermediate representations in individually trained models with a well placed pair of stitching layers allows this performance to recover to a competitive degree while maintaining improved generalization, showing that asynchronous collaboration can yield competitive results. Comments: 35 pages, 11 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.17592 [cs.LG] (or arXiv:2512.17592v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.17592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] Learning Safe Autonomous Driving Policies Using Predictive Safety Representations ICRA2026
链接: https://arxiv.org/abs/2512.17586
作者: Mahesh Keswani,Raunak Bhattacharyya
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages, 4 figures. Submitted to ICRA 2026
Abstract:Safe reinforcement learning (SafeRL) is a prominent paradigm for autonomous driving, where agents are required to optimize performance under strict safety requirements. This dual objective creates a fundamental tension, as overly conservative policies limit driving efficiency while aggressive exploration risks safety violations. The Safety Representations for Safer Policy Learning (SRPL) framework addresses this challenge by equipping agents with a predictive model of future constraint violations and has shown promise in controlled environments. This paper investigates whether SRPL extends to real-world autonomous driving scenarios. Systematic experiments on the Waymo Open Motion Dataset (WOMD) and NuPlan demonstrate that SRPL can improve the reward-safety tradeoff, achieving statistically significant improvements in success rate (effect sizes r = 0.65-0.86) and cost reduction (effect sizes r = 0.70-0.83), with p 0.05 for observed improvements. However, its effectiveness depends on the underlying policy optimizer and the dataset distribution. The results further show that predictive safety representations play a critical role in improving robustness to observation noise. Additionally, in zero-shot cross-dataset evaluation, SRPL-augmented agents demonstrate improved generalization compared to non-SRPL methods. These findings collectively demonstrate the potential of predictive safety representations to strengthen SafeRL for autonomous driving.
[LG-15] Machine Learning for Static and Single-Event Dynamic Complex Network Analysis
链接: https://arxiv.org/abs/2512.17577
作者: Nikolaos Nakis
类目: Machine Learning (cs.LG)
*备注:
Abstract:The primary objective of this thesis is to develop novel algorithmic approaches for Graph Representation Learning of static and single-event dynamic networks. In such a direction, we focus on the family of Latent Space Models, and more specifically on the Latent Distance Model which naturally conveys important network characteristics such as homophily, transitivity, and the balance theory. Furthermore, this thesis aims to create structural-aware network representations, which lead to hierarchical expressions of network structure, community characterization, the identification of extreme profiles in networks, and impact dynamics quantification in temporal networks. Crucially, the methods presented are designed to define unified learning processes, eliminating the need for heuristics and multi-stage processes like post-processing steps. Our aim is to delve into a journey towards unified network embeddings that are both comprehensive and powerful, capable of characterizing network structures and adeptly handling the diverse tasks that graph analysis offers.
[LG-16] Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing
链接: https://arxiv.org/abs/2512.17574
作者: Lingxiao Zhao,Haoran Zhou,Yuezhi Che,Dazhao Cheng
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Multimodal large language models (MLLMs) extend LLMs with visual understanding through a three-stage pipeline: multimodal preprocessing, vision encoding, and LLM inference. While these stages enhance capability, they introduce significant system bottlenecks. First, multimodal preprocessing-especially video decoding-often dominates Time-to-First-Token (TTFT). Most systems rely on CPU-based decoding, which severely limits throughput, while existing GPU-based approaches prioritize throughput-oriented parallelism and fail to meet the latency-sensitive requirements of MLLM inference. Second, the vision encoder is a standalone, compute-intensive stage that produces visual embeddings and cannot be co-batched with LLM prefill or decoding. This heterogeneity forces inter-stage blocking and increases token-generation latency. Even when deployed on separate GPUs, these stages underutilize available compute and memory resources, reducing overall utilization and constraining system throughput. To address these challenges, we present FlashCodec and UnifiedServe, two complementary designs that jointly optimize the end-to-end MLLM pipeline. FlashCodec accelerates the multimodal preprocessing stage through collaborative multi-GPU video decoding, reducing decoding latency while preserving high throughput. UnifiedServe optimizes the vision-to-text and inference stages using a logically decoupled their execution to eliminate inter-stage blocking, yet physically sharing GPU resources to maximize GPU system utilization. By carefully orchestrating execution across stages and minimizing interference, UnifiedServe Together, our proposed framework forms an end-to-end optimized stack that can serve up to 3.0 \times more requests or enforce 1.5 \times tighter SLOs, while achieving up to 4.4 \times higher throughput compared to state-of-the-art systems. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2512.17574 [cs.DC] (or arXiv:2512.17574v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.17574 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-17] Bayesian Optimisation: Which Constraints Matter?
链接: https://arxiv.org/abs/2512.17569
作者: Xietao Wang Lin,Juan Ungredda,Max Butler,James Town,Alma Rahat,Hemant Singh,Juergen Branke
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Bayesian optimisation has proven to be a powerful tool for expensive global black-box optimisation problems. In this paper, we propose new Bayesian optimisation variants of the popular Knowledge Gradient acquisition functions for problems with \emphdecoupled black-box constraints, in which subsets of the objective and constraint functions may be evaluated independently. In particular, our methods aim to take into account that often only a handful of the constraints may be binding at the optimum, and hence we should evaluate only relevant constraints when trying to optimise a function. We empirically benchmark these methods against existing methods and demonstrate their superiority over the state-of-the-art.
[LG-18] NetworkFF: Unified Layer Optimization in Forward-Only Neural Networks
链接: https://arxiv.org/abs/2512.17531
作者: Salar Beigzad
类目: Machine Learning (cs.LG)
*备注: Conference paper, IEEE, 2025
Abstract:The Forward-Forward algorithm eliminates backpropagation’s memory constraints and biological implausibility through dual forward passes with positive and negative data. However, conventional implementations suffer from critical inter-layer isolation, where layers optimize goodness functions independently without leveraging collective learning dynamics. This isolation constrains representational coordination and limits convergence efficiency in deeper architectures. This paper introduces Collaborative Forward-Forward (CFF) learning, extending the original algorithm through inter-layer cooperation mechanisms that preserve forward-only computation while enabling global context integration. Our framework implements two collaborative paradigms: Fixed CFF (F-CFF) with constant inter-layer coupling and Adaptive CFF (A-CFF) with learnable collaboration parameters that evolve during training. The collaborative goodness function incorporates weighted contributions from all layers, enabling coordinated feature learning while maintaining memory efficiency and biological plausibility. Comprehensive evaluation on MNIST and Fashion-MNIST demonstrates significant performance improvements over baseline Forward-Forward implementations. These findings establish inter-layer collaboration as a fundamental enhancement to Forward-Forward learning, with immediate applicability to neuromorphic computing architectures and energy-constrained AI systems.
[LG-19] Deep Learning-Based Surrogate Creep Modelling in Inconel 625: A High-Temperature Alloy Study
链接: https://arxiv.org/abs/2512.17477
作者: Shubham Das,Kaushal Singhania,Amit Sadhu,Suprabhat Das,Arghya Nandi
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: Presented in 10th International Congress on Computational Mechanics and Simulation (ICCMS) 2025, IIT Bhubaneswar
Abstract:Time-dependent deformation, particularly creep, in high-temperature alloys such as Inconel 625 is a key factor in the long-term reliability of components used in aerospace and energy systems. Although Inconel 625 shows excellent creep resistance, finite-element creep simulations in tools such as ANSYS remain computationally expensive, often requiring tens of minutes for a single 10,000-hour run. This work proposes deep learning based surrogate models to provide fast and accurate replacements for such simulations. Creep strain data was generated in ANSYS using the Norton law under uniaxial stresses of 50 to 150 MPa and temperatures of 700 to 1000 ^\circ C, and this temporal dataset was used to train two architectures: a BiLSTM Variational Autoencoder for uncertainty-aware and generative predictions, and a BiLSTM Transformer hybrid that employs self-attention to capture long-range temporal behavior. Both models act as surrogate predictors, with the BiLSTM-VAE offering probabilistic output and the BiLSTM-Transformer delivering high deterministic accuracy. Performance is evaluated using RMSE, MAE, and R^2 . Results show that the BiLSTM-VAE provides stable and reliable creep strain forecasts, while the BiLSTM-Transformer achieves strong accuracy across the full time range. Latency tests indicate substantial speedup: while each ANSYS simulation requires 30 to 40 minutes for a given stress-temperature condition, the surrogate models produce predictions within seconds. The proposed framework enables rapid creep assessment for design optimization and structural health monitoring, and provides a scalable solution for high-temperature alloy applications.
[LG-20] Linear Attention for Joint Power Optimization and User-Centric Clustering in Cell-Free Networks
链接: https://arxiv.org/abs/2512.17466
作者: Irched Chafaa,Giacomo Bacci,Luca Sanguinetti
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted
Abstract:Optimal AP clustering and power allocation are critical in user-centric cell-free massive MIMO systems. Existing deep learning models lack flexibility to handle dynamic network configurations. Furthermore, many approaches overlook pilot contamination and suffer from high computational complexity. In this paper, we propose a lightweight transformer model that overcomes these limitations by jointly predicting AP clusters and powers solely from spatial coordinates of user devices and AP. Our model is architecture-agnostic to users load, handles both clustering and power allocation without channel estimation overhead, and eliminates pilot contamination by assigning users to AP within a pilot reuse constraint. We also incorporate a customized linear attention mechanism to capture user-AP interactions efficiently and enable linear scalability with respect to the number of users. Numerical results confirm the model’s effectiveness in maximizing the minimum spectral efficiency and providing near-optimal performance while ensuring adaptability and scalability in dynamic scenarios.
[LG-21] When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction
链接: https://arxiv.org/abs/2512.17460
作者: Emmanuel Charleson Dapaah,Jens Grabowski
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Software Defect Prediction (SDP) models are central to proactive software quality assurance, yet their effectiveness is often constrained by the quality of available datasets. Prior research has typically examined single issues such as class imbalance or feature irrelevance in isolation, overlooking that real-world data problems frequently co-occur and interact. This study presents, to our knowledge, the first large-scale empirical analysis in SDP that simultaneously examines five co-occurring data quality issues (class imbalance, class overlap, irrelevant features, attribute noise, and outliers) across 374 datasets and five classifiers. We employ Explainable Boosting Machines together with stratified interaction analysis to quantify both direct and conditional effects under default hyperparameter settings, reflecting practical baseline usage. Our results show that co-occurrence is nearly universal: even the least frequent issue (attribute noise) appears alongside others in more than 93% of datasets. Irrelevant features and imbalance are nearly ubiquitous, while class overlap is the most consistently harmful issue. We identify stable tipping points around 0.20 for class overlap, 0.65-0.70 for imbalance, and 0.94 for irrelevance, beyond which most models begin to degrade. We also uncover counterintuitive patterns, such as outliers improving performance when irrelevant features are low, underscoring the importance of context-aware evaluation. Finally, we expose a performance-robustness trade-off: no single learner dominates under all conditions. By jointly analyzing prevalence, co-occurrence, thresholds, and conditional effects, our study directly addresses a persistent gap in SDP research. Hence, moving beyond isolated analyses to provide a holistic, data-aware understanding of how quality issues shape model performance in real-world settings. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2512.17460 [cs.SE] (or arXiv:2512.17460v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.17460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-22] meval: A Statistical Toolbox for Fine-Grained Model Performance Analysis
链接: https://arxiv.org/abs/2512.17409
作者: Dishantkumar Sutariya,Eike Petersen
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Analyzing machine learning model performance stratified by patient and recording properties is becoming the accepted norm and often yields crucial insights about important model failure modes. Performing such analyses in a statistically rigorous manner is non-trivial, however. Appropriate performance metrics must be selected that allow for valid comparisons between groups of different sample sizes and base rates; metric uncertainty must be determined and multiple comparisons be corrected for, in order to assess whether any observed differences may be purely due to chance; and in the case of intersectional analyses, mechanisms must be implemented to find the most `interesting’ subgroups within combinatorially many subgroup combinations. We here present a statistical toolbox that addresses these challenges and enables practitioners to easily yet rigorously assess their models for potential subgroup performance disparities. While broadly applicable, the toolbox is specifically designed for medical imaging applications. The analyses provided by the toolbox are illustrated in two case studies, one in skin lesion malignancy classification on the ISIC2020 dataset and one in chest X-ray-based disease classification on the MIMIC-CXR dataset.
[LG-23] DeepShare: Sharing ReLU Across Channels and Layers for Efficient Private Inference
链接: https://arxiv.org/abs/2512.17398
作者: Yonathan Bornfeld,Shai Avidan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Private Inference (PI) uses cryptographic primitives to perform privacy preserving machine learning. In this setting, the owner of the network runs inference on the data of the client without learning anything about the data and without revealing any information about the model. It has been observed that a major computational bottleneck of PI is the calculation of the gate (i.e., ReLU), so a considerable amount of effort have been devoted to reducing the number of ReLUs in a given network. We focus on the DReLU, which is the non-linear step function of the ReLU and show that one DReLU can serve many ReLU operations. We suggest a new activation module where the DReLU operation is only performed on a subset of the channels (Prototype channels), while the rest of the channels (replicate channels) replicates the DReLU of each of their neurons from the corresponding neurons in one of the prototype channels. We then extend this idea to work across different layers. We show that this formulation can drastically reduce the number of DReLU operations in resnet type network. Furthermore, our theoretical analysis shows that this new formulation can solve an extended version of the XOR problem, using just one non-linearity and two neurons, something that traditional formulations and some PI specific methods cannot achieve. We achieve new SOTA results on several classification setups, and achieve SOTA results on image segmentation. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2512.17398 [cs.LG] (or arXiv:2512.17398v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.17398 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-24] mely Information Updating for Mobile Devices Without and With ML Advice
链接: https://arxiv.org/abs/2512.17381
作者: Yu-Pin Hsu,Yi-Hsuan Tseng
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 23 pages, journal version of arXiv:1901.03137 , submitted for possible journal publication
Abstract:This paper investigates an information update system in which a mobile device monitors a physical process and sends status updates to an access point (AP). A fundamental trade-off arises between the timeliness of the information maintained at the AP and the update cost incurred at the device. To address this trade-off, we propose an online algorithm that determines when to transmit updates using only available observations. The proposed algorithm asymptotically achieves the optimal competitive ratio against an adversary that can simultaneously manipulate multiple sources of uncertainty, including the operation duration, the information staleness, the update cost, and the availability of update opportunities. Furthermore, by incorporating machine learning (ML) advice of unknown reliability into the design, we develop an ML-augmented algorithm that asymptotically attains the optimal consistency-robustness trade-off, even when the adversary can additionally corrupt the ML advice. The optimal competitive ratio scales linearly with the range of update costs, but is unaffected by other uncertainties. Moreover, an optimal competitive online algorithm exhibits a threshold-like response to the ML advice: it either fully trusts or completely ignores the ML advice, as partially trusting the advice cannot improve the consistency without severely degrading the robustness. Extensive simulations in stochastic settings further validate the theoretical findings in the adversarial environment.
[LG-25] Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach
链接: https://arxiv.org/abs/2512.17367
作者: Yidong Chai,Yi Liu,Mohammadreza Ebrahimi,Weifeng Li,Balaji Padmanabhan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample’s predictability and each base detector’s capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.
[LG-26] LibriVAD: A Scalable Open Dataset with Deep Learning Benchmarks for Voice Activity Detection
链接: https://arxiv.org/abs/2512.17281
作者: Ioannis Stylianou,Achintya kr. Sarkar,Nauman Dawalatabad,James Glass,Zheng-Hua Tan
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly available datasets. To address this, we introduce LibriVAD - a scalable open-source dataset derived from LibriSpeech and augmented with diverse real-world and synthetic noise sources. LibriVAD enables systematic control over speech-to-noise ratio, silence-to-speech ratio (SSR), and noise diversity, and is released in three sizes (15 GB, 150 GB, and 1.5 TB) with two variants (LibriVAD-NonConcat and LibriVAD-Concat) to support different experimental setups. We benchmark multiple feature-model combinations, including waveform, Mel-Frequency Cepstral Coefficients (MFCC), and Gammatone filter bank cepstral coefficients, and introduce the Vision Transformer (ViT) architecture for VAD. Our experiments show that ViT with MFCC features consistently outperforms established VAD models such as boosted deep neural network and convolutional long short-term memory deep neural network across seen, unseen, and out-of-distribution (OOD) conditions, including evaluation on the real-world VOiCES dataset. We further analyze the impact of dataset size and SSR on model generalization, experimentally showing that scaling up dataset size and balancing SSR noticeably and consistently enhance VAD performance under OOD conditions. All datasets, trained models, and code are publicly released to foster reproducibility and accelerate progress in VAD research.
[LG-27] Warmer for Less: A Cost-Efficient Strategy for Cold-Start Recommendations at Pinterest WWW’26
链接: https://arxiv.org/abs/2512.17277
作者: Saeed Ebrahimi,Weijie Jiang,Jaewon Yang,Olafur Gudmundsson,Yucheng Tu,Huizhong Duan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Submitted to the WWW’26
Abstract:Pinterest is a leading visual discovery platform where recommender systems (RecSys) are key to delivering relevant, engaging, and fresh content to our users. In this paper, we study the problem of improving RecSys model predictions for cold-start (CS) items, which appear infrequently in the training data. Although this problem is well-studied in academia, few studies have addressed its root causes effectively at the scale of a platform like Pinterest. By investigating live traffic data, we identified several challenges of the CS problem and developed a corresponding solution for each: First, industrial-scale RecSys models must operate under tight computational constraints. Since CS items are a minority, any related improvements must be highly cost-efficient. To address this, our solutions were designed to be lightweight, collectively increasing the total parameters by only 5%. Second, CS items are represented only by non-historical (e.g., content or attribute) features, which models often treat as less important. To elevate their significance, we introduce a residual connection for the non-historical features. Third, CS items tend to receive lower prediction scores compared to non-CS items, reducing their likelihood of being surfaced. We mitigate this by incorporating a score regularization term into the model. Fourth, the labels associated with CS items are sparse, making it difficult for the model to learn from them. We apply the manifold mixup technique to address this data sparsity. Implemented together, our methods increased fresh content engagement at Pinterest by 10% without negatively impacting overall engagement and cost, and have been deployed to serve over 570 million users on Pinterest.
[LG-28] Alzheimers Disease Brain Network Mining
链接: https://arxiv.org/abs/2512.17276
作者: Alireza Moayedikia,Sara Fin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning approaches for Alzheimer’s disease (AD) diagnosis face a fundamental challenges. Clinical assessments are expensive and invasive, leaving ground truth labels available for only a fraction of neuroimaging datasets. We introduce Multi view Adaptive Transport Clustering for Heterogeneous Alzheimer’s Disease (MATCH-AD), a semi supervised framework that integrates deep representation learning, graph-based label propagation, and optimal transport theory to address this limitation. The framework leverages manifold structure in neuroimaging data to propagate diagnostic information from limited labeled samples to larger unlabeled populations, while using Wasserstein distances to quantify disease progression between cognitive states. Evaluated on nearly five thousand subjects from the National Alzheimer’s Coordinating Center, encompassing structural MRI measurements from hundreds of brain regions, cerebrospinal fluid biomarkers, and clinical variables MATCHAD achieves near-perfect diagnostic accuracy despite ground truth labels for less than one-third of subjects. The framework substantially outperforms all baseline methods, achieving kappa indicating almost perfect agreement compared to weak agreement for the best baseline, a qualitative transformation in diagnostic reliability. Performance remains clinically useful even under severe label scarcity, and we provide theoretical convergence guarantees with proven bounds on label propagation error and transport stability. These results demonstrate that principled semi-supervised learning can unlock the diagnostic potential of the vast repositories of partially annotated neuroimaging data accumulating worldwide, substantially reducing annotation burden while maintaining accuracy suitable for clinical deployment.
[LG-29] MINPO: Memory-Informed Neural Pseudo-Operator to Resolve Nonlocal Spatiotemporal Dynamics
链接: https://arxiv.org/abs/2512.17273
作者: Farinaz Mostajeran,Aruzhan Tleubek,Salah A Faroughi
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Numerical Analysis (math.NA)
*备注:
Abstract:Many physical systems exhibit nonlocal spatiotemporal behaviors described by integro-differential equations (IDEs). Classical methods for solving IDEs require repeatedly evaluating convolution integrals, whose cost increases quickly with kernel complexity and dimensionality. Existing neural solvers can accelerate selected instances of these computations, yet they do not generalize across diverse nonlocal structures. In this work, we introduce the Memory-Informed Neural Pseudo-Operator (MINPO), a unified framework for modeling nonlocal dynamics arising from long-range spatial interactions and/or long-term temporal memory. MINPO, employing either Kolmogorov-Arnold Networks (KANs) or multilayer perceptron networks (MLPs) as encoders, learns the nonlocal operator and its inverse directly through neural representations, and then explicitly reconstruct the unknown solution fields. The learning is guarded by a lightweight nonlocal consistency loss term to enforce coherence between the learned operator and reconstructed solution. The MINPO formulation allows to naturally capture and efficiently resolve nonlocal spatiotemporal dependencies governed by a wide spectrum of IDEs and their subsets, including fractional PDEs. We evaluate the efficacy of MINPO in comparison with classical techniques and state-of-the-art neural-based strategies based on MLPs, such as A-PINN and fPINN, along with their newly-developed KAN variants, A-PIKAN and fPIKAN, designed to facilitate a fair comparison. Our study offers compelling evidence of the accuracy of MINPO and demonstrates its robustness in handling (i) diverse kernel types, (ii) different kernel dimensionalities, and (iii) the substantial computational demands arising from repeated evaluations of kernel integrals. MINPO, thus, generalizes beyond problem-specific formulations, providing a unified framework for systems governed by nonlocal operators.
[LG-30] A Theoretical Analysis of State Similarity Between Markov Decision Processes
链接: https://arxiv.org/abs/2512.17265
作者: Zhenyu Tao,Wei Xu,Xiaohu You
类目: Machine Learning (cs.LG)
*备注: Submitted to an IEEE Transactions. arXiv admin note: substantial text overlap with arXiv:2509.18714
Abstract:The bisimulation metric (BSM) is a powerful tool for analyzing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to state similarity between multiple MDPs remains challenging. Prior work has attempted to extend BSM to pairs of MDPs, but a lack of well-established mathematical properties has limited further theoretical analysis between MDPs. In this work, we formally establish a generalized bisimulation metric (GBSM) for measuring state similarity between arbitrary pairs of MDPs, which is rigorously proven with three fundamental metric properties, i.e., GBSM symmetry, inter-MDP triangle inequality, and a distance bound on identical spaces. Leveraging these properties, we theoretically analyze policy transfer, state aggregation, and sampling-based estimation across MDPs, obtaining explicit bounds that are strictly tighter than existing ones derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
[LG-31] SHARP-QoS: Sparsely-gated Hierarchical Adaptive Routing for joint Prediction of QoS
链接: https://arxiv.org/abs/2512.17262
作者: Suraj Kumar,Arvind Kumar,Soumi Chattopadhyay
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 10 tables
Abstract:Dependable service-oriented computing relies on multiple Quality of Service (QoS) parameters that are essential to assess service optimality. However, real-world QoS data are extremely sparse, noisy, and shaped by hierarchical dependencies arising from QoS interactions, and geographical and network-level factors, making accurate QoS prediction challenging. Existing methods often predict each QoS parameter separately, requiring multiple similar models, which increases computational cost and leads to poor generalization. Although recent joint QoS prediction studies have explored shared architectures, they suffer from negative transfer due to loss-scaling caused by inconsistent numerical ranges across QoS parameters and further struggle with inadequate representation learning, resulting in degraded accuracy. This paper presents an unified strategy for joint QoS prediction, called SHARP-QoS, that addresses these issues using three components. First, we introduce a dual mechanism to extract the hierarchical features from both QoS and contextual structures via hyperbolic convolution formulated in the Poincaré ball. Second, we propose an adaptive feature-sharing mechanism that allows feature exchange across informative QoS and contextual signals. A gated feature fusion module is employed to support dynamic feature selection among structural and shared representations. Third, we design an EMA-based loss balancing strategy that allows stable joint optimization, thereby mitigating the negative transfer. Evaluations on three datasets with two, three, and four QoS parameters demonstrate that SHARP-QoS outperforms both single- and multi-task baselines. Extensive study shows that our model effectively addresses major challenges, including sparsity, robustness to outliers, and cold-start, while maintaining moderate computational overhead, underscoring its capability for reliable joint QoS prediction.
[LG-32] Electric Vehicle Charging Load Forecasting: An Experimental Comparison of Machine Learning Methods
链接: https://arxiv.org/abs/2512.17257
作者: Iason Kyriakopoulos,Yannis Theodoridis
类目: Machine Learning (cs.LG)
*备注: 18 pages, 2 figures, 5 tables
Abstract:With the growing popularity of electric vehicles as a means of addressing climate change, concerns have emerged regarding their impact on electric grid management. As a result, predicting EV charging demand has become a timely and important research problem. While substantial research has addressed energy load forecasting in transportation, relatively few studies systematically compare multiple forecasting methods across different temporal horizons and spatial aggregation levels in diverse urban settings. This work investigates the effectiveness of five time series forecasting models, ranging from traditional statistical approaches to machine learning and deep learning methods. Forecasting performance is evaluated for short-, mid-, and long-term horizons (on the order of minutes, hours, and days, respectively), and across spatial scales ranging from individual charging stations to regional and city-level aggregations. The analysis is conducted on four publicly available real-world datasets, with results reported independently for each dataset. To the best of our knowledge, this is the first work to systematically evaluate EV charging demand forecasting across such a wide range of temporal horizons and spatial aggregation levels using multiple real-world datasets.
[LG-33] Practical Framework for Privacy-Preserving and Byzantine-robust Federated Learning
链接: https://arxiv.org/abs/2512.17254
作者: Baolei Zhang,Minghong Fang,Zhuqing Liu,Biao Yi,Peizhao Zhou,Yuan Wang,Tong Li,Zheli Liu
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted for publication in IEEE Transactions on Information Forensics and Security
Abstract:Federated Learning (FL) allows multiple clients to collaboratively train a model without sharing their private data. However, FL is vulnerable to Byzantine attacks, where adversaries manipulate client models to compromise the federated model, and privacy inference attacks, where adversaries exploit client models to infer private data. Existing defenses against both backdoor and privacy inference attacks introduce significant computational and communication overhead, creating a gap between theory and practice. To address this, we propose ABBR, a practical framework for Byzantine-robust and privacy-preserving FL. We are the first to utilize dimensionality reduction to speed up the private computation of complex filtering rules in privacy-preserving FL. Additionally, we analyze the accuracy loss of vector-wise filtering in low-dimensional space and introduce an adaptive tuning strategy to minimize the impact of malicious models that bypass filtering on the global model. We implement ABBR with state-of-the-art Byzantine-robust aggregation rules and evaluate it on public datasets, showing that it runs significantly faster, has minimal communication overhead, and maintains nearly the same Byzantine-resilience as the baselines.
[LG-34] Do Foundational Audio Encoders Understand Music Structure?
链接: https://arxiv.org/abs/2512.17209
作者: Keisuke Toyama,Zhi Zhong,Akira Takahashi,Shusuke Takahashi,Yuki Mitsufuji
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored. Although many open-source FAE models are available, only a small subset has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using selfsupervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in MSA.
[LG-35] Learning solution operator of dynamical systems with diffusion maps kernel ridge regression
链接: https://arxiv.org/abs/2512.17203
作者: Jiwoo Song,Daning Huang,John Harlim
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Many scientific and engineering systems exhibit complex nonlinear dynamics that are difficult to predict accurately over long time horizons. Although data-driven models have shown promise, their performance often deteriorates when the geometric structures governing long-term behavior are unknown or poorly represented. We demonstrate that a simple kernel ridge regression (KRR) framework, when combined with a dynamics-aware validation strategy, provides a strong baseline for long-term prediction of complex dynamical systems. By employing a data-driven kernel derived from diffusion maps, the proposed Diffusion Maps Kernel Ridge Regression (DM-KRR) method implicitly adapts to the intrinsic geometry of the system’s invariant set, without requiring explicit manifold reconstruction or attractor modeling, procedures that often limit predictive performance. Across a broad range of systems, including smooth manifolds, chaotic attractors, and high-dimensional spatiotemporal flows, DM-KRR consistently outperforms state-of-the-art random feature, neural-network and operator-learning methods in both accuracy and data efficiency. These findings underscore that long-term predictive skill depends not only on model expressiveness, but critically on respecting the geometric constraints encoded in the data through dynamically consistent model selection. Together, simplicity, geometry awareness, and strong empirical performance point to a promising path for reliable and efficient learning of complex dynamical systems.
[LG-36] BumpNet: A Sparse Neural Network Framework for Learning PDE Solutions
链接: https://arxiv.org/abs/2512.17198
作者: Shao-Ting Chiu,Ioannis G. Kevrekidis,Ulisses Braga-Neto
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce BumpNet, a sparse neural network framework for PDE numerical solution and operator learning. BumpNet is based on meshless basis function expansion, in a similar fashion to radial-basis function (RBF) networks. Unlike RBF networks, the basis functions in BumpNet are constructed from ordinary sigmoid activation functions. This enables the efficient use of modern training techniques optimized for such networks. All parameters of the basis functions, including shape, location, and amplitude, are fully trainable. Model parsimony and h-adaptivity are effectively achieved through dynamically pruning basis functions during training. BumpNet is a general framework that can be combined with existing neural architectures for learning PDE solutions: here, we propose Bump-PINNs (BumpNet with physics-informed neural networks) for solving general PDEs; Bump-EDNN (BumpNet with evolutionary deep neural networks) to solve time-evolution PDEs; and Bump-DeepONet (BumpNet with deep operator networks) for PDE operator learning. Bump-PINNs are trained using the same collocation-based approach used by PINNs, Bump-EDNN uses a BumpNet only in the spatial domain and uses EDNNs to advance the solution in time, while Bump-DeepONets employ a BumpNet regression network as the trunk network of a DeepONet. Extensive numerical experiments demonstrate the efficiency and accuracy of the proposed architecture.
[LG-37] Distributed Learning in Markovian Restless Bandits over Interference Graphs for Stable Spectrum Sharing
链接: https://arxiv.org/abs/2512.17161
作者: Liad Lea Didi,Kobi Cohen
类目: Machine Learning (cs.LG)
*备注: 13 pages, 10 figures
Abstract:We study distributed learning for spectrum access and sharing among multiple cognitive communication entities, such as cells, subnetworks, or cognitive radio users (collectively referred to as cells), in communication-constrained wireless networks modeled by interference graphs. Our goal is to achieve a globally stable and interference-aware channel allocation. Stability is defined through a generalized Gale-Shapley multi-to-one matching, a well-established solution concept in wireless resource allocation. We consider wireless networks where L cells share S orthogonal channels and cannot simultaneously use the same channel as their neighbors. Each channel evolves as an unknown restless Markov process with cell-dependent rewards, making this the first work to establish global Gale-Shapley stability for channel allocation in a stochastic, temporally varying restless environment. To address this challenge, we develop SMILE (Stable Multi-matching with Interference-aware LEarning), a communication-efficient distributed learning algorithm that integrates restless bandit learning with graph-constrained coordination. SMILE enables cells to distributedly balance exploration of unknown channels with exploitation of learned information. We prove that SMILE converges to the optimal stable allocation and achieves logarithmic regret relative to a genie with full knowledge of expected utilities. Simulations validate the theoretical guarantees and demonstrate SMILE’s robustness, scalability, and efficiency across diverse spectrum-sharing scenarios.
[LG-38] Biosecurity-Aware AI: Agent ic Risk Auditing of Soft Prompt Attacks on ESM-Based Variant Predictors
链接: https://arxiv.org/abs/2512.17146
作者: Huixin Zhan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Genomic Foundation Models (GFMs), such as Evolutionary Scale Modeling (ESM), have demonstrated remarkable success in variant effect prediction. However, their security and robustness under adversarial manipulation remain largely unexplored. To address this gap, we introduce the Secure Agentic Genomic Evaluator (SAGE), an agentic framework for auditing the adversarial vulnerabilities of GFMs. SAGE functions through an interpretable and automated risk auditing loop. It injects soft prompt perturbations, monitors model behavior across training checkpoints, computes risk metrics such as AUROC and AUPR, and generates structured reports with large language model-based narrative explanations. This agentic process enables continuous evaluation of embedding-space robustness without modifying the underlying model. Using SAGE, we find that even state-of-the-art GFMs like ESM2 are sensitive to targeted soft prompt attacks, resulting in measurable performance degradation. These findings reveal critical and previously hidden vulnerabilities in genomic foundation models, showing the importance of agentic risk auditing in securing biomedical applications such as clinical variant interpretation.
[LG-39] DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent -Based Simulations
链接: https://arxiv.org/abs/2512.17129
作者: Seong Ho Pahng,Guoye Guan,Benjamin Fefferman,Sahand Hormoz
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Biological systems can form complex three-dimensional structures through the collective behavior of identical agents – cells that follow the same internal rules and communicate without central control. How such distributed control gives rise to precise global patterns remains a central question not only in developmental biology but also in distributed robotics, programmable matter, and multi-agent learning. Here, we introduce DiffeoMorph, an end-to-end differentiable framework for learning a morphogenesis protocol that guides a population of agents to morph into a target 3D shape. Each agent updates its position and internal state using an attention-based SE(3)-equivariant graph neural network, based on its own internal state and signals received from other agents. To train this system, we introduce a new shape-matching loss based on the 3D Zernike polynomials, which compares the predicted and target shapes as continuous spatial distributions, not as discrete point clouds, and is invariant to agent ordering, number of agents, and rigid-body transformations. To enforce full SO(3) invariance – invariant to rotations yet sensitive to reflections, we include an alignment step that optimally rotates the predicted Zernike spectrum to match the target before computing the loss. This results in a bilevel problem, with the inner loop optimizing a unit quaternion for the best alignment and the outer loop updating the agent model. We compute gradients through the alignment step using implicit differentiation. We perform systematic benchmarking to establish the advantages of our shape-matching loss over other standard distance metrics for shape comparison tasks. We then demonstrate that DiffeoMorph can form a range of shapes – from simple ellipsoids to complex morphologies – using only minimal spatial cues.
[LG-40] he Effect of Negation on CLIP in Medical Imaging: Limitations of Contrastive Language-Image Pretraining WACV
链接: https://arxiv.org/abs/2512.17121
作者: Jasmine Vu,Shivanand Sheshappanavar
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, submitted to WACV Pixels to Patients Workshop
Abstract:Large vision-language models like CLIP are increasingly used in medical imaging tasks due to their ability to align images and text without the need for extensive labeled data. This makes them particularly useful for applications like image retrieval, report generation, and classification in clinical settings. A potential issue to this approach is that CLIP-based models often under perform when interpreting negated phrases, which is especially problematic in the context of medical diagnosing. In this study, we evaluate the Stanford AIMI CheXagent model on its ability to correctly retrieve chest X-ray images using prompts with and without negation. The goal of this project is to understand where this model fails and then use it as a base model to improve its retrieval accuracy by fine tuning methods outlined in previous work. Results from this study show improvement in handling of negation in the CLIP model with a slight decrease in accuracy of positive prompt evaluation. Alongside retrieval accuracy, we examined internal model behavior through token attribution, t-SNE projection, and attention-head ablation to better characterize how each fine tuning approach reshaped the text encoders representation of negated clinical language. Through this work, we hope to better understand the internal behavior of CLIP and improve its handling of negation using clinically relevant language for improving its reliability in medical AI devices.
[LG-41] Digitizing Nepals Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts
链接: https://arxiv.org/abs/2512.17111
作者: Anjali Sarawgi,Esteban Garces Arias,Christof Zotter
类目: Machine Learning (cs.LG)
*备注: Under review
Abstract:This paper presents the first end-to-end pipeline for Handwritten Text Recognition (HTR) for Old Nepali, a historically significant but low-resource language. We adopt a line-level transcription approach and systematically explore encoder-decoder architectures and data-centric techniques to improve recognition accuracy. Our best model achieves a Character Error Rate (CER) of 4.9%. In addition, we implement and evaluate decoding strategies and analyze token-level confusions to better understand model behaviour and error patterns. While the dataset we used for evaluation is confidential, we release our training code, model configurations, and evaluation scripts to support further research in HTR for low-resource historical scripts.
[LG-42] Bridging Training and Merging Through Momentum-Aware Optimization
链接: https://arxiv.org/abs/2512.17109
作者: Alireza Moayedikia,Alicia Troncoso
类目: Machine Learning (cs.LG)
*备注: Proper is work in progress
Abstract:Training large neural networks and merging task-specific models both exploit low-rank structure and require parameter importance estimation, yet these challenges have been pursued in isolation. Current workflows compute curvature information during training, discard it, then recompute similar information for merging – wasting computation and discarding valuable trajectory data. We introduce a unified framework that maintains factorized momentum and curvature statistics during training, then reuses this information for geometry-aware model composition. The proposed method achieves memory efficiency comparable to state-of-the-art approaches while accumulating task saliency scores that enable curvature-aware merging without post-hoc Fisher computation. We establish convergence guarantees for non-convex objectives with approximation error bounded by gradient singular value decay. On natural language understanding benchmarks, curvature-aware parameter selection outperforms magnitude-only baselines across all sparsity levels, with multi-task merging improving over strong baselines. The proposed framework exhibits rank-invariant convergence and superior hyperparameter robustness compared to existing low-rank optimizers. By treating the optimization trajectory as a reusable asset rather than discarding it, our approach eliminates redundant computation while enabling more principled model composition.
[LG-43] Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse
链接: https://arxiv.org/abs/2512.17108
作者: Kunjal Panchal,Saayan Mitra,Somdeb Sarkhel,Haoliang Wang,Ishita Dasgupta,Gang Wu,Hui Guan
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
Abstract:Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27–33% faster execution compared to non-reuse baselines, with only marginal performance drop ( \leq 2.3 Recall@1 in retrieval, \leq 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.
[LG-44] Fault Diagnosis and Quantification for Photovoltaic Arrays based on Differentiable Physical Models
链接: https://arxiv.org/abs/2512.17107
作者: Zenan Yang,Yuanliang Li,Jingwei Zhang,Yongjie Liu,Kun Ding
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Accurate fault diagnosis and quantification are essential for the reliable operation and intelligent maintenance of photovoltaic (PV) arrays. However, existing fault quantification methods often suffer from limited efficiency and interpretability. To address these challenges, this paper proposes a novel fault quantification approach for PV strings based on a differentiable fast fault simulation model (DFFSM). The proposed DFFSM accurately models I-V characteristics under multiple faults and provides analytical gradients with respect to fault parameters. Leveraging this property, a gradient-based fault parameters identification (GFPI) method using the Adahessian optimizer is developed to efficiently quantify partial shading, short-circuit, and series-resistance degradation. Experimental results on both simulated and measured I-V curves demonstrate that the proposed GFPI achieves high quantification accuracy across different faults, with the I-V reconstruction error below 3%, confirming the feasibility and effectiveness of the application of differentiable physical simulators for PV system fault diagnosis.
[LG-45] Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation
链接: https://arxiv.org/abs/2512.17073
作者: Zhenyu Liu,Yunzhen Liu,Zehao Fan,Garrett Gagnon,Yayue Hou,Nan Wu,Yangwook Kang,Liu Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixture-of-Experts (MoE) models scale capacity via sparse activation but stress memory and bandwidth. Offloading alleviates GPU memory by fetching experts on demand, yet token-level routing causes irregular transfers that make inference I/O-bound. Static uniform quantization reduces traffic but degrades accuracy under aggressive compression by ignoring expert heterogeneity. We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation, which performs router-guided precision restoration using precomputed low-rank compensators. At inference time, our method transfers compact low-rank factors with Top-n (nk) experts per token and applies compensation to them, keeping others low-bit. Integrated with offloading on GPU and GPU-NDP systems, our method delivers a superior bandwidth-accuracy trade-off and improved throughput.
[LG-46] Universal consistency of the k-NN rule in metric spaces and Nagata dimension. III
链接: https://arxiv.org/abs/2512.17058
作者: Vladimir G. Pestov
类目: Machine Learning (cs.LG)
*备注: 12 pages, latex with ESAIM PS macros
Abstract:We prove the last remaining implication allowing to claim the equivalence of the following conditions for a complete separable metric space X : (1) The k -nearest neighbour classifier is (weakly) universally consistent in X , (2) The strong Lebesgue–Besicovitch differentiation property holds in X for every locally finite Borel measure, (3) X is sigma-finite dimensional in the sense of Nagata. The equivalence (2) \iff (3) was announced by Preiss (1983), while a detailed proof of the implication (3) \Rightarrow (2) has appeared in Assouad and Quentin de Gromard (2006). The implication (2) \Rightarrow (1) was established by Cérou and Guyader (2006). We prove the implication (1) \Rightarrow (3). The result was conjectured in the first article in the series (Collins, Kumari, Pestov 2020), and here we also correct a wrong claim made in the second article (Kumari and Pestov 2024). Comments: 12 pages, latex with ESAIM PS macros Subjects: Machine Learning (cs.LG) MSC classes: 62H30, 54F45 ACMclasses: I.2.6; I.5.1 Cite as: arXiv:2512.17058 [cs.LG] (or arXiv:2512.17058v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.17058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-47] Dynamic Tool Dependency Retrieval for Efficient Function Calling
链接: https://arxiv.org/abs/2512.17052
作者: Bhrij Patel,Davide Belli,Amir Jalalirad,Maximilian Arnold,Aleksandr Ermovol,Bence Major
类目: Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, 6 tables
Abstract:Function calling agents powered by Large Language Models (LLMs) select external tools to automate complex tasks. On-device agents typically use a retrieval module to select relevant tools, improving performance and reducing context length. However, existing retrieval methods rely on static and limited inputs, failing to capture multi-step tool dependencies and evolving task context. This limitation often introduces irrelevant tools that mislead the agent, degrading efficiency and accuracy. We propose Dynamic Tool Dependency Retrieval (DTDR), a lightweight retrieval method that conditions on both the initial query and the evolving execution context. DTDR models tool dependencies from function calling demonstrations, enabling adaptive retrieval as plans unfold. We benchmark DTDR against state-of-the-art retrieval methods across multiple datasets and LLM backbones, evaluating retrieval precision, downstream task accuracy, and computational efficiency. Additionally, we explore strategies to integrate retrieved tools into prompts. Our results show that dynamic tool retrieval improves function calling success rates between 23% and 104% compared to state-of-the-art static retrievers.
[LG-48] SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples
链接: https://arxiv.org/abs/2512.17051
作者: Haoye Lu,Yaoliang Yu,Darren Ho
类目: Machine Learning (cs.LG)
*备注:
Abstract:In many real-world scenarios, obtaining fully observed samples is prohibitively expensive or even infeasible, while partial and noisy observations are comparatively easy to collect. In this work, we study distribution restoration with abundant noisy samples, assuming the corruption process is available as a black-box generator. We show that this task can be framed as a one-sided entropic optimal transport problem and solved via an EM-like algorithm. We further provide a test criterion to determine whether the true underlying distribution is recoverable under per-sample information loss, and show that in otherwise unrecoverable cases, a small number of clean samples can render the distribution largely recoverable. Building on these insights, we introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to the ground-truth distribution. Our method generalizes Stochastic Forward-Backward Deconvolution (SFBD; Lu et al., 2025) to handle arbitrary measurement models beyond Gaussian corruption. Experiments across benchmark datasets and diverse measurement settings demonstrate significant improvements in both qualitative and quantitative performance.
[LG-49] GB-DQN: Gradient Boosted DQN Models for Non-stationary Reinforcement Learning
链接: https://arxiv.org/abs/2512.17034
作者: Chang-Hwan Lee,Chanseung Lee
类目: Machine Learning (cs.LG)
*备注: 23 pages. Submitted to Machine Learning
Abstract:Non-stationary environments pose a fundamental challenge for deep reinforcement learning, as changes in dynamics or rewards invalidate learned value functions and cause catastrophic forgetting. We propose \emphGradient-Boosted Deep Q-Networks (GB-DQN), an adaptive ensemble method that addresses model drift through incremental residual learning. Instead of retraining a single Q-network, GB-DQN constructs an additive ensemble in which each new learner is trained to approximate the Bellman residual of the current ensemble after drift. We provide theoretical results showing that each boosting step reduces the empirical Bellman residual and that the ensemble converges to the post-drift optimal value function under standard assumptions. Experiments across a diverse set of control tasks with controlled dynamics changes demonstrate faster recovery, improved stability, and greater robustness compared to DQN and common non-stationary baselines.
[LG-50] urn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agent ic LLM s
链接: https://arxiv.org/abs/2512.17008
作者: Junbo Li,Peng Zhou,Rui Meng,Meet P. Vadera,Lihong Li,Yang Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.
[LG-51] Physics-Informed Lightweight Machine Learning for Aviation Visibility Nowcasting Across Multiple Climatic Regimes
链接: https://arxiv.org/abs/2512.16967
作者: Marcelo Cerda Castillo
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 tables, 1 figure. Uses publicly available METAR surface observations and TAF forecast data for benchmarking
Abstract:Short-term prediction (nowcasting) of low-visibility and precipitation events is critical for aviation safety and operational efficiency. Current operational approaches rely on computationally intensive numerical weather prediction guidance and human-issued TAF products, which often exhibit conservative biases and limited temporal resolution. This study presents a lightweight gradient boosting framework (XGBoost) trained exclusively on surface observation data (METAR) and enhanced through physics-guided feature engineering based on thermodynamic principles. The framework is evaluated across 11 international airports representing distinct climatic regimes (including SCEL, KJFK, KORD, KDEN, SBGR, and VIDP) using historical data from 2000 to 2024. Results suggest that the model successfully captures underlying local physical processes without manual configuration. In a blind comparative evaluation against operational TAF forecasts, the automated model achieved substantially higher detection rates at tactical horizons (3 hours), with a 2.5 to 4.0 times improvement in recall while reducing false alarms. Furthermore, SHAP analysis reveals that the model performs an implicit reconstruction of local physical drivers (advection, radiation, and subsidence), providing actionable explainability for operational situational awareness. Keywords: aviation meteorology; physics-guided machine learning; explainable artificial intelligence; lightweight machine learning; nowcasting; METAR; TAF verification; edge computing Comments: 12 pages, 5 tables, 1 figure. Uses publicly available METAR surface observations and TAF forecast data for benchmarking Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.16967 [cs.LG] (or arXiv:2512.16967v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.16967 Focus to learn more arXiv-issued DOI via DataCite
[LG-52] Compression is Routing: Reconstruction Error as an Intrinsic Signal for Modular Language Models
链接: https://arxiv.org/abs/2512.16963
作者: Zhongpan Tang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Current Large Language Models (LLMs) face three major challenges: context length limitations, high inference costs, and catastrophic forgetting during continual learning. While Mixture-of-Experts (MoE) architectures mitigate some of these conflicts, their routing mechanisms typically rely on explicitly trained auxiliary classifiers. This not only increases system complexity but also often lacks interpretability when handling mixed-domain inputs. Building upon the premise that Compression is Intelligence,'' this paper proposes a novel architectural philosophy: \textbfCompression is Routing.‘’ We trained an 87M-parameter end-to-end Transformer Autoencoder, achieving a \textbf64x sequence length compression (compressing 512 tokens into 8 latent vectors). Experimental results demonstrate that this compressor possesses extreme domain discriminative capability: it achieves a reconstruction accuracy of \textbf99.47% on the in-domain (code) validation set; accuracy drops sharply to \textbf47.76% on a semi-out-of-distribution domain (Wiki text); and further plummets to just \textbf0.57% on a fully out-of-distribution domain (random sequences). This extreme and systematic performance discrepancy establishes the validity of reconstruction error as an \textbfIntrinsic Distribution Fingerprint. Based on this, we propose that expert modules can be automatically scheduled using reconstruction residuals directly, without the need for explicit gating networks. This mechanism offers excellent scalability. Furthermore, this architecture provides a new perspective on ``VRAM compression’’ for handling ultra-long contexts. This report aims to verify the physical validity of this foundational architecture, offering a new research perspective for the next generation of scalable modular neural networks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.16963 [cs.LG] (or arXiv:2512.16963v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.16963 Focus to learn more arXiv-issued DOI via DataCite
[LG-53] QSMOTE-PGM/kPGM: QSMOTE Based PGM and kPGM for Imbalanced Dataset Classification
链接: https://arxiv.org/abs/2512.16960
作者: Bikash K. Behera,Giuseppe Sergioli,Robert Giuntini
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 14 pages, 10 figures
Abstract:Quantum-inspired machine learning (QiML) leverages mathematical frameworks from quantum theory to enhance classical algorithms, with particular emphasis on inner product structures in high-dimensional feature spaces. Among the prominent approaches, the Kernel Trick, widely used in support vector machines, provides efficient similarity computation, while the Pretty Good Measurement (PGM), originating from quantum state discrimination, enables classification grounded in Hilbert space geometry. Building on recent developments in kernelized PGM (KPGM) and direct PGM-based classifiers, this work presents a unified theoretical and empirical comparison of these paradigms. We analyze their performance across synthetic oversampling scenarios using Quantum SMOTE (QSMOTE) variants. Experimental results show that both PGM and KPGM classifiers consistently outperform a classical random forest baseline, particularly when multiple quantum copies are employed. Notably, PGM with stereo encoding and n_copies=2 achieves the highest overall accuracy (0.8512) and F1-score (0.8234), while KPGM demonstrates competitive and more stable behavior across QSMOTE variants, with top scores of 0.8511 (stereo) and 0.8483 (amplitude). These findings highlight that quantum-inspired classifiers not only provide tangible gains in recall and balanced performance but also offer complementary strengths: PGM benefits from encoding-specific enhancements, whereas KPGM ensures robustness across sampling strategies. Our results advance the understanding of kernel-based and measurement-based QiML methods, offering practical guidance on their applicability under varying data characteristics and computational constraints.
[LG-54] SpIDER: Spatially Informed Dense Embedding Retrieval for Software Issue Localization
链接: https://arxiv.org/abs/2512.16956
作者: Shravan Chaudhari,Rahul Thomas Jacob,Mononito Goswami,Jiajun Cao,Shihab Rashid,Christian Bock
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Initial preprint
Abstract:Retrieving code units (e.g., files, classes, functions) that are semantically relevant to a given user query, bug report, or feature request from large codebases is a fundamental challenge for LLM-based coding agents. Agentic approaches typically employ sparse retrieval methods like BM25 or dense embedding strategies to identify relevant units. While embedding-based approaches can outperform BM25 by large margins, they often lack exploration of the codebase and underutilize its underlying graph structure. To address this, we propose SpIDER (Spatially Informed Dense Embedding Retrieval), an enhanced dense retrieval approach that incorporates LLM-based reasoning over auxiliary context obtained through graph-based exploration of the codebase. Empirical results show that SpIDER consistently improves dense retrieval performance across several programming languages.
[LG-55] BIONIX: A Wireless Low-Cost Prosthetic Arm with Dual-Signal EEG and EMG Control
链接: https://arxiv.org/abs/2512.16929
作者: Pranesh Sathish Kumar
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages, 8 figures
Abstract:Affordable upper-limb prostheses often lack intuitive control systems, limiting functionality and accessibility for amputees in low-resource settings. This project presents a low-cost, dual-mode neuro-muscular control system integrating electroencephalography (EEG) and electromyography (EMG) to enable real-time, multi-degree-of-freedom control of a prosthetic arm. EEG signals are acquired using the NeuroSky MindWave Mobile 2 and transmitted via ThinkGear Bluetooth packets to an ESP32 microcontroller running a lightweight classification model. The model was trained on 1500 seconds of recorded EEG data using a 6-frame sliding window with low-pass filtering, excluding poor-signal samples and using a 70/20/10 training–validation–test split. The classifier detects strong blink events, which toggle the hand between open and closed states. EMG signals are acquired using a MyoWare 2.0 sensor and SparkFun wireless shield and transmitted to a second ESP32, which performs threshold-based detection. Three activation bands (rest: 0–T1; extension: T1–T2; contraction: greater than T2) enable intuitive elbow control, with movement triggered only after eight consecutive frames in a movement class to improve stability. The EEG-controlled ESP32 actuates four finger servos, while the EMG-controlled ESP32 drives two elbow servos. A functional prototype was constructed using low-cost materials (total cost approximately 240 dollars), with most expense attributed to the commercial EEG headset. Future work includes transitioning to a 3D-printed chassis, integrating auto-regressive models to reduce EMG latency, and upgrading servo torque for improved load capacity and grip strength. This system demonstrates a feasible pathway to low-cost, biologically intuitive prosthetic control suitable for underserved and global health applications.
[LG-56] Dion2: A Simple Method to Shrink Matrix in Muon MICRO
链接: https://arxiv.org/abs/2512.16928
作者: Kwangjun Ahn,Noah Amsel,John Langford
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: this https URL
Abstract:The Muon optimizer enjoys strong empirical performance and theoretical grounding. However, the super-linear cost of its orthonormalization step introduces increasing overhead with scale. To alleviate this cost, several works have attempted to reduce the size of the matrix entering the orthonormalization step. We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon’s computation compared to prior approaches. At a high level, Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those. This sampling procedure makes the update sparse, reducing both computation and communication costs which in turn improves the scalability of Muon.
[LG-57] Learning vertical coordinates via automatic differentiation of a dynamical core
链接: https://arxiv.org/abs/2512.17877
作者: Tim Whittaker,Seth Taylor,Elsa Cardoso-Bihlo,Alejandro Di Luca,Alex Bihlo
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:Terrain-following coordinates in atmospheric models often imprint their grid structure onto the solution, particularly over steep topography, where distorted coordinate layers can generate spurious horizontal and vertical motion. Standard formulations, such as hybrid or SLEVE coordinates, mitigate these errors by using analytic decay functions controlled by heuristic scale parameters that are typically tuned by hand and fixed a priori. In this work, we propose a framework to define a parametric vertical coordinate system as a learnable component within a differentiable dynamical core. We develop an end-to-end differentiable numerical solver for the two-dimensional non-hydrostatic Euler equations on an Arakawa C-grid, and introduce a NEUral Vertical Enhancement (NEUVE) terrain-following coordinate based on an integral transformed neural network that guarantees monotonicity. A key feature of our approach is the use of automatic differentiation to compute exact geometric metric terms, thereby eliminating truncation errors associated with finite-difference coordinate derivatives. By coupling simulation errors through the time integration to the parameterization, our formulation finds a grid structure optimized for both the underlying physics and numerics. Using several standard tests, we demonstrate that these learned coordinates reduce the mean squared error by a factor of 1.4 to 2 in non-linear statistical benchmarks, and eliminate spurious vertical velocity striations over steep topography.
[LG-58] Domain-Aware Quantum Circuit for QML
链接: https://arxiv.org/abs/2512.17800
作者: Gurinder Singh,Thaddeus Pellegrini,Kenneth M. Merz Jr
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Designing parameterized quantum circuits (PQCs) that are expressive, trainable, and robust to hardware noise is a central challenge for quantum machine learning (QML) on noisy intermediate-scale quantum (NISQ) devices. We present a Domain-Aware Quantum Circuit (DAQC) that leverages image priors to guide locality-preserving encoding and entanglement via non-overlapping DCT-style zigzag windows. The design employs interleaved encode-entangle-train cycles, where entanglement is applied among qubits hosting neighboring pixels, aligned to device connectivity. This staged, locality-preserving information flow expands the effective receptive field without deep global mixing, enabling efficient use of limited depth and qubits. The design concentrates representational capacity on short-range correlations, reduces long-range two-qubit operations, and encourages stable optimization, thereby mitigating depth-induced and globally entangled barren-plateau effects. We evaluate DAQC on MNIST, FashionMNIST, and PneumoniaMNIST datasets. On quantum hardware, DAQC achieves performance competitive with strong classical baselines (e.g., ResNet-18/50, DenseNet-121, EfficientNet-B0) and substantially outperforming Quantum Circuit Search (QCS) baselines. To the best of our knowledge, DAQC, which uses a quantum feature extractor with only a linear classical readout (no deep classical backbone), currently achieves the best reported performance on real quantum hardware for QML-based image classification tasks. Code and pretrained models are available at: this https URL.
[LG-59] Revisiting the Broken Symmetry Phase of Solid Hydrogen: A Neural Network Variational Monte Carlo Study
链接: https://arxiv.org/abs/2512.17703
作者: Shengdu Chai,Chen Lin,Xinyang Dong,Yuqiang Li,Wanli Ouyang,Lei Wang,X.C. Xie
类目: rongly Correlated Electrons (cond-mat.str-el); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:The crystal structure of high-pressure solid hydrogen remains a fundamental open problem. Although the research frontier has mostly shifted toward ultra-high pressure phases above 400 GPa, we show that even the broken symmetry phase observed around 130~GPa requires revisiting due to its intricate coupling of electronic and nuclear degrees of freedom. Here, we develop a first principle quantum Monte Carlo framework based on a deep neural network wave function that treats both electrons and nuclei quantum mechanically within the constant pressure ensemble. Our calculations reveal an unreported ground-state structure candidate for the broken symmetry phase with Cmcm space group symmetry, and we test its stability up to 96 atoms. The predicted structure quantitatively matches the experimental equation of state and X-ray diffraction patterns. Furthermore, our group-theoretical analysis shows that the Cmcm structure is compatible with existing Raman and infrared spectroscopic data. Crucially, static density functional theory calculation reveals the Cmcm structure as a dynamically unstable saddle point on the Born-Oppenheimer potential energy surface, demonstrating that a full quantum many-body treatment of the problem is necessary. These results shed new light on the phase diagram of high-pressure hydrogen and call for further experimental verifications.
[LG-60] Imputation Uncertainty in Interpretable Machine Learning Methods IJCAI2025
链接: https://arxiv.org/abs/2512.17689
作者: Pegah Golchian,Marvin N. Wright
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 19 pages, 15 Figures, accepted at conference: IJCAI 2025 Workshop on Explainable Artificial Intelligence (Montreal, Canada)
Abstract:In real data, missing values occur frequently, which affects the interpretation with interpretable machine learning (IML) methods. Recent work considers bias and shows that model explanations may differ between imputation methods, while ignoring additional imputation uncertainty and its influence on variance and confidence intervals. We therefore compare the effects of different imputation methods on the confidence interval coverage probabilities of the IML methods permutation feature importance, partial dependence plots and Shapley values. We show that single imputation leads to underestimation of variance and that, in most cases, only multiple imputation is close to nominal coverage.
[LG-61] Fraud detection in credit card transactions using Quantum-Assisted Restricted Boltzmann Machines
链接: https://arxiv.org/abs/2512.17660
作者: João Marcos Cavalcanti de Albuquerque Neto,Gustavo Castro do Amaral,Guilherme Penello Temporão
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures
Abstract:Use cases for emerging quantum computing platforms become economically relevant as the efficiency of processing and availability of quantum computers increase. We assess the performance of Restricted Boltzmann Machines (RBM) assisted by quantum computing, running on real quantum hardware and simulators, using a real dataset containing 145 million transactions provided by Stone, a leading Brazilian fintech, for credit card fraud detection. The results suggest that the quantum-assisted RBM method is able to achieve superior performance in most figures of merit in comparison to classical approaches, even using current noisy quantum annealers. Our study paves the way for implementing quantum-assisted RBMs for general fault detection in financial systems.
[LG-62] Generative Multi-Objective Bayesian Optimization with Scalable Batch Evaluations for Sample-Efficient De Novo Molecular Design
链接: https://arxiv.org/abs/2512.17659
作者: Madhav R. Muthyala,Farshud Sorourifar,Tianhong Tan,You Peng,Joel A. Paulson
类目: Machine Learning (stat.ML); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Designing molecules that must satisfy multiple, often conflicting objectives is a central challenge in molecular discovery. The enormous size of chemical space and the cost of high-fidelity simulations have driven the development of machine learning-guided strategies for accelerating design with limited data. Among these, Bayesian optimization (BO) offers a principled framework for sample-efficient search, while generative models provide a mechanism to propose novel, diverse candidates beyond fixed libraries. However, existing methods that couple the two often rely on continuous latent spaces, which introduces both architectural entanglement and scalability challenges. This work introduces an alternative, modular “generate-then-optimize” framework for de novo multi-objective molecular design/discovery. At each iteration, a generative model is used to construct a large, diverse pool of candidate molecules, after which a novel acquisition function, qPMHI (multi-point Probability of Maximum Hypervolume Improvement), is used to optimally select a batch of candidates most likely to induce the largest Pareto front expansion. The key insight is that qPMHI decomposes additively, enabling exact, scalable batch selection via only simple ranking of probabilities that can be easily estimated with Monte Carlo sampling. We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods, demonstrating significant improvements across synthetic benchmarks and application-driven tasks. Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials for aqueous redox flow battery applications.
[LG-63] Resource-efficient medical image classification for edge devices
链接: https://arxiv.org/abs/2512.17515
作者: Mahsa Lavaei,Zahra Abadi,Salar Beigzad,Alireza Maleki
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: Conference paper published in ICAMIDA 2025 (IEEE)
Abstract:Medical image classification is a critical task in healthcare, enabling accurate and timely diagnosis. However, deploying deep learning models on resource-constrained edge devices presents significant challenges due to computational and memory limitations. This research investigates a resource-efficient approach to medical image classification by employing model quantization techniques. Quantization reduces the precision of model parameters and activations, significantly lowering computational overhead and memory requirements without sacrificing classification accuracy. The study focuses on the optimization of quantization-aware training (QAT) and post-training quantization (PTQ) methods tailored for edge devices, analyzing their impact on model performance across medical imaging datasets. Experimental results demonstrate that quantized models achieve substantial reductions in model size and inference latency, enabling real-time processing on edge hardware while maintaining clinically acceptable diagnostic accuracy. This work provides a practical pathway for deploying AI-driven medical diagnostics in remote and resource-limited settings, enhancing the accessibility and scalability of healthcare technologies.
[LG-64] Alternating Direction Method of Multipliers for Nonlinear Matrix Decompositions
链接: https://arxiv.org/abs/2512.17473
作者: Atharva Awari,Nicolas Gillis,Arnaud Vandaele
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 14 pages, 6 figures. Code available from this https URL
Abstract:We present an algorithm based on the alternating direction method of multipliers (ADMM) for solving nonlinear matrix decompositions (NMD). Given an input matrix X \in \mathbbR^m \times n and a factorization rank r \ll \min(m, n) , NMD seeks matrices W \in \mathbbR^m \times r and H \in \mathbbR^r \times n such that X \approx f(WH) , where f is an element-wise nonlinear function. We evaluate our method on several representative nonlinear models: the rectified linear unit activation f(x) = \max(0, x) , suitable for nonnegative sparse data approximation, the component-wise square f(x) = x^2 , applicable to probabilistic circuit representation, and the MinMax transform f(x) = \min(b, \max(a, x)) , relevant for recommender systems. The proposed framework flexibly supports diverse loss functions, including least squares, \ell_1 norm, and the Kullback-Leibler divergence, and can be readily extended to other nonlinearities and metrics. We illustrate the applicability, efficiency, and adaptability of the approach on real-world datasets, highlighting its potential for a broad range of applications.
[LG-65] Perfect reconstruction of sparse signals using nonconvexity control and one-step RSB message passing
链接: https://arxiv.org/abs/2512.17426
作者: Xiaosi Gu,Ayaka Sakata,Tomoyuki Obuchi
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 49 pages, 10 figures
Abstract:We consider sparse signal reconstruction via minimization of the smoothly clipped absolute deviation (SCAD) penalty, and develop one-step replica-symmetry-breaking (1RSB) extensions of approximate message passing (AMP), termed 1RSB-AMP. Starting from the 1RSB formulation of belief propagation, we derive explicit update rules of 1RSB-AMP together with the corresponding state evolution (1RSB-SE) equations. A detailed comparison shows that 1RSB-AMP and 1RSB-SE agree remarkably well at the macroscopic level, even in parameter regions where replica-symmetric (RS) AMP, termed RS-AMP, diverges and where the 1RSB description itself is not expected to be thermodynamically exact. Fixed-point analysis of 1RSB-SE reveals a phase diagram consisting of success, failure, and diverging phases, as in the RS case. However, the diverging-region boundary now depends on the Parisi parameter due to the 1RSB ansatz, and we propose a new criterion – minimizing the size of the diverging region – rather than the conventional zero-complexity condition, to determine its value. Combining this criterion with the nonconvexity-control (NCC) protocol proposed in a previous RS study improves the algorithmic limit of perfect reconstruction compared with RS-AMP. Numerical solutions of 1RSB-SE and experiments with 1RSB-AMP confirm that this improved limit is achieved in practice, though the gain is modest and remains slightly inferior to the Bayes-optimal threshold. We also report the behavior of thermodynamic quantities – overlaps, free entropy, complexity, and the non-self-averaging susceptibility – that characterize the 1RSB phase in this problem.
[LG-66] Sharp Structure-Agnostic Lower Bounds for General Functional Estimation
链接: https://arxiv.org/abs/2512.17341
作者: Jikai Jin,Vasilis Syrgkanis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 95 pages; generalize and subsume partial results of arXiv:2402.14264 by the same authors
Abstract:The design of efficient nonparametric estimators has long been a central problem in statistics, machine learning, and decision making. Classical optimal procedures often rely on strong structural assumptions, which can be misspecified in practice and complicate deployment. This limitation has sparked growing interest in structure-agnostic approaches – methods that debias black-box nuisance estimates without imposing structural priors. Understanding the fundamental limits of these methods is therefore crucial. This paper provides a systematic investigation of the optimal error rates achievable by structure-agnostic estimators. We first show that, for estimating the average treatment effect (ATE), a central parameter in causal inference, doubly robust learning attains optimal structure-agnostic error rates. We then extend our analysis to a general class of functionals that depend on unknown nuisance functions and establish the structure-agnostic optimality of debiased/double machine learning (DML). We distinguish two regimes – one where double robustness is attainable and one where it is not – leading to different optimal rates for first-order debiasing, and show that DML is optimal in both regimes. Finally, we instantiate our general lower bounds by deriving explicit optimal rates that recover existing results and extend to additional estimands of interest. Our results provide theoretical validation for widely used first-order debiasing methods and guidance for practitioners seeking optimal approaches in the absence of structural assumptions. This paper generalizes and subsumes the ATE lower bound established in \citetjin2024structure by the same authors.
[LG-67] Penalized Fair Regression for Multiple Groups in Chronic Kidney Disease
链接: https://arxiv.org/abs/2512.17340
作者: Carter H. Nakamoto,Lucia Lushi Chen,Agata Foryciarz,Sherri Rose
类目: Methodology (stat.ME); Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Fair regression methods have the potential to mitigate societal bias concerns in health care, but there has been little work on penalized fair regression when multiple groups experience such bias. We propose a general regression framework that addresses this gap with unfairness penalties for multiple groups. Our approach is demonstrated for binary outcomes with true positive rate disparity penalties. It can be efficiently implemented through reduction to a cost-sensitive classification problem. We additionally introduce novel score functions for automatically selecting penalty weights. Our penalized fair regression methods are empirically studied in simulations, where they achieve a fairness-accuracy frontier beyond that of existing comparison methods. Finally, we apply these methods to a national multi-site primary care study of chronic kidney disease to develop a fair classifier for end-stage renal disease. There we find substantial improvements in fairness for multiple race and ethnicity groups who experience societal bias in the health care system without any appreciable loss in overall fit.
[LG-68] Machine Learning Assisted Parameter Tuning on Wavelet Transform Amorphous Radial Distribution Function
链接: https://arxiv.org/abs/2512.17245
作者: Deriyan Senjaya,Stephen Ekaputra Limantoro
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Understanding atomic structures is crucial, yet amorphous materials remain challenging due to their irregular and non-periodic nature. The wavelet-transform radial distribution function (WT-RDF) offers a physics-based framework for analyzing amorphous structures, reliably predicting the first and second RDF peaks and overall curve trends in both binary Ge 0.25 Se 0.75 and ternary Ag x(Ge 0.25 Se 0.75)100-x (x=5,10,15,20,25) systems. Despite these strengths, WT-RDF shows limitations in amplitude accuracy, which affects quantitative analyses such as coordination numbers. This study addresses the issue by optimizing WT-RDF parameters using a machine learning approach, producing the enhanced WT-RDF+ framework. WT-RDF+ improves the precision of peak predictions and outperforms benchmark ML models, including RBF and LSTM, even when trained on only 25 percent of the binary dataset. These results demonstrate that WT-RDF+ is a robust and reliable model for structural characterization of amorphous materials, particularly Ge-Se systems, and support the efficient design and development of phase-change thin films for next-generation electronic devices and components.
[LG-69] Application of machine learning to predict food processing level using Open Food Facts
链接: https://arxiv.org/abs/2512.17169
作者: Nalin Arora,Aviral Chauhan,Siddhant Rana,Mahansh Aditya,Sumit Bhagat,Aditya Kumar,Akash Kumar,Akanksh Semar,Ayush Vikram Singh,Ganesh Bagler
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 27 Pages (22 Pages of Main Manuscript + Supplementary Material), 7 Figures, 1 Table
Abstract:Ultra-processed foods are increasingly linked to health issues like obesity, cardiovascular disease, type 2 diabetes, and mental health disorders due to poor nutritional quality. This first-of-its-kind study at such a scale uses machine learning to classify food processing levels (NOVA) based on the Open Food Facts dataset of over 900,000 products. Models including LightGBM, Random Forest, and CatBoost were trained on nutrient concentration data. LightGBM performed best, achieving 80-85% accuracy across different nutrient panels and effectively distinguishing minimally from ultra-processed foods. Exploratory analysis revealed strong associations between higher NOVA classes and lower Nutri-Scores, indicating poorer nutritional quality. Products in NOVA 3 and 4 also had higher carbon footprints and lower Eco-Scores, suggesting greater environmental impact. Allergen analysis identified gluten and milk as common in ultra-processed items, posing risks to sensitive individuals. Categories like Cakes and Snacks were dominant in higher NOVA classes, which also had more additives, highlighting the role of ingredient modification. This study, leveraging the largest dataset of NOVA-labeled products, emphasizes the health, environmental, and allergenic implications of food processing and showcases machine learning’s value in scalable classification. A user-friendly web tool is available for NOVA prediction using nutrient data: this https URL.
[LG-70] Disentangled representations via score-based variational autoencoders
链接: https://arxiv.org/abs/2512.17127
作者: Benjamin S. H. Lyo,Eero P. Simoncelli,Cristina Savin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 7 figures
Abstract:We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.
信息检索
[IR-0] he Mental World of Large Language Models in Recommendation: A Benchmark on Association Personalization and Knowledgeability KDD2025
链接: https://arxiv.org/abs/2512.17389
作者: Guangneng Hu
类目: Information Retrieval (cs.IR)
*备注: 21 pages, 13 figures, 27 tables, submission to KDD 2025
Abstract:Large language models (LLMs) have shown potential in recommendation systems (RecSys) by using them as either knowledge enhancer or zero-shot ranker. A key challenge lies in the large semantic gap between LLMs and RecSys where the former internalizes language world knowledge while the latter captures personalized world of behaviors. Unfortunately, the research community lacks a comprehensive benchmark that evaluates the LLMs over their limitations and boundaries in RecSys so that we can draw a confident conclusion. To investigate this, we propose a benchmark named LRWorld containing over 38K high-quality samples and 23M tokens carefully compiled and generated from widely used public recommendation datasets. LRWorld categorizes the mental world of LLMs in RecSys as three main scales (association, personalization, and knowledgeability) spanned by ten factors with 31 measures (tasks). Based on LRWorld, comprehensive experiments on dozens of LLMs show that they are still not well capturing the deep neural personalized embeddings but can achieve good results on shallow memory-based item-item similarity. They are also good at perceiving item entity relations, entity hierarchical taxonomies, and item-item association rules when inferring user interests. Furthermore, LLMs show a promising ability in multimodal knowledge reasoning (movie poster and product image) and robustness to noisy profiles. None of them show consistently good performance over the ten factors. Model sizes, position bias, and more are ablated.
[IR-1] CDE: Topic-Centric Dual Expansion of Queries and Documents with Large Language Models for Information Retrieval
链接: https://arxiv.org/abs/2512.17164
作者: Yu Yang,Feng Tian,Ping Chen
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Query Expansion (QE) enriches queries and Document Expansion (DE) enriches documents, and these two techniques are often applied separately. However, such separate application may lead to semantic misalignment between the expanded queries (or documents) and their relevant documents (or queries). To address this serious issue, we propose TCDE, a dual expansion strategy that leverages large language models (LLMs) for topic-centric enrichment on both queries and documents. In TCDE, we design two distinct prompt templates for processing each query and document. On the query side, an LLM is guided to identify distinct sub-topics within each query and generate a focused pseudo-document for each sub-topic. On the document side, an LLM is guided to distill each document into a set of core topic sentences. The resulting outputs are used to expand the original query and document. This topic-centric dual expansion process establishes semantic bridges between queries and their relevant documents, enabling better alignment for downstream retrieval models. Experiments on two challenging benchmarks, TREC Deep Learning and BEIR, demonstrate that TCDE achieves substantial improvements over strong state-of-the-art expansion baselines. In particular, on dense retrieval tasks, it outperforms several state-of-the-art methods, with a relative improvement of 2.8% in NDCG@10 on the SciFact dataset. Experimental results validate the effectiveness of our topic-centric and dual expansion strategy.
[IR-2] A Reproducible and Fair Evaluation of Partition-aware Collaborative Filtering ECIR2026
链接: https://arxiv.org/abs/2512.17015
作者: Domenico De Gioia,Claudio Pomo,Ludovico Boratto,Tommaso Di Noia
类目: Information Retrieval (cs.IR)
*备注: accepted at ECIR 2026 reproducibility track
Abstract:Similarity-based collaborative filtering (CF) models have long demonstrated strong offline performance and conceptual simplicity. However, their scalability is limited by the quadratic cost of maintaining dense item-item similarity matrices. Partitioning-based paradigms have recently emerged as an effective strategy for balancing effectiveness and efficiency, enabling models to learn local similarities within coherent subgraphs while maintaining a limited global context. In this work, we focus on the Fine-tuning Partition-aware Similarity Refinement (FPSR) framework, a prominent representative of this family, as well as its extension, FPSR+. Reproducible evaluation of partition-aware collaborative filtering remains challenging, as prior FPSR/FPSR+ reports often rely on splits of unclear provenance and omit some similarity-based baselines, thereby complicating fair comparison. We present a transparent, fully reproducible benchmark of FPSR and FPSR+. Based on our results, the family of FPSR models does not consistently perform at the highest level. Overall, it remains competitive, validates its design choices, and shows significant advantages in long-tail scenarios. This highlights the accuracy-coverage trade-offs resulting from partitioning, global components, and hub design. Our investigation clarifies when partition-aware similarity modeling is most beneficial and offers actionable guidance for scalable recommender system design under reproducible protocols.

