Arxiv今日论文 | 2025-03-26

本篇博文主要内容为 2025-03-26 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在高保真表征学习任务（如图像或文本嵌入生成用于检索）中的局限性，同时保持其生成能力。现有方法通过微调LVLMs实现表征学习，但通常会导致生成能力的丧失，因为表征学习与生成任务的训练范式存在冲突。为解决这一权衡问题，论文提出了一种名为CAFe的对比-自回归微调框架。其关键是将对比学习目标与自回归语言建模相结合，统一表征学习与生成任务，从而在多模态检索和生成基准测试中均达到最先进的性能，并缓解对象幻觉（Object Hallucination, OH）等问题。CAFe通过在一个模型中协同嵌入和生成功能，为同时具备检索精度和连贯输出生成能力的未来多模态模型奠定了基础。

链接: https://arxiv.org/abs/2503.19900
作者: Hao Yu,Zhuokai Zhao,Shen Yan,Lukasz Korycki,Jianyu Wang,Baosheng He,Jiayi Liu,Lizhu Zhang,Xiangjun Fan,Hanchao Yu
机构: Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large vision-language models (LVLMs) has driven significant progress in multimodal tasks, enabling models to interpret, reason, and generate outputs across both visual and textual domains. While excelling in generative tasks, existing LVLMs often face limitations in tasks requiring high-fidelity representation learning, such as generating image or text embeddings for retrieval. Recent work has proposed finetuning LVLMs for representational learning, but the fine-tuned model often loses its generative capabilities due to the representational learning training paradigm. To address this trade-off, we introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks. By integrating a contrastive objective with autoregressive language modeling, our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks, including object hallucination (OH) mitigation. CAFe establishes a novel framework that synergizes embedding and generative functionalities in a single model, setting a foundation for future multimodal models that excel in both retrieval precision and coherent output generation.
zh

[NLP-1] CausalRAG : Integrating Causal Graphs into Retrieval-Augmented Generation

【速读】：该论文致力于解决传统 Retrieval-Augmented Generation (RAG) 系统中因文本分块导致的上下文完整性破坏以及过度依赖语义相似性进行检索的关键问题。论文提出的解决方案核心在于引入因果图（causal graphs）到检索过程中，通过构建和追踪因果关系，CausalRAG 不仅保持了上下文的一致性，还提升了检索精度，从而实现了更准确且可解释的回答。评估结果显示，相比常规 RAG 和基于图的 RAG 方法，CausalRAG 在多个指标上表现出优越性，表明以因果推理为基础的检索方法在知识密集型任务中具有巨大潜力。

链接: https://arxiv.org/abs/2503.19878
作者: Nengbo Wang,Xiaotian Han,Jagdip Singh,Jing Ma,Vipin Chaudhary
机构: Department of Computer and Data Sciences, Case Western Reserve University (凯斯西储大学); Department of Design and Innovation, Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
zh

[NLP-2] Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

【速读】：该论文试图解决的问题是如何通过增加测试时的计算资源（test-time compute）来提升语言模型（Language Model, LM）的评估能力。解决方案的关键在于利用具备长链推理能力的推理型语言模型（reasoning models）作为评估者，并通过两种方式有效利用更多的测试时计算资源：一是让推理模型生成更长的推理链条（即生成更多推理标记 tokens），二是不仅对整个输出结果进行整体评估（outcome evaluation），还对每个步骤单独进行过程评估（process evaluation）。实验表明，随着推理标记数量的增加，评估器的性能单调递增，且基于更准确的评估器重新排序多个生成结果后，测试时增加计算资源在提升语言模型问题解决能力方面与生成时增加计算资源同样有效。

链接: https://arxiv.org/abs/2503.19877
作者: Seungone Kim,Ian Wu,Jinu Lee,Xiang Yue,Seongyun Lee,Mingyeong Moon,Kiril Gashteovski,Carolin Lawrence,Julia Hockenmaier,Graham Neubig,Sean Welleck
机构: CMU (卡内基梅隆大学); UIUC (伊利诺伊大学香槟分校); KAIST AI (韩国科学技术院人工智能研究院); NEC Laboratories Europe (NEC欧洲实验室); Ss. Cyril and Methodius University of Skopje (斯科普里圣西里尔和美多迪乌斯大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs’ “thinking” time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM’s evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator’s performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM’s problem-solving capability.
zh

[NLP-3] hink Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理长文本和强化学习（Reinforcement Learning, RL）训练效率方面的局限性问题。为了解决这些问题，论文提出了一种名为“Multi-round Thinking”的简单而有效的测试时扩展方法（Test-time Scaling Approach）。该方法通过将先前轮次的答案作为提示来迭代优化模型推理过程，从而显著提升模型性能。关键在于利用多轮迭代的方式逐步改进模型的推理能力，使模型能够在多个基准数据集上实现稳定且一致的性能提升。

链接: https://arxiv.org/abs/2503.19855
作者: Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yunjie Ji,Yiping Peng,Han Zhao,Xiangang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: Original question prompt The assistant’s previous answer is: answer last round answer /answer, and please re-answer.
zh

[NLP-4] A Comparative Analysis of Word Segmentation Part-of-Speech Tagging and Named Entity Recognition for Historical Chinese Sources 1900-1950 NAACL2025

【速读】：该论文试图解决在分析1900年至1950年间中国历史文本时，由于汉字表意文字的特性、缺乏自然词边界以及语言显著变化所导致的传统自然语言处理（NLP）工具面临的挑战。论文通过比较大型语言模型（Large Language Models, LLMs）与传统工具（如Jieba和spaCy）在词分割、词性标注（Part-of-Speech, POS）及命名实体识别（Named Entity Recognition, NER）任务上的表现，探索更高效的文本分析方法。关键在于利用LLMs强大的上下文学习能力，在无需大量领域特定训练数据的情况下，提升历史文本处理的准确性，尽管这伴随着较高的计算成本。

链接: https://arxiv.org/abs/2503.19844
作者: Zhao Fang,Liang-Chun Wu,Xuening Kong,Spencer Dean Stewart
机构: University of Chicago (芝加哥大学); Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NLP4DH 2025 at NAACL 2025

点击查看摘要

Abstract:This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.
zh

[NLP-5] Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy NAACL2025

【速读】：该论文试图解决自动评估指标的元评估（meta-evaluation）在实际应用中的局限性问题，即现有方法主要关注于通用系统输出下的指标绝对和相对质量，而未能充分考虑指标在具体应用场景下的有效性。论文的关键解决方案是引入一种基于局部指标精度比较的上下文元评估方法，通过分析翻译、语音识别和排序任务中不同评估场景下指标局部准确性的变化，揭示了采用上下文特定评估而非全局评估的重要性。这一方法能够更精准地衡量自然语言处理系统的性能，从而更好地服务于科学探究、生产模型开发及政策执行。

链接: https://arxiv.org/abs/2503.19828
作者: Athiya Deviyani,Fernando Diaz
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 (Findings)

点击查看摘要

Abstract:Meta-evaluation of automatic evaluation metrics – assessing evaluation metrics themselves – is crucial for accurately benchmarking natural language processing systems and has implications for scientific inquiry, production model development, and policy enforcement. While existing approaches to metric meta-evaluation focus on general statements about the absolute and relative quality of metrics across arbitrary system outputs, in practice, metrics are applied in highly contextual settings, often measuring the performance for a highly constrained set of system outputs. For example, we may only be interested in evaluating a specific model or class of models. We introduce a method for contextual metric meta-evaluation by comparing the local metric accuracy of evaluation metrics. Across translation, speech recognition, and ranking tasks, we demonstrate that the local metric accuracies vary both in absolute value and relative effectiveness as we shift across evaluation contexts. This observed variation highlights the importance of adopting context-specific metric evaluations over global ones.
zh

[NLP-6] SemEval-2025 Task 9: The Food Hazard Detection Challenge SEMEVAL2025

【速读】：该论文旨在解决基于文本的食品危害预测问题，特别是针对长尾分布类别的情况。研究将任务分为两个子任务：(1) 预测网络文本是否暗示某一食品危害类别并识别相关食品类别，以及 (2) 通过为危害和产品分配特定标签实现更细粒度的分类。关键解决方案在于利用大规模语言模型生成的合成数据对长尾分布进行过采样，并发现经过微调的仅编码器、编码器-解码器和仅解码器系统在两个子任务中均能达到相当的最优性能。此外，研究团队逐步发布了一组包含 6,644 条人工标注的食品事件报告（Creative Commons Attribution-NonCommercial-ShareAlike 4.0 许可）。

链接: https://arxiv.org/abs/2503.19800
作者: Korbinian Randl,John Pavlopoulos,Aron Henriksson,Tony Lindgren,Juli Bakagianni
机构: Stockholm University (斯德哥尔摩大学); Athens University of Economics and Business (雅典经济与商业大学); Archimedes, Athena Research Center (雅典研究与技术中心); Agroknow (未知)
类目: Computation and Language (cs.CL)
备注: Under review for SemEval 2025

点击查看摘要

Abstract:In this challenge, we explored text-based food hazard prediction with long tail distributed classes. The task was divided into two subtasks: (1) predicting whether a web text implies one of ten food-hazard categories and identifying the associated food category, and (2) providing a more fine-grained classification by assigning a specific label to both the hazard and the product. Our findings highlight that large language model-generated synthetic data can be highly effective for oversampling long-tail distributions. Furthermore, we find that fine-tuned encoder-only, encoder-decoder, and decoder-only systems achieve comparable maximum performance across both subtasks. During this challenge, we gradually released (under CC BY-NC-SA 4.0) a novel set of 6,644 manually labeled food-incident reports.
zh

[NLP-7] Gemma 3 Technical Report

【速读】：本文介绍了一种名为Gemma 3的新一代轻量级开源多模态模型，旨在扩展Gemma家族模型的能力范围。论文试图解决的问题是如何在保持较低参数规模（1至27亿参数）的同时，提升模型的视觉理解能力、支持更广泛的语言覆盖以及处理更长上下文（至少128K tokens）。此外，针对长上下文导致KV缓存内存爆炸的问题，论文提出了一种关键解决方案：通过增加局部注意力层与全局注意力层的比例，并缩短局部注意力的跨度，有效缓解了这一挑战。这种架构调整不仅提升了模型效率，还使得Gemma 3在经过蒸馏训练后，无论是预训练还是指令微调版本，均实现了超越Gemma 2的性能表现。特别是引入的一种新颖的后训练方法显著增强了数学推理、对话交互、指令跟随及多语言处理能力，使Gemma3-4B-IT与更大规模的Gemma2-27B-IT相当，而Gemma3-27B-IT则接近Gemini-1.5-Pro的表现水平。最终，所有模型均向社区开放。

链接: https://arxiv.org/abs/2503.19786
作者: Gemma Team:Aishwarya Kamath,Johan Ferret,Shreya Pathak,Nino Vieillard,Ramona Merhej,Sarah Perrin,Tatiana Matejovicova,Alexandre Ramé,Morgane Rivière,Louis Rouillard,Thomas Mesnard,Geoffrey Cideron,Jean-bastien Grill,Sabela Ramos,Edouard Yvinec,Michelle Casbon,Etienne Pot,Ivo Penchev,Gaël Liu,Francesco Visin,Kathleen Kenealy,Lucas Beyer,Xiaohai Zhai,Anton Tsitsulin,Robert Busa-Fekete,Alex Feng,Noveen Sachdeva,Benjamin Coleman,Yi Gao,Basil Mustafa,Iain Barr,Emilio Parisotto,David Tian,Matan Eyal,Colin Cherry,Jan-Thorsten Peter,Danila Sinopalnikov,Surya Bhupatiraju,Rishabh Agarwal,Mehran Kazemi,Dan Malkin,Ravin Kumar,David Vilar,Idan Brusilovsky,Jiaming Luo,Andreas Steiner,Abe Friesen,Abhanshu Sharma,Abheesht Sharma,Adi Mayrav Gilady,Adrian Goedeckemeyer,Alaa Saade,Alex Feng,Alexander Kolesnikov,Alexei Bendebury,Alvin Abdagic,Amit Vadi,András György,André Susano Pinto,Anil Das,Ankur Bapna,Antoine Miech,Antoine Yang,Antonia Paterson,Ashish Shenoy,Ayan Chakrabarti,Bilal Piot,Bo Wu,Bobak Shahriari,Bryce Petrini,Charlie Chen,Charline Le Lan,Christopher A. Choquette-Choo,CJ Carey,Cormac Brick,Daniel Deutsch,Danielle Eisenbud,Dee Cattle,Derek Cheng,Dimitris Paparas,Divyashree Shivakumar Sreepathihalli,Doug Reid,Dustin Tran,Dustin Zelle,Eric Noland,Erwin Huizenga,Eugene Kharitonov,Frederick Liu,Gagik Amirkhanyan,Glenn Cameron,Hadi Hashemi,Hanna Klimczak-Plucińska,Harman Singh,Harsh Mehta,Harshal Tushar Lehri,Hussein Hazimeh,Ian Ballantyne,Idan Szpektor,Ivan Nardini
机构: Google DeepMind (深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
zh

[NLP-8] Writing as a testbed for open ended agents

【速读】：本文旨在研究大型语言模型（LLMs）在开放性任务中的表现，特别是其作为协作共写者（collaborative co-writer）的能力，即自主提出并实施文本改进。论文聚焦于写作这一具有广阔解空间且主观评价标准明显的领域，以探索LLMs在缺乏明确成功定义的情况下如何通过广泛的探索和灵活策略取得成效。关键在于分析Gemini 1.5 Pro、Claude 3.5 Sonnet和GPT-4o这三个模型的动作多样性、与人类的对齐程度以及迭代改进能力对其整体性能的影响，并由此建立了一个用于评估自主写作代理的基准框架，同时揭示了构建在多样化开放域中表现出色系统的根本挑战与潜在解决方案。

链接: https://arxiv.org/abs/2503.19711
作者: Sian Gooding,Lucia Lopez-Rivilla,Edward Grefenstette
机构: Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.
zh

[NLP-9] Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在空间推理能力上的不足，这些不足主要体现在现有基准测试中包含的空间成分未能有效隔离空间推理与相关任务（如目标检测或语义理解）之间的关联。为了解决这一问题，论文采取了一种多方面的策略来深入理解空间推理。关键在于首先分析空间推理的核心要素，包括空间关系、方向与导航、心理旋转以及空间可视化，并评估这些模型在这两类图像（合成图像和真实世界图像）中的表现，从而在受控环境和自然环境中建立桥梁。通过分析13个最先进的VLMs，论文揭示了当前模型在空间推理任务上的显著缺陷，平均准确率接近随机猜测水平，这凸显了空间推理作为VLMs持续障碍的重要性。这一研究不仅指出了提升VLMs空间推理能力的紧迫需求，还为未来的探索奠定了坚实的基础。

链接: https://arxiv.org/abs/2503.19707
作者: Ilias Stogiannidis,Steven McDonagh,Sotirios A. Tsaftaris
机构: The University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 8 main pages, 4 pages Appendix, 5 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing benchmarks for VLMs include spatial components, which often fail to isolate spatial reasoning from related tasks such as object detection or semantic comprehension. In this paper, we address these deficiencies with a multi-faceted approach towards understanding spatial reasoning. Informed by the diverse and multi-dimensional nature of human spatial reasoning abilities, we present a detailed analysis that first delineates the core elements of spatial reasoning: spatial relations, orientation and navigation, mental rotation, and spatial visualization, and then assesses the performance of these models in both synthetic and real-world images, bridging controlled and naturalistic contexts. We analyze 13 state-of-the-art Vision-Language Models, uncovering pivotal insights into their spatial reasoning performance. Our results reveal profound shortcomings in current VLMs, with average accuracy across the 13 models approximating random chance, highlighting spatial reasoning as a persistent obstacle. This work not only exposes the pressing need to advance spatial reasoning within VLMs but also establishes a solid platform for future exploration. Code available on GitHub (this https URL) and dataset available on HuggingFace (this https URL).
zh

[NLP-10] HausaNLP at SemEval-2025 Task 2: Entity-Aware Fine-tuning vs. Prompt Engineering in Entity-Aware Machine Translation

【速读】：该论文旨在解决实体感知机器翻译（Entity-Aware Machine Translation, EA-MT）任务中的挑战，具体目标是开发能够准确将英语句子翻译成多种目标语言的翻译模型，尤其关注处理命名实体（Named Entities），因为命名实体通常对机器翻译系统构成难题。论文覆盖了以英语为源语言的10种目标语言。解决方案的关键在于设计和优化能够有效捕捉和正确翻译命名实体的翻译模型，并通过实验结果与分析展示其性能与改进方向。

链接: https://arxiv.org/abs/2503.19702
作者: Abdulhamid Abubakar,Hamidatu Abdulkadir,Ibrahim Rabiu Abdullahi,Abubakar Auwal Khalid,Ahmad Mustapha Wali,Amina Aminu Umar,Maryam Bala,Sani Abdullahi Sani,Ibrahim Said Ahmad,Shamsuddeen Hassan Muhammad,Idris Abdulmumin,Vukosi Marivate
机构: HausaNLP (HausaNLP); Kaduna State University (卡杜纳州立大学); Ahmadu Bello University (阿哈迈杜贝洛大学); University of the Witwatersrand (约翰内斯堡威特沃特斯兰德大学); Northeastern University (东北大学); Imperial College (帝国理工学院); Data Science for Social Impact, University of Pretoria (数据科学促进社会影响, 普利托利亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents our findings for SemEval 2025 Task 2, a shared task on entity-aware machine translation (EA-MT). The goal of this task is to develop translation models that can accurately translate English sentences into target languages, with a particular focus on handling named entities, which often pose challenges for MT systems. The task covers 10 target languages with English as the source. In this paper, we describe the different systems we employed, detail our results, and discuss insights gained from our experiments.
zh

[NLP-11] AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

【速读】：该论文旨在解决大型语言模型（LLMs）在低资源领域中因通用性导致的高计算开销问题，特别是在自回归解码过程中每一步都需要前向传递的操作。论文提出了一种创新的领域适应视角，通过将词汇表调整到特定领域来降低延迟和计算成本。解决方案的关键在于引入了AdaptiVocab，这是一种端到端的词汇适应方法，旨在提高LLMs在低资源领域的效率。AdaptiVocab能够应用于任何分词器和架构，通过用基于领域特定n-gram的标记替换现有标记来修改词汇表，从而减少输入处理和输出生成所需的标记数量。此外，它使用指数加权组合现有嵌入来初始化新的n标记嵌入，并采用轻量级微调阶段，该阶段可以在单个GPU上高效执行。实验结果显示，AdaptiVocab在不牺牲性能的情况下减少了超过25%的标记使用。

链接: https://arxiv.org/abs/2503.19693
作者: Itay Nakash,Nitay Calderon,Eyal Ben David,Elad Hoffer,Roi Reichart
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance
zh

[NLP-12] A multitask transformer to sign language translation using motion gesture primitives

【速读】：该论文旨在解决听障人群在与健听社会有效沟通方面存在的主要社会鸿沟问题，特别是手语（Sign Language）缺乏正式书面形式所带来的挑战。研究聚焦于时空手语表达与自然语言文本之间的自动翻译，这是当前的核心难题。现有方法多基于编码器-解码器架构，并通过引入注意力模块来增强非线性对应关系，但这些方法通常需要复杂的训练和架构设计以实现合理预测，且受限于视频序列中的冗余背景信息。

论文的关键解决方案在于提出了一种包含词素学习表示的多任务Transformer架构。该架构通过引入密集运动表示（Dense Motion Representation），强化手势特征并结合运动学信息——这是手语中的关键组成部分。此外，通过这种表示方式可以剔除背景干扰，利用手势的几何特性，并引入时空表示以促进手势与中间文本表示（词素）之间的对齐。实验表明，所提方法在CoL-SLTD数据集上的BLEU-4分数分别达到了72.64%（Split 1）和14.64%（Split 2），并在RWTH-PHOENIX-Weather 2014 T数据集上实现了11.58%的竞争力BLEU-4分数。因此，该方案的核心创新点在于通过词素学习和密集运动表示优化了手语翻译性能，同时减少了背景噪声的影响。

链接: https://arxiv.org/abs/2503.19668
作者: Fredy Alejandro Mendoza López,Jefferson Rodriguez,Fabio Martínez
机构: Biomedical Imaging, Vision and Learning Laboratory (BIVL2ab)(生物医学成像、视觉与学习实验室); Universidad Industrial de Santander, Bucaramanga (UIS)(工业南方大学); Colombia(哥伦比亚)
类目: Computation and Language (cs.CL)
备注: 32 pages, 10 tables, 13 figures

点击查看摘要

Abstract:The absence of effective communication the deaf population represents the main social gap in this community. Furthermore, the sign language, main deaf communication tool, is unlettered, i.e., there is no formal written representation. In consequence, main challenge today is the automatic translation among spatiotemporal sign representation and natural text language. Recent approaches are based on encoder-decoder architectures, where the most relevant strategies integrate attention modules to enhance non-linear correspondences, besides, many of these approximations require complex training and architectural schemes to achieve reasonable predictions, because of the absence of intermediate text projections. However, they are still limited by the redundant background information of the video sequences. This work introduces a multitask transformer architecture that includes a gloss learning representation to achieve a more suitable translation. The proposed approach also includes a dense motion representation that enhances gestures and includes kinematic information, a key component in sign language. From this representation it is possible to avoid background information and exploit the geometry of the signs, in addition, it includes spatiotemporal representations that facilitate the alignment between gestures and glosses as an intermediate textual representation. The proposed approach outperforms the state-of-the-art evaluated on the CoL-SLTD dataset, achieving a BLEU-4 of 72,64% in split 1, and a BLEU-4 of 14,64% in split 2. Additionally, the strategy was validated on the RWTH-PHOENIX-Weather 2014 T dataset, achieving a competitive BLEU-4 of 11,58%.
zh

[NLP-13] HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection

【速读】：该论文致力于解决多语言大语言模型（Large Language Models, LLMs）中幻觉现象（hallucinations）及其相关过生成错误的识别问题。为此，论文聚焦于检测14种语言中LLMs输出中的特定文本片段是否构成幻觉。解决方案的关键在于通过自然语言推理（Natural Language Inference, NLI）技术，利用合成数据集（400个样本）微调ModernBERT模型，以实现对幻觉发生情况及严重程度的细致且模型感知的理解。尽管模型的置信度分数与幻觉实际存在之间表现出适度的正相关性（相关系数为0.422），但预测的幻觉边界与真实标注之间的交并比（Intersection over Union, IoU）仅为0.032，这反映了幻觉检测任务的高度复杂性及其对上下文依赖性的挑战。

链接: https://arxiv.org/abs/2503.19650
作者: Maryam Bala,Amina Imam Abubakar,Abdulhamid Abubakar,Abdulkadir Shehu Bichi,Hafsa Kabir Ahmad,Sani Abdullahi Sani,Idris Abdulmumin,Shamsuddeen Hassan Muhamad,Ibrahim Said Ahmad
机构: HausaNLP (HausaNLP); University of Abuja (阿布贾大学); Bayero University Kano (卡诺包耶罗大学); Data Science for Social Impact, University of Pretoria (数据科学促进社会影响，普利托利亚大学); Imperial College London (帝国理工学院); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents our findings of the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes, MU-SHROOM, which focuses on identifying hallucinations and related overgeneration errors in large language models (LLMs). The shared task involves detecting specific text spans that constitute hallucinations in the outputs generated by LLMs in 14 languages. To address this task, we aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English. We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples, achieving an Intersection over Union (IoU) score of 0.032 and a correlation score of 0.422. These results indicate a moderately positive correlation between the model’s confidence scores and the actual presence of hallucinations. The IoU score indicates that our model has a relatively low overlap between the predicted hallucination span and the truth annotation. The performance is unsurprising, given the intricate nature of hallucination detection. Hallucinations often manifest subtly, relying on context, making pinpointing their exact boundaries formidable.
zh

[NLP-14] Exploring Cultural Nuances in Emotion Perception Across 15 African Languages

【速读】：该论文试图解决非洲语言中情感表达研究不足的问题，以促进在这些语言中开发有效的 emotion detection（情感检测）工具。解决方案的关键在于通过跨语言分析，考察15种非洲语言在文本长度、情感极性、情感共现以及强度变化四个维度上的情感表达特性，揭示语言特定的情感表达模式，并强调采用语言特定方法进行情感检测的重要性，同时探索相关语言之间的迁移学习机会。

链接: https://arxiv.org/abs/2503.19642
作者: Ibrahim Said Ahmad,Shiran Dudy,Tadesse Destaw Belay,Idris Abdulmumin,Seid Muhie Yimam,Shamsuddeen Hassan Muhammad,Kenneth Church
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how emotions are expressed across languages is vital for building culturally-aware and inclusive NLP systems. However, emotion expression in African languages is understudied, limiting the development of effective emotion detection tools in these languages. In this work, we present a cross-linguistic analysis of emotion expression in 15 African languages. We examine four key dimensions of emotion representation: text length, sentiment polarity, emotion co-occurrence, and intensity variations. Our findings reveal diverse language-specific patterns in emotional expression – with Somali texts typically longer, while others like IsiZulu and Algerian Arabic show more concise emotional expression. We observe a higher prevalence of negative sentiment in several Nigerian languages compared to lower negativity in languages like IsiXhosa. Further, emotion co-occurrence analysis demonstrates strong cross-linguistic associations between specific emotion pairs (anger-disgust, sadness-fear), suggesting universal psychological connections. Intensity distributions show multimodal patterns with significant variations between language families; Bantu languages display similar yet distinct profiles, while Afroasiatic languages and Nigerian Pidgin demonstrate wider intensity ranges. These findings highlight the need for language-specific approaches to emotion detection while identifying opportunities for transfer learning across related languages.
zh

[NLP-15] 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training

【速读】：该论文旨在构建一个高质量的大规模推理任务数据集（AM-DeepSeek-R1-Distilled），以促进面向推理的大型语言模型（LLMs）的发展。论文的核心问题是解决现有推理任务数据集中存在的测试集污染问题以及如何生成更具挑战性和多样性的推理问题。为了解决这些问题，论文提出了通过多源开放数据集收集高质量推理问题，并进行语义去重和细致清洗，同时确保所有响应均来自经过验证的推理模型（主要是DeepSeek-R1）。验证过程针对不同类型的任务采用了不同的方法：数学问题使用参考答案验证，代码问题利用测试用例验证，其他任务借助奖励模型评估。关键在于设计了一套严谨的数据处理与验证流程，从而有效提升了数据质量。基于此数据集训练的AM-Distill-Qwen系列模型在多个基准测试中表现出色，证明了该方案的有效性。

链接: https://arxiv.org/abs/2503.19633
作者: Han Zhao,Haotian Wang,Yiping Peng,Sitong Zhao,Xiaoyu Tian,Shuaiting Chen,Yunjie Ji,Xiangang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks, composed of high-quality and challenging reasoning problems. These problems are collected from a multitude of open-source datasets, subjected to semantic deduplication and meticulous cleaning to eliminate test set contamination. All responses within the dataset are distilled from reasoning models (predominantly DeepSeek-R1) and have undergone rigorous verification procedures. Mathematical problems are validated by checking against reference answers, code problems are verified using test cases, and other tasks are evaluated with the aid of a reward model. The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT) using this batch of data, outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks: AIME2024, MATH-500, GPQA-Diamond, and LiveCodeBench. Additionally, the AM-Distill-Qwen-72B model surpassed the DeepSeek-R1-Distill-Llama-70B model on all benchmarks as well. We are releasing these 1.4 million problems and their corresponding responses to the research community with the objective of fostering the development of powerful reasoning-oriented Large Language Models (LLMs). The dataset was published in \hrefthis https URLthis https URL.
zh

[NLP-16] Lean Formalization of Generalization Error Bound by Rademacher Complexity

【速读】：该论文旨在通过Rademacher复杂度在Lean 4定理证明器中形式化一般化误差界，以解决机器学习模型在训练数据与未见测试数据上的性能差距量化问题。传统方法如PAC学习和VC维难以适用于深度学习和核方法等复杂场景，而Rademacher复杂度因其对学习机器假设类复杂性的敏感性，能够广泛应用于多种机器学习任务。论文的关键在于形式化了经验与总体Rademacher复杂度等核心概念，并通过McDiarmid不等式、Hoeffding引理以及对称化论证的严格形式化证明，建立了系统化的一般化误差界推导路径。

链接: https://arxiv.org/abs/2503.19605
作者: Sho Sonoda,Kazumi Kasaura,Yuma Mizuno,Kei Tsukamoto,Naoto Onda
机构: RIKEN AIP(理化学研究所AIP); OMRON SINIC X Corporation(欧姆龙 SINIC X 公司); University College Dublin(都柏林大学学院); The University of Tokyo(东京大学); AutoRes(未知)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST)
备注:

点击查看摘要

Abstract:We formalize the generalization error bound using Rademacher complexity in the Lean 4 theorem prover. Generalization error quantifies the gap between a learning machine’s performance on given training data versus unseen test data, and Rademacher complexity serves as an estimate of this error based on the complexity of learning machines, or hypothesis class. Unlike traditional methods such as PAC learning and VC dimension, Rademacher complexity is applicable across diverse machine learning scenarios including deep learning and kernel methods. We formalize key concepts and theorems, including the empirical and population Rademacher complexities, and establish generalization error bounds through formal proofs of McDiarmid’s inequality, Hoeffding’s lemma, and symmetrization arguments.
zh

[NLP-17] he Greatest Good Benchmark: Measuring LLM s Alignment with Utilitarian Moral Dilemmas

【速读】：该论文试图解决如何设计有益于人类且无害的语言模型（Language Models, LMs）以最大化所有人福祉的问题。论文的关键解决方案是引入“最大幸福基准（Greatest Good Benchmark）”，通过功利主义困境评估大型语言模型（LLMs）的道德判断，揭示其内在的“人工道德罗盘”。这一方法分析了15种不同LLMs的一致性道德偏好，这些偏好与既有的道德理论及普通人群的道德标准存在差异，从而为理解LLMs的道德对齐提供了重要洞见。

链接: https://arxiv.org/abs/2503.19598
作者: Giovanni Franco Gabriel Marraffini,Andrés Cotton,Noe Fabian Hsueh,Axel Fridman,Juan Wisznia,Luciano Del Corro
机构: Universidad de Buenos Aires (布宜诺斯艾利斯大学); Universidad Torcuato Di Tella (图库曼·迪特拉大学); Lumina Labs; Facultad de Ciencias Exactas y Naturales (精确科学与自然学院); Escuela de Negocios. Laboratorio de Neurociencia (商学院.神经科学实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The question of how to make decisions that maximise the well-being of all persons is very relevant to design language models that are beneficial to humanity and free from harm. We introduce the Greatest Good Benchmark to evaluate the moral judgments of LLMs using utilitarian dilemmas. Our analysis across 15 diverse LLMs reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards. Most LLMs have a marked preference for impartial beneficence and rejection of instrumental harm. These findings showcase the ‘artificial moral compass’ of LLMs, offering insights into their moral alignment.
zh

[NLP-18] Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment

【速读】：该论文试图解决的问题是：大型音频语言模型（Large Audio-Language Models, LALMs）在处理带有说话者特征的语言时，其机制是否与人类的认知机制相平行。具体而言，研究关注模型对说话者与内容不一致情况下的敏感性，特别是在社会刻板印象违背（如男性声称定期做美甲）和生物学知识违背（如男性声称怀孕）情境下的表现。

解决方案的关键在于通过比较两个LALMs（Qwen2-Audio和Ultravox 0.5）与人类脑电图（EEG）响应的处理模式，利用 surprisal 和 entropy 指标分析模型对说话者-内容不一致性的敏感程度。研究发现，Qwen2-Audio 对说话者不一致内容表现出更高的 surprisal 值，并且其 surprisal 值显著预测了人类的 N400 响应，而 Ultravox 0.5 在这一方面表现较弱。此外，两种模型均未能重现人类对社会违背和生物学违背之间不同的处理方式（分别表现为 N400 和 P600 效应）。这些结果揭示了当前LALMs在处理说话者语境化语言方面的潜力与局限性，同时表明人类与LALMs在社会语言处理机制上的差异。

链接: https://arxiv.org/abs/2503.19586
作者: Hanlin Wu,Xufeng Duan,Zhenguang Cai
机构: Department of Linguistics and Modern Languages, The Chinese University of Hong Kong (香港中文大学); Brain and Mind Institute, The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: Accepted by the 14th edition of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2025)

点击查看摘要

Abstract:Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs’ (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.
zh

[NLP-19] Multi-agent Application System in Office Collaboration Scenarios

【速读】：该论文致力于解决办公协作效率与工作质量提升的问题，特别是在复杂动态环境中多智能体系统面临的交互挑战。论文的关键解决方案在于设计了一种集成人工智能、机器学习和自然语言处理技术的多智能体应用系统，其核心在于提出了一种将计划（Plan）与求解器（Solver）分离的智能代理架构，并通过多轮查询改写（multi-turn query rewriting）和业务工具检索（business tool retrieval）等技术增强了代理的多意图与多轮对话能力。此外，系统通过个性化协作支持和数据分析工具提升了决策质量，并在任务分配、进度监控及信息共享等方面实现了功能优化。实验与评估验证了系统的有效性，尤其是在查询理解、任务规划及工具调用方面的卓越表现。

链接: https://arxiv.org/abs/2503.19584
作者: Songtao Sun,Jingyi Li,Yuanfei Dong,Haoguang Liu,Chenxin Xu,Fuyang Li,Qiang Liu
机构: Kingsoft Office Software Inc. (金山办公软件股份有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Technical report

点击查看摘要

Abstract:This paper introduces a multi-agent application system designed to enhance office collaboration efficiency and work quality. The system integrates artificial intelligence, machine learning, and natural language processing technologies, achieving functionalities such as task allocation, progress monitoring, and information sharing. The agents within the system are capable of providing personalized collaboration support based on team members’ needs and incorporate data analysis tools to improve decision-making quality. The paper also proposes an intelligent agent architecture that separates Plan and Solver, and through techniques such as multi-turn query rewriting and business tool retrieval, it enhances the agent’s multi-intent and multi-turn dialogue capabilities. Furthermore, the paper details the design of tools and multi-turn dialogue in the context of office collaboration scenarios, and validates the system’s effectiveness through experiments and evaluations. Ultimately, the system has demonstrated outstanding performance in real business applications, particularly in query understanding, task planning, and tool calling. Looking forward, the system is expected to play a more significant role in addressing complex interaction issues within dynamic environments and large-scale multi-agent systems.
zh

[NLP-20] Context-Efficient Retrieval with Factual Decomposition NAACL2025

【速读】：该论文试图解决如何有效利用外部语料库中的信息以提升语言模型在受限上下文条件下的问答性能问题。论文的关键在于将外部语料库预处理为半结构化的“原子事实”(atomic facts)，这种方法不仅提高了检索效率，还通过限制检索文本量减小了上下文规模，从而提升了推理效率，特别是在问答任务中表现尤为显著。

链接: https://arxiv.org/abs/2503.19574
作者: Yanhong Li,David Yunis,David McAllester,Jiawei Zhou
机构: University of Chicago (芝加哥大学) / Toyota Technological Institute at Chicago (丰田技术研究院（芝加哥）); Toyota Technological Institute at Chicago (丰田技术研究院（芝加哥）); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: NAACL 2025 Main Conference

点击查看摘要

Abstract:There has recently been considerable interest in incorporating information retrieval into large language models (LLMs). Retrieval from a dynamically expanding external corpus of text allows a model to incorporate current events and can be viewed as a form of episodic memory. Here we demonstrate that pre-processing the external corpus into semi-structured ‘‘atomic facts’’ makes retrieval more efficient. More specifically, we demonstrate that our particular form of atomic facts improves performance on various question answering tasks when the amount of retrieved text is limited. Limiting the amount of retrieval reduces the size of the context and improves inference efficiency.
zh

[NLP-21] Scaling Laws of Synthetic Data for Language Models

【速读】：该论文试图解决的问题是如何评估合成数据（Synthetic Data）在大规模语言模型（LLMs）预训练中的可扩展性（Scalability），并探索其是否能够替代日益枯竭的真实网页数据（Raw Pre-training Data）。论文的关键解决方案在于提出了一种名为SynthLLM的可扩展框架，该框架通过图算法自动提取和重组多文档中的高层概念，从而生成多样化且高质量的合成数据集。这种方法不仅实现了对合成数据缩放规律的系统研究，还验证了其遵循修正后的缩放定律，并展示了优于现有合成数据生成方法的性能与扩展能力。

链接: https://arxiv.org/abs/2503.19551
作者: Zeyu Qin,Qingxiu Dong,Xingxing Zhang,Li Dong,Xiaolong Huang,Ziyi Yang,Mahmoud Khademi,Dongdong Zhang,Hany Hassan Awadalla,Yi R. Fung,Weizhu Chen,Minhao Cheng,Furu Wei
机构: Microsoft (微软); Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the \emphrectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.
zh

[NLP-22] FLEX: A Benchmark for Evaluating Robustness of Fairness in Large Language Models NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对精心设计的偏见诱导提示时，其公平性评估不足的问题。现有基准可能未能充分揭示LLMs内在的弱点，即即使在简单的对抗性指令下，这些模型仍可能生成带有偏见的响应。为填补这一关键空白，论文引入了一个名为“极端场景下LLM的公平性基准（Fairness Benchmark in LLM under Extreme Scenarios, FLEX）”的新工具，用于测试LLMs在遭遇旨在诱发偏见的提示时是否能够保持公平性。FLEX的关键创新在于将放大潜在偏见的提示整合到公平性评估中，从而更全面地考察LLMs的鲁棒性。通过与现有基准的对比实验，论文表明传统评估方法可能低估了模型的固有风险，强调了开发更严格的LLMs评估基准以确保安全性和公平性的必要性。

链接: https://arxiv.org/abs/2503.19540
作者: Dahyun Jung,Seungyoon Lee,Hyeonseok Moon,Chanjun Park,Heuiseok Lim
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025 findings

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly enhanced interactions between users and models. These advancements concurrently underscore the need for rigorous safety evaluations due to the manifestation of social biases, which can lead to harmful societal impacts. Despite these concerns, existing benchmarks may overlook the intrinsic weaknesses of LLMs, which can generate biased responses even with simple adversarial instructions. To address this critical gap, we introduce a new benchmark, Fairness Benchmark in LLM under Extreme Scenarios (FLEX), designed to test whether LLMs can sustain fairness even when exposed to prompts constructed to induce bias. To thoroughly evaluate the robustness of LLMs, we integrate prompts that amplify potential biases into the fairness assessment. Comparative experiments between FLEX and existing benchmarks demonstrate that traditional evaluations may underestimate the inherent risks in models. This highlights the need for more stringent LLM evaluation benchmarks to guarantee safety and fairness.
zh

[NLP-23] DomainCQA: Crafting Expert-Level QA from Domain-Specific Charts

【速读】：该论文旨在解决当前 Chart Question Answering (CQA) 基准测试主要关注通用能力评估，而未能充分捕捉领域特定挑战的问题。为了解决这一问题，论文提出了一种系统性的方法 DomainCQA，用于构建领域特定的 CQA 数据集，并通过开发 AstroChart（天文学领域的 CQA 数据集）验证其有效性。论文的关键解决方案在于强调现有多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在图表推理以及结合图表信息与领域知识进行深度分析和总结方面存在主要挑战，而非单纯依赖领域特定知识。通过提供可扩展且严谨的框架，DomainCQA 能够更精确地评估和改进 MLLMs 在领域特定应用中的性能。

链接: https://arxiv.org/abs/2503.19498
作者: Ling Zhong,Yujing Lu,Jing Yang,Weiming Li,Peng Wei,Yongheng Wang,Manni Duan,Qing Zhang
机构: Zhejiang Lab (浙江实验室); National Astronomical Observatory, Chinese Academy of Science (中国科学院国家天文台)
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Chart Question Answering (CQA) benchmarks are essential for evaluating the capability of Multimodal Large Language Models (MLLMs) to interpret visual data. However, current benchmarks focus primarily on the evaluation of general-purpose CQA but fail to adequately capture domain-specific challenges. We introduce DomainCQA, a systematic methodology for constructing domain-specific CQA benchmarks, and demonstrate its effectiveness by developing AstroChart, a CQA benchmark in the field of astronomy. Our evaluation shows that chart reasoning and combining chart information with domain knowledge for deeper analysis and summarization, rather than domain-specific knowledge, pose the primary challenge for existing MLLMs, highlighting a critical gap in current benchmarks. By providing a scalable and rigorous framework, DomainCQA enables more precise assessment and improvement of MLLMs for domain-specific applications.
zh

[NLP-24] KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在自然语言生成（Natural Language Generation, NLG）任务中因知识捷径（knowledge shortcuts）导致的事实性幻觉（factual hallucinations）问题。这种幻觉源于数据中的虚假相关性（spurious correlations），尤其是在正确且无缺陷的数据中表现显著。论文的关键解决方案包括两个方面：首先，在数据预处理阶段提出了一种高相似度剪枝算法（high similarity pruning algorithm），以减少数据中的虚假相关性；其次，设计了一种特定的知识捷径幻觉检测方法，用于评估所提策略的有效性。实验结果表明，该方法能够有效减少知识捷径幻觉，特别是在微调任务中，同时不会对问答任务的模型性能产生负面影响。这一工作引入了一种新的范式，用于缓解生成式模型中的特定幻觉问题，从而增强其在实际应用中的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2503.19482
作者: Zhiwei Wang,Zhongxin Liu,Ying Li,Hongyu Sun,Meng Xu,Yuqing Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); Xidian University (西安电子科技大学); Hainan University (海南大学); University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 34 figures

点击查看摘要

Abstract:The emergence of large language models (LLMs) has significantly advanced the development of natural language processing (NLP), especially in text generation tasks like question answering. However, model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes. We systematically expand on the causes of factual hallucinations from the perspective of knowledge shortcuts, analyzing hallucinations arising from correct and defect-free data and demonstrating that knowledge-shortcut hallucinations are prevalent in generative models. To mitigate this issue, we propose a high similarity pruning algorithm at the data preprocessing level to reduce spurious correlations in the data. Additionally, we design a specific detection method for knowledge-shortcut hallucinations to evaluate the effectiveness of our mitigation strategy. Experimental results show that our approach effectively reduces knowledge-shortcut hallucinations, particularly in fine-tuning tasks, without negatively impacting model performance in question answering. This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.
zh

[NLP-25] ReSearch: Learning to Reason with Search for LLM s via Reinforcement Learning

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在与外部搜索过程整合以支持复杂多跳推理任务时所面临的挑战。传统方法通常依赖于监督数据来指导推理步骤，而本文提出的关键解决方案是ReSearch框架，它通过强化学习训练LLMs将搜索操作作为推理链的组成部分，无需使用任何关于推理步骤的监督数据。该方法的核心在于利用基于文本的思考来引导何时以及如何执行搜索，并使搜索结果能够影响后续的推理过程。这种设计使得模型能够在没有特定领域知识的情况下展现出强大的泛化能力，并且在强化学习过程中自然涌现出如反思和自我修正等高级推理能力。

链接: https://arxiv.org/abs/2503.19470
作者: Mingyang Chen,Tianpeng Li,Haoze Sun,Yijie Zhou,Chenzheng Zhu,Fan Yang,Zenan Zhou,Weipeng Chen,Haofen Wang,Jeff Z. Pan,Wen Zhang,Huajun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in reasoning, exemplified by the success of OpenAI-o1 and DeepSeek-R1. However, integrating reasoning with external search processes remains challenging, especially for complex multi-hop questions requiring multiple retrieval steps. We propose ReSearch, a novel framework that trains LLMs to Reason with Search via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning. We train ReSearch on Qwen2.5-7B(-Instruct) and Qwen2.5-32B(-Instruct) models and conduct extensive experiments. Despite being trained on only one dataset, our models demonstrate strong generalizability across various benchmarks. Analysis reveals that ReSearch naturally elicits advanced reasoning capabilities such as reflection and self-correction during the reinforcement learning process.
zh

[NLP-26] Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning NAACL2025

【速读】：该论文旨在解决零样本分类（Zero-Shot Classification, ZSC）在多语言和低资源场景下的挑战，特别是现有方法依赖大量标注数据或特定语言的手工设计提示词，导致其在跨语言任务中的适应性和效果受限的问题。论文的关键创新在于提出了一种轻量级且数据高效的RoSPrompt方法，通过训练软提示（soft prompts）来增强多语言ZSC的能力，同时确保模型在数据分布偏移时具有鲁棒的泛化性能。RoSPrompt特别针对小规模多语言预训练语言模型（PLMs），使其能够利用高资源语言的数据提升低资源环境下的性能，而无需进行大规模微调或消耗高额计算资源。这种方法的核心优势在于其灵活性和数据效率，能够在不同语言和任务间实现有效的知识迁移。

链接: https://arxiv.org/abs/2503.19469
作者: Fred Philippy,Siwen Guo,Cedric Lothritz,Jacques Klein,Tegawendé F. Bissyandé
机构: Zortify Labs, Zortify S.A. (佐尔提实验室，佐尔提有限公司); SnT, University of Luxembourg (卢森堡大学 SnT 研究中心); Luxembourg Institute of Science and Technology (LIST) (卢森堡科学技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Workshop on Language Models for Underserved Communities (co-located with NAACL 2025)

点击查看摘要

Abstract:In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios. Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings. To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.
zh

[NLP-27] DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models NAACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在零样本问答（zero-shot Question Answering, QA）任务中面对社会敏感问题时暴露内部知识偏见的问题，这会导致性能下降。现有方法虽高效但未能考虑上下文信息或抑制答案中的偏见传播。论文提出的解决方案DeCAP（基于上下文自适应提示生成的去偏方法）的关键在于利用问题歧义检测来根据上下文采取适当的去偏行动，并通过中立答案引导生成来抑制LLMs的主观判断倾向，从而最小化其内部知识中偏见的传播。实验结果表明，DeCAP在多种LLMs上的零样本去偏问答任务中达到了最先进的性能，证明了其在提升LLMs公平性和准确性方面的有效性。

链接: https://arxiv.org/abs/2503.19426
作者: Suyoung Bae,YunSeok Choi,Jee-Hyong Lee
机构: Sungkyunkwan University (成均馆大学), South Korea
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to NAACL 2025 main. 20 pages, 3 figures

点击查看摘要

Abstract:While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers. To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP’s efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.
zh

[NLP-28] QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

【速读】：该论文旨在解决大规模语言模型（LLMs）在量化过程中因激活异常值导致的精度下降问题，特别是在中等规模模型（如Llama-3-8B）中的表现退化。论文的关键创新在于提出了一种名为QUAD（Quantization with Activation Decomposition）的新框架，通过利用奇异值分解（SVD）来抑制激活异常值，从而实现高效的4位量化。QUAD的核心在于离线估计校准数据的激活奇异向量，构建正交变换矩阵P，将异常值转移到全精度的额外维度，而其余部分则被量化为4位。此外，QUAD通过可调的全精度异常值权重实现了参数高效的微调，有效缩小了量化模型与全精度模型之间的精度差距。实验结果表明，QUAD在W4A4量化下达到了94%~96%的准确性，并在W4A4/A8量化及参数高效微调下实现了98%的准确性。

链接: https://arxiv.org/abs/2503.19353
作者: Yuxuan Hu,Xiaodong Chen,Cuiping Li,Hong Chen,Jing Zhang
机构: Renmin University of China (中国人民大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 18 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \hrefthis https URLrepository.
zh

[NLP-29] Substance over Style: Evaluating Proactive Conversational Coaching Agents

【速读】：该论文试图解决多轮对话式教练（coaching）系统在初始目标不明确、评价标准主观且混合主动性的复杂场景下的设计与评估问题。解决方案的关键在于开发了五种具有不同对话风格的多轮教练代理，并通过用户研究收集了第一人称反馈，同时对比了用户反馈与第三方健康专家及语言模型的评价结果。研究发现，用户高度重视核心功能，而缺乏核心功能的风格化组件会被负面评价，进一步揭示了不同评估方法之间的显著不一致性。这些发现为以用户为中心的自然语言处理（NLP）应用的设计与评估提供了重要见解。

链接: https://arxiv.org/abs/2503.19328
作者: Vidya Srinivas,Xuhai Xu,Xin Liu,Kumar Ayush,Isaac Galatzer-Levy,Shwetak Patel,Daniel McDuff,Tim Althoff
机构: Google Research; Paul G. Allen School of Computer Science & Engineering, University of Washington (保罗·G·艾伦计算机科学与工程学院，华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.
zh

[NLP-30] Iterative Hypothesis Generation for Scientific Discovery with Monte Carlo Nash Equilibrium Self-Refining Trees

【速读】：该论文旨在解决科学研究中科学假设生成这一根本性挑战，即如何高效生成既具有创新性又具备实证基础的新见解。传统方法依赖于人类直觉与领域专业知识，而纯基于大型语言模型（Large Language Model, LLM）的方法往往难以同时保证假设的创新性和可靠性。为克服这些局限性，论文提出了一种名为蒙特卡洛纳什均衡自精炼树（Monte Carlo Nash Equilibrium Self-Refine Tree, MC-NEST）的创新框架，其关键在于将蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）与纳什均衡策略相结合，通过迭代优化和验证假设来动态平衡探索与利用。MC-NEST采用自适应采样策略，优先选择高潜力假设的同时保持搜索空间的多样性。实验结果表明，MC-NEST在生物医学、社会科学和计算机科学等多个领域的假设生成任务中均优于当前最先进的基于提示的方法，并促进了结构化的人机协作，确保AI增强而非取代人类创造力。此外，MC-NEST的设计强调透明度和人工监督，推动负责任的AI应用。

链接: https://arxiv.org/abs/2503.19309
作者: Gollam Rabby,Diyana Muhammed,Prasenjit Mitra,Sören Auer
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific hypothesis generation is a fundamentally challenging task in research, requiring the synthesis of novel and empirically grounded insights. Traditional approaches rely on human intuition and domain expertise, while purely large language model (LLM) based methods often struggle to produce hypotheses that are both innovative and reliable. To address these limitations, we propose the Monte Carlo Nash Equilibrium Self-Refine Tree (MC-NEST), a novel framework that integrates Monte Carlo Tree Search with Nash Equilibrium strategies to iteratively refine and validate hypotheses. MC-NEST dynamically balances exploration and exploitation through adaptive sampling strategies, which prioritize high-potential hypotheses while maintaining diversity in the search space. We demonstrate the effectiveness of MC-NEST through comprehensive experiments across multiple domains, including biomedicine, social science, and computer science. MC-NEST achieves average scores of 2.65, 2.74, and 2.80 (on a 1-3 scale) for novelty, clarity, significance, and verifiability metrics on the social science, computer science, and biomedicine datasets, respectively, outperforming state-of-the-art prompt-based methods, which achieve 2.36, 2.51, and 2.52 on the same datasets. These results underscore MC-NEST’s ability to generate high-quality, empirically grounded hypotheses across diverse domains. Furthermore, MC-NEST facilitates structured human-AI collaboration, ensuring that LLMs augment human creativity rather than replace it. By addressing key challenges such as iterative refinement and the exploration-exploitation balance, MC-NEST sets a new benchmark in automated hypothesis generation. Additionally, MC-NEST’s ethical design enables responsible AI use, emphasizing transparency and human supervision in hypothesis generation.
zh

[NLP-31] Machine-assisted writing evaluation: Exploring pre-trained language models in analyzing argumentative moves

【速读】：该论文旨在解决如何高效且准确地分析学习者语料库中论证性话语（argumentative moves）的问题。传统方法依赖于定性分析和人工编码，存在效率低下和泛化能力有限的局限性。论文的关键解决方案是利用预训练语言模型（Pre-trained Language Models, PLMs），特别是BERT模型，自动标注论证性话语的六种类型（主张、数据、反主张、反数据、反驳和非论证）。研究结果表明，PLMs在分析论证性话语方面具有较强的可靠性（整体F1得分为0.743），并且能够有效捕捉学习者的写作发展模式并预测写作质量。这一方法通过提升评估效率和准确性，展现了将人工智能融入语言教育中的变革潜力。

链接: https://arxiv.org/abs/2503.19279
作者: Wenjuan Qin,Weiran Wang,Yuming Yang,Tao Gui
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The study investigates the efficacy of pre-trained language models (PLMs) in analyzing argumentative moves in a longitudinal learner corpus. Prior studies on argumentative moves often rely on qualitative analysis and manual coding, limiting their efficiency and generalizability. The study aims to: 1) to assess the reliability of PLMs in analyzing argumentative moves; 2) to utilize PLM-generated annotations to illustrate developmental patterns and predict writing quality. A longitudinal corpus of 1643 argumentative texts from 235 English learners in China is collected and annotated into six move types: claim, data, counter-claim, counter-data, rebuttal, and non-argument. The corpus is divided into training, validation, and application sets annotated by human experts and PLMs. We use BERT as one of the implementations of PLMs. The results indicate a robust reliability of PLMs in analyzing argumentative moves, with an overall F1 score of 0.743, surpassing existing models in the field. Additionally, PLM-labeled argumentative moves effectively capture developmental patterns and predict writing quality. Over time, students exhibit an increase in the use of data and counter-claims and a decrease in non-argument moves. While low-quality texts are characterized by a predominant use of claims and data supporting only oneside position, mid- and high-quality texts demonstrate an integrative perspective with a higher ratio of counter-claims, counter-data, and rebuttals. This study underscores the transformative potential of integrating artificial intelligence into language education, enhancing the efficiency and accuracy of evaluating students’ writing. The successful application of PLMs can catalyze the development of educational technology, promoting a more data-driven and personalized learning environment that supports diverse educational needs.
zh

[NLP-32] CoMAC: Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions PAKDD2025

【速读】：该论文旨在解决现有基于多辅助数据源（如知识库和人物画像）增强对话响应生成的方法在高效提取相关信息方面存在的局限性。具体而言，当前方法难以有效结合多样化的对话能力与对已知事实的遵循以及适应用户偏好和信念系统的变化，这限制了对话式人工智能工具的广泛应用。论文提出了一种名为“带有稀疏对称潜在交互的多源辅助上下文对话代理”（Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions, CoMAC）的新方法，其关键在于采用专门设计的编码流和后融合接地网络来处理多个数据源，从而识别与对话相关的人物和知识信息。此外，CoMAC 引入了一种新颖的文本相似性度量，支持多源之间的双向信息共享，并专注于有意义词汇的子集。实验结果表明，与两种最先进的方法相比，CoMAC 在相关人物和知识预测准确性以及响应生成质量方面均有显著提升。

链接: https://arxiv.org/abs/2503.19274
作者: Junfeng Liu,Christopher T. Symons,Ranga Raju Vatsavai
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: The 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2025)

点击查看摘要

Abstract:Recent advancements in AI-driven conversational agents have exhibited immense potential of AI applications. Effective response generation is crucial to the success of these agents. While extensive research has focused on leveraging multiple auxiliary data sources (e.g., knowledge bases and personas) to enhance response generation, existing methods often struggle to efficiently extract relevant information from these sources. There are still clear limitations in the ability to combine versatile conversational capabilities with adherence to known facts and adaptation to large variations in user preferences and belief systems, which continues to hinder the wide adoption of conversational AI tools. This paper introduces a novel method, Conversational Agent for Multi-Source Auxiliary Context with Sparse and Symmetric Latent Interactions (CoMAC), for conversation generation, which employs specialized encoding streams and post-fusion grounding networks for multiple data sources to identify relevant persona and knowledge information for the conversation. CoMAC also leverages a novel text similarity metric that allows bi-directional information sharing among multiple sources and focuses on a selective subset of meaningful words. Our experiments show that CoMAC improves the relevant persona and knowledge prediction accuracies and response generation quality significantly over two state-of-the-art methods.
zh

[NLP-33] MARS: Memory-Enhanced Agents with Reflective Self-improvement

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在自然语言处理领域面临的连续决策能力不足、缺乏长期记忆以及在动态环境中上下文窗口受限等挑战。为应对这些问题，论文提出了一种名为Memory-Enhanced Agents with Reflective Self-improvement（MARS）的创新框架。MARS框架的关键在于通过集成迭代反馈机制、反思机制以及基于艾宾浩斯遗忘曲线的内存优化机制，显著提升代理（agents）在多任务处理和长跨度信息管理方面的能力。

链接: https://arxiv.org/abs/2503.19271
作者: Xuechen Liang,Meiling Tao,Yinghui Xia,Jianhui Wang,Kun Li,Yijin Wang,Jingsong Yang,Tianyu Shi,Yuantao Wang,Miao Zhang,Xueqian Wang
机构: East China Jiaotong University (华东交通大学); Xidian University (西安电子科技大学); AutoAgents Co., Ltd. (北京); University of Electronic Science and Technology of China (电子科技大学); Xiamen University (厦门大学); University of Minnesota - Twin Cities (明尼苏达大学双城分校); Faculty of Applied Science and Engineering, University of Toronto (多伦多大学应用科学与工程学院); Beijing University of Technology (北京工业大学); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have made significant advances in the field of natural language processing, but they still face challenges such as continuous decision-making, lack of long-term memory, and limited context windows in dynamic environments. To address these issues, this paper proposes an innovative framework Memory-Enhanced Agents with Reflective Self-improvement. The MARS framework comprises three agents: the User, the Assistant, and the Checker. By integrating iterative feedback, reflective mechanisms, and a memory optimization mechanism based on the Ebbinghaus forgetting curve, it significantly enhances the agents capabilities in handling multi-tasking and long-span information.
zh

[NLP-34] PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

【速读】：该论文试图解决生物医学研究中计算表型分析耗时且资源密集的问题，传统方法通常需要大量手动数据审查。尽管机器学习和自然语言处理的进步有所帮助，但进一步改进仍有必要。目前，关于利用大型语言模型（Large Language Models, LLMs）进行此类任务的研究较少，尽管LLMs在文本相关任务中有已知优势。为此，论文开发了一个评估框架——PHEONA（Evaluation of PHEnotyping for Observational Health Data），用于明确上下文特定的考虑因素，并将其应用于急性呼吸衰竭（Acute Respiratory Failure, ARF）呼吸支持疗法的表型过程中的概念分类任务。关键在于通过设计PHEONA框架，结合LLMs实现高效的概念分类，从而显著提高计算表型分析的准确性与效率。

链接: https://arxiv.org/abs/2503.19265
作者: Sarah Pungitore,Shashank Yadav,Vignesh Subbian
机构: 未知
类目: Computation and Language (cs.CL)
备注: 2 figures, 5 tables, submitted to 2025 AMIA Annual Symposium

点击查看摘要

Abstract:Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. To facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. We applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.
zh

[NLP-35] Linguistic Blind Spots of Large Language Models NAACL2025

【速读】：该论文旨在研究大型语言模型（Large Language Models, LLMs）在细粒度语言标注任务中的表现，特别是其在检测名词、动词或识别复杂句法结构（如从句）等任务上的能力。论文的核心问题是评估LLMs是否具备精确的句法和语义理解能力，以可靠地完成详细的语言分析任务，并验证其输出是否真正反映了对输入文本的理解。论文的关键在于通过一系列实验揭示近期最先进的LLMs（如Llama3-70b）在处理语言查询时的局限性，特别是在面对复杂的语言输入时的表现不佳，以及在检测句法结构时存在的显著错误，例如误判嵌入从句、未能识别动词短语及混淆复杂名词短语与从句等问题。这些发现为未来LLM的设计与改进提供了重要参考。

链接: https://arxiv.org/abs/2503.19260
作者: Jiali Cheng,Hadi Amiri
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NAACL 2025 Cognitive Modeling and Computational Linguistics Workshop

点击查看摘要

Abstract:Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.
zh

[NLP-36] SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings

【速读】：该论文旨在解决科学领域中基于大型语言模型（Large Language Models, LLMs）支持的创新性、高质量且上下文相关的科学创意生成的挑战。现有方法在生成具备新颖性、兴奋点、可行性和有效性的科学创意方面仍存在困难。为此，论文提出了一种名为SCI-IDEA的框架，其关键在于结合LLM提示策略与“啊哈时刻”（Aha Moment）检测技术，通过迭代优化实现创意的逐步完善。SCI-IDEA能够从研究文献中提取关键要素，并评估生成创意的多个维度，包括新颖性、兴奋度、可行性和有效性。实验结果表明，SCI-IDEA在多种LLM配置下均表现出色，验证了其在促进科学创意生成方面的有效性及伦理考量的重要性。

链接: https://arxiv.org/abs/2503.19257
作者: Farhana Keya,Gollam Rabby,Prasenjit Mitra,Sahar Vahdati,Sören Auer,Yaser Jaradeh
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Every scientific discovery starts with an idea inspired by prior work, interdisciplinary concepts, and emerging challenges. Recent advancements in large language models (LLMs) trained on scientific corpora have driven interest in AI-supported idea generation. However, generating context-aware, high-quality, and innovative ideas remains challenging. We introduce SCI-IDEA, a framework that uses LLM prompting strategies and Aha Moment detection for iterative idea refinement. SCI-IDEA extracts essential facets from research publications, assessing generated ideas on novelty, excitement, feasibility, and effectiveness. Comprehensive experiments validate SCI-IDEA’s effectiveness, achieving average scores of 6.84, 6.86, 6.89, and 6.84 (on a 1-10 scale) across novelty, excitement, feasibility, and effectiveness, respectively. Evaluations employed GPT-4o, GPT-4.5, DeepSeek-32B (each under 2-shot prompting), and DeepSeek-70B (3-shot prompting), with token-level embeddings used for Aha Moment detection. Similarly, it achieves scores of 6.87, 6.86, 6.83, and 6.87 using GPT-4o under 5-shot prompting, GPT-4.5 under 3-shot prompting, DeepSeek-32B under zero-shot chain-of-thought prompting, and DeepSeek-70B under 5-shot prompting with sentence-level embeddings. We also address ethical considerations such as intellectual credit, potential misuse, and balancing human creativity with AI-driven ideation. Our results highlight SCI-IDEA’s potential to facilitate the structured and flexible exploration of context-aware scientific ideas, supporting innovation while maintaining ethical standards.
zh

[NLP-37] A Survey of Large Language Model Agents for Question Answering

【速读】：该论文旨在解决传统问答（Question Answering, QA）系统在数据需求大和难以泛化到新环境方面的显著局限性。论文的关键解决方案在于利用大型语言模型（Large Language Model, LLM）作为核心推理引擎，设计基于LLM的智能体（LLM-based agents），使其能够通过与外部环境交互来提升问答性能。这种方法不仅超越了传统的QA流水线和简单的LLM QA系统，还通过规划（Planning）、问题理解（Question Understanding）、信息检索（Information Retrieval）以及答案生成（Answer Generation）等关键阶段的设计，实现了更高效的问答能力。

链接: https://arxiv.org/abs/2503.19213
作者: Murong Yue
机构: George Mason Univeristy (乔治梅森大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper surveys the development of large language model (LLM)-based agents for question answering (QA). Traditional agents face significant limitations, including substantial data requirements and difficulty in generalizing to new environments. LLM-based agents address these challenges by leveraging LLMs as their core reasoning engine. These agents achieve superior QA results compared to traditional QA pipelines and naive LLM QA systems by enabling interaction with external environments. We systematically review the design of LLM agents in the context of QA tasks, organizing our discussion across key stages: planning, question understanding, information retrieval, and answer generation. Additionally, this paper identifies ongoing challenges and explores future research directions to enhance the performance of LLM agent QA systems.
zh

[NLP-38] owards Terminology Management Automation for Arabic

【速读】：该论文旨在解决阿拉伯语术语管理自动化的问题，具体目标是通过从领域特定文本中提取外语与阿拉伯语对应术语的平行术语列表，以提高专业阿拉伯学术书籍中术语翻译和使用的一致性，并为跨语言文本处理提供自动化辅助。论文的关键在于利用自然出现的术语翻译对，综合考虑多个候选短语（不同长度）以及多种相似性度量方法（包括词典、语音、形态和语义相似性），以此来确定最佳匹配。解决方案的核心是结合启发式方法、机器学习以及带有后处理的机器学习方法，并通过构建一个新的精心标注的数据集及使用现有专家评审的行业平行语料库进行实验验证。最终，最优方法达到了94.9%的精确率和92.4%的召回率。

链接: https://arxiv.org/abs/2503.19211
作者: Mahdi Nasser,Laura Sayyah,Fadi A. Zaraket
机构: Arab Center for Research and Policy Studies (阿拉伯研究中心), Doha
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a method and supporting tools for automation of terminology management for Arabic. The tools extract lists of parallel terminology matching terms in foreign languages to their Arabic counterparts from field specific texts. This has significant implications as it can be used to improve consistent translation and use of terms in specialized Arabic academic books, and provides automated aid for enhancing cross lingual text processing. This automation of terminology management aims to reduce processing time, and ensure use of consistent and correct terminology. The extraction takes advantage of naturally occurring term translations. It considers several candidate phrases of varying lengths that co-occur next to the foreign terms. Then it computes several similarity metrics, including lexicographic, phonetic, morphological, and semantic ones to decide the problem. We experiment with heuristic, machine learning, and ML with post processing approaches. This paper reports on a novel curated dataset for the task, an existing expert reviewed industry parallel corpora, and on the performance of the three approaches. The best approach achieved 94.9% precision and 92.4% recall.
zh

[NLP-39] Overtrained Language Models Are Harder to Fine-Tune

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）在不断增加的训练令牌预算下进行预训练时，是否始终能够提升下游任务的表现。传统观点认为更大的预训练规模会带来更好的性能，但本文挑战这一假设，提出过长的预训练可能导致模型难以微调，从而损害最终表现，即所谓的“灾难性过拟合”（Catastrophic Overtraining）。

解决方案的关键在于通过控制实验和理论分析揭示灾难性过拟合的根本原因：预训练参数的广义敏感性系统性增加，使得模型对包括微调在内的任何修改变得更加敏感。研究结果强调了重新评估预训练设计的重要性，需综合考虑模型的下游适应能力。

链接: https://arxiv.org/abs/2503.19206
作者: Jacob Mitchell Springer,Sachin Goyal,Kaiyue Wen,Tanishq Kumar,Xiang Yue,Sadhika Malladi,Graham Neubig,Aditi Raghunathan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 72 pages, 65 figures, 6 tables

点击查看摘要

Abstract:Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.
zh

[NLP-40] Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

【速读】：该论文试图解决生成式 AI (Generative AI) 助手在处理复杂搜索与推理任务中的性能瓶颈问题。具体而言，它引入了一个名为 BLUR 的基准测试集，包含 573 个真实世界验证的问题，这些问题需要跨多模态、多语言输入进行搜索和推理，并熟练运用工具才能有效解决。论文的关键在于通过设计这种具有挑战性的任务，揭示当前系统（最高得分约 56%）与人类表现（平均 98%）之间的显著差距，从而推动通用 AI 助手在这一领域的能力提升。为促进研究进展，论文公开了 350 个问题及其部分答案，并保留其余作为私有测试集。

链接: https://arxiv.org/abs/2503.19193
作者: Sky CH-Wang,Darshan Deshpande,Smaranda Muresan,Anand Kannappan,Rebecca Qian
机构: Patronus AI; Columbia University (哥伦比亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multi-modal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.
zh

[NLP-41] Protein Structure-Function Relationship: A Kernel-PCA Approach for Reaction Coordinate Identification

【速读】：该论文旨在解决蛋白质结构与功能关系建模以及反应坐标对蛋白性质影响的量化分析问题。解决方案的关键在于提出了一种基于核主成分分析（Kernel-PCA）的模型，通过结合核方法与主成分分析技术，从分子动力学（MD）模拟得到的高维蛋白数据中提取有意义的模式，并利用网络方法揭示与特定蛋白性质相关的残基动态行为的相关性。该模型不仅能够有效识别反应坐标，还实现了对反应坐标的排序以反映其对蛋白特性的相对重要性，从而为蛋白质结构-功能分析及可视化提供了有力工具。

链接: https://arxiv.org/abs/2503.19186
作者: Parisa Mollaei,Amir Barati Farimani
机构: Carnegie Mellon University (CMU)
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注: 28 pages, 10 figures

点击查看摘要

Abstract:In this study, we propose a Kernel-PCA model designed to capture structure-function relationships in a protein. This model also enables ranking of reaction coordinates according to their impact on protein properties. By leveraging machine learning techniques, including Kernel and principal component analysis (PCA), our model uncovers meaningful patterns in high-dimensional protein data obtained from molecular dynamics (MD) simulations. The effectiveness of our model in accurately identifying reaction coordinates has been demonstrated through its application to a G protein-coupled receptor. Furthermore, this model utilizes a network-based approach to uncover correlations in the dynamic behavior of residues associated with a specific protein property. These findings underscore the potential of our model as a powerful tool for protein structure-function analysis and visualization.
zh

[NLP-42] Evaluating Bias in LLM s for Job-Resume Matching: Gender Race and Education NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在自动化招聘中的性能与公平性问题，特别是通过简历与职位描述匹配任务来评估其潜在偏见。论文的关键在于研究影响LLMs决策的因素，如性别、种族和教育背景，并揭示显性属性（如性别和种族）的偏见虽有所减少，但隐性偏见（如教育背景）仍然显著。因此，论文强调需要持续评估及开发先进的偏见缓解策略，以确保LLMs在实际行业应用中的公平性和可靠性。

链接: https://arxiv.org/abs/2503.19182
作者: Hayate Iso,Pouya Pezeshkpour,Nikita Bhutani,Estevam Hruschka
机构: Megagon Labs
类目: Computation and Language (cs.CL)
备注: NAACL 2025: Industry Track

点击查看摘要

Abstract:Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes, streamlining recruitment processes, and reducing operational costs. However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity. This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context. It evaluates how factors such as gender, race, and educational background influence model decisions, providing critical insights into the fairness and reliability of LLMs in HR applications. Our findings indicate that while recent models have reduced biases related to explicit attributes like gender and race, implicit biases concerning educational background remain significant. These results highlight the need for ongoing evaluation and the development of advanced bias mitigation strategies to ensure equitable hiring practices when using LLMs in industry settings.
zh

[NLP-43] Language Model Uncertainty Quantification with Attention Chain

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在涉及中间推理步骤的复杂任务中预测不确定性量化（Uncertainty Quantification, UQ）的难题。现有方法在处理直接可答的简短问题时表现良好，但在需要推理的场景下，由于答案标记的概率依赖于大量前置推理标记的空间，直接边缘化（marginalization）变得不可行，且概率依赖性会导致不确定性估计中的过自信（overconfidence）问题。论文的关键创新在于提出了一种名为UQAC（Uncertainty Quantification via Attention Chain）的高效方法，通过构建“注意力链”（attention chain）来缩小推理空间至可管理规模。UQAC利用回溯过程，基于注意力权重识别对最终答案具有语义关键性的标记，并迭代扩展此链直到输入标记，同时结合相似性过滤和概率阈值筛选进一步优化链路结构，从而近似答案标记的边缘概率，作为模型的置信度指标。实验验证表明，UQAC在多个推理基准测试中提供了可靠且高效的不确定性量化结果。

链接: https://arxiv.org/abs/2503.19168
作者: Yinghao Li,Rushi Qiang,Lama Moukheiber,Chao Zhang
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: 33 pages, 7 figures, 30 tables

点击查看摘要

Abstract:Accurately quantifying a large language model’s (LLM) predictive uncertainty is crucial for judging the reliability of its answers. While most existing research focuses on short, directly answerable questions with closed-form outputs (e.g., multiple-choice), involving intermediate reasoning steps in LLM responses is increasingly important. This added complexity complicates uncertainty quantification (UQ) because the probabilities assigned to answer tokens are conditioned on a vast space of preceding reasoning tokens. Direct marginalization is infeasible, and the dependency inflates probability estimates, causing overconfidence in UQ. To address this, we propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. UQAC iteratively constructs an “attention chain” of tokens deemed “semantically crucial” to the final answer via a backtracking procedure. Starting from the answer tokens, it uses attention weights to identify the most influential predecessors, then iterates this process until reaching the input tokens. Similarity filtering and probability thresholding further refine the resulting chain, allowing us to approximate the marginal probabilities of the answer tokens, which serve as the LLM’s confidence. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs, demonstrating that it consistently delivers reliable UQ estimates with high computational efficiency.
zh

[NLP-44] MIRAG E: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在面对跨模态越狱攻击（multimodal jailbreaks）时的安全性漏洞问题。传统文本过滤机制虽已显著进步，但无法有效应对利用模型跨模态推理能力的攻击。论文提出了一种名为MIRAGE的新框架，其关键在于通过叙事驱动的上下文和角色沉浸，系统性地将有毒查询分解为环境、角色和动作三元组，并结合Stable Diffusion生成多轮图文叙事序列，引导目标模型逐步降低防御并偏离安全机制。这种方法不仅提升了攻击成功率（相比现有最佳基线提升高达17.5%），还揭示了模型内在偏差被激活后可能自发违反伦理约束的现象，从而凸显了当前多模态安全机制的薄弱点及加强跨模态防护的紧迫性。

链接: https://arxiv.org/abs/2503.19134
作者: Wenhao You,Bryan Hooi,Yiwei Wang,Youke Wang,Zong Ke,Ming-Hsuan Yang,Zi Huang,Yujun Cai
机构: University of Waterloo(滑铁卢大学); National University of Singapore(新加坡国立大学); University of California, Merced(加州大学默塞德分校); University of Alberta(阿尔伯塔大学); University of Queensland(昆士兰大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model’s defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model’s spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.
zh

[NLP-45] Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

【速读】：该论文试图解决语言模型中因教师模型与学生模型词汇不匹配导致的显著挑战，包括发散的标记序列和输出分布。为克服这些限制，论文提出了一种名为“无词汇教师指导语言建模 (VocAgnoLM)”的新方法，其关键是通过两种关键技术桥接由词汇不匹配引起的差距：(1) 标记级词典对齐 (Token-level Lexical Alignment)，用于对齐不同词汇表中的标记序列；(2) 教师指导损失 (Teacher Guided Loss)，利用教师模型的损失来有效引导学生模型训练。研究展示了该方法在使用多种具有不同词汇表的大型教师模型（如Qwen2.5-Math-Instruct和TinyLlama）训练小型学生模型时的有效性，并证明了其在面对严重词汇不匹配时的性能提升及对更强教师模型的持续受益能力。

链接: https://arxiv.org/abs/2503.19123
作者: Haebin Shin,Lei Ji,Xiao Liu,Yeyun Gong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.
zh

[NLP-46] Where is this coming from? Making groundedness count in the evaluation of Document VQA models NAACL

【速读】：该论文试图解决现有 Document VQA 模型评估指标未能充分反映模型语义一致性和多模态关联性的问题。当前评估方法将基于幻觉或重大语义错误的输出与语义扎实的输出同等对待，无法有效衡量模型的推理能力。为解决此问题，论文提出了一种新的评估方法，关键在于通过参数化设计，综合考虑预测结果在输出语义特性以及其在输入文档中的多模态位置方面的 groundedness（语义一致性）。该方法能够根据用户偏好调整评分标准，并通过人机对比验证表明，其评分更能反映模型的鲁棒性，倾向于奖励更可靠的答案。

链接: https://arxiv.org/abs/2503.19120
作者: Armineh Nourbakhsh,Siddharth Parekh,Pranav Shetty,Zhao Jin,Sameena Shah,Carolyn Rose
机构: Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所); J.P. Morgan (摩根大通), New York
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL Findings 2025

点击查看摘要

Abstract:Document Visual Question Answering (VQA) models have evolved at an impressive rate over the past few years, coming close to or matching human performance on some benchmarks. We argue that common evaluation metrics used by popular benchmarks do not account for the semantic and multimodal groundedness of a model’s outputs. As a result, hallucinations and major semantic errors are treated the same way as well-grounded outputs, and the evaluation scores do not reflect the reasoning capabilities of the model. In response, we propose a new evaluation methodology that accounts for the groundedness of predictions with regard to the semantic characteristics of the output as well as the multimodal placement of the output within the input document. Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences. We validate our scoring methodology using human judgment and show its potential impact on existing popular leaderboards. Through extensive analyses, we demonstrate that our proposed method produces scores that are a better indicator of a model’s robustness and tends to give higher rewards to better-calibrated answers.
zh

[NLP-47] Understanding and Improving Information Preservation in Prompt Compression for LLM s

【速读】：该论文试图解决在信息密集型任务中，随着提示词长度快速增长导致的计算资源需求增加、性能下降以及因无关或冗余信息引起的偏差问题。为优化输入长度与性能保留之间的权衡，近年来提出了多种提示词压缩技术。论文的关键解决方案在于提出了一种全面的评估框架，用于深入分析提示词压缩方法，并重点关注压缩比之外的三个关键方面：(i) 下游任务性能、(ii) 输入上下文的关联性以及 (iii) 信息保留能力。通过此框架，研究发现当前最先进的软压缩和硬压缩方法难以保留原始提示词中的关键细节，从而限制其在复杂任务中的表现。论文进一步表明，改进软提示方法以更好地控制压缩信息的粒度可以显著提高其有效性，在下游任务性能上提升高达 23%，在关联性上提高超过 8 个 BERTScore 点，并在压缩过程中多保留 2.7 倍的实体。

链接: https://arxiv.org/abs/2503.19114
作者: Weronika Łajewska,Momchil Hardalov,Laura Aina,Neha Anna John,Hang Su,Lluís Màrquez
机构: University of Stavanger (斯塔万格大学); AWS AI Labs; Technical University of Catalonia (UPC) (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 21 pages, 6 figures, 23 tables

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Through this framework, we investigate state-of-the-art soft and hard compression methods, showing that they struggle to preserve key details from the original prompt, limiting their performance on complex tasks. We demonstrate that modifying soft prompting methods to control better the granularity of the compressed information can significantly improve their effectiveness – up to +23% in downstream task performance, more than +8 BERTScore points in grounding, and 2.7x more entities preserved in compression.
zh

[NLP-48] Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification NAACL2025

【速读】：该论文旨在解决由大型语言模型（Large Language Models, LLMs）驱动的新型攻击对作者身份验证模型带来的安全风险问题。随着LLMs在文档准确作者身份识别等任务中的应用，尽管它们提升了防御技术的能力，但也为恶意行为者提供了新的攻击向量，包括无目标的作者风格混淆（authorship obfuscation）和有目标的作者身份冒充（authorship impersonation）。为应对这一挑战，论文的关键解决方案是评估现有作者身份验证模型在面对这些基于LLM的强大攻击时的鲁棒性，并通过扰动验证模型来量化攻击的成功率，最终实现了针对混淆和冒充攻击的最大成功率达到92%和78%。

链接: https://arxiv.org/abs/2503.19099
作者: Kenneth Alperin,Rohan Leekha,Adaku Uchendu,Trang Nguyen,Srilakshmi Medarametla,Carlos Levya Capote,Seth Aycock,Charlie Dagli
机构: MIT Lincoln Laboratory (MIT林肯实验室), The University of Virginia (弗吉尼亚大学), University of Puerto Rico-Mayaguez (波多黎各大学梅贾兹分校), University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注: Accepted at NLP4DH Workshop @ NAACL 2025

点击查看摘要

Abstract:The increasing use of Artificial Intelligence (AI) technologies, such as Large Language Models (LLMs) has led to nontrivial improvements in various tasks, including accurate authorship identification of documents. However, while LLMs improve such defense techniques, they also simultaneously provide a vehicle for malicious actors to launch new attack vectors. To combat this security risk, we evaluate the adversarial robustness of authorship models (specifically an authorship verification model) to potent LLM-based attacks. These attacks include untargeted methods - \textitauthorship obfuscation and targeted methods - \textitauthorship impersonation. For both attacks, the objective is to mask or mimic the writing style of an author while preserving the original texts’ semantics, respectively. Thus, we perturb an accurate authorship verification model, and achieve maximum attack success rates of 92% and 78% for both obfuscation and impersonation attacks, respectively.
zh

[NLP-49] Rankers Judges and Assistants: Towards Understanding the Interplay of LLM s in Information Retrieval Evaluation

【速读】：该论文试图解决大型语言模型（LLMs）在信息检索（IR）中的潜在偏见问题，特别是由基于LLM的排名器、评估器和内容创建助手之间的交互所引发的偏见。论文的关键在于通过综合现有研究并设计新的实验，揭示基于LLM的评估器对基于LLM的排名器存在显著偏见，并发现评估器在区分系统性能细微差异方面的局限性。尽管与一些先前的研究结果不同，本研究未发现针对AI生成内容的偏见证据。为此，论文提出了初步指南和研究议程，以确保LLMs在IR评估中的可靠应用。

链接: https://arxiv.org/abs/2503.19092
作者: Krisztian Balog,Donald Metzler,Zhen Qin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges’ ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
zh

[NLP-50] LLM -Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

【速读】：该论文旨在解决呼叫中心（Contact Center）中自动化驱动生成（call driver generation）及相关任务效率与成本优化的问题。论文的关键在于提出了一种成本高效的大型语言模型（LLM）系统设计，通过全面评估专有模型、开源权重模型以及微调模型（proprietary, open-weight, and fine-tuned models），结合成本节约策略，并对生产环境中的部署进行成本分析，从而实现从话题建模（topic modeling）、来电分类（incoming call classification）、趋势检测（trend detection）到常见问题解答（FAQ generation）等任务的自动化支持，为客服代表和管理人员提供可操作的洞见（actionable insights）。

链接: https://arxiv.org/abs/2503.19090
作者: Varsha Embar,Ritvik Shrivastava,Vinay Damodaran,Travis Mehlinger,Yu-Chung Hsiao,Karthik Raghunathan
机构: Cisco Systems (思科系统)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have transformed the Contact Center industry, manifesting in enhanced self-service tools, streamlined administrative processes, and augmented agent productivity. This paper delineates our system that automates call driver generation, which serves as the foundation for tasks such as topic modeling, incoming call classification, trend detection, and FAQ generation, delivering actionable insights for contact center agents and administrators to consume. We present a cost-efficient LLM system design, with 1) a comprehensive evaluation of proprietary, open-weight, and fine-tuned models and 2) cost-efficient strategies, and 3) the corresponding cost analysis when deployed in production environments.
zh

[NLP-51] LookAhead Tuning: Safer Language Models via Partial Answer Previews

【速读】：该论文试图解决在对大型语言模型（Large Language Models, LLMs）进行微调以适应特定领域时，通常会导致其先前建立的安全对齐机制受损的问题。论文提出的解决方案的关键在于引入了一种名为LookAhead Tuning的方法，该方法包含两种简单、低成本且有效的数据驱动策略，通过预览部分答案前缀来修改训练数据，从而最小化对初始标记分布的扰动，以保持模型固有的安全机制。

链接: https://arxiv.org/abs/2503.19041
作者: Kangwei Liu,Mengru Wang,Yujie Luo,Lin Yuan,Mengshu Sun,Ningyu Zhang,Lei Liang,Zhiqiang Zhang,Jun Zhou,Huajun Chen
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Work in progress

点击查看摘要

Abstract:Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model’s inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at this https URL.
zh

[NLP-52] SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）与人类偏好和价值观对齐方法中存在的三个主要问题：(1) 对人工标注的过度依赖；(2) 对齐税（alignment tax）；(3) 浅层对齐易受越狱攻击（jailbreak attacks）的影响。此外，现有的对齐数据集通常存在分布不均的问题，导致某些主题的过代表而其他主题被忽视。为了解决这些问题，论文提出了一种基于内省推理的影子奖励模型方法SRMIR（Shadow Reward Models Based on Introspective Reasoning）。其关键是通过构建一个平衡的安全链式草稿（Chain of Draft, CoD）数据集，并利用LLMs的内省推理能力生成结构化提示，训练一组专门的奖励模型，然后通过组相对策略优化（Group Relative Policy Optimization, GRPO）指导策略优化。在集成影子奖励模型时，采用了线性组合和分类方法两种策略，实验表明分类方法虽然计算成本更高，但对齐效果更优。跨多个LLMs的实验结果证明，SRMIR显著优于现有方法。

链接: https://arxiv.org/abs/2503.18991
作者: Ruoxi Cheng,Shuirong Cao
机构: Alibaba Group (阿里巴巴集团); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences and values is vital for application. However, current alignment methods face three main limitations: (1) reliance on costly human annotation; (2) alignment tax; (3) shallow alignment vulnerable to jailbreak attacks. Additionally, current alignment datasets often suffer from uneven distributions, leading to overrepresentation of some topics and neglect of others. To address these issues, we propose SRMIR (Shadow Reward Models Based on Introspective Reasoning), inspired by shadow models in membership inference attacks. We first construct a balanced safety Chain of Draft (CoD) dataset across 7 harmful types with structured prompt leveraging the introspective reasoning capabilities of LLMs, then train a set of specialized reward models to guide policy optimization through Group Relative Policy Optimization (GRPO). We apply two strategies, linear combination and categorized approach, to integrate shadow reward models for policy optimization. By comparison, we find that the latter achieves superior alignment despite higher computational costs. Experiments across several LLMs demonstrate SRMIR significantly outperforms existing methods.
zh

计算机视觉

[CV-0] EventFly: Event Camera Perception from Ground to the Sky CVPR2025

【速读】：该论文旨在解决跨平台适配在基于事件的密集感知任务中的挑战，特别是在车辆、无人机和四足机器人等具有不同运动动力学、视角分布及类别分布的多样化应用场景中部署事件相机时所面临的问题。论文提出了一种名为EventFly的框架，用于提升事件相机感知的鲁棒性跨平台适应能力。其解决方案的关键在于三个核心组件：i) 事件激活先验（Event Activation Prior, EAP），通过识别目标域中的高激活区域来最小化预测熵，从而促进自信且适应性强的预测；ii) 事件混合（EventBlend），一种基于EAP驱动的相似性和密度图的数据混合策略，将源域与目标域的事件体素网格融合，以增强特征对齐；iii) 事件匹配（EventMatch），一种双判别器技术，用于对齐源域、目标域及混合域的特征，实现更好的域不变学习。此外，为了全面评估跨平台适应能力，作者还引入了一个名为EXPo的大规模基准数据集。实验结果验证了EventFly的有效性，并在多个场景中显著优于现有方法。

链接: https://arxiv.org/abs/2503.19916
作者: Lingdong Kong,Dongyue Lu,Xiang Xu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau
机构: National University of Singapore (新加坡国立大学); CNRS@CREATE (法国国家科学研究中心@CREATE); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Institute for Infocomm Research, A*STAR (信息通信研究院, 新加坡科技研究局); IPAL, CNRS IRL 2955, Singapore (国际视觉与语言实验室, 法国国家科学研究中心国际联合研究实验室 2955, 新加坡); CerCo, CNRS UMR 5549, Université Toulouse III (中央脑成像实验室, 法国国家科学研究中心联合研究组 5549, 图卢兹第三大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025; 30 pages, 8 figures, 16 tables; Project Page at this https URL

点击查看摘要

Abstract:Cross-platform adaptation in event-based dense perception is crucial for deploying event cameras across diverse settings, such as vehicles, drones, and quadrupeds, each with unique motion dynamics, viewpoints, and class distributions. In this work, we introduce EventFly, a framework for robust cross-platform adaptation in event camera perception. Our approach comprises three key components: i) Event Activation Prior (EAP), which identifies high-activation regions in the target domain to minimize prediction entropy, fostering confident, domain-adaptive predictions; ii) EventBlend, a data-mixing strategy that integrates source and target event voxel grids based on EAP-driven similarity and density maps, enhancing feature alignment; and iii) EventMatch, a dual-discriminator technique that aligns features from source, target, and blended domains for better domain-invariant learning. To holistically assess cross-platform adaptation abilities, we introduce EXPo, a large-scale benchmark with diverse samples across vehicle, drone, and quadruped platforms. Extensive experiments validate our effectiveness, demonstrating substantial gains over popular adaptation methods. We hope this work can pave the way for more adaptive, high-performing event perception across diverse and complex environments.
zh

[CV-1] Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

【速读】：该论文旨在解决通过学习物体-物体空间关系（Object-Object Relationships, OOR）来理解三维空间中物体间相对位置的问题。论文的关键在于利用从预训练的二维扩散模型生成的合成三维样本，这些样本天然蕴含了合理的且真实的OOR线索，从而能够高效构建用于学习多种无界物体类别OOR的三维数据集。解决方案的核心是首先通过合成包含合理OOR线索的多样化图像，并将其提升为三维样本；随后，利用这些多样化的合理三维样本训练基于分数的OOR扩散模型以学习物体间相对空间关系的分布。此外，通过在成对关系中引入一致性约束并避免物体碰撞，进一步扩展到多物体OOR的学习。实验结果验证了该方法在不同物体-物体空间关系上的鲁棒性及其在真实世界三维场景布置任务中的适用性。

链接: https://arxiv.org/abs/2503.19914
作者: Sangwon Beak,Hyeonwoo Kim,Hanbyul Joo
机构: Seoul National University (首尔国立大学); RLWRLD (RLWRLD)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations and preventing object collisions. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.
zh

[CV-2] PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model CVPR2025

【速读】：该论文旨在解决如何更准确地建模物体部件级别的动态变化问题，以满足从当前观测和动作预测未来状态的需求。现有方法如Puppet-Master依赖于微调大规模预训练视频扩散模型，但因二维视频表示的局限性和处理速度慢而不适用于实际应用。论文的关键解决方案是提出PartRM，这是一种新的4D重建框架，能够同时从静态物体的多视角图像中建模外观、几何形状和部件级别的运动。PartRM基于大型3D高斯重建模型构建，并引入了PartDrag-4D数据集来缓解4D数据稀缺的问题。此外，通过多尺度拖拽嵌入模块捕获不同粒度的动力学信息，并采用两阶段训练过程防止微调过程中的灾难性遗忘，从而进一步增强模型性能。实验结果表明，PartRM在部件级别运动学习方面达到了新的技术水平，并可应用于机器人操作任务中。

链接: https://arxiv.org/abs/2503.19913
作者: Mingju Gao,Yike Pan,Huan-ang Gao,Zongzheng Zhang,Wenyi Li,Hao Dong,Hao Tang,Li Yi,Hao Zhao
机构: Tsinghua University (清华大学); University of Michigan (密歇根大学); Peking University (北京大学); BAAI (北京智源人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications. Existing approaches, such as Puppet-Master, rely on fine-tuning large-scale pre-trained video diffusion models, which are impractical for real-world use due to the limitations of 2D video representation and slow processing times. To overcome these challenges, we present PartRM, a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object. PartRM builds upon large 3D Gaussian reconstruction models, leveraging their extensive knowledge of appearance and geometry in static objects. To address data scarcity in 4D, we introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states. We enhance the model’s understanding of interaction conditions with a multi-scale drag embedding module that captures dynamics at varying granularities. To prevent catastrophic forgetting during fine-tuning, we implement a two-stage training process that focuses sequentially on motion and appearance learning. Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics. Our code, data, and models are publicly available to facilitate future research.
zh

[CV-3] SuperFlow: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining

【速读】：该论文旨在解决现有基于激光雷达（LiDAR）表征学习方法主要关注空间对齐而忽视时间动态性的问题，特别是在捕捉运动和驾驶场景连续性方面的能力不足。为了解决这一局限性，论文提出了一种名为SuperFlow++的新框架，其关键在于通过连续的激光雷达-相机配对，在预训练和下游任务中整合时空线索。SuperFlow++包含四个核心组件：(1) 视角一致性对齐模块以统一跨摄像头视图的语义信息；(2) 密集到稀疏一致性正则化机制以增强点云密度变化下的特征鲁棒性；(3) 基于流的对比学习方法以建模时间关系以提升场景理解能力；(4) 时间投票策略以传播激光雷达扫描中的语义信息从而改善预测一致性。这些创新使得SuperFlow++在多个异构数据集上的表现超越了当前最先进的方法，并揭示了可扩展三维基础模型的潜在特性。

链接: https://arxiv.org/abs/2503.19912
作者: Xiang Xu,Lingdong Kong,Hui Shuai,Wenwei Zhang,Liang Pan,Kai Chen,Ziwei Liu,Qingshan Liu
机构: College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (南京航空航天大学计算机科学与技术学院), China; School of Computing, Department of Computer Science, National University of Singapore (新加坡国立大学计算学院计算机科学系), CNRS@CREATE (新加坡创造研究中心), Singapore; School of Computer Science, Nanjing University of Posts and Telecommunications (南京邮电大学计算机科学学院), China; Shanghai AI Laboratory (上海人工智能实验室), China; S-Lab, Nanyang Technological University (南洋理工大学S-Lab), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Preprint; 15 pages, 6 figures, 10 tables; Code at this https URL

点击查看摘要

Abstract:LiDAR representation learning has emerged as a promising approach to reducing reliance on costly and labor-intensive human annotations. While existing methods primarily focus on spatial alignment between LiDAR and camera sensors, they often overlook the temporal dynamics critical for capturing motion and scene continuity in driving scenarios. To address this limitation, we propose SuperFlow++, a novel framework that integrates spatiotemporal cues in both pretraining and downstream tasks using consecutive LiDAR-camera pairs. SuperFlow++ introduces four key components: (1) a view consistency alignment module to unify semantic information across camera views, (2) a dense-to-sparse consistency regularization mechanism to enhance feature robustness across varying point cloud densities, (3) a flow-based contrastive learning approach that models temporal relationships for improved scene understanding, and (4) a temporal voting strategy that propagates semantic information across LiDAR scans to improve prediction consistency. Extensive evaluations on 11 heterogeneous LiDAR datasets demonstrate that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. Furthermore, by scaling both 2D and 3D backbones during pretraining, we uncover emergent properties that provide deeper insights into developing scalable 3D foundation models. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving. The code is publicly available at this https URL
zh

[CV-4] CoLLM : A Large Language Model for Composed Image Retrieval CVPR25 CVPR2025

【速读】：该论文旨在解决基于多模态查询的合成图像检索（Composed Image Retrieval, CIR）任务中训练数据稀缺的问题。传统方法依赖昂贵且耗时的真实三元组数据，而现有零样本方法要么使用规模有限且多样性和真实性不足的合成三元组，要么利用缺乏三元组信息的图像-标题对，导致多模态查询嵌入学习困难，并难以处理复杂细微的修改文本。论文的关键解决方案是提出CoLLM框架，通过即时从图像-标题对生成真实三元组实现监督训练，无需人工标注；同时利用大语言模型（Large Language Models, LLMs）生成参考图像与修改文本的联合嵌入，促进更深层次的多模态融合。此外，论文构建了一个包含340万样本的大规模MTCIR数据集，并改进了现有CIR基准数据集（CIRR和Fashion-IQ），以提升评估可靠性。实验结果表明，CoLLM在多个CIR基准测试中达到当前最优性能，MTCIR提升了高达15%的性能，改进后的基准提供了更可靠的评价指标。

链接: https://arxiv.org/abs/2503.19910
作者: Chuong Huynh,Jinyu Yang,Ashish Tawari,Mubarak Shah,Son Tran,Raffay Hamid,Trishul Chilimbi,Abhinav Shrivastava
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. Typical training data consists of triplets containing a reference image, a textual description of desired modifications, and the target image, which are expensive and time-consuming to acquire. The scarcity of CIR datasets has led to zero-shot approaches utilizing synthetic triplets or leveraging vision-language models (VLMs) with ubiquitous web-crawled image-caption pairs. However, these methods have significant limitations: synthetic triplets suffer from limited scale, lack of diversity, and unnatural modification text, while image-caption pairs hinder joint embedding learning of the multimodal query due to the absence of triplet data. Moreover, existing approaches struggle with complex and nuanced modification texts that demand sophisticated fusion and understanding of vision and language modalities. We present CoLLM, a one-stop framework that effectively addresses these limitations. Our approach generates triplets on-the-fly from image-caption pairs, enabling supervised training without manual annotation. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts, facilitating deeper multimodal fusion. Additionally, we introduce Multi-Text CIR (MTCIR), a large-scale dataset comprising 3.4M samples, and refine existing CIR benchmarks (CIRR and Fashion-IQ) to enhance evaluation reliability. Experimental results demonstrate that CoLLM achieves state-of-the-art performance across multiple CIR benchmarks and settings. MTCIR yields competitive results, with up to 15% performance improvement. Our refined benchmarks provide more reliable evaluation metrics for CIR models, contributing to the advancement of this important field.
zh

[CV-5] FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

【速读】：该论文旨在解决现有视频生成基础模型在多条件细粒度视频内容创作中的局限性，包括基于适配器的方法（如ControlNet）在整合多种条件时面临的分支冲突、参数冗余导致的计算成本增加以及相对于全微调性能不足等问题。论文的关键解决方案是提出FullDiT，这是一种通过统一的全注意力机制无缝集成多种条件的视频生成统一基础模型。FullDiT通过将多任务条件融合到统一的序列表示中，并利用全自注意力的长上下文学习能力捕捉条件动态变化，从而减少参数开销、避免条件冲突，并展现出可扩展性和涌现能力。此外，论文还引入了FullBench用于多任务视频生成评估。实验结果表明，FullDiT达到了当前最先进的性能，凸显了全注意力在复杂多任务视频生成中的有效性。

链接: https://arxiv.org/abs/2503.19907
作者: Xuan Ju,Weicai Ye,Quande Liu,Qiulin Wang,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Qiang Xu
机构: Kuaishou Technology (快手科技); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
zh

[CV-6] AvatarArtist: Open-Domain 4D Avatarization CVPR2025

【速读】：该论文旨在解决从任意风格的肖像图像创建开放域4D虚拟化身的问题。为实现这一目标，论文提出了一种结合生成对抗网络（GANs）和扩散模型的实用训练范式，并选用参数化三平面作为中间4D表示形式。解决方案的关键在于利用2D扩散先验（diffusion prior）增强4D GAN在处理多样化数据分布时的能力，通过两者之间的协同作用构建多域图像-三平面数据集，从而开发出通用的4D虚拟化身生成器。实验表明，所提出的AvatarArtist模型能够生成高质量的4D虚拟化身，并具有较强的鲁棒性。

链接: https://arxiv.org/abs/2503.19906
作者: Hongyu Liu,Xuan Wang,Ziyu Wan,Yue Ma,Jingye Chen,Yanbo Fan,Yujun Shen,Yibing Song,Qifeng Chen
机构: HKUST (香港科技大学); Ant Group (蚂蚁集团); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:This work focuses on open-domain 4D avatarization, with the purpose of creating a 4D avatar from a portrait image in an arbitrary style. We select parametric triplanes as the intermediate 4D representation and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. Our design stems from the observation that 4D GANs excel at bridging images and triplanes without supervision yet usually face challenges in handling diverse data distributions. A robust 2D diffusion prior emerges as the solution, assisting the GAN in transferring its expertise across various domains. The synergy between these experts permits the construction of a multi-domain image-triplane dataset, which drives the development of a general 4D avatar creator. Extensive experiments suggest that our model, AvatarArtist, is capable of producing high-quality 4D avatars with strong robustness to various source image domains. The code, the data, and the models will be made publicly available to facilitate future studies…
zh

[CV-7] racktention: Leverag ing Point Tracking to Attend Videos Faster and Better CVPR2025

【速读】：该论文旨在解决视频预测中时间一致性的问题，传统方法如时间注意机制（Temporal Attention）和三维卷积（3D Convolution）在处理显著物体运动时可能表现不佳，并且难以捕捉动态场景中的长程时间依赖关系。论文的关键解决方案是提出了Tracktention层，这是一种新颖的架构组件，通过使用点轨迹（point tracks，即跨帧的对应点序列）显式地整合运动信息。这种设计增强了时间对齐能力，有效应对复杂物体运动，保持时间上的特征表示一致性。Tracktention层具有计算效率高、易于与现有模型（如视觉Transformer）集成的特点，并且可以将仅基于图像的模型升级为先进的视频预测模型，在某些情况下甚至优于专门设计用于视频预测的模型。实验表明，该方法在视频深度预测和视频着色任务中显著提高了时间一致性。

链接: https://arxiv.org/abs/2503.19904
作者: Zihang Lai,Andrea Vedaldi
机构: Visual Geometry Group (VGG), University of Oxford (牛津大学视觉几何组; 牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2025. Project website: this http URL

点击查看摘要

Abstract:Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion and may not capture long-range temporal dependencies in dynamic scenes. To address this gap, we propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks, i.e., sequences of corresponding points across frames. By incorporating these motion cues, the Tracktention Layer enhances temporal alignment and effectively handles complex object motions, maintaining consistent feature representations over time. Our approach is computationally efficient and can be seamlessly integrated into existing models, such as Vision Transformers, with minimal modification. It can be used to upgrade image-only models to state-of-the-art video ones, sometimes outperforming models natively designed for video prediction. We demonstrate this on video depth prediction and video colorization, where models augmented with the Tracktention Layer exhibit significantly improved temporal consistency compared to baselines.
zh

[CV-8] Scaling Vision Pre-Training to 4K Resolution CVPR2025

【速读】：该论文试图解决视觉预训练模型在高分辨率视觉细节感知上的局限性问题。当前基于对比学习的视觉预训练方法受限于处理大尺寸图像的二次计算成本，通常只能支持低分辨率（如378 x 378像素）。论文提出PS3方法，通过将CLIP风格的视觉预训练扩展到4K分辨率，并以接近常数的成本实现高效训练。其关键在于采用局部区域选择性处理的方式，结合局部详细描述符进行对比学习，从而显著降低计算开销的同时实现高分辨率表征学习。这一方案使预训练后的PS3既能编码全局低分辨率图像，又能根据文本提示对局部高分辨率区域进行选择性处理，为多模态大型语言模型（MLLM）提供更高效的高分辨率视觉感知能力。

链接: https://arxiv.org/abs/2503.19903
作者: Baifeng Shi,Boyi Li,Han Cai,Yao Lu,Sifei Liu,Marco Pavone,Jan Kautz,Song Han,Trevor Darrell,Pavlo Molchanov,Hongxu Yin
机构: UC Berkeley (加州大学伯克利分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378 x 378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S^2 while using up to 4.3x fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96x speedup over Qwen2-VL.
zh

[CV-9] ICE: Intrinsic Concept Extraction from a Single Image via Diffusion Models CVPR2025

【速读】：该论文旨在解决现代生成式模型（尤其是基于扩散的文生图（T2I）模型）在从单个图像准确学习视觉概念时面临的固有歧义挑战。现有方法缺乏系统性手段以可靠地提取可解释的潜在内在概念。为应对这一挑战，论文提出了一种名为ICE（Intrinsic Concept Extraction）的新框架，该框架仅利用T2I模型即可自动且系统地从单个图像中提取内在概念。ICE的关键在于其包含两个重要阶段：第一阶段通过自动概念定位模块确定图像中的相关文本概念及其对应的掩码，这一阶段简化了概念初始化并为后续分析提供了精确指导；第二阶段深入每个识别出的掩码，将对象级概念分解为内在概念和通用概念，这种分解使得视觉元素的更细致和可解释的拆分成为可能。因此，ICE框架的核心创新点在于其能够以无监督的方式从单个图像中高效提取内在概念。

链接: https://arxiv.org/abs/2503.19902
作者: Fernando Julio Cendra,Kai Han
机构: Visual AI Lab, The University of Hong Kong (香港大学视觉人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project page: this https URL

点击查看摘要

Abstract:The inherent ambiguity in defining visual concepts poses significant challenges for modern generative models, such as the diffusion-based Text-to-Image (T2I) models, in accurately learning concepts from a single image. Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilizes a T2I model to automatically and systematically extract intrinsic concepts from a single image. ICE consists of two pivotal stages. In the first stage, ICE devises an automatic concept localization module to pinpoint relevant text-based concepts and their corresponding masks within the image. This critical stage streamlines concept initialization and provides precise guidance for subsequent analysis. The second stage delves deeper into each identified mask, decomposing the object-level concepts into intrinsic concepts and general concepts. This decomposition allows for a more granular and interpretable breakdown of visual elements. Our framework demonstrates superior performance on intrinsic concept extraction from a single image in an unsupervised manner. Project page: this https URL
zh

[CV-10] okenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization CVPR2025

【速读】：该论文旨在解决当前人体与场景交互（Human-Scene Interaction, HSI）合成方法主要专注于开发针对特定任务的独立控制器的问题，这限制了处理需要整合多种技能的复杂交互任务的能力，例如携带物体坐下。为了解决这一问题，论文提出了一种名为TokenHSI的统一Transformer策略，其关键在于将人形体的本体感觉建模为一个单独的共享标记，并通过掩码机制将其与不同的任务标记相结合，从而实现多技能的知识共享和灵活适应，同时支持可变长度输入以增强泛化能力。通过训练额外的任务标记器，可以进一步调整交互目标的几何形状并协调多种技能来解决复杂任务。实验表明，该方法显著提高了多种HSI任务中的通用性、适应性和扩展性。

链接: https://arxiv.org/abs/2503.19901
作者: Liang Pan,Zeshi Yang,Zhiyang Dou,Wenjia Wang,Buzhen Huang,Bo Dai,Taku Komura,Jingbo Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学); Southeast University (东南大学); Feeling AI (感觉人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks. Website: this https URL
zh

[CV-11] Scaling Down Text Encoders of Text-to-Image Diffusion Models CVPR2025

【速读】：该论文旨在解决大型文本编码器（如T5-XXL）在扩散模型中的参数冗余问题，即尽管这些编码器显著提升了对复杂提示的理解能力和生成效果，但其庞大的参数规模带来了不必要的计算负担。此外，尽管T5系列编码器在包含大量非视觉数据的C4语料库上进行了训练，但带有T5编码器的扩散模型无法响应非视觉提示，表明其表征能力存在冗余。为解答“是否真的需要如此大的文本编码器”这一问题，论文提出通过基于视觉的知识蒸馏方法训练一系列T5编码器，并构建了一个满足图像质量、语义理解和文本渲染三个标准的数据集以充分继承原模型的能力。关键在于通过知识蒸馏技术成功训练出一个仅T5-base大小的模型，该模型生成的图像质量与T5-XXL相当，但模型体积缩小了50倍，从而大幅降低了运行高端模型（如FLUX和SD3）所需的GPU资源需求，使高质量文本到图像的生成更加普及化。

链接: https://arxiv.org/abs/2503.19897
作者: Lifu Wang,Daqing Liu,Xinchen Liu,Xiaodong He
机构: JD Explore Academy, JD.com Inc. (京东探索研究院, 京东); Georgia Institute of Technology (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2025

点击查看摘要

Abstract:Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models’ ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: “Do we really need such a large text encoder?” In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.
zh

[CV-12] Visuo-Tactile Object Pose Estimation for a Multi-Finger Robot Hand with Low-Resolution In-Hand Tactile Sensing ICRA

【速读】：本文旨在解决在高遮挡情况下机器人执行抓取物体的三维姿态估计（3D pose estimation）这一感知难题。由于机器人自身手部的遮挡，传统的视觉方法难以准确估计被握持物体的姿态。为应对这一挑战，论文提出结合视觉信息与本体感觉（proprioception），利用关节式机械手内表面分布的二值化、低分辨率触觉接触测量数据来缓解遮挡带来的影响。关键在于通过因子图（factor graph）以概率形式建模视触觉物体姿态估计问题，并采用鲁棒代价函数优化物体姿态，以减少视觉或触觉异常值（outlier）的干扰。实验表明，这种结合多模态感知信息的方法在高遮挡和高视觉噪声条件下显著提升了物体姿态估计的准确性。

链接: https://arxiv.org/abs/2503.19893
作者: Lukas Mack,Felix Grüninger,Benjamin A. Richardson,Regine Lendway,Katherine J. Kuchenbecker,Joerg Stueckler
机构: University of Augsburg (奥格斯堡大学); Embodied Vision Group, Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Robotics ZWE, Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Haptic Intelligence Department, Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the IEEE International Conference on Robotics and Automation (ICRA), 2025

点击查看摘要

Abstract:Accurate 3D pose estimation of grasped objects is an important prerequisite for robots to perform assembly or in-hand manipulation tasks, but object occlusion by the robot’s own hand greatly increases the difficulty of this perceptual task. Here, we propose that combining visual information and proprioception with binary, low-resolution tactile contact measurements from across the interior surface of an articulated robotic hand can mitigate this issue. The visuo-tactile object-pose-estimation problem is formulated probabilistically in a factor graph. The pose of the object is optimized to align with the three kinds of measurements using a robust cost function to reduce the influence of visual or tactile outlier readings. The advantages of the proposed approach are first demonstrated in simulation: a custom 15-DoF robot hand with one binary tactile sensor per link grasps 17 YCB objects while observed by an RGB-D camera. This low-resolution in-hand tactile sensing significantly improves object-pose estimates under high occlusion and also high visual noise. We also show these benefits through grasping tests with a preliminary real version of our tactile hand, obtaining reasonable visuo-tactile estimates of object pose at approximately 13.3 Hz on average.
zh

[CV-13] Mask2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation CVPR2025

【速读】：该论文旨在解决多场景视频生成这一更具挑战性的任务，相较于单场景视频生成，多场景视频生成具有更广泛的应用前景但研究相对不足的问题。论文的关键解决方案是提出了一种名为Mask ^2 DiT的新方法，通过在Diffusion Transformer (DiT) 架构的每一层注意力机制中引入对称二值掩码，实现视频片段与其对应文本标注之间细粒度的一对一对齐，同时保持视觉标记的时间一致性。此外，通过引入片段级条件掩码，使新生成的视频片段基于已有片段进行自回归扩展，从而增强DiT架构生成额外场景的能力。实验结果表明，Mask ^2 DiT在保持段落间视觉一致性的同时，确保了每个段落与相应文本描述之间的语义对齐。

链接: https://arxiv.org/abs/2503.19881
作者: Tianhao Qi,Jianlong Yuan,Wanquan Feng,Shancheng Fang,Jiawei Liu,SiYu Zhou,Qian He,Hongtao Xie,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); Bytedance Intelligent Creation (字节跳动智能创作); Yuanshi Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask ^2 DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask ^2 DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is this https URL.
zh

[CV-14] GENIUS: A Generative Framework for Universal Multimodal Search CVPR2025

【速读】：该论文旨在解决现有生成式检索（Generative Retrieval）方法在性能上逊色于嵌入式检索（Embedding-based Retrieval）的问题，并突破任务特定性的限制，实现跨模态和跨领域的通用性。论文的关键解决方案是提出了GENIUS框架，其核心创新包括模态解耦语义量化（modality-decoupled semantic quantization），将多模态数据转换为同时编码模态和语义的离散标识符（IDs）。此外，为了增强泛化能力，GENIUS引入了查询增强（query augmentation）策略，通过插值查询及其目标数据，使其能够适应多样化的查询形式。这些创新使得GENIUS在M-BEIR基准测试中显著超越先前的生成式检索方法，同时保持高效的检索速度和跨数据库规模的一致性，部分场景下结合重排序后性能接近嵌入式方法，但依然维持高效率优势。

链接: https://arxiv.org/abs/2503.19868
作者: Sungyeon Kim,Xinliang Zhu,Xiaofan Lin,Muhammet Bastan,Douglas Gray,Suha Kwak
机构: Amazon (亚马逊); POSTECH (浦项工科大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Generative retrieval is an emerging approach in information retrieval that generates identifiers (IDs) of target data based on a query, providing an efficient alternative to traditional embedding-based retrieval methods. However, existing models are task-specific and fall short of embedding-based retrieval in performance. This paper proposes GENIUS, a universal generative retrieval framework supporting diverse tasks across multiple modalities and domains. At its core, GENIUS introduces modality-decoupled semantic quantization, transforming multimodal data into discrete IDs encoding both modality and semantics. Moreover, to enhance generalization, we propose a query augmentation that interpolates between a query and its target, allowing GENIUS to adapt to varied query forms. Evaluated on the M-BEIR benchmark, it surpasses prior generative methods by a clear margin. Unlike embedding-based retrieval, GENIUS consistently maintains high retrieval speed across database size, with competitive performance across multiple benchmarks. With additional re-ranking, GENIUS often achieves results close to those of embedding-based methods while preserving efficiency.
zh

[CV-15] owards Online Multi-Modal Social Interaction Understanding

【速读】：该论文旨在解决多模态社会交互理解（MMSI）在实际人机交互系统中的实时反馈需求，传统模型因依赖过去和未来的上下文而难以应用于真实场景的问题。为填补这一差距，论文提出了一个在线MMSI设置，并开发了一种名为Online-MMSI-VLM的新框架。该框架的关键在于结合两种互补策略：多方对话预测与社交感知视觉提示。前者通过粗到细的方式模拟潜在的未来话语，丰富语言上下文；后者利用多模态大语言模型突出视频中的社交动态，通过边界框和每帧每个人的关节点来强调目光和手势等视觉社交线索。实验结果表明，该方法在三个任务和两个数据集上达到最先进的性能，显著优于基线模型。

链接: https://arxiv.org/abs/2503.19851
作者: Xinpeng Li,Shijian Deng,Bolin Lai,Weiguo Pian,James M. Rehg,Yapeng Tian
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Georgia Institute of Technology (乔治亚理工学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal social interaction understanding (MMSI) is critical in human-robot interaction systems. In real-world scenarios, AI agents are required to provide real-time feedback. However, existing models often depend on both past and future contexts, which hinders them from applying to real-world problems. To bridge this gap, we propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams. To address the challenges of missing the useful future context, we develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting with multi-modal large language models. First, to enrich linguistic context, the multi-party conversation forecasting simulates potential future utterances in a coarse-to-fine manner, anticipating upcoming speaker turns and then generating fine-grained conversational details. Second, to effectively incorporate visual social cues like gaze and gesture, social-aware visual prompting highlights the social dynamics in video with bounding boxes and body keypoints for each person and frame. Extensive experiments on three tasks and two datasets demonstrate that our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI. The code and pre-trained models will be publicly released at: this https URL.
zh

[CV-16] FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLM s

【速读】：该论文致力于解决长视频（长达一小时）中的信息检索难题，尤其是在传统视觉语言模型（Vision-Language Models, VLMs）难以精确定位包含答案的小帧集合时面临的挑战。论文的关键创新在于提出了一种名为FALCONEye的新型视频代理，其核心解决方案包括：1）一种专为长视频设计的元架构，相比现有短视频方法更具优势；2）一种高效的探索算法，利用短片段、字幕及答案置信度来定位信息；3）针对答案置信度的先进VLM校准分析。FALCONEye基于小型VLM和中型大型语言模型（Large Language Model, LLM），能够在标准计算资源上运行。此外，论文还发布了FALCON-Bench基准数据集，以评估长视频问答任务，并强调开放性问题评估的重要性。实验结果表明，FALCONEye在FALCON-Bench上的表现优于现有技术，并在相关基准测试中达到相似或更好的性能。

链接: https://arxiv.org/abs/2503.19850
作者: Carlos Plou,Cesar Borja,Ruben Martinez-Cantin,Ana C. Murillo
机构: DIIS-I3A, University of Zaragoza (迪斯-伊萨, 萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Information retrieval in hour-long videos presents a significant challenge, even for state-of-the-art Vision-Language Models (VLMs), particularly when the desired information is localized within a small subset of frames. Long video data presents challenges for VLMs due to context window limitations and the difficulty of pinpointing frames containing the answer. Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. FALCONEye novelty relies on 1) the proposed meta-architecture, which is better suited to tackle hour-long videos compared to short video approaches in the state-of-the-art; 2) a new efficient exploration algorithm to locate the information using short clips, captions and answer confidence; and 3) our state-of-the-art VLMs calibration analysis for the answer confidence. Our agent is built over a small-size VLM and a medium-size LLM being accessible to run on standard computational resources. We also release FALCON-Bench, a benchmark to evaluate long (average 1 hour) Video Answer Search challenges, highlighting the need for open-ended question evaluation. Our experiments show FALCONEye’s superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.
zh

[CV-17] Attention IoU: Examining Biases in CelebA using Attention Maps CVPR2025

【速读】：该论文旨在解决计算机视觉模型在多种数据集和任务中表现出并放大偏见的问题，现有方法主要关注数据分布和子组模型性能，而忽视了模型内部工作机制中的偏见。论文的关键解决方案是引入了注意力交并比（Attention-IoU）指标及相关评分，通过利用注意力图揭示模型内部表征中的偏见，并识别可能导致这些偏见的图像特征。论文验证了Attention-IoU在合成Waterbirds数据集上的有效性，进一步分析CelebA数据集时发现其能够揭示超出准确率差异之外的相关性，并通过研究男性作为保护属性下的个体属性展示了CelebA中偏见的不同表现形式，最后通过调整训练集的属性相关性，证明Attention-IoU能够揭示数据标签中不存在的潜在混淆变量。

链接: https://arxiv.org/abs/2503.19846
作者: Aaron Serianni,Tyler Zhu,Vikram V. Ramaswamy,Olga Russakovsky
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear in CVPR 2025. Code and data is available at this https URL . 15 pages, 14 figures, including appendix

点击查看摘要

Abstract:Computer vision models have been shown to exhibit and amplify biases across a wide array of datasets and tasks. Existing methods for quantifying bias in classification models primarily focus on dataset distribution and model performance on subgroups, overlooking the internal workings of a model. We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model’s internal representations and identify image features potentially causing the biases. First, we validate Attention-IoU on the synthetic Waterbirds dataset, showing that the metric accurately measures model bias. We then analyze the CelebA dataset, finding that Attention-IoU uncovers correlations beyond accuracy disparities. Through an investigation of individual attributes through the protected attribute of Male, we examine the distinct ways biases are represented in CelebA. Lastly, by subsampling the training set to change attribute correlations, we demonstrate that Attention-IoU reveals potential confounding variables not present in dataset labels.
zh

[CV-18] FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model CVPR2025

【速读】：该论文致力于解决基于指令的图像编辑方法在复杂场景、语义一致性以及细粒度编辑方面的三大挑战。为应对这些问题，论文提出了一种创新的基于指令的细粒度图像编辑框架FireEdit，其关键在于利用区域感知视觉语言模型（Region-aware Vision Language Model, REgion-aware VLM）。具体而言，通过引入额外的区域标记增强视觉语言模型的细粒度视觉感知能力，并结合时间感知目标注入模块与混合视觉交叉注意力模块，在扩散模型的不同去噪阶段动态调整引导强度，同时提升视觉细节以保持编辑结果与源图像之间的语义一致性。实验表明，FireEdit在理解编辑指令和维持高语义一致性方面具有显著优势，并超越了现有最先进的基于指令的图像编辑方法。

链接: https://arxiv.org/abs/2503.19839
作者: Jun Zhou,Jiahao Li,Zunnan Xu,Hanhui Li,Yiji Cheng,Fa-Ting Hong,Qin Lin,Qinglin Lu,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Hunyuan, Tencent (腾讯浑元); Tsinghua University (清华大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Currently, instruction-based image editing methods have made significant progress by leveraging the powerful cross-modal understanding capabilities of vision language models (VLMs). However, they still face challenges in three key areas: 1) complex scenarios; 2) semantic consistency; and 3) fine-grained editing. To address these issues, we propose FireEdit, an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM. FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process. Specifically, we enhance the fine-grained visual perception capabilities of the VLM by introducing additional region tokens. Relying solely on the output of the LLM to guide the diffusion model may lead to suboptimal editing results. Therefore, we propose a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module. The former dynamically adjusts the guidance strength at various denoising stages by integrating timestep embeddings with the text embeddings. The latter enhances visual details for image editing, thereby preserving semantic consistency between the edited result and the source image. By combining the VLM enhanced with fine-grained region tokens and the time-dependent diffusion model, FireEdit demonstrates significant advantages in comprehending editing instructions and maintaining high semantic consistency. Extensive experiments indicate that our approach surpasses the state-of-the-art instruction-based image editing methods. Our project is available at this https URL.
zh

[CV-19] AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers CVPR

【速读】：该论文试图解决的问题是如何基于音频生成具有高保真度且时间连贯的全身人类视频，同时确保唇同步（lip-sync）的准确性以及手部和面部细微共发言手势（co-speech gestures）的自然表达。现有方法大多仅关注驱动面部运动，导致头部和身体动态不协调。为解决这一挑战，论文提出的关键方案是AudCast框架，采用级联Diffusion-Transformers (DiTs) 范式。该框架首先通过一种音频条件化全身人类DiT架构直接驱动任意人体的动作，并生成生动的手势动态；其次，为了增强手部和面部细节的表现，引入区域精炼DiT模块，利用区域3D拟合作为桥梁重构信号，从而获得最终结果。

链接: https://arxiv.org/abs/2503.19824
作者: Jiazhi Guan,Kaisiyuan Wang,Zhiliang Xu,Quanwei Yang,Yasheng Sun,Shengyi He,Borong Liang,Yukang Cao,Yingying Li,Haocheng Feng,Errui Ding,Jingdong Wang,Youjian Zhao,Hang Zhou,Ziwei Liu
机构: DCST, Tsinghua University (清华大学智能产业研究院); Baidu Inc. (百度公司); S-Lab, Nanyang Technological University (南洋理工大学S-Lab); Zhongguancun Laboratory (中关村实验室); University of Science and Technology of China (中国科学技术大学); KAUST (沙特国王科技大学); Zhongguancun Laboratory (中关村实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Project page: this https URL

点击查看摘要

Abstract:Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details. Resources can be found at this https URL.
zh

[CV-20] Domain-incremental White Blood Cell Classification with Privacy-aware Continual Learning

【速读】：该论文旨在解决白细胞（WBC）分类在动态医疗环境中因样本来源变化（如血液或骨髓）和不同医院成像条件导致的领域偏移问题。传统深度学习模型容易遭受灾难性遗忘，而基础模型虽然通常稳健，但在推理数据分布与训练数据分布不同时性能会下降。论文的关键解决方案是提出了一种基于生成式重放的连续学习（Continual Learning, CL）策略，通过轻量级生成器利用合成潜在表示来模仿过去的数据，在保护隐私的前提下实现重放，从而防止基础模型在WBC分类任务中的遗忘现象。实验结果表明，该方法有效缓解了灾难性遗忘，并在不同的数据分布域中保持了模型性能。

链接: https://arxiv.org/abs/2503.19819
作者: Pratibha Kumari,Afshin Bozorgpour,Daniel Reisenbüchler,Edgar Jost,Martina Crysandt,Christian Matek,Dorit Merhof
机构: University of Regensburg (雷根斯堡大学), 93053, Germany; University Hospital RWTH Aachen (亚琛工业大学医院), Aachen, Germany; University Hospital Erlangen (埃尔朗根大学医院), Erlangen, Germany; Fraunhofer Institute for Digital Medicine MEVIS (弗劳恩霍夫数字医学MEVIS研究所), Bremen, Germany
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:White blood cell (WBC) classification plays a vital role in hematology for diagnosing various medical conditions. However, it faces significant challenges due to domain shifts caused by variations in sample sources (e.g., blood or bone marrow) and differing imaging conditions across hospitals. Traditional deep learning models often suffer from catastrophic forgetting in such dynamic environments, while foundation models, though generally robust, experience performance degradation when the distribution of inference data differs from that of the training data. To address these challenges, we propose a generative replay-based Continual Learning (CL) strategy designed to prevent forgetting in foundation models for WBC classification. Our method employs lightweight generators to mimic past data with a synthetic latent representation to enable privacy-preserving replay. To showcase the effectiveness, we carry out extensive experiments with a total of four datasets with different task ordering and four backbone models including ResNet50, RetCCL, CTransPath, and UNI. Experimental results demonstrate that conventional fine-tuning methods degrade performance on previously learned tasks and struggle with domain shifts. In contrast, our continual learning strategy effectively mitigates catastrophic forgetting, preserving model performance across varying domains. This work presents a practical solution for maintaining reliable WBC classification in real-world clinical settings, where data distributions frequently evolve.
zh

[CV-21] LENVIZ: A High-Resolution Low-Exposure Night Vision Benchmark Dataset

【速读】：该论文试图解决低光图像增强这一关键问题，其核心挑战在于低光照环境下捕获的图像通常伴随严重的噪声、对比度不足及细节缺失，使得提升图像质量变得尤为困难。论文的关键解决方案是引入了一个名为“低曝光夜视（LENVIZ）”的数据集，这是一个包含超过23万帧的综合多曝光基准数据集，涵盖了2.4万种真实的室内和室外场景，并提供了高质量的人类生成的真实标注（ground truth）。每个多曝光低光场景均由专家摄影师精心编辑，以确保最佳图像质量。此外，论文还对该领域当前最先进的低光图像增强技术进行了全面分析，并指出了潜在的改进方向。

链接: https://arxiv.org/abs/2503.19804
作者: Manjushree Aithal,Rosaura G. VidalMata,Manikandtan Kartha,Gong Chen,Eashan Adhikarla,Lucas N. Kirsten,Zhicheng Fu,Nikhil A. Madhusudhana,Joe Nasti
机构: Lenovo Research (联想研究院) (Chicago, IL); Lehigh University (里海大学) (Bethlehem, PA); Motorola Mobility (摩托罗拉移动), Comercio de Produtos Eletronicos Ltda (Jaguariuna)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Dataset will be released upon publication

点击查看摘要

Abstract:Low-light image enhancement is crucial for a myriad of applications, from night vision and surveillance, to autonomous driving. However, due to the inherent limitations that come in hand with capturing images in low-illumination environments, the task of enhancing such scenes still presents a formidable challenge. To advance research in this field, we introduce our Low Exposure Night Vision (LENVIZ) Dataset, a comprehensive multi-exposure benchmark dataset for low-light image enhancement comprising of over 230K frames showcasing 24K real-world indoor and outdoor, with-and without human, scenes. Captured using 3 different camera sensors, LENVIZ offers a wide range of lighting conditions, noise levels, and scene complexities, making it the largest publicly available up-to 4K resolution benchmark in the field. LENVIZ includes high quality human-generated ground truth, for which each multi-exposure low-light scene has been meticulously curated and edited by expert photographers to ensure optimal image quality. Furthermore, we also conduct a comprehensive analysis of current state-of-the-art low-light image enhancement techniques on our dataset and highlight potential areas of improvement.
zh

[CV-22] SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI

【速读】：该论文旨在解决医学深度学习（Deep Learning, DL）模型在实际应用中因缺乏足够带有人工标注的数据样本而受到限制的问题。为应对这一挑战，论文提出了一种基于对比学习（Contrastive Learning）的方法，通过利用头颅MRI图像与其对应的放射学报告之间的关联来开发多模态的基础模型。关键在于引入了一种结合语法和语义相似性匹配度量的对比学习框架，以减少传统对比学习框架对极端大规模数据集的需求。所提出的相似性增强对比语言图像预训练（Similarity Enhanced Contrastive Language Image Pretraining, SeLIP）能够更有效地提取有用特征，并在图像-文本检索、分类及图像分割等下游任务中表现出色，强调了在构建医学图像基础模型时考虑描述不同图像文本之间相似性的的重要性。

链接: https://arxiv.org/abs/2503.19801
作者: Zhiyang Liu,Dong Yang,Minghao Zhang,Hanyu Sun,Hong Wu,Huiying Wang,Wen Shen,Chao Chai,Shuang Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite that deep learning (DL) methods have presented tremendous potential in many medical image analysis tasks, the practical applications of medical DL models are limited due to the lack of enough data samples with manual annotations. By noting that the clinical radiology examinations are associated with radiology reports that describe the images, we propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings. In particular, a contrastive learning framework is proposed, where a mixed syntax and semantic similarity matching metric is integrated to reduce the thirst of extreme large dataset in conventional contrastive learning framework. Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features. Experiments revealed that our proposed SeLIP performs well in many downstream tasks including image-text retrieval task, classification task, and image segmentation, which highlights the importance of considering the similarities among texts describing different images in developing medical image foundation models.
zh

[CV-23] Unpaired Object-Level SAR-to-Optical Image Translation for Aircraft with Keypoints-Guided Diffusion Models

【速读】：该论文旨在解决非配对 Synthetic Aperture Radar (SAR) 图像到光学图像翻译中的目标级转换问题，特别是在复杂目标任务中因缺乏配对数据及难以准确保留轮廓和纹理细节所导致的研究局限。论文的关键创新在于提出了一种基于关键点引导的扩散模型（KeypointDiff），通过引入目标类别和方位角的监督机制以及针对未配对数据的训练策略，结合无分类器引导扩散架构设计了一个类别-方位角引导模块（CAGM）。此外，利用对抗损失和一致性损失提升图像保真度与细节质量，同时借助预训练的关键点检测器实现自动化翻译流程，无需人工标注类别和方位信息。实验结果表明，该方法在多个评估指标上优于现有方法，并展现出对未见过飞机类型的强零样本泛化能力。

链接: https://arxiv.org/abs/2503.19798
作者: Ruixi You,Hecheng Jia,Feng Xu
机构: Key Laboratory for Information Science of Electromagnetic Waves (Ministry of Education), School of Information Science and Technology, Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) imagery provides all-weather, all-day, and high-resolution imaging capabilities but its unique imaging mechanism makes interpretation heavily reliant on expert knowledge, limiting interpretability, especially in complex target tasks. Translating SAR images into optical images is a promising solution to enhance interpretation and support downstream tasks. Most existing research focuses on scene-level translation, with limited work on object-level translation due to the scarcity of paired data and the challenge of accurately preserving contour and texture details. To address these issues, this study proposes a keypoint-guided diffusion model (KeypointDiff) for SAR-to-optical image translation of unpaired aircraft targets. This framework introduces supervision on target class and azimuth angle via keypoints, along with a training strategy for unpaired data. Based on the classifier-free guidance diffusion architecture, a class-angle guidance module (CAGM) is designed to integrate class and angle information into the diffusion generation process. Furthermore, adversarial loss and consistency loss are employed to improve image fidelity and detail quality, tailored for aircraft targets. During sampling, aided by a pre-trained keypoint detector, the model eliminates the requirement for manually labeled class and azimuth information, enabling automated SAR-to-optical translation. Experimental results demonstrate that the proposed method outperforms existing approaches across multiple metrics, providing an efficient and effective solution for object-level SAR-to-optical translation and downstream tasks. Moreover, the method exhibits strong zero-shot generalization to untrained aircraft types with the assistance of the keypoint detector.
zh

[CV-24] PAVE: Patching and Adapting Video Large Language Models CVPR2025

【速读】：该论文旨在解决如何将预训练的视频大型语言模型（Video LLMs）有效适应于包含附加模态或数据类型（如音频、3D信息等）的新任务的问题。解决方案的关键在于提出了一种名为PAVE的灵活框架，通过引入轻量级适配器（称为“补丁”），在不修改基础模型架构或其预训练权重的情况下，仅添加少量参数和操作，从而实现对预训练模型的有效适配，支持多样化的下游任务，包括视听问答、3D推理、多视角视频识别及高帧率视频理解。这种方法不仅显著提升了基础模型的性能，还实现了与特定任务领先模型相当甚至更优的表现，同时仅增加约0.1%的浮点运算次数（FLOPs）和参数量。此外，PAVE支持多任务学习并能在不同Video LLMs间良好泛化。相关代码已公开。

链接: https://arxiv.org/abs/2503.19794
作者: Zhuoming Liu,Yiquan Li,Khoi Duc Nguyen,Yiwu Zhong,Yin Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR2025 Camera Ready

点击查看摘要

Abstract:Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as “patches,” which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at this https URL.
zh

[CV-25] In the Blink of an Eye: Instant Game Map Editing using a Generative-AI Smart Brush

【速读】：该论文旨在解决复杂、高分辨率纹理编辑在高质量AAA级3D游戏环境中的自动化生成问题。传统方法难以应对这类场景的独特复杂性和领域特定挑战，而现有研究主要集中在较简单的数据分布上。论文的关键创新在于引入了一种名为“智能画笔（Smart Brush）”的新工具，用于辅助艺术家高效且无缝地修改游戏地图的选定区域。该方案的核心是结合生成对抗网络（GANs）和扩散模型，提出了两种不同的画笔变体，以实现高效的上下文感知生成。通过这种混合工作流程，不仅提升了艺术创作的灵活性，还显著提高了生产效率，减少了对细节的手动调整需求，从而弥合了自动化与创意控制之间的差距。实验结果显示，基于GAN的画笔生成的结果最为清晰和详细，并能更好地保持图像上下文一致性，优于所评估的其他最先进的模型。

链接: https://arxiv.org/abs/2503.19793
作者: Vitaly Gnatyuk,Valeriia Koriukina Ilya Levoshevich,Pavel Nurminskiy,Guenter Wallner
机构: Wargaming(Berlin, Germany); Wargaming(Nicosia, Cyprus); Johannes Kepler University Linz(Linz, Austria)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With video games steadily increasing in complexity, automated generation of game content has found widespread interest. However, the task of 3D gaming map art creation remains underexplored to date due to its unique complexity and domain-specific challenges. While recent works have addressed related topics such as retro-style level generation and procedural terrain creation, these works primarily focus on simpler data distributions. To the best of our knowledge, we are the first to demonstrate the application of modern AI techniques for high-resolution texture manipulation in complex, highly detailed AAA 3D game environments. We introduce a novel Smart Brush for map editing, designed to assist artists in seamlessly modifying selected areas of a game map with minimal effort. By leveraging generative adversarial networks and diffusion models we propose two variants of the brush that enable efficient and context-aware generation. Our hybrid workflow aims to enhance both artistic flexibility and production efficiency, enabling the refinement of environments without manually reworking every detail, thus helping to bridge the gap between automation and creative control in game development. A comparative evaluation of our two methods with adapted versions of several state-of-the art models shows that our GAN-based brush produces the sharpest and most detailed outputs while preserving image context while the evaluated state-of-the-art models tend towards blurrier results and exhibit difficulties in maintaining contextual consistency.
zh

[CV-26] SITA: Structurally Imperceptible and Transferable Adversarial Attacks for Stylized Image Generation

【速读】：该论文旨在解决现有保护视觉艺术作品免受未经授权使用的防御方法所面临的挑战，如迁移性差、计算成本高以及引入明显噪声等问题，这些问题会损害原始艺术品的美学质量。论文提出了一种名为Structurally Imperceptible and Transferable Adversarial (SITA)攻击的方法作为解决方案。SITA的关键在于利用基于CLIP的去风格化损失来解耦和破坏图像的鲁棒风格表示，这种破坏阻碍了风格化图像生成过程中的风格提取，从而削弱整体风格化过程。此外，SITA通过在图像的不可察觉的结构细节中嵌入噪声来引入扰动，这种方法能够在不牺牲艺术品视觉质量的情况下有效防止风格提取。该方法显著提高了迁移性和计算效率，并且保持了噪声的不可察觉性。

链接: https://arxiv.org/abs/2503.19791
作者: Jingdan Kang,Haoxin Yang,Yan Cai,Huaidong Zhang,Xuemiao Xu,Yong Du,Shengfeng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image generation technology has brought significant advancements across various fields but has also raised concerns about data misuse and potential rights infringements, particularly with respect to creating visual artworks. Current methods aimed at safeguarding artworks often employ adversarial attacks. However, these methods face challenges such as poor transferability, high computational costs, and the introduction of noticeable noise, which compromises the aesthetic quality of the original artwork. To address these limitations, we propose a Structurally Imperceptible and Transferable Adversarial (SITA) attacks. SITA leverages a CLIP-based destylization loss, which decouples and disrupts the robust style representation of the image. This disruption hinders style extraction during stylized image generation, thereby impairing the overall stylization process. Importantly, SITA eliminates the need for a surrogate diffusion model, leading to significantly reduced computational overhead. The method’s robust style feature disruption ensures high transferability across diverse models. Moreover, SITA introduces perturbations by embedding noise within the imperceptible structural details of the image. This approach effectively protects against style extraction without compromising the visual quality of the artwork. Extensive experiments demonstrate that SITA offers superior protection for artworks against unauthorized use in stylized generation. It significantly outperforms existing methods in terms of transferability, computational efficiency, and noise imperceptibility. Code is available at this https URL.
zh

[CV-27] Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models CVPR2025

【速读】：该论文旨在解决现有文本到图像生成模型中未学习算法在移除特定目标概念时无法保留语义相关概念知识的问题，即所谓的“邻近性挑战”(adjacency)。为了解决这一问题，论文提出了FADE（Fine-grained Attenuation for Diffusion Erasure），在扩散模型中引入了邻近感知的未学习方法。FADE的关键在于其包含两个组成部分：(1) 概念邻域（Concept Neighborhood），用于识别与目标概念相关的邻近概念集合；(2) 网格模块（Mesh Modules），通过融合Expungement、Adjacency和Guidance损失组件实现结构化组合。这些设计使得FADE能够在精确擦除目标概念的同时，最大程度地保留相关和无关概念的保真度。实验结果表明，FADE在Stanford Dogs、Oxford Flowers等多个数据集上的表现优于现有最先进的方法，目标概念的去除对相关概念的影响最小，并至少提升了12%的保留性能。

链接: https://arxiv.org/abs/2503.19783
作者: Kartik Thakral,Tamar Glaser,Tal Hassner,Mayank Vatsa,Richa Singh
机构: IIT Jodhpur (印度技术学院乔普拉); Harman International (哈曼国际); Weir AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in CVPR 2025

点击查看摘要

Abstract:Existing unlearning algorithms in text-to-image generative models often fail to preserve the knowledge of semantically related concepts when removing specific target concepts: a challenge known as adjacency. To address this, we propose FADE (Fine grained Attenuation for Diffusion Erasure), introducing adjacency aware unlearning in diffusion models. FADE comprises two components: (1) the Concept Neighborhood, which identifies an adjacency set of related concepts, and (2) Mesh Modules, employing a structured combination of Expungement, Adjacency, and Guidance loss components. These enable precise erasure of target concepts while preserving fidelity across related and unrelated concepts. Evaluated on datasets like Stanford Dogs, Oxford Flowers, CUB, I2P, Imagenette, and ImageNet1k, FADE effectively removes target concepts with minimal impact on correlated concepts, achieving atleast a 12% improvement in retention performance over state-of-the-art methods.
zh

[CV-28] LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

【速读】：该论文旨在解决开放词汇语义分割（open-vocabulary semantic segmentation）中的挑战，特别是在不依赖额外训练的情况下实现高质量分割。传统视觉语言模型（Vision-and-Language Models, VLMs）主要优化跨模态对齐（cross-modal alignment），而忽略了模态内相似性（intra-modal similarity），导致初始逐块（per-patch）预测在精细边界区域的表现受限。为解决此问题，论文提出了一种无需训练的方法，其关键在于通过标签传播（label propagation）技术来增强VLM的初始预测。该方法利用视觉模型（Vision Model, VM）更好地捕捉块间关系，并在像素级别应用标签传播作为细化步骤，显著提升了类别边界附近的分割精度。此外，通过在整个图像上进行推理而非基于窗口处理，该方法能够捕获全局上下文交互。这一系列改进使得所提出的LPOSS+方法在多个数据集上实现了无训练方法中的最新性能（state-of-the-art）。

链接: https://arxiv.org/abs/2503.19777
作者: Vladan Stojnić,Yannis Kalantidis,Jiří Matas,Giorgos Tolias
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: this https URL
zh

[CV-29] Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion CVPR2025

【速读】：本文旨在解决现代自动驾驶感知系统在严重传感器失效（如LiDAR光线减少、LiDAR数据丢失、视野受限、相机故障及遮挡等）情况下性能显著下降的问题。当前基于多模态传感器融合的架构依赖于模态间的相互关联性，在面对这些失效情况时表现欠佳。为应对这一挑战，论文提出了一种名为MoME的高效且鲁棒的LiDAR-相机3D目标检测器。其关键创新在于引入了混合专家方法（mixture of experts approach），通过完全解耦模态依赖性实现稳健性能。具体而言，MoME采用三个并行的专家解码器分别利用仅相机特征、仅LiDAR特征或两者的组合来解码目标查询，并结合多专家解码（Multi-Expert Decoding, MED）框架，使用自适应查询路由（Adaptive Query Router, AQR）根据相机和LiDAR特征的质量动态选择最合适的专家解码器处理每个查询。这种方法确保了在多样化传感器失效场景下均能保持稳健的检测性能。实验结果表明，MoME在nuScenes-R基准上的表现达到最先进水平，尤其是在极端天气和传感器失效条件下显著优于现有模型。

链接: https://arxiv.org/abs/2503.19776
作者: Konyul Park,Yecheol Kim,Daehun Kim,Jun Won Choi
机构: Seoul National University (首尔国立大学); Hanyang University (汉阳大学); LG AI Research (LG AI 研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras. Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion. This limitation stems from inter-modality dependencies in current sensor fusion frameworks. In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as MoME, which can achieve robust performance through a mixture of experts approach. Our MoME fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively. We propose Multi-Expert Decoding (MED) framework, where each query is decoded selectively using one of three expert decoders. MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features. This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios. We evaluated the performance of MoME on the nuScenes-R benchmark. Our MoME achieved state-of-the-art performance in extreme weather and sensor failure conditions, significantly outperforming the existing models across various sensor failure scenarios.
zh

[CV-30] BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

【速读】：该论文旨在解决现有分割方法中如何有效融合点提示（point prompts）和文本提示（text prompts）这两种互补模态以实现最优分割性能的问题。论文的关键创新在于提出了一种名为BiPrompt-SAM的双模态提示分割框架，通过显式的候选选择机制融合点提示和文本提示的优势。具体而言，该方法利用了Segment Anything Model (SAM) 生成多个掩码候选的能力，并结合来自文本提示的语义引导掩码，基于相似性度量显式选择最合适的候选掩码。这一过程可被视为一种简化的Mixture of Experts (MoE) 系统，其中点提示和文本提示模块作为不同的“专家”，而相似性评分充当基础的“门控网络”。实验结果表明，该显式双模态选择方法能够有效结合点提示的空间精确性和文本提示的语义丰富性，在处理语义复杂对象、多个相似对象以及部分遮挡等场景时表现出色。

链接: https://arxiv.org/abs/2503.19769
作者: Suzhe Xu,Jialin Peng,Chengyuan Zhang
机构: Huaqiao University (华侨大学); Huaqiao University (华侨大学); Huaqiao University (华侨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding. However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance. This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism. Specifically, we leverage SAM’s inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics. This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct “experts,” and the similarity scoring serves as a rudimentary “gating network.” We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. On Endovis17, BiPrompt-SAM achieved 89.55% mDice and 81.46% mIoU, comparable to state-of-the-art specialized medical segmentation models. On the RefCOCO series datasets, our method attained 87.1%, 86.5%, and 85.8% IoU, significantly outperforming existing approaches. Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions. BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.
zh

[CV-31] OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

【速读】：该论文试图解决3D场景理解中开放词汇表表示评估局限于封闭语义集的问题，未能充分捕捉语言的丰富性。为了解决这一问题，论文提出了OpenLex3D基准，它提供了Replica、ScanNet++和HM3D数据集中23个场景的新标签注释，通过引入同义物体类别和更细腻的描述来捕获现实世界的语言变异性。关键在于引入开放集3D语义分割任务和物体检索任务，以提供关于特征精度、分割及下游能力的见解，并评估现有3D开放词汇方法在这些任务上的表现，揭示其失败案例及改进方向。

链接: https://arxiv.org/abs/2503.19764
作者: Christina Kassab,Sacha Morin,Martin Büchner,Matías Mattamala,Kumaraditya Gupta,Abhinav Valada,Liam Paull,Maurice Fallon
机构: University of Oxford (牛津大学); Université de Montréal (蒙特利尔大学); University of Freiburg (弗赖堡大学); Mila - Quebec AI Institute (魁北克人工智能研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, the evaluation of these representations is limited to closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark to evaluate 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for 23 scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we provide insights on feature precision, segmentation, and downstream capabilities. We evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. The benchmark is publicly available at: this https URL.
zh

[CV-32] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

【速读】：该论文旨在解决现有视觉-语言-动作（Vision-Language-Action, VLA）模型在处理异构动作空间时的适应性限制问题。传统方法依赖于紧凑的动作头来预测离散或连续动作，这在面对多样化动作空间时表现出了局限性。为了解决这一问题，论文提出了一种名为Dita的可扩展框架，其关键创新在于通过统一的多模态扩散过程直接去噪连续动作序列。与以往基于浅层网络条件化去噪的方法不同，Dita采用了上下文条件化设计，实现了去噪动作与历史观测的原始视觉标记之间的细粒度对齐。这种设计能够显式建模动作增量及环境细节，并通过扩展扩散去噪器与Transformer架构的结合，有效整合跨实体数据集的多样性，包括不同相机视角、观察场景、任务和动作空间。这种协同作用增强了模型对各种变化的鲁棒性，促进了长时序任务的成功执行。实验结果表明，Dita在模拟环境中表现出最先进的性能，并且通过仅使用第三人称相机输入即可实现对真实世界环境变化和复杂长时序任务的稳健适应，只需10次微调样本。

链接: https://arxiv.org/abs/2503.19757
作者: Zhi Hou,Tianyi Zhang,Yuwen Xiong,Haonan Duan,Hengjun Pu,Ronglei Tong,Chengyang Zhao,Xizhou Zhu,Yu Qiao,Jifeng Dai,Yuntao Chen
机构: Shanghai AI Lab; College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室); Peking University (北京大学); SenseTime Research (商汤科技研究部); Tsinghua University (清华大学); Center for Artificial Intelligence and Robotics, HKISI, CAS (中国科学院自动化研究所人工智能与机器人中心); robodita.github.io
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; this https URL ;

点击查看摘要

Abstract:While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning – enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer’s scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: this https URL.
zh

[CV-33] ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

【速读】：该论文试图解决在交互式闭环评估中，端到端 (End-to-end, E2E) 自动驾驶方法因因果推理能力有限而难以做出正确决策的问题。当前方法尝试利用视觉语言模型 (Vision-Language Models, VLMs) 的强大理解和推理能力来解决这一困境，但受限于语义推理空间与动作空间中纯数值轨迹输出之间的差距，现有 VLMs 在闭环评估中的表现仍不理想。为应对这一挑战，论文提出了一种名为 ORION 的整体框架，其关键在于通过视觉语言指令驱动的动作生成实现端到端自动驾驶。ORION 结合了 QT-Former 用于聚合长期历史上下文，大型语言模型 (Large Language Model, LLM) 用于驾驶场景推理，以及生成式规划器用于精确轨迹预测，并进一步对齐语义推理空间与动作空间，以实现视觉问答 (Visual Question-Answering, VQA) 和规划任务的统一端到端优化。

链接: https://arxiv.org/abs/2503.19755
作者: Haoyu Fu,Diankun Zhang,Zongchuang Zhao,Jianfeng Cui,Dingkang Liang,Chong Zhang,Dingyuan Zhang,Hongwei Xie,Bing Wang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation. ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.
zh

[CV-34] A Survey on Event-driven 3D Reconstruction: Development under Different Categories

【速读】：该论文旨在综述基于事件驱动（Event-driven）的3D重建方法，涵盖双目、单目及多模态系统，并分类总结了几何方法、学习方法以及混合方法的最新进展。此外，还探讨了神经辐射场（Neural Radiance Fields）和基于事件数据的3D高斯点阵投射（3D Gaussian Splatting）等新兴趋势。论文按时间顺序组织相关工作，以展示领域内的创新与进展。为推动未来研究，论文还指出了数据集、实验设计、评估标准、事件表示等方面的关键研究空白与发展方向。

解决方案的关键在于综合分析不同类型的事件驱动3D重建方法，并通过分类整理现有技术，揭示其几何原理、学习框架及混合模型的优势与局限性，同时引入新兴技术趋势，为后续研究提供明确的方向指引。

链接: https://arxiv.org/abs/2503.19753
作者: Chuanzhi Xu,Haoxian Zhou,Haodong Chen,Vera Chung,Qiang Qu
机构: School of Computer Science (计算机学院), The University of Sydney (悉尼大学), NSW (新南威尔士), Australia (澳大利亚)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 1 figure, 6 tables

点击查看摘要

Abstract:Event cameras have gained increasing attention for 3D reconstruction due to their high temporal resolution, low latency, and high dynamic range. They capture per-pixel brightness changes asynchronously, allowing accurate reconstruction under fast motion and challenging lighting conditions. In this survey, we provide a comprehensive review of event-driven 3D reconstruction methods, including stereo, monocular, and multimodal systems. We further categorize recent developments based on geometric, learning-based, and hybrid approaches. Emerging trends, such as neural radiance fields and 3D Gaussian splatting with event data, are also covered. The related works are structured chronologically to illustrate the innovations and progression within the field. To support future research, we also highlight key research gaps and future research directions in dataset, experiment, evaluation, event representation, etc.
zh

[CV-35] Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings

【速读】：该论文旨在解决传统手术视觉数据集规模小、多样性不足的问题，这些问题限制了计算机辅助手术系统的发展。为了解决这些约束，论文提出了Surg-3M数据集，通过新颖的数据聚合管道从在线资源中收集了超过4000个高分辨率手术视频和超过300万张高质量图像，覆盖多种手术类型。该数据集显著扩大了现有手术视觉数据的规模和范围，并引入了两个新的任务。解决方案的关键在于结合Surg-3M数据集训练的SurgFM自监督基础模型，该模型融合了ConvNeXt、DINO架构以及创新的增强蒸馏方法，展现出卓越的性能，在下游任务如手术阶段识别、动作识别和工具存在检测中超越了现有的最先进模型。实验结果表明，即使使用一半的数据量，SurgFM在多个基准测试中仍表现出色，特别是在手术阶段识别（AutoLaparo、M2CAI16和Cholec80上Jaccard指标分别提升+8.9pp、+4.7pp和+3.9pp）和工具存在检测（Cholec80上的mAP提升+4.6pp）等任务中取得了显著改进。

链接: https://arxiv.org/abs/2503.19740
作者: Chengan Che,Chao Wang,Tom Vercauteren,Sophia Tsoka,Luis C. Garcia-Peraza-Herrera
机构: King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Advancements in computer-assisted surgical procedures heavily rely on accurate visual data interpretation from camera systems used during surgeries. Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos with less than 100K images. To address these constraints, a new dataset called Surg-3M has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos and more than 3 million high-quality images from multiple procedure types, Surg-3M offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel tasks. To demonstrate the effectiveness of this dataset, we present SurgFM, a self-supervised foundation model pretrained on Surg-3M that achieves impressive results in downstream tasks such as surgical phase recognition, action recognition, and tool presence detection. Combining key components from ConvNeXt, DINO, and an innovative augmented distillation method, SurgFM exhibits exceptional performance compared to specialist architectures across various benchmarks. Our experimental results show that SurgFM outperforms state-of-the-art models in multiple downstream tasks, including significant gains in surgical phase recognition (+8.9pp, +4.7pp, and +3.9pp of Jaccard in AutoLaparo, M2CAI16, and Cholec80), action recognition (+3.1pp of mAP in CholecT50) and tool presence detection (+4.6pp of mAP in Cholec80). Moreover, even when using only half of the data, SurgFM outperforms state-of-the-art models in AutoLaparo and achieves state-of-the-art performance in Cholec80. Both Surg-3M and SurgFM have significant potential to accelerate progress towards developing autonomous robotic surgery systems.
zh

[CV-36] FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion

【速读】：该论文旨在解决图像-事件联合深度估计方法在泛化性方面面临的两大挑战：1）标注数据集的匮乏导致跨模态监督不足；2）静态图像与动态事件流之间固有的频率不匹配，导致特征融合效果不佳。为应对这些挑战，论文提出了一种名为Frequency-decoupled Unified Self-supervised Encoder (FUSE) 的解决方案，其关键是结合两个协同组件：Parameter-efficient Self-supervised Transfer (PST)，通过潜在空间对齐实现跨模态知识迁移，并借助图像基础模型缓解数据稀缺问题；以及Frequency-Decoupled Fusion (FreDFuse) 模块，显式分离高频边缘特征与低频结构成分，通过物理感知融合解决模态特定的频率不匹配问题。这一综合方法使得FUSE能够构建通用的图像-事件编码器，仅需轻量级解码器适配即可应用于目标数据集，从而显著提升性能并增强实际部署能力。

链接: https://arxiv.org/abs/2503.19739
作者: Pihai Sun(1),Junjun Jiang(1),Yuanqi Yao(1),Youyu Chen(1),Wenbo Zhao(1),Kui Jiang(1),Xianming Liu(1) ((1) Faculty of Computing, Harbin Institute of Technology)
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity by enabling joint encoding without depth ground this http URL this, we propose the Frequency-Decoupled Fusion module (FreDFuse) to explicitly decouple high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches through physics-aware fusion. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in this http URL on MVSEC and DENSE datasets. The framework exhibits remarkable zero-shot adaptability to challenging scenarios including extreme lighting and motion blur, significantly advancing real-world deployment capabilities. The source code for our method is publicly available at: this https URL
zh

[CV-37] PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models CVPR2025

【速读】：该论文旨在解决扩散模型在视觉、文本和机器人领域取得显著进展的同时，因顺序去噪过程而导致的生成速度缓慢的问题。为了解决这一问题，论文引入了一种基于Picard迭代的并行采样方法，虽然有效减少了顺序步骤并确保了与原始输出的精确收敛，但并未保证更快的收敛速度，实践中仍可能导致生成速度慢。论文的关键解决方案是提出了一种新的并行化方案——Picard一致性模型（PCM）。PCM通过显著减少Picard迭代中的生成步骤，并受一致性模型的启发，直接训练以预测收敛轨迹任意阶段的固定点解或最终输出，从而提高效率。此外，论文还引入了模型切换的新概念，以解决PCM的局限性并确保精确收敛。实验结果表明，PCM在图像生成和机器人控制等任务中相比顺序采样加速达2.71倍，相比Picard迭代加速达1.77倍。

链接: https://arxiv.org/abs/2503.19731
作者: Junhyuk So,Jiwoong Shin,Chaeyeon Jang,Eunhyeok Park
机构: Department of Computer Science and Engineering, POSTECH (计算机科学与工程系, POSTECH); Graduate School of Artificial Intelligence, POSTECH (人工智能研究生院, POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2025

点击查看摘要

Abstract:Recently, diffusion models have achieved significant advances in vision, text, and robotics. However, they still face slow generation speeds due to sequential denoising processes. To address this, a parallel sampling method based on Picard iteration was introduced, effectively reducing sequential steps while ensuring exact convergence to the original output. Nonetheless, Picard iteration does not guarantee faster convergence, which can still result in slow generation in practice. In this work, we propose a new parallelization scheme, the Picard Consistency Model (PCM), which significantly reduces the number of generation steps in Picard iteration. Inspired by the consistency model, PCM is directly trained to predict the fixed-point solution, or the final output, at any stage of the convergence trajectory. Additionally, we introduce a new concept called model switching, which addresses PCM’s limitations and ensures exact convergence. Extensive experiments demonstrate that PCM achieves up to a 2.71x speedup over sequential sampling and a 1.77x speedup over Picard iteration across various tasks, including image generation and robotic control.
zh

[CV-38] CamSAM2: Segment Anything Accurately in Camouflaged Videos

【速读】：该论文旨在解决视频伪装物体分割（VCOS）任务中，现有方法在处理简单提示（如点或框）时对伪装场景分割能力不足的问题。特别是针对SAM2模型在分割伪装视频时表现欠佳的情况，论文提出了一种增强SAM2能力的方法，命名为Camouflaged SAM2 (CamSAM2)，且无需修改SAM2的参数。

解决方案的关键在于引入了三个创新模块：首先，通过引入解伪装标记（decamouflaged token），提供特征调整的灵活性；其次，分别设计了隐式目标感知融合（IOF）和显式目标感知融合（EOF）模块，充分利用当前帧和历史帧中的细粒度高分辨率特征；最后，提出了对象原型生成（OPG）模块，利用前几帧的高质量特征抽象并记忆伪装对象的原型信息。这些技术显著提升了模型在伪装场景下的分割性能，同时仅增加了极少量可学习参数。实验结果表明，CamSAM2在三个VCOS数据集上大幅超越SAM2，尤其在MoCA-Mask数据集上的点击提示和SUN-SEG-Hard数据集上的掩码提示中分别取得了12.2 mDice和19.6 mDice的提升。

链接: https://arxiv.org/abs/2503.19730
作者: Yuli Zhou,Guolei Sun,Yawei Li,Yuqian Fu,Luca Benini,Ender Konukoglu
机构: Computer Vision Laboratory, ETH Zurich (计算机视觉实验室, 瑞士联邦理工学院 Zurich 校区); University of Zurich (苏黎世大学); Integrated Systems Laboratory, ETH Zurich (集成系统实验室, 瑞士联邦理工学院 Zurich 校区); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学 INSAIT 学院); University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2’s capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2’s ability to handle camouflaged scenes without modifying SAM2’s parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code will be available at \hrefthis https URLthis http URL.
zh

[CV-39] EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction

【速读】：该论文旨在解决基于现有 Mamba 模型的视觉任务在事件驱动场景下的局限性，特别是事件驱动视频重建 (Event-based Video Reconstruction, EBVR) 中的空间平移不变性和局部时空事件关系建模不足的问题。传统 Mamba 算法采用静态窗口划分和标准重塑扫描方法，导致局部连接信息的显著损失。为克服这些限制，论文提出了 EventMamba，这是一种专为 EBVR 设计的模型。其关键创新在于引入随机窗口偏移 (Random Window Offset, RWO) 来取代固定的窗口划分，并在时空域中采用一致的遍历序列化方法以保持相邻事件在空间和时间上的紧密联系。这些改进使 EventMamba 在保留 Mamba 强大的建模能力的同时，显著提升了事件数据的时空局部性，从而大幅提高了视频重建的计算速度与视觉质量。

链接: https://arxiv.org/abs/2503.19721
作者: Chengjie Ge,Xueyang Fu,Peng He,Kunyu Wang,Chengzhi Cao,Zheng-Jun Zha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Leveraging its robust linear global modeling capability, Mamba has notably excelled in computer vision. Despite its success, existing Mamba-based vision models have overlooked the nuances of event-driven tasks, especially in video reconstruction. Event-based video reconstruction (EBVR) demands spatial translation invariance and close attention to local event relationships in the spatio-temporal domain. Unfortunately, conventional Mamba algorithms apply static window partitions and standard reshape scanning methods, leading to significant losses in local connectivity. To overcome these limitations, we introduce EventMamba–a specialized model designed for EBVR tasks. EventMamba innovates by incorporating random window offset (RWO) in the spatial domain, moving away from the restrictive fixed partitioning. Additionally, it features a new consistent traversal serialization approach in the spatio-temporal domain, which maintains the proximity of adjacent events both spatially and temporally. These enhancements enable EventMamba to retain Mamba’s robust modeling capabilities while significantly preserving the spatio-temporal locality of event data. Comprehensive testing on multiple datasets shows that EventMamba markedly enhances video reconstruction, drastically improving computation speed while delivering superior visual quality compared to Transformer-based methods.
zh

[CV-40] On What Depends the Robustness of Multi-source Models to Missing Data in Earth Observation?

【速读】：该论文试图解决的问题是如何理解并提升多源地球观测（Earth Observation, EO）模型在面对缺失数据时的预测性能。论文的关键在于评估六种最先进的多源模型在单个数据源缺失或仅有一个数据源可用情况下的预测表现，并揭示模型效能与任务性质、数据源互补性及模型设计之间的复杂关系。研究发现，移除某些数据源反而可能提高预测性能，这挑战了“整合所有可用数据总是有益”的传统假设，从而引发对模型复杂度以及数据源必要性的深刻反思，为更精简的EO应用方法提供了潜在方向。

链接: https://arxiv.org/abs/2503.19719
作者: Francisco Mena,Diego Arenas,Miro Miranda,Andreas Dengel
机构: University of Kaiserslautern-Landau (RPTU)(凯泽斯劳滕-兰道大学); German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2025

点击查看摘要

Abstract:In recent years, the development of robust multi-source models has emerged in the Earth Observation (EO) field. These are models that leverage data from diverse sources to improve predictive accuracy when there is missing data. Despite these advancements, the factors influencing the varying effectiveness of such models remain poorly understood. In this study, we evaluate the predictive performance of six state-of-the-art multi-source models in predicting scenarios where either a single data source is missing or only a single source is available. Our analysis reveals that the efficacy of these models is intricately tied to the nature of the task, the complementarity among data sources, and the model design. Surprisingly, we observe instances where the removal of certain data sources leads to improved predictive performance, challenging the assumption that incorporating all available data is always beneficial. These findings prompt critical reflections on model complexity and the necessity of all collected data sources, potentially shaping the way for more streamlined approaches in EO applications.
zh

[CV-41] Semi-SD: Semi-Supervised Metric Depth Estimation via Surrounding Cameras for Autonomous Driving

【速读】：该论文旨在解决自动驾驶场景下基于周边摄像头的度量深度估计问题，特别是多摄像头设置中的尺度不确定性挑战。论文的关键解决方案在于提出了一种名为Semi-SD的新框架，其核心包括：(1) 设计了一个统一的空间-时间-语义融合模块，用于构建视觉融合特征；(2) 引入了围绕摄像头及相邻帧的交叉注意力组件，以优化尺度信息细化与时间特征匹配；(3) 提出了一种利用周边摄像头、其估计深度以及外参的位姿估计算法，有效解决了多摄像头设置中的尺度模糊问题；(4) 结合语义世界模型与单目深度估计模型对深度估计进行监督，进一步提升了深度估计质量。这些创新点共同构成了Semi-SD框架的核心优势。

链接: https://arxiv.org/abs/2503.19713
作者: Yusen Xie,Zhengmin Huang,Shaojie Shen,Jun Ma
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology (香港科技大学电子与计算机工程系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce Semi-SD, a novel metric depth estimation framework tailored for surrounding cameras equipment in autonomous driving. In this work, the input data consists of adjacent surrounding frames and camera parameters. We propose a unified spatial-temporal-semantic fusion module to construct the visual fused features. Cross-attention components for surrounding cameras and adjacent frames are utilized to focus on metric scale information refinement and temporal feature matching. Building on this, we propose a pose estimation framework using surrounding cameras, their corresponding estimated depths, and extrinsic parameters, which effectively address the scale ambiguity in multi-camera setups. Moreover, semantic world model and monocular depth estimation world model are integrated to supervised the depth estimation, which improve the quality of depth estimation. We evaluate our algorithm on DDAD and nuScenes datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of surrounding camera based depth estimation quality. The source code will be available on this https URL.
zh

[CV-42] Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations CVPR2025

【速读】：该论文旨在解决从第一人称（egocentric）和第三人称（exocentric）视频中学习视点不变表征的问题，这一领域由于视角、运动模式及上下文的显著差异而未被充分探索。论文的关键在于提出了一种名为“Bootstrap Your Own Views (BYOV)”的新方法，通过因果时间动态和跨视图对齐的促进机制，实现从无配对的第一人称和第三人称视频中进行细粒度视点不变视频表征学习。该方法强调捕捉人类动作的组合特性作为稳健跨视图理解的基础，并通过自视图掩码预测和跨视图掩码预测的设计，同时学习视点不变且强大的表征。实验结果表明，BYOV在四个下游第一人称与第三人称视频任务的所有指标上均显著优于现有方法。

链接: https://arxiv.org/abs/2503.19706
作者: Jungin Park,Jiyoung Lee,Kwanghoon Sohn
机构: Yonsei University (延世大学); Ewha Womans University (梨花女子大学); NAVER AI Lab (NAVER人工智能实验室); Korea Institute of Science and Technology (KIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025 Camera-ready

点击查看摘要

Abstract:View-invariant representation learning from egocentric (first-person, ego) and exocentric (third-person, exo) videos is a promising approach toward generalizing video understanding systems across multiple viewpoints. However, this area has been underexplored due to the substantial differences in perspective, motion patterns, and context between ego and exo views. In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. We highlight the importance of capturing the compositional nature of human actions as a basis for robust cross-view understanding. Specifically, self-view masking and cross-view masking predictions are designed to learn view-invariant and powerful representations concurrently. Experimental results demonstrate that our BYOV significantly surpasses existing approaches with notable gains across all metrics in four downstream ego-exo video tasks. The code is available at this https URL.
zh

[CV-43] High-Quality Spatial Reconstruction and Orthoimage Generation Using Efficient 2D Gaussian Splatting

【速读】：该论文旨在解决传统真数字正射影像图（True Digital Orthophoto Map, TDOM）生成方法中存在的计算复杂度高且易出错的问题，特别是依赖于繁琐的数字表面模型（Digital Surface Model, DSM）和遮挡检测的过程。论文提出了一种基于2D高斯点绘制（2D Gaussian Splatting, 2DGS）的新技术，无需显式的DSM和遮挡检测，通过深度图生成获取每个像素的空间信息，并以高精度重建场景。其关键在于采用分而治之策略，在较低资源成本下实现高质量的大规模场景重建与复杂地形及薄结构的高精度建模，同时保持高效性。实验结果验证了该方法在大规模场景重建和高精度地形建模方面的有效性。

链接: https://arxiv.org/abs/2503.19703
作者: Qian Wang,Zhihao Zhan,Jialei He,Zhituo Tu,Xiang Zhu,Jie Yuan
机构: School of Electronic Science and Engineering, Nanjing University (南京大学); TopXGun Robotics (TopXGun机器人), Nanjing (南京)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Highly accurate geometric precision and dense image features characterize True Digital Orthophoto Maps (TDOMs), which are in great demand for applications such as urban planning, infrastructure management, and environmental monitoring. Traditional TDOM generation methods need sophisticated processes, such as Digital Surface Models (DSM) and occlusion detection, which are computationally expensive and prone to errors. This work presents an alternative technique rooted in 2D Gaussian Splatting (2DGS), free of explicit DSM and occlusion detection. With depth map generation, spatial information for every pixel within the TDOM is retrieved and can reconstruct the scene with high precision. Divide-and-conquer strategy achieves excellent GS training and rendering with high-resolution TDOMs at a lower resource cost, which preserves higher quality of rendering on complex terrain and thin structure without a decrease in efficiency. Experimental results demonstrate the efficiency of large-scale scene reconstruction and high-precision terrain modeling. This approach provides accurate spatial data, which assists users in better planning and decision-making based on maps.
zh

[CV-44] Optimization of MedSAM model based on bounding box adaptive perturbation algorithm

【速读】：该论文旨在解决MedSAM模型在医学图像分割任务中的两个主要问题：一是训练过程中扰动窗口设置的限制可能导致小组织或器官与相邻结构错误分割，从而产生分割误差；二是当处理具有不规则形状和复杂结构的目标时，MedSAM在缩小边界框提示下的分割性能不佳。为了解决这些问题，论文提出了一种边界框自适应扰动算法来优化训练过程，其关键是通过调整扰动策略以减少小目标的分割误差，并提升模型在缩小边界框提示下的准确性，从而增强MedSAM模型在复杂医学成像任务中的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2503.19700
作者: Boyi Li,Ye Yuan,Wenjun Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures, 3 Tables

点击查看摘要

Abstract:The MedSAM model, built upon the SAM framework, enhances medical image segmentation through generalizable training but still exhibits notable limitations. First, constraints in the perturbation window settings during training can cause MedSAM to incorrectly segment small tissues or organs together with adjacent structures, leading to segmentation errors. Second, when dealing with medical image targets characterized by irregular shapes and complex structures, segmentation often relies on narrowing the bounding box to refine segmentation intent. However, MedSAM’s performance under reduced bounding box prompts remains suboptimal. To address these challenges, this study proposes a bounding box adaptive perturbation algorithm to optimize the training process. The proposed approach aims to reduce segmentation errors for small targets and enhance the model’s accuracy when processing reduced bounding box prompts, ultimately improving the robustness and reliability of the MedSAM model for complex medical imaging tasks.
zh

[CV-45] Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

【速读】：该论文旨在解决检测部分篡改面部深度伪造（Face Deepfakes）的难题，这类伪造仅对特定面部特征进行微妙改动，同时保留整体上下文，比完全合成的面孔更具检测挑战性。论文的关键解决方案在于利用对比语言图像预训练（Contrastive Language-Image Pre-training, CLIP）模型中的ViT-L/14视觉编码器，通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）技术如LN-tuning，仅调整模型的一小部分参数以保持CLIP的预训练知识并减少过拟合。此外，论文设计了一套面向面部图像的定制化预处理流程，并结合L2归一化及超球面流形上的度量学习等正则化策略来提升泛化能力。实验结果表明，该方法在多个数据集上表现出与更复杂现有技术相当或更好的检测准确性。

链接: https://arxiv.org/abs/2503.19683
作者: Andrii Yermakov,Jan Cech,Jiri Matas
机构: Visual Recognition Group (视觉识别小组); Faculty of Electrical Engineering (电气工程学院); Czech Technical University in Prague (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model’s parameters, preserving CLIP’s pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP’s visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: this https URL
zh

[CV-46] MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities CVPR2025

【速读】：该论文旨在解决神经辐射场（NeRF）在跨异构成像模态（如RGB、单色、近红外、偏振和多光谱等）之间学习和传递信息能力的研究不足问题，这一不足主要源于有限的多模态训练数据。论文的关键创新在于提出了MultimodalStudio (MMS)，它包含两个核心组成部分：MMS-DATA和MMS-FW。其中，MMS-DATA是一个包含32个场景、涵盖5种不同成像模态的多模态多视角数据集；MMS-FW是一种新型的模块化多模态NeRF框架，能够处理原始多通道数据并支持任意数量的多模态设备。通过广泛的实验验证，论文展示了MMS-FW能够在MMS-DATA上实现跨模态信息迁移，并生成比单一模态更高的渲染质量，从而推动多模态体绘制及相关领域的研究发展。

链接: https://arxiv.org/abs/2503.19673
作者: Federico Lincetto,Gianluca Agresti,Mattia Rossi,Pietro Zanuttigh
机构: University of Padova (帕多瓦大学); Sony Europe B.V (索尼欧洲公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have shown impressive performances in the rendering of 3D scenes from arbitrary viewpoints. While RGB images are widely preferred for training volume rendering models, the interest in other radiance modalities is also growing. However, the capability of the underlying implicit neural models to learn and transfer information across heterogeneous imaging modalities has seldom been explored, mostly due to the limited training data availability. For this purpose, we present MultimodalStudio (MMS): it encompasses MMS-DATA and MMS-FW. MMS-DATA is a multimodal multi-view dataset containing 32 scenes acquired with 5 different imaging modalities: RGB, monochrome, near-infrared, polarization and multispectral. MMS-FW is a novel modular multimodal NeRF framework designed to handle multimodal raw data and able to support an arbitrary number of multi-channel devices. Through extensive experiments, we demonstrate that MMS-FW trained on MMS-DATA can transfer information between different imaging modalities and produce higher quality renderings than using single modalities alone. We publicly release the dataset and the framework, to promote the research on multimodal volume rendering and beyond.
zh

[CV-47] fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models

【速读】：本文旨在解决现有基于CLIP的视觉语言模型在细粒度手术活动（尤其是动作三元组）零样本识别上的局限性。当前模型依赖全局图像特征，忽视了复杂任务所需的细粒度语义和上下文细节，并且未能利用三元组中的层次结构，从而限制了其对新型三元组的泛化能力。为了解决这些问题，论文提出了一种名为fine-CLIP的方法，其关键是学习以对象为中心的特征并利用三元组定义中的层次结构。具体而言，fine-CLIP通过三个组件实现这一目标：层次提示建模以捕捉共享语义、基于LoRA的视觉主干适应以增强特征提取，以及基于图的特征凝练策略将相似的块特征聚合成有意义的对象簇。此外，为了评估模型的跨域泛化能力，作者引入了一个新的基准测试，在CholecT50数据集上设置了Unseen-Target和Unseen-Instrument-Verb两种场景。实验结果显示，fine-CLIP在F1和mAP指标上取得了显著提升，有效增强了对新型手术三元组的零样本识别性能。

链接: https://arxiv.org/abs/2503.19670
作者: Saurav Sharma,Didier Mutter,Nicolas Padoy
机构: University of Strasbourg (斯特拉斯堡大学), CNRS (法国国家科学研究中心), INSERM (法国国家健康与医学研究院), ICube (ICube), UMR7357 (联合研究单位7357), France (法国); IHU Strasbourg (斯特拉斯堡IHU), Strasbourg, France (法国); University Hospital of Strasbourg (斯特拉斯堡大学医院), France (法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 tables, 3 figures

点击查看摘要

Abstract:While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and lever- ages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.
zh

[CV-48] CoSimGen: Controllable Diffusion Model for Simultaneous Image and Mask Generation

【速读】：该论文致力于解决跨模态配对数据集（图像与分割掩码）标注资源匮乏的问题，特别是医疗影像、遥感及计算机视觉领域中高质量同时生成图像与分割掩码的挑战。现有生成模型通常仅针对单一模态输出，缺乏适应性条件机制，难以满足复杂场景下的可控生成需求。论文的关键在于提出CoSimGen框架，基于扩散模型实现可控的同时图像与掩码生成。其创新点包括通过语义文本提示、上下文空间嵌入以及时间步频谱嵌入实现直观的条件控制，并结合对比三元组损失、扩散损失和对抗损失以增强可控制性和训练效率。此外，低分辨率输出通过超分辨率技术提升至高分辨率，确保生成结果的保真度与条件一致性。

链接: https://arxiv.org/abs/2503.19661
作者: Rupak Bose,Chinedu Innocent Nwoye,Aditya Bhat,Nicolas Padoy
机构: ICube (UMR7357 CNRS INSERM University of Strasbourg) (法国); IHU Strasbourg (法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 14 figure, 2 tables, project page at this https URL

点击查看摘要

Abstract:The acquisition of annotated datasets with paired images and segmentation masks is a critical challenge in domains such as medical imaging, remote sensing, and computer vision. Manual annotation demands significant resources, faces ethical constraints, and depends heavily on domain expertise. Existing generative models often target single-modality outputs, either images or segmentation masks, failing to address the need for high-quality, simultaneous image-mask generation. Additionally, these models frequently lack adaptable conditioning mechanisms, restricting control over the generated outputs and limiting their applicability for dataset augmentation and rare scenario simulation. We propose CoSimGen, a diffusion-based framework for controllable simultaneous image and mask generation. Conditioning is intuitively achieved through (1) text prompts grounded in class semantics, (2) spatial embedding of context prompts to provide spatial coherence, and (3) spectral embedding of timestep information to model noise levels during diffusion. To enhance controllability and training efficiency, the framework incorporates contrastive triplet loss between text and class embeddings, alongside diffusion and adversarial losses. Initial low-resolution outputs 128 x 128 are super-resolved to 512 x 512, producing high-fidelity images and masks with strict adherence to conditions. We evaluate CoSimGen on metrics such as FID, KID, LPIPS, Class FID, Positive predicted value for image fidelity and semantic alignment of generated samples over 4 diverse datasets. CoSimGen achieves state-of-the-art performance across all datasets, achieving the lowest KID of 0.11 and LPIPS of 0.53 across datasets.
zh

[CV-49] BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction ICDAR2025

【速读】：该论文旨在解决手动数字化书目元数据（bibliographic metadata）效率低下且劳动密集的问题，尤其是在格式高度可变的历史档案和现实世界档案中。尽管机器学习领域已取得进展，但缺乏专用的数据集阻碍了自动化进程。为填补这一空白，论文引入了BiblioPage数据集，该数据集包含来自14个捷克图书馆的约2,000个专著标题页，并以结构化书目元数据进行标注。每个标题页不仅标注了16个元数据属性（如标题、贡献者和出版信息），还提供了边界框形式的精确位置信息。

解决方案的关键在于结合目标检测模型（如YOLO和DETR）与基于Transformer的OCR技术，用于从扫描的标题页中提取结构化信息，并进一步评估视觉大型语言模型（如Llama 3.2-Vision和GPT-4o）的性能，其中最佳模型达到了F1分数为67。这为书目元数据的自动化提取提供了一个实际可用的基准。

链接: https://arxiv.org/abs/2503.19658
作者: Jan Kohút,Martin Dočekal,Michal Hradiš,Marek Vaško
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ICDAR2025 conference

点击查看摘要

Abstract:Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: this https URL
zh

[CV-50] RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models

【速读】：该论文试图解决Vision-Language Models (VLMs) 在红外视觉任务（RGB-Thermal Vision）评估方面的不足问题。现有评估主要局限于基于RGB图像的基准数据集，而缺乏针对红外图像理解能力的有效评估手段。此外，现有的可见光与红外图像数据集或任务特定性强，或缺乏高质量标注以支持严格的模型评估。为了解决这些问题，论文提出了RGB-Th-Bench，这是一个全面的评估框架，包含14个技能维度、超过1600个专家注释的Yes/No问题，并设计了两种准确性度量方法：标准的问题级准确率和更严格的技能级准确率，以确保对模型性能进行全面且鲁棒的评估。关键解决方案在于创建一个涵盖多技能维度的高质量RGB-Thermal图像对数据集及其对应的评估体系，同时揭示了当前VLMs在热成像理解上的局限性以及预训练数据中缺乏大规模应用特定且专家注释的热成像描述数据对这一重要因素的影响。

链接: https://arxiv.org/abs/2503.19654
作者: Mehdi Moshtaghi,Siavash H. Khajavi,Joni Pajarinen
机构: Aalto University (阿尔托大学); KTH Royal Institute of Technology (皇家理工学院); Detectium Oy (Detectium Oy)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce RGB-Th-Bench, the first benchmark designed to evaluate the ability of Vision-Language Models (VLMs) to comprehend RGB-Thermal image pairs. While VLMs have demonstrated remarkable progress in visual reasoning and multimodal understanding, their evaluation has been predominantly limited to RGB-based benchmarks, leaving a critical gap in assessing their capabilities in infrared vision tasks. Existing visible-infrared datasets are either task-specific or lack high-quality annotations necessary for rigorous model evaluation. To address these limitations, RGB-Th-Bench provides a comprehensive evaluation framework covering 14 distinct skill dimensions, with a total of 1,600+ expert-annotated Yes/No questions. The benchmark employs two accuracy metrics: a standard question-level accuracy and a stricter skill-level accuracy, which evaluates model robustness across multiple questions within each skill dimension. This design ensures a thorough assessment of model performance, including resilience to adversarial and hallucinated responses. We conduct extensive evaluations on 19 state-of-the-art VLMs, revealing significant performance gaps in RGB-Thermal understanding. Our results show that even the strongest models struggle with thermal image comprehension, with performance heavily constrained by their RGB-based capabilities. Additionally, the lack of large-scale application-specific and expert-annotated thermal-caption-pair datasets in pre-training is an important reason of the observed performance gap. RGB-Th-Bench highlights the urgent need for further advancements in multimodal learning to bridge the gap between visible and thermal image understanding. The dataset is available through this link, and the evaluation code will also be made publicly available.
zh

[CV-51] OpenSDI: Spotting Diffusion-Generated Images in the Open World

【速读】：该论文旨在解决开放世界环境中检测扩散生成图像（Diffusion-Generated Images）的挑战，定义为OpenSDI。为应对这一挑战，研究者提出了一个名为OpenSDI数据集（OpenSDID）的新基准，其特点在于利用大型视觉语言模型模拟开放世界的扩散操作，并包含全局与局部扩散模型操纵图像的检测和定位任务。论文的关键解决方案是提出了一种协同预训练模型（Synergizing Pretrained Models, SPM）方案，通过提示（prompting）和注意力机制（attending strategies）整合多个预训练基础模型，以增强在OpenSDI场景中的泛化能力。在此基础上，引入了MaskCLIP模型，将对比语言图像预训练（Contrastive Language-Image Pre-Training, CLIP）与掩码自编码器（Masked Autoencoder, MAE）相结合。实验结果表明，MaskCLIP在OpenSDID上的评估显著优于当前最先进的方法，在定位和检测任务中分别实现了14.23%（IoU）和2.05%（准确率）的相对改进。

链接: https://arxiv.org/abs/2503.19653
作者: Yabin Wang,Zhiwu Huang,Xiaopeng Hong
机构: Xi’an Jiaotong University (西安交通大学); University of Southampton (南安普顿大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at this https URL.
zh

[CV-52] Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

【速读】：该论文试图解决如何有效提示（prompt）视觉语言模型（Vision-Language Models, VLMs）以完成语义分割任务的问题。论文通过系统评估近期几种模型在MIXED分布外数据集上的分割性能，发现VLMs相比专为特定分割任务训练的专家模型（specialist models）平均落后约30%的Intersection-over-Union指标。论文的关键解决方案是引入一种可扩展的提示方案——少量提示语义分割（few-shot prompted semantic segmentation），它结合开放词汇分割（open-vocabulary segmentation）与少量学习（few-shot learning）的思想，并提出PromptMatcher方法，通过无训练方式融合文本提示和视觉提示，实现当前最优结果，分别超越最佳文本提示VLM和最佳视觉提示VLM的性能。此外，研究发现文本提示和视觉提示具有互补性，能够预见最有效的提示模态可以带来11%的性能提升。

链接: https://arxiv.org/abs/2503.19647
作者: Niccolo Avogaro,Thomas Frick,Mattia Rigotti,Andrea Bartezzaghi,Filip Janicki,Cristiano Malossi,Konrad Schindler,Roy Assaf
机构: IBM Research (IBM研究); ETH Zurich (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.
zh

[CV-53] Burst Image Super-Resolution with Mamba

【速读】：该论文致力于解决快速连续拍摄的多帧低分辨率图像到单帧高分辨率图像的超分辨问题（Burst Image Super-Resolution, BISR）。传统方法基于卷积网络或Transformer架构，但后者存在自注意力机制计算复杂度呈二次增长的问题。论文提出的关键解决方案是引入Mamba模块构建BurstMamba架构：通过解耦任务为专门的空间模块（用于关键帧超分辨）和时间模块（用于亚像素先验提取），在保持计算效率的同时实现多帧信息的有效融合。此外，论文还提出了两种创新策略：一是基于光流的序列化方法，在状态更新时对齐burst序列以保留亚像素细节；二是基于小波变换的状态空间更新规则重参数化，优先传递高频特征以优化多帧到关键帧的信息传递。这些方法使所提框架在SyntheticSR、RealBSR-RGB和RealBSR-RAW等公开基准数据集上达到了当前最优性能（SOTA）。

链接: https://arxiv.org/abs/2503.19634
作者: Ozan Unal,Steven Marty,Dengxin Dai
机构: Computer Vision Lab, Huawei Research Center Zurich (华为研究中东欧中心计算机视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Burst image super-resolution (BISR) aims to enhance the resolution of a keyframe by leveraging information from multiple low-resolution images captured in quick succession. In the deep learning era, BISR methods have evolved from fully convolutional networks to transformer-based architectures, which, despite their effectiveness, suffer from the quadratic complexity of self-attention. We see Mamba as the next natural step in the evolution of this field, offering a comparable global receptive field and selective information routing with only linear time complexity. In this work, we introduce BurstMamba, a Mamba-based architecture for BISR. Our approach decouples the task into two specialized branches: a spatial module for keyframe super-resolution and a temporal module for subpixel prior extraction, striking a balance between computational efficiency and burst information integration. To further enhance burst processing with Mamba, we propose two novel strategies: (i) optical flow-based serialization, which aligns burst sequences only during state updates to preserve subpixel details, and (ii) a wavelet-based reparameterization of the state-space update rules, prioritizing high-frequency features for improved burst-to-keyframe information passing. Our framework achieves SOTA performance on public benchmarks of SyntheticSR, RealBSR-RGB, and RealBSR-RAW.
zh

[CV-54] DynOPETs: A Versatile Benchmark for Dynamic Object Pose Estimation and Tracking in Moving Camera Scenarios

【速读】：该论文旨在解决动态场景中移动相机捕捉的物体位姿估计数据集匮乏的问题，这严重阻碍了鲁棒位姿估计模型的发展与评估。论文的关键解决方案在于提出了一个新的数据集DynOPETs以及专门的数据采集与标注流水线，用于无约束环境下的物体位姿估计与跟踪。其创新性的标注方法结合了位姿估计和位姿跟踪技术生成伪标签，并通过位姿图优化进行精炼，从而为从移动相机观察到的动态物体提供精确的位姿标注。这一方案的核心在于高效且精准的标注流程，它显著提升了数据质量和可用性，为相关领域的研究提供了有力支持。

链接: https://arxiv.org/abs/2503.19625
作者: Xiangting Meng,Jiaqi Yang,Mingshu Chen,Chenxin Yan,Yujiao Shi,Wenchao Ding,Laurent Kneip
机构: ShanghaiTech University, Mobile Peception Lab. (上海科技大学，移动感知实验室); Fudan University, Multi-Agent Robotic Systems Lab. (复旦大学，多智能体机器人系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the realm of object pose estimation, scenarios involving both dynamic objects and moving cameras are prevalent. However, the scarcity of corresponding real-world datasets significantly hinders the development and evaluation of robust pose estimation models. This is largely attributed to the inherent challenges in accurately annotating object poses in dynamic scenes captured by moving cameras. To bridge this gap, this paper presents a novel dataset DynOPETs and a dedicated data acquisition and annotation pipeline tailored for object pose estimation and tracking in such unconstrained environments. Our efficient annotation method innovatively integrates pose estimation and pose tracking techniques to generate pseudo-labels, which are subsequently refined through pose graph optimization. The resulting dataset offers accurate pose annotations for dynamic objects observed from moving cameras. To validate the effectiveness and value of our dataset, we perform comprehensive evaluations using 18 state-of-the-art methods, demonstrating its potential to accelerate research in this challenging domain. The dataset will be made publicly available to facilitate further exploration and advancement in the field.
zh

[CV-55] Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark Analysis and Mitigation

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在视频模态中的幻觉问题（hallucination），即生成看似正确但实际错误的响应，从而限制其可靠性和适用性。论文的关键解决方案是提出了一种基于视频理解任务的全面评估基准 HAVEN，并通过监督推理微调（Supervised Reasoning Fine-Tuning, SRFT）和直接偏好优化（Direct Preference Optimization, TDPO）的方法设计了一种视频思维模型来缓解 LMMs 的幻觉问题。其中，SRFT 提升了模型的推理能力，而 TDPO 减少了思维过程中的幻觉现象。实验结果表明，该方法在幻觉评估的准确性上提升了 7.65%，并将偏差分数降低了 4.5%。

链接: https://arxiv.org/abs/2503.19622
作者: Hongcheng Gao,Jiashu Qu,Jingyi Tang,Baolong Bi,Yue Liu,Hongyu Chen,Li Liang,Li Su,Qingming Huang
机构: University of Chinese Academy of Sciences (中国科学院大学); University of Cincinnati (辛辛那提大学); Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS (中科院计算技术研究所智能信息处理重点实验室); National University of Singapore (新加坡国立大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at this https URL.
zh

[CV-56] SACB-Net: Spatial-awareness Convolutions for Medical Image Registration CVPR2025

【速读】：该论文旨在解决现有基于深度学习的图像配准方法在捕捉特征图非局部区域的空间变化信息方面的不足，这一不足源于这些方法依赖于空间共享的卷积核。这种局限性导致变形场估计次优。为了解决此问题，论文提出了一种名为3D Spatial-Awareness Convolution Block (SACB) 的模块，其关键在于通过利用特征相似性估计特征图内的空间聚类，并随后在不同区域参数化自适应卷积核。这种自适应机制能够生成针对空间变化定制的卷积核（包括权重和偏置），从而使网络能够有效捕获空间变化的信息。基于SACB，作者进一步构建了一个金字塔流估计器（命名为SACB-Net），以促进多尺度流合成，特别适用于处理大变形情况。实验结果验证了SACB的有效性和SACB-Net相对于当前最先进的基于学习的配准方法的优势。

链接: https://arxiv.org/abs/2503.19592
作者: Xinxing Cheng,Tianyang Zhang,Wenqi Lu,Qingjie Meng,Alejandro F. Frangi,Jinming Duan
机构: School of Computer Science, University of Birmingham (伯明翰大学), UK; Department of Computing and Mathematics, Manchester Metropolitan University (曼彻斯特城市大学), UK; Division of Informatics, Imaging and Data Sciences, University of Manchester (曼彻斯特大学), UK; Centre for Computational Imaging and Modelling in Medicine, University of Manchester (曼彻斯特大学), UK; Department of Computing, Impeiral College London (帝国理工学院), UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Deep learning-based image registration methods have shown state-of-the-art performance and rapid inference speeds. Despite these advances, many existing approaches fall short in capturing spatially varying information in non-local regions of feature maps due to the reliance on spatially-shared convolution kernels. This limitation leads to suboptimal estimation of deformation fields. In this paper, we propose a 3D Spatial-Awareness Convolution Block (SACB) to enhance the spatial information within feature representations. Our SACB estimates the spatial clusters within feature maps by leveraging feature similarity and subsequently parameterizes the adaptive convolution kernels across diverse regions. This adaptive mechanism generates the convolution kernels (weights and biases) tailored to spatial variations, thereby enabling the network to effectively capture spatially varying information. Building on SACB, we introduce a pyramid flow estimator (named SACB-Net) that integrates SACBs to facilitate multi-scale flow composition, particularly addressing large deformations. Experimental results on the brain IXI and LPBA datasets as well as Abdomen CT datasets demonstrate the effectiveness of SACB and the superiority of SACB-Net over the state-of-the-art learning-based registration methods. The code is available at this https URL .
zh

[CV-57] Video Anomaly Detection with Contours - A Study

【速读】：该论文试图解决基于人体姿态的视频异常检测（Pose-based Video Anomaly Detection）中的问题，重点关注如何利用正常人体行为的重复运动模式来识别异常事件。不同于传统方法依赖于人体骨架表示，本文探索了使用二维轮廓（2D contours）来学习正常行为模式的可能性，并假设这种转变能够为未来研究覆盖更多物体类别提供机会。论文的关键在于将问题表述为回归和分类任务，并提出两种不同的轮廓数据表示技术；同时，所有方法均基于浅层神经网络以降低计算复杂度，最终在六个基准数据集上验证了所提方案的有效性。

链接: https://arxiv.org/abs/2503.19588
作者: Mia Siemon,Ivan Nikolov,Thomas B. Moeslund,Kamal Nasrollahi
机构: Milestone Systems A/S (Milestone Systems A/S); Department of Architecture, Design and Media Technology, Aalborg University (建筑、设计与媒体技术系，奥尔堡大学); Milestone Systems A/S (Milestone Systems A/S)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Pose-based Video Anomaly Detection prior art is rooted on the assumption that abnormal events can be mostly regarded as a result of uncommon human behavior. Opposed to utilizing skeleton representations of humans, however, we investigate the potential of learning recurrent motion patterns of normal human behavior using 2D contours. Keeping all advantages of pose-based methods, such as increased object anonymization, the shift from human skeletons to contours is hypothesized to leave the opportunity to cover more object categories open for future research. We propose formulating the problem as a regression and a classification task, and additionally explore two distinct data representation techniques for contours. To further reduce the computational complexity of Pose-based Video Anomaly Detection solutions, all methods in this study are based on shallow Neural Networks from the field of Deep Learning, and evaluated on the three most prominent benchmark datasets within Video Anomaly Detection and their human-related counterparts, totaling six datasets. Our results indicate that this novel perspective on Pose-based Video Anomaly Detection marks a promising direction for future research.
zh

[CV-58] SINR: Sparsity Driven Compressed Implicit Neural Representations

【速读】：该论文旨在解决隐式神经表示（INRs）信号压缩效率较低的问题。现有方法主要依赖于直接量化与熵编码或基于可学习变换的潜在码生成，其性能受限于所采用的量化和编码方案。论文提出了一种名为SINR的新算法，其关键在于利用INR权重形成的向量空间中的模式，并通过字典内的高维稀疏码来压缩这些向量空间。进一步分析表明，用于生成稀疏码的字典原子无需被学习或传输即可成功恢复INR权重。该方法能够与任何现有的基于INR的信号压缩技术集成，并显著降低INRs在多种配置下的存储需求，同时保持高质量解码能力，适用于图像、占用场以及神经辐射场等多种数据模态。

链接: https://arxiv.org/abs/2503.19576
作者: Dhananjaya Jayasundara,Sudarshan Rajagopalan,Yasiru Ranasinghe,Trac D. Tran,Vishal M. Patel
机构: Johns Hopkins University (约翰斯·霍普kins大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) are increasingly recognized as a versatile data modality for representing discretized signals, offering benefits such as infinite query resolution and reduced storage requirements. Existing signal compression approaches for INRs typically employ one of two strategies: 1. direct quantization with entropy coding of the trained INR; 2. deriving a latent code on top of the INR through a learnable transformation. Thus, their performance is heavily dependent on the quantization and entropy coding schemes employed. In this paper, we introduce SINR, an innovative compression algorithm that leverages the patterns in the vector spaces formed by weights of INRs. We compress these vector spaces using a high-dimensional sparse code within a dictionary. Further analysis reveals that the atoms of the dictionary used to generate the sparse code do not need to be learned or transmitted to successfully recover the INR weights. We demonstrate that the proposed approach can be integrated with any existing INR-based signal compression technique. Our results indicate that SINR achieves substantial reductions in storage requirements for INRs across various configurations, outperforming conventional INR-based compression baselines. Furthermore, SINR maintains high-quality decoding across diverse data modalities, including images, occupancy fields, and Neural Radiance Fields.
zh

[CV-59] Improved tissue sodium concentration quantification in breast cancer by reducing partial volume effects: a preliminary study

【速读】：该论文试图解决钠（23Na）MRI中因部分容积效应（PVE）导致的组织钠浓度（TSC）定量误差问题。解决方案的关键在于利用基于压缩感知（Compressed Sensing, CS）的先进图像重建算法，包括加权全变分（wTV）、方向全变分（dTV）、解剖学引导全变分（AG-TV）以及自适应组合（ADC）重建方法，以改善图像质量和提高TSC量化准确性。研究结果显示不同重建方法在肿瘤显影和TSC估计上的差异可能与它们减少PVE的鲁棒性有关。

链接: https://arxiv.org/abs/2503.19570
作者: Olgica Zaric,Carmen Leser,Vladimir Juras,Alex Farr,Malina Gologan,Stanislas Rapacchi,Laura Villazan Garcia,Christian Singer,Siegfried Trattnig,Christian Licht,Ramona Woitek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Introduction: In sodium (23Na) MRI, partial volume effects (PVE) are one of the most common causes of errors in the quantification of tissue sodium concentration (TSC) in vivo. Advanced image reconstruction algorithms, such as compressed sensing (CS), have been shown to potentially reduce PVE. Therefore, we investigated the feasibility of CS-based methods for image quality and TSC quantification accuracy improvement in patients with breast cancer (BC). Subjects and Methods: Three healthy participants and 12 female participants with BC were examined on a 7T MRI scanner in this study. We reconstructed 23Na-MRI images using the weighted total variation (wTV) and directional total variation (dTV), anatomically guided total variation (AG-TV), and adaptive combine (ADC) reconstruction and performed image quality assessment. We evaluated agreement in tumor volumes delineated on sodium data using the Dice score and performed TSC quantification for different image reconstruction approaches. Results: All methods provided sodium images of the breast with good quality. The mean Dice scores for wTV, dTV, and AG-TV were 65%, 72%, and 75%, respectively. In the breast tumors, average TSC values were 83.0, 72.0, 80.0, and 84.0 mmol/L, respectively. There was a significant difference between dTV and wTV (p0.001), as well as between dTV and AG-TV (p0.001) and dTV and ADC algorithm (p0.001). Conclusion: The results of this study showed that there are differences in tumor appearance and TSC estimations that might be depending on the type of image reconstruction and parameters used, most likely due to differences in their robustness in reducing PVE.
zh

[CV-60] Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion MDM

【速读】：该论文致力于解决文本到运动生成模型在处理细腻风格属性（如“Chicken”风格）时面临的挑战，由于风格特定数据的稀缺性，现有方法通常通过拉近生成先验与参考风格的方式生成动作，但容易导致分布外的低质量结果。论文的关键在于提出LoRA-MDM框架，其核心思想是通过调整生成先验以包含目标风格，同时保持整体分布不变，而非在生成过程中逐个修改动作，从而实现复杂动作的泛化与编辑能力。LoRA-MDM仅需少量样本即可学习适应先验以包含参考风格，并能在不同文本提示下利用该风格进行生成，通过低秩适应在语义上有意义地移动运动流形，实现在参考样本中未见动作的真实风格融合，同时保留分布结构以支持风格混合和运动编辑等高级操作。

链接: https://arxiv.org/abs/2503.19557
作者: Haim Sawdayee,Chuan Guo,Guy Tevet,Bing Zhou,Jian Wang,Amit H. Bermano
机构: Tel Aviv University (特拉维夫大学); Snap Inc. (Snap Inc.); Tel Aviv University (特拉维夫大学); Snap Inc. (Snap Inc.); Snap Inc. (Snap Inc.); Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Text-to-motion generative models span a wide range of 3D human actions but struggle with nuanced stylistic attributes such as a “Chicken” style. Due to the scarcity of style-specific data, existing approaches pull the generative prior towards a reference style, which often results in out-of-distribution low quality generations. In this work, we introduce LoRA-MDM, a lightweight framework for motion stylization that generalizes to complex actions while maintaining editability. Our key insight is that adapting the generative prior to include the style, while preserving its overall distribution, is more effective than modifying each individual motion during generation. Building on this idea, LoRA-MDM learns to adapt the prior to include the reference style using only a few samples. The style can then be used in the context of different textual prompts for generation. The low-rank adaptation shifts the motion manifold in a semantically meaningful way, enabling realistic style infusion even for actions not present in the reference samples. Moreover, preserving the distribution structure enables advanced operations such as style blending and motion editing. We compare LoRA-MDM to state-of-the-art stylized motion generation methods and demonstrate a favorable balance between text fidelity and style consistency.
zh

[CV-61] Practical Fine-Tuning of Autoregressive Models on Limited Handwritten Texts ICDAR2025

【速读】：该论文旨在解决光学字符识别（OCR）模型在用户逐步纠正自动识别结果以获得最终转录文本的过程中，如何有效地进行渐进式适应的问题。论文的关键在于利用最先进的基于变换器的模型支持这种适应过程，并通过少量标注数据（如仅16行）即可启动微调，从而逐渐减轻标注员的工作负担。研究还探讨了模型组件的作用，提出了可靠的停止准则，并展示了通过基于置信度的选择策略可以将标注成本减半，同时保持相同的性能表现。

链接: https://arxiv.org/abs/2503.19546
作者: Jan Kohút,Michal Hradiš
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICDAR2025 conference

点击查看摘要

Abstract:A common use case for OCR applications involves users uploading documents and progressively correcting automatic recognition to obtain the final transcript. This correction phase presents an opportunity for progressive adaptation of the OCR model, making it crucial to adapt early, while ensuring stability and reliability. We demonstrate that state-of-the-art transformer-based models can effectively support this adaptation, gradually reducing the annotator’s workload. Our results show that fine-tuning can reliably start with just 16 lines, yielding a 10% relative improvement in CER, and scale up to 40% with 256 lines. We further investigate the impact of model components, clarifying the roles of the encoder and decoder in the fine-tuning process. To guide adaptation, we propose reliable stopping criteria, considering both direct approaches and global trend analysis. Additionally, we show that OCR models can be leveraged to cut annotation costs by half through confidence-based selection of informative lines, achieving the same performance with fewer annotations.
zh

[CV-62] ng artifacts and trade-offs of feature normalization in the segmentation of large biological images

【速读】：该论文旨在解决显微镜图像分割中的拼接伪影问题，这是显微成像、医学影像或遥感领域中非常大的图像分割的常见挑战。论文指出，尽管滑动窗口推理理论上可以实现无缝拼接的预测结果，但实践中许多流行的处理流程仍然受到拼接伪影的影响。研究发现，这些问题的根本原因在于神经网络中的归一化层。为了解决这一问题，论文提出了检测归一化问题的指标，并探讨了无伪影与高质量预测之间的权衡，使用了三个不同的显微镜数据集作为示例。解决方案的关键在于提出使用BatchRenorm作为最合适的归一化策略，这种方法能够有效消除拼接伪影并提升模型在新数据集上的迁移性能，从而提高训练好的网络的可重用性。

链接: https://arxiv.org/abs/2503.19545
作者: Elena Buglakova,Anwai Archit,Edoardo D’Imprima,Julia Mahamid,Constantin Pape,Anna Kreshuk
机构: European Molecular Biology Laboratory (欧洲分子生物学实验室), Heidelberg; Institute of Computer Science, University of Göttingen (哥廷根大学计算机科学研究所); Cluster of Excellence ‘Multiscale Bioimaging: from Molecular Machines to Networks of Excitable Cells‘ (MBExC) (卓越集群“多尺度生物成像：从分子机器到兴奋性细胞网络”), Georg-August-University Göttingen (乔治-奥古斯都-哥廷根大学); IRCCS Humanitas Research Hospital (IRCCS人类研究医院), Milan (米兰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of very large images is a common problem in microscopy, medical imaging or remote sensing. The problem is usually addressed by sliding window inference, which can theoretically lead to seamlessly stitched predictions. However, in practice many of the popular pipelines still suffer from tiling artifacts. We investigate the root cause of these issues and show that they stem from the normalization layers within the neural networks. We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. Finally, we propose to use BatchRenorm as the most suitable normalization strategy, which effectively removes tiling artifacts and enhances transfer performance, thereby improving the reusability of trained networks for new datasets.
zh

[CV-63] Scene-agnostic Pose Regression for Visual Localization CVPR2025

【速读】：该论文旨在解决视觉定位任务中三种主流方法（Absolute Pose Regression, Relative Pose Regression 和 Visual Odometry）在适应未知环境时各自存在的局限性。Absolute Pose Regression 虽然能够预测6D相机位姿但缺乏对未知环境的适应能力；Relative Pose Regression 虽然泛化性能更好但需要依赖大规模图像检索数据库；而 Visual Odometry 在未见过的环境中表现良好但存在开放轨迹下的累积误差问题。为应对这一困境，论文引入了一种新的任务——Scene-agnostic Pose Regression (SPR)，其关键在于通过一种灵活的方式实现精确的姿态回归，同时无需重新训练或依赖数据库。为此，作者构建了一个大规模数据集 360SPR，并提出了 SPR-Mamba 模型以双分支方式解决 SPR 问题。实验结果表明，所提出的 SPR 方法及其模型在未知场景下显著优于其他方法。

链接: https://arxiv.org/abs/2503.19543
作者: Junwei Zheng,Ruiping Liu,Yufan Chen,Zhenfang Chen,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); ETH Zurich (瑞士苏黎世联邦理工学院); Hunan University (湖南大学); MIT-IBM Watson AI Lab (麻省理工学院-IBM Watson人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Absolute Pose Regression (APR) predicts 6D camera poses but lacks the adaptability to unknown environments without retraining, while Relative Pose Regression (RPR) generalizes better yet requires a large image retrieval database. Visual Odometry (VO) generalizes well in unseen environments but suffers from accumulated error in open trajectories. To address this dilemma, we introduce a new task, Scene-agnostic Pose Regression (SPR), which can achieve accurate pose regression in a flexible way while eliminating the need for retraining or databases. To benchmark SPR, we created a large-scale dataset, 360SPR, with over 200K photorealistic panoramas, 3.6M pinhole images and camera poses in 270 scenes at three different sensor heights. Furthermore, a SPR-Mamba model is initially proposed to address SPR in a dual-branch manner. Extensive experiments and studies demonstrate the effectiveness of our SPR paradigm, dataset, and model. In the unknown scenes of both 360SPR and 360Loc datasets, our method consistently outperforms APR, RPR and VO. The dataset and code are available at this https URL.
zh

[CV-64] One Framework to Rule Them All: Unifying RL-Based and RL-Free Methods in RLHF

【速读】：本文旨在解决基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）和大型推理模型（Large Reasoning Models, LRMs）中的方法设计问题。论文的关键在于重新审视多种基于强化学习（RL-based）和非基于强化学习（RL-free）的算法，并通过神经结构化多臂老虎机预测（neural structured bandit prediction）的视角提供一个清晰的概念框架，揭示这些看似独立的方法之间的深层联系。此外，通过在完整的强化学习背景下推导标准RLHF目标，证明其与神经结构化多臂老虎机预测的等价性，并通过对近端策略优化（Proximal Policy Optimization, PPO）原理的再分析，识别需要调整的领域，最终引入广义强化优化（Generalized Reinforce Optimization, GRO）框架，实现RL-based和RL-free方法在RLHF中的无缝集成。

链接: https://arxiv.org/abs/2503.19523
作者: Xin Cai
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this article, we primarily examine a variety of RL-based and RL-free methods designed to address Reinforcement Learning from Human Feedback (RLHF) and Large Reasoning Models (LRMs). We begin with a concise overview of the typical steps involved in RLHF and LRMs. Next, we reinterpret several RL-based and RL-free algorithms through the perspective of neural structured bandit prediction, providing a clear conceptual framework that uncovers a deeper connection between these seemingly distinct approaches. Following this, we briefly review some core principles of reinforcement learning, drawing attention to an often-overlooked aspect in existing RLHF studies. This leads to a detailed derivation of the standard RLHF objective within a full RL context, demonstrating its equivalence to neural structured bandit prediction. Finally, by reinvestigating the principles behind Proximal Policy Optimization (PPO), we pinpoint areas needing adjustment, which culminates in the introduction of the Generalized Reinforce Optimization (GRO) framework, seamlessly integrating RL-based and RL-free methods in RLHF. We look forward to the community’s efforts to empirically validate GRO and invite constructive feedback.
zh

[CV-65] RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation

【速读】：本文针对机器人技术在复杂多模态交互与操作任务中的挑战，致力于解决在三维环境中融合RGB与深度信息以及依据语言指令执行任务的问题。现有方法虽有所进展，但在深度信息处理及语言引导的任务执行方面仍存在不足。为应对这些挑战，论文提出RoboFlamingo-Plus，其关键在于将深度数据融入视觉-语言模型（Vision-Language Models, VLMs），并通过结合预训练的视觉Transformer（Vision Transformer, ViT）与重采样技术实现RGB与深度信息的精细融合，并利用跨注意力机制优化特征整合，以更好地关联语言线索。这种创新性的输入适配与特征提取方式显著提升了机器人在复杂环境下的操作性能，使RoboFlamingo-Plus相比现有方法提升了10%-20%的操纵能力。

链接: https://arxiv.org/abs/2503.19510
作者: Sheng Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As robotic technologies advancing towards more complex multimodal interactions and manipulation tasks, the integration of advanced Vision-Language Models (VLMs) has become a key driver in the field. Despite progress with current methods, challenges persist in fusing depth and RGB information within 3D environments and executing tasks guided by linguistic instructions. In response to these challenges, we have enhanced the existing RoboFlamingo framework by introducing RoboFlamingo-Plus, which incorporates depth data into VLMs to significantly improve robotic manipulation performance. Our research achieves a nuanced fusion of RGB and depth information by integrating a pre-trained Vision Transformer (ViT) with a resampling technique, closely aligning this combined data with linguistic cues for superior multimodal understanding. The novelty of RoboFlamingo-Plus lies in its adaptation of inputs for depth data processing, leveraging a pre-trained resampler for depth feature extraction, and employing cross-attention mechanisms for optimal feature integration. These improvements allow RoboFlamingo-Plus to not only deeply understand 3D environments but also easily perform complex, language-guided tasks in challenging settings. Experimental results show that RoboFlamingo-Plus boosts robotic manipulation by 10-20% over current methods, marking a significant advancement. Codes and model weights are public at RoboFlamingo-Plus.
zh

[CV-66] Improved Alignment of Modalities in Large Vision Language Models

【速读】：该论文旨在解决跨多样化视觉语言任务（如图像描述生成和视觉问答）中统一视觉-语言模型对齐的挑战。现有方法要么需要非常大的语言模型，要么依赖庞大的数据集，效率较低且难以充分利用已有资源。论文的关键解决方案是提出了一种自回归视觉-语言模型的训练策略，并设计了四个训练阶段以实现视觉模型与语言模型的有效对齐，使语言模型具备处理视觉输入的能力。此外，通过引入特定的注意力掩码优化基于Transformer的语言模型，提升了视觉特征的质量。关键创新点包括：注意力掩码不应应用于视觉输入；语言模型在AI生成的数据上收敛更快；预训练阶段需进一步加强模型对齐；模型可轻松适应下游任务，如医疗健康领域的视觉问答。最终，该方法在较小的数据集上训练的小型模型，在COCO和Flickr30k等基准测试中的CIDEr分数表现优于更大的VILA-13B模型，并接近GIT-2模型的表现。

链接: https://arxiv.org/abs/2503.19508
作者: Kartik Jangra,Aman Kumar Singh,Yashwani Mann,Geetanjali Rathee
机构: Netaji Subhas University of Technology (NSUT)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in vision-language models have achieved remarkable results in making language models understand vision inputs. However, a unified approach to align these models across diverse tasks such as image captioning and visual question answering remains a challenge. Existing methods either require very big language models or very big datasets which is not efficient in utilizing existing models. This paper addresses this gap and devises a training strategy of auto-regressive vision-language models, to unify vision-language tasks like image-captioning and visual question answering. We propose four training stages for aligning the vision model with the language model, in other words, the language model is given an ability to process visual inputs. We also devise different attention masks for training transformer-based language models that improve the quality of visual features. Further, we introduce some findings, 1) the attention mask should not be applied on visual inputs, 2) the Language model converges faster on AI- generated data, 3) More work should be done in the alignment stage during the pre-training of the model, 4) the model can easily adapt to any downstream tasks like visual question answering on healthcare datasets like PathVQA. After training the model for one epoch for all the stages, it outperforms large models like VILA-13 billion models on common benchmarks like CIDEr scores on COCO and Flickr30k datasets and achieves very close scores to GIT-2 on the same dataset despite being a much smaller model trained on a much smaller dataset. All of the training is done using best practices available like multi- GPU parallel training, lower-precision training with 16-bit float numbers, faster attention (SDPA), and gradient accumulation, and completed the training within 12 hours.
zh

[CV-67] Adaptive Weighted Parameter Fusion with CLIP for Class-Incremental Learning ICME2025

【速读】：该论文旨在解决类增量学习（Class-incremental Learning, CIL）中的灾难性遗忘问题，即在模型优化新类别时不可避免地会丢失先前类别的知识。为应对这一挑战，论文的关键解决方案是设计了一种自适应加权参数融合方法，并结合对比语言-图像预训练（Contrastive Language-Image Pre-training, CLIP），通过最大程度保留参数矩阵的有效信息以及引入平衡因子来协调相邻任务的数据分布对齐与区分能力，从而在保持旧知识的同时有效容纳新信息，减少模型辨别能力的部分损失。实验结果验证了所提出方法的优越性。

链接: https://arxiv.org/abs/2503.19503
作者: Juncen Guo,Xiaoguang Zhu,Liangyu Teng,Hao Yang,Jing Liu,Yang Liu,Liang Song
机构: Fudan University (复旦大学); University of California, Davis (加州大学戴维斯分校); The University of British Columbia (不列颠哥伦比亚大学); Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Class-incremental Learning (CIL) enables the model to incrementally absorb knowledge from new classes and build a generic classifier across all previously encountered classes. When the model optimizes with new classes, the knowledge of previous classes is inevitably erased, leading to catastrophic forgetting. Addressing this challenge requires making a trade-off between retaining old knowledge and accommodating new information. However, this balancing process often requires sacrificing some information, which can lead to a partial loss in the model’s ability to discriminate between classes. To tackle this issue, we design the adaptive weighted parameter fusion with Contrastive Language-Image Pre-training (CLIP), which not only takes into account the variability of the data distribution of different tasks, but also retains all the effective information of the parameter matrix to the greatest extent. In addition, we introduce a balance factor that can balance the data distribution alignment and distinguishability of adjacent tasks. Experimental results on several traditional benchmarks validate the superiority of the proposed method.
zh

[CV-68] Pose-Based Fall Detection System: Efficient Monitoring on Standard CPUs

【速读】：该论文旨在解决养老院老年人跌倒检测的问题，传统方法通常依赖于需要专用硬件的传感器系统或需要高计算资源和GPU的视频模型，这可能导致成本高昂且实施困难。论文提出了一种无需额外传感器或高性能硬件的鲁棒跌倒检测系统。其关键是利用MediaPipe框架进行姿态估计，并结合基于阈值的分析和投票机制来区分跌倒与非跌倒活动，通过分析运动、身体姿势及关键姿态点，在标准CPU上实现低计算开销的实时处理，同时采用20帧缓冲区以减少误报并保持高精度，从而提供了一种实用且经济高效的解决方案，用于提升养老院居民的安全性。

链接: https://arxiv.org/abs/2503.19501
作者: Vinayak Mali,Saurabh Jaiswal
机构: Indian Institue of Technology Kharagpur (印度理工学院克勒格布尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 Pages, 2 figures, 2 code block, 1 flow chart

点击查看摘要

Abstract:Falls among elderly residents in assisted living homes pose significant health risks, often leading to injuries and a decreased quality of life. Current fall detection solutions typically rely on sensor-based systems that require dedicated hardware, or on video-based models that demand high computational resources and GPUs for real-time processing. In contrast, this paper presents a robust fall detection system that does not require any additional sensors or high-powered hardware. The system uses pose estimation techniques, combined with threshold-based analysis and a voting mechanism, to effectively distinguish between fall and non-fall activities. For pose detection, we leverage MediaPipe, a lightweight and efficient framework that enables real-time processing on standard CPUs with minimal computational overhead. By analyzing motion, body position, and key pose points, the system processes pose features with a 20-frame buffer, minimizing false positives and maintaining high accuracy even in real-world settings. This unobtrusive, resource-efficient approach provides a practical solution for enhancing resident safety in old age homes, without the need for expensive sensors or high-end computational resources.
zh

[CV-69] Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage

【速读】：该论文旨在解决在人物图像合成中实现细粒度可控性的长期挑战，现有方法难以同时以解耦方式控制视角(viewpoint)、姿态(pose)、服饰(clothing)和身份(identity)等关键因素。论文的关键创新在于引入了一个新的解耦且可控的人物合成任务，并在统一框架下显式分离与操控这四个因素。为实现这一目标，论文首先开发了一种端到端生成模型用于因子解耦，并基于MVHumanNet进行训练。然而，由于MVHumanNet与真实场景数据之间的领域差距(domain gap)，直接应用该模型效果不佳。为此，论文探索将虚拟试穿(Virtual Try-On, VTON)数据集作为潜在解决方案。实验表明，简单地将VTON数据集作为附加数据训练端到端模型会降低性能，主要原因是两组数据形式不一致，破坏了解耦过程。为更好地利用这两种数据集，论文提出了一种分阶段(stage-by-stage)的框架，将人物图像生成分解为三个顺序步骤：带服饰的A姿态生成、背面合成以及姿态和视角控制。这种结构化流水线在不同阶段实现了更高效的资源利用，显著提升了可控性和泛化能力，特别是在真实场景中的表现。大量实验验证了分阶段方法在视觉保真度和解耦质量方面均优于端到端模型，为实际应用提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2503.19486
作者: Zhengwentai Sun,Heyuan Li,Xihe Yang,Keru Zheng,Shuliang Ning,Yihao Zhi,Hongjie Liao,Chenghong Li,Shuguang Cui,Xiaoguang Han
机构: FNii, CUHKSZ (香港中文大学（深圳）); SSE, CUHKSZ (香港中文大学（深圳） )
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving fine-grained controllability in human image synthesis is a long-standing challenge in computer vision. Existing methods primarily focus on either facial synthesis or near-frontal body generation, with limited ability to simultaneously control key factors such as viewpoint, pose, clothing, and identity in a disentangled manner. In this paper, we introduce a new disentangled and controllable human synthesis task, which explicitly separates and manipulates these four factors within a unified framework. We first develop an end-to-end generative model trained on MVHumanNet for factor disentanglement. However, the domain gap between MVHumanNet and in-the-wild data produce unsatisfacotry results, motivating the exploration of virtual try-on (VTON) dataset as a potential solution. Through experiments, we observe that simply incorporating the VTON dataset as additional data to train the end-to-end model degrades performance, primarily due to the inconsistency in data forms between the two datasets, which disrupts the disentanglement process. To better leverage both datasets, we propose a stage-by-stage framework that decomposes human image generation into three sequential steps: clothed A-pose generation, back-view synthesis, and pose and view control. This structured pipeline enables better dataset utilization at different stages, significantly improving controllability and generalization, especially for in-the-wild scenarios. Extensive experiments demonstrate that our stage-by-stage approach outperforms end-to-end models in both visual fidelity and disentanglement quality, offering a scalable solution for real-world tasks. Additional demos are available on the project page: this https URL.
zh

[CV-70] GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

【速读】：本文旨在解决生成式模型与判别式模型结合中的关键挑战，即如何有效利用生成模型增强判别式模型（如CLIP）的视觉表征能力。传统方法通过将CLIP的视觉特征作为条件进行重建来提升表示效果，但其背后的原理尚未被充分探索。论文的关键发现在于，视觉上完美的生成结果并不总是最优的增强策略，核心在于从生成模型中有效提取细粒度知识的同时抑制无关信息。为此，论文深入研究了三个关键方面：(1) 条件机制发现仅使用全局视觉标记作为条件是最有效的策略；(2) 去噪配置提出两阶段训练策略以优先学习有用的视觉知识，并证明轻量级去噪器可显著提升性能；(3) 生成范式验证了连续和离散去噪器均能取得理想效果。最终，基于这些探索提出了名为GenHancer的方法，该方法在MMVP-VLM基准上始终优于现有技术，例如在OpenAICLIP上的提升达6.0%。增强后的CLIP还可进一步集成到多模态大语言模型中，以改善视觉为中心的任务性能。所有模型和代码均已公开发布。

链接: https://arxiv.org/abs/2503.19480
作者: Shijie Ma,Yuying Ge,Teng Wang,Yuxin Guo,Yixiao Ge,Ying Shan
机构: ARC Lab, Tencent PCG (腾讯PCG弧光实验室); Institute of Automation, CAS (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project released at: this https URL

点击查看摘要

Abstract:The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP’s visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.
zh

[CV-71] LL Me what you cant see

【速读】：该论文旨在解决刑事调查中因高质量图像稀缺或过时导致的人脸识别准确性下降的问题。论文提出了一种新颖的法医头像增强框架，通过可定制的数据增强技术生成额外的高保真图像，同时保持原始数据的生物特征完整性和一致性。解决方案的关键在于结合数据增强与生物特征一致性维护的技术手段，以显著提升不同法医场景下的人脸识别准确率和鲁棒性。

链接: https://arxiv.org/abs/2503.19478
作者: Saverio Cavasin,Pietro Biasetton,Mattia Tamiazzo,Mauro Conti,Simone Milani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 16 pages, 58 images

点击查看摘要

Abstract:During criminal investigations, images of persons of interest directly influence the success of identification procedures. However, law enforcement agencies often face challenges related to the scarcity of high-quality images or their obsolescence, which can affect the accuracy and success of people searching processes. This paper introduces a novel forensic mugshot augmentation framework aimed at addressing these limitations. Our approach enhances the identification probability of individuals by generating additional, high-quality images through customizable data augmentation techniques, while maintaining the biometric integrity and consistency of the original data. Several experimental results show that our method significantly improves identification accuracy and robustness across various forensic scenarios, demonstrating its effectiveness as a trustworthy tool law enforcement applications. Index Terms: Digital Forensics, Person re-identification, Feature extraction, Data augmentation, Visual-Language models.
zh

[CV-72] A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition ICME2025

【速读】：该论文旨在解决多模态意图识别（Multimodal Intent Recognition, MIR）领域中现有方法难以充分捕捉模态间内在关联以及忽视相应语义表示的问题。为应对这些局限性，论文提出了基于锚点的多模态嵌入与语义同步（Anchor-based Multimodal Embedding with Semantic Synchronization, A-MESS）框架。其关键在于首先设计了基于锚点的多模态嵌入（A-ME）模块，采用基于锚点的嵌入融合机制整合多模态输入；其次开发了一种结合三元组对比学习管道的语义同步（SS）策略，通过大型语言模型生成的标签描述同步多模态表示，从而优化整个过程。实验结果表明，A-MESS 达到了当前最先进水平，并为多模态表征及下游任务提供了重要见解。

链接: https://arxiv.org/abs/2503.19474
作者: Yaomin Shen,Xiaojian Lin,Wei Fan
机构: XR System Application Research Center, Nanchang Research Institute of Zhejiang University (南昌研究院，浙江大学); Institute for AI Industry Research(AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept by ICME2025

点击查看摘要

Abstract:In the domain of multimodal intent recognition (MIR), the objective is to recognize human intent by integrating a variety of modalities, such as language text, body gestures, and tones. However, existing approaches face difficulties adequately capturing the intrinsic connections between the modalities and overlooking the corresponding semantic representations of intent. To address these limitations, we present the Anchor-based Mul- timodal Embedding with Semantic Synchronization (A-MESS) framework. We first design an Anchor-based Multimodal Embed- ding (A-ME) module that employs an anchor-based embedding fusion mechanism to integrate multimodal inputs. Furthermore, we develop a Semantic Synchronization (SS) strategy with the Triplet Contrastive Learning pipeline, which optimizes the pro- cess by synchronizing multimodal representation with label de- scriptions produced by the large language model. Comprehensive experiments indicate that our A-MESS achieves state-of-the-art and provides substantial insight into multimodal representation and downstream tasks.
zh

[CV-73] Noisier2Inverse: Self-Supervised Learning for Image Reconstruction with Correlated Noise

【速读】：该论文旨在解决一般逆向问题（General Inverse Problems）中由于测量噪声统计相关性导致的传统自监督深度学习方法效果不佳的问题。论文提出了一种名为Noisier2Inverse的无校正（correction-free）自监督深度学习方法，能够在无需真实样本（ground truth samples）的情况下学习重建函数，并适用于如CT扫描中探测器缺陷或光子散射引起的噪声相关性，以及显微镜成像和地震成像中因物理交互引入的噪声依赖性场景。该方法的关键在于通过生成更噪数据（noisier data）来训练重建网络，同时其损失函数在测量空间中运作，并被设计为恢复外推图像（extrapolated image）而非原始噪声图像，从而避免了传统方法在推理阶段因不适定性（ill-posedness）而需要额外外推步骤的局限性。实验表明，该方法显著优于现有考虑相关噪声的自监督方法。

链接: https://arxiv.org/abs/2503.19468
作者: Nadja Gruber,Johannes Schwab,Markus Haltmeier,Ander Biguri,Clemens Dlaska,Gyeongha Hwang
机构: Digital Cardiology Lab, Medical University of Innsbruck (数字心脏病实验室,因斯布鲁克医科大学); University Clinic of Internal Medicine III, Cardiology and Angiology, Medical University of Innsbruck (内科三诊所,心血管科和血管科,因斯布鲁克医科大学); Department of Mathematics, University of Innsbruck (因斯布鲁克大学数学系); Department of Applied Mathematics and Theoretical Physics (DAMTP), University of Cambridge (英国剑桥大学应用数学和理论物理系); Department of Mathematics, Yeungnam University (韩国岭南大学数学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We propose Noisier2Inverse, a correction-free self-supervised deep learning approach for general inverse prob- lems. The proposed method learns a reconstruction function without the need for ground truth samples and is ap- plicable in cases where measurement noise is statistically correlated. This includes computed tomography, where detector imperfections or photon scattering create correlated noise patterns, as well as microscopy and seismic imaging, where physical interactions during measurement introduce dependencies in the noise structure. Similar to Noisier2Noise, a key step in our approach is the generation of noisier data from which the reconstruction net- work learns. However, unlike Noisier2Noise, the proposed loss function operates in measurement space and is trained to recover an extrapolated image instead of the original noisy one. This eliminates the need for an extrap- olation step during inference, which would otherwise suffer from ill-posedness. We numerically demonstrate that our method clearly outperforms previous self-supervised approaches that account for correlated noise.
zh

[CV-74] AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset ACCV

【速读】：本文旨在解决现有扩散模型在视频生成任务中推理步骤多、计算效率低的问题。论文的关键在于提出了一种名为AccVideo的新方法，通过利用预训练视频扩散模型生成的有效去噪轨迹作为合成数据集，消除无用数据点，从而减少推理步骤以加速模型。此外，基于合成数据集，设计了一种基于轨迹的少步引导机制，利用关键数据点学习噪声到视频的映射，实现更少步骤内的视频生成。同时，引入对抗训练策略以对齐学生模型输出分布与合成数据集的分布，提升视频质量。实验表明，AccVideo相比教师模型在生成速度上提升了8.5倍，且保持了相当的性能，同时生成的视频质量与分辨率更高。

链接: https://arxiv.org/abs/2503.19462
作者: Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Xihui Liu,Yunhong Wang,Yu Qiao
机构: Beihang University (北京航空航天大学); Shanghai AI Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.
zh

[CV-75] GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting

【速读】：该论文旨在解决从多视图图像重建开放曲面（open surfaces）的问题，特别是在利用3D高斯点溅射（3D Gaussians splatting, 3DGS）方法学习连续且隐式的无符号距离函数（unsigned distance functions, UDFs）时面临的挑战。由于3D高斯表示是一种离散且显式的场景表达方式，难以有效地学习连续的UDF表示。为了解决这一问题，论文提出了一种创新方法，通过在表面上拟合薄而平的二维高斯平面（2D Gaussian planes），并结合基于梯度的推理和自监督机制来监督表面附近及远场区域内的无符号距离值。关键在于引入新的约束和策略，以确保二维高斯参数的学习过程更加稳定，并提供更可靠的自监督信号，从而应对UDFs零水平集附近复杂梯度场带来的挑战。实验结果表明，所提方法在常用基准数据集和真实数据上的数值与视觉比较中展示了其在重建开放曲面的准确性、效率、完整性和边界清晰度方面的优势。

链接: https://arxiv.org/abs/2503.19458
作者: Shujuan Li,Yu-Shen Liu,Zhizhong Han
机构: School of Software, Tsinghua University (清华大学软件学院), Beijing, China; Department of Computer Science, Wayne State University (韦恩州立大学计算机科学系), Detroit, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing open surfaces from multi-view images is vital in digitalizing complex objects in daily life. A widely used strategy is to learn unsigned distance functions (UDFs) by checking if their appearance conforms to the image observations through neural rendering. However, it is still hard to learn continuous and implicit UDF representations through 3D Gaussians splatting (3DGS) due to the discrete and explicit scene representation, i.e., 3D Gaussians. To resolve this issue, we propose a novel approach to bridge the gap between 3D Gaussians and UDFs. Our key idea is to overfit thin and flat 2D Gaussian planes on surfaces, and then, leverage the self-supervision and gradient-based inference to supervise unsigned distances in both near and far area to surfaces. To this end, we introduce novel constraints and strategies to constrain the learning of 2D Gaussians to pursue more stable optimization and more reliable self-supervision, addressing the challenges brought by complicated gradient field on or near the zero level set of UDFs. We report numerical and visual comparisons with the state-of-the-art on widely used benchmarks and real data to show our advantages in terms of accuracy, efficiency, completeness, and sharpness of reconstructed open surfaces with boundaries. Project page: this https URL
zh

[CV-76] G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation

【速读】：该论文致力于解决在未见过物体类别和多样化任务指令下，灵巧抓取合成难以泛化的问题。解决方案的关键在于提出了一种检索增强生成方法G-DexGrasp，通过检索可泛化的抓取先验（包括细粒度接触部位及与功能相关的抓取实例分布），为后续合成流水线提供指导。具体而言，细粒度接触部位和功能作为通用引导，利用生成模型推断未见物体的合理抓取配置，而相关抓取分布则起到正则化作用，确保合成抓取在后优化阶段的合理性。实验验证了所提方法在泛化能力上的有效性，并展示了相对于现有方法的显著性能提升。

链接: https://arxiv.org/abs/2503.19457
作者: Juntao Jian,Xiuping Liu,Zixuan Chen,Manyi Li,Jian Liu,Ruizhen Hu
机构: Dalian University of Technology (大连理工大学); Shandong University (山东大学); Shenyang University of Technology (沈阳工业大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Recent advances in dexterous grasping synthesis have demonstrated significant progress in producing reasonable and plausible grasps for many task purposes. But it remains challenging to generalize to unseen object categories and diverse task instructions. In this paper, we propose G-DexGrasp, a retrieval-augmented generation approach that can produce high-quality dexterous hand configurations for unseen object categories and language-based task instructions. The key is to retrieve generalizable grasping priors, including the fine-grained contact part and the affordance-related distribution of relevant grasping instances, for the following synthesis pipeline. Specifically, the fine-grained contact part and affordance act as generalizable guidance to infer reasonable grasping configurations for unseen objects with a generative model, while the relevant grasping distribution plays as regularization to guarantee the plausibility of synthesized grasps during the subsequent refinement optimization. Our comparison experiments validate the effectiveness of our key designs for generalization and demonstrate the remarkable performance against the existing approaches. Project page: this https URL
zh

[CV-77] SparseGS-W: Sparse-View 3D Gaussian Splatting in the Wild with Generative Priors

【速读】：本文旨在解决从非约束的真实世界图像中合成大规模场景新视角的问题，现有方法在稀疏输入条件下难以有效优化每幅图像的外观和瞬态遮挡，导致显著伪影。为应对这一挑战，论文提出了一种名为SparseGS-W的新框架，基于3D高斯点 splatting 技术，能够利用最少五张训练图像重建复杂的室外场景并处理遮挡和外观变化。关键在于引入了几何先验和受限扩散先验来弥补极稀疏输入中多视图信息的缺失，并设计了一个即插即用的受限新视角增强模块，在高斯优化过程中迭代提升渲染新视角的质量；同时提出了一个灵活去除遮挡的模块，利用受限扩散先验的高质量修复能力。这两个模块可以从任何用户提供的参考图像中提取外观特征，实现一致光照场景的灵活建模。实验表明，SparseGS-W在全参考和非参考指标上均达到最先进的性能。

链接: https://arxiv.org/abs/2503.19452
作者: Yiqing Li,Xuan Wang,Jiawei Wu,Yikun Ma,Zhi Jin
机构: Sun Yat-sen University (中山大学); Ant Research (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing novel views of large-scale scenes from unconstrained in-the-wild images is an important but challenging task in computer vision. Existing methods, which optimize per-image appearance and transient occlusion through implicit neural networks from dense training views (approximately 1000 images), struggle to perform effectively under sparse input conditions, resulting in noticeable artifacts. To this end, we propose SparseGS-W, a novel framework based on 3D Gaussian Splatting that enables the reconstruction of complex outdoor scenes and handles occlusions and appearance changes with as few as five training images. We leverage geometric priors and constrained diffusion priors to compensate for the lack of multi-view information from extremely sparse input. Specifically, we propose a plug-and-play Constrained Novel-View Enhancement module to iteratively improve the quality of rendered novel views during the Gaussian optimization process. Furthermore, we propose an Occlusion Handling module, which flexibly removes occlusions utilizing the inherent high-quality inpainting capability of constrained diffusion priors. Both modules are capable of extracting appearance features from any user-provided reference image, enabling flexible modeling of illumination-consistent scenes. Extensive experiments on the PhotoTourism and Tanks and Temples datasets demonstrate that SparseGS-W achieves state-of-the-art performance not only in full-reference metrics, but also in commonly used non-reference metrics such as FID, ClipIQA, and MUSIQ.
zh

[CV-78] owards Robust Time-of-Flight Depth Denoising with Confidence-Aware Diffusion Model

【速读】：该论文致力于解决传统时间-of-flight (ToF) 深度传感器因非线性深度构建过程导致的深度图中噪声方差极大甚至出现无效区域的问题。此外，基于深度神经网络 (DNN) 的现有去噪方法在处理严重噪声污染时，由于对 ToF 数据分布的先验知识有限而表现欠佳。为应对这些挑战，论文提出了一种名为 DepthCAD 的新方法，其关键在于结合 Stable Diffusion 中丰富的先验知识以确保全局结构平滑，并通过置信度引导调整扩散过程来保持局部度量准确性。为了将预训练图像扩散模型应用于 ToF 深度去噪任务，该方法首先对原始 ToF 相关测量值进行动态范围归一化处理，然后将其转换为深度图后执行扩散操作。实验结果表明，所提出的方案达到了最先进的性能水平，且在真实数据上的评估进一步验证了其对实际 ToF 噪声的鲁棒性。

链接: https://arxiv.org/abs/2503.19448
作者: Changyong He,Jin Zeng,Jiawei Zhang,Jiajie Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Time-of-Flight (ToF) sensors efficiently capture scene depth, but the nonlinear depth construction procedure often results in extremely large noise variance or even invalid areas. Recent methods based on deep neural networks (DNNs) achieve enhanced ToF denoising accuracy but tend to struggle when presented with severe noise corruption due to limited prior knowledge of ToF data distribution. In this paper, we propose DepthCAD, a novel ToF denoising approach that ensures global structural smoothness by leveraging the rich prior knowledge in Stable Diffusion and maintains local metric accuracy by steering the diffusion process with confidence guidance. To adopt the pretrained image diffusion model to ToF depth denoising, we apply the diffusion on raw ToF correlation measurements with dynamic range normalization before converting to depth maps. Experimental results validate the state-of-the-art performance of the proposed scheme, and the evaluation on real data further verifies its robustness against real-world ToF noise.
zh

[CV-79] COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting

【速读】：该论文旨在解决基于3D Gaussian Splatting (3DGS) 的物体分割在准确描绘物体边界方面存在的挑战。现有方法因高斯基元的体积特性及训练过程中缺乏语义引导，常导致边界模糊不清。为应对这些难题，论文提出了一种名为Clear Object Boundaries for 3DGS Segmentation (COB-GS) 的方法，其关键在于通过联合优化语义信息与视觉信息来明确区分交织高斯基元的模糊边界。具体而言，COB-GS 引入了边界自适应高斯分裂技术以利用语义梯度统计识别并分割模糊的高斯基元，同时通过修正3DGS场景中降质的次优纹理，特别是沿着优化后的边界结构，从而在保持高质量视觉效果的同时显著提升分割精度和鲁棒性。

链接: https://arxiv.org/abs/2503.19443
作者: Jiaxin Zhang,Junjun Jiang,Youyu Chen,Kui Jiang,Xianming Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate object segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB-GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results show that COB-GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained model, yielding clear boundaries while preserving high visual quality. Code is available at this https URL.
zh

[CV-80] Quantifying the Ease of Reproducing Training Data in Unconditional Diffusion Models

【速读】：该论文试图解决扩散模型（Diffusion Models）在训练过程中可能过度记忆（memorization）训练数据而导致的潜在版权问题。论文的关键解决方案在于提出一种方法，通过量化无条件扩散模型中再现训练数据的难易程度来应对这一挑战。具体而言，该方法利用反向扩散过程中满足Langevin方程的样本群体平均运动所遵循的一阶常微分方程（ODE），建立图像与其潜空间中噪声版本之间的1-to-1映射关系。由于该ODE可逆且初始噪声图像是随机采样的，因此图像投影到潜空间后的面积体积可以表示生成这些图像的概率。通过测量此过程中的体积增长率，论文成功实现了对训练数据再现难易程度的量化评估。该方法具有较低的计算复杂度，从而能够有效检测并修改容易被记忆的训练样本，进而提升训练数据的质量。

链接: https://arxiv.org/abs/2503.19429
作者: Masaya Hasegawa,Koji Yasuda
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models, which have been advancing rapidly in recent years, may generate samples that closely resemble the training data. This phenomenon, known as memorization, may lead to copyright issues. In this study, we propose a method to quantify the ease of reproducing training data in unconditional diffusion models. The average of a sample population following the Langevin equation in the reverse diffusion process moves according to a first-order ordinary differential equation (ODE). This ODE establishes a 1-to-1 correspondence between images and their noisy counterparts in the latent space. Since the ODE is reversible and the initial noisy images are sampled randomly, the volume of an image’s projected area represents the probability of generating those images. We examined the ODE, which projects images to latent space, and succeeded in quantifying the ease of reproducing training data by measuring the volume growth rate in this process. Given the relatively low computational complexity of this method, it allows us to enhance the quality of training data by detecting and modifying the easily memorized training samples.
zh

[CV-81] EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters

【速读】：该论文旨在解决从音频输入生成具有特定情感的 Talking Head 视频这一重要且复杂的人机交互挑战。由于情感是一个高度抽象且边界模糊的概念，需要解耦的情感表达参数来生成情感丰富的 Talking Head 视频。为实现这一目标，论文提出了 EmoHead 方法，通过语义表达参数合成 Talking Head 视频。其解决方案的关键在于设计了一个可由情感标签指定的音频-表情模块（audio-expression module），用于增强不同情感下音频输入与表情预测之间的相关性，并利用预训练的超平面沿垂直方向探测以优化面部运动。最终，精炼后的表情参数被用于正则化神经辐射场（Neural Radiance Fields, NeRF），从而实现情感一致的 Talking Head 视频生成。实验结果表明，语义表情参数在重建质量和可控性方面表现出色。

链接: https://arxiv.org/abs/2503.19416
作者: Xuli Shen,Hua Cai,Dingding Yu,Weilin Shen,Qing Xu,Xiangyang Xue
机构: Fudan University (复旦大学); UniDT (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating emotion-specific talking head videos from audio input is an important and complex challenge for human-machine interaction. However, emotion is highly abstract concept with ambiguous boundaries, and it necessitates disentangled expression parameters to generate emotionally expressive talking head videos. In this work, we present EmoHead to synthesize talking head videos via semantic expression parameters. To predict expression parameter for arbitrary audio input, we apply an audio-expression module that can be specified by an emotion tag. This module aims to enhance correlation from audio input across various emotions. Furthermore, we leverage pre-trained hyperplane to refine facial movements by probing along the vertical direction. Finally, the refined expression parameters regularize neural radiance fields and facilitate the emotion-consistent generation of talking head videos. Experimental results demonstrate that semantic expression parameters lead to better reconstruction quality and controllability.
zh

[CV-82] A Prototype-Guided Coarse Annotations Refining Approach for Whole Slide Images

【速读】：本文旨在解决在全片扫描图像（Whole Slide Images, WSIs）中生成细粒度标注成本高昂的问题，同时现有粗略标注精化方法因依赖大量训练样本或清洁数据集，难以捕捉幻灯片内和幻灯片间的潜在语义模式，从而限制其精度。为应对这一挑战，本文提出了一种基于原型的方法。关键在于引入局部到全局的策略构建非冗余的代表性原型，通过联合建模幻灯片内的局部语义与幻灯片间的上下文关系实现这一目标；随后设计了一个基于原型的伪标签模块来优化粗略标注；最后采用动态数据采样和再微调策略训练补丁分类器。实验结果表明，该方法在包含淋巴、肝脏和结直肠癌的三个公开WSI数据集上的性能显著优于现有的最先进的方法。

链接: https://arxiv.org/abs/2503.19407
作者: Bingjian Yao,Weiping Lin,Yan He,Zheng Wang,Liangsheng Wang
机构: Department of Computer Science and Technology, School of Informatics, Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:The fine-grained annotations in whole slide images (WSIs) show the boundaries of various pathological regions. However, generating such detailed annotation is often costly, whereas the coarse annotations are relatively simpler to produce. Existing methods for refining coarse annotations often rely on extensive training samples or clean datasets, and fail to capture both intra-slide and inter-slide latent sematic patterns, limiting their precision. In this paper, we propose a prototype-guided approach. Specifically, we introduce a local-to-global approach to construct non-redundant representative prototypes by jointly modeling intra-slide local semantics and inter-slide contextual relationships. Then a prototype-guided pseudo-labeling module is proposed for refining coarse annotations. Finally, we employ dynamic data sampling and re-finetuning strategy to train a patch classifier. Extensive experiments on three publicly available WSI datasets, covering lymph, liver, and colorectal cancers, demonstrate that our method significantly outperforms existing state-of-the-art (SOTA) methods. The code will be available.
zh

[CV-83] M2CD: A Unified MultiModal Framework for Optical-SAR Change Detection with Mixture of Experts and Self-Distillation

【速读】：该论文旨在解决跨模态变化检测（Cross-Modal Change Detection, CD）在光学图像与合成孔径雷达（SAR）图像之间面临的挑战。现有基于权重量化Siamese网络的方法难以有效学习光学与SAR图像之间的跨模态数据分布，尤其是在灾后场景等极端条件下，SAR的主动成像能力更具优势。为应对这一挑战，论文提出了一种统一的多模态变化检测框架M²CD。其关键创新点在于：1）引入混合专家模块（Mixture of Experts, MoE），显式处理多样化的模态数据，提升模型学习多模态数据分布的能力；2）设计光学到SAR引导路径（Optical-to-SAR Guided Path, O2SP），并在训练过程中实施自蒸馏技术，以减少不同模态特征空间的差异，进一步减轻模型的学习负担。实验结果表明，基于Transformer的MiT-b1版本M²CD在光学-SAR变化检测任务中超越所有现有最先进的方法。

链接: https://arxiv.org/abs/2503.19406
作者: Ziyuan Liu,Jiawei Zhang,Wenyu Wang,Yuantao Gu
机构: Department of Electronic Engineering, Beijing National Research Center for Information Science and Technology, Tsinghua University (清华大学); College of Communications Engineering, Army Engineering University of PLA (中国人民解放军陆军工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Most existing change detection (CD) methods focus on optical images captured at different times, and deep learning (DL) has achieved remarkable success in this domain. However, in extreme scenarios such as disaster response, synthetic aperture radar (SAR), with its active imaging capability, is more suitable for providing post-event data. This introduces new challenges for CD methods, as existing weight-sharing Siamese networks struggle to effectively learn the cross-modal data distribution between optical and SAR images. To address this challenge, we propose a unified MultiModal CD framework, M ^2 CD. We integrate Mixture of Experts (MoE) modules into the backbone to explicitly handle diverse modalities, thereby enhancing the model’s ability to learn multimodal data distributions. Additionally, we innovatively propose an Optical-to-SAR guided path (O2SP) and implement self-distillation during training to reduce the feature space discrepancy between different modalities, further alleviating the model’s learning burden. We design multiple variants of M ^2 CD based on both CNN and Transformer backbones. Extensive experiments validate the effectiveness of the proposed framework, with the MiT-b1 version of M ^2 CD outperforming all state-of-the-art (SOTA) methods in optical-SAR CD tasks.
zh

[CV-84] Multi-modal 3D Pose and Shape Estimation with Computed Tomography

【速读】：该论文旨在解决围手术期护理中精确估计患者床上三维姿势与形状（Pose and Shape Estimation, PSE）的问题，特别是应对因床品遮挡和复杂体位导致的传统基于RGB-D、红外或压力图方法的精度不足挑战。论文的关键在于提出了一种多模态的患者床上三维PSE网络mPSE-CT，该网络融合了从常规获取的计算机断层扫描（Computed Tomography, CT）扫描中提取的详细几何特征与深度图。其解决方案的核心包括一个利用概率对应对齐的形状估计模块、一个采用优化神经网络的姿势估计模块，以及一个最终参数混合模块，这些模块共同实现了对被遮挡身体区域的鲁棒重建，并显著提升了估计的三维人体网格模型的准确性。通过在临床场景下使用自有的全身刚性模型和志愿者数据集进行验证，mPSE-CT在姿态和形状估计方面分别比现有最佳方法提高了23%和49.16%，展示了其在改善围手术期复杂环境下的临床结果方面的潜力。

链接: https://arxiv.org/abs/2503.19405
作者: Mingxiao Tu,Hoijoon Jung,Alireza Moghadam,Jineel Raythatha,Lachlan Allan,Jeremy Hsu,Andre Kyme,Jinman Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In perioperative care, precise in-bed 3D patient pose and shape estimation (PSE) can be vital in optimizing patient positioning in preoperative planning, enabling accurate overlay of medical images for augmented reality-based surgical navigation, and mitigating risks of prolonged immobility during recovery. Conventional PSE methods relying on modalities such as RGB-D, infrared, or pressure maps often struggle with occlusions caused by bedding and complex patient positioning, leading to inaccurate estimation that can affect clinical outcomes. To address these challenges, we present the first multi-modal in-bed patient 3D PSE network that fuses detailed geometric features extracted from routinely acquired computed tomography (CT) scans with depth maps (mPSE-CT). mPSE-CT incorporates a shape estimation module that utilizes probabilistic correspondence alignment, a pose estimation module with a refined neural network, and a final parameters mixing module. This multi-modal network robustly reconstructs occluded body regions and enhances the accuracy of the estimated 3D human mesh model. We validated mPSE-CT using proprietary whole-body rigid phantom and volunteer datasets in clinical scenarios. mPSE-CT outperformed the best-performing prior method by 23% and 49.16% in pose and shape estimation respectively, demonstrating its potential for improving clinical outcomes in challenging perioperative environments.
zh

[CV-85] LangBridge: Interpreting Image as a Combination of Language Embeddings

【速读】：该论文试图解决主流Large Vision-Language Models (LVLMs) 中视觉与语言模态对齐机制不清晰以及MLP适配器在切换大型语言模型 (LLMs) 主干网络时需重新训练的问题。论文的关键在于发现MLP适配器通过逐步将视觉嵌入投影到由相应文本嵌入张成的子空间来实现模态对齐的工作原理，并基于此提出LangBridge，一种显式映射视觉标记到LLM词汇嵌入线性组合的新型适配器。这种设计不仅实现了跨不同LLMs的预训练自由适配器迁移，同时保持了性能，且其插拔式设计确保了在多个LLMs间高效复用，几乎无性能损失。

链接: https://arxiv.org/abs/2503.19404
作者: Jiaqi Liao,Yuwei Niu,Fanqing Meng,Hao Li,Changyao Tian,Yinuo Du,Yuwen Xiong,Dianqi Li,Xizhou Zhu,Li Yuan,Jifeng Dai,Yu Cheng
机构: OpenGVLab (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学); SenseTime Research (商汤研究); Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学); PengCheng Laboratory (鹏城实验室); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code and weights will be open-sourced. Project page: this https URL

点击查看摘要

Abstract:Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA’s paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at this https URL
zh

[CV-86] raF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception CVPR2025

【速读】：该论文旨在解决协同感知中的异步通信导致的时间延迟问题，这一问题会引发空间和语义特征的错位，从而增加实时观测融合的复杂性。论文的关键解决方案是提出TraF-Align框架，该框架通过预测从过去观测到当前时间对象特征的轨迹，学习特征流路径。关键创新在于沿这些轨迹生成时间有序的采样点，引导当前查询注意力聚焦于相关的历史特征，从而重建当前时间特征并促进多帧间的语义交互。这种方法能够校正空间错位，确保跨代理的语义一致性，并有效补偿运动影响，实现特征的协调融合。

链接: https://arxiv.org/abs/2503.19391
作者: Zhiying Song,Lei Yang,Fuxi Wen,Jun Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Cooperative perception presents significant potential for enhancing the sensing capabilities of individual vehicles, however, inter-agent latency remains a critical challenge. Latencies cause misalignments in both spatial and semantic features, complicating the fusion of real-time observations from the ego vehicle with delayed data from others. To address these issues, we propose TraF-Align, a novel framework that learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle’s current time. By generating temporally ordered sampling points along these paths, TraF-Align directs attention from the current-time query to relevant historical features along each trajectory, supporting the reconstruction of current-time features and promoting semantic interaction across multiple frames. This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion and achieving coherent feature fusion. Experiments on two real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception.
zh

[CV-87] Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model

【速读】：该论文试图解决传统语义通信系统中图像语义特征提取数量不足导致重建精度低的问题，这限制了其实际应用。论文的关键解决方案是提出了一种多文本传输语义通信（Multi-SC）系统，利用视觉语言模型（VLM）辅助图像语义信号的传输。不同于以往的图像传输语义通信系统，该系统通过修改后的大型语言与视觉助手（LLaVA）将图像分割为多个块并提取多段文本信息，同时结合语义分割标签与语义文本进行图像恢复，从而显著提升了重建准确性。

链接: https://arxiv.org/abs/2503.19386
作者: Peishan Huang,Dong Li
机构: School of Computer Science and Engineering, Macau University of Science and Technology (澳门科技大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In recent years, the rapid development of machine learning has brought reforms and challenges to traditional communication systems. Semantic communication has appeared as an effective strategy to effectively extract relevant semantic signals semantic segmentation labels and image features for image transmission. However, the insufficient number of extracted semantic features of images will potentially result in a low reconstruction accuracy, which hinders the practical applications and still remains challenging for solving. In order to fill this gap, this letter proposes a multi-text transmission semantic communication (Multi-SC) system, which uses the visual language model (VLM) to assist in the transmission of image semantic signals. Unlike previous image transmission semantic communication systems, the proposed system divides the image into multiple blocks and extracts multiple text information from the image using a modified large language and visual assistant (LLaVA), and combines semantic segmentation tags with semantic text for image recovery. Simulation results show that the proposed text semantics diversity scheme can significantly improve the reconstruction accuracy compared with related works.
zh

[CV-88] Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing

【速读】：本文旨在解决预训练流模型（Pretrained Flow Models）在推理时间扩展（Inference-Time Scaling）方面的挑战。与扩散模型（Diffusion Models）不同，流模型因其确定性生成过程（Deterministic Generative Process），无法直接应用扩散模型中高效的推理时间扩展方法。为了解决这一问题，论文提出了三个关键方案：1) 基于随机微分方程（SDE-Based Generation）的生成方式，使流模型能够采用粒子采样（Particle Sampling）；2) 插值转换（Interpolant Conversion），扩展搜索空间以增强样本多样性；3) 滚动预算强制（Rollover Budget Forcing, RBF），通过自适应分配各时间步的计算资源来最大化预算利用率。实验结果表明，基于SDE的生成方法，尤其是基于方差保持插值（Variance-Preserving Interpolant）的生成方式，在流模型的推理时间扩展中显著提升了粒子采样的性能，同时结合RBF的方法表现出最佳性能，优于现有所有方法。

链接: https://arxiv.org/abs/2503.19385
作者: Jaihoon Kim,Taehoon Yoon,Jisung Hwang,Minhyuk Sung
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose an inference-time scaling approach for pretrained flow models. Recently, inference-time scaling has gained significant attention in LLMs and diffusion models, improving sample quality or better aligning outputs with user preferences by leveraging additional computation. For diffusion models, particle sampling has allowed more efficient scaling due to the stochasticity at intermediate denoising steps. On the contrary, while flow models have gained popularity as an alternative to diffusion models–offering faster generation and high-quality outputs in state-of-the-art image and video generative models–efficient inference-time scaling methods used for diffusion models cannot be directly applied due to their deterministic generative process. To enable efficient inference-time scaling for flow models, we propose three key ideas: 1) SDE-based generation, enabling particle sampling in flow models, 2) Interpolant conversion, broadening the search space and enhancing sample diversity, and 3) Rollover Budget Forcing (RBF), an adaptive allocation of computational resources across timesteps to maximize budget utilization. Our experiments show that SDE-based generation, particularly variance-preserving (VP) interpolant-based generation, improves the performance of particle sampling methods for inference-time scaling in flow models. Additionally, we demonstrate that RBF with VP-SDE achieves the best performance, outperforming all previous inference-time scaling approaches.
zh

[CV-89] MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation CVPR2025

【速读】：该论文旨在解决现有肖像动画方法在生成多视角、可控且富有表现力的动画时面临的挑战，包括缺乏对头部运动和面部表情的显式控制，以及无法从多个视角生成视频等问题。此外，尽管文本引导的肖像动画具有友好的用户交互性，但其潜力尚未被充分挖掘。论文提出了一种新颖的两阶段文本引导框架——MVPortrait（多视角生动肖像），以生成符合描述动作和情感的多视角生动动画。

解决方案的关键在于引入FLAME（Face, Lips, and Expressions Model）作为中间表示，并将其有效嵌入参数空间，从而实现对面部运动、表情及视点变换的统一建模。第一阶段分别训练基于文本输入的FLAME动作扩散模型和情感扩散模型；第二阶段则基于参考肖像图像和第一阶段生成的多视角FLAME渲染序列，训练一个多视角视频生成模型。通过这一方式，MVPortrait不仅提升了动作与情感控制能力，还实现了视点一致性，同时成为首个兼容文本、语音和视频作为驱动信号的可控肖像动画框架。

链接: https://arxiv.org/abs/2503.19383
作者: Yukang Lin,Hokit Fung,Jianjin Xu,Zeping Ren,Adela S.M. Lau,Guosheng Yin,Xiu Li
机构: Tsinghua University (清华大学); The University of Hong Kong (香港大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Recent portrait animation methods have made significant strides in generating realistic lip synchronization. However, they often lack explicit control over head movements and facial expressions, and cannot produce videos from multiple viewpoints, resulting in less controllable and expressive animations. Moreover, text-guided portrait animation remains underexplored, despite its user-friendly nature. We present a novel two-stage text-guided framework, MVPortrait (Multi-view Vivid Portrait), to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. MVPortrait is the first to introduce FLAME as an intermediate representation, effectively embedding facial movements, expressions, and view transformations within its parameter space. In the first stage, we separately train the FLAME motion and emotion diffusion models based on text input. In the second stage, we train a multi-view video generation model conditioned on a reference portrait image and multi-view FLAME rendering sequences from the first stage. Experimental results exhibit that MVPortrait outperforms existing methods in terms of motion and emotion control, as well as view consistency. Furthermore, by leveraging FLAME as a bridge, MVPortrait becomes the first controllable portrait animation framework that is compatible with text, speech, and video as driving signals.
zh

[CV-90] Interpretable Generative Models through Post-hoc Concept Bottlenecks CVPR2025

【速读】：该论文试图解决概念瓶颈模型（CBM）在构建高效且可扩展的生成式解释性模型时面临的挑战，具体包括昂贵的从头训练需求以及依赖于真实图像和劳动密集型概念监督的问题。为了解决这些问题，论文提出了两种新颖且低成本的方法：概念瓶颈自动编码器（CB-AE）和概念控制器（CC）。解决方案的关键在于通过后验技术实现无需真实数据即可进行高效、可扩展的训练，并仅需极少量甚至无需概念监督，同时确保方法能够广泛适用于多种现代生成模型家族，如生成对抗网络（GANs）和扩散模型（Diffusion Models）。实验结果表明，所提出的方法在多个标准数据集上的解释性和可控性显著优于现有工作，平均提升约25%，并且训练速度提高了4到15倍。

链接: https://arxiv.org/abs/2503.19377
作者: Akshay Kulkarni,Ge Yan,Chung-En Sun,Tuomas Oikarinen,Tsui-Wei Weng
机构: University of California San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.
zh

[CV-91] DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image CVPR2025

【速读】：该论文旨在解决从单张图像中单独重建三维衣物和人体的问题，而非将穿着衣物的人体视为单一对象。这一任务具有挑战性，因为衣物与人体之间的严重遮挡使得准确推断几何形状和纹理变得困难。尽管近期基于文本到图像扩散模型的三维人体重建方法取得了显著成果，但直接应用此类方法通常会导致错误指导，尤其是在重建三维衣物方面。为了解决这些挑战，论文提出了框架中的两个核心设计：首先，利用衣物和人体的三维模板模型作为正则化项，提供强大的几何先验以缓解遮挡引起的错误重建；其次，引入专门设计的衣物扩散模型，以提供关于衣物外观的上下文信息，从而增强三维衣物的重建效果。实验结果表明，所提出的方法在重建三维衣物和人体方面非常有效。

链接: https://arxiv.org/abs/2503.19373
作者: Hyeongjin Nam,Donghwan Kim,Jeongtaek Oh,Kyoung Mu Lee
机构: Dept. of ECE&ASRI, Seoul National University (首尔国立大学); IPAI, Seoul National University (首尔国立大学); KRAFTON
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at CVPR 2025, 17 pages including the supplementary material

点击查看摘要

Abstract:Most existing methods of 3D clothed human reconstruction from a single image treat the clothed human as a single object without distinguishing between cloth and human body. In this regard, we present DeClotH, which separately reconstructs 3D cloth and human body from a single image. This task remains largely unexplored due to the extreme occlusion between cloth and the human body, making it challenging to infer accurate geometries and textures. Moreover, while recent 3D human reconstruction methods have achieved impressive results using text-to-image diffusion models, directly applying such an approach to this problem often leads to incorrect guidance, particularly in reconstructing 3D cloth. To address these challenges, we propose two core designs in our framework. First, to alleviate the occlusion issue, we leverage 3D template models of cloth and human body as regularizations, which provide strong geometric priors to prevent erroneous reconstruction by the occlusion. Second, we introduce a cloth diffusion model specifically designed to provide contextual information about cloth appearance, thereby enhancing the reconstruction of 3D cloth. Qualitative and quantitative experiments demonstrate that our proposed approach is highly effective in reconstructing both 3D cloth and the human body. More qualitative results are provided at this https URL.
zh

[CV-92] EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models

【速读】：该论文旨在解决生成式文本到视频（Text-to-Video, T2V）模型中视频运动可控性有限的问题。现有方法通常依赖于参考视频的运动表示来引导生成，但它们往往采用针对特定样本的优化策略，导致计算负担较高。为应对这一挑战，论文提出了一种名为EfficientMT的新颖且高效的端到端框架。其关键是通过利用少量合成配对的运动转移样本，将预训练的T2V模型适配为通用的运动转移框架，从而能够精确捕捉并再现多样的运动模式。具体而言，EfficientMT重新利用了T2V模型的主干网络以提取参考视频的时间信息，并进一步设计了一个缩放模块来提炼与运动相关的信息。此外，引入了一种时间整合机制，以无缝地将参考运动特征融入视频生成过程中。经过在自收集的合成配对样本上的训练后，EfficientMT实现了无需测试时优化的一般视频运动转移任务，同时在效率和灵活的运动控制能力方面优于现有方法。

链接: https://arxiv.org/abs/2503.19369
作者: Yufei Cai,Hu Han,Yuxiang Wei,Shiguang Shan,Xilin Chen
机构: Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences (CAS); Institute of Computing Technology, CAS; University of the Chinese Academy of Sciences; Harbin Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The progress on generative models has led to significant advances on text-to-video (T2V) generation, yet the motion controllability of generated videos remains limited. Existing motion transfer methods explored the motion representations of reference videos to guide generation. Nevertheless, these methods typically rely on sample-specific optimization strategy, resulting in high computational burdens. In this paper, we propose \textbfEfficientMT, a novel and efficient end-to-end framework for video motion transfer. By leveraging a small set of synthetic paired motion transfer samples, EfficientMT effectively adapts a pretrained T2V model into a general motion transfer framework that can accurately capture and reproduce diverse motion patterns. Specifically, we repurpose the backbone of the T2V model to extract temporal information from reference videos, and further propose a scaler module to distill motion-related information. Subsequently, we introduce a temporal integration mechanism that seamlessly incorporates reference motion features into the video generation process. After training on our self-collected synthetic paired samples, EfficientMT enables general video motion transfer without requiring test-time optimization. Extensive experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability. Our code will be available this https URL.
zh

[CV-93] VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction

【速读】：该论文旨在解决在资源匮乏地区因缺乏基因组测序数据而阻碍多模态癌症生存分析临床应用的问题，试图仅利用全片图像（Whole-Slide Images, WSI）实现生存预测。论文的关键解决方案是提出了一种名为Visual-Genomic Answering-Guided Transformer (VGAT) 的框架，它通过整合视觉问答（Visual Question Answering, VQA）技术来重建基因组模态，从视觉特征中推导出稳定的基因组表示，从而规避原始基因组数据的高维挑战；同时，引入基于聚类的视觉提示模块，选择性增强判别性WSI patch，减少未过滤图像区域带来的噪声干扰。实验表明，该方法在五个TCGA数据集上优于现有仅基于WSI的方法，证明了无需基因组测序即可实现基因组引导推理的可行性，弥合了多模态研究与资源受限临床环境之间的差距。

链接: https://arxiv.org/abs/2503.19367
作者: Zizhi Chen,Minghao Han,Xukun Zhang,Shuwei Ma,Tao Liu,Xing Wei,Lihua Zhang
机构: Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院); Institute of Metaverse & Intelligent Medicine, Fudan University (复旦大学元宇宙与智能医学研究所); Engineering Research Center of AI and Robotics, Ministry of Education (教育部人工智能与机器人工程研究中心); Jilin Provincial Key Laboratory of Intelligence Science and Engineering (吉林省级智能科学与工程重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA’s text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is this https URL.
zh

[CV-94] ImageSet2Text: Describing Sets of Images through Text

【速读】：该论文旨在解决大规模图像集合自然语言描述生成的问题，提出了一种名为ImageSet2Text的新方法。解决方案的关键在于结合视觉-语言基础模型，通过迭代提取图像子集的关键概念，并将其编码为结构化图谱，同时利用外部知识图谱和基于CLIP的验证进行洞察优化。这种方法以概念瓶颈模型（CBMs）和视觉问答（VQA）链为灵感，不仅提升了描述的准确性与完整性，还增强了跨图像集合的解释性与总结质量。

链接: https://arxiv.org/abs/2503.19361
作者: Piera Riccio,Francesco Galati,Kajetan Schweighofer,Noa Garcia,Nuria Oliver
机构: ELLIS Alicante (ELLIS阿尔坎特拉); Johannes Kepler University Linz (约翰内斯·开普勒林茨大学); The University of Osaka (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. This iterative process enhances interpretability and enables accurate and detailed set-level summarization. Through extensive experiments, we evaluate ImageSet2Text’s descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.
zh

[CV-95] Show and Segment: Universal Medical Image Segmentation via In-Context Learning CVPR2025

【速读】：该论文旨在解决医学图像分割领域因解剖结构多样性、成像模态差异及分割任务复杂性导致的泛化难题。当前基于深度学习的方法通常需要针对特定任务进行训练或微调，难以适应未见过的类别。为应对这一挑战，论文提出了一种名为Iris的新框架，其核心创新在于无需微调即可通过参考示例灵活适应新任务。关键解决方案在于引入了一个轻量级的上下文任务编码模块，该模块从参考图像-标签对中提取任务特异性信息，并将其作为引导信号用于目标对象分割。通过将任务编码与推理过程解耦，Iris支持多种策略，包括单样本推理、上下文示例集成以及对象级上下文检索和上下文化调整。实验结果表明，Iris在分布内任务上表现接近特定任务模型，在分布外数据和未知类别上展现出卓越的泛化能力，同时其任务编码模块能够自动发现跨数据集和模态的解剖关系，为医学对象提供无监督的解剖学洞见。

链接: https://arxiv.org/abs/2503.19359
作者: Yunhe Gao,Di Liu,Zhuowei Li,Yunsheng Li,Dongdong Chen,Mu Zhou,Dimitris N. Metaxas
机构: Microsoft GenAI (微软生成式人工智能); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Medical image segmentation remains challenging due to the vast diversity of anatomical structures, imaging modalities, and segmentation tasks. While deep learning has made significant advances, current approaches struggle to generalize as they require task-specific training or fine-tuning on unseen classes. We present Iris, a novel In-context Reference Image guided Segmentation framework that enables flexible adaptation to novel tasks through the use of reference examples without fine-tuning. At its core, Iris features a lightweight context task encoding module that distills task-specific information from reference context image-label pairs. This rich context embedding information is used to guide the segmentation of target objects. By decoupling task encoding from inference, Iris supports diverse strategies from one-shot inference and context example ensemble to object-level context example retrieval and in-context tuning. Through comprehensive evaluation across twelve datasets, we demonstrate that Iris performs strongly compared to task-specific models on in-distribution tasks. On seven held-out datasets, Iris shows superior generalization to out-of-distribution data and unseen classes. Further, Iris’s task encoding module can automatically discover anatomical relationships across datasets and modalities, offering insights into medical objects without explicit anatomical supervision.
zh

[CV-96] From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting CVPR2025

【速读】：该论文旨在解决视觉重定位（Visual Relocalization）问题，即在未知环境中精确估计相机位置与姿态。现有方法多依赖粗略到精细的局部化流程，需要先进行图像检索再执行特征匹配，而这些方法可能因检索不准确或匹配效率低下导致性能受限。论文提出了一种名为STDLoc的新方法，其关键在于引入基于特征高斯分布（Feature Gaussian）的场景表示，并设计了一种新的从稀疏到密集的局部化范式。通过这一场景表示，STDLoc提出了面向匹配的高斯采样策略及特定场景检测器，实现了高效且鲁棒的初始位姿估计。此外，利用密集特征匹配将查询图像特征图与高斯特征场对齐，进一步提升了定位精度。实验表明，STDLoc在室内和室外数据集上的定位精度和召回率均优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.19358
作者: Zhiwei Huang,Hailin Yu,Yichun Shentu,Jin Yuan,Guofeng Zhang
机构: State Key Lab of CAD & CG, Zhejiang University (浙江大学 CAD&CG 国家重点实验室); SenseTime Research (商汤科技研究部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures, CVPR 2025

点击查看摘要

Abstract:This paper presents a novel camera relocalization method, STDLoc, which leverages Feature Gaussian as scene representation. STDLoc is a full relocalization pipeline that can achieve accurate relocalization without relying on any pose prior. Unlike previous coarse-to-fine localization methods that require image retrieval first and then feature matching, we propose a novel sparse-to-dense localization paradigm. Based on this scene representation, we introduce a novel matching-oriented Gaussian sampling strategy and a scene-specific detector to achieve efficient and robust initial pose estimation. Furthermore, based on the initial localization results, we align the query feature map to the Gaussian feature field by dense feature matching to enable accurate localization. The experiments on indoor and outdoor datasets show that STDLoc outperforms current state-of-the-art localization methods in terms of localization accuracy and recall.
zh

[CV-97] Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection

【速读】：该论文旨在解决基于重建的无监督异常检测方法在多类别场景下难以保持结构完整性并准确恢复异常区域正常内容的问题。现有扩散模型（Diffusion Models）虽擅长从纯噪声生成图像，但在选择性地修改图像中的异常区域同时保留正常区域方面存在困难，这可能导致正常区域在重建过程中退化，从而削弱异常检测的效果。论文的关键解决方案在于提出了一种名为“偏差校正扩散”(\Ours) 的新模型，通过将异常建模为潜在空间中的噪声，实现了对正常区域的保护以及仅针对异常区域的变换操作。这种选择性的方法显著提升了重建质量，从而更有效地实现了复杂图像中异常区域的无监督检测与定位。实验结果表明，该方法在多个知名异常检测数据集上的像素级 AUPRC 提升了 11%-14%，优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.19357
作者: Farzad Beizaee,Gregory A. Lodygensky,Christian Desrosiers,Jose Dolz
机构: ÉTS Montreal (蒙特利尔ÉTS工程学院); CHU-Sainte-Justine Montreal (蒙特利尔圣贾斯汀中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have spurred research into their application for Reconstruction-based unsupervised anomaly detection. However, these methods may struggle with maintaining structural integrity and recovering the anomaly-free content of abnormal regions, especially in multi-class scenarios. Furthermore, diffusion models are inherently designed to generate images from pure noise and struggle to selectively alter anomalous regions of an image while preserving normal ones. This leads to potential degradation of normal regions during reconstruction, hampering the effectiveness of anomaly detection. This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies. By modeling anomalies as noise in the latent space, our proposed \textbfDeviation correction diffusion (\Ours) model preserves the normal regions and encourages transformations exclusively on anomalous areas. This selective approach enhances the reconstruction quality, facilitating effective unsupervised detection and localization of anomaly regions. Comprehensive evaluations demonstrate the superiority of our method in accurately identifying and localizing anomalies in complex images, with pixel-level AUPRC improvements of 11-14% over state-of-the-art models on well known anomaly detection datasets. The code is available at this https URL
zh

[CV-98] Can Vision-Language Models Answer Face to Face Questions in the Real-World?

【速读】：该论文试图解决的问题是：现有AI模型是否能够在连接摄像头和麦克风的情况下，实时描述和回答关于摄像头前实时场景与事件的问题，从而实现与用户的自然交互。这一目标长期以来被认为是AI领域的重要挑战，也是构建实用化AI助手和人形机器人与人类日常互动的前提条件。

论文的关键解决方案是引入了一个新的数据集和基准——Qualcomm交互视频数据集（IVD），该数据集基于实时问答设置，用户通过摄像头和音频输入提出问题，系统需即时作答。研究发现，当前模型在该任务上的表现远低于人类水平，并分析了性能差距的主要来源。然而，研究表明，针对所需感知技能对模型进行微调可以在很大程度上缩小这一差距。

链接: https://arxiv.org/abs/2503.19356
作者: Reza Pourreza,Rishit Dagli,Apratim Bhattacharyya,Sunny Panchal,Guillaume Berger,Roland Memisevic
机构: Qualcomm AI Research (高通人工智能研究); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.
zh

[CV-99] ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

【速读】：该论文致力于解决视觉-语言模型（Vision-Language Models, VLMs）在时空推理能力上的不足，特别是其难以有效分析移动物体的运动学要素（如行进距离、速度等）。为弥合这一差距，论文的关键创新在于构建了一个包含运动学指令微调的数据集STKit及其对应的基准测试STKit-Bench，该数据集结合了带有3D标注的真实世界视频，详细描述了物体的运动动力学信息。此外，为了扩展到缺乏3D标注的视频数据，论文提出了一种自动化的伪标签生成管道，利用真实尺度的4D重建技术实现高效标注。基于此，论文开发了增强时空推理能力的ST-VLM模型，并证明其在多个时空推理任务中表现出色，同时具备跨领域和跨任务的稳健泛化能力。

链接: https://arxiv.org/abs/2503.19355
作者: Dohwan Ko,Sihyeon Kim,Yumin Suh,Vijay Kumar B.G,Minseo Yoon,Manmohan Chandraker,Hyunwoo J. Kim
机构: Korea University (韩国大学); NEC Labs America (NEC美国实验室); UC San Diego (加州大学圣地亚哥分校); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: this https URL.
zh

[CV-100] Multi-Object Sketch Animation by Scene Decomposition and Motion Planning

【速读】：该论文致力于解决多对象草图动画（multi-object sketch animation）中的两个核心挑战：对象感知运动建模（object-aware motion modeling）和复杂运动优化（complex motion optimization）。当前单对象草图动画方法在多对象场景下表现不佳，主要源于上述挑战。为应对这些问题，论文提出了一种基于迭代优化的Score Distillation Sampling (SDS) 方法，名为MoSketch，且无需额外训练数据。其解决方案的关键在于设计了四个模块：基于LLM的场景分解（LLM-based scene decomposition）、基于LLM的运动规划（LLM-based motion planning）、运动细化网络（motion refinement network）以及组合式SDS（compositional SDS），通过分而治之的策略逐一攻克挑战。大量定性和定量实验验证了该方法相对于现有草图动画方法的优越性，标志着迈向多对象草图动画的重要一步。

链接: https://arxiv.org/abs/2503.19351
作者: Jingyu Liu,Zijie Xin,Yuhan Fu,Ruixiang Zhao,Bangxiang Lan,Xirong Li
机构: Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 17 figures

点击查看摘要

Abstract:Sketch animation, which brings static sketches to life by generating dynamic video sequences, has found widespread applications in GIF design, cartoon production, and daily entertainment. While current sketch animation methods perform well in single-object sketch animation, they struggle in multi-object scenarios. By analyzing their failures, we summarize two challenges of transitioning from single-object to multi-object sketch animation: object-aware motion modeling and complex motion optimization. For multi-object sketch animation, we propose MoSketch based on iterative optimization through Score Distillation Sampling (SDS), without any other data for training. We propose four modules: LLM-based scene decomposition, LLM-based motion planning, motion refinement network and compositional SDS, to tackle the two challenges in a divide-and-conquer strategy. Extensive qualitative and quantitative experiments demonstrate the superiority of our method over existing sketch animation approaches. MoSketch takes a pioneering step towards multi-object sketch animation, opening new avenues for future research and applications. The code will be released.
zh

[CV-101] Stop Walking in Circles! Bailing Out Early in Projected Gradient Descent CVPR

【速读】：该论文试图解决在对抗鲁棒性评估中基于投影梯度下降（PGD）方法计算开销过大的问题。当前最佳实践建议使用数千次迭代来生成单张图像的对抗样本，这导致了高昂的计算成本。论文的关键解决方案是提出了一种基于循环检测的早期终止策略，通过利用实际实现PGD时的几何特性，在不牺牲攻击强度的前提下显著加速PGD过程，同时保持与标准PGD相同的模型鲁棒性估计能力。这种方法使得原本计算上不可行的鲁棒性评估成为可能。

链接: https://arxiv.org/abs/2503.19347
作者: Philip Doldo,Derek Everett,Amol Khanna,Andre T Nguyen,Edward Raff
机构: Booz Allen Hamilton (博思艾伦咨询公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: To appear in the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

点击查看摘要

Abstract:Projected Gradient Descent (PGD) under the L_\infty ball has become one of the defacto methods used in adversarial robustness evaluation for computer vision (CV) due to its reliability and efficacy, making a strong and easy-to-implement iterative baseline. However, PGD is computationally demanding to apply, especially when using thousands of iterations is the current best-practice recommendation to generate an adversarial example for a single image. In this work, we introduce a simple novel method for early termination of PGD based on cycle detection by exploiting the geometry of how PGD is implemented in practice and show that it can produce large speedup factors while providing the \emphexact same estimate of model robustness as standard PGD. This method substantially speeds up PGD without sacrificing any attack strength, enabling evaluations of robustness that were previously computationally intractable.
zh

[CV-102] BADGR: Bundle Adjustment Diffusion Conditioned by GRadients for Wide-Baseline Floor Plan Reconstruction

【速读】：该论文试图解决从宽基线RGB全景图中精确重建相机位姿和楼层平面布局的问题，这是一个具有挑战性且尚未完全解决的任务。论文提出了一种名为BADGR的新型扩散模型，其关键在于联合执行重建与束调整（Bundle Adjustment, BA），通过利用来自数十张输入图像（密度各异）的一维地板边界预测，从粗略状态逐步优化相机位姿和布局。不同于引导型扩散模型，BADGR以单步Levenberg-Marquardt优化器得到的密集实体输出为条件，并通过最小化重投影误差来确保视图一致性。此外，去噪扩散过程中的布局生成目标补充了BA优化，提供了额外的基于跨图像共视特征的学习布局结构约束，这些约束有助于BADGR对空间关系进行合理推测（如墙壁相邻、共线性等），并结合全局上下文缓解密集边界观测带来的误差。BADGR仅在二维楼层平面数据上训练，简化了数据获取流程，增强了数据增强的鲁棒性，并支持多种输入密度。实验结果验证了该方法在不同输入密度下显著优于现有最先进的相机位姿和楼层平面布局重建技术。

链接: https://arxiv.org/abs/2503.19340
作者: Yuguang Li,Ivaylo Boyadzhiev,Zixuan Liu,Linda Shapiro,Alex Colburn
机构: University of Washington (华盛顿大学); Zillow Group (Zillow 集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing precise camera poses and floor plan layouts from wide-baseline RGB panoramas is a difficult and unsolved problem. We introduce BADGR, a novel diffusion model that jointly performs reconstruction and bundle adjustment (BA) to refine poses and layouts from a coarse state, using 1D floor boundary predictions from dozens of images of varying input densities. Unlike a guided diffusion model, BADGR is conditioned on dense per-entity outputs from a single-step Levenberg Marquardt (LM) optimizer and is trained to predict camera and wall positions while minimizing reprojection errors for view-consistency. The objective of layout generation from denoising diffusion process complements BA optimization by providing additional learned layout-structural constraints on top of the co-visible features across images. These constraints help BADGR to make plausible guesses on spatial relations which help constrain pose graph, such as wall adjacency, collinearity, and learn to mitigate errors from dense boundary observations with global contexts. BADGR trains exclusively on 2D floor plans, simplifying data acquisition, enabling robust augmentation, and supporting variety of input densities. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art pose and floor plan layout reconstruction with different input densities.
zh

[CV-103] Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting ICME2025

【速读】：该论文致力于解决动态场景理解中基于高斯点 splatting 方法存在的问题，主要挑战在于动态与静态部分的区分及其相互作用导致的传统更新策略容易产生显著伪影和噪声。论文的关键解决方案是提出 Dual-Hierarchical Optimization (DHO)，它通过分而治之的方式结合 Hierarchical Gaussian Flow 和 Hierarchical Gaussian Guidance。前者有效实现了静态与动态渲染及特征的划分，后者则有助于缓解纹理复杂场景中动态前景渲染失真的问题。实验表明，该方法在合成数据集和真实世界数据集上均优于基线方法，并支持多种下游任务。

链接: https://arxiv.org/abs/2503.19332
作者: Zhiying Yan,Yiyuan Liang,Shilv Cai,Tao Zhang,Sheng Zhong,Luxin Yan,Xu Zou
机构: Huazhong University of Science and Technology, China (华中科技大学); National Key Laboratory of Multispectral Information Intelligent Processing Technology, China (多光谱信息智能处理技术国家重点实验室); Nanyang Technological University, Singapore (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025

点击查看摘要

Abstract:Semantic 4D Gaussians can be used for reconstructing and understanding dynamic scenes, with temporal variations than static scenes. Directly applying static methods to understand dynamic scenes will fail to capture the temporal features. Few works focus on dynamic scene understanding based on Gaussian Splatting, since once the same update strategy is employed for both dynamic and static parts, regardless of the distinction and interaction between Gaussians, significant artifacts and noise appear. We propose Dual-Hierarchical Optimization (DHO), which consists of Hierarchical Gaussian Flow and Hierarchical Gaussian Guidance in a divide-and-conquer manner. The former implements effective division of static and dynamic rendering and features. The latter helps to mitigate the issue of dynamic foreground rendering distortion in textured complex scenes. Extensive experiments show that our method consistently outperforms the baselines on both synthetic and real-world datasets, and supports various downstream tasks. Project Page: this https URL.
zh

[CV-104] ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning

【速读】：该论文旨在解决多通道成像（Multi-Channel Imaging, MCI）领域中基于掩码自编码器（Masked Autoencoders, MAEs）方法的局限性。传统MAEs通常依赖于随机块掩码，假设图像在不同通道间存在显著冗余，从而通过跨通道相关性重建被遮掩的内容。然而，在MCI场景中，各通道可能提供互补信息且特征重叠较少，这种假设不再成立。因此，现有MAEs主要学习单个通道内的局部结构，而未能充分挖掘跨通道交互作用，限制了其在MCI任务中的性能。

为了解决上述问题，论文提出了一种名为ChA-MAEViT的方法，其核心解决方案包括四个关键策略：(1) 动态通道-块掩码设计，迫使模型不仅重建被遮掩的图像块，还需恢复缺失的通道，从而增强跨通道依赖关系并提高对不同通道配置的鲁棒性；(2) 记忆标记机制，作为长期记忆辅助工具以促进通道间的信息共享，应对结构多样化的通道重建挑战；(3) 混合标记融合模块，将细粒度的块标记与全局类别标记结合，捕获更丰富的表征；(4) 面向通道感知的轻量级解码器，利用通道标记高效重建图像块。实验结果表明，ChA-MAEViT在卫星影像和显微镜数据集上的表现显著优于最先进的MCI-ViTs，提升了3.0%-21.5%，凸显了跨通道交互在MCI任务中的重要性。

链接: https://arxiv.org/abs/2503.19331
作者: Chau Pham,Juan C. Caicedo,Bryan A. Plummer
机构: Boston University (波士顿大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI.
zh

[CV-105] MATT-GS: Masked Attention-based 3DGS for Robot Perception and Object Detection IROS

【速读】：该论文旨在解决工业和智能工厂环境中机器人感知与物体检测面临的挑战，特别是在复杂场景下提高三维高斯点云（3D Gaussian Splatting, 3DGS）建模的视觉保真度和细节保留能力。论文的关键创新在于提出了一种基于掩码注意力机制的新方法：首先利用U2-Net进行背景去除，以隔离目标物体并减少冗余信息；其次引入基于Sobel滤波器的注意力机制，增强细粒度特征的捕捉能力，特别是对螺丝、电线及复杂纹理等关键细节的提取。通过定量评估（如L1损失、结构相似性指数SSIM和峰值信噪比PSNR），验证了所提方法在提升视觉真实感和细节保留方面的显著优势，从而有效增强了机器人在工业环境中的视觉识别与操作能力。

链接: https://arxiv.org/abs/2503.19330
作者: Jee Won Lee,Hansol Lim,SooYeun Yang,Jongseong Brad Choi
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication

点击查看摘要

Abstract:This paper presents a novel masked attention-based 3D Gaussian Splatting (3DGS) approach to enhance robotic perception and object detection in industrial and smart factory environments. U2-Net is employed for background removal to isolate target objects from raw images, thereby minimizing clutter and ensuring that the model processes only relevant data. Additionally, a Sobel filter-based attention mechanism is integrated into the 3DGS framework to enhance fine details - capturing critical features such as screws, wires, and intricate textures essential for high-precision tasks. We validate our approach using quantitative metrics, including L1 loss, SSIM, PSNR, comparing the performance of the background-removed and attention-incorporated 3DGS model against the ground truth images and the original 3DGS training baseline. The results demonstrate significant improves in visual fidelity and detail preservation, highlighting the effectiveness of our method in enhancing robotic vision for object recognition and manipulation in complex industrial settings.
zh

[CV-106] Long-Context Autoregressive Video Modeling with Next-Frame Prediction

【速读】：该论文旨在解决长上下文视频生成建模的问题，特别是视频自回归（Video Autoregressive, Video AR）在充分利用扩展时间上下文时面临的挑战。传统方法如Token AR和视频扩散Transformer在处理长视频序列时存在收敛性差及计算开销大的问题。此外，现有旋转位置编码（Rotary Position Embedding, RoPE）缺乏对远程上下文的有效时间衰减，难以有效推广到长视频序列。

论文的关键解决方案包括：1）提出FlexRoPE技术，在推理阶段引入灵活的时间衰减，增强RoPE的远程上下文外推能力；2）设计长短时上下文建模机制，通过高分辨率短时窗口确保细粒度的时间一致性，同时利用较少的token编码长程信息。这些方法共同实现了在可控token上下文长度内训练长视频序列，并显著提升了视频自回归建模的性能，提供了一个简单而有效的基线模型。

链接: https://arxiv.org/abs/2503.19325
作者: Yuchao Gu,Weijia Mao,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学 Show 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts. To investigate long-context video modeling, we introduce Frame AutoRegressive (FAR), a strong baseline for video autoregressive modeling. Just as language models learn causal dependencies between tokens (i.e., Token AR), FAR models temporal causal dependencies between continuous frames, achieving better convergence than Token AR and video diffusion transformers. Building on FAR, we observe that long-context vision modeling faces challenges due to visual redundancy. Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences. Additionally, training on long videos is computationally expensive, as vision tokens grow much faster than language tokens. To tackle these issues, we propose balancing locality and long-range dependency. We introduce FlexRoPE, an test-time technique that adds flexible temporal decay to RoPE, enabling extrapolation to 16x longer vision contexts. Furthermore, we propose long short-term context modeling, where a high-resolution short-term context window ensures fine-grained temporal consistency, while an unlimited long-term context window encodes long-range information using fewer tokens. With this approach, we can train on long video sequences with a manageable token context length. We demonstrate that FAR achieves state-of-the-art performance in both short- and long-video generation, providing a simple yet effective baseline for video autoregressive modeling.
zh

[CV-107] ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

【速读】：该论文旨在解决文本到图像情境学习（Text-to-Image In-Context Learning, T2I-ICL）中的上下文推理问题。尽管统一多模态大语言模型（Unified Multimodal Large Language Models, MLLMs）近年来取得了显著进展，但在T2I-ICL场景中仍面临上下文推理能力不足的挑战。为了解决这一局限性，论文提出了一种新颖的框架，引入了图像生成的思维过程（ImageGen-CoT prior）作为关键解决方案。为了规避生成无结构且无效推理步骤的问题，研究团队开发了一个自动化的管道来精心构建高质量的ImageGen-CoT数据集，并通过微调MLLMs以增强其上下文推理能力。此外，通过探索测试时扩展策略，论文进一步提出了一个混合扩展方法，在生成多个ImageGen-CoT链后，对每条链进行多次采样以生成多张图像，从而进一步提升性能。实验结果表明，使用ImageGen-CoT数据集进行微调可使SEED-X在T2I-ICL任务上的性能显著提升80%。

链接: https://arxiv.org/abs/2503.19312
作者: Jiaqi Liao,Zhengyuan Yang,Linjie Li,Dianqi Li,Kevin Lin,Yu Cheng,Lijuan Wang
机构: Microsoft; The Chinese University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). While Unified Multimodal LLMs (MLLMs) have advanced rapidly in recent years, they struggle with contextual reasoning in T2I-ICL scenarios. To address this limitation, we propose a novel framework that incorporates a thought process called ImageGen-CoT prior to image generation. To avoid generating unstructured ineffective reasoning steps, we develop an automatic pipeline to curate a high-quality ImageGen-CoT dataset. We then fine-tune MLLMs using this dataset to enhance their contextual reasoning capabilities. To further enhance performance, we explore test-time scale-up strategies and propose a novel hybrid scaling approach. This approach first generates multiple ImageGen-CoT chains and then produces multiple images for each chain via sampling. Extensive experiments demonstrate the effectiveness of our proposed method. Notably, fine-tuning with the ImageGen-CoT dataset leads to a substantial 80% performance gain for SEED-X on T2I-ICL tasks. See our project page at this https URL. Code and model weights will be open-sourced.
zh

[CV-108] LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text

【速读】：该论文旨在解决现有遥感视觉-语言基础模型（VLFM）在处理长文本时的技术瓶颈以及因短文本信息不足导致的“幻觉”问题。论文的关键解决方案在于提出了一种新的视觉-语言基础模型LRSCLIP和一个多模态数据集LRS2M。具体而言，通过整合多源遥感数据并采用大语言模型标注策略构建了包含200万图像-文本对的LRS2M数据集，首次提供了短文本和长文本，从而解决了现有数据集中语义粒度限制的问题；同时，设计了基于Long-CLIP的KPS模块的LRSCLIP架构，扩展了CLIP的文本处理能力，并通过双文本损失加权机制实现了细粒度的跨模态特征对齐。这些创新显著提升了模型在零样本长文本跨模态检索、短文本跨模态检索、图像分类及语义定位等任务中的性能。

链接: https://arxiv.org/abs/2503.19311
作者: Weizhi Chen,Jingbo Chen,Yupeng Deng,Jiansheng Chen,Yuman Feng,Zhihao Xi,Diyou Liu,Kai Li,Yu Meng
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息创新研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院); School of Information Network Security, People’s Public Security University of China (中国人民公安大学信息安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:This study addresses the technical bottlenecks in handling long text and the “hallucination” issue caused by insufficient short text information in remote sensing vision-language foundation models (VLFM). We propose a novel vision-language foundation model, LRSCLIP, and a multimodal dataset, LRS2M. The main contributions are as follows: (1) By integrating multi-source remote sensing data and adopting a large language model labeling strategy, we construct the LRS2M dataset, which contains 2 million image-text pairs, providing both short and long texts for the first time, thus solving the problem of semantic granularity limitations in existing datasets; (2) The design of the LRSCLIP architecture based on Long-CLIP’s KPS module, which extends CLIP’s text processing capacity and achieves fine-grained cross-modal feature alignment through a dual-text loss weighting mechanism. Experimental results show that LRSCLIP improves retrieval accuracy by 10%-20% over the Long-CLIP baseline in the zero-shot long-text cross-modal retrieval task. For the zero-shot short-text cross-modal retrieval task, LRSCLIP achieves improvements over the current best model, GeoRSCLIP, with increases of 0.17%, 0.67%, and 0.92% in Text to Image R@1, Image to Text R@1, and mR on RSITMD, respectively, and 0.04%, 2.93%, and 1.28% on RSICD. In the zero-shot image classification task (average accuracy=75.75%) and semantic localization task (Rmi=0.7653), LRSCLIP achieves state-of-the-art performance. These results validate the dual advantages of fine-grained semantic understanding and global feature matching in LRSCLIP. This work provides a new benchmark model and data support for remote sensing multimodal learning. The related code has been open source and is available at this https URL.
zh

[CV-109] A Comprehensive Analysis of Mamba for 3D Volumetric Medical Image Segmentation

【速读】：本文旨在解决在高分辨率3D医学图像分割任务中，Mamba模型是否能够替代Transformer，并提升多尺度表征学习能力，以及是否需要复杂的扫描策略以充分发挥其潜力的问题。关键在于通过引入自定义设计的三维深度可分离卷积增强U-shape Mamba网络（UlikeMamba），使其在保持计算效率的同时显著提高分割精度。此外，提出的多尺度Mamba模块能够更有效地捕捉细粒度细节与全局上下文信息，尤其在复杂分割任务中优于基于Transformer的方法。同时，研究还评估了不同扫描策略，表明简单方法通常足够有效，而提出的Tri-scan方法在最具挑战性的情况下表现出显著优势。通过这些改进，论文提出了一种新型3D医学图像分割网络，将Mamba定位为一种超越现有领先模型如nnUNet、CoTr和U-Mamba的变革性力量。

链接: https://arxiv.org/abs/2503.19308
作者: Chaohan Wang,Yutong Xie,Qi Chen,Yuyin Zhou,Qi Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mamba, with its selective State Space Models (SSMs), offers a more computationally efficient solution than Transformers for long-range dependency modeling. However, there is still a debate about its effectiveness in high-resolution 3D medical image segmentation. In this study, we present a comprehensive investigation into Mamba’s capabilities in 3D medical image segmentation by tackling three pivotal questions: Can Mamba replace Transformers? Can it elevate multi-scale representation learning? Is complex scanning necessary to unlock its full potential? We evaluate Mamba’s performance across three large public benchmarks-AMOS, TotalSegmentator, and BraTS. Our findings reveal that UlikeMamba, a U-shape Mamba-based network, consistently surpasses UlikeTrans, a U-shape Transformer-based network, particularly when enhanced with custom-designed 3D depthwise convolutions, boosting accuracy and computational efficiency. Further, our proposed multi-scale Mamba block demonstrates superior performance in capturing both fine-grained details and global context, especially in complex segmentation tasks, surpassing Transformer-based counterparts. We also critically assess complex scanning strategies, finding that simpler methods often suffice, while our Tri-scan approach delivers notable advantages in the most challenging scenarios. By integrating these advancements, we introduce a new network for 3D medical image segmentation, positioning Mamba as a transformative force that outperforms leading models such as nnUNet, CoTr, and U-Mamba, offering competitive accuracy with superior computational efficiency. This study provides key insights into Mamba’s unique advantages, paving the way for more efficient and accurate approaches to 3D medical imaging.
zh

[CV-110] Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation CVPR2025

【速读】：该论文旨在解决3D手部姿态估计中的合成数据与真实数据之间的差距（synthetic-to-real gap），这是现有生成式手部姿态估计方法面临的主要挑战之一。论文的关键在于通过系统性分析识别出手臂前部（forearm）、图像频率统计特性、手部姿态以及物体遮挡（object occlusions）等核心影响因素，并提出了一种高质量数据合成管道（data synthesis pipeline）。基于此，作者证明了当整合这些关键组件后，合成手部数据可以达到与真实数据相同的精度水平，从而为仅使用合成数据进行手部姿态估计铺平了道路。

链接: https://arxiv.org/abs/2503.19307
作者: Zhuoran Zhao,Linlin Yang,Pengzhan Sun,Pan Hui,Angela Yao
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Communication University of China (中国传媒大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Recent synthetic 3D human datasets for the face, body, and hands have pushed the limits on photorealism. Face recognition and body pose estimation have achieved state-of-the-art performance using synthetic training data alone, but for the hand, there is still a large synthetic-to-real gap. This paper presents the first systematic study of the synthetic-to-real gap of 3D hand pose estimation. We analyze the gap and identify key components such as the forearm, image frequency statistics, hand pose, and object occlusions. To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. We demonstrate that synthetic hand data can achieve the same level of accuracy as real data when integrating our identified components, paving the path to use synthetic data alone for hand pose estimation. Code and data are available at: this https URL.
zh

[CV-111] BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation

【速读】：该论文旨在解决现有RGB-T语义分割模型在融合多模态信息时存在的简单相加或拼接策略以及未能充分利用不同层次信息差异性的问题。论文的关键在于提出了一种名为Brain-Inspired Multi-Iteration Interaction Network (BIMII-Net) 的新型网络架构。首先，通过设计基于脑启发模型的深度连续耦合神经网络(DCCNN)，满足道路场景中精确纹理和局部信息提取的需求；其次，在特征融合阶段引入跨显式注意力增强融合模块(CEAEF-Module)，以加强多模态信息间的交互与表达能力；最后，构建互补交互多层解码器结构，包括浅层特征迭代模块(SFI-Module)、深层特征迭代模块(DFI-Module)及多特征增强模块(MFE-Module)，协同提取纹理细节与全局骨架信息，并通过多模块联合监督进一步优化分割结果。实验结果表明，BIMII-Net在脑启发计算领域达到了最先进的性能，并在多个RGB-T数据集上表现出强大的泛化能力。

链接: https://arxiv.org/abs/2503.19303
作者: Hanshuo Qiu,Jie Jiang,Ruoli Yang,Lixin Zhan,Jizhao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-T road scene semantic segmentation enhances visual scene understanding in complex environments characterized by inadequate illumination or occlusion by fusing information from RGB and thermal images. Nevertheless, existing RGB-T semantic segmentation models typically depend on simple addition or concatenation strategies or ignore the differences between information at different levels. To address these issues, we proposed a novel RGB-T road scene semantic segmentation network called Brain-Inspired Multi-Iteration Interaction Network (BIMII-Net). First, to meet the requirements of accurate texture and local information extraction in road scenarios like autonomous driving, we proposed a deep continuous-coupled neural network (DCCNN) architecture based on a brain-inspired model. Second, to enhance the interaction and expression capabilities among multi-modal information, we designed a cross explicit attention-enhanced fusion module (CEAEF-Module) in the feature fusion stage of BIMII-Net to effectively integrate features at different levels. Finally, we constructed a complementary interactive multi-layer decoder structure, incorporating the shallow-level feature iteration module (SFI-Module), the deep-level feature iteration module (DFI-Module), and the multi-feature enhancement module (MFE-Module) to collaboratively extract texture details and global skeleton information, with multi-module joint supervision further optimizing the segmentation results. Experimental results demonstrate that BIMII-Net achieves state-of-the-art (SOTA) performance in the brain-inspired computing domain and outperforms most existing RGB-T semantic segmentation methods. It also exhibits strong generalization capabilities on multiple RGB-T datasets, proving the effectiveness of brain-inspired computer models in multi-modal image segmentation tasks.
zh

[CV-112] Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

【速读】：本文旨在解决零样本组合图像检索（Zero-shot Composed Image Retrieval, ZS-CIR）任务中，由于缺乏标注三元组数据而导致性能受限的问题。传统方法通过预训练一个文本反转网络将图像映射为单一伪词令牌来实现这一任务，但其粗粒度的文本反转可能无法精确捕捉图像的完整内容。为了解决此问题，论文提出了一种新颖的细粒度文本反转网络FTI4CIR。该方案的关键在于包含两个主要组件：一是细粒度伪词令牌映射，它将图像映射为主体导向的伪词令牌及多个属性导向的伪词令牌，以全面表达图像的文本形式；二是基于BLIP生成图像标题模板的三重描述语义正则化，用于联合对齐细粒度伪词令牌与真实词嵌入空间，从而提升模型对图像细节的理解能力。实验结果表明，所提方法在三个基准数据集上的优越性。

链接: https://arxiv.org/abs/2503.19296
作者: Haoqiang Lin,Haokun Wen,Xuemeng Song,Meng Liu,Yupeng Hu,Liqiang Nie
机构: Shandong University(Qingdao)(山东大学（青岛）); Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Shandong Jianzhu University(Jinan)(山东建筑大学（济南）); Shandong University(Jinan)(山东大学（济南）)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user’s modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, recent researchers have shifted to the challenging task of zero-shot CIR (ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer ZS-CIR studies focus on converting the CIR task into a standard text-to-image retrieval task by pre-training a textual inversion network that can map a given image into a single pseudo-word token. Despite their significant progress, their coarse-grained textual inversion may be insufficient to capture the full content of the image accurately. To overcome this issue, in this work, we propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization. The former maps the image into a subject-oriented pseudo-word token and several attribute-oriented pseudo-word tokens to comprehensively express the image in the textual form, while the latter works on jointly aligning the fine-grained pseudo-word tokens to the real-word token embedding space based on a BLIP-generated image caption template. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our proposed method.
zh

[CV-113] Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment CVPR2025

【速读】：该论文旨在解决现有基于生成对抗网络（GAN）的图像超分辨率（SR）方法在提升感知质量时通常直接对图像进行粗粒度判别，而忽视图像语义信息的问题，这使得超分辨率网络（SRN）难以学习到精细且与语义相关的纹理细节。为了解决这一问题，论文提出了语义特征判别方法（SFD）。其关键是设计了一个特征判别器（Feat-D），用于从CLIP中区分逐像素的中间语义特征，并使超分辨率图像的特征分布与高质量图像对齐；同时引入可学习提示对（LPP），以对抗方式在CLIP更抽象的输出特征上进行文本引导判别（TG-D），进一步增强判别能力。通过Feat-D和TG-D的结合，SFD能够有效区分低质量和高质量图像的语义特征分布，从而促使SRN生成更加真实且语义相关的纹理。此外，基于训练好的Feat-D和LPP，还提出了一种新的无参考图像质量评估方法（SFD-IQA），显著提升了无参考意见的图像质量评估性能。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.19295
作者: Guanglu Dong,Xiangyu Liao,Mingyang Li,Guihuan Guo,Chao Ren
机构: College of Electronics and Information Engineering, Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have been widely applied to image super-resolution (SR) to enhance the perceptual quality. However, most existing GAN-based SR methods typically perform coarse-grained discrimination directly on images and ignore the semantic information of images, making it challenging for the super resolution networks (SRN) to learn fine-grained and semantic-related texture details. To alleviate this issue, we propose a semantic feature discrimination method, SFD, for perceptual SR. Specifically, we first design a feature discriminator (Feat-D), to discriminate the pixel-wise middle semantic features from CLIP, aligning the feature distributions of SR images with that of high-quality images. Additionally, we propose a text-guided discrimination method (TG-D) by introducing learnable prompt pairs (LPP) in an adversarial manner to perform discrimination on the more abstract output feature of CLIP, further enhancing the discriminative ability of our method. With both Feat-D and TG-D, our SFD can effectively distinguish between the semantic feature distributions of low-quality and high-quality images, encouraging SRN to generate more realistic and semantic-relevant textures. Furthermore, based on the trained Feat-D and LPP, we propose a novel opinion-unaware no-reference image quality assessment (OU NR-IQA) method, SFD-IQA, greatly improving OU NR-IQA performance without any additional targeted training. Extensive experiments on classical SISR, real-world SISR, and OU NR-IQA tasks demonstrate the effectiveness of our proposed methods.
zh

[CV-114] ISPDiffuser: Learning RAW-to-sRGB Mappings with Texture-Aware Diffusion Models and Histogram-Guided Color Consistency AAAI2025

【速读】：该论文旨在解决智能手机传感器捕获的RAW数据转换为高质量sRGB图像过程中存在的细节差异和色彩失真问题。现有基于学习的方法虽能达到与复杂手工设计的相机ISP方案相当的结果，但在细节恢复和色彩一致性方面仍存在不足。论文的关键解决方案是提出了一种名为ISPDiffuser的扩散模型驱动解耦框架，将RAW到sRGB的映射分解为灰度空间中的细节重建以及从灰度到sRGB的颜色一致性映射。具体而言，该框架引入了一个纹理感知的扩散模型，利用扩散模型的生成能力专注于局部细节恢复，并通过设计纹理增强损失来引导模型生成更精细的纹理细节；同时，还提出了一个以颜色直方图为指导的颜色一致性模块，结合颜色一致性损失约束学习到的颜色信息，从而实现精确的颜色映射。实验结果表明，ISPDiffuser在定量和视觉效果上均优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.19283
作者: Yang Ren,Hai Jiang,Menglong Yang,Wei Li,Shuaicheng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:RAW-to-sRGB mapping, or the simulation of the traditional camera image signal processor (ISP), aims to generate DSLR-quality sRGB images from raw data captured by smartphone sensors. Despite achieving comparable results to sophisticated handcrafted camera ISP solutions, existing learning-based methods still struggle with detail disparity and color distortion. In this paper, we present ISPDiffuser, a diffusion-based decoupled framework that separates the RAW-to-sRGB mapping into detail reconstruction in grayscale space and color consistency mapping from grayscale to sRGB. Specifically, we propose a texture-aware diffusion model that leverages the generative ability of diffusion models to focus on local detail recovery, in which a texture enrichment loss is further proposed to prompt the diffusion model to generate more intricate texture details. Subsequently, we introduce a histogram-guided color consistency module that utilizes color histogram as guidance to learn precise color information for grayscale to sRGB color consistency mapping, with a color consistency loss designed to constrain the learned color information. Extensive experimental results show that the proposed ISPDiffuser outperforms state-of-the-art competitors both quantitatively and visually. The code is available at this https URL.
zh

[CV-115] Multiscale Feature Importance-based Bit Allocation for End-to-End Feature Coding for Machines

【速读】：该论文旨在解决特征编码（Feature Coding for Machines, FCM）在远程智能分析中的高效压缩问题，以满足未来智能视觉应用的需求。论文的关键在于提出了一种基于多尺度特征重要性的比特分配方法（Multiscale Feature Importance-based Bit Allocation, MFIBA），用于端到端的FCM。其解决方案的核心包括：首先通过多尺度特征重要性预测模块（Multiscale Feature Importance Prediction, MFIP）评估各尺度特征的重要性；其次构建任务损失-率模型，量化使用压缩特征时的任务精度损失与编码比特率之间的关系；最后设计MFIBA算法，合理分配多尺度特征的编码比特，使其与特征重要性相匹配。实验结果表明，该方法在多种机器视觉任务中实现了显著的比特率节省，验证了其通用性和适应性。

链接: https://arxiv.org/abs/2503.19278
作者: Junle Liu,Yun Zhang,Zixi Guo
机构: School of Electronics and Communication Engineering, Sun Yat-Sen University (中山大学)(Shenzhen, Guangdong, China); Faculty of Information and Science and Engineering, Ningbo University (宁波大学)(Ningbo, Zhejiang, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Feature Coding for Machines (FCM) aims to compress intermediate features effectively for remote intelligent analytics, which is crucial for future intelligent visual applications. In this paper, we propose a Multiscale Feature Importance-based Bit Allocation (MFIBA) for end-to-end FCM. First, we find that the importance of features for machine vision tasks varies with the scales, object size, and image instances. Based on this finding, we propose a Multiscale Feature Importance Prediction (MFIP) module to predict the importance weight for each scale of features. Secondly, we propose a task loss-rate model to establish the relationship between the task accuracy losses of using compressed features and the bitrate of encoding these features. Finally, we develop a MFIBA for end-to-end FCM, which is able to assign coding bits of multiscale features more reasonably based on their importance. Experimental results demonstrate that when combined with a retained Efficient Learned Image Compression (ELIC), the proposed MFIBA achieves an average of 38.202% bitrate savings in object detection compared to the anchor ELIC. Moreover, the proposed MFIBA achieves an average of 17.212% and 36.492% feature bitrate savings for instance segmentation and keypoint detection, respectively. When the proposed MFIBA is applied to the LIC-TCM, it achieves an average of 18.103%, 19.866% and 19.597% bit rate savings on three machine vision tasks, respectively, which validates the proposed MFIBA has good generalizability and adaptability to different machine vision tasks and FCM base codecs.
zh

[CV-116] Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications

【速读】：该论文旨在解决语义分割在捕捉对象间上下文和语义关系方面的局限性，特别是现有模型（如CNN和Transformer架构）难以区分语义相似的对象或理解复杂场景上下文的问题。论文的关键解决方案在于提出了一种新颖的上下文感知语义分割框架，将大型语言模型（Large Language Models, LLMs）与最先进的视觉主干网络相结合。该框架利用Swin Transformer进行鲁棒的视觉特征提取，并通过GPT-4增强语义理解以生成文本嵌入。引入的交叉注意力机制用于对齐视觉和语言特征，从而更有效地推理场景上下文。此外，采用图神经网络（Graph Neural Networks, GNNs）建模场景中的对象关系，捕捉传统模型忽略的依赖关系。实验结果表明，该方法在COCO和Cityscapes等基准数据集上不仅提升了像素级准确性（mIoU），还增强了上下文理解能力（mAP）。

链接: https://arxiv.org/abs/2503.19276
作者: Ben Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic segmentation has made significant strides in pixel-level image understanding, yet it remains limited in capturing contextual and semantic relationships between objects. Current models, such as CNN and Transformer-based architectures, excel at identifying pixel-level features but fail to distinguish semantically similar objects (e.g., “doctor” vs. “nurse” in a hospital scene) or understand complex contextual scenarios (e.g., differentiating a running child from a regular pedestrian in autonomous driving). To address these limitations, we proposed a novel Context-Aware Semantic Segmentation framework that integrates Large Language Models (LLMs) with state-of-the-art vision backbones. Our hybrid model leverages the Swin Transformer for robust visual feature extraction and GPT-4 for enriching semantic understanding through text embeddings. A Cross-Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively. Additionally, Graph Neural Networks (GNNs) are employed to model object relationships within the scene, capturing dependencies that are overlooked by traditional models. Experimental results on benchmark datasets (e.g., COCO, Cityscapes) demonstrate that our approach outperforms the existing methods in both pixel-level accuracy (mIoU) and contextual understanding (mAP). This work bridges the gap between vision and language, paving the path for more intelligent and context-aware vision systems in applications including autonomous driving, medical imaging, and robotics.
zh

[CV-117] DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation Instruct-Masking Tuning

【速读】：该论文致力于解决视觉推理（Visual Reasoning, VR）领域中基于大型语言模型（Large Language Models, LLMs）的组合式视觉推理方法所面临的性能瓶颈问题。这些方法虽然在提升视觉理解能力方面展现出潜力，但受限于冻结状态的LLMs缺乏工具意识，以及因训练数据有限、工具不完美引入错误、数据收集效率低下和噪声工作流微调困难等挑战。论文的关键解决方案是提出DWIM框架：i) 差异感知训练工作流生成（Discrepancy-aware Training Workflow generation），用于评估工具使用并提取更可行的工作流以增强训练；ii) 指令掩码微调（Instruct-Masking fine-tuning），通过引导模型仅克隆有效的操作来生成更具实用性的解决方案。实验结果表明，DWIM在多种VR任务中达到了最先进的性能，并表现出强大的泛化能力。

链接: https://arxiv.org/abs/2503.19263
作者: Fucai Ke,Vijay Kumar B G,Xingjian Leng,Zhixi Cai,Zaid Khan,Weiqing Wang,Pari Delir Haghighi,Hamid Rezatofighi,Manmohan Chandraker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.
zh

[CV-118] Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing CVPR2025

【速读】：该论文旨在解决现有真实世界图像去雾方法过度依赖预训练模型及其相关训练数据的问题，并探索生成式扩散模型在重度雾霾场景下恢复严重失真信息的潜力。然而，这些扩散模型由于采样过程耗时较长，其去雾应用尚未得到充分利用。为了解决这些问题，论文提出了一种新的去雾框架，包括一个现实雾霾图像生成模块（HazeGen）和一个基于扩散模型的去雾模块（DiffDehaze）。其中，HazeGen 利用预训练文本到图像扩散模型中嵌入的真实雾霾图像生成性扩散先验知识，通过专门设计的混合训练策略和融合采样策略，生成高质量的逼真且多样化的雾霾图像，作为 DiffDehaze 的训练数据。为了缓解基于扩散模型方法的效率和保真度问题，DiffDehaze 引入了一种加速保真采样过程（AccSamp），其核心是瓦片统计对齐操作（AlignOp），能够在极少量采样步骤内提供干净且忠实的去雾估计，从而显著降低计算复杂度并实现有效的保真度引导。关键在于 HazeGen 和 AccSamp 的结合，分别解决了训练数据生成与高效去雾的核心挑战。

链接: https://arxiv.org/abs/2503.19262
作者: Ruiyi Wang,Yushuo Zheng,Zicheng Zhang,Chunyi Li,Shuaicheng Liu,Guangtao Zhai,Xiaohong Liu
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Existing real-world image dehazing methods primarily attempt to fine-tune pre-trained models or adapt their inference procedures, thus heavily relying on the pre-trained models and associated training data. Moreover, restoring heavily distorted information under dense haze requires generative diffusion models, whose potential in dehazing remains underutilized partly due to their lengthy sampling processes. To address these limitations, we introduce a novel hazing-dehazing pipeline consisting of a Realistic Hazy Image Generation framework (HazeGen) and a Diffusion-based Dehazing framework (DiffDehaze). Specifically, HazeGen harnesses robust generative diffusion priors of real-world hazy images embedded in a pre-trained text-to-image diffusion model. By employing specialized hybrid training and blended sampling strategies, HazeGen produces realistic and diverse hazy images as high-quality training data for DiffDehaze. To alleviate the inefficiency and fidelity concerns associated with diffusion-based methods, DiffDehaze adopts an Accelerated Fidelity-Preserving Sampling process (AccSamp). The core of AccSamp is the Tiled Statistical Alignment Operation (AlignOp), which can provide a clean and faithful dehazing estimate within a small fraction of sampling steps to reduce complexity and enable effective fidelity guidance. Extensive experiments demonstrate the superior dehazing performance and visual quality of our approach over existing methods. The code is available at this https URL.
zh

[CV-119] Adaptive Multi-Order Graph Regularized NMF with Dual Sparsity for Hyperspectral Unmixing

【速读】：该论文旨在解决高光谱解混（Hyperspectral Unmixing, HU）中的两个主要问题：一是现有基于非负矩阵分解（Nonnegative Matrix Factorization, NMF）与图学习的方法大多仅关注一阶或二阶最近邻关系，无法充分表征数据的内在结构；二是这些方法通常需要手动调整参数，缺乏自适应性。为解决上述问题，论文提出了一种新颖的自适应多阶图正则化非负矩阵分解方法（MOGNMF），其关键在于引入了多阶图正则化以全面挖掘全局与局部信息，并通过数据驱动的方式自适应学习相关参数，同时嵌入双稀疏性约束以增强鲁棒性。此外，还开发了一种交替最小化算法以有效求解模型。实验结果表明，该方法在模拟和真实高光谱数据上的解混性能更优。

链接: https://arxiv.org/abs/2503.19258
作者: Hui Chen,Liangyu Liu,Xianchao Xiu,Wanquan Liu
机构: School of Automation Engineering, Shanghai University of Electric Power (上海电力大学自动化工程学院); School of Mechatronic Engineering and Automation, Shanghai University (上海大学机电工程与自动化学院); School of Intelligent Systems Engineering, Sun Yat-sen University (中山大学智能系统工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral unmixing (HU) is a critical yet challenging task in remote sensing. However, existing nonnegative matrix factorization (NMF) methods with graph learning mostly focus on first-order or second-order nearest neighbor relationships and usually require manual parameter tuning, which fails to characterize intrinsic data structures. To address the above issues, we propose a novel adaptive multi-order graph regularized NMF method (MOGNMF) with three key features. First, multi-order graph regularization is introduced into the NMF framework to exploit global and local information comprehensively. Second, these parameters associated with the multi-order graph are learned adaptively through a data-driven approach. Third, dual sparsity is embedded to obtain better robustness, i.e., \ell_1/2 -norm on the abundance matrix and \ell_2,1 -norm on the noise matrix. To solve the proposed model, we develop an alternating minimization algorithm whose subproblems have explicit solutions, thus ensuring effectiveness. Experiments on simulated and real hyperspectral data indicate that the proposed method delivers better unmixing results.
zh

[CV-120] Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

【速读】：该论文旨在解决现有指代表达理解（Referring Expression Comprehension, REC）方法受限于物体类别描述和单一属性意图描述的问题，这些问题限制了其在实际场景中的应用。在自然的人机交互中，用户通常通过个体状态、意图以及伴随的手势来表达需求，而非详细的物体描述。为应对这一挑战，论文提出了一种名为Multi-ref EC的新任务框架，该框架整合了状态描述、推导意图和具身手势以定位目标物体。解决方案的关键在于引入State-Intention-Gesture Attributes Reference (SIGAR) 数据集，该数据集结合了状态和意图表达与具身参考，并通过在SIGAR数据集上的大量实验验证了有序多属性参考对提升定位性能的重要性，表明单一属性参考不足以支持自然的人机交互场景。研究结果强调了多属性参考表达在推进视觉-语言理解方面的重要性。

链接: https://arxiv.org/abs/2503.19240
作者: Hao Guo,Jianfei Zhu,Wei Fan,Chunzhi Yi,Feng Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.
zh

[CV-121] HoGS: Unified Near and Far Object Reconstruction via Homogeneous Gaussian Splatting CVPR’25

【速读】：该论文旨在解决基于3D Gaussian splatting (3DGS) 的方法在处理远处物体时因依赖笛卡尔坐标系而导致性能受限的问题，尤其是在重建无界室外环境中的远处物体时。论文的关键创新在于将齐次坐标（homogeneous coordinates）引入3DGS框架，提出了一种名为齐次高斯点阵化 (Homogeneous Gaussian Splatting, HoGS) 的方法。通过采用射影几何（projective geometry）原理，HoGS 提供了一种统一的表示方式，能够有效提升远处物体的渲染精度，同时保持近处物体的高质量渲染和快速训练速度及实时渲染能力。

链接: https://arxiv.org/abs/2503.19232
作者: Xinpeng Liu,Zeyi Huang,Fumio Okura,Yasuyuki Matsushita
机构: The University of Osaka (大阪大学); Microsoft Research Asia – Tokyo (微软亚洲研究院 – 东京)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’25)

点击查看摘要

Abstract:Novel view synthesis has demonstrated impressive progress recently, with 3D Gaussian splatting (3DGS) offering efficient training time and photorealistic real-time rendering. However, reliance on Cartesian coordinates limits 3DGS’s performance on distant objects, which is important for reconstructing unbounded outdoor environments. We found that, despite its ultimate simplicity, using homogeneous coordinates, a concept on the projective geometry, for the 3DGS pipeline remarkably improves the rendering accuracies of distant objects. We therefore propose Homogeneous Gaussian Splatting (HoGS) incorporating homogeneous coordinates into the 3DGS framework, providing a unified representation for enhancing near and distant objects. HoGS effectively manages both expansive spatial positions and scales particularly in outdoor unbounded environments by adopting projective geometry principles. Experiments show that HoGS significantly enhances accuracy in reconstructing distant objects while maintaining high-quality rendering of nearby objects, along with fast training speed and real-time rendering capability. Our implementations are available on our project page this https URL.
zh

[CV-122] Face Spoofing Detection using Deep Learning

【速读】：该论文旨在解决数字图像欺骗（Digital Image Spoofing）在生物特征认证系统中的安全威胁问题，特别是针对依赖面部识别的系统。论文通过评估三种基于视觉的模型（MobileNetV2、ResNet50 和 Vision Transformer, ViT）在图像分类中的反欺骗检测（Spoof Detection）性能，试图找到适用于提高图像识别系统安全性的最佳方案。论文使用了一个包含150,986张图像的数据集，并通过准确率（Accuracy）、精确率（Precision）、召回率（Recall）和F1分数（F1 Score）等指标来比较模型的效果。

解决方案的关键在于模型的选择与优化。研究发现，MobileNetV2 在测试数据集和验证数据集上的表现均优于其他模型，尤其在测试数据集上实现了更高的准确率（91.59%）、精确率（91.72%）、召回率（91.59%）和F1分数（91.58%），并且在验证数据集上也表现出色（97.17%准确率）。此外，MobileNetV2 展现出更快的收敛速度和更好的泛化能力，尽管两者都存在过拟合现象。这些特性使得 MobileNetV2 成为反欺骗检测应用中可靠且实用的选择，特别是在需要对新数据保持高可靠性的情境下。因此，论文强调了模型选择在安全敏感场景中的重要性，并建议使用 MobileNetV2 实现实际部署。

链接: https://arxiv.org/abs/2503.19223
作者: Najeebullah,Maaz Salman,Zar Nawab Khan Swati
机构: Department of Computer Science, Faculty of Natural Science and Engineering, KIU, Gilgit (计算机科学系，自然科学与工程学院，KIU，吉尔吉特); Pukyong National University (釜庆国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 9 figures,3 tables

点击查看摘要

Abstract:Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.
zh

[CV-123] On Symmetries in Convolutional Weights ICLR2025

【速读】：该论文试图探索卷积神经网络（Convolutional Neural Networks, CNNs）中每一层均值 k×k 权重核的空间对称性，并研究这种对称性的成因及其在不同数据集和模型中的表现，同时分析特定网络架构选择对其的影响。论文的关键在于揭示这种对称性与期望特性（如平移一致性和翻转一致性）之间的关联，并提出这种对称性可能构成卷积神经网络的一种内在归纳偏置（inductive bias）。

链接: https://arxiv.org/abs/2503.19215
作者: Bilal Alsallakh,Timothy Wroge,Vivek Miglani,Narine Kokhlikyan
机构: Voxel AI Labs; Meta AI (Meta); FAIR (Facebook AI Research)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the ICLR 2025 Workshop on Weight Space Learning (WSL)

点击查看摘要

Abstract:We explore the symmetry of the mean k x k weight kernel in each layer of various convolutional neural networks. Unlike individual neurons, the mean kernels in internal layers tend to be symmetric about their centers instead of favoring specific directions. We investigate why this symmetry emerges in various datasets and models, and how it is impacted by certain architectural choices. We show how symmetry correlates with desirable properties such as shift and flip consistency, and might constitute an inherent inductive bias in convolutional neural networks.
zh

[CV-124] FRESA:Feedforward Reconstruction of Personalized Skinned Avatars from Few Images CVPR2025

【速读】：该论文旨在解决从少量图像快速重建个性化三维人体 avatar 并实现真实动画的问题。现有方法因需对不同主体进行数小时的优化而限制了其实用性。论文的关键在于学习了一个来自上千名穿着衣服的人体的通用先验，实现了即时前馈生成与零样本泛化。具体而言，通过联合推断个性化 avatar 形状、皮肤权重及姿态相关变形，而非使用共享皮肤权重，显著提升了整体几何保真度并减少了变形伪影。此外，设计了一种 3D 规范化过程以生成像素对齐的初始条件，解决了姿态变化归一化及规范形状与皮肤权重之间的耦合歧义，有助于重建精细的几何细节。最终，通过在大规模捕捉数据集上的端到端训练，模型能够生成比现有技术更真实的重建和动画，并可直接应用于随意拍摄的手机照片输入。

链接: https://arxiv.org/abs/2503.19207
作者: Rong Wang,Fabian Prada,Ziyan Wang,Zhongshi Jiang,Chengxiang Yin,Junxuan Li,Shunsuke Saito,Igor Santesteban,Javier Romero,Rohan Joshi,Hongdong Li,Jason Saragih,Yaser Sheikh
机构: Australian National University (澳大利亚国立大学); Meta Reality Labs Research (Meta现实实验室研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in CVPR 2025

点击查看摘要

Abstract:We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at this https URL.
zh

[CV-125] Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery CVPR2025

【速读】：该论文试图解决现实世界中目标分布与源数据分布不一致导致的目标检测模型性能显著下降的问题。为应对这一挑战，论文聚焦于领域泛化（Domain Generalisation, DG）任务，旨在使模型在未见过的目标分布下仍能保持鲁棒性和泛化能力，而无需在训练阶段访问目标分布。论文的关键在于引入了一个新的基准数据集——Real-World Distribution Shifts (RWDS)，它包含三个针对人道主义和气候变化应用设计的新型DG基准数据集，专注于跨气候带以及不同灾难和地理区域的领域偏移研究。这是首次专门为实际高影响力场景下的目标检测任务定制的领域泛化基准数据集，为评估未来目标检测模型的鲁棒性和泛化能力提供了重要资源。

链接: https://arxiv.org/abs/2503.19202
作者: Sara Al-Emadi,Yin Yang,Ferda Ofli
机构: Qatar Computing Research Institute, HBKU (卡塔尔计算研究所，哈马德本哈利法大学); College of Science and Engineering, HBKU (科学与工程学院，哈马德本哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Object detectors have achieved remarkable performance in many applications; however, these deep learning models are typically designed under the i.i.d. assumption, meaning they are trained and evaluated on data sampled from the same (source) distribution. In real-world deployment, however, target distributions often differ from source data, leading to substantial performance degradation. Domain Generalisation (DG) seeks to bridge this gap by enabling models to generalise to Out-Of-Distribution (OOD) data without access to target distributions during training, enhancing robustness to unseen conditions. In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. Despite the need, a standardised benchmark dataset specifically designed for assessing object detection under realistic DG scenarios is currently lacking. To address this, we introduce Real-World Distribution Shifts (RWDS), a suite of three novel DG benchmarking datasets that focus on humanitarian and climate change applications. These datasets enable the investigation of domain shifts across (i) climate zones and (ii) various disasters and geographic regions. To our knowledge, these are the first DG benchmarking datasets tailored for object detection in real-world, high-impact contexts. We aim for these datasets to serve as valuable resources for evaluating the robustness and generalisation of future object detection models. Our datasets and code are available at this https URL.
zh

[CV-126] Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces CVPR2025

【速读】：该论文旨在解决从定位的RGB-D图像预测真实世界室内环境的功能性3D场景图（functional 3D scene graphs）的问题。传统3D场景图侧重于物体的空间关系，而功能性3D场景图则捕捉物体、交互元素及其功能关系。由于缺乏标注数据，论文的关键解决方案是利用基础模型，包括视觉语言模型（Visual Language Models, VLMs）和大型语言模型（Large Language Models, LLMs），来编码功能性知识。实验表明，该方法在扩展的SceneFun3D数据集和新收集的FunGraph3D数据集上显著优于适配的基线模型（如Open3DSG和ConceptGraph），验证了其在建模复杂场景功能方面的有效性，并展示了其在3D问答和机器人操作等下游任务中的应用潜力。

链接: https://arxiv.org/abs/2503.19199
作者: Chenyangguang Zhang,Alexandros Delitzas,Fangjinhua Wang,Ruida Zhang,Xiangyang Ji,Marc Pollefeys,Francis Engelmann
机构: Tsinghua University (清华大学); ETH Zürich (瑞士联邦理工学院); MPI for Informatics (马克斯·普朗克计算机科学研究所); Microsoft (微软); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. Unlike traditional 3D scene graphs that focus on spatial relationships of objects, functional 3D scene graphs capture objects, interactive elements, and their functional relationships. Due to the lack of training data, we leverage foundation models, including visual language models (VLMs) and large language models (LLMs), to encode functional knowledge. We evaluate our approach on an extended SceneFun3D dataset and a newly collected dataset, FunGraph3D, both annotated with functional 3D scene graphs. Our method significantly outperforms adapted baselines, including Open3DSG and ConceptGraph, demonstrating its effectiveness in modeling complex scene functionalities. We also demonstrate downstream applications such as 3D question answering and robotic manipulation using functional 3D scene graphs. See our project page at this https URL
zh

[CV-127] FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

【速读】：该论文致力于解决文本引导的图像编辑（Text-to-Image, T2I）模型在实际应用中经常引入局部细节丢失和色彩变化等非预期修改的问题。论文分析认为，这些问题源于模型在优化过程中对所有频率带宽的无差别处理，而实际上仅部分频率可能需要调整。为此，论文提出了一种简单但有效的方法，通过小波变换将图像分解为多个空间分辨率下的不同频率带宽，并在局部空间区域内实现特定频率带的选择性优化，从而实现精确编辑。方法的关键在于利用频率选择性和空间定位能力，以避免全局优化带来的副作用，同时支持多尺度细节调整。此外，论文还扩展了该方法至三维纹理编辑领域，通过对三平面表示进行频域分解，实现了三维纹理的频率感知调整。量化评估和用户研究验证了该方法在生成高质量且精准编辑结果方面的有效性。

链接: https://arxiv.org/abs/2503.19191
作者: Yufan Ren,Zicong Jiang,Tong Zhang,Søren Forchhammer,Sabine Süsstrunk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages (main paper)

点击查看摘要

Abstract:Text-guided image editing using Text-to-Image (T2I) models often fails to yield satisfactory results, frequently introducing unintended modifications, such as the loss of local detail and color changes. In this paper, we analyze these failure cases and attribute them to the indiscriminate optimization across all frequency bands, even though only specific frequencies may require adjustment. To address this, we introduce a simple yet effective approach that enables the selective optimization of specific frequency bands within localized spatial regions for precise edits. Our method leverages wavelets to decompose images into different spatial resolutions across multiple frequency bands, enabling precise modifications at various levels of detail. To extend the applicability of our approach, we provide a comparative analysis of different frequency-domain techniques. Additionally, we extend our method to 3D texture editing by performing frequency decomposition on the triplane representation, enabling frequency-aware adjustments for 3D textures. Quantitative evaluations and user studies demonstrate the effectiveness of our method in producing high-quality and precise edits.
zh

[CV-128] HOIGPT : Learning Long Sequence Hand-Object Interaction with Language Models

【速读】：该论文试图解决三维手部与物体交互（Hand-Object Interaction, HOI）的描述生成和序列生成问题，提供了一种从多样化条件信号（如文本、物体、部分序列等）生成高质量3D HOI序列的全面解决方案。论文的关键创新在于提出了HOIGPT方法，其核心通过大型语言模型实现HOI序列与自然语言描述之间的双向变换预测。此外，为了使大语言模型更好地理解HOI，论文引入了两个关键技术：一是基于物理约束的手部-物体分解VQ-VAE（hand-object decomposed VQ-VAE）的新颖HOI分词器，用于离散化HOI序列；二是经过运动感知训练的语言模型，能够同时处理和生成文本与HOI标记。这些创新显著提升了在文本生成和HOI生成任务上的性能表现。

链接: https://arxiv.org/abs/2503.19157
作者: Mingzhen Huang,Fu-Jen Chu,Bugra Tekin,Kevin J Liang,Haoyu Ma,Weiyao Wang,Xingyu Chen,Pierre Gleize,Hongfei Xue,Siwei Lyu,Kris Kitani,Matt Feiszli,Hao Tang
机构: FAIR, Meta (FAIR, Meta); State University of New York at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.
zh

[CV-129] Out-of-distribution evaluations of channel agnostic masked autoencoders in fluorescence microscopy

【速读】：该论文旨在解决因实验条件、干扰物及荧光标记变化引起的多种分布偏移（distribution-shift）对高内涵筛选计算机视觉模型泛化能力的影响问题。现有基于迁移学习的模型评估方法难以区分不同来源的分布偏移，限制了对模型设计和训练策略如何影响泛化性能的理解。为解决此问题，论文提出了一种通过JUMP-CP数据集隔离分布偏移来源的评估方案，以实现针对特定分布偏移源的泛化能力评估。此外，论文还引入了一种通道无关的掩码自编码器（channel-agnostic masked autoencoder \mathbfCampfire），其通过共享解码器处理多种荧光标记，有效扩展至包含大量不同荧光标记的数据集，并展示了其在跨实验批次、干扰物及荧光标记上的泛化能力以及从一种细胞类型到另一种细胞类型的成功迁移学习能力。关键在于提出了一种能够分离和评估分布偏移来源的评估框架以及设计有效的多通道通用自编码器架构。

链接: https://arxiv.org/abs/2503.19149
作者: Christian John Hurry,Jinjie Zhang,Olubukola Ishola,Emma Slade,Cuong Q. Nguyen
机构: GSK AIML (GSK 人工智能与机器学习)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Developing computer vision for high-content screening is challenging due to various sources of distribution-shift caused by changes in experimental conditions, perturbagens, and fluorescent markers. The impact of different sources of distribution-shift are confounded in typical evaluations of models based on transfer learning, which limits interpretations of how changes to model design and training affect generalisation. We propose an evaluation scheme that isolates sources of distribution-shift using the JUMP-CP dataset, allowing researchers to evaluate generalisation with respect to specific sources of distribution-shift. We then present a channel-agnostic masked autoencoder \mathbfCampfire which, via a shared decoder for all channels, scales effectively to datasets containing many different fluorescent markers, and show that it generalises to out-of-distribution experimental batches, perturbagens, and fluorescent markers, and also demonstrates successful transfer learning from one cell type to another.
zh

[CV-130] Risk-Based Thresholding for Reliable Anomaly Detection in Concentrated Solar Power Plants

【速读】：本文旨在解决集中式太阳能发电（CSP）厂高温太阳能接收器运行风险监测的问题，这些风险包括冻结、变形和腐蚀，可能导致昂贵的停机和维护。为应对这一挑战，论文提出了一种框架，用于生成更可靠的决策阈值，并对任意选定的风险函数提供有限样本覆盖保证。关键解决方案在于引入拒绝机制以处理高风险预测，将其转交给领域专家；同时开发了一种密度预测方法，通过计算观测图像在先前一系列图像序列下的可能性来评估异常分数。此外，通过对两个月内两个CSP厂的多种训练场景部署结果进行分析，提供了优化维护操作的宝贵见解。最后，鉴于数据集的保密性，论文还提供了扩展的模拟数据集，利用生成模型的最新进展创建多样化的热成像模拟。

链接: https://arxiv.org/abs/2503.19146
作者: Yorick Estievenart,Sukanya Patra,Souhaib Ben Taieb
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient and reliable operation of Concentrated Solar Power (CSP) plants is essential for meeting the growing demand for sustainable energy. However, high-temperature solar receivers face severe operational risks, such as freezing, deformation, and corrosion, resulting in costly downtime and maintenance. To monitor CSP plants, cameras mounted on solar receivers record infrared images at irregular intervals ranging from one to five minutes throughout the day. Anomalous images can be detected by thresholding an anomaly score, where the threshold is chosen to optimize metrics such as the F1-score on a validation set. This work proposes a framework for generating more reliable decision thresholds with finite-sample coverage guarantees on any chosen risk function. Our framework also incorporates an abstention mechanism, allowing high-risk predictions to be deferred to domain experts. Second, we propose a density forecasting method to estimate the likelihood of an observed image given a sequence of previously observed images, using this likelihood as its anomaly score. Third, we analyze the deployment results of our framework across multiple training scenarios over several months for two CSP plants. This analysis provides valuable insights to our industry partner for optimizing maintenance operations. Finally, given the confidential nature of our dataset, we provide an extended simulated dataset, leveraging recent advancements in generative modeling to create diverse thermal images that simulate multiple CSP plants. Our code is publicly available.
zh

[CV-131] Compositional Caching for Training-free Open-vocabulary Attribute Detection CVPR2025

【速读】：该论文旨在解决现有属性检测方法因依赖人工标注而存在的细节描述不一致（如颜色与色阶的差异）以及属性集固定导致的可扩展性不足的问题。论文提出了一种名为“组合缓存”(Compositional Caching, ComCa) 的无训练方法，用于开放词汇量的属性检测。其关键在于利用网络规模数据库和大型语言模型生成辅助图像缓存，并为缓存中的图像赋予软属性标签以反映属性的组合特性。这些软标签在推理阶段通过输入图像与缓存图像之间的相似性进行聚合，从而优化底层视觉-语言模型 (Vision-Language Models, VLMs) 的预测结果。这种方法具有模型无关性，可兼容多种 VLM，并在公开数据集上的实验中显著优于零样本和基于缓存的基线方法。

链接: https://arxiv.org/abs/2503.19145
作者: Marco Garosi,Alessandro Conti,Gaowen Liu,Elisa Ricci,Massimiliano Mancini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project website at this https URL

点击查看摘要

Abstract:Attribute detection is crucial for many computer vision tasks, as it enables systems to describe properties such as color, texture, and material. Current approaches often rely on labor-intensive annotation processes which are inherently limited: objects can be described at an arbitrary level of detail (e.g., color vs. color shades), leading to ambiguities when the annotators are not instructed carefully. Furthermore, they operate within a predefined set of attributes, reducing scalability and adaptability to unforeseen downstream applications. We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection that overcomes these constraints. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images by leveraging web-scale databases and Large Language Models to determine attribute-object compatibility. To account for the compositional nature of attributes, cache images receive soft attribute labels. Those are aggregated at inference time based on the similarity between the input and cache images, refining the predictions of underlying Vision-Language Models (VLMs). Importantly, our approach is model-agnostic, compatible with various VLMs. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines, competing with recent training-based methods, proving that a carefully designed training-free approach can successfully address open-vocabulary attribute detection.
zh

[CV-132] Stochastic Poisson Surface Reconstruction with One Solve using Geometric Gaussian Processes

【速读】：该论文旨在解决在仅部分表面信息可用或扫描分阶段进行的情况下，传统Poisson表面重建算法难以有效处理不确定性的问题。现有方法通过引入高斯过程模型来表征不确定性，但其两阶段流程（先进行高斯过程插值，再全局求解偏微分方程）计算成本高昂。论文的关键创新在于利用几何高斯过程的最新技术，将插值与表面重建整合为单阶段操作，每样本仅需一次线性求解即可完成。此外，所提方法支持局部空间查询重建表面采样点，无需依赖问题相关的体网格或网格化表示。这种能力不仅提升了重建质量，还实现了局部概率碰撞检测、无冗余点的光线投射以及基于切片的下一视图规划等功能，同时避免了中间计算中对核矩阵逆近似为对角矩阵的需求。

链接: https://arxiv.org/abs/2503.19136
作者: Sidhanth Holalkere,David S. Bindel,Silvia Sellán,Alexander Terenin
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Poisson Surface Reconstruction is a widely-used algorithm for reconstructing a surface from an oriented point cloud. To facilitate applications where only partial surface information is available, or scanning is performed sequentially, a recent line of work proposes to incorporate uncertainty into the reconstructed surface via Gaussian process models. The resulting algorithms first perform Gaussian process interpolation, then solve a set of volumetric partial differential equations globally in space, resulting in a computationally expensive two-stage procedure. In this work, we apply recently-developed techniques from geometric Gaussian processes to combine interpolation and surface reconstruction into a single stage, requiring only one linear solve per sample. The resulting reconstructed surface samples can be queried locally in space, without the use of problem-dependent volumetric meshes or grids. These capabilities enable one to (a) perform probabilistic collision detection locally around the region of interest, (b) perform ray casting without evaluating points not on the ray’s trajectory, and © perform next-view planning on a per-slice basis. They also improve reconstruction quality, by not requiring one to approximate kernel matrix inverses with diagonal matrices as part of intermediate computations. Results show that our approach provides a cleaner, more-principled, and more-flexible stochastic surface reconstruction pipeline.
zh

[CV-133] Your ViT is Secretly an Image Segmentation Model WWW CVPR2025

【速读】：该论文试图解决在图像分割任务中引入多尺度特征生成、融合及预测所需的复杂任务特定组件的问题。论文的关键发现是，通过足够大的模型和大规模预训练，Vision Transformer (ViT) 本身可以学习到这些任务特定组件所引入的归纳偏置。基于此，论文提出了 Encoder-only Mask Transformer (EoMT)，它重新利用纯 ViT 架构来执行图像分割任务。EoMT 的关键在于其架构简洁性，避免了传统方法中的任务特定组件，从而显著提高了预测速度（例如，在 ViT-L 上高达 4 倍），同时在多种模型规模下实现了与现有先进方法相当的分割精度。这表明将计算资源用于扩展 ViT 本身比增加架构复杂度更具优势。

链接: https://arxiv.org/abs/2503.19108
作者: Tommie Kerssies,Niccolò Cavagnero,Alexander Hermans,Narges Norouzi,Giuseppe Averta,Bastian Leibe,Gijs Dubbelman,Daan de Geus
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Code: this https URL

点击查看摘要

Abstract:Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: this https URL.
zh

[CV-134] Anomaly Detection Using Computer Vision: A Comparative Analysis of Class Distinction and Performance Metrics

【速读】：该论文旨在解决计算机视觉领域中的异常检测问题，特别是在高安全性环境下的实时监控系统中，实现对授权人员（admin）、入侵者及非人类实体的有效区分与分类。论文的关键在于结合OpenCV的传统计算机视觉技术和基于TensorFlow的深度学习方法，利用MobileNetV2模型优化实时性能，并通过迁移学习、批量归一化及Adam优化算法确保稳定且高效的训练过程。此外，通过广泛的数据预处理（如图像增强与标准化）提高模型泛化能力，以及深入分析特征提取技术与训练策略对分类效果的影响，最终实现了90.20%的管理员识别精度、98.60%的入侵者检测精度和75.80%的非人类场景检测精度，同时保持平均30帧/秒的处理速率。这表明，先进的特征选择与数据增强技术显著提升了检测性能，尤其是对于人机场景的区分。

链接: https://arxiv.org/abs/2503.19100
作者: Md. Barkat Ullah Tusher,Shartaz Khan Akash,Amirul Islam Showmik
机构: AIUB (亚洲国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:This paper showcases an experimental study on anomaly detection using computer vision. The study focuses on class distinction and performance evaluation, combining OpenCV with deep learning techniques while employing a TensorFlow-based convolutional neural network for real-time face recognition and classification. The system effectively distinguishes among three classes: authorized personnel (admin), intruders, and non-human entities. A MobileNetV2-based deep learning model is utilized to optimize real-time performance, ensuring high computational efficiency without compromising accuracy. Extensive dataset preprocessing, including image augmentation and normalization, enhances the models generalization capabilities. Our analysis demonstrates classification accuracies of 90.20% for admin, 98.60% for intruders, and 75.80% for non-human detection, while maintaining an average processing rate of 30 frames per second. The study leverages transfer learning, batch normalization, and Adam optimization to achieve stable and robust learning, and a comparative analysis of class differentiation strategies highlights the impact of feature extraction techniques and training methodologies. The results indicate that advanced feature selection and data augmentation significantly enhance detection performance, particularly in distinguishing human from non-human scenes. As an experimental study, this research provides critical insights into optimizing deep learning-based surveillance systems for high-security environments and improving the accuracy and efficiency of real-time anomaly detection.
zh

[CV-135] Uncertainty-Aware Decomposed Hybrid Networks

【速读】：本文旨在解决图像识别算法鲁棒性不足的问题，当前模型通常依赖大量标注数据。为应对这一挑战，论文提出了一种混合方法，结合神经网络的适应性与领域特定准不变算子的可解释性、透明性和鲁棒性。关键在于将识别任务分解为多个专注于不同特征的任务特定算子，并通过一种针对这些算子设计的新颖置信度测量来支持，该测量使网络能够优先利用可靠特征并处理噪声。此设计增强了透明性和鲁棒性，特别是在低数据场景下显著提升了性能。实验结果表明，该方法在交通标志检测中表现出色，尤其是在半监督和无监督场景下，凸显了其在数据受限应用中的潜力。

链接: https://arxiv.org/abs/2503.19096
作者: Sina Ditzel,Achref Jaziri,Iuliia Pliushch,Visvanathan Ramesh
机构: Goethe University (歌德大学); Flanders Make (弗拉芒Make)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The robustness of image recognition algorithms remains a critical challenge, as current models often depend on large quantities of labeled data. In this paper, we propose a hybrid approach that combines the adaptability of neural networks with the interpretability, transparency, and robustness of domain-specific quasi-invariant operators. Our method decomposes the recognition into multiple task-specific operators that focus on different characteristics, supported by a novel confidence measurement tailored to these operators. This measurement enables the network to prioritize reliable features and accounts for noise. We argue that our design enhances transparency and robustness, leading to improved performance, particularly in low-data regimes. Experimental results in traffic sign detection highlight the effectiveness of the proposed method, especially in semi-supervised and unsupervised scenarios, underscoring its potential for data-constrained applications.
zh

[CV-136] Clustering data by reordering them

【速读】：该论文试图解决通过分组分析元素时面临的复杂性和噪声干扰问题，特别是在数据驱动的研究领域中。解决方案的关键在于提出了一种基于元素间相似性进行自动分组的新算法。该算法首先根据元素间的距离重新排序数据，然后利用易于理解的参数实现自动分析，同时显式考虑了噪声的影响以应对多样化的问题场景。论文展示了该方法在生物分子构象分类、基因序列聚类、细胞分型、图像分析以及实验条件归类等领域的应用。

链接: https://arxiv.org/abs/2503.19067
作者: Axel Descamps,Sélène Forget,Aliénor Lahlou,Claire Lavergne,Camille Berthelot,Guillaume Stirnemann,Rodolphe Vuilleumier,Nicolas Chéron
机构: Chimie Physique et Chimie du Vivant (CPCV), Département de chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS (化学与生命化学, 化学系, 巴黎高等师范学院, 巴黎科学与文学研究型大学, 国家科学研究中心), 75005 Paris, France; Sony Computer Science Laboratories (索尼计算机科学实验室), Paris 75005, France; Institut Pasteur, Université Paris Cité, CNRS UMR 3525, INSERM UA12 (巴斯德研究所, 巴黎城市大学, 国家科学研究中心联合研究单位 3525, 法国国家健康与医学研究院独立附属单位 12), Comparative Functional Genomics Group, Paris 75015, France; Université Paris Cité, BioSPC, F-75205 Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV); Biomolecules (q-bio.BM)
备注: 60 pages, 21 figures

点击查看摘要

Abstract:Grouping elements into families to analyse them separately is a standard analysis procedure in many areas of sciences. We propose herein a new algorithm based on the simple idea that members from a family look like each other, and don’t resemble elements foreign to the family. After reordering the data according to the distance between elements, the analysis is automatically performed with easily-understandable parameters. Noise is explicitly taken into account to deal with the variety of problems of a data-driven world. We applied the algorithm to sort biomolecules conformations, gene sequences, cells, images, and experimental conditions.
zh

[CV-137] WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation

【速读】：该论文旨在解决传统知识发现与收集任务依赖大量人工投入且现有自动化方法主要局限于文本生成、忽视多模态内容重要性的问题。论文提出的关键解决方案是WikiAutoGen系统，其创新之处在于不仅检索和整合相关文本，还同时引入图像内容以增强生成文章的信息深度与视觉吸引力。此外，通过多视角自省机制（multi-perspective self-reflection mechanism），从不同角度评估检索到的内容，进一步提升生成内容的事实准确性、全面性和连贯性。同时，为了评估多模态知识生成的效果，论文还构建了WikiSeek基准数据集，包含具有文本及图像表示的主题配对维基百科文章。实验结果表明，WikiAutoGen在WikiSeek基准上的表现较之前方法提升了8%-29%，生成的文章更精准、连贯且视觉丰富。

链接: https://arxiv.org/abs/2503.19065
作者: Zhongyu Yang,Jun Chen,Dannong Xu,Junjie Fei,Xiaoqian Shen,Liangbing Zhao,Chun-Mei Feng,Mohamed Elhoseiny
机构: King Abdullah University of Science and Technology (国王阿卜杜拉科技大学); Lanzhou University (兰州大学); The University of Sydney (悉尼大学); IHPC, A*STAR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project in this https URL

点击查看摘要

Abstract:Knowledge discovery and collection are intelligence-intensive tasks that traditionally require significant human effort to ensure high-quality outputs. Recent research has explored multi-agent frameworks for automating Wikipedia-style article generation by retrieving and synthesizing information from the internet. However, these methods primarily focus on text-only generation, overlooking the importance of multimodal content in enhancing informativeness and engagement. In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. Unlike prior approaches, WikiAutoGen retrieves and integrates relevant images alongside text, enriching both the depth and visual appeal of generated content. To further improve factual accuracy and comprehensiveness, we propose a multi-perspective self-reflection mechanism, which critically assesses retrieved content from diverse viewpoints to enhance reliability, breadth, and coherence, etc. Additionally, we introduce WikiSeek, a benchmark comprising Wikipedia articles with topics paired with both textual and image-based representations, designed to evaluate multimodal knowledge generation on more challenging topics. Experimental results show that WikiAutoGen outperforms previous methods by 8%-29% on our WikiSeek benchmark, producing more accurate, coherent, and visually enriched Wikipedia-style articles. We show some of our generated examples in this https URL .
zh

[CV-138] Color Transfer with Modulated Flows AAAI2025

【速读】：该论文试图解决图像之间颜色迁移（color transfer）的问题，目标是调整目标图像的颜色分布以匹配参考图像的颜色分布。解决方案的关键在于提出了一种基于修正流（rectified flows）的新方法——调制流（Modulated Flows, ModFlows）。该方法利用流的双射性质引入一个通用的中间颜色分布，并构建了一个修正流的数据集。通过在该数据集上训练编码器来预测新图像的修正模型权重，ModFlows 实现了无需额外微调即可为新颜色分布对生成最优传输计划的能力。此外，训练后的编码器还提供了与图像颜色风格相关的嵌入表示。该方法能够在保持内容相似性的同时实现高质量的颜色迁移，并能够处理 4K 分辨率图像。

链接: https://arxiv.org/abs/2503.19062
作者: Maria Larchenko,Alexander Lobashev,Dmitry Guskov,Vladimir Vladimirovich Palyulin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:In this work, we introduce Modulated Flows (ModFlows), a novel approach for color transfer between images based on rectified flows. The primary goal of the color transfer is to adjust the colors of a target image to match the color distribution of a reference image. Our technique is based on optimal transport and executes color transfer as an invertible transformation within the RGB color space. The ModFlows utilizes the bijective property of flows, enabling us to introduce a common intermediate color distribution and build a dataset of rectified flows. We train an encoder on this dataset to predict the weights of a rectified model for new images. After training on a set of optimal transport plans, our approach can generate plans for new pairs of distributions without additional fine-tuning. We additionally show that the trained encoder provides an image embedding, associated only with its color style. The presented method is capable of processing 4K images and achieves the state-of-the-art performance in terms of content and style similarity. Our source code is available at this https URL
zh

[CV-139] Color Conditional Generation with Sliced Wasserstein Guidance CVPR2025

【速读】：该论文旨在解决基于颜色条件生成图像时，现有方法常导致生成图像中颜色语义不一致的问题。传统方法通过先从文本提示生成图像，再应用颜色风格迁移来固定颜色，但容易产生与上下文无关的随机颜色。为克服此局限，论文提出了一种无需训练的方案——SW-Guidance，其关键在于修改扩散模型的采样过程，引入生成图像颜色分布与参考调色板之间的可微分Sliced 1-Wasserstein距离，从而在保持参考颜色相似性的同时确保生成图像的语义一致性。

链接: https://arxiv.org/abs/2503.19034
作者: Alexander Lobashev,Maria Larchenko,Dmitry Guskov
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:We propose SW-Guidance, a training-free approach for image generation conditioned on the color distribution of a reference image. While it is possible to generate an image with fixed colors by first creating an image from a text prompt and then applying a color style transfer method, this approach often results in semantically meaningless colors in the generated image. Our method solves this problem by modifying the sampling process of a diffusion model to incorporate the differentiable Sliced 1-Wasserstein distance between the color distribution of the generated image and the reference palette. Our method outperforms state-of-the-art techniques for color-conditional generation in terms of color similarity to the reference, producing images that not only match the reference colors but also maintain semantic coherence with the original text prompt. Our source code is available at this https URL.
zh

[CV-140] DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding

【速读】：该论文旨在解决可见光到红外图像翻译（Visible-to-Infrared Image Translation, V2IR）中的三个主要挑战：实现语义感知的翻译、处理红外图像的多样波长谱以及应对红外数据集的稀缺性。当前主流方法通常将V2IR视为常规图像到图像合成任务，往往忽视了这些特定问题。为了解决这些问题，论文提出了DiffV2IR，这是一种包含两个关键要素的新框架：渐进学习模块（Progressive Learning Module, PLM）和视觉-语言理解模块（Vision-Language Understanding Module, VLUM）。其中，PLM采用自适应扩散模型架构，通过多阶段知识学习从全波段过渡到目标波长来实现红外转换；而VLUM则引入统一的视觉-语言理解以提升V2IR翻译效果。此外，研究团队还构建了一个大规模红外数据集IR-500K，包含50万张在不同环境条件下由多种场景和物体拍摄的红外图像。通过结合PLM、VLUM以及IR-500K数据集，DiffV2IR显著提升了V2IR任务的表现，并通过实验验证了其高质量翻译能力和广泛适用性。相关代码、数据集及DiffV2IR模型将在指定链接处提供。

链接: https://arxiv.org/abs/2503.19012
作者: Lingyan Ran,Lidong Wang,Guangcong Wang,Peng Wang,Yanning Zhang
机构: Northwestern Polytechnical University; Great Bay University (西工大); 大湾区大学
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The task of translating visible-to-infrared images (V2IR) is inherently challenging due to three main obstacles: 1) achieving semantic-aware translation, 2) managing the diverse wavelength spectrum in infrared imagery, and 3) the scarcity of comprehensive infrared datasets. Current leading methods tend to treat V2IR as a conventional image-to-image synthesis challenge, often overlooking these specific issues. To address this, we introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM). PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength. To improve V2IR translation, VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions. Through the combination of PLM, VLUM, and the extensive IR-500K dataset, DiffV2IR markedly improves the performance of V2IR. Experiments validate DiffV2IR’s excellence in producing high-quality translations, establishing its efficacy and broad applicability. The code, dataset, and DiffV2IR model will be available at this https URL.
zh

[CV-141] RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

【速读】：该论文旨在解决现有方法在为三维几何体生成高质量纹理时面临的挑战，特别是因多视角图像不一致导致的接缝和重影伪影问题。同时，传统基于三维的纹理合成方法虽能缓解上述问题，但往往忽略了二维扩散模型的先验知识，限制了其在真实物体上的应用。论文的关键在于提出了一种名为RomanTex的多视角纹理生成框架，该框架结合了多注意力网络与底层三维表示，并通过创新的三维感知旋转位置嵌入实现更高效的特征提取。此外，通过在多注意力模块中引入解耦特性，增强了模型在图像到纹理任务中的鲁棒性，实现了语义正确的后视图合成。最后，引入了几何相关的分类器自由引导（CFG）机制，进一步提升了对几何体和图像的对齐效果。这些创新点共同构成了论文的核心解决方案。

链接: https://arxiv.org/abs/2503.19011
作者: Yifei Feng,Mingxin Yang,Shuhui Yang,Sheng Zhang,Jiaao Yu,Zibo Zhao,Yuhong Liu,Jie Jiang,Chunchao Guo
机构: Tencent Hunyuan (腾讯混元); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model’s robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.
zh

[CV-142] Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval CVPR2025

【速读】：该论文试图解决文本到视频检索（Text-to-Video Retrieval, T2VR）的问题。解决方案的关键在于引入了一种名为Video-ColBERT的方法，其核心机制包括细粒度的空间和时间Token级交互（fine-grained spatial and temporal token-wise interaction）、查询与视觉扩展（query and visual expansions），以及在训练过程中采用双Sigmoid损失函数（dual sigmoid loss）。这些组件共同实现了对查询与视频之间细粒度相似性的高效评估，并生成了兼容且有效的视频内容编码表示，从而提升了在常见文本到视频检索基准上的性能表现。

链接: https://arxiv.org/abs/2503.19009
作者: Arun Reddy,Alexander Martin,Eugene Yang,Andrew Yates,Kate Sanders,Kenton Murray,Reno Kriz,Celso M. de Melo,Benjamin Van Durme,Rama Chellappa
机构: Johns Hopkins University (约翰斯·霍普金斯大学); Human Language Technology Center of Excellence (语言技术中心卓越中心); Johns Hopkins Applied Physics Laboratory (约翰斯·霍普金斯应用物理实验室); DEVCOM Army Research Laboratory (陆军研究实验室DEVCOM)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025. 13 pages, 4 figures. Approved for public release: distribution unlimited

点击查看摘要

Abstract:In this work, we tackle the problem of text-to-video retrieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video retrieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assessment between queries and videos. Video-ColBERT is built upon 3 main components: a fine-grained spatial and temporal token-wise interaction, query and visual expansions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong individual, yet compatible, representations for encoding video content. These representations lead to increases in performance on common text-to-video retrieval benchmarks compared to other bi-encoder methods.
zh

[CV-143] DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model

【速读】：该论文旨在解决现有口型生成方法中的两个根本性问题：基于3DMM的方法虽能保持时间一致性但缺乏细粒度的局部控制，而基于Stable Diffusion的方法虽然支持空间操作却存在时间不一致的问题。此外，这些方法由于控制机制的不兼容以及面部表示的语义纠缠，难以有效整合。为了解决这些问题，论文提出了DisentTalk，其关键在于引入了一个数据驱动的语义解缠框架，将3DMM表情参数分解为有意义的子空间，以实现更精细的面部控制。在此基础上，构建了一个在3DMM参数空间中运行的分层潜在扩散架构，并结合区域感知注意力机制，确保空间精确性和时间一致性。此外，为了应对高质量中文训练数据稀缺的问题，还引入了CHDTF数据集。实验结果表明，该方法在唇同步、表情质量和时间一致性等多个指标上均优于现有方法。

链接: https://arxiv.org/abs/2503.19001
作者: Kangwei Liu,Junwu Liu,Yun Cao,Jinlin Guo,Xiaowei Yi
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间学院); Laboratory for Big Data and Decision, School of System Engineering, National University of Defense (国防科技大学系统工程学院大数据与决策实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in talking face generation have significantly improved facial animation synthesis. However, existing approaches face fundamental limitations: 3DMM-based methods maintain temporal consistency but lack fine-grained regional control, while Stable Diffusion-based methods enable spatial manipulation but suffer from temporal inconsistencies. The integration of these approaches is hindered by incompatible control mechanisms and semantic entanglement of facial representations. This paper presents DisentTalk, introducing a data-driven semantic disentanglement framework that decomposes 3DMM expression parameters into meaningful subspaces for fine-grained facial control. Building upon this disentangled representation, we develop a hierarchical latent diffusion architecture that operates in 3DMM parameter space, integrating region-aware attention mechanisms to ensure both spatial precision and temporal coherence. To address the scarcity of high-quality Chinese training data, we introduce CHDTF, a Chinese high-definition talking face dataset. Extensive experiments show superior performance over existing methods across multiple metrics, including lip synchronization, expression quality, and temporal consistency. Project Page: this https URL.
zh

[CV-144] Improving Food Image Recognition with Noisy Vision Transformer

【速读】：本文旨在解决食品图像识别任务中因食品图像的高度变异性与复杂性而导致的挑战。为提升食品分类性能，研究者探索了引入噪声的视觉Transformer（NoisyViT）的潜力。NoisyViT通过在学习过程中加入噪声降低任务复杂度，并调整系统熵，从而提高模型准确性。关键在于利用噪声增强模型的泛化能力与分类精度。研究者在三个基准数据集上微调了NoisyViT，并验证其性能超越现有先进模型，证明了其在膳食评估、营养监测及医疗健康领域的应用潜力。

链接: https://arxiv.org/abs/2503.18997
作者: Tonmoy Ghosh,Edward Sazonov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Food image recognition is a challenging task in computer vision due to the high variability and complexity of food images. In this study, we investigate the potential of Noisy Vision Transformers (NoisyViT) for improving food classification performance. By introducing noise into the learning process, NoisyViT reduces task complexity and adjusts the entropy of the system, leading to enhanced model accuracy. We fine-tune NoisyViT on three benchmark datasets: Food2K (2,000 categories, ~1M images), Food-101 (101 categories, ~100K images), and CNFOOD-241 (241 categories, ~190K images). The performance of NoisyViT is evaluated against state-of-the-art food recognition models. Our results demonstrate that NoisyViT achieves Top-1 accuracies of 95%, 99.5%, and 96.6% on Food2K, Food-101, and CNFOOD-241, respectively, significantly outperforming existing approaches. This study underscores the potential of NoisyViT for dietary assessment, nutritional monitoring, and healthcare applications, paving the way for future advancements in vision-based food computing. Code for reproducing NoisyViT for food recognition is available at NoisyViT_Food.
zh

[CV-145] SG-Tailor: Inter-Object Commonsense Relationship Reasoning for Scene Graph Manipulation

【速读】：该论文旨在解决合理操作场景图（Scene Graph）这一具有挑战性的问题，特别是通过添加节点或修改边来操纵场景图时所面临的计算复杂性和冲突问题。即使是对单条边进行修改也可能因图中节点间的复杂相互依赖关系而引发冲突，使得任务如添加节点或推理节点间关系变得难以处理。为了解决这些问题，论文提出了SG-Tailor，这是一种自回归模型，其关键在于能够预测任意两个节点之间无冲突的关系。SG-Tailor不仅推断对象间的相互作用，包括为新添加的节点生成常识性边，还通过Cut-And-Stitch策略解决边修改引起的冲突，从而生成一致且可操作的场景图以支持下游任务。对于节点添加，模型从图中查询目标节点与其他节点以预测适当的关联；而对于边修改，则采用上述策略全局调整图结构。实验结果表明，SG-Tailor在性能上显著优于现有方法，并可以无缝集成到场景生成和机器人操作等任务中作为插件模块。

链接: https://arxiv.org/abs/2503.18988
作者: Haoliang Shang,Hanyu Wu,Guangyao Zhai,Boyang Sun,Fangjinhua Wang,Federico Tombari,Marc Pollefeys
机构: ETH Zurich (苏黎世联邦理工学院); TU Munich (慕尼黑工业大学); MCML; Google(谷歌); Microsoft; Inf. ETH Zurich (苏黎世联邦理工学院信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: The code will be available at this https URL

点击查看摘要

Abstract:Scene graphs capture complex relationships among objects, serving as strong priors for content generation and manipulation. Yet, reasonably manipulating scene graphs – whether by adding nodes or modifying edges – remains a challenging and untouched task. Tasks such as adding a node to the graph or reasoning about a node’s relationships with all others are computationally intractable, as even a single edge modification can trigger conflicts due to the intricate interdependencies within the graph. To address these challenges, we introduce SG-Tailor, an autoregressive model that predicts the conflict-free relationship between any two nodes. SG-Tailor not only infers inter-object relationships, including generating commonsense edges for newly added nodes but also resolves conflicts arising from edge modifications to produce coherent, manipulated graphs for downstream tasks. For node addition, the model queries the target node and other nodes from the graph to predict the appropriate relationships. For edge modification, SG-Tailor employs a Cut-And-Stitch strategy to solve the conflicts and globally adjust the graph. Extensive experiments demonstrate that SG-Tailor outperforms competing methods by a large margin and can be seamlessly integrated as a plug-in module for scene generation and robotic manipulation tasks.
zh

[CV-146] LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning

【速读】：该论文旨在解决连续学习（Continual Learning, CL）中的特征漂移（feature drift）导致的灾难性遗忘（catastrophic forgetting）问题，特别是在无样本（exemplar-free）连续学习（EFCL）场景下，由于无法保留先前任务的样本，难以有效保存先验知识。论文的关键在于引入了一种称为“漂移抗性空间”（Drift-Resistant Space, DRS）的方法，无需显式的特征建模或存储旧任务信息即可有效应对特征漂移。为此，提出了一种参数高效的微调方法——低秩适应减法（Low-Rank Adaptation Subtraction, LoRA-），通过在处理新任务数据前从初始预训练权重中减去旧任务的LoRA权重来构建DRS。这种方法不仅提升了模型的稳定性，还提高了效率并简化了实现过程。此外，通过结合三元组损失（triplet loss）进一步稳定特征漂移，从而增强了模型的可塑性。实验结果表明，该方法在多个数据集上，尤其是长任务序列中，取得了最先进的性能。

链接: https://arxiv.org/abs/2503.18985
作者: Xuan Liu,Xiaobin Chang
机构: School of Artificial Intelligence, Sun Yat-sen University (中山大学); Key Laboratory of Intelligent Assessment Technology for Sustainable Tourism, Ministry of Culture and Tourism, Sun Yat-sen University (文化和旅游部智能旅游技术重点实验室，中山大学); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China (教育部机器智能与先进计算重点实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In continual learning (CL), catastrophic forgetting often arises due to feature drift. This challenge is particularly prominent in the exemplar-free continual learning (EFCL) setting, where samples from previous tasks cannot be retained, making it difficult to preserve prior knowledge. To address this issue, some EFCL methods aim to identify feature spaces that minimize the impact on previous tasks while accommodating new ones. However, they rely on static features or outdated statistics stored from old tasks, which prevents them from capturing the dynamic evolution of the feature space in CL, leading to performance degradation over time. In this paper, we introduce the Drift-Resistant Space (DRS), which effectively handles feature drifts without requiring explicit feature modeling or the storage of previous tasks. A novel parameter-efficient fine-tuning approach called Low-Rank Adaptation Subtraction (LoRA-) is proposed to develop the DRS. This method subtracts the LoRA weights of old tasks from the initial pre-trained weight before processing new task data to establish the DRS for model training. Therefore, LoRA- enhances stability, improves efficiency, and simplifies implementation. Furthermore, stabilizing feature drifts allows for better plasticity by learning with a triplet loss. Our method consistently achieves state-of-the-art results, especially for long task sequences, across multiple datasets.
zh

[CV-147] A Real-Time Human Action Recognition Model for Assisted Living

【速读】：该论文旨在解决在辅助生活环境中，利用计算机视觉技术通过实时视频监控预测老年人及脆弱人群健康风险的问题，特别是针对摔倒（Falling）、摇晃（Staggering）和胸痛（Chest Pain）等动作的高精度、高效能实时识别。论文的关键在于提出了一种结合深度学习模型与实时视频预测及警报系统的解决方案，并通过迁移学习技术，在NTU RGB+D 60数据集上训练了四种先进的动作识别模型（UniFormerV2、TimeSformer、I3D和SlowFast）。最终，论文选择TimeSformer作为实时动作识别模型的核心，因其具有领先的加权宏F1分数（95.33%）、召回率（95.49%）和精确度（95.19%），同时具备显著更高的推理吞吐量，从而实现了性能与效率的最佳平衡。

链接: https://arxiv.org/abs/2503.18957
作者: Yixuan Wang,Paul Stynes,Pramod Pathak,Cristina Muntean
机构: National College of Ireland (爱尔兰国立理工学院); Technological University (理工大学); National College of Ireland (爱尔兰国立理工学院); National College of Ireland (爱尔兰国立理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Ensuring the safety and well-being of elderly and vulnerable populations in assisted living environments is a critical concern. Computer vision presents an innovative and powerful approach to predicting health risks through video monitoring, employing human action recognition (HAR) technology. However, real-time prediction of human actions with high performance and efficiency is a challenge. This research proposes a real-time human action recognition model that combines a deep learning model and a live video prediction and alert system, in order to predict falls, staggering and chest pain for residents in assisted living. Six thousand RGB video samples from the NTU RGB+D 60 dataset were selected to create a dataset with four classes: Falling, Staggering, Chest Pain, and Normal, with the Normal class comprising 40 daily activities. Transfer learning technique was applied to train four state-of-the-art HAR models on a GPU server, namely, UniFormerV2, TimeSformer, I3D, and SlowFast. Results of the four models are presented in this paper based on class-wise and macro performance metrics, inference efficiency, model complexity and computational costs. TimeSformer is proposed for developing the real-time human action recognition model, leveraging its leading macro F1 score (95.33%), recall (95.49%), and precision (95.19%) along with significantly higher inference throughput compared to the others. This research provides insights to enhance safety and health of the elderly and people with chronic illnesses in assisted living environments, fostering sustainable care, smarter communities and industry innovation.
zh

[CV-148] Is there a future for AI without representation?

【速读】：该论文试图探讨通用人工智能（Artificial Intelligence, AI）在缺乏表征（representation）情况下的前景，并重点关注Rodney Brooks的提案。论文指出，Brooks的方案关键在于拒绝智能代理中的集中控制（central control），其系统所包含的表征量与传统AI一样多或一样少。传统的观点认为表征是智能的必要条件，这一观点隐含地假设智能需要集中控制。然而，近期的认知科学研究表明，我们应摒弃将智能代理视为集中表征处理器的形象。如果这一范式转变得以实现，Brooks提出的非集中式认知（non-centralized cognition）方案在构建完全智能的代理方面显得很有前景，尽管它不适用于有意识的代理，因此也不适用于类人AI（human-like AI）。

链接: https://arxiv.org/abs/2503.18955
作者: Vincent C. Müller
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the prospects of AI without representation in general, and the proposals of Rodney Brooks in particular. What turns out to be characteristic of Brooks’ proposal is the rejection of central control in intelligent agents; his systems has as much or as little representation as traditional AI. The traditional view that representation is necessary for intelligence presupposes that intelligence requires central control. However, much of recent cognitive science suggests that we should dispose of the image of intelligent agents as central representation processors. If this paradigm shift is achieved, Brooks’ proposal for non-centralized cognition without representation appears promising for full-blown intelligent agents - though not for conscious agents and thus not for human-like AI.
zh

[CV-149] Unpaired Translation of Chest X-ray Images for Lung Opacity Diagnosis via Adaptive Activation Masks and Cross-Domain Alignment

【速读】：该论文旨在解决胸部X光片（Chest X-ray Radiographs, CXRs）中肺部混浊（lung opacities）导致解剖结构模糊、肺边界识别不清以及病灶定位困难的问题，这些问题严重影响了病变分割的精度和诊断准确性。论文的关键解决方案在于提出了一种无配对的CXRs翻译框架，通过使用自适应激活掩膜（adaptive activation masks）选择性地修改肺部混浊区域，将包含肺部混浊的CXRs转换为无混浊的对照图像，同时保持语义特征不变。此外，跨域对齐机制确保了翻译后的图像与预训练的病变分类器的特征图和预测标签一致，从而提高了翻译过程的可解释性。实验验证表明，该方法在RSNA、MIMIC-CXR-JPG和JSRT数据集上的表现优于现有方法，并显著提升了肺边界分割和病变分类的准确性。

链接: https://arxiv.org/abs/2503.19860
作者: Junzhi Ning,Dominic Marshall,Yijian Gao,Xiaodan Xing Yang Nan,Yingying Fang,Sheng Zhang,Matthieu Komorowski,Guang Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-ray radiographs (CXRs) play a pivotal role in diagnosing and monitoring cardiopulmonary diseases. However, lung opac- ities in CXRs frequently obscure anatomical structures, impeding clear identification of lung borders and complicating the localization of pathology. This challenge significantly hampers segmentation accuracy and precise lesion identification, which are crucial for diagnosis. To tackle these issues, our study proposes an unpaired CXR translation framework that converts CXRs with lung opacities into counterparts without lung opacities while preserving semantic features. Central to our approach is the use of adaptive activation masks to selectively modify opacity regions in lung CXRs. Cross-domain alignment ensures translated CXRs without opacity issues align with feature maps and prediction labels from a pre-trained CXR lesion classifier, facilitating the interpretability of the translation process. We validate our method using RSNA, MIMIC-CXR-JPG and JSRT datasets, demonstrating superior translation quality through lower Frechet Inception Distance (FID) and Kernel Inception Distance (KID) scores compared to existing meth- ods (FID: 67.18 vs. 210.4, KID: 0.01604 vs. 0.225). Evaluation on RSNA opacity, MIMIC acute respiratory distress syndrome (ARDS) patient CXRs and JSRT CXRs show our method enhances segmentation accuracy of lung borders and improves lesion classification, further underscoring its potential in clinical settings (RSNA: mIoU: 76.58% vs. 62.58%, Sensitivity: 85.58% vs. 77.03%; MIMIC ARDS: mIoU: 86.20% vs. 72.07%, Sensitivity: 92.68% vs. 86.85%; JSRT: mIoU: 91.08% vs. 85.6%, Sensitivity: 97.62% vs. 95.04%). Our approach advances CXR imaging analysis, especially in investigating segmentation impacts through image translation techniques.
zh

[CV-150] GyralNet Subnetwork Partitioning via Differentiable Spectral Modularity Optimization

【速读】：该论文旨在解决现有方法在分析三铰回（3HG）时面临的三个主要挑战：其一，典型神经成像分辨率下3HG的亚体素尺度问题；其二，跨受试者对应关系建立的计算复杂性；其三，将3HG视为独立节点而忽略其社区级关系的过度简化。为应对这些局限性，论文提出了一种完全可微的亚网络划分框架，采用谱模ularity最大化优化策略来模块化GyralNet中3HG的组织结构。该方案的关键在于结合拓扑结构相似性和基于DTI的连接模式作为属性特征，从而提供一种具有生物学意义的大脑皮层组织表示方式。实验结果表明，此方法能够在个体层面有效划分GyralNet，同时保持跨受试者3HG的社区级一致性，为理解大脑连接提供了坚实基础。

链接: https://arxiv.org/abs/2503.19823
作者: Yan Zhuang,Minheng Chen,Chao Cao,Tong Chen,Jing Zhang,Xiaowei Yu,Yanjun Lyu,Lu Zhang,Tianming Liu,Dajiang Zhu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Understanding the structural and functional organization of the human brain requires a detailed examination of cortical folding patterns, among which the three-hinge gyrus (3HG) has been identified as a key structural landmark. GyralNet, a network representation of cortical folding, models 3HGs as nodes and gyral crests as edges, highlighting their role as critical hubs in cortico-cortical connectivity. However, existing methods for analyzing 3HGs face significant challenges, including the sub-voxel scale of 3HGs at typical neuroimaging resolutions, the computational complexity of establishing cross-subject correspondences, and the oversimplification of treating 3HGs as independent nodes without considering their community-level relationships. To address these limitations, we propose a fully differentiable subnetwork partitioning framework that employs a spectral modularity maximization optimization strategy to modularize the organization of 3HGs within GyralNet. By incorporating topological structural similarity and DTI-derived connectivity patterns as attribute features, our approach provides a biologically meaningful representation of cortical organization. Extensive experiments on the Human Connectome Project (HCP) dataset demonstrate that our method effectively partitions GyralNet at the individual level while preserving the community-level consistency of 3HGs across subjects, offering a robust foundation for understanding brain connectivity.
zh

[CV-151] GRN: A Simplified Generative Reinforcement Network for Tissue Layer Analysis in 3D Ultrasound Images for Chronic Low-back Pain

【速读】：该论文旨在解决在3D超声图像中手动区分多种软组织以进行定量分析所面临的劳动密集型问题。为了解决这一挑战，论文提出了一种名为GRN+的新颖多模型框架，用于在标注数据有限的情况下实现层分割的自动化。GRN+的关键创新在于结合了基于ResNet的生成器和U-Net分割模型，并通过“分割引导增强（Segmentation-guided Enhancement, SGE）”方法，利用分割模型指导生成器产生新的图像及其对应的掩码，同时调整生成器权重以最小化分割损失梯度。此外，为了确保训练过程的稳定性并避免梯度爆炸，采用了两阶段反向传播策略：第一阶段同时传播分割损失至生成器和分割模型，第二阶段则专注于优化分割模型本身，从而利用生成的图像进一步精炼掩码预测。这些技术使得GRN+能够在仅使用5%标注数据的情况下超越其他半监督方法，并在完全标注的数据集上实现更高的Dice系数，同时降低计算成本。因此，GRN+不仅提供了准确的组织分割能力，还显著减少了计算开销和对大量标注数据的依赖性。

链接: https://arxiv.org/abs/2503.19736
作者: Zixue Zeng,Xiaoyan Zhao,Matthew Cartier,Xin Meng,Jiantao Pu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D ultrasound delivers high-resolution, real-time images of soft tissues, which is essential for pain research. However, manually distinguishing various tissues for quantitative analysis is labor-intensive. To streamline this process, we developed and validated GRN+, a novel multi-model framework that automates layer segmentation with minimal annotated data. GRN+ combines a ResNet-based generator and a U-Net segmentation model. Through a method called Segmentation-guided Enhancement (SGE), the generator produces new images and matching masks under the guidance of the segmentation model, with its weights adjusted according to the segmentation loss gradient. To prevent gradient explosion and secure stable training, a two-stage backpropagation strategy was implemented: the first stage propagates the segmentation loss through both the generator and segmentation model, while the second stage concentrates on optimizing the segmentation model alone, thereby refining mask prediction using the generated images. Tested on 69 fully annotated 3D ultrasound scans from 29 subjects with six manually labeled tissue layers, GRN+ outperformed all other semi-supervised methods in terms of the Dice coefficient using only 5% labeled data, despite not using unlabeled data for unsupervised training. Additionally, when applied to fully annotated datasets, GRN+ with SGE achieved a 2.16% higher Dice coefficient while incurring lower computational costs compared to other models. Overall, GRN+ provides accurate tissue segmentation while reducing both computational expenses and the dependency on extensive annotations, making it an effective tool for 3D ultrasound analysis in cLBP patients.
zh

[CV-152] InterSliceBoost: Identifying Tissue Layers in Three-dimensional Ultrasound Images for Chronic Lower Back Pain (cLBP) Assessment

【速读】：该论文旨在解决慢性下腰痛（cLBP）研究中三维超声图像组织分割面临的挑战，即由于图像切片数量庞大，手动标注耗时且易出错的问题。为应对这一挑战，论文提出了一种名为InterSliceBoost的新方法，其关键在于通过生成相邻切片间的图像-掩膜对（IMPs），在部分标注的数据集上训练分割模型的同时保持分割性能不下降。InterSliceBoost的核心由两部分组成：一个基于残差块的相邻切片生成器和一个分割模型。生成器从相邻的图像-掩膜对中提取特征，并利用差异特征生成新的切片对；分割模型则结合这些生成的切片对与部分标注数据进行训练。实验验证表明，InterSliceBoost在仅使用33%标注数据的情况下，实现了优于传统全标注模型的分割性能，特别是在部分标注场景下的组织分层分割任务中表现尤为突出。

链接: https://arxiv.org/abs/2503.19735
作者: Zixue Zeng,Matthew Cartier,Xiaoyan Zhao,Pengyu Chen,Xin Meng,Zhiyu Sheng,Maryam Satarpour,John M Cormack,Allison C. Bean,Ryan P. Nussbaum,Maya Maurer,Emily Landis-Walkenhorst,Kang Kim,Ajay D. Wasan,Jiantao Pu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Available studies on chronic lower back pain (cLBP) typically focus on one or a few specific tissues rather than conducting a comprehensive layer-by-layer analysis. Since three-dimensional (3-D) images often contain hundreds of slices, manual annotation of these anatomical structures is both time-consuming and error-prone. We aim to develop and validate a novel approach called InterSliceBoost to enable the training of a segmentation model on a partially annotated dataset without compromising segmentation performance. The architecture of InterSliceBoost includes two components: an inter-slice generator and a segmentation model. The generator utilizes residual block-based encoders to extract features from adjacent image-mask pairs (IMPs). Differential features are calculated and input into a decoder to generate inter-slice IMPs. The segmentation model is trained on partially annotated datasets (e.g., skipping 1, 2, 3, or 7 images) and the generated inter-slice IMPs. To validate the performance of InterSliceBoost, we utilized a dataset of 76 B-mode ultrasound scans acquired on 29 subjects enrolled in an ongoing cLBP study. InterSliceBoost, trained on only 33% of the image slices, achieved a mean Dice coefficient of 80.84% across all six layers on the independent test set, with Dice coefficients of 73.48%, 61.11%, 81.87%, 95.74%, 83.52% and 88.74% for segmenting dermis, superficial fat, superficial fascial membrane, deep fat, deep fascial membrane, and muscle. This performance is significantly higher than the conventional model trained on fully annotated images (p0.05). InterSliceBoost can effectively segment the six tissue layers depicted on 3-D B-model ultrasound images in settings with partial annotations.
zh

[CV-153] Single Shot AI-assisted quantification of KI-67 proliferation index in breast cancer

【速读】：该论文旨在解决乳腺癌组织切片中Ki-67增殖标志物定量评估中存在的主观性强、观察者间变异性大以及可重复性有限的问题。传统方法如视觉估计和手动计数难以满足分子分型和个性化治疗规划的需求。为应对这一挑战，论文提出了一种基于AI辅助的解决方案，关键在于利用YOLOv8目标检测框架实现Ki-67阳性细胞的自动化识别与评分。通过高分辨率图像的标注数据训练模型，并优化模型结构以提升性能，最终实现高效、可扩展且客观的Ki-67评分方法。

链接: https://arxiv.org/abs/2503.19606
作者: Deepti Madurai Muthu,Priyanka S,Lalitha Rani N,P. G. Kubendran Amos
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
备注:

点击查看摘要

Abstract:Reliable quantification of Ki-67, a key proliferation marker in breast cancer, is essential for molecular subtyping and informed treatment planning. Conventional approaches, including visual estimation and manual counting, suffer from interobserver variability and limited reproducibility. This study introduces an AI-assisted method using the YOLOv8 object detection framework for automated Ki-67 scoring. High-resolution digital images (40x magnification) of immunohistochemically stained tumor sections were captured from Ki-67 hotspot regions and manually annotated by a domain expert to distinguish Ki-67-positive and negative tumor cells. The dataset was augmented and divided into training (80%), validation (10%), and testing (10%) subsets. Among the YOLOv8 variants tested, the Medium model achieved the highest performance, with a mean Average Precision at 50% Intersection over Union (mAP50) exceeding 85% for Ki-67-positive cells. The proposed approach offers an efficient, scalable, and objective alternative to conventional scoring methods, supporting greater consistency in Ki-67 evaluation. Future directions include developing user-friendly clinical interfaces and expanding to multi-institutional datasets to enhance generalizability and facilitate broader adoption in diagnostic practice.
zh

[CV-154] GIViC: Generative Implicit Video Compression

【速读】：本文旨在解决基于隐式神经表示（INRs）的视频压缩方法在相同编码配置下无法达到最先进的性能（SOTA）的问题。现有基于INRs的视频编解码器与传统的或基于自动编码器的方法相比，在性能上仍存在差距。为突破这一限制，论文提出了生成式隐式视频压缩框架GIViC。其关键在于引入了一种新的隐式扩散过程，通过从粗到细的空间-时间分解进行扩散采样，逐步从序列级的粗粒度扩散过渡到逐令牌的细粒度扩散。此外，还集成了一个新颖的分层门控线性注意力Transformer（HGLA），用于沿尺度和时序轴双重分解全局依赖建模。实验结果显示，GIViC在随机接入（RA）配置下相较于最新的传统和神经网络编解码器分别实现了15.94%、22.46%和8.52%的BD率节省，且首次超越了基于RA配置的VVC VTM，成为首个在RA配置下表现更优的INR基视频编解码器。

链接: https://arxiv.org/abs/2503.19604
作者: Ge Gao,Siyue Teng,Tianhao Peng,Fan Zhang,David Bull
机构: Visual Information Lab, University of Bristol (视觉信息实验室，布里斯托大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While video compression based on implicit neural representations (INRs) has recently demonstrated great potential, existing INR-based video codecs still cannot achieve state-of-the-art (SOTA) performance compared to their conventional or autoencoder-based counterparts given the same coding configuration. In this context, we propose a Generative Implicit Video Compression framework, GIViC, aiming at advancing the performance limits of this type of coding methods. GIViC is inspired by the characteristics that INRs share with large language and diffusion models in exploiting long-term dependencies. Through the newly designed implicit diffusion process, GIViC performs diffusive sampling across coarse-to-fine spatiotemporal decompositions, gradually progressing from coarser-grained full-sequence diffusion to finer-grained per-token diffusion. A novel Hierarchical Gated Linear Attention-based transformer (HGLA), is also integrated into the framework, which dual-factorizes global dependency modeling along scale and sequential axes. The proposed GIViC model has been benchmarked against SOTA conventional and neural codecs using a Random Access (RA) configuration (YUV 4:2:0, GOPSize=32), and yields BD-rate savings of 15.94%, 22.46% and 8.52% over VVC VTM, DCVC-FM and NVRC, respectively. As far as we are aware, GIViC is the first INR-based video codec that outperforms VTM based on the RA coding configuration. The source code will be made available.
zh

[CV-155] Prompt-Guided Dual-Path UNet with Mamba for Medical Image Segmentation

【速读】：该论文旨在解决现有基于卷积神经网络（CNNs）和Transformer的U-Net架构在医学图像分割任务中的局限性。具体而言，CNN难以建模长程依赖关系，而Transformer则面临二次计算复杂度的问题。尽管Mamba（一种状态空间模型）因其能够高效建模长程交互且保持线性计算复杂度而受到关注，但现有的Mamba基方法仍存在两个主要问题：一是缺乏对原始输入数据的感知能力；二是过于侧重捕捉全局信息，而忽视了局部细节。针对这些问题，论文提出了一种提示引导的CNN-Mamba双路径U-Net（PGM-UNet）。其关键在于引入了一个提示引导的残差Mamba模块，通过自适应提取动态视觉提示来指导Mamba捕获全局信息，并设计了一个包含局部信息提取模块、提示引导的残差Mamba模块以及多焦点注意力融合模块的局部-全局信息融合网络，以有效整合局部与全局信息。此外，受Kolmogorov-Arnold网络启发，还开发了一个多尺度信息提取模块，用于在不改变分辨率的情况下捕获更丰富的上下文信息。实验结果表明，所提出的PGM-UNet在多个医学图像分割任务中显著优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.19589
作者: Shaolei Zhang,Jinyan Liu,Tianyi Qian,Xuesong Li
机构: School of Computer Science &&& Technology, Beijing Institute of Technology (北京理工大学); Qiyuan Laboratory (启元实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) and transformers are widely employed in constructing UNet architectures for medical image segmentation tasks. However, CNNs struggle to model long-range dependencies, while transformers suffer from quadratic computational complexity. Recently, Mamba, a type of State Space Models, has gained attention for its exceptional ability to model long-range interactions while maintaining linear computational complexity. Despite the emergence of several Mamba-based methods, they still present the following limitations: first, their network designs generally lack perceptual capabilities for the original input data; second, they primarily focus on capturing global information, while often neglecting local details. To address these challenges, we propose a prompt-guided CNN-Mamba dual-path UNet, termed PGM-UNet, for medical image segmentation. Specifically, we introduce a prompt-guided residual Mamba module that adaptively extracts dynamic visual prompts from the original input data, effectively guiding Mamba in capturing global information. Additionally, we design a local-global information fusion network, comprising a local information extraction module, a prompt-guided residual Mamba module, and a multi-focus attention fusion module, which effectively integrates local and global information. Furthermore, inspired by Kolmogorov-Arnold Networks (KANs), we develop a multi-scale information extraction module to capture richer contextual information without altering the resolution. We conduct extensive experiments on the ISIC-2017, ISIC-2018, DIAS, and DRIVE. The results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in multiple medical image segmentation tasks.
zh

[CV-156] Single-Step Latent Consistency Model for Remote Sensing Image Super-Resolution

【速读】：该论文旨在解决扩散模型（DMs）在遥感图像超分辨率（RSISR）任务中因迭代采样过程导致的推理速度慢的问题，限制其在实时任务中的应用。解决方案的关键在于提出了一种名为潜在一致性超分辨率模型（LCMSR）的新方法。LCMSR通过两个阶段实现高效且高质量的RSISR：第一阶段预训练残差自动编码器，将扩散过程转换到潜在空间以减少计算成本；第二阶段专注于潜在空间中残差编码分布的一致性扩散学习，并通过一致性约束确保反向扩散轨迹中任意两时刻的预测保持一致，从而实现从噪声到数据的直接映射。这一创新使LCMSR将传统扩散模型的迭代步骤从50-1000步以上大幅减少到单步操作，显著提高了效率，同时保持了高性能输出。

链接: https://arxiv.org/abs/2503.19505
作者: Xiaohui Sun,Jiangwei Mo,Hanlin Wu,Jie Ma
机构: School of Information Science and Technology (信息科学与技术学院), Beijing Foreign Studies University (北京外国语大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models (DMs) have greatly advanced remote sensing image super-resolution (RSISR). However, their iterative sampling processes often result in slow inference speeds, limiting their application in real-time tasks. To address this challenge, we propose the latent consistency model for super-resolution (LCMSR), a novel single-step diffusion approach designed to enhance both efficiency and visual quality in RSISR tasks. Our proposal is structured into two distinct stages. In the first stage, we pretrain a residual autoencoder to encode the differential information between high-resolution (HR) and low-resolution (LR) images, transitioning the diffusion process into a latent space to reduce computational costs. The second stage focuses on consistency diffusion learning, which aims to learn the distribution of residual encodings in the latent space, conditioned on LR images. The consistency constraint enforces that predictions at any two timesteps along the reverse diffusion trajectory remain consistent, enabling direct mapping from noise to data. As a result, the proposed LCMSR reduces the iterative steps of traditional diffusion models from 50-1000 or more to just a single step, significantly improving efficiency. Experimental results demonstrate that LCMSR effectively balances efficiency and performance, achieving inference times comparable to non-diffusion models while maintaining high-quality output.
zh

[CV-157] FIC: End-to-End Text-Focused Image Compression for Coding for Machines

【速读】：该论文致力于解决传统图像压缩方法在面向机器任务（如光学字符识别，OCR）时无法有效保留关键信息的问题。为了解决这一挑战，论文提出了一种专门设计用于保留文本特定特征的图像压缩系统。解决方案的关键在于其高效的编码过程，该过程仅需OCR模块一半的时间，从而特别适合计算能力受限的设备。此外，通过在低比特率下显著提升文本提取准确性，甚至超越未压缩图像上的OCR性能，证明了该方法的有效性，使其能够作为本地预处理步骤增强后续OCR任务的表现。

链接: https://arxiv.org/abs/2503.19495
作者: Stefano Della Fiore,Alessandro Gnutti,Marco Dalai,Pierangelo Migliorati,Riccardo Leonardi
机构: University of Brescia (意大利布雷西亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional image compression methods aim to faithfully reconstruct images for human perception. In contrast, Coding for Machines focuses on compressing images to preserve information relevant to a specific machine task. In this paper, we present an image compression system designed to retain text-specific features for subsequent Optical Character Recognition (OCR). Our encoding process requires half the time needed by the OCR module, making it especially suitable for devices with limited computational capacity. In scenarios where on-device OCR is computationally prohibitive, images are compressed and later processed to recover the text content. Experimental results demonstrate that our method achieves significant improvements in text extraction accuracy at low bitrates, even improving over the accuracy of OCR performed on uncompressed images, thus acting as a local pre-processing step.
zh

[CV-158] ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation

【速读】：该论文旨在解决皮肤病变分割这一计算机视觉中的关键挑战，特别是如何从健康皮肤组织中精确分离病理特征以支持诊断。传统卷积神经网络（Convolutional Neural Networks, CNNs）受限于狭窄的感受野，而Transformer模型则面临显著的计算负担。为克服这些限制，论文提出了一种名为Atrous Shifted Parallel Vision Mamba UNet (ASP-VMUNet) 的新型皮肤病变分割框架。该框架的核心在于引入了高效的Mamba架构，并通过空洞扫描技术减少背景干扰、扩展感受野，同时结合Parallel Vision Mamba (PVM) 层和Shift Round操作优化特征分割与段间信息交互。此外，通过补充的带Selective-Kernel (SK) Block的CNN分支进一步融合局部与全局上下文信息以提升分割精度。在ISIC16/17/18和PH2四个基准数据集上的测试表明，该方法在皮肤病变分割任务中表现优异。

链接: https://arxiv.org/abs/2503.19427
作者: Muyi Bao,Shuchang Lyu,Zhaoyang Xu,Qi Zhao,Changyu Zeng,Wenpei Bai,Guangliang Cheng
机构: School of Advanced Technology, Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学先进技术学院); School of Electronics and Information Engineering, Beihang University (北京航空航天大学电子与信息工程学院); Department of Paediatrics, Cambridge University (剑桥大学儿科学系); Department of Gynecology and Obstetrics, Beijing Shijitan Hospital, Capital Medical University (首都医科大学北京世纪坛医院妇产科); Department of Computer Science, University of Liverpool (利物浦大学计算机科学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Parallel Vision Mamba UNet (ASP-VMUNet), which integrates the efficient and scalable Mamba architecture to overcome limitations in traditional CNNs and computationally demanding Transformers. The framework introduces an atrous scan technique that minimizes background interference and expands the receptive field, enhancing Mamba’s scanning capabilities. Additionally, the inclusion of a Parallel Vision Mamba (PVM) layer and a shift round operation optimizes feature segmentation and fosters rich inter-segment information exchange. A supplementary CNN branch with a Selective-Kernel (SK) Block further refines the segmentation by blending local and global contextual information. Tested on four benchmark datasets (ISIC16/17/18 and PH2), ASP-VMUNet demonstrates superior performance in skin lesion segmentation, validated by comprehensive ablation studies. This approach not only advances medical image segmentation but also highlights the benefits of hybrid architectures in medical imaging technology. Our code is available at this https URL.
zh

[CV-159] Wavelet-based Global-Local Interaction Network with Cross-Attention for Multi-View Diabetic Retinopathy Detection ICME

【速读】：该论文旨在解决多视图糖尿病视网膜病变（Diabetic Retinopathy, DR）检测中因病灶尺寸变化大且位置分散导致的学习困难，以及现有方法在融合多视图信息时未能有效考虑病灶信息相关性和冗余性的问题。论文的关键解决方案在于提出了一种两分支网络结构，用于同时获取局部病灶特征及其全局依赖关系，并利用小波变换的高频分量提取病灶边缘信息，结合全局语义增强对难分类病灶的学习能力。此外，还设计了一个跨视图融合模块，以提升多视图信息融合效果并减少冗余。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.19329
作者: Yongting Hu,Yuxin Lin,Chengliang Liu,Xiaoling Luo,Xiaoyan Dou,Qihao Xu,Yong Xu
机构: School of Computer Science and Technology, Harbin Institute of Technology (哈尔滨工业大学), Shenzhen, China; Shenzhen Key Laboratory of Visual Object Detection and Recognition (深圳视觉目标检测与识别重点实验室), Shenzhen, China; College of Computer Science and Software Engineering, Shenzhen University (深圳大学), Shenzhen, China; Ophthalmology Department, Shenzhen Second People’s Hospital (深圳市第二人民医院), Shenzhen, China
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE International Conference on Multimedia Expo (ICME) 2025

点击查看摘要

Abstract:Multi-view diabetic retinopathy (DR) detection has recently emerged as a promising method to address the issue of incomplete lesions faced by single-view DR. However, it is still challenging due to the variable sizes and scattered locations of lesions. Furthermore, existing multi-view DR methods typically merge multiple views without considering the correlations and redundancies of lesion information across them. Therefore, we propose a novel method to overcome the challenges of difficult lesion information learning and inadequate multi-view fusion. Specifically, we introduce a two-branch network to obtain both local lesion features and their global dependencies. The high-frequency component of the wavelet transform is used to exploit lesion edge information, which is then enhanced by global semantic to facilitate difficult lesion learning. Additionally, we present a cross-view fusion module to improve multi-view fusion and reduce redundancy. Experimental results on large public datasets demonstrate the effectiveness of our method. The code is open sourced on this https URL.
zh

[CV-160] Adaptive Wavelet Filters as Practical Texture Feature Amplifiers for Parkinsons Disease Screening in OCT

【速读】：该论文旨在解决帕金森病（Parkinson’s Disease, PD）早期筛查中基于光学相干断层成像（Optical Coherence Tomography, OCT）图像的自动化诊断问题。目前的研究尚未充分利用视网膜层纹理特征作为生物标志物来提升深度神经网络（Deep Neural Networks, DNNs）的诊断性能。为此，论文提出了一种新颖的自适应小波滤波器（Adaptive Wavelet Filter, AWF），作为实用的纹理特征增强器，通过频域学习技术充分挖掘纹理特征的优势，以提升DNN在PD筛查中的表现。AWF的关键创新在于首先通过通道混合器增强纹理特征表示的多样性，然后借助精心设计的小波滤波令牌混合器突出信息丰富的纹理特征。结合AWF与DNN主干网络构建了AWFNet模型，并进一步引入平衡置信度（Balanced Confidence, BC）损失函数，通过挖掘样本级预测概率及类别频率先验信息，进一步提高PD筛查的性能和可信度。

链接: https://arxiv.org/abs/2503.19292
作者: Xiaoqing Zhang,Hanfeng Shi,Xiangyu Li,Haili Ye,Tao Xu,Na Li,Yan Hu,Fan Lv,Jiangfan Chen,Jiang Liu
机构: Center for High Performance Computing and Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院高性能计算中心和智能生物信息学深圳市重点实验室), Shenzhen, China; Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering, Southern University of Science and Technology (南方科技大学可信自主系统研究院和计算机科学与工程系), Shenzhen 518055, China; State Key Laboratory of Ophthalmology, Optometry and Vision Science, Wenzhou Medical University (温州医科大学眼视光国家重点实验室), Wenzhou, China; Centre for Computational Science and Mathematical Modelling, Coventry University (考文垂大学计算科学与数学建模中心), Coventry, UK; School of Computer Science, University of Nottingham Ningbo China (宁波诺丁汉大学计算机科学学院), Ningbo 315100, China; School of Ophthalmology and Optometry, Wenzhou Medical University (温州医科大学眼视光学院), Wenzhou 325035, China; Department of Electronic and Information Engineering, Changchun University (长春大学电子与信息工程学院), Changchun 130022, China; IEEE Publication Technology Group (IEEE出版技术组), Piscataway, NJ (新泽西州皮斯卡塔韦)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parkinson’s disease (PD) is a prevalent neurodegenerative disorder globally. The eye’s retina is an extension of the brain and has great potential in PD screening. Recent studies have suggested that texture features extracted from retinal layers can be adopted as biomarkers for PD diagnosis under optical coherence tomography (OCT) images. Frequency domain learning techniques can enhance the feature representations of deep neural networks (DNNs) by decomposing frequency components involving rich texture features. Additionally, previous works have not exploited texture features for automated PD screening in OCT. Motivated by the above analysis, we propose a novel Adaptive Wavelet Filter (AWF) that serves as the Practical Texture Feature Amplifier to fully leverage the merits of texture features to boost the PD screening performance of DNNs with the aid of frequency domain learning. Specifically, AWF first enhances texture feature representation diversities via channel mixer, then emphasizes informative texture feature representations with the well-designed adaptive wavelet filtering token mixer. By combining the AWFs with the DNN stem, AWFNet is constructed for automated PD screening. Additionally, we introduce a novel Balanced Confidence (BC) Loss by mining the potential of sample-wise predicted probabilities of all classes and class frequency prior, to further boost the PD screening performance and trustworthiness of AWFNet. The extensive experiments manifest the superiority of our AWFNet and BC over state-of-the-art methods in terms of PD screening performance and trustworthiness.
zh

[CV-161] L2FMamba: Lightweight Light Field Image Super-Resolution with State Space Model

【速读】：该论文试图解决光场图像超分辨率任务中基于Transformer方法计算复杂度过高的问题。为了解决这一挑战，论文的关键创新在于引入了LF-VSSM（Light Field-Variational Spatial-Spatial-Multi-Angular）模块，该模块通过渐进特征提取的方式高效捕获光场图像中的长程空间-角度依赖关系。具体而言，LF-VSSM依次提取子孔径图像内的空间特征、子孔径图像间的空间-角度特征以及光场图像像素间的空间-角度特征。在此基础上，论文进一步提出了一种轻量级网络L² FMamba，通过集成LF-VSSM模块来利用光场特性进行超分辨率任务，同时克服了传统Transformer方法的计算难题。实验结果表明，该方法在保持优异超分辨率性能的同时显著减少了参数数量和计算复杂度，并实现了更快的推理速度。

链接: https://arxiv.org/abs/2503.19253
作者: Zeqiang Wei,Kai Jin,Zeyi Hou,Kuan Song,Xiuzhuang Zhou
机构: School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); Bigo Technology Pte. Ltd. (Bigo Technology Pte. Ltd.); Explorer Global (Suzhou) Artificial Intelligence Technology Co., Ltd. (探索者全球（苏州）人工智能科技有限公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Transformers bring significantly improved performance to the light field image super-resolution task due to their long-range dependency modeling capability. However, the inherently high computational complexity of their core self-attention mechanism has increasingly hindered their advancement in this task. To address this issue, we first introduce the LF-VSSM block, a novel module inspired by progressive feature extraction, to efficiently capture critical long-range spatial-angular dependencies in light field images. LF-VSSM successively extracts spatial features within sub-aperture images, spatial-angular features between sub-aperture images, and spatial-angular features between light field image pixels. On this basis, we propose a lightweight network, L^2 FMamba (Lightweight Light Field Mamba), which integrates the LF-VSSM block to leverage light field features for super-resolution tasks while overcoming the computational challenges of Transformer-based approaches. Extensive experiments on multiple light field datasets demonstrate that our method reduces the number of parameters and complexity while achieving superior super-resolution performance with faster inference speed.
zh

[CV-162] Limited-angle x-ray nano-tomography with machine-learning enabled iterative reconstruction engine

【速读】：该论文旨在解决断层扫描中的“缺失楔形”（missing wedge）问题，即由于几何约束导致投影图像在特定角度范围内的获取受限，从而引发重建图像中显著伪影和分辨率下降的核心挑战。论文的关键解决方案是提出了一种名为“感知融合迭代断层重建引擎”（Perception Fused Iterative Tomography Reconstruction Engine）的方法，将带有感知知识的卷积神经网络（Convolutional Neural Network, CNN）作为智能正则化器集成到迭代求解框架中，并通过乘数交替方向法（Alternating Direction Method of Multipliers, ADMM）同时优化物理域和图像域，实现物理一致性与视觉增强的结果。实验结果表明，该方法即使在超过100度的严重缺失楔形情况下仍能显著改善重建质量，且在稀疏投影场景中也表现出良好的性能，体现了其鲁棒性和通用性。

链接: https://arxiv.org/abs/2503.19248
作者: Chonghang Zhao,Mingyuan Ge,Xiaogang Yang,Yong S. Chu,Hanfei Yan
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A long-standing challenge in tomography is the ‘missing wedge’ problem, which arises when the acquisition of projection images within a certain angular range is restricted due to geometrical constraints. This incomplete dataset results in significant artifacts and poor resolution in the reconstructed image. To tackle this challenge, we propose an approach dubbed Perception Fused Iterative Tomography Reconstruction Engine, which integrates a convolutional neural network (CNN) with perceptional knowledge as a smart regularizer into an iterative solving engine. We employ the Alternating Direction Method of Multipliers to optimize the solution in both physics and image domains, thereby achieving a physically coherent and visually enhanced result. We demonstrate the effectiveness of the proposed approach using various experimental datasets obtained with different x-ray microscopy techniques. All show significantly improved reconstruction even with a missing wedge of over 100 degrees - a scenario where conventional methods fail. Notably, it also improves the reconstruction in case of sparse projections, despite the network not being specifically trained for that. This demonstrates the robustness and generality of our method of addressing commonly occurring challenges in 3D x-ray imaging applications for real-world problems.
zh

[CV-163] PSO-UNet: Particle Swarm-Optimized U-Net Framework for Precise Multimodal Brain Tumor Segmentation GECCO2025

【速读】：本文旨在解决脑肿瘤分析中医学图像分割面临的挑战，特别是针对多模态MRI数据集复杂性和肿瘤形态多样性导致的精确度与计算效率需求。传统手动调参或替代优化方法难以有效应对复杂的超参数搜索空间。为了解决这一问题，论文提出了将粒子群优化（Particle Swarm Optimization, PSO）与U-Net架构结合的PSO-UNet模型，其关键在于利用PSO动态优化超参数，包括滤波器数量、卷积核大小及学习率，从而显著提升分割性能并降低计算复杂度。实验结果显示，PSO-UNet在BraTS 2021和Figshare数据集上的Dice相似系数（DSC）分别达到0.9578和0.9523，交并比（IoU）分别为0.9194和0.9097，同时仅需7.8百万参数且运行时间约为906秒，优于同类U-Net基线模型。这表明PSO-UNet具备跨多种MRI模态及肿瘤分类的强大泛化能力，并展示了相较于常规超参数调优方法的明显优势。未来研究将进一步探索混合优化策略并验证其鲁棒性与可扩展性。

链接: https://arxiv.org/abs/2503.19152
作者: Shoffan Saifullah,Rafał Dreżewski
机构: Faculty of Computer Science, AGH University of Krakow(Kraków)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 4 tables, Gecco 2025 Conference

点击查看摘要

Abstract:Medical image segmentation, particularly for brain tumor analysis, demands precise and computationally efficient models due to the complexity of multimodal MRI datasets and diverse tumor morphologies. This study introduces PSO-UNet, which integrates Particle Swarm Optimization (PSO) with the U-Net architecture for dynamic hyperparameter optimization. Unlike traditional manual tuning or alternative optimization approaches, PSO effectively navigates complex hyperparameter search spaces, explicitly optimizing the number of filters, kernel size, and learning rate. PSO-UNet substantially enhances segmentation performance, achieving Dice Similarity Coefficients (DSC) of 0.9578 and 0.9523 and Intersection over Union (IoU) scores of 0.9194 and 0.9097 on the BraTS 2021 and Figshare datasets, respectively. Moreover, the method reduces computational complexity significantly, utilizing only 7.8 million parameters and executing in approximately 906 seconds, markedly faster than comparable U-Net-based frameworks. These outcomes underscore PSO-UNet’s robust generalization capabilities across diverse MRI modalities and tumor classifications, emphasizing its clinical potential and clear advantages over conventional hyperparameter tuning methods. Future research will explore hybrid optimization strategies and validate the framework against other bio-inspired algorithms to enhance its robustness and scalability.
zh

[CV-164] rackRAD2025 challenge dataset: Real-time tumor tracking for MRI-guided radiotherapy

【速读】：该论文旨在解决在磁共振引导放射治疗（MRI-guided radiotherapy）中实时肿瘤定位（tracking）算法的开发与评估问题。论文的关键在于提供一个多机构的真实时间MRI时间序列数据集，该数据集包含来自不同MRI直线加速器（MRI-linac）供应商的585名患者的 sagittal 2D cine MRI 图像（3荷兰机构、1德国机构、1澳大利亚机构和1中国机构）。此数据集特别涵盖了胸腔、腹部和骨盆区域的肿瘤，并针对108个病例手动分割了每个时间帧上的照射目标或跟踪替代物。通过将数据随机划分为公开训练集（527例）和私人测试集（58例），研究者希望支持TrackRAD2025挑战赛中相关算法的发展与验证，从而推动更精确的运动管理及自适应治疗策略的实现。

链接: https://arxiv.org/abs/2503.19119
作者: Yiling Wang,Elia Lombardo,Adrian Thummerer,Tom Blöcker,Yu Fan,Yue Zhao,Christianna Iris Papadopoulou,Coen Hurkmans,Rob H.N. Tijssen,Pia A.W. Görts,Shyama U. Tetar,Davide Cusumano,Martijn P.W. Intven,Pim Borman,Marco Riboldi,Denis Dudáš,Hilary Byrne,Lorenzo Placidi,Marco Fusella,Michael Jameson,Miguel Palacios,Paul Cobussen,Tobias Finazzi,Cornelis J.A. Haasbeek,Paul Keall,Christopher Kurz,Guillaume Landry,Matteo Maspero
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 2 tables; submitted to Medical Physics

点击查看摘要

Abstract:Purpose: Magnetic resonance imaging (MRI) to visualize anatomical motion is becoming increasingly important when treating cancer patients with radiotherapy. Hybrid MRI-linear accelerator (MRI-linac) systems allow real-time motion management during irradiation. This paper presents a multi-institutional real-time MRI time series dataset from different MRI-linac vendors. The dataset is designed to support developing and evaluating real-time tumor localization (tracking) algorithms for MRI-guided radiotherapy within the TrackRAD2025 challenge (this https URL). Acquisition and validation methods: The dataset consists of sagittal 2D cine MRIs in 585 patients from six centers (3 Dutch, 1 German, 1 Australian, and 1 Chinese). Tumors in the thorax, abdomen, and pelvis acquired on two commercially available MRI-linacs (0.35 T and 1.5 T) were included. For 108 cases, irradiation targets or tracking surrogates were manually segmented on each temporal frame. The dataset was randomly split into a public training set of 527 cases (477 unlabeled and 50 labeled) and a private testing set of 58 cases (all labeled). Data Format and Usage Notes: The data is publicly available under the TrackRAD2025 collection: this https URL. Both the images and segmentations for each patient are available in metadata format. Potential Applications: This novel clinical dataset will enable the development and evaluation of real-time tumor localization algorithms for MRI-guided radiotherapy. By enabling more accurate motion management and adaptive treatment strategies, this dataset has the potential to advance the field of radiotherapy significantly. Comments: 10 pages, 5 figures, 2 tables; submitted to Medical Physics Subjects: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.19119 [physics.med-ph] (or arXiv:2503.19119v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2503.19119 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Matteo Maspero [view email] [v1] Mon, 24 Mar 2025 20:14:42 UTC (776 KB)
zh

[CV-165] 3D Structural Phenotype of the Optic Nerve Head at the Intersection of Glaucoma and Myopia - A Key to Improving Glaucoma Diagnosis in Myopic Populations

【速读】：该论文旨在通过三维结构表型（3D structural phenotypes）分析青光眼（glaucoma）、高度近视（high myopia）以及两者共存（concurrent high myopia and glaucoma）患者视神经头（optic nerve head, ONH）的特征，并评估这些条件下的形态学变化。论文的关键在于开发了一种专门的集成网络（ensemble network），该网络包含一个编码器（encoder）用于将高维输入数据压缩为潜在向量（latent vector），一个解码器（decoder）用于从潜在向量重构点云（point clouds），以及一个分类器（classifier）用于将点云分类为四种ONH状态（健康、高度近视、青光眼和共病）。这种解决方案能够实现高精度分类（微平均AUC为0.92 ± 0.03），并通过解码器有效重构点云，同时揭示了视神经头在不同条件下的结构变异，包括视网膜和结缔组织厚度的变化、视盘及巩膜筛板倾斜与拉伸、以及视杯形态的浅挖或深挖等特征。

链接: https://arxiv.org/abs/2503.19083
作者: Swati Sharma,Fabian A. Braeu,Thanadet Chuangsuwanich,Tin A. Tun,Quan V Hoang,Rachel Chong,Shamira Perera,Ching-Lin Ho,Rahat Husain,Martin L. Buist,Tin Aung,Michaël J.A. Girard
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 Pages, 2 Tables, 6 Figures, 1 Appendix

点击查看摘要

Abstract:Purpose: To characterize the 3D structural phenotypes of the optic nerve head (ONH) in patients with glaucoma, high myopia, and concurrent high myopia and glaucoma, and to evaluate their variations across these conditions. Participants: A total of 685 optical coherence tomography (OCT) scans from 754 subjects of Singapore-Chinese ethnicity, including 256 healthy (H), 94 highly myopic (HM), 227 glaucomatous (G), and 108 highly myopic with glaucoma (HMG) cases. Methods: We segmented the retinal and connective tissues from OCT volumes and their boundary edges were converted into 3D point clouds. To classify the 3D point clouds into four ONH conditions, i.e., H, HM, G, and HMG, a specialized ensemble network was developed, consisting of an encoder to transform high-dimensional input data into a compressed latent vector, a decoder to reconstruct point clouds from the latent vector, and a classifier to categorize the point clouds into the four ONH conditions. Results: The classification network achieved high accuracy, distinguishing H, HM, G, and HMG classes with a micro-average AUC of 0.92 \pm 0.03 on an independent test set. The decoder effectively reconstructed point clouds, achieving a Chamfer loss of 0.013 \pm 0.002. Dimensionality reduction clustered ONHs into four distinct groups, revealing structural variations such as changes in retinal and connective tissue thickness, tilting and stretching of the disc and scleral canal opening, and alterations in optic cup morphology, including shallow or deep excavation, across the four conditions. Conclusions: This study demonstrated that ONHs exhibit distinct structural signatures across H, HM, G, and HMG conditions. The findings further indicate that ONH morphology provides sufficient information for classification into distinct clusters, with principal components capturing unique structural patterns within each group.
zh

[CV-166] Foundation Model for Whole-Heart Segmentation: Leverag ing Student-Teacher Learning in Multi-Modal Medical Imaging

【速读】：该论文旨在解决心脏CT和MRI扫描全心分割中因模态特定偏倚以及对大规模标注数据集依赖而导致现有方法性能受限的问题。为应对这些挑战，论文提出了一种基于学生-教师架构（student-teacher architecture）的自监督学习（self-supervised learning, SSL）基础模型。解决方案的关键在于利用大规模无标注数据进行预训练，采用xLSTM主干网络捕捉三维医学图像中的长程空间依赖性和复杂的解剖结构，并通过多模态预训练确保模型在CT和MRI模态间的强泛化能力，从而缓解模态特定变化并提高不同临床场景下的分割精度。此外，引入基于xLSTM-UNet的架构以实现下游全心分割任务，在少量标注数据条件下仍表现出鲁棒性能。

链接: https://arxiv.org/abs/2503.19005
作者: Abdul Qayyum,Moona Mazher,Devran Ugurlu,Jose Alonso Solis Lemus,Cristobal Rodero,Steven A Niederer
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Whole-heart segmentation from CT and MRI scans is crucial for cardiovascular disease analysis, yet existing methods struggle with modality-specific biases and the need for extensive labeled datasets. To address these challenges, we propose a foundation model for whole-heart segmentation using a self-supervised learning (SSL) framework based on a student-teacher architecture. Our model is pretrained on a large, unlabeled dataset of CT and MRI scans, leveraging the xLSTM backbone to capture long-range spatial dependencies and complex anatomical structures in 3D medical images. By incorporating multi-modal pretraining, our approach ensures strong generalization across both CT and MRI modalities, mitigating modality-specific variations and improving segmentation accuracy in diverse clinical settings. The use of large-scale unlabeled data significantly reduces the dependency on manual annotations, enabling robust performance even with limited labeled data. We further introduce an xLSTM-UNet-based architecture for downstream whole-heart segmentation tasks, demonstrating its effectiveness on few-label CT and MRI datasets. Our results validate the robustness and adaptability of the proposed model, highlighting its potential for advancing automated whole-heart segmentation in medical imaging.
zh

[CV-167] FACE: Few-shot Adapter with Cross-view Fusion for Cross-subject EEG Emotion Recognition

【速读】：该论文旨在解决跨被试脑电(EEG)情感识别中的两个主要挑战：显著的跨被试变异性(inter-subject variability)和复杂的_within-被试变异性(intra-subject variability)。现有方法主要通过领域自适应或泛化策略应对这些挑战，但通常需要大量目标被试数据或在未见过的被试上表现出有限的泛化性能。此外，尽管最近的少样本学习范式尝试克服这些限制，但在利用有限样本进行特定被试适应时仍容易发生灾难性过拟合(catastrophic overfitting)。

论文提出的解决方案关键在于Few-Shot Adapter with Cross-view Fusion (FACE)，它结合了动态多视角融合方法与有效的特定被试适应机制。具体而言，FACE引入了一个跨视角融合模块，通过特定被试的融合权重动态整合全局脑连接性和局部模式，从而提供互补的情感信息。此外，还设计了一个少样本适配器模块，通过增强适配器结构的元学习(meta-learning)能力，在减少过拟合的同时实现对未见被试的快速适应。实验结果表明，FACE在三个公开的EEG情感识别基准数据集上实现了优于当前最先进方法的泛化性能，为有限标注数据的跨被试场景提供了实用的解决方案。

链接: https://arxiv.org/abs/2503.18998
作者: Haiqi Liu,C. L. Philip Chen,Tong Zhang
机构: Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence (广东省计算智能模型重点实验室), School of Computer Science and Engineering (计算机科学与工程学院), South China University of Technology (华南理工大学), Guangzhou 510006, China (中国); Pazhou Lab (琶洲实验室), Guangzhou 510335, China (中国); Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human (教育部健康智能感知与平行数字人工程研究中心), Guangzhou, China (中国)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Cross-subject EEG emotion recognition is challenged by significant inter-subject variability and intricately entangled intra-subject variability. Existing works have primarily addressed these challenges through domain adaptation or generalization strategies. However, they typically require extensive target subject data or demonstrate limited generalization performance to unseen subjects. Recent few-shot learning paradigms attempt to address these limitations but often encounter catastrophic overfitting during subject-specific adaptation with limited samples. This article introduces the few-shot adapter with a cross-view fusion method called FACE for cross-subject EEG emotion recognition, which leverages dynamic multi-view fusion and effective subject-specific adaptation. Specifically, FACE incorporates a cross-view fusion module that dynamically integrates global brain connectivity with localized patterns via subject-specific fusion weights to provide complementary emotional information. Moreover, the few-shot adapter module is proposed to enable rapid adaptation for unseen subjects while reducing overfitting by enhancing adapter structures with meta-learning. Experimental results on three public EEG emotion recognition benchmarks demonstrate FACE’s superior generalization performance over state-of-the-art methods. FACE provides a practical solution for cross-subject scenarios with limited labeled data.
zh

[CV-168] Automated diagnosis of lung diseases using vision transformer: a comparative study on chest x-ray classification

【速读】：该论文旨在解决肺部异常诊断中的自动化分类问题，以减少对人工干预的依赖。论文针对包括肺炎在内的多种肺部疾病，利用包含3,475张胸部X光片的数据集，通过二分类和多分类方法评估了多种预训练深度学习模型（如CNN、ResNet50、DenseNet、CheXNet、U-Net）以及两种迁移学习算法（Vision Transformer (ViT) 和 Shifted Window (Swin)）。关键在于采用Vision Transformer (ViT) 方法，其在二分类任务中达到了99%的准确率，在多分类任务中达到了95.25%的准确率，从而显著提升了肺部疾病的自动诊断性能。

链接: https://arxiv.org/abs/2503.18973
作者: Muhammad Ahmad,Sardar Usman,Ildar Batyrshin,Muhammad Muzammil,K. Sajid,M. Hasnain,Muhammad Jalal,Grigori Sidorov
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Lung disease is a significant health issue, particularly in children and elderly individuals. It often results from lung infections and is one of the leading causes of mortality in children. Globally, lung-related diseases claim many lives each year, making early and accurate diagnoses crucial. Radiographs are valuable tools for the diagnosis of such conditions. The most prevalent lung diseases, including pneumonia, asthma, allergies, chronic obstructive pulmonary disease (COPD), bronchitis, emphysema, and lung cancer, represent significant public health challenges. Early prediction of these conditions is critical, as it allows for the identification of risk factors and implementation of preventive measures to reduce the likelihood of disease onset Methods: In this study, we utilized a dataset comprising 3,475 chest X-ray images sourced from from Mendeley Data provided by Talukder, M. A. (2023) [14], categorized into three classes: normal, lung opacity, and pneumonia. We applied five pre-trained deep learning models, including CNN, ResNet50, DenseNet, CheXNet, and U-Net, as well as two transfer learning algorithms such as Vision Transformer (ViT) and Shifted Window (Swin) to classify these images. This approach aims to address diagnostic issues in lung abnormalities by reducing reliance on human intervention through automated classification systems. Our analysis was conducted in both binary and multiclass settings. Results: In the binary classification, we focused on distinguishing between normal and viral pneumonia cases, whereas in the multi-class classification, all three classes (normal, lung opacity, and viral pneumonia) were included. Our proposed methodology (ViT) achieved remarkable performance, with accuracy rates of 99% for binary classification and 95.25% for multiclass classification. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.18973 [eess.IV] (or arXiv:2503.18973v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.18973 Focus to learn more arXiv-issued DOI via DataCite
zh

人工智能

[AI-0] A proposal for an incident regime that tracks and counters threats to national security posed by AI systems

链接: https://arxiv.org/abs/2503.19887
作者: Alejandro Ortega
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent progress in AI capabilities has heightened concerns that AI systems could pose a threat to national security, for example, by making it easier for malicious actors to perform cyberattacks on critical national infrastructure, or through loss of control of autonomous AI systems. In parallel, federal legislators in the US have proposed nascent ‘AI incident regimes’ to identify and counter similar threats. In this paper, we consolidate these two trends and present a proposal for a legally mandated post-deployment AI incident regie that aims to counter potential national security threats from AI systems. We start the paper by introducing the concept of ‘security-critical’ to describe doctors that pose extreme risks to national security, before arguing that ‘security-critical’ describes civilian nuclear power, aviation, life science dual-use research of concern, and frontier AI development. We then present in detail our AI incident regime proposal, justifying each component of the proposal by demonstrating its similarity to US domestic incident regimes in other ‘security-critical’ sectors. Finally, we sketch a hypothetical scenario where our proposed AI incident regime deals with an AI cyber incident. Our proposed AI incident regime is split into three phases. The first phase revolves around a novel operationalization of what counts as an ‘AI incident’ and we suggest that AI providers must create a ‘national security case’ before deploying a frontier AI system. The second and third phases spell out that AI providers should notify a government agency about incidents, and that the government agency should be involved in amending AI providers’ security and safety procedures, in order to counter future threats to national security. Our proposal is timely, given ongoing policy interest in the potential national security threats posed by AI systems.

[AI-1] Dynamics of Structured Complex-Valued Hopfield Neural Networks

链接: https://arxiv.org/abs/2503.19885
作者: Rama Murthy Garimella,Marcos Eduardo Valle,Guilherme Vieira,Anil Rayala,Dileep Munugoti
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we explore the dynamics of structured complex-valued Hopfield neural networks (CvHNNs), which arise when the synaptic weight matrix possesses specific structural properties. We begin by analyzing CvHNNs with a Hermitian synaptic weight matrix and establish the existence of four-cycle dynamics in CvHNNs with skew-Hermitian weight matrices operating synchronously. Furthermore, we introduce two new classes of complex-valued matrices: braided Hermitian and braided skew-Hermitian matrices. We demonstrate that CvHNNs utilizing these matrix types exhibit cycles of length eight when operating in full parallel update mode. Finally, we conduct extensive computational experiments on synchronous CvHNNs, exploring other synaptic weight matrix structures. The findings provide a comprehensive overview of the dynamics of structured CvHNNs, offering insights that may contribute to developing improved associative memory models when integrated with suitable learning rules.

[AI-2] Geometric Meta-Learning via Coupled Ricci Flow: Unifying Knowledge Representation and Quantum Entanglement

链接: https://arxiv.org/abs/2503.19867
作者: Ming Lei,Christophe Baehr
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Geometric Topology (math.GT); Quantum Physics (quant-ph)
*备注: 9 pages, submitted to IEEE PAMI

点击查看摘要

Abstract:This paper establishes a unified framework integrating geometric flows with deep learning through three fundamental innovations. First, we propose a thermodynamically coupled Ricci flow that dynamically adapts parameter space geometry to loss landscape topology, formally proved to preserve isometric knowledge embedding (Theorem~\refthm:isometric). Second, we derive explicit phase transition thresholds and critical learning rates (Theorem~\refthm:critical) through curvature blowup analysis, enabling automated singularity resolution via geometric surgery (Lemma~\reflem:surgery). Third, we establish an AdS/CFT-type holographic duality (Theorem~\refthm:ads) between neural networks and conformal field theories, providing entanglement entropy bounds for regularization design. Experiments demonstrate 2.1 \times convergence acceleration and 63% topological simplification while maintaining \mathcalO(N\log N) complexity, outperforming Riemannian baselines by 15.2% in few-shot accuracy. Theoretically, we prove exponential stability (Theorem~\refthm:converge) through a new Lyapunov function combining Perelman entropy with Wasserstein gradient flows, fundamentally advancing geometric deep learning.

[AI-3] Guarding against artificial intelligence–hallucinated citations: the case for full-text reference deposit

链接: https://arxiv.org/abs/2503.19848
作者: Alex Glynn
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注: 3 pages

点击查看摘要

Abstract:The tendency of generative artificial intelligence (AI) systems to “hallucinate” false information is well-known; AI-generated citations to non-existent sources have made their way into the reference lists of peer-reviewed publications. Here, I propose a solution to this problem, taking inspiration from the Transparency and Openness Promotion (TOP) data sharing guidelines, the clash of generative AI with the American judiciary, and the precedent set by submissions of prior art to the United States Patent and Trademark Office. Journals should require authors to submit the full text of each cited source along with their manuscripts, thereby preventing authors from citing any material whose full text they cannot produce. This solution requires limited additional work on the part of authors or editors while effectively immunizing journals against hallucinated references.

[AI-4] Bitstream Collisions in Neural Image Compression via Adversarial Perturbations

链接: https://arxiv.org/abs/2503.19817
作者: Jordan Madden,Lhamo Dorje,Xiaohua Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural image compression (NIC) has emerged as a promising alternative to classical compression techniques, offering improved compression ratios. Despite its progress towards standardization and practical deployment, there has been minimal exploration into it’s robustness and security. This study reveals an unexpected vulnerability in NIC - bitstream collisions - where semantically different images produce identical compressed bitstreams. Utilizing a novel whitebox adversarial attack algorithm, this paper demonstrates that adding carefully crafted perturbations to semantically different images can cause their compressed bitstreams to collide exactly. The collision vulnerability poses a threat to the practical usability of NIC, particularly in security-critical applications. The cause of the collision is analyzed, and a simple yet effective mitigation method is presented.

[AI-5] hinking agents for zero-shot generalization to qualitatively novel tasks

链接: https://arxiv.org/abs/2503.19815
作者: Thomas Miconi,Kevin McKee,Yicong Zheng,Jed McCaleb
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Intelligent organisms can solve truly novel problems which they have never encountered before, either in their lifetime or their evolution. An important component of this capacity is the ability to ``think’', that is, to mentally manipulate objects, concepts and behaviors in order to plan and evaluate possible solutions to novel problems, even without environment interaction. To generate problems that are truly qualitatively novel, while still solvable zero-shot (by mental simulation), we use the combinatorial nature of environments: we train the agent while withholding a specific combination of the environment’s elements. The novel test task, based on this combination, is thus guaranteed to be truly novel, while still mentally simulable since the agent has been exposed to each individual element (and their pairwise interactions) during training. We propose a method to train agents endowed with world models to make use their mental simulation abilities, by selecting tasks based on the difference between the agent’s pre-thinking and post-thinking performance. When tested on the novel, withheld problem, the resulting agent successfully simulated alternative scenarios and used the resulting information to guide its behavior in the actual environment, solving the novel task in a single real-environment trial (zero-shot).

[AI-6] Guidelines For The Choice Of The Baseline in XAI Attribution Methods

链接: https://arxiv.org/abs/2503.19813
作者: Cristian Morasso,Giorgio Dolci,Ilaria Boscolo Galazzo,Sergey M. Plis,Gloria Menegaz
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given the broad adoption of artificial intelligence, it is essential to provide evidence that AI models are reliable, trustable, and fair. To this end, the emerging field of eXplainable AI develops techniques to probe such requirements, counterbalancing the hype pushing the pervasiveness of this technology. Among the many facets of this issue, this paper focuses on baseline attribution methods, aiming at deriving a feature attribution map at the network input relying on a “neutral” stimulus usually called “baseline”. The choice of the baseline is crucial as it determines the explanation of the network behavior. In this framework, this paper has the twofold goal of shedding light on the implications of the choice of the baseline and providing a simple yet effective method for identifying the best baseline for the task. To achieve this, we propose a decision boundary sampling method, since the baseline, by definition, lies on the decision boundary, which naturally becomes the search domain. Experiments are performed on synthetic examples and validated relying on state-of-the-art methods. Despite being limited to the experimental scope, this contribution is relevant as it offers clear guidelines and a simple proxy for baseline selection, reducing ambiguity and enhancing deep models’ reliability and trust.

[AI-7] Simulating Tracking Data to Advance Sports Analytics Research AAMAS

链接: https://arxiv.org/abs/2503.19809
作者: David Radke,Kyle Tilbury
类目: Artificial Intelligence (cs.AI)
*备注: 2 pages, 2 figures, Proceedings of the 24th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS)

点击查看摘要

Abstract:Advanced analytics have transformed how sports teams operate, particularly in episodic sports like baseball. Their impact on continuous invasion sports, such as soccer and ice hockey, has been limited due to increased game complexity and restricted access to high-resolution game tracking data. In this demo, we present a method to collect and utilize simulated soccer tracking data from the Google Research Football environment to support the development of models designed for continuous tracking data. The data is stored in a schema that is representative of real tracking data and we provide processes that extract high-level features and events. We include examples of established tracking data models to showcase the efficacy of the simulated data. We address the scarcity of publicly available tracking data, providing support for research at the intersection of artificial intelligence and sports analytics.

[AI-8] Splitting Answer Set Programs with respect to Intensionality Statements (Extended Version) AAAI2023

链接: https://arxiv.org/abs/2503.19762
作者: Jorge Fandinno,Yuliya Lierler
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Extended version of the paper published in AAAI 2023

点击查看摘要

Abstract:Splitting a logic program allows us to reduce the task of computing its stable models to similar tasks for its subprograms. This can be used to increase solving performance and prove program correctness. We generalize the conditions under which this technique is applicable, by considering not only dependencies between predicates but also their arguments and context. This allows splitting programs commonly used in practice to which previous results were not applicable.

[AI-9] Inducing Personality in LLM -Based Honeypot Agents : Measuring the Effect on Human-Like Agenda Generation

链接: https://arxiv.org/abs/2503.19752
作者: Lewis Newsham,Ryan Hyland,Daniel Prince
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 11 pages, 1 figure, 6 tables. Accepted to NLPAICS 2024

点击查看摘要

Abstract:This paper presents SANDMAN, an architecture for cyber deception that leverages Language Agents to emulate convincing human simulacra. Our ‘Deceptive Agents’ serve as advanced cyber decoys, designed for high-fidelity engagement with attackers by extending the observation period of attack behaviours. Through experimentation, measurement, and analysis, we demonstrate how a prompt schema based on the five-factor model of personality systematically induces distinct ‘personalities’ in Large Language Models. Our results highlight the feasibility of persona-driven Language Agents for generating diverse, realistic behaviours, ultimately improving cyber deception strategies.

[AI-10] Invertible Koopman neural operator for data-driven modeling of partial differential equations

链接: https://arxiv.org/abs/2503.19717
作者: Yuhong Jin,Andong Cong,Lei Hou,Qiang Gao,Xiangdong Ge,Chonglong Zhu,Yongzhi Feng,Jun Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 10 figures

点击查看摘要

Abstract:Koopman operator theory is a popular candidate for data-driven modeling because it provides a global linearization representation for nonlinear dynamical systems. However, existing Koopman operator-based methods suffer from shortcomings in constructing the well-behaved observable function and its inverse and are inefficient enough when dealing with partial differential equations (PDEs). To address these issues, this paper proposes the Invertible Koopman Neural Operator (IKNO), a novel data-driven modeling approach inspired by the Koopman operator theory and neural operator. IKNO leverages an Invertible Neural Network to parameterize observable function and its inverse simultaneously under the same learnable parameters, explicitly guaranteeing the reconstruction relation, thus eliminating the dependency on the reconstruction loss, which is an essential improvement over the original Koopman Neural Operator (KNO). The structured linear matrix inspired by the Koopman operator theory is parameterized to learn the evolution of observables’ low-frequency modes in the frequency space rather than directly in the observable space, sustaining IKNO is resolution-invariant like other neural operators. Moreover, with preprocessing such as interpolation and dimension expansion, IKNO can be extended to operator learning tasks defined on non-Cartesian domains. We fully support the above claims based on rich numerical and real-world examples and demonstrate the effectiveness of IKNO and superiority over other neural operators.

[AI-11] Decoupled Dynamics Framework with Neural Fields for 3D Spatio-temporal Prediction of Vehicle Collisions

链接: https://arxiv.org/abs/2503.19712
作者: Sanghyuk Kim,Minsik Seo,Namwoo Kang
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:This study proposes a neural framework that predicts 3D vehicle collision dynamics by independently modeling global rigid-body motion and local structural deformation. Unlike approaches directly predicting absolute displacement, this method explicitly separates the vehicle’s overall translation and rotation from its structural deformation. Two specialized networks form the core of the framework: a quaternion-based Rigid Net for rigid motion and a coordinate-based Deformation Net for local deformation. By independently handling fundamentally distinct physical phenomena, the proposed architecture achieves accurate predictions without requiring separate supervision for each component. The model, trained on only 10% of available simulation data, significantly outperforms baseline models, including single multi-layer perceptron (MLP) and deep operator networks (DeepONet), with prediction errors reduced by up to 83%. Extensive validation demonstrates strong generalization to collision conditions outside the training range, accurately predicting responses even under severe impacts involving extreme velocities and large impact angles. Furthermore, the framework successfully reconstructs high-resolution deformation details from low-resolution inputs without increased computational effort. Consequently, the proposed approach provides an effective, computationally efficient method for rapid and reliable assessment of vehicle safety across complex collision scenarios, substantially reducing the required simulation data and time while preserving prediction fidelity.

[AI-12] Optimal Path Planning and Cost Minimization for a Drone Delivery System Via Model Predictive Control

链接: https://arxiv.org/abs/2503.19699
作者: Muhammad Al-Zafar Khan,Jamal Al-Karaki
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 15 pages, 5 figures, Submitted to the 2025 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications

点击查看摘要

Abstract:In this study, we formulate the drone delivery problem as a control problem and solve it using Model Predictive Control. Two experiments are performed: The first is on a less challenging grid world environment with lower dimensionality, and the second is with a higher dimensionality and added complexity. The MPC method was benchmarked against three popular Multi-Agent Reinforcement Learning (MARL): Independent Q -Learning (IQL), Joint Action Learners (JAL), and Value-Decomposition Networks (VDN). It was shown that the MPC method solved the problem quicker and required fewer optimal numbers of drones to achieve a minimized cost and navigate the optimal path.

[AI-13] Deep Learning for Speech Emotion Recognition: A CNN Approach Utilizing Mel Spectrograms

链接: https://arxiv.org/abs/2503.19677
作者: Niketa Penumajji
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: 5 pages 8 figures

点击查看摘要

Abstract:This paper explores the application of Convolutional Neural Networks CNNs for classifying emotions in speech through Mel Spectrogram representations of audio files. Traditional methods such as Gaussian Mixture Models and Hidden Markov Models have proven insufficient for practical deployment, prompting a shift towards deep learning techniques. By transforming audio data into a visual format, the CNN model autonomously learns to identify intricate patterns, enhancing classification accuracy. The developed model is integrated into a user-friendly graphical interface, facilitating realtime predictions and potential applications in educational environments. The study aims to advance the understanding of deep learning in speech emotion recognition, assess the models feasibility, and contribute to the integration of technology in learning contexts

[AI-14] owards Reliable Time Series Forecasting under Future Uncertainty: Ambiguity and Novelty Rejection Mechanisms

链接: https://arxiv.org/abs/2503.19656
作者: Ninghui Feng,Songning Lai,Xin Zhou,Jiayu Yang,Kunlong Feng,Zhenxiao Yin,Fobao Zhou,Zhangyi Hu,Yutao Yue,Yuxuan Liang,Boyu Wang,Hang Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In real-world time series forecasting, uncertainty and lack of reliable evaluation pose significant challenges. Notably, forecasting errors often arise from underfitting in-distribution data and failing to handle out-of-distribution inputs. To enhance model reliability, we introduce a dual rejection mechanism combining ambiguity and novelty rejection. Ambiguity rejection, using prediction error variance, allows the model to abstain under low confidence, assessed through historical error variance analysis without future ground truth. Novelty rejection, employing Variational Autoencoders and Mahalanobis distance, detects deviations from training data. This dual approach improves forecasting reliability in dynamic environments by reducing errors and adapting to data changes, advancing reliability in complex scenarios.

[AI-15] Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation

链接: https://arxiv.org/abs/2503.19611
作者: Max W. Y. Lam,Yijin Xing,Weiya You,Jingcheng Wu,Zongyu Yin,Fuqiang Jiang,Hangyu Liu,Feng Liu,Xingda Li,Wei-Tsung Lu,Hanyu Chen,Tong Feng,Tianwei Zhao,Chien-Hung Liu,Xuchen Song,Yang Li,Yahui Zhou
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Preprint

点击查看摘要

Abstract:Autoregressive (AR) models have demonstrated impressive capabilities in generating high-fidelity music. However, the conventional next-token prediction paradigm in AR models does not align with the human creative process in music composition, potentially compromising the musicality of generated samples. To overcome this limitation, we introduce MusiCoT, a novel chain-of-thought (CoT) prompting technique tailored for music generation. MusiCoT empowers the AR model to first outline an overall music structure before generating audio tokens, thereby enhancing the coherence and creativity of the resulting compositions. By leveraging the contrastive language-audio pretraining (CLAP) model, we establish a chain of “musical thoughts”, making MusiCoT scalable and independent of human-labeled data, in contrast to conventional CoT methods. Moreover, MusiCoT allows for in-depth analysis of music structure, such as instrumental arrangements, and supports music referencing – accepting variable-length audio inputs as optional style references. This innovative approach effectively addresses copying issues, positioning MusiCoT as a vital practical method for music prompting. Our experimental results indicate that MusiCoT consistently achieves superior performance across both objective and subjective metrics, producing music quality that rivals state-of-the-art generation models. Our samples are available at this https URL. Comments: Preprint Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP) Cite as: arXiv:2503.19611 [cs.SD] (or arXiv:2503.19611v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2503.19611 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-16] Enabling Rapid Shared Human-AI Mental Model Alignment via the After-Action Review AAAI2025

链接: https://arxiv.org/abs/2503.19607
作者: Edward Gu,Ho Chit Siu,Melanie Platt,Isabelle Hurley,Jaime Peña,Rohan Paleja
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted to the Cooperative Multi-Agent Systems Decision-making and Learning:Human-Multi-Agent Cognitive Fusion Workshop at AAAI 2025

点击查看摘要

Abstract:In this work, we present two novel contributions toward improving research in human-machine teaming (HMT): 1) a Minecraft testbed to accelerate testing and deployment of collaborative AI agents and 2) a tool to allow users to revisit and analyze behaviors within an HMT episode to facilitate shared mental model development. Our browser-based Minecraft testbed allows for rapid testing of collaborative agents in a continuous-space, real-time, partially-observable environment with real humans without cumbersome setup typical to human-AI interaction user studies. As Minecraft has an extensive player base and a rich ecosystem of pre-built AI agents, we hope this contribution can help to facilitate research quickly in the design of new collaborative agents and in understanding different human factors within HMT. Our mental model alignment tool facilitates user-led post-mission analysis by including video displays of first-person perspectives of the team members (i.e., the human and AI) that can be replayed, and a chat interface that leverages GPT-4 to provide answers to various queries regarding the AI’s experiences and model details.

[AI-17] Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking

链接: https://arxiv.org/abs/2503.19602
作者: Yuyao Ge,Shenghua Liu,Yiwei Wang,Lingrui Mei,Lizhe Chen,Baolong Bi,Xueqi Cheng
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs), which employ extended thinking processes with reflection and self-correction capabilities, demonstrating the effectiveness of test-time scaling. RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: “Is CoT prompting, a popular In-Context Learning (ICL) method for chat LLMs, necessary to enhance the reasoning capability of RLLMs?” In this work, we present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks. We examine models ranging from 1.5B to 32B parameters, finding that contrary to concerns, CoT prompting significantly enhances RLLMs’ performance in most scenarios. Our results reveal distinct patterns: large-capacity models show minimal improvement on simple tasks but substantial gains on complex problems, while smaller models exhibit the opposite behavior. Further analysis demonstrates that CoT prompting effectively controls the distribution of the numbers of thinking tokens and reasoning steps, reducing excessive reflections by approximately 90% in some cases. Moreover, attention logits analysis reveals the RLLMs’ overfitting to reflection-related words, which is mitigated by external CoT guidance. Notably, our experiments indicate that for RLLMs, one-shot CoT consistently yields superior performance compared to Few-shot CoT approaches. Our findings provide important insights for optimizing RLLMs’ performance through appropriate prompting strategies.

[AI-18] HoarePrompt: Structural Reasoning About Program Correctness in Natural Language

链接: https://arxiv.org/abs/2503.19599
作者: Dimitrios Stamatios Bouras,Yihan Dai,Tairan Wang,Yingfei Xiong,Sergey Mechtaev
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While software requirements are often expressed in natural language, verifying the correctness of a program against natural language requirements is a hard and underexplored problem. Large language models (LLMs) are promising candidates for addressing this challenge, however our experience shows that they are ineffective in this task, often failing to detect even straightforward bugs. To address this gap, we introduce HoarePrompt, a novel approach that adapts fundamental ideas from program analysis and verification to natural language artifacts. Drawing inspiration from the strongest postcondition calculus, HoarePrompt employs a systematic, step-by-step process in which an LLM generates natural language descriptions of reachable program states at various points in the code. To manage loops, we propose few-shot-driven k-induction, an adaptation of the k-induction method widely used in model checking. Once program states are described, HoarePrompt leverages the LLM to assess whether the program, annotated with these state descriptions, conforms to the natural language requirements. For evaluating the quality of classifiers of program correctness with respect to natural language requirements, we constructed CoCoClaNeL, a challenging dataset of solutions to programming competition problems. Our experiments show that HoarePrompt improves the MCC by 62% compared to directly using Zero-shot-CoT prompts for correctness classification. Furthermore, HoarePrompt outperforms a classifier that assesses correctness via LLM-based test generation by increasing the MCC by 93%. The inductive reasoning mechanism contributes a 28% boost to MCC, underscoring its effectiveness in managing loops.

[AI-19] A Contradiction-Centered Model for the Emergence of Swarm Intelligence

链接: https://arxiv.org/abs/2503.19585
作者: Wenpin Jiao
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 21 pages, in Chinese language

点击查看摘要

Abstract:The phenomenon of emergence of swarm intelligence exists widely in nature and human society. People have been exploring the root cause of emergence of swarm intelligence and trying to establish general theories and models for emergence of swarm intelligence. However, the existing theories or models do not grasp the essence of swarm intelligence, so they lack generality and are difficult to explain various phenomena of emergence of swarm intelligence. In this paper, a contradiction-centered model for the emergence of swarm intelligence is proposed, in which the internal contradictions of individuals determine their behavior and properties, individuals are related and interact within the swarm because of competing and occupying environmental resources, interactions and swarm potential affect the internal contradictions of individuals and their distribution in the swarm, and the swarm intelligence is manifested as the specific distribution of individual contradictions. This model completely explains the conditions, dynamics, pathways, formations and processes of the emergence of swarm intelligence. In order to verify the validity of this model, several swarm intelligence systems are implemented and analyzed in this paper. The experimental results show that the model has good generality and can be used to describe the emergence of various swarm intelligence.

[AI-20] FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments

链接: https://arxiv.org/abs/2503.19564
作者: Sree Bhargavi Balija
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence systems increasingly operate in Real-world environments, the integration of multi-modal data sources such as vision, language, and audio presents both unprecedented opportunities and critical challenges for achieving trustworthy intelligence. In this paper, we propose a novel framework that unifies federated learning with explainable multi-modal reasoning to ensure trustworthiness in decentralized, dynamic settings. Our approach, called FedMM-X (Federated Multi-Modal Explainable Intelligence), leverages cross-modal consistency checks, client-level interpretability mechanisms, and dynamic trust calibration to address challenges posed by data heterogeneity, modality imbalance, and out-of-distribution generalization. Through rigorous evaluation across federated multi-modal benchmarks involving vision-language tasks, we demonstrate improved performance in both accuracy and interpretability while reducing vulnerabilities to adversarial and spurious correlations. Further, we introduce a novel trust score aggregation method to quantify global model reliability under dynamic client participation. Our findings pave the way toward developing robust, interpretable, and socially responsible AI systems in Real-world environments.

[AI-21] VectorFit : Adaptive Singular Bias Vector Fine-Tuning of Pre-trained Foundation Models

链接: https://arxiv.org/abs/2503.19530
作者: Suhas G Hegde,Shilpy Kaur,Aruna Tiwari
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Popular PEFT methods achieve parameter efficiency by assuming that incremental weight updates are inherently low-rank, which often leads to a performance gap compared to full fine-tuning. While recent methods have attempted to address this limitation, they typically lack sufficient parameter and memory efficiency. We propose VectorFit, an effective and easily deployable approach that adaptively trains the singular vectors and biases of pre-trained weight matrices. We demonstrate that the utilization of structural and transformational characteristics of pre-trained weights enables high-rank updates comparable to those of full fine-tuning. As a result, VectorFit achieves superior performance with 9X less trainable parameters compared to state-of-the-art PEFT methods. Through extensive experiments over 17 datasets spanning diverse language and vision tasks such as natural language understanding and generation, question answering, image classification, and image generation, we exhibit that VectorFit consistently outperforms baselines, even in extremely low-budget scenarios.

[AI-22] SMT-EX: An Explainable Surrogate Modeling Toolbox for Mixed-Variables Design Exploration

链接: https://arxiv.org/abs/2503.19496
作者: Mohammad Daffa Robani,Paul Saves,Pramudita Satria Palar,Lavi Rizki Zuhal,oseph Morlier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Surrogate models are of high interest for many engineering applications, serving as cheap-to-evaluate time-efficient approximations of black-box functions to help engineers and practitioners make decisions and understand complex systems. As such, the need for explainability methods is rising and many studies have been performed to facilitate knowledge discovery from surrogate models. To respond to these enquiries, this paper introduces SMT-EX, an enhancement of the open-source Python Surrogate Modeling Toolbox (SMT) that integrates explainability techniques into a state-of-the-art surrogate modelling framework. More precisely, SMT-EX includes three key explainability methods: Shapley Additive Explanations, Partial Dependence Plot, and Individual Conditional Expectations. A peculiar explainability dependency of SMT has been developed for such purpose that can be easily activated once the surrogate model is built, offering a user-friendly and efficient tool for swift insight extraction. The effectiveness of SMT-EX is showcased through two test cases. The first case is a 10-variable wing weight problem with purely continuous variables and the second one is a 3-variable mixed-categorical cantilever beam bending problem. Relying on SMT-EX analyses for these problems, we demonstrate its versatility in addressing a diverse range of problem characteristics. SMT-Explainability is freely available on Github: this https URL .

[AI-23] Data-centric Federated Graph Learning with Large Language Models

链接: https://arxiv.org/abs/2503.19455
作者: Bo Yan,Zhongjian Zhang,Huabin Sun,Mengmei Zhang,Yang Cao,Chuan Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ongoing work

点击查看摘要

Abstract:In federated graph learning (FGL), a complete graph is divided into multiple subgraphs stored in each client due to privacy concerns, and all clients jointly train a global graph model by only transmitting model parameters. A pain point of FGL is the heterogeneity problem, where nodes or structures present non-IID properties among clients (e.g., different node label distributions), dramatically undermining the convergence and performance of FGL. To address this, existing efforts focus on design strategies at the model level, i.e., they design models to extract common knowledge to mitigate heterogeneity. However, these model-level strategies fail to fundamentally address the heterogeneity problem as the model needs to be designed from scratch when transferring to other tasks. Motivated by large language models (LLMs) having achieved remarkable success, we aim to utilize LLMs to fully understand and augment local text-attributed graphs, to address data heterogeneity at the data level. In this paper, we propose a general framework LLM4FGL that innovatively decomposes the task of LLM for FGL into two sub-tasks theoretically. Specifically, for each client, it first utilizes the LLM to generate missing neighbors and then infers connections between generated nodes and raw nodes. To improve the quality of generated nodes, we design a novel federated generation-and-reflection mechanism for LLMs, without the need to modify the parameters of the LLM but relying solely on the collective feedback from all clients. After neighbor generation, all the clients utilize a pre-trained edge predictor to infer the missing edges. Furthermore, our framework can seamlessly integrate as a plug-in with existing FGL methods. Experiments on three real-world datasets demonstrate the superiority of our method compared to advanced baselines.

[AI-24] VecTrans: LLM Transformation Framework for Better Auto-vectorization on High-performance CPU

链接: https://arxiv.org/abs/2503.19449
作者: Zhongchun Zheng,Long Cheng,Lu Li,Rodrigo C. O. Rocha,Tianyi Liu,Wei Wei,Xianwei Zhang,Yaoqing Gao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated great capabilities in code generation, yet their effective application in compiler optimizations remains an open challenge due to issues such as hallucinations and a lack of domain-specific reasoning. Vectorization, a crucial optimization for enhancing code performance, often fails because of the compiler’s inability to recognize complex code patterns, which commonly require extensive empirical expertise. LLMs, with their ability to capture intricate patterns, thus providing a promising solution to this challenge. This paper presents VecTrans, a novel framework that leverages LLMs to enhance compiler-based code vectorization. VecTrans first employs compiler analysis to identify potentially vectorizable code regions. It then utilizes an LLM to refactor these regions into patterns that are more amenable to the compiler’s auto-vectorization. To ensure semantic correctness, VecTrans further integrates a hybrid validation mechanism at the intermediate representation (IR) level. With the above efforts, VecTrans combines the adaptability of LLMs with the precision of compiler vectorization, thereby effectively opening up the vectorization opportunities. Experimental results show that among all 50 TSVC functions unvectorizable by Clang, GCC, and BiShengCompiler, VecTrans successfully vectorizes 23 cases (46%) and achieves an average speedup of 2.02x, greatly surpassing state-of-the-art performance.

[AI-25] Quantifying Symptom Causality in Clinical Decision Making: An Exploration Using CausaLM

链接: https://arxiv.org/abs/2503.19394
作者: Mehul Shetty,Connor Jordan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current machine learning approaches to medical diagnosis often rely on correlational patterns between symptoms and diseases, risking misdiagnoses when symptoms are ambiguous or common across multiple conditions. In this work, we move beyond correlation to investigate the causal influence of key symptoms-specifically “chest pain” on diagnostic predictions. Leveraging the CausaLM framework, we generate counterfactual text representations in which target concepts are effectively “forgotten” enabling a principled estimation of the causal effect of that concept on a model’s predicted disease distribution. By employing Textual Representation-based Average Treatment Effect (TReATE), we quantify how the presence or absence of a symptom shapes the model’s diagnostic outcomes, and contrast these findings against correlation-based baselines such as CONEXP. Our results offer deeper insight into the decision-making behavior of clinical NLP models and have the potential to inform more trustworthy, interpretable, and causally-grounded decision support tools in medical practice.

[AI-26] Causal invariant geographic network representations with feature and structural distribution shifts

链接: https://arxiv.org/abs/2503.19382
作者: Yuhan Wang,Silu He,Qinyao Luo,Hongyuan Yuan,Ling Zhao,Jiawei Zhu,Haifeng Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 15 pages, 3 figures, 8 tables

点击查看摘要

Abstract:The existing methods learn geographic network representations through deep graph neural networks (GNNs) based on the i.i.d. assumption. However, the spatial heterogeneity and temporal dynamics of geographic data make the out-of-distribution (OOD) generalisation problem particularly salient. The latter are particularly sensitive to distribution shifts (feature and structural shifts) between testing and training data and are the main causes of the OOD generalisation problem. Spurious correlations are present between invariant and background representations due to selection biases and environmental effects, resulting in the model extremes being more likely to learn background representations. The existing approaches focus on background representation changes that are determined by shifts in the feature distributions of nodes in the training and test data while ignoring changes in the proportional distributions of heterogeneous and homogeneous neighbour nodes, which we refer to as structural distribution shifts. We propose a feature-structure mixed invariant representation learning (FSM-IRL) model that accounts for both feature distribution shifts and structural distribution shifts. To address structural distribution shifts, we introduce a sampling method based on causal attention, encouraging the model to identify nodes possessing strong causal relationships with labels or nodes that are more similar to the target node. Inspired by the Hilbert-Schmidt independence criterion, we implement a reweighting strategy to maximise the orthogonality of the node representations, thereby mitigating the spurious correlations among the node representations and suppressing the learning of background representations. Our experiments demonstrate that FSM-IRL exhibits strong learning capabilities on both geographic and social network datasets in OOD scenarios.

[AI-27] Flow to Learn: Flow Matching on Neural Network Parameters ICLR

链接: https://arxiv.org/abs/2503.19371
作者: Daniel Saragih,Deyu Cao,Tejas Balaji,Ashwin Santhosh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025

点击查看摘要

Abstract:Foundational language models show a remarkable ability to learn new concepts during inference via context data. However, similar work for images lag behind. To address this challenge, we introduce FLoWN, a flow matching model that learns to generate neural network parameters for different tasks. Our approach models the flow on latent space, while conditioning the process on context data. Experiments verify that FLoWN attains various desiderata for a meta-learning model. In addition, it matches or exceeds baselines on in-distribution tasks, provides better initializations for classifier training, and is performant on out-of-distribution few-shot tasks while having a fine-tuning mechanism to improve performance.

[AI-28] Efficient IoT Intrusion Detection with an Improved Attention-Based CNN-BiLSTM Architecture

链接: https://arxiv.org/abs/2503.19339
作者: Amna Naeem,Muazzam A. Khan,Nada Alasbali,Jawad Ahmad,Aizaz Ahmad Khattak,Muhammad Shahbaz Khan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ever-increasing security vulnerabilities in the Internet-of-Things (IoT) systems require improved threat detection approaches. This paper presents a compact and efficient approach to detect botnet attacks by employing an integrated approach that consists of traffic pattern analysis, temporal support learning, and focused feature extraction. The proposed attention-based model benefits from a hybrid CNN-BiLSTM architecture and achieves 99% classification accuracy in detecting botnet attacks utilizing the N-BaIoT dataset, while maintaining high precision and recall across various scenarios. The proposed model’s performance is further validated by key parameters, such as Mathews Correlation Coefficient and Cohen’s kappa Correlation Coefficient. The close-to-ideal results for these parameters demonstrate the proposed model’s ability to detect botnet attacks accurately and efficiently in practical settings and on unseen data. The proposed model proved to be a powerful defense mechanism for IoT networks to face emerging security challenges.

[AI-29] Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLM Reasoning LLMs to Ignore the Correct Reasoning Steps

链接: https://arxiv.org/abs/2503.19326
作者: Yu Cui,Bryan Hooi,Yujun Cai,Yiwei Wang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent reasoning large language models (LLMs) have demonstrated remarkable improvements in mathematical reasoning capabilities through long Chain-of-Thought. The reasoning tokens of these models enable self-correction within reasoning chains, enhancing robustness. This motivates our exploration: how vulnerable are reasoning LLMs to subtle errors in their input reasoning chains? We introduce “Compromising Thought” (CPT), a vulnerability where models presented with reasoning tokens containing manipulated calculation results tend to ignore correct reasoning steps and adopt incorrect results instead. Through systematic evaluation across multiple reasoning LLMs, we design three increasingly explicit prompting methods to measure CPT resistance, revealing that models struggle significantly to identify and correct these manipulations. Notably, contrary to existing research suggesting structural alterations affect model performance more than content modifications, we find that local ending token manipulations have greater impact on reasoning outcomes than structural changes. Moreover, we discover a security vulnerability in DeepSeek-R1 where tampered reasoning tokens can trigger complete reasoning cessation. Our work enhances understanding of reasoning robustness and highlights security considerations for reasoning-intensive applications.

[AI-30] Observation Adaptation via Annealed Importance Resampling for Partially Observable Markov Decision Processes ICAPS2025

链接: https://arxiv.org/abs/2503.19302
作者: Yunuo Zhang,Baiting Luo,Ayan Mukhopadhyay,Abhishek Dubey
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted as Oral Presentation to ICAPS 2025

点击查看摘要

Abstract:Partially observable Markov decision processes (POMDPs) are a general mathematical model for sequential decision-making in stochastic environments under state uncertainty. POMDPs are often solved \textitonline, which enables the algorithm to adapt to new information in real time. Online solvers typically use bootstrap particle filters based on importance resampling for updating the belief distribution. Since directly sampling from the ideal state distribution given the latest observation and previous state is infeasible, particle filters approximate the posterior belief distribution by propagating states and adjusting weights through prediction and resampling steps. However, in practice, the importance resampling technique often leads to particle degeneracy and sample impoverishment when the state transition model poorly aligns with the posterior belief distribution, especially when the received observation is highly informative. We propose an approach that constructs a sequence of bridge distributions between the state-transition and optimal distributions through iterative Monte Carlo steps, better accommodating noisy observations in online POMDP solvers. Our algorithm demonstrates significantly superior performance compared to state-of-the-art methods when evaluated across multiple challenging POMDP domains.

[AI-31] No Black Box Anymore: Demystifying Clinical Predictive Modeling with Temporal-Feature Cross Attention Mechanism

链接: https://arxiv.org/abs/2503.19285
作者: Yubo Li,Xinyu Yao,Rema Padman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures, submitted to AMIA 2025

点击查看摘要

Abstract:Despite the outstanding performance of deep learning models in clinical prediction tasks, explainability remains a significant challenge. Inspired by transformer architectures, we introduce the Temporal-Feature Cross Attention Mechanism (TFCAM), a novel deep learning framework designed to capture dynamic interactions among clinical features across time, enhancing both predictive accuracy and interpretability. In an experiment with 1,422 patients with Chronic Kidney Disease, predicting progression to End-Stage Renal Disease, TFCAM outperformed LSTM and RETAIN baselines, achieving an AUROC of 0.95 and an F1-score of 0.69. Beyond performance gains, TFCAM provides multi-level explainability by identifying critical temporal periods, ranking feature importance, and quantifying how features influence each other across time before affecting predictions. Our approach addresses the “black box” limitations of deep learning in healthcare, offering clinicians transparent insights into disease progression mechanisms while maintaining state-of-the-art predictive performance.

[AI-32] CubeRobot: Grounding Language in Rubiks Cube Manipulation via Vision-Language Model

链接: https://arxiv.org/abs/2503.19281
作者: Feiyang Wang,Xiaomin Yu,Wangyu Wu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Proving Rubik’s Cube theorems at the high level represents a notable milestone in human-level spatial imagination and logic thinking and reasoning. Traditional Rubik’s Cube robots, relying on complex vision systems and fixed algorithms, often struggle to adapt to complex and dynamic scenarios. To overcome this limitation, we introduce CubeRobot, a novel vision-language model (VLM) tailored for solving 3x3 Rubik’s Cubes, empowering embodied agents with multimodal understanding and execution capabilities. We used the CubeCoT image dataset, which contains multiple-level tasks (43 subtasks in total) that humans are unable to handle, encompassing various cube states. We incorporate a dual-loop VisionCoT architecture and Memory Stream, a paradigm for extracting task-related features from VLM-generated planning queries, thus enabling CubeRobot to independent planning, decision-making, reflection and separate management of high- and low-level Rubik’s Cube tasks. Furthermore, in low-level Rubik’s Cube restoration tasks, CubeRobot achieved a high accuracy rate of 100%, similar to 100% in medium-level tasks, and achieved an accuracy rate of 80% in high-level tasks.

[AI-33] LogicLearner: A Tool for the Guided Practice of Propositional Logic Proofs

链接: https://arxiv.org/abs/2503.19280
作者: Amogh Inamdar,Uzay Macar,Michel Vazirani,Michael Tarnow,Zarina Mustapha,Natalia Dittren,Sam Sadeh,Nakul Verma,Ansaf Salleb-Aouissi
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 32 pages, 27 figures, open-source codebase linked in paper

点击查看摘要

Abstract:The study of propositional logic – fundamental to the theory of computing – is a cornerstone of the undergraduate computer science curriculum. Learning to solve logical proofs requires repeated guided practice, but undergraduate students often lack access to on-demand tutoring in a judgment-free environment. In this work, we highlight the need for guided practice tools in undergraduate mathematics education and outline the desiderata of an effective practice tool. We accordingly develop LogicLearner, a web application for guided logic proof practice. LogicLearner consists of an interface to attempt logic proofs step-by-step and an automated proof solver to generate solutions on the fly, allowing users to request guidance as needed. We pilot LogicLearner as a practice tool in two semesters of an undergraduate discrete mathematics course and receive strongly positive feedback for usability and pedagogical value in student surveys. To the best of our knowledge, LogicLearner is the only learning tool that provides an end-to-end practice environment for logic proofs with immediate, judgment-free feedback.

[AI-34] NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios

链接: https://arxiv.org/abs/2503.19267
作者: Songyi Gao,Zuolin Tu,Rong-Jun Qin,Yi-Hao Sun,Xiong-Hui Chen,Yang Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) aims to learn from historical data without requiring (costly) access to the environment. To facilitate offline RL research, we previously introduced NeoRL, which highlighted that datasets from real-world tasks are often conservative and limited. With years of experience applying offline RL to various domains, we have identified additional real-world challenges. These include extremely conservative data distributions produced by deployed control systems, delayed action effects caused by high-latency transitions, external factors arising from the uncontrollable variance of transitions, and global safety constraints that are difficult to evaluate during the decision-making process. These challenges are underrepresented in previous benchmarks but frequently occur in real-world tasks. To address this, we constructed the extended Near Real-World Offline RL Benchmark (NeoRL-2), which consists of 7 datasets from 7 simulated tasks along with their corresponding evaluation simulators. Benchmarking results from state-of-the-art offline RL approaches demonstrate that current methods often struggle to outperform the data-collection behavior policy, highlighting the need for more effective methods. We hope NeoRL-2 will accelerate the development of reinforcement learning algorithms for real-world applications. The benchmark project page is available at this https URL.

[AI-35] LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages

链接: https://arxiv.org/abs/2503.19217
作者: Patrick Diehl,Nojoud Nader,Maxim Moraru,Steven R. Brandt
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has opened new possibilities for automating various tasks in software development. This paper evaluates the capabilities of the Llama 2-70B model in automating these tasks for scientific applications written in commonly used programming languages. Using representative test problems, we assess the model’s capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between commonly used programming languages. Our comprehensive analysis evaluates the compilation, runtime behavior, and correctness of the generated and translated code. Additionally, we assess the quality of automatically generated code, documentation and unit tests. Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations, requiring considerable manual corrections. We identify key limitations and suggest areas for future improvements to better leverage AI-driven automation in scientific computing workflows.

[AI-36] Continual Reinforcement Learning for HVAC Systems Control: Integrating Hypernetworks and Transfer Learning

链接: https://arxiv.org/abs/2503.19212
作者: Gautham Udayakumar Bekal,Ahmed Ghareeb,Ashish Pujari
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Buildings with Heating, Ventilation, and Air Conditioning (HVAC) systems play a crucial role in ensuring indoor comfort and efficiency. While traditionally governed by physics-based models, the emergence of big data has enabled data-driven methods like Deep Reinforcement Learning (DRL). However, Reinforcement Learning (RL)-based techniques often suffer from sample inefficiency and limited generalization, especially across varying HVAC systems. We introduce a model-based reinforcement learning framework that uses a Hypernetwork to continuously learn environment dynamics across tasks with different action spaces. This enables efficient synthetic rollout generation and improved sample usage. Our approach demonstrates strong backward transfer in a continual learning setting after training on a second task, minimal fine-tuning on the first task allows rapid convergence within just 5 episodes and thus outperforming Model Free Reinforcement Learning (MFRL) and effectively mitigating catastrophic forgetting. These findings have significant implications for reducing energy consumption and operational costs in building management, thus supporting global sustainability goals. Keywords: Deep Reinforcement Learning, HVAC Systems Control, Hypernetworks, Transfer and Continual Learning, Catastrophic Forgetting Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2503.19212 [cs.LG] (or arXiv:2503.19212v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.19212 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-37] A Shared Low-Rank Adaptation Approach to Personalized RLHF AISTATS2025

链接: https://arxiv.org/abs/2503.19201
作者: Renpu Liu,Peng Wang,Donghao Li,Cong Shen,Jing Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at AISTATS 2025

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for aligning artificial intelligence systems with human values, achieving remarkable success in fine-tuning large language models. However, existing RLHF frameworks often assume that human preferences are relatively homogeneous and can be captured by a single, unified reward model. This assumption overlooks the inherent diversity and heterogeneity across individuals, limiting the adaptability of RLHF to personalized scenarios and risking misalignments that can diminish user satisfaction and trust in AI systems. In this paper, we address these challenges by introducing Low-Rank Adaptation (LoRA) into the personalized RLHF framework. We apply LoRA in the the aggregated parameter space of all personalized reward functions, thereby enabling efficient learning of personalized reward models from potentially limited local datasets. Our approach exploits potential shared structures among the local ground-truth reward models while allowing for individual adaptation, without relying on restrictive assumptions about shared representations as in prior works. We further establish sample complexity guarantees for our method. Theoretical analysis demonstrates the effectiveness of the proposed approach in capturing both shared and individual-specific structures within heterogeneous human preferences, addressing the dual challenge of personalization requirements and practical data constraints. Experimental results on real-world datasets corroborate the efficiency of our algorithm in the personalized RLHF setting.

[AI-38] Mining-Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling

链接: https://arxiv.org/abs/2503.19195
作者: Chayan Banerjee,Kien Nguyen,Clinton Fookes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Mining process optimization particularly truck dispatch scheduling is a critical factor in enhancing the efficiency of open pit mining operations However the dynamic and stochastic nature of mining environments characterized by uncertainties such as equipment failures truck maintenance and variable haul cycle times poses significant challenges for traditional optimization methods While Reinforcement Learning RL has shown promise in adaptive decision making for mining logistics its practical deployment requires rigorous evaluation in realistic and customizable simulation environments The lack of standardized benchmarking environments limits fair algorithm comparisons reproducibility and the real world applicability of RL based approaches in open pit mining settings To address this challenge we introduce Mining Gym a configurable open source benchmarking environment designed for training testing and comparing RL algorithms in mining process optimization Built on Discrete Event Simulation DES and seamlessly integrated with the OpenAI Gym interface Mining Gym provides a structured testbed that enables the direct application of advanced RL algorithms from Stable Baselines The framework models key mining specific uncertainties such as equipment failures queue congestion and the stochasticity of mining processes ensuring a realistic and adaptive learning environment Additionally Mining Gym features a graphical user interface GUI for intuitive mine site configuration a comprehensive data logging system a built in KPI dashboard and real time visual representation of the mine site These capabilities facilitate standardized reproducible evaluations across multiple RL strategies and baseline heuristics

[AI-39] SoK: How Robust is Audio Watermarking in Generative AI models?

链接: https://arxiv.org/abs/2503.19176
作者: Yizhu Wen,Ashwin Innuganti,Aaron Bien Ramos,Hanqing Guo,Qiben Yan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Audio watermarking is increasingly used to verify the provenance of AI-generated content, enabling applications such as detecting AI-generated speech, protecting music IP, and defending against voice cloning. To be effective, audio watermarks must resist removal attacks that distort signals to evade detection. While many schemes claim robustness, these claims are typically tested in isolation and against a limited set of attacks. A systematic evaluation against diverse removal attacks is lacking, hindering practical deployment. In this paper, we investigate whether recent watermarking schemes that claim robustness can withstand a broad range of removal attacks. First, we introduce a taxonomy covering 22 audio watermarking schemes. Next, we summarize their underlying technologies and potential vulnerabilities. We then present a large-scale empirical study to assess their robustness. To support this, we build an evaluation framework encompassing 22 types of removal attacks (109 configurations) including signal-level, physical-level, and AI-induced distortions. We reproduce 9 watermarking schemes using open-source code, identify 8 new highly effective attacks, and highlight 11 key findings that expose the fundamental limitations of these methods across 3 public datasets. Our results reveal that none of the surveyed schemes can withstand all tested distortions. This evaluation offers a comprehensive view of how current watermarking methods perform under real-world threats. Our demo and code are available at this https URL.

[AI-40] AssertionForge: Enhancing Formal Verification Assertion Generation with Structured Representation of Specifications and RTL

链接: https://arxiv.org/abs/2503.19174
作者: Yunsheng Bai,Ghaith Bany Hamad,Syed Suhaib,Haoxing Ren
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating SystemVerilog Assertions (SVAs) from natural language specifications remains a major challenge in formal verification (FV) due to the inherent ambiguity and incompleteness of specifications. Existing LLM-based approaches, such as AssertLLM, focus on extracting information solely from specification documents, often failing to capture essential internal signal interactions and design details present in the RTL code, leading to incomplete or incorrect assertions. We propose a novel approach that constructs a Knowledge Graph (KG) from both specifications and RTL, using a hardware-specific schema with domain-specific entity and relation types. We create an initial KG from the specification and then systematically fuse it with information extracted from the RTL code, resulting in a unified, comprehensive KG. This combined representation enables a more thorough understanding of the design and allows for a multi-resolution context synthesis process which is designed to extract diverse verification contexts from the KG. Experiments on four designs demonstrate that our method significantly enhances SVA quality over prior methods. This structured representation not only improves FV but also paves the way for future research in tasks like code generation and design understanding.

[AI-41] Information-Seeking Decision Strategies Mitigate Risk in Dynamic Uncertain Environments

链接: https://arxiv.org/abs/2503.19107
作者: Nicholas W. Barendregt,Joshua I. Gold,Krešimir Josić,Zachary P. Kilpatrick
类目: Artificial Intelligence (cs.AI); Probability (math.PR); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:To survive in dynamic and uncertain environments, individuals must develop effective decision strategies that balance information gathering and decision commitment. Models of such strategies often prioritize either optimizing tangible payoffs, like reward rate, or gathering information to support a diversity of (possibly unknown) objectives. However, our understanding of the relative merits of these two approaches remains incomplete, in part because direct comparisons have been limited to idealized, static environments that lack the dynamic complexity of the real world. Here we compared the performance of normative reward- and information-seeking strategies in a dynamic foraging task. Both strategies show similar transitions between exploratory and exploitative behaviors as environmental uncertainty changes. However, we find subtle disparities in the actions they take, resulting in meaningful performance differences: whereas reward-seeking strategies generate slightly more reward on average, information-seeking strategies provide more consistent and predictable outcomes. Our findings support the adaptive value of information-seeking behaviors that can mitigate risk with minimal reward loss.

[AI-42] he Case for “Thick Evaluations” of Cultural Representation in AI

链接: https://arxiv.org/abs/2503.19075
作者: Rida Qadri,Mark Diaz,Ding Wang,Michael Madaio
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 14 pages

点击查看摘要

Abstract:Generative AI image models have been increasingly evaluated for their (in)ability to represent non-Western cultures. We argue that these evaluations operate through reductive ideals of representation, abstracted from how people define their own representation and neglecting the inherently interpretive and contextual nature of cultural representation. In contrast to these ‘thin’ evaluations, we introduce the idea of ‘thick evaluations’: a more granular, situated, and discursive measurement framework for evaluating representations of social worlds in AI images, steeped in communities’ own understandings of representation. We develop this evaluation framework through workshops in South Asia, by studying the ‘thick’ ways in which people interpret and assign meaning to images of their own cultures. We introduce practices for thicker evaluations of representation that expand the understanding of representation underpinning AI evaluations and by co-constructing metrics with communities, bringing measurement in line with the experiences of communities on the ground.

[AI-43] HingeRLC-GAN: Combating Mode Collapse with Hinge Loss and RLC Regularization

链接: https://arxiv.org/abs/2503.19074
作者: Osman Goni,Himadri Saha Arka,Mithun Halder,Mir Moynuddin Ahmed Shibly,Swakkhar Shatabda
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Generative Adversarial Networks (GANs) have demonstrated their capability for producing high-quality images. However, a significant challenge remains mode collapse, which occurs when the generator produces a limited number of data patterns that do not reflect the diversity of the training dataset. This study addresses this issue by proposing a number of architectural changes aimed at increasing the diversity and stability of GAN models. We start by improving the loss function with Wasserstein loss and Gradient Penalty to better capture the full range of data variations. We also investigate various network architectures and conclude that ResNet significantly contributes to increased diversity. Building on these findings, we introduce HingeRLC-GAN, a novel approach that combines RLC Regularization and the Hinge loss function. With a FID Score of 18 and a KID Score of 0.001, our approach outperforms existing methods by effectively balancing training stability and increased diversity.

[AI-44] Graph-Level Label-Only Membership Inference Attack against Graph Neural Networks

链接: https://arxiv.org/abs/2503.19070
作者: Jiazhu Dai,Yubing Lu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are widely used for graph-structured data but are vulnerable to membership inference attacks (MIAs) in graph classification tasks, which determine if a graph was part of the training dataset, potentially causing data leakage. Existing MIAs rely on prediction probability vectors, but they become ineffective when only prediction labels are available. We propose a Graph-level Label-Only Membership Inference Attack (GLO-MIA), which is based on the intuition that the target model’s predictions on training data are more stable than those on testing data. GLO-MIA generates a set of perturbed graphs for target graph by adding perturbations to its effective features and queries the target model with the perturbed graphs to get their prediction labels, which are then used to calculate robustness score of the target graph. Finally, by comparing the robustness score with a predefined threshold, the membership of the target graph can be inferred correctly with high probability. Our evaluation on three datasets and four GNN models shows that GLO-MIA achieves an attack accuracy of up to 0.825, outperforming baseline work by 8.5% and closely matching the performance of probability-based MIAs, even with only prediction labels.

[AI-45] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization EUROSYS2025

链接: https://arxiv.org/abs/2503.19050
作者: Zhanda Zhu,Christina Giannoula,Muralidhar Andoorveedu,Qidong Su,Karttikeya Mangalam,Bojian Zheng,Gennady Pekhimenko
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: Accepted by EuroSys 2025

点击查看摘要

Abstract:Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset of optimizations, due to the lack of overlap awareness, inability to navigate the vast search space, and ignoring the inter-microbatch imbalance, leading to sub-optimal performance. To address these shortcomings, we propose Mist, a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism. Mist is based on three key ideas: (1) fine-grained overlap-centric scheduling, orchestrating optimizations in an overlapped manner, (2) symbolic-based performance analysis that predicts runtime and memory usage using symbolic expressions for fast tuning, and (3) imbalance-aware hierarchical tuning, decoupling the process into an inter-stage imbalance and overlap aware Mixed Integer Linear Programming problem and an intra-stage Dual-Objective Constrained Optimization problem, and connecting them through Pareto frontier sampling. Our evaluation results show that Mist achieves an average of 1.28 \times (up to 1.73 \times ) and 1.27 \times (up to 2.04 \times ) speedup compared to state-of-the-art manual system Megatron-LM and state-of-the-art automatic system Aceso, respectively.

[AI-46] Evolutionary Policy Optimization

链接: https://arxiv.org/abs/2503.19037
作者: Jianren Wang,Yifan Su,Abhinav Gupta,Deepak Pathak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Website at this https URL

点击查看摘要

Abstract:Despite its extreme sample inefficiency, on-policy reinforcement learning has become a fundamental tool in real-world applications. With recent advances in GPU-driven simulation, the ability to collect vast amounts of data for RL training has scaled exponentially. However, studies show that current on-policy methods, such as PPO, fail to fully leverage the benefits of parallelized environments, leading to performance saturation beyond a certain scale. In contrast, Evolutionary Algorithms (EAs) excel at increasing diversity through randomization, making them a natural complement to RL. However, existing EvoRL methods have struggled to gain widespread adoption due to their extreme sample inefficiency. To address these challenges, we introduce Evolutionary Policy Optimization (EPO), a novel policy gradient algorithm that combines the strengths of EA and policy gradients. We show that EPO significantly improves performance across diverse and challenging environments, demonstrating superior scalability with parallelized simulations.

[AI-47] Option Discovery Using LLM -guided Semantic Hierarchical Reinforcement Learning

链接: https://arxiv.org/abs/2503.19007
作者: Chak Lam Shek,Pratap Tokekar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable promise in reasoning and decision-making, yet their integration with Reinforcement Learning (RL) for complex robotic tasks remains underexplored. In this paper, we propose an LLM-guided hierarchical RL framework, termed LDSC, that leverages LLM-driven subgoal selection and option reuse to enhance sample efficiency, generalization, and multi-task adaptability. Traditional RL methods often suffer from inefficient exploration and high computational cost. Hierarchical RL helps with these challenges, but existing methods often fail to reuse options effectively when faced with new tasks. To address these limitations, we introduce a three-stage framework that uses LLMs for subgoal generation given natural language description of the task, a reusable option learning and selection method, and an action-level policy, enabling more effective decision-making across diverse tasks. By incorporating LLMs for subgoal prediction and policy guidance, our approach improves exploration efficiency and enhances learning performance. On average, LDSC outperforms the baseline by 55.9% in average reward, demonstrating its effectiveness in complex RL settings. More details and experiment videos could be found in \hrefthis https URLthis link\footnotethis https URL.

[AI-48] Computational Thinking with Computer Vision: Developing AI Competency in an Introductory Computer Science Course AAAI2025

链接: https://arxiv.org/abs/2503.19006
作者: Tahiya Chowdhury
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 7 pages, 3 figures, 3 tables, Proceedings of AAAI 2025

点击查看摘要

Abstract:Developing competency in artificial intelligence is becoming increasingly crucial for computer science (CS) students at all levels of the CS curriculum. However, most previous research focuses on advanced CS courses, as traditional introductory courses provide limited opportunities to develop AI skills and knowledge. This paper introduces an introductory CS course where students learn computational thinking through computer vision, a sub-field of AI, as an application context. The course aims to achieve computational thinking outcomes alongside critical thinking outcomes that expose students to AI approaches and their societal implications. Through experiential activities such as individual projects and reading discussions, our course seeks to balance technical learning and critical thinking goals. Our evaluation, based on pre-and post-course surveys, shows an improved sense of belonging, self-efficacy, and AI ethics awareness among students. The results suggest that an AI-focused context can enhance participation and employability, student-selected projects support self-efficacy, and ethically grounded AI instruction can be effective for interdisciplinary audiences. Students’ discussions on reading assignments demonstrated deep engagement with the complex challenges in today’s AI landscape. Finally, we share insights on scaling such courses for larger cohorts and improving the learning experience for introductory CS students.

[AI-49] Enhanced prediction of spine surgery outcomes using advanced machine learning techniques and oversampling methods

链接: https://arxiv.org/abs/2503.18996
作者: José Alberto Benítez-Andrades,Camino Prada-García,Nicolás Ordás-Reyes,Marta Esteban Blanco,Alicia Merayo,Antonio Serrano-García
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The study proposes an advanced machine learning approach to predict spine surgery outcomes by incorporating oversampling techniques and grid search optimization. A variety of models including GaussianNB, ComplementNB, KNN, Decision Tree, and optimized versions with RandomOverSampler and SMOTE were tested on a dataset of 244 patients, which included pre-surgical, psychometric, socioeconomic, and analytical variables. The enhanced KNN models achieved up to 76% accuracy and a 67% F1-score, while grid-search optimization further improved performance. The findings underscore the potential of these advanced techniques to aid healthcare professionals in decision-making, with future research needed to refine these models on larger and more diverse datasets.

[AI-50] LLM s in the Classroom: Outcomes and Perceptions of Questions Written with the Aid of AI AAAI2025

链接: https://arxiv.org/abs/2503.18995
作者: Gavin Witsken,Igor Crk,Eren Gultepe
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Accepted to AAAI 2025 Technical Track on AI Alignment

点击查看摘要

Abstract:We randomly deploy questions constructed with and without use of the LLM tool and gauge the ability of the students to correctly answer, as well as their ability to correctly perceive the difference between human-authored and LLM-authored questions. In determining whether the questions written with the aid of ChatGPT were consistent with the instructor’s questions and source text, we computed representative vectors of both the human and ChatGPT questions using SBERT and compared cosine similarity to the course textbook. A non-significant Mann-Whitney U test (z = 1.018, p = .309) suggests that students were unable to perceive whether questions were written with or without the aid of ChatGPT. However, student scores on LLM-authored questions were almost 9% lower (z = 2.702, p .01). This result may indicate that either the AI questions were more difficult or that the students were more familiar with the instructor’s style of questions. Overall, the study suggests that while there is potential for using LLM tools to aid in the construction of assessments, care must be taken to ensure that the questions are fair, well-composed, and relevant to the course material.

[AI-51] HH4AI: A methodological Framework for AI Human Rights impact assessment under the EUAI ACT

链接: https://arxiv.org/abs/2503.18994
作者: Paolo Ceravolo,Ernesto Damiani,Maria Elisa D’Amico,Bianca de Teffe Erb,Simone Favaro,Nannerel Fiano,Paolo Gambatesa,Simone La Porta,Samira Maghool,Lara Mauri,Niccolo Panigada,Lorenzo Maria Ratto Vaquer,Marta A. Tamborini
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, 1 table

点击查看摘要

Abstract:This paper introduces the HH4AI Methodology, a structured approach to assessing the impact of AI systems on human rights, focusing on compliance with the EU AI Act and addressing technical, ethical, and regulatory challenges. The paper highlights AIs transformative nature, driven by autonomy, data, and goal-oriented design, and how the EU AI Act promotes transparency, accountability, and safety. A key challenge is defining and assessing “high-risk” AI systems across industries, complicated by the lack of universally accepted standards and AIs rapid evolution. To address these challenges, the paper explores the relevance of ISO/IEC and IEEE standards, focusing on risk management, data quality, bias mitigation, and governance. It proposes a Fundamental Rights Impact Assessment (FRIA) methodology, a gate-based framework designed to isolate and assess risks through phases including an AI system overview, a human rights checklist, an impact assessment, and a final output phase. A filtering mechanism tailors the assessment to the system’s characteristics, targeting areas like accountability, AI literacy, data governance, and transparency. The paper illustrates the FRIA methodology through a fictional case study of an automated healthcare triage service. The structured approach enables systematic filtering, comprehensive risk assessment, and mitigation planning, effectively prioritizing critical risks and providing clear remediation strategies. This promotes better alignment with human rights principles and enhances regulatory compliance. Comments: 19 pages, 7 figures, 1 table Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.18994 [cs.CY] (or arXiv:2503.18994v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.18994 Focus to learn more arXiv-issued DOI via DataCite

[AI-52] Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization

链接: https://arxiv.org/abs/2503.18987
作者: Xiran Wang,Jian Zhang,Lei Qi,Yinghuan Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Domain generalization is proposed to address distribution shift, arising from statistical disparities between training source and unseen target domains. The widely used first-order meta-learning algorithms demonstrate strong performance for domain generalization by leveraging the gradient matching theory, which aims to establish balanced parameters across source domains to reduce overfitting to any particular domain. However, our analysis reveals that there are actually numerous directions to achieve gradient matching, with current methods representing just one possible path. These methods actually overlook another critical factor that the balanced parameters should be close to the centroid of optimal parameters of each source domain. To address this, we propose a simple yet effective arithmetic meta-learning with arithmetic-weighted gradients. This approach, while adhering to the principles of gradient matching, promotes a more precise balance by estimating the centroid between domain-specific optimal parameters. Experimental results validate the effectiveness of our strategy.

[AI-53] SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices

链接: https://arxiv.org/abs/2503.18986
作者: Jian Ma,Xinchen Lyu,Jun Jiang,Qimei Cui,Haipeng Yao,Xiaofeng Tao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) on private, on-device data can empower tailored personalized AI agents. However, fine-tuning LLMs on resource-constrained edge devices faces significant challenges, including excessive computation overhead, device heterogeneity, and data imbalance. This paper proposes SplitFrozen, a split learning framework that enables efficient LLM fine-tuning by strategically freezing device-side model layers while centralizing parameter-efficient fine-tuning on the server. Our framework partitions LLMs into device-side frozen layers and server-side fine-tuning layers, where heterogeneous resource-constrained devices execute only forward propagation. To minimize server-side training costs, we integrate Low-Rank Adaptation (LoRA) into the server-side layers. A pipeline parallelism strategy further optimizes training efficiency by decoupling device-server computations and leveraging decomposed backward propagation. Experiments on GPT-2 with the MRPC, MNLI-matched, and SST-2 datasets demonstrate that SplitFrozen outperforms FedLoRA and SplitLoRA by 69.4% model accuracy under extremely imbalanced data, while reducing up to 86.8% device-side computations and 50.2% total training time. Experiments also validate the scalability of SplitFrozen on content generation task using Llama-3.2 model on GSM8K dataset.

[AI-54] he Misinterpretable Evidence Conveyed by Arbitrary Codes

链接: https://arxiv.org/abs/2503.18984
作者: Guido Fioretti
类目: Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 22 pages, 4 figures, 1 table

点击查看摘要

Abstract:Evidence Theory is a mathematical framework for handling imprecise reasoning in the context of a judge evaluating testimonies or a detective evaluating cues, rather than a gambler playing games of chance. In comparison to Probability Theory, it is better equipped to deal with ambiguous information and novel possibilities. Furthermore, arrival and evaluation of testimonies implies a communication channel. This paper explores the possibility of employing Evidence Theory to represent arbitrary communication codes between and within living organisms. In this paper, different schemes are explored for living organisms incapable of anticipation, animals sufficiently sophisticated to be capable of extrapolation, and humans capable of reading one other’s minds. Comments: 22 pages, 4 figures, 1 table Subjects: Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO) Cite as: arXiv:2503.18984 [cs.AI] (or arXiv:2503.18984v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.18984 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-55] Confronting Catastrophic Risk: The International Obligation to Regulate Artificial Intelligence

链接: https://arxiv.org/abs/2503.18983
作者: Bryan Druzin,Anatole Boute,Michael Ramsden
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While artificial intelligence (AI) holds enormous promise, many experts in the field are warning that there is a non-trivial chance that the development of AI poses an existential threat to humanity. Existing regulatory initiative do not address this threat but merely instead focus on discrete AI-related risks such as consumer safety, cybersecurity, data protection, and privacy. In the absence of regulatory action to address the possible risk of human extinction by AI, the question arises: What legal obligations, if any, does public international law impose on states to regulate its development. Grounded in the precautionary principle, we argue that there exists an international obligation to mitigate the threat of human extinction by AI. Often invoked in relation to environmental regulation and the regulation of potentially harmful technologies, the principle holds that in situations where there is the potential for significant harm, even in the absence of full scientific certainty, preventive measures should not be postponed if delayed action may result in irreversible consequences. We argue that the precautionary principle is a general principle of international law and, therefore, that there is a positive obligation on states under the right to life within international human rights law to proactively take regulatory action to mitigate the potential existential risk of AI. This is significant because, if an international obligation to regulate the development of AI can be established under international law, then the basic legal framework would be in place to address this evolving threat.

[AI-56] Generative Data Imputation for Sparse Learner Performance Data Using Generative Adversarial Imputation Networks

链接: https://arxiv.org/abs/2503.18982
作者: Liang Zhang,Jionghao Lin,John Sabatini,Diego Zapata-Rivera,Carol Forsyth,Yang Jiang,John Hollander,Xiangen Hu,Arthur C. Graesser
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learner performance data collected by Intelligent Tutoring Systems (ITSs), such as responses to questions, is essential for modeling and predicting learners’ knowledge states. However, missing responses due to skips or incomplete attempts create data sparsity, challenging accurate assessment and personalized instruction. To address this, we propose a generative imputation approach using Generative Adversarial Imputation Networks (GAIN). Our method features a three-dimensional (3D) framework (learners, questions, and attempts), flexibly accommodating various sparsity levels. Enhanced by convolutional neural networks and optimized with a least squares loss function, the GAIN-based method aligns input and output dimensions to question-attempt matrices along the learners’ dimension. Extensive experiments using datasets from AutoTutor Adult Reading Comprehension (ARC), ASSISTments, and MATHia demonstrate that our approach significantly outperforms tensor factorization and alternative GAN methods in imputation accuracy across different attempt scenarios. Bayesian Knowledge Tracing (BKT) further validates the effectiveness of the imputed data by estimating learning parameters: initial knowledge (P(L0)), learning rate (P(T)), guess rate (P(G)), and slip rate (P(S)). Results indicate the imputed data enhances model fit and closely mirrors original distributions, capturing underlying learning behaviors reliably. Kullback-Leibler (KL) divergence assessments confirm minimal divergence, showing the imputed data preserves essential learning characteristics effectively. These findings underscore GAIN’s capability as a robust imputation tool in ITSs, alleviating data sparsity and supporting adaptive, individualized instruction, ultimately leading to more precise and responsive learner assessments and improved educational outcomes.

[AI-57] FedSKD: Aggregation-free Model-heterogeneous Federated Learning using Multi-dimensional Similarity Knowledge Distillation

链接: https://arxiv.org/abs/2503.18981
作者: Ziqiao Weng,Weidong Cai,Bo Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 5 figure, 7 tables

点击查看摘要

Abstract:Federated learning (FL) enables privacy-preserving collaborative model training without direct data sharing. Model-heterogeneous FL (MHFL) extends this paradigm by allowing clients to train personalized models with heterogeneous architectures tailored to their computational resources and application-specific needs. However, existing MHFL methods predominantly rely on centralized aggregation, which introduces scalability and efficiency bottlenecks, or impose restrictions requiring partially identical model architectures across clients. While peer-to-peer (P2P) FL removes server dependence, it suffers from model drift and knowledge dilution, limiting its effectiveness in heterogeneous settings. To address these challenges, we propose FedSKD, a novel MHFL framework that facilitates direct knowledge exchange through round-robin model circulation, eliminating the need for centralized aggregation while allowing fully heterogeneous model architectures across clients. FedSKD’s key innovation lies in multi-dimensional similarity knowledge distillation, which enables bidirectional cross-client knowledge transfer at batch, pixel/voxel, and region levels for heterogeneous models in FL. This approach mitigates catastrophic forgetting and model drift through progressive reinforcement and distribution alignment while preserving model heterogeneity. Extensive evaluations on fMRI-based autism spectrum disorder diagnosis and skin lesion classification demonstrate that FedSKD outperforms state-of-the-art heterogeneous and homogeneous FL baselines, achieving superior personalization (client-specific accuracy) and generalization (cross-institutional adaptability). These findings underscore FedSKD’s potential as a scalable and robust solution for real-world medical federated learning applications.

[AI-58] hreshold Crossings as Tail Events for Catastrophic AI Risk

链接: https://arxiv.org/abs/2503.18979
作者: Elija Perrier
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Under peer review

点击查看摘要

Abstract:We analyse circumstances in which bifurcation-driven jumps in AI systems with their emergent heavy-tailed outcome distributions. By analysing how a control parameter’s random fluctuations near a catastrophic threshold generate extreme outcomes, we demonstrate in what circumstances the probability of a sudden, large-scale, transition aligns closely with the tail probability of the resulting damage distribution. Our results contribute to research in monitoring, mitigation and control of AI systems when seeking to manage potentially catastrophic AI risk.

[AI-59] Synthetic media and computational capitalism: towards a critical theory of artificial intelligence

链接: https://arxiv.org/abs/2503.18976
作者: David M. Berry
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:This paper develops a critical theory of artificial intelligence, within a historical constellation where computational systems increasingly generate cultural content that destabilises traditional distinctions between human and machine production. Through this analysis, I introduce the concept of the algorithmic condition, a cultural moment when machine-generated work not only becomes indistinguishable from human creation but actively reshapes our understanding of ideas of authenticity. This transformation, I argue, moves beyond false consciousness towards what I call post-consciousness, where the boundaries between individual and synthetic consciousness become porous. Drawing on critical theory and extending recent work on computational ideology, I develop three key theoretical contributions, first, the concept of the Inversion to describe a new computational turn in algorithmic society; second, automimetric production as a framework for understanding emerging practices of automated value creation; and third, constellational analysis as a methodological approach for mapping the complex interplay of technical systems, cultural forms and political economic structures. Through these contributions, I argue that we need new critical methods capable of addressing both the technical specificity of AI systems and their role in restructuring forms of life under computational capitalism. The paper concludes by suggesting that critical reflexivity is needed to engage with the algorithmic condition without being subsumed by it and that it represents a growing challenge for contemporary critical theory.

[AI-60] LLM s as Planning Modelers: A Survey for Leverag ing Large Language Models to Construct Automated Planning Models

链接: https://arxiv.org/abs/2503.18971
作者: Marcus Tantakoun,Xiaodan Zhu,Christian Muise
类目: Artificial Intelligence (cs.AI)
*备注: 20 pages, 3 figures, 3 appendices

点击查看摘要

Abstract:Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities. However, identifying optimal AP deployment frameworks can be daunting. This paper aims to provide a timely survey of the current research with an in-depth analysis, positioning LLMs as tools for extracting and refining planning models to support reliable AP planners. By systematically reviewing the current state of research, we highlight methodologies, and identify critical challenges and future directions, hoping to contribute to the joint research on NLP and Automated Planning.

[AI-61] MedAgent -Pro: Towards Multi-modal Evidence-based Medical Diagnosis via Reasoning Agent ic Workflow

链接: https://arxiv.org/abs/2503.18968
作者: Ziyue Wang,Junde Wu,Chang Han Low,Yueming Jin
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing reliable AI systems to assist human clinicians in multi-modal medical diagnosis has long been a key objective for researchers. Recently, Multi-modal Large Language Models (MLLMs) have gained significant attention and achieved success across various domains. With strong reasoning capabilities and the ability to perform diverse tasks based on user instructions, they hold great potential for enhancing medical diagnosis. However, directly applying MLLMs to the medical domain still presents challenges. They lack detailed perception of visual inputs, limiting their ability to perform quantitative image analysis, which is crucial for medical diagnostics. Additionally, MLLMs often exhibit hallucinations and inconsistencies in reasoning, whereas clinical diagnoses must adhere strictly to established criteria. To address these challenges, we propose MedAgent-Pro, an evidence-based reasoning agentic system designed to achieve reliable, explainable, and precise medical diagnoses. This is accomplished through a hierarchical workflow: at the task level, knowledge-based reasoning generate reliable diagnostic plans for specific diseases following retrieved clinical criteria. While at the case level, multiple tool agents process multi-modal inputs, analyze different indicators according to the plan, and provide a final diagnosis based on both quantitative and qualitative evidence. Comprehensive experiments on both 2D and 3D medical diagnosis tasks demonstrate the superiority and effectiveness of MedAgent-Pro, while case studies further highlight its reliability and interpretability. The code is available at this https URL.

[AI-62] Unifying EEG and Speech for Emotion Recognition: A Two-Step Joint Learning Framework for Handling Missing EEG Data During Inference

链接: https://arxiv.org/abs/2503.18964
作者: Upasana Tiwari,Rupayan Chakraborty,Sunil Kumar Kopparapu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Computer interfaces are advancing towards using multi-modalities to enable better human-computer interactions. The use of automatic emotion recognition (AER) can make the interactions natural and meaningful thereby enhancing the user experience. Though speech is the most direct and intuitive modality for AER, it is not reliable because it can be intentionally faked by humans. On the other hand, physiological modalities like EEG, are more reliable and impossible to fake. However, use of EEG is infeasible for realistic scenarios usage because of the need for specialized recording setup. In this paper, one of our primary aims is to ride on the reliability of the EEG modality to facilitate robust AER on the speech modality. Our approach uses both the modalities during training to reliably identify emotion at the time of inference, even in the absence of the more reliable EEG modality. We propose, a two-step joint multi-modal learning approach (JMML) that exploits both the intra- and inter- modal characteristics to construct emotion embeddings that enrich the performance of AER. In the first step, using JEC-SSL, intra-modal learning is done independently on the individual modalities. This is followed by an inter-modal learning using the proposed extended variant of deep canonically correlated cross-modal autoencoder (E-DCC-CAE). The approach learns the joint properties of both the modalities by mapping them into a common representation space, such that the modalities are maximally correlated. These emotion embeddings, hold properties of both the modalities there by enhancing the performance of ML classifier used for AER. Experimental results show the efficacy of the proposed approach. To best of our knowledge, this is the first attempt to combine speech and EEG with joint multi-modal learning approach for reliable AER.

[AI-63] Advancing Deep Learning through Probability Engineering: A Prag matic Paradigm for Modern AI

链接: https://arxiv.org/abs/2503.18958
作者: Jianyi Zhang
类目: Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
*备注: Ph.D. dissertation

点击查看摘要

Abstract:Recent years have witnessed the rapid progression of deep learning, pushing us closer to the realization of AGI (Artificial General Intelligence). Probabilistic modeling is critical to many of these advancements, which provides a foundational framework for capturing data distributions. However, as the scale and complexity of AI applications grow, traditional probabilistic modeling faces escalating challenges, such as high-dimensional parameter spaces, heterogeneous data sources, and evolving real-world requirements often render classical approaches insufficiently flexible. This paper proposes a novel concept, Probability Engineering, which treats the already-learned probability distributions within deep learning as engineering artifacts. Rather than merely fitting or inferring distributions, we actively modify and reinforce them to better address the diverse and evolving demands of modern AI. Specifically, Probability Engineering introduces novel techniques and constraints to refine existing probability distributions, improving their robustness, efficiency, adaptability, or trustworthiness. We showcase this paradigm through a series of applications spanning Bayesian deep learning, Edge AI (including federated learning and knowledge distillation), and Generative AI (such as text-to-image generation with diffusion models and high-quality text generation with large language models). These case studies demonstrate how probability distributions once treated as static objects can be engineered to meet the diverse and evolving requirements of large-scale, data-intensive, and trustworthy AI systems. By systematically expanding and strengthening the role of probabilistic modeling, Probability Engineering paves the way for more robust, adaptive, efficient, and trustworthy deep learning solutions in today’s fast-growing AI era. Comments: Ph.D. dissertation Subjects: Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2503.18958 [cs.AI] (or arXiv:2503.18958v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.18958 Focus to learn more arXiv-issued DOI via DataCite

[AI-64] International Agreements on AI Safety: Review and Recommendations for a Conditional AI Safety Treaty

链接: https://arxiv.org/abs/2503.18956
作者: Rebecca Scholefield,Samuel Martin,Otto Barten
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 34 pages

点击查看摘要

Abstract:The malicious use or malfunction of advanced general-purpose AI (GPAI) poses risks that, according to leading experts, could lead to the ‘marginalisation or extinction of humanity.’ To address these risks, there are an increasing number of proposals for international agreements on AI safety. In this paper, we review recent (2023-) proposals, identifying areas of consensus and disagreement, and drawing on related literature to assess their feasibility. We focus our discussion on risk thresholds, regulations, types of international agreement and five related processes: building scientific consensus, standardisation, auditing, verification and incentivisation. Based on this review, we propose a treaty establishing a compute threshold above which development requires rigorous oversight. This treaty would mandate complementary audits of models, information security and governance practices, overseen by an international network of AI Safety Institutes (AISIs) with authority to pause development if risks are unacceptable. Our approach combines immediately implementable measures with a flexible structure that can adapt to ongoing research. Comments: 34 pages Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.18956 [cs.CY] (or arXiv:2503.18956v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.18956 Focus to learn more arXiv-issued DOI via DataCite

[AI-65] Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions CVPR2025

链接: https://arxiv.org/abs/2411.10364
作者: Tianhao Ma,Han Chen,Juncheng Hu,Yungang Zhu,Ximing Li
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as a conference paper at CVPR 2025

点击查看摘要

Abstract:Learning from label proportions (LLP), i.e., a challenging weakly-supervised learning task, aims to train a classifier by using bags of instances and the proportions of classes within bags, rather than annotated labels for each instance. Beyond the traditional bag-level loss, the mainstream methodology of LLP is to incorporate an auxiliary instance-level loss with pseudo-labels formed by predictions. Unfortunately, we empirically observed that the pseudo-labels are are often inaccurate due to over-smoothing, especially for the scenarios with large bag sizes, hurting the classifier induction. To alleviate this problem, we suggest a novel LLP method, namely Learning from Label Proportions with Auxiliary High-confident Instance-level Loss (L^2P-AHIL). Specifically, we propose a dual entropy-based weight (DEW) method to adaptively measure the confidences of pseudo-labels. It simultaneously emphasizes accurate predictions at the bag level and avoids overly smoothed predictions. We then form high-confident instance-level loss with DEW, and jointly optimize it with the bag-level loss in a self-training manner. The experimental results on benchmark datasets show that L^2P-AHIL can surpass the existing baseline methods, and the performance gain can be more significant as the bag size increases. The implementation of our method is available at this https URL.

[AI-66] Recover from Horcrux: A Spectrogram Augmentation Method for Cardiac Feature Monitoring from Radar Signal Components

链接: https://arxiv.org/abs/2503.19649
作者: Yuanyuan Zhang,Sijie Xiong,Rui Yang,EngGee Lim,Yutao Yue
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radar-based wellness monitoring is becoming an effective measurement to provide accurate vital signs in a contactless manner, but data scarcity retards the related research on deep-learning-based methods. Data augmentation is commonly used to enrich the dataset by modifying the existing data, but most augmentation techniques can only couple with classification tasks. To enable the augmentation for regression tasks, this research proposes a spectrogram augmentation method, Horcrux, for radar-based cardiac feature monitoring (e.g., heartbeat detection, electrocardiogram reconstruction) with both classification and regression tasks involved. The proposed method is designed to increase the diversity of input samples while the augmented spectrogram is still faithful to the original ground truth vital sign. In addition, Horcrux proposes to inject zero values in specific areas to enhance the awareness of the deep learning model on subtle cardiac features, improving the performance for the limited dataset. Experimental result shows that Horcrux achieves an overall improvement of 16.20% in cardiac monitoring and has the potential to be extended to other spectrogram-based tasks. The code will be released upon publication.

[AI-67] owards Long-Range ENSO Prediction with an Explainable Deep Learning Model

链接: https://arxiv.org/abs/2503.19502
作者: Qi Chen,Yinghao Cui,Guobin Hong,Karumuri Ashok,Yuchun Pu,Xiaogu Zheng,Xuanze Zhang,Wei Zhong,Peng Zhan,Zhonglei Wang
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:El Niño-Southern Oscillation (ENSO) is a prominent mode of interannual climate variability with far-reaching global impacts. Its evolution is governed by intricate air-sea interactions, posing significant challenges for long-term prediction. In this study, we introduce CTEFNet, a multivariate deep learning model that synergizes convolutional neural networks and transformers to enhance ENSO forecasting. By integrating multiple oceanic and atmospheric predictors, CTEFNet extends the effective forecast lead time to 20 months while mitigating the impact of the spring predictability barrier, outperforming both dynamical models and state-of-the-art deep learning approaches. Furthermore, CTEFNet offers physically meaningful and statistically significant insights through gradient-based sensitivity analysis, revealing the key precursor signals that govern ENSO dynamics, which align with well-established theories and reveal new insights about inter-basin interactions among the Pacific, Atlantic, and Indian Oceans. The CTEFNet’s superior predictive skill and interpretable sensitivity assessments underscore its potential for advancing climate prediction. Our findings highlight the importance of multivariate coupling in ENSO evolution and demonstrate the promise of deep learning in capturing complex climate dynamics with enhanced interpretability.

[AI-68] Minimum Volume Conformal Sets for Multivariate Regression

链接: https://arxiv.org/abs/2503.19068
作者: Sacha Braun,Liviu Aolaritei,Michael I. Jordan,Francis Bach
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:Conformal prediction provides a principled framework for constructing predictive sets with finite-sample validity. While much of the focus has been on univariate response variables, existing multivariate methods either impose rigid geometric assumptions or rely on flexible but computationally expensive approaches that do not explicitly optimize prediction set volume. We propose an optimization-driven framework based on a novel loss function that directly learns minimum-volume covering sets while ensuring valid coverage. This formulation naturally induces a new nonconformity score for conformal prediction, which adapts to the residual distribution and covariates. Our approach optimizes over prediction sets defined by arbitrary norm balls, including single and multi-norm formulations. Additionally, by jointly optimizing both the predictive model and predictive uncertainty, we obtain prediction sets that are tight, informative, and computationally efficient, as demonstrated in our experiments on real-world datasets.

[AI-69] Forecasting Labor Demand: Predicting JOLT Job Openings using Deep Learning Model

链接: https://arxiv.org/abs/2503.19048
作者: Kyungsu Kim
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:This thesis studies the effectiveness of Long Short Term Memory model in forecasting future Job Openings and Labor Turnover Survey data in the United States. Drawing on multiple economic indicators from various sources, the data are fed directly into LSTM model to predict JOLT job openings in subsequent periods. The performance of the LSTM model is compared with conventional autoregressive approaches, including ARIMA, SARIMA, and Holt-Winters. Findings suggest that the LSTM model outperforms these traditional models in predicting JOLT job openings, as it not only captures the dependent variables trends but also harmonized with key economic factors. These results highlight the potential of deep learning techniques in capturing complex temporal dependencies in economic data, offering valuable insights for policymakers and stakeholders in developing data-driven labor market strategies

[AI-70] CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2503.18980
作者: Yexin Li,Pring Wong,Hanfang Zhang,Shuo Chen,Siyuan Qi
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exploration remains a critical challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short of practical effectiveness. In this paper, we introduce CAE, a lightweight algorithm that repurposes the value networks in standard deep RL algorithms to drive exploration without introducing additional parameters. CAE utilizes any linear multi-armed bandit technique and incorporates an appropriate scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and practical stability. Notably, it is simple to implement, requiring only around 10 lines of code. In complex tasks where learning an effective value network proves challenging, we propose CAE+, an extension of CAE that incorporates an auxiliary network. This extension increases the parameter count by less than 1% while maintaining implementation simplicity, adding only about 10 additional lines of code. Experiments on MuJoCo and MiniHack show that both CAE and CAE+ outperform state-of-the-art baselines, bridging the gap between theoretical rigor and practical efficiency.

[AI-71] Machine Learning - Driven Materials Discovery: Unlocking Next-Generation Functional Materials - A minireview

链接: https://arxiv.org/abs/2503.18975
作者: Dilshod Nematov,Mirabbos Hojamberdiev
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of machine learning and artificial intelligence (AI)-driven techniques is revolutionizing materials discovery, property prediction, and material design by minimizing human intervention and accelerating scientific progress. This review provides a comprehensive overview of smart, machine learning (ML)-driven approaches, emphasizing their role in predicting material properties, discovering novel compounds, and optimizing material structures. Key methodologies ranging from deep learning, graph neural networks, and Bayesian optimization to automated generative models, such as generative adversarial networks (GANs) and variational autoencoders (VAEs) enable the autonomous design of materials with tailored functionalities. By leveraging AutoML frameworks (e.g., AutoGluon, TPOT, and this http URL), researchers can automate the model selection, hyperparameter tuning, and feature engineering, significantly improving the efficiency of materials informatics. Furthermore, the integration of AI-driven robotic laboratories and high-throughput computing has established a fully automated pipeline for rapid synthesis and experimental validation, drastically reducing the time and cost of material discovery. This review highlights real-world applications of automated ML-driven approaches in predicting mechanical, thermal, electrical, and optical properties of materials, demonstrating successful cases in superconductors, catalysts, photovoltaics, and energy storage systems. We also address key challenges, such as data quality, interpretability, and the integration of AutoML with quantum computing, which are essential for future advancements. Ultimately, the synergy between AI, automated experimentation, and computational modeling transforms the way the materials are discovered, optimized, and designed, paving the way for next-generation innovations in energy, electronics, and nanotechnology.

[AI-72] On the Hopf-Cole Transform for Control-affine Schrödinger Bridge

链接: https://arxiv.org/abs/2503.17640
作者: Alexis Teter,Abhishek Halder
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The purpose of this note is to clarify the importance of the relation \boldsymbolgg^\top\propto \boldsymbol\sigma\sigma^\top in solving control-affine Schrödinger bridge problems via the Hopf-Cole transform, where \boldsymbolg,\boldsymbol\sigma are the control and noise coefficients, respectively. We show that the Hopf-Cole transform applied to the conditions of optimality for generic control-affine Schrödinger bridge problems, i.e., without the assumption \boldsymbolgg^\top\propto\boldsymbol\sigma\sigma^\top , gives a pair of forward-backward PDEs that are neither linear nor equation-level decoupled. We explain how the resulting PDEs can be interpreted as nonlinear forward-backward advection-diffusion-reaction equations, where the nonlinearity stem from additional drift and reaction terms involving the gradient of the log-likelihood a.k.a. the score. These additional drift and reaction vanish when \boldsymbolgg^\top\propto\boldsymbol\sigma\sigma^\top , and the resulting boundary-coupled system of linear PDEs can then be solved by dynamic Sinkhorn recursions. A key takeaway of our work is that the numerical solution of the generic control-affine Schrödinger bridge requires further algorithmic development, possibly generalizing the dynamic Sinkhorn recursion or otherwise.

机器学习

[LG-0] RCC-PFL: Robust Client Clustering under Noisy Labels in Personalized Federated Learning

链接: https://arxiv.org/abs/2503.19886
作者: Abdulmoneam Ali,Ahmed Arafa
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: to appear in the 2025 IEEE International Conference on Communications

点击查看摘要

Abstract:We address the problem of cluster identity estimation in a personalized federated learning (PFL) setting in which users aim to learn different personal models. The backbone of effective learning in such a setting is to cluster users into groups whose objectives are similar. A typical approach in the literature is to achieve this by training users’ data on different proposed personal models and assign them to groups based on which model achieves the lowest value of the users’ loss functions. This process is to be done iteratively until group identities converge. A key challenge in such a setting arises when users have noisy labeled data, which may produce misleading values of their loss functions, and hence lead to ineffective clustering. To overcome this challenge, we propose a label-agnostic data similarity-based clustering algorithm, coined RCC-PFL, with three main advantages: the cluster identity estimation procedure is independent from the training labels; it is a one-shot clustering algorithm performed prior to the training; and it requires fewer communication rounds and less computation compared to iterative-based clustering methods. We validate our proposed algorithm using various models and datasets and show that it outperforms multiple baselines in terms of average accuracy and variance reduction.

[LG-1] Extensions of regret-minimization algorithm for optimal design

链接: https://arxiv.org/abs/2503.19874
作者: Youguang Chen,George Biros
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We explore extensions and applications of the regret minimization framework introduced by~\citedesign for solving optimal experimental design problems. Specifically, we incorporate the entropy regularizer into this framework, leading to a novel sample selection objective and a provable sample complexity bound that guarantees a (1+\epsilon) -near optimal solution. We further extend the method to handle regularized optimal design settings. As an application, we use our algorithm to select a small set of representative samples from image classification datasets without relying on label information. To evaluate the quality of the selected samples, we train a logistic regression model and compare performance against several baseline sampling strategies. Experimental results on MNIST, CIFAR-10, and a 50-class subset of ImageNet show that our approach consistently outperforms competing methods in most cases.

[LG-2] An Overview of Low-Rank Structures in the Training and Adaptation of Large Models

链接: https://arxiv.org/abs/2503.19859
作者: Laura Balzano,Tianjiao Ding,Benjamin D. Haeffele,Soo Min Kwon,Qing Qu,Peng Wang,Zhangyang Wang,Can Yaras
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
*备注: Authors are listed alphabetically; 27 pages, 10 figures

点击查看摘要

Abstract:The rise of deep learning has revolutionized data processing and prediction in signal processing and machine learning, yet the substantial computational demands of training and deploying modern large-scale deep models present significant challenges, including high computational costs and energy consumption. Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures in weight matrices and learned representations during training. These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models. Practical techniques inspired by this phenomenon, such as low-rank adaptation (LoRA) and training, enable significant reductions in computational cost while preserving model performance. In this paper, we present a comprehensive review of recent advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations. Mathematically, we present two complementary perspectives on understanding the low-rankness in deep networks: (i) the emergence of low-rank structures throughout the whole optimization dynamics of gradient and (ii) the implicit regularization effects that induce such low-rank structures at convergence. From a practical standpoint, studying the low-rank learning dynamics of gradient descent offers a mathematical foundation for understanding the effectiveness of LoRA in fine-tuning large-scale models and inspires parameter-efficient low-rank training strategies. Furthermore, the implicit low-rank regularization effect helps explain the success of various masked training approaches in deep neural networks, ranging from dropout to masked self-supervised learning.

[LG-3] Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs

链接: https://arxiv.org/abs/2503.19856
作者: Alexander Ryabchenko,Idan Attias,Daniel M. Roy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study online learning with oblivious losses and delays under a novel capacity constraint'' that limits how many past rounds can be tracked simultaneously for delayed feedback. Under clairvoyance’’ (i.e., delay durations are revealed upfront each round) and/or preemptibility'' (i.e., we have ability to stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the optimal capacity’’ needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For K actions and total delay D over T rounds, under clairvoyance and assuming capacity C = \Omega(\log(T)) , we achieve regret \widetilde\Theta(\sqrtTK + DK/C + D\log(K)) for bandits and \widetilde\Theta(\sqrt(D+T)\log(K)) for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound d_\max , adding \smash\widetildeO(d_\max) to the regret. For fixed delays d (i.e., D=Td ), the minimax regret is \Theta\bigl(\sqrtTK(1+d/C)+Td\log(K)\bigr) and the optimal capacity is \Theta(\min\K/\log(K),d\bigr) in the bandit setting, while in the full-information setting, the minimax regret is \Theta\bigl(\sqrtT(d+1)\log(K)\bigr) and the optimal capacity is \Theta(1) . For round-dependent and fixed delays, our upper bounds are achieved using novel scheduling policies, based on Pareto-distributed proxy delays and batching techniques. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity.

[LG-4] PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

链接: https://arxiv.org/abs/2503.19779
作者: Abhishek Ghosh,Ajay Nayak,Ashish Panwar,Arkaprava Basu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:CUDA Graphs – a recent hardware feature introduced for NVIDIA GPUs – aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data copy. In fact, we show a counter-intuitive result – deploying CUDA Graphs hurts performance in many cases. We introduce PyGraph, a novel approach to automatically harness the power of CUDA Graphs within PyTorch2. Driven by three key observations, PyGraph embodies three novel optimizations: it enables wider deployment of CUDA Graphs, reduces GPU kernel parameter copy overheads, and selectively deploys CUDA Graphs based on a cost-benefit analysis. PyGraph seamlessly integrates with PyTorch2’s compilation toolchain, enabling efficient use of CUDA Graphs without manual modifications to the code. We evaluate PyGraph across various machine learning benchmarks, demonstrating substantial performance improvements over PyTorch2. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.19779 [cs.LG] (or arXiv:2503.19779v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.19779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] How to RETIRE Tabular Data in Favor of Discrete Digital Signal Representation

链接: https://arxiv.org/abs/2503.19733
作者: Paweł Zyblewski,Szymon Wojciechowski
类目: Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, 2 tables

点击查看摘要

Abstract:The successes achieved by deep neural networks in computer vision tasks have led in recent years to the emergence of a new research area dubbed Multi-Dimensional Encoding (MDE). Methods belonging to this family aim to transform tabular data into a homogeneous form of discrete digital signals (images) to apply convolutional networks to initially unsuitable problems. Despite the successive emerging works, the pool of multi-dimensional encoding methods is still low, and the scope of research on existing modality encoding techniques is quite limited. To contribute to this area of research, we propose the Radar-based Encoding from Tabular to Image REpresentation (RETIRE), which allows tabular data to be represented as radar graphs, capturing the feature characteristics of each problem instance. RETIRE was compared with a pool of state-of-the-art MDE algorithms as well as with XGBoost in terms of classification accuracy and computational complexity. In addition, an analysis was carried out regarding transferability and explainability to provide more insight into both RETIRE and existing MDE techniques. The results obtained, supported by statistical analysis, confirm the superiority of RETIRE over other established MDE methods.

[LG-6] owards Efficient Training of Graph Neural Networks: A Multiscale Approach

链接: https://arxiv.org/abs/2503.19666
作者: Eshed Gal,Moshe Eliasof,Carola-Bibiane Schönlieb,Eldad Haber,Eran Treister
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a powerful tool for learning and inferring from graph-structured data, and are widely used in a variety of applications, often considering large amounts of data and large graphs. However, training on such data requires large memory and extensive computations. In this paper, we introduce a novel framework for efficient multiscale training of GNNs, designed to integrate information across multiscale representations of a graph. Our approach leverages a hierarchical graph representation, taking advantage of coarse graph scales in the training process, where each coarse scale graph has fewer nodes and edges. Based on this approach, we propose a suite of GNN training methods: such as coarse-to-fine, sub-to-full, and multiscale gradient computation. We demonstrate the effectiveness of our methods on various datasets and learning tasks.

[LG-7] Enhancing Graphical Lasso: A Robust Scheme for Non-Stationary Mean Data

链接: https://arxiv.org/abs/2503.19651
作者: Samuel Rey,Ernesto Curbelo,Luca Martino,Fernando Llorente,Antonio G. Marques
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work addresses the problem of graph learning from data following a Gaussian Graphical Model (GGM) with a time-varying mean. Graphical Lasso (GL), the standard method for estimating sparse precision matrices, assumes that the observed data follows a zero-mean Gaussian distribution. However, this assumption is often violated in real-world scenarios where the mean evolves over time due to external influences, trends, or regime shifts. When the mean is not properly accounted for, applying GL directly can lead to estimating a biased precision matrix, hence hindering the graph learning task. To overcome this limitation, we propose Graphical Lasso with Adaptive Targeted Adaptive Importance Sampling (GL-ATAIS), an iterative method that jointly estimates the time-varying mean and the precision matrix. Our approach integrates Bayesian inference with frequentist estimation, leveraging importance sampling to obtain an estimate of the mean while using a regularized maximum likelihood estimator to infer the precision matrix. By iteratively refining both estimates, GL-ATAIS mitigates the bias introduced by time-varying means, leading to more accurate graph recovery. Our numerical evaluation demonstrates the impact of properly accounting for time-dependent means and highlights the advantages of GL-ATAIS over standard GL in recovering the true graph structure.

[LG-8] An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators ISCAS2025

链接: https://arxiv.org/abs/2503.19640
作者: Tseng-Jen Li,Tian-Sheuan Chang
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: to be published in IEEE International Symposium on Circuits and Systems (IEEE ISCAS 2025)

点击查看摘要

Abstract:Transformer-based models have become the \textitde facto backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.

[LG-9] Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review

链接: https://arxiv.org/abs/2503.19626
作者: Mays Al-Azzawi,Dung Doan,Tuomo Sipola,Jari Hautamäki,Tero Kokkonen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: An earlier version first published in Good Practices and New Perspectives in Information Systems and Technologies (pp. 129-138), 2024 by Springer Nature

点击查看摘要

Abstract:The progress of artificial intelligence (AI) has made sophisticated methods available for cyberattacks and red team activities. These AI attacks can automate the process of penetrating a target or collecting sensitive data. The new methods can also accelerate the execution of the attacks. This review article examines the use of AI technologies in cybersecurity attacks. It also tries to describe typical targets for such attacks. We employed a scoping review methodology to analyze articles and identify AI methods, targets, and models that red teams can utilize to simulate cybercrime. From the 470 records screened, 11 were included in the review. Various cyberattack methods were identified, targeting sensitive data, systems, social media profiles, passwords, and URLs. The application of AI in cybercrime to develop versatile attack models presents an increasing threat. Furthermore, AI-based techniques in red team use can provide new ways to address these issues.

[LG-10] Optimization through In-Context Learning and Iterative LLM Prompting for Nuclear Engineering Design Problems

链接: https://arxiv.org/abs/2503.19620
作者: M. Rizki Oktavian,Anirudh Tunga,Amandeep Bakshi,Michael J. Mueterthies,J. Thomas Gruenwald,Jonathan Nistor
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Codes and data are available upon request

点击查看摘要

Abstract:The optimization of nuclear engineering designs, such as nuclear fuel assembly configurations, involves managing competing objectives like reactivity control and power distribution. This study explores the use of Optimization by Prompting, an iterative approach utilizing large language models (LLMs), to address these challenges. The method is straightforward to implement, requiring no hyperparameter tuning or complex mathematical formulations. Optimization problems can be described in plain English, with only an evaluator and a parsing script needed for execution. The in-context learning capabilities of LLMs enable them to understand problem nuances, therefore, they have the potential to surpass traditional metaheuristic optimization methods. This study demonstrates the application of LLMs as optimizers to Boiling Water Reactor (BWR) fuel lattice design, showing the capability of commercial LLMs to achieve superior optimization results compared to traditional methods.

[LG-11] Learning to chain-of-thought with Jensens evidence lower bound

链接: https://arxiv.org/abs/2503.19618
作者: Yunhao Tang,Sid Wang,Rémi Munos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a way to optimize chain-of-thought with reinforcement learning, but without external reward function. Our algorithm relies on viewing chain-of-thought as latent variable as part of a probabilistic inference problem. Contrary to the full evidence lower bound, we propose to apply a much simpler Jensen’s lower bound, which derives tractable objectives with simple algorithmic components (e.g., without the need for parametric approximate posterior), making it more conducive to modern large-scale training. The lower bound approach naturally interpolates other methods such as supervised fine-tuning and online reinforcement learning, whose practical trade-offs we will illustrate. Finally, we show that on mathematical reasoning problems, optimizing with Jensen’s lower bound is as effective as policy gradient with external reward. Taken together, our results showcase as a proof of concept to this new algorithmic paradigm’s potential to more generic applications.

[LG-12] RL-finetuning LLM s from on- and off-policy data with a single algorithm

链接: https://arxiv.org/abs/2503.19612
作者: Yunhao Tang,Taco Cohen,David W. Zhang,Michal Valko,Rémi Munos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel reinforcement learning algorithm (AGRO, for Any-Generation Reward Optimization) for fine-tuning large-language models. AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model. We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence. Our experiments demonstrate the effectiveness of AGRO in both on-policy and off-policy settings, showing improved performance on the mathematical reasoning dataset over baseline algorithms.

[LG-13] Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

链接: https://arxiv.org/abs/2503.19595
作者: Yunhao Tang,Kunhao Zheng,Gabriel Synnaeve,Rémi Munos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with k samples, with a focus on pass@ k and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@ k objectives compared to the baseline method.

[LG-14] Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization ICME2025

链接: https://arxiv.org/abs/2503.19591
作者: Weifei Jin,Junjie Su,Hejia Wang,Yulin Ye,Jie Hao
类目: ound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICME 2025

点击查看摘要

Abstract:With the widespread application of automatic speech recognition (ASR) systems, their vulnerability to adversarial attacks has been extensively studied. However, most existing adversarial examples are generated on specific individual models, resulting in a lack of transferability. In real-world scenarios, attackers often cannot access detailed information about the target model, making query-based attacks unfeasible. To address this challenge, we propose a technique called Acoustic Representation Optimization that aligns adversarial perturbations with low-level acoustic characteristics derived from speech representation models. Rather than relying on model-specific, higher-layer abstractions, our approach leverages fundamental acoustic representations that remain consistent across diverse ASR architectures. By enforcing an acoustic representation loss to guide perturbations toward these robust, lower-level representations, we enhance the cross-model transferability of adversarial examples without degrading audio quality. Our method is plug-and-play and can be integrated with any existing attack methods. We evaluate our approach on three modern ASR models, and the experimental results demonstrate that our method significantly improves the transferability of adversarial examples generated by previous methods while preserving the audio quality.

[LG-15] Post-Hoc Calibrated Anomaly Detection

链接: https://arxiv.org/abs/2503.19577
作者: Sean Gloumeau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep unsupervised anomaly detection has seen improvements in a supervised binary classification paradigm in which auxiliary external data is included in the training set as anomalous data in a process referred to as outlier exposure, which opens the possibility of exploring the efficacy of post-hoc calibration for anomaly detection and localization. Post-hoc Platt scaling and Beta calibration are found to improve results with gradient-based input perturbation, as well as post-hoc training with a strictly proper loss of a base model initially trained on an unsupervised loss. Post-hoc calibration is also found at times to be more effective using random synthesized spectral data as labeled anomalous data in the calibration set, suggesting that outlier exposure is superior only for initial training.

[LG-16] Noise Resilient Over-The-Air Federated Learning In Heterogeneous Wireless Networks

链接: https://arxiv.org/abs/2503.19549
作者: Zubair Shaban,Nazreen Shah,Ranjitha Prasad
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In 6G wireless networks, Artificial Intelligence (AI)-driven applications demand the adoption of Federated Learning (FL) to enable efficient and privacy-preserving model training across distributed devices. Over-The-Air Federated Learning (OTA-FL) exploits the superposition property of multiple access channels, allowing edge users in 6G networks to efficiently share spectral resources and perform low-latency global model aggregation. However, these advantages come with challenges, as traditional OTA-FL techniques suffer due to the joint effects of Additive White Gaussian Noise (AWGN) at the server, fading, and both data and system heterogeneity at the participating edge devices. In this work, we propose the novel Noise Resilient Over-the-Air Federated Learning (NoROTA-FL) framework to jointly tackle these challenges in federated wireless networks. In NoROTA-FL, the local optimization problems find controlled inexact solutions, which manifests as an additional proximal constraint at the clients. This approach provides robustness against straggler-induced partial work, heterogeneity, noise, and fading. From a theoretical perspective, we leverage the zeroth- and first-order inexactness and establish convergence guarantees for non-convex optimization problems in the presence of heterogeneous data and varying system capabilities. Experimentally, we validate NoROTA-FL on real-world datasets, including FEMNIST, CIFAR10, and CIFAR100, demonstrating its robustness in noisy and heterogeneous environments. Compared to state-of-the-art baselines such as COTAF and FedProx, NoROTA-FL achieves significantly more stable convergence and higher accuracy, particularly in the presence of stragglers.

[LG-17] DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data

链接: https://arxiv.org/abs/2503.19516
作者: Liming Zheng,Feng Yan,Fanfan Liu,Chengjian Feng,Yufeng Zhong,Yiyang Huang,Lin Ma
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing adoption of Vision-Language-Action (VLA) models in embodied AI intensifies the demand for diverse manipulation demonstrations. However, high costs associated with data collection often result in insufficient data coverage across all scenarios, which limits the performance of the models. It is observed that the spatial reasoning phase (SRP) in large workspace dominates the failure cases. Fortunately, this data can be collected with low cost, underscoring the potential of leveraging inexpensive data to improve model performance. In this paper, we introduce the DataPlatter method, a framework that decouples training trajectories into distinct task stages and leverages abundant easily collectible SRP data to enhance VLA model’s generalization. Through analysis we demonstrate that sub-task-specific training with additional SRP data with proper proportion can act as a performance catalyst for robot manipulation, maximizing the utilization of costly physical interaction phase (PIP) data. Experiments show that through introducing large proportion of cost-effective SRP trajectories into a limited set of PIP data, we can achieve a maximum improvement of 41% on success rate in zero-shot scenes, while with the ability to transfer manipulation skill to novel targets.

[LG-18] Bayesian Optimization of a Lightweight and Accurate Neural Network for Aerodynamic Performance Prediction

链接: https://arxiv.org/abs/2503.19479
作者: James M. Shihua,Paul Saves,Rhea P. Liem,Joseph Morlier
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Ensuring high accuracy and efficiency of predictive models is paramount in the aerospace industry, particularly in the context of multidisciplinary design and optimization processes. These processes often require numerous evaluations of complex objective functions, which can be computationally expensive and time-consuming. To build efficient and accurate predictive models, we propose a new approach that leverages Bayesian Optimization (BO) to optimize the hyper-parameters of a lightweight and accurate Neural Network (NN) for aerodynamic performance prediction. To clearly describe the interplay between design variables, hierarchical and categorical kernels are used in the BO formulation. We demonstrate the efficiency of our approach through two comprehensive case studies, where the optimized NN significantly outperforms baseline models and other publicly available NNs in terms of accuracy and parameter efficiency. For the drag coefficient prediction task, the Mean Absolute Percentage Error (MAPE) of our optimized model drops from 0.1433% to 0.0163%, which is nearly an order of magnitude improvement over the baseline model. Additionally, our model achieves a MAPE of 0.82% on a benchmark aircraft self-noise prediction problem, significantly outperforming existing models (where their MAPE values are around 2 to 3%) while requiring less computational resources. The results highlight the potential of our framework to enhance the scalability and performance of NNs in large-scale MDO problems, offering a promising solution for the aerospace industry.

[LG-19] Extracting Interpretable Logic Rules from Graph Neural Networks

链接: https://arxiv.org/abs/2503.19476
作者: Chuqin Geng,Zhaoyue Wang,Ziyu Zhao,Haolin Ye,Xujie Si
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) operate over both input feature spaces and combinatorial graph structures, making it challenging to understand the rationale behind their predictions. As GNNs gain widespread popularity and demonstrate success across various domains, such as drug discovery, studying their interpretability has become a critical task. To address this, many explainability methods have been proposed, with recent efforts shifting from instance-specific explanations to global concept-based explainability. However, these approaches face several limitations, such as relying on predefined concepts and explaining only a limited set of patterns. To address this, we propose a novel framework, LOGICXGNN, for extracting interpretable logic rules from GNNs. LOGICXGNN is model-agnostic, efficient, and data-driven, eliminating the need for predefined concepts. More importantly, it can serve as a rule-based classifier and even outperform the original neural models. Its interpretability facilitates knowledge discovery, as demonstrated by its ability to extract detailed and accurate chemistry knowledge that is often overlooked by existing methods. Another key advantage of LOGICXGNN is its ability to generate new graph instances in a controlled and transparent manner, offering significant potential for applications such as drug design. We empirically demonstrate these merits through experiments on real-world datasets such as MUTAG and BBBP.

[LG-20] A Probabilistic Neuro-symbolic Layer for Algebraic Constraint Satisfaction

链接: https://arxiv.org/abs/2503.19466
作者: Leander Kurscheidt,Paolo Morettin,Roberto Sebastiani,Andrea Passerini,Antonio Vergari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In safety-critical applications, guaranteeing the satisfaction of constraints over continuous environments is crucial, e.g., an autonomous agent should never crash into obstacles or go off-road. Neural models struggle in the presence of these constraints, especially when they involve intricate algebraic relationships. To address this, we introduce a differentiable probabilistic layer that guarantees the satisfaction of non-convex algebraic constraints over continuous variables. This probabilistic algebraic layer (PAL) can be seamlessly plugged into any neural architecture and trained via maximum likelihood without requiring approximations. PAL defines a distribution over conjunctions and disjunctions of linear inequalities, parameterized by polynomials. This formulation enables efficient and exact renormalization via symbolic integration, which can be amortized across different data points and easily parallelized on a GPU. We showcase PAL and our integration scheme on a number of benchmarks for algebraic constraint integration and on real-world trajectory data.

[LG-21] Multi-Agent Deep Reinforcement Learning for Safe Autonomous Driving with RICS-Assisted MEC

链接: https://arxiv.org/abs/2503.19418
作者: Xueyao Zhang,Bo Yang,Xuelin Cao,Zhiwen Yu,George C. Alexandropoulos,Yan Zhang,Merouane Debbah,Chau Yuen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Environment sensing and fusion via onboard sensors are envisioned to be widely applied in future autonomous driving networks. This paper considers a vehicular system with multiple self-driving vehicles that is assisted by multi-access edge computing (MEC), where image data collected by the sensors is offloaded from cellular vehicles to the MEC server using vehicle-to-infrastructure (V2I) links. Sensory data can also be shared among surrounding vehicles via vehicle-to-vehicle (V2V) communication links. To improve spectrum utilization, the V2V links may reuse the same frequency spectrum with V2I links, which may cause severe interference. To tackle this issue, we leverage reconfigurable intelligent computational surfaces (RICSs) to jointly enable V2I reflective links and mitigate interference appearing at the V2V links. Considering the limitations of traditional algorithms in addressing this problem, such as the assumption for quasi-static channel state information, which restricts their ability to adapt to dynamic environmental changes and leads to poor performance under frequently varying channel conditions, in this paper, we formulate the problem at hand as a Markov game. Our novel formulation is applied to time-varying channels subject to multi-user interference and introduces a collaborative learning mechanism among users. The considered optimization problem is solved via a driving safety-enabled multi-agent deep reinforcement learning (DS-MADRL) approach that capitalizes on the RICS presence. Our extensive numerical investigations showcase that the proposed reinforcement learning approach achieves faster convergence and significant enhancements in both data rate and driving safety, as compared to various state-of-the-art benchmarks.

[LG-22] owards Build Optimization Using Digital Twins

链接: https://arxiv.org/abs/2503.19381
作者: Henri Aïdasso,Francis Bordeleau,Ali Tizghadam
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted at the 21st International Conference on Predictive Models and Data Analytics in Software Engineering - PROMISE 2025

点击查看摘要

Abstract:Despite the indisputable benefits of Continuous Integration (CI) pipelines (or builds), CI still presents significant challenges regarding long durations, failures, and flakiness. Prior studies addressed CI challenges in isolation, yet these issues are interrelated and require a holistic approach for effective optimization. To bridge this gap, this paper proposes a novel idea of developing Digital Twins (DTs) of build processes to enable global and continuous improvement. To support such an idea, we introduce the CI Build process Digital Twin (CBDT) framework as a minimum viable product. This framework offers digital shadowing functionalities, including real-time build data acquisition and continuous monitoring of build process performance metrics. Furthermore, we discuss guidelines and challenges in the practical implementation of CBDTs, including (1) modeling different aspects of the build process using Machine Learning, (2) exploring what-if scenarios based on historical patterns, and (3) implementing prescriptive services such as automated failure and performance repair to continuously improve build processes.

[LG-23] Social Network User Profiling for Anomaly Detection Based on Graph Neural Networks

链接: https://arxiv.org/abs/2503.19380
作者: Yiwei Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes a risk pricing anomaly detection method for social network user portraits based on graph neural networks (GNNs), aiming to improve the ability to identify abnormal users in social network environments. In view of the limitations of traditional methods in social network data modeling, this paper combines graph autoencoders (GAEs) and graph attention networks (GATs) to achieve accurate detection of abnormal users through dynamic aggregation of neighbor features and reconstruction error evaluation. The Facebook Page-Page Network dataset is used in the experiment and compared with VAE, GNN, Transformer and GAE. The results show that the proposed method achieves the best performance in AUC, F1-score, Precision and Recall, verifying its effectiveness. In addition, this paper explores the computational efficiency of the model in large-scale data and looks forward to combining self-supervised learning, federated learning, and other technologies in the future to improve the robustness and privacy protection of risk assessment. The research results can provide efficient anomaly detection solutions for financial risk control, social security management, and other fields.

[LG-24] Data-driven Mesoscale Weather Forecasting Combining Swin-Unet and Diffusion Models

链接: https://arxiv.org/abs/2503.19354
作者: Yuta Hirabayashi,Daisuke Matsuoka
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Data-driven weather prediction models exhibit promising performance and advance continuously. In particular, diffusion models represent fine-scale details without spatial smoothing, which is crucial for mesoscale predictions, such as heavy rainfall forecasting. However, the applications of diffusion models to mesoscale prediction remain limited. To address this gap, this study proposes an architecture that combines a diffusion model with Swin-Unet as a deterministic model, achieving mesoscale predictions while maintaining flexibility. The proposed architecture trains the two models independently, allowing the diffusion model to remain unchanged when the deterministic model is updated. Comparisons using the Fractions Skill Score and power spectral analysis demonstrate that incorporating the diffusion model leads to improved accuracy compared to predictions without it. These findings underscore the potential of the proposed architecture to enhance mesoscale predictions, particularly for strong rainfall events, while maintaining flexibility.

[LG-25] Optimal Parameter Adaptation for Safety-Critical Control via Safe Barrier Bayesian Optimization

链接: https://arxiv.org/abs/2503.19349
作者: Shengbo Wang,Ke Li,Zheng Yan,Zhenyuan Guo,Song Zhu,Guanghui Wen,Shiping Wen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Preprent manuscript, review only

点击查看摘要

Abstract:Safety is of paramount importance in control systems to avoid costly risks and catastrophic damages. The control barrier function (CBF) method, a promising solution for safety-critical control, poses a new challenge of enhancing control performance due to its direct modification of original control design and the introduction of uncalibrated parameters. In this work, we shed light on the crucial role of configurable parameters in the CBF method for performance enhancement with a systematical categorization. Based on that, we propose a novel framework combining the CBF method with Bayesian optimization (BO) to optimize the safe control performance. Considering feasibility/safety-critical constraints, we develop a safe version of BO using the barrier-based interior method to efficiently search for promising feasible configurable parameters. Furthermore, we provide theoretical criteria of our framework regarding safety and optimality. An essential advantage of our framework lies in that it can work in model-agnostic environments, leaving sufficient flexibility in designing objective and constraint functions. Finally, simulation experiments on swing-up control and high-fidelity adaptive cruise control are conducted to demonstrate the effectiveness of our framework.

[LG-26] Membership Inference Attacks on Large-Scale Models: A Survey

链接: https://arxiv.org/abs/2503.19338
作者: Hengyu Wu,Yang Cao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The adoption of the Large Language Model (LLM) has accelerated dramatically since the ChatGPT from OpenAI went online in November 2022. Recent advances in Large Multimodal Models (LMMs), which process diverse data types and enable interaction through various channels, have expanded beyond the text-to-text limitations of early LLMs, attracting significant and concurrent attention from both researchers and industry. While LLMs and LMMs are starting to spread widely, concerns about their privacy risks are increasing as well. Membership Inference Attacks (MIAs), techniques used to determine whether a particular data point was part of a model’s training set, serve as a key metric for assessing the privacy vulnerabilities of machine learning models. Hu et al. show that various machine learning algorithms are vulnerable to MIA. Despite extensive studies on MIAs in traditional models, there remains a lack of systematic surveys addressing their effectiveness and implications in modern large-scale models like LLMs and LMMs. In this paper, we systematically reviewed recent studies of MIA against LLMs and LMMs. We analyzed and categorized each attack based on their methodology and scenario and discussed the limitations in existing research. Additionally, we examine privacy concerns associated with the fine-tuning process. Finally, we provided some suggestions for future research in this direction.

[LG-27] E-PINNs: Epistemic Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2503.19333
作者: Ashish S. Nair,Bruno Jacob,Amanda A. Howard,Jan Drgona,Panos Stinis
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 27 pages, 13 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have demonstrated promise as a framework for solving forward and inverse problems involving partial differential equations. Despite recent progress in the field, it remains challenging to quantify uncertainty in these networks. While approaches such as Bayesian PINNs (B-PINNs) provide a principled approach to capturing uncertainty through Bayesian inference, they can be computationally expensive for large-scale applications. In this work, we propose Epistemic Physics-Informed Neural Networks (E-PINNs), a framework that leverages a small network, the \emphepinet, to efficiently quantify uncertainty in PINNs. The proposed approach works as an add-on to existing, pre-trained PINNs with a small computational overhead. We demonstrate the applicability of the proposed framework in various test cases and compare the results with B-PINNs using Hamiltonian Monte Carlo (HMC) posterior estimation and dropout-equipped PINNs (Dropout-PINNs). Our experiments show that E-PINNs provide similar coverage to B-PINNs, with often comparable sharpness, while being computationally more efficient. This observation, combined with E-PINNs’ more consistent uncertainty estimates and better calibration compared to Dropout-PINNs for the examples presented, indicates that E-PINNs offer a promising approach in terms of accuracy-efficiency trade-off.

[LG-28] How to optimize K-means?

链接: https://arxiv.org/abs/2503.19324
作者: Qi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Center-based clustering algorithms (e.g., K-means) are popular for clustering tasks, but they usually struggle to achieve high accuracy on complex datasets. We believe the main reason is that traditional center-based clustering algorithms identify only one clustering center in each cluster. Once the distribution of the dataset is complex, a single clustering center cannot strongly represent distant objects within the cluster. How to optimize the existing center-based clustering algorithms will be valuable research. In this paper, we propose a general optimization method called ECAC, and it can optimize different center-based clustering algorithms. ECAC is independent of the clustering principle and is embedded as a component between the center process and the category assignment process of center-based clustering algorithms. Specifically, ECAC identifies several extended-centers for each clustering center. The extended-centers will act as relays to expand the representative capability of the clustering center in the complex cluster, thus improving the accuracy of center-based clustering algorithms. We conducted numerous experiments to verify the robustness and effectiveness of ECAC. ECAC is robust to diverse datasets and diverse clustering centers. After ECAC optimization, the accuracy (NMI as well as RI) of center-based clustering algorithms improves by an average of 33.4% and 64.1%, respectively, and even K-means accurately identifies complex-shaped clusters.

[LG-29] RGL: A Graph-Centric Modular Framework for Efficient Retrieval-Augmented Generation on Graphs

链接: https://arxiv.org/abs/2503.19314
作者: Yuan Li,Jun Hu,Jiaxin Jiang,Zemin Liu,Bryan Hooi,Bingsheng He
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in graph learning have paved the way for innovative retrieval-augmented generation (RAG) systems that leverage the inherent relational structures in graph data. However, many existing approaches suffer from rigid, fixed settings and significant engineering overhead, limiting their adaptability and scalability. Additionally, the RAG community has largely overlooked the decades of research in the graph database community regarding the efficient retrieval of interesting substructures on large-scale graphs. In this work, we introduce the RAG-on-Graphs Library (RGL), a modular framework that seamlessly integrates the complete RAG pipeline-from efficient graph indexing and dynamic node retrieval to subgraph construction, tokenization, and final generation-into a unified system. RGL addresses key challenges by supporting a variety of graph formats and integrating optimized implementations for essential components, achieving speedups of up to 143x compared to conventional methods. Moreover, its flexible utilities, such as dynamic node filtering, allow for rapid extraction of pertinent subgraphs while reducing token consumption. Our extensive evaluations demonstrate that RGL not only accelerates the prototyping process but also enhances the performance and applicability of graph-based RAG systems across a range of tasks.

[LG-30] UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design

链接: https://arxiv.org/abs/2503.19300
作者: Xiangzhe Kong,Zishen Zhang,Ziting Zhang,Rui Jiao,Jianzhu Ma,Kai Liu,Wenbing Huang,Yang Liu
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: preprint

点击查看摘要

Abstract:The design of target-specific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduce Unified generative Modeling of 3D Molecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.

[LG-31] Data-Driven ML-assisted Approaches to Problem Well-Posedness

链接: https://arxiv.org/abs/2503.19255
作者: Tom Bertalan,George A. Kevrekidis,Eleni D Koronaki,Siddhartha Mishra,Elizaveta Rebrova,Yannis G. Kevrekidis
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Classically, to solve differential equation problems, it is necessary to specify sufficient initial and/or boundary conditions so as to allow the existence of a unique solution. Well-posedness of differential equation problems thus involves studying the existence and uniqueness of solutions, and their dependence to such pre-specified conditions. However, in part due to mathematical necessity, these conditions are usually specified “to arbitrary precision” only on (appropriate portions of) the boundary of the space-time domain. This does not mirror how data acquisition is performed in realistic situations, where one may observe entire “patches” of solution data at arbitrary space-time locations; alternatively one might have access to more than one solutions stemming from the same differential operator. In our short work, we demonstrate how standard tools from machine and manifold learning can be used to infer, in a data driven manner, certain well-posedness features of differential equation problems, for initial/boundary condition combinations under which rigorous existence/uniqueness theorems are not known. Our study naturally combines a data assimilation perspective with an operator-learning one.

[LG-32] Analytic DAG Constraints for Differentiable DAG Learning ICLR2025

链接: https://arxiv.org/abs/2503.19218
作者: Zhen Zhang,Ignavier Ng,Dong Gong,Yuhang Liu,Mingming Gong,Biwei Huang,Kun Zhang,Anton van den Hengel,Javen Qinfeng Shi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Recovering the underlying Directed Acyclic Graph (DAG) structures from observational data presents a formidable challenge, partly due to the combinatorial nature of the DAG-constrained optimization problem. Recently, researchers have identified gradient vanishing as one of the primary obstacles in differentiable DAG learning and have proposed several DAG constraints to mitigate this issue. By developing the necessary theory to establish a connection between analytic functions and DAG constraints, we demonstrate that analytic functions from the set \f(x) = c_0 + \sum_i=1^\inftyc_ix^i | \forall i 0, c_i 0; r = \lim_i\rightarrow \inftyc_i/c_i+1 0\ can be employed to formulate effective DAG constraints. Furthermore, we establish that this set of functions is closed under several functional operators, including differentiation, summation, and multiplication. Consequently, these operators can be leveraged to create novel DAG constraints based on existing ones. Using these properties, we design a series of DAG constraints and develop an efficient algorithm to evaluate them. Experiments in various settings demonstrate that our DAG constraints outperform previous state-of-the-art comparators. Our implementation is available at this https URL.

[LG-33] Byzantine Resilient Federated Multi-Task Representation Learning

链接: https://arxiv.org/abs/2503.19209
作者: Tuan Le,Shana Moothedath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose BR-MTRL, a Byzantine-resilient multi-task representation learning framework that handles faulty or malicious agents. Our approach leverages representation learning through a shared neural network model, where all clients share fixed layers, except for a client-specific final layer. This structure captures shared features among clients while enabling individual adaptation, making it a promising approach for leveraging client data and computational power in heterogeneous federated settings to learn personalized models. To learn the model, we employ an alternating gradient descent strategy: each client optimizes its local model, updates its final layer, and sends estimates of the shared representation to a central server for aggregation. To defend against Byzantine agents, we employ geometric median aggregation for robust client-server communication. Our method enables personalized learning while maintaining resilience in distributed settings. We implemented the proposed alternating gradient descent algorithm in a federated testbed built using Amazon Web Services (AWS) platform and compared its performance with various benchmark algorithms and their variations. Through extensive experiments using real-world datasets, including CIFAR-10 and FEMINIST, we demonstrated the effectiveness and robustness of our approach and its transferability to new unseen clients with limited data, even in the presence of Byzantine adversaries.

[LG-34] Graph neural networks extrapolate out-of-distribution for shortest paths

链接: https://arxiv.org/abs/2503.19173
作者: Robert R. Nerem,Samantha Chen,Sanjoy Dasgupta,Yusu Wang
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Neural networks (NNs), despite their success and wide adoption, still struggle to extrapolate out-of-distribution (OOD), i.e., to inputs that are not well-represented by their training dataset. Addressing the OOD generalization gap is crucial when models are deployed in environments significantly different from the training set, such as applying Graph Neural Networks (GNNs) trained on small graphs to large, real-world graphs. One promising approach for achieving robust OOD generalization is the framework of neural algorithmic alignment, which incorporates ideas from classical algorithms by designing neural architectures that resemble specific algorithmic paradigms (e.g. dynamic programming). The hope is that trained models of this form would have superior OOD capabilities, in much the same way that classical algorithms work for all instances. We rigorously analyze the role of algorithmic alignment in achieving OOD generalization, focusing on graph neural networks (GNNs) applied to the canonical shortest path problem. We prove that GNNs, trained to minimize a sparsity-regularized loss over a small set of shortest path instances, exactly implement the Bellman-Ford (BF) algorithm for shortest paths. In fact, if a GNN minimizes this loss within an error of \epsilon , it implements the BF algorithm with an error of O(\epsilon) . Consequently, despite limited training data, these GNNs are guaranteed to extrapolate to arbitrary shortest-path problems, including instances of any size. Our empirical results support our theory by showing that NNs trained by gradient descent are able to minimize this loss and extrapolate in practice.

[LG-35] Integrating Biological-Informed Recurrent Neural Networks for Glucose-Insulin Dynamics Modeling

链接: https://arxiv.org/abs/2503.19158
作者: Stefano De Carli,Nicola Licini,Davide Previtali,Fabio Previdi,Antonio Ferramosca
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted for pubblication in the proceedings of the Engineering Diabetes Technologies (EDT 2025). 7 pages, 2 figures and 1 table

点击查看摘要

Abstract:Type 1 Diabetes (T1D) management is a complex task due to many variability factors. Artificial Pancreas (AP) systems have alleviated patient burden by automating insulin delivery through advanced control algorithms. However, the effectiveness of these systems depends on accurate modeling of glucose-insulin dynamics, which traditional mathematical models often fail to capture due to their inability to adapt to patient-specific variations. This study introduces a Biological-Informed Recurrent Neural Network (BIRNN) framework to address these limitations. The BIRNN leverages a Gated Recurrent Units (GRU) architecture augmented with physics-informed loss functions that embed physiological constraints, ensuring a balance between predictive accuracy and consistency with biological principles. The framework is validated using the commercial UVA/Padova simulator, outperforming traditional linear models in glucose prediction accuracy and reconstruction of unmeasured states, even under circadian variations in insulin sensitivity. The results demonstrate the potential of BIRNN for personalized glucose regulation and future adaptive control strategies in AP systems.

[LG-36] Activation Functions Considered Harmful: Recovering Neural Network Weights through Controlled Channels

链接: https://arxiv.org/abs/2503.19142
作者: Jesse Spielman,David Oswald,Mark Ryan,Jo Van Bulck
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:With high-stakes machine learning applications increasingly moving to untrusted end-user or cloud environments, safeguarding pre-trained model parameters becomes essential for protecting intellectual property and user privacy. Recent advancements in hardware-isolated enclaves, notably Intel SGX, hold the promise to secure the internal state of machine learning applications even against compromised operating systems. However, we show that privileged software adversaries can exploit input-dependent memory access patterns in common neural network activation functions to extract secret weights and biases from an SGX enclave. Our attack leverages the SGX-Step framework to obtain a noise-free, instruction-granular page-access trace. In a case study of an 11-input regression network using the Tensorflow Microlite library, we demonstrate complete recovery of all first-layer weights and biases, as well as partial recovery of parameters from deeper layers under specific conditions. Our novel attack technique requires only 20 queries per input per weight to obtain all first-layer weights and biases with an average absolute error of less than 1%, improving over prior model stealing attacks. Additionally, a broader ecosystem analysis reveals the widespread use of activation functions with input-dependent memory access patterns in popular machine learning frameworks (either directly or via underlying math libraries). Our findings highlight the limitations of deploying confidential models in SGX enclaves and emphasise the need for stricter side-channel validation of machine learning implementations, akin to the vetting efforts applied to secure cryptographic libraries. Comments: 17 pages, 5 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2503.19142 [cs.CR] (or arXiv:2503.19142v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.19142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training

链接: https://arxiv.org/abs/2503.19081
作者: Amin Totounferoush,Serge Kotchourko,Michael W. Mahoney,Steffen Staab
类目: Machine Learning (cs.LG)
*备注: pre-print, 31 pages

点击查看摘要

Abstract:Partial differential equations (PDEs) govern a wide range of physical systems, but solving them efficiently remains a major challenge. The idea of a scientific foundation model (SciFM) is emerging as a promising tool for learning transferable representations across diverse domains. However, SciFMs require large amounts of solution data, which may be scarce or computationally expensive to generate. To maximize generalization while reducing data dependence, we propose incorporating PDE residuals into pre-training either as the sole learning signal or in combination with data loss to compensate for limited or infeasible training data. We evaluate this constraint-aware pre-training across three key benchmarks: (i) generalization to new physics, where material properties, e.g., the diffusion coefficient, is shifted with respect to the training distribution; (ii) generalization to entirely new PDEs, requiring adaptation to different operators; and (iii) robustness against noisy fine-tuning data, ensuring stability in real-world applications. Our results show that pre-training with PDE constraints significantly enhances generalization, outperforming models trained solely on solution data across all benchmarks. These findings prove the effectiveness of our proposed constraint-aware pre-training as a crucial component for SciFMs, providing a scalable approach to data-efficient, generalizable PDE solvers.

[LG-38] Near-optimal Active Reconstruction

链接: https://arxiv.org/abs/2503.18999
作者: Daniel Yang
类目: Machine Learning (cs.LG)
*备注: Thesis submitted on 12.04.2023

点击查看摘要

Abstract:With the growing practical interest in vision-based tasks for autonomous systems, the need for efficient and complex methods becomes increasingly larger. In the rush to develop new methods with the aim to outperform the current state of the art, an analysis of the underlying theory is often neglected and simply replaced with empirical evaluations in simulated or real-world experiments. While such methods might yield favorable performance in practice, they are often less well understood, which prevents them from being applied in safety-critical systems. The goal of this work is to design an algorithm for the Next Best View (NBV) problem in the context of active object reconstruction, for which we can provide qualitative performance guarantees with respect to true optimality. To the best of our knowledge, no previous work in this field addresses such an analysis for their proposed methods. Based on existing work on Gaussian process optimization, we rigorously derive sublinear bounds for the cumulative regret of our algorithm, which guarantees near-optimality. Complementing this, we evaluate the performance of our algorithm empirically within our simulation framework. We further provide additional insights through an extensive study of potential objective functions and analyze the differences to the results of related work.

[LG-39] A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models

链接: https://arxiv.org/abs/2503.18989
作者: Zuan Xie,Yang Xu,Hongli Xu,Yunming Liao,Zhiwei Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have catalyzed a substantial surge in demand for LLM services. While traditional cloud-based LLM services satisfy high-accuracy requirements, they fall short in meeting critical demands for low delay and enhanced privacy. To address these limitations, we propose HAT, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding. HAT partitions the LLM into three submodels, and the input and output submodels, stacked with a lightweight adapter network, are deployed as a small language model (SLM) on each end device. Meanwhile, the middle submodel, encompassing the majority of the LLM’s decoder layers, is hosted in the cloud to perform speculative decoding with on-device SLMs. During inference, HAT exchanges hidden states (rather than raw tokens) of input or draft tokens between devices and the cloud, thereby incurring substantial communication delays. Besides, processing hidden states of long prompts will exacerbate computation delays in the cloud, further compromising inference efficiency. To improve efficiency, we introduce a prompt chunking mechanism that segments long prompts into shorter chunks, enabling parallel transmission and processing. Furthermore, HAT is implemented to dynamically determine optimal chunk sizes for devices handling long prompts, thereby improving overall inference speed. Extensive experiments are conducted on a physical testbed comprising 30 NVIDIA Jetson devices and a server with 8 NVIDIA A6000 GPUs. Experimental results demonstrate that HAT achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines.

[LG-40] A Survey on Structured State Space Sequence (S4) Models

链接: https://arxiv.org/abs/2503.18970
作者: Shriyank Somvanshi,Md Monzurul Islam,Mahmuda Sultana Mimi,Sazzad Bin Bashar Polock,Gaurab Chhetri,Subasish Das
类目: Machine Learning (cs.LG)
*备注: 30 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Recent advancements in sequence modeling have led to the emergence of Structured State Space Models (SSMs) as an efficient alternative to Recurrent Neural Networks (RNNs) and Transformers, addressing challenges in long-range dependency modeling and computational efficiency. While RNNs suffer from vanishing gradients and sequential inefficiencies, and Transformers face quadratic complexity, SSMs leverage structured recurrence and state-space representations to achieve superior long-sequence processing with linear or near-linear complexity. This survey provides a comprehensive review of SSMs, tracing their evolution from the foundational S4 model to its successors like Mamba, Simplified Structured State Space Sequence Model (S5), and Jamba, highlighting their improvements in computational efficiency, memory optimization, and inference speed. By comparing SSMs with traditional sequence models across domains such as natural language processing (NLP), speech recognition, vision, and time-series forecasting, we demonstrate their advantages in handling long-range dependencies while reducing computational overhead. Despite their potential, challenges remain in areas such as training optimization, hybrid modeling, and interpretability. This survey serves as a structured guide for researchers and practitioners, detailing the advancements, trade-offs, and future directions of SSM-based architectures in AI and deep learning.

[LG-41] Representative Ranking for Deliberation in the Public Sphere

链接: https://arxiv.org/abs/2503.18962
作者: Manon Revel,Smitha Milli,Tyler Lu,Jamelle Watson-Daniels,Max Nickel
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online comment sections, such as those on news sites or social media, have the potential to foster informal public deliberation, However, this potential is often undermined by the frequency of toxic or low-quality exchanges that occur in these settings. To combat this, platforms increasingly leverage algorithmic ranking to facilitate higher-quality discussions, e.g., by using civility classifiers or forms of prosocial ranking. Yet, these interventions may also inadvertently reduce the visibility of legitimate viewpoints, undermining another key aspect of deliberation: representation of diverse views. We seek to remedy this problem by introducing guarantees of representation into these methods. In particular, we adopt the notion of justified representation (JR) from the social choice literature and incorporate a JR constraint into the comment ranking setting. We find that enforcing JR leads to greater inclusion of diverse viewpoints while still being compatible with optimizing for user engagement or other measures of conversational quality.

[LG-42] An Approach to Analyze Niche Evolution in XCS Models

链接: https://arxiv.org/abs/2503.18961
作者: Pier Luca Lanzi
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an approach to identify and track the evolution of niches in XCS that can be applied to any XCS model and any problem. It exploits the underlying principles of the evolutionary component of XCS, and therefore, it is independent of the representation used. It also employs information already available in XCS and thus requires minimal modifications to an existing XCS implementation. We present experiments on binary single-step and multi-step problems involving non-overlapping and highly overlapping solutions. We show that our approach can identify and evaluate the number of niches in the population; it also show that it can be used to identify the composition of active niches to as to track their evolution over time, allowing for a more in-depth analysis of XCS behavior.

[LG-43] Identification of Averag e Treatment Effects in Nonparametric Panel Models

链接: https://arxiv.org/abs/2503.19873
作者: Susan Athey,Guido Imbens
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper studies identification of average treatment effects in a panel data setting. It introduces a novel nonparametric factor model and proves identification of average treatment effects. The identification proof is based on the introduction of a consistent estimator. Underlying the proof is a result that there is a consistent estimator for the expected outcome in the absence of the treatment for each unit and time period; this result can be applied more broadly, for example in problems of decompositions of group-level differences in outcomes, such as the much-studied gender wage gap.

[LG-44] Ab-initio simulation of excited-state potential energy surfaces with transferable deep quantum Monte Carlo

链接: https://arxiv.org/abs/2503.19847
作者: Zeno Schätzle,P. Bernát Szabó,Alice Cuzzocrea,Frank Noé
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:The accurate quantum chemical calculation of excited states is a challenging task, often requiring computationally demanding methods. When entire ground and excited potential energy surfaces (PESs) are desired, e.g., to predict the interaction of light excitation and structural changes, one is often forced to use cheaper computational methods at the cost of reduced accuracy. Here we introduce a novel method for the geometrically transferable optimization of neural network wave functions that leverages weight sharing and dynamical ordering of electronic states. Our method enables the efficient prediction of ground and excited-state PESs and their intersections at the highest accuracy, demonstrating up to two orders of magnitude cost reduction compared to single-point calculations. We validate our approach on three challenging excited-state PESs, including ethylene, the carbon dimer, and the methylenimmonium cation, indicating that transferable deep-learning QMC can pave the way towards highly accurate simulation of excited-state dynamics.

[LG-45] IgCraft: A versatile sequence generation framework for antibody discovery and engineering

链接: https://arxiv.org/abs/2503.19821
作者: Matthew Greenig,Haowen Zhao,Vladimir Radenkovic,Aubin Ramon,Pietro Sormanni
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Designing antibody sequences to better resemble those observed in natural human repertoires is a key challenge in biologics development. We introduce IgCraft: a multi-purpose model for paired human antibody sequence generation, built on Bayesian Flow Networks. IgCraft presents one of the first unified generative modeling frameworks capable of addressing multiple antibody sequence design tasks with a single model, including unconditional sampling, sequence inpainting, inverse folding, and CDR motif scaffolding. Our approach achieves competitive results across the full spectrum of these tasks while constraining generation to the space of human antibody sequences, exhibiting particular strengths in CDR motif scaffolding (grafting) where we achieve state-of-the-art performance in terms of humanness and preservation of structural properties. By integrating previously separate tasks into a single scalable generative model, IgCraft provides a versatile platform for sampling human antibody sequences under a variety of contexts relevant to antibody discovery and engineering. Model code and weights are publicly available at this http URL.

[LG-46] A Systematic Review of EEG-based Machine Intelligence Algorithms for Depression Diagnosis and Monitoring

链接: https://arxiv.org/abs/2503.19820
作者: Amir Nassibi,Christos Papavassiliou,Ildar Rakhmatulin,Danilo Mandic,S. Farokh Atashzar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Depression disorder is a serious health condition that has affected the lives of millions of people around the world. Diagnosis of depression is a challenging practice that relies heavily on subjective studies and, in most cases, suffers from late findings. Electroencephalography (EEG) biomarkers have been suggested and investigated in recent years as a potential transformative objective practice. In this article, for the first time, a detailed systematic review of EEG-based depression diagnosis approaches is conducted using advanced machine learning techniques and statistical analyses. For this, 938 potentially relevant articles (since 1985) were initially detected and filtered into 139 relevant articles based on the review scheme ‘preferred reporting items for systematic reviews and meta-analyses (PRISMA).’ This article compares and discusses the selected articles and categorizes them according to the type of machine learning techniques and statistical analyses. Algorithms, preprocessing techniques, extracted features, and data acquisition systems are discussed and summarized. This review paper explains the existing challenges of the current algorithms and sheds light on the future direction of the field. This systematic review outlines the issues and challenges in machine intelligence for the diagnosis of EEG depression that can be addressed in future studies and possibly in future wearable technologies.

[LG-47] Interpretable Deep Regression Models with Interval-Censored Failure Time Data

链接: https://arxiv.org/abs/2503.19763
作者: Changhui Yuan,Shishun Zhao,Shuwei Li,Xinyuan Song,Zhao Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have become powerful tools for modeling complex data structures through sequentially integrating simple functions in each hidden layer. In survival analysis, recent advances of DNNs primarily focus on enhancing model capabilities, especially in exploring nonlinear covariate effects under right censoring. However, deep learning methods for interval-censored data, where the unobservable failure time is only known to lie in an interval, remain underexplored and limited to specific data type or model. This work proposes a general regression framework for interval-censored data with a broad class of partially linear transformation models, where key covariate effects are modeled parametrically while nonlinear effects of nuisance multi-modal covariates are approximated via DNNs, balancing interpretability and flexibility. We employ sieve maximum likelihood estimation by leveraging monotone splines to approximate the cumulative baseline hazard function. To ensure reliable and tractable estimation, we develop an EM algorithm incorporating stochastic gradient descent. We establish the asymptotic properties of parameter estimators and show that the DNN estimator achieves minimax-optimal convergence. Extensive simulations demonstrate superior estimation and prediction accuracy over state-of-the-art methods. Applying our method to the Alzheimer’s Disease Neuroimaging Initiative dataset yields novel insights and improved predictive performance compared to traditional approaches.

[LG-48] Data-efficient rapid prediction of urban airflow and temperature fields for complex building geometries

链接: https://arxiv.org/abs/2503.19708
作者: Shaoxiang Qin,Dongxue Zhan,Ahmed Marey,Dingyang Geng,Theodore Potsis,Liangzhu Leon Wang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting urban microclimate, including wind speed and temperature, based solely on building geometry requires capturing complex interactions between buildings and airflow, particularly long-range wake effects influenced by directional geometry. Traditional methods relying on computational fluid dynamics (CFD) are prohibitively expensive for large-scale simulations, while data-driven approaches struggle with limited training data and the need to model both local and far-field dependencies. In response, we propose a novel framework that leverages a multi-directional distance feature (MDDF) combined with localized training to achieve effective wind field predictions with minimal CFD data. By reducing the problem’s dimensionality, localized training effectively increases the number of training samples, while MDDF encodes the surrounding geometric information to accurately model wake dynamics and flow redirection. Trained on only 24 CFD simulations, our localized Fourier neural operator (Local-FNO) model generates full 3D wind velocity and temperature predictions in under one minute, yielding a 500-fold speedup over conventional CFD methods. With mean absolute errors of 0.3 m/s for wind speed and 0.3 ^\circ C for temperature on unseen urban configurations, our method demonstrates strong generalization capabilities and significant potential for practical urban applications.

[LG-49] Kernel Learning Assisted Synthesis Condition Exploration for Ternary Spinel

链接: https://arxiv.org/abs/2503.19637
作者: Yutong Liu,Mehrad Ansari,Robert Black,Jason Hattrick-Simpers
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Machine learning and high-throughput experimentation have greatly accelerated the discovery of mixed metal oxide catalysts by leveraging their compositional flexibility. However, the lack of established synthesis routes for solid-state materials remains a significant challenge in inorganic chemistry. An interpretable machine learning model is therefore essential, as it provides insights into the key factors governing phase formation. Here, we focus on the formation of single-phase Fe _2 (ZnCo)O _4 , synthesized via a high-throughput co-precipitation method. We combined a kernel classification model with a novel application of global SHAP analysis to pinpoint the experimental features most critical to single phase synthesizability by interpreting the contributions of each feature. Global SHAP analysis reveals that precursor and precipitating agent contributions to single-phase spinel formation align closely with established crystal growth theories. These results not only underscore the importance of interpretable machine learning in refining synthesis protocols but also establish a framework for data-informed experimental design in inorganic synthesis.

[LG-50] Causal Bayesian Optimization with Unknown Graphs

链接: https://arxiv.org/abs/2503.19554
作者: Jean Durand,Yashas Annadani,Stefan Bauer,Sonali Parbhoo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal Bayesian Optimization (CBO) is a methodology designed to optimize an outcome variable by leveraging known causal relationships through targeted interventions. Traditional CBO methods require a fully and accurately specified causal graph, which is a limitation in many real-world scenarios where such graphs are unknown. To address this, we propose a new method for the CBO framework that operates without prior knowledge of the causal graph. Consistent with causal bandit theory, we demonstrate through theoretical analysis and that focusing on the direct causal parents of the target variable is sufficient for optimization, and provide empirical validation in the context of CBO. Furthermore we introduce a new method that learns a Bayesian posterior over the direct parents of the target variable. This allows us to optimize the outcome variable while simultaneously learning the causal structure. Our contributions include a derivation of the closed-form posterior distribution for the linear case. In the nonlinear case where the posterior is not tractable, we present a Gaussian Process (GP) approximation that still enables CBO by inferring the parents of the outcome variable. The proposed method performs competitively with existing benchmarks and scales well to larger graphs, making it a practical tool for real-world applications where causal information is incomplete.

[LG-51] A novel forecasting framework combining virtual samples and enhanced Transformer models for tourism demand forecasting

链接: https://arxiv.org/abs/2503.19423
作者: Tingting Diao,Xinzhang Wu,Lina Yang,Ling Xiao,Yunxuan Dong
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate tourism demand forecasting is hindered by limited historical data and complex spatiotemporal dependencies among tourist origins. A novel forecasting framework integrating virtual sample generation and a novel Transformer predictor addresses constraints arising from restricted data availability. A spatiotemporal GAN produces realistic virtual samples by dynamically modeling spatial correlations through a graph convolutional network, and an enhanced Transformer captures local patterns with causal convolutions and long-term dependencies with self-attention,eliminating autoregressive decoding. A joint training strategy refines virtual sample generation based on predictor feedback to maintain robust performance under data-scarce conditions. Experimental evaluations on real-world daily and monthly tourism demand datasets indicate a reduction in average MASE by 18.37% compared to conventional Transformer-based models, demonstrating improved forecasting accuracy. The integration of adaptive spatiotemporal sample augmentation with a specialized Transformer can effectively address limited-data forecasting scenarios in tourism management.

[LG-52] Centroid Decision Forest

链接: https://arxiv.org/abs/2503.19306
作者: Amjad Ali,Zardad Khan,Saeed Aldahmani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: This article has 11 pages, 6 figures, and 3 tables and has been submitted to the “IEEE Transactions on Pattern Analysis and Machine Intelligence” journal

点击查看摘要

Abstract:This paper introduces the centroid decision forest (CDF), a novel ensemble learning framework that redefines the splitting strategy and tree building in the ordinary decision trees for high-dimensional classification. The splitting approach in CDF differs from the traditional decision trees in theat the class separability score (CSS) determines the selection of the most discriminative features at each node to construct centroids of the partitions (daughter nodes). The splitting criterion uses the Euclidean distance measurements from each class centroid to achieve a splitting mechanism that is more flexible and robust. Centroids are constructed by computing the mean feature values of the selected features for each class, ensuring a class-representative division of the feature space. This centroid-driven approach enables CDF to capture complex class structures while maintaining interpretability and scalability. To evaluate CDF, 23 high-dimensional datasets are used to assess its performance against different state-of-the-art classifiers through classification accuracy and Cohen’s kappa statistic. The experimental results show that CDF outperforms the conventional methods establishing its effectiveness and flexibility for high-dimensional classification problems.

[LG-53] Universal Architectures for the Learning of Polyhedral Norms and Convex Regularization Functionals

链接: https://arxiv.org/abs/2503.19190
作者: Michael Unser,Stanislas Ducotterd
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper addresses the task of learning convex regularizers to guide the reconstruction of images from limited data. By imposing that the reconstruction be amplitude-equivariant, we narrow down the class of admissible functionals to those that can be expressed as a power of a seminorm. We then show that such functionals can be approximated to arbitrary precision with the help of polyhedral norms. In particular, we identify two dual parameterizations of such systems: (i) a synthesis form with an \ell_1 -penalty that involves some learnable dictionary; and (ii) an analysis form with an \ell_\infty -penalty that involves a trainable regularization operator. After having provided geometric insights and proved that the two forms are universal, we propose an implementation that relies on a specific architecture (tight frame with a weighted \ell_1 penalty) that is easy to train. We illustrate its use for denoising and the reconstruction of biomedical images. We find that the proposed framework outperforms the sparsity-based methods of compressed sensing, while it offers essentially the same convergence and robustness guarantees.

[LG-54] High Probability Complexity Bounds of Trust-Region Stochastic Sequential Quadratic Programming with Heavy-Tailed Noise

链接: https://arxiv.org/abs/2503.19091
作者: Yuchen Fang,Javad Lavaei,Katya Scheinberg,Sen Na
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 50 pages, 5 figures

点击查看摘要

Abstract:In this paper, we consider nonlinear optimization problems with a stochastic objective and deterministic equality constraints. We propose a Trust-Region Stochastic Sequential Quadratic Programming (TR-SSQP) method and establish its high-probability iteration complexity bounds for identifying first- and second-order \epsilon -stationary points. In our algorithm, we assume that exact objective values, gradients, and Hessians are not directly accessible but can be estimated via zeroth-, first-, and second-order probabilistic oracles. Compared to existing complexity studies of SSQP methods that rely on a zeroth-order oracle with sub-exponential tail noise (i.e., light-tailed) and focus mostly on first-order stationarity, our analysis accommodates irreducible and heavy-tailed noise in the zeroth-order oracle and significantly extends the analysis to second-order stationarity. We show that under weaker noise conditions, our method achieves the same high-probability first-order iteration complexity bounds, while also exhibiting promising second-order iteration complexity bounds. Specifically, the method identifies a first-order \epsilon -stationary point in \mathcalO(\epsilon^-2) iterations and a second-order \epsilon -stationary point in \mathcalO(\epsilon^-3) iterations with high probability, provided that \epsilon is lower bounded by a constant determined by the irreducible noise level in estimation. We validate our theoretical findings and evaluate the practical performance of our method on CUTEst benchmark test set.

[LG-55] Detecting Arbitrary Planted Subgraphs in Random Graphs

链接: https://arxiv.org/abs/2503.19069
作者: Dor Elimelech,Wasim Huleihel
类目: atistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR)
*备注: 110 pages

点击查看摘要

Abstract:The problems of detecting and recovering planted structures/subgraphs in Erdős-Rényi random graphs, have received significant attention over the past three decades, leading to many exciting results and mathematical techniques. However, prior work has largely focused on specific ad hoc planted structures and inferential settings, while a general theory has remained elusive. In this paper, we bridge this gap by investigating the detection of an \empharbitrary planted subgraph \Gamma = \Gamma_n in an Erdős-Rényi random graph \mathcalG(n, q_n) , where the edge probability within \Gamma is p_n . We examine both the statistical and computational aspects of this problem and establish the following results. In the dense regime, where the edge probabilities p_n and q_n are fixed, we tightly characterize the information-theoretic and computational thresholds for detecting \Gamma , and provide conditions under which a computational-statistical gap arises. Most notably, these thresholds depend on \Gamma only through its number of edges, maximum degree, and maximum subgraph density. Our lower and upper bounds are general and apply to any value of p_n and q_n as functions of n . Accordingly, we also analyze the sparse regime where q_n = \Theta(n^-\alpha) and p_n-q_n =\Theta(q_n) , with \alpha\in[0,2] , as well as the critical regime where p_n=1-o(1) and q_n = \Theta(n^-\alpha) , both of which have been widely studied, for specific choices of \Gamma . For these regimes, we show that our bounds are tight for all planted subgraphs investigated in the literature thus far\textemdashand many more. Finally, we identify conditions under which detection undergoes sharp phase transition, where the boundaries at which algorithms succeed or fail shift abruptly as a function of q_n .

[LG-56] Learning Beamforming Codebooks for Active Sensing with Reconfigurable Intelligent Surface

链接: https://arxiv.org/abs/2503.19046
作者: Zhongze Zhang,Wei Yu
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted in IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:This paper explores the design of beamforming codebooks for the base station (BS) and for the reconfigurable intelligent surfaces (RISs) in an active sensing scheme for uplink localization, in which the mobile user transmits a sequence of pilots to the BS through reflection at the RISs, and the BS and the RISs are adaptively configured by carefully choosing BS beamforming codeword and RIS codewords from their respective codebooks in a sequential manner to progressively focus onto the user. Most existing codebook designs for RIS are not tailored for active sensing, by which we mean the choice of the next codeword should depend on the measurements made so far, and the sequence of codewords should dynamically focus reflection toward the user. Moreover, most existing codeword selection methods rely on exhaustive search in beam training to identify the codeword with the highest signal-to-noise ratio (SNR), thus incurring substantial pilot overhead as the size of the codebook scales. This paper proposes learning-based approaches for codebook construction and for codeword selection for active sensing. The proposed learning approach aims to locate a target in the service area by recursively selecting a sequence of BS beamforming codewords and RIS codewords from the respective codebooks as more measurements become available without exhaustive beam training. The codebook design and the codeword selection fuse key ideas from the vector quantized-variational autoencoder (VQ-VAE) and the long short-term memory (LSTM) network to learn respectively the discrete function space of the codebook and the temporal dependencies between measurements.

[LG-57] Quantum Complex-Valued Self-Attention Model

链接: https://arxiv.org/abs/2503.19002
作者: Fu Chen,Qinglin Zhao,Li Feng,Longfei Tang,Yangbin Lin,Haitao Huang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The self-attention mechanism has revolutionized classical machine learning, yet its quantum counterpart remains underexplored in fully harnessing the representational power of quantum states. Current quantum self-attention models exhibit a critical limitation by neglecting the indispensable phase information inherent in quantum systems when compressing attention weights into real-valued overlaps. To address this fundamental gap, we propose the Quantum Complex-Valued Self-Attention Model (QCSAM), the first framework that explicitly leverages complex-valued similarities between quantum states to capture both amplitude and phase relationships. Simultaneously, we enhance the standard Linear Combination of Unitaries (LCUs) method by introducing a Complex LCUs (CLCUs) framework that natively supports complex-valued coefficients. This framework enables the weighting of corresponding quantum values using fixed quantum complex self-attention weights, while also supporting trainable complex-valued parameters for value aggregation and quantum multi-head attention. Experimental evaluations on MNIST and Fashion-MNIST demonstrate our model’s superiority over recent quantum self-attention architectures including QKSAN, QSAN, and GQHAN, with multi-head configurations showing consistent advantages over single-head variants. We systematically evaluate model scalability through qubit configurations ranging from 3 to 8 qubits and multi-class classification tasks spanning 2 to 4 categories. Through comprehensive ablation studies, we establish the critical advantage of complex-valued quantum attention weights over real-valued alternatives.

信息检索

[IR-0] How Generative IR Retrieves Documents Mechanistically

链接: https://arxiv.org/abs/2503.19715
作者: Anja Reusch,Yonatan Belinkov
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative Information Retrieval (GenIR) is a novel paradigm in which a transformer encoder-decoder model predicts document rankings based on a query in an end-to-end fashion. These GenIR models have received significant attention due to their simple retrieval architecture while maintaining high retrieval effectiveness. However, in contrast to established retrieval architectures like cross-encoders or bi-encoders, their internal computations remain largely unknown. Therefore, this work studies the internal retrieval process of GenIR models by applying methods based on mechanistic interpretability, such as patching and vocabulary projections. By replacing the GenIR encoder with one trained on fewer documents, we demonstrate that the decoder is the primary component responsible for successful retrieval. Our patching experiments reveal that not all components in the decoder are crucial for the retrieval process. More specifically, we find that a pass through the decoder can be divided into three stages: (I) the priming stage, which contributes important information for activating subsequent components in later layers; (II) the bridging stage, where cross-attention is primarily active to transfer query information from the encoder to the decoder; and (III) the interaction stage, where predominantly MLPs are active to predict the document identifier. Our findings indicate that interaction between query and document information occurs only in the last stage. We hope our results promote a better understanding of GenIR models and foster future research to overcome the current challenges associated with these models.

[IR-1] Beyond Relevance: An Adaptive Exploration-Based Framework for Personalized Recommendations

链接: https://arxiv.org/abs/2503.19525
作者: Edoardo Bianchi
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems must balance personalization, diversity, and robustness to cold-start scenarios to remain effective in dynamic content environments. This paper introduces an adaptive, exploration-based recommendation framework that adjusts to evolving user preferences and content distributions to promote diversity and novelty without compromising relevance. The system represents items using sentence-transformer embeddings and organizes them into semantically coherent clusters through an online algorithm with adaptive thresholding. A user-controlled exploration mechanism enhances diversity by selectively sampling from under-explored clusters. Experiments on the MovieLens dataset show that enabling exploration reduces intra-list similarity from 0.34 to 0.26 and increases unexpectedness to 0.73, outperforming collaborative filtering and popularity-based baselines. A/B testing with 300 simulated users reveals a strong link between interaction history and preference for diversity, with 72.7% of long-term users favoring exploratory recommendations. Computational analysis confirms that clustering and recommendation processes scale linearly with the number of clusters. These results demonstrate that adaptive exploration effectively mitigates over-specialization while preserving personalization and efficiency.

[IR-2] Enhanced Blooms Educational Taxonomy for Fostering Information Literacy in the Era of Large Language Models

链接: https://arxiv.org/abs/2503.19434
作者: Yiming Luo,Ting Liu,Patrick Cheong-Iao Pang,Dana McKay,Ziqi Chen,George Buchanan,Shanton Chang
类目: Information Retrieval (cs.IR)
*备注: 25 Pages, 5 figures, submitted to the journal Computers Education, currently under peer review

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has profoundly transformed the paradigms of information retrieval and problem-solving, enabling students to access information acquisition more efficiently to support learning. However, there is currently a lack of standardized evaluation frameworks that guide learners in effectively leveraging LLMs. This paper proposes an LLM-driven Bloom’s Educational Taxonomy that aims to recognize and evaluate students’ information literacy (IL) with LLMs, and to formalize and guide students practice-based activities of using LLMs to solve complex problems. The framework delineates the IL corresponding to the cognitive abilities required to use LLM into two distinct stages: Exploration Action and Creation Metacognition. It further subdivides these into seven phases: Perceiving, Searching, Reasoning, Interacting, Evaluating, Organizing, and Curating. Through the case presentation, the analysis demonstrates the framework’s applicability and feasibility, supporting its role in fostering IL among students with varying levels of prior knowledge. This framework fills the existing gap in the analysis of LLM usage frameworks and provides theoretical support for guiding learners to improve IL.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-26

目录

概览 (2025-03-26)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载